Review

Road map for clinicians to develop and evaluate AI predictive models to inform clinical decision-making

Abstract

Background Predictive models have been used in clinical care for decades. They can determine the risk of a patient developing a particular condition or complication and inform the shared decision-making process. Developing artificial intelligence (AI) predictive models for use in clinical practice is challenging; even if they have good predictive performance, this does not guarantee that they will be used or enhance decision-making. We describe nine stages of developing and evaluating a predictive AI model, recognising the challenges that clinicians might face at each stage and providing practical tips to help manage them.

Findings The nine stages included clarifying the clinical question or outcome(s) of interest (output), identifying appropriate predictors (features selection), choosing relevant datasets, developing the AI predictive model, validating and testing the developed model, presenting and interpreting the model prediction(s), licensing and maintaining the AI predictive model and evaluating the impact of the AI predictive model. The introduction of an AI prediction model into clinical practice usually consists of multiple interacting components, including the accuracy of the model predictions, physician and patient understanding and use of these probabilities, expected effectiveness of subsequent actions or interventions and adherence to these. Much of the difference in whether benefits are realised relates to whether the predictions are given to clinicians in a timely way that enables them to take an appropriate action.

Conclusion The downstream effects on processes and outcomes of AI prediction models vary widely, and it is essential to evaluate the use in clinical practice using an appropriate study design.

Introduction

Healthcare systems worldwide generate enormous amounts of patient-related health data, much of which is electronic in developed countries. There is growing interest among clinicians and healthcare staff in how they could use these data to support patient care.1 Much of medicine is about anticipating and reducing risk, based on current and historical experiences. Predictive analytics in healthcare can help determine the risk of a patient developing a particular condition or complication, which can inform the shared decision-making process between clinicians and patients and improve patient satisfaction with their overall medical care.2–7 With the new era of artificial intelligence (AI), clinical prediction tools can help personalise treatment and management decisions.

The Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) framework was published to guide developing multivariate predictive models,8 outlining what should be reported (eg, data sources, modelling techniques) when written up for publication.9 However, a recent systematic review highlighted how these models’ reporting has been rather poor since its publication.10 TRIPOD also only focused on regression-based prediction models (although it can be applied to AI-generated approaches) and highlighted the need for more ‘practical methods’ for developing models more commonly used in healthcare (ie, supervised learning techniques).11 The Consolidated Standards of Reporting Trials–AI guidelines were published in 2020 to help readers conceive studies with AI interventions; however, there was limited guidance on how these AI predictive models could be developed and usefully applied in clinical practice12; clinicians have sought further information on this.1 13 Even if a newly developed AI model has a good predictive performance, this does not guarantee that it will be used in clinical practice or enhance clinical decision-making, let alone improve health outcomes.14 The quality criteria important for evaluating AI predictive models were described in a recent scoping review; however, little information was provided on how such tools affect the clinical routine of physicians, which may vary per physician.15

The nine stages for developing and evaluating predictive AI models

Stage 1: clarifying the clinical question or outcome(s) of interest (output).

Stage 2: identifying appropriate predictors (features selection).

Stage 3: choosing relevant datasets.

Stage 4: developing the AI predictive model.

Stage 5: validating and testing the developed model.

Stage 5: presenting and interpreting the model prediction(s).

Stage 7: licensing the AI predictive model.

Stage 8: maintaining the AI predictive model.

Stage 9: ongoing evaluation of the impact of the AI predictive model.

It is vital to seek the input of a multidisciplinary team early when developing AI predictive models. This includes clinical specialists when deciding how the model could potentially enhance clinical decision-making and computing scientists when selecting the most appropriate algorithm(s).16 Patients and providers should also be involved in deciding if the recommendations will be presented to them, including what, how and when information might be usefully presented (ie, content and alerts).2 7 17 Taking each of these stages in turn.

Stage 1: clarifying the clinical question or outcome(s) of interest (output)

The clinical question or outcome(s) of interest should be clearly defined from the onset. An example of a clinical question might be ‘what is the likelihood of a patient developing type 2 diabetes mellitus (T2DM)?’ to modify some of the patient’s potential risk factors through lifestyle changes and/or prescribing medication.18 It is essential to consider how we define T2DM here. Kopitar et al defined it as a fasting plasma glucose level of 6.1 mmol/L or higher without diabetes symptoms.18 This definition makes the model a prognostic rather than diagnostic predictive model, given that it focuses on predicting a future health outcome. It is worth mentioning that this definition varies from those presented in different clinical guidelines18 and can also change over time, highlighting the importance of model upgrading and maintenance. Another example of a clinical question could be ‘what is the likelihood of a patient developing an infection and subsequent sepsis as an inpatient?’. Again, multiple definitions of sepsis could be used,19–21 each varying in how closely aligned it is with the systemic effects of sepsis syndrome (see figure 1).19 20 However, the choice of definition here is critical as it can directly influence the model performance measures, particularly specificity, which we will discuss later.22 Clinicians should decide on the most accurate clinical definition for the predicted output, with the model upgraded to reflect any future changes to this definition.

Figure 1
Figure 1

Different definitions of sepsis and their related clinical predictors. *Note that SIRS criteria are non-specific on the type of infection. **Note that suspected infection became a requirement to define sepsis. ***Note that clinical parameters are more specific to the systemic mechanism of sepsis.

Stage 2: identifying appropriate predictors (feature selection)

The second step involves identifying appropriate clinical predictors (features) related to the outcome of interest. Thus, if we take our sepsis-3 definition (figure 1), the next question relates to ‘what clinical variables should we use for predicting sepsis?’. These clinical predictors will again depend on whether you want to develop a prognostic predictive model (which predicts the likelihood of sepsis occurring before the systemic inflammation process begins)23 or a diagnostic predictive model (which early detects the likelihood of sepsis but after the inflammation process has already begun).24 A review of the medical literature can help identify potential predictors that might be worth considering; 194 clinical predictors have been previously used to train machine learning algorithms for sepsis prediction, 13 of which were used across all 17 newly developed algorithms.22 These 13 predictors contained a blend of non-modifiable (eg, age, gender) and modifiable (eg, blood glucose levels, blood pressure) predictors, the latter potentially increasing the applicability of the model in clinical practice.22 It is important to consider here how these predictors have been defined and selected in previous studies, their source (ie, retrospective or real-time data) and whether any were excluded, thus recognising any inherent bias.14 25 In terms of predictor type, numerical predictors should be given preference over categorical predictors, whenever possible.8 26–28 A classic example is blood pressure, which can be recorded as a numerical (eg, 110 mm Hg) or categorical (eg, high, normal, low) value. The latter assumes that a patient with systolic blood pressure of 110 mm Hg has the same level of hypotension as another patient with systolic blood pressure below 90 mm Hg, which is more characteristic of sepsis. In the T2DM example mentioned above, Kopitar et al screened the electronic health records (EHRs) of patients who went on to develop T2DM to identify potentially modifiable (eg, total cholesterol) and non-modifiable (eg, age) predictors.18 EHR data can also allow exploring variables with predictive potentials that might not have been considered.18

The potential clinical predictors are then correlated to the model’s outcome of interest (output) using either statistical methods or machine learning techniques.29 Some predictors are likely to correlate strongly to the output but may be more suitable for a diagnostic rather than a prognostic predictive model. For example, the Sequential Oragn Failure Assessment Score (SOFA) Score, which reflects multiorgan dysfunction, will have a strong correlation with the sepsis diagnosis and would be more suitable for developing a diagnostic predictive model, whereas lipid profile will have a strong correlation to the diabetes prognosis and would be more suitable for developing a prognostic predictive model; this is because patients with established diabetes are likely to have hypercholesterolaemia.18 We suggested using a ‘blended approach’ for predictor selection, where the predictors are correlated to the model’s output and clinical input is also obtained on the choice to support its clinical application.19 22 30

Stage 3: choosing relevant datasets

The existence, choice and access to relevant datasets often represent a limiting step for developing predictive AI models.1 31 Thousands of organisations hold health datasets in the UK, so it can be difficult for clinicians, researchers and innovators to discover what datasets already exist.32–34 Developers should first look at the relevance, data size and diversity of potential datasets; the proposed dataset should ideally represent the targeted population where the AI model is intended to be used to reduce the risk of inherent bias.35 If the key outcome(s) of interest is unidentified, developers may have to decide how these available variables are used to define the key outcome. Researchers and innovators can search and request access to UK health-related datasets through ‘the Gateway’, a common entry point established by the Health Research Authority for nine UK-based health data research (HDR) hubs across the country.33 These hubs include DATAMIND (mental health data), PIONEER (acute care data) and Discover-Now (primary care data), the latter being one of the largest primary care datasets in Europe. The UK HDR Alliance is also an independent alliance of leading healthcare and research organisations united to establish best practices for the ethical use of UK health data for research at scale.34 In the UK, patients’ information is protected by the General Data Protection Regulation and patients can refuse to permit their confidential data to be used through the national data opt-out service. Deidentification can be challenging, specifically with demographic variables, some of which can be important predictors when training the model. Removing them can potentially risk the efficiency of the model performance. A trusted research environment with anonymised patient data can be prepared for the clinician or researcher, once all the necessary ethical approvals have been obtained and the required training on data use and security completed.36–39 Alternatively, data can be processed in a safe environment either at a hospital or university site; however, checks will need to be made on the safety of these environments and these data not approved for release if they do not meet the HDR UK five safes (safe people, safe projects, safe settings, safe outputs and safe data).34 The diabetes risk prediction model mentioned above was developed using anonymised data collected from 10 diabetes screening clinics pooled in a single database.18 Internationally, the Medical Information Mart for Intensive Care (MIMIC) database has clinical information from more than 40 000 patients admitted to critical care units at one tertiary centre (Beth Israel Deaconess Medical Centre, Boston, Massachusetts, USA). Similarly, healthcare professionals can freely access the dataset after completing appropriate data use and security training and signing a data usage agreement.36 40 An important consideration is how these data have been collected and recorded. Numerical variables in the chosen dataset should ideally be collected and recorded synchronously.37 The MIMIC database developers recognised this as a potential limitation of their dataset, with vital signs like heart rate and blood pressure recorded at different time points, thus potentially impacting the accuracy of the model.36 Clinicians should help decide which dataset best represents the patient population that this model is intended to be used in.

Stage 4: developing the AI predictive model

There are four major types of machine learning algorithms: supervised learning, unsupervised learning, semisupervised learning and reinforcement learning.41 The choice of machine learning algorithm will depend on some factors, including the outcome of interest (ie, numerical or discrete value); the number of predictors; the ‘shape’ of the dataset (ie, size, completeness, uniformity); and the performance measures of the algorithm (ie, sensitivity, specificity, accuracy, area under the curve).30 In the case of the latter, a number of algorithms may need to be tried first before finally deciding on the most suitable one or combination (ensemble model).41 Supervised learning is commonly used for predictive models and can be subclassified into regression (ie, numerical output) or classification (ie, discrete output) algorithms.42 The higher the number of predictors used, the more computational power needed to train the model and the higher the potential risk of overfitting.42 An overfitting model is a model that has high accuracy during the training phase, but lower accuracy during the validation and testing phase; potential ways to overcome this are described below.26 42 43 It is important to remember, however, that strong computational correlations primarily depend on the entry values (eg, non-extreme vs extreme) and amount of missing data. Missing data can be potentially managed by statistical methods (ie, multiple imputations) or machine learning algorithms (ie, K-nearest neighbours), the choice of which will usually depend on the type and extent of missing information.41 44 45

Deep learning and artificial neural networks can perform better than conventional machine learning techniques. These networks act as a net of neurons that can identify patterns and correlations in a dataset so the model can self-learn from these patterns. The ‘deep’ refers to the depth of layers in a neural network and the performance measures of a deep learning model are directly correlated to the data size (ie, the larger the dataset, the better the model performance).46 47 However, this can be challenging with rare diseases.42 46 48

Python is one of the most common programming languages for developing AI predictive models and is freely available.49 After importing the dataset into programming software, you usually divide it into two portions: training the algorithm (70%) and internal validation (30%).41 43 As described in stage 2 above, each predictor is then correlated to the outcome of interest (feature selection) using the training set and the performance measures of the algorithm calculated. This includes the specificity, sensitivity, receiver operating characteristic (ROC) curve and the area under the ROC curve (AUROC curve). The AUROC curve measures the distinctive ability of the algorithm to predict the outcome, with a value of >0.9 considered excellent.22 50 AI systems learn to make decisions based on these training data, which may reflect human biases or social inequities, even if predictors such as race or gender have been removed.51 It is beneficial to have the input of a programming specialist when preparing/revising the codes and judging the performance measures of any resulting models.

Stage 5: validating and testing the AI predictive model

After developing the model, its predictive accuracy is reassessed using a validation dataset (internal validation) and again in a completely new, unseen dataset (ie, externally validated), ideally from another site. This comparison of performance measures is important for evaluating the risk of over/underfitting and widening the generalisability of the model, considering the diversity and representation of the patient population.52 The testing phase usually involves running the model in a silent clinical environment, where the output is not shared with clinicians but compared with conventional clinical judgement and diagnosis. The T2DM prediction model was tested in a silent clinical environment over 6 months to assess its performance, before ‘going live’ to support clinical decision-making.18 It is important to recognise that not all data are equal in quality; laboratory values may be coded differently or missing for some or all of an entire predictor in validation and training datasets. Complete case analysis is a method that can handle missing data and involves removing all missing patient cases; however, this requires a large sample size and may introduce selection bias. Alternatively, mean imputation can be used for missing numerical predictors, but can be sensitive to outliers (ie, extreme values).53

Stage 6: presenting and interpreting the model prediction(s)

It is essential to consider how the model prediction(s) is presented to target users (patients/clinicians) and whether a recommendation accompanies it. The predicted probability (output) can be presented to users without any corresponding recommendations; this assistive presentation format allows clinicians to combine these predictions with clinical judgement.54 55 In contrast, a directive prediction model provides the physician with a recommendation in addition to the predicted probability; this, in turn, can potentially increase the ease of use of the AI prediction model, especially if integrated into the electronic ordering system.56 57 Clinicians should be informed of the underlying assumptions of the model, including which predictors were included and why, any inherent bias (eg, if groups are over-represented or under-represented in the training data) and how patients with specific outcome risk profiles might be affected by different recommendations.14 For example, the inclusion of health costs as a proxy for health needs could potentially introduce racial bias, as less money is spent on black patients who have the same level of need in the USA; in other words, the algorithm could falsely conclude that black patients are healthier than equally sick white patients.58 There is some evidence that clinicians in English-speaking countries have felt more legally supported when using decision support tools because they can provide documented evidence for the rationale behind their decisions.59 Chua et al proposed an AI–human interface, where clinicians identify which patients might be eligible to use the tool, and the algorithm identifies (more accurately) which patients have serious illness communication needs and promotes upstream data collection.7 Target users should contribute to the design of the model interface, ensuring that it is user-friendly, and any outputs and recommendations are easy to understand.

Stage 7: licensing the AI predictive model

In the UK, AI-based tools are classified as medical devices and therefore need the Medicines and Healthcare products Regulatory Agency (MHRA) approval. Before Brexit, approved tools required either the ‘United Kingdom Conformity Assessed’ (UKCA) or ‘Conformité Européenne’ logo to be marketed in Europe.60 However, from July 2023, only tools with the UKCA logo will be allowed to be marketed in the UK.61 In Europe, AI-based software and tools are regulated by the EU Medical Device Regulation (EU MDR),31 62 63 whereas in the USA, AI-based tools are regulated by the Food and Drug Administration.64

To licence an AI predictive tool in the UK, the MHRA must ensure that it complies with certain ‘conformity assessment’ standards, described by the National Institute for Health and Care Excellence (NICE) in 2018 and updated in 2021.65 It is worth mentioning that NICE framework is designed for AI tools with fixed algorithms (ie, no change over time) rather than AI tools with adaptive algorithms (ie, continually and automatically change)65; the latter are covered by separate standards (including principle 7 of the code of conduct for data-driven health and care technology).65 Higher-risk AI tools are classified as those that either target vulnerable patient populations, have serious consequences with errors or system failure, are used solely by patients without healthcare professionals’ support or require a change in clinical workflow.65 For EU-approved tools, the tool should comply with the general safety and performance requirements stated by the EU MDR.66 67 Clinicians should be aware of the appropriate approvals that need to be obtained, especially with the growing adoption of these tools.

Stage 8: maintaining the AI predictive model

Maintenance of the model and knowledge management are critical.68 It may be necessary to update the model as populations, diseases and treatments change and include an expiry date.68 In the UK, NICE data framework recommends a regression test be done when the model is updated to ensure that any new changes do not have a negative impact on its performance, reliability and functionality.65 Model developers should also keep users (clinicians and patients) informed when releasing new model versions. In the USA, model recertification is needed when AI predictive models are updated,15 although the US FDA is currently working on a framework that allows repeated updating of an AI predictive model without recertification through a change control plan.69

Stage 9: ongoing evaluation of the impact of the AI predictive model

Introducing an AI prediction model into clinical practice can be considered a complex intervention; it usually consists of multiple interacting components including the accuracy of the model predictions, physician and patient understanding and use of these probabilities, expected effectiveness of subsequent actions or interventions, and adherence to these. A new framework has now replaced the UK Medical Research Council’s guidance for developing and evaluating complex interventions. It focuses on recent developments in methods and the need to optimise the efficiency, use and impact of research.70 The downstream effects on patient outcomes of using an AI prediction model are not always predictable. For example, Kappen et al described no decrease in the incidence of postoperative nausea and vomiting, despite an increase in the administration of prophylactic antiemetics in the cluster-randomised trial of the AI prediction model (using an assistive presentation format).56 This may indicate that either the predictive performance of the model was insufficient, the impact on physician decision-making was too small (eg, too few prophylactic drugs were administered despite high predicted probabilities), the antiemetic drugs were not as effective as thought, and/or patients chose not to take them.56 Collecting additional data (observations and interviews) may help improve our understanding of these study results.

When designing an impact study before applying to licensing, a clinician needs to consider whether the complex intervention will have an individual effect on patients or whether it induces a more group-like effect.56 A prediction model often aims to affect the clinical routine of a physician, which may vary per physician; this could lead to clustering of the effect per physician or practice (hospital) when the AI model use is compared across providers or practices.19 31 56 After repeated exposure to the predictions, clinicians may also become better at estimating the probability in subsequent similar patients, even when those patients are in the control group.19 31 56 This likely dilutes the effectiveness and thus impact of the model use.48 56 As Kappen et al highlights, the effects of a learning curve may be minimised, though not completely prevented, by randomisation at a cluster level, for example, physicians or hospitals.52 56

Conclusion

We have provided a road map which clinicians and others developing algorithms can use to develop and evaluate AI predictive models to inform clinical decision-making. We described the nine stages, recognising the challenges that clinicians might face at each stage and practical tips to manage them. A ‘blended approach’ should be considered for clinical predictor selection, and the proposed dataset clearly represents the targeted population where the AI model is intended to be used. Comparing performance measures between the different training, validation and unseen clinical datasets are important for evaluating the risk of over/underfitting and widening the generalisability of the model. The format of the predictive model (assistive or directive) should be carefully chosen and designed. The maintenance of the model is important as populations, diseases and treatments change. The downstream effects on patient outcomes of using an AI prediction model are not always predictable, and it is important to evaluate its use in clinical practice using an appropriate study design.