Original Research

Resampling to address inequities in predictive modeling of suicide deaths

Abstract

Objective Improve methodology for equitable suicide death prediction when using sensitive predictors, such as race/ethnicity, for machine learning and statistical methods.

Methods Train predictive models, logistic regression, naive Bayes, gradient boosting (XGBoost) and random forests, using three resampling techniques (Blind, Separate, Equity) on emergency department (ED) administrative patient records. The Blind method resamples without considering racial/ethnic group. Comparatively, the Separate method trains disjoint models for each group and the Equity method builds a training set that is balanced both by racial/ethnic group and by class.

Results Using the Blind method, performance range of the models’ sensitivity for predicting suicide death between racial/ethnic groups (a measure of prediction inequity) was 0.47 for logistic regression, 0.37 for naive Bayes, 0.56 for XGBoost and 0.58 for random forest. By building separate models for different racial/ethnic groups or using the equity method on the training set, we decreased the range in performance to 0.16, 0.13, 0.19, 0.20 with Separate method, and 0.14, 0.12, 0.24, 0.13 for Equity method, respectively. XGBoost had the highest overall area under the curve (AUC), ranging from 0.69 to 0.79.

Discussion We increased performance equity between different racial/ethnic groups and show that imbalanced training sets lead to models with poor predictive equity. These methods have comparable AUC scores to other work in the field, using only single ED administrative record data.

Conclusion We propose two methods to improve equity of suicide death prediction among different racial/ethnic groups. These methods may be applied to other sensitive characteristics to improve equity in machine learning with healthcare applications.

Summary

What is already known?

  • There has been significant research in building machine learning/statistical models for predicting suicide.

  • Most of these models use race as a predictor, but do not include analysis of how this predictor is used.

  • Most of these models follow patients over a period of time and do not analyse a single visit.

What does this paper add?

  • Shows models can perform competitively only using one patient visit and administrative patient records.

  • Compares model performance on different racial/ethnic groups.

  • Introduces two resampling techniques to increase racial/ethnic equitability in model performance.

Introduction

Suicide is the 10th leading cause of death in the USA and has increased 35% from 1999 to 2018.1 Despite decades of clinical and epidemiological research, our ability to predict which individuals will die by suicide has not improved significantly in the last 50 years.2 Many factors (eg, prior non-fatal suicide attempt, psychiatric disorder, stressful life events and key demographic characteristics) are associated with elevated suicide risk at the population level, but individualised suicide risk prediction remains challenging.

Recent research attempting to improve the performance of previous suicide prediction models has used statistical and machine learning tools to explore suicide risk factors and to classify patients according to their risk for suicidal behaviour.3–9 Much of this work has focused on patients in healthcare settings, motivated by the growing availability of large-scale longitudinal health data through electronic medical record (EMR) systems, the high proportion of suicide decedents who have contact with healthcare providers in the year before their deaths,10 and healthcare patients’ substantially elevated risks of suicide.11 Many of these studies focus on high-risk groups5 6 9 and/or predicting non-fatal suicidal behaviours7 8 instead of suicide death, due to the low base rate of suicide and/or the difficulty of linking EMRs with death records.

The increasing prominence of machine learning models in healthcare applications has been accompanied by increasing concerns that these models perpetuate and potentially exacerbate long-standing inequities in the provision and quality of healthcare services.12 13 Algorithmic unfairness can stem from two places: the collected data and the machine learning algorithms.14 To address this issue, several groups15 16 have advocated for machine learning models to be proactively designed in ways that advance equity in health outcomes and prioritise fairness. This goal is critical in the mental healthcare and suicide prevention fields, where research has long documented both racial discrimination in care as well as racial/ethnic disparities in rates of suicidal behaviour and mental health stigma.17–19 Recent work has shown that predictive models for suicide death are less accurate for Native American/Alaskan Aleut, non-Hispanic Black and patients with unknown racial/ethnic information compared with Hispanic, non-Hispanic White or Asian patients.9 Although the ultimate goal is ensuring that racial/ethnic minoritised groups derive equal benefit with respect to patient outcomes from the deployment of machine learning models in healthcare systems, an important goal in the earlier stages of model development is testing whether a prediction model is equally accurate for patients in minoritised and non-minority groups.15 20

We build models that quantify an individual’s risk of future death by suicide, using information gleaned from a single visit to an emergency department to seek care for any condition, including non-psychiatric conditions. Our retrospective cohort study uses a database of administrative patient records (APRs) linked with death records that has not been used in prior predictive modelling studies. To address the low base rate of suicide death and/or racial/ethnic imbalances, we resample database records to build three different training sets. Using metrics established in the literature, we measure the test set performance of four classifiers trained on each of the three resampled training sets, focusing on methods that equalise opportunity and odds across all subgroups.

Methods

Data sources

This study uses APRs provided by the California Office of Statewide Health Planning and Development together with linked death records provided by the California Department of Public Health Vital Records. All data obtained and used were deidentified.

We analyse all visits to all California-licensed EDs from 2009 to 2012, by individuals aged at least five with a California residential zip code and less than 500 visits. The data contains N=35 393 415 records from 12 818 456 patients,21 and includes the date and underlying cause of death for all decedents who died in California in 2009–2013.

For each record, we assign a label of Y=1 if the record corresponds to a patient who died by suicide (corresponding to International Classification of Diseases-version 10 (ICD-10) codes X60-X84, Y87.0 or U03) during the period 2009–2013; otherwise, we assign a label of Y=0. This allows a minimum of 1 year between each patient visit and when deaths are assessed. The goal of our models is to use information from a single visit by a single patient to predict Y, death by suicide between 2009 and 2013. In our records, 9364 patients (with 37 661 records) died by suicide; as <0.11% of the data is in the Y=1 (death by suicide) class, the classification problem is imbalanced.

The APRs contain both patient- and facility-level information which includes basic patient demographic characteristics, insurance/payer status, discharge information, type of care, admission type and one primary and up to five secondary Clinical Classifications Software (CCS) diagnostic codes. These CCS codes aggregate more than 14 000 ICD-9-CM diagnoses into 285 mutually exclusive and interpretable category codes. The APRs also contain supplemental E-Codes, which provide information about the intent (accidental, intentional, assault, or undetermined) of external injuries and poisonings. Note that APRs omit information such as vital signs, height/weight and other biological indicators found in a full medical record. See online supplemental material for additional information regarding APRs.

Table 1 breaks down the data set by racial/ethnic identity. Seven categories describe racial/ethnic identity: Black, Native American/Eskimo/Aleut, Asian/Pacific Islander, White, unknown/invalid/blank, other and Hispanic. While we recognise that these are crude measures for racial/ethnic identity, this is the granularity of information collected by hospitals and used in machine learning models. Native American/Eskimo/Aleut and White patients have significantly higher rates of suicide death than Hispanic, Black or Asian/Pacific Islander patients, which is consistent with the measured trends.1 In this work, we do not train classification models for the Native American/Eskimo/Aleut group, as the number of suicide deaths is too small to generalise to a wider population. Note that racial/ethnic information is supposed to be self-reported by patients but may be inferred incorrectly by clinical personnel or be incorrectly recorded22; we assume the error rate is low enough to not affect our results substantially.

Table 1
|
Data broken down by race/ethnic feature, excluding the ‘other’ and ‘unknown’ race categories

Statistical methods

Given the large imbalance in the class distribution, training directly on the raw data would yield classifiers that achieve accuracies exceeding 99.9% by predicting that no one dies by suicide. To derive meaningful results, we must proactively address the class imbalance; we focus on resampling, an established approach for classification with imbalanced data.23

For each of three resampling methods (denoted below as Blind, Separate and Equity), we apply four statistical/machine learning techniques: logistic regression, naive Bayes, random forests and gradient boosted trees (model descriptions in online supplemental material). This yields 12 models, which we compare below. In each case, we split the raw data into training, validation, and test sets and resample only the training sets. We select model hyperparameters (eg, for tree-based models, the maximum depth of the tree) by assessing the performance of trained models on validation sets. Once we have selected hyperparameters and finished training a model, we report its test set performance.24 The test set is not used for any other purpose, simulating a scenario in which a model is applied to newly collected data.

For imbalanced binary classification problems, among the most widely used resampling methods are those that sample uniformly from either or both classes to create a class-balanced training set.23 We choose this method as a baseline and denote this method as Blind; it resamples without considering racial/ethnic group membership. The Separate and Equity resampling procedures are different ways to account for racial/ethnic group membership when forming balanced training sets. These sampling techniques address two sources of bias in the data: representation bias and aggregation bias. From table 1, the White population comprises the majority of patient records as well as suicide deaths, leaving all minority groups underrepresented. The aggregation of over-represented data with underrepresented data can lead to bias. However, there can still be aggregation bias when groups are equally represented.14 For this reason, we train separate models for each racial/ethnic group in addition to a joint model with Equity resampling.

For all three approaches, we begin by shuffling the data by unique patient identifier. We then divide the data into training, validation, and test sets with a roughly 60/20/20 ratio, ensuring that each set is disjoint in terms of patients. This ensures that patients used for training are not in the test set, which may artificially inflate model performance.

In the Blind method, we separate the training set by class, resulting in two sets. We then randomly sample a subset of the majority class—patients who do not die of suicide—till we achieve a balanced training set.

In the Separate method, the training data is separated by racial/ethnic group and like the Blind method we undersample the majority class to balance the data. We thus train disjoint models for each racial/ethnic group.

In contrast, in the Equity method, we divide the training set by both racial/ethnic group and class label. This results in eight training subsets. We then sample 7500 files with replacement from each of the 8 training subsets. The union of these samples is the equity-directed resampled training set; note its balance across racial/ethnic groups and across 0/1 labels. This is a form of stratified resampling in which the strata are racial/ethnic group and 0/1 label.25 In this case, the trained model can be applied to test data from any of the four groups.

In these models, we treat each visit by each patient independently. Consequently, each predictive model bases its prediction only on the APR from the current (index) visit. As resampling uses randomness, we show the robustness of our results by repeating the sampling procedure and building/training the models with 10 different random seeds. Additional information about the random trials can be found in online supplemental material, figures S1-S3. When reporting the results, we provide average performance (with SD) of each model for each racial/ethnic group and resampling method.

Results

In tables 2–4, we report test set sensitivity, specificity and area under the curve (AUC)24 for each resampling method and model type, broken down by racial/ethnic group. We do not report accuracy due to the class imbalance. Here sensitivity and specificity are, respectively, the percentages of correctly classified records in the Y=1 (patients who died by suicide) and Y=0 (all other patients) classes. When analysing the performance of different models, we imagine a setting in which patients classified as positive (ie, at high risk of suicide) have the opportunity to receive an intervention such as a postdischarge phone call. We, thus, prioritise sensitivity over specificity, as false negatives consist of patients who die of suicide with no intervention, while false positives consist of patients who receive a potentially unneeded intervention.

Table 2
|
Average test set sensitivity with SD (at training set specificity of approximately 0.76) of different combinations of resampling procedure plus statistical/machine learning method, by racial/ethnic group
Table 3
|
Average test set specificity with SD (at training set specificity of approximately 0.76) of different combinations of resampling procedure plus statistical/machine learning method, by racial/ethnic group
Table 4
|
Average test set AUC with SD (at training set specificity of approximately 0.76) of different combinations of resampling procedure plus statistical/machine learning method, by racial/ethnic group

Our models output a probability of Y=1 (suicide death) conditional on a patient’s APR. In each case, there is a threshold τ such that when the model output exceeds (respectively, does not exceed) τ, we assign a predicted label of 1 (respectively, 0).26 We assign τ values to each model to approximately balance specificity, enabling comparison of models based on test set sensitivity. We also report the size of range which is the difference between the highest performance and lowest performance by racial/ethnic group. A smaller range implies more equitable performance across the racial/ethnic groups; a model whose range is zero (for both sensitivity and specificity) satisfies the equal odds criterion established in the algorithmic fairness literature.

We see several trends in the results. First, Blind resampling is the least equitable in terms of either test set sensitivity or test set specificity. All models yield worse test sensitivity on minoritised racial/ethnic groups than on the White group. Models trained with Blind resampling learn to overclassify White patient files as dying of suicide. The AUC metric obscures these differences and makes them hard to discern. These results hold for all four statistical/machine learning methods considered.

Both the Separate and Equity resampling strategies lead to more equalised sensitivity and specificity across the four racial/ethnic groups. These strategies lead all four statistical/machine learning methods to improve in terms of the equal odds criteria for fairness in classification20 and treatment equality; these strategies reduce the range in performance of false negative and false positive rates across the different racial/ethnic groups.14 27 For instance, the range of sensitivities for logistic regression decreases from 0.47 (Blind) to either 0.16 (Separate) or 0.14 (Equity). Notably, this reduction in performance range is coupled with a boost in test set sensitivities on the minority racial/ethnic groups, and a boost in test set specificity for the White group. For further discussion on fairness, see online supplemental material.

Table 4 shows that the test set AUC of XGBoost (with Equity resampling) is between 0.73 and 0.78, signifying good diagnostic accuracy.26 This is clearly better than random guessing (AUC of 0.5) and exceeds all AUC scores reported in a meta-analysis of 50 years of suicide modelling.2 Our AUC scores are comparable to a recent study’s male-specific models (0.77 for CART28 and 0.80 for random forests), and slightly less than that study’s female-specific models (0.87 for CART and 0.88 for random forests).4

Discussion

We trained machine learning models on statewide emergency department using three resampling methods on APRs for suicide death classification. We have shown that resampling methods can reduce the range in model performance on different racial/ethnic groups by at least 50%. Specifically, equity-focused resampling increases the predictive performance of all four machine learning models on minoritised racial/ethnic patient groups to approximately match that of the majority (non-Hispanic White) patient group.

This study has several strengths. Our models achieve high predictive accuracy using only single-visit APRs, whereas other studies often have a much richer feature space from which to learn4 and/or restrict attention to only those patients with at least three visits.8 Additionally, the resampling and machine learning methods we employ are highly scalable. Given additional records from other healthcare systems (eg, from neighbouring states), we could add them to our current data set and resample/retrain without difficulty. Our models also issue predictions for the general population of emergency department patients instead of subpopulations with higher suicide risk,5 6 increasing their scope and generalisability. We also use linked mortality records to predict suicide death rather than nonfatal suicidal behaviours or self-harm.3 7 8 When previous large-scale machine learning models have been trained on such linked data sets, they have often used data from nationalised/centralised systems unavailable in the USA.4 Additionally, the Equity method allows for learning across racial/ethnic groups, but because each racial/ethnic group is equally represented, still allows for racial/ethnic specific predictors to be identified. For example, a mental health diagnosis is a recognised predictor for suicide death,2 but non-Hispanic Black, Hispanic and Asian individuals are less likely to be diagnosed with a mental health condition.18 19

While we may be able to improve on our methods with additional features such as lab results and medication history, there are benefits to training with APRs. The features in APRs are accessible in most (if not all) existing EMR databases. Because these models are trained solely on information gathered at a single emergency department visit, there is no need to process a patient’s medical history. While the logistic regression and naive Bayes models are inherently interpretable, the boosted tree and random forest models could also be analysed and interpreted in detail prior to implementation. Therefore, deployment of these methods as an extension of existing database software is feasible. We envision this as a tool that could potentially assist healthcare providers in identifying patients at risk for suicide death.29 30

Other machine learning for healthcare applications can benefit from this equity analysis. We showed that regardless of model type, the Blind resampling method resulted in inequitable suicide classification for different racial/ethnic groups. Our findings suggest that sensitive group representation should be considered as a type of class imbalance that must be rectified before model training takes place. While we have focused here on racial/ethnic group membership, the Separate or Equity resampling methods can be directly applied to other sensitive categories. We hypothesise that in other problem domains and applications, one can improve prediction equity either by building separate models, or by using equity-directed resampling. When separation of data by sensitive group results in sample sizes too small to train machine learning models, equity-directed resampling may still be viable.

This study also has limitations. First, as with all machine learning models, the finalised predictions are intended to complement (rather than substitute for) human judgement. As with other technology (eg, medical imaging), practitioners may require additional explanation/interpretation of what the models do internally, to trust and apply their predictions in a beneficial way. Though we address algorithmic fairness, we should not expect purely technological solutions to address systemic inequities in the healthcare system.13 These inequities may cause unequal mislabeling of suicides by race/ethnicity, affecting the quality of the linked data we analyse, and thereby reducing the true generalisability of our models to real-world settings.18 Within the algorithmic fairness context, while our equity-resampled models achieve predictive equality across racial/ethnic groups, recorded membership in these groups is not always accurate. Additionally, patients potentially belong to many vulnerable groups (via their socioeconomic status, disability status, Veteran status, etc); further resampling/stratification may be needed to achieve algorithmic fairness with respect to all such groups. In some cases, for instance, if sample sizes are too small, achieving the equal opportunity standard may not be possible. Finally, because we have trained our models only on data from California residents in specific years, we cannot be sure that the trained models themselves will generalise to other locations and time periods. However, the techniques we describe could be applied to construct analogous models given sufficient data from other locations.

Conclusion

When building suicide prediction models using highly imbalanced data sets, resampling is necessary. However, blind resampling can negatively impact model performance for minority groups. Applying either of two resampling methods, we develop predictive models that have reduced prediction inequity across racial/ethnic groups.