Original Research

Social vulnerability and initial COVID-19 community spread in the US South: a machine learning approach

Abstract

Background and objectives More than 93 million COVID-19 cases and more than 1 million COVID-19 deaths have been reported in the USA by August 2022. The disproportionate effect of the pandemic and its severe impact on vulnerable communities raised concerns. This research aimed to identify and rank Social Vulnerability Index (SVI) factors highly predictive of the spread of COVID-19 in the US South at the beginning of the pandemic.

Methods We used Extreme Gradient Boosting (XGBoost) machine learning methodology and SVI data, and the number of COVID-19 cases across all counties in the US South to predict the number of positive cases within 30 days of a county’s first case.

Results Our results showed that the percentage of mobile homes is the most important feature in predicting the increase in COVID-19. Also, population density per square mile, per capita income, percentage of housing in structures with 10+ units, percentage of people below poverty and percentage of people with no high school diploma are important predictors of COVID-19 community spread, respectively.

Conclusions SVI can help assess the vulnerability or resilience of communities to the spread of COVID-19 and can help identify communities at high risk of COVID-19 spread.

What is already known on this topic

  • Social and economic factors influence vulnerability to infection and health outcomes and the severe impact of COVID-19 on vulnerable communities.

What this study adds

  • Percentage of mobile homes within a county, population density per square mile and per capita income are important predictors of community spread of COVID-19.

How this study might affect research, practice or policy

  • The Social Vulnerability Index can help assess the resilience of communities to the spread of COVID-19 and can help identify communities at high risk of COVID-19 spread.

Introduction

More than 93 million COVID-19 cases and more than 1 million COVID-19 deaths have been reported in the USA by August 2022.1 The pandemic has disproportionally affected minority communities at the local level.2 Even at the early stages of the pandemic, the severe impact of COVID-19 on vulnerable communities raised concerns.3 Historically, poverty, inequalities and social determinants of health facilitate the spread of infectious diseases.4 There is evidence that socioeconomic factors may influence the spatial spread of COVID-19 at the county level.5 Past pandemics also have shown that social and economic factors influence vulnerability to infection and health outcomes.6 Further, individuals residing in deprived neighbourhoods (ie, neighbourhoods with higher poverty, lower education, low housing quality and low employment rates) had a higher risk of COVID-19 infection.7 Also, a recent study analysed the association of social, economic and demographic factors in the initial spread of COVID-19 and reported that social and economic factors are strongly and positively associated with COVID-19.8

Many communities in the US South have substantial social vulnerabilities that may worsen the impact of COVID-19. In recent weeks, the US South has become a major region of community spread, ranging from Florida to Texas (figure 1). While studies suggest effective policies, including lockdowns and mandatory mask use, that are effective for controlling the spread of COVID-19 in communities,9 10 in several of these states, lack of consistent and effective public policies to mitigate infection spread has been a source of debate. In Georgia, for example, the governor filed a lawsuit (later dropped) against the mayor of Atlanta in order to prevent the latter’s enforcement of a mask mandate.11 The city of Atlanta is racially diverse and minority communities have experienced both high rates of poverty and other socioeconomic vulnerabilities as well as COVID-19 community spread.

Figure 1
Figure 1

County-level distribution of COVID-19 cases in the US South (August 2020). US South region includes the states of Alabama, Arkansas, Florida, Georgia, Louisiana, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee and Texas.

Social vulnerability is the resilience of communities against disease outbreaks and natural or human-caused disasters.12 It is applicable to identify communities most at risk when faced with adverse events that may impact health (eg, disease outbreaks). Social vulnerability refers to socioeconomic and demographic factors that affect a community’s ability and power to prevent human suffering in the event of disaster or outbreaks. The Centers for Disease Control and Prevention (CDC) categorises these socioeconomic and demographic factors into four overall vulnerability domains: socioeconomic status, household composition, and disability, minority status and language, and housing type and transportation.13 The Social Vulnerability Index (SVI) provides social and spatial information to help public health officials and local emergency response planners to identify communities at high risk of being adversely affected during a crisis.13 This information helps communities to prepare for a better response to emergency events especially disease outbreaks.12 13 SVI was associated with increased rates of COVID-19.14 Also, counties with the highest SVI had a greater risk of COVID-19 infection and death,3 and most vulnerable counties had higher death rates, especially at the beginning of the pandemic.15

Although race/ethnic minority communities have been disproportionately impacted by COVID-19,3 6 16 17 the role of specific social vulnerabilities such as poverty, housing insecurity and other issues faced in these communities that contribute to the spread of infection at the beginning of the pandemic and spread of the COVID-19 virus is unclear. To address this gap in knowledge, we use machine learning-based analyses of the SVI data to identify and rank SVI factors that are highly predictive of the spread of COVID-19 cases at the county level across 11 states in the US South.

Methods

Study setting and design

This machine learning-based study included COVID-19 cases and 16 social vulnerability features for all counties across 11 US states located in the South, including: Alabama, Arkansas, Florida, Georgia, Louisiana, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee and Texas (online supplemental figures A1,A2). To investigate the association of social vulnerability factors and the spread of COVID-19 at the county level, we use an effective prediction algorithm regression method. We regress the number of COVID-19 cases 30 days after the first confirmed COVID-19 case in each county against social vulnerability features (detailed below). We chose to examine the US South because of the number of major COVID-19 ‘hot spots’ located in that region as well as the region’s long-standing historical socioeconomic inequities across minority and non-minority communities.18

Study sample and data

We used daily COVID-19 cases from January 2020 to August 2020 from the official website of Johns Hopkins University’s Coronavirus Resource Center.1 For each county in the US South (1086 counties), we identified the number of COVID-19 cases 30 days after their first COVID-19 case was confirmed.

We also used the latest SVI data available from the CDC released in 2018.13 We used 16 social vulnerability features as independent variables: percentage of people below poverty, unemployment rate, per capita income, percentage of people with no high school diploma, percentage of people aged 65 and older, percentage of people aged 17 and younger, percentage of non-institutionalised people with a disability, percentage of single-parent households with children, percentage of minority people (except white, non-Hispanic), percentage of people aged 5+ who speak limited English, percentage of housing in structures with 10+ units, percentage of mobile homes, percentage of overoccupied housing units, percentage of households with no vehicle available, percentage of institutionalised group quarters (eg, correctional institutions, nursing homes) and population density per square mile (see online supplemental table A1 for definitions). All data used in the manuscript are publicly available.

Statistical analysis

We used Extreme Gradient Boosting (XGBoost) to predict the number of positive cases within 30 days of a county’s first case. XGBoost is a scalable machine learning system using gradient tree boosting which is available as an open source software package.19 Chen and Guestrin presented the XGBoost algorithm in 2016.20 XGBoost is a highly effective and widely used machine learning method that can be used for regression, classification and prediction.20 Gradient boosted decision trees (GBDT) are an ensemble learning method (ie, a method that aggregates the predictions of a group of predictors) which uses decision trees as their base predictor and sequentially adds decision trees to the ensemble, while each added tree improves the fit of its predecessor to the data.21 XGBoost benefits from several innovations and optimisation techniques to add scalability to GBDT, making it faster and yielding better performance. In this study, the XGBoost algorithm is used to predict COVID-19 cases as the sum of predictions from thousands of individual decision trees, with each trained on the residual of all previous trees and making marginal improvements to the overall model prediction.19 21

While XGBoost learns from the training data and makes predictions with the testing data, it also uses different importance metrics to produce an importance matrix that contains the information gain, cover and frequency of features that have been actually used in the boosted trees. The interpretation of prediction results and how features contribute to the prediction is based on these three importance metrics. Gain is the most relevant attribute to interpret the relative importance of each feature and denotes the relative contribution of a feature in explaining variation in outcomes within the model, that is, a higher feature gain implies that the feature is more important for generating the prediction. Cover denotes the average coverage (the relative number of counties affected) of splits which use a specific feature. It simply corresponds to the percentage of the counties which the feature is used to decide the leaf node for them. Frequency is the percentage representing the relative number of times a specific feature occurs across all the trees estimated within the model.22 All measures are reported as relative amounts and hence all sum up to 1.

A subset of 869 counties (80% of the total 1086 counties) were used as our training data set, and 217 counties (20% of all counties) were used for our testing data set. We used 10-fold cross-validation, which is a commonly used statistical method in applied machine learning methods, to tune the model’s hyperparameters. Cross-validation assesses how the results of a statistical analysis will generalise to an independent data set and tests the model’s ability to predict with a new data set. It also points out problems like overfitting or selection bias.23 Tenfold cross-validation divided the training sample into 10 parts; the model is trained on nine parts (90% of the 869 counties), and performance is measured by the ability to accurately predict COVID-19 cases by the remaining part (the other 10% of 869 counties). When the hyperparameters of the XGBoost model are tuned, the XGBoost is trained using the tuned parameters on all the 869 counties. Finally, the model is used to predict the outcomes (ie, number of positive COVID-19 cases after 30 days of the county’s first confirmed case) for the test data (ie, the 217 counties). We also conducted a SHapley Additive exPlanations (SHAP) analysis to explain the predictions of machine learning models. A positive SHAP value means a positive impact of the features on prediction. Finally, for the sensitivity analysis the model was used to predict the outcomes that was number of positive COVID-19 cases after 60 days of the county’s first confirmed case. We used the RStudio V.4.0.2 (R Core Team, 2020) statistical package for all analyses.

Results

Table 1 provides sample characteristics of the 16 SVIs and COVID-19 cases and COVID-19 rates per 100 000 population after 30 days of the first COVID-19-positive cases in all counties in the 11 states of the US South (1086 counties). On average, 85.3 COVID-19 cases were reported after 30 days of the first reported case in a county, and a maximum of 6119 COVID-19 cases after 30 days of the first case in a county. Also, on average, 139.5 COVID-19 cases per 100 000 population were reported after 30 days of the first reported case, and a maximum of 4026.8 COVID-19 cases per 100 000 population after 30 days of the first case in a county.

Table 1
|
Descriptive statistics of the 16 SVIs and COVID-19 cases and COVID-19 rates per 100 000 population after 30 days of the first COVID-19-positive cases in all counties in the US South (1086 counties)

To evaluate the accuracy of our model, we tested the reliability of our predictions on 217 counties in the test data set. Goodness of fit and prediction evaluation (adjusted R-squared=0.59, root mean square error (RMSE)=92.36) indicates that the model was robust (online supplemental table A2). Online supplemental figure A5 also shows calibration plot of the predicted versus observed COVID-19 rates. Figure 2 shows the result of XGBoost gain relative importance. The percentage of mobile homes in counties is the most important feature, followed by population density per square mile and per capita income, in predicting the growth of COVID-19 within 30 days of the first case. The relative contributions of percentage of mobile homes, population density per square mile and per capita income to the model for generating predictions are 0.35, 0.12 and 0.12, respectively. Percentage of housing in structures with 10+ units, percentage of people below poverty and percentage of people with no high school diploma have relative contributions of 0.10, 0.08 and 0.04, respectively. The percentage of overoccupied housing units and the percentage of institutionalised group quarters are the least important features in the model with relative gains of 0.003 and 0.002, respectively.

Figure 2
Figure 2

Extreme Gradient Boosting (XGBoost) gain relative importance. The measures are all reported as relative amounts and all sum up to 1.0.

The relative cover for percentage of mobile homes, population density per square mile and per capita income is 0.09, 0.12 and 0.07, respectively, which shows the relative proportion of counties in our sample that include these features across all the decision trees (online supplemental figure A3). Also, the relative cover for percentage of housing in structures with 10+ units, percentage of people below poverty and percentage of people with no high school diploma is 0.7, 0.06 and 0.06, respectively. Relative frequency is calculated as the proportion of decision tree nodes that include a specific feature. The result of relative frequency shows that percentage of mobile homes, population density per square mile and per capita income occurred in 0.069, 0.093 and 0.079 of nodes within the trees of the model, respectively (online supplemental table A4). In addition, percentage of housing in structures with 10+ units, percentage of people below poverty and percentage of people with no high school diploma accounted for 0.059, 0.085 and 0.061 of nodes in the trees of the model, respectively. Additional XGBoost feature importance matrix details can be found in online supplemental table A3. Figure 3 shows the results of the SHAP analysis. Population density per square mile, percentage of housing in structures with 10+ units and percentage of people below poverty had the most positive impact on the number of COVID-19 cases in a county. Also, per capita income and aged 17 and younger features had the most negative impact on the number of COVID-19 cases in a county.

Figure 3
Figure 3

Shapley additive explanations (SHAP) analysis results.

Online supplemental table A4 shows the result of XGBoost gain relative importance after 60 days of the county’s first COVID-19 case. The population density per square mile in counties is the most important feature in predicting the growth of COVID-19 within 60 days of the first case with a relative gain of 31.8%. This is followed by percentage of housing in structures with 10+ units and percentage of mobile homes, with relative gains of 30.4% and 11.2%, respectively. Also, percentage of people aged 65 and older, per capita income and percentage of people aged 5+ who speak limited English have relative contributions of 5.5%, 4.9% and 2.6%, respectively. Additional XGBoost feature importance matrix details can be found in online supplemental table A4.

Discussion

Our machine learning study used SVI data and number of COVID-19 cases across all counties in the US South to analyse the association of social vulnerability features in predicting the community spread of infection. Our analysis suggests that the percentage of mobile homes within a county is the most important feature in predicting the increase in COVID-19. This was followed by population density per square mile and per capita income. Percentage of housing in structures with 10+ units, percentage of people below poverty and percentage of people with no high school diploma were also important predictors of community spread. However, the percentage of large, multifamily housing units and the percentage of institutionalised group quarters were the least important features in predicting COVID-19 spread at the county level.

Our findings are consistent with the results from prior studies that investigated COVID-19 cases and socioeconomic factors and considered the impact of the pandemic on racial and ethnic minorities.2 3 16 24 25 Studies report a disproportionate rate of infections and deaths among non-Hispanic Blacks and Hispanics.2 25 For example, a recent study found that minority status and language, household composition and transportation, and housing and disability were associated with the number of COVID-19 cases in the USA.25 Poverty, crowded housing and lack of vehicle ownership were reported to be associated with increased COVID-19 cases and deaths in urban areas. Also, high population densities catalyse the spread of COVID-19; therefore, avoiding situations with higher population densities will limit the spread of COVID-19.26 In addition, in rural communities, minority status and language are associated with increases in COVID-19 cases.3 Another study reported that counties with a higher percentage of minority, high-density housing structures and crowded housing units were at higher risk of becoming a COVID-19 hot spot.27 A study of urban-rural differences in COVID-19 exposures and outcomes in South Carolina has shown a positive correlation between the case rates, mortality rates and pre-existing social vulnerability. Also, a negative correlation between mortality rates and county resilience patterns suggests that counties with higher levels of inherent resilience had lower death rates.28

Although the US South has numerous hot spots of community spread of COVID-19, there are a few prior studies that have systematically investigated the initial spread of COVID-19 in relation to social vulnerabilities across counties in the region. A recent study investigated the spatial association of social vulnerability with COVID-19 prevalence and reported a spatially varying relationship between SVI and COVID-19 cases and deaths.29 Further, our use of a machine learning approach helped determine the specific community vulnerabilities that are most salient in determining the rapid spread of COVID-19. One study reported that mobility habits (eg, number of citizens who make at least one trip per day; transport accessibility; distance from the main city clusters) have a positive association for the spread of COVID-19.30 A recent study also forecasted the geographic spread of COVID-19 as a communicable disease by using social structure of networks.31 Aggregated data from Facebook also showed that COVID-19 cases were more likely to spread between regions that had stronger social network connections.32 Google COVID-19 Community Mobility Reports also provide a new tool to assess the role of policies to mitigate community spread (eg, to work from home, shelter in place and other recommendations) in flattening the curve of the COVID-19 pandemic.33

This study is subject to limitations. The results of this study should not be interpreted in a causality context. There are various state and local policies (eg, lockdown, business closure and facial mask mandate) that may have impacted our findings. Hence, residual confounding should be considered due to omission of important covariates. Also, the number of COVID-19 cases in a county might affect the number of cases in neighbouring counties through the connection between counties. Finally, our results are regional and may not generalise to other regions of the USA. With the availability of various free COVID-19 vaccines, the USA still struggles to fight the pandemic, and new waves of COVID-19 are an ongoing threat to public health in the USA. More studies are needed to investigate the resilience of vulnerable counties against COVID-19.

Conclusions

Our findings showed that SVI can help assess the vulnerability or resilience of communities to the spread of COVID-19. Thus, our results can help identify communities at high risk of spread and aid in policy efforts tailored to addressing these communities’ specific vulnerabilities to COVID-19. An understanding of the role social vulnerabilities have in determining the spread of COVID-19 is critical for forecasting the trajectory of this disease and designing effective mitigation interventions at the community level.