Article Text

Influence of social determinants of health and county vaccination rates on machine learning models to predict COVID-19 case growth in Tennessee
  1. Lukasz S Wylezinski1,2,
  2. Coleman R Harris1,3,
  3. Cody N Heiser1,4,
  4. Jamieson D Gray1 and
  5. Charles F Spurlock1,2,5
  1. 1Decode Health, Inc. and IQuity Labs, Inc, Nashville, Tennessee, USA
  2. 2Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
  3. 3Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
  4. 4Program in Chemical and Physical Biology, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
  5. 5Wagner School of Public Health, New York University, New York, New York, USA
  1. Correspondence to Dr Charles F Spurlock; chase.spurlock{at}


Introduction The SARS-CoV-2 (COVID-19) pandemic has exposed health disparities throughout the USA, particularly among racial and ethnic minorities. As a result, there is a need for data-driven approaches to pinpoint the unique constellation of clinical and social determinants of health (SDOH) risk factors that give rise to poor patient outcomes following infection in US communities.

Methods We combined county-level COVID-19 testing data, COVID-19 vaccination rates and SDOH information in Tennessee. Between February and May 2021, we trained machine learning models on a semimonthly basis using these datasets to predict COVID-19 incidence in Tennessee counties. We then analyzed SDOH data features at each time point to rank the impact of each feature on model performance.

Results Our results indicate that COVID-19 vaccination rates play a crucial role in determining future COVID-19 disease risk. Beginning in mid-March 2021, higher vaccination rates significantly correlated with lower COVID-19 case growth predictions. Further, as the relative importance of COVID-19 vaccination data features grew, demographic SDOH features such as age, race and ethnicity decreased while the impact of socioeconomic and environmental factors, including access to healthcare and transportation, increased.

Conclusion Incorporating a data framework to track the evolving patterns of community-level SDOH risk factors could provide policy-makers with additional data resources to improve health equity and resilience to future public health emergencies.

  • health equity
  • machine learning
  • COVID-19
  • public health
  • artificial intelligence

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


The SARS-CoV-2 (COVID-19) pandemic exacerbated health inequities throughout the USA, disproportionately affecting at-risk populations.1 Identifying social determinants of health (SDOH) risk factors within US communities that contribute to poor outcomes following infection can improve health equity and strengthen community readiness for future public health emergencies.2 3 Following vaccine roll-outs in 2021, we predicted Tennessee COVID-19 case growth using machine learning models and investigated the influence of SDOH factors on COVID-19 incidence to quantify and track opportunities to improve health equity.


Our approach combined publicly available COVID-19 testing, vaccination, hospitalization and death metrics with county-specific SDOH and demographic data.4 5 Data sources included the Tennessee Department of Health, Johns Hopkins Coronavirus Research Center and the US Census database. We employed feature engineering and feature selection to identify novel predictors such as offset case counts to best represent changes in Tennessee county COVID-19 incidence between February and May 2021. We aggregated data from multiple sources to minimize implicit bias and removed or ignored missing values depending on the model type. An ensemble of generalized linear and tree-based machine learning models was built in parallel, each trained and tested with 4–6 weeks of historical COVID-19 case data to generate predictions from 40 to 50 models at 13 time points. Optimal models were selected using cross-validation metrics (eg, mean absolute error, R2) and prediction accuracy for future relative case growth normalized to county population.6 We analyzed the impact of all features from top performing models to quantify and rank SDOH by their influence on COVID-19 incidence predictions. Finally, we calculated Pearson coefficients to quantify associations between vaccination rates and county COVID-19 case growth over time.


Machine learning models across all time points were more than 90% accurate when comparing model predictions to actual cases (online supplemental figure 1A and C). The top models demonstrated an average R2 value of 0.99, mean absolute error of 0.21 and 0.001 mean Tweedie deviance (online supplemental figure 1B).

Supplemental material

Highly predictive SDOH features changed in importance over time. Categorically, demographic SDOH were most important in February 2021, but socioeconomic and environmental SDOH became increasingly more influential towards May. Health outcome SDOH features remained largely consistent during the study period. Individually, the female and under 18 age demographic features ranked highest in February and then declined while African American poverty and health infrastructure features, such as the number of hospital beds and community provider access statistics, increased in importance by mid-April. Lastly, COVID-19 vaccination data features grew in relative importance by May compared with the other SDOH factors (figure 1).

Figure 1

Social determinants of health (SDOH) linked to COVID-19 case growth in Tennessee dynamically shift in importance over time. SDOH include social, physical and environmental factors that impact community health such as age, race, gender, access to transportation, access to primary care and community vaccination rates. Twelve of these SDOH features demonstrated the highest feature importance across all predictive models during the study period. Size and color are used to emphasize SDOH feature importance at each time point. large, red (Embedded Image) bubbles connote the top ranked SDOH feature while small dark blue (Embedded Image) bubbles signify least importance of a given feature at each time point. Black bubbles (Embedded Image) represent the least important feature at each time point compared with the other top ranked SDOH data elements.

As Tennessee vaccination rates increased, counties with the lowest vaccination rates exhibited the highest COVID-19 case growth (online supplemental figure 2A). Initially, vaccination rates were not correlated with COVID-19 risk, but by mid-March, a statistically significant correlation with low risk of COVID-19 case growth emerged (online supplemental figure 2B).

Supplemental material


Efforts to curtail the health and economic impact of the SARS-CoV-2 pandemic illuminate the need to define specific risk factors that catalyze future case growth, worsen health disparities and adversely impact the public health response across US communities.7 Addressing these challenges, we constructed a real-time predictive framework to discover and rank county-level SDOH risk factors that drive machine learning predictions of future COVID-19 incidence (figure 1).

In Tennessee, we found that communities with rapid vaccine roll-out were at lower risk for case growth (online supplemental figure 2). As vaccination levels began to rise, demographic SDOH features such as age, race and ethnicity declined in relative importance while socioeconomic and environmental risk factors such as poverty, access to transportation and healthcare infrastructure increased significantly. Measures promoting health equity rely on constant assessment of risk mitigation effectiveness. Real-time knowledge of community specific SDOH risk factors empowers healthcare organizations and local governments to improve policy and resource allocation to mitigate outbreaks, enhance resilience to future public health threats, and capture evolving risk profiles as novel virus variants emerge.8

Ethics statements

Patient consent for publication


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Twitter @colemanrharris, @cody_heiser, @jamiesongray, @cfspurlock

  • Contributors CFS had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. CFS devised the concept and study design. All authors took part in acquisition, analysis and interpretation of the data along with drafting and revising the manuscript.

  • Funding This work was supported by Decode Health, IQuity Labs and grants from the National Institutes of Health (AI124766, AI129147 and AI145505). CFS had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. CFS devised the concept and study design. All authors took part in acquisition, analysis and interpretation of the data along with drafting and revising the manuscript.

  • Competing interests LSW, JDG and CFS are shareholders in IQuity Labs (Nashville, Tennessee, USA) and Decode Health (Nashville, Tennessee, USA). IQuity Labs develops blood-based RNA tools to aid in the diagnosis and treatment of human disease. Decode Health develops artificial intelligence approaches to predict chronic and infectious disease risk in patient populations.

  • Provenance and peer review Not commissioned; internally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.