Article Text

Download PDFPDF

Reliability of administrative data to identify sexually transmitted infections for population health: a systematic review
  1. Brian E Dixon1,2,
  2. Saurabh Rahurkar2,3,
  3. Yenling Ho1 and
  4. Janet N Arno4,5
  1. 1Department of Epidemiology, Indiana University Richard M Fairbanks School of Public Health, Indianapolis, Indiana, USA
  2. 2Center for Biomedical Informatics, Regenstrief Institute, Indianapolis, Indiana, USA
  3. 3Department of Biomedical Informatics, Ohio State University, Columbus, Ohio, USA
  4. 4Division of Infectious Diseases, Indiana University School of Medicine, Indianapolis, Indiana, USA
  5. 5Bell Flower STD Control Program, Marion County Public Health Department, Indianapolis, Indiana, USA
  1. Correspondence to Dr Brian E Dixon; bedixon{at}


Introduction International Classification of Diseases (ICD) codes in administrative health data are used to identify cases of disease, including sexually transmitted infections (STIs), for population health research. The purpose of this review is to examine the extant literature on the reliability of ICD codes to correctly identify STIs.

Methods We conducted a systematic review of empirical articles in which ICD codes were validated with respect to their ability to identify cases of chlamydia, gonorrhoea, syphilis or pelvic inflammatory disease (PID). Articles that included sensitivity, specificity and positive predictive value of ICD codes were the target. In addition to keyword searches in PubMed and Scopus databases, we further examined bibliographies of articles selected for full review to maximise yield.

Results From a total of 1779 articles identified, only two studies measured the reliability of ICD codes to identify cases of STIs. Both articles targeted PID, a serious complication of chlamydia and gonorrhoea. Neither article directly assessed the validity of ICD codes to identify cases of chlamydia, gonorrhoea or syphilis independent of PID. Using ICD codes alone, the positive predictive value for PID was mixed (range: 18%–79%).

Discussion and conclusion While existing studies have used ICD codes to identify STI cases, their reliability is unclear. Further, available evidence from studies of PID suggests potentially large variation in the accuracy of ICD codes indicating the need for primary studies to evaluate ICD codes for use in STI-related public health research.

  • administrative codes
  • validation study
  • sexually transmitted infections
  • systematic review
  • international classification of diseases codes
  • public health informatics

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

View Full Text

Statistics from


Surveillance is the cornerstone of public health practice as well as research.1 Public health surveillance involves the systematic collection, analysis, interpretation and dissemination of health-related data to inform population health policies, programmes and interventions. The most prevalent type of data used by public health authorities to identify population health trends, examine healthcare access and assess population level outcomes is administrative data.

The term ‘administrative data’ in healthcare refers to data generated during routine healthcare delivery processes,2 which includes but is not limited to outpatient encounters, hospital admissions and pharmacy dispensing events. These data are used for a range of administrative purposes in healthcare, such as billing payers for healthcare services and measuring the efficiency of healthcare delivery. Administrative data include patient demographic information (eg, age, race), insurance plan enrolment (eg, Medicare, Medicaid, Blue Cross/Blue Shield, etc), hospital discharge information (eg, reason for admission, discharge disposition), procedures delivered during an outpatient encounter and pharmaceutical claims (eg, medication dispensed, route of administration). Administrative data are typically structured using information coding standards to enable interpretation and analysis. As a result, they represent accessible, available data suited for secondary purposes such as population health research.

Accessibility and availability are two important dimensions of data quality,3 yet they are only part of the quality equation with respect to population health research.4 To be of high quality, administrative data must also be accurate.5 Otherwise health policies, programmes and interventions that derive from observational research will be developed on false premises and may therefore fail to prevent disease, prolong life or promote health.

In the context of public health surveillance at the national level, scientists use diagnostic codes from administrative data in the form of International Classification of Diseases (ICD) codes to identify disease-specific cohorts for assessment of population level trends and outcomes. While other coding systems, such as the Logical Observation Identifiers Names and Codes, are used to electronically report positive disease cases to local and state health authorities,6 7 national surveillance systems lack these data as they are often removed when cases are reported to federal information systems. Moreover, scientists at federal health agencies are unable to measure positivity, adherence to testing guidelines and other population indicators when data reported to state health authorities come from just those individuals with a disease. Robust public health surveillance and research requires accessible, available and accurate data for population numerators and denominators. Therefore, national public health scientists leverage large, population health data sets that consist primarily of administrative data.

Existing literature on the accuracy of ICD codes in administrative data is mixed. Whereas a recent Canadian study found that ICD codes had a low positive predictive value (PPV) of 16% with respect to the identification of pertussis cases,8 an earlier study in Canada found that this data possessed high sensitivity (96.2%) and specificity (99.6%) with respect to the identification of HIV infection.9 The ability of ICD codes from administrative data to reliably identify true cases of disease therefore appears to vary, suggesting that validation of administrative data is important to conduct for all diseases of public health importance.

Sexually transmitted infections (STIs) are an important public health challenge and a focus of the Healthy People 2020 goals.10 Undiagnosed and untreated STIs are associated with adverse outcomes such as infertility, pelvic inflammatory disease, chronic pelvic pain, HIV acquisition, neurologic disease and adverse pregnancy outcomes. Several STI health services are recommended by the WHO to protect the reproductive and sexual health of men and women, yet referral to and performance of these services requires accurate identification of STI incidence, prevalence and outcomes. Chlamydia, gonorrhoea and syphilis are the most prevalent and curable STIs reportable under public health laws in the USA11 and more than 1 million STIs are acquired every day worldwide.12 Accuracy of administrative data for identifying cases of chlamydia, gonorrhoea and syphilis is unknown.


The purpose of this study is to review the extant literature for evidence on using ICD codes from administrative data to reliably identify populations with chlamydia, gonorrhoea and syphilis. Our findings will provide insights into the interpretation of results from STI studies that use administrative data for cohort identification. Further, these insights may inform future population health research.


To examine the existing literature on the validity of using administrative data to identify STI cases, we conducted a systematic review in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis. We performed a comprehensive search of the literature using the MEDLINE and Scopus databases for peer-reviewed articles published before February 2018. We limited the scope of our review to chlamydia, gonorrhoea and syphilis, as these are the STIs that constitute the highest reportable morbidity for local health departments.

Articles were identified using search terms consisting of search strings as well as Medical Subject Heading (MeSH) terms associated with the key words. We used STI-related search terms including ‘syphilis’, ‘gonorrhoea’, ‘chlamydia’, ‘pelvic inflammatory disease’, ‘sexually transmitted disease’ and those related to administrative codes such as ‘international classification of diseases’, ‘administrative codes’, ‘ICD’, ‘diagnostic codes’, ‘diagnosis, classification’, ‘international classification of diseases’, ‘ICD code’, together with terms intended to identify validation studies: ‘validation studies’, ‘reliability’ and ‘validity’. Pelvic inflammatory disease (PID) was included as previous literature indicates that 33%–50% of all PID cases are due to either Chlamydia trachomatis or Neisseria gonorrhoeae.13 14

Only empirical publications appearing in peer-reviewed English language journals were included. As such, we excluded articles classified as letters to the editor, policy briefs, perspectives, commentaries, summaries of future research plans, as well as grey literature. Additionally, articles without abstracts were also excluded.

Article identification and selection followed the process outlined in figure 1. We used Covidence (Melbourne, Victoria, Australia), a web-based tool developed for systematic reviews by Cochrane, to facilitate article selection and review.

Figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analysis diagram depicting the article selection process for the systematic review. ICD, Internaational Classification of Diseases; STI, sexually transmitted infection.

In the first step, articles were reviewed with a focus on title and abstract. We eliminated articles that did not focus on STIs or did not validate STI administrative data as a component of the research. This included validation studies focused on conditions or diseases other than STIs. Articles were included if they focused on any combination or all of the STIs of interest, or on PID, and were validation studies focused on diagnostic testing or administrative codes.

Screened articles were then subjected to a second review which focused on including only validation studies examining the use of either ICD, Ninth Revision, Clinical Modification (ICD-9-CM) or ICD, 1oth Revision, Clinical Modification (ICD-10-CM) codes to identify diagnosis of chlamydia, gonorrhoea, syphilis or PID. Other forms of administrative data, such as pharmacy claims, to identify cases were not considered because the target STIs of interest are often treated with common antibiotics such as azithromycin which is also used to treat a number of other bacterial infections. Moreover, studies that focused on the validation of specific laboratory assessments such as PCR assays for diagnosing chlamydia were also excluded.

Two reviewers (SR and YH) reviewed each article to determine inclusion with conflicts in judgement reconciled by consensus. Finally, in order to be exhaustive, we used a snowball technique whereby we reviewed all references found in the bibliographies of included articles.

The final set of articles included for review met the following criteria: (1) the study included syphilis, chlamydia or gonorrhoea; (2) assessed and listed ICD-9-CM or ICD-10-CM codes and (3) measured accuracy, sensitivity, specificity, PPV and/or negative predictive value for ICD codes related to any of the three STIs. With the selected articles, references from each article were examined to identify other relevant articles.


Our search strategy identified 1754 unique articles. Of these, only five (0.29%) studies met initial inclusion criteria for full-text review based on review of title and abstract. After full-text review, only two (40%) articles met the criteria for inclusion in the systematic review.15 16 The three articles that did not meet inclusion criteria focused on validating Healthcare Effectiveness Data and Information Set (HEDIS) measures for estimating screening rates for chlamydia among young women.17–19 HEDIS measures utilise administrative data, yet the identified articles were removed because the assessment focused on identifying whether individuals were tested for chlamydia rather than on positive diagnosis of disease. Finally, the two included articles resulted in 60 references identified using the snowball technique. No additional articles were identified in this step in the review process.

The two articles which met the inclusion criteria focused on the validation of administrative data to detect PID. Neither study examined syphilis nor did either study examine the accuracy of administrative data to diagnose chlamydia or gonorrhoea independent of PID. Furthermore, both studies examined only ICD-9-CM codes, because the data used were prior to October 2015. The table 1 presents the list of ICD-9-CM codes used in these articles to identify potential cases.

Table 1

ICD-9 codes utilised by included studies

Ratelle and colleagues (2003) assessed women at a group practice in Massachusetts from 1995 to 1997.16 A cohort of 1051 patients with PID was identified based on ICD-9 codes (see table 1). From this cohort, chart reviews were conducted on de-identified medical records for a random sample of 296 patients focused on the 614.9 (unspecified inflammatory disease, female pelvic organs) code. Chart reviews were limited to the first encounter for each patient with PID with 72.5% of patients having one encounter for the ICD-9 code of interest. The study identified 39 cases which met US Centers for Disease Control and Prevention (CDC) case criteria for PID, resulting in a PPV of 18.1%. For each PID case, reviewers examined whether patients were tested for chlamydia or gonorrhoea. Most patients were tested for chlamydia (84.3%) and gonorrhoea (82.8%).When the PID diagnostic code was used in combination with a positive laboratory test for chlamydia or gonorrhoea, the PPV for PID was 56% (five out of nine cases) and 100% (four out of four cases), respectively. These determinations were based on laboratory testing results for chlamydia or gonorrhoea rather than use of ICD-9-CM codes for these diseases.

The second article by Satterwhite and colleagues (2011) assessed validity of a PID case-finding algorithm among women aged 15–44 years at healthcare organisations in Washington and Colorado from 2003 to 2007.15 Potential PID cases were identified from both healthcare organisations using ICD-9-CM codes (see table 1). In order to identify PID cases associated with chlamydia and gonorrhoea, only non-chronic cases were considered. A total of 2764 and 2685 potential PID cases were identified from Washington and Colorado, respectively. Chart reviews were conducted on 393 cases from Washington and 500 cases from Colorado. Relevant data were extracted from medical records during the chart review process including the ICD-9-CM codes. Data from Washington were used to develop the PID case-finding algorithm that used several variables from administrative data including ICD-9-CM codes to identify cases; this algorithm was then tested on the Colorado data. Using ICD-9-CM codes alone, the PPV of identifying PID in Washington was 78.8% while that in Colorado was 79.1%. When supplemented with other administrative data (eg, age at diagnosis, inpatient admission, etc) PPV increased to 87.9% in Washington and 84.5% in Colorado. Sensitivity was high at 96.4% in Washington and 90.3% in Colorado. In contrast, specificity was low at both Washington (45.9%) and Colorado (37%). The validity of chlamydia was not determined using either laboratory testing data extracted from the EHR or ICD codes.


Our systematic review identified two population health studies that evaluated the validity of using diagnostic codes to identify PID. No studies could be identified that examined the validity of using ICD codes to identify positive diagnosis of chlamydia, gonorrhoea or syphilis. The reviewed studies reported a wide range of PPVs for PID using only diagnostic codes from 18.1% to 79.1%. The PPV range increased to 56%–100% when additional data were considered in addition to an ICD diagnosis. These findings suggest there is sparse evidence on the reliability of using administrative data to identify STI cases in the conduct of population health research.

Knowledge gaps identified

The review identifies several gaps in our understanding of the validity of ICD codes for STIs. First, validation studies of STI-related ICD codes are limited to PID. Notably, our review found no studies that evaluated the reliability of using administrative data to identify chlamydia, gonorrhoea or syphilis cases. Moreover, studies that did evaluate chlamydia and gonorrhoea identified potential cases using laboratory test results rather than ICD codes. Second, a limited set of STI-related codes were used to identify PID cases. Neither study examined the full range of administrative codes for chlamydia or gonorrhoea, which includes 079.98 (unspecified chlamydial infection) and 098.0 (acute gonococcal infection of lower genitourinary tract). Individuals who present with symptoms of STIs without pain, or those who are routinely screened for STIs during pregnancy, would likely be assigned administrative codes not considered by the prior studies and, therefore, the reliability of these codes remains unknown. Third, neither article examined the validity of administrative codes in identifying syphilis. Syphilis is one of the most prevalent reportable and curable STIs. Moreover, due to its correlation with congenital syphilis and stillbirths in pregnant women, syphilis is particularly important to public health research.

Existing validation studies are also limited in the populations used to study the validity of ICD codes. Due to their focus on PID, populations in existing studies are limited to sexually active women. This is an important limitation to note as public health has observed significant increases in STI incidence among men, including men who have sex with men20–22 as well as older adults in recent years.23 24 To thoroughly examine the reliability of STI-related ICD codes for use in health services or public health research, ICD codes assigned to other, broader populations will be required.

Implications for population health research and practice

Administrative data are abundant and used frequently in national population health research. For example, the CDC uses commercial claims databases like Marketscan (The MEDSTAT Group, Ann Arbour, Michigan, USA) to examine healthcare utilisation and outcomes by identifying cohorts based solely on administrative codes.25 26 Patel et al used ICD codes to identify stillbirth cases in order to examine adherence to syphilis testing recommendations during pregnancy.27 Nelson et al used Marketscan data to examine trends in the incidence of Lyme disease in the USA.28 Guoyu et al used ICD codes to study rates of ectopic pregnancy among commercially and Medicaid insured women.29 Based on our findings, it is possible that public health research solely utilising administrative data to identify STI cases without validating findings through chart reviews or additional data such as laboratory test results may potentially misidentify or falsely exclude true cases. As such, findings from these studies may be biased and/or limited in their generalisability.

Given the need for administrative data, validation studies for ICD codes for STIs are needed. These studies should focus on a broad range of STIs as well as populations. Further, STIs such as chlamydia, gonorrhoea and syphilis should be studied with a specific focus rather than as part of other conditions, such as PID. This is especially true with the transition to ICD-10-CM codes, since no studies were found to have examined the reliability of newer coding system now in wide use within most nations. Conducting validation studies will generate evidence to enable utilisation of administrative data, specifically ICD codes, to examine healthcare and health services outcomes for populations with these diseases.

Limitations of the study

Although we examined all English language empirical articles published before February 2018, our search strategy identified only two studies that met inclusion criteria. While we recognise this shortcoming, the small number of articles is more a representation of the literature rather than a flaw in our methodology. Despite an exhaustive search process, we could not identify any studies that focused on chlamydia, gonorrhoea or syphilis, had a focus other than PID, and/or also evaluated ICD-10-CM codes, thus limiting the generalisability of our findings. Finally, it is possible that our search strategy may have missed some relevant studies; however, in order to minimise this risk, we used the snowball technique to identify additional relevant articles based on the bibliographies of identified articles.


Based on a review of the literature, there is scant evidence on the reliability of using administrative codes to identify cases of chlamydia, gonorrhoea or syphilis. In the available literature, we found high variability in the predictive value of using administrative codes. Given these findings, further studies are required to examine the predictive value of administrative codes for all three diseases in the general population as well as high risk populations, including pregnant women and men who have sex with men.


The authors acknowledge Ashley Wiensch, MPH, of the Regenstrief Institute Center for Biomedical Informatics for her role in nudging the authors to complete the review and editing the manuscript for clarity and conformance to the journal’s guidelines. We further acknowledge Guoyu Tao, PhD, of the Division of STD Prevention, Centers for Disease Control and Prevention, for providing review and feedback on the manuscript as well as the research.


View Abstract


  • Contributors BED conceived of and obtained funding for the study. BED, SR and YH designed the study methods. SR and YH conducted the review and drafted the manuscript. JNA provided guidance for the study design and critical review of the manuscript. BED provided critical review of the manuscript and finalised it for submission. All authors reviewed and approved the final version of the manuscript.

  • Funding The research reported in this publication was supported by the Centers for Disease Control and Prevention (CDC), US Department of Health and Human Services (HHS), under contract number 200-2017-M-94698. YH is supported by a training grant (Award Number T15LM012502) from the National Library of Medicine of the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the CDC, NIH or HHS.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available in a public, open access repository.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.