Article Text

Exploring the reliability of inpatient EMR algorithms for diabetes identification
  1. Seungwon Lee1,2,
  2. Elliot A Martin1,2,
  3. Jie Pan1,3,
  4. Cathy A Eastwood1,3,
  5. Danielle A Southern3,
  6. David J T Campbell1,4,
  7. Abdel Aziz Shaheen1,4,
  8. Hude Quan1,3 and
  9. Sonia Butalia1,4
  1. 1Community Health Sciences, University of Calgary Cumming School of Medicine, Calgary, Alberta, Canada
  2. 2Provincial Research Data Services, Alberta Health Services, Edmonton, Alberta, Canada
  3. 3Centre for Health Informatics, University of Calgary Cumming School of Medicine, Calgary, Alberta, Canada
  4. 4Department of Medicine, University of Calgary Cumming School of Medicine, Calgary, Alberta, Canada
  1. Correspondence to Dr Seungwon Lee; seungwon.lee{at}


Introduction Accurate identification of medical conditions within a real-time inpatient setting is crucial for health systems. Current inpatient comorbidity algorithms rely on integrating various sources of administrative data, but at times, there is a considerable lag in obtaining and linking these data. Our study objective was to develop electronic medical records (EMR) data-based inpatient diabetes phenotyping algorithms.

Materials and methods A chart review on 3040 individuals was completed, and 583 had diabetes. We linked EMR data on these individuals to the International Classification of Disease (ICD) administrative databases. The following EMR-data-based diabetes algorithms were developed: (1) laboratory data, (2) medication data, (3) laboratory and medications data, (4) diabetes concept keywords and (5) diabetes free-text algorithm. Combined algorithms used or statements between the above algorithms. Algorithm performances were measured using chart review as a gold standard. We determined the best-performing algorithm as the one that showed the high performance of sensitivity (SN), and positive predictive value (PPV).

Results The algorithms tested generally performed well: ICD-coded data, SN 0.84, specificity (SP) 0.98, PPV 0.93 and negative predictive value (NPV) 0.96; medication and laboratory algorithm, SN 0.90, SP 0.95, PPV 0.80 and NPV 0.97; all document types algorithm, SN 0.95, SP 0.98, PPV 0.94 and NPV 0.99.

Discussion Free-text data-based diabetes algorithm can yield comparable or superior performance to a commonly used ICD-coded algorithm and could supplement existing methods. These types of inpatient EMR-based algorithms for case identification may become a key method for timely resource planning and care delivery.

  • health services research
  • electronic health records
  • medical informatics
  • medical record linkage

Data availability statement

Data may be obtained from a third party and are not publicly available. Restrictions apply to the availability of these data. Data were obtained from Alberta Health Services and are available with the permission of Alberta Health Services.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


  • Identifying people with diabetes in databases is typically carried out by using International Classification of Disease codes, laboratory results and/or medications.


  • The diabetes identification algorithm based on free-text electronic medical records (EMR) notes shows excellent performance. This study further supports the idea that EMRs contain a wealth of details that can be leveraged to complement existing methods to identify people with diabetes within databases.


  • This study provides evidence that free-text EMR data could enhance the flow of diabetes information in clinical care and improve associated downstream processes in case identification, surveillance and clinical outcome research.


Accurate identification of chronic conditions, such as diabetes, within acute care facilities or hospitals is imperative for delivering optimal care.1 Information regarding comorbidity status is typically gathered by healthcare professionals during their care encounters and stored within electronic medical records (EMRs). The collected information is subsequently conveyed to other care providers based on individual needs. Comorbidity information is not only useful in point-of-care clinical encounters but is also useful for research, quality improvement and resource planning. In the Canadian context, a key inpatient administrative health database, the discharge abstract database (DAD),2 is used by>90% of hospitals, is coded by trained coding specialists reviewing physician documentations from the EMRs who assign International Classification of Diseases 10th Revision Canadian modification (ICD-10-CA) codes. This database is populated by these coding specialists who review discharge summaries from the EMRs and assign International Classification of Diseases (ICD) codes to each encounter.3 The DAD serves various purposes such as care service planning activities, fiscal planning, operational planning, population surveillance and epidemiology. However, there exists a considerable delay in obtaining this data.

The increased adoption of EMRs within acute care facilities,4 coupled with the integration of artificial intelligence techniques in healthcare,5 has created the potential to extract chronic conditions and comorbidities directly from EMRs. This has the benefit of enhancing operational data practices in health systems and ensuring timelier information. For example, diabetes definitions have typically used ICD codes and laboratory and medication data to define diabetes in clinical datasets.6 However, most detailed contextual information in healthcare is stored within free-text notes in paper charts or EMRs. The advancement of natural language processing (NLP) techniques now enables using EMR free-text data to refine condition definitions and facilitate the identification process, thereby enhancing various healthcare processes, including real-time point of care, research and care planning processes.

Our hypothesis is that a diabetes algorithm, using clinical free-text notes, can perform similarly or better than existing standard methods. We were also interested in assessing whether different components of free-text notes could contribute to phenotyping diabetes. The purpose of this study was to develop diabetes algorithms based on different EMR data modalities and compare their performances.

Materials and methods

Study population and design

This study is a retrospective cohort study covering the period of 1 January 2015 to 30 June 2015, from Alberta, Canada. This cohort was assembled from data sources listed below.

Data sources and linkage

The EMR and administrative data records were linked using Personal Health Number (PHN), and generated Patient Identification and Encounter Management details (eg, encounter number, health record number) sourced from the Clinibase system. These represent a unique set of identifiers for patient encounters7 that are loaded into the EMR system. The combination ensured the linkage was pinpointed to the correct admission period contained within EMR data. We developed this linkage mechanism in a previous study8 and subsequently created multiple EMR databases linking administrative health databases. The PHN and other personal identifiers were anonymised after the linkage was completed. The following sources of information were used: chart review database, Allscripts Sunrise Clinical Manager EMR and the DAD.

Chart review database

A previously conducted project assembled a chart review cohort of randomly selected patients in acute-care facilities in Calgary, Alberta.9 The chart review data recorded patients’ chronic disease status (binary) which included diabetes status, admission date and other system variables for linking it to the DAD and other data sources. The chart review included a total of 51 medical conditions and 3 healthcare-related adverse events. The chart review team consisted of six nurses who received training and followed a consistent protocol to review the charts. These reviewers were blinded to the ICD coding status.

Allscripts Sunrise Clinical Manager EMR

Sunrise Clinical Manager (SCM) has been used as the inpatient EMR for several acute-care sites operated by Alberta Health Services (AHS), the single health authority in the province of Alberta, since 2009. This EMR contains (but is not limited to) patient demographic information, laboratory information, medications, free-text history and physical notes, interdisciplinary progress notes, and discharge summaries for inpatient encounters. Detailed description of this EMR system is available in our previous work.1

Discharge abstract database

DAD is a national Canadian administrative health database which includes all inpatient separations (by discharge or death) through a collaborative system set up between provincial, and territorial governments, and the Canadian Institute for Health Information (CIHI). CIHI sets national training requirements for those responsible for coding the data. The utilisation of administrative health data, such as DAD, is widely acknowledged as the reference standard in Canada for both research activities10 and public health initiatives,11 from using ICD codes.

Data extraction

Once the coded patient records were deterministically linked to the EMR using PHN and Clinibase variables, linkage to subtables within EMR of interests was conducted through system variables (eg, table record identifier, health record number). We extracted and cleaned these EMR subtables that contained the following information: (1) inpatient laboratory subtable (contains all conducted laboratory tests within a patient encounter period), (2) inpatient medication subtable (contains all medications prescribed and fulfilled to the patient within a patient encounter period), and (3) subtable containing all clinical notes (free-text notes documented throughout the patient encounter) period. ICD codes were obtained from the linked DAD data. These EMR subtables were used to develop varying diabetes algorithms listed in the next section.

Diabetes algorithm development

Chart review labels served as the gold standard labels for algorithm development.

Operational standards—validated administrative data-based ICD codes algorithm

Current operational algorithm standards for surveillance and research are based on ICD-coded data. The National Diabetes Surveillance System (NDSS)12 employs ICD-based code algorithm developed by Quan et al13 and is inclusive of ICD-10-CA codes E10–E14 during hospitalisation. We assessed the performance of the algorithm by Quan et al against the chart review labels.

EMR data-based algorithms

Various approaches were implemented for developing algorithms accounting for different data modalities. All algorithms were compared against the chart review labels for performance measurements.

Laboratory data-based clinical diagnosis algorithm

To identify diabetes, we used haemoglobin A1C (HbA1c) tests, oral glucose tolerance tests, random plasma glucose tests, or fasting plasma glucose tests, adhering to the thresholds outlined in Diabetes Canada’s national guidelines for diagnosis. The criteria and thresholds for these tests have been published.14 While Diabetes Canada requires at least two separate test types for a diabetes diagnosis, the varied prevalence of recommended tests for each patient led us to implement a single test meeting the diagnostic criteria15 for performance reporting in this study.

Medication data-based clinical diagnosis algorithm

The medication clinical algorithm included any use of a single (or multiple) agent(s) that are commonly used to treat diabetes. The list of diabetes medications was derived from Diabetes Canada’s national guidelines, reviewed by clinicians (endocrinologists), and validated on the Canada’s Drug Product database14 (online supplemental appendix table 1).

Supplemental material

Inpatient laboratory and medication data-based clinical diagnosis algorithm

This clinical diagnosis algorithm included both laboratory and medications data. Specifically, the absence of diabetes was defined as the highest HbA1c laboratory result below 6.5%16 17 with no evidence of prescribed or fulfilled medications. Pre-diabetes was defined by the highest HbA1c falling within the range of 6.0%–6.4% or through an oral glucose tolerance test, random plasma glucose test, or fasting plasma glucose test adhering to the thresholds listed in the Diabetes Canada guidelines, and no prescribed antidiabetic medications. Diabetes status was categorised as follows: as (1) HbA1c≥6.5%, if no evidence of medication, (2) meeting glycaemic targets: HbA1c values<7.0%, supported by evidence of both prescribed and dispensed medications, and (3) not meeting glycaemic targets: indicated by the highest HbA1c laboratory result closest to discharge>7.0 %.18 Another subgroup of individuals with diabetes was identified as those with appropriately intensified therapy with agents known to confer cardiorenal benefit such as (1) GLP1RA if obese or with a history of cardiovascular disease or stroke, and (2) SGLT2 if chronic kidney disease (low GFR or albuminuria) or cardiovascular disease. These data were analysed using a time-series context, and all laboratory and medication records were used.

NLP clinical notes-based machine learning (ML) algorithm

Free-text notes were cleaned and decoded into American Standard Code for Information Interchange (ASCII) to ensure extracted free-text notes were converted to an analyzable format. Then all free-text notes were stratified by document types. The default clinical pipeline of clinical Text Analysis and Knowledge Extraction Systems (cTAKES)19 was used to process the raw text documents into unified medical language system’s (UMLS) concept unique identifiers (CUIs) for each patient.20 Two algorithms were developed: the first one was a CUI search of the diabetes concept which encompasses its synonyms (eg, diabetes, diabetes mellitus, hyperglycaemia), and the second algorithm was based on a data-driven model of all CUIs extracted from all document types. These CUIs covered anatomical sites, signs/symptoms, procedures, diseases/disorders and medications.

A data-driven supervised ML model on all document types and CUIs was developed (figure 1) and closely follows our previous work.21 Boruta22 feature selection algorithm was applied to reduce the dimension of CUIs. An XGBoost23 algorithm was trained against the chart review cohort. The dataset was divided into 80:20 training ratio stratified by the diabetes outcome to ensure a similar ratio between the labels was maintained. Fivefold cross-validation was employed, and a grid search of hyperparameters was conducted. Feature importance assessed for the top predictive CUI document name pair (ie, a specific CUI in a specific document type) associated with diabetes. Top 20 document type—concept predictive features were identified after fitting the XGBoost algorithm.

Figure 1

Clinical Text Analysis and Knowledge Extraction Systems (cTAKES)and XGBoost free-text algorithm. After free-text notes were extracted from the Sunrise Clinical Manager (SCM) electronic medical record (EMR), these notes were processed by document type using cTAKES. Boruta feature selection was employed and XGBoost classification model was fit. This diagram was adapted and modified from our previous work on hypertension.

Combined algorithms used or statements between the above algorithms.

Evaluation metrics and validation

Several evaluation metrics were calculated to assess the model performance. These metrics included sensitivity (SN), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV). Statistical tests such as t-test, χ2 and Kruskal-Wallis one-way analysis of variance test were applied for continuous, categorical and ordinal variables, respectively.

Figure 2 schematically presents the process flow from data linkage to algorithm development. Figure 1 depicts the detailed algorithm development process of applying cTAKES on the free-text data. We determined the best performing algorithm as the one that showed the high performance of SN and PPV.

Figure 2

Flow process of algorithm development. Chart Review data were deterministically linked to discharge abstract database (DAD) and inpatient data. The International Classification of Diseases (ICD) algorithm was developed by Quan et al. Laboratory and medication algorithms used Diabetes Canada3’s established definitions. Medications were ascertained on Canada’s drug product database. Free-text algorithm employed clinical Text Analysis and Knowledge Extraction Systems (cTAKES) for extracting concept unique identifiers (CUIs) from clinical notes and XGBoost was applied.


Cohort overview

We analysed the charts of 3040 individuals, and their demographic details are summarised in table 1. The median age was 62.5 years and there was an equal distribution between males and females. The median body mass index of the cohort was 23.8 kg/m2, and approximately 1617 individuals (53.2%) had no Charlson comorbidities. Among these 3040 individuals, 583 individuals (19.2%) had diabetes based on the chart review ‘gold standard’. The cohort with diabetes was, on average, 10 years older than the overall chart review cohort (p<0.01). Within the diabetes cohort, there was a higher proportion of males than females (p<0.01). Additionally, the comorbidity profiles differed between the two groups, with the diabetes subcohort exhibiting a higher prevalence of comorbidities compared with the overall cohort (p<0.01).

Table 1

Demographics of people with diabetes from the chart review cohort

Feature selection on all document type ML model

The cTAKES system successfully processed a total of 59 document types and processed 692 918 free-text records within this cohort. The system also extracted negation status and experiencer details, distinguishing between patients and family members. We retained only CUIs that were not negated, and had the patient as the experiencer, resulting in a total of 83 107 CUIs. Using the Boruta method, it recommended the inclusion of 42 ranked features, with an additional three features identified as tentative. Therefore, we considered the top 45 ranked features, which constituted the training dataset for the XGBoost model.

Algorithm performance

Table 2 presents the performance of the diabetes Clinical and ML algorithms on the testing dataset. The administrative database ICD-based algorithm yielded SN of 0.84, SP of 0.98, PPV of 0.93 and NPV of 0.96; medication data-based clinical algorithm, SN of 0.89, SP of 0.98, PPV of 0.91 and NPV of 0.98; selected keyword concepts from free-text notes, SN of 0.73, SP of 0.93, PPV of 0.70 and NPV of 0.93; ML algorithm based on free-text notes, SN of 0.95, SP of 0.98, PPV of 0.94 and NPV of 0.99. Various performance of the combined clinical and ML algorithms is also shown in table 2.

Table 2

Performance of clinical and ML algorithms on the testing dataset (n=609)

Top features from all document type ML model

Figure 3 presents the top 20 pairs of document type and feature from the free-text ML algorithm in the chart review cohort. Calibration plot of the free text (ie, all documents; CUI and XGBoost) is shown in online supplemental appendix figure 1. Grid search space and best hyperparameters are shown in online supplemental appendix table 2. The confusion matrix for the free-text XGBoost algorithm on testing dataset is shown in online supplemental appendix table 3.

Supplemental material

Supplemental material

Supplemental material

Figure 3

Top 20 document type—concept predictive features selected from the all-document types supervised free-text XGBoost algorithm. Consistent diabetes related terminologies were identified from multiple document types.

Amongthe top 20 features of the XGBoost model, the most influential contributors to classifying individuals with a diagnosis were any glucose documentation or fasting blood glucose measurement recorded within SCM inpatient settings. Following closely was the mention of ‘breakfast’, in the free-text notes. Other captured top features included text of diabetes, medication administration (eg, metformin, insulin) and diabetic diet. Several document types consistently captured predictor variables or features.


This study explored various EMR data-based case definitions for diabetes, uncovering algorithms with excellent performance. We used chart review labels as our gold standard. While the validated administrative data-based ICD-code algorithm demonstrated strong performance, the findings support our hypothesis that harnessing free-text notes can yield comparable or superior results to existing standard methods. The ML algorithm that included all document types of free-text notes was the top performer in this study cohort, with 0.95 SN and 0.94 PPV. Meanwhile, the combination of free-text algorithm, medication, and ICD codes improved the SN to 0.97 but experienced a decline in PPV to 0.87.

The current operational standards for defining diabetes for surveillance (ie, NDSS)12 and research purposes in Canada were shaped by the administrative data-based ICD code algorithm.13 These methodologies rely on the utilisation of ICD-code databases, and rely on readily available standardised ICD-code databases, like the DAD, established at both national and international settings. In the Canadian context, these DAD records are reliant on the quality of ICD codes produced by the trained coders who review the charts. Diabetes is a chronic condition which is heavily emphasised for ICD coding in Alberta, and yet the algorithms that solely use these codes resulted in a lower SN compared with the free-text algorithm. This discrepancy stems from the fact that ICD coders primarily review physician documentations from free-text documents within the EMR system for ICD coding in Canada, as dictated by the system design. Challenges and limitations encountered in ICD coding have been described in previous studies24 indicating the information overload experienced by the healthcare system and workers in various areas when dealing with EMR data.

A recent scoping review highlighted that diabetes definitions typically incorporate laboratory and medications data, along with ICD codes.8 Laboratory data typically employ values surpassing specific clinical thresholds to determine disease status. When a patient is being treated with antihyperglycaemic medication, these clinical values are presumed not reach that threshold due to the medication’s effect. In our study, the combined clinical diagnosis algorithm of laboratory and medication had a 0.90 SN and 0.80 PPV, which is comparable to algorithms described in the above-mentioned review. In a systematic review25 on the applications of NLP in diabetes care showed that out of 38 studies, 17 aimed to define diabetes, but most of these studies relied on single concept words or keyword-based definitions (ie, diabetes). In our cohort, the keyword algorithm had an 0.73 SN and 0.70 PPV, potentially reflecting the quality of documentation or the practice of data being entered into the EMR from the front end. Figure 3 showed that several consistent diabetes related medication terminologies (eg, metformin and insulin) were captured across multiple EMR document types. The ML-based algorithm which included all types of free-text documents performed the best in this study cohort, achieving a SN of 0.95 and PPV of 0.94 PPV, raising several important considerations. The ICD code algorithm had an 0.84 SN and 0.93 PPV. Combined algorithms often increased SN but reduced PPV, which was expected.

EMR systems, such as SCM1 and Connect Care (Alberta’s newly implemented province-wide clinical information system),26 based on Epic software (Madison, WI), typically have a front-end graphical user interface for delivering clinical care. It is important to note that not all healthcare workers or providers have access to complete patient charts, and access is typically determined based on assigned roles in the system. Information overload from EMR data can occur if too much information is given,27 and communication oversight could arise if insufficient information is provided.28 Additionally, the quality of clinical notes documentation can be heavily influenced by interactions between the care providers and patients or their family members, potentially triggering varying sets of orders and interventions documented in the EMR system. This project extracted all free-text notes from the back end of the EMR system and processed these documents using a standardised medical terminology dictionary (ie, UMLS). Our findings demonstrated that various types of healthcare workers and providers are documenting similar medical concepts across multiple EMR document types for diabetes. Therefore, analysing the commonality in documentation across roles to consolidate and centralise information for shared awareness would enhance information flow in clinical care settings and improve downstream processes, such as improving the quality of the administrative health databases.

Current diabetes definitions based on ICD-code databases are not integrated into clinical practices within the Canadian context, as DAD coding systems and EMR systems operate separately from each other. Alberta’s Connect Care clinical information system which includes EPIC-based EMR infrastructure, now in operations throughout AHS operated acute care and ambulatory facilities, has the capacity of integrating ML models,29 with potential outputs incorporated into dashboards. The integration of inpatient data-specific case definitions could facilitate easier identification of comorbidities, designing automated risk prediction algorithms within EMR which could be implemented into point of care as needed. As EMR adoption in Canada continues to rise,4 the implementation of EMR data-based diabetes case definitions from both inpatient and outpatient care30 has the potential to enhance the quality of DAD data for diabetes. This, from a research operations standpoint, could assist with cohort selection for epidemiological and clinical studies. The subsequent improvement in DAD will, in turn, enhance the surveillance capabilities of the NDSS for Alberta in the long run.

This study is not without limitations. First, as we used a single geographic setting, external validation from a different geographical setting is needed. Second, our algorithms do not differentiate between type 1 and type 2 diabetes, the two most common forms of diabetes. With the prevalence of both types increasing, as well as differences in management and care, differentiating between these types is important, this will be an area of future work. Also, we appreciate the immaturity of the proposed application in real-life practice but importantly this study is foundational work for ML in healthcare systems. We appreciate the limited interpretability by the prediction model. Importantly, in our study, we demonstrated the explainbility by showing that top features (figure 3) are coinciding with what is documented within clinical practices. This strengthens the application of our model in real-world practice. We also appreciate the lack of system infrastructure to implement models with existing EMRs not having the capacity to implement designed ML models. AHS has recently implemented EPIC-based clinical information system, which has the capacity to integrate ML models into EMR systems, in AHS-operated and partner acute and subacute care sites, ambulatory care locations, clinical lab services and diagnostic imaging areas. That being said, our study includes many strengths. Strengths include taking a multimodal EMR data approach to develop a case definition for diabetes and comparing to existing standards, integrating ML and NLP onto EMR data, and using the randomly selected chart review data as the gold standard.

Our future studies will expand to include Connect Care data and eventually validate this work in other jurisdictions. Furthermore, we will evaluate the implementation of our ML models into existing clinical information systems. Recent advancements in large language models have shifted the interest in developing such models for eventual deployment in healthcare systems from the NLP field perspective. While we acknowledge that we have not considered these deep learning NLP models for this study, a future study is in the planning stages, aiming to explore large language model methods on a study cohort with a much larger disease prevalence.10


As NLP techniques are advancing, there is the potential to leverage them in healthcare, particularly for using free text data within EMRs. As such, we assessed several algorithms and found the free-text algorithm performed the best in this cohort. Determining the ideal algorithm or combinations for implementation would be dependent on the needs, the clinical practice culture and data availability. These types of inpatient EMR-based algorithms for case identification are ideal for timely care delivery and resource planning.

Data availability statement

Data may be obtained from a third party and are not publicly available. Restrictions apply to the availability of these data. Data were obtained from Alberta Health Services and are available with the permission of Alberta Health Services.

Ethics statements

Patient consent for publication

Ethics approval

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Conjoint Health Research and Ethics Board of University of Calgary (REB15-0790, and approved 7 May 2023).


Supplementary materials


  • Contributors SL, HQ and SB conceptualised this study. CAE and DAS provided the chart review data. EAM and SL conducted data extraction and linkage. SL conducted analysis. EAM and JP assisted with analysis. DJTC and AAS refined the laboratory and medications algorithm and reviewed the medications list. SL and SB drafted the manuscript. All authors reviewed the contents of the manuscript. SL, HQ, and SB are the guarnators of this study.

  • Funding This work was supported by Canadian Institutes of Health Research, Foundation Grant FDN-167272, awarded to HQ.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.