Methods
The TMUCRD is a central data warehouse of EHRs, providing us with a platform to leverage our accumulated expertise in managing and combining data, as illustrated in figure 1. This data repository contains a wealth of information including details about patients’ demographics, observations, diagnoses, prescribed medications, medical devices used, laboratory measurements, procedure codes, pathology and medical imaging reports, as well as vital health data. Currently, the database covers a vast range of medical records for approximately 4.15 million patients, spanning from the year 2004 to 2021.
Figure 1The overview of the TMUCRD. ATC, anatomical therapeutic chemical; ICD-9-CM, International Classification of Disease, 9th Revision, Clinical Modification; NHI, National Health Insurance; OHDSI, Observational Health Data Sciences and Informatics; SHH, Shuang Ho Hospital; TMUH, Taipei Medical University Hospital; WFH, Wan-Fang Hospital.
Database development
The Clinical Data Centre (CDC) at the TMU Office of Data Science is a collaborative group made up of experts in data science, pharmacists and practising physicians. They have joined forces to create the research database. The TMUCRD database is filled with information collected during regular hospital care, meaning it does not cause any extra work for healthcare providers or disrupt their usual routines. The data have been gathered from various sources and linked, including:
Archives from hospital information system (HIS) databases.
Taiwan Cancer Registry database.
Taiwan Death Registry database.
During the data collection period, information was gathered from three distinct HIS—TMUH, WFH and SSH. These systems served as the origin of clinical data, comprising various elements such as:
Different types of forms like outpatient, inpatient and emergency records.
Results of ordered measurements.
Medications prescribed by clinicians/physicians.
Details about procedures performed and associated fees.
Patient demographic data including birthdates, zip codes, height, weight, blood pressure readings (systolic blood pressure, diastolic blood pressure), temperature for each hospital visit and in-hospital mortality.
Recorded notes such as discharge summaries and reports from examinations such as radiology, cardiology and pathology.
Medical images in Digital Imaging and Communications in Medicine (DICOM) format, which include X-rays, CT scans, MRIs and ultrasounds.
With the exception of data specifically collected for research purposes, the data were extracted and organised into database tables with structures distinct from those of the HISs. These data are stored individually for each hospital and are differentiated using a suffix denoting their source. For instance, TMUH’s outpatient visits are stored in the OPD_BASIC_T table while WFH’s and SHH’s outpatient visits are stored in the OPD_BASIC_W and OPD_BASIC_S tables, respectively. However, patient data can still be cross-referenced across hospitals using their pseudoidentification, represented by the ‘ID_NO’.
We acquired information about mortality occurring outside the hospital environment by referring to the Taiwan Death Registry database, which is maintained by the Taiwan Ministry of the Interior.21 Additionally, we have established a link between the TMUCRD and the Taiwan Cancer Registry, a dataset offered by the Taiwan Ministry of Health.22 This linkage allowed us to identify patients who were diagnosed with various forms of cancer and had visited any of the three hospitals in our study.
The TMUCRD vocabulary contains various terms, and the team at the CDC has worked to link these terms with standardised dictionaries within the database. As an example, the codes used for laboratory tests and medications in TMUCRD, which are recognised by Taiwan’s National Health Insurance (NHI), have been connected to codes in LOINC23 and RxNorm,24 respectively. These efforts have been made to adapt TMUCRD into widely accepted data formats, such as the Observational Medical Outcomes Partnership CDM (OMOP CDM). This adaptation enables the use of consistent tools and methodologies.25
Deidentification
Prior to being integrated into the TMUCRD database, the data underwent a deidentification process to adhere to the standards set by the Health Insurance Portability and Accountability Act (HIPAA). The initial step was conducted independently by the Centre for Management and Development (CMD) at TMU.26 This deidentification was achieved using structured data techniques.27 The process for structured data involved the elimination of eighteen specific data elements that could potentially identify individuals, as outlined in HIPAA. This removal included details such as patient names, phone numbers, addresses and dates. Notably, for the birth dates, only the year and month were retained for each patient, ensuring further privacy.
Moreover, an additional layer of deidentification was implemented by introducing randomisation to the variables within each data table. Essentially, we combined the initial pseudoidentification with a randomly generated salt-key, which consists of data from one or multiple variables associated with each patient. This salt-key serves as an additional input to a one-way function that hashed the pseudoidentification. Additionally, we employed checksum functions using MD5, SHA1 and SHA256 algorithms, which are types of hash functions. This process was completed before providing the data to each respective study principal investigator (PI). It is important to note that the components of this deidentification system are consistently expanded to accommodate new data as it is obtained.
The code used to create the TMUCRD introduction website and its accompanying documentation is accessible solely to individuals associated with TMU, including the PIs. The link to access this code is available.28
CDM conversion
The OMOP CDM serves as a standardised structure for organising observational medical data. Its purpose is to ensure the reliable analysis and utilisation of medical information for research purposes. This model includes standardised vocabularies that establish uniform terminology usage across different medical areas.29 Essentially, it provides a systematic framework for converting varied healthcare data into a shared format, facilitating consistent analysis across diverse data sources and research investigations.30 Starting in January 2021, the TMUCRD database embarked on a journey to adapt its data to the OMOP CDM standard. This transition was facilitated with the support of the Observational Health Data Sciences and Informatics (OHDSI) global initiative. The amalgamation of data from all three affiliated hospitals led to the naming of the database as the TMU-CMD.
Technical validation
To maintain the close representation of the original data collected from the three affiliated hospitals, we aimed to minimise significant changes to the structure of TMUCRD while achieving the necessary level of deidentification and data schema.
We adhered to the best practices in scientific computing whenever feasible. The development of TMUCRD was managed with version control, ensuring that changes were well tracked and documented. Issue tracking was implemented to transparently document any limitations in the data or code and address them appropriately. We actively encourage the research community to report and address any issues they come across. Furthermore, we have established a system for minor updates to the database.
The process of converting to TMU-CDM, which is the TMU-CDM, was carefully validated. This validation process followed the guidance of the OHDSI global initiative, particularly the SOS project.31 This rigorous approach ensured the accuracy and reliability of the conversion process.