Probabilistic linking to enhance deterministic algorithms and reduce linkage errors in hospital administrative data

Background The pseudonymisation algorithm used to link together episodes of care belonging to the same patient in England [Hospital Episode Statistics ID (HESID)] has never undergone any formal evaluation to determine the extent of data linkage error. Objective To quantify improvements in linkage accuracy from adding probabilistic linkage to existing deterministic HESID algorithms. Methods Inpatient admissions to National Health Service (NHS) hospitals in England (HES) over 17 years (1998 to 2015) for a sample of patients (born 13th or 28th of months in 1992/1998/2005/2012). We compared the existing deterministic algorithm with one that included an additional probabilistic step, in relation to a reference standard created using enhanced probabilistic matching with additional clinical and demographic information. Missed and false matches were quantified and the impact on estimates of hospital readmission within one year was determined. Results HESID produced a high missed match rate, improving over time (8.6% in 1998 to 0.4% in 2015). Missed matches were more common for ethnic minorities, those living in areas of high socio-economic deprivation, foreign patients and those with ‘no fixed abode’. Estimates of the readmission rate were biased for several patient groups owing to missed matches, which were reduced for nearly all groups. Conclusion Probabilistic linkage of HES reduced missed matches and bias in estimated readmission rates, with clear implications for commissioning, service evaluation and performance monitoring of hospitals. The existing algorithm should be modified to address data linkage error, and a retrospective update of the existing data would address existing linkage errors and their implications.


Introduction
Data linkage algorithms are widely used to combine records that belong to the same individual. Errors in patient identifiers,1 data quality problems,2 missing data3 or imperfect linkage algorithms1 can produce two kinds of linkage errors: false matches, where two records belonging to different patients are linked (2) and missed matches, where two records belonging to the same patient are not linked. Linkage errors can bias the results of data analyses, with important implications for the accuracy of official statistics,4 and for data used for funding, planning or delivering services or for monitoring the relative performance of hospitals.
Bias due to linkage errors can artefactually alter differences between groups (for example, between hospitals, or age groups) by making differences bigger or smaller or changing the direction of the effect.4-6 The impact of bias due to linkage error can be compounded by low event rates or when sensitivity of an algorithm differs across cohorts.7 Analysts are rarely able to take linkage error into account in their analyses as linkage methods are rarely reported in detail,8 and few algorithms have been validated against good quality reference standard data sets.1,6 Hence, analysts using anonymised data, without identifiers, are often unaware of the extent of linkage error and cannot adjust for such error in their analyses.9 Data linkage errors can be addressed by improving both data quality and the algorithm used for linkage. Algorithms that use deterministic matching are popular, in part because they can be fully automated. However, deterministic algorithms designed to minimise false matches often have the disadvantage of a high missed match rate. 10 The algorithm used to link together the records of care belonging to the same patient using National Health Service (NHS) hospitals in England [Hospital Episode Statistics (HES)], is thought to have a missed match rate of at least 4%.1 Although data quality has improved over time, the frequency with which key identifiers, such as NHS number, are missing disproportionately affects certain patient groups, leading to increased missed matched rates and hence to underestimates of readmission and mortality rates.1,9 HES is widely used for calculating costs, commissioning services, monitoring performance of NHS hospitals, evaluating services and monitoring health inequalities. Bias due to linkage error will affect all these analyses, and so has specific and important implications.
Probabilistic data linkage is known to produce more accurate linkage and less biased results11 than deterministic linkage, particularly in settings where data quality is poor. 12 The aim of our evaluation was to determine if an additional probabilistic step to the existing deterministic algorithm used to link data on admissions to English hospitals (HES) would reduce the missed match rate and provide more accurate estimates of the relative risk of hospital readmission within one year, for different patient groups.

Population and databases
The HES administrative data set records care within English hospitals, from 1989/90 onwards.13 A deterministic linkage algorithm is used for internal data linkage, producing a pseudonym called the HES ID (HESID) that identifies the same patient when they are readmitted.14 Our study population comprised records where the date of birth was 13th or 28th of any month, appearing in the Admitted Patient Care data set from 1998 (the first available calendar year with data available on ethnic group and other relevant variables) to 2015 (the last available calendar year). These dates were chosen in order to avoid issues associated with transposition of days and months, and with commonly used default date values (1st and 15th). 15 We restricted the sample to patients born in four years (1992,1998,2005,2012), allowing us to consider both age and year of data collection. Analysis took place within the Health and Social Care Information Centre (HSCIC) in 2015 and 2016.
Covariates Age at admission was calculated using date of birth and admission date then grouped into 0-3, 4-7, 8-11, 12-15, 16-19 and 20-23. Sex was classified as male, female or missing. Ethnic groups were grouped into White, Mixed, Asian, Black, Chinese/Other and missing (missing included codes referring to unknown ethnic group). Postcode was used to identify records referring to foreign patients (which includes countries in the UK other than England), those with 'no fixed abode' (which includes homeless patients), and to calculate the index of multiple deprivation (IMD) 2004 score,16 a measure of socio-economic deprivation at a small area level. Five mutually exclusive socio-economic groups were created from postcode and IMD score: socio-economically deprived (most deprived quintile), not socio-economically deprived, missing postcode, foreign postcode and 'no fixed abode' postcode. For analyses after data linkage (described below) that considered patients and their risk of readmission, covariates may change over time, leading us to select the most commonly occurring category.

Linkage procedures
Date of birth, sex, NHS number, local ID, provider code and postcode were used as the personal identifiers to match records.14 For the reference standard data set, ethnic group, general practitioner (GP) code, local authority code and the first three diagnostic codes (on the basis that 85% of records have up to three diagnostic codes) were used as additional identifying characteristics to ascertain true match status. 13 Record linkage was performed in Microsoft SQL Server 2008, for deterministic and probabilistic matching.
Deterministic linkage-The existing deterministic algorithm operated by HSCIC to allocate HESID is not publicly available and is considered proprietary, but is described in sufficient detail elsewhere14 to be replicated using a range of programming languages. We wrote a version in SQL that has the same three steps: (1) Records are initially matched on the basis of partial or full agreement on date of birth, exact agreement on sex and exact agreement on NHS number; (2) Records are matched if partially agreed on date of birth, exactly agreed on sex, exact local ID within provider and exact postcode; (3) Records are matched if they agreed exactly on date of birth, sex and postcode. At this third step, communal postcodes are not considered and existing NHS numbers are disallowed. To match at step 3, NHS number and either local ID or provider code would have to be missing. 14 Local ID within provider is a concatenation of provider and local ID, with zeros or spaces removed prior to linkage.14 Records with contradictory NHS numbers can be matched to the same HESID at step 2. Due to an ongoing error in compiling HES, most postcodes are missing for birth records prior to 2014. This technical issue means that all birth episodes extracted from hospitals into HES have blank postcodes, and therefore, limited geographic or socio-economic information is available. Only birth episodes incorrectly coded as another episode type (e.g. general episode) contain postcode.17 Allocation of NHS number at birth was introduced in 2005,18 generating linkage errors for multiple births before that time (given that NHS number was often missing and a match on local ID would not be allowed if postcode was missing).
Probabilistic linkage-We designed an additional probabilistic step to include unlinked records at step 3 because of missing NHS numbers or other identifiers. The probability that two identifiers would agree, given a match (m probability), was specified for each identifier: date of birth [0.95 (day), 0.94 (month), 0.91 (year)], 0.9 (sex), 0.9 (NHS number), 0.62 (local ID within provider) and 0.68 (postcode). These values were determined from preliminary analyses of the probabilities that NHS number agreed, and by evaluating their level of agreement in the reference standard data set. The probability that each identifier agreed, given a non-match (the u probability), was specified as 0.5 (sex), 0.03226 (day), 0.08333 (month), 0.05 (year), 0.00001 (NHS number), 0.00002 (local ID within hospital) and 0.00001 (postcode), respectively. Match weights for each identifier were calculated by dividing the m probability by the u probability and taking the log 2 of the result. 19 The total match weight for a record is the sum of the match weights for each identifier. Based on visual inspection of a histogram of match weights, we chose three thresholds above which a pair of records could be considered as an additional link: 10 (relaxed), 20 (middle) and 30 (strict). We then manually reviewed all scenarios producing additional links above each threshold, deciding on a final threshold of >21.5. This threshold was sufficiently relaxed to allow sex or date of birth to be missing or differ and to allow postcode to be missing if sex, date of birth and local ID agreed, but sufficiently strict to prevent additional false matches. Examples are available in Table 1 as part of the section on results.
Reference standard-A 'reference standard' HES data set was created by probabilistic matching using the same identifier that is used by the existing algorithm, in addition to a wider range of identifying characteristics (ethnic group, local authority, GP and diagnostic codes), and manual review. The m probabilities were based on the overall probabilities that identifiers agreed given a match on NHS number: local ID (0.8), postcode (0.7), ethnic group (0.8), local authority (0.9), GP (0.8) and agreement on one (0.3), two (0.1) or three (0.04) diagnostic codes. Following manual review, we found that false matches occurred primarily because of disagreement on NHS number and local ID, or because the record pairs may belong to multiple births. For these reasons, records were allowed to match in two scenarios: (1) total match weight >22.8 with the additional requirement that NHS number and local ID may not disagree; (2) NHS numbers were allowed to differ, if the level of agreement on other identifiers produced a total match weight >35 with the additional requirement that no multiple birth was indicated. Multiple births were defined as birth order or baby number >1, or ICD10 codes Z372 to Z377 inclusive. This decision was made on the basis of prior knowledge that NHS number can be wrong,1 but NHS number and local ID are the only two identifiers in this data set that can potentially distinguish multiple births sharing other identifiers.

Ethical approval
As the analysis was a service evaluation to improve the quality of service provided by the HSCIC, which did not directly involve participants in research, we did not require NHS Research Ethics Committee ethical approval.20 The first author conducted all analyses internally at the HSCIC on record-level data, tables of results were shared with co-authors, and small cell sizes were suppressed to minimize the risk of disclosure. The study design and results were shared with HSCIC staff at three meetings between January and May 2016.

statistical analysis
Before data linkage, we cleaned the data sets using existing data cleaning rules and data dictionaries. 13 The quality of the data set was evaluated in terms of the proportion of missing data for different identifiers and different patient groups. After data linkage, we evaluated the missed match rate (at the record level), comparing the deterministic and probabilistic algorithms against the reference standard for all records within the entire study period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015). Sensitivity and specificity were calculated according to the standard formulae.21 The missed match rate is 1-sensitivity. To evaluate the impact of data linkage error on results (at the patient level), we modelled the risk of hospital readmission for patients within one year (the first admission linked to a second admission). Results from the deterministically linked and probabilistically linked data were compared to the reference standard. The percentage bias was estimated by comparing the coefficients (log odds) in logistic regression models with the coefficient in the model using the reference standard (the difference between the log odds of readmission in the comparison model and the reference standard, as a proportion of the log odds of readmission in the reference standard). In sensitivity analyses, we repeated results, comparing relaxed, middle and strict thresholds for probabilistic matching, to determine the impact of the choice on biased estimates of readmission. We also repeated analyses allowing the m probabilities to vary across three periods of data collection (1998-2003, 2004-2009, 2010-2015).
Patient involvement-There was no patient involvement in this service evaluation.

Results
There were 418,046 records extracted from HES (calendar years 1998 to 2015). We removed 451 records where the year of admission was outside this range and 336 with no admission date available. Table 2 evaluates data quality for all records in the remaining extract of 417,259 records. Sex and local ID within hospital were very rarely missing (<0.1%) and are not shown. There was improvement in data quality over time. The number of records with missing NHS number fell from 43.8% (birth year 1992) to 0.7% (birth year 2012). The proportion of records with missing NHS numbers in the 1992 birth cohort is higher, because birth episodes were not captured by our sampling frame for this birth year. Postcode is missing for many birth records (prior to 2014) due to a system error,17 explaining the high proportion of missing postcodes in the 2005 (30.6%) and 2012 (47.3%) birth cohorts in our evaluation population. Postcode would usually be available for admissions after birth or where birth episodes had been incorrectly recorded as another type of episode. This is also shown in Table 4 that shows data quality across three data periods (1998-2003, 2004-2009 and 2010-2015) and additionally for different age groups. Table 2 shows that NHS number is more likely to be missing for ethnic minorities, foreign patients, those with no fixed abode and where the record has missing data in other fields (e.g. sex, ethnic group or postcode are also missing). Postcode is more likely to be missing when other fields are missing, particularly ethnic group, and is often missing for birth records prior to 2014. This highlights the potential for the rate of data linkage errors to vary across patient groups and produce biased results, given the strong emphasis placed on NHS number and postcode in the deterministic algorithm.
Linking records across the study period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015), the existing deterministic HESID algorithm has a missed match rate of 2.3% [95% Confidence Interval (CI) 2.2%, 2.4%] overall, but Table 3 shows that this was higher in older data years: from 1998 to 2003, this was 8.6% (95% CI 8.4%, 8.8%). There was variation across patient groups, with higher rates seen in ethnic minorities, foreign patients, those with no fixed abode and young infants. Specificity also improved over time, but even after the introduction of NHS number for babies in 2005 (which would reduce false matches generated by multiple births) the false match rate was higher than previously estimated (0.5% vs. 0.2%(1)). Table 3 shows that the additional probabilistic match step lowered the missed match rate for all patient groups. Table 1 shows the scenarios that would allow additional links not permitted by the existing algorithm. For example, in the first row, if NHS number, date of birth, local ID and postcode agreed but sex disagreed (as happened for 5 records), this would receive a match weight of 62.02 that would be permitted by our probabilistic algorithm but not by the existing deterministic algorithm. The most common scenario for missed matches was when NHS number was missing, local ID differed but sex, date of birth and postcode agreed (n = 3,642). This would not be permitted at step 3 of the existing algorithm because NHS number and local ID would have to be blank.14 Our reference standard considered these to be links, on the basis of other identifiers and identifying characteristics agreeing. The second most common scenario was for sex, date of birth and local ID to agree but postcode to be missing. This is not currently permitted but identified 1,809 additional links. An additional 722 links were identified where sex, date of birth and local ID agreed but postcode disagreed (Table  1).

Impact on results (readmission rates for patients)
Whereas the missed matches in Table 3 refer to data linkage across the evaluation period for records, Table 4 considers the next aim of our evaluation -to evaluate the impact on the relative risk of hospital readmission for each patient within one year, comparing the existing deterministic algorithm (readmission rate 18.4%) with the additional probabilistic step (readmission rate 18.7%), adjusting for covariates. By comparing the coefficients with the same model run on the reference standard data (readmission rate 18.7%), we calculated bias -defined as the percentage by which the coefficient is under-or over-estimated. The number of patients decreases in the probabilistic model and the reference standard model, because fewer HESIDs are assigned to the same number of records (181,395 patients in the deterministic model, 176,990 with the additional probabilistic step, 175,773 in the reference standard). Table 4 shows evidence of bias for nearly all patient groups, particularly males (6%), young infants (13%), children aged 8 to 11 (119%), young adults aged 16 to 19 (77%) or 20 to 23 (50%), Black (13%) and Chinese/Other (−3%) ethnic minority groups, patients living in areas of high socio-economic deprivation (9%), those with 'no fixed abode' (−70%) and in newer data years (−7%). The probabilistic match step reduced bias for nearly all patient groups, with the exception of foreign patients where it increased from 2% to 14%, although this involved a small number of patients (n = 142).
In sensitivity analyses (Table 5), relaxing the threshold for the additional probabilistic step lowered the missed match rate further, particularly for older data years, but increased the false match rate. A stricter threshold lowered the false match rate but increased the missed match rate.

Discussion
Our results show missed matches that are produced by an existing deterministic algorithm that is used to link together hospital records in England within HES (inpatients) and the most common scenarios that create these data linkage errors. An additional probabilistic step reduced the number of missed matches, particularly for common scenarios where local ID agreed but other identifiers such as postcode were missing. Analyses of data that were linked using the additional probabilistic step had less biased estimates of hospital readmission rates for certain patient groups (e.g. ethnic minorities). Although the mismatch rate improved in recent years, there were discernible improvements in mismatch rates in virtually all patient groups and throughout the 17 years of analysis. The technique is particularly well suited to this administrative data source, where data quality is poor (particularly in older data years) but the implications of missed matches are serious -given that the HES data are widely used for commissioning and research. The reference standard we created additionally shows that other identifying characteristics (ethnic group, local authority, GP and diagnostic codes) can be used to substantially improve linkage success.
The strength of our evaluation is that it is the first attempt to evaluate data linkage error between multiple episodes of care for patients within the HES longitudinal data set. We previously showed that applying the HESID algorithm to link multiple episodes of paediatric intensive care data produced a false match rate of 0.2% and a missed match rate of >4%.1 In this study, the missed match rate was 2.3% overall but ranged from 8.6% (1998-2003) to 0.4% (2010-2015), with marked variation across patient groups.
A second strength of our evaluation was that we quantified the mechanisms that caused data linkage errors. A relatively small number of common scenarios created missed matches (Table 1). This has important implications for HES because it shows that the current deterministic algorithm is too strict, preventing matches that are very likely to be correct (e.g. sex, date of birth and local ID agree but postcode is missing; sex is missing but other identifiers agree; NHS number may be incorrect but other identifiers agree). The deterministic algorithm could be improved with additional deterministic steps that address these specific scenarios, or an additional probabilistic step could be introduced that automatically allows all scenarios above a threshold. Probabilistic matching is suitable for data sets where only one or two identifiers might have problems,3 because it can evaluate the overall level of agreement across all identifiers. It additionally allows situations in which NHS number might be valid, but incorrect.1 The technique was particularly useful for highlighting the benefit of local ID within hospitals, not currently allowed unless postcode also agrees. A relatively small number of additional links were captured by probabilistic matching, but small improvements in linkage error benefit certain subgroups (e.g. infants, young adults, ethnic minorities, foreign patients, those with 'no fixed abode' and those with poor quality data).
A limitation of our approach is that we cannot determine whether additional links are correct in relation to an external reference standard data set, since none exists for HES. Our analysis can be further extended using a recently developed method22 that uses all possible matches and their weights, rather than taking only those above a fixed threshold, but we have not pursued this further here. It may also be possible to improve linkage error by allowing m and u probabilities to change depending on the frequencies of different values for identifiers, which we did not consider here. 23 The rate of change in postcodes, for example, will differ for different age groups,24 and the probability that NHS number or local ID agrees for a match may increase over successive data years. In our reference standard data set, we considered exact matching on up to three diagnostic codes, but future evaluations could consider clusters of disease codes that are likely to be more stable over time. 25 A major limitation was that we focused on records for children and adolescents, meaning that results may not generalise to records for adults. Many of the mechanisms generating linkage error will, however, be similar across the age range, and the methods we propose can be used in other data sets.
Given that Accident and Emergency data is known to be lower quality than inpatient records, our results represent a 'best case' scenario in terms of linkage error for hospital data in England as a whole. In Accident & Emergency settings, there may be less opportunity to check patient identifiers and the proportion of missing data is higher.9,26 It is also likely to be worse when additionally considering records where date of birth is missing, incorrect or estimated with a 'default' date -our sampling frame was created using date of birth assumed to be valid and correct. These scenarios were excluded from our evaluation but could be addressed by probabilistic matching that would allow these records to link if agreement on other identifiers was sufficiently high. Although the probability that two identifiers agree for a match may change in different data sets, the threshold can be adjusted so that probabilistic matching is useful even for lower quality data sets.
The evaluation extends previous studies of apparent false matches in pseudonymised HES extracts9 and a preliminary estimate of the false and missed match rate when applying the HESID algorithm to a well-curated clinical data set.1 For the first time, the patient identifiers in HES (and additional identifying characteristics) were used to create a reference standard that could be used to evaluate the existing deterministic algorithm and identify which scenarios generated data linkage errors. The results show that there are vulnerable patient groups who are disadvantaged by the current algorithm, such as those without NHS numbers. Patients with 'no fixed abode' include the homeless, who have important healthcare needs and are frequently readmitted.27 Without an NHS number or postcode, their records are difficult to link, but probabilistic linkage can help if a local ID is available at the hospital. Our results will be particularly important for evaluating the health outcomes of vulnerable and mobile populations who are less likely to have NHS numbers.

Implications for research
Future evaluations need to consider whether different match weights and threshold are needed for different hospitals. The accuracy of local ID for some hospitals may not be the same as for others, and we have previously shown that there is significant variation in data linkage error across hospitals in England.9 Further evaluations are necessary that determine how good local ID is in each hospital, at correctly identifying patients, particularly when NHS number is missing. Most patients in our study population will have a birth record that will increase the prevalence of blank postcodes relative to those whose birth was not recorded in HES. Evaluations of older adults and the elderly would be useful, and an evaluation of the impact of linkage error on mortality estimates. Although we considered a long time window for linking records, we considered readmissions within one year for patients. Over long periods, there is more opportunity for linkage error. There is a clear need for a reference standard data set that can be used to check patient identifiers for several administrative health data sets.

Implications for practice
Even in recent years, the existing HES algorithm generates mismatch rates in some groups that result in clinically important biases in estimated readmission rates, thereby underestimating service use, health needs and comorbidity. Mismatch rates are likely to similarly underestimate mortality rates.28 Improvements to the algorithm for future years should be accompanied by retrospective linkage to update existing HESIDs. This is particularly important for infants who did not automatically acquire NHS numbers at birth prior to 2005, and whose birth episodes did not contain a postcode before 2014. Interpreting trends over time in readmission rates is problematic if these partly reflect improvements in data linkage. Also for infants, it is very important to correctly link a patient to a birth episode and maternity episode so that critical birth characteristics can be linked into children's health care trajectories. HES is widely used for commissioning and research and it is imperative to address data quality issues. HES is also linked to external data sets that can further introduce problems if the internal linkage problems are not addressed.

Conclusion
Deterministic linkage of hospital administrative data is prone to generate missed matches, which produces biased estimates of hospital readmission for vulnerable patient groups and for older data. Probabilistic data linkage is suitable for data sets like HES where data quality is poor, and it can highlight the benefits of making better use of particular identifiers such as local patient ID within hospitals. The algorithm can be changed to improve future record linkage, but a retrospective update is also required to address linkage error in existing data. It is important to evaluate and address linkage error and data quality,29 particularly for this data set that is used to allocate >£100 billion of public resources annually, and to plan and deliver health services. Development of an external, reference or 'gold' standard data set that could identify patients across a range of data sets, even where NHS number was not available, would be extremely useful.  Table 2 Number (