DISCUSSION
Our results show missed matches that are produced by an existing deterministic algorithm that is used to link together hospital records in England within HES (inpatients) and the most common scenarios that create these data linkage errors. An additional probabilistic step reduced the number of missed matches, particularly for common scenarios where local ID agreed but other identifiers such as postcode were missing. Analyses of data that were linked using the additional probabilistic step had less biased estimates of hospital readmission rates for certain patient groups (e.g. ethnic minorities). Although the mismatch rate improved in recent years, there were discernible improvements in mismatch rates in virtually all patient groups and throughout the 17 years of analysis. The technique is particularly well suited to this administrative data source, where data quality is poor (particularly in older data years) but the implications of missed matches are serious – given that the HES data are widely used for commissioning and research. The reference standard we created additionally shows that other identifying characteristics (ethnic group, local authority, GP and diagnostic codes) can be used to substantially improve linkage success.
The strength of our evaluation is that it is the first attempt to evaluate data linkage error between multiple episodes of care for patients within the HES longitudinal data set. We previously showed that applying the HESID algorithm to link multiple episodes of paediatric intensive care data produced a false match rate of 0.2% and a missed match rate of >4%.1 In this study, the missed match rate was 2.3% overall but ranged from 8.6% (1998–2003) to 0.4% (2010–2015), with marked variation across patient groups.
A second strength of our evaluation was that we quantified the mechanisms that caused data linkage errors. A relatively small number of common scenarios created missed matches (Table 1). This has important implications for HES because it shows that the current deterministic algorithm is too strict, preventing matches that are very likely to be correct (e.g. sex, date of birth and local ID agree but postcode is missing; sex is missing but other identifiers agree; NHS number may be incorrect but other identifiers agree). The deterministic algorithm could be improved with additional deterministic steps that address these specific scenarios, or an additional probabilistic step could be introduced that automatically allows all scenarios above a threshold. Probabilistic matching is suitable for data sets where only one or two identifiers might have problems,3 because it can evaluate the overall level of agreement across all identifiers. It additionally allows situations in which NHS number might be valid, but incorrect.1 The technique was particularly useful for highlighting the benefit of local ID within hospitals, not currently allowed unless postcode also agrees. A relatively small number of additional links were captured by probabilistic matching, but small improvements in linkage error benefit certain subgroups (e.g. infants, young adults, ethnic minorities, foreign patients, those with ‘no fixed abode’ and those with poor quality data).
A limitation of our approach is that we cannot determine whether additional links are correct in relation to an external reference standard data set, since none exists for HES. Our analysis can be further extended using a recently developed method22 that uses all possible matches and their weights, rather than taking only those above a fixed threshold, but we have not pursued this further here. It may also be possible to improve linkage error by allowing m and u probabilities to change depending on the frequencies of different values for identifiers, which we did not consider here.23 The rate of change in postcodes, for example, will differ for different age groups,24 and the probability that NHS number or local ID agrees for a match may increase over successive data years. In our reference standard data set, we considered exact matching on up to three diagnostic codes, but future evaluations could consider clusters of disease codes that are likely to be more stable over time.25 A major limitation was that we focused on records for children and adolescents, meaning that results may not generalise to records for adults. Many of the mechanisms generating linkage error will, however, be similar across the age range, and the methods we propose can be used in other data sets.
Given that Accident and Emergency data is known to be lower quality than inpatient records, our results represent a ‘best case’ scenario in terms of linkage error for hospital data in England as a whole. In Accident & Emergency settings, there may be less opportunity to check patient identifiers and the proportion of missing data is higher.9,26 It is also likely to be worse when additionally considering records where date of birth is missing, incorrect or estimated with a ‘default’ date – our sampling frame was created using date of birth assumed to be valid and correct. These scenarios were excluded from our evaluation but could be addressed by probabilistic matching that would allow these records to link if agreement on other identifiers was sufficiently high. Although the probability that two identifiers agree for a match may change in different data sets, the threshold can be adjusted so that probabilistic matching is useful even for lower quality data sets.
The evaluation extends previous studies of apparent false matches in pseudonymised HES extracts9 and a preliminary estimate of the false and missed match rate when applying the HESID algorithm to a well-curated clinical data set.1 For the first time, the patient identifiers in HES (and additional identifying characteristics) were used to create a reference standard that could be used to evaluate the existing deterministic algorithm and identify which scenarios generated data linkage errors. The results show that there are vulnerable patient groups who are disadvantaged by the current algorithm, such as those without NHS numbers. Patients with ‘no fixed abode’ include the homeless, who have important healthcare needs and are frequently readmitted.27 Without an NHS number or postcode, their records are difficult to link, but probabilistic linkage can help if a local ID is available at the hospital. Our results will be particularly important for evaluating the health outcomes of vulnerable and mobile populations who are less likely to have NHS numbers.
Implications for research
Future evaluations need to consider whether different match weights and threshold are needed for different hospitals. The accuracy of local ID for some hospitals may not be the same as for others, and we have previously shown that there is significant variation in data linkage error across hospitals in England.9 Further evaluations are necessary that determine how good local ID is in each hospital, at correctly identifying patients, particularly when NHS number is missing. Most patients in our study population will have a birth record that will increase the prevalence of blank postcodes relative to those whose birth was not recorded in HES. Evaluations of older adults and the elderly would be useful, and an evaluation of the impact of linkage error on mortality estimates. Although we considered a long time window for linking records, we considered readmissions within one year for patients. Over long periods, there is more opportunity for linkage error. There is a clear need for a reference standard data set that can be used to check patient identifiers for several administrative health data sets.
Implications for practice
Even in recent years, the existing HES algorithm generates mismatch rates in some groups that result in clinically important biases in estimated readmission rates, thereby underestimating service use, health needs and comorbidity. Mismatch rates are likely to similarly underestimate mortality rates.28 Improvements to the algorithm for future years should be accompanied by retrospective linkage to update existing HESIDs. This is particularly important for infants who did not automatically acquire NHS numbers at birth prior to 2005, and whose birth episodes did not contain a postcode before 2014. Interpreting trends over time in readmission rates is problematic if these partly reflect improvements in data linkage. Also for infants, it is very important to correctly link a patient to a birth episode and maternity episode so that critical birth characteristics can be linked into children’s health care trajectories. HES is widely used for commissioning and research and it is imperative to address data quality issues. HES is also linked to external data sets that can further introduce problems if the internal linkage problems are not addressed.