Discussion
We used machine learning algorithms to analyse three large datasets to investigate the consistency of clinical coding of three mandatory health conditions within a large administrative healthcare dataset. Clinical coding of DMPC as a mandatory condition was relatively consistent. However, over two-fifths of subsequent spells for autism patients and almost a quarter of subsequent spells for patients with PDD had data inconsistencies. There was a high level of variation in the proportion of data inconsistencies between trusts, and there was no evidence that trusts are consistently poor at reporting mandatory codes across the three conditions studied.
In the HES dataset, inconsistencies related to mandatory clinical codes can arise from two main sources. A failure of the clinician to record the diagnosis in the medical notes or a failure of the clinical coder to code a diagnosis recorded in medical notes. In our analysis, data inconsistencies could also be due to misuse of the code of interest on the first spell (ie, a false positive in the index spell), although the numbers involved are likely to be small.
From the random forest classifier algorithms, age was strongly associated with data inconsistencies. A greater proportion of data inconsistencies were associated with increasing age for autism and DMPC, and with decreasing age for PDD. This confirms the pattern seen in the descriptive data and is likely to be due to expectations around the likelihood that a patient has the condition. This may also explain the relative importance of the association between female sex and more inconsistencies in the autism dataset. Although we identified a relationship between deprivation score and data inconsistencies in all three datasets, the nature of the relationship was unclear. This may suggest a bias towards continuous variables in the algorithms used.19 20
Change in provider, change in main specialty and time from first spell to the subsequent admission were also associated with a higher proportion of data inconsistencies across all datasets. Initiatives to allow easier cross-referencing of information across providers and settings and over an extended period of time should be encouraged.
For the PDD dataset, coding of Parkinson’s disease and emergency admission were associated with lower rates of inconsistencies. Elective admissions are generally of short duration and the case notes are likely to focus on the elective procedure being conducted, with limited coding depth.
Large scale, administrative datasets, such as HES, are being increasingly used to inform decision-making in healthcare.21 22 Such data have helped inform the response to the COVID-19 pandemic23 24 and are being used to inform service structure postpandemic.25–27 Having data which is as reliable as possible will be invaluable. Understanding the source and structure of coding inconsistencies may also help the development of new quality improvement programmes, as well as inform the work of researchers, clinical coders and policy analysts.22 28 The impact of the data inconsistencies identified in this paper will vary in importance depending on the nature and aims of the data analysis being undertaken. However, we recommend that researchers using HES and interested in long-term comorbidities should not rely on the coding of the index spell alone, but should look at prior spells for the same patient. Frailty/comorbidity indices, such as the Charlson Comorbidity Index and HFRS, if constructed from HES data, perform this function (to an extent) by looking back over 1 and 2 years of prior hospital spells, respectively.
The performance of the algorithms used to identify key features of data inconsistencies was similar in smaller subgroups of ethnicity and sex. There are concerns that artificial intelligence (AI) techniques can accentuate known biases against representation of smaller subpopulations of a dataset.29 30 Although the problem of fair data analysis is not unique to AI techniques, and can occur with more traditional forms of data processing and analysis, the ‘black-box’ element of AI methodology leads naturally to concerns over ‘fair AI’ and data equity. We used random forest classifiers in our analysis, allowing us to understand the key features represented in our algorithms and allowing a degree of transparency.
Our study has a number of strengths and limitations. We had access to one of the most extensive and complete healthcare datasets anywhere in the world. However, this meant that there was no ‘gold standard’ against which to externally validate the dataset. Difference in coding practice across trusts will have affected our assessment of data quality on the national scale, and we highlight the variation across trusts. We were not able to identify whether an inconsistency was related to a mandatory code being misused in a first spell or being missing in all subsequent spells. We recognise that patients with diabetes mellitus can go into remission, but the number involved across the time period investigated are likely to be very small indeed. We also acknowledge that some forms of dementia and autism may be mild and not impact on the clinical care. Nevertheless, all the conditions studied are mandatory and should still be recorded once diagnosed. Given the potential variability in the source and proportions of coding inconsistencies across all three conditions, the performance of the three classifiers should not be assessed by one single metric alone. For that reason, we opted to also use the precision-recall curves and the recall-aware precision gain—recall gain curves, particularly relevant for the coding of diabetes where the number of inconsistencies is much lower (ie, higher class imbalance). Our analysis highlights that the characteristics of coding inconsistencies can be particular to the condition under investigation. Although we selected conditions that tend to be present across the lifetime, extrapolation to other disease groups should be done with caution. More broadly, although we investigated inconsistent use of mandatory diagnostic codes in this study, it would be possible to investigate other types of inconsistences using similar methods.