Introduction
Health Data Research UK (HDR UK) was established to unite the UK’s health data to enable discoveries that improve people’s lives.1 By making health data available to researchers and innovators, it will be possible to more rapidly develop improved understanding of disease and approaches to prevent, treat and cure them. During the establishment of HDR UK an initial ‘listening exercise’ was carried out, collating responses across the landscape of health data users, which reported that the major perceived barriers to use of data for research and innovation were issues regarding data access and data quality.2
It is generally accepted that secondary use of health data for research and development has huge potential value but a significant amount of work to improve the data will be required to make such routine data useful.3 One difficulty is that precisely which improvements that provide most value in this context remain unknown. For example, ‘quality’ of datasets for most users is usually composed of a view of technical ‘data quality’ dimensions such as completeness, in addition to subjective assessment of factors related to a specific use case.4
The recently published National Data Strategy highlights the importance of high-quality data for the UK, and specifically references the lack of standardised approaches for assessing and managing data quality in this context.5 With the potential for future widespread investment in the general area of ‘improving data’, it is important that this is evidence based and focused on the areas that will have the greatest impact. A widely adopted standard tool for assessing the broad usefulness of a dataset would help to inform this future development and investment.
Previous studies are available which address specific aspects of the broad area of ‘data quality’ in health, but none presents a similar framework as suggested here. For example, evaluation of data quality improvement programmes are described focusing on specific quality dimensions such as accuracy and precision.6 There are approaches described for evaluating quality of medical device data,7 use of rule based approaches for data quality evaluation and management,8 and outputs of workshops focusing on health data quality issues.9 There has been a suggestion that quality informatics may become a specific area of health informatics.10 Despite such recognition of the importance of data quality for widespread uses, evaluation of data utility for specific purposes has remained difficult.11
The aim of this study was to develop a proposed framework for evaluating the usefulness of a health dataset, across a range of potential use cases, to rapidly identify those which are likely to be most applicable for the specific purpose, and to provide an objective method of evaluating or categorising a dataset in order to rationally deploy data curation resources. In addition to the needs across the health data community, HDR UK in particular requires a means of determining improvement across seven publicly funded HDR Hubs: consortia of organisations involved in improving data and providing access to it for research and innovation.12 The Hubs present multiple, parallel experiments for improving data, and so understanding their effectiveness in improving datasets for particular use cases allows for effective evaluation and could focus future investment. This presented an opportunity and need to develop a data utility framework as a service development project for HDR UK.