Original research

Development of a data utility framework to support effective health data curation

Abstract

Objectives The value of healthcare data is being increasingly recognised, including the need to improve health dataset utility. There is no established mechanism for evaluating healthcare dataset utility making it difficult to evaluate the effectiveness of activities improving the data. To describe the method for generating and involving the user community in developing a proposed framework for evaluation and communication of healthcare dataset utility for given research areas.

Methods An initial version of a matrix to review datasets across a range of dimensions was developed based on previous published findings regarding healthcare data. This was used to initiate a design process through interviews and surveys with data users representing a broad range of user types and use cases, to help develop a focused framework for characterising datasets.

Results Following 21 interviews, 31 survey responses and testing on 43 datasets, five major categories and 13 subcategories were identified as useful for a dataset, including Data Model, Completeness and Linkage. Each sub-category was graded to facilitate rapid and reproducible evaluation of dataset utility for specific use-cases. Testing of applicability to >40 existing datasets demonstrated potential usefulness for subsequent evaluation in real-world practice.

Discussion The research has developed an evidenced-based initial approach for a framework to understand the utility of a healthcare dataset. It is likely to require further refinement following wider application and additional categories may be required.

Conclusion The process has resulted in a user-centred designed framework for objectively evaluating the likely utility of specific healthcare datasets, and therefore, should be of value both for potential users of health data, and for data custodians to identify the areas to provide the optimal value for data curation investment.

Summary

What is already known?

  • The concept of data quality is well established, but the overall usefulness of a dataset for a purpose may be impacted by numerous additional factors.

  • One of the stated challenges in using UK health data for research is perceived lack of useful datasets.

  • There is no currently available standard framework for evaluating the utility of a healthcare dataset for a particular purpose.

What does this paper add?

  • We describe the process by which a framework for understanding the usefulness of a dataset was developed.

  • Information on the key characteristics related to dataset utility, based on surveys, interviews and testing, were used to provide a standard framework for dataset evaluation according to purpose.

Introduction

Health Data Research UK (HDR UK) was established to unite the UK’s health data to enable discoveries that improve people’s lives.1 By making health data available to researchers and innovators, it will be possible to more rapidly develop improved understanding of disease and approaches to prevent, treat and cure them. During the establishment of HDR UK an initial ‘listening exercise’ was carried out, collating responses across the landscape of health data users, which reported that the major perceived barriers to use of data for research and innovation were issues regarding data access and data quality.2

It is generally accepted that secondary use of health data for research and development has huge potential value but a significant amount of work to improve the data will be required to make such routine data useful.3 One difficulty is that precisely which improvements that provide most value in this context remain unknown. For example, ‘quality’ of datasets for most users is usually composed of a view of technical ‘data quality’ dimensions such as completeness, in addition to subjective assessment of factors related to a specific use case.4

The recently published National Data Strategy highlights the importance of high-quality data for the UK, and specifically references the lack of standardised approaches for assessing and managing data quality in this context.5 With the potential for future widespread investment in the general area of ‘improving data’, it is important that this is evidence based and focused on the areas that will have the greatest impact. A widely adopted standard tool for assessing the broad usefulness of a dataset would help to inform this future development and investment.

Previous studies are available which address specific aspects of the broad area of ‘data quality’ in health, but none presents a similar framework as suggested here. For example, evaluation of data quality improvement programmes are described focusing on specific quality dimensions such as accuracy and precision.6 There are approaches described for evaluating quality of medical device data,7 use of rule based approaches for data quality evaluation and management,8 and outputs of workshops focusing on health data quality issues.9 There has been a suggestion that quality informatics may become a specific area of health informatics.10 Despite such recognition of the importance of data quality for widespread uses, evaluation of data utility for specific purposes has remained difficult.11

The aim of this study was to develop a proposed framework for evaluating the usefulness of a health dataset, across a range of potential use cases, to rapidly identify those which are likely to be most applicable for the specific purpose, and to provide an objective method of evaluating or categorising a dataset in order to rationally deploy data curation resources. In addition to the needs across the health data community, HDR UK in particular requires a means of determining improvement across seven publicly funded HDR Hubs: consortia of organisations involved in improving data and providing access to it for research and innovation.12 The Hubs present multiple, parallel experiments for improving data, and so understanding their effectiveness in improving datasets for particular use cases allows for effective evaluation and could focus future investment. This presented an opportunity and need to develop a data utility framework as a service development project for HDR UK.

Methods

The initial framework was developed based on the broad areas relating to data utility, which had been identified from previously published evidence.13 We adopted a user centred co-design approach which was designed to result in an understanding of the areas of user interest in data usefulness.14 Given the absence of consensus on an existing approach, it was important to gain a broad understanding of the topic, so a combination of interviews, surveys and user testing was used to help achieve diversity of inputs into the development of the framework, in line with standard practice in user centred design.15 16

We, therefore, focused on the major issues relating to health dataset utility by interviewing a range of data users in the domain, followed by collation of additional views from key stakeholders across the community through a survey consultation process. This was done to create and discover the main areas of interest for such a framework across a range of user groups and use cases. These findings were then used to further refine and develop a proposed Data Utility Framework that could be subsequently used and iterated by the community.

The framework aimed to indicate the ‘utility’ of a dataset for researchers for a particular purpose based on a set of key attributes, which would be classified using predefined criteria into subsequent arbitrary qualitative categories (bronze, silver, gold and platinum). Principles for the framework included that the bronze-platinum categories should differentiate utility for any given dimension in a progressive manner that each dimension should have objective criteria that will allow users to determine a utility score (initially through self-evaluation) that a greater score can only be achieved if the dataset meets the criteria of the previous categories as well as an additional criterion for the new score and that there is no expectation that all datasets should achieve a particular classification since some use cases may only require a minimum standard, while others may require greater utility scores for their purposes.

For interviews, the interviewee’s particular data requirements were used as a basis for discussion, and users articulated the ways by which they determine the utility of a dataset for their particular use case, that is, how they determined if a dataset was ‘useful’ for their purpose. They were specifically asked their views regarding various components of an initial proposed framework which was based on the previous HDR UK scoping information (online supplemental table 1). The number of times interviewees refer to specific components of the draft framework were quantified, to capture interest in the items as presented. This methodology is a common user-centred design approach, to identify features that a range of individuals would like to see represented.17

Interviewees were selected based on a segmented sample to ensure representation from multiple sectors, including artificial intelligence (AI)/tech firms, large pharmaceutical companies, National Health Service/data custodians and academics (figure 1). Forty individuals were contacted to request an interview. Of these, 8 did not respond, 10 declined to be interviewed, 3 were unable to schedule a time and 21 were finally interviewed. Interviews were held in April and May 2020 using an online meeting platform, and these were recorded to support transcription of the discussion.

Figure 1
Figure 1

Bar chart showing breakdown of interviewees by sector.

Interviews were semistructured (online supplemental table 2). Questions were sent at least 24 hours before the interview, in addition to the initial framework. Consent of the participants was taken from their initial agreement to participate in the interview process—all were informed of the nature of the project in developing the tool.18 As the approach to develop the framework was service development work and did not include patients or staff in their clinical roles, ethical approval was not required—as confirmed by the Health Research Authority.19

The interviewees were asked to comment on the importance of dimensions within the proposed framework, and to suggest any dimensions which were not originally included, but were not otherwise directed and were free to discuss whichever aspects they felt most important.

Qualitative content analysis was used on the outputs of interviews to establish the relative interest in the various dimensions. This included an estimation of the categories for the proposed ‘medallion’ ratings (online supplemental table 3).

In June 2020, a survey was issued to the interviewee list, as well as HDR UK’s Data Officer Community, with a request for all recipients to share with their own contacts. The survey (online supplemental table 4) requested input on the revised matrix. Responses from 30 individuals were received with some respondents spread across multiple sectors (figure 2). The content analysis was repeated to identify refinements and develop the second version of the matrix (online supplemental table 5).

Figure 2
Figure 2

Bar chart showing breakdown of survey respondents by sector.

Following the survey, the second version of the framework was included in a wider consultation, which was publicly open online from August to September 2020. In addition, the second version of the framework was applied to 43 existing datasets across seven HDR Hubs, with feedback provided from each team on the potential suitability and applicability of the framework. These datasets include routinely collected clinical data, genomic data, national datasets and imaging datasets. The feedback from this process was used to develop the final version of the framework but given that no existing ‘gold standard’ framework existed, formal testing of performance was not carried out as part of this process.

Results

Participant characteristics

The interviewees represented a range of different sectors (figure 1). All were required to use health data for research, or support others in doing so, as part of their professional context.

Original framework comments

All interviews (21) emphasised the importance of a comprehensive dataset metadata, describing the nature and scope of the data collection. This information enabled users to identify the utility and relevance of a dataset for specific use cases. A number of interviews (nine) emphasised the importance of a readily available data dictionary and ability to interrogate the dataset at a data element level. Beyond typical dataset descriptions, users emphasised the requirement to understand data provenance, especially in the case of data consolidated from multiple sources, where the provenance may differ across data elements. The number of mentions of each dimension of the original framework is provided in table 1 (note that these are a reflection of interest, rather than support, as they may be comments in support of the dimension or disagreement):

Table 1
|
Table showing number of interview respondents commenting on each item from the original framework, to gain an understanding of an end users view of the initial framework categories

Description: characteristics and service

Interviewees drew little distinction between dimensions in the description and characteristics and service categories, as many of the elements relate to the information that is available about the dataset, known as the metadata. Several users, particularly from industry, noted the importance of upfront clarity regarding the uses applicable for the dataset, something that is not currently widely available. A number of interviewees described the available of additional resources and information on the dataset including previous academic publications, documentation of frequently asked questions and a contact for whom they could reach with questions as a key factor of data utility.

Beyond descriptive metadata, ‘access’ was a key consideration for data users sourcing data from outside their organisations. Users with commercial use cases (including companies developing and supplying data and machine learning products, and pharmaceutical companies) wanted indicators of which datasets were available for commercial use, in order to minimise enquiries into unsuitable data sources. Researchers noted that the amount of time required to gain approval for data access was a deciding factor in selecting datasets (and sometimes research questions) and commercial data users also highlighted this as a risk. The means of data access; (eg, direct download, use of a secure environment, via an internal analysis team) was noted as an important factor impacting time and costs for commercial organisations accessing data. This emphasis led to the creation of a separate category focusing on service.

The service category was subject to significant attention at the testing stage. Respondents noted that clarity was required on the role of the research environment, as well as refining the wording on the categories for timeliness.

Scale

The initial elements within the Scale category were perceived to be useful by interviewees. Some interviewees (3) mentioned the pathway coverage dimension and those from pharmaceutical background indicated that the longitudinal patient journey helped explain health outcomes. Many survey respondents (14 of the 20) who indicated their feelings deemed this element as either important or very important.

While nine interviewees specifically commented on the ‘coverage’ field, interview questions relating to the number of expected items did not yield meaningful answers due to the significant variability across use cases. Therefore, the number of entries element was excluded from the final matrix. Additionally, pharmaceutical companies wanted details of coverage on a specific number of patients meeting multiple requirements. Duration was excluded since as feedback suggested that ‘length of patient follow-up’ was an element of particular interest. Depth was also excluded since users valued the details of what was measured rather than the number of things being measured. Missing data and missing data handling was a topic of interest but was excluded from the final framework since it was difficult to agree the optimal method, but suggestions that this be included in the additional documentation and support section of the metadata.

‘Technical’ quality dimensions

Throughout the surveys and interviews, the aspects of ‘technical’ data quality, defined here as those listed by the Data Management Association (DAMA), were seen as important, mentioned by 11 interviewees, but no indication could be given of specific required or expected levels for the dimensions. This led to the supplementation of the DAMA dimensions (Completeness, Uniqueness, Timeliness, Validity, Accuracy and Consistency) with an additional element relating to the data management process itself, which was supported by several (six) interviewees. This additional data management process element was considered by all but one survey respondent to be either important (14) or very important (7).

Added value

The additional information available about a dataset was generally considered to be of importance by interviewees and respondents. The majority of discussions in this area related to the ability to link the dataset with others, with this being mentioned 10 times by interviewees. Many survey respondents (12) felt that this element was very important. Respondents also commented that data enrichment was important (13) but there was no direct discussion on this dimension from the interviews. Current usage was relatively useful but was not mentioned directly by interviewees or in the respondent’s comments. Data access requests were said to be a useful indicator of the level of utility of the dataset and more likely to be understandable with a functional and operational data governance process. There were mixed feelings among the respondents for this dimension deeming it low importance. For the provenance of access request dimension where there were only two specific comments in each of the interview and survey responses. The added value section was reduced in scope for the final matrix given this response.

The final proposed version of the data utility framework following the user design feedback is provided in figure 3.

Figure 3
Figure 3

Final version of the proposed data utility framework based on data user feedback.

Discussion

The findings of this study have provided an evidence-based initial approach for a proposed framework for identifying and understanding the main factors that are related to the utility of a healthcare dataset for secondary purposes across a range of industries and use cases. This has allowed development of a data quality matrix and classification system, which will be further refined through testing and implementation. The real-world usefulness of such a framework will be evaluated following implementation and feedback from users as well as usage statistics of particular datasets in relation to their framework score and classification. It is believed to be the first successful attempt to create a semi-structured framework for characterising health datasets on usability and allowing users, regardless of their specific use-case, to identify in advance whether a dataset would be useful for their purposes.

The main strengths of the study are that this is a practical and robust approach to a problem that has been theoretically reviewed but not addressed in an objective manner previously. The range of different stakeholders, through multiple cycles, and the repeated improvement and testing have allowed for the development of a framework that is able to be adapted to different data types and able to be implemented in real-life situations.

However, by necessity, this approach has had to use a non-random group of respondents, due to the process and potential selection bias of the survey respondents. The sampling strategy was developed based on a systematic but pragmatic approach to collect a range of views to be gathered from different organisations and sectors to ensure all main stakeholder groups were represented. Any potential bias could lead to a matrix that was incomplete, or had unnecessary emphasis on particular categories, and only large-scale feedback will determine these aspects. The continuous development of the matrix during the process was appropriate given the pragmatic nature of the project in the absence of an existing standard practice, but a more structured approach would have identified in advance the stages and cycles of development. This methodological limitation impacts the ability of the matrix in its present form to be used for other data contexts without further development and here we make no claims regarding cross-sector applicability. Given these limitations, future iterations of the framework are likely to evolve from the original version presented here based on real world usage and feedback. However, further research is required to identify the ‘effectiveness’ of the framework in terms of identifying areas to address through data improvement activities and then demonstrating that usability of the dataset increased subsequently. It is not known at this stage the power of the framework or its potential use in other circumstances.

The framework is likely to require further development as it is tested on an increasingly wide range of health datasets. This development will take several forms: the refinement of the existing categories, the addition of new categories, integration into tools and the creation of extensions. The refinement of the existing categories will continue as more feedback is received through the testing process. One key development is likely to be ‘normalisation’ of the categories—currently the breakdown is based on the surveys and interviews, as well as initial testing on >40 datasets. As the tool is applied to many more datasets, it may be necessary to adjust the categories to ensure an appropriate distribution across the existing categories in the UK health landscape.

Similarly, it is possible that further categories may be added. For example, during the development, a category on ‘Research Environment’ was proposed. However, an inability to reconcile the varying tensions between the current status of research environments across UK data custodians, user requests from particular sectors and the principles for Trusted Research Environments, this category was not included in the final version of the framework.

The integration of the framework into tools will allow for its value to be realised in practice. In isolation, it provides a view of a given dataset, however, the power of such a framework approach comes with the ability for an individual to specify their requirements in a catalogue, such as the HDR Innovation Gateway, and use their requirements from the framework to remove the datasets which would not be fit for their purposes. The framework was deliberately designed to be able to be applied in a general manner, however, it was solely tested on datasets and users relating to biomedical data and applications. Further work is required to explore whether it can be usefully applied to other data types, and if specific extensions are required for certain data modalities, such as imaging data.

Conclusion

In conclusion, we have proposed a codesigned and evidence-based health dataset utility framework, for potential widespread evaluation and use, which will be integrated into HDR UK’s ambitions to make health data more useful for research and enable discoveries that improve people’s lives. The tool will be tested on datasets currently discoverable through the Innovation Gateway, and feedback from this process will be used to refine the framework as applicable to the range of potential use cases. Such as framework may also provide objective evidence to demonstrate the benefit of specific data improvement and ‘curation’ activities, including the potential for providing a return-on-investment type understanding of work in data.