Article Text

An instrument to identify computerised primary care research networks, genetic and disease registries prepared to conduct linked research: TRANSFoRm International Research Readiness (TIRRE) survey
1. Emily Jennings,
2. Simon de Lusignan,
3. Georgios Michalakidis, PhD Data Analytics,
4. Paul Krause,
5. Frank Sullivan,
6. Harshana Liyanage and
7. Brendan C. Delaney
1. Hon. Research Assistant, Department of Clinical and Experimental Medicine, University of Surrey, Guildford, UK
2. Professor, Primary Care and Clinical Informatics, Department of Clinical and Experimental Medicine, University of Surrey, Guildford, UK
3. Department of Computer Science, University of Surrey, Guildford, UK
4. Professor, Complex Systems, Department of Computer Science, University of Surrey, Guildford, UK
5. Professor, Primary Care, Population Health Sciences, University of Dundee, Dundee, UK
6. Research Fellow, Department of Clinical and Experimental Medicine, University of Surrey, Guildford, UK
7. Professor, Primary Care Research, Department of Surgery and Cancer, Imperial College London, St Mary’s Campus, London, UK
1. Author address for correspondence: Simon de Lusignan Professor Primary Care and Clinical Informatics Department of Clinical and Experimental Medicine University of Surrey Guildford GU2 7XH, UK s.lusignan{at}surrey.ac.uk

## Abstract

Purpose The Translational Research and Patients safety in Europe (TRANSFoRm) project aims to integrate primary care with clinical research whilst improving patient safety. The TRANSFoRm International Research Readiness survey (TIRRE) aims to demonstrate data use through two linked data studies and by identifying clinical data repositories and genetic databases or disease registries prepared to participate in linked research.

Method The TIRRE survey collects data at micro-, meso- and macro-levels of granularity; to fulfil data, study specific, business, geographical and readiness requirements of potential data providers for the TRANSFoRm demonstration studies. We used descriptive statistics to differentiate between demonstration-study compliant and non-compliant repositories. We only included surveys with >70% of questions answered in our final analysis, reporting the odds ratio (OR) of positive responses associated with a demonstration-study compliant data provider.

Results We contacted 531 organisations within the Eurpean Union (EU). Two declined to supply information; 56 made a valid response and a further 26 made a partial response. Of the 56 valid responses, 29 were databases of primary care data, 12 were genetic databases and 15 were cancer registries. The demonstration compliant primary care sites made 2098 positive responses compared with 268 in non-use-case compliant data sources [OR: 4.59, 95% confidence interval (CI): 3.93–5.35, p < 0.008]; for genetic databases: 380:44 (OR: 6.13, 95% CI: 4.25–8.85, p < 0.008) and cancer registries: 553:44 (OR: 5.87, 95% CI: 4.13–8.34, p < 0.008).

Conclusions TIRRE comprehensively assesses the preparedness of data repositories to participate in specific research projects. Multiple contacts about hypothetical participation in research identified few potential sites.

• medical informatics
• family practice
• medical records systems
• electronic health records
• diabetes mellitus
• Barrett’s disease

## ABBREVIATIONS

IT - information technology; EMR - electronic medical records; TRANSFoRm - Translational Research and Patients safety in Europe; TIRRE survey - TRANSFoRm International Research Readiness survey; IBM SPSS - IBM Statistical Package for Social Sciences; eHR - electronic health record; OR - odds ratio; ICD - International Classification of Disease; ICPC - International Classification of Primary Care; SNOMED - Systematised Nomenclature of Medicine; CTv3 - Clinical Terms Version 3; ATC - Anatomical Therapeutic Chemical; HL7 - Health Level-7; RIM - Reference Information Model; CDISC - Clinical Data Interchange Standards Consortium; BRIDG - Biomedical Research Integrated Domain Group; CSV - Comma Separated Values; CPRD - Clinical Practice Research Datalink.

## INTRODUCTION

Large databases of health data are widely used for research but less often combined.1 Linked data facilitates better measurement of clinical performance and patient health outcomes in health care systems.2 Technical challenges of linking data are mostly considered to be the key barrier of integrating disparate heterogeneous data sources.3 Data privacy legislations can considerably hinder research in a multinational setting.4 Data collected within primary care have been computerised since the 1990s5 with data widely used for research,6 but with relatively little linkage of data beyond disease-specific programmes in individual localities. In the United States, the federal electronic medical records mandate aims not only to save money but also to modernise health information technology (IT). A team of RAND Corporation researchers projected in 2005 that a move towards health IT could potentially save $81 billion. However, this saving has far from materialised and despite the recommendations, spending in the US has increased over the past 9 years by$800 billion.7 The increase in spending was, in part, attributed to the slow adoption of health IT systems that are neither interoperable nor easy to use.

The Translational Research and Patient Safety in Europe (TRANSFoRm) project aims to reduce barriers to conducting research using routine healthcare data across Europe.810 The European eHealth Action Plan prioritises interoperability between health records so that internationally comparable data can be collected on the quality of care and for research.11 The TRANSFoRm International Research Readiness (TIRRE) survey was developed and designed to collect information about these data sources with the primary aim of assessing the preparedness of disease registries, throughout Europe, to conduct linked research using the TRANSFoRm project (Appendix 1). The TRANSFoRm requirements for the TIRRE instrument were that it could assess the feasibility of conducting two simulated studies (use-cases): one on the genetics of response to oral anti-diabetic medication; the other on the relationship between anti-indigestion medication, Barrett’s disease, oesophageal cancer and the quality of life. The ‘use-cases’ were designed to capture how primary care recorded oesophageal reflux might be a prodrome of cancer; and any genetic predisposition to complications of people with type 2 diabetes.6

## METHOD

### Sampling and data collection

Our initial contact was to the health ministry of each EU country and to National Primary Care Organisations. Subsequent strategies included trying to identify sites through Internet and Medline searches, and snowball sampling through contacts made or work references. We also contacted National and European informatics and research networks. We identified sites across Europe willing to participate in the survey by contacting them through email or web-form and we then followed this up with a phone call. We exported these data from the completed online questionnaires directly into either Microsoft Excel or into Statistical Package for Social Sciences (IBM SPSS). We categorised ‘non-compliance’ as a respondent who partially completed the online survey, answering <70% of the questions; or as a respondent with whom we had made telephone contact initially was unavailable for their telephone interview or failed to proceed to online completion of the survey. A major component of the workload in this project involved identifying potential survey respondents.

### Micro-, meso- and macro-level

The broad scope of the survey emerged from a series of workshops and is composed of a wide range of questions designed to assess how data might be linked, the data itself, extraction methods and social and organisational influences.15,16 The final instrument contained 160 questions divided into a framework which consisted of micro-, meso-, macro- and study-specific levels.

• The first section covered micro-level issues and was concerned with the data source, the data itself, metadata, the potential for linkage or achieving semantic interoperability between data sources17 and details of how many studies have been published using the data.

• The meso-level explored the data extraction,18 the architecture for the computerised medical record and other data repositories,19 audit trails and the size of the database.

• The macro-issues related to the nature of the health system, socio-cultural factors and issues relating to the funding, purpose and restrictions on the use of the data.

• Study-specific questions make up the final part of the survey instrument (Supplementary data file, Table S1), these were designed to identify sites that were eligible to participate in the use-cases in pairs of primary care and genetic, or primary care and cancer registry data.

We described the coding systems used to store data, including drug dictionaries and any standards used (the aim was to determine whether there were a small number of possible combinations of coded data to identify within data repositories and the mechanisms for achieving interoperability), the number and details of eHR vendors, vendors of communications and data processing applications routinely used (including their international scope, coding systems offered and if they had common data export formats) along with organisational, policy, cultural or legislative restrictions on data reuse.

### Use-case specific

We analysed the process of conducting two use-cases and defined the studies using a framework which defined the micro-, meso- and macro-levels of data and process information required to conduct successful linked research, where multiple data sources are semantically integrated. We summarised the sites eligible to participate in the use-cases in pairs of primary care and genetic or primary care and cancer registry data. If the database can support a use-case, we consider the site as a use-case compliant site and if it cannot support, we define the site as a non-use-case site. Registries were only eligible if they provided a valid response to the questionnaire. We required as much of the survey to be completed as possible as each part of it was determined from our requirements analysis. We defined a valid response to be one which answered >70% of the questions. Key compulsory answer questions which defined compliance provided information such as valid contact details, a link to another dataset, size of the dataset, data model and details of the coding system, the likely lead time in any approval process and that they have use-case variables available. All sections of the questionnaire provided significant and useful information to determine if the database was use-case specific.

### Reporting and analysis

We compared the responses from databases that proved eligible to participate in the use-cases with those who were not. We wanted to explore whether it was more likely that those associated with eligibility would give a positive response to questions than those who were not deemed eligible. A valid response provided by the respondent is considered a positive response. The purpose of this exercise was to identify any questions that were not purposeful and to reduce the number of questions. We identified and reviewed any questions that were not answered positively by any of the use-case eligible respondents on the basis that they were not discriminatory of eligibility to participate in either of the studies.

### Statistical methods

We used descriptive statistics (i.e. measures of frequency) to describe response rates and quote odds-ratios (ORs), 95% confidence intervals (CIs) and used tests of proportion to report whether sections of the questionnaire helped to discriminate between those able to conduct the use-case or not.

### Ethics statement

There was no formal ethics board review. This survey only seeks to report information about the capacity and capability of information sources to be combined to conduct research studies and does not involve any access to personal data. However, the TIRRE survey does check whether data sources collect individual consent and if they contain strong identifiers and if there are restrictions on the use of data.

## RESULTS

### Sample and data collection for use-case specific defined studies

We made many contacts but received few responses. We contacted 531 different organisations, and later individuals in EU countries (including eHR vendors) and received 56 valid responses. Of the health ministries we contacted, seven provided useful information and a further five responded but could not provide any helpful information. Only two site representatives declined to participate at this stage (Supplementary data file, Table S2).

### eHR vendors

We also collected details of the national or international eHR vendors with a significant presence in one or more EU countries. We contacted 17 companies identified initially, as well as any reported by survey respondents. Nine of these eHR vendors had a presence in more than one country. Two of those contacted started to complete the TIRRE survey instrument but failed to complete the questionnaire. We also approached nine vendors listed by site representatives who completed the questionnaire but they once again expressed no interest in participating in the survey. They did suggest they might consider completing the survey in the future if and when we had something more definite to offer. Few vendors responded; however, when they did reply to the survey, their responses to the questions posed provided useful detail.

### Telephone and online completion of the survey

Of the 531 organisations we made contact with, 45 respondents commenced but did not complete the TIRRE survey online (Supplementary data file, Table S3) and 26 made a partial response during telephone enquiries but were then either unavailable for their telephone interview or failed to go ahead and complete the online survey. The initial telephone interviews took 1.5 hours and with experience still took 50–75 minutes. The feedback from the pilot survey suggested that the process took too long and that there was very little incentive for the respondent for completing the survey. While this drawback of the survey could have possibly caused a bias for the responses collected, we consider this as a valuable learning to consider in similar database profiling activities conducted in the future.

### Completion of the survey

The valid surveys were on an average returned with 76% of the questions completed and this was consistent across the three respondent groups. Looking at the survey by category, the Data source and Record system sections were the only ones that fell below the 75% level (many sections were returned with above 90% of the questions completed). The main reason for this was the variation in the skip logic for individual respondents in these sections of the questionnaire (Supplementary data file, Table S4). There was a little difference between the sites which we had identified as eligible to participate in the use-cases and those we had identified as not eligible (77% use-case sites versus 75% non-use-case sites).

### Results micro-level data

The greater the number of coding systems in use, the harder it will be to achieve semantic interoperability; therefore, the micro-level data collection was primarily concerned with collecting information about the coding systems the repositories used. We found that the WHO International Classification of Disease (ICD)20 was the most common coding system used by 71% (n = 39) of respondents. ICD-10 (n = 32) was used by 82%; 13% (n = 5) used ICD-9; 23% (n = 9) used an ICD modification and 5% (n = 2) did not respond (Supplementary data file, Table S5).

The second most used coding system was the WHO International Classification of Primary Care (ICPC), this was used by 20% (n = 11) of respondents. Eighty-two percent (n = 9) of those using ICPC used ICPC-2 and 18% (n = 2) used ICPC-1, none reported using an ICPC modification (Supplementary data file, Table S6). The third most common coding system was the Systematised Nomenclature of Medicine (SNOMED),21 which was used by 13% (n = 7) of all the respondents; 44% (n = 4) of those using SNOMED used the Clinical Terms version; 33% (n = 3) used the Reference Terminology version and 22% (n = 2) did not respond (Supplementary data file, Table S7).

One of the least common coding systems used was the Read Coding system [version2 – 5-byte and the Clinical Terms Version 3 (CTv3)] and these were only used by the seven UK repositories. They represented 9% (n = 5) and 4% (n = 2) of all respondents, respectively; 87% (n = 50) did not respond (Supplementary data file, Table S8).

The survey highlighted that there was a great variety in the number of drugs dictionaries utilised by the repositories and this is one potential barrier to achieving semantic interoperability. Sixty percent (n = 33) of respondents said that they have a coding system for drugs (Primary care 83%, n = 24; Cancer 33%, n = 5; Genetic 36%, n = 4). Of these; 76% (n = 25) use the Anatomical Therapeutic Chemical classification system;22 9% (n = 3) use Multilex; 12% (n = 4) responded ‘other’ and 3% (n = 1) responded ‘no data’ (Supplementary data file, Table S9). We were interested to know whether it was possible to extract information about the administration of drugs and we asked respondents if it was possible to extract data about daily dose and administration route from their database. Only around one-third of the Primary care and Cancer registries could extract data of this nature, while none of the genetic databases held this information (Supplementary data file, Table S10).

The survey was designed to assess what systems the registries had in place to achieve interoperability and to ensure data quality. Thirty-four percent (n = 19) of respondents had no system at all; only 5% (n = 3) used Health Level-7 (HL-7), an international interoperability organisation who’s Reference Information Model underpins much interoperability in healthcare; 2% (n = 1) used the Clinical Data Interchange Standards Consortium (CDISC);23 none used the Biomedical Research Integrated Domain Group (BRIDG)24 and 52% (n = 29) used an ‘In-house or other’ system (Table 1). Nearly, all (93%, n = 52) of the respondents either had no system in place or used an in-house system or provided no data.

Table 1 Systems used to ensure data quality

### Data collection meso- and socio-cultural levels

Data extraction at this level was concerned with record level issues. The majority of respondents (82%; n = 45) have the ability to extract data in standardised formats such as Comma Separated Values, Excel and full text. All of the respondents have at least one appropriate format. The data collected have a wide application and this is reflected by the diverse nature of the information stored within these repositories which ranges from research to mortality records (Table 2).

Table 2 The aims of the data source for the data collected

The respondents reported that socio-cultural influences had a small but significant impact on the validity of their data. These factors included ethical, religious and legal factors (Table 3); these might delay or prevent participation in the TRANSFoRm studies.

Table 3 Socio-cultural influences on the validity of the data

Socio-cultural factors, which include legal and ethical constraints, as well as influences on diagnosis, and organisational components of the health system from which the data originates are often barriers to conducting research. In summary, 71% (n = 39) of respondents use ICD and 20% (n = 11) use ICPC; however, 86% (n = 48) do not use one of the three main systems for ensuring data quality; 29% opt instead for an in-house system. Very few sites are adopting national standards for interoperability in linking data. Whilst multiple drug dictionaries were used, 66% (n = 10) of cancer repositories did not use them. Extract formats for data were standardised and only 3% (n = 6) of respondents chose to use a non-standard format. Data were not forthcoming from eHR vendors (n = 40). Repositories had a broad range of applications for their data, the most important was research (49%, n = 51). The most common socio-cultural influence that could potentially affect the validity of their data was ethical (10%, n = 7) and social (10%, n = 7) factors although 49% (n = 36) reported no social issues at all.

### Difference in response depending on eligibility

Data sources that were non-use-case eligible tended to produce much fewer positive responses than those that were eligible. Overall, the repositories identified as potentially being use-case eligible made 2098 positive responses to questionnaire items compared with 268 from non-use-case eligible data sources (OR: 4.59; 95% CI: 3.93–5.35; p < 0.008); for genetic databases, the respective figures were 380:44 (OR: 6.13; 95% CI: 4.25–8.85; p < 0.008) and for cancer registries, they were 553:44 (OR: 5.87; 95% CI: 4.13–8.34; p < 0.008); the full results are in Table 4.

Table 4 Positive responses to the questionnaire sections – comparing non-use case eligible and use-case-eligible data sources

### Data repositories capable of participation in the survey

Of the 56 valid responses, there were 15 pairs eligible to complete one or other of the use-cases. The 56 valid responses were made up of 29 databases of routine primary care data, 12 genetic databases and 15 cancer registries. From the valid responses, we were able to identify the location of databases with the potential to participate in the research studies. We identified five locations for linking primary care databases with genetic databases and 10 for linking primary care databases with cancer registries. The 15 eligible sites were spread across 11 countries (Supplementary data file, Table S11).

### Details of the eligible sites

The sites had a total of around 1.5 million potential patients eligible to participate in this research; over 30,000 in the genetics of diabetes use-case and over 1 million to participate in Barrett’s disease, oesophageal cancer and the prescription of 30 medicines used to treat dyspepsia use-case. The country of origin, the website for these sites, the main coding system used and the expected delay in ethical approval are shown in Tables 6. We sometimes found contradictions between the data sources which indicated that they could supply linked data and several of the participants were, on closer questioning, only linking on a pilot basis; we have shaded out in grey the sites which are not currently active. The outcome of this process is that we have identified one fully functional location able to run the diabetes use-case (Table 5) and five pairs of locations able to run Barrett’s disease use-case (Table 6). The one able to run the diabetes use-case is the Wellcome Type 2 Diabetes study group in Scotland. The five locations that can run the second use-case are as follows: Finland, Germany (Bremen), Norway, UK (General Practice Research Database), UK, Scotland (pilot).

Table 5 The eligible sites for conducting the diabetes TRANSFoRm use-cases
Table 6 The eligible sites for conducting Barrett’s disease TRANSFoRm use-cases

## DISCUSSION

### Principal findings

The TIRRE survey has been completed by 56 data-repositories across Europe and six outside the EU. We have developed a usable instrument which can assess their potential to take part in linked data research. There were no equivalent International sites available to conduct this type of research. A challenge was to get databases to complete the questionnaire, when we did get a response, the completeness of information gathered was high and proved useful in identifying their potential to participate in linked research. Meso- and macro-level questions were important discriminators between use-case and non-use-case eligible data sources. There are currently no other survey instruments available to enable brokerage between databases potentially willing to participate in research. Micro levels informed about the data and its granularity.

### Implications of the findings

The TIRRE survey is the first step towards assessing the potential of a database for linkage. It can identify data sources suitability in terms of data availability and readiness to participate in a study. Whilst the initial focus of TIRRE was on linking data sources (which were important and consistent), the meso- and macro-factors generally had higher OR of predicting use-case eligibility.

Different coding systems have varying levels of granularity. For example, at the time of this study, neither ICD-10 nor ICPC differentiated between types of diabetes according to the latest WHO classification. ICD-10 differentiates insulin-dependent and non-insulin-dependent, rather than the Type 1 (insulin for survival) and Type 2 diabetes used in the latest classifications. Although we acknowledge, this is now updated in later releases.

### Comparison with the literature

It is possible to draw comparisons between the complexity of this task and the existing successful projects that involve linking data. However, the successful data repositories in the UK have all been based on a single vendor of GP eHR system. Clinical Practice Research Datalink previously only extracted data from a single vendor called In-Practice Systems, though they are expanding this to all UK vendors;25 Q-Research on the EMIS system26 and other UK research networks (The Health Improvement Network27 and ResearchOne28) and other networks following the same pattern. The only exception to this in the UK is the Royal College of General Practitioners (RCGP) Research and Surveillance Centre (RSC);29 this network extracts data from all the different brand of medical record systems. It has published a cohort profile about patients in the RCGP RSC database with diabetes, one of the TRANSFoRm use-case areas30 Notwithstanding the RSCHP RSC success, the relatively simple task of linking data from this small number of brands of computer within the UK has proved challenging, both in terms of creating a summary care record31 and in developing a common data extraction system.32

### Limitations of the method

Any initial screening process will need to be followed up by a detailed assessment of whether the dataset needed for a given study can be elicited from the data repositories. There was no real incentive for data repositories to supply us with the data required, as there was not a reciprocal offer of benefit. As a consequence, our results inevitably underestimate the number of sites where this type of research can be conducted. We propose that future projects should consider including incentives in their budget. An effective method to reduce the impact of this self-selection bias could be to approach databases with a partially completed survey (using information available in the public domain) in order to encourage participation. Furthermore, the collected data could be shared publicly as a metadata registry that would facilitate advertising data offered by organisations for prospective studies. We also recommend limiting surveys to 30–40 questions to improve the response rate.

### Call for further research

We need to conduct test-retest studies to assess the reliability of the survey instrument. The reliability test could be carried out by repeating the data collection after a period of time. While this would help to validate the instrument, it will also potentially remove any bias introduced by the specific person responding to the survey. We should conduct simulated and real studies with data extractions to test its validity. However, conducting real studies may be affected by the availability of funding. Alternatively, we can promote reuse of the instrument in other projects with the research area.

## CONCLUSIONS

A large complex set of data is needed to know if it will be possible to link primary care and either disease registry of the genetic database. This complex set of data can either be classified by level of granularity or as a business or data requirement.

The TIRRE instrument is a useful tool that can be used to assess general suitability and readiness to participate in linked research studies. With increased use, it is likely that TIRRE will evolve further, but its use needs to be embedded in a concrete ‘offer’ and business case rather than a one-off research study.

## Acknowledgements

Paul van Royen for his comments on the manuscript; IMIA and EFMI for supporting their primary health care informatics working groups. Antonis Ntasioudis for his contribution to this research. TRANSFoRm is supported by the European Commission – DG INFSO (FP7 2477).

## Appendix 1 Details of the TRANSFoRm work tasks

Table S1 Categories of data collection and min-to-max number of questions; skip logic reduces the number of questions that each type of respondent might answer
Table S2 Number of contacts and valid responses
Table S3 Number of contacts and valid responses
Table S4 Completion of the questionnaire
Table S5 Coding systems information (ICD)
Table S6 Coding systems information (ICPC)
Table S7 Coding systems information (SNOMED)
Table S8 Reading coding systems usage (CTv3 and Read codes version used)
Table S9 Coding systems for drugs
Table S10 Extraction of drug information from data provided
Table S11 Location of respondents and eligible sites

1. 1.
2. 2.
3. 3.
4. 4.
5. 5.
6. 6.
7. 7.
8. 8.
9. 9.
10. 10.
11. 11.
12. 12.
13. 13.
14. 14.
15. 15.
16. 16.
17. 17.
18. 18.
19. 19.
20. 20.
21. 21.
22. 22.
23. 23.
24. 24.
25. 25.
26. 26.
27. 27.
28. 28.
29. 29.
30. 30.
31. 31.
32. 32.

## Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.