Outputs and growth of primary care databases in the United Kingdom: bibliometric analysis

Background Electronic health database (EHD) data are increasingly used by researchers. The major United Kingdom EHDs are the ‘Clinical Practice Research Datalink’ (CPRD), ‘The Health Improvement Network’ (THIN) and ‘QResearch’. Over time, outputs from these databases have increased but have not been evaluated. Objective This study compares research outputs from CPRD, THIN and QResearch assessing growth and publication outputs over a 10-year period (2004-2013). CPRD was also reviewed separately over 20 years as a case study. Methods Publications from CPRD and QResearch were extracted using the Science Citation Index of the Thomson Scientific Institute for Scientific Information (Web of Science). THIN data were obtained from University College London and validated in the Web of Science. All databases were analysed for growth in publications, the speciality areas and the journals in which their data have been published. Results These databases collectively produced 1,296 publications over a ten-year period, with CPRD representing 63.6% (n = 825 papers), THIN 30.4% (n = 394) and QResearch 5.9% (n = 77). Pharmacoepidemiology and General Medicine were the most common specialities featured. Over the 9-year period (2004–2013), publications for THIN and QResearch have slowly increased over time, whereas CPRD publications have increased substantially in the last 4 years with almost 75% of CPRD publications published in the past 9 years. Conclusion These databases are enhancing scientific research and are growing yearly, however display variability in their growth. They could become more powerful research tools if the National Health Service and general practitioners can provide accurate and comprehensive data for inclusion in these databases.


INTRODUCTION
Data collected in electronic medical records for a patient in primary care can span from birth to death and can have enormous benefits in improving health care and public health, and for research. Several systems exist in the United Kingdom (UK) to facilitate the use of research data generated from consultations between primary care professionals and their patients. General Practitioners play a gatekeeper role in the UK's National Health Service (NHS) because they are responsible for providing primary care services and for referring patients to see specialists.
In more recent years, these databases have been supplemented (through data linkage) with additional data from areas such as laboratory investigations, hospital admissions and mortality statistics. Data collected in primary care research databases are now increasingly used for research in many areas, and for providing information on patterns of disease. 1 These databases have clinical and prescription data and can provide information to support pharmacovigilance, including information on demographics, medical symptoms, therapy (medicines, vaccines, devices) and treatment outcomes. 1 The major primary care research databases in the UK include the 'Clinical Practice Research Datalink' (CPRD), 'QResearch' and 'The Health Improvement Network' (THIN). For all three systems, the information relating to symptoms, diseases, consultations and other clinical events is recorded using the Read code system. The data made available to researchers are anonymised, and strong patient identifiers such as name, address and postcode, date of birth and NHS number are removed.
The CPRD is jointly funded by the NHS National Institute for Health Research (NIHR) and the Medicine and Healthcare Products Regulatory Agency (MHRA). 2 It is one of the largest databases of longitudinal medical records derived from primary care in the world. 3 The collection of information began in 1987 under the previous name General Practice Research Database (GPRD). GPRD was initially part of the Value Added Medical Products, a company that pioneered the design and marketing of a general practice office computer system, allowing the recording of individual patient medical recording. The database was later transferred to government control. 4 CPRD has been providing nearly 30 years of longitudinal data. As of December 2014, the database contained data for over 13.5 million patients, of which approximately 5.7 million are currently active. 4 As well as primary care data, CPRD now links to a number of other data sets such as 'Hospital Episode Statistics' (HES) and mortality data from the Office for National Statistics. It is increasingly being used to enhance clinical trial efficiency (protocol optimization, feasibility and recruitment), through working with the general practitioners, and can provide data for both industry and academic researchers. 4 Access to the data is subject to protocol approval by the MHRA Independent Scientific Advisory Committee. Over 1,500 research reports published in peer-reviewed journals have used data from the CPRD and have had direct impacts on public health and disease speciality areas. 5 QResearch is a large primary care database derived from the anonymised health records of over 12 million patients. 6 The data currently come from over 950 general practices using the Egton Medical Information Systems (EMIS) clinical computer system that is used throughout the UK. 6 Although the data contain socio-economic details of patients based on their postcode, it does not hold any identifiable data, and access to it is only opened to academic researchers who have ethical approval to receive datasets. QResearch has led many projects such as QFlu, which was used for monitoring and tracking the prevalence of the swine flu outbreak in 2009, reporting to the Health Protection Agency. 6 One of the limitations of QResearch is that although it has links to external databases such as HES, the anonymisation process in compiling the database means that there is no way to identify patients.
THIN is collaboration between two companies; In Practice Systems Ltd. (INPS), who developed Vision software used by General Practitioners in the UK to manage patient data, and Cegedim Healthcare Software. 7 THIN data collection started in 2003 and over 500 vision practices have so far joined the scheme. THIN data currently contain the electronic medical records of 11.1 million patients (3.7 million active patients). This covers 6.2% of the UK population. 7 In addition to the main consultations being recorded, the most patient data in THIN are linked to postcode-level area-based socioeconomic, ethnicity and environmental indices. The data are based on the patients' postcodes so that variables at ward level are available. 8 The patient is identified only by a code allocated by the GP system and cannot be identified outside the practice.

AIM
The aim of this study was to conduct a bibliometric review to analyse the research outputs and the longitudinal growth in the number of publications that harness these three primary care databases; CPRD, QResearch and THIN from 2004 until 2013, and also to look at the growth of CPRD on its own from 1993 to 2013 as a case study.

METHODS
To evaluate the impact of the three primary care databases (CPRD, QResearch and THIN), publications using data from each of the databases were extracted and analysed. For CPRD, we extracted publications from 1993 to 2013 using the Science Citation Index of the Thomson Scientific Institute for Scientific Information (Web of Science). Conference abstracts and posters were not included in the data. We used the same method to extract QResearch publications from 2004 to 2013. We obtained data for THIN from the Department of Primary Care and Population Health, University College London (UCL) and verified the data using the Web of Science. The data was provided in Excel format, which contained details of the author, title of article, the journal in which the article was published, article reference and the year of publication. Publications provided by UCL dated from 2004 to 2013.
The number of times publications were cited, their speciality area and the names of the journals they were published in were then extracted using the Web of Science for all three databases. The speciality areas of publications in all three databases were then categorised into four groups.
Speciality areas were classed as follows for all three databases.

RESULTS
The CPRD database 1,140 publications categorised into 28 speciality areas were extracted for CPRD. Results represented in Table 1  Between 2004 and 2013 (9-year period), the total number of publications listed in the CPRD database was 825, which shows significant growth in this period. 72.3% of the CPRD publications were published in the last 9 years. The highest number of publications was published in the Pharmacoepidemiology and Drug Safety journal, which represents 4.5% (52) of the publications in CPRD. Table 2 shows the journals that CPRD papers were most frequently published in.

QResearch
Seventy-seven articles categorised into 13 speciality areas were published from studies conducted with QResearch data between 2004 and 2013 that were extracted from Web of Science. Results are listed in Table 3

THIN database
Three hundred-ninety four (394) articles categorised into 32 speciality areas from studies conducted with THIN data between 2004 and 2013 were extracted. Results are listed in Table 4.
The largest speciality areas with publications from THIN, with >50 publications (Group 1) were the following: Pharmacology (116)

Combined growth
Results represented in Table 5, and Figure 1 showed an increase in publications using data from all three databases. Over the 9-year period, publications for THIN and QResearch The journal most published in, across the three databases, was the Pharmacoepidemiology and Drug Safety journal. As represented in Table 6, the most publications from CPRD and the THIN database occurred in this journal.

DISCUSSION
This review looked at the publications using data from the three main UK primary care databases: CPRD, QResearch and THIN. Other databases derived from primary care data do exist; such as the Prescribing Analysis and Cost (database) and the Quality Management and Analysis System, but they have not being included in this review because they do not contain data based on individual patient records. A more recently developed database, ResearchOne, that derived from SystmOne was also excluded because it is only now being started to be used for research.
This review showed a large increase in studies conducted with data from all the three databases over time. The combined number of publications based on data from these three databases from 2004 to 2013 was 1,296 publications. The publications covered a large number of specialty areas, providing evidence of the widespread usage of this data. CPRD data was found to be used most commonly by researchers conducting studies in the Pharmacology specialty area, implying its importance for pharmaco-epidemiological and drug safety research and its management by the UK medicines regulator, the MHRA. This category accounted for 27% of the publications over the 20-year period reviewed for CPRD. QResearch attracted researchers who conducted more studies in the General and Internal Medicine specialty area. This accounted for 45% of the publications for QResearch. Finally, THIN database showed that researchers in the Pharmacoepidemiology and drug safety specialty area conducted more studies using its data, accounting for 22% of the publications generated from THIN data. Overall, publications from all the three databases showed that researchers from varying specialty areas showed a keen interest in the use of EHDs for research and, over time, this is bound to increase.
Some limitations do exist with this study because the publications obtained in each database varied in amount. The CPRD, being the largest of the three, contained more publications, over 90% more when compared to QResearch and 46.1% more than THIN. The THIN database when compared to QResearch contained 82% more publications. Some publications could also have been missed during extraction from the Web of Science.    and checked against the Web of Science. Also, the blanket categorisations (speciality areas) of the Web of Science were used, which may not always be reflective of a publication's true audience. Further, publications published in non-peer reviewed journals would not be accounted for in this work. Finally, another primary care database, ResearchOne, was not included in this study, as it has only recently been established. In spite of these limitations, the growth in publications derived from these databases is clear. The use of EHDs in this manner remains invaluable and researchers are beginning to realise their benefits. The increase in publication for the combined databases from 74 in 2004 to 1296 in 2013 clearly indicates that this method of research is becoming more popular. It also highlights the importance of perfecting any limitations that this research method may present. Opportunities to use patient records for secondary uses are also on the increase with advances in technology allowing routinely collected data to be easily stored and shared.
Studies conducted in the past have also highlighted that EHDs such as CPRD will promote scientific productions in many ways. 9 For example, the clinical impact of these databases can be significant; namely, studies based largely on CPRD data and entirely from database research have contributed to the evidence for management, investigation and referral draft consultation document by the National Institute for Health and Care Excellence. 3, 10 The wealth of information available in EHDs from primary care and their uses is invaluable and plays a key role in healthcare improvement in the UK. Studies that have provided useful information for many disease areas have used data available from both primary and secondary databases. The use of the data available in these databases has increased significantly over time and more recently, as seen in this review, has attracted more international interest. This is evident in the authorship of publications available from such studies. These studies have provided researchers with great tools to help make the best decisions for the care of their patients.
Monitoring bodies have also been beneficiaries of the data contained in these databases because of the surveillance information that they provide. With the ability of CPRD to now link to other databases such as the HES, mortality statistics from ONS, cancer registry data from the National Cancer Intelligence Network, cardiovascular disease registry data from the Myocardial Ischaemia National Audit Project and Socioeconomic data at the Lower Layer Super Output Area level, the value of the database is expanding. [11][12][13][14] It is also important to consider differences in the data quality between the primary care databases inherent to the computerised medical records (CMR) systems that contribute to them. Data contained within CPRD have historically been derived mainly from Vision/INPS, only recently taking some data from EMIS. The THIN database contains data exclusively from Vision/INPS, while QResearch uses only EMIS. Vision/INPS is a Problem Oriented Medical Record (POMR) software, meaning that it mandates the linkage of medical and drug information or visit consultations to a specific coded problem(s). 15 EMIS, however, is an Episode Oriented Medical Record (EOMR) software, meaning that this linkage is not obligated. It is widely accepted that POMR-based systems reduce intrapatient coding variability, which assists in maintaining higher quality data. 15 This study has demonstrated that data from CPRD and THIN account for the vast majority of research outputs from primary care databases. Considering these primarily used POMR CMR data, it may be the case that researchers are aware of the potential data inaccuracies inherent to EOMR systems, such as EMIS, and thus preferentially elect to use data from CPRD and THIN databases.
The success of EHDs and the impact on research is completely dependent on the quality of the data that has been entered. 16 Accuracy, consistency and completeness of the data they contained have always been a challenge, one that   is still seeking a solution. Evidence for the accuracy and validity of the data from clinical databases is mixed and varies between clinical areas, the individual databases and the use of data. 17-19 A recent systematic review exemplified the low usage of EHDs to inform national healthcare guidelines. 20 Out of 25 guidelines included in the review, only 43 CPRD/ GPRD studies were referenced, highlighting how electronic health records (EHRs) are a relatively untapped resource for informing evidence-based medicine (EBM). 20 This may in part be due to the widely accepted status quo that randomised controlled trials (RCTs) are the 'gold standard' data source for informing EBM. However, as exemplified in this study and the aforementioned review, research outputs from EHRs are rising, more rapidly so in recent years. It may be important to appreciate that in a primary care setting, 'big data' generated from EHR databases afford a number of advantages over RCT data, including better generalisability to real-life clinical settings and facilitation of increased autonomy over patients' own healthcare data.
To increase the research outputs of EHR data, future research should concentrate on examining differences in the nature of data collection between databases particularly on a patient level, which will better inform how EHRs can be designed to answer clinical and population health questions. General Practitioners will need to be fully committed to reporting data, and increasing the quality of data reported and ensuring patient's privacy are well protected by being gatekeepers of the data. 21

CONCLUSION
Based on the review of publications that harness data in CPRD, QResearch and THIN, there is strong evidence that these electronic health care databases are promoting scientific research. The growth in publications has shown that researchers are now conducting more studies using these databases and are beginning to realise their full potential. To continue to promote academic research, General Practitioners will need to continue to provide complete and accurate data; set standards will also need to be provided to General Practitioners to encourage enthusiasm and willingness to enter the required data; and public support encouraged for the continued use of these databases for research.