Learning health systems need to bridge the ‘two cultures’ of clinical informatics and data science

Background UK health research policy and plans for population health management are predicated upon transformative knowledge discovery from operational ‘Big Data’. Learning health systems require not only data, but feedback loops of knowledge into changed practice. This depends on knowledge management and application, which in turn depends upon effective system design and implementation. Biomedical informatics is the interdisciplinary field at the intersection of health science, social science and information science and technology that spans this entire scope. Issues In the UK, the separate worlds of health data science (bioinformatics, ‘Big Data’) and effective healthcare system design and implementation (clinical informatics, ‘Digital Health’) have operated as ‘two cultures’. Much National Health Service and social care data is of very poor quality. Substantial research funding is wasted on ‘data cleansing’ or by producing very weak evidence. There is not yet a sufficiently powerful professional community or evidence base of best practice to influence the practitioner community or the digital health industry. Recommendation The UK needs increased clinical informatics research and education capacity and capability at much greater scale and ambition to be able to meet policy expectations, address the fundamental gaps in the discipline’s evidence base and mitigate the absence of regulation. Independent evaluation of digital health interventions should be the norm, not the exception. Conclusions Policy makers and research funders need to acknowledge the existing gap between the ‘two cultures’ and recognise that the full social and economic benefits of digital health and data science can only be realised by accepting the interdisciplinary nature of biomedical informatics and supporting a significant expansion of clinical informatics capacity and capability.


INTRODUCTION
C. P. Snow famously characterised the gulf between the 'two cultures' of science and the humanities as a serious barrier to progress. 1 In our field, at least in the UK, there appears to be an analogous gap between the policy and funding programmes of data science (bioinformatics, 'Big Data') and effective system design and implementation (clinical informatics, 'Digital Health').
Data science in healthcare is subject to strong regulatory and ethical controls, minimum educational qualifications, well-established methodologies, mandatory professional accreditation and evidence-based independent scrutiny. By contrast, 'Digital Health' has minimal substantive regulation or ethical foundation, no specified educational requirements, weak methodologies, a contested evidence base and negligible peer scrutiny. Yet, the 'Big Data' vision is to base its science on the data routinely produced by digital health systems.
This paper is focussed on the UK context. We bring together experience from the frontline National Health Service (NHS) clinical informatics and epidemiological research to present the operational realities of health data quality and the implications for data science. We argue that to build a successful learning health system, data science and clinical informatics should be seen as two parts of the same discipline with a common mission. We commend the work in progress to bridge this cultural divide, but propose that the UK needs to expand its clinical informatics research and education capacity and capability at much greater scale to address the substantial gaps in the evidence base and to realise the anticipated societal aims.

ROUTINE CLINICAL DATA IS HIGHLY PROBLEMATIC
Data quality in the frontline health and care system faces a dual challenge in our current environment. First is the lack of standard data sets and adoption of reference values, though work is progressing in this area. 2 The second is the lack of data quality due to unreliable adherence to process 3 and poor system usability. 4 Embarking on the implementation of clinical terminology including Systematised Nomenclature of Medicine Clinical Terms (SNOMED CT) and Logical Observation Identifiers Names and Codes (LOINC) shows us that our historical environment and the complexity of these standards always causes long debate and significant amounts of implementation effort. So far, little progress has been made even by the 'Global Digital Exemplars' 5 in implementing SNOMED CT in any depth. Furthermore, complexity is introduced when interoperating with other care settings such as social care and mental health. GP data is far from consistent. Different practices will use different fields in different ways and usage varies from clinician to clinician. Historically, the system has not forced users to standardise their recording or practice. This results in varying data quality between GP practices, which affects not just epidemiological studies but operational processes. Failure to enter accurate data into health and care systems occurs for a number of reasons including poor usability, overly complex systems, lack of data input logic to check errors and poor business change leadership.
Most epidemiological research with routine clinical data uses coded data, rather than free text. Thus, there is over reliance on codes used during clinical consultations. A national evaluation of usage of codes in primary care in Scotland, taking allergy as an example, found that 50% usage in over 2 million consultations, over 7 years, were from eight codes used to report for an incentive programme for GPs, 95% usage was from 10% of the 352 allergy codes (n = 36) and 21% codes were never ever used. 6 A systematic review found that there are variations in completeness (66%-96%) and correctness of morbidity recording across disease areas. 7 For instance, the quality of recording in diabetes is better than asthma in primary care. There are also changes in case definition and diagnostic criteria across disease areas over time, which are seldom mentioned in the databases. A recent primary care study found that choice of codes can make a difference to outcome measures, for example, the incidence rate was found to be higher when non-diagnostic codes were used rather than with diagnostic codes. 8 Since there is variability of coding of data across GP practices, when practices with poor quality of recording were included in the analysis, there was significant difference in incidence rate and trends, with lower incidence rate and decreasing trends when they were included. This study highlights the effect of miscoding and misclassification. It also shows that when data are missing, they might not be missing at random. Furthermore, there could be unavailability of codes that were needed during consultation and thus were recorded in free text. All these salient features around coding of data are often ignored when interrogating patient databases for research and thus could lead to erroneous conclusions. No amount of data cleansing could sort the inherent discrepancies involved in coded data.
There could be confounding by indication or severity, for example, when severely ill patients receive more intensive treatment and could have poor outcomes compared to other patients. 9 Clinical databases only comprise patients who attended healthcare services. A UK-wide study showed the difference in asthma prevalence when asthma was reported from population surveys compared to clinical databases. 10 Besides quality of coded data, there could be lack of key variables in clinical databases, since their primary purpose was not designed for research, for example, the absence of diagnoses in outpatient hospital attendances.
Furthermore, significant variance is seen in the success of electronic patient record deployments from the same commercial vendor in different localities. For example, the Arch Collaborative from KLAS research 11 shows variance in all aspects of success including data quality of the deployments by Cerner, Epic and Allscripts. US experience has shown a particular risk from 'copy and paste' errors. 12

BIOMEDICAL INFORMATICS IS AN INTERDISCIPLINARY FIELD WITH A COMMON MISSION
The 'two cultures' are both embraced by the widely adopted American Medical Informatics Association definition of biomedical informatics as: 'the interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific inquiry, problem solving and decision making, motivated by efforts to improve human health'. 13 Biomedical informatics can be visualised as the intersection of health science, social science and information science and technology ( Figure 1, reproduced with permission from AMIA 14 ).
In this definition, biomedical informatics has sub-fields such as health informatics (comprising clinical and public health informatics) and bioinformatics (also called computational biology). Whereas bioinformatics deals with data science, clinical informatics 'covers the practice of informatics in healthcare' (emphasis added). Therefore, getting clinical informatics right is more about people than it is about technology or data. As Coiera said, informatics is 'as much about computers as cardiology is about stethoscopes'. 15 Of course, biomedical informatics must be aimed at a grand outcome -the betterment of health -rather than a contained body of knowledge or an abstract philosophy. The sole axis of interest is whether or not health is ultimately improved.
This has a number of implications. In pursuit of a better health outcome, a clinician may employ nuclear physics or big data analytics. Similarly, an informatician needs to be multi-disciplinary and citizen-centred as they play their part in a shared mission. Maintaining a system-wide view of outcomes is an ethical imperative for everyone involved, from research to application. 16 Treating the 'two cultures' within biomedical informatics as separate disciplines, rather than as a shared mission, may be professionally attractive and tractable for funders and policymakers, but risks maintaining silos and working against the public interest. Instead, biomedical informatics researchers and practitioners -including clinicians -need to be part of a single professional organism made of interlocking professional communities; able to work together in a single systemic view of citizen benefit and harm, and able to implement the best scientific, engineering and medical disciplines available. To do otherwise is simply unethical.
This ethical perspective opens up an exciting vista of fruitful, high impact, applied research and professional practice. Global health public policy is united in its view that digital systems, data and digital transformation are vital tools for the advancement of health and care. Learning health systems 17 require not only the Big Data 'engine' but also the feedback loop of knowledge into changed practice. This crucially depends on knowledge management and application, which in turn depends on effective system design and implementation: clinical informatics. Figure 2 (adapted from Rouse et al. 18 originally based on ONC 19 ) illustrates how much of the learning health system depends on clinical informatics and how much on data science.

STEPS TOWARDS CONVERGENCE
There are several encouraging steps towards convergence. We highlight and commend several excellent initiatives that are taking a collaborative and aligned approach:  In addition, some of the Academic Health Science Networks 24 are helping to bring together the practitioner and research communities in both data science and clinical informatics initiatives and the 'Global Digital Exemplars' 5 are to participate in a national evaluation programme 25 . The invitation to participate in the recently launched 'Local health and care record exemplar' programme 26 includes several references to 'research', but unfortunately this seems to be solely the 'Big Data' aspect not the clinical informatics research needed to improve frontline usage and data quality.
One focus of the NHS Digital Academy (Figure 3) will be to unpick the currently secret recipe for deriving user satisfaction, productivity and good quality data from clinical systems. There is a significant focus on user-centred design, interoperability and healthcare system standards within the modules. The aim is to ensure that the cohort of 'digital leaders' understand the role of the end-to-end technology from data standards to usability in achieving good data for direct care and research.

EXPANDING CLINICAL INFORMATICS RESEARCH AND EDUCATION CAPACITY AND CAPABILITY
However, we suggest that the UK needs increased clinical informatics research and education capacity and capability at much greater scale and ambition to be able to address the fundamental gaps in the discipline's evidence base and mitigate the absence of regulation. 4 Numerous basic clinical informatics research questions remain to be satisfactorily addressed, 27 including in the fields of: This realisation has led to the 'Evidence-Based Health informatics' movement, which is well described in an open access textbook. 43 The way to build our discipline's evidence base is to identify and test relevant theories using rigorous evaluation studies. 44 A key measure that would bring the 'two cultures' of data science and clinical informatics closer is to make independent evaluation of digital health interventions the norm, not the exception. 45, 46 These studies need to be carried out by independent evaluators, not system developers, because there is clear systematic review evidence that even randomised controlled trials (RCTs) carried out by system developers are three times as likely to generate positive results than RCTs carried out by independent evaluators. 47

CONCLUSIONS
We have highlighted serious issues with the quality of routine data and how that can be addressed beyond nugatory 'data cleansing'. We submit that policy makers and research funders need to acknowledge the existing gap between the 'two cultures' and recognise that the full social and economic benefits of digital health and data science can only be realised by accepting the interdisciplinary nature of biomedical informatics and supporting a significant expansion of clinical informatics capacity and capability.