Introduction
Data, particularly ‘big data’, are being increasingly used for research in health.1 2 Currently, research largely involves developing a protocol before collecting the carefully curated data and then analysing it. However, increasing attention is turning to the potential of interrogating pre-existing large data sets.3 These come with a particular set of challenges. The data are usually collected for purposes other than research4 and potentially from different sources. Issues such as coding, data quality and completeness must all be addressed and approached with care and maturity of thought.
As a specific subset, the potential power of primary care data has long been identified.5 In many settings, primary care is often synonymous with general practice (family practice). Most of the Australian population see a general practitioner (GP) at least once a year, and can visit as many practices as they wish.6 Hospital databases only contain a limited subset of patient encounters that may be separated by many years.
Therefore, when looking at true population health issues, pooled general practice data should be a key resource.7 This multiplicity relates to how data are used by individual GPs for direct patient care, compared with how data can be used for other uses such as clinical governance and population health.
General Practice care in Australia is funded by universal, government provided health insurance called Medicare, supported by a publicly funded hospital network through the states. Referrals to private specialists must be made through a GP who acts as a gatekeeper to specialist care.8 Australian general practice is the primary contact for the population: 90% of the population see a GP each year. GPs are also almost universally computerised, and have been for over 10 years.9 Therefore, the largest and most comprehensive electronic database of the population sits on the 8000 servers that service these independent GP practices.
This is not just a theoretical exercise. Demonstrating the link between a coding and care has been done in other settings; usually around a specific diagnosis.10 11 These projects coded to SCT the complete content of GP records from a subset of practices. SCT has been the endorsed and recommended Australian standard for coding in clinical systems since 2005.12 It is an increasingly global standard with over 39 member countries of the SCT consortium. Despite this, its local adoption remains an ongoing challenge. In Australia, only a small number of implemented systems are mature enough to allow full integration of SCT.13 Australia has its own extension, SNOMED-CT-AU and its own medicines terminology extension, the Australian Medicines Terminology. Hospital systems, for the most part, still use the International Classification of Diseases rather than a clinical terminology.
Australian General Practice, also, has a lack of a ‘coding culture’14 (unlike in the UK and USA). The two main clinical systems, for instance, still use their own, proprietary coding terms. There is no published mapping to recommended and international standards and each study must perform its own mapping and validation. Also, coding is not in any way enforced. An Australian clinician can (and often does) write free text into the diagnosis, reason for encounter, or indication for prescribing field. There are also no professionally led or large-scale attempts to minimise the variability in the way clinicians enter data.
Given the background, this paper outlines an approach to dealing with these issues,15 and how to develop data suitable for a broad number of uses, not just direct patient care.