Discussion
Nine years of colorectal cancer data were successfully collected across three pilot sites and collated in a centralised research database as part of the NIHR HIC. This process demonstrated that it is possible to create an automated data-rich longitudinal research-focused database from routinely collected health data. In doing so, this paper highlights the potential of this database in future colorectal cancer research. To our knowledge, this is the first such database of its type.
NLP of histopathology, imaging and endoscopy reports has the potential to create a depth of data beyond coding, synoptic reporting and manually entered data. Several algorithms were successfully created to extract key data points and successfully implement them for different source data. This paper only presents one Trust’s NLP results, however the algorithms have been shared and implemented across the other pilot sites. This ability to share complex validated algorithms has the potential to take the database beyond common epidemiological or research parameters, especially when using an open source philosophy. In this context, open source facilitates sharing, collaboration, personalisation and rapid advancement of algorithm development and thus data processing.
The pathway plots developed are valuable in identifying groupings of patients with similar pathways to aid future analysis. Group A (patients with a diagnosis of colon cancer, with a pretreatment staging scan, followed by surgical resection of the cancer) and Group B (diagnosis of rectal cancer who had a pretreatment staging scan, neoadjuvant treatment followed by surgical resection then further treatment, either adjuvant or for recurrence) provide apt examples: Group A plots provided a visual indication of the proportion of the group who had adjuvant treatment, the completeness of the follow-up regime and the incidence of disease recurrence, while Group B plots provided insight into temporal variability in adjuvant treatment.
These particular groups were selected to illustrate the potential of this form of representation of the data, rather than address particular research questions. The plots also clarified issues with the data that needed to be addressed. For example, several pathways recorded treatment or even recurrence well before the initial diagnosis. Certain groups were able to be identified within the data set, such as those primarily managed at a peripheral hospital before referral to a specialist tertiary unit, that need further attention and processing before the data are used for research analysis.
Although not specifically explored in this pilot study, such data capture has the potential to explore variation in practice. For example, there is significant variation in the use of neoadjuvant treatment across the UK, especially for higher rectal tumours.25 The breadth of this database has the potential to identify variance in greater detail, and provide insight into outcomes across various patient cohorts.
The processes developed for this pilot study will be applied to generate a much larger database as other centres contribute data and the time period is extended. Further, the data set can be expanded and adapted to match the requirements of research questions. This does not replace other data collection programmes such as the NBOCA,5 which plays an important role in quality of colorectal cancer care above all else. The attributes of this data set, however, provide a unique research opportunity to investigate novel strategies in the management of colorectal cancer.
Alongside the research potential illustrated by this study, the process also highlighted challenges in such data extraction. Trusts were readily able to obtain inpatient data points, however outpatient data were more difficult to capture which explains some of the variables still missing in the results (table 1). This highlights the importance of greater collaboration across inpatient and outpatient facilities while demanding a greater focus on this aspect of data extraction in the longer term. Further, several therapy points, including neoadjuvant, surgical and adjuvant therapy were missing when treatments were provided at facilities outside the central Trust, reflecting the centralisation of certain services at a regional level in the NHS. The database will ultimately need to be expanded to include more centres across the UK to maximise the research potential.
Although NLP was successful in capturing more complex data components, it currently has a low capture rate. For example, T staging has not yet been identified in some 41% of patients for whom NLP was undertaken. However, when analysis was restricted to patients who had surgical resection recorded at the site, T stage was obtained for 95% of patients. The algorithms have only been applied to imaging and histopathology reports so far, and require these reports to specifically mention T stage. It is expected that data capture will increase as the algorithm is improved, and as it is applied across a wider range of data sources, for example, including multidisciplinary meeting reports and operative notes.
The database is only as accurate as the data inputted. While it is possible to build in simple validation checks to exclude or correct nonsensical values, for example in age or BMI, more complex issues such as errors in reporting that result in misclassification will not be detected at the ‘big data’ level. Such errors may be detected in smaller scale research projects where original data are scrutinised, but at the larger scale, the assumption is that the incidence of such errors will be relatively small and not significantly impact the overall results. This is however a limitation of the database and a focus for optimisation as the database continues to be developed.
In summary, automated collation of routinely collected clinical data does not only promise to alleviate administrative burden but allows for expansive data sets that capture a theoretically unlimited and expanding number of touchpoints for every patient. Ultimately, research using catalogued, comparable, comprehensive and longitudinal patient data will inform clinical practice and aid governing bodies in the development of colorectal cancer care pathways to reduce disparities and improve overall patient outcomes.