Introduction
Breast cancer is the second leading cause of cancer deaths in US women, comprising 30% of new female cancer diagnoses.1 It is the most common cancer across all ethnic groups in the USA, but disparities exist in outcomes.2 While white women have higher incidence rates, black and Hispanic women face higher mortality rates.3 4 Additionally, the incidence is increasing rapidly among Asian/Pacific Islanders and American Indian/Alaska Natives.4
The widespread adoption of electronic health records (EHRs) offers promising opportunities for predicting future events using large amounts of data.5 Especially, unstructured clinical notes contain important information often not captured in structured, coded formats.6 For example, patient-reported outcomes from patients with cancer are often not captured in structured EHRs, but is increasingly found in unstructured or semi-structured text formats within EHRs, facilitating translational research and personalised care.7–9 One common approach in clinical text analysis involves using a rule-based natural language processing (NLP) algorithm that leverages distinct medical keywords from clinical texts.10 11 Specifically, with the advancements in neural language modelling, integrating neural networks with features extracted from this rule-based NLP method can be achieved by using word embedding models for feature extraction.12 This approach allows for building a fully neural network-based pipeline that combines embedding models with supervised learning algorithms.13
In cancer research, incorporating clinical notes into analyses is crucial for capturing information on comprehensive symptoms and side effects that patients experience,14 as it can provide insights into monitoring and individualised symptom management. Several studies have investigated breast cancer treatment outcomes using clinical notes and NLP14–16; however, research that specifically aims the capture of treatment side effects and patient-reported outcomes in patients with breast cancer from under-represented populations remains sparse. Addressing this research gap is important, because these populations face unique health disparities that impact treatment outcomes and patient care. Understanding these specific challenges and barriers enables the development of targeted interventions to mitigate disparities and enhance health outcomes. There is a clear need for an automated tool to capture symptoms and side effects from clinical notes, enabling accurate symptom management and tailored nursing care planning for those patients from under-represented populations.
The goal of this study was to develop NLP algorithms to automate the knowledge extraction process for patient-centred breast cancer treatment outcomes from clinical notes, aiming to gain valuable insights to improve care for those from under-represented populations. Specifically, we aimed to compare the effectiveness of these algorithms in providing scientific evidence for their use in the care of patients with breast cancer from under-represented populations.