Original Research

ChatGPT in glioma adjuvant therapy decision making: ready to assume the role of a doctor in the tumour board?

Abstract

Objective To evaluate ChatGPT‘s performance in brain glioma adjuvant therapy decision-making.

Methods We randomly selected 10 patients with brain gliomas discussed at our institution’s central nervous system tumour board (CNS TB). Patients’ clinical status, surgical outcome, textual imaging information and immuno-pathology results were provided to ChatGPT V.3.5 and seven CNS tumour experts. The chatbot was asked to give the adjuvant treatment choice, and the regimen while considering the patient’s functional status. The experts rated the artificial intelligence-based recommendations from 0 (complete disagreement) to 10 (complete agreement). An intraclass correlation coefficient agreement (ICC) was used to measure the inter-rater agreement.

Results Eight patients (80%) met the criteria for glioblastoma and two (20%) were low-grade gliomas. The experts rated the quality of ChatGPT recommendations as poor for diagnosis (median 3, IQR 1–7.8, ICC 0.9, 95% CI 0.7 to 1.0), good for treatment recommendation (7, IQR 6–8, ICC 0.8, 95% CI 0.4 to 0.9), good for therapy regimen (7, IQR 4–8, ICC 0.8, 95% CI 0.5 to 0.9), moderate for functional status consideration (6, IQR 1–7, ICC 0.7, 95% CI 0.3 to 0.9) and moderate for overall agreement with the recommendations (5, IQR 3–7, ICC 0.7, 95% CI 0.3 to 0.9). No differences were observed between the glioblastomas and low-grade glioma ratings.

Conclusions ChatGPT performed poorly in classifying glioma types but was good for adjuvant treatment recommendations as evaluated by CNS TB experts. Even though the ChatGPT lacks the precision to replace expert opinion, it may serve as a promising supplemental tool within a human-in-the-loop approach.

What is already known on this topic

  • Advanced artificial intelligence (AI) language models, such as ChatGPT, are quickly evolving and have the potential to incorporate multi-modal medical information and assist with complicated medical decision-making.

What this study adds

  • The use of AI in making therapeutic decisions for central nervous system tumours has not been fully explored. This study aims to assess the effectiveness of AI compared with expert recommendations in aiding complex brain tumour decision-making, providing valuable insights into the potential and limitations of AI in this field.

  • This study shows that an AI language model was successful in suggesting adjuvant treatment plans for glioma patients. However, the model had difficulty accurately identifying glioma subtypes and only achieved moderate success in taking patients’ functional status into account when making recommendations.

How this study might affect research, practice or policy

  • While AI language models like ChatGPT cannot currently replace the opinions of medical experts, they may serve as a useful supplementary tool in aiding complex brain tumour decisions when used as part of a human-in-the-loop approach.

Introduction

Artificial intelligence (AI) is attracting a lot of interest in the present era of personalised medicine.1–3 Since novel drug discovery, surgical robotics or complex interdisciplinary oncological therapy decisions are time-consuming and resource-demanding, innovative AI-based language models may enhance the performance of healthcare ecosystems.4–6 Recently, a novel general-purpose AI chatbot, called ChatGPT-3.5 (Generative Pretrained Transformer 3.5), was launched, spurring mixed reactions of curiosity and scepticism from the scientific community.7–11

ChatGPT is an AI-powered chat interface which results in a language model that uses unsupervised learning and generates human-like text. It allows humans to satisfy their curiosity by engaging in a dialogue using various questions and prompts.12 Although the chatbot was not designed to deliver medical knowledge, it allows one to chat on specific medical topics and provides answers with a tone of authority as one would interact with an expert. Nevertheless, chatGPT has some limitations such as the availability of online data until September 2021 and that it sometimes provides incorrect although plausible-sounding answers13 possibly limiting its use in medical settings.

Neuro-oncolgy has significantly evolved in parallel with new research advances.14 For instance, the treatment of high-grade gliomas has been extensively studied for the last 20 years to offer a longer survival rate for affected individuals.15 16 Furthermore, the consideration of the patient’s clinical status, age, and comorbidities have been included in novel trials to optimise treatment protocols.17 Low-grade gliomas which account for approximately 20% of all gliomas are even more heterogenous and adjuvant treatment is based on their complex molecular profile.18–20 In order to deliver the best treatment strategies for glioma patients, central nervous system (CNS) tumour boards (TB) arose implicating a multidisciplinary team composed of neurosurgeons, oncologists, neurologists, pathologists, radiation oncologists and neuroradiologists.21 TBs are, however, mobilising an extensive amount of resources, which might be challenging to apply in every scenario. In this regard, AI-assisted decision-making could prove helpful in delivering personalised treatment strategies.22

Given the promise of AI in using vast amounts of knowledge to synthesise information and provide recommendations, we investigated whether ChatGPT had a role to play in CNS TB regarding glioma patient adjuvant therapy decision-making. We hypothesised that ChatGPT would perform as well as CNS TB experts in providing glioma subtype diagnosis and adjuvant treatment strategy in line with the current guidelines.23

Methods

Patients’ selection

We randomly selected 10 glioma cases from our institutional CNS TB registry from 2014 to 2022. During this period a total of 215 brain glioma cases were evaluated. Inclusion criteria were: (1) new onset or recurrent supratentorial glioma, (2) surgical treatment was performed (removal or biopsy), (3) CNS TB recommendation and (4) informed consent was available. Exclusion criteria were: (1) a presence of brain metastasis, (2) extra-axial tumours and (3) glioma involving the brainstem or the spinal cord.

Dialogue with ChatGPT

Electronic patients’ records were retrospectively reviewed. From 1 February to 14 February 2023, 10 case summaries were presented to ChatGPT (V.3.5, February 2023). A separate chat session was used for each case and was presented concisely with information on age, sex, medical history, symptoms, textual imaging results, surgical outcome, tumour resection extent, histopathological and molecular examination results. No diagnosis nor patient identification information was provided to ChatGPT. The questionnaire was modelled after a real-life TB panel discussion format. Two questions were asked to ChatGPT: (1) ‘what is the best adjuvant treatment?’, (2) ‘what would be the regimen of radiotherapy and chemotherapy for this patient?’. ChatGPT’s answers were collected. The same case information and a complete chat transcript were provided to the experts (online supplemental material 1). As a quality control measure, we asked the chatbot to provide the presumed diagnosis, which was consistent with its initial spontaneousresponse for each case.

CNS TB and experts’ selection

Our institutional CNS TB is composed of neuro-oncologists, radio-oncologists, radiologists, neurosurgeons, neuropathologists and neurologists. We considered our institutional CNS TB as a reference, as its decisions are evidence-based and are supported by a multidisciplinary consensus. Every patient with CNS oncological disease admitted to our institution is presented at this multidisciplinary meeting. For the purpose of this study, five experts from our CNS TB (two neuropathologists, one neurosurgeon, one radio-oncologist and one oncologist) and two external independent experts (two neurosurgeons from Europe and North America) evaluated ChatGPT’s output with regard to the formal decision of the CNS TB.

Studied parameters

The experts were asked to rank ChatGPT’s answers for each of the 10 cases. The CNS TB decisions were used as the gold standard. The experts were asked to evaluate the ChatGPT’s output on a scale between 0 and 10, where ‘0’ indicated complete disagreement, ‘10’ indicated complete agreement and ‘5’ a neutral answer (‘neither agreement nor disagreement’). The experts had to evaluate ChatGPT’s answers regarding the diagnosis, the proposed treatment, the consideration of the patient’s functional status to support adjuvant therapy, the proposed regimen of adjuvant therapy and the overall accuracy of ChatGPT with respect to its answers. Finally, the experts were asked to provide their opinion on the possible place of AI in interdisciplinary CNS tumour decision-making. The experts were provided with a questionnaire to rate ChatGPT’s performance in providing the diagnosis of specific glioma types, adjuvant treatment recommendations, adjuvant therapy regimen, how well the chatbot integrated the overall functional status of the patient into the decision-making and the overall quality of the recommendations provided. Figure 1 summarises the study workflow. Online supplemental material 2 presents the questions asked to the experts. Finally, the agreement between experts was evaluated.

Figure 1
Figure 1

Summary of study workflow. Ten patients were randomly selected from our institutional central nervous system (CNS) tumour board (TB) registry. All cases received state-of-the-art preoperative and postoperative glioma workups. Third, a summary of the anonymised case, including clinical, textual imaging information and immunohistological findings were presented to the ChatGPT, as it would be done at the CNS TB. Seven experts compared ChatGPT’s output and the TB recommendations. The results represent the median experts’ rating with the IQR. The figure was created with BioRender.com.

Statistics

We used R V.3.6.1 for the statistical analysis. The randomisation process was performed using function floor(runif). Ordinal variables were presented as median with IQR and were compared using a Mann-Whitney U test when appropriate. Experts’ rating score between 0 and 3 was considered poor, 4 and 6 as moderate, 7 and 8 as good, and 9 and 10 as excellent. The intraclass correlation coefficient (ICC) was used to evaluate the agreement between the experts (two-way random effects, absolute agreement, multiple raters average, ICC (2,k)).24 An ICC <0.5 was considered as poor, ≥0.5 and <0.75 as moderate, ≥0.75 and <0.9 as good and ≥0.9 as excellent agreement.24 Hypothesis testing was considered significant at p value <0.05 (two-sided).

Results

ChatGPT’s output

ChatGPT provided the diagnosis for suspected glioma type, recommendations for adjuvant treatment plan, regimen for radiotherapy and chemotherapy, and consideration of functional status for all 10 cases. Regarding the first question ‘what is the best adjuvant treatment’, ChatGPT started the dialogue by giving its appreciation of the diagnosis. Based on the patient summary, it correctly recognised and classified the tumours as glioma in all cases and suggested the tumour type (eg, low-grade glioma, grade II or III astrocytoma, glioblastoma). Of note, no alternative diagnosis such as brain metastasis or extra-axial brain tumour was proposed. ChatGPT then recommended ‘the best adjuvant treatment […]’ or ‘the standard of care for glioblastoma […]’. Concerning the second question ‘what would be the regimen of radiotherapy and chemotherapy for this patient’, ChatGPT provided a recommendation for all cases. However, a complete regimen of radiotherapy (greys in fractions over weeks) was provided in 70% of the cases, and a complete regimen of chemotherapy (medication and doses) in 50% of cases.

For both questions, ChatGPT nuanced its answers for all cases by mentioning the need to adjust the treatment according to the patient’s individual preferences and functional status, although never specifying alternatives. Finally, ChatGPT mentioned the need to confirm its treatment suggestion with a multidisciplinary team in 80% of the cases.

Experts’ opinion and agreement

Seven experts rated ChatGPT’s output regarding the diagnosis, recommendations for therapy and regimen, the consideration of the patient’s functional status and ChatGPT’s overall performance. Rater 6 only rated the diagnosis accuracy and treatment recommendations for case 2 and did not rate the output regarding the consideration of the functional status nor the regimen of adjuvant therapy (the expert preferred to remain in their scope of practice).

Figure 2 demonstrates the inter-rater agreement for each evaluated outcome. Concerning the diagnosis, ChatGPT’s output was evaluated as poor with a median score of 3 (IQR 1–7.8) with excellent agreement between the experts (ICC 0.9, 95% CI 0.7 to 1.0). For the adjuvant therapy, the ChatGPT recommendations were evaluated as good with a median score of 7 (IQR 6–8) and a good agreement (ICC 0.8, 95% CI 0.4 to 0.9). The adjuvant therapy regimen was evaluated as good with a median score of 7 (IQR 4–8) and good expert agreement (ICC 0.8, 95% CI 0.5 to 0.9). Regarding ChatGPT’s output on the consideration of the patient’s functional status, the experts rated the recommendations as moderate with a median score of 6 (IQR 1–7) and a moderate agreement (ICC 0.7, 95% CI 0.3 to 0.9). Finally, the global evaluation of ChatGPT’s output accuracy was moderate and scored 5 (IQR 3–5) with a moderate expert agreement (ICC 0.7, 95% CI 0.3 to 0.9). Six experts (86%) evaluated ChatGPT’s role in a CNS TB as useful if the AI-based system can evolve and learn. One rater (14%) evaluated ChatGPT’s role in a CNS TB as useful, but only in specific circumstances.

Figure 2
Figure 2

Barplots representing the ratings per patient and per expert, regarding (A) the diagnosis, (B) the adjuvant treatment recommendation, (C) the consideration of the patient’s functional status, (D) the regimen of the adjuvant therapy, (E) ChatGPT’s overall performance, (F) the legend. ICC, intraclass correlation coefficient (from 0 to 10, 95% CI). The dashed red line represents the median value of the experts’ rating.

There was no significant difference between experts’ ratings in glioblastoma (8/10) and two low-grade glioma cases.

Discussion

In this study, we assessed the performance of ChatGPT, an AI-based language model, in providing treatment recommendations for glioma patients. To the best of our knowledge, this is the first study aiming to evaluate this novel chatbot within the framework of CNS tumour multidisciplinary decision-making. While ChatGPT demonstrated proficiency in accurately identifying cases as gliomas, it displayed limited precision in identifying specific tumour subtypes. Furthermore, the tool’s recommendations regarding treatment strategy and regimen were rated as good, while the ability to incorporate functional status in its decision-making process as moderate.

Rationale for CNS TB

Oncological patients discussed in the multidisciplinary CNS TB are more likely to benefit from a preoperative and postoperative staging and are more likely to receive the optimal adjuvant treatment.25 26 Barbaro et al presented the foundations of neuro-oncology and the need for multidisciplinary expertise in order to embrace the multiple disease aspects in CNS tumour-affected patients.14 The authors highlighted the prerogatives and missions of a CNS TB: (1) neuro-oncology, neurosurgery, radiation oncology, neuropathology, neurology and radiology are specialties necessary to compose the CNS TB; (2) the expert consortium’s main goal is to propose a collaborative treatment plan; (3) the development of novel clinical trials. Furthermore, a single-centre prospective evaluation of a CNS TB showed that the experts’ consortium influences the clinical management of patients suffering from a brain tumour through high-impact decisions.27 However, the organisation of CNS TB is limited by economic costs, time expenditure, resource availability and the limited presence of TB across the geographic and socioeconomic strata.26 New AI-based tools with underlying deep learning, such as ChatGPT, might represent a valuable complement or at least offer some help to centres lacking expertise or resources.

ChatGPT ready to assume the role of the doctor?

Two questions were asked ChatGPT that corresponded to the main aim of a CNS TB discussion: ‘what is the best adjuvant treatment?’, and ‘what would be the regimen of radiotherapy and chemotherapy for this patient?’. ChatGPT scored well on both parameters, but its responses were less accurate on other parameters such as incorporating the functional status of the patient, and glioma subtype diagnostic accuracy. Regarding the latter, the output provided by the chatbot was often incorrect (ie, pleiomorphic astrocytoma instead of glioblastoma in one case), or not detailed enough (ie, no distinction between grade II or III astrocytoma). On the other hand, the adjuvant treatment suggestion and its regimen were rated as good. In future studies, it may be worth exploring alternative questioning methods that align better with how chatbots process information. This approach could potentially lead to more accurate results.

In this cohort, 80% of the included patients were diagnosed with glioblastoma (WHO grade IV). In the literature, the treatment of glioblastoma WHO IV has been extensively studied.15–17 19 23 28 AI models used by ChatGPT are trained on a large dataset of information found online including websites, journals and digitalised books. It is thus comprehensible that ChatGPT’s output regarding the adjuvant treatment and its regimen related to glioblastoma is of better quality because the underlying knowledge base is well-documented. To this extent, ChatGPT’s performance is mediocre regarding recommendations that are based on less extensive knowledge base. The consideration of patient functional status was rated as moderate, even though the clinical preoperative and postoperative state of the included cases was presented to ChatGPT. This consideration is much less documented in the literature as only a few clinical trials studied adjuvant therapy for glioblastoma in patients with impaired functional status or in older adults.17

Strengths and limitations

Our results provide valuable information on the potential of human-AI interfaces in medical decision-making. To test the chatbot’s performance, we have used glioma cases which represent a homogenous sample of tumour cases which allowed us to test the performance in this setting but limited the generalisability of our findings to other tumour types. Of note, ChatGPT’s recommendations were conscientiously mitigated with disclosure statements that it was not designed to provide medical advice, which presents another limitation in a medical setting. Notwithstanding, it might be seen as an opportunity if similar algorithms would be designed specifically for this purpose. Given this, at the moment we cannot appreciate the full potential of ChatGPT in CNS TB. Notwithstanding this limitation, one could imagine that AI chatbots, with pursued development in the medical field, could hold great promise to complement the classic CNS TB workflow. Another limitation lies in the fact that the chatbot’s knowledge relies on content from the internet limited to 2021. Although information on more novel research developments in neuro-oncology were not accessible for the chatbot, this should not have impacted its recommendations for standard clinical care. If the chatbot had access to information on new clinical trials, it could greatly aid the therapeutic discussion and potentially lead to new development directions . Finally, ChatGPT recommendations cannot be taken at face value without specialist verification since it is not uncommon for the chatbot to provide erroneous information.13 In language models such as ChatGPT, a phenomenon known as ‘hallucinations’ frequently occurs and can span from rather benign, for example, providing plausible but non-existent scientific references, to very dangerous medical scenarios, such as recommending an ineffective or harmful treatment.2 Therefore, whether used to inform medical or other high-stake decisions, at this stage it is indispensable that the output is verified by a human professional. Finally, our study relied on textual neuroimaging information and did not involve a quantitative AI imaging analysis which could be a potential area of development.

Further developments

Six of the seven experts evaluated ChatGPT as useful if the system could learn and improve. This notion is supported by the medical community as AI is growing and holds immense promise in medicine.2 6 29–31 However, since its launch in November 2022, ChatGPT has raised scepticism in the scientific community regarding threats to the originality of scientific work.10 11 32–35 Another consideration is the risk that AI chatbots may be prone to bias or commit omissions and errors in the interpretation of medical information. Due to these shortcomings, AI-based systems in medicine should be used with a human-in-the-loop approach.

Even if our results suggest a reserved rating for ChatGPT’s performance on glioma subtype diagnosis and multi-modal information integration, AI-based chatbots may be a promising supplement in TB decision-making. Future studies could explore ways to refine ChatGPT’s functionality, such as incorporating more patient-specific data and refining its ability to provide nuanced recommendations based on the clinical context. Furthermore, future developments in the ChatGPT interface could introduce the ability to read medical imaging, such as preoperative and postoperative brain MRI, which could enormously improve its diagnostic ability and treatment recommendations.

Nonetheless, our results highlight the potential utility of ChatGPT in facilitating clinical decision-making. Chatbots could be used to quickly provide information related to a patient’s medical history, differential diagnosis, relevant diagnostic tests, experimental treatment options and potential side effects. Furthermore, we intentionally provided the chatbot with only one conversation log. Thus, it is possible that further interaction and additional discussion with the chatbot may have yielded increased performance.

However, ChatGPT’s ability to provide medical information was restricted as it did not have access to the latest clinical trial findings. This was because it lacks live internet access and access to research databases.28 Overcoming these barriers and facilitating AI access to the newest scientific information, could be a potential direction of future development as the novel clinical trials are a crucial part of a CNS TB discussion.14 AI-based chatbots could have the potential to integrate the newest trial and bench science information into multidisciplinary decision-making and help TB direct patients to potential applicable treatments.

AI language models are evolving at a tremendous speed, and by the time of the publication of this manuscript, a newer ChatGPT V.4.0 was introduced, offering a more versatile conversational tool. It is possible that future updates may include a neuro-imaging analysis tool, which would greatly enhance the complexity of AI tools available for the medical field.

Conclusion

We have evaluated the performance of the novel AI-based language generator ChatGPT in glioma-related treatment recommendations. ChatGPT correctly identified the cases as CNS tumours but lacked precision on tumour subtype. The treatment strategy and regimen recommendations were rated as good; however, it lacked the ability to nuance its recommendations when taking into consideration the functional status. Overall, our findings suggest that ChatGPT has potential as an adjunct to the multidisciplinary TB decision workflow within a human-in-the-loop approach, provided that further algorithmic advancements are made in the medical domain.