I read the article by Haemmerli et al on the performance of ChatGPT-3.5 in generating treatment recommendations for central nervous system (CNS) tumours, which were then evaluated by tumour board (TB) experts. While the study did illuminate promising aspects of the Artificial Intelligence (AI) model, the design of the prompt used to interact with ChatGPT warrants further consideration.
In the study, the prompt employed was a brief patient history, followed by two questions, which appears to have limited the model’s performance. As a sophisticated large language model (LLM), GPT-3.5 relies heavily on the context and specificity of the provided prompt.1 2 Based on cited literature, an alternative prompt structure could have included context, specific intent, a question and an expected response format. Moreover, pretraining the LLM with examples of the expected answer significantly improves the quality of the answer.2 3 Finally, the introduction of GPT-4 in early March 2023 has shown considerable improvement in understanding and generating responses when compared with ChatGPT-3.5.4 5
With the application of these techniques, researchers could have guided the predictive capabilities of the LLM to generate more relevant and contextually nuanced responses. This could have particularly helped in areas where the model underperformed, such as precision in glioma subtypes and considerations of patient functional status.
As an illustration, both ChatGPT-3.5 and GPT-4 were pretrained with eight examples (patients 1–8, patient history followed by TB response) from online supplemental material of the study. A more context-specific prompt was then used with the history of patients 9 and 10. Table 1 displays main output obtained using this technique, revealing enhanced precision in oncological diagnosis, treatment discussions and patient functional status from ChatGPT-3.5 compared with what was presented in the paper. GPT-4 seemed to align even more closely with the board’s opinion, which was defined as the gold standard. Full discussion with the chatbot is available in online supplemental material 1.
It is critical to acknowledge that the efficiency of LLMs applications heavily depends on the prompt used and the quality of the data given. Future research needs to employ a refined, context-driven approach in interacting with these models and the development and sharing of prompt engineering techniques should continue to be prioritised.
In conclusion, the exploration of LLM in CNS oncology research is commendable, but it is essential to optimise the methodology to fully unlock the true potential of AI tools in such a complex and challenging clinical landscape.