Comparative study of ChatGPT and human evaluators on the assessment of medical literature according to recognised reporting standards

Richard HR Roberts; Stephen R Ali; Hayley A Hutchings; Thomas D Dobbs; Iain S Whitaker

doi:10.1136/bmjhci-2023-100830

Article Text

Short report

Comparative study of ChatGPT and human evaluators on the assessment of medical literature according to recognised reporting standards

http://orcid.org/0000-0002-9600-5943Richard HR Roberts1,2,3,
Stephen R Ali1,3,
Hayley A Hutchings2,
Thomas D Dobbs1,3 and
Iain S Whitaker1,3

¹Reconstructive Surgery and Regenerative Medicine Research Centre, Swansea University, Swansea, UK
²Swansea University Medical School, Swansea University, Swansea, UK
³Welsh Centre for Burns and Plastic Surgery, Morriston Hospital, Swansea, UK

Correspondence to Dr Richard HR Roberts; 838272{at}swansea.ac.uk

Abstract

Introduction Amid clinicians’ challenges in staying updated with medical research, artificial intelligence (AI) tools like the large language model (LLM) ChatGPT could automate appraisal of research quality, saving time and reducing bias. This study compares the proficiency of ChatGPT3 against human evaluation in scoring abstracts to determine its potential as a tool for evidence synthesis.

Methods We compared ChatGPT’s scoring of implant dentistry abstracts with human evaluators using the Consolidated Standards of Reporting Trials for Abstracts reporting standards checklist, yielding an overall compliance score (OCS). Bland-Altman analysis assessed agreement between human and AI-generated OCS percentages. Additional error analysis included mean difference of OCS subscores, Welch’s t-test and Pearson’s correlation coefficient.

Results Bland-Altman analysis showed a mean difference of 4.92% (95% CI 0.62%, 0.37%) in OCS between human evaluation and ChatGPT. Error analysis displayed small mean differences in most domains, with the highest in ‘conclusion’ (0.764 (95% CI 0.186, 0.280)) and the lowest in ‘blinding’ (0.034 (95% CI 0.818, 0.895)). The strongest correlations between were in ‘harms’ (r=0.32, p<0.001) and ‘trial registration’ (r=0.34, p=0.002), whereas the weakest were in ‘intervention’ (r=0.02, p<0.001) and ‘objective’ (r=0.06, p<0.001).

Conclusion LLMs like ChatGPT can help automate appraisal of medical literature, aiding in the identification of accurately reported research. Possible applications of ChatGPT include integration within medical databases for abstract evaluation. Current limitations include the token limit, restricting its usage to abstracts. As AI technology advances, future versions like GPT4 could offer more reliable, comprehensive evaluations, enhancing the identification of high-quality research and potentially improving patient outcomes.

Artificial intelligence
Medical Informatics

https://creativecommons.org/licenses/by/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.

https://doi.org/10.1136/bmjhci-2023-100830

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

In the dynamic landscape of medical research, clinicians face the daunting challenge of staying abreast of the latest advancements amid their demanding clinical responsibilities. The rate and varying quality of emerging research further compounds this challenge. A number of appraisal tools exist to help readers assess the quality of the reported research, although these can also be time-consuming to employ and are at risk of user bias. The use of large language models (LLMs) like ChatGPT has the potential to automate this evaluation, thereby aiding clinicians in making informed decisions.1 However, the accuracy of LLMs compared with human expertise as a gold standard remains uncertain. In November 2023, OpenAI unveiled ChatGPT, a generative pretrained transformer (GPT) language model grounded in transformer architecture, which empowers it to process vast amounts of text data and generate coherent text outputs by discerning the relationships between input and output sequences. ChatGPT has been trained on extensive human language datasets, and several studies attest to its ability to produce high-quality, coherent text outputs.2 3 Clinical research applications of ChatGPT have yielded promising results, suggesting that artificial intelligence could potentially critically appraise abstracts and liberate valuable clinician time.4 The objective of this study is to compare the proficiency of ChatGPT3, the third iteration of OpenAI’s GPT model, in scoring abstracts against human evaluation as the benchmark. By determining the accuracy and efficiency of these LLMs in assessing research quality, we aim to explore their potential as valuable tools for clinicians in appraisal and evidence synthesis.

Methods

In this study, we used a previously published paper as the basis of our comparison with ChatGPT.5 In their study, abstracts from a systematic review on implant dentistry were scored using the Consolidated Standards of Reporting Trials for Abstracts (CONSORT-A)6 statement by the human authors of the study. The processes of selection and data extraction were performed independently and in duplicate by two clinician reviewers across a sample of 30 abstracts. Discrepancies were systematically addressed through discussion until a consensus of at least 80% was achieved. Subsequent data extraction was conducted solely by one reviewer. The CONSORT-A checklist scores abstract reporting standards based on well-defined definitions for subsections such as trial design, blinding and randomisation. The human evaluators scored each item as fully reported, partially reported or not reported. ChatGPT was used to score the same set of abstracts, using a prompt to assess for each domain within the CONSORT-A checklist (figure 1). Building on the methodology established, each constituent subgroup was subsequently scored and categorised into one of the three classifications (figure 1A). An overall compliance score (OCS) was given out of 15, along with an OCS percentage (figure 1B). This was performed using the GPT3.5 model.

Figure 1

(A) Example prompt used to generate the OCS as per CONSORT-A criteria. (B) An example of the calculated OCS and OCS% as generated by ChatGPT. CONSORT-A, Consolidated Standards of Reporting Trials for Abstracts; OCS, overall compliance score.

Bland-Altman analysis was used to evaluate the overall agreement between human and ChatGPT-generated OCS percentage. For error analysis, the mean difference of the absolute OCS subscores, Welch’s two-sample t-test and Pearson’s correlation coefficient were undertaken. The mean difference provides information on the magnitude and direction of the differences in OCS between ChatGPT and human evaluators, while the Pearson’s correlation coefficient provides information on the strength and direction of the linear relationship between the two sets of scores. This provided complementary information on the agreement between ChatGPT and human evaluator. The Pearson’s correlation coefficient was interpreted based on magnitude: r, 0–0.19 very weak, 0.2–0.39 weak, 0.40–0.59 moderate, 0.6–0.79 strong and 0.8–1 very strong correlation. Statistical analysis was done in R (V.4.1.1). P<0.001 was deemed statistically significant.

Results

Bland-Altman analysis revealed a mean difference of 4.92% (95% CI 0.62%, 0.37%) in OCS percentage (figure 2). Error analysis revealed small mean differences between human evaluation and ChatGPT in most domains (table 1).

View this table:

Table 1

Error analysis of ChatGPT CONSORT-A OCS subscores

Figure 2

Bland-Altman analysis between ChatGPT human evaluation. OCS, overall compliance score.

The mean difference in absolute OCS was highest for the ‘conclusion’ domain (0.764, 95% CI: 0.186, 0.280), indicating that ChatGPT differed the most from human evaluators in this domain. In contrast, the domain with the lowest mean difference in absolute OCS was ‘blinding’ (0.034, 95% CI: 0.818, 0.895), indicating that ChatGPT was most accurate in this domain. In terms of correlation, the study found varying levels of correlation between ChatGPT and human evaluators for different domains. For example, the domains with a strong positive correlation were ‘harms’ (r=0.32, p<0.001) and ‘trial registration’ (r=0.34, p=0.002), indicating a high level of consistency between ChatGPT and human evaluators in these domains. On the other hand, ‘intervention’ (r=0.02, p<0.001) and ‘objective’ (r=0.06, p<0.001) domains had very weak correlations, suggesting that ChatGPT’s performance was less consistent with human evaluators in these domains.

Discussion

The emergence of LLMs like ChatGPT offers a promising solution to streamline the assessment of reporting standards in medical literature and assist clinicians to make informed decisions. Bland-Altman analysis supports the overall findings of the study that ChatGPT has the potential to automate appraisal of medical literature. By providing a score for the quality of reporting in abstracts, ChatGPT can help clinicians and researchers quickly identify studies with more comprehensive and transparent reporting. The recent release of ChatGPT4, an advancement on the ChatGPT3 architecture, has demonstrated enhanced performance across diverse domains.7 8 Full access is currently limited by a paywall; however, its web integration technology creates immediate possibilities for further application. This could include searching for papers with minimum CONSORT compliance scores or the use of ChatGPT as a widget within popular medical databases, where it could automatically evaluate the quality of abstracts and provide a score to users promoting comprehensive and transparent reporting. One important barrier to using LLMs more widely in medical literature evaluation is the token limit. ChatGPT’s current token limit may not allow it to process the entire research articles, limiting its use to abstracts. Nevertheless, the potential to feed ChatGPT full papers in the future and have it evaluate studies using other appraisal tools is an exciting possibility. Large, unexpected differences were seen in the conclusion and outcome (methods) subdomains. In the context of LLMs such as ChatGPT, the paucity of data in relation to training makes pinpointing a singular cause challenging. However, the quality of the prompt has been underscored as a major determinant in response accuracy,9 and in the context of academic writing and interpretation, ChatGPT has been shown to not follow directions correctly.10 These may have played a pivotal role in the observed significant difference. Furthermore, some specifics of human evaluation were not elaborated upon and human assessment inaccuracies may have influenced scoring. Future research could cater to the assessment of variations between human evaluators and pave the way for a more in-depth analysis in conjunction with ChatGPT.

Conclusion

As the technology continues to evolve and improve, the next iteration of GPT, GPT4, may further enhance the accuracy and efficiency of the tool, allowing for even more reliable and comprehensive evaluations of research. While there are still limitations to this technology, the promise it holds for assisting in the evaluation and identification of high-quality research is a significant step towards improving patient care and outcomes.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.

References

↵
1. Lee P,
2. Bubeck S,
3. Petro J
. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med 2023;388:2400. doi:10.1056/NEJMc2305286
OpenUrl
↵
1. Brown TB,
2. Mann B,
3. Ryder N, et al
. Language models are few-shot learners. 2020. Available: http://arxiv.org/abs/2005.14165
↵
1. Raffel C,
2. Shazeer N,
3. Roberts A, et al
. Exploring the limits of transfer learning with a unified text-to-text transformer. 2020. Available: http://arxiv.org/abs/1910.10683
↵
1. Sanmarchi F,
2. Bucci A,
3. Golinelli D
. A step-by-step researcher’s guide to the use of an Ai-based transformer in epidemiology: an exploratory analysis of Chatgpt using the Strobe checklist for observational studies. Z Gesundh Wiss [Preprint] 2023. doi:10.1101/2023.02.06.23285514
↵
1. Menne MC,
2. Pandis N,
3. Faggion CM
. Reporting quality of abstracts of randomized controlled trials related to implant dentistry. J Periodontol 2021;93:73–82. doi:10.1002/JPER.21-0396
OpenUrl
↵
1. Moher D,
2. Hopewell S,
3. Schulz KF, et al
. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ 2010;340:c869. doi:10.1136/bmj.c869
↵
1. He N,
2. Yan Y,
3. Wu Z, et al
. Chat GPT-4 significantly surpasses GPT-3.5 in drug information queries. J Telemed Telecare 2023. doi:10.1177/1357633X231181922
↵
1. Takagi S,
2. Watari T,
3. Erabi A, et al
. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ 2023;9:e48002. doi:10.2196/48002
↵
1. Zuccon G,
2. Koopman B
. Dr Chatgpt, tell me what I want to hear: how prompt knowledge impacts health answer correctness. 2023. Available: http://arxiv.org/abs/2302.13793
↵
1. HS Kumar A
. Analysis of Chatgpt tool to assess the potential of its utility for academic writing in BIOMEDICAL domain. BEMS Reports 2023;9:24–30. doi:10.5530/bems.9.1.5
OpenUrl

Footnotes

Contributors RHRR and SRA conceptualised the study. RHRR performed the review and initial data analysis. Both RHRR and SRA were jointly responsible for subsequent in-depth data analysis. HAH, SRA, TDD and ISW contributed significantly to the editing process, refining the manuscript for clarity and consistency. All authors reviewed the final manuscript before submission.
Funding The research conducted herein was funded by Swansea University. SRA and TDD are funded by the Welsh Clinical Academic Training Fellowship (no award number). SRA received a Paton Masser grant from the British Association of Plastic, Reconstructive and Aesthetic Surgeons to support this work (no award number). ISW is the surgical specialty lead for Health and Care Research Wales and the chief investigator for the Scar Free Foundation & Health and Care Research Wales Programme of Reconstructive and Regenerative Surgery Research (no award number). The Scar Free Foundation is the only medical research charity focused on scarring with the mission to achieve scar-free healing within a generation. ISW is an associate editor for the Annals of Plastic Surgery, editorial board member of BMC Medicine and takes numerous other editorial board roles.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.

[1] ↵
Lee P,
Bubeck S,
Petro J
. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med 2023;388:2400. doi:10.1056/NEJMc2305286
OpenUrl

[2] Lee P,

[3] Bubeck S,

[4] Petro J

[5] ↵
Brown TB,
Mann B,
Ryder N, et al
. Language models are few-shot learners. 2020. Available: http://arxiv.org/abs/2005.14165

[6] Brown TB,

[7] Mann B,

[8] Ryder N, et al

[9] ↵
Raffel C,
Shazeer N,
Roberts A, et al
. Exploring the limits of transfer learning with a unified text-to-text transformer. 2020. Available: http://arxiv.org/abs/1910.10683

[10] Raffel C,

[11] Shazeer N,

[12] Roberts A, et al

[13] ↵
Sanmarchi F,
Bucci A,
Golinelli D
. A step-by-step researcher’s guide to the use of an Ai-based transformer in epidemiology: an exploratory analysis of Chatgpt using the Strobe checklist for observational studies. Z Gesundh Wiss [Preprint] 2023. doi:10.1101/2023.02.06.23285514

[14] Sanmarchi F,

[15] Bucci A,

[16] Golinelli D

[17] ↵
Menne MC,
Pandis N,
Faggion CM
. Reporting quality of abstracts of randomized controlled trials related to implant dentistry. J Periodontol 2021;93:73–82. doi:10.1002/JPER.21-0396
OpenUrl

[18] Menne MC,

[19] Pandis N,

[20] Faggion CM

[21] ↵
Moher D,
Hopewell S,
Schulz KF, et al
. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ 2010;340:c869. doi:10.1136/bmj.c869

[22] Moher D,

[23] Hopewell S,

[24] Schulz KF, et al

[25] ↵
He N,
Yan Y,
Wu Z, et al
. Chat GPT-4 significantly surpasses GPT-3.5 in drug information queries. J Telemed Telecare 2023. doi:10.1177/1357633X231181922

[26] He N,

[27] Yan Y,

[28] Wu Z, et al

[29] ↵
Takagi S,
Watari T,
Erabi A, et al
. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ 2023;9:e48002. doi:10.2196/48002

[30] Takagi S,

[31] Watari T,

[32] Erabi A, et al

[33] ↵
Zuccon G,
Koopman B
. Dr Chatgpt, tell me what I want to hear: how prompt knowledge impacts health answer correctness. 2023. Available: http://arxiv.org/abs/2302.13793

[34] Zuccon G,

[35] Koopman B

[36] ↵
HS Kumar A
. Analysis of Chatgpt tool to assess the potential of its utility for academic writing in BIOMEDICAL domain. BEMS Reports 2023;9:24–30. doi:10.5530/bems.9.1.5
OpenUrl

[37] HS Kumar A

Log in using your username and password

Main menu

Log in using your username and password

You are here

Abstract

Statistics from Altmetric.com

Request Permissions

Introduction

Methods

Results

Discussion

Conclusion

Ethics statements

Patient consent for publication

Ethics approval

References

Footnotes

Read the full text or download the PDF:

Log in using your username and password