Article Text

Download PDFPDF

Review of study reporting guidelines for clinical studies using artificial intelligence in healthcare
  1. Susan Cheng Shelmerdine1,
  2. Owen J Arthurs1,
  3. Alastair Denniston2 and
  4. Neil J Sebire3
  1. 1Radiology, Great Ormond Street Hospital NHS Foundation Trust, London, UK
  2. 2Institute of Inflammation and Ageing, University of Birmingham, Birmingham, UK
  3. 3Digital Research, Informatics and Virtual Environments Unit (DRIVE), London, UK
  1. Correspondence to Dr Susan Cheng Shelmerdine; susie.shelmerdine{at}gmail.com

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Recent, rapid developments in computational technologies and increased volumes of digital data for analysis have resulted in an unprecedented growth in research activities relating to artificial intelligence (AI), particularly within healthcare. This volume of work has even led to several high impact journals launching their own subjournals within the ‘AI healthcare’ field (eg, Nature Machine Intelligence,1 Lancet Digital Health,2 Radiology: Artificial Intelligence).3 High-quality research should be accompanied by transparency, reproducibility and validity of techniques for adequate evaluation and translation into clinical practice. Standardised reporting guidelines help researchers define key components of their study, ensuring that relevant information is provided in the final publication.4 Studies pertaining to algorithm development and clinical application of AI however, have brought unique challenges and added complexities in how such studies are reported, assessed and compared in relation to elements that are not conventionally prespecified in traditional reporting guidelines. This could lead to missing information and high risk of hidden bias. If these actual or potential limitations are not identified, then it may lead to tacit approval through publication which in turn may support premature adoption of new technologies.5 6 Conversely well-designed, well-delivered studies that are poorly reported may be judged unfavourably due to being adjudged to have a high risk of bias, simply due to a lack of information.

Inadequacies of reporting of AI clinical studies are increasingly well-recognised. In 2019, a systematic review by Liu et al7 reviewed over 20 500 articles, but found that fewer than 1% of these were sufficiently robust in their design and reporting allowing independent reviewers to have confidence in their claims. Similarly Nagendran et al8 identified high levels of bias in the field. In another study,9 it was reported that only 6% of over 500 eligible radiological-AI research publications performed any external validation of their models, and none used multicentre or prospective data collection. Similarly most studies using machine learning (ML) models for medical diagnosis10 did not have adequate detail on how these were evaluated nor sufficient detail for these to be reproduced. Inconsistencies in how ML models from electronic health records have also been reported, with details regarding race and ethnicity of participants omitted in 64% of studies, and only 12% of models being externally validated.11

In order to address these concerns, adapted research reporting guidelines based on the well-established EQUATOR Network (Enhancing the QUAlity and Transparency Of health Research)12 13 and de novo recommendations by individual societies have been published, with a greater relevance for AI research. In this review, we highlight those that will cover the majority of healthcare focused AI-related studies, and explain how they differ to the well-known guidance for non-AI related clinical work. Our intention is to raise awareness of how such studies should be structured, thereby improving the quality of future submissions and providing a helpful aid for researchers, peer reviewers and editors.

In compiling a detailed, yet relevant list of study guidelines, we reviewed the EQUATOR network13 website for those containing the terms AI, ML or deep learning. A separate search was also conducted using Medline, Scopus and Google Scholar databases for publications using the same search terms with the addition of ‘reporting guideline’, ‘checklist’ or ‘template’. Opinion pieces were excluded. Articles were included where the description of the recommendations were provided, and published at time of the search (March 2021).

Types of research reporting guidelines

An ideal reporting guideline should be a clear, structured tool with a minimum list of key information to include within a published scientific manuscript. The EQUATOR Network13 is the international ‘standard bearer’ for reporting guidelines, committed to improving ‘the reliability and value of published health research literature by promoting transparent and accurate reporting and wider use of robust reporting guidelines’. Since the landmark publication of Consolidated Standards of Reporting Trials (CONSORT),14 the network has overseen the development and publication of a number of guidelines that address other types of study design (eg, diagnostic accuracy studies). The EQUATOR guidelines are centrally registered (available via a core library) which ensures adherence to robust methodology of development and avoids redundancy of parallel initiatives to address the same issue. Importantly these guidelines are not medical specialty specific but are focused on the type of study, which helps ensure that there is a consistent approach and quality for addressing the same study design. It is recognised that certain specific scenarios may require specific extensions to these guidelines. For example, the increasing recognition of the importance of patient-reported outcomes (PROs) has led to the development of Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT-PRO)15 and CONSORT-PRO.16 In a similar way, the specific attributes of AI as an intervention, has led to a number of AI extensions, both published and in process, which build on the robust methodology of the original EQUATOR guidelines, while ensuring AI-specific elements are also addressed.

In parallel to the work of the EQUATOR network, a number of experts and institutions have developed their own recommendations for good practice and reporting. In contrast, these start with the intervention (ie, AI) rather than the study type (ie, randomised controlled trial (RCT)), and therefore, cover essentially the same territory. They vary in depth, and there can be differences in nuance depending on their primary purpose. For example some have originated from the need to support reviewers and editorial staff (‘is this complete and is it good enough?’), whereas others are addressing at building a shared understanding of appropriate design and delivery (‘this is what good looks like’).

Given the number of different reporting guidelines in this area, there is value in setting them in context to help support users in understanding which is most appropriate for a particular setting (table 1). Ultimately the most important elements of a high-quality study are contained within the methodology of the study design itself and not within the intervention. It is these elements that help minimise the major biases that all studies must address. In line with leading journals, we would, therefore, recommend starting with the guideline that addresses that particular study design (eg, CONSORT14 for an RCT). If an AI extension is already in existence for that study type then these are clearly appropriate for that study (eg, CONSORT-AI).17–19 If no such -AI extension exists then we recommend using the appropriate EQUATOR guideline (eg, Standards for Reporting of Diagnostic Accuracy Studies (STARD)20 for diagnostic accuracy studies), but supplementing with AI-specific elements recommended in other guidelines (eg, SPIRIT-AI,21–23 CONSORT-AI17–19 or the non-EQUATOR guidelines described below). Indeed all the guidelines considered here contain valuable insights into the specific challenges of AI studies, and are recommended reading into good practice for design and reporting.

Table 1

Summary of reporting guidelines for common study types used in radiological research, and their corresponding guideline extensions where these involve artificial intelligence

EQUATOR network guidelines

Clinical trials protocols

The quality of a study and the trustworthiness of its findings, starts at the design phase. The study protocol should contain all elements of the study design, sufficient for independent groups to carry out the study and expect replicability. Prepublication of the study protocol, helps avoid biases such as post-hoc assignment of the primary outcome in which the triallist can ‘cherry pick’ one of a number of outcomes that point in the desired direction.

Guidance for recommended items to include in a trial protocol are provided by the SPIRIT Statement (latest version published in 2013),24 which has been recently adapted for trials with an AI-related focus, termed the ‘SPIRIT-AI’ guideline.21–23 This adaptation includes an additional 15 items (12 extensions, 3 elaborations) to the existing 33-item SPIRIT 2013 guideline. The key differences are outlined in table 2, mostly focused on the methodology of the trial, (accounting for eight extensions, one elaboration) with emphasis on inclusion/exclusion of data and participants, dealing with poor quality data and how the AI intervention will be applied to and benefit clinical practice.

Table 2

Additional items proposed for studies relating to AI intervention clinical protocols within the SPIRIT-AI statement (in addition to the SPIRIT 2013 statement)

Clinical trials reports

While most AI studies are currently at early-phase validation stages, those evaluating the use of ‘AI-interventions’ in real world setting are fast emerging, and will become of increasing importance, since these are required for real-world clinical benefit demonstration. RCTs are the exemplar study design in providing a robust evidence basis for efficacy and safety of a given intervention, with the CONSORT statement, 2010 version14 providing a 25-item checklist for the minimum reporting content in such studies. An adapted version, entitled the ‘CONSORT-AI’ extension17–19 was published in September 2020 for ‘AI intervention’ studies. This includes an additional 14 items (11 extensions, 3 elaborations) to the existing CONSORT 2010 statement, the majority of which (8 extensions, 1 elaboration) relate to the study participants and details of the ‘AI intervention’ being evaluated, which are similar to those additions already described in the SPIRIT-AI extension. Specific key differences in the new guideline are outlined in table 3. Although not specific for AI interventions, some aspects of the checklist Template for Intervention Description and Replication, 201425 may be a helpful addition when reporting details of the interventional elements of a study (ie, as an extension of item 5 of the CONSORT 2010 statement or as item 11 of the SPIRIT 2013 statement). These include details regarding any modifications of the intervention during a study, including how and why certain aspects were personalised or adapted. There are currently no publicly proposed plans to publish an ‘AI’ extension to this guideline to the best of our knowledge.

Table 3

Additional criteria to be included for studies relating to AI interventions within the CONSORT-AI statement (in addition to the CONSORT 2010 statement)

Diagnostic accuracy studies

The STARD statement, 2015 version20 is the most widely accepted reporting standard for diagnostic accuracy studies. A steering group has been established to devise an AI-specific extension to the latest version of the 30-item STARD statement (called the STARD-AI extension.26 At the time of writing this is undergoing an international consensus survey among leaders in the AI field for suggested adaptations and pending publication.

Prediction models

Extensions to reporting guidelines describing prediction models that use ML have been announced, and are anticipated for publication soon. These include adapted versions of the ‘Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis’ (TRIPOD), 2015 version,27 which will be entitled ‘TRIPOD-AI’,28 29 and supported by the ‘Prediction model Risk Of Bias Assessment Tool’ (PROBAST, 2019 version)30 which is proposed to be entitled PROBAST-ML.28 29

Human factors

Another upcoming guideline, focused on the evaluation of the ‘human factors’ in algorithm implementation, has been announced: the checklist (Developmental and Exploratory Clinical Investigation of Decision-support systems driven by AI).31 This checklist is intended for use in early small-scale clinical trials that evaluate and provide information on how algorithms may be used in practice, bridging the gap between the algorithm development/validation stage (which would follow TRIPOD-AI, STARD-AI or Checklist for Artificial Intelligence in Medical Imaging (CLAIM)), but before large-scale clinical trials of AI interventions (where the CONSORT-AI would be used). Publication is anticipated to be late 2021 or early 2022.

Systematic reviews

Given the increasing volume of radiological AI-related research for a growing variety of conditions and clinical settings, it is also likely that we will encounter more systematic reviews and meta-analyses that aim to aggregate the evidence from studies in this field (eg, recent publications have already emerged that summarise research regarding the role of AI in COVID-19.32–34 At present, the ‘Preferred Reporting Items for Systematic Reviews and Meta-analyses’ (PRISMA), 200935 guidelines are the most established for systematic reviews and meta-analyses, with a modified version specifically tailored for meta-analyses relating to diagnostic test accuracies (ie, the PRISMA-Diagnostic Trials of Accuracy (DTA), 2018).36 Currently, there have not been any announcements for an update to these guidelines for AI-related systematic reviews or meta-analyses, and therefore, it is suggested that the PRSIMA 200935 or PRISMA-DTA 201836 guidance should be followed.

In the planning stages for conducting systematic reviews of prediction models, the ‘Checklist for critical appraisal and data extraction for systematic reviews of prediction modelling studies’ (CHARMS, 201437 was developed by the Cochrane Prognosis Methods Group. This was not intentionally created for publications relating to AI per se, but applicable to a wide range of studies, which also happen to include the evaluation of ML models. The developers provide the checklist to help authors frame their review question, design and extract relevant items from published reports of prediction models and guide assessment of risk of bias (rather than in the analysis of these). This checklist will, therefore, be useful to those who wish to plan a review of AI tools that provide a ‘risk score’ or ‘probability of diagnosis’. A tutorial on how to carry out a ‘CHARMS analysis’ for prognostic multivariate models with real-life worked examples has been published38 and may be a helpful resource for readers wishing to carry out similar work. It is worth noting that the authors of CHARMS still recommend reference to the PRISMA 200935 and PRISMA-DTA 201836 statements for the reporting and analysis of trial results, in conjunction with their own checklist for planning of the review design.

Other (NON-EQUATOR network) guidelines

Alternative guidelines have been published by expert interest groups and endorsed by different specialty societies. A few are described here to supplement further reading and interest.

The Radiological Society of North America recently published the ‘CLAIM’39 in 2020, containing elements of the STARD 2015 guideline and applicable for trials addressing a wide spectrum of AI applications using medical images (eg, classification, reconstruction, text analysis, work flow optimisation). This checklist comprises of 42 items, of which 6 are new (pertaining to model design and training), 8 are extensions of pre-existing STARD 2015 items, 14 items are elaborations (mostly relating to methods and results) and 14 items remain the same. Particular emphasis is given to data, the reference standard of ‘ground truth’ and the precise development and methodology of the AI algorithm being tested. These are listed in further detail in table 4, where differences to the STARD 2015 are highlighted. Care should be taken to avoid any confusion with another similarly named checklist entitled ‘minimum information about clinical AI modelling’ (MI-CLAIM),40 which is less of a reporting guideline but a document outlining required shared understanding in the development and evaluation of AI models aimed to serve clinical and data scientists), repository managers and model users.

Table 4

Criteria for the CLAIM checklist for diagnostic accuracy studies using AI

It is also worth noting that the American Medical Informatics Association produced a set of guidelines in 2020 termed the ‘MI for Medical AI Reporting’ (MINIMAR),41 specific to studies reporting the use of AI solutions in healthcare. Rather than a list of items for manuscript writing, this guidance provides suggestions for details pertaining to data sources used in algorithm development and their intended usage, spread across four key subject areas (ie, study population and setting, patient demographics, model architecture and model evaluation). There are many similarities with the aforementioned CLAIM checklist, although the key differences include the granularity by which the MINIMAR suggests researchers should explicitly state participant demographics (eg, ethnicity and socioeconomic status, rather than just age and sex) and how code and data can be shared with the wider community.

Further reading

There is an increasing need to build a cadre of researchers and reviewers with sufficient domain knowledge of technical aspects (including limitations and risk) and of the principles of good trial methodology (including areas of potential bias, analysis issues, etc). There is also a need for ML experts and clinical trial communities to increasingly learn each other’s language, to ensure accurate and precise communication of concepts, and enable comparison between studies. A number of reviews are highlighted here for further reading42–46 along with work47 explaining different evaluation metrics used in AI and ML studies. It is also worth bearing in mind the wider clinical and ethical context of how any AI tool would fit into our existing clinical pathways and healthcare systems.48

Conclusion

In conclusion, this article has provided readers an overview of changes to standard clinical reporting guidelines specific for AI-related studies. The fundamental basics of describing the trial setup, inclusion and exclusion criteria, detailing the study methodology and standards used, together with details on algorithm development, should create transparency and address reproducibility. Those which are most relevant for a particular healthcare specialty will depend on the type of research being conducted in that particular field (eg, guidelines for AI-related diagnostic accuracy trials may be more relevant for radiological or pathological specialties, whereas those addressing patient outcomes with the aid of an AI algorithm may be more relevant for oncological or surgical specialties).

Although the reporting guidelines outlined may seem comprehensive, there remain areas that will need to be addressed, such as for economic health evaluation of AI-tools and algorithms (many are currently developed for ‘pharmacoeconomic evaluations’.49 It is likely that future guidelines may take the form of an extension to the widely used CHEERS guidance (Consolidated Health Economic Evaluation Reporting Standards50 51 available via the EQUATOR network.13 Nevertheless, a wide variation in opinion regarding the most appropriate economic evaluation guideline already exists for non-AI related tools, and this may be reflected in future iterations of such guidelines depending on how the algorithms are funded in different healthcare systems.52

The current guidelines outlined here will likely continue to be updated in the light of new understanding of the specific challenges of AI as an intervention and, how traditional study designs and reports need to be adapted.

Data availability statement

Data sharing not applicable as no datasets generated and/or analysed for this study.

Ethics statements

Ethics approval

Not required

References

Footnotes

  • Funding OJA is funded by a National Institute for Health Research (NIHR) Career Development Fellowship (NIHR-CDF-2017-10-037). SS, OJA and NJS receive funding from the Great Ormond Street Children’s Charity and the Great Ormond Street Hospital NIHR Biomedical Research Centre. AD receives funding from Health Data Research UK. An initiative funded by UK Research and Innovation, Department of Health and Social Care (England) and the devolved administrations and leading medical research charities.

  • Disclaimer The funding source(s) did not have any direct involvement in the methodology, design or write-up of this review article.

  • Competing interests None declared.

  • Patient and public involvement statement Not required

  • Provenance and peer review Not commissioned; externally peer reviewed.