Equity in essence: a call for operationalising fairness in machine learning for healthcare
•,,,.
...
Introduction
Machine learning for healthcare (MLHC) is at the juncture of leaping from the pages of journals and conference proceedings to clinical implementation at the bedside. Succeeding in this endeavour requires the synthesis of insights from both the machine learning and healthcare domains, in order to ensure that the unique characteristics of MLHC are leveraged to maximise benefits and minimise risks. An important part of this effort is establishing and formalising processes and procedures for characterising these tools and assessing their performance. Meaningful progress in this direction can be found in recently developed guidelines for the development of MLHC models,1 guidelines for the design and reporting of MLHC clinical trials,2 3 and protocols for the regulatory assessment of MLHC tools.4 5
But while such guidelines and protocols engage extensively with relevant technical considerations, engagement with issues of fairness, bias and unintended disparate impact is lacking. Such issues have taken on a place of prominence in the broader ML community,6–9 with recent work highlighting issues such as racial disparities in the accuracy of facial recognition and gender classification software,6 10 gender bias in the output of natural language processing models11 12 and racial bias in algorithms for bail and criminal sentencing.13 MLHC is not immune to these concerns, as seen in disparate outcomes from algorithms for allocating healthcare resources,14 15 bias in language models developed on clinical notes16 and melanoma detection models developed primarily on images of light-coloured skin.17 Within this paper, we will examine the inclusion of fairness in recent guidelines for MLHC model reporting, clinical trials and regulatory approval. We highlight opportunities to ensure that fairness is made fundamental to MLHC, and examine ways how this can be operationalised for the MLHC context.
Fairness as an afterthought?
Model development and trial reporting guidelines
Several recent documents have attempted, with varying degrees of practical implication, to enumerate guiding principles for MLHC. Broadly, these documents do an excellent job of highlighting artificial intelligence (AI)-specific technical and operational concerns, such as how to handle human-AI interaction, or how to account for model performance errors. Yet as outlined in table 1, references to fairness are either conspicuously absent, made merely in passing, or relegated to supplemental discussion.
Table 1
|
Fairness in recently released and upcoming guidelines
Notable examples are the recent the Standard Protocol Items: Recommendations for Interventional Trials-AI (SPIRIT-AI)2 and Consolidated Standards of Reporting Trials-AI (CONSORT-AI)3 extensions, which expand prominent guidelines for the design and reporting of AI clinical trials to include concerns relevant to AI. While the latter states in the discussion that ‘investigators should also be encouraged to explore differences in performance and error rates across population subgroups’,3 there is no more formal inclusion of the concept into the guideline itself. Similarly, the announcement papers for the upcoming Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis-ML (TRIPOD-ML)18 andStandards for Reporting of Diagnostic Accuracy Studies AI Extension (STARD-AI)19 guidelines for model reporting do not allude to these issues (though we wait in anticipation for their potential inclusion in the final versions of these guidelines). While recently published guidelines from the editors of respiratory, sleep and critical care medicine journals engage with the concept in an exemplary fashion, the depth of their discussion is relegated to a supplementary segment of the paper.1
Regulatory guidance
Broadly, the engagement of prominent regulatory bodies with MLHC remains at a preliminary stage, and engagement with fairness tends to be either minimal or vague. The Food and Drug Administration in the USA has made significant strides towards modernisation of its frameworks for the approval and regulation of software-based medical interventions, including MLHC tools.5 Their documents engage broadly with technical concerns, and criteria for effective clinical evaluation, but entirely lack discussion of fairness or the relationship between these tools and the broader health equity context.20 The Canadian Agency for Drugs and Technologies in Health has explicitly highlighted the need for fairness and bias to be considered, but further elaboration is lacking.21
The work of the European Union on this topic remains at a broad stage.4 While their documents do make reference to principles of ‘diversity, non-discrimination and fairness’, they do so in a very broad manner without any clearly operationalised specifics.22 23 The engagement of the UK with MLHC is relatively advanced, with several prominent reports engaging with the topic,24–26 and an explicit ‘Code of Conduct for Data-Driven Healthcare Technology’27 from the Department of Health and Social Care that highlights the need for fairness. However, the specifics of this regulatory approach are still being decided, and no clear guidance has yet been put forth to clarify these principles in practice.28 MLHC as a whole would benefit from increased clarity and force in regulatory guidance from these major agencies.29
Operationalising fairness in MLHC practice
If fairness is an afterthought in the design and reporting of MLHC papers and trials, as well as regulatory processes, it is likely to remain an afterthought in the development and implementation of MLHC tools. If MLHC is going to prove effective for— and be trusted by—a diverse range of patients, fairness cannot be a post-hoc and after-the-fact consideration. Nor is it sufficient for fairness to be a vague abstraction of academic importance but ineffectual consequence. The present moment affords a tremendous opportunity to define MLHC such that fairness is integral, and to ensure that this commitment is reflected in model reporting guidelines, clinical trial guidelines and regulatory approaches.
However, moving from vague commitments of fairness to practical and effective guidance is far from a trivial task. As work in the machine learning community has demonstrated, fairness has multiple definitions which can occasionally be incompatible,7 and bias can arise from a complex range of sources.30 Operationalisation of fairness must be context-specific, and embeds the relevant values in a field. We call for concerted effort from the MLHC community, and in particular the groups responsible for the development and propagation of guidelines, to affirm a commitment to fairness in an explicit and operationalised fashion. Similarly, we call on the various regulatory agencies to establish clear minimum standards for AI fairness. In box 1, we highlight a non-exhaustive series of recommendations that are likely to be beneficial as the MLHC community engages in this endeavour.
Box 1
Recommendations for operationalising fairness
Recommendations
Engage members of the public and in particular members of marginalised communities in the process of determining acceptable fairness standards.
Collect necessary data on vulnerable protected groups in order to perform audits of model function (eg, on race, gender).
Analyse and report model performance for different intersectional subpopulations at risk of unfair outcomes.
Establish target thresholds and maximum disparities for model function between groups.
Be transparent regarding the specific definitions of fairness that are used in the evaluation of a machine learning for healthcare (MLHC) model.
Explicitly evaluate for disparate treatment and disparate impact in MLHC clinical trials.
Commit to postmarketing surveillance to assess the ongoing real-world impact of MLHC models.
Conclusion
Values are embedded throughout the MLHC pipeline, from the design of models, to the execution and reporting of trials, to the regulatory approval process. Guidelines hold significant power in defining what is worthy of emphasis. While fairness is essential to the impact and consequences of MLHC tools, the concept is often conspicuously absent or ineffectually vague in emerging guidelines. The field of machine MLHC has the opportunity at this juncture to render fairness integral to the identity field. We call on the MLHC community to commit to the project of operationalising fairness, and to emphasise fairness as a requirement in practice.
Contributors: Initial conceptions and design: JWG, LGM, MG and LAC. Drafting of the paper: LGM, JWG, MG and LAC. Critical revision of the paper for important intellectual content: JWG, LGM, MG and LAC.
Funding: Division of Electrical, Communications and Cyber Systems (1928481), National Institute of Biomedical Imaging and Bioengineering (EB017205).
Competing interests: MG acts as an advisor to Radical Ventures in Toronto.
Provenance and peer review: Not commissioned; externally peer reviewed.
Leisman DE, Harhay MO, Lederer DJ, et al. Development and reporting of prediction models: guidance for authors from editors of respiratory, sleep, and critical care journals. Crit Care Med2020; 48:623–33. doi:10.1097/CCM.0000000000004246•Google Scholar•PubMed
Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med2020; 26:1351–63. doi:10.1038/s41591-020-1037-7•Google Scholar•PubMed
Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med2020; 26:1364–74. doi:10.1038/s41591-020-1034-x•Google Scholar•PubMed
Cohen IG, Evgeniou T, Gerke S, et al. The European artificial intelligence strategy: implications and challenges for digital health. Lancet Digit Health2020; 2:e376–9. doi:10.1016/S2589-7500(20)30112-6•Google Scholar•PubMed
FDA. Artificial intelligence and machine learning in software as a medical device. 2020;
De-Arteaga M, Romanov A, Wallach H, et al. Bias in bios: a case study of semantic representation bias in a High-Stakes setting. 2019; Google Scholar
Klare BF, Burge MJ, Klontz JC, et al. Face recognition performance: role of demographic information. IEEE Transactions on Information Forensics and Security2012; 7:1789–801. doi:10.1109/TIFS.2012.2214212•Google Scholar
Caliskan A, Bryson JJ, Narayanan A, et al. Semantics derived automatically from language corpora contain human-like biases. Science2017; 356:183–6. doi:10.1126/science.aal4230•Google Scholar•PubMed
Bordia S, Bowman SR. Identifying and reducing gender bias in word-level language models. arXiv:190403035 [cs]. 2019;
Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI steering group. Nat Med2020; 26:807–8. doi:10.1038/s41591-020-0941-1•Google Scholar•PubMed
Ferryman K. Addressing health disparities in the food and drug administration's artificial intelligence and machine learning regulatory framework. J Am Med Inform Assoc2020; 27:2016–9. doi:10.1093/jamia/ocaa133•Google Scholar•PubMed
Mason A, Morrison A, Visintini S, et al. An overview of clinical applications of artificial intelligence. Ottawa, CADTH2018; Google Scholar
Commission E. COM(2019) 168 final: building trust in human-centric artificial intelligence. 2019; Google Scholar
Commission E. White paper on artificial intelligence–a European approach to excellence and trust. 2020; Google Scholar
Tankelevitch L, Ahn A, Paterson R, et al. Advancing AI in the NHS. 2018; Google Scholar
Fenech M, Strukelj N, Buston O, et al. Ethical, social, and political challenges of artificial intelligence in health. London, Wellcome Trust Future Advocacy2018; Google Scholar
Topol E. The Topol review: preparing the healthcare workforce to deliver the digital future. Health Education England2019; Google Scholar
Department of Health and Social Care. Code of conduct for data-driven health and care technology. 2019;
Mongan J, Moy L, Kahn CE, et al. Checklist for artificial intelligence in medical imaging (claim): a guide for authors and reviewers. Radiology2020; 2. doi:org/10.1148/ryai.2020200029•Google Scholar