Equity in essence: a call for operationalising fairness in machine learning for healthcare ========================================================================================== * Judy Wawira Gichoya * Liam G McCoy * Leo Anthony Celi * Marzyeh Ghassemi * BMJ health informatics ## Introduction Machine learning for healthcare (MLHC) is at the juncture of leaping from the pages of journals and conference proceedings to clinical implementation at the bedside. Succeeding in this endeavour requires the synthesis of insights from both the machine learning and healthcare domains, in order to ensure that the unique characteristics of MLHC are leveraged to maximise benefits and minimise risks. An important part of this effort is establishing and formalising processes and procedures for characterising these tools and assessing their performance. Meaningful progress in this direction can be found in recently developed guidelines for the development of MLHC models,1 guidelines for the design and reporting of MLHC clinical trials,2 3 and protocols for the regulatory assessment of MLHC tools.4 5 But while such guidelines and protocols engage extensively with relevant technical considerations, engagement with issues of fairness, bias and unintended disparate impact is lacking. Such issues have taken on a place of prominence in the broader ML community,6–9 with recent work highlighting issues such as racial disparities in the accuracy of facial recognition and gender classification software,6 10 gender bias in the output of natural language processing models11 12 and racial bias in algorithms for bail and criminal sentencing.13 MLHC is not immune to these concerns, as seen in disparate outcomes from algorithms for allocating healthcare resources,14 15 bias in language models developed on clinical notes16 and melanoma detection models developed primarily on images of light-coloured skin.17 Within this paper, we will examine the inclusion of fairness in recent guidelines for MLHC model reporting, clinical trials and regulatory approval. We highlight opportunities to ensure that fairness is made fundamental to MLHC, and examine ways how this can be operationalised for the MLHC context. ## Fairness as an afterthought? ### Model development and trial reporting guidelines Several recent documents have attempted, with varying degrees of practical implication, to enumerate guiding principles for MLHC. Broadly, these documents do an excellent job of highlighting artificial intelligence (AI)-specific technical and operational concerns, such as how to handle human-AI interaction, or how to account for model performance errors. Yet as outlined in table 1, references to fairness are either conspicuously absent, made merely in passing, or relegated to supplemental discussion. View this table: [Table 1](http://informatics.bmj.com/content/28/1/e100289/T1) Table 1 Fairness in recently released and upcoming guidelines Notable examples are the recent the Standard Protocol Items: Recommendations for Interventional Trials-AI (SPIRIT-AI)2 and Consolidated Standards of Reporting Trials-AI (CONSORT-AI)3 extensions, which expand prominent guidelines for the design and reporting of AI clinical trials to include concerns relevant to AI. While the latter states in the discussion that ‘investigators should also be encouraged to explore differences in performance and error rates across population subgroups’,3 there is no more formal inclusion of the concept into the guideline itself. Similarly, the announcement papers for the upcoming Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis-ML (TRIPOD-ML)18 andStandards for Reporting of Diagnostic Accuracy Studies AI Extension (STARD-AI)19 guidelines for model reporting do not allude to these issues (though we wait in anticipation for their potential inclusion in the final versions of these guidelines). While recently published guidelines from the editors of respiratory, sleep and critical care medicine journals engage with the concept in an exemplary fashion, the depth of their discussion is relegated to a supplementary segment of the paper.1 ### Regulatory guidance Broadly, the engagement of prominent regulatory bodies with MLHC remains at a preliminary stage, and engagement with fairness tends to be either minimal or vague. The Food and Drug Administration in the USA has made significant strides towards modernisation of its frameworks for the approval and regulation of software-based medical interventions, including MLHC tools.5 Their documents engage broadly with technical concerns, and criteria for effective clinical evaluation, but entirely lack discussion of fairness or the relationship between these tools and the broader health equity context.20 The Canadian Agency for Drugs and Technologies in Health has explicitly highlighted the need for fairness and bias to be considered, but further elaboration is lacking.21 The work of the European Union on this topic remains at a broad stage.4 While their documents do make reference to principles of ‘diversity, non-discrimination and fairness’, they do so in a very broad manner without any clearly operationalised specifics.22 23 The engagement of the UK with MLHC is relatively advanced, with several prominent reports engaging with the topic,24–26 and an explicit ‘Code of Conduct for Data-Driven Healthcare Technology’27 from the Department of Health and Social Care that highlights the need for fairness. However, the specifics of this regulatory approach are still being decided, and no clear guidance has yet been put forth to clarify these principles in practice.28 MLHC as a whole would benefit from increased clarity and force in regulatory guidance from these major agencies.29 ## Operationalising fairness in MLHC practice If fairness is an afterthought in the design and reporting of MLHC papers and trials, as well as regulatory processes, it is likely to remain an afterthought in the development and implementation of MLHC tools. If MLHC is going to prove effective for— and be trusted by—a diverse range of patients, fairness cannot be a post-hoc and after-the-fact consideration. Nor is it sufficient for fairness to be a vague abstraction of academic importance but ineffectual consequence. The present moment affords a tremendous opportunity to define MLHC such that fairness is integral, and to ensure that this commitment is reflected in model reporting guidelines, clinical trial guidelines and regulatory approaches. However, moving from vague commitments of fairness to practical and effective guidance is far from a trivial task. As work in the machine learning community has demonstrated, fairness has multiple definitions which can occasionally be incompatible,7 and bias can arise from a complex range of sources.30 Operationalisation of fairness must be context-specific, and embeds the relevant values in a field. We call for concerted effort from the MLHC community, and in particular the groups responsible for the development and propagation of guidelines, to affirm a commitment to fairness in an explicit and operationalised fashion. Similarly, we call on the various regulatory agencies to establish clear minimum standards for AI fairness. In box 1, we highlight a non-exhaustive series of recommendations that are likely to be beneficial as the MLHC community engages in this endeavour. Box 1 ### Recommendations for operationalising fairness #### Recommendations * Engage members of the public and in particular members of marginalised communities in the process of determining acceptable fairness standards. * Collect necessary data on vulnerable protected groups in order to perform audits of model function (eg, on race, gender). * Analyse and report model performance for different intersectional subpopulations at risk of unfair outcomes. * Establish target thresholds and maximum disparities for model function between groups. * Be transparent regarding the specific definitions of fairness that are used in the evaluation of a machine learning for healthcare (MLHC) model. * Explicitly evaluate for disparate treatment and disparate impact in MLHC clinical trials. * Commit to postmarketing surveillance to assess the ongoing real-world impact of MLHC models. ## Conclusion Values are embedded throughout the MLHC pipeline, from the design of models, to the execution and reporting of trials, to the regulatory approval process. Guidelines hold significant power in defining what is worthy of emphasis. While fairness is essential to the impact and consequences of MLHC tools, the concept is often conspicuously absent or ineffectually vague in emerging guidelines. The field of machine MLHC has the opportunity at this juncture to render fairness integral to the identity field. We call on the MLHC community to commit to the project of operationalising fairness, and to emphasise fairness as a requirement in practice. ## Footnotes * Twitter @judywawira, @liamgmccoy, @MITCriticalData * Contributors Initial conceptions and design: JWG, LGM, MG and LAC. Drafting of the paper: LGM, JWG, MG and LAC. Critical revision of the paper for important intellectual content: JWG, LGM, MG and LAC. * Funding Division of Electrical, Communications and Cyber Systems (1928481), National Institute of Biomedical Imaging and Bioengineering (EB017205). * Competing interests MG acts as an advisor to Radical Ventures in Toronto. * Provenance and peer review Not commissioned; externally peer reviewed. * Received November 22, 2020. * Revision received February 7, 2021. * Accepted February 9, 2021. * © Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ. [http://creativecommons.org/licenses/by-nc/4.0/](http://creativecommons.org/licenses/by-nc/4.0/) This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: [http://creativecommons.org/licenses/by-nc/4.0/](http://creativecommons.org/licenses/by-nc/4.0/). ## References 1. Leisman DE, Harhay MO, Lederer DJ, et al. Development and reporting of prediction models: guidance for authors from editors of respiratory, sleep, and critical care journals. Crit Care Med 2020;48:623–33.[doi:10.1097/CCM.0000000000004246](http://dx.doi.org/10.1097/CCM.0000000000004246)pmid:http://www.ncbi.nlm.nih.gov/pubmed/32141923 [PubMed](http://informatics.bmj.com/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fbmjhci%2F28%2F1%2Fe100289.atom) 2. Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med 2020;26:1351–63.[doi:10.1038/s41591-020-1037-7](http://dx.doi.org/10.1038/s41591-020-1037-7)pmid:http://www.ncbi.nlm.nih.gov/pubmed/32908284 [PubMed](http://informatics.bmj.com/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fbmjhci%2F28%2F1%2Fe100289.atom) 3. Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med 2020;26:1364–74.[doi:10.1038/s41591-020-1034-x](http://dx.doi.org/10.1038/s41591-020-1034-x)pmid:http://www.ncbi.nlm.nih.gov/pubmed/32908283 [PubMed](http://informatics.bmj.com/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fbmjhci%2F28%2F1%2Fe100289.atom) 4. Cohen IG, Evgeniou T, Gerke S, et al. The European artificial intelligence strategy: implications and challenges for digital health. Lancet Digit Health 2020;2:e376–9.[doi:10.1016/S2589-7500(20)30112-6](http://dx.doi.org/10.1016/S2589-7500(20)30112-6)pmid:http://www.ncbi.nlm.nih.gov/pubmed/33328096 [PubMed](http://informatics.bmj.com/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fbmjhci%2F28%2F1%2Fe100289.atom) 5. FDA. Artificial intelligence and machine learning in software as a medical device, 2020. Available: [https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device](https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device) [Accessed 11 Oct 2020]. 6. Buolamwini J, Gebru T. Gender shades: intersectional accuracy disparities in commercial gender classification. Conference on Fairness, Accountability and Transparency, 2018:77–91. 7. Gajane P, Pechenizkiy M. On formalizing fairness in prediction with machine learning. arXiv:171003184 [cs, stat], 2018. Available: [http://arxiv.org/abs/1710.03184](http://arxiv.org/abs/1710.03184) [Accessed 20 Sept 2020]. 8. 1. Mehrabi N, 2. Morstatter F, 3. Saxena N et alMehrabi N, Morstatter F, Saxena N. A survey on bias and fairness in machine learning. arXiv:190809635 [cs], 2019. Available: [http://arxiv.org/abs/1908.09635](http://arxiv.org/abs/1908.09635) [Accessed 11 Oct 2020]. 9. De-Arteaga M, Romanov A, Wallach H. Bias in bios: a case study of semantic representation bias in a High-Stakes setting. Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* ’19. Published online, 2019:120–8. 10. Klare BF, Burge MJ, Klontz JC, et al. Face recognition performance: role of demographic information. IEEE Transactions on Information Forensics and Security 2012;7:1789–801.[doi:10.1109/TIFS.2012.2214212](http://dx.doi.org/10.1109/TIFS.2012.2214212) 11. Caliskan A, Bryson JJ, Narayanan A. Semantics derived automatically from language corpora contain human-like biases. Science 2017;356:183–6.[doi:10.1126/science.aal4230](http://dx.doi.org/10.1126/science.aal4230)pmid:http://www.ncbi.nlm.nih.gov/pubmed/28408601 [Abstract/FREE Full Text](http://informatics.bmj.com/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNTYvNjMzNC8xODMiO3M6NDoiYXRvbSI7czoyNToiL2JtamhjaS8yOC8xL2UxMDAyODkuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 12. Bordia S, Bowman SR. Identifying and reducing gender bias in word-level language models. arXiv:190403035 [cs], 2019. Available: [http://arxiv.org/abs/1904.03035](http://arxiv.org/abs/1904.03035) [Accessed 30 Jan 2021]. 13. Huq AZ. Racial equity in algorithmic criminal justice. Duke LJ 2018;68:1043. 14. Obermeyer Z, Powers B, Vogeli C, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447–53.[doi:10.1126/science.aax2342](http://dx.doi.org/10.1126/science.aax2342)pmid:http://www.ncbi.nlm.nih.gov/pubmed/31649194 [Abstract/FREE Full Text](http://informatics.bmj.com/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjYvNjQ2NC80NDciO3M6NDoiYXRvbSI7czoyNToiL2JtamhjaS8yOC8xL2UxMDAyODkuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 15. Benjamin R. Assessing risk, automating racism. Science 2019;366:421–2.[doi:10.1126/science.aaz3873](http://dx.doi.org/10.1126/science.aaz3873)pmid:http://www.ncbi.nlm.nih.gov/pubmed/31649182 [Abstract/FREE Full Text](http://informatics.bmj.com/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjYvNjQ2NC80MjEiO3M6NDoiYXRvbSI7czoyNToiL2JtamhjaS8yOC8xL2UxMDAyODkuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 16. Zhang H, AX L, Abdalla M. Hurtful words: quantifying biases in clinical contextual word embeddings. Proceedings of the ACM Conference on Health, Inference, and Learning. CHIL ’20. Association for Computing Machinery, 2020:110–20. 17. Adamson AS, Smith A. Machine learning and health care disparities in dermatology. JAMA Dermatol 2018;154:1247–8.[doi:10.1001/jamadermatol.2018.2348](http://dx.doi.org/10.1001/jamadermatol.2018.2348)pmid:http://www.ncbi.nlm.nih.gov/pubmed/30073260 [PubMed](http://informatics.bmj.com/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fbmjhci%2F28%2F1%2Fe100289.atom) 18. Collins GS, Moons KGM. Reporting of artificial intelligence prediction models. Lancet 2019;393:1577–9.[doi:10.1016/S0140-6736(19)30037-6](http://dx.doi.org/10.1016/S0140-6736(19)30037-6)pmid:http://www.ncbi.nlm.nih.gov/pubmed/31007185 [CrossRef](http://informatics.bmj.com/lookup/external-ref?access_num=10.1016/S0140-6736(19)30037-6&link_type=DOI) [PubMed](http://informatics.bmj.com/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fbmjhci%2F28%2F1%2Fe100289.atom) 19. Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI steering group. Nat Med 2020;26:807–8.[doi:10.1038/s41591-020-0941-1](http://dx.doi.org/10.1038/s41591-020-0941-1)pmid:http://www.ncbi.nlm.nih.gov/pubmed/32514173 [CrossRef](http://informatics.bmj.com/lookup/external-ref?access_num=10.1038/s41591-020-0941-1&link_type=DOI) [PubMed](http://informatics.bmj.com/lookup/external-ref?access_num=32514173&link_type=MED&atom=%2Fbmjhci%2F28%2F1%2Fe100289.atom) 20. Ferryman K. Addressing health disparities in the food and drug administration's artificial intelligence and machine learning regulatory framework. J Am Med Inform Assoc 2020;27:2016–9.[doi:10.1093/jamia/ocaa133](http://dx.doi.org/10.1093/jamia/ocaa133)pmid:http://www.ncbi.nlm.nih.gov/pubmed/32951036 [PubMed](http://informatics.bmj.com/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fbmjhci%2F28%2F1%2Fe100289.atom) 21. Mason A, Morrison A, Visintini S. An overview of clinical applications of artificial intelligence. Ottawa: CADTH, 2018. 22. Commission E. COM(2019) 168 final: building trust in human-centric artificial intelligence, 2019. 23. Commission E. White paper on artificial intelligence–a European approach to excellence and trust, 2020. 24. Tankelevitch L, Ahn A, Paterson R. Advancing AI in the NHS, 2018. 25. Fenech M, Strukelj N, Buston O. Ethical, social, and political challenges of artificial intelligence in health. London: Wellcome Trust Future Advocacy, 2018. 26. Topol E. The Topol review: preparing the healthcare workforce to deliver the digital future. Health Education England, 2019. 27. Department of Health and Social Care. Code of conduct for data-driven health and care technology, 2019. Available: [https://www.gov.uk/government/publications/code-of-conduct-for-data-driven-health-and-care-technology/initial-code-of-conduct-for-data-driven-health-and-care-technology](https://www.gov.uk/government/publications/code-of-conduct-for-data-driven-health-and-care-technology/initial-code-of-conduct-for-data-driven-health-and-care-technology) [Accessed 1 Aug 2020]. 28. NHS. Regulating AI in health and care - Technology in the NHS. Available: [https://healthtech.blog.gov.uk/2020/02/12/regulating-ai-in-health-and-care/](https://healthtech.blog.gov.uk/2020/02/12/regulating-ai-in-health-and-care/) [Accessed 12 Oct 2020]. 29. Parikh RB, Obermeyer Z, Navathe AS. Regulation of predictive analytics in medicine. Science 2019;363:810–2.[doi:10.1126/science.aaw0029](http://dx.doi.org/10.1126/science.aaw0029)pmid:http://www.ncbi.nlm.nih.gov/pubmed/30792287 [Abstract/FREE Full Text](http://informatics.bmj.com/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjMvNjQyOS84MTAiO3M6NDoiYXRvbSI7czoyNToiL2JtamhjaS8yOC8xL2UxMDAyODkuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 30. Suresh H, Guttag JV. A framework for understanding unintended consequences of machine learning. arXiv:190110002 [cs, stat], 2020. Available: [http://arxiv.org/abs/1901.10002](http://arxiv.org/abs/1901.10002) [Accessed 20 Sept 2020]. 31. Mongan J, Moy L, Kahn CE. Checklist for artificial intelligence in medical imaging (claim): a guide for authors and reviewers. Radiology 2020;2:e200029. [doi:org/10.1148/ryai.2020200029](http://dx.doi.org/org/10.1148/ryai.2020200029)