AI arrogance and ignorance
The current system means that the risk for sharing one’s data is high, with little personal gain. Despite the fact these risks are real to the institution, the failure to disclose data does not eliminate the risk; it merely transfers the risk from the institution to the patients being treated based on the research. Thus, those who we claim to be helping must carry the risk for our own arrogance and ignorance, which may be worse than fatal, where one’s data may worsen the outcomes of another human being who ‘does not look like you’. This problem can be further exacerbated by reasoning that Artificial Intelligence (AI) methods such as synthetic minority oversampling technique (SMOTE) will simply ‘fix’ issues such as sex and race data imbalances. AI has introduced new effective methods, such as SMOTE that can forward medical and social issues, but is not a ‘cure all’ and is instead a specific methodological tool.44 Current popular interpretation methods such as local interpretable model-agnostic explanations and SHapley Additive exPlanations have respective limitations such as model reduction to an alternative localised linear or probability values for covariates that are in reality collinear.45 These limitations are not a sole reason to discard them, but be thoughtfully instead of blindly executed. Methods that intersect AI and causal frameworks that perform counterfactual scenarios about outcomes based on attributes should not be implemented indiscriminately on features conditional on each other.46 For example, if you wanted to understand a survivor expectancy of a male patient if instead they were female, other attributes such as occupation, income level, age and race would need to be considered holistically.
Historically, tools and software used for research are specified in publications, but code sharing is newer and less frequently incorporated as part of the publication or supplement. As AI and coding are linked, so is AI arrogance and lack of code sharing and transparency. Much like the DAS, code is available on request. While the true availability of the data outlined in DAS statements has begun to be researched, code sharing is not specifically well researched and is likely more researched in specifically computational journals.47 While tools and software may by nature use Graphical User Interfaces (GUIs) that cannot be automatically reproduced by being run, coding scripts are. While code sharing is possible through Git and providers such GitHub and GitLab there are legal, technical and reputational risks associated with sharing source code. These can span from how deidentification is conducted to critiques ways the code is more methodically robust, scalable or elegant (few lines of code). By turning the research focus back to patient centricity, the risks posed by code sharing are smaller compared with the issues of non-reproducibility and model improvements.
Continuous improvement process and validation
A discontinuous and stochastic approach dominates current quality improvement, however, a mindset shift towards a data-centric and systems-based methodology should be leveraged in the future. In order to make data sharing a more frequent reality that acts in service of the patient, incremental change at the organisation, researcher and data set level are required. A continuous improvement process for data sharing means iterating on the parts of the process that cause failure. It is distinct from the data management plan; while a data management plan is created before or during data sharing and primarily completed once the data is shared, a continuous improvement process is cyclical. While a continuous improvement process has technical aspects, it is driven by considerations of an organisation to serve both the patient and research community.48
Typically, the data sharing process begins with how a data sharing inquiry is received and to whom, the approval process, the data transfer and/or sharing, and clarification and follow-up support. The goal of a continuous improvement process for data sharing means first designing with data in mind and iterating on the pain points for greater data dissemination.49 Figure 2 illustrates two possible process flows for a four-step data sharing process, A and B. Scenario A represents what the data sharing process looks like without any online data repository or portal and scenario B represents where a repository or portal solution has been implemented. From the inquiry to clarification and follow-up scenario A has many more communication and back and forth touchpoints between the corresponding author and the researcher making the request. Scenario B outlines the type of data sharing process that is possible when a continuous improvement process is implemented with the patient and research community in mind.
Figure 2Data sharing process with manual and automated scenarios, A and B, respectively. Non-Disclosure Agreement, NDA; CITI, Collaborative Institutional Training Initiative; SFTP, Secure File Transfer Protocol; FAQs, Frequently Asked Questions.
From these two scenarios, we can glean that through a continuous improvement process there are opportunities to potentially automate and reduce the time and effort exerted to share data (figure 3). A continuous improvement process laid out by an institution may consist of multiple aims such as to use a trusted research database portal, begin adopting data standards of the field prior to data collection of an experiment, and incorporate a deidentification requirement for project completion with the intent of data sharing. By placing data sharing as a goal to be met in service of the patient and research community, it is less likely to be considered and after thought or extra work with low incentivisation for the researcher. A continuous improvement process is not seen as complete, as new needs arise whether making the data sizes more accessible or creating documentation for frequently asked questions about the data set, the process is aimed to give the best possible experience in sharing and understanding the data. An exemplar system that lifts the onus of data sharing from the researcher entirely is MIMIC. The MIMIC data are accessible via PhysioNet, where data sets are categorised as open, restricted or credentialed. For restricted data sets, including the latest version of MIMIC, CITI training must be completed, user information and completing the DUA are required. Additionally, data dictionaries, release notes specifying incorrect data and subsequent corrections, and directions for how to join commonly created data views are documented for MIMIC.50
Figure 3Continuous improvement process for more ubiquitous data sharing.
Data sharing currently emphasises the ability to garner better scientific reproducibility, but validation is equally if not more important. From a treatment perspective, it is imperative to prove clinical efficacy such as AI enabled treatment recommendations created from longitudinal analysis of demographic, symptom and vital sign data. By putting the patient first, AI then refocuses itself as a tool, where clinical safety and efficacy supersedes importance of AI interpretability and explainability.51 Where AI is used to lead to treatment enhancement indirectly in medical imaging analysis or organisation of unstructured EHR data, validation of the accuracy of the method, the degree of utility and ability to generalise is where patients can benefit. To mitigate AI research, arrogance and ignorance, goals need to be oriented so there is a direct relationship from the patient providing their data to improvements in health outcomes.
The recent NIH initiative forces the sharing of such data and, thus, we hope, a change in mindset that promotes humility and transparency. The development of continuous and systematic approaches to quality improvement are a beneficiary of such a mindset. Further, it shares the same psychological sentiment that drives data sharing and would discourage ODIAO.