Role of evaluation throughout the life cycle of biomedical and health AI applications
•
...
In the development and evaluation of medical artificial intelligence (AI) programmes, there is a tendency to focus the work on the system’s decision-making performance. This is natural, since the typical goal is to develop software that can assist physicians or other clinicians with decision tasks that they encounter when caring for patients. Yet it is short-sighted to focus evaluation efforts on decision-making performance alone when many other elements determine the success, impact, and validity of the system. Furthermore, performance goals and evaluation approaches need to be considered from the beginning of a development project, not as an afterthought when much of the work is already complete. Suitable planning and development needs to be driven by a desire to assure not only that the system makes good decisions or provides accurate analyses but also that ultimately it achieves high perceived value by the anticipated user community.1
Several recent articles have summarised evaluation issues for medical computing systems.2–4 I add here an emphasis on the stages through which a system and its evaluation must move as it is designed, implemented and tested. In practice, development is an iterative process, with formative evaluations along the way leading to revisions, or rethinking of the goals and the characteristics of the evolving system. Such work tends to begin in laboratory settings, where there can be early assessments of both the key component technologies as well as the developing decision-making performance, typically by evaluating its generation of advice or interpretation of data. Also crucial during these early stages is consideration of the system’s usability, with an emphasis on how it will integrate with the workflow characteristics of the envisioned users.
Subsequent studies move out of the laboratory and into a real-world production environment. Such work may not require full-blown implementation of the final product, but the studies do require use in actual clinical settings where key performance elements can be assessed. First, is the system acceptable to users? Do early exposures lead to regular use or do users ignore and circumvent the new features being offered? Second, if it becomes clear that the clinicians do use it, is there evidence that they occasionally change their behaviour or decisions because of the system’s availability? If no impact can be documented, there is no point in asking whether patients have benefited.
When these crucial issues of acceptance and impact have been positively demonstrated, it is appropriate to move onto the formal summative assessment of how patients benefit from the system’s introduction. Formal clinical trials, suitably controlled and similar to those used in drug studies, are devilishly complex for studies of decision-support systems used in busy patient-care environments. Crossover effects are inevitable, and blinding is generally not an option, so most studies will try to assess differences that occur before and after a system is implemented. There are natural confounders that affect data interpretation, including turnover in the staff who make the decisions, other systemic or therapeutic innovations that make the two time periods poorly comparable, and other external factors such as changes in the costs of procedures or treatments. A system that works well at one institution may fail at another because of external factors such as different conventions regarding use of computer systems, variations in patient or physician populations, or cultural or fiscal factors. Similar variables and issues can complicate efforts to determine the cost-effectiveness of such innovations or to assess a system’s impact on health of a population. Such issues, and how the team has controlled for them, should be addressed in any written evaluation reports.
With this background in mind, readers can consider a proposed multistep process for full design, implementation, evaluation, and documentation of decision-support programmes, including AI systems.5 As I have previously described, the process involves planning, from the outset, in three major stages (figure 1).
Steps in the evolution and execution of AI (and other) informatics research and development efforts. AI, artificial intelligence.
Before
Identify the specific problem to be addressed.
Partner as necessary to assure availability of the required breadth of expertise, including collaborators from the domain of application who will understand both the clinical and acceptance issues.
Analyse the problem in detail, including prior work and solutions that previously failed or succeeded.
Motivate others in order to generate excitement and commitment among those who are joining the development and implementation teams.
Create and specify a fiscal and organisational plan that is intended to achieve the stated goals, while assuming that some iteration and rethinking will be required and accommodated.
During
Innovate by considering how the effort will add knowledge to the underlying informatics discipline, in addition to its ability to address the specific clinical problem.
Implement by building the system, carrying out formative studies that allow reconsideration of approaches or goals.
Assess by carrying out the pertinent studies (in both laboratory and—when ready—naturalistic settings), which will address the range of issues previously outlined.
After
Generalise by considering and describing the range or types of problems for which the development effort has offered solutions.
Critique by assessing the weaknesses or limitations of the work, as well as the summative studies, asking what remains to be done and what elements of the solution are suboptimal.
Share by publishing and presenting the work, focusing as much on the generalised lessons and methods of the work as on the achieved performance of the finished product.
Inspire by sharing the work in ways that excite others and encourage them to assume some of the challenges embodied in the questions that remain unanswered.
The clinical collaborators on a development project will typically serve as early adopters of the developing system. In turn they often become partners in the summative evaluation studies, where they can help to explain the system to others and to engage them in the clinical trial.
I have argued that clinical AI systems require a time-consuming staged evaluation plan that demonstrates more than the quality of advice or interpretation. This is also true for commercial decision-support projects. The phases and activities described here may seem intuitive and acceptable when developers are working in an academic environment. However, commercial development efforts tend to proceed under different pressures, largely related to finances and time constraints. Industry may skip some of the proposed steps (eg, formal usability testing), which can seriously compromise the ultimate impact and acceptance of their decision-support tool. Accordingly, it would be wise for commercial developers to take heed of the stages and priorities outlined in this article, despite the implied costs and time requirements. Rigorous evaluation can only help when bringing a commercial product to market.
The framework offered here is intended to place iterative evaluation and system assessment at the centre of overall planning for a development effort. Although my motivation has been to stress the relevance of such considerations for today’s rapidly evolving AI systems, the discussion is largely relevant to any decision-support project’s life cycle. Carefully crafted studies emphasise the science underlying the system’s development and allow a new product’s impact to extend beyond its own clinical use to the sharing of lessons and methods that can affect future generations of systems as well.