Evaluating the quality of voice assistants’ responses to consumer health questions about vaccines: an exploratory comparison of Alexa, Google Assistant and Siri
•,.
...
Abstract
Objective To assess the quality and accuracy of the voice assistants (VAs), Amazon Alexa, Siri and Google Assistant, in answering consumer health questions about vaccine safety and use.
Methods Responses of each VA to 54 questions related to vaccination were scored using a rubric designed to assess the accuracy of each answer provided through audio output and the quality of the source supporting each answer.
Results Out of a total of 6 possible points, Siri averaged 5.16 points, Google Assistant averaged 5.10 points and Alexa averaged 0.98 points. Google Assistant and Siri understood voice queries accurately and provided users with links to authoritative sources about vaccination. Alexa understood fewer voice queries and did not draw answers from the same sources that were used by Google Assistant and Siri.
Conclusions Those involved in patient education should be aware of the high variability of results between VAs. Developers and health technology experts should also push for greater usability and transparency about information partnerships as the health information delivery capabilities of these devices expand in the future.
Summary
What is already known?
Voice assistants (VAs) are increasingly used to search for online information.
VAs’ ability to deliver health information has been demonstrated to be inconsistent in Siri and Google Assistant.
What does this paper add?
This study evaluates vaccine health information delivered by the top three virtual assistants.
This paper highlights answer variability across devices and explores potential health information delivery models for these tools.
Introduction
Patients widely use the internet to find health information.1 A growing share of internet searches are conducted using voice search. In 2018, voice queries accounted for one-fifth of search queries, and industry leaders predict that figure will grow to 30%–50% by 2020.2–4 The growth in voice search is partially driven by the ubiquity of artificial intelligence-powered voice assistants (VAs) on mobile apps, such as Siri and Google Assistant, and on smart speakers, such as Google Home, Apple HomePod, Amazon Alexa and Amazon Echo.5 As VAs become available on more household devices, more people are turning to them for informational queries. In 2018, 72.9% of smart speaker owners reported using their devices to ask a question at least once a month.6 A number of health-related companion apps have been released for these VAs, suggesting developer confidence that VAs will be used in the health information context in the future.7 8
Many studies have evaluated the quality of health information websites and found varied results depending on the topic being researched.9–12 However, the literature evaluating how well VAs find and interpret online health information is limited. Miner et al found that VAs from Google, Apple and Samsung responded inconsistently when asked questions about mental health and interpersonal violence.13 Boyd and Wilson found that the quality of smoking cessation information provided by Siri and Google Assistant is poor.14 Similarly, Wilson et al found that Siri and Google Assistant answered sexual health questions with expert sources only 48% of the time.15
Recently, online misinformation about vaccines is of particular concern in light of outbreaks of vaccine-preventable diseases in the USA. Several studies have demonstrated that online networks are instrumental in spreading misinformation about vaccination safety.16–19 Because the internet hosts a large amount of inaccurate vaccination information, the topic of vaccines is ideal for testing how well VAs distinguish between evidence-based online information and non-evidence-based sites. At the time of writing, there are no studies evaluating how well VAs navigate the online vaccine information landscape. There are also no studies that have evaluated consumer health information provided by Amazon Alexa, which makes up a growing market share for voice search.
Study objective
This study aims to assess the quality and accuracy of Amazon Alexa, Siri and Google Assistant in answering consumer health questions about vaccine safety and use. For the purposes of this paper, ‘consumer health’ refers to health information aimed at patients and lay persons rather than healthcare practitioners and policymakers.
Materials and methods
Selection of VAs
Siri, Google Assistant and Alexa were chosen for analysis due to their rankings as the top three VAs by search volume and smart speaker market share.20 21
Selection of questions
The sample set of questions was selected from government agency frequently asked question (FAQ) pages and organic web search queries about vaccines. This dual-pronged approach to question harvesting was chosen to ensure the questions reflected both agency expertise and realistic use cases from online information seekers.
Question set 1 contained questions from the Centers for Disease Control (CDC) immunisation FAQ page and the CDC infant immunisation FAQ page.22 23 Question set 2 was generated using AnswerThePublic.com, a free tool which aggregates autosuggest data from Google and Bing.24 Autosuggest data offers insight into the volume of questions millions of users search for in online search engines. Prior studies have used autosuggest data from Google Trends to identify vaccination topics of interests to online health information consumers.18 The authors chose to build upon this method by using AnswerThePublic.com, which incorporates data from both Google Trends and Bing users, to capture a broader sample of online search queries. The authors used the English language setting and searched for the term ‘vaccines’ to pull 186 frequently searched vaccine-related phrases. From these phrases, fully formed questions were included and partial phrases that did not form a full sentence were discarded. Partial phrase queries were removed because they do not reflect the longer conversational queries typically used to address VAs.3 25 Questions that were redundant with question set 1 were also removed. The final sample set of questions included 54 items.
Evidence-based answers were created for each question to serve as a comparison reference when assessing VA answer accuracy. A complete list of questions, approved answers and supporting sources is available in online supplementary table S1.
Developing the rubric
To grade the quality of each answer, the authors developed a rubric that assigned points based on author expertise, quality of sources cited and accuracy of the answer provided. The development of the rubric was informed by prior work in health web content evaluation and VA evaluation. The rubric incorporated quality standards for authorship, attribution, disclosure and currency from the JAMA benchmark criteria for evaluating websites.26 The rubric was also informed by the hierarchy of health information/advice created by Boyd and Wilson in their evaluation of smoking cessation advice provided by VAs.14 In their hierarchy, information produced by health agencies, such as the National Health Service (NHS) and the CDC, is grade A, information produced by sites with commercially oriented medical sites, such as WebMD, is grade B, and information produced by non-health organisations and individual publishers is grade C. Our rubric similarly assigned the highest value to government health agencies, such as the CDC, NHS or NIH, and lower value to crowdsourced and non-health websites. In cases where the immediate answer did not come from an expert source and was instead pulled from a for-profit or crowdsourced site, points could be gained if the answer was accurate and/or supported with an expert source citation.
All three VAs provided both an audio answer and a link to the source supporting each answer through the app interface. In determining the accuracy of the answer, both the verbal answer and the link provided as a source for the answer were considered for scoring. To assess how well the VAs processed voice interactions, the rubric assigned points for the VA’s comprehension of the question. The app interfaces also transcribed the text of the questions asked by the user, so the reviewers were able to assess whether or not the assistants had accurately recorded the question. The supporting links were also useful for evaluating which evidence was used to generate each answer. An answer was scored as fully accurate if the source it cited contained the correct answer, even if the VA did not provide the full answer through audio output. Possible scores ranged from 0.0 (VA did not understand the question and/or did not provide an answer) to 6.0 (VA answered the question correctly using an evidence-based government or non-profit source). The authors tested the rubric using a pilot set of 10 questions. Additional categories were added based on the pilot test to create the final rubric (figure 1).
Rubric for evaluating the quality of voice assistant responses.
Data
Data collection
Using two Apple iPads that had been reset to factory mode and had Siri, Alexa and Google Assistant installed as apps, both authors independently asked each VA the sample questions and assigned scores. Both authors speak with American English accents. The iPads ran on iOS V.11.4.1. Because search history can influence search results, the authors took several steps to ensure that the results were depersonalised.27 Each reviewer created new Amazon, Apple and Google accounts to use with each VA application. ‘Siri & Search’ was also turned off in settings to keep the Alexa and Google Assistant apps from learning from Siri’s responses. Location tracking was disabled in each app to avoid having the answers influenced by location-based results. If more than one answer or source was provided, the first source was used for scoring. Author 1 collected data on 10 August 2018. Author 2 collected data on 1–8 October 2018.
Results and discussion
Summary statistics
The authors combined the scores from both reviewers to calculate the overall mean score for each VA. Possible overall means ranged from 0.0 (VA did not understand the question and/or did not provide an answer) to 6.0 (VA answered the question correctly using an evidence-based government or non-profit source). Alexa’s overall mean was 0.9815. Google Assistant’s overall mean was 5.1012, and Siri’s was 5.1574. See table 1 for additional analysis.
Table 1
|
Summary performance statistics
Inter-rater reliability for the total score for each answer was strong as measured by an equally weighted Cohen’s kappa of 0.761 (95% CI 0.6908 to 0.8308). It is possible that the kappa statistic was impacted by inconsistency in some responses. Overall, the VAs offered the same answer to both reviewers for 67% of the queries. In instances where both VAs offered the same answer and source, 78% of the reviewer scores were identical. While the rubric was adequate for this pilot exploration of VA health information provision, the reviewers identified a need for more nuanced scoring of the audio answer as an area to increase the reliability of the rubric for future researchers.
Source quality
Google Assistant and Siri both scored highly for understanding the questions and delivering links to expert sources to the user. For most questions, Google Assistant and Siri delivered the recommended link to the reviewers’ devices with a brief audio response, such as ‘Here’s what I found on the web’ or ‘These came back from a search’. Both VAs provided links to evidence-based sources that contained accurate answers for the majority of the questions. The CDC website was the most frequently cited source for both Siri and Google Assistant. This CDC prevalence most likely occurred because many of the sample questions were pulled from CDC websites and demonstrates the ability of the VAs to accurately match voice input to text verbatim online. This finding is also consistent with prior research which found high online visibility for the CDC’s vaccination content.18 Other highly cited sources included expert sources produced by the WHO, the US Department of Health and Human Services, the Mayo Clinic and the American Academy of Pediatricians. Sources with less transparent funding and editorial processes included procon.org, healthline.com and Wikipedia.
Alexa often responded with ‘Sorry, I don’t know that’, and consequently received a low average score. This low rate of command comprehension supports prior research documenting Alexa’s low comprehension of long natural language phrases.28 In the instances where Alexa offered a spoken response, Wikipedia was the most frequently cited source. Because Wikipedia is editable by the general public and does not undergo peer review, its frequent use as a source also contributed to Alexa’s low score. Online supplementary table S2 contains the transcribed audio output and sources recommended by each device for each question.
Differences in supporting sources might be explained by variations in the search engines powering each VA. Alexa uses Microsoft Bing, and Siri and Google Assistant’s answers are powered by Google search.29 30 These search engines have shared information about medical information partnerships in the past, and these partnerships may explain variations in source selection.31 32
Audio output
Although Siri had the highest score for quality of answers, it used the fewest spoken words of all VAs. Alexa had the longest average spoken response in spite of having the lowest rate of understanding and providing answers to the questions. The length of audio output may offer insight into how the developers of each device envision the role of VAs as information assistants. With the brief audio responses offered by Google Assistant and Siri, the responsibility of assessing the quality of the information and locating the most important sections on the web page is placed on the user. The devices primarily function as a neutral voice-initiated web search.
Alexa’s longer answers may reflect an attempt to deliver audio-only answers that do not require additional reading. Preliminary usability research suggests that users prefer audio-only answers to those delivered via screen.33 Of the devices sampled for this study, Alexa is the only VA with documentation that claims it can ‘answer questions about medical topics using information compiled from trusted sources’.34 Alexa was the VA most likely to offer an audio-only answer without links to additional reading. Although these answers were factually correct, they often included grammar errors, as exemplified in this response to the question ‘are vaccines bad?’:
According to data from the United States Department of Health and Human Services: I know about twenty-two vaccines including the chickenpox vaccine whose health effect is getting vaccinated is the best way to prevent chickenpox. Getting the chickenpox vaccine is much safer than getting chickenpox. Influenza vaccine whose health effect is getting vaccinated every years is the best way to lower your chances of getting the flu…
Based on the data collected, Alexa is behind Google Assistant and Siri in its ability to process natural language health queries and deliver an answer from a high-quality source. Although it did not function as an effective health information assistant at the time of data collection, Alexa’s approach to audio-only responses raises larger questions about future directions for VAs as medical information tools. It is appealing to envision a convenient hands-free system that delivers evidence-based answers to consumers in easily digestible lengths. Compared to an assistant that initiates a web search on the user’s mobile device and delivers a web page that must be read, an audio-only system would improve accessibility and potentially provide health information to consumers in less time. However, the audio-only method also raises ethical concerns about transparency and bias in the creation of answers, particularly surrounding health topics that have been the target of past misinformation campaigns. The audio-only approach may also lead to a reduced choice of sources for information for consumers because they would not be offered multiple sources to explore. Recent industry analysis has explored the risks of reduced consumer choice with the ‘one perfect answer’ audio-only approach to voice search, but the implications for health information provision remain unclear.35
More research is needed to understand whether the audio-only approach or the voice-powered web search approach is more effective for delivering consumer health information. The companies developing VAs should also be more transparent about how their search algorithms process health queries.
Third-party apps
Third-party apps are an additional model for health information delivery through VAs. Amazon allows third parties, such as WebMD and Boston Children’s Hospital, to create Alexa Skills, which users can enable in the Alexa Skills Store.8 Google similarly allows users to enable third-party apps through the Google Assistant app. A recent evaluation of third-party VA apps found 300 ‘health and fitness’ apps in the Alexa Skills store and 9 available for Google Assistant.7 Although the reviewers of the present study did not evaluate third-party apps, Google Assistant offered to let them ‘talk to WebMD’ in response to the question ‘are vaccines tested?’. This active recommendation of a third-party health information tool demonstrates an alternative model in which VAs connect consumers to third-party skills designed to address specific topics. This model may reduce VA manufacturers’ liability for offering medical advice, but it is currently unclear what quality standards are used to evaluate third-party health app developers.
Limitations
The reviewers asked questions and gathered data on different dates, which may have led to variations in VA answers. However, the rubric was designed to evaluate answer quality and did not penalise answers from varying sources if the answer was correct and from an authoritative source. The scoring rubric was developed for this exploratory study. As such, it has not been assessed for reliability or validity and needs refinement in future studies to keep pace with new developments in VA technology. Additionally, because all accounts were set to default mode and cleared of search history, the VA answers may not reflect use cases where search results are tailored to the local use history of individual users. Finally, the authors chose to include the link provided via screen in the quality assessment. Some preliminary data show that users prefer audio output, so the overall scores for all assistants may have been lower if the rubric evaluated audio output only.33
Conclusions
This study assessed health information provided by three VAs. By incorporating user-generated health questions about vaccines into the set of test questions, the authors created realistic natural language use cases to test VA performance. After evaluating VA responses using a novel evaluation rubric, the authors found that Google Assistant and Siri understood voice queries accurately and provided users with links to authoritative sources about vaccination. Alexa performed poorly at understanding voice queries and did not draw answers from high-quality information sources. However, Alexa attempted to deliver audio-only answers more frequently than other VAs, illustrating a potential alternative approach to VA of health information delivery. Existing frameworks for information evaluation need to be adapted to better assess the quality of health information provided by these tools and the health and technological literacy levels required to use them.
Those involved in patient education should be aware of the increasing popularity of VAs and the high variability of results between users and devices. Consumers should also push for greater usability and transparency about information partnerships as the health information delivery capabilities of these devices expand in the future.