Speech-Interface Prompt Design: Lessons From the Field
Transcription
Speech-Interface Prompt Design: Lessons From the Field
Speech-Interface Prompt Design: Lessons From the Field Jerome White Mayuri Duggirala New York University Abu Dhabi, UAE Tata Consultancy Services Pune, India jerome.white@nyu.edu ABSTRACT Designers of IVR systems often shy away from using speech prompts; preferring, where they can, to use keypad input. Part of the reason is that speech processing is expensive and often error prone. This work attempts to address this problem by offering guidelines for prompt design based on field experiments. It is shown, specifically, that accuracy can be influenced by prompt examples, depending on the nature of the information requested. Categories and Subject Descriptors [Human-centred computing]: Empirical studies in interaction design General Terms Human Factors Keywords Spoken Web, interactive voice response, prompt design 1. INTRODUCTION Voice-based systems, and in particular those based on interactive voice response (IVR), have played an influential role in technology geared for human development. Within this setting, voice-based systems have been broadly used for information dissemination [11], community building [8], real-time monitoring [4], and data collection [9]. They have been applied, specifically, for purposes as diverse as health care [10], journalism [7], agriculture [8], and education [6]. In a recent international conference focusing on ICTD [12], in fact, 20 per cent of accepted work centred around some application of the voice modality. Mobile interfaces driven by speech are generally restricted by inaccuracy: overall task performance is hindered by recognition errors, which in turn negatively affect users’ task completion and experience. Although this is the nature of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICTD ’15 May 15–18, 2015, Singapore, Singapore Copyright 2015 ACM 978-1-4503-3163-0/15/05 ...$15.00. http://dx.doi.org/10.1145/2737856.2737861 mayuri.duggirala@tcs.com speech, quality can be improved by using professional voice recognition software. The expense of such software, however, makes the solution unrealistic for small organisations operating on limited resources. For organisations that can afford such solutions, local languages within developing regions are often unsupported. Thus, for speech to be a realistic option within the IVR space, and in turn IVR to be a more viable option within the development community, affordable improvements to recognition technology are paramount. One such approach is through prompt design that encourages recognisable input. That is, giving enough dialogue such that a user’s utterance is likely to be within the recognised vocabulary. Studies around such approaches exist [2, 5, 3], but none have done so on a large-scale, live deployment, within a developing region. This work fills that gap. Specifically, it evaluates differences in spoken user input as a function of a prompt’s instructional dialogue. The results suggest that, depending on the nature of the requested information, a prompt can influence user input, and in doing so ultimately improve accuracy. 2. 2.1 BACKGROUND The Employment Service This work is based on interaction data from a voice-based employment platform. The employment service allowed candidates to input their resume information, employers to input their job details, and for the two parties to be matched where appropriate. Candidates were able to apply for jobs of interest, and employers were able to obtain contact information for candidates they deemed appropriate. The entire interaction—leaving information, editing information, and searching for matching parties—took place over the phone. Users were taken through a series of keypad, known as dualtone multi-frequency (DTMF), and speech prompts to acquire their information and navigate through the system itself. Interaction took place in either English or Kannada, the local language of the deployment region. Language was chosen at the start of the call, and interaction took place in that language for the remainder of the call. The application was built using the Spoken Web platform, a spoken dialogue system that provides Internet-like features on top of IVR [1]. Although the service was designed for use on phones in general, it was intended for use on low-end mobile phones in particular. 2.2 System Specifics and Definitions When a candidate called for the first time, they were re- Table 1: Prompt transcription. Prompt content Section Original Altered examples No examples Location “Speak the name of the district where you live. For example, you may say Mysore, Mandya, Bijapur, Dharwad, et cetera.” “Speak the name of the district where you live. For example, you may say Kolar, Hassan, Gulbarga, Belgaum, et cetera.” “Speak the name of the district where you live.” Skill “Now, using a single phrase, inform your skills. For example you may say data entry operator, DTP, plumber, electrician, welder, et cetera.” “Now, using a single phrase, inform your skills. For example you may say secretary, waiter, driver, mechanic, technician, et cetera.” “Now, using a single phrase, inform your skills.” Qualification “Speak your highest educational qualification. For example you may say ITI, diploma, Below SSLC, PUC, BA, et cetera.” “Speak your highest educational qualification. For example you may say B-com, SSLC, 12th standard, B-tech, PGdiploma, et cetera.” “Speak your highest educational qualification.” quired to register with the system. The registration process consisted of a series of question-answer dialogues that resulted in the creation of a user resume. For purposes of this paper, an entire question-answer dialogue is referred to as a section, where the “question” portion of a section is known as a prompt. The entire registration process consisted of seven sections: two sections—a user’s age and work experience—required DTMF input while the remaining five sections required spoken input. Of the five spoken sections, two, name and free speech,1 were saved for presentation to job providers—no speech recognition was attempted in these cases. For the remaining three sections—qualification, skill, and location— an attempt was made to recognise what the user said. In an ideal case, the speech was recognised, converted to text, and stored in a database. For cases in which the speech was not recognised, a null value was stored, and the user’s input was recorded and saved for offline analysis. A user was given two attempts to speak an utterance that the system recognised; the offline recording was made from the third and final attempt. 2.3 User demographics The system was deployed throughout the state of Karnataka for a total of 11 months. In that time, candidate registration was attempted 29,152 times; 8,625 of those attempts completed all sections of the registration process. Most callers were in their early twenties with one to two years of work experience. Approximately 21 per cent specified that they were from Bangalore, the largest city in Karnataka; there were over 20 other districts across the state that were represented by more than 1 per cent of callers, respectively. The educational split was more uniform, with most callers possessing at least 10 years of education. 3. SETUP As previously mentioned, three sections of the registration process attempted to recognise what the user said and 1 During “free speech,” the user was instructed to speak freely about themselves for thirty seconds. then convert that value to text. The prompt for these three sections instructed the caller not only on the purpose of the section, but on appropriate words that they should speak. Specifically, the user was presented with the sections purpose, along with examples of valid input; see Table 1 for details. The basis for the examples was twofold: first, to give the user an idea of the type of input that was expected. That is, both to reinforce what was meant by “skill” or “qualification,” for example, and to convey that they should condense their answer to one or two words. The second reason for examples was to increase the chances that what a user said was in the speech recognisers vocabulary. Thus, examples were provided that were expected to be recognised. The examples chosen for the original prompts were done so based on field studies [13] and discussions with government employment specialists; thus, they were chosen for their relatability to the target audience and designed in conjunction with the speech recognition dictionaries. The examples for the experimental prompts were chosen based on what was already in the recogniser vocabulary. They were chosen at random, with the only condition being that the system was capable of recognising them if spoken and that they were not already present in the original prompt. Equivalent prompts were run for both Kannada and English. The original prompts ran in the system, exclusively, for 10 weeks. From this, a reference distribution of user input was established. After this period, users received one of the three prompts—original, altered examples, or no examples—from a random uniform distribution. Original prompt statistics are reported from the start of the experimental time period, not from system inception. 4. RESULTS Results from the experimental study are outlined in Table 2. Three aspects of the data are of particular interest: the amount of time required to go through each section, the accuracy of speech recognition, and the distribution of recognised answers. Reported times are exclusively user interaction time—the amount of time spent listening to the prompt itself is removed. Table 2: Prompt effects on user input. N is the number of samples taken for a given prompt. Time refers to the amount of time users spent in a given prompt; stars (?) denote values that are signficantly different from “original” (two-sided paired t-test, p < 0.05). Accuracy refers to the amount of user input that was successfully recognized. Ex. Dist. (“example distribution”) is cumulative distribution of the examples mentioned in the prompts. Top-five inputs are the most common system-recognized input given by users, presented in descending order; daggars (†) denote input values were also used as examples in the prompt. Prompt Statistics Time (sec.) Section Prompt N Avg SD Accuracy Ex. Dist. Location Original 1597 33.3 22.1 73.6% 0.179 Altered Examples 1550 32.4? 19.8 75.8% 0.112 No Examples 1506 29.9? 24.0 69.7% Original 1915 51.7 29.6 38.3% 0.155 Altered Examples 1953 45.1 33.6 50.5% 0.281 No Examples 1938 51.0? 30.9 18.7% Data entry operator† , Electrician† , DTP† , Attendant, Fitter Technician† , Secretary† , Teacher, † Driver , Mechanic† Computer Science, Electrician, Cashier, Finance Original Altered Examples No Examples 2140 2101 2174 43.8 45.9? 38.0? 24.7 26.7 26.3 64.2% 62.3% 58.8% 0.331 0.128 PUC† , BA† , ITI† , BEd, BCom† BA, ITI, BCom† , BEd, 12th standard† PUC, BA, ITI, BCom, BEd Skill Qualification 4.1 Input bias Example distribution (Table 2, “Ex. Dist.”), and top-5 input, are telling indicators of whether users are merely repeating the examples they hear. Example distribution is the fraction of users exposed to a given prompt who responded with a value that was also present in the example. Based on values observed across all prompts, it is unlikely that the results are dominated by repeats. If this were the case, observed values would have been closer to one—in no section, in fact, was the cumulative distribution of the prompt examples greater than 35 per cent. Further, in no prompt did the top-5 values observed coincide completely with the values presented in the prompt examples. These findings suggest that the values extracted during this experiment were actual representations of the population. 4.2 Accuracy In all cases, irrespective of section, the accuracy of the speech recogniser was lowest when no examples were presented to the user. The gain in accuracy, however, was not constant across sections: whether a gain was observed if examples were presented depended on the section, and ultimately the nature of the information extracted from the user. In the case of location and qualification, an increase of 4 and 2.8 percentage points, respectively, were observed, in the worst case.2 In the case of skill, however, accuracy doubled with the introduction of examples. Yet, the best case accuracy of any skill example section was almost 10 percentage points lower than the no-example accuracies of location 2 Where “worst case” is the difference between the noexample accuracy and the lower of the original and altered example accuracies. Top-Five Inputs Bangalore, Bijapur† , Bellary, Dharwad† , Bagalkot Bangalore, Bijapur, Bellary, Belgaum† , Dharwad Bangalore, Bijapur, Bellary, Bagalkot, Hassan and qualification; which happened to be the worst case observed accuracies in those sections. It is likely that skill accuracy was poor, and that qualification was slightly worse than location, because of the nature of the information. When asked to describe their skill, many people will either list several concise topics, or speak freely, in a conversation-like manner, about a single ability. Moreover, what constitutes a skill, and how to describe it, are often unique across a populous: the same expertise may be described several different ways depending on who is speaking. This combines to make skill a difficult concept to classify, and in turn to recognise automatically. Within the location section, regardless of whether examples were presented, the user was asked to speak the name of their “district,” an administrative division in India, that is well-defined and widely understood. There are approximately 700 districts within the country, 30 of which are contained in Karnataka. This not only simplifies the responsibility of the speech recognition engine, it reduces the values considered by the caller to a set that is finite and universal. While there may be several ways to describe skill, and slight variations in degrees offered across syllabi, districts are relatively well known and stable. Qualification accuracy, overall, was better than the accuracies observed for skill, but worse than those observed for location. Qualification benefits from many of the same advantages that were seen in location: a finite, well-established set of values, which was further reduced given that the platform had a target audience. However, qualification—like skill, but to a lesser extent—is plagued by a lack of standardised structure, along with various methods of expression. India has various boards that govern primary, secondary, and trade-specific tertiary education. As such, equivalent degrees may go by different names depending on the board governing a particular student’s education. Complicating matters further, some people may express their last educational instance as a combination of that instance, and the instance result: as opposed to “PUC,” for example, a person may use what is a more accurate description of “PUC-fail.” They may also combine their last successful degree and the degree that they are currently pursuing, making the speech interaction more conversational and harder to automatically discern. 4.3 Time With the exception of skill, the amount of time required to move through a section was smallest when no examples were presented. The difference is slight—about five seconds in the worst case—but observable. This increase in section time, however, generally led to accuracy improvements. Skill was an exception: the duration of the no-example prompt was longer, and the accuracy was much lower, when compared to the original and altered example prompts. 5. CONCLUSION What examples to use within a prompt, and even whether to use examples at all, depends on the nature of the data and the acceptable trade-off between time and accuracy. In all cases, there was an improvement in accuracy when examples were introduced. There was also—with the exception of skill—an increase in time spent to move through the section. However, whether the increased time was worth the improved accuracy was not always clear. This is most notable in the case of location: examples, in the best case, improved accuracy by over 6 percentage points. Inclusion of these examples, however, cost the caller, and the system, an additional 2.2 seconds of connection time, on average. Moreover, irrespective of which examples were used, or if examples were used at all, there was little difference between the observed top-five inputs. Again, this suggests that location is relatively standard in the minds of callers, and that creating prompts that optimise time would not significantly hinder the quality of the data collection. Qualification benefits from examples, but the time required for those examples could likely be reduced; presenting fewer examples than were presented in our system, for example, would be one way of achieving this. Again, from the overlap in top-five inputs across example schemas, what a qualification is seems clear to our user group; examples likely serve to educate them on what format—standard one-word abbreviations—is expected. The overwhelming conclusion for skill is that examples are not just helpful, but necessary. As noted in Table 2, the discrepancy between the best case and worst case skill prompts—no examples and altered examples, respectively— was the largest across all prompts and sections in which the experiment was conducted. Further, that users spent as much or more time in the no-example case as they did in the original and altered example cases, suggests that either their speech was too long, or that it required numerous retries to move through the prompt.3 Further, what was particularly interesting was that in the case where examples were cho3 And even then, “success,” in the sense of the system recognising their input, was not guaranteed. sen at random (“altered examples”) the accuracy was higher than in the case where the examples were chosen based on an understanding of the potential user base (“original”). This suggests that for relatively open-ended data collection—as is the nature of skill—systems, and researchers, should constantly adapt and observe the prompt to maximise success. 6. ACKNOWLEDGEMENTS The authors would like to thank the Karnataka Vocational Training and Skill Development Corporation for their assistance in making this work possible; Vaijayanthi Desai and Kundan Shrivastava for their infrastructural assistance; and Brian DeRenzi whose meaningful discussions played a key role in the formation of this work. Appreciation is also extended to the reviewers for their insightful comments and suggestions. References [1] S. Agarwal et al. The spoken web: a web for the underprivileged. SIGWEB Newsletter, (Summer):1:1–1:9, June 2010. [2] C. Baber et al. Factors affecting users’ choice of words in speech-based interaction with public technology. Int. J. Speech Tech., 2(1):45–59, 1997. [3] K. Baker et al. Constraining user response via multimodal dialog interface. Int. J. Speech Tech., 7(4):251– 258, 2004. [4] W. Curioso et al. Design and implementation of CellPREVEN: a real-time surveillance system for adverse events using cell phones in Peru. In AMIA Annual Symposium, 2005. [5] L. Karsenty. Shifting the design philosophy of spoken natural language dialogue: From invisible to transparent systems. Int. J. Speech Tech., 5(2):147–157, 2002. [6] M. Larson et al. I want to be Sachin Tendulkar!: A spoken English cricket game for rural students. In CSCW, 2013. [7] P. Mudliar et al. Emergent practices around CGNet Swara, voice forum for citizen journalism in rural India. In ICTD, 2012. [8] N. Patel et al. Avaaj Otalo: a field study of an interactive voice forum for small farmers in rural India. In CHI, 2010. [9] S. Patnaik et al. Evaluating the accuracy of data collection on mobile phones: A study of forms, SMS, and voice. In ICTD, 2009. [10] J. Sherwani et al. Healthline: Speech-based access to health information by low-literate users. In ICTD, 2007. [11] J. Sherwani et al. Speech vs. touch-tone: Telephony interfaces for information access by low literate users. In ICTD, 2009. [12] B. Thies and A. Nanavati, editors. Proceedings of the 3rd ACM DEV, 2013. [13] J. White et al. Designing a voice-based employment exchange for rural India. In ICTD, 2012.