Intelligent Interfaces Enabling Blind Web Users to Build Accessibility
Transcription
Intelligent Interfaces Enabling Blind Web Users to Build Accessibility
Intelligent Interfaces Enabling Blind Web Users to Build Accessibility Into the Web Jeffrey P. Bigham A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2009 Program Authorized to Offer Degree: Computer Science and Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Jeffrey P. Bigham and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Chair of the Supervisory Committee: Richard E. Ladner Reading Committee: Richard E. Ladner Tessa Lau Jacob O. Wobbrock Date: In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted “the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform.” Signature Date University of Washington Abstract Intelligent Interfaces Enabling Blind Web Users to Build Accessibility Into the Web Jeffrey P. Bigham Chair of the Supervisory Committee: Boeing Professor Richard E. Ladner Computer Science and Engineering The web holds incredible potential for blind computer users. Most web content is relatively open, represented in digital formats that can be automatically converted to voice or refreshable Braille. Software programs called screen readers can convert some content to an accessible form, but struggle on content not created with accessibility in mind. Even content that is possible to access may not have been designed for non-visual access, requiring blind web users to inefficiently search for what they want in the lengthy serialized views of content exposed by their screen readers. Screen readers are expensive, costing nearly $1000, and are not installed on most computers. These problems collectively limit the accessibility, usability, and availability of web access for blind people. Existing approaches to addressing these problems have not included blind people as part of the solution, instead relying on either (i) the owners of content and infrastructure to improve access or (ii) automated approaches that are limited in scope and can produce confusing errors. Developers can improve access to their content and administrators to their computing infrastructure, but relying on them represents a bottleneck that cannot be easily overcome when they fail. Automated tools can improve access but cannot address all concerns and can cause confusion when they make errors. Despite having the incentive to improve access, blind web users have largely been left out. This dissertation explores novel intelligent interfaces enabling blind people to independently improve web content. These tools are made possible by novel predictive models of web actions and careful consideration of the design constraints for creating software that can run anywhere. Solutions created by users of these tools can be shared so that blind users can collaboratively help one another make sense of the web. Disabled people should not only be seen as access consumers but also as effective partners in achieving better access for everyone. The thesis of this dissertation is the following: With intelligent interfaces supporting them, blind end users can collaboratively and effectively improve the accessibility, usability, and availability of their own web access. TABLE OF CONTENTS Page List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Achieving the Full Potential of the Web . . . . . . . . . . . . . . . . . . . . . 1 1.2 Accessibility, Usability and Availability . . . . . . . . . . . . . . . . . . . . . 5 1.3 Who should fix the web? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Understanding the User Experience . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Enabling Users to Improve Accessibility and Usability . . . . . . . . . . . . . 17 2.3 Improving the Availability of Accessible Interfaces . . . . . . . . . . . . . . . 25 Chapter 3: WebinSitu: Understanding Accessibility Problems . . . . . . . . . . . . 29 3.1 Motivation 3.2 Recording Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Chapter 4: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Collaborative Accessibility with Accessmonkey . . . . . . . . . . . . . 48 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Accessmonkey Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Implemented Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 i 4.7 Summary & Ongoing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 5: A More Usable Interface to Audio CAPTCHAs . . . . . . . . . . . . . 67 5.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Evaluation of Existing CAPTCHAs . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4 Improved Interface for Non-Visual Use . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Evaluation of New CAPTCHA Interface . . . . . . . . . . . . . . . . . . . . . 85 5.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Chapter 6: More Effective Access with TrailBlazer . . . . . . . . . . . . . . . . . . 90 6.1 Motivation 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.3 An Accessible Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5 Formative Evaluation 6.6 Dynamic Script Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.7 Evaluation of Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.8 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 7: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Improving the Availability of Web Access with WebAnywhere . . . . . 114 7.1 Introduction & Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3 Public Computer Terminals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.4 The WebAnywhere System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.5 User-Driven Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.6 User Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.7 Reducing Latency 7.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.9 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Chapter 8: A New Delivery Model for Access Technology . . . . . . . . . . . . . . 145 8.1 The Audience for WebAnywhere . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.2 Getting New Technology to Users: Two Examples . . . . . . . . . . . . . . . 148 ii 8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Chapter 9: Conclusion and Future Directions 9.1 Contributions . . . . . . . . . . . . . . . 9.2 Future Directions . . . . . . . . . . . . . 9.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 151 155 157 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 iii LIST OF FIGURES Figure Number 1.1 1.2 Page A motivating example of existing access problems. (a) Finding content, even on the relatively simple gmail.com login page can be time-consuming. (b) An incorrect login is difficult to detect and an audio CAPTCHA must be solved to try again. (c) The most efficient route to the inbox requires knowing arbitrary key mappings tied to the underlying HTML structure of the web page, (d) as does finding the beginning of the message. (e) A table of important statistics and other information on nytimes.com is an image assigned the uninformative alternative text “INSERT DESCRIPTION,” making it impossible for a screen reader user to read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Effective web access involves more than simply making it possible for users with diverse abilities to access content. Content must also be usable and the tools needed to access it widely available. Accessibility is the foundation of usability and availability, usability increases the potential audience for whom access is possible, and availability determines where content can be accessed and who will be able to access it. . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 An overview of the flow of accessible content from web developer to blind web user, along with a selection of the components that have been explored act at each stage designed to improve web accessibility. While later stages can influence earlier stages, such change is slower and more difficult to achieve. 18 2.2 Many products enable web access to blind individuals but few have high availability and low cost (upper-left portion of this diagram). Only systems that voice both web information and provide an interface for browsing it are included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Log frequency of visits per domain name recorded for all participants ordered by popularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Diagram of the system used to record users browsing behavior. . . . . . . . . 33 3.3 For the web pages visited by each participant, percentage of: (1) images with alt text, (2) pages that had one or more mouse movement, (3) pages with Flash, (4) pages with Asynchronous Javascript and XML (AJAX), (5) pages containing dynamic content, (6) pages where the participant interacted with dynamic content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 iv 3.4 Number of probes for each page that had at least one probe. Blind participants performed more probes from more pages. . . . . . . . . . . . . . . . . . 42 3.5 For each participant, average time spend on: (1) all pages visited, (2) WebinSitu search page, (3) WebinSitu results page, (4) Google home page, (5) Google results pages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Web history search interface and results. . . . . . . . . . . . . . . . . . . . . . 45 4.1 Accessmonkey allows web users, web developers and systems for automated accessibility improvement to collaboratively improve web accessibility. . . . . 55 4.2 A screenshot of the WebInSight Accessmonkey script in developer mode applied to the homepage of the International World Wide Web Conference. This script helps web developers discover images that are assigned inappropriate alternative text (such as the highlighted image) and suggests appropriate alternatives. The developer can modify these suggestions, as was done here, to produce the final alternative text. . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 The menubar of this online retailer is inaccessible due to its reliance on the mouse. To fix this problem we wrote an Accessmonkey script that makes this menu accessible from the keyboard. . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 This script moves the header and navigation menus of this site to the bottom of the page, providing users with a view of the web page that presents the main content window first in the page. . . . . . . . . . . . . . . . . . . . . . . 65 5.1 Examples of existing interfaces for solving audio CAPTCHAs. (a) A separate window containing the sound player opens to play the CAPTCHA, (b) the sound player is in the same window as the answer box but separate from the answer box, and (c) clicking a link plays the CAPTCHA. In all three interfaces, a button or link is pressed to play the audio CAPTCHA, and the answer is typed in a separate answer box. . . . . . . . . . . . . . . . . . . . . 68 5.2 A summary of the features of the CAPTCHAs that we gathered. Audio CAPTCHAs varied primarily along the several common dimensions shown here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 An interface for solving audio CAPTCHAs modeled after those currently provided to users to solve audio CAPTCHAs (Figure 5.1). . . . . . . . . . . . 76 5.4 Percentage of participants answering each value on a Likert scale from 1 Strongly Agree to 5 Strongly Disagree reflecting perceived frustration of blind and sighted participants in solving audio and visual CAPTCHAs. Participants could also respond “I have never independently solved a visual[audio] CAPTCHA.” Results illustrate that (i) nearly half of sighted and blind participants had not solved an audio or visual CAPTCHA, respectively, (ii) visual CAPTCHAs are a great source of frustration for blind participants, and (iii) audio CAPTCHAs are also somewhat frustrating to solve. . . . . . . 79 v 5.5 The average time spent by blind and sighted users to submit their first solution to the ten audio CAPTCHAs presented to them. Error bars represent ± 1 standard error (SE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.6 The number of tries required to correctly answer each CAPTCHA problem illustrating that (i) multiple tries resulted in relatively few corrections, (ii) the success rates of blind and sighted solvers were on par, and (iii) many audio CAPTCHAs remained unsolved after three tries. . . . . . . . . . . . . . 82 5.7 The new interface developed to better support solving audio CAPTCHAs. The interface is combined within the answer textbox to give users control of CAPTCHA playback from within the element in which they will type the answer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.8 The percentage of CAPTCHAs answered correctly by blind participants using the original and optimized interfaces. The optimized interface enabled participants to answer 59% more CAPTCHAs correctly on their first try as compared to the original interface. . . . . . . . . . . . . . . . . . . . . . . . . 87 6.1 A CoScript for entering time worked into an online time card. The natural language steps in the CoScript can be interpreted both by tools such as CoScripter and TrailBlazer, and also read by humans. These steps are also sufficient to identify all of the web page elements required to complete this task – the textbox and two buttons. Without TrailBlazer, steps 2-4 would require a time-consuming linear search for screen reader users. . . . . . . . . 91 6.2 The TrailBlazer interface is integrated directly into the page, is keyboard accessible, and directs screen readers to read each new step. A) The description of the current step is displayed visually in an offset bubble but is placed in DOM order so that the target of a step immediately follows its description when viewed linearly with a screen reader. B) Script controls are placed in the page for easy discoverability but also have alternative keyboard shortcuts for efficient access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3 TrailBlazer guiding a user step-by-step through purchasing a book on Amazon. 1) The first step is to goto the Amazon.com homepage. 2) TrailBlazer directs the user to select the “Books” option from the highlighted listbox. 8) On the product detail page, TrailBlazer directs users past the standard template material directly to the product information. . . . . . . . . . . . . 98 6.4 The descriptions provided by two participants for the screenshots shown illustrating diversity in how regions were described. Selected regions are 1) the table of statistics for a particular baseball player, and 2) the search results for a medical query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 vi 6.5 Participant responses to Likert scale questions indicating that they think completing new tasks and finding content is difficult (1, 2), think TrailBlazer can help them complete tasks more quickly and easier (3,4,5), and want to use it in the future (6), especially if scripts are available for more tasks (7). . 102 6.6 Proportion of action types at each step number for scripts in the CoScripter repository. These scripts were contributed by current users of CoScripter. The action types represented include actions recognized by CoScripter which appeared in at least one script as of October 2008. . . . . . . . . . . . . . . . 103 6.7 The features calculated and used by TrailBlazer in order to rank potential action suggestions, along with the three sources from which they are formed. 106 6.8 Suggestions are presented to users within the page context, inserted into the DOM of the web page following the last element with which they interacted. In this case, the user has just entered “105” into the “Flight Number” textbox and TrailBlazer recommends clicking on the “Check” button as its first suggestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.9 The fraction of the time that the correct action appeared among the top suggestions provided by TrailBlazer for varying numbers of suggestions. The correct suggestion was listed first in 41.4% cases and within the top 5 in 75.9% of cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.1 People often use computers besides their own, such as computers in university labs, library kiosks, or friends’ laptops. . . . . . . . . . . . . . . . . . . . . . . 115 7.2 WebAnywhere is a self-voicing web browser inside a web browser. . . . . . . . 116 7.3 Survey of public computer terminals by category indicating that WebAnywhere can run on most of them. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.4 Browsing the ICWE 2008 homepage with the WebAnywhere self-voicing, web-browsing web application. Users use the keyboard to interact with WebAnywhere like they would with their own screen readers. Here, the user has pressed the TAB key to skip to the next focusable element, and CTRL+h to skip to the next heading element. Both web content and interfaces are voiced to enable blind web users access. . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.5 The WebAnywhere system consists of server-side components that convert text to speech and proxy web content; and client-side components that provide the user interface and coordinate what speech will be played and play the speech. Users interact with the system using the keyboard. . . . . . . . . 122 7.6 Selected shortcut functionality provided by WebAnywhere and the default keys assigned to each. The system implements the functionality for more than 30 different shortcut keys. Users can customize the keys assigned to each.126 vii 7.7 Participant response to the WebAnywhere, indicating that they saw a need for a low-cost, highly-available screen-reading solution (7,8,9) and thought that WebAnywhere could provide it (3,4,6,10). Full results available in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Caching and prefetching on the server and client improve latency. . . . . . . . 7.9 An evaluation of server load as the number of simultaneous users reading news.google.com is increased for 3 different caching combinations. . . . . . . 7.10 Counts of recorded actions along with the contexts in which they were recorded (current node and prior action), ordered by observed frequency. . . . . . . . . 7.11 Average latency per sound using different prefetching strategies. The first set contains tasks performed by participants in our user evaluation, including results for prefetching strategies that are based on user behavior. The second set contains five popular sites, read straight through from top to bottom, with and without DFS prefetching. Bars are shown overlapping. . . . . . . . . . . 8.1 8.2 8.3 130 131 137 139 140 Weekly Web Usage between November 15, 2008 and May 1, 2009. An average of approximately 600 unique users visit WebAnywhere each week. The large drop in users in December roughly corresponded to winter break. WebAnywhere offers the chance for a living laboratory to improve our understanding of how blind people browse the web and the problems that they face. . . . . 146 From November 2008 to May 2009, WebAnywhere was used by people from over 90 countries. This chart lists the 40 best-represented countries ranked by the number of unique IPs identified from each country that accessed WebAnywhere over this period. 33.9% of the total 23,384 IPs could not be localized and are not included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 WebAnywhere, May 2009. Since its release, new languages have been added to WebAnywhere. These screenshot shows an early version of a Cantonese version of the system. We have also started to introduce features that may make it more useful for other populations. The text being read is highlighted within the page and shown in a magnified, high-contrast view. . . . . . . . . 148 viii ACKNOWLEDGMENTS I would like to thank my advisor Richard Ladner. He has helped to shape all of the work presented in this dissertation, and none of it would have turned out as well absent his input. I have enjoyed my time working with Richard. He has helped me become the researcher that I am today. From him, I have learned the importance of connecting my research to impact and outreach. For that, I will always be thankful. I also thank Jacob Wobbrock, who somehow has always known the right advice to give on everything from research direction to future plans, and Tessa Lau, who graciously entrusted me with her wisdom. Tessa pressed me to consider the broader implications of my research and provides an excellent example of an incredibly successful researcher who somehow manages a balanced life. I thank Jake and Tessa for helping me become a better HCI researcher, both through their explicit mentoring and by example. Anna Cavender has played an integral role in nearly all of the projects forming this dissertation. From early brainstorming and developing prototypes, to conducting user studies and considering broader themes, my projects would not have been nearly as successful without her input and support. Craig Prince is one of the smartest and talented people I know. He was always there to listen to another crazy idea, spend sleepless nights helping to solidify prototypes, and get everything working before the next deadline. Thank you Craig. I thank Jeffrey Nichols for his energy and great conversations. In Jeff, I see an example of what I want to become - a builder with a strong base in science, an innovator who pushes the boundaries of what interfaces can do. I would like to thank the following organizations who have helped to fund me over the years: the National Science Foundation, Boeing, Microsoft, TRACE, and NASA. I would ix like to especially thank Allan and Inger Osberg who funded me through my last year of graduate school via their generous Osberg Fellowship. A large number of other students, faculty, friends, and family have also influenced this work. They are, in alphabetical order: Chieko Asakawa, Jennison Asuncion, Yevgen Borodin, Jeremy Brudvick, Charles Chen, Wendy Chisholm, Allen Cypher, Clemens Drews, Reuben Firmin, James Fogarty, Jim Fruchterman, Susumu Harada, Simon Harper, Sangyun Hahn, Shaun Kane, Ed Lazowska, Clayton Lewis, Benson Limketkai, Jimmy Lin, I.V. Ramakrishnan, T.V. Raman, Hironobu Takagi, Gregg Vanderheiden, Lindsay Yazzolino and Yeliz Yesilada. This list is a who’s who in the access and end user programming worlds - I am grateful for the time that they have given me to learn from them. I want to thank my parents, Richard and Peggy Bigham. The unshakable confidence and support that they have shown in me throughout my life has allowed me to dream big. Thank you. Finally, I want to thank my wife Jennifer Bigham. Without her unwavering support, my work would not have been possible. She inspires me with her unending enthusiasm and positive attitude, and helps to connect me to what is really important in life. Thanks Jen for forcing me to take a walk on a sunny day every once in a while even in the midst of yet another important deadline; thanks for understanding when I disappeared to California for a summer to pursue another important opportunity (twice!); and thanks for being everything you are. x DEDICATION This dissertation has two dedications. • To my parents, Richard and Peggy Bigham, who have supported and encouraged me throughout my life, helping me realize that anything is possible. • To my wife Jennifer Bigham, whose unwavering love and support has encouraged me to strive for my best. xi 1 Chapter 1 INTRODUCTION The web holds incredible potential for blind computer users. Most web content is relatively open, represented in digital formats that can be automatically converted to voice or refreshable Braille. Software programs called screen readers can convert some content to an accessible form, but struggle on content not created with accessibility in mind. Even content that is possible to access may not have been designed for non-visual access, requiring blind web users to inefficiently search through content using the linearized view of content exposed by screen readers. Screen readers are expensive, costing nearly $1000 [82, 146], and are not installed on most computers. These problems can make access impossible, unusable, and unavailable for blind web users. This dissertation explores intelligent tools that enable blind web users to collaboratively solve these problems for themselves and share the results with one another. 1.1 Achieving the Full Potential of the Web The web has already become an incredible resource for blind people. Unlike printed material of the past, most web information can be automatically converted to an accessible form. Remaining problems can make web browsing frustrating and access to some content impossible. Despite its potential, the web remains difficult to use for many of those who could benefit most [34]. This is particularly true for blind web users whose view of content is devoid of the rich visual structure that developers use to organize web information. Web developers can help by creating content that relies less on its visual representation. Existing web standards, such as World Wide Web Consortium (W3C) Web Content Accessibility Guidelines [142] and Section 508 of the U.S. Rehabilitation Act, offer quantifiable rules to which web developers can strive to adhere, but creating truly accessible content requires a more subtle and 2 discerning evaluation [85]. Web developers can improve access by following standards and creating content with accessibility in mind, but many developers are either ill-equipped to create accessible content or unaware of the problem. Access problems are wide-ranging and preventing them can require a subjective design process. Information encoded as images, in the layout, or as color is not accessible for blind web users. Finding desired content in the long linear stream to which a complex page is converted can be frustrating for screen reader users. Small text and fixed-width structures do not work well for low-vision users or anyone using a small screen device. With so many possible problems, everyone can find themselves in a situation in which the web fails to work as well as it could for them, preventing full realization of the web’s potential. Compounding the problem are the numerous stakeholders in web access, who often lack tools and opportunities to improve access and help one another. Blind people often cannot fix the problems that they experience themselves and cannot easily communicate the problems to those who could. Developers and content producers may not understand the problems faced by users different than themselves, or may overestimate the costs of fixing existing problems. Developer tools and user agents have evolving support for creating and conveying accessible content, and often play catch up when new technologies are released. This dissertation considers intelligent tools that enable blind web users to help one another improve the web. 1.1.1 A Motivating Example As an example, consider the following scenario describing the problems that Joe, a blind computer user, might experience when reading his web-based email and following a link sent by a friend. First, Joe opens a web browser and navigates to gmail.com. His screen reader presents the content in a linear, time-dependent stream of voice. To login to the site, Joe first needs to find the “username” textbox. Fortunately, he has memorized the keyboard shortcut to skip directly there, is able to quickly enter his username and password, and then press submit. Joe’s screen reader announces that a page has loaded and he begins to explore the new 3 b c d a e Motivating Example of Existing Access Problems Figure 1.1: A motivating example of existing access problems. (a) Finding content, even on the relatively simple gmail.com login page can be time-consuming. (b) An incorrect login is difficult to detect and an audio CAPTCHA must be solved to try again. (c) The most efficient route to the inbox requires knowing arbitrary key mappings tied to the underlying HTML structure of the web page, (d) as does finding the beginning of the message. (e) A table of important statistics and other information on nytimes.com is an image assigned the uninformative alternative text “INSERT DESCRIPTION,” making it impossible for a screen reader user to read. page, expecting to find his inbox. Skipping through the page by heading tags usually lets him get to his inbox more quickly, but his screen reader unexpectedly reports that the page contains no heading tags. After some exploration, Joe eventually realizes that he’s still on the login page. Without a quick visual scan, noticing that the page had not changed as they expected was not obvious, and finding the error message explaining that he had not entered their password correctly would have taken a long time. The interface was designed for visual use, and can be frustrating for Joe to use with a screen reader even though all of 4 the content is technically accessible to him. Due to the failed login, gmail.com asks Joe to complete a CAPTCHA. The site provides an audio CAPTCHA as an accessible alternative, but the interface provided is not easy to use with a screen reader. Joe presses a button to play the CAPTCHA and then navigates to an answer field to enter the text that he hears. During this navigation, his screen reader talks over the playing audio CAPTCHA, announcing navigation information. Audio CAPTCHAs are purposefully made to be difficult to understand, and the added interference from the screen reader makes them all but impossible. Joe eventually just asks a sighted friend to solve the much easier visual CAPTCHA. After successfully logging into the site, Joe reads his new messages. To reach the inbox, he presses the ‘h’ key repeatedly to cycle through the headings on the page. Once he hears “Gmail Inbox,” he can navigate forward from the beginning of the inbox to review his messages. Joe has not yet learned the trick of pressing the ‘x’ key to jump directly to the inbox, which works because each message in the inbox happens to be preceded by a checkbox. This mapping is arbitrary, impossible to predict before visiting the site, and confusing to a non-technical user. A new message from a friend includes a link to an article on nytimes.com that discusses the prevalence of bullying in elementary schools [23]. A figure containing numerous statistics and descriptive text supporting the claims in the article is represented as an image. The creator of this content chose not to provide a meaningful textual description that the screen reader could read. Instead, a default description (“INSERT DESCRIPTION”) is read when Joe reaches the image. Finally, Joe had to wait until he returned home to his own computer before he was able to access his email even though he had been at the public library earlier that day. The computers at the library he visited did not have screen readers installed on them. Without a screen reader, those computers were not possible for him to use. He could have tried to run the screen reader executable that he keeps on a USB keychain, but the USB ports weren’t easy to access and the computers prevent new software from being run as a security precaution anyway. The preceding example highlighted a number of accessibility problems that a blind per- 5 son might encounter while completing the common task of checking their email. Most of these problems have work-arounds - for example, blind users can always default to a slow linear scan of a page to find changed content, or they can carry their own laptop around with them so they’ll also have access to a screen reader. Some problems, like the image lacking alternative text, do not have easy work-arounds. Nevertheless, with the right tools (for instance Optical Character Recognition (OCR) software), blind people could use this content. Even the lack of access technology could be addressed by blind web users if a base level of access was available from any computer. Once one blind user has solved a particular problem, future users should benefit from the prior solution. This dissertation looks at how blind web users could be empowered to collaboratively improve web content, and help one another more effectively overcome the problems highlighted in the motivating example. 1.2 Accessibility, Usability and Availability Creating accessible content is multi-faceted, just like any design process. Content is not simply accessible or not; instead, access exists on a spectrum across several dimensions. The accessibility and usability of content have been extensively explored [130, 95], and research suggests that they are linked [103]. We also explore the availability of access, a less frequently explored dimension that often determines successful web access. This dissertation considers how end users can improve their access along the dimensions of accessibility, usability and availability (Figure 1.2). The remainder of this section describes these dimensions in relation to web access and provide examples. 1.2.1 Accessibility Accessibility means making access to content possible. For example, in the motivating example presented earlier, Joe was unable to access the statistics and other information contained within the image that was not assigned alternative text. Other examples include encoding information in either the color of visual layout of the page - for instance, “names marked in red have been chosen to participate” - and embedding content that screen read- 6 Availability Usability Accessibility Figure 1.2: Effective web access involves more than simply making it possible for users with diverse abilities to access content. Content must also be usable and the tools needed to access it widely available. Accessibility is the foundation of usability and availability, usability increases the potential audience for whom access is possible, and availability determines where content can be accessed and who will be able to access it. ers cannot access - for instance, some Flash applications do not correctly implement the accessibility APIs that enable blind people to access them. Chapter 4 discusses a tool called Accessmonkey that helps end users write scripts to improve the accessibility of the content that they access. For example, one component of this tools can automatically formulate alternative text for images. Blind and sighted people can use Accessmonkey to create scripts that make content more accessible and then share those scripts with others. Making content possible to access is fundamental to improving access along other dimensions. 1.2.2 Usability Usability means creating content in a way that helps users be more effective and efficient. Content can be possible to for someone to access, but it might require inefficient or confusing interactions. Just as the elements of usable visual designs are partly subjective, so are the elements of design for other modalities. A complex page containing a large amount of information may be confusing or inefficient for blind web users to navigate, but adding 7 semantic information - for example, an outline using heading tags - can help blind users more effectively access it [138]. Chapter 5 illustrates this concept by exploring in depth the interactions required to solve current audio CAPTCHAs. By making the interface CAPTCHAs more usable, blind participants improved their success rate by 59%. The improvements made demonstrate the importance of broadly applicable principles in designing content for blind web users. Importantly, web users themselves have control over the interface they use to solve audio CAPTCHAs as they can overwrite the existing interface with the improved interface using a script that we provide. TrailBlazer, presented in Chapter 6, extends the idea of more usable interface to arbitrary web-based tasks. Blind web users have been shown to be less inefficient than their sighted counterparts [17], in part because it takes time to find content in the time-dependent, linear stream of information exposed by the screen reader. TrailBlazer helps guide users through completing tasks on the web, suggesting actions that they might want to take based on their prior actions and those of other users. 1.2.3 Availability The technology that users need to access the web is often not available to them. Many people do not own their own computers and rely on public computers, such as those at libraries and schools, for access. Access technology is not installed on most computers for many reasons. Improving availability means making access technology available wherever users happen to be. Chapter 7 explores tools that bring access technology, including many of the projects outlined elsewhere in this dissertation, to the computers to which users happen to have access. Chapter 8 discusses the implications of a more available delivery model for access technology. Summary Many access problems can be improved with existing technology. Accessible alternatives can be provided for inaccessible content, web interfaces can be designed with careful consideration of blind people using screen readers, and computer administrators can install screen 8 readers on all computers. With current approaches, there is always a bottleneck, someone who may not make the best choices for access but on whom access relies. The creation of accessible web content relies on web developers and content producers and getting access software on public computers relies on computer administrators. While these people are in the best position to improve access, they may not have the knowledge, desire or awareness to make the best choices for achieving access. 1.3 Who should fix the web? Who should make the web accessible? This is a technical question, a legal question, and even an ethical question. It is a question that can conjure a desire to help, but one that quickly leads to a litany of other questions: What does it mean to make the web accessible? How much will it cost? Do disabled people even visit my site? With so many questions being asked, and with few concrete answers available, often nothing is done. The question of who should make the web accessible is misleading. Until recently, only content producers could directly influence its accessibility; blind end users were primarily consumers. This dissertation shows the advantages of a complementary approach: enabling blind web users to collaboratively improve the accessibility of their own web experiences. Expanding the set of possible whos to anyone with the need, incentive or desire to create a more accessible web forms a more pragmatic solution to access problems. This dissertation discusses how to better understand the problems that blind web users face on the web and enable blind web users to independently improve their own access across the three dimensions. 1.3.1 Enabling Users to Improve Access In order to enable blind web users to improve the accessibility of their own web experiences, we made contributions in the following areas: recording web interactions, understanding the problems that users face, predicting future web interactions, and leveraging predictions to help end users improve access for themselves. Our goal is to understand what problems most impact the access of users, and develop tools that effectively leverage the intelligence of blind web users to improve access for everyone. 9 The contributions of this dissertation are summarized in Section 9.1 and described in more detail throughout the document as outlined next. 1.4 Dissertation Outline The remainder of this dissertation is organized as follows: Background Overview of prior work in (i) understanding the user ex- Chapter 2 perience, (ii) improving the accessibility of web content for blind web users, (iii) and improving the availability of access technology. Understanding Problems Describes WebinSitu, infrastructure for conducting remote Chapter 3 studies with disabled users. WebinSitu records web interactions with proxy, and unlike many prior systems records user actions within web pages. The studies that we completed using WebinSitu quantify the differences between blind and sighted web users. Accessibility Explores the potential of blind web users to both collab- Chapter 4 oratively improve accessibility for themselves and partner with web developers to improve their content. Presents an implementation of this idea, Accessmonkey, which enables users to create and inject scripts across many platforms to improve the accessibility of web content and share improvements with web developers. 10 Usability Demonstrates via a large study of audio CAPTCHAs con- Chapters 5 and 6 ducted with WebinSitu that audio CAPTCHAs are much more difficult than visual alternatives. Offers a redesigned interface that targets non-visual use, which blind web users can add to existing web sites with Accessmonkey. Provides quantitative results showing the importance of designing non-visual interfaces with users in mind. Presents TrailBlazer, which lets blind web users record and replay tasks on the web to make browsing more efficient. By predicting the actions that users might want to complete next, TrailBlazer can also help make users more effective at completing tasks for which no script has already been recorded. Availability WebAnywhere improves the availability of access by en- Chapter 7 and 8 abling users to access the web on any computer that has web access. WebAnywhere can also help get technology for improving accessibility and usability to users wherever they happen to be. WebAnywhere is broadly a platform for delivering access technology to people with diverse needs, requirements and goals. Since its release, WebAnywhere has drawn a large global audience of blind and low-vision users, but also web developers, special education teachers, people learning English, and people with learning disabilities. Because WebAnywhere is free and does not need to be installed, people can quickly try it and use it if it works for them. 11 Realizing the full potential of the web for blind people involves improving access across several dimensions. Relying only on developers or computer administrators to create accessible and usable content and provide the tools required by users has been shown not to enough to address these problems. This dissertation describes intelligent tools that can help blind web users partner with developers and administrators without relying on them. 12 Chapter 2 RELATED WORK This chapter summarizes related work in understanding and improving the accessibility, usability and availability of the web. Section 2.1 considers work in understanding the web experience of blind users, specifically in terms of the accessibility and usability of web content, and describes gaps in this understanding that WebinSitu (Chapter 3) helps to fill. Section 2.2 describes work that has sought to improve the accessibility and usability of web content, either automatically or by supporting developers. This dissertation explores several tools that involve users in the process of improving content (Chapters 4-6). Section 2.3 considers a wide variety of products designed to improve the availability of web access for blind web users, motivating the need for WebAnywhere (Chapter 7), which provides a base level of access on any web-enabled device. 2.1 Understanding the User Experience A vital component of improving web accessibility is understanding the user experience of those involved. Previous work has offered surprising results: for instance, that blind web users may not evaluate the accessibility of web content as thoroughly as sighted developers employing screen readers [85]. More work needs to be done to understand the user experience from the perspective of blind web users and understand the implications for blind users working collaboratively to help improve access. In this section, we review work in understanding the experience of web users, especially disabled users, motivating our own WebinSitu approach (Chapter 3). The Disability Rights Commission in the United Kingdom sought to formally investigate the accessibility problems faced by users with different disabilities [34]. In this extensive study, the results of focus groups, automated accessibility testing, user testing of 100 websites, and a controlled study of 6 web pages were combined. This work identified the effects 13 of a number of different accessibility problems. Coyne and Nielsen conducted extensive observation of blind web users by going to the homes and workplaces of a number of blind individuals [35]. Each session comprised manually observing users completing four specified tasks, which did not allow them to record low-level events associated with each browsing session or for extended periods. Since both studied were conducted, new web technologies have become increasingly important, such as scripting and dynamic content in the form of dynamic content, AJAX, Adobe Flash and Rich Internet Application (RIA). WebinSitu adds to both studies by enabling the observation of participants over a longer period of time. We measure the practical and observable effects of accessibility features on web users, which is difficult to determine in a controlled lab study. Watanabe used lab-based studies to find that the proper use of HTML heading elements to structure web content can dramatically improve the observed completion time of both sighted and blind web users [138]. They used screen recordings and key loggers combined with in-person observation to implement their study, but another approach that has been explored is to consider the accessibility of the web divorced from user behavior. Some studies have used manual evaluation of pages [102] and others have been conducted automatically via a web crawl [18, 36]. Other studies have used automated evaluation [39]. Not considering which pages users will likely visit or the coping strategies they might employ makes the practical effects of the results obtained difficult to interpret. Finding participants meeting specific requirements in a given geographic area and recreating personalized setups and software configurations, studies with disabled populations can be costly, time-consuming and, in many cases, impractical. This has led some researchers to utilize remote evaluation and observation procedures instead [101, 85, 27]. Such user studies are particularly well-suited for investigating web accessibility issues because users employ a variety of screen readers and other assistive technologies that may be difficult to replicate in the lab. Because these technologies are designed for web usage, they are already connected to the Internet and are therefore more easily adapted to remote monitoring. In the remainder of this section, we first review work that has investigated remote user evaluation with disabled users in order to highlight the lessons learned. We will then discuss papers that investigate technical aspects of observing web users remotely. 14 2.1.1 Remote Studies with Blind Web Users The most common type of remote study involving blind web users is a diary-based study. For example, Lazar et al. conducted a study in which blind web users recorded their web experiences in user diaries and discovered that, in contrast to sighted users, the mood of blind users does not seem to deteriorate as the inefficiency of the browsing experience increases [74]. Possible problems with diary-based studies are that users may misrepresent their difficulty in achieving tasks and may choose to report only experiences that are unusual. Another option is on-site observation in the homes and offices of participants, as explored by Coyne et al. [35]. On-site studies are expensive and impractical for longitudinal observation. Mankoff et al. compared the following four evaluation techniques for discovering accessibility problems in web pages: the Bobby [140] automated accessibility evaluator, web developers, web developers using screen readers, and blind web users who evaluated web pages remotely [85]. The web developers were each introduced to the Web Content Accessibility Guidelines (WCAG) 1.0 accessibility standard [143]. Based on representative tasks developed in a baseline investigation with blind web users, members of each of the four conditions were asked to identify accessibility problems. The results indicated that multiple web developers employing screen readers were able to find the most accessibility problems and the automated tool was able to find the least. Remote blind users, although shown to be much less thorough at identifying web accessibility problems, most often found true problems (that is they labeled few false positives). This number was also artificially deflated due to users being unable to complete some of the tasks due to severe accessibility problems. The researchers also speculated that the remote users were not adequately encouraged to report all accessibility problems and hoped to improve on this in future work. Petrie et al. investigated remote evaluation by disabled users further by comparing the results obtained via remote evaluation as compared to laboratory evaluation [101]. In this work, participants conducted the following two evaluation tasks: an evaluation of a system that converts certain technical diagrams into spoken language, and a summative evaluation of websites. Users that conducted their evaluations remotely were more likely to give high-level qualitative feedback whereas users that evaluated locally gave more specific 15 evaluations. For example, when evaluating the same feature remotely, a remote participant said “there are a lot of problems with the description,” while a local participant said, “I take it each thing in brackets is the thing they are talking about [Door to Hall and Living room]... [Door to Bathroom] it is not clear what the relationship is... I cannot technically understand these relationships...it just doesn’t work.” While both participants expressed their inability to understand the relationships expressed by this system, the local participant provided much more useful feedback. In other instances, usability was so poor that remote users could not determine if they had successfully completed a task and, therefore, provide an adequate evaluation. In local studies, researchers could tell participants whether or not they had successfully completed the task. An important conclusion of this work is that while remote evaluation may be appropriate for summative results, it is often not as valuable during formative studies when the technology may not work as intended and researchers benefit greatly from observing how users attempt to interact with it. A common theme of both Mankoff and Petrie is that to maximize the value of remote evaluation, the technology used must achieve at least a base level of usability, otherwise participants may not be able to provide useful feedback. When operating away from the guidance of researchers, participants may be less likely to be able to successfully complete required tasks. Furthermore, the quality and extent of qualitative feedback is likely to be much greater in local studies in which participants can interact with the researchers. We seek to address these problems with our WebinSitu work by asking qualitative questions before and after remote studies. 2.1.2 Recording User Data Numerous projects have been created to record and aggregate web usage statistics to better understand the experience of web users. A common approach has been to use a traditional proxy that passively observes participants as they browse the web. Users have been central to the web from its inception and an extensive body of work has been dedicated to better understanding the user experience. Initially, much of this research concentrated on improving the quantitative performance of web infrastructure components. 16 The Medusa proxy allows web researchers to go a step beyond these first systems by investigating the impact such systems have as perceived by the user [71]. The Medusa proxy is a non-caching forwarding proxy that provides several convenient features to researchers investigating the user experience. These features include the ability to simultaneously mirror requests (send a request to multiple sources) and the ability to transform user requests. The Medusa proxy was used to explore the user-perceived advantages of using the NALNR cache hierarchy and the Akamai content distribution network. Later work used the proxy to discover that of several HTTP optimizations, parallel connections provide the largest benefit and that persistent connections are helpful but only on the subset of pages to which they are fully utilized [12]. The ability of the Medusa proxy to record and then replay request logs allow collected data to be reused, reducing the cost of testing variations. Medusa allows researchers to accurately record quantitative data about web browsing as long as that data can be directly observed as part of an HTTP request, but the system is unable to record finer-grained user events such as mouse movement, button clicks, or keys pressed. Such detail is important for discovering the web components directly affected the observed access of users. Goecks and Shavlik used an augmented browser to record the mouse movements and scrolling behaviors of users. They showed that these recordings could be used to predict how users would interact with their web browsers [45]. Claypool et al. utilized a similar approach in order to compare the explicit ratings that users provide to web pages with several methods of gathering implicit ratings [32]. While an augmented web browser allows recording data at a finer granularity, we would like to avoid requiring users to use specific web browsers to avoid confounding factors related to the introduction of new tools. Users in disabled populations also use a wide variety of web browsers and screen readers, making it impractical to provide a version tailored to each user’s individual setup. WebinSitu (Chapter 3) addresses these concerns using an enhanced web proxy. A proxy-based approach is desirable because it is relatively easy to set up and users can be remotely observed using the technology to which they are already accustomed. Gonzalez introduced a Java-based system that is capable of remotely monitoring users as they browse the web [46] and later advocated its potential for conducting remote usability studies with 17 people with disabilities [47]. The goal of using a proxy is to eliminate confounding factors that are often an unfortunate consequence of traditional laboratory studies, such as users being unable to use the assistive technology to which they have become accustomed. The Gonzalez proxy introduces Java applets into web pages that, once on a client’s machine, can observe the actions of users and report back to the remote server. UsaProxy uses the light-weight option of using Javascript to remotely record user actions and found it to be quite powerful [9]. In this approach, users connect to a proxy server which alters web pages on-the-fly to include a Javascript script that monitors the user’s actions. When a page is loaded, the Javascript script attaches event listeners to most Javascript events, including the keypress, mouseover, and mousedown events. It also includes code that polls at regular intervals to determine the position of the mouse. For all events, the script records the event type, the time the event occurred and the element of the Document Object Model (DOM) associated with it (when applicable). To allow this information to be recorded, the system sends messages back to the server using AJAX. The resulting log file contains a combination of these user event recordings and the information recorded by a traditional web proxy, such as the Medusa proxy. Use of the system is unobtrusive and does not affect the normal function of the web browser. Atterer et al. later demonstrated how a complex AJAX application (gmail.com) can be remotely observed and evaluated using the system [8]. Javascript running in the browser is particularly suited for this type of observation because it has access to the DOM representation of the document being displayed, and automatically incorporates updates that have been made dynamically. WebinSitu (Chapter 3) uses the UsaProxy approach, but modifies it to record additional information useful for understanding the user experience of blind web users. The framework exposed by WebinSitu facilitates short experiments or longitudinal studies on arbitrary web content, and will be discussed in the next chapter. 2.2 Enabling Users to Improve Accessibility and Usability The accessibility of web content can be implemented and improved during the many stages between when the idea for the content is originally conceived to when the implemented web page is conveyed to web users. A simplified view of the stages of this process are outlined in 18 Figure 2.1: An overview of the flow of accessible content from web developer to blind web user, along with a selection of the components that have been explored act at each stage designed to improve web accessibility. While later stages can influence earlier stages, such change is slower and more difficult to achieve. Figure 2.1, including a selection of techniques that have been explored at each stage which help to place the related work presented in the remainder of this section. Most of this work has not involved blind web users in the process of improving content. This section highlights work at each stage along the web publishing pipeline. At each stage, we will discuss a sample of solutions that have been explored at that stage. Accessibility Standards Web accessibility standards set guidelines that, if met, should ensure that a web page is accessible to blind users and other web users with disabilities. Achieving accessibility through a set of specific guidelines is difficult in general because implementing web accessibility requires implementing efficient access not just making information available. By requiring the web developers in the screen reader condition of the comparative study previously discussed by Mankoff et al. to use a screen reader, the researchers may have implicitly forced the web developers to consider usability [85]. This may partially explain why they fared better than their counterparts who did not make use of the software. Problems with static standards have been previously noted [69], but remain in use, in part, due to the difficulty 19 of formulating, checking and enforcing more subjective usability standards. The most important web accessibility standards for web developers in the United States are the World Wide Web Consortium (W3C)’s WCAG 2.0 [142] and the technology access requirements of Section 508 of the U.S. Rehabilitation Act of 1973 that were expanded and strengthened by the Rehabilitation Act Amendments of 1998. Many other countries have similar accessibility requirements, many of which are based on the W3C’s WCAG [126]. In the United States, only web sites that receive funding from the Federal government are compelled to comply with Section 508 guidelines, while private entities are exempt. A recent court case, however, may allow a wider selection of web sites to face legal trouble if they fail to implement accessibility features. In National Federation of the Blind vs. Target Corporation, a Federal Circuit Court ruled that Target Corporation could be sued on grounds that its inaccessible web site violated the Americans with Disabilities Act (ADA) because the web site is an extension of the physical store [93]. Developer Tools Numerous tools are available to web developers that automatically identify accessibility problems. Some of the most popular include A-Prompt [1], UsableNet’s LIFT [77], W3C Accessibility Validation Service [139], IBM alphaWork’s aDesigner [63], and Watchfire’s Bobby Worldwide [140]. These tools commonly report on how well a web page adheres to web standards, reporting problems that can be identified automatically, such as missing alternative text, lack of row and column heading tags in HTML tables, and using deprecated HTML tags instead of Cascading Style Sheets (CSS). Automated tools can provide advice to web developers on how to fix the errors that have been identified, but they cannot offer subjective suggestions or critical feedback. As an example, a web page that uses zero-length alternative text for all images will pass most validators with only warnings because zero-length alternative text is appropriate for purely decorative images. As noted previously, web developers must be skilled in order to effectively implement standards. Another approach is to semantically tag web page elements using ontologies that client tools can use to improve the interface to content. The Dante approach recognizes that not 20 all elements of accessibility, particularly those that deal with efficient use of a web page, can be met by existing Hyper Text Markup Language (HTML) markup, and offers web developers the ability to semantically annotate their web pages in order to facilitate more appropriate audio presentation. Originally, this semantic knowledge was assigned manually and was introduced by Yesilada et al. [150]. The Web Authoring for Accessibility (WAfA) ontology (introduced as the Travel Ontology) was designed to facilitate web navigation and is the ontology that Dante uses for annotation [149]. This ontology includes such concepts as Advertisement, Header, and NavigationalList. Once annotations from this ontology are made to a web page, Dante allows easier navigation of it by performing transformations, such as removing advertisements, providing skip links to bypass headers and navigation links, and allowing users to easily move between semantically-related sections of a page. A downside of this approach is that it required manual annotation by web developers. Work by Plessers et al. removes the manual component of creating annotating visual objects with semantic guidelines entirely by building such annotation directly into the design process [105]. This work was based on an existing web engineering approach (Web Site Design Method (WSDM) [38]), but could potentially be extended to work with any formal design methodology. Design methodologies in general, and WSDM in particular, help web developers break down the task of creating a usable web site into a series of manageable phases. They force designers to carefully consider their target audience and the tasks they will most likely want to perform before considering low-level details related to implementation. As part of this process, WSDM guides users through creating models of navigation, tasks and objects on the site. Plessers et al. demonstrated that 70.51% of the WAfA ontology could be automatically generated on web sites that were created using the WSDM process by directly mapping elements of the WSDM ontology to elements in the WAfA ontology (84.62% in the best case). This approach certainly has potential in that it shows web developers that are already using a formalized method for designing and implementing web pages that they can build in accessibility without added cost. Unfortunately, its practical benefit would seem to come largely from its ability to shift the justification for semantic annotation of content away from the merits of accessibility to a justification based on the merits of using a formal design method. Plessers et al. also showed that accessibility through semantic markup can 21 be built into dynamically-generated template-based web pages, which is a powerful idea and may benefit the accessibility of a number of web sites even if they aren’t employing a formal design method. The automatic method presented in this work still requires manual annotation, and still suffers from the drawbacks associated with such approaches. Automatic Transcoding for Improved Accessibility Automatically transcoding content in order to render it more accessible is an approach that has been explored extensively. Tools that use this approach generally intercept content after it is retrieved from the web server (where it is stored) and before when it is read to web users. This step can occur as part of an edge-service co-located with the web server, as part of a transformation proxy to which users connect, or as part of a web browser plugin or extension. A number of previous systems have been introduced that help screen reader users independently transform documents to better suit their needs, including several systems that use a proxy-based architecture that allows web pages to be automatically made more accessible before they are delivered to blind web users. Iaccarino described edge-services for improving web accessibility designed to be implemented at the edge between web servers and the Internet [61, 62]. Others have proposed that such transformations be implemented in a proxy to which clients can connect [53]. Harper et al. created an extension for the Firefox web browser that inserted GIST summaries to allow blind users the ability to get an overview of web content without reading the entire page [54]. Luis von Ahn et al. created online games that entice humans to accurately label images and suggested that the labels produced could be stored in a centralized database where they could be used to help make web images more accessible [135, 134]. Altifier provides alternative text for images using a set of heuristics [136]. Chapter 4 discusses how we can apply many transcoding services using Javascript scripts that can be introduced by web pages, web proxies or client-side tools. As an example, we present our WebInSight system, which formulates and adds alternative text to web images that lack them automatically [18]. 22 Screen Readers Screen readers were originally developed in the context of console computing systems and converted the visual information displayed by a computer to text that could be read (hence the name screen reader) by directly interfacing with the computer’s screen buffer. Since the arrival of this technology in the 1980’s, a number of more advanced screen readers have been developed to handle the graphical user interfaces exposed by modern operating systems. Popular screen readers include Freedom Scientific’s JAWS [82], G.W. Micro’s Window Eyes [146], IBM’s Home Page Reader [64], IBM alphaWorks’ open source Linux Screen Reader (LSR) [78] and Screenreader.net’s freeware Thunder Screen Reader [127]. Modern screen readers attempt to interface directly with the applications that they are reading and have introduced technology, such as off-screen models of web browsing that seek to overcome limitations with that interface. Screen readers have been developed extensively to make web content accessible and they work reasonably well when content has been appropriately annotated for accessibility, but the user experience can still be frustrating and inefficient. The hypertext environment of the web is represented by a wealth of links, multimedia and structure, and the HTML which represents much of it often lacks the necessary annotation for conveying this information non-visually. For a blind person, a screen reader currently provides the most direct control over how web content is presented. Determining the semantic purpose of web elements is difficult in general, and most screen readers report based on the underlying HTML markup instead. The Hearsay web browser is a screen reader that is designed to be controlled by voice and transforms web content into VoiceXML for audio output [106]. Hearsay does extensive processing on the DOM of web pages to automatically transform it into a semantic tree represented in VoiceXML [131] that is ideal for audio browsing. This approach leverages automatic identification of repetition and locality in HTML documents in order to automatically derive semantic structure. The process for discovering implicit schemas on web pages was derived from earlier work by Mukherjee et al. that focused on isolating the schemas imposed by template-driven HTML documents [91]. This process identifies semantically- 23 related groups of objects that appear together, such as items appearing together in a list. For instance, alternating <h1> and <p> tags in HTML often express the semantic pairings of a title of an object and its summary. In this example, these would all be siblings of their parent node, but for auditory browsing they should be paired to allow the listener to efficiently select each group. To aid in the successful reconstruction of the semantic tree, the system uses various heuristics in order to accept partial matches, and leverages semantic similarity of nodes as determined by the similarity of the words contained within them. CSurf, a recent addition to the Hearsay web browser, allows web users to browse in a context-directed manner [84]. In this system, when a user clicks on a link, the text of that link is used to automatically detect an appropriate place in the resulting web page for the browser to begin reading. Hearsay has the potential to greatly improve the efficiency of web browsing, but does so at the potential cost of reduced transparency for users. The tree representation created by Hearsay may place items in unexpected levels of the tree, which may render it confusing to users. The system was shown to be very accurate in its ability to correctly create the semantic tree, but these experiments were conducted in the “news” domain for which an ontology was manually created and Hearsay was manually tuned. The authors later showed the promise of bootstrapping ontologies for new domains [90], although it remains unclear how well these techniques will perform in arbitrary domains. User studies of the system were positive, but they they too operated on this manually-tuned “news” domain and were tested only in short lab studies by (mostly) sighted users. Evaluators hinted at the issue of transparency by suggesting that they would like additional control and that they wanted Hearsay to be more explicit about the types of elements in the semantic tree. CSurf suffers similar transparency concerns because users are automatically redirected to content but provided no information concerning where on the page they have been redirected or the surrounding context. Chapter 6 presents a system called TrailBlazer that helps users decide what to do next. For its interface, we chose to offer a list of suggestions from which users could choose. This interface is designed to keep users in control, but it remains unclear whether this is sufficient given the limited context afforded by the audio interface. Balancing transparency 24 with automation is a concern with any intelligent user interface, and especially so when concerning blind web users who may not have supplementary clues to help them determine what is being done for them by the interface. User Involvement To browse the web as a blind web user, one must be both skilled and patient. The reality is that a primary way for blind web users to influence the accessibility of content is to request improvements from the original developer of the site. In some cases, these requests have escalated into lawsuits, for example the National Federation of the Blind vs. Target Corporation [93] and Maguire vs. SOCOG [31]). The ability of blind users to directly influence the accessibility of the web has historically been severely limited; the motivation of this dissertation is to bring control back to these users. Blind users also affect the accessibility of the web through their choice of access technology. Iaccarino et al. let users choose among many options for transcoding using a single tool [62]. This offered an incredible benefit for personalization, but users were still restricted to the options offered by the system. There was no easy way for users to contribute new transcoding services. Accessmonkey (Chapter 4) and TrailBlazer (Chapter 6) help users contribute improvements. Recent projects have sought to involve blind users more directly in the process of improving web content. IBM’s Social Accessibility Project lets blind web users report problems to sighted volunteers who can fix them [121] and share those improvements in a common repository [67]. This project has two primary advantages: first, it lets blind people easily evaluate and report access problems, and, second, any sighted person with the knowledge and desire to fix problems can do so. Blind users do not directly improve content as part of Social Accessibility but they are active participants in the process. This dissertation presents several examples of involving blind users in the process of improving web access. Chapter 4 discusses our Accessmonkey Project, which was at the forefront of collaborative accessibility and Chapter 6 presents TrailBazer, which lets blind users demonstrate trails through web pages that they can share with others. System Access Mobile on USB[29] Braille Sense PDA[5] Smartphone w/ Mobile Speak Pocket[21] Laptop with Screen Reader FireVox[11], HearSay[26] Free Self-Voicing Web Browsers $0.00 Traditional Screen Readers JAWS[17], Window-Eyes[35], ... $1000.00 ... Mobile Devices SA to Go[2] Any Computer WebAnywhere Software Portability 25 $5000.00 User Cost Figure 2.2: Many products enable web access to blind individuals but few have high availability and low cost (upper-left portion of this diagram). Only systems that voice both web information and provide an interface for browsing it are included. 2.3 Improving the Availability of Accessible Interfaces Many existing solutions convert web content to voice to enable access by blind individuals. Three important dimensions on which to compare them are functionality, portability and cost. WebAnywhere provides full web-browsing functionality from any computer for free to users. Screen Readers Screen readers, such as JAWS [82] or Window-Eyes [146], are special-purpose software programs that cost more than $1,000 US. The Linux Screen Reader [78], the Orca screen reader [97] for the GNOME platform and the NVDA screen reader for Windows [96] are free alternatives to the commercial products. Screen readers are seldom installed on computers not normally used by blind individuals because of their expense and because their owners are unaware of free alternatives or that blind users might want to use them. Fire Vox is 26 a free extension to the Firefox web browser that provides screen reading functionality [30], the HearSay web browser is a standalone self-voicing web browser [106], and aiBrowser is a self-voicing web browser that targets making multimedia web content accessible [87]. Although free, these alternatives are similarly unlikely to be installed on most systems. Users are rarely allowed to install new software on public terminals, and installing software takes time and may be difficult for blind users without using a screen reader. Many users would also be hesitant to install new software on a friend’s laptop. WebAnywhere is designed to replicate the web functionality of screen readers in a way that can be easily accessed from any computer, requires minimal permissions on the client computer, and starts up quickly without requiring a large download before the system can be used. Recent versions of Macintosh OS X include a screen reader called VoiceOver which voices the contents of a web page and provides support for navigating through web content [132]. Most computers do not run OS X, and, on public terminals, access to features not explicitly allowed by an administrator may be restricted. Windows XP and Vista include a limitedfunctionality screen reader called Narrator, which is described as a “text-to-speech utility” and does not support the interaction required for web browsing [147]. Mobile Alternatives Mobile access alternatives can be quite costly. PDA solutions can access the web in remote locations that offer wireless Internet and usually also offer an integrated Braille display that can be used on any computer. The GW Micro Braille Sense [25] costs roughly $5,000 US but also offers an integrated Braille display. A Pocket PC device and the screen reading software designed for it, Mobile Speak Pocket [88], together cost about $1,000 US. Many cannot afford or would prefer not to carry such expensive devices. The Serotek System Access Mobile (SAM) [119] is a screen reader designed to run on Windows computers from a USB key without prior installation. It is available for $500 US and requires access to a USB port and permission to run arbitrary executables. The Serotek System Access To Go (SA to Go) screen reader can be downloaded from the web via a speech-enabled web page, but the program requires Windows, Internet Explorer, and 27 permission to run executables on the computer. The AIR Foundation [2] has recently made this product free. A self-voicing Flash interface for downloading and running the screen reader is provided. Alternatives with Limited Functionality Some web pages voice their own content, but are limited either by the scope of information that can be voiced or by the lack of an accessible interface for reaching the speech. Talklets enable web developers to provide a spoken version of their web pages by including code that acts with the Talklet server to play a spoken version of each page [123]. This speech plays as a single file and neither its playback nor the interface of the page can be manipulated by users. Scribd.com converts documents to speech [117], but the speech is available only as a single MP3 file that does not support interactive navigation. The interface for converting a document is also not voiced. The National Association for the Blind, India, provides access to a portion of their web content via a self-voicing Flash Movie [92]. The information contained in the separate Flash movie is not nearly as comprehensive of the entire web page, which could be read by WebAnywhere. Web information can also be retrieved using a phone. The free GOOG-411 service enables users to access the business information from Google Maps using a voice interface over the phone [48]. For a fee, email2phone provides voice access to email over the phone [40]. Availability Summary Products that provide full web-browsing functionality are shown in Figure 2.2. The portability axis is approximate. Solutions that can be run on any computer can also be run on wireless devices and are therefore rated more highly. Mobile phones are more portable than other solutions but only when cellphone service is available. WebAnywhere will be able to run on many mobile devices, regardless of the underlying platform, as they are increasingly supporting more complex web browsers that can play sound. The WebAnywhere web application is more highly portable than the Serotek System Access Mobile, which can run 28 only on Windows computers on which users have permission to run arbitrary executables. Braille Sense PDAs use a proprietary operating system and some versions cannot connect to wireless networks using WPA security. WebAnywhere is designed to be free and highlyavailable, but other solutions may be more appropriate or provide more functionality for users with different requirements using different devices. 29 Chapter 3 WEBINSITU: UNDERSTANDING ACCESSIBILITY PROBLEMS The extent of accessibility problems and their practical effects on the browsing experience of blind web users is not yet adequately understood. For web access guidelines, standards, and future technology to be truly relevant and useful, more information about the real-life web interactions of blind web users is needed. In this chapter, we present infrastructure that enables remote web studies of blind participants. Unlike prior systems, the WebinSitu infrastructure presented here can capture and record the actions that users perform on the web - for example, the buttons pressed, the text entered into form fields, and the links clicked. The focus of this chapter is on a study using this infrastructure that illustrates the differences in browsing behavior of blind and sighted web users. The problems identified in this study help to motivate many of the solutions presented later in this dissertation, and the infrastructure is used later to help evaluate these solutions. 3.1 Motivation We used an advanced web proxy to enable our study and quantitatively measured both the presence and observed effectiveness of components thought to impact web accessibility. Most proxy systems can only record HTTP requests and cannot easily discern user actions performed on web pages [28, 58]. WebinSitu is an enhanced version of UsaProxy [9], designed to be used for long periods of time and record information especially important for access. UsaProxy was used as a base for WebinSitu because it can record actions that are impossible to record with a traditional proxy, such key presses, clicks on arbitrary page elements (including within-page anchor links), and the use of the “back” button to return to a page that was previously viewed. Recording user actions has traditionally required study participants to install specialized browser plugins [45, 32], but UsaProxy is able to 30 record most user actions by using Javascript code that is injected into pages that are viewed. Because it uses Javascript to parse the viewed web pages, WebinSitu can also record the use of technology at the center of increasingly important accessibility concerns, such as dynamic page changes, interaction with dynamic content and AJAX requests. A proxy approach enables transparent setup by participants and allows them to use their own equipment with its existing configuration. Prior work has sought a better understanding of the web user experience for general users [58, 71]. The importance of measuring accessibility in situ from the user perspective is illustrated by the relative popularity of web sites visited by web users in our study, as shown in Figure 3.1. The distribution is Zipf-like [26], which results in three sites (google.com, myspace.com and msn.com) accounting for approximately 20% of the pages viewed by the participants in our study. The google.com domain alone accounted for almost twice as many page views as the 630 domains that were viewed five or less times during our study. The accessibility of popular sites more strongly affects users than do sites on the long tail of popularity. While our study is not a replacement for laboratory studies that use common tasks, it offers an important view of accessibility that better matches the experiences of real users. Blind web users have proven adept at overcoming accessibility problems, and one of the goals of this study was to gain a clearer picture of how blind users might be changing their browsing behavior to avoid problems. For instance, the lack of alternative text is an often-cited accessibility concern, but blind users can often obtain the same information contained within an image from surrounding context. Within-page anchors called “skip links” are designed to help blind users effectively navigate complex web pages by enabling them to jump to relevant content, but these links may be used infrequently because other screen reader functionality also enables users to move quickly through a page. If the context surrounding links on a page isn’t clearly expressed to blind users, they may explore the page by clicking on links simply to see where they point and then return. WebinSitu explores whether blind web users tend not to visit inaccessible content and considers strategies that they may be using to cope with the problems they might experience. Quantitative differences that are observed may suggest motivation or browsing strategies, 31 Figure 3.1: Log frequency of visits per domain name recorded for all participants ordered by popularity. but causal relationships are difficult to determine clearly from interaction histories alone. To address this problem, we supplement our observations with qualitative feedback to get a better sense why we might have observed these differences The direct effects of technology and developer practices for improving accessibility are also difficult to measure in practice because users employ many different browsing and coping strategies that may vary based on the user’s familiarity with the page be accessed. Related work has looked at task-based analysis of accessibility [85, 138, 34, 101], with a major focus on supporting effective accessibility evaluation (see Ivory for a survey of this work [65]). Realistic studies with blind web users are difficult to conduct in the lab due to difficulties in replicating the diversity of assistive technology and configurations normally used by participants. Previous work has advocated remote studies because they allow participants to use their existing assistive technology and software [85, 101, 47]. These studies noted that blind participants often provide less feedback when a page is considerably inaccessible, indicating that simply asking blind users to list the problems they face may not be sufficient. Overall, we found that blind web users browse the web quite similarly to sighted users and that most pages visited during our study were inaccessible to some degree. In our study 32 these problems are placed in the context of their predicted effects because we implicitly weighted pages relative to their popularity. Perhaps most surprising, blind participants generally did not shy away from pages exhibiting accessibility problems anymore than did sighted users. Blind participants were, however, much less likely to visit pages containing content not well addressed by assistive technology. Blind users tended not to visit sites heavily dependent on AJAX, but visited many pages that included Flash content. Blind users also interacted less with both dynamic content and inaccessible web images. Skip links, added to web pages to assist screen reader users, were only used occasionally by our participants. Although from interaction histories alone we cannot determine with certainty the causal relationship of such differences in browsing behavior, these observations combined with the known technical capabilities of assistive technology present a strong case for access problems causing the differences. The contributions of this chapter are as follows: • We present the design of the WebinSitu infrastructure which enables remote user studies with disabled participants. • We demonstrate the effectiveness of proxy-based recording for exploring the interaction of blind web users. • We compare the browsing experience of sighted and blind web users on several quantitative dimensions, and report on access as experienced by blind web users. • We formulate practical user observations that can influence the direction of future web accessibility research. 3.2 Recording Data To enable our study, we developed a tracking proxy called WebinSitu to record statistics about the web browsing of its users. WebinSitu is an extended implementation of UsaProxy (Figure 3.2)), which allows both HTTP request data and user-level events to be recorded [9]. This method of data collection allows participants to be located remotely and use their own equipment. This is important for our study because of the diversity of assistive technology and configurations employed by blind users. Our proxy-based approach requires minimal 33 Figure 3.2: Diagram of the system used to record users browsing behavior. configuration by the user and does not require the installation of new software. Connecting to the system involved configuring their browsers to communicate with the tracking proxy and entering their login and password. Names and passwords were not connected with individuals, but a record was kept indicating whether the participant primarily uses a screen reader or a visual browser to browse the web. A browsing session begins with the participant initiating an HTTP request, which is first sent to the proxy and then passed directly to the web server. The web server sends a response back to the proxy, which logs statistics about the response header and web page contents. The proxy also injects Javascript into HTML responses to record usergenerated events and sends this modified response back to the user. After the response is received by the user and is loaded in their browser, the Javascript inserted into the page can record events such as key presses, mouse events, and focus events and sends data about each event, including the DOM elements associated with each event, back to the proxy for logging. For example, if a user clicks on a linked image, the click event and its associated image (dimension, source, alternative text, etc.), the link address and position in the DOM are sent to the proxy and recorded. The proxy also records whether content with which participants interact is dynamic (i.e. created after the page was loaded via Javascript) and whether the pages viewed issue AJAX requests. All of the data pertaining to a participant’s browsing experience is stored on a remote database. At any time during the study, participants may examine their generated web traces, comment on the web pages viewed, enter general comments about their browsing 34 experience or delete portions of their recorded browsing history (See Figure 3.6). Our participants deleted only three browsing history entries. 3.3 Study Design In this study, we considered two categories of data related to web browsing that yield insight into accessibility problems faced by blind web users. Many definitions of blindness exist; we use the term blind users for those users that primarily use a screen reader to browse the web and sighted users for those who use a visual display. First, we recorded statistics relating to basic web accessibility of pages viewed in our study, such as alternative text for images, heading tags for added structure and label elements to associate form input with their labels. Second, we considered the browsing behavior of both blind and sighted users, including average time spent on pages and interaction with elements. 3.3.1 Accessibility of Content Accessibility guidelines for web developers offer suggestions on how to create accessible web content. Most noted is the WCAG [142] on which many other accessibility guidelines are based. Web developers often don’t include the advice presented in these guidelines in their designs [34, 18]. Our study effectively weights pages based on the frequency with which they are viewed, allowing us to measure the accessibility of web content as perceived by web users. The individual metrics reported here suggest the accessibility of web pages that users view, but cannot capture the true usability of these pages. Because inaccessible pages can be inefficient or impractical to use, blind users may choose to visit sites that are more accessible according to ours metrics. In our analysis, we compared the browsing behavior of blind and sighted users according to the metrics below. Descriptive Anchor Text and Skip Links Navigating from link to link is common method of moving through web pages using a screen reader. Many screen readers provide users with a list of links accessed via a shortcut key. However, links can be difficult to interpret when separated from the surrounding context. 35 For instance, the destination of a link labeled “Click Here” is impossible to determine without accompanying context. Prior work has shown that descriptive link text helps users efficiently navigate web pages [55] and related work has explored automatically supplying richer descriptions for links [54]. In our study we collected all links on the pages viewed by our participants as well as all links clicked on by our participants. We sampled 1000 links from each set and manually labeled whether or not each was descriptive. Skip links are within-page links that enable users to skip ahead in content. They normally appear near the beginning of the HTML source of a page and are meant for blind web users. We identified skip links using two steps. First, we selected all within-page anchors whose anchor text or alternative text (in the case of images used as skip links) contained one of the following phrases (case insensitive): “skip,” “jump to,” “content,” “navigation,” “menu.” These phrases may not appear in all skip links, but this works for our purposes of efficiently selecting a set of such links. To ensure that the chosen links were skip links, we manually verified each one chosen in the first step. Structure, Semantics and Images Browsing is made more efficient for blind web users when the structure of the page is encoded in its content and when the semantics of elements are not dependent on visual features. Heading tags (<h1>... <h6>) have been shown to provide useful structure that can aid navigation efficiency [138]. The <label>tag allows web developers to semantically associate input elements with the text that describes them. Often this association is expressed visually, which can make filling out forms difficult for blind web users. These are some of the easiest methods for encoding structure and semantics into HTML pages. Their use may foreshadow the likelihood that web developers will use more complex methods for assigning structure and semantics to web content, such as the Roadmap for Accessible Rich Internet Applications (WAI-ARIA) [110]. Investigating the accessibility of web images has often been used as an easy measure of web accessibility [18]. In this study, we analyzed the appropriateness of alternative text on images viewed by participants. We sampled both 1000 of the images contained on the pages 36 viewed by our participants and 1000 images that were clicked on by our participants. We manually judged the appropriateness of the alternative text provided for these images. Dynamic Content, AJAX and Flash The web has evolved into a more dynamic medium than previous static web pages. This trend, popularly known as Web 2.0, uses Dynamic HTML (DHTML) and Javascript to arbitrarily modify web pages on the client-side after they have been loaded. All users may benefit from this technology, but it raises important accessibility concerns for blind users. Changes or updates to content that occur dynamically have long been recognized by standards such as the WCAG [142] as potential problems for screen reader users because dynamic changes often occur away from a user’s focus. In our study, we recorded dynamic changes in viewed pages. A dynamic change is defined as any change to the DOM after the page has loaded. We took special interest when users directly interacted with dynamically changed content. Our system cannot detect when users read content that is dynamically introduced, but can detect when users perform an action that uses such an element. We also recorded how many of the pages viewed by our participants contained programmatic or AJAX web requests. While not necessarily an accessibility concern, these requests are indicative of the complex applications that often are accessibility concerns. A growing number of web pages include Flash content. Recent improvements to this technology has enabled web developers to make much of this content accessible, but doing so requires them to consciously decide to implement accessibility features. Conveying this accessibility information to users requires users to browse with up-to-date versions of their web browsers, screen readers and Adobe Flash. We report on the percentage of web pages visited by blind and sighted web users that contain Flash content. 3.3.2 Browsing Behavior Blind web users browse the web differently from their sighted counterparts in terms of the tools that they use and the way information is conveyed to them. We explored how these different access methods manifest in quantifiable differences according to several metrics. 37 In particular, because blind web users have proven quite adept at overcoming accessibility problems, it is interesting to explore the practical effects of accessibility problems. For instance, an image that lacks alternative text does not conform to accessibility guidelines, but may still be accessible if it points to a web page with a meaningful filename. Similarly, skip links seem as though they would be of assistance to users, but users may choose not to follow them either because they are most often interested in content that would be skipped or because they prefer potentially longer reading times to potentially missing out on valuable information. Our study seeks to measure such factors. Beyond the simple presence of accessible and inaccessible components in web pages, we also wanted to collect information that helps suggest the effects of the accessibility of web page components. Probing A probing event occurs when a user leaves and then quickly returns to a page. Web users often exhibit probing behavior as a method of exploration when they are unsure which link to choose [55]. Probing is also often used as a metric of the quality of results returned when analyzing search engines [145]. If a returned link is probed, then the user likely did not find the contents relevant. Because exploring the context surrounding links is less efficient for screen reader users, they may choose to directly follow links to determine explicitly where they lead. If screen reader users probe more than their sighted counterparts then this would motivate the further development of techniques for associating contextual clues with links. In our study, we investigated the use of probing by our blind and sighted participants. Timing Underlying work in improving web accessibility is the goal of increasing efficiency for blind web users. In our study, we attempted to quantify the differences in time spent web browsing by blind and sighted web users. We first looked at average time per page to see if there is a measurable effect of blindness on per page browsing time. We then looked at specific tasks that were common across our users that we identified from our collected data. The first was entering a query on the Google search engine, looking through the returned results and then 38 clicking on a result page. The second was using our web history page to find a particular page they themselves had visited during the web study, finding it on the results page and then entering feedback for the page. Even though both groups of users could accomplish these tasks (they were accessible to each group), this comparison provides a sense of the relative efficiency of performing typical tasks. 3.4 Results For our study, we recruited both blind and sighted web users. In the end, we had 10 blind participants (5 female) ranging in age from 18 to 63 years old and 10 sighted participants ranging in age from 19 to 61 (3 female). We began our recruiting of blind users by first contacting people who had previously expressed interest in volunteering for one of our user studies and then by advertising on an email list for blind web users. Our sighted participants were contacted from a large list of potential volunteers and chosen to be roughly similar to our blind participants according to age and occupation area. Participants were given $30 in exchange for completing the week-long study. Both our blind and sighted participants were diverse in their occupations, although fields related to engineering and science accounted for slightly more than half of participants in both groups. We placed no restriction on participation, but all of our participants resided in either the United States or Canada, with geographical diversity within this region. Participant were sent instructions outlining how to configure their computers to access the web through our proxy. Only one participant had difficulty with this setup procedure and the issue was quickly resolved by speaking to the researchers on the phone. Each participant was told to browse the web as they normally would for 7 days. During this time, our participants visited 21,244 total pages (7,161 by blind participants), which represented approximately 325 combined hours of browsing (141 by blind participants). “Browsing” time here refers to total time spent on our system with no more than 1 hour of inactivity. The pages they viewed contained 337,036 images (109,264 by blind participants) and 926,901 links (285,207 by blind participants). Of our blind participants, 8 used the JAWS screen reader, 2 used Window-Eyes; 9 used Internet Explorer, 1 used Mozilla Firefox. All of our blind participants but one used the latest major version of their preferred screen reader. 39 Figure 3.3: For the web pages visited by each participant, percentage of: (1) images with alt text, (2) pages that had one or more mouse movement, (3) pages with Flash, (4) pages with AJAX, (5) pages containing dynamic content, (6) pages where the participant interacted with dynamic content. None reported using multiple screen readers, although we know of individuals who report switching between JAWS and Window-Eyes depending on the application or web page. All of our participants used Javascript-enabled web browsers, although we did not screen for this. Our data was collected “in the wild,” and, as is often required when working with real data, it was necessary to remove outliers that might have otherwise inappropriately skewed our data. For each metric in this section, we removed data that was more than 3 standard deviations (SD) from the mean. This resulted in an average of 1.04% of our data being eliminated for the applicable metrics. Our measures are averages over subjects. The remainder of this section explores the results of our study for the two broad categories initially outlined in our Study Design (Section 3.3). A summary of many of the measurements reported in this section is presented in Figure 3.3 for both blind and sighted participants. 40 3.4.1 Accessibility of Content Descriptive Anchor Text and Skip Links Overall, 93.71% (SD 0.07) of the anchors on pages visited by blind users contained descriptive anchor text, compared with 92.84% (0.06) of anchors on pages visited by sighted users. The percentage of anchors that were clicked on by the two groups was slightly higher at 98.25% (0.03) and 95.99% (0.06), respectively, but this difference was not detectably significant. This shows that web developers do a good job of providing descriptive anchor text. We identified 822 skip links viewed by our blind participants compared to 881 skip links viewed by our sighted participants, which was not a detectably significant difference. Blind participants clicked on 46 (5.60%) of the skip links presented to them, whereas sighted users clicked on only 6 (0.07%). Often these links are made to be invisible in visual web browsers. These results suggest that blind users may use other functionality provided by their screen readers to skip forward in content in lieu of skip links. We were unable to test this hypothesis due to difficulty in reliably determining when users used screen reader functionality to skip forward in content. Structure, Semantics and Images Overall, 53.08% of the web pages viewed by our participants contained at least one heading tag and there was no significant difference between pages visited by sighted and blind users. We found that on pages that contained input elements that required labels, only 41.73% contained at least one label element. Using manual evaluation, we found that 56.9% of all images on the pages visited by our participants were properly assigned alternative text and that 55.3% of the images clicked on by web users were properly assigned alternative text based on manual assessment of appropriateness. Blind participants were more likely to click on images that contained alternative text. 72.17% (19.61) of images clicked on by blind participants were assigned appropriate alternative text, compared to 34.03% (29.74) of the images clicked on by sighted participants, which represents a statistically significant effect of blindness on this measure (F1,19 = 11.46, p < .01). 41 Dynamic Content, AJAX and Flash Many of the pages viewed by our participants contained dynamic content, AJAX and Flash content. Pages visited by sighted participants underwent an average of 21.65 (35.38) dynamic changes to their content as compared to an average of only 1.44 (1.81) changes per page visited by blind participants. This difference was marginally significant (F1,19 = 3.59, p = .07). Blind users interacted with only 0.04 (0.08) of page elements that were either dynamically introduced or dynamically altered, while sighted users interacted with 0.77 (0.89) of such elements per page. There was a significant effect of blindness on this measure (F1,19 = 7.49, p < 0.01). Our blind participants may not been aware that the content had been introduced or changed, or were unable to interact with it. Pages visited by blind and sighted users issued an average of 0.02 (0.02) and 0.15 (0.20) AJAX requests, respectively. This result is statistically significant (F1,19 = 4.59, p < 0.05) and suggests that blind users tend not to visit web pages that contain AJAX content. Of the dynamic content categories, Flash was the only one for which we were unable to detect a significant difference in the likelihood of blind versus sighted participants visiting those types of pages. On average 17.03% (SD 0.24) and 16.00% (11.38) of the web pages viewed by blind and sighted participants, respectively, contained some Flash content. There was not a detectably significant difference on this measure (F1,19 = 0.90, n.s.). We also calculated these four metrics for domains visited (groups of web pages) and reached analogous conclusions. 3.4.2 Browsing Behavior Blind users used the mouse (or simulated it using the keyboard) a surprising amount. On average, blind participants used or simulated the mouse on 25.85% (SD 22.01) of the pages that they viewed and sighted participants used the mouse on 35.07% (12.56) of the pages that they viewed. This difference was not detectably significant (F1,19 = 1.35 n.s.). Blind and sighted participants, however, on average performed 0.43 (0.33) and 8.92 (4.21) discrete mouse movements per page. This difference was statistically significant (F1,19 = 44.57, p < .0001). 42 Figure 3.4: Number of probes for each page that had at least one probe. Blind participants performed more probes from more pages. Our users arrived at 24.21% of the pages that they viewed by following a link. The HearSay browser leverages the context surrounding links that are followed to begin reading at relevant content on the resulting page [84] and could likely apply in these cases. Probing Our blind participants exhibited more probing than their sighted counterparts as shown in Figure 3.4. On average, blind participants executed 0.34 (SD 0.18) probes per page while sighted participants had only 0.12 (0.12), a significant difference (F1,19 = 10.40, p < 0.01) and may be indicative of the greater difficulty of blind web users due to limited context. (See Figure 3.4 to better visualize participant probing behavior for individual pages). Timing In examining the time spent per task, we found that our data was skewed toward shorter time periods, which is typical when time is used as a performance measure. Because this data does not meet the normality assumption of ANOVA, we applied the commonly used log transformation to all time data [120]. Although this complicates the interpretation of results, it was necessary to perform parametric statistical analysis [43]. All statistical significance reported here is in reference to the transformed data; however, results are reported in the original, untransformed scale. 43 Range Blind Sighted Sig.F1,19 0 - 1.0 0.38 (0.26) 0.23 (0.23) 32.55, p < .0001 0 - 2.5 0.76 (0.65) 0.38 (0.49) 31.83, p < .0001 0 - 5.0 1.04 (1.05) 0.51 (0.80) 10.69, p < .01 0 - 10.0 1.25 (1.54) 0.77 (1.52) 6.90, p < .05 0 - 20.0 1.50 (2.35) 1.11 (2.51) 5.01, p < .05 0- 5.08 (16.68) 11.30 (74.36) 0.01, p < .91 Table 3.1: Average time (minutes) and standard deviation per page for increasing time ranges. We found that blind participants spent more time on average on each page visited than sighted participants. For a summary of the results, see Figure 3.5. These results seemed particularly strong for short tasks, where sighted users were able to complete the tasks much faster than blind users. Blindness had a significant effect on the log of time spent on each page for all but the longest time period. Table 3.1 shows that the average time spent by blind and sighted participants approach one another as task length increases. We also identified four tasks conducted by both blind and sighted participants, which enabled us to compare the time required for users to complete these tasks. Google This task consisted of two subtasks: 1) querying from the Google homepage, and 2) choosing a result. On the first subtask, blind and sighted users spent a mean of 74.66 (SD 31.57) and 34.54 (105.5) seconds, respectively. Blindness had a significant effect of blindness on the log of time spent on issuing queries (F1,17 = 7.47, p < .01). On the second subtask, the time between page load to clicking on a search result for blind and sighted users was 155.06 (46.14) and 34.81 (222.24) seconds. This represents a significant effect of blindness on the log of time spent on searching Google’s results. (F1,19 = 28.3, p < .0001). Providing Feedback Another common task performed by most of our participants was to provide qualitative comments on some of the web pages that they visited as part of the study (Figure 3.6). This task also consisted of two subtasks: 1) querying for web pages from the user’s web history, and 2) commenting on one of the pages returned. On 44 Figure 3.5: For each participant, average time spend on: (1) all pages visited, (2) WebinSitu search page, (3) WebinSitu results page, (4) Google home page, (5) Google results pages. average, blind and sighted users took 30.36 and 18.41 seconds to complete the first subtask (SD 20.59, 19.84). This represents a marginally significant effect of blindness on the log of time spent querying personal web history (F1,14 = 4.2529, p = .06). On average, blind and sighted participants spent 104.60 (30.98) and 68.74 (78.74) seconds, respectively, to leave a comment. This represented a significant effect of blindness on the log of time spent commenting on personal web history (F1,11 = 5.23, p < .05). 3.5 Discussion Our study provided an interesting look into the web accessibility experienced by web users. Overall, the presence of traditional accessibility problems measured in our study did not seem to deter blind web users from visiting most pages, but problems with dynamic content characteristic of Web 2.0 did. Our blind participants were less likely than sighted participants to visit pages that contained either dynamic content or which issued AJAX requests. Much of this content is known to be, for the most part, inaccessible to blind web users. Our blind participants did not detectably avoid Flash content. Upon manual review of 2000 examples of Flash content, we found that 44.1% of the Flash objects shown to our 45 Figure 3.6: Web history search interface and results. participants were advertisements. The inaccessibility of these Flash objects is unlikely to deter blind users from visiting the pages which contain them. Only 5.6% of the Flash objects viewed by our participants (both blind and sighted) presented the main content of the page. The remainder of Flash objects contained content relevant to the main content of the page but supplement to it. Blind users may miss out on some information contained in such Flash objects, but might still find value in other information on the page that is accessible. Flash was also often used to play sound, which does not require a visual interface. Finally, recent strides in Flash accessibility are making it easier to design accessible Flash objects that can be used by blind users. We also observed that blind web users were less likely to interact with content that is inaccessible. Participants were less likely to interact with content that was dynamically introduced. We also found that blind users are more likely to click on images assigned appropriate alternative text. This should be a warning to web developers that not only are their pages more difficult to navigate by blind users when they fail to assign appropriate alternative text, but they may be driving away potential visitors. Our blind participants appeared to employ numerous coping strategies. For example, blind participants used the mouse courser when page elements were otherwise inaccessible. 46 One participate explained that he is often required to search for items that are inaccessible using keyboard commands. Blind participants also exhibited more probing than their sighted counterparts, suggesting that web pages still have far to go to make their content efficiently navigable using a screen reader. Technology that obviates the need for these coping strategies would be quite useful. Overall, our observations underscore the importance of enabling accessible dynamic content. While our blind participants may have employed (inefficient) coping strategies to access web content that might be considered inaccessible, they generally tended not to visit pages that rely on dynamic content at all. 3.6 Related Work Proxy-based recording of user actions on the web have been explored before. The Medusa Proxy measures user-perceived web performance [71] and WebQuilt displays a visualization of web experiences based on recorded HTTP requests [57]. Traditional proxy systems record information contained in HTTP requests and so others have created browser plugins that can record richer information about user experiences [32]. UsaProxy, on which WebinSitu is based, is not the only example of using Javascript to record web user actions. Google Analytics allows web developers to include a Javascript file in their web pages that tracks basic actions of visitors on their web pages [50]. WebAnywhere uses a web proxy to both record what users are doing and speak the web content being read (Chapter 7). The benefits and trade-offs involved in conducting remote studies with blind participants have been explored previously (Section 2.1.1). WebinSitu enables remote deployment to blind and sighted participants who are likely using a diversity of browsers and assistive technology. Developing plugins for each desired browser and deploying them would be a large undertaking. Our users initially expressed concern over installing new software onto their machines and wanted to make sure they knew when it was and was not collecting data. Specifying a proxy server is easy in popular web browsers (Internet Explorer, Firefox, Safari, Opera, etc.) and allows users to maintain transparent control. WebinSitu is the first large-scale, longitudinal study demonstrating the promise of this approach. 47 3.7 Summary This chapter has presented a study in situ of blind and sighted web users performing real-life web browsing tasks using their own equipment over the period of one week. Our analysis indicates that blind web users employ coping strategies to overcome many accessibility problems and are undeterred from visiting pages containing them, although they took more time to access all pages than their sighted counterparts. Blind users tended not to visit pages containing severe accessibility problems, such as those related to dynamic content. In all cases our blind participants were less likely than our sighted participants to interact with page elements that exhibited accessibility problems. Our user-centered approach afforded a unique look web accessibility and the problems that most need addressing, motivating work in later chapters that seeks to address these problems. Chapter 5 uses the WebinSitu infrastructure to explore a particular interaction, solving audio CAPTCHAs, in more depth and to evaluate an interface that addresses observed problems. Chapter 4 presents Accessmonkey, an intelligent tool to help blind web users and others improve web access, specifically problems identified by our WebinSitu study. Finally, to address browsing efficiency, Chapter 6 presents a tool called TrailBlazer that helps suggest paths through the web by predicting the actions that users will want to complete next. As users choose which suggestions to take, TrailBlazer records what they choose, and lets other users play them back to make blind web users more efficient. Although these tools do not fully solve the problems highlighted by our study, they represent the new direction forward in which blind users have more control that is our thesis. 48 Chapter 4 COLLABORATIVE ACCESSIBILITY WITH ACCESSMONKEY Standards outline what is required for content to be accessible [142], but relying on developers has proven insufficient. As a clear example, nearly fifteen years after the introduction of the alt attribute to the image tag, less than 50% of informative web images are assigned descriptive alternative text [18]. The need for alternative text is readily apparent, but has not been pervasively applied. Our WebinSitu study (Chapter 3) demonstrated that other accessibility problems may be even more prevalent and that these problems negatively influence the behavior and effectiveness of blind web users. A Firefox extension called Greasemonkey lets users customized web content by writing Javascript scripts which change web pages after they are loaded [51]. This chapter explores the potential of scripts to improve accessibility on-the-fly. We introduce a variation on Greasemonkey called Accessmonkey [19]. Accessmonkey is a framework that targets reuse of scripts to help developers improve access to content they control. Several implementations of Accessmonkey are offered to improve the availability of the improvements offered by Accessmonkey scripts, letting users take advantages of Accessmonkey without installing new software. This chapter explores collaborative accessibility, the idea that anyone with the incentive, desire, or need to create more accessible content should be able to do so. We present Accessmonkey, an implementation of this idea that enables people to fix accessibility problems and share solutions both with one another and the developers of the content. 4.1 Motivation Despite efforts to promote accessible web development, web developers often fail to implement even basic elements of accessibility standards. For many web developers, the cause is a lack of appropriate experience, although even experienced web developers may require 49 more time to produce a visually appealing web page that is also accessible [100]. Available tools can help spot deviation from the easily quantifiable portions of established standards, but they fail to adequately guide developers through the process of creating accessible content. Perhaps it should be unsurprising that, when faced with a deadline to release new web content or when faced with a daunting backlog of web pages to be updated, web developers often delay a full consideration of accessibility. Web developers need tools that can efficiently guide them through the process of creating accessible web pages, and blind users need tools that can help them overcome accessibility problems even when developers fail. One approach is to automatically transcode documents in order to render them more accessible and usable [18, 62, 60, 53]. Automatic transcoding is potentially powerful, but implementations require web users to employ a specific browser or platform on their machine to run. The transformations made by these tools help the web users that know to use them, but are not easily utilized by web developers, who might create more accessible content given easier access to the same underlying technology. Most importantly, web users cannot easily influence this technology or independently suggest new accessibility improvements. Accessmonkey is a scripting framework that helps web users and web developers collaboratively improve the accessibility of the web. Accessmonkey helps web users automatically and independently transcode web content according to their personal needs on a number of browsers and platforms. The same Accessmonkey scripts can also be used by developers so that they leverage the same scripts that transcode content for blind users to offer suggestions to them. Many existing tools and systems address accessibility problems, but they often require a specific browser or require the user to install a separate tool. As such, they can be difficult for users to independently improve and difficult for developers to integrate into existing tools. Accessmonkey provides a convenient mechanism for sharing techniques developed and insights gained. The framework allows both web users and web developers to collaboratively improve the accessibility of the web by leveraging the work of those that have come before them. Users can improve the accessibility of the web by writing a script and other users can immediately use and adapt the script to fit their own needs. Web developers can use the script to improve the accessibility of their web pages automatically, reducing the job of providing accessibility information to a more efficient 50 editing and approval task. To allow as many users as possible to utilize our framework, we offer several implementations of it that work on multiple platforms and in multiple web browsers. The contributions of our work include the following: 1. We illustrate the advantages of using Javascript and dynamic content for accessibility improvement. Both technologies have been thought to decrease access. 2. We introduce a framework for dual-display of the results of scripts that enables web users and web developers to utilize the same underlying technology and avoid duplicating work. 3. We re-implement several previous systems designed to improve accessibility as Accessmonkey scripts. These systems can now be used on more web browsers and platforms by both web users and developers. 4.2 Related Work Automatically transcoding web content to better support the needs of web users has been a popular topic in web accessibility research for almost a decade, especially in relation to improving access for blind web users. Accessmonkey seeks to allow both web users and web developers to collaboratively create scripts that direct the automatic transcoding of web content in a way that helps both groups of users efficiently increase web accessibility. 4.2.1 Scripting Frameworks Greasemonkey was introduced in 2003 by Mark Pilgrim. The project was partially motivated by a desire to provide web users with the ability to automatically transcode web pages into a form that is more accessible. Several examples of such scripts are offered in the book Greasemonkey Hacks [104]. NextPlease! is an example of a Greasemonkey script that has become quite popular among blind web users [137]. This script allows web users to define key combinations that simulate clicks on links that contain user-defined text strings. Currently, to implement similar functionality in their own web pages, web developers cannot directly leverage the code changes made by users of a script like NextPlease! and, 51 instead, must independently decide on these changes. Accessmonkey extends the original idea behind Greasemonkey by providing a mechanism by which web users who write scripts can also make their scripts useful to web developers. We also provide a web and proxy implementation of Accessmonkey that opens the system to use by additional users. Scripts designed to automatically improve accessibility are already available as Greasemonkey scripts. Popular existing user scripts include those for automatically detecting and removing distracting ads and those that add new functionality to popular web sites like google.com and hotmail.com. Others automatically add accessibility features to web pages. Often these scripts present solutions that web developers could have included (support for access keys, proper table annotation, etc.), while others address problems that apply to particular subsets of a web pages visitors (high-contrast colors, larger fonts, etc.). Some of the most popular scripts are those that add access keys to popular web sites and those that adapt web pages to be easier to view by people with low vision. A large repository of Greasemonkey scripts is available at userscripts.org, including 49 scripts targeted at accessibility. These scripts alter pages in ways that users have found helpful, such as adding heading tags (<h2>) to the Google results page. Many scripts are available, which suggests that a number of individuals are willing to write such scripts and that many web users find them useful. Another web scripting framework is Chickenfoot, which allows users to programmatically manipulate web pages using a superset of Javascript that includes methods specific to web browsing [24]. The interface of Chickenfoot is designed to make web page manipulation easier, although it still requires some level of programming knowledge. Platypus, another Firefox extension, seeks to remove this requirement as well by providing an interface that allows users to manipulate the content of web pages by simply clicking on items [129]. Neither of these systems offers a mechanism that allows users to save altered web pages, which Accessmonkey supports. 52 4.2.2 Accessibility Evaluation The automatic evaluation of the accessibility of web pages has been a popular topic in both research and industry, and has resulted in the availability of many evaluation tools [140, 139, 1, 77]. Most of these tools have focused on assisting developers in meeting quantifiable accessibility standards, such as the W3C Web Content Accessibility Guidelines [142] or Section 508 of the U.S. Rehabilitation Act. The research community has sought to extend the capabilities of evaluation tools to allow for the automatic detection of more subtle usability and accessibility concerns [65], but tools that can do this well have yet to be developed. Mankoff et al. noted that an effective method for identifying accessibility problems in a web page is to have it reviewed by multiple web developers using a screen reader, but that blind web users could effectively detect subtle usability problems [85]. Accessmonkey allows both groups to collaboratively assist in the evaluation and remediation of web content, but neither group must rely on members of the other before accessibility improvements can be implemented. 4.2.3 Automatic Accessibility Improvement Previous work has explored both automatically improving the accessibility of web pages [60, 62, 53, 18]. To take advantage of these systems, content has generally needed to be processed by the web developer using a specialized tool, displayed using a specialized web browser [106], or transcoded for users on-the-fly. Harper et al. suggested the following three alternative approaches for transcoding documents to make them more accessible for blind web users [53]: (i) in a browser plugin or extension, (ii) in a transcoding proxy, and (iii) in Javascript included within web pages. Accessmonkey uses a hybrid approach in which scripts are injected using a proxy implemented as either a browser extension, traditional proxy, or as a web-based proxy. This flexibility lets Accessmonkey transcode pages and allows scripts written for it to be used on many different platforms and in many different browsers. Users can personalize what selection of available transcoding services they would like to apply to web pages that they view and can also write or modify their own scripts. 53 Implementing transcoders as scripts also has the potential to make extending the techniques that they encompass to web development tools easier. While several systems have suggested that techniques used to automatically transcode documents could also be used to help web developers more easily create accessible content [77, 136], this process has often been difficult to directly integrate into existing developer tools. Accessmonkey allows scripts to be written once and included in a variety of tools used by both web users and web developers. To our knowledge this is the first example of a system that unifies the automatic accessibility improvement targeted at web users and web developers in an extensible way. Despite the similarity in the underlying technology, little work has been devoted to assisting web developers in automatically improving the content of their web pages through specific suggestions. Many tools used for evaluation display general advice about how to rectify problems that are discovered [42, 140, 1], but a web developer must still be skilled in accessibility to correctly act on this advice. The guidance provided is usually general and is often drawn from the standard against which the tool is evaluating the web page. A-Prompt, for example, guides web developers through the process of fixing accessibility problems. Related systems have been designed to assist users in annotating web content for the semantic web [11] and Plessers et al. showed how such annotations could be automatically generated as a direct result of the web design process [105]. Accessmonkey scripts can utilize the same technology used to assist web users to help web developers. 4.2.4 Adding Alternative Text with WebInSight WebInSight improves the accessibility of web images by automatically deriving alternative text [18]. This system was shown to be capable of automatically providing labels for 43.2% of web images that originally lacked alternative text with high accuracy by using the title of the linked page for linked images and by applying optical character recognition (OCR) to the images that contained text. WebInSight was originally developed for web users, but developers could also use it to help them choose appropriate alternative text. Many available accessibility tools inform web developers when images lack alternative text, but few suggest alternative text or automatically judge the quality of the alternative text already provided. 54 WebInSight uses a supervised learning model built from contextual features in order to identify alternative text that is likely to be incorrect [14]. The ability to automatically judge the quality of alternative text could potentially improve the user experience by eliding alternative text that is likely to be inappropriate. Using the WebInSight Accessmonkey script, web developers are not only told that an image lacking alternative text should be supplied it, but also whether the alternative text provided is likely to be correct. 4.3 Accessmonkey Framework The framework exported by the Accessmonkey system allows users to edit web pages using Javascript. The Greasemonkey Firefox extension [51] is one of the most successful examples of an open scripting framework and exposes the framework from which Accessmonkey is derived. The Greasemonkey extension allows users to inject their own scripts into arbitrary web pages and these scripts can then alter web pages automatically. The main difference between Accessmonkey and Greasemonkey is that Accessmonkey natively supports web developers by providing a mechanism for web developers to edit, approve and save changes that have been made to web pages by user scripts. Figure 4.1 shows the relation between the components that use the Accessmonkey framework. Accessmonkey is designed to support multiple implementations which may be placed on a remote server, on the user’s computer or directly integrated into web tools. Accessmonkey scripts can be used in different browsers and on different platforms because of the near ubiquity of Javascript. While Greasemonkey is only available on Mozilla web browsers, other major web browsers, such as Internet Explorer, Safari and Opera, already afford similar capabilities and can often run Greasemonkey scripts unaltered. Incapability concerns remain because of differences between the implementations of the ECMAscript standard (commonly known as Javascript) used by different browsers. The primary implementations of ECMAScript are JScript as implemented by Internet Explorer and Javascript as implemented by other popular browsers, including Firefox and Safari. Despite these limitations, web developers are accustomed to writing scripts that are compatible with the different implementations of ECMAScript. 55 Figure 4.1: Accessmonkey allows web users, web developers and systems for automated accessibility improvement to collaboratively improve web accessibility. Many web browsers and screen readers do not work well with Javascript code, and some ignore it altogether. These increasingly rare browsers are currently not supported. The scripts presented in this chapter have been tested with Window-Eyes 6.0 [146]. Accessmonkey gives users the option of running solely on the client side. A disadvantage of our approach is that Javascript limits the space of possible transcoding operations that can be performed, but, as shown in Section 4.5, many of the transcoding operations that have been previous suggested can be achieved using Javascript only. Furthermore, as we discuss in Section 4.6, future versions of Accessmonkey may allow Java code to be bundled with Accessmonkey scripts in order to enhance their functionality. 4.3.1 Writing Scripts Accessmonkey scripts share structure with Greasemonkey scripts but rely on additional functionality provided by Accessmonkey implementations. Greasemonkey scripts can be run unaltered in Accessmonkey implementations. Accessmonkey scripts are expected to provide a mechanism for users to view, edit and approve changes that are automatically made by the script when appropriate and rely on functionality exposed by the Accessmonkey 56 implementation to facilitate this. Accessmonkey differentiates two modes of operation: a user mode in which changes are automatically made to a page and a developer mode in which users are provided an interface that allows them to edit, approve and save automatic changes. A script can query the implementation in which it is running to determine the mode that is currently activated and to obtain a pre-defined area in which the script can place its developer interface. The implementation provides functionality that coordinates which script’s developer interface should be displayed and allows changes that have been made to the web page to be saved. To write a script, a user must be able to write Javascript code, but any user can use an existing script. Future versions of this tool will include a mechanism to help users locate applicable scripts. We also plan to explore ways of enabling users who are not technically savvy to create scripts (see Section 4.6). 4.3.2 Requirements An Accessmonkey implementation requires only a few elements. First, the implementation must have the capabilities of Greasemonkey. Specifically, it must be able to load a web page, add custom Javascript to it and execute this Javascript. The implementation must provide the standard Greasemonkey API [104] and two additional methods required for Accessmonkey features. The first method returns a Boolean value indicating whether the system is in developer mode or user mode, which allows user scripts to appropriately alter their presentation and editing options. The second method returns a reference to a div element that represents the script’s development area. The script may append elements to this element to form its developer interface. This interface supports the user in viewing, editing and approving changes automatically suggested by the script. Each implementation must also provide a mechanism for saving changes that were made to the web page by the user. Figure 4.2 shows an implementation of Accessmonkey running a script. The selection boxes and buttons at the top of this screenshot form the developer interface. They let users both switch tools and usage modes and save changes made to the web page by the script. 57 4.3.3 Developer Workflow An important consideration for the usability of Accessmonkey is its potential to fit into developer workflows. We hope to address one of the main shortcomings of accessibility tools, which is their inability to integrate well into current developer workflows [65]. Designing a tool that easily integrates into the wide diversity of available developer products is impractical, but the implementations provided allow our system to be immediately available. Accessmonkey integrates into the developer workflow letting developers make and edit potential changes and then save changes. Developers of sites that are generated dynamically using underlying data sources and web page templates are unable to leverage Accessmonkey. These developers may still benefit from Accessmonkey. Ideally, Accessmonkey would be implemented directly into the tools already used by web developers. The simple and open scripting framework exposed by Accessmonkey allows users to develop such implementations that more closely integrate into these tools. Regardless, previous work has shown that an improved workflow that still involves the use of several applications can nevertheless dramatically increase efficiency [73]. 4.4 Implementations People use a variety of different web browsers on a number of platforms, and web developers use a variety of development tools, on a number of platforms. Accessmonkey should be easy to integrate into these tools. The decision to implement Accessmonkey as a scripting framework using Javascript lets new implementations be easily developed because many platforms already contain support for Javascript. Creating implementations of Accessmonkey that integrate directly into all possible tools used by users and developers is impractical. Instead, Accessmonkey provides a simple framework which can be extended to other tools and platforms by users. We have created the following three implementations of Accessmonkey covering a wide variety of use cases: a Firefox extension, a stand-alone web page, and a web proxy. Web users and developers can access the full range of Accessmonkey functionality by using any of these implementations. 58 Figure 4.2: A screenshot of the WebInSight Accessmonkey script in developer mode applied to the homepage of the International World Wide Web Conference. This script helps web developers discover images that are assigned inappropriate alternative text (such as the highlighted image) and suggests appropriate alternatives. The developer can modify these suggestions, as was done here, to produce the final alternative text. 4.4.1 Firefox Extension The Firefox Extension implementation is a straightforward adaptation of the existing Greasemonkey extension, which was the motivation for Accessmonkey and already provides much of the required functionality. To allow the extension to fully support Accessmonkey scripts, we enhanced the extension by adding the Accessmonkey-specific methods described earlier in Section 4.3.1 and added a toggle that allows users to switch between user and developer mode. Finally, we added the ability to save changes that were made to the web page. A screenshot of the resulting system is shown in Figure 4.2. 4.4.2 Web Proxy The web proxy version of Accessmonkey is implemented as an Apache module. This module inserts a script containing the Accessmonkey code into each web page visited. Disadvantages 59 of proxy-based approaches were discussed previously in Section 4.2.3, but for some users it is the most viable option because it does not require Firefox. Currently, the administrator of the proxy is responsible for adding new user scripts, although future versions may allow users to upload scripts and have them immediately included in their suite of Accessmonkey scripts. Eventually, we would also like to offer a proxy-based solution that processes web pages on the fly according to user scripts as the user browses the web. For security reasons, some methods in the Greasemonkey API do not have direct analogies in Javascript. The Greasemonkey method used to retrieve the content of an arbitrary URLs is useful for allowing scripts to include information derived from web services or integrated from other web sites. For security reasons, the analogous Javascript functionality is restricted to retrieving content from the same domain as where the script is located. To allow Accessmonkey scripts running in this implementation to incorporate data not available on the original domain of the web page, this implementation allows scripts to request the content of any URL from the proxy, which effectively anonymizes these requests. To avoid abuse, the proxy implementation limits use of the system to registered users. 4.4.3 Web Page Several popular evaluation tools are web-based [140, 139, 141]. Visitors to these web sites can enter the URL of a web page that they would like to evaluate and the tools will analyze it. Such tools are convenient because they don’t require users to install additional software and can be accessed from anywhere. Because the evaluation is done remotely, these tools require the web page to be publicly available and, therefore, may be inappropriate for accessibility evaluation of private or pre-release web pages. To allow users of our system additional flexibility, we have created a web-based version of Accessmonkey. Our web implementation of Accessmonkey requires a browser that supports Javascript, but requires the user to neither use a specific browser nor install an extension, which opens Accessmonkey scripts to potential users that prefer Internet Explorer, Opera or another web browser. This implementation allows a large portion of web users and developers to use Accessmonkey scripts. 60 Our web page version of Accessmonkey is implemented using a variation on the module for the Apache Web Server that we developed for our proxy implementation. When users visit the Accessmonkey web page they are first asked for a URL. The system then fetches that URL and alters the page returned in a way that allows it to be displayed at a local address. All of the URLs in each web page are automatically translated into fully qualified URLs that are directed through the proxy. This Accessmonkey implementation uses the same techniques for producing the full Accessmonkey API that were required in the proxy implementation discussed previously. 4.4.4 Future Implementations Future implementations will allow more web users and developers to use Accessmonkey on more platforms. Turnabout is a plug-in for Internet Explorer that is similar to Greasemonkey and allows user-defined scripts [128]. It could be modified to provide the added functionality required of an Accessmonkey implementation. We would also like to add the capability of running Accessmonkey scripts directly in web development tools. SeaMonkey Composer and Adobe Dreamweaver are attractive options because they already support Javascript, although we would like to eventually create Accessmonkey implementations for other popular tools, such as Microsoft FrontPage. 4.5 Implemented Scripts We have implemented several scripts for our system that demonstrate the usefulness of the Accessmonkey architecture. Our current implementations are both strengthened and limited by their restriction of only using Javascript. Restricting our scripts to Javascript allows them to be easily extended to many other platforms, but comes at the cost of accepting the limitations of Javascript. For instance, our scripts cannot gain pixel-level access to images. One method of circumventing this limitation is to utilize web services, as we did for our WebInSight script so that it could access OCR functionality. In this section, we further demonstrate the diversity of powerful transformations that can be accomplished using Accessmonkey and how they can be leveraged by both web users and developers. 61 4.5.1 WebInSight Script One motivation for Accessmonkey was to enable web developers to leverage the technology that we developed for WebInSight (described in Section 4.2.4) to make the creation of accessible web pages easier. Our belief is that web developers would be more likely to create accessible content if they are given specific suggestions on what to do; our WebInSight Accessmonkey script is one example. A screenshot of the Accessmonkey system running the WebInSight script is shown in Figure 4.2. The developer interface provides web developers with functionality to quickly approve and edit the alternative text assigned to each image in the page. To assist in this process, the system provides several automatically-computed suggestions for appropriate alternative text that web developers can select with a single click after optionally editing the suggestion. The script automatically computes all suggestions, except for the OCR suggestion, which is retrieved from a web service. Each suggestion is automatically evaluated by the system and the best suggestion is always displayed in the lowest text box. The interface allows developers to skip images that are unlikely to be informative and assign these images a zero-length alternative text. The system does not provide developers with a button that automatically applies alternative text to all images because the system’s suggestions are not always correct. Following the spirit of Accessmonkey, blind web users can also utilize this script. In user mode the script simply inserts the best alternative text for each image directly into the page, although users are provided the option to preface each inserted alternative text label with a string indicating that it was automatically generated. 4.5.2 Context-Centered Web Browsing Mahmud et al. introduced a context-driven method called CSurf for browsing the web that they showed to be much more efficient than browsing with a traditional screen reader [84]. The increased efficiency of this method is derived from its ability to automatically direct users to relevant content, instead of requiring them to read a web page starting at the beginning, as is common in most screen readers. When using the system, users are directed 62 Figure 4.3: The menubar of this online retailer is inaccessible due to its reliance on the mouse. To fix this problem we wrote an Accessmonkey script that makes this menu accessible from the keyboard. to content related to links that they have followed. The text of a link is likely similar to the content in which they are interested. The system calculates where in the web page to begin reading by choosing the section of the web page that contains content most similar to the text of the link that was followed. This enhanced functionality is expected to be included in the Hearsay browser [106]. We have implemented a variation of this accessibility improvement as an Accessmonkey script. On every web page, the system first adds an onclick event to each anchor tag on the page. When a user clicks on a link, the text of the link is recorded. When a new page is loaded, the script checks to see if it occurred as a result of the user clicking a link. If so, it then finds the DOM element of the page that is most similar to the text of the clicked link using a simple word-vector comparison. The focus of the web page is changed to the identified DOM element, which allows modern screen readers to begin reading at that location. The system also assists web developers in setting skip links, which are links at the beginning of a web page that are visually hidden but provide a mechanism to screen reader users to skip to the main content area of a web page. This Accessmonkey script detects content on the web page that is likely to be relevant, highlights the identified area and adds the skip link if it is approved by the user. While this script cannot perform the full machine learning and semantic analysis that is done in CSurf, it allows this powerful technique to be used by users immediately with the tools they already own. 63 4.5.3 Personalized Edge Services Iaccarino et al. outlined a number of edge services that transcoded web content into a form that better suited the personal needs of web users [62]. Accessmonkey provides an ideal framework in which to implement these edge services and we have replicated many of them as Accessmonkey scripts. The original intent of the edge services was to provide web users with disabilities options for personalization. By implementing them as Accessmonkey scripts, web developers can leverage them as well. Although many of these services are not appropriate for all users, web developers may employ them to produce alternative views for specific audiences. We have replicated several edge services as Accessmonkey scripts, including services that replace all images in a page with links to the image, delete the target attribute from all anchor tags and add access keys to all links. Such improvements can improve access for certain individuals. These scripts can also be used by web developers, although, because of the nature of the transformations applied, they may be best used to help create alternative versions of a web page rather than used to create a universally accessible version. 4.5.4 Site-Specific Scripts Many useful accessibility improvements cannot yet be implemented in a general way that will apply to multiple web sites. Such scripts can reorganize a site’s layout, add accessibility features, or improve the accessibility dynamic content. We have implemented several scripts that demonstrate the dramatic improvements that can be made by Accessmonkey scripts targeted at specific sites. For example, the web page of a popular online retailer contains a menubar at the top listing the major categories of products that they sell, organized in a tree. This menubar (and the elements contained within it) are inaccessible because they require the use of a mouse. We wrote an Accessmonkey script that allows the same content to be accessed via keyboard commands (see Figure 4.3). In this example, the menu content is available to screen reader users, but is not efficiently conveyed to them. Figure 4.4 demonstrates another example of a site-specific script that, in this case, removes distracting ads and places the main content of the page closest to the top in reading order. 64 The content that the scripts in this section modify already exists on the page, and, therefore, blind web users could potentially conduct these transformations independently. This is in contrast to the content in images or Flash content which is more difficult to access. While figuring out how to create a script that will improve accessibility may take time, the user will benefit from these improvements on subsequent visits to the page. These improvements could be leveraged by other web users visiting the page and, perhaps, even the web developers responsible for creating it. Javascript is a powerful mechanism for transcoding content and we are exploring how users can more easily discover and apply these scripts. 4.6 Discussion Accessmonkey provides a common framework for web users, web developers, and web researchers to share automatic accessibility improvements. To facilitate this collaboration, we plan to create an online repository where such scripts can be posted and shared. We also plan to explore methods for enabling users to easily locate and, perhaps automatically, install scripts from this repository. For example, users could arrive at a news site to which they have not been before and be immediately greeted with the possibility of jumping directly to the content, navigation or search areas of the page. Creating an Accessmonkey script currently requires a knowledge of Javascript programming, but many tools let users personalize web content without programming. For example, the Platypus Firefox extension lets users create Greasemonkey scripts by clicking and dragging elements with the mouse [129]. Keyword commands lets users naturally create simple scripts [81], and the keyboard driven interface is naturally accessible. Accessmonkey seeks to enable blind web users to improve the accessibility and usability of their own web experiences, but programming is often still required. Programming-bydemonstration methods for automating web tasks, such as Web Macros [113], PLOW [66], and Turquoise [86] enable more people to customize the web for themselves. Chapter 6 introduces TrailBlazer, which helps extend these capabilities to blind web users. The transformations that current Accessmonkey scripts can achieve are limited by the Javascript programming language. While Javascript is more than adequate for achieving 65 Figure 4.4: This script moves the header and navigation menus of this site to the bottom of the page, providing users with a view of the web page that presents the main content window first in the page. many transformations, more complex transformations often require specialized libraries. For example, natural language processing or image manipulation functions are not currently available in Javascript and would be difficult to implement only in Javascript. Javascript also has limited ability to directly access other formats that web content in which web content is represented, such as Flash and Silverlight. Accessmonkey scripts currently rely on web services for advanced functionality, but a better solution may be for scripts to use supplementary libraries. We will explore both adding commonly-used functionality to Accessmonkey implementations and allowing user scripts to bundle Java code libraries. The implementations that we have provided already support calling Java code from Javascript and, so, a main challenge is to provide a stan- 66 dardized method for users to include such code along with their scripts and supporting such bundles in a variety of Accessmonkey implementations. 4.7 Summary & Ongoing Work Accessmonkey is a common scripting framework that enables web users and developers to write and apply scripts for improving web accessibility. We have created implementations that run on a variety of platforms, including as a Firefox extension and directly from the web. We have reimplemented several existing systems for automatically improving web pages, which renders these systems available on more platforms and allows them to more easily be utilized by web developers. In particular, We have converted our WebInSight system for automatically generating and inserting alternative text into web pages into an Accessmonkey script from which both web users and web developers can benefit. We also demonstrated that dynamic content can be made accessible on a per-site basis. Accessmonkey is at the forefront of the burgeoning area of social accessibility. Soon after its development, IBM Japan released Social Accessibility [121], a project which helps connect blind web users experiencing accessibility problems with volunteers who can fix those problems. They provide end user tools for both groups. A repository of improvements that have been made are stored in a shared repository called the Accessibility Commons [67], enabling collaborative improvement of accessibility problems. Social Accessibility has been developed beyond the prototype stage and is currently in use by many blind and sighted volunteers. The AxsJAX project from Google is a scripting framework that uses scripts to turn web pages into dynamic web applications customized for non-visual use [49]. Both projects explore different aspects of collaborative accessibility, and illustrate the continued important of this approach. An important component of Accessmonkey was getting access improvements out to as many people as possible, regardless of what browser or platform they were running. One solution advanced by this work was a web-based proxy that could be accessed from any computer. This work foreshadows our work on WebAnywhere (Chapter 7), which brings not only access improvement but access itself to users on any computer. 67 Chapter 5 A MORE USABLE INTERFACE TO AUDIO CAPTCHAS The goal of a CAPTCHA1 is to differentiate humans from automated agents by requesting the solution to a problem that is easy for humans but difficult for computers. CAPTCHAs are used to guard access to web resources and, therefore, prevent automated agents from abusing them. Current CAPTCHAs rely on superior human perception, leading to CAPTCHAs that are predominately visual and, therefore, unsolvable by people with vision impairments. Audio CAPTCHAs that rely instead on human audio perception were introduced as a non-visual alternative but are much more difficult for web users to solve. Part of the problem is that the interface has not been designed for non-visual use. This chapter first presents a study of audio CAPTCHAs conducted using the WebinSitu infrastructure (Chapter 3), and then presents and evaluates a more usable interface designed for non-visual use. With the new interface, blind web users had a 59% higher success rate in solving audio CAPTCHAs [16]. The results of improving this interaction illustrate broader themes that can inform the design of interfaces for non-visual use; specially, that visual interfaces cannot just be naively adapted and achieving effective access means making interfaces usable. Chapter 6 seeks to enable users to expand these benefits generally to completing other web-based tasks with a tool called TrailBlazer. 5.1 Introduction and Motivation Most CAPTCHAs on the web today exhibit the following pattern: the solver is presented text that has been obfuscated in some way and is asked to type the original text into an answer box. The technique for obfuscation is chosen such that it is difficult for automated agents to recover the original text but humans should be able to do so easily. Visually this most often means that graphic text is displayed with distorted characters (Figure 5.1). 1 Completely Automated Public Turing test to tell Computers and Humans Apart 68 (a) Microsoft CAPTCHA (c) AOL CAPTCHA (b) reCAPTCHA Figure 5.1: Examples of existing interfaces for solving audio CAPTCHAs. (a) A separate window containing the sound player opens to play the CAPTCHA, (b) the sound player is in the same window as the answer box but separate from the answer box, and (c) clicking a link plays the CAPTCHA. In all three interfaces, a button or link is pressed to play the audio CAPTCHA, and the answer is typed in a separate answer box. In audio CAPTCHAs, this often means text is synthesized and mixed in with background noise, such as music or unidentifiable chatter. Although the two types of CAPTCHAs seem roughly analogous, the usability of the two types of CAPTCHAs is quite different because of inherent differences in the interfaces used to perceive and answer them. Visual CAPTCHAs are perceived as a whole and can be viewed even when focus is on the answer box. Once focusing the answer box, solvers can continue to look at visual CAPTCHAs, edit their answer, and verify their answer. They can repeat this process until satisfied without pressing any keys other than those that form their answer. Errors primarily arise from CAPTCHAs that are obfuscated too much or from careless solvers. Audio playback is linear. A solver of an audio CAPTCHA first plays the CAPTCHA and then quickly focuses the answer box to provide their answer. For sighted solvers, focusing the answer box involves a single click of the mouse, but for blind solvers, focusing the answer box requires navigating with the keyboard using audio output from a screen reader. Solving audio CAPTCHAs is difficult, especially when using a screen reader. 69 Screen readers voice user interfaces that have been designed for visual display, enabling blind people to access and use standard computers. Screen readers often speak over playing CAPTCHAs as solvers navigate to the answer box, speaking the interface but also talking over the CAPTCHA. A playing CAPTCHA will not pause for solvers as they type their answer or deliberate about what they heard. Reviewing an audio CAPTCHA is cumbersome, often requiring the user to start again from the beginning, and replaying an audio CAPTCHA requires solvers to navigate away from the answer box in order to access the controls of the audio player. The interface to audio CAPTCHAs was not designed for helping blind users solve them non-visually. Audio CAPTCHAs have been shown previously to be difficult for blind web users. Sauer et al. found that six blind participants had a success rate of only 46% in solving the audio version of the popular reCAPTCHA [114], and Bigham et al. observed that none of the fifteen blind high school students in an introductory programming class were able to solve the audio CAPTCHA guarding a web service required for the course [15]. In this chapter, we present a study with 89 blind web users who achieved only a 43% success rate in solving 10 popular audio CAPTCHAs. On many websites, unsuccessful solvers must try again on a new CAPTCHA with no guarantee of success on subsequent attempts, a frustrating and often time-consuming experience. Given its limitations, audio may be an inappropriate modality for CAPTCHAs. Developing CAPTCHAs that require human intelligence that computers do not yet have seems an ideal alternative, but the development of such CAPTCHAs has proven elusive [44]. CAPTCHAs cannot be drawn from a fixed set of questions and answers because doing so would make them easily solvable by computers. Computers are quite good at the math and logic questions that can be generated automatically. Audio CAPTCHAs could also be made more understandable, but that could also make them easier for computers to solve automatically. The new interface that we developed improves usability without changing the underlying audio CAPTCHAs. By moving the interface for controlling playback directly into the answer box, a change in focus (and thus a change in context) is not required. Using the new interface, solvers have localized access to playback controls without the need to navigate 70 from the answer box to the playback controls. Solvers also do not need to memorize the CAPTCHA, hurry to navigate to the answer box after starting playback of the CAPTCHA, or solve the CAPTCHA while their screen readers are talking over it. Solvers can play the CAPTCHA without triggering their screen readers to speak, type their answer as they go, pause to think or correct what they have typed, and rewind to review - all from within the answer box. Because popular audio CAPTCHAs have similarities in their interfaces, our optimized interface can easily be used in place of these existing interfaces. Both the ideas and interface itself are likely to be applicable to CAPTCHAs yet to be developed. Finally, the design considerations explored here have application to improving a wide range of interfaces for non-visual access. Our work on audio CAPTCHAs offers the following four contributions: • A study of 162 blind and sighted web users showing that popular audio CAPTCHAs are much more difficult than their visual counterparts. • An improved interface for solving audio CAPTCHAs optimized for non-visual use that moves the controls for playback into the answer box. • A study of the optimized interface indicating that it increases the success rate of blind web users on popular CAPTCHAs by 59% without altering the underlying CAPTCHAs. • An illustration via the optimized interface that usable interfaces for non-visual access should not be directly adapted from their visual alternatives without considering differences in non-visual access. 5.2 Related Work CAPTCHAs were developed in order to control access to online resources and prevent access by automated agents that may seek to abuse these resources [133]. As their popularity increased, so did the concern that the CAPTCHAs used were primarily based on the superiority of human visual perception, and therefore excluded blind web users. Although audio CAPTCHAs were introduced as an accessible alternative, the interface used to solve them did not consider the lessons of prior work on optimizing interfaces for non-visual use. 71 5.2.1 Making CAPTCHAs Accessible Audio CAPTCHAs were introduced soon after their visual alternatives [133, 70], and have been slowly adopted by web sites using visual CAPTCHAs since that time. Although the adoption of audio CAPTCHAs has been slower than that of visual CAPTCHAs, many popular sites now include audio alternatives, including services offered by Google and Microsoft. Over 2600 web users have signed a petition asking for Yahoo to provide an accessible alternative [148]. The reCAPTCHA project, a popular, centralized CAPTCHA service with the goal of improving the automated OCR (Optical Character Recognition) processing of books also provides an audio alternative. Although audio CAPTCHAs exist, their usability has not been adequately examined. Researchers have quantified the difficulty that users have solving both audio and visual CAPTCHAs. For instance, Kumar et al. explored the solvability of visual CAPTCHAs while varying their difficulty on several dimensions [29]. Studies on audio CAPTCHAs have been smaller but informative. For instance, Sauer et al. conducted a small usability study (N=6) in order to evaluate the effectiveness of the reCAPTCHA audio CAPTCHA [114]. They noted that participants in the study employed a variety of strategies for solving audio CAPTCHAs. Four participants memorized the characters as they were being read and then entered them into the answer box after the CAPTCHA had finished playing and one participant used a separate note taking device to record the CAPTCHA characters as they were read. They noted that the process of solving this audio CAPTCHA was highly error-prone, resulting in only a 46% success rate. The study presented in the next section expands these results to a diverse selection of popular CAPTCHAs in use today and further illustrates the frustration and strategies that blind web users employ to solve audio CAPTCHAs. The usability of CAPTCHAs for human users must be achieved while maintaining the inability of automated agents to solve them. Although visual CAPTCHAs have had the highest profile in attempts to break them, audio CAPTCHAs have recently faced similar attempts [124]. As audio CAPTCHAs are increasingly made the target of automated attacks, changes that make them easier to understand will be less likely to be adopted out of 72 concern that they will make automated attacks easier as well. Changing the interface used to solve a CAPTCHA, however, only impacts the usability for human solvers. The audio CAPTCHAs described earlier are currently the most popular type of accessible CAPTCHA, but they are not the only approach pursued. Holman et al. developed a CAPTCHA that pairs pictures with the sounds that they make (for instance, a dog is paired with a barking sound) so that either the visual or audio representation can be used to identify the subject of the CAPTCHA [56]. Tam et al. proposed phrased-based CAPTCHAs that could be more obfuscated than current audio CAPTCHAs but remain easy for humans to solve because human solvers will be able to rely on context [124]. The improvements provided by our optimized interface to audio CAPTCHAs could be adapted to both of these new approaches should they be shown to be better alternatives. 5.2.2 Other Alternatives Because audio CAPTCHAs remain difficult to use and are not offered on many web sites, several alternatives have been developed supporting access for blind web users. Many sites require blind web users to call or email someone to gain access. This can be slow and detracts from the instant gratification afforded to sighted users. The WebVisum Firefox extension enables web users to submit requests for CAPTCHAs to be solved, which are then forwarded to their system to be solved by a combination of automated and manual techniques [144]. Because of the potential for abuse, the system is currently offered by invitation only and questions remain about its long-term effectiveness. For many blind web users the best solution continues to be asking a sighted person for assistance when required to solve a visual CAPTCHA. Combinations of (i) new approaches to creating audio CAPTCHA problems and (ii) interfaces targeting non-visual use promise to enable blind web users to independently solve CAPTCHAs in the future. This chapter demonstrates the importance of the interface. 5.2.3 Targeting Non-Visual Access The interface that we developed for solving audio CAPTCHAs builds on work considering the development of non-visual interfaces. Such interfaces are often very different than 73 the interfaces developed for visual use even though they enable equivalent interaction. For instance, in the math domain, specialized interfaces have been developed to make navigation of complex mathematics feasible in the linear space exposed by non-visual interfaces [108]. Emacspeak explores the usability improvement resulting from applications designed for nonvisual access instead of being adapted from visual interfaces [107]. For instance, a standard screen reader may not correctly reflect the semantics of columns in a software calendar, making it difficult for users to determine what day a particular date falls on. Emacspeak would announce the day along with the date. With the increasing importance of web content, much work has targeted better nonvisual web access. For instance, the HearSay browser converts web pages into semanticallymeaningful trees [106] and, in some circumstances, automatically directs users to content in a web page that is likely to be interesting to them [84]. TrailBlazer (Chapter 6) suggests paths through web content for users to follow, helping them avoid slow linear searches through content [20]. A common theme in work targeting web accessibility is that content should be accessed in a semantically meaningful way and functionality should be easily available from the context in which it most makes sense. The aiBrowser for multimedia web content enables users to independently control the volume of their screen reader and multimedia content on the web pages they view [87]. Without the interface provided by aiBrowser, content on a web page can begin making noise (for instance, playing a song in an embedded sound player or Flash movie) making screen readers difficult to hear. This audio clutter can make navigating to the controls of the multimedia content using a screen reader difficult, if controls are provided for the multimedia content at all. One of the goals of our optimized interface to audio CAPTCHAs is to prevent CAPTCHAs from starting to play before the user is in the answer field where they will type their answers - a major complaint of our study participants concerning how audio CAPTCHAs work currently. Just as with the aiBrowser, the goal is, in part, to give users finer control over the audio channel used by both their screen readers and applications. Work in accessibility has also explored the difference between accessibility and usability. Many web sites are technically accessible to screen reader users, but they are inefficient and 74 time-consuming to access. Prior work has shown that the addition of heading elements to semantically break up a web page or the use of skip links to enable users to quickly skip to the main content of a page can increase its usability [126, 138]. Audio CAPTCHAs are accessible non-visually, but their usability is quite poor for most blind web users. Our new interface helps to improve usability. 5.3 Evaluation of Existing CAPTCHAs Many web services now offer audio CAPTCHAs because they believe them to be an accessible alternative to visual CAPTCHAs. However, the accessibility and usability of these audio CAPTCHAs has not been extensively evaluated. Our initial study aims to evaluate the accessibility of existing audio CAPTCHAs and search for implications we could use to improve them. We did this by gathering currently used CAPTCHAs from the most popular web services and presenting them to study participants to solve. During the study, we collected tracking data to investigate the means by which both sighted and blind users solve CAPTCHAs. The tracking data we collected allowed us to analyze the timing (from page load to submit) of every key pressed and button clicked, and search for problem areas and possible improvements to existing CAPTCHAs. 5.3.1 Existing Audio CAPTCHAs To gather existing audio CAPTCHAs for our study, we used Alexa [4], a web tracking and statistic gathering service, to determine the most popular web sites visited from the United States as of July 2008. Of the top 100, 38 used some form of CAPTCHA, and of those less than half (47%) had an audio CAPTCHA alternative. For our study, we chose to only include sites offering both visual and audio CAPTCHAs and avoided sites using the same third party CAPTCHA services. Using this method we chose 10 unique types of CAPTCHAs that represent those used by today’s most popular websites: AOL (aol), Authorize.net payment gateway service provider (authorize), craigslist.org online classifieds (craigslist), Digg content sharing forum (digg), Facebook social utility (facebook), Google (google), Microsoft Windows Live individual web services and software products (mslive), PayPal e-commerce site (paypal), Slashdot 75 Features of Audio CAPTCHAs AOL AuthCraigslist DIGG orize Facebook Google MS-Live PayPal Slashdot Veoh Assistance no no yes no no no no no yes no Offered Beeps 3 0 0 0 1 3 0 0 0 1 Before Background voice none music static voice voice voice static none voice Noise Challenge A-Z A-Z A-Z A-Z 0-9 0-9 0-9 A-Z Word 0-9 Alphabet 0-9 0-9 Duration 10.2 5.1 9.3 6.9 24.7 40.9 7.1 4.3 3.0 25.1 (sec) Repeat no no no no no no no no yes no Figure 5.2: A summary of the features of the CAPTCHAs that we gathered. Audio CAPTCHAs varied primarily along the several common dimensions shown here. technology-related news website (slashdot), and Veoh Internet television service (veoh). For each of the 10 CAPTCHA types we downloaded 10 examples, resulting in a total of 100 audio CAPTCHAs used for the study (Figure 5.2). Several of these sites attempted to block the download of the audio files representing each CAPTCHA although all of them were in either the MP3 or WAV format. Many sites added the audio files to web pages using obfuscated Javascript and would allow each to be downloaded only once. These techniques at best marginally improve security, but can often hinder access to users who may want to play the audio CAPTCHA with a separate interface that is easier for them to use. 5.3.2 Study Description To conduct our study, we created interfaces for solving visual and audio CAPTCHAs mimicking those we observed on existing web pages (Figure 5.3). The interface for visual CAPTCHAs consisted of the CAPTCHA image, an answer field, and a submit button. The interface for solving audio CAPTCHAs replaced the image with a play button that when pressed caused the audio CAPTCHA to play. These simplified interfaces preserve the necessary components of the CAPTCHA interface, enabling interface components to be 76 Separate Play Button Separate Answer Field Figure 5.3: An interface for solving audio CAPTCHAs modeled after those currently provided to users to solve audio CAPTCHAs (Figure 5.1). isolated from the surrounding content. Solving CAPTCHAs in real web pages may be more difficult as there are additional distractions, such as other content, and the CAPTCHA may need to be solved with a less ideal interface, for instance using a pop-up window. Our study was conducted remotely. As Petrie et al. observed, conducting studies with disabled people in a lab setting can be difficult, but remote studies can produce similar results [101]. Blind users in particular use many different screen readers and settings that would be difficult to replicate fully in a lab setting, meaning the remote studies can better approximate the true performance of participants. Participants were first presented with a questionnaire asking about their experience with web browsing, experience with CAPTCHAs and the level of difficulty or frustration they present, as well as demographic information. They were then asked to solve 10 visual CAPTCHAs and 10 audio CAPTCHAs (for sighted participants) or 10 audio CAPTCHAs (for blind participants). Each participant was asked to solve one problem randomly drawn from each CAPTCHA type, and the CAPTCHA types were presented in random order to help avoid ordering effects. For this study, participants were designated as belonging to the blind or sighted condition based on their response to the question: “How do you access the web?” The following answers were provided as options: “I am blind and use a screen reader,” “I am sighted and use a visual browser,” and “Other.” In this chapter, blind participates refer to those who answered with the first option and sighted participants to those who answered with the second option. 77 Participants were given up to 3 chances to correctly solve each CAPTCHA, but of primary concern was their ability to correctly solve each CAPTCHA on the first try because this is what is required by most existing CAPTCHAs. To instrument our study, we included Javascript tracking code on each page of the study that allowed us to keep track of the keys users typed and other interaction with page elements. This approach is similar to that provided by the more general UsaProxy [9] system which records all user actions in the browser when users connect through its proxy. This approach has also been used before in studies with screen reader users [17]. The data recorded enabled us to make observations, including the time required to answer the CAPTCHA, how many times the CAPTCHA was played, how many mistakes were made in the process of answering a CAPTCHA, and the number of attempts required. The full list of the events gathered and the information recorded for each is shown below: • Page Loaded - the web page has loaded. • Focused Play - participant selected the play button. • Pressed Play - participant pressed the play button. • Blurred Play - participant moved away from the play button. • Answer Box Focused - participant entered the answer box either by clicking on it or tabbing to it. • Answer Box Blurred - participant exited the answer box either by clicking out or moving away. • Key Pressed - participant pressed a keyboard key. • Focused Submit - submit button was selected. • Pressed Submit - submit button was pressed. • Blurred Submit - participant moved from the submit button but did not press it. • Incorrect Answer - the answer provided by the participant is incorrect, leading the participant to be presented with a 2nd or 3rd try. Personally identifying information was not recorded. 78 5.3.3 Results Of our 162 participants, 89 were blind and 73 were sighted; 56 were female, 99 were male, and 7 chose not to answer that question; and their ages ranged from 18 to 69 with an average age of 38.0 (SD = 13.2). Before participating in our study, blind and sighted participants showed differing levels of frustration toward the audio and visual CAPTCHAs they had already come across. Participants were asked to rate the following questions on a scale from Strongly Agree (1) to Strongly Disagree (5) or opt out by answering “I have never independently solved a visual[audio] CAPTCHA” for the following questions: “Audio CAPTCHAs are frustrating to solve.” and “Visual CAPTCHAs are frustrating to solve.” For the question about audio CAPTCHAs, averages from the two groups were similar, 2.73 (SD = 1.3) for blind participants and 2.82 (SD = 1.4) for sighted participants. Far more sighted participants opted out; however, as only 7.87% of blind participants opted out compared to 44.44% of sighted participants who opted out (χ2 = 69.13, N = 161, df = 1, p < .0001). This shows that nearly half of our sighted participants had never solved an audio CAPTCHA before, but those who had were nearly as frustrated by them as blind participants. For the question about visual CAPTCHAs, blind participants averaged 1.58 (SD = 0.9) with 38.2% opting out and sighted participants averaged 2.98 (SD = 1.2) with only 1.4% opting out (χ2 = 14.21, N = 161, df = 1, p = .0002). This shows that more than a third of blind participants said they had never solved a visual CAPTCHA and the others found them very frustrating with a rating very close to (1) Strongly Agree. This rating may mean that some of our participants who checked the “I am blind and use a screen reader” box did have some vision and had tried to solve visual CAPTCHAs before or perhaps some participants found the required phone call to technical support, the added step of waiting for an email, or the task of finding a sighted person for help to be extremely frustrating. These results are summarized in Figure 5.4. The data gathered from the Javascript tracking code were analyzed using a mixed-effects model analysis of variance with repeated measures [79, 116]. Condition (blind or sighted), CAPTCHA type (audio or visual), and CAPTCHA source, were modeled as fixed effects, 79 Participant Agreement with: blind sighted “Audio CAPTCHAs are frustrating to solve.” Percentage of Participants 50% 40% 30% Blind Sighte d 20% 10% 0% 1 2 3 4 5 never solved strongly disagree strongly agree “Visual CAPTCHAs are frustrating to solve.” Percentage of Participants 50% 40% 30% Blind Sighte d 20% 10% 0% 1 strongly agree 2 3 4 5 strongly disagree never solved Figure 5.4: Percentage of participants answering each value on a Likert scale from 1 Strongly Agree to 5 Strongly Disagree reflecting perceived frustration of blind and sighted participants in solving audio and visual CAPTCHAs. Participants could also respond “I have never independently solved a visual[audio] CAPTCHA.” Results illustrate that (i) nearly half of sighted and blind participants had not solved an audio or visual CAPTCHA, respectively, (ii) visual CAPTCHAs are a great source of frustration for blind participants, and (iii) audio CAPTCHAs are also somewhat frustrating to solve. with Condition and CAPTCHA type combined as a fixed effect group with three possible values (blind-audio, sighted-audio, and sighted-visual). Participant was modeled correctly as a random effect. Mixed-effects models properly handle the imbalance in our data due to not all participants solving both audio and visual CAPTCHAs. Mixed-effects models 80 Average Time per CAPTCHA 80 audio-blind audio - blind audio-sighted audio - sighted visual-sighted visual - sighted 70 time (seconds) time (seconds) 60 50 40 30 20 10 0 Figure 5.5: The average time spent by blind and sighted users to submit their first solution to the ten audio CAPTCHAs presented to them. Error bars represent ± 1 standard error (SE). also account for correlated measurements within participants. However, they retain large denominator degrees of freedom, which can be fractional for unbalanced data. Sighted participants solving visual CAPTCHAs were much faster than blind participants solving audio CAPTCHAs. On average, their respective completion times were more than 5 times faster. Sighted participants averaged 9.9 seconds (SD = 1.9) and blind participants averaged 50.9 seconds (SD = 1.8), (F1,232.1 = 243.9, p < .0001). This may have been expected, but sighted participants also outperformed blind participants on audio CAPTCHAs with average completion times of 22.8 (SD = 1.9), or about twice as fast as our blind participants (F1,232.4 = 113.9.0, p < .0001). The timing data alone show the drastic inequalities in current CAPTCHAs for blind web users (Figure 5.5). The largest differences were observed in success rates. The sighted participants in this study successfully solved nearly 80% of the visual CAPTCHAs presented to them (on the first try). This resembles the 90% previously reported [29]2 . These same participants, however, were only able to solve 39% of audio CAPTCHAs on the first try, demonstrating again 2 The lower observed success rate may reflect the trend of CAPTCHAs having become more difficult in order to thwart increasingly-sophisticated automated attacks. 81 the higher difficulty of solving audio CAPTCHAs. And while it did take blind participants longer (see above), blind and sighted participants were on par when it came to solving the audio CAPTCHAs correctly. Blind participants solved 43% of audio CAPTCHAs presented to them successfully on the first try, although the difference between blind and sighted was not significant (χ2 = 3.46, N = 161, df = 1, p = .06). Second and third tries rarely helped in finding a correct answer (Figure 5.6). Even though blind participants were on par (slightly better, but not significantly so) at solving audio CAPTCHAs correctly, they took twice as long to do so. So, what occupied the remaining time? This extra time may have been spent listening to the CAPTCHA (on average, blind participants clicked played 3.6 (SD = 0.1) times whereas sighted participants clicked play 2.5 (SD = 0.1) times (F1,232.1 = 52.2, p < .0001)) or they may have spent more time navigating to and from the text box. Blind participants entered the text box on average 2.9 (SD = .1) times whereas sighted participants entered the text box on average 2.4 (SD = 0.1) times (F1,232.2 = 10.2, p < .001). 5.3.4 Discussion Recruiting participants to take part in studies can be especially difficult when looking for participants with specific characteristics, such as participants who use a screen reader. Despite this, we had very little trouble recruiting participants for this study (as reflected by the large number of responses). Our post on an online mailing list for blind web users was greeted with a flurry of responses - both positive and negative. Many seemed pleased to find that this problem was being worked on and many doubted that audio CAPTCHAs could ever be improved. Our first anecdotal evidence that CAPTCHAs were a widelyacknowledged problem were the number of responses, many of which were written with what appeared to be significant emotion. Audio CAPTCHAs were anecdotally a great source of frustration to both blind and sighted participants in our study. Many sighted participants had no prior experience with audio CAPTCHAs and told us that they were much more difficult than they expected. In fact, one participants said, “After going through this exercise, I’ve changed my opinion 82 Figure 5.6: The number of tries required to correctly answer each CAPTCHA problem illustrating that (i) multiple tries resulted in relatively few corrections, (ii) the success rates of blind and sighted solvers were on par, and (iii) many audio CAPTCHAs remained unsolved after three tries. that audio CAPTCHA is a good alternative solution for people who are blind.” Many participants, but perhaps especially the blind participants, expressed exacerbation toward CAPTCHAs: “I understand the necessity for CAPTCHAs, but they are the only obstacle on the Internet I have been unable to overcome independently.” Clearly, some types of audio CAPTCHAs are much more difficult to solve than others and some features were better received than others. For example, “The random-letters, random-numbers ones were completely impossible for me to solve. I couldn’t tell the difference between c/t/v/b, for example. Those with human-intelligible context (e.g. ‘c as in cucumber’) were far easier and less stressful.” While some of the frustration from solving CAPTCHAs seemed to stem from the difficulty of deciphering distorted audio, for blind people, much of the frustration comes from interacting with the CAPTCHA with their screen reader. For example, “It will always be hard to activate the play button, jump to the answer edit box, silence a screen reader and 83 get focused to listen and enter data accurately.” This process takes time and often content in the beginning of the CAPTCHA is missed: “At the beginning of the captcha, give me time to get down to the edit box and enter it. My screen reader is chattering while I’m getting to the edit box and the captcha is playing.” Instead of trying to navigate while the CAPTCHA plays, some people try to memorize the answer, wait for the play to finish, and then move to the text box and start typing. But, this presents an entirely new challenge: “I heard them, but could not remember them. And if I tried to type them out [while] listening, my screen reader interfered with my listening.” This process resembles what one might expect sighted users to do if the visual CAPTCHA and the answer box were located on different pages and only one could be viewed at a time. The interaction problems identified in this study motivate a new interface design with simple improvements that could greatly increase the usability of audio CAPTCHAs. 5.4 Improved Interface for Non-Visual Use The comments of participants identified two main areas in which audio CAPTCHAs could be improved. As expected, one area was the audio itself – the speech representation should be made clearer, background noise reduced, and additional contextual hints provided in order to make audio CAPTCHAs easier to solve. The audio characteristics of a CAPTCHA were important in determining its difficulty but are difficult to change because they directly determine how resistant the CAPTCHA will be to automated attacks. Audio CAPTCHAs have recently become a more popular target for automated attacks, for example reCAPTCHA was shown likely to be vulnerable to automated attack [124]. The second area of difficulty mentioned by participants was the interface provided for solving audio CAPTCHAs. Users found the current interfaces cumbersome and sometimes confusing to use. Unlike many improvements to the CAPTCHA itself, improvements to the interface do not affect the resistance to automated attacks. As long as the interface does not embed clues to the answer of the CAPTCHA, then it can be modified in whatever way is best for users. The navigation elements used to listen to and answer an audio CAPTCHA can be distracting, forcing users to either miss the beginning of the CAPTCHA or memorize the 84 Keyboard Controls: , Rewind 1 Second . Play/Pause / Forward 1 Second All other keys as normal. Figure 5.7: The new interface developed to better support solving audio CAPTCHAs. The interface is combined within the answer textbox to give users control of CAPTCHA playback from within the element in which they will type the answer. entire CAPTCHA before typing the answer. Participants reported that they appreciated CAPTCHAs that began with a few beeps (as 4 of the 10 CAPTCHAs did) because this allowed them time to move from the “Play” button to the answer box. This suggested that a more usable interface would not require users to navigate back and forth. Our interface optimized for non-visual use addresses this navigation problem by moving the controls for playback into the answer box, obviating the need to navigate from playback controls to the answer box because they are now one and the same. By combining the playback controls and the answer box into a single control, the interface is designed to present less of a hurdle for users to overcome, enabling them to focus on answering the CAPTCHA. Many participants mentioned that using the current interface required them to play through an entire audio CAPTCHA to review a specific portion. Even when controls other than “play” are available, users do not use them because they require them to navigate to the appropriate control and then back again to the answer box. Based on this feedback, we added simple controls into the answer box that enabled users to both play/pause the CAPTCHA and to rewind or fast-forward by one second without additional navigation (Figure 5.7). Through several rounds of integrative design with several blind participants, we refined this new interface. For example, we initially used various control key combinations to control playback of the CAPTCHA (such as CTRL+P for play), we found that the shortcuts that we chose often overlapped with shortcuts available in screen readers. We briefly considered using the single key “p” for play, but this overlaps with the alphabet used in many popular CAPTCHAs meaning our interface could not be used with them. 85 On the suggestion of a blind participant, we chose to use the following individual keys for the playback controls: comma(,) for rewind, period(.) for play/pause, and forward slash(/) for fast-forward. These were not included in the alphabets of any of the CAPTCHAs that we considered (Figure 5.2) and are located in that order in standard American keyboards. For users of keyboards with different layouts, the keys could be similarly chosen to avoid collision with screen reader shortcuts and characters used in language-specific CAPTCHAs, and such that they are conveniently located on popular local keyboards. 5.4.1 Integration Into Existing Websites An advantage of altering the interface used to solve CAPTCHAs instead of attempting to make CAPTCHA problems themselves more usable is that a new interface can be independently added to existing web sites. We have written a Greasemonkey script [104] that detects the reCAPTCHA interface and replaces the interface used to solve its audio CAPTCHA with our optimized interface. For web sites in which this is not currently possible, web developers could add this interface into their sites without concern that the new interface will expose them to additional risk of automated attack. All of the currently-used CAPTCHAs considered in the study in the previous section can be used directly with our optimized interface. 5.5 Evaluation of New CAPTCHA Interface We evaluated our new interface for solving audio CAPTCHAs with the optimizations for screen reader users based on the insights from our initial study. 5.5.1 Study Design To evaluate the new interface for audio CAPTCHAs we repeated the study described earlier but with the new interface. Below is a snippet from the instructions that were given to participants before the study began: We are testing a different interface for solving CAPTCHAs. Please take some time to familiarize yourself with the new interface. Keys for controlling playback are as follows: 86 • Typing a period in this box will cause the CAPTCHA to play, and pressing it again will pause playback • Typing a comma will rewind the CAPTCHA by 1 second and then continue playing. • Typing a forward slash will fast forward the CAPTCHA by 1 second and then continue. These keys work only when the textbox used to answer the CAPTCHA problem has focus. This allows you to control the CAPTCHA directly from the box into which you will enter your answer. The control key characters will not be entered into the box and are only used to control playback of the CAPTCHA. No participants reported difficulty learning the new interface. 5.5.2 Results This study included 14 blind participants: 2 were female, 10 were male, and 2 chose not to answer that question; their ages ranged from 22 to 59 with an average age of 36.1 (SD = 10.2). We again used a mixed-effects model analysis of variance with repeated measures to analyze our data. While our optimized interface did not have a significant effect on the time required to solve CAPTCHAs (participants averaged 50.9 seconds (SD = 2.4) with the original interface and 47.3 seconds (SD = 5.9) with the optimized interface (F1,101.3 = 0.31, p = n.s.), it did have significant and positive effects on the number of tries required to solve CAPTCHAs and the observed success rate of participants. With our optimized interface, participants were able to reduce the average number of attempts required to solve the CAPTCHAs from 2.21 (SD = 0.4) with the original interface to 1.56 (SD = 1.2) (F1,100 = 20.3, p < .0001). Perhaps more importantly, participants solved over 50% more CAPTCHAs correctly on the first try with the optimized interface than they did with the original interface: 42.9% (SD = 0.2) were correctly solved on the first try using the original interface and 68.5% (SD = 0.5) were correctly solved on the first try using the optimized interface (F1,100 = 22.3, p < .0001). These improvements are shown in Figure 5.8. 87 Percentage of CAPTCHAs answered correctly using the Original and Optimized Interfaces 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1st try 2nd try original 3rd try never optimized Figure 5.8: The percentage of CAPTCHAs answered correctly by blind participants using the original and optimized interfaces. The optimized interface enabled participants to answer 59% more CAPTCHAs correctly on their first try as compared to the original interface. 5.5.3 Discussion Participants in this study were generally enthusiastic about the new interface to audio CAPTCHAs that we created, leading one participant to say, “I really liked the interface provided here for answering the captchas. I think it could really be benefitial [sic] if widely used.” Some participants felt that while the new interface offered an improvement, audio CAPTCHAs were still frustrating. For example, one participant said, “... sometimes the audio captchas are still so distorted that it’s hard to solve them even then.” In general, while audio CAPTCHAs remained challenging for users, they were both more accurate with the new interface (answered incorrectly less often) and required fewer attempts to find the right answer. Because the new interface does not affect the security of the underlying CAPTCHA and can be easily adapted to new CAPTCHAs we hope this interface will become the default in the near future. 88 5.6 Future Work In this work, we have demonstrated the difficulty of audio CAPTCHAs and offered improvements to the interface used to answer them that can help make them more usable. We plan to explore other areas in which interface changes may improve non-visual access, and consider how the lessons we learned in this work may generalize beyond the interfaces to audio CAPTCHAs. Future work may explore how audio CAPTCHAs could be created that are easier for humans to solve while still addressing the improved automatic techniques for defeating them. The ten audio CAPTCHAs explored in our study exhibited a wide variety of dimensions on which they varied, but yet remained quite similar in design. Perceptual CAPTCHAs face many problems, including that (i) none are currently accessible to individuals who are both blind and deaf and (ii) automated techniques are becoming increasingly effective in defeating them. An important direction for future work is addressing these problems. 5.7 Summary Creating an interface optimized for non-visual access presents challenges that are very different than those targeting visual access. Our study with blind participants demonstrated that existing audio CAPTCHAs are inadequate alternatives and that their frustration is due in part to the interface provided for solving them. Based on this feedback, we optimized the interface to solving audio CAPTCHAs for non-visual use by localizing the playback interface to the answer box. Although we did not change the audio CAPTCHAs themselves, users in our subsequent study were able to successfully solve CAPTCHAs on their first try 59% more of the time. This dramatic improvement can be directly used in existing interfaces to CAPTCHAs without impacting the ability of the CAPTCHA to protect access from automatic agents. Because of the incredible differences in non-visual access, the interface can make all the difference when developing applications designed to be accessed non-visually. This chapter demonstrates the utility of our WebinSitu infrastructure (Chapter 3) for conducting large studies of user interaction with blind web users. Not only did we discover 89 the difficulty of using the interfaces to audio CAPTCHAs with WebinSitu but were also able to evaluate our newly designed improved interface using it. The new interface can be added to existing pages using an Accessmonkey script (Chapter 4), injected by the user on a variety of different platforms. In the next chapter, we present TrailBlazer, a collaborative tool that lets blind web users generally connect individual interactions together to more effectively complete web-based tasks. The goal of TrailBlazer is to enable end users to create interactions that work better for them, just as the script offered in this chapter improves access to audio CAPTCHAs, without requiring programming experience. 90 Chapter 6 MORE EFFECTIVE ACCESS WITH TRAILBLAZER The previous chapter illustrated how screen reader users could dramatically improve their success rate at solving audio CAPTCHAs by injecting a script that altered the interface to make it better for them. This chapter explores the potential for screen reader users to help one another be more effective at completing tasks on the web by improving the constituent interactions by demonstration. We introduce TrailBlazer, a system that provides an accessible, non-visual interface to guide blind users through existing how-to knowledge [20]. A formative study indicated that participants saw the value of TrailBlazer but wanted to use it for tasks and web sites for which no existing script was available. To address this, TrailBlazer offers suggestion-based help created on-the-fly from a short, user-provided task description and an existing repository of how-to knowledge. 6.1 Motivation For blind web users, completing tasks on the web can be time-consuming and frustrating. Blind users interact with the web through software programs called screen readers. Screen readers convert information on the screen to a linear stream of either synthesized voice or refreshable Braille. If a blind user needs to search for a specific item on the page, they either must listen to the entire linear stream until the goal item is reached or they may skip around in the page using structural elements, such as headings, as a guide. To become proficient, users must learn hundreds of keyboard shortcuts to navigate web page structures and access mouse-only controls. Unfortunately, as shown in Chapter 3, even experienced screen reader users do not approach the speed of searching a web page that is afforded to sighted users for many tasks [17, 122]. Existing repositories contain how-to knowledge that is able to guide people through web tasks quickly and efficiently. This how-to knowledge is often encoded as a list of steps that 91 Time Card CoScript 1. goto “http://www.mycompany.com/timecard/” 2. enter “8” into the “Hours worked” textbox 3. click the “Submit” button 4. click the “Verify” button Figure 6.1: A CoScript for entering time worked into an online time card. The natural language steps in the CoScript can be interpreted both by tools such as CoScripter and TrailBlazer, and also read by humans. These steps are also sufficient to identify all of the web page elements required to complete this task – the textbox and two buttons. Without TrailBlazer, steps 2-4 would require a time-consuming linear search for screen reader users. must be performed in order to complete the task. The description of each step consists of information describing the element that must be interacted with, such as a button or text box, and the type of operation to perform with that element. For example, one step in the task of buying an airplane flight on orbitz.com is to enter the destination city into the text box labeled “To”. One such repository is provided by CoScripter [80], which contains a collection of scripts written in a “sloppy” programming language that is both human- and machine-understandable (Figure 6.1). In this chapter, we present TrailBlazer, a system that guides blind users through completing tasks step-by-step. TrailBlazer offers users suggestions of what to do next, automatically advancing the focus of the screen reader to the interactive element that needs to be operated or the information that needs to be heard. This capability reduces the need for time-consuming linear searches when using a screen reader. TrailBlazer was created to support the specific needs of blind users. First, its interface explicitly accommodates screen readers and keyboard-only access. Second, TrailBlazer augments the CoScripter language with a “clip” command that specifies a particular region of the page on which to focus the user’s attention. A feature for identifying regions is not included in other systems under the assumption that users can visually search content to quickly find desired regions. Third, TrailBlazer is able to dynamically create new scripts from a brief user-specified description of the goal task and the existing corpus of scripts. Dynamic script creation 92 was inspired by a formative user study of the initial TrailBlazer system, which confirmed that TrailBlazer made the web more usable but was not helpful in the vast majority of cases where a script did not already exist. To address this problem, we hypothesized that users would be willing to spend a few moments describing their desired task for TrailBlazer if it could make them more efficient on tasks lacking a script. The existing repository of scripts helps CoScripter to incorporate knowledge from similar tasks or sub-tasks that have already been demonstrated. Building from the existing corpus of CoScripter scripts and the short task description, TrailBlazer dynamically creates new scripts that suggest patterns of interaction with previously unseen web sites, guiding blind users through sites for which no script exists. As blind web users interact with TrailBlazer to follow these dynamically-suggested steps, they are implicitly supervising the synthesis of new scripts. These scripts can be added to the script repository and reused by all users. Studies have shown that many users are unwilling to pay the upfront costs of script creation even though those scripts could save them time in the future [75]. Through the use of TrailBlazer, we can effectively reverse the traditional roles of the two groups, enabling blind web users to create new scripts that lead sighted users through completing web tasks. This chapter makes the following three contributions: • An Accessible Guide - TrailBlazer is an accessible interface to the how-to knowledge contained in the CoScripter repository that enables blind users to avoid linear searches of content and complete tasks more efficiently. • Formative Evaluation - A formative evaluation of TrailBlazer illustrating its promise for improving non-visual access, as well as the desire of participants to use it on tasks for which a script does not already exist. • Dynamic Script Generation - TrailBlazer, when given a natural language description of a user’s goal and a pre-existing corpus of scripts, dynamically suggests steps to follow to achieve the user’s goal. 93 6.2 Related Work Work related to TrailBlazer falls into two main categories: (i) tools and techniques for improving non-visual web access, and (ii) programming by demonstration and interactive help systems that play back and record how-to knowledge. 6.2.1 Improving Web Accessibility Most screen readers simply speak aloud a verbal description of the visual interface. While this enables blind users to access most of the software available to sighted people, they are often not easy-to-use because their interfaces were not designed to be viewed non-visually. Emacspeak demonstrated the benefits to usability resulting from designing applications with voice output in mind [107]. The openness of the web enables it to be adapted it for non-visual access. Unfortunately, most web content is not designed with voice output in mind. In order to produce a usable spoken interface to a web site, screen readers extract semantic information and structure from each page and provide interaction techniques designed for typical web interactions. When pages contain good semantics, these can be used to improve the usability of the page, for instance by enabling users to skip over sections irrelevant to them. Semantic information can either be added to pages by content providers or formulated automatically when pages are accessed. Adding meaningful headings tags (<H1 - 6>) has been shown to improve web efficiency for blind web users browsing structural information [138] but, as shown in Chapter 3, less than half of web pages use them. To improve web navigation, in-page “skip” links visible only to screen reader users can be added to complex pages by web developers. These links enable users to quickly jump to areas of the page possibly far in linear distance. Unfortunately, these links are often broken (Chapter 3). Web developers have proven unreliable in manually providing navigation aids by annotating their web pages. Numerous middleware systems [7] have suggested ways for inserting semantically relevant markup into web pages before they reach the client. Other systems have moved the automatic detection of semantically-important regions to the interface itself. For example, 94 the Hearsay non-visual web browser parses web pages into a semantic tree that can be more easily navigated with a screen reader [106]. Augmenting the screen reader interface has also been explored. Several systems have added information about surrounding pages to existing pages to make them easier to use. Harper et al. augments links in web pages with “Gist” summaries of the linked pages in order to provide users more information about the page to which a link would direct them [54]. CSurf observes the context of clicked links in order to begin reading at a relevant point in the resulting page [84]. Although adding appropriate semantic information makes web content more usable, finding specific content on a page is still a difficult problem for screen reader users. AxsJAX addresses this problem by embedding “trails” into web pages that guide users through semantically-related elements [49]. TrailBlazer scripts expand on this trail metaphor. Because AxsJAX trails are generally restricted to a single page and are written in Javascript, AxsJAX trails cannot be created by end users or applied to the same range of tasks as TrailBlazer’s scripts. 6.2.2 Recording and playback of how-to knowledge Interactive help systems and programming by demonstration tools have explored how to capture procedural knowledge and express it to users. COACH [118] and Eager [37] are early systems in this space that work with standard desktop applications instead of the web. COACH observes computer users in order to provide targeted help, and Eager learned and executed repetitive tasks by observing users. Expressing procedural knowledge, especially to assist a user who is currently working to complete a task, is a key issue for interactive help systems. Kelleher et al.’s work on stencil-based tutorials demonstrates a variety of useful mechanisms [68], such as by blurring all of the items on the screen except for those which are relevant to the current task. Sticky notes adding useful contextual information was also found to be effective. TrailBlazer makes use of analogous ideas to direct the attention of users to important content in its non-visual user interface. 95 A B ... Starts Here Step 2 of 15: select “Books” from the “Search” listbox “Search” Listbox “Previous Step” Button “Play from Here” Button “Next Step” Button ... Figure 6.2: The TrailBlazer interface is integrated directly into the page, is keyboard accessible, and directs screen readers to read each new step. A) The description of the current step is displayed visually in an offset bubble but is placed in DOM order so that the target of a step immediately follows its description when viewed linearly with a screen reader. B) Script controls are placed in the page for easy discoverability but also have alternative keyboard shortcuts for efficient access. 6.3 An Accessible Guide Representing procedural knowledge is also a difficult challenge. Keyword commands is one method, which uses simple psuedo-natural language description to refer to interface elements and the operations to be applied to them [81]. This is similar to the sloppy language used by CoScripter to describe web-based activity [80]. TrailBlazer builds upon these approaches because the stored procedural knowledge represented by TrailBlazer can be easily spoken aloud and understood by blind users. A limitation of most current systems is that they cannot generalize captured procedural knowledge to other contexts. For example, recording the process of purchasing a plane flight on orbitz.com will not help perform the same task on travelocity.com. One of the only systems to explore generalization is the Goal-Oriented Web Browser [41], which attempts to generalize a previously demonstrated script using a database of common sense knowledge. This approach centered around data detectors that could determine the type 96 of data appearing on web sites. TrailBlazer incorporates additional inputs into its generalization process, including a brief task description from the user, and does not require a common-sense knowledgebase. An alternate approach to navigating full-size web pages with a script, as TrailBlazer does, is to instead shrink the web pages by keeping only the information needed to perform the current task. This can be done using a system such as Highlight, which enables users to re-author web pages for display on small screen devices by demonstrating which parts of the pages used in the task are important [94]. The resulting simplified interfaces created by Highlight are more efficient to navigate with a screen reader, but prevent the user from deviating from the task by removing content that is not directly related to the task. TrailBlazer was designed from the start for non-visual access using the following three guidelines (Figure 6.2): • Keyboard Access. All play back functions are accessible using only the keyboard, making access for those who do not use a mouse feasible. • Minimize Context Switches. The playback interface is integrated directly into the web pages through which the user is being guided. This close coupling of the interface into the web page enables users to easily switch between TrailBlazer’s suggestions and the web page components needed to complete each step. • Directing Focus. TrailBlazer directs users to the location on each page to complete each step. As mentioned, a main limitation of using a screen reader is the difficulty in finding specific content quickly. TrailBlazer directs users to the content necessary to complete the instruction that it suggests. If the user wants to complete a different action, the rest of the page is immediately available. The bubbles used to visually highlight the relevant portion of the page and provide contextual information were inspired by the “sticky notes” used in Stencil-Based Tutorials [68]. The non-visual equivalent in TrailBlazer was achieved by causing the screen reader to begin reading at the step (Figure 6.2). Although the location of each bubble is visually 97 offset from the target element, the DOM order of the bubble’s components was chosen such that they are read in an intuitive order for screen reader users. The visual representation resembles that of some tutoring systems, and may also be preferred by users of visual browsers, in addition to supporting non-visual access with TrailBlazer. Upon advancing to a new instruction, the screen reader’s focus is set to the instruction description (e.g., “Step 2 of 5: click the “search” button”). The element containing that text is inserted immediately before the relevant control (e.g., the search button) in DOM order so that exploring forward from this position will take the user directly to the element mentioned in the instruction. The playback controls for previous step, play, and next step are represented as buttons and are inserted following the relevant control. Each of these functions can also be activated by a separate keyboard shortcut - for example, “ALT+S” advances to the next step. The TrailBlazer interface enables screen reader users to move from step to step, verifying that each step is going to be conducted correctly, while avoiding all linear searches through content (Figure 6.3). In the event that the user does not want to follow a particular step of the script they are using, the entire web page is available to them as normal. TrailBlazer is a guide but does not override the user’s intentions. 6.4 Clipping While examining the scripts in the CoScripter repository, we noticed that many scripts contained comments directing users to specific content on the page. Comments are not interpreted by CoScripter however, and there is no command in CoScripter’s language that can identify a particular region of the screen. Whether users were looking up the status of their flight, checking the prices of local apartments or searching Google, the end goal was not to press buttons, enter information into text boxes, or follow links; the goal was to find information. A visual scan might locate this information quickly, but doing so with a screen reader would be a slower process. Coyne et al. observed that blind web users often use the “Find” function of their web browsers to address this issue [35]. The find function provides a simple way for users to quickly skip to the content, but requires them to know in advance appropriate text for which 98 TrailBlazer Example 1) 1 of 15: go to www.amazon.com ... 2 of 15: select “Books” from the “Search” listbox ... 2) 8 of 15: clip the TABLE containing “List Price” 8) Figure 6.3: TrailBlazer guiding a user step-by-step through purchasing a book on Amazon. 1) The first step is to goto the Amazon.com homepage. 2) TrailBlazer directs the user to select the “Books” option from the highlighted listbox. 8) On the product detail page, TrailBlazer directs users past the standard template material directly to the product information. to search. The “clip” command that TrailBlazer adds to the CoScripter language enables regions to be described and TrailBlazer users to be quickly directed to them. 6.4.1 Region Description Study Existing CoScripter commands are written in natural language. In order to determine what language would be appropriate for our CoScripter command, we conducted a study in which we asked 5 participants to describe 20 regions covering a variety of content (Figure 6.4). To encourage participants to provide descriptions that would generalize to multiple regions, two different versions of each region were presented. Upon an initial review of the results of this study, we concluded that the descriptions provided fell into the following 5 non-exclusive categories: high-level semantic descriptions 99 1-a. “2008 season stats” 1-b. “The highlighted region is of statistics. This is a table that has multiple numbers describing a player's achievements and records of what he has accomplished.” 2-a. “This region lists search results for your query.” 2-b. “This area contains the heading ‘Search Results’ along with the returns from a search of a term.” 1 2 Figure 6.4: The descriptions provided by two participants for the screenshots shown illustrating diversity in how regions were described. Selected regions are 1) the table of statistics for a particular baseball player, and 2) the search results for a medical query. of the content (78%), descriptions matching all or part of the headings provided on the page for the region (53%), descriptions drawn directly from the words used in the region (37%), descriptions including the color, size, or other stylistic qualities of the region (18%), and descriptions of the location of the region on the page (11%). 6.4.2 The Syntax of the Clip Command We based the formulation of the syntax of the clip command on the results of the study just described. Clearly, users found it most convenient to describe the semantic class of the region. While future work may seek to leverage a data detector like Miro to automatically determine the class of data in order to facilitate such a command [41], our clip command currently refers to regions by either their heading or the content contained within them. When using a heading to refer to a region, a user lists text that starts the region of interest. For instance, the step “clip the ‘search results”’ would begin the clipped region at the text “search results.” This formulation closely matched what many users wrote in our 100 study, but does not explicitly specify an end to the clip. TrailBlazer uses several heuristics to end the clip. The most important part is directing the user to the general area before the information that is valuable to them. If the end of the clipped region comes too soon, they can simply keep reading past the end of the region. To use text contained within a region to refer to it, users write commands like, “clip the region containing “flight status””. For scripts operating on templated web site or for those that use dynamically-generated content, this is not always an ideal formulation because specific text may not always be present in a desired region. By using both commands, users have the flexibility to describe most regions, and, importantly, TrailBlazer is able to easily interpret them. 6.5 Formative Evaluation The improvements offered in the previous sections were designed to make TrailBlazer accessible to blind web users using a screen reader. In order to investigate its perceived usefulness and remaining usability concerns, we conducted a formative user study with 5 blind participants. Our participants were experienced screen reader users. On average, they had 15.0 (SD=4.7) years of computer experience, including 11.8 (2.8) years using the web. We first demonstrated how to use TrailBlazer as a guide through pre-defined tasks. We showed users how they could, at each step, choose to either have TrailBlazer complete the step automatically, complete it themselves, or choose any other action on the page. After this short introduction, participants performed the following three tasks using TrailBlazer: (i) checking the status of a flight on united.com, (ii) finding real estate listings fitting specific criteria, and (iii) querying the local library to see if a particular book is available. Each participant was asked the extent to which they agreed with several statements on a Likert scale after completing the tasks (Figure 6.5). In general, participants were enthusiastic about TrailBlazer, leading one to say “this is exactly what most blind users would like.” One participant said TrailBlazer was a “very good concept, especially for the work setting where the scenarios and templates are already there.” Another participant who trains people on screen reader use thought that it would be a good way to gradually introduce the concept of using a screen reader to a new computer user for whom the complexity of web 101 sites and the numerous shortcuts available can be overwhelming. Participants uniformly agreed that despite their experience using screen readers, “finding a little kernel of information can be really time-consuming on a complex web page” and that “sometimes there is too much content to just use headings and links to navigate.” Participants wondered if TrailBlazer could help them with dynamic web content, which often is added to the DOM of web pages far from where it appears visually, making it difficult to find. Screen readers can also have trouble presenting dynamically-created content to users. TrailBlazer could not only direct users to content automatically, avoiding a long linear search, but also help them interact with it. Despite their enthusiasm for using TrailBlazer for tasks that were already defined, they questioned how useful it would be if they had to rely on others to provide the scripts for them to use. One participant even questioned the usefulness of scripts created for a task that he wanted to complete because “designers and users do not always agree on what is important.” TrailBlazer did not support recording new tasks at the time of the evaluation, although new CoScripts could be created by sighted users using CoScripter. Participants also had several suggestions on how to improve the interface. TrailBlazer guides users from one step to the next by dynamically modifying the page, but screen readers do not always update their external models of the pages that they read from. To fix this users would need to occasionally refresh the model of the screen reader, which many thought could be confusing to novice users. Other systems that improve non-visual access have similar limitations [49], and these problems are being addressed in upcoming versions of screen readers. 6.6 Dynamic Script Generation TrailBlazer can suggest actions that users may want to take even when no pre-existing script is available for their current task. These suggestions are based on a short task description provided by the user and an existing repository of how-to knowledge. Suggestions are presented to users as options, which they can quickly jump to when correct but also easily ignore. Collectively, these suggestions help users dynamically create a new script potentially increasing efficiency even when they first complete a task. 102 Completing tasks that are new to me is easy on most web sites: 1. Disagree 1 Agree 5 2. Finding relevant content on web pages can be challenging. Disagree 1 Agree 5 3. TrailBlazer makes completing tasks easier 4. TrailBlazer makes completing tasks faster. Disagree 1 Agree 5 Disagree 1 Agree 5 5. TrailBlazer made it easier to find content on web pages. Disagree 1 Agree 5 6. I want to use TrailBlazer in the future. Disagree 1 Agree 5 7. I would be more likely to use TrailBlazer if more scripts were available. Disagree 1 Agree 5 Participants 0 1 2 3 4 5 Figure 6.5: Participant responses to Likert scale questions indicating that they think completing new tasks and finding content is difficult (1, 2), think TrailBlazer can help them complete tasks more quickly and easier (3,4,5), and want to use it in the future (6), especially if scripts are available for more tasks (7). 6.6.1 Example Use Case To inform its suggestions, TrailBlazer first asks users for a short textual description of the task that they want to complete; it then provides appropriate suggestions to help them complete that task. As an example, consider Jane, a blind web user who wants to look up the status of her friend’s flight on Air Canada. She first provides TrailBlazer with the following description of her task: “flight status on Air Canada.” The CoScripter repository does not contain a script for finding the status of a flight on Air Canada, but it does contain scripts for finding the status of flights on Delta and United. After some pre-processing of the request, TrailBlazer conducts a web search using the task description to find likely web sites on which to complete it. “goto aircanada.com” is its first suggested step, and Jane chooses to follow that suggestion. If an appropriate 103 Action Types at Each Step 1 click button 0.9 Proportion 0.8 click link 0.7 enter 0.6 0.5 goto 0.4 0.3 select 0.2 0.1 turnon 0 1 2 3 4 5 6 Step Number 7 8 9 10 Figure 6.6: Proportion of action types at each step number for scripts in the CoScripter repository. These scripts were contributed by current users of CoScripter. The action types represented include actions recognized by CoScripter which appeared in at least one script as of October 2008. suggestion was not listed, then Jane could have chosen to visit a different web site or even searched the web for the appropriate web site herself (perhaps using TrailBlazer to guide her search). TrailBlazer automatically loads aircanada.com and then presents Jane with the following three suggestions: “click the ’flight status’ button”, “click the ’flight’ button, and “fill out the ’search the site’ textbox.” Jane chooses the first, and TrailBlazer completes it automatically. Importantly, TrailBlazer only suggests actions that are possible to complete on the current web page. Jane uses this interface to complete the entire task without needing to search within the page for any of the necessary page elements. A pre-existing script for the described task is not required for TrailBlazer to accurately suggest appropriate actions. TrailBlazer can in effect apply scripts describing tasks (or subtasks) on one web site on other web sites. It can, for example, use a script for buying a book at Amazon to buy a book at Barnes and Noble, a script for booking a trip on Amtrak to help book a trip on the United Kingdom’s National Rail Line, or a script for checking the status of a package being delivered by UPS to help check on one being delivered by Federal Express. Subtasks contained within scripts can also be applied by TrailBlazer in different domains. For example, the sequence of steps in a script on a shopping site that helps users enter their contact information can be applied during the registration process on an employment site. If a script already exists for a user’s entire task, then the suggestions 104 they receive can follow that script without the user having to conduct a search for that specific script in advance. 6.6.2 Suggestion Types The CoScripter language provides a set number of action types (Figure 6.6). Most CoScripts begin with a “goto” command that directs users to a specific web page. Next, users are led through interaction with a number of links and form controls. Although not included in the scripts in the CoScripter repository, the final action implicitly defined in most CoScripts is to read the information that resulted from completion of the previous steps, which corresponds to the “clip” command added by TrailBlazer. The creation of suggestions in TrailBlazer is divided into the following three corresponding components: • Goto Component - TrailBlazer converts a user’s task description to keywords, and then searches the web using those keywords to find appropriate starting sites. • General Suggestion Component - TrailBlazer combines a user’s task description, scripts in an existing repository, and the history of the user’s actions to suggest the next action that the user should take. • Automatic Clipping Component - TrailBlazer uses the textual history of user actions represented as CoScripter commands to find the area on the page that is most likely relevant to the user at this point using an algorithm inspired by CSurf [84]. Finding the relevant region to read is equivalent to an automatic clip of content. The following sections describe the components used by TrailBlazer to choose suggestions. 6.6.3 Goto Component As shown in Figure 6.6, most scripts begin with a goto command. Accordingly, TrailBlazer offers goto commands as suggestions when calculating the first step for the user to take. The goto component uses the task description input by the user to suggest web sites on which the task could be completed. 105 Forming goto suggestions consists of the following three steps: (i) determining which words in the task description are most likely to describe the web site on which the task it to be completed, (ii) searching the web using these most promising keywords, and (iii) presenting the top results to users. A part-of-speech tagger first isolates the URLs, proper nouns, and words that follow prepositions (e.g., “United” from the phrase “on United”) in the task description. TrailBlazer proceeds in a series of rounds, querying with keywords in the order described until it gathers at least 5 unique URLs. These URLs are then offered as suggestions. The success of the goto component is highly dependent on the task description provided by the user and on the popularity of the site on which it should be completed. The authors have observed this component to work well on many real-world tasks, but future work will test and improve its performance. 6.6.4 General Suggestion Component The main suggestion component described in this chapter is the general suggestion component, which suggests specific actions for users to complete on the current page. These suggestions are presented as natural language steps in the CoScripter language and are chosen from all the actions possible to complete on the current page. TrailBlazer ranks suggestions based on the user’s task description, knowledge mined from the CoScripter script repository, and the history of actions that the user has already completed. Suggestions are first assigned a probability by a Naive Bayes classifier and then ranked according to them. Naive Bayes is a simple but powerful supervised learning method that after training on labeled examples can assign probability estimates to new examples. Although the probabilities assigned are only estimates, they are known to be useful for ranking [76]. The model is trained on tasks that were previously demonstrated using either TrailBlazer or CoScripter, which are contained within the CoScripter script repository. The knowledge represented by the features used in the model could also have also been expressed as static rules for the system to follow. TrailBlazer’s built-in machine learning model enables it to continually improve as it is used. Because tasks that users complete 3. Prior Action Script Similarity 4. Likelihood Action Pair 5. Same Form as Prior Action 6. Button First Form Action User History 2. Task Script Similarity Repository 1. Task Description Similarity Task 106 Figure 6.7: The features calculated and used by TrailBlazer in order to rank potential action suggestions, along with the three sources from which they are formed. using TrailBlazer implicitly describe new scripts, the features based on the script repository should become more informative over time as more scripts are added. 6.6.5 Features Used in Making Suggestions In order to accurately rank potential actions, TrailBlazer relies on a number of informative, automatically-derived features (Figure 6.7). The remainder of this section explains the motivation behind the features found to be informative and describes how they are computed. Leveraging Action History TrailBlazer includes several features that leverage its record of actions that it has observed the user perform. Two features capture how the user’s prior actions relate to their interaction with forms (Figure 6.7-5,6). Intuitively, when using a form containing more than one element, interacting with one increases the chance that you will interact with another in the same form. The Same Form as Prior Action feature expresses whether the action under consideration refers to an element in a form for which an action has previously been completed. Next, although it can occur in other situations, pressing a button in a form usually occurs after acting on another element in a form. The Button First Form Action 107 feature captures whether the potential action is a button press in a form in which no other elements have been acted upon. Similarity to Task Description The Task Description Similarity feature enables TrailBlazer to weight steps similar to the task description provided by the user more highly (Figure 6.7-1). Similarity is quantified by calculating the vector-cosine between the words in the task description and the words in each potential suggestion. The word-vector cosine metric considers each set of words as a vector in which each dimension corresponds to a different term, and in which the length is set to the frequency by which each word has been observed. For this calculation, a list of stopwords are removed. The similarity between the task description word vector vd and the potential suggestion word vector vs is calculated as follows. V C(vd , vs ) = vd · vs ||vd || ∗ ||vs || (6.1) The word-vector cosine is often used in information retrieval settings to compare documents with keyword queries [10]. Using the Existing Script Repository TrailBlazer uses the record of user actions combined with the scripts contained with the script repository to directly influence which suggestions are chosen. The CoScripter repository contains nearly 1000 human-created scripts describing the steps to complete a diversity of tasks on the web. These scripts contain not only the specific steps required to complete a given web-based task, but also more general knowledge about web tasks. Features used for ranking suggestions built from the repository are based on (i) statistics of the actions and the sequence in which they appear, and (ii) matching suggestions to relevant scripts already in the repository. These features represent the knowledge contained within existing scripts, enabling TrailBlazer to apply that knowledge to tasks for which no script exists. Some action types are more likely than others according to how many actions the user has completed (Figure 6.6). For instance, clicking a link is more likely near the beginning 108 of a task than near the end. In addition, some actions are more likely to follow actions of particular types. For instance, clicking a button is more likely to follow entering text than it is clicking a link because buttons are usually pressed after entering information into a form. Following this motivation, the Likelihood Action Pair feature used by TrailBlazer is the likelihood of each action given the actions that the user completed before (Figure 6.7-4). This likelihood is computed through consideration of all scripts in the repository. Leveraging Existing Scripts for Related Tasks TrailBlazer also directly uses related scripts already in the repository to help form its suggestions. TrailBlazer retrieves two sets of related scripts and computes a separate feature for each. First, TrailBlazer uses the task description provided by the user as a query to the repository, retrieving scripts that are related to the user’s task. For instance, if the user’s task description was “Flight status on Air Canada,” matches on the words “flight” and “status” will enable the system to retrieve scripts for finding the flight status on “United” and “Delta.” The procedure for checking flight status on both of these sites is different than it is on Air Canada but certain steps, like entering information into a textbox with a label containing the words “Flight Number” are repeated on all three. The Task Script Similarity feature captures this information (Figure 6.7-2). The second set of scripts that TrailBlazer retrieves are found using the last action that the user completed. These scripts may contain subtasks that do not relate to the user’s stated goal but can still be predictive of the next action to be completed. For instance, if the user just entered their username into a textbox with the label “username,” many scripts will be retrieved suggesting that a good next action would be to enter their “password” into a password box. The Prior Action Script Similarity feature enables TrailBlazer to respond to relevant sub-tasks (Figure 6.7-3). The motivation for the Task Script Similarity and Prior Action Script features is that if TrailBlazer can find steps in existing scripts similar to either the task description or an action previously completed by the user, then subsequent steps in that script should be predictive of future actions. The scores assigned to each step are, therefore, fed forward to 109 other script steps so that they are weighted more highly. All tasks implicitly start with a goto step specifying the page on which the user first requests suggestions, so a prior action always exists. The process used is similar to spreading activation, which is a method used to connect semantically-related elements represented in a tree structure [33]. The added value from a prior step decreases exponentially for each subsequent step, meaning that steps following close after highly-weighted steps primarily benefit. To compute these features, TrailBlazer first finds a set of related scripts S by sending either the task description or the user’s prior action as a query to the CoScripter repository. TrailBlazer then derives a weight for each of the steps contained in each related script. Each script s contains a sequential list of natural language steps (Figure 6.1). The weight of each script’s first step is set to V C(s0 , query), the vector-cosine between the first step and the query as described earlier. TrailBlazer computes the weight of each subsequent step, as follows: W (si ) = w ∗ W (si−1 ) + V C(si , query) (6.2) TrailBlazer currently uses w = 0.3, which has worked well in practice. The fractional inclusion of the weight of prior steps serves to feed their weight forward to later steps. Next, TrailBlazer constructs a weighted sentence sentS of all the words contained within S. The weight of each word is set to the sum of the computed weights of each step in which each word is contained, W (si ). The final feature value is the word-vector cosine between vectors formed from the words in sentS and query. Importantly, although the features constructed in this way do not explicitly consider action types, the labels assigned to page elements, or the types of page elements, all are implicitly included because they are included in the natural language CoScripter steps. 6.6.6 Presenting Suggestions to Users Once the values of all of the features are computed and all potential actions are ranked, the most highly-ranked actions are presented to the user as suggestions. The suggestions are integrated into the accessible guide interface outlined earlier. TrailBlazer provides five suggestions, displayed in the interface in rank order (Figure 6.8). 110 Figure 6.8: Suggestions are presented to users within the page context, inserted into the DOM of the web page following the last element with which they interacted. In this case, the user has just entered “105” into the “Flight Number” textbox and TrailBlazer recommends clicking on the “Check” button as its first suggestion. The suggestions are inserted into the DOM immediately following the target of the prior command, making them appear to non-visual users to come immediately after the step that they just completed. This continues the convenient non-visual interface design used in TrailBlazer for script play back. Users are directed to the suggestions just as they would be directed to the next action in a pre-existing script. Just as with pre-defined actions, users can choose to review the suggestions or choose to skip past them if they prefer, representing a hallmark of mixed-initiative design [59]. Because the suggestions are contained within a single listbox, moving past them requires only one keystroke. Future user studies will seek to answer questions about how to best present suggestions to users, how many suggestions should be presented, and how the system’s confidence in its suggestions might be conveyed by the user interface. 111 6.7 Evaluation of Suggestions We evaluated TrailBlazer by testing its ability to accurately suggest the correct next action while being used to complete 15 tasks. The chosen tasks represented the 15 most popular scripts in the CoScripter repository according to the number people who have run them. The scripts contained a total of 102 steps, with an average of 6.8 steps per script (SD=3.1). None of the scripts included in the test set were included when training the model. Using existing scripts to test TrailBlazer provided two advantages. The first was that the scripts represented a natural ground truth to which we could compare TrailBlazer’s suggestions and the second was that each provided a short title that we could use as the user’s description for purposes of testing. The provided titles were relatively short, averaging 5.1 words per title. The authors believe that it is not unreasonable to assume that users could provide similar task descriptions since users provided these titles. On the 15 tasks in this study, TrailBlazer listed the correct next action as its top suggestion in 41.4% of cases and within the top 5 suggestions in 75.9% of cases (Figure 6.9). Predicting the next action correctly can dramatically reduce the number of elements that users need to consider when completing tasks on the web. The average number of possible actions per step was 41.8 (SD=37.9), meaning that choosing the correct action by chance has a probability of only 2.3%. TrailBlazer’s suggestions could help users avoid a long, linear search over these possibilities. 6.7.1 Discussion TrailBlazer was able to suggest the correct next action among its top 5 suggestions in 75.9% of cases. The current interface enables users to review these 5 choices quickly, so that in these cases users will not need to search the entire page in order to complete the action - TrailBlazer will lead them right there. Furthermore, in the 24.1% of cases in which TrailBlazer did not make an accurate suggestion, users can continue completing their tasks as they would have without TrailBlazer. Future studies will look at the effect on users of incorrect suggestions and how we might mitigate these problems. 112 Suggestion Performance 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 1 2 3 4 5 6 7 8 9 10 Suggestions Provided Figure 6.9: The fraction of the time that the correct action appeared among the top suggestions provided by TrailBlazer for varying numbers of suggestions. The correct suggestion was listed first in 41.4% cases and within the top 5 in 75.9% of cases. TrailBlazer had difficulties making correct suggestions when the steps of the procedure seemed arbitrary. For instance, the script for changing one’s Emergency Contact information begins by clicking through links titled “Career and Life” and “About me - personal” on pages on which nearly 150 different actions could be made. Because no similar scripts existed, no features indicated that it should be selected. Later, the script included the step, “click the “Emergency contacts” link,” which the system recommended as its first choice. These problems illustrate importance of having scripts in the first place, and are indicative of the improvements possible as additional scripts and other knowledge are incorporated. Fortunately, TrailBlazer is able to quickly recover upon making an error. For instance, if it does not suggest an action involving a particular form and the user completes an action on the form anyway, TrailBlazer is likely to recover on the next suggestion. This is particularly true with form elements because of the features created specifically for this case (features 5 and 6 in Figure 6.7), but TrailBlazer is often able to recover in more general cases. TrailBlazer benefits greatly from its design as a guide to a human who can occasionally correct its suggestions, and its operators also benefit because of the efficiency gains when it is correct. 113 6.8 Summary and Future Directions This chapter introduced TrailBlazer, an accessible interface to how-to knowledge that helps blind users complete web-based tasks faster and more effectively by guiding them through a task step-by-step. By directing the user’s attention to the right places on the page and by providing accessible shortcut keys, TrailBlazer enables users to follow existing how-to instructions quickly and easily. A formative evaluation of the system revealed that users were positive about the system, but that the lack of how-to scripts could be a barrier to use. We extended the TrailBlazer system to dynamically suggest possible next steps based on a short description of the desired task, the user’s previous behavior, and a repository of existing scripts. The user’s next step was contained within the top 5 suggestions 75.9% of the time, showing that TrailBlazer is successfully able to guide users through new tasks. TrailBlazer can accurately predict the next step that users are likely to want to perform based on an initial action and provides an accessible interface enabling blind users to leverage those suggestions. A next step will be user studies with blind web users to study how blind web users use the system and discover improvements on the provided interface to suggestions. Interfaces for non-visual access could benefit from moving toward task-level assistance of the kind exposed by TrailBlazer. Current interfaces too often focus on either low-level annotations or deciding beforehand what users will want to do, taking away their control. Informing potential users of TrailBlazer and encouraging them to install it can limit its potential impact. In the next chapter (Chapter 7), we discuss WebAnywhere, which not only improves the availability of web access but also the availability of new tools such as TrailBlazer. Chapter 8 discusses in more depth the implications of the WebAnywhere delivery model, which promises to make it easier for users to gain access to tools like TrailBlazer. 114 Chapter 7 IMPROVING THE AVAILABILITY OF WEB ACCESS WITH WEBANYWHERE WebAnywhere is a web application the enables blind web users to access the web on any computer that is available. WebAnywhere requires no new software to installed and, as a web application, users always have the most recent version when they visit the site. WebAnywhere is easily extensible and open source, enabling researchers to easily try out new ideas. The infrastructure supports remote user studies, so researchers can also use it as part of empirical evaluations. This chapter first motivates WebAnywhere and describes its architecture. An evaluation of WebAnywhere that shows blind participants can use it to perform common tasks is presented, followed by an automated performance study under different conditions to explore the limitations of this approach. 7.1 Introduction & Motivation WebAnywhere is a web-based screen reader that can be used by blind individuals to access the web from any computer with web access and audio output. Other screen readers are expensive, costing nearly a thousand dollars for each installation because of their complexity, relatively small market and high support costs. Development of these programs is complex because the interface with each supported program must be deciphered independently. As a result of their expense and ignorance, screen readers are not installed on most computers, leaving blind users on-the-go unable to access the web from any computer that they happen to have access to and many blind users unable to afford a screen reader unable to access the web at all. Even blind web users who have a screen reader installed at home or at work cannot access the web everywhere that sighted people can. From terminals in public libraries to the local gym, from coffee shops to pay-per-use computers at the airport, from a friend’s 115 Figure 7.1: People often use computers besides their own, such as computers in university labs, library kiosks, or friends’ laptops. laptop to a school laboratory (Figure 7.1); computers are used for a variety of useful tasks that most of us take for granted, such as checking our email, viewing the bus schedule or finding a restaurant. Few would argue that the ease of use of web mail or document editors has surpassed desktop analogs, but their popularity is increasing, indicating the rising importance of accessing the web from wherever someone happens to be. The cost of screen readers is one problem that leads to screen readers not being installed where people need them. Libraries are known to provide an invaluable connection to web information for those with low incomes [13], and, while almost all libraries in the United States today provide Internet access, many do not provide screen readers because of their expense. Even when they do, they are provided on a limited number of computers. Most blind people with low incomes cannot afford a computer and a screen reader of their own, and governmental assistance in receiving one (in countries where such programs exist) are often tied to a demonstrated need for one because of school or employment, as it is in the United States. Unemployed blind people often miss out on such services even though they could potentially benefit the most. In countries where these programs do not exist, the problem may be worse. For blind users unable to afford a full screen reader, WebAnywhere might serve as a temporary alternative. Voice output while navigating through a web page can also be 116 WebAnywhere Browser Frame http://www2008.org/ Replicates browser functionality and provides a screen reading interface to both web content and browser functions. WebAnywhere Content Frame Loads web content via proxy server. Browser frame speaks the web content loaded here. Figure 7.2: WebAnywhere is a self-voicing web browser inside a web browser. beneficial for people who have low vision or dyslexia. Web developers have been shown to produce more accessible content when they have access to a screen reader [85], but may be deterred from using one by the expense and the hassle of installing one. All of these users could use WebAnywhere for free from anywhere. 7.1.1 Using WebAnywhere When users open WebAnywhere’s homepage it speaks the contents of the page that is currently loaded (initially a welcome page). WebAnywhere voices both its interface and the content of the current web page. Users can navigate from this starting page to any other web page and the WebAnywhere interface will speak those pages to the user as well. No separate software needs to be downloaded or run by the user; the system runs entirely as a web application with minimal permissions. WebAnywhere replicates much of the functionality of existing screen readers for enabling interaction with the web. WebAnywhere traverses the DOM of each loaded web page using a pre-order Depth First Search (DFS), which is approximately top-to-bottom, left-to-right in the visual page. As the system’s cursor reaches each element of the DOM, it speaks the element’s textual representation to the user (Figure 7.2). For instance, upon reaching a link containing the text “Google,” it will speak “Link Google.” Users can follow the link by 117 pressing enter. They can also skip forward or backward in this content by sentence, word or character. Users can also skip through page content by headings, input elements, links, paragraphs and tables. In input fields, the system speaks the characters typed by the user and allows them to review what they have typed. As a web application running with minimal permissions, WebAnywhere does not have access to the interface of the browser. It instead replicates the functionality already provided, such as providing its own address bar where the URL of the current page is displayed and users can type to navigate to a new page (Figure 7.2). We used the following design goals while developing WebAnywhere in order to ensure that it would be accessible and usable on most computers: 1. WebAnywhere should replicate the functionality and security provided by traditional screen readers when used to browse the web. 2. In order to work on as many web-enabled devices and public computer terminals as possible, WebAnywhere should not require special software or permissions on the host machine in order to run. 3. WebAnywhere should be usable: users should not be substantially affected or aware of the engineering decisions made in order to satisfy the first two goals. 7.2 Related Work Section 2.3 presents an overview of different systems impacting the availability of access for blind people. Prior solutions have all required users to either have permission to run or install software on their machine or carry around a specialized device. Serotek provides provides products that are most related to WebAnywhere. The Serotek System Access Mobile (SAM) [119] screen reader designed is designed to run directly from a USB key on Windows computers without prior installation. It is available for $500 US and still requires access to a USB port and permission to run arbitrary executables. The Serotek System Access to Go (SA-to-Go) screen reader can be downloaded from the web via a speech-enabled web page, but the program requires Windows, Internet Explorer, and permission to run executables on the computer. This product has recently been made 118 freely available by the AIR Foundation [2], who also provide a self-voicing Flash interface for downloading and running the screen reader. Starting this system requires downloading more than 10 MB of data, compared to WebAnywhere’s 100 kB. WebAnywhere, therefore, may be more appropriate for quickly checking email or looking up information, even on systems where SA-to-Go is able to run. A small number of web pages voice their own content, but the scope of information or access is limited. Talklets enable web developers to provide a spoken version of their web pages as a single file that users cannot control [123]. Scribd.com provides a unvoiced interface for converting documents to speech [117], but the speech is available only as a single MP3 file that does not support interactive navigation. The National Association for the Blind, India, provides access to a portion of their web content via a self-voicing Flash Movie that provides keyboard commands for navigation [92], but the information contained in the separate Flash movie is not comprehensive of the entire web site. The web information that can be accessed by WebAnywhere is not limited. 7.3 Public Computer Terminals The potential of WebAnywhere to provide access everywhere is limited by the capabilities of public computer terminals. In the United States, nearly all public libraries provide Internet access to their patrons, and 63% of these libraries provide high speed connections [13]. As of 2006, 22.4% of libraries specifically provided Internet-based video content and 32.8% provided Internet-based audio content [13]. Other public terminals likely have different capabilities. The aspects of public terminals explored here will also determine the ability of other web applications to enable self-voicing technology, and is, therefore, generally applicable beyond WebAnywhere. We surveyed 15 public computer terminals available in Seattle, WA, to get an idea of their technical capabilities and the environments in which they are located. We visited computers in the following locations: 5 libraries, 3 Internet cafes, 3 university & community college labs, 2 business centers, a gym and a retirement community. The computers were all managed by different groups. For instance, we visited only one Seattle public library. Although we considered only public terminals located in Seattle, we found public terminals W eb Sc An r y H een wh el R er Fl p A ea e U as va d s O h I ila er able th ns bl So er S tal e u o le So nd un d u In d H nd itia Pla ea P ll y H dp lay y A er ea ho ed ud ib Sp dp ne le ea hon Ja ke es ck rs 119 Public Terminal Survey Internet Cafes (3) 3 0 3 3 3 3 3 2 1 1 Kiosks (3) 2 0 2 2 3 2 2 3 0 2 Libraries (5) 5 0 5 5 4 3 3 5 1 1 Other (4) 4 0 2 4 3 4 4 3 0 1 All (15) 14 0 12 14 13 12 12 13 2 5 Figure 7.3: Survey of public computer terminals by category indicating that WebAnywhere can run on most of them. with diverse capabilities, suggesting that our results may generalize. For instance, while most ran Windows XP as their operating system, two ran Macintosh OS X, and one ran Linux. The terminals were designed for different use cases. Several assumed users would stand while accessing them, one was used primarily as a print station and many appeared to be several years old. Figure 7.3 summarizes the results of this survey. Most computers tested were capable of running WebAnywhere, and in 12 of the locations, someone was available in an official capacity to assist users. In all locations, people (including employees) were nearby and could have also assisted users. The most notable restriction to access that we found was that only 2 locations (a library and an Internet cafe) provided headphones for users. However, in all of the libraries that we visited, we were able to ask to use headphones. Bringing headphones seems like a reasonable requirement given that headphones are inexpensive and many people already carry them for listening to their music players and other devices. In the 5 locations that did not provide headphones, speakers were available. Using speakers is not ideal because it renders a user’s private browsing public and could be bothersome to others, but at least in these locations users could access the web without requiring them to bring anything with them. One location was restricted to using the embedded sound player, which suggests that while supporting it could help in some cases, access on most 120 computers could probably be achieved only using the Flash player to play sound. On 14 out of 15 systems, blind users could potentially access web content even though none of the computers had a screen reader installed on them. 7.4 The WebAnywhere System WebAnywhere is designed to function on any computer with web access and that ability to play sound. Its design has carefully considered independence from a particular web browser or plugin. To facilitate its use on public systems on which users may not have permission to install new software, functionality that would require this has been moved to a remote server. WebAnywhere can play sound using several different sound players commonly available on web browsers. The system consists of the following three components (Figure 7.5): (i) client-side Javascript, which supports user interaction, decides which sounds to play and interfaces with sound players to play them, (ii) server-side text-to-speech generation and caching, and (iii) a server-side transformation proxy that makes web pages appear to come from a local server to overcome violations of the same-origin security policy enforced by most web browsers. WebAnywhere consists of less than 100 KB of data in four files, and that is all a user needs to download to begin using the system. 7.4.1 WebAnywhere Script The client-side portion of WebAnywhere forms a self-voicing web browser that can be run inside an existing web browser (Figure 7.4). The system is written in cross-browser Javascript that is downloaded from the server, allowing it to be run in most modern web browsers, including Firefox, Internet Explorer and Safari. The system captures all key events, allowing it to both provide a rich set of keyboard commands like what users are accustomed to in their usual screen readers and to maintain control of the browser window. WebAnywhere’s use of Javascript to capture a rich set of user interaction is similar to that of UsaProxy [9] and Google Analytics [50], which are used to gather web usage statistics. Web pages viewed in the system are loaded through a modified version of the web-based proxy PHProxy [6]. This enables the system to bypass the same-origin policy that prevents 121 WebAnywhere Browser Frame http://www.webengineering.org Replicates browser functionality and provides a screen reading interface to both web content and browser functions. WebAnywhere Content Frame Loads web content via a web proxy server. Browser frame voices the web content loaded here. Action Speech Sound Page has loaded. 7.9 kB ICWE 2008 Welcome. 12.1 kB 4.4 kB Image ICWE 2008 5.7 kB TAB Link: Home. TAB Link: Open Calls. CTRL + + 9.2 kB 5.6 kB Heading 1 h Welcome CTRL 4.4 kB Heading 2 h 4.4 kB 5.6 kB Important Deadline Ahead 0.0 1.0 Play Time (seconds) 10.7 kB 2.0 Figure 7.4: Browsing the ICWE 2008 homepage with the WebAnywhere self-voicing, webbrowsing web application. Users use the keyboard to interact with WebAnywhere like they would with their own screen readers. Here, the user has pressed the TAB key to skip to the next focusable element, and CTRL+h to skip to the next heading element. Both web content and interfaces are voiced to enable blind web users access. scripts from accessing content loaded from other domains. Without this step, the scripts retrieved from WebAnywhere’s domain would not be able to traverse the DOM of pages that are retrieved, for instance, from google.com. Deliberately bypassing the same-origin policy can introduce security concerns, which we address in Section 7.9. 7.4.2 Producing and Playing Speech Speech is produced on separate speech servers. Our current system uses the free Festival Text-to-Speech System [125] because it is distributed along with the Fedora Linux distribution on which the rest of the system runs. The sounds produced by the Festival Text-to- 122 Figure 7.5: The WebAnywhere system consists of server-side components that convert text to speech and proxy web content; and client-side components that provide the user interface and coordinate what speech will be played and play the speech. Users interact with the system using the keyboard. Speech (TTS) system are converted server-side to the MP3 format because this format can be played by most sound players already available in browsers and because it creates small files necessary for achieving our goal of low latency. For example, the sound “Welcome to WebAnywhere,” played when the system loads, is approximately 10k, while the single letter “t”, played when users type the letter, is 3k. See Figure 7.4 for more examples of how the speech sounds in WebAnywhere are generated and used. Sounds are cached on the server so they don’t need to be generated again by the TTS service and on the client as part of the existing browser cache. For most efficient caching, WebAnywhere would generate a separate sound for each word, but this results in choppysounding speech. Another option would be to generate a single speech sound for an entire web page, but this would prevent the system from providing its rich interface. Sound players already installed in the browser do not support jumping to arbitrary places in a sound file as would be required when a user decides that they want to skip to the middle of the page. Instead, the system generates a separate sound for each phrase and the WebAnywhere script coordinates which sound to play based on user input. The system primarily uses the SoundManager 2 Flash Object [115] for playing sound. This Flash object provides a Javascript bridge between the WebAnywhere Javascript code and the Flash sound player. It provides an event-based API for sound playback that includes 123 an onfinish event that signals when a sound has finished playing. It is also able to begin playing sounds before they have finished downloading using streaming, which results in lower perceived latency. Adobe reports that version 8 or later of the Flash player, required for Sound Manager 2, is installed on 98.5% of computers [3]. To enable the system to operate on more systems, we have also developed our own Javascript API for playing speech using embedded players, such as Quicktime or the Windows Media Player. The existing API for controlling embedded players is limited and makes precise reaction to sounds that are being played difficult. While programmatic methods exist for playing and stopping audio files, they do not implement an onfinish event that would provide a programmatic method for determining when a sound has finished playing. WebAnywhere relies on this information to tell it when it should start playing the next sound in the page when a user is reading through a page. We initially required users to manually advance sounds, but this proved cumbersome and caused the system to act differently based on which sound player was being used. It was also frustrating for users to read a large section of a page as they have become accustomed to doing using their usual screen readers. To enable WebAnywhere to simulate an onfinish event for embedded sound players, the TTS service includes a header in its HTTP responses that specifies the length of each speech sound. Before programmatically embedding each sound file into the page, WebAnywhere first issues an xmlHttpRequest for the file and records the length of the returned sound. WebAnywhere then sets a timer for slightly longer than the length of the sound and finally inserts an embedded player for the sound into the navigation frame. Because the sound has already been retrieved via the programmatic request, it is located in the cache and the embedded player can start playing it almost immediately - there is a small delay for the embedded player to load and begin playing the sound. The timer is used as an onfinish event signaling that the sound has stopped playing and the next sound in the queue of sounds to play should be played. 124 7.5 User-Driven Design WebAnywhere was designed with close consultation with blind users. These potential users of WebAnywhere have offered valuable feedback on its design. During preliminary evaluations with 3 blind web users, participants wanted to be able to customize the shortcut keys used and other features of WebAnywhere in order to emulate their usual screen readers (either JAWS or Window-Eyes) [21]. To support this, WebAnywhere now includes user-configurable components that specify which key combinations to use for various functionality and the text that is read when specified actions are taken. Users can also choose to emulate their preferred screen reader (either JAWS or Window-Eyes) and their preferred browser (Internet Explorer or Firefox). For instance, Internet Explorer announces “Type the Internet address of a document or folder, and Internet Explorer will open it for you” when users move to the location bar, while users will hear “Location Bar Combo Box” if they are using Firefox. These settings can be changed using a web-based form and are saved for each individual user. To load their personal settings, users first press a shortcut key. The system asks them to enter the name of their profile and press enter when they are finished. It then applies the appropriate settings. Users can create an account and edit their personal settings either using a screen reader to which they are already accustomed or by using the default interface provided by WebAnywhere. Explanations of the available functionality and initial keyboard shortcuts assigned to each are read when the system first loads. It is unclear to what extent frequent users will want to use personalized settings, but the option to use personalized settings may ease the transition to WebAnywhere from another screen reader. In the future, personal profiles may also enable users to specify their preferred speech rate and interface language. When browsing the web, screen readers use an off-screen model of each web page that is encountered, which results in the screen reader exposing two different complementary but incomplete modes of operation to the user. Because WebAnywhere accesses the DOM of the web page directly, it does not need a separate forms mode. This has the advantage of immediately updating when content in the page dynamically changes. Traditional screen 125 readers must be manually toggled between a “forms mode” and “browse mode”1 using a specific control key. Even though this is cumbersome and unnecessary in WebAnywhere, it caused confusion for users that were accustomed to it, and so this behavior can be emulated in WebAnywhere. Our consultants felt that the main limitation of the system was the limited functionality that was provided to skip through page content relative to other screen readers. Users felt that WebAnywhere most needed the following two features: • Skipping functionality - Users have become accustomed to the rich array of shortcuts designed to enable users to skip through content. Our initial system only supported skipping by focusable element using the TAB key, but most screen readers provide shortcuts to skip through web content by heading, by input element, by paragraph and by link. • Find feature - Users wanted to be able to search for text in the page using the familiar find functionality provided by most browsers, which has been shown valuable for blind web users [35]. The existing find functionality in the web browser is not accessible using WebAnywhere because, as a web application, WebAnywhere can only access the elements in the web pages it loads. We used these preliminary evaluations to direct the priorities for development of WebAnywhere and the system now includes these features. Individuals also wanted a variety of other functionality available in existing screen readers, such as the ability to spell out the word that was just spoken, to speak user-defined pronunciations for words, and to specify the speech rate. These can be implemented in future versions of WebAnywhere. 7.5.1 Reaching WebAnywhere Before using WebAnywhere, blind users must first navigate to the WebAnywhere web page. Blind users have proven adept at using computers to start a screen reader when one is not yet running. For instance, some screen readers required users to login to the operating system 1 Names for these modes differ; these are used by JAWS. 126 without using the screen reader. Existing solutions, such as the Serotek System Access Mobile [119], share this requirement and are still used. In most locations, blind users can ask for assistance in navigating to the WebAnywhere page and then browse independently. This issue is explored more in our survey of public terminals presented in Section 7.3. Windows provides a limited voice interface that could be used to reach WebAnywhere. Windows Narrator can be started using a standard shortcut key and can voice the run dialog enabling users to navigate to the WebAnywhere URL. Web Narrator’s rudimentary support for browsing the web is not sufficient for web access, but would enable users to open the WebAnywhere web page. Navigation Functionality CTRL CTRL + + l Focus location text field. f Focus finder text field. Reading Granularity or Read next/previous element. or SHIFT Read next/previous word. + or Read next/previous character. Skipping Functionality Skip to next focusable element. TAB CTRL CTRL CTRL CTRL + + + + h Skip to next heading. i Skip to next input element. r Move to next row in table. d Move to next column in table. * Pressing SHIFT in combination with these keys will reverse skipping direction. Figure 7.6: Selected shortcut functionality provided by WebAnywhere and the default keys assigned to each. The system implements the functionality for more than 30 different shortcut keys. Users can customize the keys assigned to each. 7.6 User Evaluation In order to investigate the usability and perceived value of WebAnywhere, we conducted a study with 8 blind users (4 female) ranging in age from 18 to 51. Participants represented a 127 diversity of screen reader users, including both students and professionals, involved in fields ranging from the sciences and law to journalism and sales. Their self-described, screenreader expertise varied from beginner to expert. Similarly, their experience using methods for accessing the web when away from their own computers using the methods described in Section 2.3 varied considerably. Participants were compensated with $15 US for their participation in our study, and none had participated in earlier stages of the development of WebAnywhere. Two of our participants were located remotely. Remote user studies can be particularly appropriate for users with disabilities for whom it may be difficult or costly to conduct inperson studies. Such studies have been shown to yield similar quantitative results, although risk collecting less-informative qualitative feedback [101]. We conducted interviews with remote participants to gather valuable qualitative feedback. We examined (i) the effects of technological differences between our remote screen reader and a local one, and (ii) the likelihood that participants would use WebAnywhere in the future. While we did not explicitly compare the screen reading interface with existing screen readers, all of our participants had previously used a screen reader and many of their comments were made with respect to this experience. In this evaluation, participants were first introduced to the system and then asked to browse the WebAnywhere homepage using it. They then independently completed the following four tasks: searching Google to find the phone number of a local restaurant, finding when the next bus will be arriving at a bus stop, checking a Gmail email account, and completing a survey about their experience using WebAnywhere. Gmail.com and google.com were frequently visited by blind participants in our WebinSitu study (Chapter 3), and mybus.org is a popular site for checking the bus in Seattle, where most of our evaluators live. The authors feel that these tasks are representative of those that a screen reader user who is away from their primary computer may want to perform. We did not test the system with blind individuals who would like to learn to use a screen reader but cannot afford one. Using a screen reader efficiently requires practice and our expectation is that if current screen reader users can use WebAnywhere then others could likely learn to use it as well. 128 Task 1: Restaurant Phone Number on Google Participants were asked to find the phone number of the Volterra Italian Restaurant in Seattle by using google.com. Participants were told to search for the phrase “Volterra Seattle.” The phone number of the restaurant can be found on the Google results page, although some participants did not notice this and, instead, found the number on the restaurant’s home page. This task represented an example of users wanting to quickly find information on-the-go. Task 2: Gmail Participants checked a web-based email account and located and read a message with the subject “Important Information.” Participants first navigated to the gmail.com homepage and entered a provided username and password into the appropriate fields. They next found the message and then read its text. This task involved navigating the complex pages of gmail.com that include a lot of information and large tables. Task 3: Bus Schedule Participants found when the 48 bus would next be arriving at the intersection of 15th Ave and 65th St using the real-time bus tracking web page mybus.org. Participants first navigated to the mybus.org homepage, entered the bus number into a text input field and clicked the submit button. This brought them to a page with a large list of links consisting of all of the stops where information is available for the bus. Participants needed to find the correct stop among these links and then navigate to its results. This task also included navigating through large tables of information. Task 4: WebAnywhere Survey The final task asked participants to complete a survey about their experiences using WebAnywhere. They completed the survey about WebAnywhere using the WebAnywhere screen reader itself. This task involved completing a web-based survey that consisted of eleven statements and associated selection boxes that allowed them to specify to what extent they 129 agreed or disagree with the statement on a Likert scale. The survey also included a text area for additional, free-form comments. For this task, the researchers left the room, and so participants completed the task completely independently. 7.6.1 Study Results All participants were able to complete the four tasks. Most users were not familiar with the pages used in the study and found it tedious to find the information of interest without already knowing the structure of the page. However, most noted that this experience was not unlike using their usual screen reader to access a page which they had not accessed before. Some participants noted functionality available in their current screen reader that would have helped them complete the tasks more quickly. For example, the JAWS screen reader has a function for skipping to the main content area of a page. Participants who were already familiar with the web pages used to complete the tasks in the study, were, unsurprisingly, faster at completing those tasks. For instance, several participants were frequent Gmail users and leveraged their knowledge of the page’s structure to quickly jump to the inbox. In that example, skipping through the page using a combination of the next heading and next input element shortcut keys is an efficient way to reach the messages in the inbox. Figure 7.7 summarizes the results of the survey participants took as part of task 4. Participants all felt that WebAnywhere was a bit tedious to use, although many mentioned in a post-study interview that it was only slightly more tedious than the screen reader to which they are accustomed. Most agreed that mobile technology for accessing the web is expensive and most find themselves in situations where a computer is available but they cannot access it because a screen reader is not installed on it. The main exception was a skilled computer user who carries a portable version of the NVDA screen reader with him on a USB key. He was uniformly negative about WebAnywhere because WebAnywhere provided an inferior experience relative to NVDA and his solution works on the computers that he has tried. Most of our participants could see themselves using the system when it is released. 130 Responses strongly disagree 1 Question strongly 5 agree 3 2. WebAnywhere is tedious to use. 2.5 3. I could use WebAnywhere to access the web. 3.5 4. I could access the web from more locations using WebAnywhere. 3.5 6. I would use WebAnywhere to access the web from computers lacking screen readers. 3.5 8. Other tools would provide access in as many locations as this tool. 9. I often find myself where a computer is available but it lacks a screen reader. 10. I would use WebAnywhere if no other screen reader is available. strongly 5 agree 4 5. With practice, I could independently access the web with WebAnywhere. 7. Technologies enabling mobile access are expensive. strongly disagree 1 Median 1. WebAnywhere is difficult to use 11. WebAnywhere would be useful for someone unable to afford another screen reader. 5 3.5 5 4 3.5 Figure 7.7: Participant response to the WebAnywhere, indicating that they saw a need for a low-cost, highly-available screen-reading solution (7,8,9) and thought that WebAnywhere could provide it (3,4,6,10). Full results available in Appendix A. 7.6.2 Discussion Participants felt that WebAnywhere would be adequate for accessing the web, although none were prepared to give up their normal screen reader to use it instead. This was the expected and desired result. One participant remarked that the system would be useful for providing access when he is visiting a relative where he would not be comfortable installing new software. Participants who completed our study after the release of the free Serotek tool said that they could see themselves using both tools depending on their situation. For instance, if they only needed to find a phone number or an email, they would probably use WebAnywhere because it does not take as long to load. They would also use WebAnywhere if they were on a machine on which SA-to-Go would not run. SA-to-Go is an important option because they can use it to access applications other than the web. Participants did not mention the latency of the system as a major concern, but some were confused when a sound or web page took a while to load because, during that period, WebAnywhere was silent. We can address this by having the system periodically say “content loading, X%” while content is loading. One participant mentioned that the 131 WebAnywhere Prefetching Fast. Browser Cache (MB) Slower. Server Cache (GB) Slow. Text-to-Speech Server Figure 7.8: Caching and prefetching on the server and client improve latency. latency of the speech repeated to him while typing was bothersome. We have improved this by prefetching the speech representation of letters and punctuation that result when users type. Others noted that errors in the current implementation occasionally produced incorrect effects and that WebAnywhere lacks some of the functionality of existing screen readers. Many of these shortcomings have already been addressed in the current version of WebAnywhere. Most importantly, participants successfully accessed the web using WebAnywhere; future versions will seek to further improve the user experience and functionality. 7.7 Reducing Latency WebAnywhere uses remote text-to-speech conversion, and the latency of requesting, generating and retrieving speech could potentially disrupt the user experience. Because the sound players used in WebAnywhere are able to play sounds soon after they begin downloading, latency is low on high-speed connections but can be noticeable on slower connections. Latency of retrieving speech is an important factor in the system because it directly determines the user-perceived latency of the system. When a user hits a key, they know that the system has responded only when the sound that should be played in response to that key press is played by the system. 132 To reduce the perceived latency of retrieving speech, the system aggressively caches the speech that is generated by the TTS service on both the server and client. In order to increase the likelihood that the speech a user wants to be played is in the cache when they want to play it, the system can use several different prefetching strategies designed to prime these caches. Prefetching has been explored before as a way to reduce web latency [98] and has been shown to dramatically reduce the latency for general web browsing [72]. Traditional screen readers running as processes on the client machine do not require prefetching because generating and playing sound has low latency. Web applications have long used prefetching as a mechanism for reducing the delay introduced by network latency. For instance, web applications that use dynamic images use Javascript scripts to preload images to make these changes appear more responsive. The prefetching and caching strategies explored in WebAnywhere may also be useful for visually rich web applications. 7.7.1 Caching The system uses caching on both the server and client browser in order to reduce the perceived latency of retrieving speech. TTS conversion is a relatively processor-intensive task. To reduce how often speech must be generated from scratch, WebAnywhere stores the speech that is generated on a hard disk on the server. While hard disk seek times can be slow, their latency is low compared with the cost of generating the speech again. The speech that is retrieved from the client is cached on the client machine by the browser. Most browsers maintain their own caches of files retrieved from the web, and an unprivileged web application such as WebAnywhere does not have permission to directly set either the the size of the cache or the cache replacement policy, and WebAnywhere does not attempt to do so. Flash uses the regular browser cache. Both the Internet Explorer and Firefox disk caches default to 50 Megabytes, which can hold a large number of the relatively small MP3 files used to represent speech. The performance implications of these caching strategies are explored in Section 7.8.1. 133 7.7.2 Prefetching Speech The goal of prefetching in WebAnywhere is to determine what text the user is likely to want to play as speech in the future and increase the likelihood that the speech sounds requested are in the web browser’s cache by the time that the user wants them to be played. The browser cache is stored in a combination of memory and hard disk, and retrieving sounds to play from it is a very low-latency operation relative to retrieving sounds from the remote WebAnywhere server. The distribution of requested speech sounds is expected to be Zipflike [26], resulting in most popular sounds already likely to be in the cache, but a long tail of speech sounds that have not been generated before. All prefetching strategies add strings to a priority queue, which a separate prefetching thread uses to prioritize which strings should be converted to speech. We explored several different strategies for deciding what strings should be given highest prefetching priority by the system. The function of each strategy is to determine the priority that should be assigned to each string. To prefetch speech sounds, the prefetching thread issues an xmlHttpRequest request for the speech sound (MP3) representing each string from its queue. This populates the browser’s local cache, so that when the sound is requested later, it is retrieved quickly from the cache. We next present several different prefetching strategies that we have implemented. Section 7.8.3 presents a comparison of these strategies. DFS Prefetching WebAnywhere and other screen readers traverse the DOM using a pre-order Depth First Search (DFS). The basic prefetching mode of the WebAnywhere system performs a separate DFS of the DOM that inserts the text that will be spoken for each node in the DOM into the priority queue with a weight corresponding to its order in the DFS traversal of the DOM. This method retrieves speech sounds in the order in which users would reach them if they were to read through a page in the natural top-to-bottom, left-to-right order. If users either normally read in this order and if they do not read through the page more quickly than the prefetching is able to be performed, then this strategy should work well. However, blind 134 web users are known for leveraging the large number of shortcut keys made available by their screen readers to skip around in web content [35, 17], so it is worthwhile considering other strategies that may better address this usage. DFS Prefetching + Position Update The system could better adapt if it updated the nodes to be prefetched based on the node currently being read. This could prevent nodes that have already been skipped by the user from being prefetched and taking bandwidth that could otherwise be used to download sounds that are more likely to be played. The DFS+Update prefetching algorithm includes support for changing its position in prefetching. For example, if the prefetcher is working on content near the top of the page when a user skips to the middle of the page using the the find functionality, the prefetcher will be able to update its current position and continue prefetching at the new location. When the user skips ahead in the page, the priority of elements in the queue are updated to reflect the new position. These updates make prefetching more likely to improve performance. DFS Prefetching + Prediction WebAnywhere also prefetches sounds based on a personalized, predictive model of user behavior. The shortcut keys supported by the system are rich, but users frequently employ only a few. Furthermore, users do not randomly skip through the page; meaning that, the likelihood that a user will issue each keyboard shortcut can be inferred from such factors as the keys that they previously pressed and the node that is currently being read. For instance, a user who has pressed TAB to move from one link to the next is more likely to press TAB again upon reaching another link than is a user who has been reading through the page sequentially. Similarly, some users may frequently skip forward by form element, while others may rarely do so. To use such behavior patterns to improve the efficacy of prefetching, WebAnywhere records each user’s interactions with the system and uses this empirical data to construct a predictive model. The model is used to predict which actions the user is most likely to 135 take at each point, helping to direct the prefetcher to retrieve those sounds most likely to be played. An action is defined as a shortcut key pressed by the user. WebAnywhere records the history of actions performed by the user and the history of the current node types associated with each. The system distinguishes three types: link, input element, and other. These actions were chosen because they roughly align with the most popular actions currently implemented in the system and could be expanded in the future. The probability of the next action actioni being action x assuming that the next action depends only on prior observations of actions and visited nodes is as follows: P (actioni = x|node0 , ..., nodei , action0 , ..., actioni−1 ) WebAnywhere uses the standard Markov assumption to estimate this probability by looking back only one time step [112]. Therefore, the probability that the user’s next action is x given the type of the current node and the user’s previous action can be expressed as follows: P (actioni = x|nodei , actioni−1 ) All actions are initially assigned uniform probability. These probabilities are dynamically updated as the system is used and sounds in the priority queue are weighted using them. To be specific, for each possible condition (combination of previous action and type of the current node) w, a count cw (x) is maintained. The count for each possible condition is initially set to 1 and is incremented by 1 when the event x occurs in condition w. An event x is defined as the user taking a specified action while in a particular condition. The probability of each new action can then be calculated as follows: P (actioni = x|nodei , actioni−1 = w) = cw (x)/ P y cw (y). WebAnywhere reweights nodes in the priority queue used for prefetching according to these probabilities. Section 7.8.2 presents an evaluation of the accuracy of predictive prefetching. 136 7.8 Evaluation We evaluated WebAnywhere along several dimensions, including both how caching improves the performance of the system and the load that it can withstand, and the accuracy and latency effects of the prefetch strategies discussed in Section 7.7.2. 7.8.1 Server Load In order for us to release the system, we must be able to support a reasonable number of simultaneous users per machine. In this section, we present our evaluation of the performance of the WebAnywhere speech retrieval system under increasing load, while varying the caching and prefetching strategies used by the system. We chose to evaluate the latency of sound retrieval because it will contribute most to the perceived latency of the system. When users press a key, they expect the appropriate speech sound to be played immediately. The TTS system and cache were running on a single machine with a 2.26 GHz Intel Pentium 4 Processor and 1 GB of memory. To implement this evaluation, we first recorded the first 250 requests that the system made for speech sounds when reading news.google.com from top to bottom. This represented a total of 446 seconds of speech with an average file size of 11.7 kB over the 250 retrieved files. A multi-threaded script was used to replay those requests for any number of users. The script first retrieves a speech sound and then waits for the length of the speech sound. It repeats this process until all of the recorded sounds have completed, reproducing the requests that WebAnywhere would make when reading the page. This script was run on a separate machine that issued requests over a high-speed connection to the WebAnywhere server. We tested the following three caching conditions: (i) both server and browser caching enabled, (ii) only server caching enabled, and (iii) no caching enabled. Speech sounds were assumed to be in the server cache when it was used. Speech sounds were added to the browser cache as they were retrieved. Figure 7.9 presents the results of our evaluation, which demonstrates that TTS production is the major bottleneck in the system. Latency of retrieved speech quickly increases as the system attempts to serve more than 10 users. With server-side caching, this is dramatically improved. Client-side caching in the browser 137 Average Latency (sec) 6 5 4 3 2 1 0 5 10 15 20 Number of Simultaneous Users No Cache Server Cache Server + Browser Cache Figure 7.9: An evaluation of server load as the number of simultaneous users reading news.google.com is increased for 3 different caching combinations. improves slightly more, although its effect is limited because in this example because relatively few speech sounds are repeated. Repeated sounds include “link,” which is said before each link that is read, and “Associated Press,” which appears frequently because this is a news site. As we move toward releasing WebAnywhere to a larger audience, we will continue to evaluate its performance under real-world conditions. A number of assumptions were made that may not be upheld in practice. For example, the system will not achieve the perfect server-side cache hit-rate assumed here, although both the server and browser caches will likely have had more opportunity to be primed when users read through multiple pages. Most users also do not read pages straight through. As we have observed users using the system, we have seen users most often skipping around through the page, often returning to nodes that they had visited before, causing speech already retrieved by the browser to be played again. In this initial system we have also not optimized the TTS generation or the cache routines themselves but could likely achieve better results by doing so. Finally, latency here was calculated as the delay between when a speech sound was requested and when it was retrieved. In practice, the Flash sound player can stream sounds and begin playing them much earlier. Despite its limitations, 138 this evaluation generally illustrates the performance of the current system and where future development should be targeted. 7.8.2 Prefetching Accuracy This section presents an analysis of the predictive power of the model underlying DFS Prefetching + Prediction described in Section 7.7.2. To conduct this study we collected traces of the interactions of 3 users of WebAnywhere in a study that we presented earlier in this chapter (Section 7.6). In that study, users completed the following four tasks using WebAnywhere: checking their email on gmail.com, looking up the next arrival of a bus, finding the phone number of a restaurant using google.com and completing a web survey about WebAnywhere. In total, we recorded 2632 individual key presses. 1110 of these were not command key presses and resulted in the name of a key being spoken, for instance “a” or “forward slash.” The system prefetches these keys automatically when it first loads. 1522 of these key presses were commands that caused the screen reader to read a new element in the page. For instance, the TAB key would cause WebAnywhere to advance its internal cursor to the next focusable page element and read it to the user. Using this data, we computed the probability of each future action given the current node and the user’s previous action as described earlier. Figure 7.10 shows the counts that we recorded. From this data, it appears that users are more likely to issue a keyboard command again after issuing it once. Some commands are also more likely given certain types of nodes, for instance users are more likely to request the next input element when the cursor is currently located on an input element. We replayed the traces in order to build the models that would have been created when a user browsed using WebAnywhere in order to measure the system’s accuracy in predicting the user’s next action. Using the Markov model to predict the next action that the user is likely to choose is able to correctly predict the next action in 72.4% of cases. However, simply predicting that the user will choose to repeat the action they just performed predicts the next action in 74.5% of cases. Markov prediction is still useful because it can quickly adapt 139 Node Link Input Other next node next focusable prior focusable prior node next focusable prior node next node next input next node next focusable prior node next heading next focusable next node prior node prior focusable next heading next input prior heading prior input Actions Observed 83 00 00 03 05 17 33 00 91 06 11 07 11 207 31 01 12 01 00 00 05 00 00 00 60 01 01 04 09 06 00 00 07 117 16 56 02 02 10 12 01 00 08 00 08 03 00 02 36 12 23 03 00 00 00 00 09 213 29 23 15 01 85 00 02 05 22 25 01 00 00 00 09 00 00 00 00 00 00 00 03 01 00 00 00 00 01 01 03 01 02 15 02 07 02 00 19 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 16 03 01 00 00 01 03 00 00 01 18 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ne x ne pr t no xt i o r d e pr fo no __ i o cu d rf s e n e o cu a b l _ _ x s e_ pr t he ab _ i o a l e_ r h di _ e n ne ad g__ x in p r t i n g_ io pu _ r i t_ np _ ut __ All Prior Action Observed Actions Figure 7.10: Counts of recorded actions along with the contexts in which they were recorded (current node and prior action), ordered by observed frequency. to individual users whose behavior may not follow this particular pattern. Its predictions also enable prefetching of the second-most-likely action. The true action is in the two most likely candidates in 87.3% of cases. Predictive prefetching is quite accurate. Markov predictions are used in DFS Prefetching + Prediction. The next section explores how that accuracy manifests in the perceived latency of the system. 7.8.3 Observed Latency The main difference between WebAnywhere and other screen readers is that WebAnywhere generates speech remotely and transfers it to the client for playback. Existing screen readers play sounds with almost no latency because the sounds are both generated and played on the client. To better understand the latency trade-offs inherent in WebAnywhere and the 140 Figure 7.11: Average latency per sound using different prefetching strategies. The first set contains tasks performed by participants in our user evaluation, including results for prefetching strategies that are based on user behavior. The second set contains five popular sites, read straight through from top to bottom, with and without DFS prefetching. Bars are shown overlapping. implemented prefetching algorithms, we performed a series of experiments designed to test latency under various conditions. A summary of these results is presented in Figure 7.11. For all experiments, we report the average latency per sound on both a dial-up and a high-speed connection, whose connection speeds were 44 kBps and 791 kBps, respectively. Although 63% of public libraries have high-speed connections in the United States [13], a dial-up connection may better represent speeds available in many communities. Timing was recorded and aggregated within the WebAnywhere client script. The latency of a sound is the time between when the sound is requested by the client-side script and when that sound begins playing. The average latency is the time that one should expect to wait before each sound is played. Because the system streams audio files, the entire sound does not need to be loaded when it starts playing. Experiments were conducted on a single computer and the browser cache was cleared between experiments. We first compared the DFS prefetching strategy to using no prefetching on a straightthrough read of the 5 web pages visited most often by blind users in a study of browsing behavior [17]. We did not test the other prefetching methods because absent user interaction they operate identically to DFS prefetching. On these tests, the average latency for the high-speed connection was under 500 ms even without prefetching and under 100 ms with prefetching (Figure 7.11). Delays under 200 ms generally are not perceivable by users [83] 141 and this may explain why most users did not cite latency as a concern with WebAnywhere during a user evaluation of it [22]. The dial-up connection recorded latencies above 2 seconds per sound for all five web pages, making it almost unusable without prefetching. The latency with prefetching averages less than one second per sound, and the average length of all sounds was 2.4 seconds. Screen reader users often skip around in content instead of reading it straight through. Using recordings of the actions performed by participants during a user evaluation [22], we identified the common methods used to complete the tasks and replayed them manually to see the effect of different prefetching strategies on these tasks. Recording and then replaying actions in order to test web performance under varying strategies has been done before [71]. The prefetching strategies tested were DFS-DOM traversal, DFS with dynamic updating and Markov model prediction. Observed latency was again quite low for runs using the high-speed connection. When using the dial-up connection, however, the results differed dramatically. On both the Gmail and Google tasks, DFS prefetching increased the latency of retrieving sounds. This happened because our participants used skipping extensively on these sites and quickly moved beyond the point in the DOM where the prefetcher was retrieving sounds. When this happened, the prefetcher used bandwidth to retrieve speech that the user was not going to play later, slowing the retrieval of speech that would be played. Only the survey task showed a significant benefit for the predictive model of user behavior. On this task, participants exhibited a regular pattern of tabbing from selection box to selection box, making their actions easy to correctly determine. Importantly, the prediction method did not perform worse than the Update method, and both far outperformed both DFS and no prefetching. 7.9 Security WebAnywhere enables users to browse untrusted web content. Because WebAnywhere is a web application running inside the browser with no special permissions, it lacks the usual mechanisms that web browsers use to enforce security. In this section, we describe the primary security concerns resulting from our engineering decisions in building WebAnywhere and the steps we have taken to address them. 142 7.9.1 Enforcing the Same-Origin Policy The primary security policy that web browsers enforce is called the same-origin policy, which prevents scripts loaded from one web domain from accessing scripts or documents loaded from another domain [111]. This policy restricts scripts from stealYourPassword.com from accessing content on, for instance, bankOfAmerica.com. To enable WebAnywhere to access and voice the contents of the pages its users want to visit, the system retrieves all web content through a web proxy in order to bypass the same-origin restriction. This makes all content appear to originate from the WebAnywhere domain (for these examples, assume wadomain.org) and affords the WebAnywhere script access to that content. Bypassing the same-origin policy is fundamental for WebAnywhere’s operation, but gives malicious web sites an opportunity to violate expected security guarantees, potentially accessing information to which they should not have access. WebAnywhere cannot directly enforce the same-origin policy, and so it instead ensures that all content retrieved that should be isolated based on the same-origin policy originates from a different domain. This is done by prepending the original domain of a resource onto its existing domain. For instance, content from yahoo.com is made to originate from yahoo.com.wadomain.org. WebAnywhere rewrites all URLs in this way, causing the browser to correctly enforce the same-origin policy for the content viewed. All requests to open new pages and URL references within retrieved pages are rewritten to point to the proper domain. The web proxy server enforces that a request for web content located on domain d must originate from the d.wadomain.org domain. WebAnywhere is able to respond to requests for any domain, regardless of the subdomain supplied, through the use of a wildcard DNS record [89] for *.wadomain.org that directs all such requests to WebAnywhere. The WebAnywhere script and the Flash sound player must also originate from the the d.wadomain.org domain, and so they are reloaded. This 100 KB download need only occur when users browse to a new domain. Browser cookies also have access restrictions designed to prevent unauthorized scripts from accessing them [99]. The PHProxy web proxy used by WebAnywhere keeps track of the cookies assigned by each domain and only sends cookies set by a domain to that domain. 143 This is not entirely sufficient. Access to cookies is controlled both by the domain and path listed when a cookie is set as determined by the web page that sets each. Future versions of WebAnywhere will modify the domain and the path requested by the cookie to match its location in WebAnywhere. For example, the URL www.domain.com/articles/index.php will appear to come from www.domain.com.wadomain.org/web-proxy/articles/index.php using the URL rewriting supported by the Apache web server. The domain and path of the SetCookie request could be adjusted accordingly. Others have attempted to detect malicious Javascript in the browser client [52], but this relies on potentially malicious Javascript code being isolated from code within the browser. The WebAnywhere Javascript runs in the same security context as the potentially malicious code. BrowserShield describes a method for rewriting static web pages in order to enforce run-time checks for security vulnerabilities with Javascript scripts [109]. It is targeted at protecting against generalized threats and is, therefore, a fairly heavyweight option. The same-origin policy is our main concern because it is the security policy that we removed by introducing the web proxy. We believe the approach here could be used more generally by web proxies in order to make them less vulnerable to violations of the same-origin policy. 7.9.2 Remaining Concerns Fixing the same-origin policy vulnerability created by WebAnywhere was of primary importance to us, but other concerns remain. The first is that in order to work for secure web sites, WebAnywhere must intercept and decode secure connections made by users of the system before forwarding them on. When a secure request is made, the web proxy establishes a separate secure connection with both the client and the server. The data is unencrypted on the WebAnywhere server. All accesses to WebAnywhere are made over its SSL-enabled web server, but users still must trust that the WebAnywhere server is secure and, therefore, may want to avoid connecting to secure sites using the system. The second concern that remains unresolved is the opportunity for sites to use phishing to misrepresent their identity, potentially tricking users into giving up personal information. Although phishing is a problem general to web browsing, the unique design of WebAny- 144 where makes phishing potentially more difficult to detect. A web page could override the WebAnywhere script used to play speech and prevent users from discovering the real origin of the web page. For instance, as the system currently exists, a malicious site could override the commands in WebAnywhere used to speak the current URL, preventing a user from discovering the real web address they are visiting. Future versions of WebAnywhere will include protections like those in BrowserShield [109] to enforce runtime checks to ensure the WebAnywhere functions have not been altered. Finally, because content that has been read previously is cached on the server, malicious users could determine what other users have had read to them, possibly exposing private information. While this problem is shared by all proxy-based systems, WebAnywhere enables it at a finer granularity than most other systems, which is potentially more revealing. For instance, if a user visits a page that contains their credit card number, it is likely that the system will choose to generate a separate speech sound for their number. A malicious user could repeatedly query the system for credit card numbers and isolate those that are retrieved most quickly. We have partially addressed this problem by not caching sounds that originate from secure web addresses. 7.10 Summary The WebAnywhere web-based, self-voicing web browser enables blind individuals otherwise unable to afford a screen reader and blind individuals on-the-go to access the web from any computer that happens to be available. WebAnywhere is able to be run on most systems, even public terminals on which users are given few permissions. Its small startup size means that users will be able to quickly begin browsing the web, even on relatively slow connections. Participants that evaluated WebAnywhere were able to complete tasks representative of those that users may want to complete on-the-go. The focus of this chapter has been to improve web access for blind web users. After we released WebAnywhere, we found that a wider variety of people were using it to address their needs, illustrating the power of access technology that is easy to get going. WebAnywhere also offers a promising way for new technology to reach users. These implications are discussed in greater detail in the next chapter. 145 Chapter 8 A NEW DELIVERY MODEL FOR ACCESS TECHNOLOGY In the previous chapter (Chapter 7), we introduced WebAnywhere, a web-based screen reader that blind people can use to access the web on any computer to which they have access. This chapter explores the implications of the WebAnywhere models as a method for delivering more general access technology. We released WebAnywhere on a public site in June 2008 and since then it has attracted a large number of visitors. Surprisingly, many of these visitors weren’t blind web users - WebAnywhere also attracted people with low vision, web developers, special education teachers, and people with learning disabilities. People came from all over the world, and a small community of developers has begun to create localized versions for many different languages. WebAnywhere has the potential to serve as a vehicle to disseminate access technology quickly and easily to a large number of users across the world. 8.1 The Audience for WebAnywhere Since its release in June 2008, a large number of users have used WebAnywhere (Figure 8.1). These users have offered their feedback directly through emails and indirectly via features of the web traffic that they generate. 8.1.1 User Feedback Participants in our initial user study of the system requested several features that are offered by other screen readers but which are currently unavailable in WebAnywhere. Many of these requests involved new keyboard shortcuts and functionality, but several involved producing different, individualized speech sounds. Implementing these features in a straightforward way has the potential to reduce the efficacy of the prefetching and caching strategies employed by the system. For instance, users requested that the system use a voice that is more 146 800 (sliding window) Unique Users Per Week 1000 600 400 200 0 Nov. 15 May 1 Figure 8.1: Weekly Web Usage between November 15, 2008 and May 1, 2009. An average of approximately 600 unique users visit WebAnywhere each week. The large drop in users in December roughly corresponded to winter break. WebAnywhere offers the chance for a living laboratory to improve our understanding of how blind people browse the web and the problems that they face. preferred by them; popular screen readers offer tens of voices form which users can choose. Others asked for the ability to set the speech rate to a custom rate. Because using a screen reader can be inefficient, many users speed up the rate of the speech that is read by two times or more. Many users, however, do not prefer this because speech can be difficult to understand at high speeds. Both of these improvements will cause the speech played by the system to less frequently be located in its cache, and, therefore, the value of these features will need to be balanced by their performance implications. We also plan to explore the option of switching to client-side TTS when users both have the permission to use it and it is available. Several operating systems have native support for TTS that WebAnywhere could leverage when permitted. 8.1.2 Broader Audience Releasing WebAnywhere demonstrated that the audience for WebAnywhere was much broader than we had originally anticipated. Many WebAnywhere users have sent us email feedback since its release, from which we have identified two themes. First, we discovered initially from our usage logs that WebAnywhere had a large global reach (Figure 8.2). The 147 UNITED STATES: 6815 CANADA: 675 INDIA: 455 SPAIN: 351 URUGUAY: 335 GERMANY: 315 FRANCE: 265 HONG KONG: 147 NEW ZEALAND: 127 NETHERLANDS: 95 SWITZERLAND: 59 DENMARK: 51 IRELAND: 49 PORTUGAL: 43 MALAYSIA: 37 ARGENTINA: 35 UNITED ARAB EMIRATES: 33 OMAN:29 CHILE: 25 ISRAEL: 19 UNITED KINGDOM: 2307 ITALY: 611 AUSTRALIA: 409 TAIWAN: 341 BRAZIL: 321 SINGAPORE: 273 CHINA: 207 THAILAND: 137 JAPAN: 97 MEXICO: 67 AUSTRIA: 57 POLAND: 51 BELGIUM: 47 SWEDEN: 43 SOUTH AFRICA: 37 IRAN: 33 FINLAND: 29 NORWAY: 27 SLOVAKIA: 21 TURKEY: 19 Figure 8.2: From November 2008 to May 2009, WebAnywhere was used by people from over 90 countries. This chart lists the 40 best-represented countries ranked by the number of unique IPs identified from each country that accessed WebAnywhere over this period. 33.9% of the total 23,384 IPs could not be localized and are not included. most popular request we have received is for support of additional languages. The most active contributor to the WebAnywhere open source project1 has been a developer who has created a Cantonese version of WebAnywhere (Figure 8.3). Second, from the feedback of users, it has become clear that not only blind people are using WebAnywhere. We have received emails from web developers who use WebAnywhere to quickly test the accessibility of their content. A special education teacher emailed us saying that she uses WebAnywhere with our her students. Specialized software is available that is specifically designed for developers wanting to create accessible content and for students with learning disabilities. We speculate that because WebAnywhere can be used without installing new software and works on any platform, it is likely to be used even when better alternatives are available. Future work will look to (i) understand why people are using WebAnywhere, (ii) how WebAnywhere could better support the features that these new audiences want, and (ii) how tools targeting specific groups might provide the advantages that cause people to use options like WebAnywhere instead. 1 http://webanywhere.googlecode.com 148 Figure 8.3: WebAnywhere, May 2009. Since its release, new languages have been added to WebAnywhere. These screenshot shows an early version of a Cantonese version of the system. We have also started to introduce features that may make it more useful for other populations. The text being read is highlighted within the page and shown in a magnified, high-contrast view. The implication is that the WebAnywhere approach to providing access may be appropriate for people with different needs all over the world. 8.2 Getting New Technology to Users: Two Examples WebAnywhere has the potential to bring both access and new technology to users. Many research projects and good ideas fail to make it to users because they are difficult to integrate into existing products and require users to find and install new software. 8.2.1 Social Accessibility To take advantage of collaboratively-authored accessibility improvements in systems like Accessmonkey (Chapter 4), Social Accessibility [121], or AxsJAX [49], users must first install software. Often this involves installing a browser extension or plugin. For the reasons mentioned previously (Chapter 7), users may not be able to install new software in order to benefit from this technology. 149 WebAnywhere can include new technology, for instance adding support for TrailBlazer trails (Chapter 6), and users will get the latest updated version when they next use it. The improvements that are made using all three current social accessibility systems can be introduced using a Javascript script. WebAnywhere can easily inject these scripts when it loads a new web page. 8.2.2 Recording and Replaying Tasks TrailBlazer (Chapter 6) offers users the ability to record and replay web-based tasks. WebAnywhere has immediate access to the actions that users are performing since it is the interface that they are using and can easily record them. We have already implemented basic macro recording and playback in WebAnywhere and plan to fully support TrailBlazer scripts in future versions. Importantly, WebAnywhere users will not have to install any new software, support for web macros can be added transparently without user involvement. 8.2.3 Key Features The WebAnywhere delivery model includes several key features fundamental to its delivery model which future projects may want to emulate. We have outlined these features below: • Free - The fact that WebAnywhere is free for users to use is important. Beyond the issues are fairness and equal access discussed in the introduction to the previous chapter, a free tool allows people to easily try the software without committing to a purchase. We also do not have to put restrictions on where it can be run. • No Installation - A related advantage of the WebAnywhere model is that no new software needs to be installed. As a consequence, software developed following the WebAnywhere model will work on any platform that supports web access, even those that are developed later. • Low-Cost Distribution & Updates - As a consequence of web-based delivery, users always receive the latest version of WebAnywhere. New features and updates can reach users quickly, helping to decrease the lag users might experience in their access technology responding to technology trends. 150 The features just described make WebAnywhere easy to try out, easy to demonstrate to others, quick to adapt to changing technology, and available wherever people are. 8.3 Summary This chapter has discussed some of the interesting implications of WebAnywhere and some of our experiences with having released it publicly. The main conclusion we have drawn from this experience is that there is a large need for access technology that is easy to run and personalize on whatever machine people to which people have access. Releasing technology on the web means that anyone can access it worldwide, and there might be even a greater need for affordable access technology that can run on any computer in other countries. 151 Chapter 9 CONCLUSION AND FUTURE DIRECTIONS This dissertation has explored the potential of including blind web users as partners in improving access to the web. In this context, we have offered general contributions in (i) tracking the actions of web users for predictive purposes, (ii) enabling end user customization of web content through better interfaces, and (iii) formulating design constraints for making tools widely available on the web. The inclusion of blind users in this work has been both explicit, such as by providing improvements using Accessmonkey or choosing to install the Usable CAPTCHA Greasemonkey script, and implicit, such as when their web interactions improved the performance of WebAnywhere’s prefetching and TrailBlazer’s suggestions. Including users in the process of improving their access is powerful and we believe this approach may extend to other populations and contexts. This chapter first overviews the contributions of this dissertation, then discusses future directions, and lastly presents some final remarks on the broader message of the work. 9.1 Contributions This dissertation has offered contributions in both understanding the problems that blind web users face and in technology that will help blind web users improve the accessibility of their own web experiences. Many of the tools and technology developed have broader applications to understanding the web experiences of all users and helping users with diverse requirements collaboratively create more effective access. Recording the actions that users perform on the web forms an important component of improving access for two reasons. First, it helps in building tools that improve our understanding of how people use the web and the problems that they experience. Second, recording actions is an important first step to predicting what actions users will perform next. Both of these can be used to improve access. 152 9.1.1 Understanding Web Browsing Problems WebinSitu (Chapter 3) explored the extent of accessibility problems and their practical effects from the perspective of users. This remote study used an advanced web proxy that leverages AJAX technology to record both the pages viewed and the actions taken by users on the web pages that they visited. Over the period of one week, participants used the assistive technology and software to which they were already accustomed and had already configured according to preference. These advantages allowed us to aggregate observations of many users and to explore the practical expects on and coping strategies employed by our blind participants. Our WebinSitu study reflects web accessibility from the perspective of web users and describes quantitative differences in the browsing behavior of blind and sighted web users. These results have motivated the areas that we have explored with the subsequent research presented in this dissertation. Conducting remote studies of these systems using the WebinSitu infrastructure has allowed us to include more participants and offers promising directions for future research (Section 9.2.4). 9.1.2 Predicting Web Actions A core contribution of the research presented here are methods for observing and predicting user actions on the web - button presses, clicks on links, reading the next heading, etc. We show several examples of how predicting what users will do next can result in interfaces that users can use to improve their own web experiences. There has been a long history of both (i) prefetching the web pages users will likely visit next, and (ii) adapting pages for smaller screens (or screen readers, for that matter). For instance, the PROTEUS system [5] learned a model of page access from web logs. PROTEUS used this model to create shortcut links to pages located deep within a site and to hide content not related to the inferred user goal. Many other systems (before and after) offered their own variations on this theme. The models used in these prior systems were generally learned from the history of pages that users visited, often drawn from page access logs. In contrast, the work presented here 153 learns models of user actions from actions observed within web pages (buttons pressed, content read, links followed, commands executed, etc.). Action-based learning within web pages is an important new direction. Web pages are becoming more complex and web applications more popular, making the question of what to do next within a web page increasingly important - it’s been important for blind users for a long time. Users may also only visit a single web page when using a web application (for instance, on gmail.com), making the question of what web page to visit next irrelevant. Importantly, actions other than following a link can lead to a new HTTP request. Both of these lead to the breakdown of the page-based models. More fundamentally, page-based models treat the web as a collection of linked, static documents - an assumption that is increasingly violated. Action-based models treat web pages as applications. As such, ideas explored in this area can find applications in traditional desktop applications as well. Related work in this area has generally been limited by the difficulty of recording and automating arbitrary desktop applications. The relative openness of the web allows more progress to be made in this domain. The models explored here are constructed and predictions are made directly in the browser, personalized to each user as they browse. The models can learn not only from what others have done on a page before, but also from what a particular user has done on the pages they have visited recently. Most of the prior work has formulated models offline, often requiring the logs from specific sites. 9.1.3 Intelligent End User Tools for Improving Access As part of this dissertation, we have created a number of tools that have either been released by us or helped to influence released projects. • Accessmonkey - Accessmonkey was at the forefront of the burgeoning area of social accessibility. The Accessmonkey research prototype has influenced several related projects. AxsJAX by Google injects scripts into web pages converting them to dynamic web applications targeting non-visual use [49]. The Social Accessibility project lets blind web users report accessibility problems to sighted volunteers whose fixes 154 are represented as scripts [121]. Accessmonkey was created with the idea that blind web users who were not programmers could independently improve accessibility - both Accessmonkey and the tools that have followed it have made steps toward that aim. • More Usable Audio CAPTCHAs - Our more usable interface to audio CAPTCHAs helps improve the interactions required to solve existing audio CAPTCHAs and results in improved performance by blind web users. Several popular sites have since made their audio CAPTCHAs easier to use by blind web users. Although they did not adopt our entire interface, they solve the problem of the screen reader speaking over the playing audio CAPTCHA in another way. A common solution is to automatically focus the answer box after the play button is pressed and then delay the start of the audio CAPTCHA by preceding it with several seconds of beeps. This prevents the screen reader from talking over the playing CAPTCHA. • TrailBlazer - A version of the TrailBlazer interface for accessible script playback has been released as a feature of CoScripter. WebAnywhere now includes limited support for macro recording and playback. Both tools plan to include better support in the future and study how these capabilities are used “in the wild.” WebinSitu clearly demonstrated the differences in browsing efficiency of blind and sighted participants. This difference is one reason we believe script playback and suggestions might be especially beneficial for blind users. As the web becomes more complex, similar tools may become increasingly useful for sighted users as well. • WebAnywhere - WebAnywhere has already been released for nearly a year as discussed in Chapter 8. Multiple parties have contributed to the open source project, both to improve and add desired features, and as a platform for their own research. We believe WebAnywhere is a harbinger of a delivery model especially promising for access technology. Hosting technology remotely and reusing mainstream devices to deliver it means disabled users are in control and do not need rely on computer administrators who may be ill-equipped to provide the access technology they need. 155 9.1.4 Summary of Technical Contributions We have shown that tools that record, playback and predict web actions can both (i) help us understand the problems faced by blind web users and (ii) enable blind web users to improve the accessibility, usability, and availability of web content. We have created tools that implement these ideas, several of which people are actively using. 9.2 Future Directions Throughout most of its existence, the web has been an untamed, loosely-connected collection of documents. As the web continues to evolve, we see it transitioning away from the current model in which users are responsible for finding what they want to one in which the web itself takes a more active role in enabling what users want. The research and directions explored within this dissertation suggest numerous opportunities for future work. 9.2.1 Improving Web Interactions by Predicting Actions Both WebAnywhere (Chapter 7) and TrailBlazer (Chapter 6) use predictive models of web actions to improve the web interface. Predicting actions within web pages has not been adequately explored prior to this work and has the potential to improve interactions in diverse scenarios and on many different types of devices. For instance, if the web browser could predict what part of a complex web page in which you might be interested, it could highlight those parts of the page, providing easy access. A browser on a small screen device could provide easy access to its top 5 predictions. Prefetching based on predicted actions within a web application could also be useful for any application that updates its content. 9.2.2 Extending to New Technologies A problem with accessibility is that nearly as soon as a new technology is made accessible it is eclipsed with another. This has already happened with the web, as the static web pages created using only HTML have started to give way to dynamic web applications powered by Javascript. While the particular technologies used will change, the ideas presented here are likely to translate. The challenge is to build technologies that can adapt faster and with less modification. 156 Currently, the web is becoming more closed and more resistant to end user personalization. For instance, Flash and Silverlight are increasingly popular formats for web content. While the long-term success of either format is uncertain, these examples suggest that whatever comes next may be more closed, representing a challenge not only for accessibility but also for end user web programming in general. 9.2.3 More Available Access This dissertation has motivated availability as a new dimension on which to consider access. WebAnywhere seeks to improve the availability of access by exposing a base level of access on any computer with web access. Future work may look to explore improving the availability in other ways. For instance, in some situations, people don’t have access to a computer, but they do have access to a smart phone. Can we build in access on smart phones? Even more people have access to a simple cellphone that does not provide data services. Can we provide access to the web using only the voice channel. Increasing the diversity of devices on which people can access web content can help realize the potential of the web for everyone. Building software, such as web applications, that can work on a variety of common, mainstream devices can help maximize potential impact. Another option may be to build non-visual access into the server and enable anyone to access their computer or the web with a phone call. Most importantly, it is increasingly not enough to consider whether the web is possible to use, or even whether the experience is usable; we also need to consider the availability of tools enabling access. 9.2.4 Longitudinal Field Studies in a Living Laboratory WebAnywhere offers not only the opportunity to provide web access to many people who might not have access; it also provides the opportunity to study a large number of how a large number of people browse the web using the tool and iterate on tools to help them improve access. We have recently added the ability to record and playback tasks on the web to WebAnywhere and plan to use this to help test the ability of trails to help web users more easily complete web tasks. 157 For the WebinSitu study (Chapter 3), users had to explicitly sign up and were paid for participation. The large number of visitors coming to WebAnywhere every day offer a rich resource to better understand what is working and what is not without requiring participants to use or configure a new tool. We think conducting studies and iterating over designs in this environment over longer periods of time is a powerful direction for future research. 9.3 Final Remarks This dissertation has explored the following thesis: With intelligent interfaces supporting them, blind end users can collaboratively and effectively improve the accessibility, usability, and availability of their own web access. Promoting blind web users as active participants in the development of accessible content represents a paradigm shift in access technology, demonstrating a new role that disabled users can play in access technology. We hope this work will be a valuable example to researchers, developers and practitioners. In the design of new tools, disabled people should be seen as effective partners in creating a more accessible experience for everyone. This dissertation represents important first steps in this direction revealing significant opportunities for new research. 158 BIBLIOGRAPHY [1] A-prompt. Adaptive Technology Resource Centre (ATRC) and the TRACE Center at the University of Wisconsin. http://www.aprompt.ca/. Accessed April 17, 2007. [2] Accessibility is a right foundation. http://accessibilityisaright.org/. Accessed May 28, 2009. [3] Adobe shockwave and flash players: Adoption statistics. http://www.adobe.com/products/player census/. Accessed June 15, 2007. Adobe. [4] Alexa web search – data services, 2008. http://www.alexa.com. Accessed June 15, 2008. [5] Corin Anderson, Pedro Domingos, and Daniel S. Weld. Web site personalizers for mobile devices. In IJCAI Workshop on Intelligent Techniques for Web Personalization (ITWP), 2001. [6] Abdullah Arif. PHProxy. http://whitefyre.com/poxy/. Accessed June 7, 2007. [7] Chieko Asakawa and Hironobu Takagi. Web Accessibility: A Foundation for Research, chapter Transcoding. Springer, 2008. [8] Richard Atterer. Logging usage of ajax applications with the usaproxy http proxy. In Proceedings of the WWW 2006 Workshop on Logging Traces of Web Activity: The Mechanics of Data Collection, 2006. [9] Richard Atterer, Monika Wnuk, and Albrecht Schmidt. Knowing the user’s every move - user activity tracking for website usability evaluation and implicit interaction. In Proceedings of the 15th International Conference on World Wide Web (WWW ’06), pages 203–212, New York, NY, 2006. [10] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999. [11] Sean Bechhofer, Carole Goble, Leslie Carr, Simon Kampa, Wendy Hall, and Dave De Roure. Cohse: Conceptual open hypermedia service. Frontiers in Artifical Intelligence and Applications, 96, 2003. [12] Leean Bent and Geoff Voelker. Whole page performance. In Proceedings of the 7th annual Web Caching Workshop, June 2002. 159 [13] John Carlo Bertot, Charles R. McClure, Paul T. Jaeger, and Joe Ryan. Public libraries and the internet 2006: Study results and findings. Technical report, Information Use Management and Policy Institute, Florida State University, September 2006. [14] Jeffrey P. Bigham. Increasing web accessibility by automatically judging alternative text quality. In Proceedings of the 12th International Conference on Intelligent user interfaces (IUI ’07), New York, NY, USA, 2007. ACM Press. [15] Jeffrey P. Bigham, Maxwell B. Aller, Jeremy T. Brudvik, Jessica O. Leung, Lindsay A. Yazzolino, and Richard Ladner. Inspiring blind high school students to pursue computer science with instant messenging chatbots. In Proceedings of the 39th SIGCSE technical symposium on Computer science education (SIGCSE ’08), Portland, OR, USA, 2008. [16] Jeffrey P. Bigham and Anna C. Cavender. Evaluating existing audio CAPTCHAs and an interface optimized for non-visual use. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2009), pages 1829–1838, Boston, MA, USA. [17] Jeffrey P. Bigham, Anna C. Cavender, Jeremy T. Brudvik, Jacob O. Wobbrock, and Richard Ladner. WebinSitu: A comparative analysis of blind and sighted browsing behavior. In Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’07), pages 51–58, Tempe, AZ, USA. [18] Jeffrey P. Bigham, Ryan S. Kaminsky, Richard Ladner, Oscar M. Danielsson, and Gordon L. Hempton. WebInSight: Making web images accessible. In Proceedings of 8th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’06), pp. 181–188. Portland, Oregon, 2006. [19] Jeffrey P. Bigham and Richard E. Ladner. Accessmonkey: A Collaborative Scripting Framework for Web Users and Developers. In Proceedings of the 4th International Cross-Disciplinary Conference on Web Accessibility (W4A 2007), pp. 25–34. Banff, Canada, 2007. [20] Jeffrey P. Bigham, Tessa Lau, and Jeffrey Nichols. TrailBlazer: Enabling blind users to blaze trails through the web. In Proceedings of the 12th International Conference on Intelligent User Interfaces (IUI 2009), Sanibel Island, FL, USA, 2009. [21] Jeffrey P. Bigham and Craig M. Prince. Webanywhere: a screen reader on-the-go. In Proceedings of the 9th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS ’07), pages 225–226, New York, NY, USA, 2007. ACM. [22] Jeffrey P. Bigham, Craig M. Prince, and Richard E. Ladner. Webanywhere: A screen reader on-the-go. In Proceedings of the International Cross-Disciplinary Conference on Web Accessibility (W4A 2008), pages 73–82, Beijing, China, 2008. 160 [23] Charles M. Blow. Two little boys. http://blow.blogs.nytimes.com/2009/04/24/twolittle-boys/ Accessed May 15, 2009. [24] Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C. Miller. Automation and customization of rendered web pages. In Proceedings of the 18th annual ACM symposium on User interface software and technology (UIST ’05), pages 163–172, Seattle, WA, USA, 2005. [25] Braille sense. GW Micro. http://www.gwmicro.com/Braille Sense/. Accessed April 12, 2007. [26] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM (1), pages 126–134, 1999. [27] A.J. Bernheim Brush, Morgan Ames, and Janet Davis. A comparison of synchronous remote and local usability studies for an expert interface. In CHI ’04 extended abstracts on Human factors in computing systems, pages 1179–1182, New York, NY, USA, 2004. ACM Press. [28] Michael D. Byrne, Bonnie E. John, Neil S. Wehrle, and David C. Crow. The tangled web we wove: A taxonomy of www use. In Proceedings of the Conference on Human Factors in Computing Systems (CHI ’99), 1999. [29] Kumar Chellapilla, Kevin Larson, Patrice Y. Simard, and Mary Czerwinski. Designing human friendly human interaction proofs (hips). In Proceedings of Computer Human Interaction (CHI ’05), pages 711–720, 2005. [30] Charles Chen. Fire vox: A screen http://firevox.clcworld.net/. Accessed July 23, 2007. reader firefox extension. [31] Joe Clark. Reader’s guide to sydney olympics accessibility complaint, 2001. http://www.contenu.nu/socog.html. Accessed May 15, 2007. [32] Mark Claypool, Phong Le, Makoto Wased, and David Brown. Implicit interest indicators. In Proceedings of the 6th International Conference on Intelligent User Interfaces (IUI 2001), pages 33–40, New York, NY, USA, 2001. [33] A.M. Collins and E.F. Loftus. A spreading activation theory of semantic processing. Psychological Review, 82:407–428, 1975. [34] Disability Rights Commission. The web: Access and inclusion for disabled people. The Stationary Office, 2004. 161 [35] Kara Pernice Coyne and Jakob Nielsen. Beyond alt text: Making the web easy to use for users with disabilities, 2001. [36] Timothy C. Craven. Some features of alt text associated with images in web pages. Information Research, 11, 2006. [37] Allen Cypher. Eager: programming repetitive tasks by example. In Proceedings of the SIGCHI Conference on Human factors in computing systems (CHI ’91), pages 33–39, New Orleans, Louisiana, United States, 1991. [38] O. De Troyer and C. Leune. Wsdm: A user-centered design method for web sites. In Proceedings of the Seventh International World Wide Web Conference, pages 85–94, 1998. [39] D. Diaper and L. Worman. Two falls out of three in the automated accessibility assessment of world wide web sites: A-prompt v. bobby. People and Computers XVII, pages 349–363, 2003. [40] Email2me. Across Communications. http://www.email2phone.net/. Accessed February 9, 2007. [41] Alexander Faaborg and Henry Lieberman. A goal-oriented web browser. In Proceedings of the SIGCHI Conference on Human Factors in computing systems (CHI 2006), pages 751–760, 2006. [42] Firefox accessibility extension. Illinois Center for Information Technology. Accessed July 23, 2007. [43] E. P. George and D. R. Cox. An analysis of transformations. Journal of Royal Statistical Society, Series B(26):211–246, June 1964. [44] Philip Brighten Godfrey. Text-based captcha algorithms. In First Workshop on Human Interactive Proofs. Unpublished Manuscript, 2002. http://www.adaddin.cs.cmu.edu/hips/events/abs/godfreyb abstract.pdf. [45] Jeremy Goecks and Jude Shavlik. Learning users’ interests by unobtrusively observing their normal behavior. In Proceedings of the 5th International Conference on Intelligent user interfaces (IUI 2000), pages 129–132, New York, NY, USA, 2000. ACM Press. [46] Martin Gonzalez. Automatic data-gathering agents for remote navigability testing. IEEE Software, 19(6):78–85, 2002. 162 [47] Martin Gonzalez, Marcos Gonzalez, Cristobal Rivera, Ignacio Pintado, and Agueda Vidau. Testing web navigation for all: An agent-based approach. In Proceedings of 10th International Conference on Computers Helping People with Special Needs (ICCHP 2006), volume 4061 of Lecture Notes in Computer Science, pages 223–228, Berlin, Germany, 2006. Springer. [48] GOOG-411. Google Labs. http://labs.google.com/goog411/. Accessed February 7, 2007. [49] Google-AxsJAX. http://code.google.com/p/google-axsjax/. Accessed April 15, 2009. [50] Google analytics. http://analytics.google.com/. Accessed February 12, 2009. [51] Greasemonkey Firefox Extension. http://www.greasespot.net/. Accessed June 4, 2009. [52] Oystein Hallaraker and Giovanni Vigna. Detecting malicious javascript code in mozilla. In Proceedings of the 10th IEEE International Conference on Engineering of Complex Computer Systems (ICECCS 2005), pages 85–94, Washington, DC, USA, 2005. IEEE Computer Society. [53] Simon Harper, Carole Goble, Robert Stevens, and Yeliz Yesilada. Middleware to expand context and preview in hypertext. In Proceedings of the 6th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2004), pages 63–70, New York, NY, USA, 2004. ACM Press. [54] Simon Harper and Neha Patel. Gist summaries for visually impaired surfers. In Proceedings of the 7th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2005), pages 90–97, New York, NY, USA, 2005. ACM Press. [55] Simon Harper, Yeliz Yesilada, Carole Goble, and Robert Stevens. How much is too much in a hypertext link?: investigating context and preview – a formative evaluation. In Proceedings of the 15th Conference on Hypertext and hypermedia (HYPERTEXT 2004), pages 116–125, Santa Cruz, CA, USA, 2004. [56] Jonathan Holman, Jonathan Lazar, Jinjuan Heidi Feng, and John D’Arcy. Developing usable captchas for blind users. In Proceedings of the 9th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2007), pages 245–246, New York, NY, USA, 2007. [57] Jason I. Hong, Jeffrey Heer, Sarah Waterson, and James A. Landay. Webquilt: A proxy-based approach to remote web usability testing. Information Systems, 19(3):263–285, 2001. 163 [58] Jason I. Hong and James A. Landay. Webquilt: a framework for capturing and visualizing the web experience. In Proceedings of the 10th International Conference on the World Wide Web (WWW 2001), pages 717–724, 2001. [59] Eric Horvitz. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI Conference on Human factors in computing systems (CHI ’99), pages 159– 166, New York, NY, USA, 1999. ACM Press. [60] Anita W. Huang and Neel Sundaresan. A semantic transcoding system to adapt web services for users with disabilities. In Proceedings of the fourth International ACM Conference on Assistive technologies (Assets 2000), pages 156–163, New York, NY, USA, 2000. ACM Press. [61] Gennaro Iaccarino, Delfina Malandrino, and Vittorio Scarano. Efficient edge-services for colorblind users. In Proceedings of the 15th International Conference on World Wide Web (WWW 2006), pages 919–920, New York, NY, 2006. ACM Press. [62] Gennaro Iaccarino, Delfina Malandrino, and Vittorio Scarano. Personalizable edge services for web accessibility. In Proceedings of the 2006 International cross-disciplinary workshop on Web accessibility (W4A 2006), pages 23–32, New York, NY, USA, 2006. ACM Press. [63] IBM alphaWork’s aDesigner. http://www.alphaworks.ibm.com/tech/adesigner. Accessed May 15, 2007. [64] IBM home page reader. http://www-03.ibm.com/able/. Accessed May 15, 2009. [65] Melody Y. Ivory. Automated Web Site Evaluation Reseachers’ and Practitioners’ Perspectives. Kluwer Academic Publishers, 2003. [66] H. Jung, J. Allen, N. Chambers, L. Galescu, M. Swift, and W. Taysom. One-shot procedure learning from instruction and observation. In Proceedings of the International FLAIRS Conference: Special Track on Natural Language and Knowledge Representation. [67] Shinya Kawanaka, Yevgen Borodin, Jeffrey P. Bigham, Darren Lunn, Hironobu Takagi, and Chieko Asakawa. Accessibility commons: a metadata infrastructure for web accessibility. In Proceedings of the 10th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2008), pages 153–160, Halifax, Nova Scotia, Canada, 2008. [68] Caitlin Kelleher and Randy Pausch. Stencils-based tutorials: design and evaluation. In Proceedings of the SIGCHI Conference on Human factors in computing systems (CHI 2005), pages 541–550, Portland, Oregon, USA, 2005. 164 [69] Brian Kelly, David Sloan, Lawrie Phipps, Helen Petrie, and Fraser Hamilton. Forcing standardization or accommodating diversity?: a framework for applying the wcag in the real world. In Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A 2005), pages 46–54, New York, NY, USA, 2005. ACM Press. [70] Greg Kochanski, Daniel Lopresti, and Chilin Shih. A reverse turing test using speech. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 2002), pages 1357–1360, 2002. [71] Mimika Koletsou and Geoff Voelker. The medusa proxy: A tool for exploring userperceived web performance. In Proceedings of the 6th annual Web Caching Workshop, June 2001. [72] Tom M. Kroeger, Darrell D. E. Long, and Jeffrey C. Mogul. Exploring the bounds of web latency reduction from caching and prefetching. In USENIX Symposium on Internet Technologies and Systems, 1997. [73] Richard Ladner, Melody Y. Ivory, Raj Rao, Sheryl Burgstahler, Dan Comden, Sangyun Hahn, Mathew Renzelmann, Satria Krisnandi, Mahalakshmi Ramasamy, Beverly Slabosky, Andrew Martin, Amelia Lacenski, Stuart Olsen, and Dmitri Groce. Automating tactile graphics translation. In Proceedings of the Seventh International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2005), pages 50–57, New York, NY, 2005. ACM Press. [74] Jonathan Lazar, Jinjuan Feng, and Aaron Allen. Determining the impact of computer frustration on the mood of blind users browsing the web. In Proceedings of the 8th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2006), pages 149–156, New York, NY, USA, 2006. ACM Press. [75] Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. Coscripter: Automating & sharing how-to knowledge in the enterprise. In Proceedings of the 26th SIGCHI Conference on Human Factors in Computing Systems (CHI 2008), pages 1719–1728, Florence, Italy, 2008. [76] David D. Lewis. Naive bayes at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 4–15, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. [77] Lift. UsableNet, 2006. http://www.usablenet.com/. Accessed April 15, 2009. [78] Linux screen reader (LSR). http://live.gnome.org/LSR. Accessed February 17, 2007. [79] R. C. Littell, G. A. Milliken, W. W. Stroup, and R. D. Wolfinger. SAS System for Mixed Models. SAS Institute, Inc., Cary, North Carolina, USA, 1996. 165 [80] Greg Little, Tessa Lau, Allen Cypher, James Lin, Eben M. Haber, and Eser Kandogan. Koala: capture, share, automate, personalize business processes on the web. In Proceedings of the SIGCHI Conference on Human factors in computing systems (CHI 2007), pages 943–946, 2007. [81] Greg Little and Robert C. Miller. Translating keyword commands into executable code. In Proceedings of the 19th annual ACM symposium on User interface software and technology (UIST 2006), pages 135–144, New York, NY, USA, 2006. ACM Press. [82] JAWS 8.0 for Windows. Freedom Scientific. http://www.freedomscientific.com. Accessed May 4, 2007. [83] I. Scott MacKenzie and Colin Ware. Lag as a determinant of human performance in interactive systems. In Proceedings of the INTERACT and Conference on Human factors in computing systems (CHI ’93), pages 488–493, New York, NY, USA, 1993. ACM Press. [84] Jalal Mahmud, Yevgen Borodin, and I.V. Ramakrishnan. Csurf: A context-driven non-visual web-browser. In Proceedings of the International Conference on the World Wide Web (WWW 2007), pages 31–40. [85] Jennifer Mankoff, Holly Fait, and Tu Tran. Is your web page accessible?: a comparative study of methods for assessing web page accessibility for the blind. In Proceedings of the SIGCHI Conference on Human factors in computing systems (CHI 2005), pages 41–50, New York, NY, USA, 2005. [86] Robert C. Miller and B. Myers. Creating dynamic world wide web pages by demonstration, 1997. [87] Hisashi Miyashita, Daisuke Sato, Hironobu Takagi, and Chieko Asakawa. Aibrowser for multimedia: introducing multimedia content accessibility for visually impaired users. In Proceedings of the 9th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2007), pages 91–98, New York, NY, USA, 2007. ACM. [88] Mobile speak pocket. Code Factory. http://www.codefactory.es/mobile speak pocket/. Accessed February 7, 2009. [89] Paul Mockapetris. RFC 1034 Domain Names - Concepts and Facilities. Network Working Group, November 1987. http://tools.ietf.org/html/rfc1034. [90] Saikat Mukherjee, I.V. Ramakrishnan, and A. Singh. Bootstrapping semantic annotation for content-rich html documents. In Proceedings of the International Conference on Data Engineering (ICDE 2005), 2005. 166 [91] Saikat Mukherjee, Guizhen Yang, Wenfang Tan, and I.V. Ramakrishnan. Automatic discovery of semantic structures in html documents. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR 2003), 2003. [92] National association for the blind, India. http://www.nabindia.com/. Accessed July 23, 2007. [93] National federation of the blind vs. target corporation. U.S. District Court: Northern District of California, 2006. No. C 06-01802 MHP. [94] Jeffrey Nichols and Tessa Lau. Mobilization by demonstration: using traces to reauthor existing web sites. In Proceedings of the 13th International Conference on Intelligent User Interfaces (IUI 2008), pages 149–158, New York, NY, USA, 2008. [95] Jakob Nielsen, editor. Designing Web usability : The Practice of simplicity. New Riders, 2000. [96] NVDA screen reader. NV Access Inc., http://www.nvda-project.org/. Accessed November 17, 2007. [97] Orca: the gnome project. http://live.gnome.org/Orca. Accessed February 11, 2008. [98] Venkata N. Padmanabhan and Jeffrey C. Mogul. Using predictive prefetching to improve world wide web latency. SIGCOMM Comput. Commun. Rev., 26(3):22–36, 1996. ISSN 0146-4833. [99] Joon S. Park and Ravi Sandhu. Secure cookies on the web. IEEE Internet Computing, 4(4):36–44, July 2000. [100] Helen Petrie, Fraser Hamilton, and Neil King. Tension, what tension?: Website accessibility and visual design. In Proceedings of the International cross-disciplinary workshop on Web accessibility (W4A 2004), pages 13–18, New York, NY, USA, 2004. ACM Press. [101] Helen Petrie, Fraser Hamilton, Neil King, and Pete Pavan. Remote usability evaluations with disabled people. In Proceedings of the SIGCHI Conference on Human Factors in computing systems (CHI 2006), pages 1133–1141, New York, NY, USA, 2006. ACM Press. [102] Helen Petrie, Chandra Harrison, and Sundeep Dev. Describing images on the web: a survey of current practice and prospects for the future. In Proceedings of Human Computer Interaction International (HCII 2005), July 2005. 167 [103] Helen Petrie and Omar Kheir. The relationship between accessibility and usability of websites. In Proceedings of the SIGCHI Conference on Human factors in computing systems (CHI 2007), pages 397–406, San Jose, California, USA, 2007. [104] Mark Pilgrim, editor. Greasemonkey Hacks: Tips & Tools for Remixing the Web with Firefox. O’Reilly Media, 2005. [105] Peter Plessers, Sven Casteleyn, Yeliz Yesilada, Olga De Troyer, Robert Stevens, Simon Harper, and Carole Goble. Accessibility: a web engineering approach. In Proceedings of the 14th International Conference on World Wide Web (WWW 2005), pages 353– 362, New York, NY, USA, 2005. ACM Press. [106] I.V. Ramakrishnan, A. Stent, and Guizhen Yang. Hearsay: Enabling audio browsing on hypertext content. In Proceedings of the 13th International Conference on the World Wide Web (WWW 2004), 2004. [107] T.V. Raman. Emacspeak—a speech interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’96), pages 66–71, Vancouver, British Columbia, Canada, 1996. [108] T.V. Raman. Auditory User Interfaces: Toward the Speaking Computer. Kluwer Academic Publishers, Boston, MA, 1997. [109] Charlie Reis, John Dunagan, Helen J. Wang, Opher Dubrosky, and Saher Esmeir. Browsershield: Vulnerability-driven filtering of dynamic html. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI 2006), 2006. [110] Roadmap for accessible rich internet applications (wai-aria roadmap). World Wide Web Consortium, 2007. http://www.w3.org/TR/WCAG20/. [111] Jesse Ruderman. The same origin policy, http://www.mozilla.org/projects/security/components/same-origin.html. 2008. [112] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. PrenticeHall, Englewood Cliffs, NJ, 2nd edition edition, 2003. [113] Alex Safonov. Web macros by example: users managing the www of applications. In CHI ’99 extended abstracts on Human factors in computing systems (CHI ’99), pages 71–72, New York, NY, USA, 1999. ACM Press. [114] Graig Sauer, Harry Hochheiser, Jinjuan Feng, and Jonathan Lazar. Towards a universally usable captcha. In Proceedings of the 4th Symposium On Usable Privacy and Security (SOUPS 2008), Pittsburgh, PA, USA, 2008. 168 [115] Scott Schiller. Sound manager 2, 2007. http://www.schillmania.com/projects/soundmanager2/. [116] C. Schuster and A. Von Eye. The relationship of anova models with random effects and repeated measurement designs. Journal of Adolescent Research, 16(2):205–220, 2001. [117] Scribd. http://www.scribd.com/. Accessed March 21, 2008. [118] Ted Selker. Cognitive adaptive computer help (coach). In Proceedings of the International Conference on Artificial Intelligence, pages 25–34, IOS, Amsterdam, 1989. [119] Serotek system access mobile. Serotek. http://www.serotek.com/. Accessed November 7, 2007. [120] Statistical Methods for Psychology. PWS-KENT Publishing Company, Boston, third edition, 1992. [121] Hironobu Takagi, Shinya Kawanaka, Masatomo Kobayashi, Takashi Itoh, and Chieko Asakawa. Social accessibility: achieving accessibility through collaborative metadata authoring. In Proceedings of the 10th International ACM SIGACCESS Conference on Computers and accessibility (ASSETS 2008), pages 193–200, Halifax, Nova Scotia, Canada, 2008. [122] Hironobu Takagi, Shin Saito, Kentarou Fukuda, and Chieko Asakawa. Analysis of navigability of web applications for improving blind usability. ACM Transactions on Computer-Human Interaction, 14(3):13, 2007. [123] Talklets. Hidden Differences Group. http://www.talklets.com/. Accessed April 7, 2007. [124] Jennifer Tam, Jiri Simsa, David Huggins-Daines, Luis von Ahn, and Manuel Blum. Improving audio captchas. In Proceedings of the 4th Symposium on Usability, Privacy and Security (SOUPS 2008), Pittsburgh, PA, USA, July 2008. [125] Paul A. Taylor, Alan W. Black, and Richard J. Caley. The architecture of the the festival speech synthesis system. In Proceedings of the 3rd International Workshop on Speech Synthesis, Sydney, Australia, November 1998. [126] Jim Thatcher, Paul Bohman, Michael Burks, Shawn Henry, Bob Regan, Sarah Swierenga, Mark D. Urban, and Cynthia D. Waddell. Constructing Accessible Web Sites. glasshaus Ltd., Birmingham, UK, 2002. [127] Thunder screenreader. http://www.screenreader.net/. Accessed February 16, 2007. 169 [128] Turnabout. Reify Software. http://www.reifysoft.com/turnabout.php. Accessed June 7, 2006. [129] Scott R. Turner. Playtpus firefox extension, 2006. http://platypus.mozdev.org/. [130] Gregg Vanderheiden. Fundamental principles and priority setting for universal usability. In Proceedings on the 2000 Conference on Universal Usability (CUU 2000), pages 32–37, Arlington, Virginia, United States, 2000. [131] Voice extensible markup language (VoiceXML) http://www.w3.org/TR/voicexml21/. Accessed April 6, 2007. 2.1. [132] Voiceover: Macintosh OS X. http://www.apple.com/accessibility/voiceover/. Accessed April 5, 2007. [133] Luis von Ahn, Manuel Blum, and John Langford. Telling humans and computer apart automatically: How lazy cryptographers do ai. Communications of the ACM, 47(2):57–60, February 2004. [134] Luis von Ahn and Laura Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in computing systems (CHI 2004), April 2004. [135] Luis von Ahn, Shiry Ginosar, Mihir Kedia, Ruoran Liu, and Manuel Blum. Improving accessibility of the web with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in computing systems (CHI 2006), pages 79–82, New York, NY, USA, 2006. ACM Press. [136] Michael Vorburger. Altifier: Web accessibility enhancement tool, 1999. [137] Howie Wang. Nextplease!, 2006. http://nextplease.mozdev.org/. [138] Takayuki Watanabe. Experimental evaluation of usability and accessibility of heading elements. In Proceedings of the International Cross-Disciplinary Conference on Web Accessibility (W4A 2007), pages 157 – 164, 2007. [139] W3C markup validation service v0.7.4. http://validator.w3.org/. Accessed November 11, 2006. [140] Watchfire bobby. http://www.watchfire.com/products/webxm/bobby.aspx. Accessed April 7, 2007. [141] Web accessibility checker. University of Toronto Adaptive Technology Resource Centre (ATRC), 2006. http://checker.atrc.utoronto.ca/. Accessed March 15, 2007. 170 [142] Web Content Accessibility Guidelines 2.0 (wcag 2.0). World Wide Web Consortium. http://www.w3.org/TR/WCAG20/. [143] Web content accessibility guidelines 1.0 (WCAG 1.0). World Wide Web Consortium, 1999. [144] WebVisum firefox extension, 2008. http://www.webvisum.com/. [145] Ryen White and Steven M. Drucker. Investigating behavioral variability in web search. In Proceedings of the International Conference on the World Wide Web (WWW 2007), 2007. [146] Window-eyes. GW Micro. http://www.gwmicro.com/Window-Eyes/. Accessed April 3, 2008. [147] Windows narrator: Microsoft’s windows xp and vista, http://www.microsoft.com/enable/training/windowsxp/narratorturnon.aspx. 2008. [148] Yahoo accessibility improvement petition. http://www.petitiononline.com/yabvipma/. Accessed September 3, 2008. [149] Yeliz Yesilada, Simon Harper, Carole Goble, and Robert Stevens. Screen readers cannot see (ontology based semantic annotation for visually impaired web travellers). In Proceedings of the 4th International Conference on Web Engineering (ICWE 2004), pages 445–458, 2004. [150] Yeliz Yesilada, Robert Stevens, and Carole Goble. A foundation for tool based mobility support for visually impaired web users. In Proceedings of the 12th International Conference on World Wide Web (WWW 2003), pages 422–430, New York, NY, USA, 2003. ACM Press. 171 VITA Jeffrey P. Bigham received his B.S.E degree in Computer Science from Princeton University in 2003. Starting in fall 2003, he attended the University of Washington, where he worked with Richard E. Ladner. He has won the Microsoft Imagine Cup Accessible Technology Award, the W4A Accessibility Challenge Delegate’s Award, the Andrew W. Mellon Foundation Award for Technology Collaboration, the NCTI Technology in the Works Award, and the University of Washington College of Engineering Student Innovator Award for Research. In 2008, he was awarded an Osberg Fellowship. He received his M.Sc. degree in 2005 and his Ph.D. in 2009, both in Computer Science and Engineering from the University of Washington.