Supporting non-English Web searching: An experiment on the
Transcription
Supporting non-English Web searching: An experiment on the
Decision Support Systems 42 (2006) 1697 – 1714 www.elsevier.com/locate/dss Supporting non-English Web searching: An experiment on the Spanish business and the Arabic medical intelligence portals Wingyan Chung a,⁎, Alfonso Bonillas b,1 , Guanpi Lai b,1 , Wei Xi b,1 , Hsinchun Chen b,1 a Department of Information and Decision Sciences, College of Business Administration, The University of Texas at El Paso, 500 W. University Avenue, El Paso, TX 79968, USA b Artificial Intelligence Lab, Department of Management Information Systems, The University of Arizona, 1130 East Helen Street, McClelland Hall 430, Tucson, AZ 85721, USA Received 3 March 2005; received in revised form 19 February 2006; accepted 22 February 2006 Available online 27 June 2006 Abstract Although non-English-speaking online populations are growing rapidly, support for searching non-English Web content is much weaker than for English content. Prior research has implicitly assumed English to be the primary language used on the Web, but this is not the case for many non-English-speaking regions. This research proposes a language-independent approach that uses meta-searching, statistical language processing, summarization, categorization, and visualization techniques to build high-quality domain-specific collections and to support searching and browsing of non-English information. Based on this approach, we developed SBizPort and AMedPort for the Spanish business and Arabic medical domains respectively. Experimental results showed that the portals achieved significantly better search accuracy, information quality, and overall satisfaction than benchmark search engines. Subjects strongly favored the portals' search and browse functionality and user interface. This research thus contributes to developing and validating a useful approach to non-English Web searching and providing an example of supporting decision-making in non-English Web domains. © 2006 Elsevier B.V. All rights reserved. Keywords: Internet; Web; Searching; Browsing; Business intelligence; Medical intelligence; Spanish; Arabic; Non-English Web searching; Web portal; Mutual information; Summarization; Categorization; Visualization; Kohonen self-organizing map 1. Introduction The Internet has gained popularity worldwide and is estimated to continue to grow as access to Web content in different languages increases. A report published in ⁎ Corresponding author. Tel.: +1 915 747 5496; fax: +1 915 747 5126. E-mail address: wchung@utep.edu (W. Chung). 1 Tel.: +1 520 621 2748; fax: +1 520 621 2433. 0167-9236/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2006.02.015 September 2004 shows that the majority (64.8%) of the world's online population consists of non-English speakers [13]. Moreover, that population was estimated to grow significantly in the near future to 820 million while the size of English-speaking online population was predicted to remain at 300 million [12]. For instance, there are more than 3.5 million Internet users in the Arab world [1] where the growth of Arabic Web content is estimated to double every year [28]. The Spanish-speaking online population has exceeded 9 millions and Latin America is estimated to have the 1698 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 fastest growing population in the world in the coming decades [2]. These statistics suggest a growing need for better support for Web searching in some non-English languages that individuals and organizations use on a daily basis. Non-English-speaking Internet users use their native language to search for useful information, and such searching typically happens across different regions where the language is used, as is the case for members of multinational organizations (MNOs) that have operations in multiple regions using the same language. These MNOs increasingly rely on the Internet when seeking information. An example might be searching for opportunities to expand a business in Latin America. A medical institution in an Arab region may need to discover efficient ways to collect, analyze, and disseminate massive information about different regions in the Middle East and North Africa. Despite growing needs for non-English Web searching, most existing technologies have been developed for English-speaking users and fail to address the needs of non-English Web searching. Current search engines in Spanish and Arabic, for example, lack search and analysis capabilities. In particular, these search engines lack highquality collections to support searching across different regions. Better approaches to overcoming these problems would provide system developers with insights to enhance non-English Web searching. To address the needs, we propose in this paper a language-independent approach to building intelligent portals in non-English languages. Our goal was to develop and validate the approach to nonEnglish Web searching. Based on the approach, we developed two Web search portals for the Spanish business and Arabic medical domains. We empirically studied the way these portals support decision-making and the related issues using native Spanish and Arab subjects. The rest of the paper is structured as follows. Section 2 surveys previous research in non-English Web searching and search support in different languages. Section 3 presents a language-independent approach to supporting non-English Web searching and the two Web portals developed using the approach. Section 4 describes the methodology for evaluating the portals. Section 5 reports and discusses the findings. Section 6 concludes the paper and discusses future directions. 2. Literature review Since the inception of the Internet, English has been the dominant language for communication on the Web. Prior research about Web searching has assumed implicitly that technologies are developed for English- speaking users. However, as more non-English-speaking users have adopted Internet technologies, other languages have gained popularity. It therefore is useful to review previous research in information seeking on the Web in a multilingual world. In particular, we also review developments of Web search technologies for the Spanishspeaking and Arabic-speaking regions. 2.1. Information seeking on the Web Researchers who have studied information seeking on the Web have described the process of information seeking as consisting of various stages of problem identification, problem definition, problem resolution, and solution presentation [39]. Variations of this process model can be found in the literature [18,22,35]. Two major information-seeking activities are searching and browsing. Prior research has considered searching to include behaviors ranging from goal directed information searching, where the user has a specific target in mind, to more serendipitous or exploratory information browsing when no specific goal is present besides the intention to explore the information repository [35]. In directed searching, the user first decomposes his goal into smaller problems, then expresses his needs as concepts and higher level semantics, formulates queries using such supports as Boolean query languages and syntax directed editors, and finally evaluates the results by serial search or systematic sampling. In exploratory browsing, the user first transforms his general information need into a problem. He then (1) articulates that need as search terms or hyperlinks that appear on the system interface; (2) searches using the terms or explores the hyperlinks using such browse supports as automatic summarization, clustering and visualization tools, and Web directories; and (3) finally evaluates the results by scanning through them. 2.1.1. Support for Web searching and browsing To support Web searching and browsing, various types of information technologies have been proposed. Metasearching has been found to be a promising method [4] to alleviate biases of search results from different search engines [26] by sending queries to multiple search engines and collating the set of top-ranked results from each engine. In addition, post-retrieval analysis provides added value to results returned by search engines. Previews and overviews of retrieved Web pages are important elements in post-retrieval analysis. A preview is extracted from, and acts as a surrogate for, a single object of interest [14]. Document summarization techniques provide previews of individual Web pages in the form of indicative summaries [10], query-biased summaries [36], or generic summaries W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 [25]. An overview is constructed from and represents a collection of objects of interest [14]. Document categorization techniques such as the self-organizing map algorithm [17] have been used to categorize and search Web pages [5]. Document visualization techniques also have been used to amplify human cognition in browsing Internet search results [20,23]. Despite the potential advantages of meta-searching and information previews and overviews, they rarely have been applied to nonEnglish search engines. One such application is the CBizPort that helps users to search, browse, categorize, and summarize Chinese business information [7] but does not contain a domain-specific collection and lacks useful functionality such as visualization. 2.1.2. Information quality Information quality, a multifaceted concept, is considered to be an important aspect of evaluating the quality of a Web site [21] and is one that has been explored by Wang and Strong [38], who evaluated information quality using a set of 16 dimensions that were tested in [33]. These dimensions were for the most part used in evaluating the quality of information of organizations or companies, not the quality of information obtained from search engines. Although Marsico and Levialdi [24] have developed a Web site evaluation methodology that considers a site's information quality, their methodology was designed for evaluating general Web sites (e.g., travel information Web sites) and does not consider the special requirements of non-English Web searching. There have been studies on cross-regional use of Chinese search engines (e.g., [7]), but because Chinese is mainly used in three geographically close regions (mainland China, Hong Kong, and Taiwan), its regional characteristics are less apparent than those of Spanish and Arabic, which are used across continents and widelyseparated regions. Unfortunately, no attempts have been made to study the cross-regional impacts of Spanish and Arabic search engines, evaluation of which could improve understanding of optimal design of search engines and portals. 2.2. Search engines for Spanish-speaking and Arabicspeaking regions As more non-English-speaking people use the Internet to search and browse information, major search engines have attempted to expand their services for non-English speakers. Regional search engines that provide more localized searching have begun to emerge. In addition to English, these search engines typically accept queries in a user's native language and 1699 return pages from the regions being served. A survey of major search engines in Spanish and Arabic, two widely-used languages that are gaining popularity on the Web, follows. 2.2.1. Spanish search engines Major search engines have been developed for Spanish, the second most popular language in the United States and the primary language for Spain and some 22 Latin American countries. Terra (http://www.terra.com/) offers its services to more than 3.1 million Internet users in Europe and the Americas. A Gallup poll in 2002 reported Terra to be the most popular search engine in Spain; Wanadoo (http://www.wanadoo.com/), a subsidiary of France Telecom, was rated second [11]. Currently, Terra serves more than 3 million Internet users in Spain, Latin America, the United States, and many European countries. Supporting Web searching in English and French as well as Spanish, Wanadoo is currently the leading Internet service provider in France and the United Kingdom with 9.3 million customers in June 2004. Spanish search engines serving Latin America include Yahoo Español, Ahijuna, Auyantepui, Quepasa, Bacan, and Conexcol. Yahoo Español (Spain, http://espanol. yahoo.com/), the Spanish version of Yahoo, provides a human-compiled Web directory developed by about 150 editors who categorized over one million listed sites. YahooES also supplements its results with those from Inktomi and Google. Inktomi matches also appear to users after all YahooES matches have first been shown. Established in 1995, BIWE (Buscador en Internet para la web en Español, http://www.biwe.com/) is one of the earliest search engines for searching Spanish information on the Web. BIWE supports searching of news, products, images, and other information and provides a variety of services including a Web directory, email, entertainment, and market information for Hispanics. Headquartered in the United States, Quepasa (http://www.quepasa.com/) was launched in 1997 and is a bilingual Web portal (Spanish and English) serving Hispanic populations in the United States and Latin America. It uses proprietary Web search technologies to reduce the number of irrelevant results by utilizing terms most frequently used and documents most frequently viewed [32]. Quepasa also offers other services such as news, email, online radio, chat, online translation, forums, and Web hosting. The following Spanish search engines primarily serve their own or adjacent regions. Launched in 1998, Ahijuna (Argentina, http://www.ahijuna.com.ar/) provides searching services of Argentina Web sites and other Spanish Web sites. It contains a Web directory with 14 categories having a total of 7578 hyperlinks. Based in 1700 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 Venezuela, Auyantepui (http://www.auyantepui.com/) provides a searchable Web directory of Spanish sites. It grew from 14 categories listing 117 Web sites in 1996 to 550 categories with over 18,000 Web sites in 2002. Launched in 1998, Conexcol (Colombia, http://www. conexcol.com/) provides a searchable Web directory containing 14 categories having 400 subcategories and 13,214 Web sites' URLs. With more than 150,000 unique visitors per month, it is one of the top four most visited sites in Colombia. Bacan (Ecuador, http://www. bacan.com/), a major search engine in Ecuador, began its operations in 1996. It provides services such as news, email, online chat, entertainment, and shopping guides. Every month Bacan has 80,000 individual visitors and generates over 2 millions hits. Ascinsa Internet (http:// www.ascinsa.com/) is widely-used in Peru and contains Web sites from Latin American countries and the United States. It provides services such as Internet access, email, Web page design, domain registration, Web hosting, among others. It also contains a directory listed by countries and then by domains. Table 1 summarizes the content and functionality of major Spanish search engines. Although different types of information are provided, these search engines typically present results as a long textual list and lack post-retrieval analysis capabilities. Moreover, except for some large Table 1 Comparing major Spanish search engines Content Spain Web pages and news on IT Business Government Financial Medical Other Latin American countries General Size of collection Terra (Spain) ✓ ✓ ✓ ✓ ✓ Functionality Latin America Wanadoo (France) ✓ ✓ ✓ ✓ Auyantepui (Venezuela) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Very good Very good Fair Terra (Spain) Links to related ✓ resources Membership ✓ services Newsgroup ✓ search Web directory Search for Web ✓ sites Search stock ✓ prices Filtering for adult content Online translation tool Search for ✓ news ✓ Multimedia search (image, music, software, etc) User interface Very good Wanadoo (France) ✓ Ascinsa (Peru) ✓ ✓ ✓ ✓ ✓ ✓ Conexcol (Colombia) ✓ ✓ ✓ ✓ ✓ ✓ Bacan (Ecuador) ✓ ✓ ✓ ✓ ✓ ✓ Good ✓ Good ✓ Fair Quepasa (Mexico and U.S.) ✓ ✓ ✓ ✓ ✓ ✓ Very good YahooES (Spain) ✓ ✓ ✓ ✓ ✓ ✓ ✓ Very good Fair Auyantepui Ascinsa Conexcol Bacan Quepasa YahooES (Venezuela) (Peru) (Colombia) (Ecuador) (Mexico and U.S.) (Spain) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Ahijuna (Argentina) ✓ ✓ ✓ ✓ ✓ BIWE (Spain) ✓ ✓ ✓ ✓ ✓ ✓ ✓ Very good Ahijuna BIWE (Argentina) (Spain) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Fair Fair Good Very good Very good Fair Very good Fair Very good W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 portals such as Yahoo Español, BIWE, and Terra, most Spanish search engines serve a few regions rather than an entire Spanish speaking community. 2.2.2. Arabic search engines Arabic is spoken by more than 284 million people in about 22 countries. Although Arabic is the fifth most frequently spoken language in the world, the Arabic Web is still in its infancy, constituting less than 1% of the total Web content and having a low 2.2% penetration rate [1]. The cross-regional use of Arabic and the exponential growth of Arabic Web [28] nevertheless have highlighted the necessity of providing better Web searching and browsing. Four major search engines offer the Arab World comprehensive services and extensive content coverage. Ajeeb (http://www.ajeeb.com/) is a bilingual Web portal (English/Arabic) launched in 2000 by Sakhr Software Company. Its database contains over one million searchable Arabic Web pages, which can be translated to English using the online version of Sakhr's machinetranslation software. In addition, Ajeeb has a multilingual dictionary and is known for its large Web directory, “Dalil Ajeeb,” which the company claims is the world's largest online Arabic directory. Ajeeb has launched Johaina, an automatic tool that gathers news from many Middle Eastern and worldwide news agencies. Using Sakhr's “IDRISI” search engine Johaina gathers mainly Middle East related news and categorizes them into primary and secondary topic categories. Albawaba.com (http://www. albawaba.com/) is a consumer portal offering comprehensive services including news, sports, entertainment, e-mail, and online chatting. The portal supports searching for both Arabic and English pages and the results are classified according to language and relevancy. Albawaba also provides meta-searching of other search engines (Google, Yahoo, Excite, Alltheweb, Dogpile) and a comprehensive directory of all Arab countries. Launched in 2000, UAE-based Albahhar (http://www.albahhar. com/) provides a wide range of online services such as searching, news, online chatting, and entertainment. The portal searches its 1.25 million Arabic Web pages and provides Arabic speakers a wide range of other online services like news, chat, and entertainment. Based in New Hampshire, Ayna (http://www.ayna.com/) is a Web portal providing an Arabic Web directory, an Arabic search engine, and other services such as a bilingual (English/ Arabic) email system, chat, greeting cards, personal homepage hosting, and personal commercial classifieds. In July 2001, Ayna had over 700,000 registered users and provided access to more than 25 million pages per month. Due to Ayna's popularity, Alexa Research ranks it among the top three leading Web sites in the Arab World. 1701 Table 2 Comparing major Arabic search engines Content Ajeeb Albawaba Albahhar Ayna Business Government Financial Medical General Size of collection ✓ ✓ ✓ ✓ ✓ Very good ✓ ✓ ✓ ✓ ✓ Good ✓ ✓ ✓ ✓ ✓ Very good ✓ ✓ ✓ ✓ ✓ Very good Functionality Ajeeb Albawaba Albahar Ayna Encoding conversion (utf8-CP1256) Links to related resources Membership services Web directory Search for Web sites Search by time period Search for news Languages of the search database Cross-regional search support User interface System reliability ✓ ✓ ✓ ✓ Very good ✓ ✓ Very good ✓ ✓ English/ Arabic ✓ Very good Fair ✓ ✓ Fair ✓ Good ✓ ✓ English/ Arabic ✓ ✓ ✓ ✓ English/ Arabic ✓ Very good Good Good Fair Poor Very good ✓ English/ Arabic ✓ Table 2 compares the content and functionality of the Arabic search engines mentioned above. Despite their rich content and comprehensive services, Arabic search engines lack post-retrieval capability and their contents tend to be general, offering limited resources to serve domain-specific needs. They also fall short of supporting advanced search and browse functions. For example, none of them supports categorization or visualization of search results. 2.3. Summary Because existing search engines in Spanish and Arabic typically lack analysis capabilities, they limit users' ability to understand retrieved results. The collections searched by these search engines are often regionspecific, so they do not provide a comprehensive understanding of the environment where they are operating. Major English search engines such as Google provide searching of non-English resources but fall short of covering domain- and region-specific information. There is a need for better approaches to overcoming these problems and to providing high-quality information to multinational organization users. We therefore propose a 1702 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 language-independent approach to address the following questions: 1. How can a language-independent approach support Web searching in non-English languages that are widely used across different regions? 2. How well do portals developed by the approach perform in comparison with existing search engines, in terms of accuracy, precision, recall, and user satisfaction? 3. When comparing with existing search engines, what is the information quality of the portals developed by the approach? in the forms of collated search results obtained from various high-quality information sources, Web page summaries, categorized search results, and visual maps showing clusters of Web pages. Table 3 provides topical coverage details of the two portals, screen shots of which are shown in Figs. 1 and 2. 3.2. Steps in the approach Our approach consists of five major steps, which are described in the context of building SBizPort and AMedPort. 3. A language-independent approach Table 3 Topical coverage details of the two portals In this section, we describe a language-independent approach to supporting non-English Web searching. The approach uses meta-searching, statistical language processing, summarization, categorization, and visualization techniques to build high-quality domain-specific collections and to support searching and browsing of nonEnglish information on the Web. Specifically, we used the approach to build two Web portals providing domainspecific collections for non-English Web searching and post-retrieval analysis for the Spanish business and Arabic medical domains. Because the implementation of the approach requires no (or minimal) customization to the portals' languages, it allows system developers to easily adapt the development to new languages and domains. Topics 3.1. The SBizPort and AMedPort The chosen domains of the two portals, Spanish Business Intelligence Portal (SBizPort) and Arabic Medical Intelligence Portal (AMedPort), represent important segments of the Web of interest to individual users and multinational organizations. Given the growing Spanish-speaking populations in the United States, Spain, and Latin America, businesses actively expand their opportunities by seeking information on the Web. Meanwhile, the growing Arabic online population and medical professionals seek a comprehensive, one-stop Web portal through which to communicate medical information among different Arab regions. The SBizPort and AMedPort were developed to address these growing needs. In addition to providing relevant information, the portals support intelligence gathering and analysis, where intelligence is defined as the product of acquisition, interpretation, collation, assessment, and exploitation of information in the respective domains [6]. The intelligence is presented SBizPort Scenario The user searches for Spanish Web pages about electronic commerce. Search page The query “electronic commerce” is used (Fig. 1(a)). Result page Approximately 40 results from 4 metasearchers (top 10 from each) are displayed (Fig. 1(b)). Categorizer The categorizer groups retrieved Web pages into 20 folders, among which are labeled “customs agent,” “electronic commerce,” and “foreign commerce” (Fig. 1(c)). Summarizer The summarizer provides a 3-sentence summary of the circled result that contains information on an e-commerce event held in 2000 in Caracas, Venezuela (Fig. 1(d)). Visualizer The SOM visualizer categorizes about 40 Web pages onto 2 regions labeled “foreign commerce” and “international commerce” and displays hyperlinks on the right (Fig. 1(e)). AMedPort The user searches for Arabic medical information about excretion. The query “excretion” is used (Fig. 2(a)). Approximately 30 results from 4 meta-searchers (top 10 or fewer from each) are displayed (Fig. 2(b)). Examples of the results include “middle ear infection” and “skin symptoms of diabetes.” Examples of categories include “children's education” and “sports.” The user selects the seventh category titled “special education,” within which he browses 2 Web pages about “questions and answers” (Fig. 2(c)). The 3-sentence summary is listed on the left while the original page about petrochemical information is displayed on the right (Fig. 2(d)). The SOM visualizer categorizes about 30 Web pages onto 6 regions and displays hyperlinks on the right (Fig. 2(e)). Examples of the regions include “special education” and “sports.” W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 1703 (a) Search page The user types in “comercio electronico” to search for BI information about electronic commerce in Spanishspeaking regions. Search button Visualize button Categorize button (c) Categorizer (b) Result page Retrieved pages are categorized into folders labeled by key phrases. Click to summarize in 3 or 5 sentences Retrieved pages’ titles and abstracts are listed. (d) Summarizer The summary is listed on the left while the original page on the right. The SOM visualizer categorizes about 40 Web pages onto 2 regions and displays hyperlinks on the right. (e) Visualizer Fig. 1. Screen shots of SBizPort. 3.2.1. Collection building and searching Figs. 1(a), 1(b), 2(a), and 2(b) show the search and result pages of the two portals. On the search page, a user can input keywords and choose whether to search, organize, or visualize the results. The user can input multiple keywords separated by line breaks and can choose among a number of carefully selected information sources from the Spanish or Arab regions by checking the 1704 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 (a) Search page The user types in “excretion” to search for medical information about the topic in Arab regions. Button to visualize results Button to search for results Virtual Arabic keyboard Button to categorize results (b) Result page (c) Categorizer Click to summarize in 3 or 5 sentences Retrieved pages are categorized into folders labeled by key phrases. Retrieved pages’ titles and abstracts are listed. (e) Visualizer (d) Summarizer The summary is listed on the left while the original page on the right. The SOM visualizer categorizes about 30 Web pages onto 6 regions and displays hyperlinks on the right. Fig. 2. Screen shots of AMedPort. boxes. The result page lists search results according to the information sources selected by the user. To provide high-quality information, we manually analyzed the existing information sources in the two domains. For the Spanish business domain, key business categories such as e-commerce, international business, and competitive intelligence were searched to obtain seed URLs (translated into English), that W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 were used for domain spidering/collecting of Web pages. More than 183 seed URLs were obtained. A Web crawler then followed these URLs to collect pages automatically. The pages were then automatically indexed and stored in our database. In addition to domain spidering, we performed meta-spidering of six major search engines (Yahoo ES, Ahijuna, Conexcol, Ambdirecto, Auyantepui, and Teoma) using queries translated from English queries that previously had been used to build an English business intelligence search portal [23]. We chose these search engines because of their rich Spanish business content. The Spanish business collection obtained from this method contained more than 476,084 Web pages covering more than 22 countries. Similarly, our Arabic medical collection was built by using 105 seed URLs collected from seven major search engines (Google, Yahoo, AlltheWeb, Ajeeb, ArabVista, AltaVista, and DMOZ) and by meta-spidering these URLs using keywords from an Arabic medical glossary [16]. The results are then filtered depending on their number and quality. The resulting Arabic medical collection contained more than 220,000 Web pages covering more than 22 countries. Apart from searching its own database, the SBizPort supports meta-searching two domain-specific databases (SBizPort collection and AMBDirecto) and six Spanish general search engines (Yahoo Español, Terra, Ahijuna, Auyantepui, Bacan, and Ascinsa). The AMedPort supports meta-searching three domain-specific databases (AMedPort, Sehha.com, and ArabMedmag.com) and three Arabic general search engines (Ba7th.com, ArabVista.com, and Ayna). These meta-search engines were chosen because of their rich content and domainspecific coverage. A virtual keyboard provided for AMedPort facilitates input (see Fig. 2(a)). 3.2.2. Summarizer The SBizPort and AMedPort summarizers were modified from an English summarizer that uses sentenceselection heuristics to rank text segments [25]. These heuristics strive to reduce redundancy of information in a query-based summary [3]. The summarization takes place in three main steps: (1) sentence evaluation, (2) segmentation or topic identification and (3) segment ranking and extraction. First, a Web page to be summarized is fetched from the remote server and parsed to extract its full text. All sentences are extracted by identifying punctuation serving as periods. Important information such as presence of cue phrases (e.g., “therefore,” “in summary” in the respective languages), sentence lengths and positions are also extracted for ranking the sentences. Second, we 1705 use the Text-Tiling algorithm [15] to analyze the Web page and determine topic boundaries. A Jaccard similarity function is used to compare the similarity of different blocks of sentences. Third, we rank document segments identified in the previous step according to the ranking scores obtained in the first step and key sentences are extracted as summary. The summarizer can summarize Web pages flexibly, using three or five sentences. Users can invoke it by clicking the number of sentences for summarization under each result. Then, a new window is activated (shown in Figs. 1(d) and 2(d)), that displays the summary and the original Web page. 3.2.3. Categorizer The SBizPort and AMedPort categorizers organize the Web pages (related to the query shown on top) into 20 (or fewer) folders labeled by the key phrases appearing most frequently in the page summaries or titles (see Figs. 1(c) and 2(c)). Each categorizer relies on a phrase lexicon in the relevant language to extract phrases from Web page summaries obtained from meta-searching or searching our collections. To create the lexicons, we collected a large number of Web pages in the two domains. From each collection of pages, we extracted meaningful phrases by using the mutual information approach, a statistical method that identifies significant patterns as meaningful phrases from a large amount of text in any language [30]. The approach is an iterative process of identifying significant lexical patterns by examining the frequencies of word co-occurrences in a large amount of text. The mutual information (MI) algorithm is used in the approach to compute how frequently a pattern appears in the corpus, relative to its sub-patterns. Based on the algorithm, the MI of a pattern c (MIc) can be found by MIc ¼ fc fleft þ fright −fc where f stands for the frequency of a set of words. Intuitively, MIc represents the probability of cooccurrence of pattern c, relative to its left sub-pattern and right sub-pattern. Phrases with high MI are likely to be extracted and used in automatic indexing. For example, if the Spanish phrase “gerencia del conocimiento” (knowledge management) appears in the corpus 100 times, the left sub-pattern (gerencia del) appears 110 times and the right sub-pattern (del conocimiento) appears 105 times, then the mutual information (MI) for the pattern “gerencia del conocimiento” is 100 / (110 + 105 − 100) = 0.87. In addition, we employed an updateable PAT-tree data structure developed in [30] that 1706 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 supports online frequency update after removing extracted patterns to facilitate subsequent extraction. Repetitive removal of sub-patterns therefore is not necessary. In addition, we used a stop word list and manual filtering to refine the results obtained. Using the approach, we extracted 19,417 phrases from the SBizPort collection and 68,079 phrases from the AMedPort collection. The categorizer then uses these phrases to categorize the Web pages nonexclusively (see Figs. 1(c) and 2(c)). 3.2.4. Visualizer The resulting portals also support visualization of Web pages retrieved using a Kohonen self-organizing map (SOM) algorithm [17] to categorize and place Web pages onto a two-dimensional jigsaw map [23] (see Figs. 1(e) and 2(e)). SOM is a neural networks algorithm that has been used in image processing and pattern recognition applications. When applied to automatic categorization and visualization of Web pages, SOM assigns similar pages to adjacent regions with each region labeled by the most frequently occurring phrases extracted by the mutual information approach described. The larger the size of a region on the map, the more the Web pages are assigned to it. Users can click on a region to see a list of pages on the right and can open pages by clicking the link-embedded titles. 3.2.5. Web directory In addition, each of our portals provides a Web directory of the resources in its specific domain. Organized in a hierarchical manner, the directory was built from a combination of human identification and meta-searching. The Spanish business directory contains 295 categories and the Arabic medical directory contains 232 categories. Both have a depth of 5 levels. 3.3. Enhancements of the approach We believe that the proposed approach offers benefits and new enhancements in five aspects: (1) New integration of existing techniques: Although some of the techniques used in the approach have been studied in prior work, we have not found a comprehensive approach that addresses the problem of information quality on the Web and the need for Web searching in languages used in widely separated geographic regions (e.g., Spanish and Arabic). By integrating human analysis with existing techniques for text processing, our approach was developed to alleviate information overload in searching and browsing Web content in nonEnglish languages. For example, we have customized the Kohonen self-organizing map algorithm to the Spanish business and Arabic medical domains to support dynamic visualization of Web pages. This integration of visualization technique has been enhanced from our previous work [6,23] by considering languages (Spanish and Arabic) used widely in a multitude of geographic regions and by applying the technique to non-English domains. To our knowledge, there has been no previous attempt to integrate the technique into an application similar to the portals described here. (2) Collection building: Previous work on building Web collections typically focuses on English content due to the more abundant resources available. To deal with the challenge of supporting non-English Web searching, our proposed approach was used to build non-English Web collections encompassing wide arrays of geographic regions and content providers. For example, the SBizPort collection was built from spidering more than 183 Spanish business Web sites located in such regions as Argentina, Bolivia, Central America, Chile, Colombia, Ecuador, Spain, Mexico, Paraguay, Peru, Uruguay, and Venezuela. The AMedPort collection covered Web resources obtained from such regions as Saudi Arabia, Bahrain, Lebanon, Tunisia, Kuwait, Egypt, United Arab Emirates, Switzerland, United Kingdom, USA, Russia, and Canada. While existing search engines in those regions mainly provide regional services, the SBizPort and AMedPort collections respectively serve the entire communities that use Spanish and Arabic in Web searching. The collections also represent new advances over the English business collection built in [23] and the lack of its own Web collection in [6]. (3) Language processing: To extract meaningful phases as input for the categorizer and visualizer, we used the mutual information technique that considered the co-occurrence of terms in a large corpus (see Section 3.2.3). Because the approach used the probabilities of the terms appearing in the corpus rather than their linguistic patterns as the criterion for extraction, the technique was statistics-based and hence different from linguistic techniques used in previous research (e.g., [23]). Comparing with our previous work [6] (in which the system only served three closely-located geographic regions (China, Taiwan, and Hong Kong)), we have enhanced the performance of this technique by using a large number of Web pages from different regions as our corpus and by testing the technique in the two chosen languages. (4) User interface customization: The user interface interfaces of SBizPort and AMedPort were specially designed to bring about the industry features and to address the language-specific needs. For example, AMedPort provides a virtual keyboard to W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 assist in the input of the right-to-left Arabic language. The images in SBizPort user interface are related to major industries in Latin America. (5) Application domains: This research has extended to domains such as Arabic medicine and Spanish business that are less explored in prior work. As the online populations in these two languages will grow significantly (see Section 1), this work thus helps system developers to easily customize their development to the particular language they consider. We believe that our approach can help multinational organizations to search effectively for non-English information on the Web. 4. Evaluation methodology In this section, we describe our methodology for evaluating the usability of the Web portals developed by our approach. Our evaluation objectives are: (1) to study how the Web portals developed by our approach can assist searching and browsing of specialized domains on the Web; (2) to compare our portals with existing search engines in order to understand the effectiveness and efficiency of our portals; and (3) to evaluate the information quality and user satisfaction achieved by using our portals. To achieve objective (1), we invited human subjects to use our portals to search and browse the Spanish business or Arabic medical domains, two specialized domains that do not have as much coverage on the Web as their English counterparts. To achieve objective (2), we selected BIWE and Ayna as benchmarks against which to compare SBizPort and AMedPort because of their comprehensive coverage and functionality. BIWE (http://www.biwe.com/) is a major Spanish search engine providing information for the Spanish-speaking community. It also has a detailed Web directory for users to browse topics in which they are interested. Compared with other Spanish search engines, BIWE's services are more comprehensive and target more closely to Hispanics. As one of the most visited Arab Internet hubs, Ayna (http://www.ayna.com/) serves Arabicspeaking people of the Middle East and North Africa. Unlike many Arabic search engines, Ayna is more stable and reliable that serves as a good benchmark to support a fair comparison with AMedPort. To achieve objective (3), we asked subjects to provide subjective rating and comments on information quality and user satisfaction. 4.1. Experimental design We designed scenario-based search and browse tasks consistent with Text Retrieval Conference standards 1707 [37] to evaluate the performance of our Web portals. For example, a scenario for testing SBizPort was “America Online (AOL) in Latin America,” where a search task was “When was AOL Latin America launched in the United States?” and a browse task was “Find the URLs of financial portals where you can find stock quotes on America Online.” In a scenario for testing AMedPort “Prevention and treatment of cancer,” a search task was “Give the name of one vitamin that helps to prevent cancer,” and a browse task was “Find articles about healthy diet and cancer prevention.” To further validate the relevance of tasks, before conducting the actual experiment we did a pilot test with three subjects for each portal. We recruited 19 Spanish students and 11 Arab students as volunteer subjects to evaluate the performance of the SBizPort and AMedPort. In each one-hour experiment, we introduced two systems (our portal and the benchmark system) to a subject and randomly assigned different scenarios to evaluate the systems. Each scenario contained two search tasks and one browse task. To test the impact of the domain-specific collection, we asked the subjects not to use the collection in the first task when using our portal but to use it in the second task. In the third task, we asked the subjects to use the SOM visualizer when using our portal and to use the available browse tools (e.g., hyperlinks, Web directory) when using the benchmark search engine (see Table 4). Although we did not impose any time limit on completing the tasks, we found that each subject spent an average of three minutes to finish a search task and eight minutes to finish a browse task. The order in which the systems were used was randomly assigned to avoid bias due to sequence of use. After using a system, a subject filled in a post-session questionnaire about his ratings and comments on the system. The experimenter recorded all verbal comments or behavioral observations that were later analyzed using protocol analysis [9]. Upon finishing the study, each subject also filled in a post-study questionnaire to rate each system in terms of information quality and overall satisfaction and to provide additional feedback. Table 4 A summary of the experimental setup System Scenario Task Task type 1 First 2 Second 1 and 2 3 4 and 5 6 Search Browse Search Browse The systems and scenarios were randomly assigned to subjects. 1708 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 The questionnaire was developed based on the user satisfaction measures used in [8,19]. We asked the subjects to rate their satisfaction on each system along a seven-point Likert scale. To measure information quality, we modified the 16-dimension construct developed in [38] by dropping the “security” dimension which is not relevant because the information provided by the systems is already public. To accommodate the different levels of importance in the remaining 15 dimensions, we invited two experts to provide ratings on the relative importance of different dimensions in the two domains (see Table 5). The Spanish business expert is a senior executive of a management consulting company in Mexico. Being a native Spanish speaker, he had 24 years of experience in business development, raising capital, negotiations, finance, and strategic planning. He also had worked as the Vice President of Business Development for the Gallup Organization in Mexico. The Arabic medical expert is an Arab microbiology Ph.D. student at a major research university in the United States. These experts provided answers that we used to judge subjects' performances in the tasks. The subjects also provided demographic information, which was kept confidential in accordance with the Institutional Review Board Guidebook [31]. 4.2. Hypothesis testing Because the Web portals developed by our approach encompassed Web resources from different Spanish or Arab regions, we believed that they would provide richer content and higher usability than those of benchmark systems. Users could thus find relevant results more quickly from our portals. With respect to the two domains, we tested the following five sets of hypotheses, none of which had been explored in previous research. H1. Using a domain-specific collection in SBizPort/ AMedPort enables users to achieve higher effectiveness and efficiency than performing search tasks without its support. H2. SBizPort/AMedPort enables users to achieve higher effectiveness and efficiency than relying on benchmark search engines for searching. H3. The use of SOM visualizer in SBizPort/AMedPort enables users to achieve higher effectiveness and efficiency than using benchmark search engines to perform browse tasks. H4. SBizPort/AMedPort users achieve a higher overall satisfaction than users of a benchmark search engine. Table 5 Definitions of 15 dimensions of information quality and expert ratings Dimension Expert ratinga Definition Spanish Arab Presentation quality and clarity Accessibility The extent to which information is Concise representation The extent to which information is Consistent The extent to which information is representation Ease of manipulation The extent to which information is Interpretability The extent to which information is definitions are clear Coverage and reliability Appropriate amount of The information Believability The Completeness The Free-of-error The Objectivity The 3 3 3 3 3 3 easy to manipulate and apply to different tasks in appropriate languages, symbols, and units, and the 3 2 2 3 2 3 regarded as true and credible 2 not missing and is of sufficient breadth and depth for the task at hand 3 correct and reliable 2 unbiased, unprejudiced, and impartial 2 2 3 3 3 applicable and helpful for the task at hand highly regarded in terms of its source or content sufficiently up-to-date for the task at hand easily comprehended beneficial and provides advantages from its use 3 3 3 2 3 extent to which the volume of information is appropriate for the task at hand extent to which information is extent to which information is extent to which information is extent to which information is Usability and analysis quality Relevancy The extent to which information is Reputation The extent to which information is Timeliness The extent to which information is Understandability The extent to which information is Value-added The extent to which information is a available, or easily and quickly retrievable compactly represented presented in the same format Expert rating: 3 = extremely important, 2 = very important, 1 = important. 3 3 3 3 3 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 H5. SBizPort/AMedPort provides higher information quality than a benchmark search engine. To test H1, we compared the performances of using (task 2) and not using (task 1) our domain-specific collections. To test H2, we compared the search performances of our portal and the benchmark search engine. To test H3, we compared browse performances of using our portal's SOM visualizer and the benchmark search engine's browse support tools. Because a previous research [7] has conducted a focused evaluation on the use of summarizer and categorizer to support Web searching and browsing, we did not repeat the evaluation of these tools here. To test H4 and H5, we compared subjects' ratings on the aforementioned aspects. As each subject was asked to perform similar tasks using the two systems, we used a one-factor repeated-measures design, which gives greater precision than designs that employ only between-subjects factors [27]. 4.3. Performance measure We recorded the time the subject spent on each task to measure the efficiency of using a system. We also measured the effectiveness of using a system by the following formulae: Accuracy ¼ Number of correctly answered parts Total number of parts Precision ¼ Number of relevant URLs identified by the subject Number of all URLs identified by the subject Recall ¼ Number of relevant URLs identified by the subject Number of relevant URLs identified by the expert F value ¼ 2 Recall Precision Recall þ Precision Accuracy reflects how well a system finds correct answers for search tasks. To measure the browse task performance, we used precision, recall, and F value. Precision reflected how well the portal helped users find relevant results and avoid irrelevant results. Recall reflected how well the portal helped users find all the relevant results that had been identified by experts. F value was used to balance recall and precision simultaneously [34], reflecting the performances achieved by the expert and by subjects. 5. Experimental results and discussions In this section, we report and discuss the results of our user evaluation study. Table 6 summarizes the 1709 means and standard deviations of various performance measures. Table 7 shows the p-values and results of testing various hypotheses. Table 8 summarizes subjects' demographic profiles. 5.1. SBizPort performance 5.1.1. Search performance Using SBizPort's domain-specific collection achieved higher mean accuracy and lower mean efficiency than not using it. However, the differences were not significant. The figures show that employing our domain-specific collection resulted in performance comparable to that achieved by using all the meta-search engines in combination, suggesting the comprehensive nature of our collection. We nevertheless believe that the SBizPort collection should be further enhanced to provide more comprehensive results in a shorter time, so H1 was not confirmed. Comparing our portal with the benchmark search engine, we found that the mean accuracy of SBizPort was significantly higher than that of BIWE, while there was no significant difference between the efficiencies achieved by the two systems. We believe that SBizPort's ability to provide comprehensive, high-quality information from many sources helped users get accurate results. However, the efficiency of SBizPort was not significantly better than that of BIWE. Because SBizPort is a research prototype, it lacks the professional operations of BIWE. Therefore, H2 was partially confirmed. 5.1.2. Browse performance We found that SBizPort achieved a higher mean precision, recall, and F value than BIWE. However, only the difference in F value was significant at a 5% alpha-error level and the difference in recall was significant at a 6% alpha-error level. The results show that SBizPort's browse support tools and SOM visualizer could enable users to find more relevant results than BIWE. However, there is still room for improvements in terms of efficiency and precision. Therefore, H3 was partially confirmed. 5.1.3. User ratings and comments Subjects rated SBizPort more favorably than BIWE in terms of information quality and overall satisfaction (see Table 6). The mean differences between the two systems' ratings ranged from 0.6 to 1.5 and were all significant at a 5% alpha-error level. Subjects were very satisfied with SBizPort. We believe that several aspects of SBizPort contributed to its good performance: the 1710 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 Table 6 Means and standard deviations of different measures Measure SBizPort Mean b Task 1 Search performance Task 2 Search performanceb Task 3 Browse performancee c Accuracy Efficiencyd Accuracy Efficiencyd Precision Recall F value Efficiencyd Information quality (overall) – Presentation quality and clarity – Coverage and reliability – Usability and analysis quality Overall satisfaction a 0.87 131 0.95 134 0.87 0.21 0.78 288 2.1 2.3 2.2 1.98 1.8 BIWE S.D. Mean 0.33 43 0.23 59 0.29 0.14 0.38 63 0.66 0.78 0.63 0.76 0.76 0.55 149 0.55 151 0.86 0.13 0.48 285 2.9 2.9 3.0 2.9 3.1 a AMedPort S.D. Mean 0.50 48 0.50 37 0.34 0.085 0.49 24 1.07 1.3 1.1 1.1 1.7 0.64 141 0.50 141 0.43 0.26 0.24 289 2.6 2.4 2.9 2.4 2.2 a Ayna S.D. Meana S.D. 0.50 45 0.50 45 0.37 0.21 0.23 26 1.1 1.0 1.3 1.2 1.3 0.23 146 0.18 174 0.27 0.12 0.11 300 4.7 4.5 5.0 4.6 4.9 0.41 37 0.40 19 0.41 0.18 0.21 24 1.0 1.2 0.87 1.2 1.8 a The range of rating is from 1 to 7, with 1 being the best. When using our portals, the subjects were asked not to use our domain-specific collection in task 1 but used it in task 2. c In task 1, the “SBizPort” or “AMedPort” column refers to using domain-specific collection and the right column (“Benchmark”) refers to not using domain-specific collection. d Efficiency was measured by the time (in seconds) used. e In task 3, the subjects were asked to use the SOM visualizer when using our portals and could use all available browse tools when using the benchmark search engines. b different functions and have a catalog.” Subject #s18 said that the browse tools “made it easy to view retrieved data.” Regarding the search performance, fifteen subjects commented that SBizPort did a good job or has a greater variety than the benchmark search engine. For example, subject #s7 said: “(SBizPort) gives lots of pages related to what I look for from different countries.” Subject #s10 said “(SBizPort) looks with more information and (is) able to provide in detail.” However, five subjects complained about the low speed of the system, especially when retrieving information from many meta-searchers. On the other hand, the subjects were unhappy with BIWE's lack of relevance and clarity in searching and browsing. For example, subject #s7 said that BIWE “gives irrelevant pages (of) other countries I'm not high-quality meta-searchers and domain-specific collection used in SBizPort, the useful browse support tools, and the comprehensive content coverage. H4 and H5 were confirmed. The subjects provided many positive comments on SBizPort's search and browse capabilities. Twelve subjects agreed that SBizPort was very useful for searching Spanish business information. For instance, subject #s10 said that SBizPort “is very useful for searching,” and “(the information) is clear.” Subject #s1 said “For specific topics (SBizPort) gave out specific results, making the searches better than other search engines.” The subjects also liked the browse support tools provided by SBizPort. A majority of seventeen subjects commented positively on it. For example, subject #s6 said that SBizPort was “really nice to have Table 7 p-values of testing various hypotheses (alpha error* = 0.05) Comparison SBizPort vs. BIWE Hypothesis Measure Effectiveness Efficiency H1 H2 H3 Accuracy Accuracy Precision Recall F value Satisfaction Information quality (overall) 0.42 0.002 ⁎ 0.89 0.06 0.035 ⁎ 0.005 ⁎ 0.009 ⁎ 0.85 0.30 0.84 H4 H5 a Efficiency was measured by the time (in seconds) used. ⁎ p values V 0.05. AMedPort vs. Ayna a Effectiveness Efficiency 0.54 0.046 ⁎ 0.22 0.09 0.07 0.000⁎ 0.000⁎ 0.83 0.011 ⁎ 0.31 Result a Not confirmed Partially confirmed Partially confirmed Confirmed Confirmed W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 Table 8 Subjects' demographic profile Demographic Spanish subjects information (total: 19) Country of origin Education Age range Gender Hours of using computer per week Arab subjects (total: 11) Mexico (12), USA (3), Lebanon (7), Panama (1), Puerto Rico (1), Morocco (1), Iraq (1), Colombia (1), Peru (1) Mauritania (1), Jordan (1) Undergraduate (13), Undergraduate (3), bachelor earned (2), associate degree (1), master earned (3), bachelor earned (2), doctorate earned (1) master earned (5) 18–25 (14), 26–30 (2), 18–25 (6), 26–30 (3), 31–35 (2), 41–50 (1) 36–40 (1), 41–50 (1) Female (10), male (9) Female (3), male (8) b5 (1), 5–10 (2), 10–15 (1), 5–10 (1), 10–15 (3), 15–20 (3), 20–25 (9), 15–20 (1), 20–25 (2), 30–35 (1), N40 (2) 25–30 (1), 30–35 (1), N40 (2) interested in.” Subject #s9 said that it was “timeconsuming” to use BIWE. Moreover, most users did not like the presence of pop-up advertisements when using BIWE. Nevertheless, six subjects said that BIWE was useful for searching Spanish business information. Three subjects commented that the system was easy to use and fast. 1711 5.2.3. User ratings and comments Similarly to SBizPort, AMedPort received significantly better ratings than the benchmark search engine in terms of information quality and overall satisfaction. The mean differences ranged from 2.1 to 2.8 and were all significant at a 5% alpha-error level. We believe that AMedPort's good performance can be attributed to its high-quality meta-searchers and domain-specific collection and its useful browse support tools. H4 and H5 were confirmed. Subjects' verbal comments show better satisfaction with AMedPort than with Ayna. Nine (out of eleven) subjects said that AMedPort was useful or provides more topics and information. For instance, subject #a7 said AMedPort was “helpful in cross-referencing information from specific to general.” Subject #a5 said AMedPort was “very useful because it does meta-searching.” Subject #a2 said the AMedPort was “very easy to use for Arabs.” In contrast, Ayna received many negative comments from subjects because of its lack of relevant results and confusing interface. For example, subject #a2 said that Ayna was “very clumsy, disorganized, (and) very brief.” Subject #a8 said she “couldn't easily access it” and subject #a9 said Ayna was “hard to use.” 5.3. Discussion 5.2. AMedPort performance 5.2.1. Search performance Using AMedPort's collection resulted in higher mean accuracy and efficiency than not using it. However, similarly to SBizPort, the differences were not significant. We believe that the AMedPort collection should be improved to provide more comprehensive results to users in a shorter time. H1 was not confirmed. Comparing our portal with the benchmark search engine, we found that the mean accuracy and efficiency of AMedPort were significantly higher than those of Ayna. We believe that, like SBizPort, AMedPort provided comprehensive, high-quality information from many sources and helped users find correct results in a shorter time. H2 was confirmed. 5.2.2. Browse performance Contrary to our expectation, AMedPort achieved performance comparable to that of Ayna, as shown by insignificant differences in precision, recall, and F value. Yet, at 7% and 10% alpha-error levels, AMedPort achieved better F value and recall respectively. So AMedPort needs further fine-tuning to be able to achieve a better performance. H3 was not confirmed. The encouraging results from our experiment demonstrate that the proposed approach is useful to support non-English Web searching and browsing. Although we applied the approach to building two portals in different domains and languages, the experimental results are surprisingly similar. We believe that this was because similar procedures were used to develop the portals and ensured high information quality, comprehensiveness in content coverage, useful functionality, and user-friendly interface. These important components help users who need to search for information from widely scattered regions in a language used by a multitude of countries and places. The results may also imply applicability of the proposed approach to building portals in other domains and languages. Given that the Internet will likely become more and more internationalized [29], the proposed approach is expected to benefit a wide range of domains and users. Looking more closely into the findings, we observed that the performance differences between the two Arabic search engines are generally larger than those between the two Spanish search engines. This may be due to the relatively weaker Internet development in Arabic-speaking regions. However, as Arabic gains importance on the Internet, we expect the demand for better searching and 1712 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 browsing will grow significantly. Meanwhile, the performance of existing Spanish search engines is expected to lag behind the rapidly-growing Hispanic and Latino populations. Our proposed approach may possibly fill some of the needs. Compared with previous research (such as [7,23]), our experimental findings provide insights to non-English Web searching in languages that are used in widelyseparated regions. New empirical findings and developments are provided in this work. For example, this research is the first attempt to use and to empirically study the SOM visualizer in supporting non-English Web searching. The Web collections provided by SBizPort and AMedPort are also much larger and contain more diverse regional information than the one developed in [23]. Meta-searching and post-retrieval analysis are new applications in Spanish and Arabic. While major languages like English and Chinese will still be important on the Web, the notion of “multilingual Web” is expected to draw attention from practitioners and researchers in the future. And this research will likely shed light on some system development and decision support issues for nonEnglish Web searching. 6. Conclusions and future directions As non-English speakers increasingly use the Web to seek information, there is a need for better support of searching the Web across different regions. However, support for Internet searching in non-English speaking regions is much weaker than in English-speaking regions. This research proposes a language-independent approach to building Web search portals to support non-English Web searching. Based on the approach, we developed two portals, SBizPort and AMedPort, for the Spanish business and Arabic medical domains, respectively. Experimental results show that the two portals significantly outperformed the benchmark search engines in terms of search accuracy and user ratings on information quality and overall satisfaction. The two portals also achieved precision and recall comparable to those of benchmark search engines. Subjects much preferred our portals to the benchmark search engines in many types of usage. We therefore conclude that the proposed approach is useful in supporting non-English Web searching. This research thus contributes to developing and validating a useful approach to non-English Web searching and providing an example of supporting Web searching in different nonEnglish domains. This study was limited in several ways. Our two research prototype portals have speed and stability that are not as good as those of commercial search engines like the chosen benchmarks. Several subjects complained about the slow responses of our systems. We also have been limited by the scarcity of prior work on nonEnglish Web searching, which has prevented a more comprehensive review of a topic that possibly would offer better criteria for designing our approach. As for the user study, we had difficulty in recruiting native speakers as our subjects. Future work should consider expanding the sample size to establish a higher statistical confidence in the experimental results. We are pursuing several directions to extend our research. As the notion of a “multilingual Web” continues to draw attentions, we are developing scalable techniques to collect and analyze information in different languages meaningfully to relate diverse content to produce intelligence. For instance, multinational corporations (MNCs) typically provide Web site information in different languages. Analyzing MNC's relationships with their multinational stakeholders could help provide a holistic picture of how they stand in the international arena. Other domains that we will explore include Spanish medical and Arabic business domains. The resulting business intelligence from stakeholders will serve to guide global development strategies. Another challenging area is the digital archiving of multilingual data from heterogeneous sources — often scattered in different regions. We will investigate techniques and methods to facilitate such a process and better support non-English Web searching. Furthermore, we will develop and validate new visualization techniques to support browsing and comprehending massive multilingual information on the Web. Acknowledgments This research was partly supported by funding from the National Science Foundation Knowledge Discovery and Dissemination (KDD) program #9983304, June 2003–March 2004 and October 2003–March 2004 and from the University Research Institute Grant Program of the University of Texas at El Paso. We are grateful to our project members and the experts and the student subjects who participated in the user study. References [1] R. Abbi, Internet in the Arab world, UNESCO Observatory on the Information Society 3 (2002). [2] P. Caramelli, The current and future rapid growth of older people in Latin America: implications in psychogeriatrics (keynote presentation), Proceedings of the Eleventh International Congress, International Psychogeriatric Association, Chicago, IL, 2003. [3] J. Carbonell, J. Goldstein, The use of MMR: diversity-based reranking for reordering documents and producing summaries, W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] Proceedings of the 21st Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, ACM Press, Melbourne, Australia, 1998, pp. 335–336. H. Chen, H. Fan, M. Chau, D. Zeng, MetaSpider: meta-searching and categorization on the web, Journal of the American Society for Information Science and Technology 52 (13) (2001) 1134–1147. H. Chen, A. Houston, R. Sewell, B. Schatz, Internet browsing and searching: user evaluation of category map and concept space techniques, Journal of the American Society for Information Science, Special Issue on AI Techniques for Emerging Information Systems Applications 49 (7) (1998) 582–603. W. Chung, H. Chen, J.F. Nunamaker, A visual framework for knowledge discovery on the Web: an empirical study on business intelligence exploration, Journal of Management Information Systems 21 (4) (2005) 57–84. W. Chung, Y. Zhang, Z. Huang, G. Wang, T.-H. Ong, H. Chen, Internet searching and browsing in a multilingual world: an experiment on the Chinese Business Intelligence Portal (CBizPort), Journal of the American Society for Information Science and Technology 55 (9) (2004) 818–831. F.D. Davis, Perceived usefulness, perceived ease of use, and user acceptance of information technology, MIS Quarterly 13 (3) (1989) 319–340. K.A. Ericsson, H.A. Simon, Protocol Analysis: Verbal Reports as Data, MIT Press, Cambridge, MA, 1993. T. Firmin, M.J. Chrzanowski, An Evaluation of Automatic Text Summarization Systems, The MIT Press, Cambridge, 1999. Gallup, Encuesta Sobre Portales 2002, http://aui.es/estadi/gallup/ gallup_portales_2002.htm, 2002. Global Reach, Evolution of non-English online populations, http://global-reach.biz/globstats/evol.html, 2004. Global Reach, Global internet statistics (by language), http://www.glreach.com/globstats/, 2004. S. Greene, G. Marchionini, C. Plaisant, B. Shneiderman, Previews and overviews in digital libraries: designing surrogates to support visual information seeking, Journal of the American Society for Information Science 51 (4) (2000) 380–393. M.A. Hearst, Multi-paragraph segmentation of expository text, Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Morgan Kaufmann Publishers, Las Cruces, New Mexico, 1994, pp. 9–16. Y.K. Hitti, Hitti's Medical Dictionary English–Arabic, Librairie du Liban, Beirut, 1972. T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin, 1995. C. Kuhlthau, Longitudinal case studies of the information search process of users in libraries, Library and Information Science Research 10 (3) (1998) 257–304. J.R. Lewis, IBM computer usability satisfaction questionnaires: psychometric evaluation and instructions for use, International Journal of Human–Computer Interaction 7 (1) (1995) 57–78. X. Lin, Map displays for information retrieval, Journal of the American Society for Information Science 48 (1) (1997) 40–54. E. Loiacono, WebQual™: a web site quality instrument, Proceedings of International Conference on Information Systems (ICIS) Doctoral Consortium, (Charlotte, NC, USA), 2002. 1713 [22] G. Marchionini, Information Seeking in Electronic Environments, Cambridge University Press, New York, 1995. [23] B. Marshall, D. McDonald, H. Chen, W. Chung, EBizPort: collecting and analyzing business intelligence information, Journal of the American Society for Information Science and Technology 55 (10) (2004) 873–891. [24] M.D. Marsico, S. Levialdi, Evaluating web sites: exploiting user's expectations, International Journal of Human–Computer Studies 60 (3) (2004) 381–416. [25] D. McDonald, H. Chen, Using sentence selection heuristics to rank text segments in TXTRACTOR, Proceedings of the Second ACM/IEEE–CS Joint Conference on Digital Libraries, ACM/ IEEE–CS, Portland, OR, USA, 2002, pp. 28–35. [26] A. Mowshowitz, A. Kawaguchi, Bias on the web, Communications of the ACM 45 (9) (2002) 56–60. [27] J. Myers, A. Well, Research Design and Statistical Analysis, Lawrence Erlbaum Associates, Publishers, Hillsdale, NJ, USA, 1995. [28] L. Norton, The Expanding Universe: Internet Adoption in the Arab Region, World Markets Research Centre, 2001, p. 3. [29] E.T. O'Neill, B.F. Lavoie, R. Bennett, Trends in the evolution of the public web 1998–2002, Digital Library Magazine 9 (4) (2003). [30] T.-H. Ong, H. Chen, Updateable PAT-array approach for Chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management, Proceedings of the Second Asian Digital Library Conference, Taipei, Taiwan, 1999, pp. 63–84. [31] R.L. Penslar, Institutional Review Board Guidebook, Office for Human Research Protection, U.S. Department of Health and Human Services, http://ohrp.osophs.dhhs.gov/irb/irb_guidebook. htm, 2001. [32] J. Peterson, Quepasa Announces Agreement to Acquire Vayala Corporation Hispanic PR Wire–Business Wire, Phoenix, 2002. [33] L.L. Pipino, Y.W. Lee, R.Y. Wang, Data quality assessment, Communications of the ACM 45 (4) (2002) 211–218. [34] W.M.J. Shaw, R. Burgin, P. Howell, Performance standards and evaluations in information retrieval test collections: cluster-based retrieval models, Information Processing and Management 33 (1) (1997) 1–14. [35] A.G. Sutcliffe, M. Ennis, Towards a cognitive theory of information retrieval, Interacting with Computers (Special Edition on HCI and Information Retrieval) 10 (1998) 321–351. [36] A. Tombros, M. Sanderson, Advantages of query biased summaries in information retrieval, Proceedings of the 21st Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, (Melbourne, Australia), ACM Press, 1998, pp. 2–10. [37] E. Voorhees, D. Harman, Overview of the sixth text retrieval conference (TREC-6), NIST Special Publication 500-240: The Sixth Text Retrieval Conference (TREC-6), National Institute of Standards and Technology, Gaithersburg, MD, USA, 1997. [38] R.Y. Wang, D.M. Strong, Beyond accuracy: what data quality means to data consumers, Journal of Management Information Systems 12 (4) (1996) 5–34. [39] T.D. Wilson, Models of information behavior research, Journal of Documentation 55 (3) (1999) 249–270. 1714 W. Chung et al. / Decision Support Systems 42 (2006) 1697–1714 Wingyan Chung is Assistant Professor of CIS in the Department of Information and Decision Sciences at The University of Texas at El Paso. He received his Ph.D. in Management Information Systems from The University of Arizona, and M.S. in information and technology management and BBA in business administration from The Chinese University of Hong Kong. His research interests include knowledge management, Web analysis and mining, data and text mining, information visualization, and human-computer interaction. He has published in leading journals such as Communications of the ACM, Journal of Management Information Systems, IEEE Computer, International Journal of Human-Computer Studies, and Decision Support Systems. Contact him at wchung@utep.edu. Alfonso A. Bonillas received his B.S. in Systems Engineering at the University of Arizona. His main interests are Web development, systems optimization, programming, and database management. Contact him at artunso@yahoo.com. Guanpi (Greg) Lai is a doctoral student in the Systems and Industrial Engineering (SIE) Department at the University of Arizona. He received his B.S. in Computer Science from Tsinghua University, China and M.S. in Industrial Engineering from the University of Arizona. His research interests include embedded systems’ tasks scheduling, intelligent control (automobile, home automation), data mining, and data visualization. Contact him at guanpi@email.arizona.edu. Wei Xi received her masters degree in Management Information Systems from the University of Arizona in 2004 and her B.A. in English from Xi'an Foreign Languages University, China (1995). She joined AI lab in Spring 2003. Her areas of interest include Web programming and database management. Contact her at duoduoxi@gmail.com. Hsinchun Chen is McClelland Professor of MIS at the Eller College of the University of Arizona and Andersen Consulting Professor of the Year (1999). He received the Ph.D. degree in Information Systems from New York University in 1989, MBA in Finance from SUNY-Buffalo in 1985, and BS in Management Science from the National Chiao-Tung University in Taiwan. He is author/editor of 10 books and more than 130 journal articles covering intelligence analysis, biomedical informatics, data/text/Web mining, digital library, knowledge management, and Web computing. Contact him at hchen@eller.arizona.edu.