Full Issue in PDF
Transcription
Full Issue in PDF
Journal of Emerging Technologies in Web Intelligence ISSN 1798-0461 Volume 4, Number 3, August 2012 Contents Special Issue: Web Data Mining Guest Editors: Richard Khoury Guest Editorial Richard Khoury 205 SPECIAL ISSUE PAPERS Query Classification using Wikipedia's Category Graph Milad AlemZadeh, Richard Khoury, and Fakhri Karray 207 Towards Identifying Personalized Twitter Trending Topics using the Twitter Client RSS Feeds Jinan Fiaidhi, Sabah Mohammed, and Aminul Islam 221 Architecture of a Cloud-Based Social Networking News Site Jeff Luo, Jon Kivinen, Joshua Malo, and Richard Khoury 227 Analyzing Temporal Query for Improving Web Search Rim Faiz 234 Trend Recalling Algorithm for Automated Online Trading in Stock Market Simon Fong, Jackie Tai, and Pit Pichappan 240 A Novel Method of Significant Words Identification in Text Summarization Maryam Kiabod, Mohammad Naderi Dehkordi, and Mehran Sharafi 252 Attribute Overlap Minimization and Outlier Elimination as Dimensionality Reduction Techniques for Text Classification Algorithms Simon Fong and Antonio Cerone 259 New Metrics between Bodies of Evidences Pascal Djiknavorian, Dominic Grenier, and Pierre Valin 264 REGULAR PAPERS Bringing location to IP Addresses with IP Geolocation Jamie Taylor, Joseph Devlin, and Kevin Curran 273 On the Network Characteristics of the Google's Suggest Service Zakaria Al-Qudah, Mohammed Halloush, Hussein R. Alzoubi, and Osama Al-kofahi 278 RISING SCHOLAR PAPERS Review of Web Personalization Zeeshan Khawar Malik and Colin Fyfe 285 SHORT PAPERS An International Comparison on Need Analysis of Web Counseling System Design Takaaki Goto, Chieko Kato, Futoshi Sugimoto, and Kensei Tsuchida 297 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 205 Special Issue: Web Data Mining Guest Editorial Richard Khoury Department of Software Engineering, Lakehead University, Thunder Bay, Canada Email: rkhoury@lakeheadu.ca The Internet is a massive and continuously-growing source of data and information on subjects ranging from breaking news to personal anecdotes to objective documentation. It has proven to be a hugely beneficial resource for researchers, giving them a source of free, upto-date, real-world data that can be used in a varied range of projects and applications. Hundreds of new algorithms and systems are being proposed each year to filter out desired information from this seemingly endless amount of data, to clean and organize it, to infer knowledge from it and to act on this knowledge. The importance of the web in scientific research today can be informally gauged by counting the number of published papers that use the name of a major website in their titles, abstracts, or keyword lists. To illustrate, we gathered these statistics using the IEEE Xplore search system for eight well-known websites for each of the past 10 years. The results of this survey, presented in Figure 1, indicate that research interest for web data is increasing steadily. The individual websites’ naturally know increases and decreases in popularity; for example, we can see Google overtake Yahoo! around 2007. In 2011, Figure 1. © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.205-206 the three most cited websites in our sample of the scientific literature were Google, Facebook and Twitter. This ranking is similar, but a bit off, compared to the realworld popularity of these websites as measured by the website ranking site Alexa. Alexa’s ranking does put Google and Facebook in first and second place respectively, but rank Twitter ninth, below YouTube, Yahoo!, Baidu and Wikipedia. There is nonetheless a good similarity between the rankings of Figure 1 and those of Alexa, which is not unexpected: to be useful for scientific research a site needs to contain a lot of data, which means that it must be visited and contributed to by a lot of users, which in turn means a high Alexa rating. This special issue is thus dedicated to the topic of Web Data Mining. We attempted to compile papers that touch upon both a variety of websites and a variety of data mining challenges. Clearly it would be impossible to create a representative sample of all data mining tasks and all websites in use in the literature. However, after thorough peer-reviewing and careful deliberation, we have selected the following papers as good examples of a range of web data mining challenges being addressed Number of publications per year that uses the name of a major website. 206 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 today. In our first paper, “Query Classification Using Wikipedia’s Category Graph”, the authors Milad AlemZadeh, Richard Khoury and Fakhri Karray, perform data mining on Wikipedia, one of the more popular websites both in the literature and to the public. The task they focused on is query classification, or the challenge of determining the topic intended by a query given only the words of that query. This is a challenge with broad applicability, from web search engines to questionanswering systems. Their work demonstrates how a system exploiting web data can perform this task with virtually no domain restrictions, making it very appealing for applications that need to interact with human beings in any setting whatsoever. Next, we move from Wikipedia to Twitter, a website whose popularity we discussed earlier. In “Towards Identifying Personalized Twitter Trending Topics using the Twitter Client RSS Feeds”, Jinan Fiaidhi, Sabah Mohammed, and Aminul Islam, take on the challenge of mining the massive, real-time stream of tweets for interesting trending topics. Moreover, the notion of what is interesting can be personalized for each user based not only on the tweets’ vocabulary, but also on the user’s personal details and geographical location. Their paper thus defines the first true Twitter stream personalization system. Staying on the topic of defining innovative new systems, in “Architecture of a Cloud-Based Social Networking News Site”, three undergraduate engineering students, Jeff Luo, Jon Kivinen, and Joshua Malo, give us a tour of a social networking platform they developed. Their work presents a new perspective on web data mining from social networks, in which every aspect of the social network is under the control of the researchers, from the type of information users can put up to the underlying cloud architecture itself. In “Analyzing Temporal Queries for Improving Web Search”, Rim Faiz brings us back to the topic of web query understanding. Her work focuses on the challenge of adding a temporal understanding component in web search systems. Mining temporal information in this way can help improve search engines by making it possible to correctly interpret queries that are dependent on temporal context. And indeed, her enhanced method shows a promising increase in accuracy. In “Trend Recalling Algorithm for Automated Online Trading in Stock Market”, Simon Fong, Jackie Tai, and Pit Pichappan, exploit another source of web data: the online stock market. This data source is one often overlooked (only 75 references for the entire 2002-2011 period we studied in Figure 1), but one of unquestionable importance today. This paper shows how this data can be mined for trends, which can then be used to successfully guide trading decisions. Specifically, by matching the current trend to past trends and recalling the trading strategies that worked in the past, the system can adapt its behaviour and greatly increase its profits. In a further example of both the variety of web data and of data mining tasks, in “A Novel Method of © 2012 ACADEMY PUBLISHER Significant Words Identification in Text Summarization”, Maryam Kiabod, Mohammad Naderi Dekhordi and Mehran Sharafi, mine a database of web newswire in order to train a neural network to mimic the behaviour of a human reader. This neural network underlies the ability of their system to pick out the important keywords and key sentences that summarize a document. Once trained on the web data set, the system works quite well, and in fact outperforms commercially-available summarization tools. Our final two papers take a wider view of the challenge of web data mining, and focus on the mining process itself. In our penultimate paper, “Attribute Overlap Minimization and Outlier Elimination as Dimensionality Reduction Techniques for Text Classification Algorithms”, Simon Fong and Antonio Cerone note that the massive and increasing volume of online documents, and the corresponding increase in the number of features to be handled to represent them, is becoming a problem for web mining algorithms, and especially for real-time algorithms. They thus explore the challenge of feature reduction in web documents. Their experiments – conducted on Wikipedia articles in multiple languages and on CNN.com news articles – demonstrate not only the possibility but also the benefits of dimensionality reduction of web data. Our final paper, “New Metrics between Bodies of Evidence” by Pascal Djiknavorian, Dominic Grenier, and Pierre Valin, presents a higher-level theoretical perspective on web data mining. They propose new metrics to compare and evaluate evidence and uncertainty in the context of the Dempster-Shafer theory. Their work introduces fundamental theoretical advances of consequence for all information retrieval applications. It could be useful, for example, for a new generation of web search systems that can pinpoint relevant information in a web page, rather than consider the page as a whole. It could also be useful to handle the uncertainty incurred when combining information from multiple heterogeneous web data sources. Richard Khoury received his Bachelor’s Degree and his Master’s Degree in Electrical and Computer Engineering from Laval University (Québec City, QC) in 2002 and 2004 respectively, and his Doctorate in Electrical and Computer Engineering from the University of Waterloo (Waterloo, ON) in 2007. Since August 2008, he has been an Assistant Professor, tenure track, in the Department of Software Engineering at Lakehead University. Dr. Khoury has published 20 papers in international journals and conferences, and has served on the organization committee of three major conferences. His primary area of research is natural language processing, but his research interests also include data mining, knowledge management, machine learning, and intelligent systems. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 207 Query Classification using Wikipedia’s Category Graph Milad AlemZadeh Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, Ontario, Canada Email: malemzad@uwaterloo.ca Richard Khoury Department of Software Engineering, Lakehead University, Thunder Bay, Ontario, Canada, Email: richard.khoury@lakeheadu.ca Fakhri Karray Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, Ontario, Canada Email: karray@uwaterloo.ca Abstract— Wikipedia’s category graph is a network of 300,000 interconnected category labels, and can be a powerful resource for many classification tasks. However, its size and the lack of order can make it difficult to navigate. In this paper, we present a new algorithm to efficiently exploit this graph and accurately rank classification labels given user-specified keywords. We highlight multiple possible variations of this algorithm, and study the impact of these variations on the classification results in order to determine the optimal way to exploit the category graph. We implement our algorithm as the core of a query classification system and demonstrate its reliability using the KDD CUP 2005 and TREC 2007 competitions as benchmarks. Index Terms—Keyword search, Natural language processing, Knowledge based systems, Web sites, Semantic Web I. INTRODUCTION Query classification is the task of Natural Language Processing (NLP) whose goal is to identify the category label, in a predefined set, that best represents the domain of a question being asked. An accurate query classification system would be beneficial in many practical systems, including search engines and questionanswering systems. Query classification shares some similarities with other categorization tasks in NLP, and with document classification in particular. However, the challenge of query classification is accentuated by the fact that a typical query is only between one and four words long [1], [2], rather than the hundreds or thousands of words one can get from an average text document. Such a limited number of keywords makes it difficult to select the correct category label, and moreover it makes the selection very sensitive to “noise words”, or words unrelated to the query that the user entered for some reason such as because they didn’t remember a correct name or technical term to query for. A second challenge of query classification comes from the fact that, while © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.207-220 document libraries and databases can be specialized to a single domain, the users of query systems expect to be able to ask queries about any domain at all [1]. This paper continues our work on query classification using the Wikipedia category graph [3], [4]. It refines and expands on our previous work by studying multiple different design alternatives that similar classification systems could opt for, and considers the impact of each one. In contrast with our previous papers, the focus here is not on presenting a single classification system, but on implementing and comparing multiple systems that differ on critical points. The rest of the paper is organized as follows. Section 2 presents overviews of the literature in the field of query classification with a special focus on the use of Wikipedia for that task. We present in detail our ranking and classification algorithm in Section 3, and take care to highlight the points where we considered different design options. Each of these options was implemented and tested, and in Section 4 we describe and analyze the experimental results we obtained with each variation of our system. Finally, we give some concluding remarks in Section 5 II. BACKGROUND Query classification is the task of NLP that focuses on inferring the domain information surrounding userwritten queries, and on assigning to each query the best category label from a predefined set. Given the ubiquity of search engines and question-handling systems today, this challenge has been receiving a growing amount of attention. For example, it was the topic of the ACM’s annual KDD CUP competition in 2005 [5], where 37 systems competed to classify a set of 800,000 real web queries into a set of 67 categories designed to cover most topics found on the internet. The winning system was designed to classify a query by comparing its word vector to that of each website in a set pre-classified in the 208 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Google directory. The query was assigned the category of the most similar website, and the directory’s set of categories was mapped to the KDD CUP’s set [2]. This system was later improved by introducing a bridging classifier and an intermediate-level category taxonomy [6]. Most query classifiers in the literature, like the system described above, are based on the idea of mapping the queries into an external knowledge source (an objective third-party knowledge base) or internal knowledge source (user-specific information) to classify them. This simple idea leads to a great variety of classification systems. Using an internal knowledge source, Cao et al. [7] developed a query classifier that disambiguates the queries based on the context of the user’s recent online history. And on the other hand, many very different knowledge sources have been used in practice, including ontologies [8], websites [9], web query logs [10], and Wikipedia [4], [11], [12]. Exploiting Wikipedia as a knowledge source has become commonplace in scientific research. Several hundreds of journal and conference papers have been published using this tool since its creation in 2001. However, while both query classification and NLP using Wikipedia are common challenges, to the best of our knowledge there have been only three query classification systems based on Wikipedia. The first of these three systems was proposed by Hu et al. [11]. Their system begins with a set of seed concepts to recognize, and it retrieves the Wikipedia articles and categories relevant to these concepts. It then builds a domain graph by following the links in these articles using a Markov random walk algorithm. Each step from one concept to the next on the graph is assigned a transition probability, and these probabilities are then used to compute the likelihood of each domain. Once the knowledge base has been build in this way, a new user query can be classified simply by using its keywords to retrieve a list of relevant Wikipedia domains, and sorting them by likelihood. Unfortunately, their system remained small-scale and limited to only three basic domains, namely “travel”, “personal name” and “job”. It is not a general-domain classifier such as the one we aim to create. The second query classification system was designed by one of our co-authors in [12]. It follows Wikipedia’s encyclopedia structure to classify queries step-by-step, using the query’s words to select titles, then selecting articles based on these titles, then categories from the articles. At each step, the weights of the selected elements are computed based on the relevant elements in the previous step: a title’s weight depends on the words that selected it, an article’s weight on the titles’, and a category’s weight on the articles’. Unlike [11], this system was a general classifier that could handle queries from any domain, and its performance would have ranked near the top of the KDD CUP 2005 competition. The last query classification system is our own previous work, described in [4]. It is also a general © 2012 ACADEMY PUBLISHER classifier, but its fundamental principles differ fundamentally from [12]. Instead of using titles and articles to pinpoint the categories in which to classify a query like was done in [12], the classifier of [4] used titles only to create a set of inexact initial categories for the query and then explored the category graph to discover the best goal categories from a set of predetermined valid classification goals. This classifier also differs from the one described in this work on a number of points, including the equations used to weight and rank categories and the mapping of the classification goals. But the most fundamental difference is the use in this paper of pre-computed base-goal category distances instead of an exploration algorithm. As we will show in this paper, all these modifications are justified both from a theoretical standpoint and practically by improvements in the experimental results. While using Wikipedia for query classification has not been a common task, there have been several document classification projects done using that resource which are worth mentioning. Schönhofen [13] successfully developed a complete document classifier using Wikipedia, by mapping the document’s vocabulary to titles, articles, and finally categories, and weighting the mapping at each step. In fact, we used some of the mapping techniques he developed in one of our previous works [12]. Alternatively, other authors use Wikipedia to enrich existing text classifiers by improving upon the simple bag-of-words approach. The authors of [14] use it to build a kernel to map the document’s words to the Wikipedia article space and classify there, while the authors of [15] and [16] use it for text enrichment, to expand the vocabulary of the text by adding relevant synonyms taken from Wikipedia titles. Interestingly, improvements are reported in the classification results of [13], [15] and [16], while only [14] reports worse results than the bag-of-words method. The conclusion seems to be that working in the word space is the better option; a conclusion that [14] also shares. Likewise, that is the approach we used in the system we present in this paper. III. ALGORITHM Wikipedia’s category graph is a massive set of almost 300,000 category labels, describing every domain of knowledge and ranging from the very precise, such as “fictional secret agent and spies”, to the very general, such as “information”. The categories are connected by hypernym relationships, with a child category having an “is-a” relationship to its parents. However, the graph is not strictly hierarchic: there exist shortcuts in the connections (i.e. starting from one child category and going up two different paths of different lengths to reach the same parent category) as well as loops (i.e. starting from one child category and going up a path to reach the same child category again). The query classification algorithm we propose in this paper is designed to exploit this graph structure. As we will show in this section, it is a three-stage algorithm, with a lot of flexibility possible within each step. The first JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Input: Wikipedia database dump 1. CG ← the Category Graph extracted from Wikipedia 2. Associate to each category in CG the list of all titles pointing to it 3. GC ← the set of Goal Categories identified in CG 4. Dist(GC,CG) ← the shortest-path distance between every GC and all categories in CG Input: User query, CG 5. KL ← Keyword List of all keywords in the user query 6. TL ← Title List of all titles in CG featuring at least one word in KL 7. KTW ← Keyword-Title Weight, a list containing the weight of a keyword from KL featured in a title from TL 8. BC ← Base Categories, all categories in CG pointed to by TL 9. CD ← Category Density for all BC computed from the KTW 10. BC ← top BC ranked by CD Input: GC, DIST(GC,BC), CD 11. GS ← Goal Score of each GC, computed based on their distance to each BC and on CD 12. Return: top 3 GC ranked by GS Figure 1. Structure of the three steps of our classification algorithm: the pre-processing step (top), the base category evaluation (middle), and the exploration for the goal categories (bottom). stage is a pre-processing stage, during which the category graph is built and critical application-specific information is determined. This stage needs to be done only once to create the system, by contrast with the next two stages that are executed for each submitted query. In the second stage, a user’s query is mapped to a set of base categories, and these base categories are weighted and ranked. And finally, the algorithm explores the graph starting from the base categories and going towards the nearest goal categories in stage 3. The pseudocode of our new algorithm is shown in Figure 1. A. Stage 1: Pre-Processing the Category Graph We begin the first stage of our algorithm by extracting the list of categories in Wikipedia and the connections between categories from the database dump made freely available by the Wikimedia Foundation. For this project, we used the version available from September 2008. Furthermore, our graph includes one extra piece of information in addition to the categories, namely the article titles. In Wikipedia, each article is an encyclopedic entry on a given topic which is classified in a set of categories, and which is pointed to by a number of titles: a single main title, some redirect titles (for common alternative names, including foreign translations and typos) and some disambiguation titles (for ambiguous © 2012 ACADEMY PUBLISHER 209 names that may refer to it). For example, the article for the United States is under the main title “United States”, as well as the redirect titles “USA”, “United States of America” and “United Staets” (common typo redirection), and the disambiguation title “America”. Our pre-processing deletes stopwords and punctuation from the titles, then maps them directly to the categories of the articles and discards the articles. After this processing, we find that our category graph features 5,453,808 titles and 282,271 categories. The next step in the graph construction category is to define a set of goal categories that are acceptable classification labels. The exact number and nature of these goal categories will be application-specific. However, the set of Wikipedia category labels is large enough to cover numerous domains at many levels of precision, which means that it will be easy for system designers to identify a subset of relevant categories for their applications, or to map an existing category set to Wikipedia categories. The final pre-processing step is to define, compute and store the distance between the goal categories and every category in the graph. This distance between two categories is the number of intermediate categories that must be visited on the shortest path between them. We allow any path between two categories, regardless of whether it goes up to parent categories or down to children categories or zigzags through the graph. This stands in contrast with our previous work [4], where we only allowed paths going from child to parent category. The reason for adopting this more permissive approach is to make our classifier more general: the parent-only approach may work well in the case of [4] where all the goal categories selected were higher in the hierarchy than the average base category, but it would fail miserably in the opposite scenario when the base categories are parents of the goal categories. When searching for the shortest paths, we can avoid the graph problems we mentioned previously, of multiple paths and loops between categories, by only saving the first encounter of a category and by terminating paths that are revisiting categories. Finally, we can note that, while exploring the graph to find the shortest distance from every goal category and all other categories may seem like a daunting task, for a set of about 100 goal queries such as we used in our experiments it can be done in only a few minutes on a regular computer. B. Stage 2: Discovering the Base Categories The second stage of our algorithm as shown in Figure 1 is to map the user’s query to an initial set of weighted base categories. This is accomplished by stripping the query of stopwords to keep only relevant keywords, and then generating the exhaustive list of titles that feature at least one of these keywords. Next, the algorithm considers each title t and determines the weight Wt of the keywords it contains. This weight is computed based on two parameters: the number of keywords featured in the title (Nk), and the proportional importance of keywords in the title (Pk). The form of the weight equation is given in 210 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 equation (1). the number of query keywords. It happens in the case where each keyword has its maximum value in that W t = N k Pk (1) category, meaning that one of the titles pointing to the The number of keywords featured in the title is a category is composed of exactly the query words. k ,i simple and unambiguous measure. The proportional D i = ∑ k max (W t ) (3) t importance is open to interpretation however. In this research, we considered three different measures of At the end of this stage of the algorithm, we have a importance. The first is simply the proportion of weighted list of base categories, featuring some keywords in the title (Nk / Nt, where Nt is the total number categories pointed to by high-weight words and summing of words in title t). The second is the proportion of to a high density score, and a lot of categories pointed to characters in the title that belong to keywords (Ck / Ct, by only lower-weight words and having a lower score. In where Ck is the number of characters of the keywords our experiments, we found that the set contains over featured in the title and Ct is the total number of 3,000 base categories on average. We limit the size of this characters in title t). This metric assumes that longer list by keeping only the set of highest-density categories, keywords are more important; in the context of queries, as categories with a density too low are deemed to be too which are only a few words long [1], [2], it may be true unrelated to the original query to be of use. This can be that more emphasis was meant by the user on the longest, done either on a density basis (i.e. keeping categories most evident word in the query. The final measure of whose density is more than a certain proportion of the proportional importance is based on the word’s inverted highest density obtained for this query, regardless of the frequencies. It is computed as the sum of inverted number of categories this represents, as we did in [4]) or frequencies of the keywords in the title to the sum of on a set-size basis (i.e. keeping a fixed number of frequencies of all title words (ΣFk / ΣFt), where the categories regardless of their density, the approach we inverted frequency of a word w is computed as: will prioritize in this paper). When using the set-size Fw = ln( T / Tw ) (2) In equation (2), T is the total number of titles in our category graph and Tw is the number of titles featuring word w. It is, in essence, the IDF part of the classic term frequency-inverse document frequency (TFIDF) equation: ( N w / N ) ln( T / T w ) , where Nw is the number of instances of word w in a specific title (or more generally, a document) and N is the total number of words in that title. The TF part ( N w / N ) is ignored because it does not give a reliable result when dealing with short titles that only feature each word once or twice. We have used this metric successfully in the past in another classifier we designed [12]. We can see from equation (1) that every keyword appearing in a title will receive the same weight Wt. Moreover, when a title is composed exclusively of query keywords, their weight will be the number of keywords contained in the title. The maximum weight a keyword can have is thus equal to the number of keywords in the query; it occurs in the case where a title is composed of all query keywords and nothing else. Next, our algorithm builds a set of base categories by listing exhaustively all categories pointed to by the list of titles. This set of base categories can be seen as an initial coarse classification for the query. These base categories are each assigned a density value. A category’s density value is computed by determining the maximum weight each query keyword takes in the list of titles that point to that category, then summing the weights of all keywords, as shown in equation (3). In that equation, Di is the density of category i, and Wtk,i refers to the weight Wt of a title t that contains keyword k and points to category i. Following our discussion on equation (1), we can see that the maximum density a category can have is the square of © 2012 ACADEMY PUBLISHER approach, a question arises on how to deal with ties when the number of tied categories exceeds the size of the set to return. In our system, we break ties by keeping a count of the number of titles that feature keywords and that point to each category, and giving priority to the categories pointed to by more titles. C. Stage 3: Ranking the Goal Categories Once the list of base categories is available, the third and final stage of the algorithm is to determine which ones of the goal categories identified in the first stage are the best classification labels for the query. As we outlined in the pseudocode of Figure 1, our system does this by ranking the goal categories based on their shortest-path distance to the selected base categories. There are of course other options that have been considered in the literature. For example, Coursey and Mihalcea [17] proposed an alternative metric based on graph centrality, while Syed et al. [18] developed a spreading activation scheme to discover related concepts in a set of documents. Some of these ideas could be adapted into our method in future research. However, even after settling on the shortest-path distance metric, there are many ways we could take into account the base categories’ densities into the goal categories’ ranking. The simplest option is to use it at a threshold value – to cut off base categories that have a density lower than a certain value, and then rank the goal categories according to which are closest to any remaining base category regardless of density. That is the approach we used in [4]. On the other hand, taking the density into account creates different conditions for the system. Since some base categories are now more important than others, it becomes acceptable, for example, to rank a goal that is further away from several high-density base categories higher than a goal that is JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 211 closer to a low-density base category. We thus define a ranking score for the goal categories, as the sum for all base categories of a ratio of their density to the distance separating the goal and base. There are several ways to compute this ratio; five options that we considered in this study are: Sj = ∑ Sj = ∑D i i D i / ( dist ( i , j ) + 0 . 0001 ) i / ( dist ( i , j )² + 0 . 0001 ) Precision Award was given to the system with the top overall precision value within the top 10 systems evaluated on overall F1 value. Overall Recall was not used in the competition, but is included here because it is useful in our experiments. ∑ j queries correctly labeled as c j Precision = (9) ∑ j queries labeled as c j (4) (5) Recall = ∑ queries correctly labeled as c ∑ queries belonging to c j ∑De − dist ( i , j ) Sj = ∑De − 2 dist ( i , j ) Sj = ∑De − dist ( i , j )² i i i i i i (6) (7) (8) In each of these equations, the score Sj of goal category j is computed as the sum, for all base categories i, of the density Di of that category, which was computed in equation (3), divided by a function of the distance between categories i and j. This function is a simple division in equations (4) and (5), but the exponential in equations (6-8) put progressively more importance on the distance compared to the density. The addition of 0.0001 in equations (4) and (5) is simply to avoid a division by zero in the case where a selected base category is also a goal category. Finally, the goal categories with the highest score are returned as classification results. In our current version of the system, we return the top three categories, to allow for queries to belong to several different categories. We believe that this corresponds to a human level of categorization; for example, in the KDD CUP 2005 competition [5], human labelers used on average 3.3 categories per query. However, this parameter is flexible, and we ran experiments keeping anywhere from one to five goal categories. IV. EXPERIMENTAL RESULTS The various alternatives and options for our classifier described in the previous section were all implemented and tested, in order to study the behavior of the system and determine the optimal combination. That optimal combination was then subjected to a final set of tests with new data. In order to compare and study the variations of our system, we submitted them all to the same challenge as the KDD CUP 2005 competition [5]. The 37 solutions entered in that competition were evaluated by classifying a set of 800 queries into up to five categories from a predefined set of 67 target categories cj and comparing the results to the classification done by three human labelers. The solutions were ranked based on overall precision and overall F1 value, as computed by Equations (9-14). The competition’s Performance Award was given to the system with the top overall F1 value, and the © 2012 ACADEMY PUBLISHER F1 = 2 × Precision × Recall Precision + Recall Overall Precision = Overall Recall = Overall F1 = (10) j j Sj = j (11) 1 3 ∑ Precision against labeler L 3 L=1 (12) 1 3 ∑ Recall against labeler L 3 L =1 (13) 1 3 ∑ F1 against labeler L 3 L=1 (14) In order for our system to compare to the KDD CUP competition results, we need to use the same set of category labels. As we mentioned in Section 3, the size and level of detail of Wikipedia’s category graph makes it possible to identify categories to map most sets of labels to. In our case, we identified 99 goal categories in Wikipedia corresponding to the 67 KDD CUP category set. These correspondences are presented in Appendix A. A. Proportional Importance of Keywords The first aspect of the system we studied is the different formulae for the proportional importance of query keywords in a title. As we explained in Section IIIB, the choice of formula has a direct impact on the system, as it determines which titles are more relevant given the user’s query. This in turn determines the relevance of the base categories that lead to the goal categories. A bad choice at this stage can have an impact on the rest of the system. The weight of a title, and of the query keywords it contains, is function of the two parameters presented in equation (1), namely the number of keywords present in the title and the importance of those keywords in that title. Section IIIB gives three possible mathematical definitions of keyword importance in a title. They are a straightforward proportion of keywords in the title, the proportion of characters in the title that belong to keywords, and the proportion of IDF of keywords to the total IDF of the title, as computed with equation (2). We implemented all three equations and tested the system independently using each. In all implementations, we limited the list of base categories to 25, weighted the goal categories using equation (5), and varied the number of returned goal categories from 1 to 5. The results of these experiments are presented in Figure 2. The three different experiments are shown with different grey shades and markers: dark squares for the 212 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Figure 2. Overall precision (dashed line), recall (dotted line) and F1 (solid line) using Nk*(Ck/Ct) (dark squares), Nk*(Nk/Nt) (medium triangles), and Nk*(ΣFk/ΣFt) (light circles). formula using the proportion of characters, medium triangles for the formula using the proportion of words, and light circles for the formula using the proportion of IDF. Three results are also shown for each experiment: the overall precision computed using equation (12) in a dashed line, the overall recall of equation (13) in a dotted line, and the overall F1 of equation (14) in a solid line. A few observations can be made from Figure 2. The first is that the overall result curves of all three variations have the same shape. This means that the system behaves in a very consistent way regardless of the exact formula used. There is no point where one of the results given one equation shoots off in a wildly different range of values from the other two equations. Moreover, while the exact difference in the results between the three equations varies, there is no point where they switch and one equation goes from giving worse results than another to giving better results. We can also see that the precision decreases and the recall increases as we increase the number of acceptable goal categories. This result was to be expected: increasing the number of categories returned in the results means that each query is classified in more categories, leading to more correct classification (that increase recall) and more incorrect classifications (that decrease precision). Finally, we can note that the best equation for the proportional importance of keywords in titles is consistently the proportion of keywords (Nk / Nt), followed closely by the proportion of characters (Ck / Ct), while the proportion of IDF (ΣFk / ΣFt) trails in third position. It is surprising that the IDF measure gives the worst results of the three, when it worked well in other projects [12]. However, the IDF measure is based on a simple assumption, that a word with low semantic importance is one that is used commonly in most documents of the corpus. In our current system however, the “documents” are article titles, which are by design short, limited to important keywords, and stripped of semantically irrelevant words. These are clearly in contradiction with the assumptions that underlie the IDF measure. We can see this clearly when we compare the statistics of the keywords given in the example in [12] with the same keywords in our system, as we do in Table I. The system in [12] computed its statistics from the entire Wikipedia corpus, including article text, and thus computed reliable statistics; in the example in Table I the rarely-used company name WWE is found much more significant than the common corporate nouns chief, executive, chairman and headquartered. On the other hand, in our system WWE is used in almost as many titles as executive and has a comparable Fw score, which is dwarfed by the Fw score of chairman and headquartered, two common words that are very rarely used in article titles. Finally, we can wonder if the two parts of equation (1) are really necessary, especially since the best equation we found for proportional importance repeats the Nk term. To explore that question, we ran the same test again using each part of the equation separately. Figure 3 plots the TABLE I COMPARISON OF IDF OF SAMPLE KEYWORDS Keyword WWE Chief Executive Chairman Headquartered Tw* 2,705 83,977 82,976 40,241 38,749 Fw* 7.8 5.6 5.8 7.2 7.1 *Columns 2 and 3 are taken from [12]. © 2012 ACADEMY PUBLISHER Tw 657 1,695 867 233 10 Fw 9.0 8.1 8.7 10.1 13.2 Figure 3. Overall F1 using Nk*(Nk/Nt) (solid medium triangles), Nk/Nt (dashed dark rectangles), and Nk (light dotted circles). JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 overall F1 using Nk alone in light dotted line with circle markers, using Nk / Nt in black dashed line with square markers, and reproduces the overall F1 of Nk * (Nk / Nt) from Figure 3 in its medium solid line with triangle markers for comparison. This figure shows clearly that using the complete equation gives better results than using either one of its components. B. Size of the Base Category Set The second aspect of the system we studied comes at the end of the second stage of the algorithm, when the list of base categories is trimmed down to keep only the most relevant ones. This list will initially contain all categories connected to any title that contains at least one of the keywords the user specified. As we mentioned before, the average number of base categories generated by a query is 3,400 and the maximum is 45,000. These base categories are then used to compute the score of the goal categories, using one of the summations of equations (48). This test aims to see if the quality of the results can be improved by limiting the size of the set of base categories used in this summation, and if so what is the approximate ideal size. For this test, we used Wt = Nk * (Nk / Nt) for equation (1), the best formula found in the previous test. We again weighted the goal categories using equation (5) and varied the number of returned goal categories from 1 to 5. Figure 4 shows the F1 value of the system under these conditions when trimming the list of base categories to 500 (black solid line with diamonds), 100 (light solid line with circles), 50 (medium solid line with triangles), 25 (light dotted line with squares), 10 (black dashed line with squares) and 1 (black dotted line with circles). Figure 4 shows clearly that the quality of the results 213 drops if the set of base categories is too large (500) or too small (1). The difference in the results between the other four cases is less marked, and in fact the results with 10 and 100 base categories overlap. More notably, the results with 10 base categories start weaker than the case with 100, spike around 3 goal categories to outperform it, then drop again and tie it at 5 goal categories. This instability seems to indicate that 10 base categories are not enough. The tests with 25 and 50 base categories are the two that yield the best results; it thus seems then that the optimal size of the base category set is in that range. The 25 base category case outperforms the 50 case, and is the one we will prefer. It is interesting to consider that in our previous study [4], we used the other alternative we proposed, namely to trim the set based on the density values. The cutoff we used was half the density value of the base category in the set with the highest density; any category with less than that density value was eliminated. This gave us a set of 28 base categories on average, a result which is consistent with the optimum we discovered in the present study. C. Goal Category Score and Ranking Another aspect of the system we wanted to study is the choice of equations we can use to account for the base categories’ density and distance when ranking the goal categories. The option we used in the previous subsection, to find the nearest goals to any of the retained base categories regardless of their densities, is entirely valid. The alternative we consider here is to rank the goal categories in function of their distance to each base category and of the density of that base. We proposed five possible equations in Section IIIC to mathematically combine density and distance to rank the goal categories. Equation (4) considers both distance and density evenly, and the others put progressively more importance on the distance up to equation (8). To illustrate the different impact of equations (4-8), consider three fictional base categories, one which has a density of 4 and is at a distance of 4 from a goal category, a second with a density of 4 and a distance of 3 from the same goal category, and the third with a density of 3 and a distance of 3 from the goal. The contribution of each of these bases to the goal category in each of the summations is given in Table II. As we can see in this table, the contribution of each base decreases as we move down from equation (4) to equation (8), but it decreases a lot more and a lot faster for the base at a distance of 4. The contribution to the summation of the category at a distance of 4 is almost equal to that of the categories at a distance of 3 when using equation (4), but is three orders TABLE II IMPACT OF THE GOAL CATEGORY EQUATIONS Equation Figure 4. Overall F1 using from 1 to 500 base categories. © 2012 ACADEMY PUBLISHER (4) (5) (6) (7) (8) Density 4 Distance 4 1.00 0.25 0.07 0.001 4.5x10-7 Density 4 Distance 3 1.33 0.44 0.20 0.01 0.0005 Density 3 Distance 3 1.00 0.33 0.15 0.007 0.0004 214 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 of magnitude smaller when using equation (8). That is the result of putting more and more emphasis on distance rather than density: the impact of a farther-away higherdensity category becomes negligible compared to a closer lower-density category. Meanwhile, comparing the contribution of the two categories of different densities at the same distance shows that, while they are in the same order of magnitude, the higher-density one is always more important than the lower-density one, as we would want. We ran tests of our system using each of these five equations. In these tests, we again set Wt = Nk * (Nk / Nt) for equation (1), kept the 25 highest-density base categories, and varied from retuning 1 to 5 categories. The overall F1 of the variations of the system is presented in Figure 5. In this figure, the classification results obtained using equation (4) are shown with a black dashed line with circle markers, equation (5) uses a grey solid line with square markers, equation (6) uses a dashed black line with triangle markers, equation (7) uses a light grey line with circle markers and equation (8) uses a grey line with triangle markers. For comparison, we also ran the classification using the exploration algorithm from our previous work [4], and included those results as a black dotted line with square markers. We can see from Figure 5 that putting too much importance on distance rather than density can have a detrimental impact on the quality of the results: the results using equations (7) and (8) are the worst of the five equations. Even the results from equation (6) are of debatable quality: although it is in the same range as the results of equations (4) and (5), it shows a clear downward trend as we increase the number of goal categories considered, going from the best result for 2 goals to second-best with 3 goals to a narrow third place with 4 goals and finally to a more distant third place with 5 goals. Finally, we can see that the results using the exploration algorithm of [4] are clearly the worst ones, despite the system being updated to use the better category density equations and goal category mappings discovered in this study. The main difference between the two systems is thus the use of our old exploration algorithm to discover the goal categories nearest to any of the 25 base categories. This is also the source of the poorer results: the exploration algorithm is very sensitive to noise and outlier base categories that are near a goal category, and will return that goal category as a result. On the other hand, all five equations have in common that they sum the value of all base categories for each goal category, and therefore build-in noise tolerance. An outlier base category will seldom give enough of a score boost to one goal to eclipse the combined effect of the 24 other base categories on other goals. Out of curiosity, we ran the same test a second time, but this time keeping the 100 highest-density base categories. These results are presented in Figure 6, using the same line conventions as Figure 5. It is interesting to see that this time it is equation (6) that yields the best results with a solid lead, not equation (5). This indicates a more fundamental relationship in our system: the best summation for the goal categories is not an absolute but depends on the number of base categories retained. With a smaller set of 25 base categories, the system works best when it considers a larger picture including the impact of more distant categories. But with a larger set of 100 base categories, the abundance of more distant categories seems to generate noise, and the system works best by limiting their impact and by focusing on closer base categories. Figure 5. Goal score formulae using 25 base categories. Figure 6. Goal score formulae using 100 base categories. © 2012 ACADEMY PUBLISHER D. Number of Goal Categories Returned The final parameter in our system is the number of goal categories to return. We have already explained in Section IIIC that our preference to return three goal categories is based on a study of human classification – namely, in the KDD CUP 2005 competition [5], human labelers used on average 3.3 categories per query. Moreover, looking at the F1 figures we presented in the previous subsections, we can see that the curve seems to be exponential, with each extra category returned giving a lesser increase in F1 value. Returning a fifth goal JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 category gives the least improvement compared to returning only four goal categories, and in fact in some cases it causes a drop in F1. Returning three categories seems to be at the limit between the initial faster rate of increase of the curve and the later plateau. Another way to look at the question is to consider the average score of goal categories at each rank, after summing the densities of the base categories for each and ranking them. If on average the top-ranked categories have a large difference to the rest of the graph, it will show that there exist a robust division between the likelycorrect goal categories to return and the other goal categories. The opposite observation, on the other hand, would reveal that the rankings could be unstable and sensitive to noise, and that there is no solid score distinction between the goals our system returns and the others. For this part of the study, we used the summation of equation (5). We can recall from our discussion in Section III that the maximum word weight is Nk and the maximum category density is Nk². Queries in the KDD CUP data set are at most 10 words long, giving a maximum base category density of 100. This in turn gives a maximum goal category score of 1,002,400 using equation (5) and 25 base categories in the case where the distance between the goal category and each of the base categories is one except for a single base category at a distance of zero; in other words, the goal is one of the base categories found and all other base categories are immediately connected to it. More realistically, we find in our experiments that the average base category density computed by equation (3) is 1.48, and the average distance between a base and goal category is 5.6 steps, so an average goal category score using equation (5) would be 1.18. Figure 7 shows the average score of the goal category at each rank over all KDD CUP queries used in our experiment from the previous section, obtained using the method described above. This graph shows that the top category has on average a score of 3,272, several orders of magnitude above the average but still below our theoretical maximum. In fact, even the maximum score we observed in our experiments is only 67,506, very far below the theoretical maximum. This is due to the fact that most base categories are more than a single step removed from the goal category. The graph also shows the massive difference between the first three ranks of goal categories and the other 96. The average score goes from 3,272, to 657 at rank 2 and 127 at rank 3, down to 20 and 16 at ranks 4 and 5 respectively, then cover the interval from 2 to 0.7 between ranks 6 and 99. This demonstrates a number of facts. First of all, both the values of the first five goal ranks and the differences between their scores when compared to the other 94 shows that these first ranks are resilient to noise and variations. It also justifies our decision to study the performance of our system using the top 1 to 5 goal categories, and it gives further experimental support to our decision to limit the number © 2012 ACADEMY PUBLISHER 215 Figure 7. Average goal category score per rank over all KDD CUP queries. of goal categories returned by the classifier to three. It is interesting to note that the average score of the categories over the entire distribution is 42.53, very far off from our theoretical average of 1.18. However, if we ignore the first three ranks, whose values are very high outliners in this distribution, the average score becomes 1.62. Moreover, the average score over ranks 6 to 99 is 1.28. Both of these values are in line with the average we expected to find. E. The Optimal System After having performed these experiments, we are ready to put forward the optimal classifier, or the one that combines the best features from the options we have studied. This classifier uses Wt = Nk * (Nk / Nt) for equation (1), selects the top 25 base categories, ranks the goal categories using the summation formula of equation (5), and returns the top-three categories ranked. The results we obtain with that system are presented in Table III, with other KDD CUP competition finalists reported in [5] for comparison. Note that participants had the option to enter their system for precision ranking but not F1 ranking or vice-versa rather than both precision and F1 ranking, and several participants chose to use that option. Consequently, there are some N/A values in the results in Table III. As can be seen from Table III, our system performs well above the competition average, and in fact ranks in the top-10 of the competition in F1 and in the top-5 in precision. For comparison, system #22, which TABLE III CLASSIFICATION RESULTS System KDDCUP #22 KDDCUP #37 KDDCUP #21 Our system KDDCUP #14 Our previous work KDDCUP Mean KDDCUP Median F1 Rank 1 N/A 6 7 7* 10 Overall F1 0.4444 0.4261 0.3401 0.3366 0.3129 0.2827 0.2353 0.2327 Precision Rank N/A 1 2* 2 N/A 7 Overall Precision 0.4141 0.4237 0.3409 0.3643 0.3173 0.3065 0.2545 0.2446 * indicates competition systems that would have been outranked by ours. 216 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 TABLE IV SAMPLE BASE CATEGORIES Category Internet Explorer Internet history Windows web browsers Microsoft criticisms and controversies HTTP Mobile phone web browsers Cascading Style Sheets Internet PlayStation Games Islands of Finland History of animation TABLE V SAMPLE GOAL CATEGORIES Rank 1 2 3 8 Density 4 4 4 4 Titles 36 32 20 4 25 26 33 37 660 905 1811 2.67 2.67 2 1 0.4 0.33 0.05 5 4 5 17 2 1 1 won the first place in the competition, achieved the best results by querying multiple web search engines and aggregating their results [2]. Our method may not perform as well right now, but it offers the potential for algorithmic and knowledge-base improvements that goes well beyond those of a simple aggregate function, and is not dependent on third-party commercial technology. We also updated the the classifier we built in our previous work of [4] to use Wt = Nk * (Nk / Nt), the top-25 base category cutoff and the goal category mapping of Appendix A. Its original iterative graph exploration method was also slightly modified to explore all paths rather than parents-only, to break ties using equation (5) rather than a random draw, and to return the top-three goal categories rather than the top-five. These modifications are all meant to update the system of [4] with the best features obtained in this research, to create a fair comparison. The results we obtained are included in Table III. While it does perform better than the average KDDCUP system, we find that our previous classifier still falls short of the one we studied in this paper. We also found in our results that 47 of the 800 test queries were not classified at all, because the algorithm failed to select any base categories at all. This situation occurs when no Wikipedia titles featuring query words can be found. These queries are all single words, and that word is either an uncommon abbreviation (the query “AATFCU” for example), misspelled in an unusual way (“egyptains”), an erroneous compounding of two words (“contactlens”), a rare website URL, or even a combination of the above (such as the misspelled URL “studioeonline.com” instead of “studioweonline.com”). These are all situations that occur with real user search queries, and are therefore present in the KDDCUP data set. It is worth noting that Wikipedia titles include common cases of all these errors, so that only the 5.9% most unusual cases lead to failure in our system. It could be interesting to study a specific example, to see the system’s behavior step by step. We chose for this purpose to study a query for “internet explorer” in the KDDCUP set. This query was manually classified by the competition’s three labelers, into the KDDCUP categories “Computers\Software; Computers\Internet & Intranet; Computers\Security; Computers\Multimedia; Information\Companies & Industries” by the first labeler, © 2012 ACADEMY PUBLISHER Goal Category Internet Software Computing Internet culture Websites Technology Magazines Industries Law Renting Rank 1 2 3 4 5 16 18 30 49 99 Score 11.51 10.09 8.03 5.63 4.97 3.54 3.39 2.86 2.54 1.30 Refer to Appendix A for the list of KDDCUP categories corresponding to these goal categories. into “Computers\Internet & Intranet; Computers\Software” by the second labeler, and into “Computers\Software; Computers\Internet & Intranet; Information\Companies & Industries” by the third labeler. The algorithm begins by identifying a set of relevant base categories using the procedure explained in Section IIIB and then weighting them using equation (3). For this query, our algorithm identifies 1,810 base categories, and keeps the 25 highest-density ones, breaking the tie for number 25 by considering the number of titles pointing to the categories as we explained in Section IIIB. For any two-word query, the maximum title weight value that can be computed by equation (1) is 2, and the maximum base category density value that can be returned by equation (3) is 4. And in fact, we find that 8 categories receive this maximum density, including some examples we listed in Table IV. We can see from these examples that the top-ranked base categories are indeed very relevant to the query. Examining the entire set of base categories reveals that the density values drop to half the maximum by rank 33, and to a quarter of it by rank 37. The density value continues to drop as we go down the list: the average density of a base category in this example is 0.4 which corresponds to rank 660, by the middle of the list at rank 905 the density is 0.33, and the final category in the list has a density of only 0.05. It can also be seen from the samples in Table IV that the relevance of the categories to the query does seem to decrease along with the density value. Looking at the complete list of 1,810 base categories, we find that the first non-software-related category is “Exploration” at rank 41 with a density of 1. But software-related categories continue to dominate the list, mixed with a growing number of non-software categories, until rank 354 (density of 0.5 and 1 title pointing to the category) where non-computer categories begin to dominate. Incidentally, the last software-related category in the list is “United States internet case law”, at rank 1791 with a density of 0.11. The next step of our algorithm is to rank the 99 goal categories using the sum of density values in equation (5). Sample rankings are given in Table V. This table uses the Wikipedia goal category labels; the matching KDDCUP categories can be found in Appendix A. We can see from these results that the scores drop by half from the first JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 TABLE VI TEST CLASSIFICATION RESULTS Query set KDDCUP 111 TREC KDDCUP 800 Overall F1 0.3636 0.4639 0.3366 Overall Precision 0.4254 0.4223 0.3643 TABLE VII CATEGORIZATION AND RECALL Overall Recall 0.3175 0.5267 0.3195 result to the fourth one. This is much less drastic than the drop we observed on average in Figure 7, but is nonetheless consistent as it shows a quick drop from a peak over the first three ranks and a long and more stable tail over ranks 4 to 99. It is also encouraging to see that the best two goal categories selected by our system correspond to “Computers\Internet & Intranet” and “Computers\Software”, the only two categories to be picked by all three KDDCUP labelers. The fourth goal corresponds to “Online Community/Other” and is the first goal that is not in the KDDCUP “Computer/” category, although it is still strongly relevant to the query. Further down, the first goal that corresponds neither to a “Computers/” nor “Online Community/” category is Technology (“Information\Science & Technology”) at rank 16, which is still somewhat related to the query, and the first truly irrelevant result is Magazines (“Living\Book & Magazine”) at rank 18 with a little over a quarter of the top category’s score. Of the categories picked by labelers, the one that ranked worst in our system was “Information\Companies & Industries” at rank 30. All the other categories they identified are found in the top-10 results of our system. F. New Data and Final Tests In order to show that our results in Table III are general and not due to picking the best system for a specific data set, we ran two more tests of our system with two new data sets. The first data set is a set of 111 KDD CUP 2005 queries classified by a competition judge. This set was not part of the 800 test queries we used previously; it was a set of queries made available by the competition organizers to participants prior to the competition, to develop and test their systems. Naturally, the queries in this set will be similar to the other KDD CUP queries, and so we expect similar results. The second data set is a set of queries taken from the TREC 2007 Question-Answering (QA) track [19]. That data set is composed of 445 questions on 70 different topics; we randomly selected three questions per topic to use for our test. It is also worth noting that the questions in TREC 2007 were designed to be asked sequentially, meaning that a system could rely on information from the previous questions, while our system is designed to classify each query by itself with no query history. Consequently, questions that were too vague to be understood without previous information were disambiguated by adding the topic label. For example, the question ‘Who is the CEO?’ in the series of questions on the company 3M was rephrased as ‘Who is the CEO of © 2012 ACADEMY PUBLISHER 217 Query set TREC Labeler 1 TREC Labeler 2 KDDCUP Labeler 2 KDDCUP Labeler 1 KDDCUP Labeler 3 Average number of categories 1.93 ± 0.81 2.91 ± 0.92 2.39 ± 0.93 3.67 ± 1.13 3.85 ± 1.09 Recall 0.5443 0.5090 0.3763 0.3076 0.2747 3M?’ Finally, two of the co-authors independently labeled the questions to KDD CUP categories in order to have a standard to compare our system’s results to in Equations (9) and (10). The TREC data set was selected in order to subject our system to very different testing conditions: instead of the short keyword-only KDD CUP web queries, TREC has long and grammatically-correct English questions. The results from both tests are presented in Table VI, along with our system’s development results already presented in Table III for comparison. These results show that our classifier works better with the test data than with the training data it was developed and optimized on. This counter-intuitive result requires explanation. The greatest difference in our results is on recall, which increases by over 20% from the training KDDCUP test to the TREC test. Recall, as presented in equation (10), is the ratio of correct category labels identified by our system for a query to the total number of category labels the query really has. Since our classifier returns a fixed number of three categories per query, it stands to reason that it cannot achieve perfect recall for a query set that assigns more than three categories, and that it can get better recall on a query set that assigns fewer categories per query. To examine this hypothesis, we compared the results of five of our labelers individually: the three labelers of the KDDCUP competition and the two labelers of the TREC competition (the 111 KDDCUP demo queries, having been labeled by only one person, were not useful for this test). Specifically, we looked at the average number of categories per query each labeler used and the recall value our system achieved using that query set. The results, presented in Table VII, show that our intuition is correct: query sets with less categories per query lead to higher recall, with the most drastic example being the increase of 1.5 categories per query between KDDCUP labelers 2 and 3 that yielded a 10% decrease in recall. However, it also appears from that table that the relationship does not hold across different query sets: KDDCUP labeler 2 assigns less labels per query that TREC labeler 2 but still has a much lower recall. Next, we can contrast the two KDDCUP tests: they both had nearly identical recall but the new data gave a 6% increase in precision. This is interesting because the queries are from the same data sets, they are web keyword searches of the same average length, and the correct categorization statistics are nearly identical to those of Labeler 3 so we would actually expect the recall to be lower than it ended up being. An increase in both precision and recall can have the same origin in equations 218 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 (9) and (10): a greater proportion of correct categories identified by our classifier. But everything else being equal, this would only happen if the queries themselves were easier for our system to understand. To verify this hypothesis, we checked both query sets for words that are unknown in our system. As we explained previously, a lot of these words may be rare but simple typos (“egyptains”) or missing spaces between two words (“contactlens”), and while they are unknown and ignored in our system their meaning is immediately obvious to the human labelers. The labelers thus have more information to classify the queries, which makes it inherently more difficult for our system to generate the same classification. Upon evaluation of our data, we find that the KDDCUP set of 800 queries features about twice the frequency of unknown words of the set of 111 queries. Indeed, 10.4% of queries in the 800-query set have unknown words and 4.4% of words overall are unknown, while only 5.4% of queries in the 111-query set have unknown words and only 2.5% of words in that set are unknown. This is an important difference between the two query sets, and we believe it explains why the 111 queries are more often classified correctly. It incidentally also indicates that an automated corrector should be incorporated in the system in the future. The better performance of our system on the TREC query set can be explained in the same way. Thanks to the fact that set is composed of correct English questions, it features even fewer unknown words: a mere 0.4% of words in 1.9% of queries. Moreover, for the same reason, the queries are much longer: on average 5.3 words in length after stopword removal, compared to 2.4 words for the KDDCUP queries. This means that even if there is an unknown word in a query, there are still a lot of other words in the TREC queries for our system to make a reasonably good classification. Differences in the queries aside, it does not appear to be major distinctions, much less setbacks, when using our classifier on new and unseen data sets. It seems robust enough to handle new queries in a different spread of domains, and to handle both web-style keyword searches and English questions without loss of precision or recall. Finally, it could be interesting to determine how our classifier’s performance compares to that of a human doing the same labeling task. Query classification is a subjective task: since queries are short and often ambiguous, their exact meaning and classification is often dependent on human interpretation [20]. It is clear from Table VII that this is the case for our query sets, that human labelers do not agree with each other on the classification of these queries. We can evaluate the human labelers by computing the F1 of each one’s classification compared to the others in the same data set. In the case of the KDDCUP data, the average F1 of human labelers is known to be between 0.4771 and 0.5377 [5], while for our labeled TREC data we can compute the F1 between the two human labelers to be 0.5605. This means our system has between 63% and 71% of a human’s performance when labeling the © 2012 ACADEMY PUBLISHER KDDCUP queries, and 83% of a human’s performance when labeling the TREC queries. It thus appears that by this benchmark, our classifier again performs better on the TREC data set than on the KDDCUP one. This gives further weight to our conclusion that our system is robust enough to handle very diverse queries. V. CONCLUSION In this paper, we presented a ranking and classification algorithm to exploit the Wikipedia category graph to find the best set of goal categories given user-specified keywords. To demonstrate its efficiency, we implemented a query classification system using our algorithm. We performed a thorough study of the algorithm in this paper, focusing on each design decision individually and considering the practical impact of different alternatives. We showed that our system’s classification results compare favorably to those of the KDD CUP 2005 competition: it would have ranked 2nd on precision with a performance 10% better than the competition mean, and 7th in the competition on F1. We further detailed the results of an example query in key steps of the algorithm, to demonstrate that each partial result is correct. And finally we presented two blind tests on different data sets that were not used to develop the system, to validate our results. We believe this work will be of interest to anyone developing query classification systems, text classification systems, or most other kinds of classification software. By using Wikipedia, a classification system gains the ability to classify queries into a set of almost 300,000 categories covering most of human knowledge and which can easily be mapped to a simpler application-specific set of categories when needed, as we did in this study. And while we considered and tested multiple alternatives at every design stage of our system, it is possible to conceive of further alternatives that could be implemented on the same framework and compared to our results. Future work can focus on exploring these alternatives and further improving the quality of the classification. In that respect, as we indicated in Section IV.F, one of the first directions to work in will be to integrate an automated corrector into the system, to address the problem of unknown words. APPENDIX A This appendix lists how we mapped the 67 KDD CUP categories to 99 corresponding Wikipedia categories in the September 2008 version of the encyclopedia. KDD CUP Category Computers\Hardware Wikipedia Category Computer hardware Internet Computers\Internet & Intranet Computer networks Computers\Mobile Mobile computers Computing Computers\Multimedia Multimedia Networks Computers\Networks & Telecommunication Telecommunications JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Computers\Security Computers\Software Computers\Other Entertainment\Celebrities Computer security Software Computing Celebrities Games Entertainment\Games & Toys Toys Entertainment\Humor & Fun Humor Entertainment\Movies Film Entertainment\Music Music Entertainment\Pictures & Photographs Photos Entertainment\Radio Radio Entertainment\TV Television Entertainment\Other Entertainment Arts Information\Arts & Humanities Humanities Companies Information\Companies & Industries Industries Science Information\Science & Technology Technology Information\Education Education Law Information\Law & Politics Politics Regions Information\Local & Municipalities Regional Local government Reference Information\References & Libraries Libraries Information\Other Information Books Living\Book & Magazine Magazines Automobiles Living\Car & Garage Garages Living\Career & Jobs Employment Dating Living\Dating & Relationships Intimate relationships Family Living\Family & Kids Children Fashion Living\Fashion & Apparel Clothing Finance Living\Finance & Investment Investment Food and drink Living\Food & Cooking Cooking Decorative arts Living\Furnishing & Furnishings Houseware Home appliances Giving Living\Gifts & Collectables Collecting Health Living\Health & Fitness Exercise Landscape Living\Landscaping & Gardening Gardening Pets Living\Pets & Animals Animals © 2012 ACADEMY PUBLISHER Living\Real Estate Living\Religion & Belief Living\Tools & Hardware Living\Travel & Vacation Living\Other Online Community\Chat & Instant Messaging Online Community\Forums & Groups Online Community\Homepages Online Community\People Search Online Community\Personal Services Online Community\Other Shopping\Auctions & Bids Shopping\Stores & Products Shopping\Buying Guides & Researching Shopping\Lease & Rent Shopping\Bargains & Discounts Shopping\Other Sports\American Football Sports\Auto Racing Sports\Baseball Sports\Basketball Sports\Hockey Sports\News & Scores Sports\Schedules & Tickets Sports\Soccer Sports\Tennis Sports\Olympic Games Sports\Outdoor Recreations Sports\Other 219 Real estate Religion Belief Tools Hardware (mechanical) Travel Holidays Personal life On-line chat Instant messaging Internet forums Websites Internet personalities Online social networking Virtual communities Internet culture Auctions and trading Retail Product management Consumer behaviour Consumer protection Renting Sales promotion Bargaining theory Distribution, retailing, and wholesaling American football Auto racing Baseball Basketball Hockey Sports media Sport events Seasons Football (soccer) Tennis Olympics Outdoor recreation Sports REFERENCES [1] Maj. B. J. Jansen, A. Spink, T. Saracevic, “Real life, real users, and real needs: a study and analysis of user queries on the web”, Information Processing and Management, vol. 36, issue 2, 2000, pp. 207-227. [2] D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, Q. Yang, “Q2C@UST: our winning solution to query classification in KDDCUP 2005”, ACM SIGKDD Explorations Newsletter, vol. 7, issue 2, 2005, pp. 100-110. [3] M. Alemzadeh, F. Karray, “An efficient method for tagging a query with category labels using Wikipedia towards enhancing search engine results”, 2010 IEEE/WIC/ACM International Conference on Web 220 [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Intelligence and Intelligent Agent Technology, Toronto, Canada, 2010, pp. 192-195. M. Alemzadeh, R. Khoury, F. Karray, “Exploring Wikipedia’s Category Graph for Query Classification”, in Autonomous and Intelligent Systems, M. Kamel, F. Farray, W. Gueaieb, A. Khamis (eds.), Lecture notes in Artificial Intelligence, 1st edition, vol. 6752, Springer, 2011, pp. 222-230. Y. Li, Z. Zheng, H. Dai, “KDD CUP-2005 report: Facing a great challenge”, ACM SIGKDD Explorations Newsletter, vol. 7 issue 2, 2005, pp. 91-99. D. Shen, J. Sun, Q. Yang, Z. Chen, “Building bridges for web query classification”, Proceedings of SIGIR’06, 2006, pp. 131-138. H. Cao, D. H. Hu, D. Shen, D. Jiang, J.-T. Sun, E. Chen, and Q. Yang. “Context-aware query classification”, Proceedings of SIGIR, 2009. J. Fu, J. Xu, K. Jia, “Domain ontology based automatic question answering”, International Conference on Computer Engineering and Technology (ICCET '08), vol. 2, 2009, pp. 346-349. J. Yu, N. Ye, “Automatic web query classification using large unlabeled web pages”, 9th International Conference on Web-Age Information Management, Zhangjiajie, China, 2008, pp. 211-215. S. M. Beitzel, E. C. Jensen, D. D. Lewis, A. Chowdhury, O. Frieder, “Automatic classification of web queries using very large unlabeled query logs”, ACM Transactions on Information Systems, vol. 25, no. 2, 2007, article 9. J. Hu, G. Wang, F. Lochovsky, J.-T. Sun, Z. Chen, “Understanding user's query intent with Wikipedia”, Proceedings of the 18th international conference on World Wide Web, Spain, 2009, pp. 471-480. R. Khoury, “Query Classification using Wikipedia”, International Journal of Intelligent Information and Database Systems, vol. 5, no. 2, April 2011, pp. 143-163. P. Schönhofen, “Identifying document topics using the Wikipedia category network”, Web Intelligence and Agent Systems, IOS Press, Vol. 7, No. 2, 2009, pp. 195-207. Z. Minier, Z. Bodo, L. Csato, “Wikipedia-based kernels for text categorization”, International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Romania, 2007, pp, 157-164. P. Wang, J. Hu, H.-J. Zeng, Z. Chen, “Using Wikipedia knowledge to improve text classification”, Knowledge and Information Systems, vol. 19, issue 3, 2009, pp. 265-281. S. Banerjee, K. Ramanathan, A. Gupta, “Clustering short texts using Wikipedia,” Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, Netherlands, 2007 pp. 787-788. K. Coursey, R. Mihalcea, “Topic identification using Wikipedia graph centrality”, Proceedings of NAACL HLT, 2009, pp. 117-120. Z. S. Syed, T. Finin, A. Joshi, “Wikipedia as an ontology for describing documents”, Proceedings of the Second International Conference on Weblogs and Social Media, March 2008. H. T. Dang, D. Kelly, J. Lin, “Overview of the TREC 2007 question answering track”, Proceedings of the Sixteenth Text Retrieval Conference, 2007. B. Cao, J.-T. Sun, E. W. Xiang, D. H. Hu, Q. Yang, Z. Chen, “PQC: personalized query classification”, Proceedings of the 18th ACM conference on information and knowledge management, Hong Kong, China, 2009, pp. 1217-1226. © 2012 ACADEMY PUBLISHER JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 221 Towards Identifying Personalized Twitter Trending Topics using the Twitter Client RSS Feeds Jinan Fiaidhi, Sabah Mohammed Department of Computer Science Lakehead University Thunder Bay, Ontario P7B 5E1, Canada {jfiaidhi,mohammed}@lakeheadu.ca Aminul Islam Department of Computer Science Lakehead University Thunder Bay, Ontario P7B 5E1, Canada maislam@lakeheadu.ca Abstract—We are currently witnessing an information explosion with aid of many micro-blogging toolkits like the Twitter. Although Twitter provides a list of most popular topics people tweet about known as Trending Topics in real time, it is often hard to understand what these trending topics are about where most of these trending topics are far away from the personal preferences of the twitter user. In this article, we pay attention to the issue of personalizing the search for trending topics via enabling the twitter user to provide RSS feeds that include the personal preferences along with a twitter client that can filter personalized tweets and trending topics according to a sound algorithm for capturing the trending information. The algorithms used are the Latent Dirichlet allocation (LDA) along with the Levenshtein Distance. Our experimentations show that the developed prototype for personalized trending topics (T3C) finds more interesting trending topics that match the Twitter user list of preferences than traditional techniques without RSS personalization. Index Terms—component; Trending Streaming; Classification, ADL topics; Twitter I. INTRODUCTION Twitter is a popular social networking service with over 100 million users. Twitter monitors the millions and billions of 140-character bits of wisdom that travel the Twitter verse and lists out the top 10 hottest trends (also known as “trending topics”) [1]. With such social networking, streams have become the main source of information for sharing and analyzing information as it comes into the system. Streams are central concepts in most momentous Twitter applications. Streams become so important that they even replaces search engines as a starting point of Web browsing – now a typical Web session consists in reading Twitter streams and following © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.221-226 links found in these streams instead of starting with Web search. One of the central applications of using streams with Twitter is mine in real-time trending topics. To develop such application one need to use one of the three Twitter Application Programming Interfaces (APIs). The first API is the REpresentational State Transfer (REST) which covers the basic Twitter functions (e.g. send direct messages, retweets, manipulate your lists). The second is the Twitter search API which can do everything that the Twitter Advanced Search can do. The third API is the streaming API which give developers low latency access to Twitter's global stream of Tweet data. In particular the streaming API gives the developer the ability to create a long-standing connection to Twitter that receives “push” updates when new tweets matching certain criteria arrive, obviating the need to constantly poll for updates. For this reason the use of the streaming APIs becomes more common practice related to twitter applications like finding trending topics. In such approach the user subscribes to followers (e.g. FriendFeed) and read the stream made up of posts from the followers. However, the problem with this approach is that there is always a compromise with the number of followers that the user would like to read and the amount of information he/she is able to consume. Twitter users share variety of comments regarding a wide range of topics. Some researchers recommended a streaming approach that identifies interesting tweets based on their density, negativity, trending and influence characteristics [2, 3]. However, mining this content to define user interests is a challenge that requires an effective solution. Certainly identifying a personalized stream that contains only a moderate number of posts that are potentially interesting for the user can be used for the customization and personalization of a variety of commercial and non-commercial applications like 222 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 product marketing and recommendation. Recently Twitter introduced local trending topics that contribute to the solution of this problem through improving the Discover Tab to show what users in your geography are tweeting about. But this service fail short in providing many personalized trending like displaying trends only from those you follow or whom they follow. There are many current other attempts to fill this gap like the Cadmus1 and KeyTweet2, however, there is comprehensive solution that can provide wider range of personalization venues for the Twitter users. This article introduces our investigation on developing personalized trending topics over stream of tweets. II. RELATED RESEACH Research on topic identification within a textual data is either related to information retrieval, data mining or a hybrid of both. Information retrieval research provides searching techniques that can identify the main concepts in a given text based on structural elements available within the provided text (e.g. by identifying noun phrases as good topic markers [5]). This is a multi-stage process that starts by identifying key concepts within a document, then grouping these to find topics, and finally mapping the topics back to documents and using the mapping to find higher-level groupings. Information retrieval research utilizes computational linguists and natural language techniques to predict important terms in document using methods like coreference, anaphora resolution or discourse center [6]. However, using linguistic techniques in identifying important terms do not necessarily correspond to the subject or the theme. Predicting important terms involves numerical weighting of terms in document. Terms with top weights are judged important and representative of document. In this direction terms extraction methods like TF-IDF [7] (term frequencyinverse document frequency (TF–IDF) generally extracts from a text keywords which represent topics within the text). However, TF-IDF does not conduct segmentation). A segmentation method (e.g., TextTiling [8]) generally segments a text into blocks (paragraphs) in accord with topic changes within the text, but it does not identify (or label) by itself the topics discussed in each of the blocks. While both techniques (i.e. TF-IDF and Segmentation) have some appealing features—notably in its basic identification of sets of words that are discriminative for documents in the collection—these approaches also provides a relatively small amount of reduction in description length and reveals little in the way of inter- or intra-document statistical structure. To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques and topic identification techniques (e.g. LSI (latent semantic indexing), LDA (Latent Dirichlet allocation)) [9]. On the other hand, data mining tries to analyze text and predict Identify applicable sponsor/s here. If no sponsors, delete this text box. (sponsors) 1 2 http://thecadmus.com/ http://keytweet.com/ © 2012 ACADEMY PUBLISHER frequent itemsets, or groups of named entities that commonly appear together from a training dataset and use these associations to predict topics in future given documents [10]. This approach assumes a previously available datasets and it not suitable for streaming and dynamically changing topics as the one associated with Twitter. For this reason we consider this approach is out of the scope of this article. III. DEVELOPING A STREAMING CLIENT FOR IDENTIFYING TRENDING TOPICS It is a simple task to start developing a Twitter streaming client especially with the availability of variety of Twitter streaming APIs (e.g. Twitter4J3, JavaTwitter4, JTwitter5). However, modifying this client to search for trending topics and adapting to the user preferences is another issue that can add higher programming complexities. The advantage of using trending topics is to reduce messaging overload that each active user receives each day. Without classifying the incoming tweets users are forced the Twitter to march through a chronologically-ordered morass to find tweets of interest. By finding personalized trending topics and grouping tweets according to coherently clustered trending topics for more directed exploration will simplify searching and identifying tweets of interest. In this section we are presenting a Twitter client that enables the client to group tweets according to the user preferences into topics mentioned explicitly or implicitly, which users can then be browsed for items of interest. To implement this topic clustering, we have developed a revised LDA (Latent Dirichlet allocation) algorithm for discovering trending topics. Figure 1 illustrates the structure of our Trending Topics Twitter Client (T3C). Figure1. The Structure of the T3C Twitter Client. Data was collected using the Twitter streaming API6, with the filter tweet stream providing the input data and the trends/location stream providing the list of terms identified by Twitter as trending topics. The filter 3 http://repo1.maven.org/maven2/net/homeip/yusuke/twitter4j/ http://www.javacodegeeks.com/2011/10/java-twitter-client-withtwitter4j.html 5 http://www.winterwell.com/software/jtwitter.php 6 http://twitter.com 4 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 streaming API is a limited stream returns public statuses that match one or more filter predicates. The United States (New York) and Canada (Toronto) was used as the location for evaluation. Google Geocoding API has been used to get location wise Twitter data 7 . The streaming data was collected automatically using the Twitter4j API. The streaming data was stored in a tabular CSV formatted file. Data has been collected with different time interval for same city and topic. We collected different topics of dataset from different city with different time interval. We collected Canada gas price topics from Thunder Bay and Toronto on 25th April 2012 and 26th April 2012 about 200 tweets. We collected sport (Basketball) related topics from New York and Toronto on 8th May 2012 and 9th may 2012 about 28000 tweets. We collected health (flu) related topics from Toronto and Vancouver on 8th May 2012 and 9th May 2012 about 600 tweets. We collected political (election) topics from Los Angeles and Toronto on 9th May 2012 and 10th May 2012 about 6000 tweets. We collected education (engineering school) related topics from Toronto and New York on 9 th May 2012 and 10th May 2012 about 2000 tweets. Additionally we collected large set of data from USA and Canada between 25th June 2012 and 30th June 2012. We collected total 2736048 (economy 1795211, education 89455, health 390801, politics 60265, sports 400316) tweets. We ran our client to automatically collect the data. We used multiple twitter account to collect data concurrently 8 . Next the tweets were preprocessed to remove URL’s, Unicode characters, usernames, and punctuation, html, etc. A stop word file containing common English stop words was used to filter out tweets from common words. 9 The T3C client collects tweets and filters those that match the user preferences according the user feeds sent by the T3C user via the RSS protocol. IV. T3C TRENDING TOPICS PERSONALIZATION In this section, we describe an improved Twitter personalization mechanism by incorporating the user personalization RSS feeds. The user of our T3C Twitter client can feed his or her own personalization feeds via the RSS protocol or the user can directly upload the personalize data list from a given URL. Using RSS, the T3C reads these feeds an XMLEventReader method that reads all the available feeds and store them in a personalize data list. Figure 2 illustrates the filtering of personalized tweets through the streaming process. While the Twitter API collects tweet to form a dataset, the tweets that are related to the user personalization feeds are filtered out using string similarity method that is based on the Levenshtein Distance algorithm 10 . The Levenshtein distance is a measure between strings: the minimum cost of transforming one string into another through a sequence of edit operations. In our T3C the use of this measure can be illustrated using the following code snippet. 7 http://developers.google.com/maps/geocoding http://flash.lakeheadu.ca/~maislam/Data 9 http://flash.lakeheadu.ca/~maislam/Data/stopwords.txt 10 http://en.wikipedia.org/wiki/Levenshtein_distance 8 © 2012 ACADEMY PUBLISHER 223 Figure2. Filtering Personalized Tweets During Streaming. while((line = input.readLine()) != null){ line=cleanup(line); double Distance = 80; if(personalize.size()>0) Distance=4000; for (int j = 0; j < personalize.size(); ++j ) { String comparisionTweet = personalize.get(j); Int thisDistance; thisDistance=Util.computeLevenshteinDistance(comparisionTweet, line); if (Distance > thisDistance) { Distance = thisDistance;} } if(Distance<=80) articleTextList.add(line); } The similarity detection loop continues until the end of the dataset. For each tweets we remove URL’s, Unicode characters, usernames, and punctuation, html, stop words, etc. Then similarity loop iterates over the user personalize RSS data list to get the minimum Levenshtein distance value. In our implementation we have set an average distance value 80 as a good value to catch most related personalize tweets. We also found that using the Levenshtein distance we can remove duplicate tweets if the distance is zero indicating that the two tweets are identical [11]. After filtering the personalized tweets the Latent Dirichlet allocation (LDA) algorithm is used to generate trending topics model. The basic idea of LDA is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words [4]. LDA makes the assumption that document generation can be explained in terms of these distributions, which are assumed to have a Dirichlet prior. First a topic distribution is chosen for the document, and then each word in the document is generated by randomly selecting a topic from the topic distribution and randomly selecting a word from the chosen topic. Given a set of documents, the main challenge is to infer the word distributions and topic mixtures that best explain the observed data. This inference is computationally intractable, but an approximate answer can be found using a Gibbs sampling 224 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 approach. The LingPipe LDA implementation11 was used in our Twitter client prototype. In this LDA implementation, a topic is nothing more than a discrete probability distribution over words. That is, given a topic, each word has a probability of occurring, and the sum of all word probabilities in a topic must be one. For the purposes of LDA, a document is modeled as a sequence of tokens. We use the tokenizer factory and symbol table to convert the text to a sequence of token identifiers in the symbol table, using the static utility method built into LingPipe LDA12. Using the LingPipe LDA API we can report topics according to the following code snippet: TABLE I. TRENDING TOPICS WITHOUT PERSONALIZATION Trending Topic Count Probability 12479 0.161 basketball 2471 0.032 play 1277 0.016 watch 1271 0.016 school 1153 0.015 game 1109 0.014 bballproblemz 1082 0.014 #basketballproblem 0.014 basketballproblem 1079 1063 0.014 asleep 874 0.011 love 853 0.011 player 647 0.008 football for (int topic = 0; topic < numTopics; ++topic) { int topicCount = sample.topicCount(topic); ObjectToCounterMap<Integer> counter = new ObjectToCounterMap<Integer>(); for (int wordId = 0; wordId < numWords; ++wordId) { String word = mSymbolTable.idToSymbol(wordId); double Distance = 4; if(personalize.size()>0) Distance=4000; for (int j = 0; j < personalize.size(); ++j ) { String comparisionTweet = personalize.get(j); thisDistance=Util.computeLevenshteinDistance(comparisionTweet,wor d); if (Distance > thisDistance) { Distance = thisDistance; } } if(Distance<=4 ){ counter.set(Integer.valueOf(wordId), sample.topicWordCount(topic, wordId)); } } List<Integer> topWords = counter.keysOrderedByCountList(); } The iterative process of identifying trending topics maps the word identifiers to their counts in the current topic. The resulting mapping is sorted for each identifier based on their counts, from high to low, and assigned to a list of integers. Then trending topics are ranked according to the Z score by testing binomial hypothesis of word frequency in personalized topic against the word frequency in the corpus [11]. Table 1 and 2 illustrates running T3C with or without personalization with initial search around football and basketball. 11 http://aliasi.com/lingpipe/docs/api/com/aliasi/cluster/LatentDirichletAllocation.ht ml 12 http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html © 2012 ACADEMY PUBLISHER TABLE II. TRENDING TOPICS WITH PERSONALIZATION Trending Topic Count Probability 3660 0.298 basketball 372 0.030 watch 322 0.026 love 322 0.026 play 289 0.024 game 204 0.017 football 181 0.015 player 82 0.007 team 72 0.007 don 72 0.007 season 59 0.005 baseball 57 0.005 short V. EXPERIMENTATION RELATED TO THE IDENIFICATION OF TRENDING TOPICS Our experimentation starts by collecting a reasonable tweets samples on general topics like health, education, sports, ecomomy and politics. For this purpose, we run our T3C client to find trending topics by applying certain personalization feeds/queries as well as without any personalization. To demonstrate the effects of our RSSBased personalization, we conducted for experiments. For the first experiment, we have collected 3,90,801 Tweets related to the health topic and applied our personalization mechanism by uploading medical feeds related to cancer/oncology research from MedicalNewsToday 13 . We used the same sample and filter trending topics without personalization feeds for comparison purposes using the Twitter Filter API and the LDA algorithm. For personalization purpose, we apply the Twitter Filter API first to get general helath related Tweets followed by calling our rss reader to read the client personalization feeds and after that we apply the Levenshtein Distance algorithm (between user personalization feeds and health related tweets) followed by the LDA algorithm to finally finding the personalized 13 http://www.medicalnewstoday.com/rss/cancer-oncology.xml JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 225 trending topics. Figures 3a and 3b illustrate the comparison between finding trending topics with personalization and without personalization. For this expirement, we used fixed some variables like: Dirichlet priors to be .01 for η , .01 for α, number of topics to be 12 for, samples to be 2000, 200 for buring period and 5 sampling frequency14. (a) (a) Comparison Histogram (b) Comparison Graph Figure 3. Comparing Health Related Trending Topics with RSS Personalization and without Personalization. Figures 4.a and 4.b are showing the most frequent health related trending topics word for both personalize and non-personalize topics. Moreover, we conducted similar experiments using other general topics. For economy and finance, we collected 17,95,211 and used the Economist Banking RSS feeds 15 . For education we collected 89455 tweets and used the CBC Technology Feeds16. For politics we collected 60265 tweets and use the CBC Politics Feeds17. Finally for sports we collected 400316 tweets and used the CBC Sports Feeds 18 . We publish all the results of these experiments on our Lakehead University Flash server 19 . Our experiments shows clearly that our RSSBased personalization mechanism finds trending topics that matches the user perferences through the provided user feeds. 14 http://alias-i.com/lingpipe/docs/api/index.html http://www.economist.com/topics/banking/index.xml 16 http://rss.cbc.ca/lineup/technology.xml. 17 http://rss.cbc.ca/lineup/politics.xml. 18 http://rss.cbc.ca/lineup/sports.xml 19 http://flash.lakeheadu.ca/~maislam/TestSample 15 © 2012 ACADEMY PUBLISHER (b) Figure 4. Frequency Counts for Trending Topics with or without RSS Personalization. VI. CONCLUSIONS While numerous volumes of Tweets users receive daily, certain popular issues tend to capture their attention. Such trending topics are of great interest not only to the Twitter micro-bloggers but also to advertisers, marketers, journalists and many others. An examination of the state of the art in this area reveals progress that lags its importance [14]. In this article, we have introduced a new method for identifying trending topics using RSS feeds. In this method we used two algorithms to identify tweets that are similar to the RSS Levenshtein Distance algorithm and the LDA. Although LDA is a popular information retrieval algorithm that have been used also for finding trending topics [12], no attempt that we know have used the RSS feed for personalization. Figure 5 shows a screenshot of GUI of our RSS-Based Personalization Twitter Client (T3C). 226 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 [4] [5] [6] [7] Figure 5. GUI for the RSS-Based T3C Client. We are continuing our attempts to develop more personalization mechanisms that adds more focused identification of personalized trending topics using techniques that utilize machine learning algorithms [13]. The results of these experiments will be the subject of our next article. ACKNOWLEDGMENT [8] [9] [10] [11] Dr. J. Fiaidhi would like to acknowledge the support of NSERC for the research conducted in this article. REFERENCES James Benhardus, Streaming Trend Detection in Twitter, 2010 UCCS REU FOR ARTIFICIAL INTELLIGENCE, NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL FINAL REPORT. [2] Ming Hao et. al, Visual sentiment analysis on twitter data streams,2011 IEEE Conference onVisual Analytics Science and Technology (VAST), 23-28 Oct. 2011, pp277 – 278 [3] Suzumura, T. and Oiki, T., StreamWeb: Real-Time Web Monitoring with Stream Computing, 2011 IEEE [12] [1] © 2012 ACADEMY PUBLISHER [13] [14] International Conference on Web Services (ICWS), 4-9 July 2011, pp620 – 627 Kevin R. Canini, Lei Shi and Thomas L. Griffiths, Online Inference of Topics with Latent Dirichlet Allocation,In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2009, http://cocosci.berkeley.edu/tom/papers/topicpf.pdf Bendersky, M. and Croft, W.B. Discovering key concepts in verbose queries. SIGIR '08, ACM Press (2008). Nomoto, Tadashi and Matsumoto, Yuji,EXPLOITING TEXT STRUCTURE FOR TOPIC IDENTIFICATION, Workshop On Very Large Corpora, 1996 Salton, G., & Yang, C. S. (1973). On the specification of term values in automatic indexing. Journal of Documentation, 29(4), 351–372. Hearst, M. (1997). Texttiling: Segmenting text into multiparagraph subtopic passages. Computational Linguistics, 23(1), 33–64. David M. Blei, Andrew Y. Ng and Michael I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research 3 (2003) 993-1022 Alexander Pak and Patrick Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, May 19-21, 2010,Valletta, Malta. Alex Hai Wang, Don’t Follow me: Spam Detection in Twitter, IEEE Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT), 26-28 July 2010, http://test.scripts.psu.edu/students/h/x/hxw164/files/SECR YPT2010_Wang.pdf Daniel Ramage, Susan Dumais, and Dan Liebling, Characterizing Microblogs with Topic Models, in Proc. ICWSM 2010, American Association for Artificial Intelligence , May 2010 Kathy Lee, Diana Palsetia, Ramanathan Narayanan, Md. Mostofa Ali Patwary, Ankit Agrawal, and Alok Choudhary, Twitter Trending Topic Classification, 2011 11th IEEE International Conference on Data Mining Workshops, ICDMW2011,pp.251-258. Fang Fang and Nargis Pervin, Finding Trending Topics in Twitter in Real Time, NRICH Research, 2010, Available online: http://nrich.comp.nus.edu.sg/research_topic3.html. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 227 Architecture of a Cloud-Based Social Networking News Site Jeff Luo, Jon Kivinen, Joshua Malo, Richard Khoury Department of Software Engineering, Lakehead University, Thunder Bay, Canada Email: {yluo, jlkivine, jmalo, rkhoury}@lakeheadu.ca Abstract—Web 2.0 websites provide a set of tools for internet users to produce and consume news items. Prominent examples of news production sites include Issuu (issuu.com) and FlippingBook (page-flip.com) which allows users to upload publication files and transform them into flash-animated online publications with integrated socialmedia-sharing and statistics-tracking features. A prominent example of news consumption site is Google News (news.google.com), which allows users some degree of control over the layout of the presentation of news feeds, including trusted news sources and extra category keywords, but offers no real editing and social sharing components. This proposed project bridges the gap between news production sites and news consumption sites in order to offer to any user - including non-profit organizations, student or professional news media organizations, and the average Internet user - the ability to create, share, and consume social news publications in a way that gives users complete control of the layout and content of their paper, the facilities to share designs and article collections socially, as well as provide related article suggestions all in a single easy to use horizontally scaling system. Index Terms—Web 2.0, Web engineering, Cloud computing, News, Social networking, Recommendation systems I. INTRODUCTION Web 2.0 applications, and in particular social networking sites, enjoy unprecedented popularity today. For example, there were over 900,000 registered users on Facebook at the end of March 2012 1 , more than the population of any country on Earth save China and India. In a parallel development, a recent survey by the Pew Research Centre [1] discovered that an overwhelming 92% of Americans get news from multiple platforms, including in 61% of cases online news sources. Moreover, this survey showed that the news is no longer seen as a passive “they report, we read” activity but as an interactive activity, with 72% of Americans saying they follow the news explicitly because they enjoy talking about it with other people. The social aspect of news is dominant online: 75% of online news consumers receive news articles from friends, 52% retransmit those news articles, and 25% of people contribute to them by writing comments. There is also a clear intersection between 1 http://newsroom.fb.com/content/default.aspx?NewsAreaId=22 © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.227-233 social networks and news consumption: half of socialnetwork-using news consumers use the social network to receive news items from their friends or from people they follow, and a third use their social networking site to actively share news. However, news portal sites such as GoogleNews remain the single most popular online news source. There is thus a clear user need for a social networking news site. Such a site will combine the interactive social aspect of news that users enjoy with the diversity of news sources that portal sites offer. The aim of this paper is to develop a new Web 2.0 site that offers any news publisher - including non-profit organizations, student or professional news media organizations, and the average Internet user - the ability to create and share news publications in a way that gives them complete control over the layout and content of their paper as well as the sharing and accessibility of their publications. The system will allow readers to discover new publications by offering helpful suggestions based on their current interests, their reading patterns, and their social network connections. It will also allow them to share news, comments, and interact generally with other readers and friends in their social network. Finally, the cloud platform will offer a horizontally-scaling and flexible platform for the system. The contribution of this paper is thus to present the design and development of a new Web 2.0 special-purpose social networking site. From a web data mining point of view, implementing and controlling such a system opens up a lot of very interesting avenues of research. The news articles available on the system will create a growing text and multimedia corpus useful for a wide range of research projects, ranging from traditional text classification to specialized applications such as news topic detection and tracking [2]. Social networking platforms now supply data for a wide range of social relationship studies, such as exploring community structures and divides [3]. And feedback from the recommendation system will be useful to determine which variables are more or less influential in human decision-making [4]. The rest of this paper is organized as follows. The next section will present a brief overview of related projects. We will give an overview of the structure of our proposed system in Section III. In Section IV we will present in more details the functional requirements of each of the system’s components. We will bring these components 228 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 together and present the overall architecture of the system in Section V. Next, Section VI will discuss implementation and testing considerations of the system. Finally, we will present some concluding thoughts in Section VII. II. RELATED WORKS There has been some work done in developing alternative architectures for social networking sites. For example, the authors of [5] propose a mobile social network server divided into five components: an HTTP server that interacts with the web, standard profile repository database and privacy control components, a location database that allows the system to keep track of the user’s location as he moves around with his mobile device, and the matching logic component that connects the other four components together. The authors of [6] take it one step further to create a theme-based mobile social network, which is aware not only of the user’s location but also of activities related to his interests in his immediate surroundings, of their duration and of other participants. Our proposed system is also a theme-based social network, as it learns the users’ interests both from what is stated explicitly in their profiles and implicitly from the material they read, and will propose new publications based on these themes. However, the architectures mentioned above were based on having a single central web server, in contrast to our cloud architecture. Researchers have been aware for some time of the network congestion issues that comes with the traditional client-server architecture [7]. Cloud-based networking is a growingly popular solution to this problem. A relevant example of this solution is the cloud-based collaborative multimedia sharing system proposed in [8]. The building block of that system is a media server that allows users in a common session to collaborate on multimedia streams in real time. Media servers are created and destroyed according to user demand for the service by a group manager server, and users access the system through an access control server. The entire system is designed to interface with existing social networks (the prototype was integrated to Facebook). By comparison, our system is not a stand-alone component to integrate to an external social network, but an entire and complete social network. There are many open research challenges related to online news data mining. Some examples surveyed in [9] include automated content extraction from the news websites, topic detection and tracking, clustering of related news items, and news summarization. All these challenges are further compounded when one considers that online news sources are multilingual, and therefore elements of automated translation and corpus alignment may be required. These individual challenges are all combined into one in the task of news aggregation [9], or automatically collecting articles from multiple sources and clustering the related ones under a single headline. News aggregate sites are of critical importance however; as the PEW survey noted, they are the single most popular source of online news [1]. Our news-themed © 2012 ACADEMY PUBLISHER social network site would serve as a new type of news aggregate site. Researchers working on recommendation systems have shown that individuals trust people they are close to (family members, close friends) over more distant acquaintances or complete strangers. This connection between relationship degree and trust can be applied to social networks, to turn friend networks in to a Web-oftrust [10]. The trust a user feels for another can be further extrapolated from their joint history (such as the number of public and private messages exchanged), the overlap in their interests, or even simply whether the second user has completed their personal profile on the site they are both members of [4]. It is clear, then, that social networks are a ripe source of data for recommendation systems. Our proposed system is in line with this realization. One of the key areas of applied research today in Cloud is on performance and scalability. The authors of [11] propose a dynamic scaling architecture via the use of “a front-end load balancer routing user requests to web applications deployed on virtual machines (VMs)” [11], aiming to maximizing resource utilization in the VMs while minimizing total number of VMs. The authors also used a scaling algorithm based on threshold number of active user sessions. Our proposed system adopts this approach, but considers the thresholds of both the virtual machines' hardware utilization as well as the number of active user-generated requests and events, instead of sessions. Further, our system adopts the performance architecture principles discussed in [12] to examine the practical considerations in the design and development of performance intelligence architectures. For performance metrics and measurements, our system adopts the resource, workload, and performance indicators discussed in [13] and the approach discussed in [14] to utilize the server-side monitoring data to determine the thresholds and when to trigger a reconfiguration of the cloud, and the client-side end-to-end monitoring data to evaluate the effectiveness of the performance architecture and implementation designed into the system, as it would be felt and perceived by the users of the system. It thus appears that our proposed system stands at the intersection of several areas of research. Part of its appeal is that it would combine all these components into a single unified website, and serve as a research and development platform for researchers in all these areas. Likewise, the web data generated by the system would be valuable in several research projects. And all this would be done while answering a real user need. III. SYSTEM OVERVIEW There are eight major components to our proposed system: 1. The cloud component serves the overall function of performing active systems monitoring, and providing a high performance, horizontallyscaling back-end infrastructure for the system. 2. A relational database system is essential to store and retrieve all the data that will be handled by the JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 3. 4. 5. 6. 7. 8. system, including user information, social network information, news articles, and layout information. The social network aspect of the system is crucial to turn the passive act of reading news into an active social activity. Social networking will comprise a large portion of the front-end functionalities available to users. A natural language processing engine will be integrated into the system to analyze all the articles submitted. It will work both on individual articles, to detect its topic and classify it appropriately, and on sets of articles, to detect trends and discover related articles. A suggestion engine will combine information from both the social network and the natural language processor in order to suggest new reading materials for each individual user. The content layout system is central to the content producer’s experience. It will provide the producer a simple and easy-to-use interface to control all aspects of a news article’s layout (placement of text and multimedia, margins and spacing, etc.) and thus to create their own unique experience for the readers. The user interface will give the reader access to the content and will display it in the way the producer designed it. The interface will also give the user access to and control over his involvement in the community through the social network. The business logic component will facilitate user authentication and access control to ensure users are able to connect to the system and access their designs and article collections, and prevent them from accessing unauthorized content. IV. SYSTEM REQUIREMENTS Each of the eight components we listed in Section II is responsible for a set of functionalities in the overall system. The functional requirements these components must satisfy in order for the entire system to work properly are described here. A. Cloud The Cloud component responds to changes in processing demand by modifying the amount of available computing resources. It does this by monitoring and responding to traffic volume and resource usage, and creating or destroying virtual resources as needed. Additionally, the Cloud component is responsible for distributing the traffic load across the available resources. The Cloud must be designed to support the following functionalities: ability to interface with a Cloud Hypervisor [15] that virtualizes system resources to allow the system to control the operation of the Cloud; ability to perform real time monitoring of the web traffic and workloads in the system; ability to monitor the state and performance of the system, including its individual machines; © 2012 ACADEMY PUBLISHER 229 ability for individual virtual servers within the system to communicate with each other; ability for the virtual servers to load share and load balance amongst each other; ability to distribute workloads evenly across the set of virtual servers within the system; ability to add or remove computing resources into the system based upon demand and load; ability to dynamically reconfigure the topology of virtual servers to optimally consume computing resources; ability to scale horizontally. B. Database The database component is required for persistent storage and optimized data retrieval. The database must be designed to support the following functionalities: represent and store subscribers and authors of each given publication; represent and store layouts of individual articles; represent and store sets of linked articles to form a publication; represent and store the social network. C. Social Network The social network component provides users with a richer content discovery experience by allowing users to obtain meaningful content suggestions. It must support the following functionalities: support user groups (aka friend lists); map social relationships; model user interactions with articles and publications; control sharing and privacy; comment on articles. D. Natural Language Processor The natural language processor allows the articles in the system to be analyzed and used for content suggestions and discovery. A basic version can simply build a word vector for each article, and computes the cosine similarity with the word vectors of other articles and of categories. The natural language processor must support the following features: ability to categorize the topics of an article or newspaper; ability to measure the similarity between different articles. E. Suggestion Engine The suggestion engine is a meaningful content discovery tool for users. One way to suggest new content to users is to display related articles alongside what they are currently viewing. The suggestion engine must support the following features: ability to draw conclusions on the interests of a reader given their activities and relationships on the social network; ability to draw conclusions on the interests of a reader based on the set of articles they have read; 230 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 ability to provide relevant article suggestions based upon the conclusions reached about the user’s interests. F. Content Layout System The layout system will provide content producers with an experience similar to that of a desktop editor application. Page design should be done in-browser with tools that the editor will find familiar. The layout system must support the following features: drag and drop content into place; adjust style of content; adjust layout of content including but not limited to adding, removing and adjusting columns; and adjusting size and position of areas for content; save and reload layouts or layout elements; aid positioning using grids or guidelines; save and reload default layouts for a publication. G. User Interface The user interface should render the content for the reader the way the content producer envisioned it to be. Navigation through content should be non-obtrusive. The user interface must support the following features: overall styling during reading experience set by content producer; user account management including but not limited to profile information, authentication, password recovery; content production including but not limited to viewing, adding, modifying, and removing content, issues, and layouts; unobtrusive navigation while immersed in the reading experience; options during the reading experience to comment on and/or share the article to the internal social networking, by email, and/or to the currently ubiquitous social networking sites. H. Business Logic User access to content should be controlled in order to differentiate between a user who may edit content of a given article or publication and a user who may only view the content. Additionally, in the case of paid subscriptions to publications, access control needs to differentiate between users who may or may not read certain articles. The business logics must support the following features: ability to authenticate users using unique login and passwords; ability to enforce access control of user’s data based upon the user’s privileges. I. Non-Functional Requirements The system’s non-functional requirements are consistent with those of other web-based systems. They are as follow: the site must be accessible with all major web browsers, namely Internet Explorer, Firefox, Chrome, and Safari; © 2012 ACADEMY PUBLISHER system-generated data must be kept to a minimum and encoded so as to minimize the amount of bandwidth used; the user interfaces must be easy to understand and use, both for readers, producers, and administrators; the system must be quick to respond and errorfree. V. SYSTEM ARCHITECTURE A. Logical Architecture Figure 1 illustrates the logical connections between the components of our proposed system. The user interface is presented to the user through a browser, and it directly connects to the business and cloud logics module which will use the Suggestion Engine and Layout Engine as needed. It is also connected to the Social Networking module, which provides the social networking functionalities. The data is stored in a Database, which the Natural Language Processor periodically queries to analyze all available articles. The Social Networking module will maintain a Social Graph of all user relationships and interactions. Meanwhile, the Cloud Logics component will monitor the system’s overall performance and interact with the Cloud Hypervisor as needed to adjust the system’s physical structure to respond to levels of user demand. B. Physical Architecture All the end users of our system, be them readers or producers, will connect to the site via a web browser running on any device or platform. The browser connects to the cloud’s software load balancer, which is hosted on a virtual machine. The load balancer forwards requests to corresponding identical web servers (also hosted as virtual machines) to service each request, and may load share amongst themselves. Furthermore, the web servers connect to a set of databases that will replicate between each other for both fail safe redundancy and throughput. The Web Servers also connect to the social network. This setup is illustrated in Figure 2. Figure 1. Logical architecture of the system. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 231 users. The Load Balancer communicates with the cloud over TCP to obtain the information for the web servers. VI. PROTOTYPE IMPLEMENTATION AND TESTING Figure 2. Physical architecture of the system. C. Architecture Details The software architecture for the client-side Web UI software consists mainly of JavaScript model classes used to represent the locations of elements on the page. Objects of these classes are saved to the database through requests to the web server or instantiated from saved object states retrieved from the web server. The software architecture for the server-side software consists mainly of controllers and models. Ruby on Rails uses a Model, View, Controller architecture. Controllers use instances of Models and Views to render a page for the user. Requests are mapped to member functions of controllers based mainly on the request URI and HTTP method. The natural language processor is implemented as a client-server system as well. The NLP server is responsible for the language processing functionalities of the system. It uses TCP functionalities to receive communications from the NLP client. It also interfaces directly with the database to fetch information independently of the rest of the system. The cloud is composed of several interconnected subcomponents. There is a Cloud Controller, which is responsible for real time monitoring and workload management functions, dynamic cloud reconfiguration and server load balancing, and cloud hypervisor operations. This is the central component of the cloud, which manages the other servers and optimizes the entire system. To illustrate its function, its state chart is presented in Figure 3. The Controller accepts incoming TCP connections from the Load Balancer, and connects to each of the Server System Monitors that reside in the servers under its control via TCP as well. The Server System Monitors are a simple monitoring subcomponent, which periodically gathers system performance information and forwards it to the controller through a TCP connection. Both the Controller and the Monitors are written in C++ as Linux applications. The Load Balancer, by contrast, is implemented in PHP as a web application at the forefront of the Cloud. Its function is to accept incoming HTTP requests from users and forward them to designated web servers, and balance the workload so no single server is over- or under-utilized. It also forwards the responses from the servers back to the © 2012 ACADEMY PUBLISHER A working prototype of the entire system was implemented and run on a VMware 5.0 ESXi physical server, which contains a quad core CPU (2GHz per core) and 3 GB of RAM. Each of the virtual machines within runs a Ubuntu 11.10 Server operating system (32bit) with virtual hardware settings of 1 CPU Core, whereby VMware performs time-sharing between the 4 physical CPU cores, and 5 GB of HDD space. The setup also include 384 MB RAM for web and load balancer servers and 512 MB RAM for MySQL database and NLP servers, 1024 MB RAM for the Cloud Controller server. The server is connected to the internet using a router with the ability to set static IP addresses and port forwarding. A screenshot of the content layout interface is given in Figure 4. While the interface is simple, it gives content producers complete control to add, move and edit text, headings and multimedia items. A set of test cases was developed and ran to verify the functionality of critical features of the system. The cloud controller was tested for its server manipulation features: the ability to create new servers, to configure them correctly, to reconnect to them to check on them, and to delete them when they are no longer needed. Each of these operations also tested the cloud controller’s ability to update the network’s topology. With these components validated, we then proceeded to test higher-level functionalities, such as the controller’s ability to gather usage statistics from the servers, to balance the servers’ Figure 3. State chart of the cloud controller program. 232 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 workloads and optimize the topology, and to forward request and responses. The user interface and database were tested by registering both regular user accounts and publication editor accounts, and executing the legal and illegal functions of both classes of users. The editor could create new publications and new articles inside the publications, and edit these articles using the layout editor shown in Figure 4. The regular user could browse the publications, subscribe to those desired, and see the articles displayed in exactly the way the editor had laid them out. Finally, the natural language processing functionalities of the system (along with that section of the database) were tested by uploading a set of 10 news articles into the system and then having the processor parse them, build word vectors, and compare these to a set of predefined class vectors to classify the articles into their correct topics. The last test we ran was a workload test, designed to verify the robustness and reliability of our cloud controller and architecture. For this test, we used a set of other computers to send HTTP requests. We used requests for the website’s home page as well as multipage requests. Given our system’s dual-core hardware, we tested setups with one and two web servers. In each case, we measure the system’s throughput, CPU usage, disk usage and memory usage. In both setups, we found similar disk, memory, and CPU usage. The throughput was different however, with the setup with two web servers consistently performing better than the setup with only one web server. On requests for the home page, the two-server setup yielded an average throughput of 258 kB/s against 182 kB/s for the setup with one web server. And for multi-page requests, which required database queries and more processing, the throughput of the twoserver setup was 142 kB/s against 102 kB/s for the singleserver setup. The 40% improvement observed in throughput demonstrates that our system can scale up quite efficiently. To further illustrate, the response times of both systems during our test are shown graphically in Figure 5. In that figure, the higher line is the response time of the single-server setup, the lower line is the Figure 4. Response times of the two setups tested. response time of the two-server setup, and the individual points are HTTP requests. We can see that throughout the experiments, the single-server setup consistently requires more time to respond to the requests compared to the two-server setup. While these tests were conducted with static topologies, we expect the response times to “jump” from the single-server setup line to the double-server setup when the system adds a web server dynamically, and the throughput to increase in a similar fashion. Additional jumps are expected as the system adds additional servers. VII. CONCLUSION Building a Web 2.0 social networking site is a very ambitious project. In this paper, we developed a web application with smart horizontal scaling using a cloudbased architecture, incorporating modern aspects of web technology as well as elements of natural language processing to help readers discover content and help publishers get discovered. The size and scope of this project make it a challenge for any developer. This paper aims to be a roadmap, to help others duplicate or improve upon our architecture. While the system is completely functional in the state described in this paper, there is room to further develop and improve each one of its components. There is a wide range of NLP and recommendation algorithms in the literature, some of which could be adopted to improve the natural language processor and the suggestion engine respectively. New editing tools can be added to the content layout system to give more control to content producers. The design of a better user interface is an open challenge, not just in our system but in the entire software world. Gathering real workload usage will allow us to fine-tune the cloud’s load balancing algorithms. And finally, the social networking component of the system could be both simplified and enhanced by linking our system to an existing social network such as Facebook, Google+, or Twitter. Each new feature and improvement we make in each component will of course require additional testing. And once a more complete and polished version of the site is ready, it should be deployed and used in practice, to gather both real-world usage information and user feedback that will help guide the next iteration of the system. REFERENCES Figure 5. The prototype’s content layout editor. © 2012 ACADEMY PUBLISHER [1] Kristen Purcell, Lee Rainie, Amy Mitchell, Tom Rosenstiel, Kenny Olmstead, “Understanding the participatory news consumer: How internet and cell phone users have turned news into a social experience”, Pew Research Center, March 2010. Available: JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] http://www.pewinternet.org/Reports/2010/OnlineNews.aspx?r=1, accessed April 2012. Xiangying Dai, Yunlian Sun, “Event identification within news topics”, International Conference on Intelligent Computing and Integrated Systems (ICISS), October 2010, pp. 498-502. Nam P. Nguyen, Thang N. Dinh, Ying Xuan, My T. Thai, “Adaptive algorithms for detecting community structure in dynamic social networks”, Proceedings of the IEEE INFOCOM, 2011, pp. 2282-2290. Chen Wei, Simon Fong, “Social Network Collaborative Filtering Framework and online Trust Factors: a Case Study on Facebook”, 5th International Conference on Digital Information Management, 2010. Yao-Jen Chang, Hung-Huan Liu, Li-Der Chou, Yen-Wen Chen, Haw-Yun Shin, “A General Architecture of Mobile Social Network Services”, International Conference on Convergence Information Technology, November 2007, pp. 151-156. Jiamei Tang, Sangwook Kim, “Theme-Based Mobile Social Network System”, IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC), 2011, pp. 1089-1095. Rabih Dagher, Cristian Gadea, Bogdan Ionescu, Dan Ionescu, Robin Tropper, “A SIP Based P2P Architecture for Social Networking Multimedia”, 12th IEEE/ACM International Symposium on Distributed Simulation and Real-Time Applications, 2008, pp. 187-193. Cristian Gadea, Bogdan Solomon, Bogdan Ionescu, Dan Ionescu, “A Collaborative Cloud-Based Multimedia Sharing Platform for Social Networking Environments”, Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN), 2011, pp. 1-6. Wael M.S. Yafooz, Siti Z.Z. Abidin, Nasiroh Omar, “Challenges and issues on online news management”, IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 2011, pp. 482-487. Paolo Massa, Paolo Avesani, “Trust-aware recommender systems”, Proceedings of the 2007 ACM conference on Recommender systems (RecSys '07), October 2007, Minneapolis, USA, pp.17-24. Trieu C. Chieu, Ajay Mohindra, Alexei A. Karve, “Scalability and Performance of Web Applications in a Compute Cloud”, IEEE 8th International Conference on eBusiness Engineering (ICEBE), Oct. 2011, pp. 317-323. Prasad Calyam, Munkundan Sridharan, Yingxiao Xu, Kunpeng Zhu, Alex Berryman, Rohit Patali, Aishwarya Venkataraman, “Enabling performance intelligence for application adaptation in the Future Internet”, Journal of Communications and Networks, vol. 13, no. 6, Dec. 2011, pp.591-601. Jerry Gao, Pushkala Pattabhiraman, Xiaoying Bai W. T. Tsai, “SaaS performance and scalability evaluation in clouds”, 2011 IEEE 6th International Symposium on Service Oriented System Engineering (SOSE), Dec. 2011, pp. 61-71. Niclas Snellman, Adnan Ashraf, Ivan Porres, “Towards Automatic Performance and Scalability Testing of Rich Internet Applications in the Cloud”, 37th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA), Aug. -Sept. 2011, pp. 161-169. Bhanu P Tholeti, “Hypervisors, virtualization, and the cloud: Learn about hypervisors, system virtualization, and how it works in a cloud environment”, IBM developerWorks, September 2011. Available: http://www.ibm.com/developerworks/cloud/library/clhypervisorcompare, accessed April 2012. © 2012 ACADEMY PUBLISHER 233 234 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Analyzing Temporal Query for Improving Web Search Rim Faiz LARODEC, IHEC, University of Carthage, Tunisia E-mail: Rim.Faiz@ihec.rnu.tn Abstract— Research of pertinent information on the web is a recent concern of Information Society. Processing based on statistics is no longer enough to handle (i.e. to search, translate, summarize...) relevant information from texts. The problem is how to extract knowledge taking into account document contents as well as the context of the query. Requests looking for events taking place during a certain time period (i.e. between 1990 and 2001) cannot provide yet the expected results. We propose a method to transform the query in order to "understand" its context and its temporal framework. Our method is validated by the SORTWEB System. find "the relevant information", without being overwhelmed with a volume of uncontrollable and unmanageable answers. In the section that follows, we present some new methods which are based on analysis of the context for improving research on the Web. Then we propose our method based on two concepts: the concept of context in general (Desclés et al., 1997), (Lawrence et al. 1998) and the concept of temporal context (Faiz, 2002), (ElKhlifi and Faiz, 2010). Finally, we present the validation of our method by the SORTWEB system. Index Terms— Information Extraction, Semantics of queries, Web Search, Temporal Expressions Identification II. RELATED WORKS ON TEMPORAL INFORMATION I. INTRODUCTION The Web is positioned as the primary source of information in the world and the search for relevant information on the Web is considered one of the new needs of the information society. The interest of the consultation of this media is related to the effectiveness of the search engines information. The main search engines operate essentially on keywords, but this technique has limitations: thousands of pages are offered to each query, but only some contain the relevant information. To improve the quality of obtained results, search engines must take into account the semantics of queries. The methods of information processing based on statistics are no longer sufficient to meet the needs of users to manipulate (search, translate, summarize...) information on the Web. A fact tends to be necessary: introduce "more semantic" for the search of relevant information from texts. The extraction of specific information remains the fundamental question of our study. In this sense, it shares the concerns of researchers who have examined the texts understanding (Sabah, 2001), (Nazarenko and Poibeau, 2004), (Poibeau and Nazarenko, 1999) as those dealing today with the link between the semantic web and textual data (Berners-Lee et al., 2001), (Poibeau, 2004). The objective of our work is to refine the search for information on the web. It is to treat the content structure and make it usable for other types of automatic processing. Indeed, when the user makes his query, he expects, generally, find precisely what he seeks, i.e. to © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.234-239 RETRIEVAL Nowadays, the web is operated by persons who seek information via a search engine and operate their own results. Tomorrow, the web should primarily be used by automatons that will address themselves the questions asked by people, and automatically give the best results. Thus, the web becomes a forum for exchange of information between machines, allowing access to a very large volume of information and providing the means to manage these informations. In this case, a machine can understand the volume of information available on the web and thus provide more consistent assistance to people, provided that we endow the machine with some “intelligence”. By “intelligence”, we expose the fact of linking human intelligence with artificial intelligence to optimize the search of information activities on the web. The search of information involves the user in an interrogation process of the search engine. The defined query is sent to the indexes of documents. The documents whose indexes have an adequate "similarity" to the query (i.e. keywords in the query exist in the resulting documents) are considered relevant. However, the request for information expressed by a query can be an inaccurate description of the user’s needs. In general, when the user is not satisfied with the results of its initial query, he tries to change it so as to identify its needs better. This change in the query is to be reformulated. In general, the reformulation is expressed by removing or adding words. The results of the study by PD Bruza (Bruza and al., 2000), (Bruza and Dennis, 1997), conducted on reformulations made by users themselves have shown that reformulation is often the repetition of the initial JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 request, the adding or the withdrawal of few words, changing the spelling of the request, or the use of its derivatives or abbreviations. In this context, we can cite the system developed by HyperIndex P.D. Bruza and al. (Bruza and Dennis, 1997) (Dennis et al., 2002) relating to a technical reformulation of queries that helps the user to refine or extend the initial request by the addition, deletion or substitution of terms. The terms of reformulation, are extracted from the titles of Web pages. It is a post-interrogation reformulation: the user defines an initial query, after which the resulting titles of Web pages provided by the search system are analyzed as a lattice of terms in order to be used by the HyperIndex search engine. The user can navigate through this HyperIndex giving an overview of all possible forms of reformulation (refinement or enlargement). Other work has been developed in this context, we can cite: - R. W. Van Der Pol, (Van Der Pol, 2003) proposed a system to reformulation pre-interrogation based on the representation of a medical field. This field is organized into concepts linked by a certain number of binary relations (i.e. causes, treats and subclass). The complaints are built in a specification language in which users express their needs. The reformulation of requests is automatic. It takes place in two stages, the first concern the identification of concepts that pairs the need of the user, the second concerns the making up of these terms in order to formulate the request. - A. D. Mezaour (Mezaour, 2004) proposed a method of targeted research documents. The proposed language allows the user to combine multiple criteria to characterize the pages of interest with the use of logical operators. Each criteria specified in a query can target the search for its values (keywords) on a fixed part of the structure of a page (for example, its title) or characterize a particular property of a page (example: URL). By using the logical operators conjunction and disjunction, it is possible to combine the above criteria in order to target both the type of page (html, pdf, etc.) with certain properties of the URL of a page, or characteristics of some key parts (title, body of the document). Mezaour thinks a possibility of improving its approach consists in enriching the initial request by synonyms representing the values of words for each query. According to him, the assessment of his requests passes over relevant documents that do not contain the terms of the request but equivalent synonyms. - O. Alonso (Alonso et al., 2016) proposed a method for clustering and exploring search results based on temporal expressions within the text. They mentioned that temporal reasoning is also essential in supporting the emerging temporal information retrieval research direction (Alonso et al., 2011). In other work (Strötgen et al. 2012), they present an approach to identify top relevant temporal expressions in documents using expression, document, corpus, and query-based features. They present two relevance functions: one to calculate relevance scores for temporal expressions in general, and © 2012 ACADEMY PUBLISHER 235 one with respect to a search query, which consists of a textual part, a temporal part, or both. - In their work, E. Alfonseca et al. (Alfonseca et al., 2009) showed how query periodicities could be used to improve query suggestions, although they seem to have more limited utility for general topical categorization. - A. Kulkarni et al. (2011), in their work, showed that Web search is strongly influenced by time. They mentioned that the relationship between documents and queries can change as people’s intent changes. They have explored how queries, their associated documents, and query intents change over the course of 10 weeks by analyzing large scale query log data, a daily Web crawl, and periodic human relevance judgments. To improve their work, A. Kulkarni et al. plan to develop a search algorithm that uses the term history in a document to identify the most relevant documents. - A. Kumar et al. (2011) proposed a language modeling approach that builds histograms encoding the probability of different temporal periods for a document. They have shown that it is possible to perform accurate temporal resolution of texts by combining evidence from both explicit temporal expressions and the implicit temporal properties of general words. Initial results indicate this language modeling approach is effective for predicting the dates of publication of short stories, which contain few explicit mentions of years. - Zhao et al. (2012) develop a temporal reasoning system that addresses three fundamental tasks related to temporal expressions in text: extraction, normalization to time intervals and comparison. They demonstrate that their system can perform temporal reasoning by comparing normalized temporal expressions with respect to several temporal relations. We note that, in general, manual reformulation aims at building a new query with a list of terms proposed by the system. In the case of an automatic reformulation, the system will build the new query. However, the method of automatic reformulation, generally, does not take into account the context of the query. The standard model of search tools admits many disadvantages such as limited diversification, competence and performance. While, the establishment of research by the context is much more advantageous. The contextual information retrieval refers to implicit or explicit knowledge regarding the intentions of the user, the user's environment and the system itself. The hypothesis of our work is that making explicit certain elements of context could improve the performance of information research systems. The improved performance of engines is a major issue. Our study deals with a particular aspect: taking into account the temporal context. In order to improve accuracy and allow a more contextual search, we described a method based on the analysis of the temporal context of a query so as to obtain relevant event information. 236 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 III. CONTRIBUTIONS The explosion in the volume of data and the improving of the storage capacity of databases were not accompanied by the development of analytical tools and research needed to exploit this mass of information. The realization of intelligent systems research has become an emergency. In addition, queries for responding to requests for information from users become very complex and the extraction of the most relevant data becomes increasingly difficult when the data sources are diverse and numerous. It is imperative to consider the semantics of the data and use this semantics to improve web search. More especially as the results of a search query with a search engine returns a large number of documents which is not easy to manage and operate. Indeed, in carrying out tests on several search engines, we found inefficient engines for queries on a date or a period of time. Therefore, we propose to develop a tool to take into account the temporal context of the query. In this context, we propose an approach, like those aimed at improving the performance of search engines (Agichtein et al., 2001), (Glover et al., 1999, 2001) (Lawrence et al. 2001) such as the introduction of the concept of context, the analysis of web pages or the creation of specific search engines in a given field. The objective of our work is to improve the efficiency and accuracy of event information retrieval on the Web and analyzing the temporal context for understanding the query. Therefore, the matter is to propose more precise queries semantically close to the original user’s queries. Our study consists on the one hand to reformulate queries searching for text documents having an event aspect, i.e. containing temporal markers (i.e. during, after, since, etc.) taking into account the temporal context of the query, and on the other hand, to obtain relevant results specifically responding to the queries. The question that arises is how to find event information and transform collections of data into intelligible knowledge, useful and interesting in the temporal context where we are. We found that, in general, queries seeking one or more events taking place at a given date or during a determined period do not produce the expected results. For example, the scientific discoveries since 1940. In this sample of query, the user wants to seek scientific discoveries since 1940 until today, not for the year 1940 only; it is then to deal with a period of time. Indeed, a standard search engine only searches on the term "1940" and not on the time period in question, from which the idea of the reformulation of the user’s query, basing the search on the term introduced by the user and a combination of words synonymous with the terms of the original query. The processing of the query is mainly done at the context level. The system must be able to understand the timing of the query. Therefore, we provide it with some intelligence (to approach the human reasoning) plus a semantic analysis (for understanding the query). Such a system is very difficult to implement for several reasons: © 2012 ACADEMY PUBLISHER The diversity of documents types on the web (file types: doc, txt, ppt, pdf, ps, etc.), The multitude of languages, The richness of languages: it is very difficult to establish a genuine process of parsing which took into account the structure of each sentence. To do this, we will focus our work on a document type and a type of event queries containing temporal indicators (in the month, in the year, between time and date, etc.). For the identification of temporal expressions, we used our method of automatic filtering of temporal information we have developed in earlier works, (Faiz and Biskri, 2002). The temporal information retrieval from the query is made by identifying temporal markers (since, during, before, until ...) or by the presence of explicit date in the query. Then, for the interpretation of these terms and the need to seek event information taking place on a date or a period, we propose a time representation from the concept of interval (Allen and Ferguson, 1994). This representation is based on the start date and end date of events (punctual or instantaneous events and durative event). Besides, in view of the type of queries that we will study and the temporal markers such as "before", "after" and "until", we need to express this in terms of interval. Are two types of events: punctual or instantaneous events (Evi) and durative events (EVD): The instantaneous event (Evi): If the beginning date is equal to the ending date of the event Deb (Evi) = End (Evi). The durative event (EVD): the one who takes place without interruption Deb (EVD) <> End (EVD). We consider that an event E admits a start date d(E) and an end date f(E), with (d(E) <f(E)). We ideally distinguish two types of events, those of zero duration d(E) = f (E) that are expressed, for example, by the phrase "in + (date)" example "in 2001" and whose duration is not zero, i.e. (E) <> f (E), so the interval is [d(E), f (E)] and are expressed, for example, by expression: "Between 1990 and 2001" or the phrase "since 1980. The temporal grain which we base our example is the year. In our work, we need to represent, in the form of an interval, the temporal information contained in the query for using it as additional information to the query. Thus, we apply the rules of interpretation to determine, in an explicit way, the time interval. Example: "If the query contains" from "+ beginning_year Then interval = [beginning_year, current_year]." So, if the document contains the event sought taking place during the time interval generated, then we consider it as relevant. To understand the context of the query made by the user better, we also considered the extending of the query by adding words synonymous with the event in question. Example: the word "attack", we use synonyms: "attack, explosion, crime, etc ". JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 IV. VALIDATION OF THE PROPOSED METHOD: SORTWEB SYSTEM We validated our work by developing SORTWEB system (System Optimization Time Queries on the Web) that improves research on the Web and through an automatic query reformulation to obtain relevant results and meet the expectations of the user. This reformulation is done by adding automatic synonymous terms sought for the event. The enrichment of the request allows for better research and results from the search terms and synonyms not only terms entered by the user. The process is as follows: A query such as "Event + temporal marker + date or time period" launched will be analyzed and segmented into two parts and by detecting a marker time (in the month, in the year, since ...). The web search will be launched once the changes are made to the query (cf. Figure1). 237 The event (containing the description of the event sought) will be transformed and reformulated referring to the basis of synonyms that allows the enrichment of the query terms in the same direction to take account of the semantics of the request. The part with a date or a time period which took place during the event. This part (not always the form of an explicit date, for example: last year, next year or the form of an interval, for example: during this century, since 1990) will be treated and processed under the standard form of a date or a time interval. Examples: "since 1980" will be represented by the interval [1980 2006]. Figure 1. SORTWEB System Architecture. This research is done using a search engine. The document is then downloaded, analyzed and then filtered by the time of the request. The filtering is to travel documents and verify results if the selected information respects the semantics of the initial request. After the course of all documents downloaded, only the addresses of documents considered relevant are added to the page containing the results. To test and validate our system, we launched the same requests (for example, since 2000 attacks, wars between 1990 and 2002) on several search engines such as Google or Yahoo. We ascertain that our system returns a number of documents much smaller than that proposed by them directly. In addition, the returned results contain relevant documents issued from the search of synonyms and not from the terms of the initial request. For example: The attacks since 2000. This request is processed through our system and the research was done using the term “attack” © 2012 ACADEMY PUBLISHER and also the following terms “attack, explosion, crime”. The use of synonyms is very important because the user may be interested in documents containing not only the word “attack” but also containing other words in the same context. It should be noted that the evaluation of an information retrieval system is measured by the degree of relevance of results. The problem lies in the fact that the user relevance is different from the system relevance. In general, in a relevant document, the user may find the information he needs. We talk about user relevance when the user considers that a document meets his needs. However, the system relevance is judged through the used matching function. To determine the relevance of obtained results, we conducted an evaluation by human experts. We found that 80% of the results were relevant. Also, we calculated the accuracy for evaluating the quality of answers provided by the system, the results of 238 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 the tests were measured using the rate of accuracy as follows: Accuracy = (No. relevant documents found / No documents found) = 80.6% If our method has many advantages such as minimizing the number of such results while keeping their relevance and documents that have come from words (synonyms under the user’s request) added automatically by the system, it opens up new ways of studies. One of the perspectives that we intend to achieve is the improving of the search for event information. We have to work more on the very famous events in countries where they occur such as: the event of “pilgrimage” that may be associated with “Saudi Arabia”. [9] [10] [11] [12] V. CONCLUSION The new generation of search engines differs from the previous generation by the fact that these engines are increasingly incorporating new techniques other than the simple keyword search but adding other methods to improve the results of search engines, such as the introduction of the concept of context, analysis of web pages or the creation of specific search engines in a given area. Thus, the improved performance of engines is a major issue. Our study states a particular aspect: taking into account the temporal context. In order to improve accuracy and allow a more contextual search, we described a method based on the analysis of the temporal context of a query to obtain relevant event information. [13] [14] [15] [16] REFERENCES [1] Alfonseca, E., Ciaramita, M. and Hall, K. (2009), Gazpacho and summer rash: Lexical relationships from temporal patterns of Web search queries. In Proceedings of EMNLP 2009, 1046-1055. [2] Agichtein E., Lawrence S. et Gravano L. (2001), Learning Search Engine Specific Query Transformations for Question Answering. Proceedings of the Tenth International World Wide Web Conference, WWW10, may 1-5. [3] Allen J.F., Ferguson G., (1994), Actions and Events in Interval Temporal Logic. Journal Logic and Computation, vol. 4, n°5, pp.531-579. [4] Alfonseca, E., Ciaramita, M. and Hall, K. Gazpacho and summer rash: Lexical relationships from temporal patterns of Web search queries. In Proceedings of EMNLP 2009, 1046-1055.f [5] Alonso, O. and Gertz, M. Clustering of search results using temporal attributes. In Proceedings of SIGIR 2006, 597598. [6] Alonso O., Strötgen J., Baeza-Yates R. and Gertz M. (2011), Temporal information retrieval: Challenges and opportunities, TWAW 2011, Hyderabad, India, pp.1-8. [7] Berners-Lee T., Hendler J. and Lassila O., (2001), the semantic web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American. [8] Bruza P., McArthur R., Dennis S., (2000), Interactive internet search: keyword, director and query reformulation mechanisms compared. Proceedings of the 23rd annual © 2012 ACADEMY PUBLISHER [17] [18] [19] [20] [21] [22] [23] [24] international ACM SIGIR conference on Research and development in information retrieval, July 24-28, Athens, ACM Press, pp. 280-287. Bruza P.D. and Dennis S., (1997), Query Reformulation on Internet: Empirical Data and the Hyperindex Search Engine. Proceedings of RIAO-97, Computer Assisted Information Searching on the Internet. Dennis S., Bruza P., McArthur R., (2002), Web searching: A process-oriented experimental study of three interactive search paradigms. Journal of the American Society for Information Science an Technology, vol. 53, n°2, pp. 120133. ElKhlifi A. and Faiz R. (2010), French-written Event Extraction based on Contextual Exploration. Proceedings of the 23th International FLAIRS 2010, AAAI Press, California, USA. Faiz R. and Biskri, I. (2002), Hybrid approach for the assistance in the events extraction in great textual data bases. Proceedings of IEEE International Conference on Systems, Man and Cybernatics (IEEE SMC 2002), Tunisia, 6-9 October, Vol. 1, pp. 615-619. Faiz R. (2002), Exev: extracting events from news reports. Actes des Journées internationales d’Analyse statistique des Données Textuelles (JADT 2002), A. Morin et P. Sébillot (Editeurs), Vol. 1, France, pp. 257-264. Faiz R. (2006), Identifying relevant sentences in news articles for event information extraction. International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific, Vol. 19, No. 1, pp. 1-19. Glover E., Flake G., Lawrence S., Birmingham W., Giles C.L., Kruger A. and Pen-Nock D., (2001), Improving category specific web search by learning query modifications. In Symposium on Applications and the Internet (SAINT-2001), pp. 23–31. Glover E., Lawrence S., Gordon M., Birmingham W., Giles C.L., (1999), Architecture of a Metasearch Engine that Supports User Information Needs. Proceedings of the Eighth International Conference on Information and Knowledge Management, (CIKM 99), ACM, pp. 210-216. Kulkarni A., Teevan J., Svore K. M. and Dumais S. T. (2011), Understanding temporal query dynamics, wsdm WSDM’11, February 9–12, 2011, Hong Kong, China. Kumar A., Lease M., Baldridge J. (2011), Supervised language modeling for temporal resolution of texts, CIKM11, pp. 2069-2072. Lawrence S., Coetzee F., Glover E., Pennock D., Flake G., Nielsen F., Krovetz R., Kruger A., Giles C.L., (2001), Persistence of Web References in Scientific Research. IEEE Computer, Vol.34, p.26–31. Mezaour A., (2004), Recherche ciblée de documents sur le Web. Revue des Nouvelles Technologies de l’Information (RNTI), D.A. Zighed and G. Venturini (Eds.), CépaduésEditions, vol. 2, pp. 491-502. Nazarenko A., Poibeau T., (2004), L’évaluation des systèmes d’analyse et de compréhension de textes. In L’évaluation des systèmes de traitement de l’information, Chaudiron S. (Ed.), Paris, Lavoisier. Poibeau T., (2004), Annotation d’informations textuelles : le cas du web sémantique. Revue d’Intelligence Artificielle (RIA), vol. 18, n°1, Paris, Editions Hermès, pp. 139-157. Poibeau T., Nazarenko A., (1999), L’extraction d’information, une nouvelle conception de la compréhension de texte. Traitement Automatique des Langues (TAL), vol. 40, n°2, pp. 87-115. Sabah G., (2001), Sens et traitements automatiques des langues. Pierrel J. M. (dir.), Ingénierie des langues, Hermès. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 [25] Strötgen J., Alonso O., and Gertz M. (2012). Identification of Top Relevant Temporal Expressions in Documents. In TempWeb 2012: 2nd Temporal Web Analytics Workshop (together with WWW 2012), Lyon, France. [26] Van der Pol R.W., (2003), Dipe-D: A Tool for KnowledgeBased Query Formulation in Information Retrieval. Information Retrieval, vol. 6, n°1, pp.21-47. [27] Zhao R., Do Q., Roth D. (2012), A Robust Shallow Temporal Reasoning System. Proceedings of the Demonstration Session at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLTNAACL 2012), pp. 29-32, Montréal, Canada. © 2012 ACADEMY PUBLISHER 239 Dr. Rim Faiz obtained his Ph.D. in Computer Science from the University of Paris-Dauphine, in France. She is currently a Professor of Computer Science at the University of Carthage, Institute of High Business Study (IHEC) at Carthage, in Tunisia. Her research interests include Artificial Intelligence, Machine Learning, Natural Language Processing, Information Retrieval, Text Mining, Web Mining and Semantic Web. She is member of scientific and organization committees of several international conferences. She has several publications in international journals and conferences (AAAI, IEEE, ACM ...). Dr. Faiz is also the responsible of the Professional Master "Electronic Commerce" and the Research Master "Business Intelligence applied to the Management" at IHEC of Carthage. 240 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Trend Recalling Algorithm for Automated Online Trading in Stock Market Simon Fong, Jackie Tai Department of Computer and Information Science,University of Macau,Taipa, Macau SAR Email: ccfong@umac.mo, ma56562@umac.mo Pit Pichappan School of Information Systems, Al Imam University,Riyadh, Saudi Arabia Email: pichappan@dirf.org Abstract —Unlike financial forecasting, a type of mechanical trading technique called Trend Following (TF) doesn’t predict any market movement; instead it identifies a trend at early time of the day, and trades automatically afterwards by a pre-defined strategy regardless of the moving market directions during run time. Trend following trading has a long and successful history among speculators. The traditional TF trading method is by human judgment in setting the rules (aka the strategy). Subsequently the TF strategy is executed in pure objective operational manner. Finding the correct strategy at the beginning is crucial in TF. This usually involves human intervention in first identifying a trend, and configuring when to place an order and close it out, when certain conditions are met. In this paper, we presented a new type of TF, namely Trend Recalling algorithm that operates in a totally automated manner. It works by partially matching the current trend with one of the proven successful patterns from the past. Our experiments based on real stock market data show that this algorithm has an edge over the other trend following methods in profitability. The new algorithm is also compared to time-series forecasting type of stock trading, and it can even outperform the best forecasting type in a simulation. Index Terms—Trend Following Algorithm, Automated Stock Market Trading I. INTRODUCTION Trend following (TF) [1] is a reactive trading method in response to the real-time market situation; it does neither price forecasting nor predicting any market movement. Once a trend is identified, it activates the predefined trading rules and adheres rigidly to the rules until the next prominent trend is detected. Trend following does not guarantee profit every time. Nonetheless over a long period of time it may probably profit by obtaining more gains than loses. Since TF is an objective mechanism that is totally free from human judgment and technical forecasting, the trends and patterns of the underlying data play a primarily influential role in deciding its ultimate performance. It was already shown in [2] that market fluctuation adversely affects the performance of TF. Although financial cycles are known phenomena it is a controversy © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.240-251 whether cycles can be predicted or past values cannot forecast future values because they are random in nature. Nonetheless, we observed that cycles could not be easily predicted, but the abstract patterns of such cycles can be practically recalled and used for simple pattern matching. The formal interpretation of financial cycle (or better known as economic cycle) refers to economy-wide fluctuations in production or economic activity over several months or years. Here we consider it as the cycle that run continuously between bull market and bear market; some people refer this as market cycle (although they are highly correlated). In general a cycle is made of four stages, and these four stages are: “(1) consolidation (2) upward advancement (3) culmination (4) decline” [3]. Despite being termed as cycles, they do not follow a mechanical or predictable periodic pattern. However similar patterns are being observed to be always repeating themselves in the future, just as a question of when, though in approximate shapes. We can anticipate that some exceptional peak (or other particular pattern) of the market trend that happen today, will one day happen again, just like how it did happen in history. For instance, in the “1997 Asian Financial Crisis” [4], the Hang Seng Index in Hong Kong plunged from the top to bottom (in stages 3 to 4); then about ten years later, the scenario repeats itself in the “2008 Financial Crisis” [5] with a similar pattern. Dow Theory [6] describes the market trend (part of the cycle) as three types of movement. (1) The "primary movement", main movement or primary trend, which can last from a few months to several years. It can be either a bullish or bearish market trend. (2) The "secondary movement", medium trend or intermediate reaction may last from ten days to three months. (3) The "minor movement" or daily swing varies from hours to a day. Primary trend is a part of the cycle, which consist of one or several intermediate reactions and the daily swings are the minor movements that consist of all the detailed movements. Now if we project the previous assumption that the cycle is ever continuously rolling, into the minor daily movement, can we assume the trend that happens today, may also appear some days later in the future? JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Here is an example for this assumption; Figure 1 shows 31 (bottom) trend graphs of Hang Seng Index Futures, which are sourced from two different dates. Although they are not exactly the same, in terms of major upwards and downwards trends the two graphs do look alike. This is the underlying concept of our trend recalling trading strategies that are based on searching for similar patterns from the past. This concept is valid for TF because TF works by smoothing out the averages of the time series. Minor fluctuations or jitters along the trend are averaged out. This is important because TF is known to work well on major trending cycles aka major outlines of the market trend. The paper is structured as follow: Details of the trend recalling algorithm are presented in Section 2, step by step. Simulation experiments are carried out for evaluating the performance of the Trend Recalling algorithm in automated trading in Section 3. In particular, we compare the Trend Recalling algorithm with a selected time series forecasting algorithm. Section 4 concludes the paper. 241 respectively two intra-day 2009-12-07 (top) and 2008-01works [1][2]. The third question is rather challenging, that is actually the core decision maker in the TF system and where the key factor in making profit is; questions 4 and 5 are related to it. Suppose that we have found a way to identify trend signal to buy or sell, and we have a position opened. Now if the system along the way identifies another trend signal, which complies with the current opened position direction, then we should keep it open, since it suggested that the trend is not yet over. However, if it is counter to the current position, we should probably get a close out, regardless whether you are currently winning or losing, as it indicates a trend reversion. Our improved TF algorithm is designed to answer this question: when to buy or sell. The clue is derived from the past most similar trend. It is a fact that financial cycles do exist, and it is hypothesized that a trend on a particular day from the past could happen again some days later. This assumption supports the Trend Recalling trading mechanism, which is the basic driving force that our improved trend following algorithm relies on. The idea is expressed as a process diagram in Figure 2. As it can be seen in the diagram there are four major processes for decision making. Namely they are Pre-processing, Selection, Verification and Analysis. Figure 2 shows the process of which our improved TF model works by recalling a trading strategy that used to perform well in the past by matching the current shape of the pattern to that of the old time. A handful of such patterns and corresponding trading strategies are short-listed; one strategy is picked from the list after thorough verification and analysis. Figure 1. Intra-day of 2009-12-07 and 2008-01-31 day trend graphs. II. RECALLING PAST TRENDS An improved version of Trend Following algorithm called Trend Recalling is proposed in this paper, which looks back to the past for reference for selecting the best trading strategy. It works exactly like TF except that the trading rules are borrowed from one of the best performing past trends that matches most of the current trend. The design of a TF system is grounded on the rules that are summarized by Michael W. Covel, into the following five questions [7]: 1. How does the system determine what market to buy or sell at any time? 2. How does the system determine how much of a market to buy or sell at any time? 3. How does the system determine when you buy or sell a market? 4. How does the system determine when you get out of a losing position? 5. How does the system determine when you get out of a winning position? There is no standard answer to the above questions; likewise there exists no definite guideline for how the trading rules in TF should be implemented. The first and second questions are already answered in our previous © 2012 ACADEMY PUBLISHER A. Pre-processing In this step, raw historical data that are collected from the past market are archived into a pool of samples. The pool size is chosen arbitrarily by the user. Five years data were archived in the database in our case. A sample is a day trend from the past with the corresponding trading strategy attached. The trend is like an index pattern for locating the winning trading strategy that is in the format of a sequence of buy and sell decisions. Good trading strategy is one that used to maximize profit in the past given the specific market trend pattern. This past pattern, which is deemed to be similar to the current market trend, is now serving as a guidance to locate the strategy to be applied for decision making during the current market trade session. Since the past day trend that yielded a great profit before, reusing it can almost guarantee a perfect trading strategy that is superior to human judgment or a complex time series forecasting algorithm. The past samples are referenced by best trading strategies on an indicator that we name it as “EDM” (exponential divergence in movement). EDM is a crisp value indicator that is based on two moving average differences. (1) EDM (t ) f EMAs (t ) EMAl (t ) , 2 EMA(t ) price (t ) EMA(t 1) EMA(t 1) n 1 (2) 242 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Figure 2. Improved TF process with Recalling function. Where price(t) is the current price at any given time t, n is the number of periods, s denotes a shorter period of Exponential Moving Average, EMA(t) at time t, l represents a longer period EMA(t), f(.) is a function for generating the crisp result. The indicator sculpts the trend; and based on this information, a TF program finds a list of best trading strategies, which can potentially generate high profit. The following diagram in Figure 3 is an example of pre-processing a trend dated on 2009-12-07 that shows the EDM. As indicated from the diagram the program first found a long position at 10:00 followed by a short position at around 10:25, then a long position at 11:25, finally a short position around 13:51 and closes it out at the end of the day, which reaps a total of 634 index points. Each index point is equivalent to $50 Hong Kong dollars (KKD). In Hong Kong stock market, there is a two hours break between morning and afternoon sessions. To avoid this discontinuation on the chart, we shift the time backward, and joined these two sessions into one, so 13:15 is equivalent to 15:15. Figure 3. Example of EDM and preprocessed trend of 2009-12-07. © 2012 ACADEMY PUBLISHER JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 243 2009-12-07 closed market day trend Best fitted sample 2008-01-31 Figure 4. Example of 2009-12-07 sample (above) and its corresponding best fitness (2008-01-31) day trend and RSI graph (below). B. Selection Once a pool of samples reached a substantial size, the Trend Recalling mechanism is ready to use. The stored past samples are searched and the matching ones are short-listed. The goal of this selection process is to find the most similar samples from the pool, which will be used as a guideline in the forthcoming trading session. A foremost technical challenge is that no two trends are exactly the same, as they do differ from day to day as the market fluctuates in all different manners. Secondly, even two sample day trends look similar but their price ranges can usually be quite different. With consideration of these challenges, it implies that the sample cannot be compared directly value to value and by every interval for a precise match. Some normalization is necessary for enabling some rough matches. Furthermore the comparison should allow certain level of fuzziness. Hence each sample trend should be converted into a normalized graph, and by comparing their rough edges and measure the difference, it is possible to quantitatively derive a numeric list of similarities. In pattern recognition, the shape of an image can be converted to an outline like a wire-frame by using © 2012 ACADEMY PUBLISHER some image processing algorithm. The same type of algorithm is used here for extracting features from the trend line samples for quick comparison during a TF trading process. In our algorithm, each sample is first converted into a normalized graph, by calculating their technical indicators data. A popular indicator Relative Strength Index (RSI) has a limited value range (from 1 to 100), which is suitable for fast comparison, and they are sufficient to reflect the shape of a trend. In other words, these indicators help to normalize each trend sample into a simple 2D line graph. We can then simply compare each of their differences of shapes by superimposing these line graphs on top of each other for estimating the differences. This approach produces a hierarchical similarity list, such that we can get around with the inexact matching problem and allows a certain level of fuzziness without losing their similarity attributes. Figure 4 shows an example of two similar sample trend graphs with the RSI displayed. The blue line is the original market trend, red line is the moving average and green line is the RSI. 244 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 C. Verification During this process, each candidate from the list will be tested against the current market state. Ranking from the top smallest number as the most similar, they will be passed through fitness test. Each trend sample corresponds to a specific trading strategy (that was already established in the pre-processing step). Each trading strategy will be extracted and evaluated against historical data. The strategy is then tested on how well it performed as a trial. Each of their performances will be recorded. The trial performance will be used as a criterion to rearrange the list. Here we have an example before and after the fitness test, which was run on the 2009-12-07 during the middle of simulated trade session. The comparison is done based solely on the indicator EDM of the moving market price. Verification is needed because the selection of these candidates is by a best effort approach. That is because the current and past market situations may still differ to certain extent. D. Confirmation After the verification process is done, the candidate list is re-sorted according to the fitness test results. The fittest one will be used as the reference of subsequent trading strategy during the TF decision making. In order to further improve the performance on top of the referencing to the past best strategy, some technical analysis is suggested to be referenced as well. By the advice of Richard L. Weissman from his book [8], the two-moving average crossover system should be used as a signal confirmation. Cross-over means a rise on the market price starts to emerge; it must cross over to its averaged trend. The twomoving average crossover system entails the rise of a second, shorter-term moving average. Instead of using simple moving average, however, EMA - exponential moving average with RSI should be used, that is a shortterm RSI EMA and a long-term RSI EMA crossover proceed with the trading only when the market fluctuation is neither too high nor too low. When the reading of this fuzzy system is too high or too low, the system closes out the position then. There are many ways to calculate volatility; the most common one is finding the standard deviation of an asset closing price over a year. The central concept of volatility is finding the amount of changes or variances in the price of the asset over a period of time. So, we can measure market volatility simply by the following equation: system. When a changing trend is confirmed and it appears as a good trading signal, the crossover system must also be referenced and check if it gives a consistent signal. Otherwise the potential change in trend is considered as a false signal or intermittent noise. For example in our case the trading strategy from the recalled sample hints a long position trade. We check if RSI crossover system shows a short-term EMA crossing over its long-term EMA or not. Figure 5. Fitness test applied on 2009-12-07 at the time 14:47. In addition to validating the hinted trading signals from past strategy, market volatility should be considered during decision making. It was found in our previous work [2] that the performance of TF is affected mostly by the market fluctuation. It resulted in losses because frequent wrong trading actions were made by the TF rules when the market fluctuates too often. The market fluctuation is fuzzified as a fuzzy volatility indicator. This fuzzy volatility indicator is embedded in the TF mechanism is to automatically monitor the volatility, and percent. Base on the previous fluctuation test result, we can define it as the following fuzzy membership. Volatility (t ) SMAn ln( price(t )) ln( price(t 1)) C (3), SMA(t ) Close (t ) Close (t 1) ......Close (t n 1) n Figure 6. Fitness test applied on 2009-12-07 at the time 14:47. (4) Where ln(.) is a natural logarithm, n is the number of periods, t is the current time, C is a constant that enlarges the digit to a significant figure. SMA is Simple Moving Average that is the average stock price over a certain period of time. By observing how the equation responds to historical data, we can find the maximum volatility as ±15 © 2012 ACADEMY PUBLISHER During the trading session, volatility will be constantly referenced while the following rules apply at the TF system: IF volatility is too positive high and long position is opened THEN close it out IF volatility is too positive high and no position is opened THEN open short position IF volatility is too low THEN do nothing JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 IF volatility is too negative high and short position is opened THEN close it out IF volatility is too negative high and no position is opened THEN open long position These rules have a higher priority over the trade strategies, such that when the condition has met any of these rules, it will take over the control regardless of what decision that the trade strategies has made. In other words other conditions are not considered but only the volatility factor. The four processes are summarized as pseudo codes shown in the Appendix. Though the model is generic, which should be able to work on any market with varying patterns, a new pool sample is recommended to be created for different market as in the Pre-Processing Section. III. EXPERIMENTS Two experiments are conducted in this project. One is for verifying the efficacy of Trend Recalling algorithm in a simulated automated trading system. The other is to compare the performances in terms of profitability yielded by Trend Recalling algorithm and time series forecasting algorithms. The objective of the experiments is to investigate the feasibility of Trend Recalling algorithm in automated trading environment as an alternative to timeseries forecasting. A. Performance of Trend Recalling in Automated Trading The improved TF algorithm with Trend Recalling function is programmed into an automated trading 245 simulator, in JAVA. A simplified diagram of the prototype is shown in Figure 7. It essentially is an automated system, which adopts trading algorithms for deciding when to buy and sell based on predefined rules and the current market trend. The system interfaces with certain application plug-ins that instruct an online-broker to trade in an open market. The trading interval is per minute. Two sets of data are used for the experiment for avoiding bias in data selection. One set is market data of Hang Seng Index Futures collected during the year of 2010, the other one is H-Share also during the same year. They are basically time-series that have two attributes: timestamp and price. Their prices move and the records get updated in every minute. The two datasets however share the same temporal format and the same length, with identical market start and end times for fair comparisons of the algorithms. The past patterns stored in the data base are collected from the past 2.5 years for the use by Trend Recalling algorithm. All trials of simulations are run and the corresponding trading strategies are decided by the automated trading on the fly. At the end of the day, a trade is concluded by measuring the profit or loss that the system has made. The overall performance of the algorithms is the sum of profit/loss averaged by the number of days. In the simulation each trade is calculated in the unit of index point, each index point is equivalent to HKD 50, which is subject to overhead cost as defined by Interactive Broker unbundled commission scheme at HKD 19.3 per trade. The Return-of-Investment (ROI) is the prime performance index that is based on Hong Kong Exchange current Initial margin requirement (each contract HKD 7400 in year of 2010). Figure 7. A prototype diagram that shows the trading algorithms is the core in the system. A time-sequence illustration is shown in Figure 8 that depicts the essential ‘incubation’ period required prior to the start of trading. The timings are chosen arbitrarily. However, sufficient time (e.g., 30 minutes was chosen in our experiment) should be allowed since the beginning of the market for RSI to be calculated. Subsequently another buffer period of time followed by the calculation of the first RSI0 would be required for growing the initial trend pattern to be used for matching. If this initial part of the © 2012 ACADEMY PUBLISHER trend pattern is too short, the following trading by the Trend Recalling algorithm may not work effectively because of inaccurate matching by short patterns. If it is stalled for too long for accumulating a long matching pattern, it would be late for catching up with potential trading opportunities for the rest of the day. The stock market is assumed to operate on a daily basis. A fresh trade is started from the beginning of each day. In our case, we chose to wait for 30 minutes between the time when 246 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 RSI0 is calculated and when the trading by Trend Recalling started. The Trend Recalling steps repeat in every interval that periodically guides the buying or selling actions in the automatic trading system. As time progresses the matching pattern lengthens, matching would increasingly become more accurate and the advices for trading actions become more reliable. In our simulation we found that the whole process by Trend . Recalling algorithm that includes fetching samples from the database, matching and deciding the trading actions etc., consumes a small amount of running time. In average, it takes only 463.44 milliseconds to complete a trading decision with standard deviation of 45.16; the experiment was run on a PC with a CPU of Xeon QC X3430 and 4Gb RAM, Windows XP SP3 operating system. Sample market trend Buffer period for collecting enough market trend (since market opened) for trend matching Collect enough data for calculating RSI (1) Update RSIt (2) Selection (RSIt as search index) Start (3) Verification (Fitness test, etc.) trading (4) Confirmation (Check cross-over) Then decide: {buy, sell or no action) Continue trading … till market closes 09:45 Market starts 10:15 Start calculating initial RSI0 This buffer time is needed because RSI needs at least 15min of data to be computed 10:45 After initial RSI0 is calculated, collect this starting trend for matching Figure 8. Illustration of the incubation period in market trading by Trend Recalling. The simulation results are shown in Table 1 and Table 2 respectively, for running the trading systems with different TF algorithms over Hang Seng Index Futures data and H-Share data. The Static TF is one that has predefined thresholds P and Q whose values do not change throughout the whole trade. P and Q are the bars when over which the current price goes beyond, the system will automatically sell and buy respectively. Dynamic TF allows the values of the bars to be changed. Fuzzy TF essentially fuzzifies these bars, and FuzzyVix fuzzifies both the bars and the volatility of the market price. Readers who are interested in the full details can refer to [2]. The Tables show the performance figures in terms of ROI, profits and losses and the error rates. Overhead costs per trade are taken into account for calculating profits. The error rate is the frequency or the percentage of times the TF system made a wrong move that incurs a loss. As we observed from both Tables, more © 2012 ACADEMY PUBLISHER than 400% increase in ROI by Trend Recalling algorithm is achieved at the end of the experimental runs. This is a significant result as it implies the proposed algorithm can reap more than four folds of whatever the initial investment is, annually. The trading pattern of Trend Recalling algorithm is shown in Figure 9 for Hang Seng Index Futures data and Figure 10 for H-Share. The same simulation parameters are used by default. The trading pattern of Trend Recalling algorithm is compared to that of other TF algorithms proposed earlier by the authors. Readers who are interested in the other TF algorithms can refer to [2][3] for details. From the Figures, the trading performance by Trend Recalling strategy is always winning and keeps improving in a long run. Figure 11 shows a longitudinal view of trading results over a day; one can see that TF does not guarantee profits at all times, but overall there are more profits than losses. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 247 TABLE I. PERFORMANCE OF ALL TF TRADING ALGORITHMS ON HANG SENG INDEX FUTURES 2010 Static Dynamic Fuzzy FuzzyVix Recalling 8000 7000 Index Point 6000 5000 4000 3000 2000 1000 0 20100104 -1000 20100201 20100303 20100331 20100503 20100601 20100630 20100729 20100826 20100924 20101025 Date Figure 9. Simulation of all TF trading algorithms on Hang Seng Index Futures during 2010. TABLE II. PERFORMANCE OF ALL TF TRADING ALGORITHMS ON H-SHARE 2010 © 2012 ACADEMY PUBLISHER 20101122 20101220 248 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Static Dynamic Fuzzy FuzzyVix Recalling 6000 Index Point 5000 4000 3000 2000 1000 0 20100104 20100201 20100303 20100331 20100503 20100601 20100630 20100729 20100826 20100924 20101025 20101122 20101220 -1000 Date Figure 10. Simulation of all TF trading algorithms on H-Share during 2010. Trend Recalling: Daily Profit and Loss Index Point 600 400 200 0 -200 -400 Figure 11. Profit and loss diagram. B. Comparison of Trend Recalling and Time Series Forecasting Time series forecasting (TSF) is another popular technique for stock market trading by mining over the former part of the trend in order to predicting the trend of near-future. The major difference between TSF and TF is that, TSF focuses on the current movements of the trend with no regard to history, and TSF regresses over a set of past observations collected over time. Some people may distinguish them as predictive and reactive types of trading algorithms. Though the reactive type of algorithms have not been widely studied in research community, there are many predictive types of time series forecasting models available, such as stationary model, trend model, linear trend model, regression model, etc. Some advance even combined neural network with TSF [9]. In our experiment here, we want to compare the working performance of TSF and TF, which is represented by its best performer so far – Trend Recalling algorithm. For a fair comparison, both types of algorithms would operate over the same dataset, which is the Hang © 2012 ACADEMY PUBLISHER Seng Index Futures. We simulate their operations and trading results over a year, under the same conditions, and compare the level of profits each of them can achieve. The profit or loss for each trade would be recorded down, and then compute an average return-of-investment (ROI) out of them. ROI will then be the common performance indicator for the two competing algorithms. It is assumed that ROI is of prime interest here though there may be other technical performance indictors available for evaluating a trading algorithm [10]. For examples, Need to Finish, Price Sensitivity, Risk Tolerance, Frequency of Trade Signals and Algorithmic Trading Costs etc. In the TSF, future values are predicted continuously as trading proceeds. If the predicted value is greater than the closing value, the system shall take a long position for the upcoming trade. And if it is lower than the previous value, it takes a short position; anything else it will do nothing. Instead of testing out each individual algorithm under the TSF family, a representative algorithm will be chosen JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 based on its best prediction accuracy for this specific set of testing data. Oracle Crystal Ball [11] that is well-known prediction software with good industrial strength is used to find a prediction model that offers the best accuracy. For comparison the “best” candidate forecasting algorithm is selected by Oracle Crystal Ball that yields the lowest average prediction error. Oracle Crystal Ball has built-in estimators that calculate the performance of each prediction model by four commonly used accuracy measures: the mean absolute deviation (MAD), the mean absolute percent error (MAPE), the mean square error (MSE), and the root mean square error (RMSE). Theil’s U statistic is a relative accuracy measure that compares the forecasted results with a naive forecast. When Theil’s U is one the forecasting technique is about as good as guessing; more than one implies the forecasting technique is worse than guessing, less than one means it is better than guessing. Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation in the prediction error. The value always lies between 0 and 4. If the Durbin–Watson statistic is substantially smaller than 2, there is evidence of positive serial correlation. In general if Durbin–Watson is smaller than 1, there may be cause for alarm. Small values of Durbin–Watson statistic indicate successive error terms are, on average, close in value to one another, or positively correlated. If it is greater than 2 successive error terms are, on average, much different in value to one another, i.e., negatively correlated. In regressions, this can imply an underestimation of the level of statistical significance. Table 3 lists the prediction accuracies in terms of the error measures. The best performer is Single Exponential Smoothing prediction model for the chosen testing dataset. TABLE III. PERFORMANCE OF PREDICTIVE MODELS GENERATED BY ORACLE CRYSTAL BALL With this optimal prediction model suggested by Oracle Crystal Ball for the given data, we apply the following trade strategies for the prediction model: For a long position to open, the following equation should be satisfied, Pvt+1 ﹣ Pt > 0. For a short position to open, the following equation should be satisfied, Pv t+1 ﹣ Pt < 0 where Pv t+1 is the predictive value, and Pt is the closing price at the time t. The two trading models, one by TF and the other by TSF, are put vis-à-vis in the simulation. The simulation results are gathered and presented in Table 4 and their corresponding performance curves are shown in Figure 12. The results show that Trend Recalling consistently outperformed Single Exponential Smoothing algorithm in our experiment. © 2012 ACADEMY PUBLISHER 249 TABLE IV. SIMULATION RESULTS OF "PREDICTIVE MODEL" AND "REACTIVE MODEL" IV. CONCLUSION Trend following has been known as a rational stock trading technique that just rides on the market trends with some preset rules for deciding when to buy or sell. TF has been widely used in industries, but none of it was studied academically in computer science communities. We pioneered in formulating TF into algorithms and evaluating their performance. Our previous work has shown that its performance suffers when the market fluctuates in large extents. In this paper, we extended the original TF algorithm by adding a market trend recalling function, innovating a new algorithm called Trend Recalling Algorithm. Trading strategy that used to make profit from the past was recalled for serving as a reference for the current trading. The trading strategy was recalled by matching the current market trend that was elapsed since the market opened, with the past market trend at which good profit was made by the strategy. Matching market trend patterns was not easy because patterns can be quite different in details, and the problem was overcome in this paper. Our simulation showed that the improved TF model with Trend Recalling algorithm is able to generate profit from stock market trading at more than four times of ROI. The new Trend Recalling algorithm was shown to outperform the previous TF algorithms as well as a timeseries forecasting algorithm in our experiments. 250 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Predictive (Overnight) Reactive (Trend Recalling) 8000 6000 Index Point 4000 2000 0 20100104 20100202 20100305 20100408 20100507 20100608 20100709 20100809 20100907 20101008 20101108 20101207 -2000 -4000 Date Figure 12. Simulation trade result of predictive model and reactive model on HSI futures contracts in 2010. REFERENCES [1] [2] [3] [4] [5] Fong S. and Tai J., " Improved Trend Following Trading Model by Recalling Past Strategies in Derivatives Market", The Third International Conferences on Pervasive Patterns and Applications (PATTERNS 2011), 25-30 September 2011, Rome, Italy, pp. 31-36 Fong S., Tai J., and Si Y.W., "Trend Following Algorithms for Technical Trading in Stock Market", Journal of Emerging Technologies in Web Intelligence (JETWI), Academy Publisher, ISSN 1798-0461, Volume 3, Issue 2, May 2011, Oulu, Finland, pp. 136-145. Stan Weinstein's, Secrets for Profiting in Bull and Bear Markets, pp. 31-44, McGraw-Hill, USA, 1988. Wikipedia, 1997 Asian Financial Crisis, Available at http://en.wikipedia.org/wiki/1997_Asian_Financial_Crisis, last accessed on July-3-2012. Wikipedia, Financial crisis of 2007–2010, Available at http://en.wikipedia.org/wiki/Financial_crisis_of_2007, last accessed on July-3-2012. © 2012 ACADEMY PUBLISHER [6] [7] [8] [9] [10] [11] Schannep J., "Dow Theory for the 21st Century: Technical Indicators for Improving Your Investment Results", Wiley, USA, 2008. Covel M.W., "Trend Following: How Great Traders Make Millions in Up or Down Markets", New Expanded Edition, Prentice Hall, USA, 2007, pp. 220-231. Weissman R.L., "Mechanical Trading Systems: Pairing Trader Psychology with Technical Analysis", Wiley, USA, 2004, pp. 10-19. Mehdi K. and Mehdi B., "A New Hybrid Methodology for Nonlinear Time Series Forecasting", Modelling and Simulation in Engineering, vol. 2011, Article ID 379121, 5 pages, 2011. Domowitz I. and Yegerman H., "Measuring and Interpreting the Performance of Broker Algorithms", 2005, Techical Report, ITG Inc., August 2005, pp. 1-12. Oracle Crystal Ball, a spreadsheet-based application for predictive modeling, forecasting, simulation Available at http://www.oracle.com/technetwork/middleware/crystalbal l/overview/index.html, last accessed on July-3-2012. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Appendix – The Pseudo Code of the Trend Recalling Algorithm © 2012 ACADEMY PUBLISHER 251 252 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 A Novel Method of Significant Words Identification in Text Summarization Maryam Kiabod Department of Computer Engineering, Najafabad Branch, Islamic azad University, Isfahan, Iran Email: m_kiabod@sco.iaun.ac.ir Mohammad Naderi Dehkordi and Mehran Sharafi Department of Computer Engineering, Najafabad Branch, Islamic azad University, Isfahan, Iran Email: naderi@iaun.ac.ir, mehran_sharafi@iaun.ac.ir Abstract—Text summarization is a process that reduces the size of the text document and extracts significant sentences from a text document. We present a novel technique for text summarization. The originality of technique lies on exploiting local and global properties of words and identifying significant words. The local property of word can be considered as the sum of normalized term frequency multiplied by its weight and normalized number of sentences containing that word multiplied by its weight. If local score of a word is less than local score threshold, we remove that word. Global property can be thought of as maximum semantic similarity between a word and title words. Also we introduce an iterative algorithm to identify significant words. This algorithm converges to the fixed number of significant words after some iterations and the number of iterations strongly depends on the text document. We used a two-layered backpropagation neural network with three neurons in the hidden layer to calculate weights. The results show that this technique has better performance than MS-word 2007, baseline and Gistsumm summarizers. Index Terms—Significant Words, Text Summarization, Pruning Algorithm I. INTRODUCTION As the amount of information grows rapidly, text summarization is getting more important. Text summarization is a tool to save time and to decide about reading a document or not. It is a very complicated task. It should manipulate a huge quantity of words and produce a cohesive summary. The main goal in text summarization is extracting the most important concept of text document. Two kinds of text summarization are: Extractive and Abstractive. Extractive method selects a subset of sentences that contain the main concept of text. In contrast, abstractive method derives main concept of text and builds the summarization based on Natural Language Processing. Our technique is based on extractive method. There are several techniques used for extractive method. Some researchers applied statistical criterions. Some of these criterions include TF/IDF (Term Frequency-Inverse Document Frequency) [1], number of words occurring in title [2], and number of numerical © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.252-258 data [3]. Using these criterions does not produce a readerfriendly summary. As a result NLP (Natural Language Processing) and lexical cohesion [4] are used to guarantee the cohesion of the summary. Lexical cohesion is the chains of related words in text that capture a part of the cohesive structure of the text. Semantic relations between words are used in lexical cohesion. Halliday and Hasan [5] classified lexical cohesion into two categories: reiteration category and collocation category. Reiteration category considers repetition, synonym, and hyponyms, while collocation category deals with the co-occurrence between words in text document. In this article, we present a new technique which benefits of the advantages of both statistical and NLP techniques and reduces the number of words for Natural Language Processing. We use two statistical features: term frequency normalized by number of text words and number of sentences containing the word normalized by total number of text sentences. Also we use synonym, hyponymy, and meronymy relations in reiteration category to reflect the semantic similarity between text words and title words. A twolayered backpropation neural network is used to automate identification of weights of features. The rest of the article is organized as follow. Section 2 provides a review of previous works on text summarization systems. Section 3 presents our technique. Section 4 describes experimental results and evaluation. Finally we conclude and suggest future work in section 5. II. TEXT SUMMARIZATION APPROACHES Automatic text summarization dates back to fifties. In 1958, Luhn [6] created text summarization system based on weighting sentences of a text. He used word frequency to specify topic of the text document. There are some methods that consider statistical criterions. Edmundson [7] used Cue method (i.e. "introduction", "conclusion", and "result"), title method and location method for determining the weight of sentences. Statistical methods suffer from not considering the cohesion of text. Kupiec, Pederson, and Chen [8] suggested a trainable method to summarize text document. In this method, JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 number of votes collected by the sentence determines the probability of being included the sentence in the summary. Another method includes graph approach proposed by Kruengkrai and Jaruskululchi [9] to determine text title and produce summary. Their approach takes advantages of both the local and global properties of sentences. They used clusters of significant words within each sentence to calculate the local property of sentence and relations of all sentences in document to determine global property of text document. Beside statistical methods, there are other approaches that consider semantic relations between words. These methods need linguistic knowledge. Chen, Wang, and Guan [10] proposed an automated text summarization system based on lexical chain. Lexical chain is a series of interrelated words in a text. WordNet is a lexical database which includes relations between words such as synonym, hyponymy, meronymy, and some other relations. Svore, Vander Wende and Bures [11] used machine learning algorithm to summarize text. Eslami, Khosravyan D., Kyoomarsi, and Khosravi proposed an approach based on Fuzzy Logic [12]. Fuzzy Logic does not guarantee the cohesion of the summary of text. Halavati, Qazvinian, Sharif H. applied Genetic algorithm in text summarization system [13]. Latent Semantic Analysis [14] is another approach used in text summarization system. Abdel Fattha and Ren [15] proposed a technique based on Regression to estimate text features weights. In regression model a mathematical function can relate output to input variables. Feature parameters were considered as input variables and training phase identifies corresponding outputs. There are some methods that combine algorithms, such as, Fuzzy Logic and PSO [16]. Salim, Salem Binwahla, and Suanmali [17] proposed a technique based on fuzzy logic. Text features (such as similarity to title, sentence length, and similarity to keywords, etc.) were given to fuzzy system as input parameters. Ref. [18] presented MMR (Maximal Marginal Relevance) as text summarization technique. In this approach a greedy algorithm is used to select the most relevant sentences of text to user query. Another aim in this approach is minimizing redundancy with sentences already included in the summary. Then, a linear combination of these two criterions is used to choose the best sentences for summary. Carbonell and Goldstein [19] used cosine similarity to calculate these two properties. In 2008 [20] used centroid score to calculate the first property and cosine similarity to compute the second property. Different measures of novelty were used to adopt this technique [21, 22]. To avoid greedy algorithms problems, many have used optimization algorithms to solve the new formulation of the summarization task [23, 24, 25]. III. PROPOSED TECHNIQUE The goal in extractive text summarization is selecting the most relevant sentences of the text. One of the most important phases in text summarization process is identifying significant words of the text. Significant words play an important role in specifying the best sentences for summary. There are some methods to identify significant words of the text. Some methods use statistical techniques and some other methods apply semantic relations between words of the text to determine significant words of text. Such as term frequency (TF), similarity to title words, etc. each method has its own advantages and disadvantages. In our work, a combination of these methods is used to improve the performance of the text summarization system. In this way, we use the advantages of several techniques to make text summarization system better. We use both statistical criterions and semantic relations between words to identify significant words of text. Our technique has five steps: preprocessing, calculating words score, significant words identification, calculating sentences score, and sentence selection. These steps are shown in Fig. 1. Figure 1: the flowchart of proposed technique © 2012 ACADEMY PUBLISHER 253 254 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 The first step, preprocessing, involves preparing text document for the next analysis and pruning the words of the text document. This step involves sentence segmentation, sentence tokenization part of speech tagging, and finding the nouns of the text document. Keywords or significant words are usually nouns, so finding nouns of the text can help improving performance of our system. The second step, calculating words scores, calculates words scores according to their local score and global score explained in detail later. Local score is determined based on statistical criterions and global score is determined through semantic similarity between a word and title words. The third step, significant words identification, uses words score and an iterative algorithm to select the most important words of text. The fourth step, calculating sentence score, calculates sentence score according to sentence local score, sentence global score and sentence location. The fifth step, sentence selection, selects the most relevant sentences of text based on their scores. These five steps are explained in detail in the next five sections. A. Preprocessing The first step in text summarization involves preparing text document to be analyzed by text summarization algorithm. First of all we perform sentence segmentation to separate text document into sentences. Then sentence tokenization is applied to separate the input text into individual words. Some words in text document do not play any role in selecting relevant sentences of text for summary, Such as stop words ("a", "an", the"). For this purpose, we use part of speech tagging to recognize types of the text words. Finally, we separate nouns of the text document. Our technique works on nouns of text. In the rest of the article we use "word" rather than "noun". In this phase, we use two statistical criterions: term frequency of the word normalized by total number of words (represented by TF) and number of sentences containing the word normalized by total number of sentences of text document (represented by Sen_Count). We combine these two criterions to define equation (1) to calculate local score of words. word_local_score = α * TF + (1- α) * Sen_Count (1) where α is weight of the parameter and is in the range of (0, 1). We utilize a two-layered backpropagation neural network with three neurons in hidden layer, maximum error of 0.001, and learning rate of 0.2 to obtain this weight. The dendrites weights of this network are initialized in the range of (0, 1). We use sigmoid function as transfer function. The importance of each parameter is determined by the average of dendrites weights connected to the input neuron that represents a parameter [26]. After training neural network with training dataset we use weights to calculate words local scores. The algorithm in this step prunes words of the text document and deletes words without any role in selecting relevant sentences for summary. This is done by defining a threshold and taking words whose scores are above that threshold. This algorithm is shown in Algorithm 1: Algorithm 1: Word pruning Algorithm Input: local score of words, words list Output: pruned words list 1. B. Calculating Words Score After preparing input text for text summarization process, it is time to determine words score to be used in later steps. In this step we utilize combination of statistical criterions and lexical cohesion to calculate text words scores. Finding semantic relations between words is a complicated and time consuming process. So, first of all, we remove unimportant words. For this reason, we calculate local score of word. If local score of a word is less than the word_local_score_threshold, we will remove that word. Word_local_score_threshold is the average of all text words scores multiplied by a PF (a number in the range of (0, 1) as a Pruning Factor in word selection). By increasing PF, more words will be removed from text document. In this way, the number of words decreases and the algorithm gets faster. We calculate global score for remaining words based on reiteration category of lexical cohesion. Finally, we calculate words scores by using local and global score of words. This step is described in detail in three next sections. 2. foreach words w of text do 3. If (word_local_score < word_local_score_threshold) Delete word from significant words list; 4. end 5. end 6. return pruned words list; In this algorithm, i represents word index and PF stands for Pruning Factor. The first line of the Algorithm 1 computes local score threshold of words by taking the average of the local score of words multiplied by PF. The second line of it prunes words by taking words whose scores are above the word_local_score_threshold. Finally, the algorithm returns the pruned words list in the seventh line. Calculating global score of words Calculating local score of words In this phase, we consider semantic similarity between text words and title words. We use WordNet, a lexical © 2012 ACADEMY PUBLISHER JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 database, to determine semantic relations between text words and title words. We fixed the weight of repetition and synonym to 1, of hyponymy and hyperonymy to 0.7, and of meronymy and holonymy to 0.4. We also consider repetition of keywords in the text and fix the weight of it to 0.9. We define equation (2) to calculate global score of words: Word_global_score = Max (sim (w, )) (2) According to this equation, first of all, we calculate the maximum similarity between each word and title words. Then the sum of maximum similarities is calculated to determine global score of words. This score is used in the next section. Calculating word score The final phase in this step is calculating word score. In our technique, word score is calculated by combination of local score and global score of word. We define equation (3) to calculate word score. Word_score=α*(word_local_score)+β*(word_global_s core) (3) α and β are determined by neural network illustrated before. C. Identifying Significant Words Significant words play an important role in text summarization systems. The sentences containing important words have better chance to be included in summary. In the case of finding significant words of text with a high accuracy, the results of text summarization will be great. So, we focus on significant word identification process to improve text summarization results. In this step, we introduce a new iterative method to determine significant words of text. In this method, significant words are initiated with text words. Then a threshold is defined to be used to identify the words that should be removed from initial significant words. This is done by applying the average of all significant words scores in previous iteration as word_score_threshold. If a word score is less than this threshold, we will remove that word from significant words list. In each loop of this algorithm some words are deleted from significant words list. The algorithm converges to the fixed number of significant words after some iteration. The algorithm is shown below: Algorithm 2: Significant words identification algorithm 5. 6. 7. 8. 9. 10. 11. 255 if (word_score< words_score_threshold) Delete word from significant words list; end end Word_score_threshold:=average(significant_words_scores); end return significant words list; words_score_threshold in Algorithm 2 is the average of all scores of significant words of text. This threshold changes in every iteration of algorithm. The new value of it is calculated through the average of scores of significant words in previous iteration of algorithm. The first line of Algorithm 2 initiates significant words list by text words. The second line initiates Word_score_threshold by calculating the average of scores of text words. The third line to the tenth line iterates to delete unimportant words from significant words list. The ninth line of the algorithm computes words_score_threshold for the next iteration. Finally, the algorithm returns significant words list in line ten. D. Calculating Sentence Score In this step, we use significant words determined by previous step to calculate sentence score. Our technique in this phase is based on Kruengkrai and Jaruskululchi [9] approach, but we changed the parameters. They combined local and global properties of a sentence to determine sentence score as follow: Sentence_score = α*G + (1-α)*L (4) Where G is the normalized global connectivity score and L is the normalized local clustering score. It results this score in the range of (0, 1). We define G and L as follow: G= (5) L= (6) where is the maximum semantic relation among sentence words and title and keywords. As shown in equation (5), we consider semantic relations among sentence words and title and keywords to determine the global property of a sentence. Then, we normalize it by total number of words in the sentence. The parameter α determines the importance of G and L. we use neural network illustrated before to determine α. Baxendale [27] showed that sentences located at first and last paragraph of text document are more important and having greater chances to be included in summary. So, we divide text document into three sections and multiply sentences scores in the first and last section by 0.4 and in the second section by 0.2. The algorithm is shown below. Input: text words list, text words scores Output: significant words list Algorithm 3: Sentence score calculation algorithm 1. significant_words := text_words; 2. Word_score_threshold :=average(text_words_scores); 3. 4. while number of significant words changes do foreach significant words of text do © 2012 ACADEMY PUBLISHER Input: number of significant words of each sentence, total number of significant words of text, total number of words in each sentence, similarity score between a word and title words, sentence location, and the parameters α and β Output: scores of sentences JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 1. foreach sentence of text do 2. sentence_local_score:= 3. sentence_global_score := ; ; 4. Sentence_score := α*G + (1-α)*L; 5. If ((1/3)*TSN < sentence_loc < (2/3)*TSN) 6. Sentence_score *:=0.2; 7. else : Sentence_score *:=0.4; 8. end 9. end 10. return scores of sentences; TSN IN Algorithm 3 is referred as total number of text sentences. Sentence_loc is the location of sentence in text document. The Algorithm 3 repeats line two to line eight for each sentence. Line two computes local score of sentences. The third line of the algorithm computes global score of sentence. The forth line computes sentence score according to local score and global score. The fifth line to the eighth line considers the sentence location. If sentence location is in the first section or last section of the text document, multiply it’s score by 0.4 otherwise multiply score of sentence by 0.2. Finally, the algorithm returns sentences scores in line ten. E. Sentence Selection After calculating scores of the sentences, we can use these scores to select the most important sentences of text. This is done by ranking sentences according to their scores in decreasing order. Sentences with higher score tend to be included in summary more than other sentences of the text document. In our technique these sentences have more similarity to title. This similarity is measured according to statistical and semantic techniques used in our technique. Another criterion to choose sentences for summary is Compression Rate. Compression rate is a scale to decrease the size of text summary. A higher compression rate leads to a shorter summary. We fix compression rate to 80%. Then n topscoring sentences are selected according to compression rate to form the output summary. We use DUC2002 1 as input data to train neural network and test our technique. DUC 2002 is a collection of newswire articles, comprised of 59 document clusters. Each document in DUC2002 consists of 9 to 56 sentences with an average of 28 sentences. Each document within the collections has one or two manually created abstracts with approximately 100 words which are specified by a model. We evaluate the technique for different PF. The best result was achieved for PF=0.25 as shown in Fig. 2. We compare our results with MS-word 2007, Gistsumm, and baseline summarizers. MS-word 2007 uses statistical criterions, such as term frequency, to summarize a text. Gistsumm uses the gist as a guideline to identify and select text segments to include in the final extract. Gist is calculated on the basis of a list of keywords of the source text and is the result of the measurement of the representativeness of intra- and inter-paragraph sentences. The baseline is the first 100 words from the beginning of the document as determine by DUC2002. The results are shown in Fig. 3 and Fig. 4. The numerical results are shown in Table 1. The text number in Table 1 shows the text number in the tables. Our technique (OptSumm) reaches the average precision of 0.577, recall of 0.4935 and f-measure of 0.531. The MSword 2007 summarizer achieves the average precision of 0.258, recall of 0.252 and f-measure of 0.254. The Gistsumm reaches the average precision of 0.333 and fmeasure of 0.299. the baseline achieves the average of 0.388, recall of 0.28 and f-measure of 0.325.the results have shown that our system has better performance in comparison with MS-word 2007, Gistsumm and baseline summarizers. Fig. 3, Fig. 4, and Fig. 5 show that the precision score, the Recall score, and F-measure are higher when we use OptSumm rather than MS-word 2007, Gistsumm, and baseline summarizers. 1 0.8 Precision 256 0.6 PF=0.25 0.4 PF=0.5 0.2 PF=0.75 0 1 3 5 7 PF=1.0 Text Number IV. EVALUATION Figure 2: the comparison of different PF Text summarization evaluation is a complicated task. We use three criterions to evaluate our system [28]: Precision Rate = (7) Recall Rate = (8) F-measure= (9) 1. © 2012 ACADEMY PUBLISHER 9 11 13 www.nlpir.nist.gov JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 257 1.2 1 OptSumm 0.6 0.8 F-measure Precision 1 0.8 MS-Word 0.4 GistSumm 0.2 baseline 0 1 3 5 7 9 11 13 0.6 OptSumm 0.4 MS-word 0.2 Gistsumm baseline 0 Text Number 1 3 5 7 9 11 13 Text Number Figure 3: the comparison of precision score among four summarizers Figure 5: the comparison of F-measure score among four summarizers 1 Recall 0.8 0.6 OptSumm 0.4 MS-Word 0.2 GistSumm baseline 0 1 3 5 7 9 11 13 Text Number Figure 4: the comparison of recall score among four summarizers Table I. THE COMPARISON OF PRECISION AND RECALL AMONG FOUR SUMMARIZERS Text Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 SET NO. D061J D062J D106g D113h D083a D071f D072f D092c D074b D091c D110h D102e D098e average Model Precision b a a b b a j a a j b f a - 0.45 0.8 0.875 0.8 0.5 0.66 0.5 0.85 0.8 0.27 0.875 0.107 0.125 0.577 OptSumm Recall Fmeasure 0.5 0.473 0.66 0.723 0.38 0.529 0.44 0.567 0.428 0.461 0.5 0.568 0.5 0.5 0.666 0.746 0.666 0.726 0.42 0.328 0.777 0.823 0.33 0.161 0.142 0.132 0.4935 0.531 Precision 0.1 0.142 0.363 0.25 0.4 0.33 0.222 0.55 0.4 0.1 0.5 0.01 0.125 0.258 MS-word 2007 Recall Fmeasure 0.125 0.111 0.166 0.153 0.22 0.273 0.111 0.153 0.285 0.332 0.375 0.351 0.25 0.235 0.55 0.55 0.33 0.361 0.15 0.12 0.55 0.523 0.11 0.018 0.142 0.132 0.252 0.254 Precision 0.285 0.4 0.33 0.25 0.25 0.2 0.44 0.57 0.2 0.36 0.6 0.09 0.2 0.333 GistSumm Recall Fmeasure 0.25 0.266 0.33 0.361 0.166 0.220 0.111 0.153 0.142 0.181 0.125 0.153 0.5 0.468 0.22 0.317 0.16 0.177 0.57 0.441 0.33 0.425 0.11 0.099 0.142 0.166 0.272 0.299 Precision 0.5 1.0 0.625 0.8 0.4 0.5 0.2 0.2 0.6 0.1 0.166 0.1 0 0.388 baseline Recall 0.375 0.5 0.27 0.44 0.285 0.75 0.125 0.111 0.5 0.15 0.111 0.15 0 0.28 Fmeasure 0.428 0.666 0.377 0.567 0.332 0.6 0.153 0.142 0.545 0.12 0.133 0.12 0 0.325 V. CONCLUSION and FUTURE WORK REFERENCES In this article, we proposed a new technique to summarize text documents. We introduced a new approach to calculate words scores and identify significant words of the text. A neural network was used to determine the style of human reader and to which words and sentences the human reader deems to be important in a text. The evaluation results show better performance than MS-word 2007, GistSumm, and baseline summarizers. In future work, we intend to use other features, such as font based feature and cue-phrase feature in words local score and calculate words scores based on it. Also the sentence local score and global score can be changed to reflect the reader's needs. [1] M.Wasson, "Using Leading Text for News Summaries: Evaluation results and implications for commercial summarization applications”, In Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the ACL, pp.1364-1368, 1998. [2] G.Salton,C.Buckley,"Term-weighting Approaches in Automatic Text Retrieval", Information Proceeding and Management 24,1988,513-523.Reprinted in:Sparck-Jones, K.; Willet ,P.(eds).Readings in I.Retreival, Morgan Kaufmann,pp.323-328,1997 [3] C.Y.Lin, "Training a Selection Function for Extraction", In Proceedings of eighth international conference on Information and knowledge management, Kansas City, Missouri, United States, pp.55-62,1999. [4] M.Hoey, Patterns of Lexis in Text. Oxford: Oxford University Press, 1991 © 2012 ACADEMY PUBLISHER 258 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 [5] M.Halliday, and Hasan, R.1975.Cohesion in English. London: Longman [6] H.P.Luhn, “The Automatic Creation of Literature Abstracts”, IBM journal of Research Development, 1958, pp.159-165. [7] H.P.Edmundson, “New Methods in Automatic Extraction”, journal of the ACM, 1969, pp.264-285. [8] J.Kupiec , j.Pedersen, AND j.Chen, “A Trainable Document Summarizer”, In Proceedings of the 18th ACMSIGIR Conference,1955,pp.68-73. [9] C.Jaruskululchi, Kruengkrai, “Generic Text Summarization Using Local and Global Properties of Sentences”, IEEE/WIC international conference on web intelligence, October 2003, pp.13-16. [10] Y.Chen, X. Wang, L.V.YI Guan,” Automatic Text Summarization Based on Lexical Chains”, in Advances in Natural Computation, 2005, pp.947-951. [11] K.Svore, L.Vanderwende, and C.Bures, “Enhancing Single-document Summarization by Combining Ranknet and Third-party Sources”, In Proceeding of the EMNLPCoNLL. [12] F.Kyoomarsi, H.Khosravi, E.Eslami, and P.Khosravyan Dehkordy, “Optimizing Text Summarization Based on Fuzzy Logic”, In Proceedings of Seventh IEEE/ACIS International Conference on Computer and Information Science, IEEE, University of shahid Bahonar Kerman,2008,pp.347-352. [13] V.Qazvinian, L.Sharif Hassanabadi, R.Halavati, “Summarization Text with a Genetic Algorithm-Based Sentence Extraction”, International of Knowledge Management Studies (IJKMS),2008,vol.4,no.2,pp.426-444. [14] S.Hariharan, “Multi Document Summarization by Combinational Approach”, International Journal of Computational Cognition, 2010, vol.8, no.4, pp.68-74. [15] M.Abdel Fattah, and F.Ren, “Automatic Text Summarization”, Proceedings of World of Science, Engineering and Technology,2008,vol.27,pp.195-192. [16] L.Suanmali, M.Salem, N.Binwahlan and Salim, “sentence Features Fusion for Text Summarization Using Fuzzy Logic”, IEEE, 2009,pp.142-145. [17] L.Suanmali, N. Salim, and M.Salem Binwahlan, “Fuzzy Swarm Based Text Summarization”, journal of computer science, 2009, pp.338-346. [18] J. Carbonell and J. Goldstein, “The use of MMR, diversitybased rerunning for reordering documents and producing © 2012 ACADEMY PUBLISHER [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] summaries,” in Proceedings of theAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 335–336, 1998. C. D. Manning and H. Schutze, Foundations of Natural Language Processing.MIT Press, 1999. S. Xie and Y. Liu, “Using corpus and knowledge-based similarity measure in Maximum Marginal Relevance for meeting summarization,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4985–4988, 2008. G. Murray, S. Renals, and J. Carletta, “Extractive summarization of meeting recordings,” in Proceedings of 9th European Conference on Speech Communication and Technology, pp. 593–596, 2005. D. R. Timothy, T. Allison, S. Blair-goldensohn, J. Blitzer, A. elebi, S. Dimitrov, E. Drabek, A. Hakim, W. Lam, D. Liu, J. Otterbacher, H. Qi, H. Saggion, S. Teufel, A. Winkel, and Z. Zhang, “MEAD — a platform for multidocument multilingual text summarization,” in Proceedings of the International Conference on Language Resources and Evaluation, 2004. R. McDonald, “A study of global inference algorithms in multi-document summarization,” in Proceedings of the European Conference on IR Research, pp. 557–564, 2007. S. Ye, T.-S. Chua, M.-Y. Kan, and L. Qiu, “Document concept lattice for text understanding and summarization,” Information Processing and Management, vol. 43, no. 6, pp. 1643–1662, 2007. W. Yih, J. Goodman, L. Vanderwende, and H. Suzuki, “Multi-document summarization by maximizing informative content-words,” in Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1776–1782, 2007. N.Soltanian Zadeh, L.Sharif, “Evaluation of Effective Parameters and Their Effect on Summarization Systems Using Neural Network”, Fifth annual international conference of computer society of iran, 2008. P.Baxendale,” Machine-Made Index for Technical Literature –an Experiment”, IBM Journal of Research Development,1958. Y.Y. Chen, O.M.Foong, S.P.Uong, I.Kurniawan, “Text Summarization for Oil and Gas Drilling Topic”, Proceeding of world academy of science and technology, vol.32, pp.37-40, 2008. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 259 Attribute Overlap Minimization and Outlier Elimination as Dimensionality Reduction Techniques for Text Classification Algorithms Simon Fong Department of Computer and Information Science, University of Macau, Macau SAR Email: ccfong@umac.mo Antonio Cerone International Institute for Software Technology, United Nations University, Macau SAR Email: antonio@iist.unu.edu Abstract—Text classification is the task of assigning free text documents to some predefined groups. Many algorithms have been proposed; in particular, dimensionality reduction (DR) which is an important data pre-processing step has been extensively studied. DR can effectively reduce the features representation space which in turn helps improve the efficiency of text classification. Two DR methods namely Attribute Overlap Minimization (AOM) and Outlier Elimination (OE) are applied for downsizing the features representation space, on the numbers of attributes and amount of instances respectively, prior to training a decision model for text classification. AOM works by swapping the membership of the overlapped attributes (which are also known as features or keywords) to a group that has a higher occurrence frequency. Dimensionality is lowered when only significant and unique attributes are describing unique groups. OE eliminates instances that describe infrequent attributes. These two DR techniques can function with conventional feature selection together to further enhance their effectiveness. In this paper, two datasets on classifying languages and categorizing online news into six emotion groups are tested with a combination of AOM, OE and a wide range of classification algorithms. Significant improvements in prediction accuracy, tree size and speed are observed. Index Terms—Data stream mining, optimized very fast decision tree, incremental optimization. I. INTRODUCTION Text classification is a classical text mining process that concerns automatically sorting unstructured and free text documents into predefined groups [1]. This problem receives much attention from researchers from data mining research community for its practical importance in many online applications such as automatic categorization of web pages in search engines [2], detection of public moods online [3] and information retrieval that selectively acquire online text documents into the preferred categories. Given the online nature of the text classification applications, the algorithms often would have to deal with massive volume of online text that are stored in © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.259-263 unstructured format, such as hypertexts, emails, electronic news archive and digital libraries. A prominent challenge or difficulty of text classification application is processing the high dimensionality of the attribute representation space manifested from the text data. Text information is often represented by a string variable which is a single dimensional data array or linked-list in computer memory. Though the size of a string may be bounded, a string variable can potentially contain infinite number of words combinations; each string that represents an instance of text document will have a different size. The large number of values from the training dataset and the irregular length of each instance make training a classifier extremely difficult. To tackle this issue, the text strings are transformed into a fixedsized list of attributes that represent the frequency of occurrence of each corresponding word in the dataset. The frequency list is often called Word Vector in the form of a bit-vector, which is an occurrence frequency representation of the words. The length of a word vector is bounded by the maximum number of unique words exist in the dataset. An example in WEKA, which stands for 'Waikato Environment for Knowledge Analysis' is a popular suite of machine learning software written in Java, developed at the University of Waikato, illustrates how sentences in nature language are converted to word vector of frequency counts. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Figure 1. An example of text string converted to word vector. Although word vectors could be processed by most classification algorithms, the transformation approach is no scalable. For large texts that contain many words the word vector grows to prohibitory huge which slows down the model training time and it leads to the well-known data mining problem called 'curse of dimensionality'. The word vector is most sparse that occupies unnecessary runtime memory space. Hence dimensionality reduction techniques (DR) are extensively studied by researchers. The techniques aim to reduce the number of components of a dataset such as word vectors; while at the same time, the original data are represented as accurately as possible. DR often yields fewer features and/or instances. Therefore a compact representation of the data could be achieved for improving text mining performance and reduces computational costs. Two types of DR are usually applied, often together, for reducing the number of attributes/features and to streamlining the amount of instances. They attempt to eliminate irrelevant and redundant attributes/data from the training dataset and/or its transformed representation, making the training data compact and efficient for dataintensive task like constructing a classifier. In paper, a DR method called Attribute Overlap Minimization (AOM) is introduced which reduces the number of dimensions by refining the membership of each group that the word vector is more likely to belong to. Furthermore the corresponding instances that do not fit well in the rearranged groups are removed. This paper reports about this DR technique and experiments are conducted to demonstrate its effectiveness over two different datasets. II. MODEL FRAMEWORK A typical text mining workflow consists of data preprocessing that includes data cleaning, formatting and missing value handling, dimensionality reduction and data mining model training. Figure 1 shows such a typical text mining workflow. A classifier which is enabled by data mining algorithms needs to be trained initially by © 2012 ACADEMY PUBLISHER processing through a substantial amount of pre-labeled records to an acceptable accuracy, before it could be used for classifying new unseen instances to the predicted groups. The data mining algorithms are relatively mature in their efficacy and their performance is largely depending on the quality of the training data – which is the result of the DR that tries to abstract the original dataset to a compact representation. A type of DR methods which is well-known as Stemming [4] has been proposed and widely used in the past. Stemming algorithms or so-called stemmers are designed to reduce a single word to its stem or root form [5] by finding its morphological root. This is done by removing the suffix of the words. It helps shorten the length of most terms. The other important type of DR is Feature Selection which selects only the attributes whose values represent the words that exist in the text document, and it filters off those attributes that have less or little predictive power with respect to the classification model. So a subset of the original attributes can be retained for building an accurate model. A comparative study of different feature selection methods [6] have been evaluated pertaining to subsiding text space dimensionality. It was shown possible that between 50% and 90% of the terms from the text space can be removed by using suitable feature selection schemes without sacrificing any accuracy in the final classification model. Classification Model Training Training dataset with reduced dimensions Dimensionality Reduction Attribute Reduction F.S. A.O.M. Data Reduction Outliner removal Labeled training records Labeling data records with classes Clean and concise text Class assignment 260 Denoising and stemming Structured text Formatting dataset Unstructured text Data Extraction Snippets of data www Figure 2. A typical text-mining workflow. Both types of DR methods reduce the dimensionality of a dataset as an important element of the text data preprocessing stage. However, it is observed that feature selection heavily removes less-important attributes based JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 on their potential contributing power in a classifier, without regards to the context of the training text data. We identify that one of the leading factors to misclassification is the confusion of the contexts of the words in different groups. The confusion disrupts the training process of the classification model by mistakenly interpreting a word/term from an instance as an indication to one group but in fact it is more likely to belong to another. A redundant and false mapping-relation between the attributes and the target group is therefore created in the model that dampers the accuracy of the resultant classification model. The source of this problem is originated from the common attributes which are owned by more than one group. A single term, without referring to the context of its use, can be belonged to two or more target groups of text. For the example given in Figure 1, the individual term ‘Cancer’ is actually in a case that belongs to ‘Good news’, while the same term can potentially and intuitively be deemed as an element of ‘Bad news’. To rectify this problem a data pre-processing method called Attribute Overlap Minimization (AOM) is proposed. In principle, it works by relocating the terms to a group in which the term has the highest occurrence frequency. The relocation can be absolute, that is based on the Winner-takes-all approach. The group that has the highest frequency count of the overlapped word recruits it all. In the dataset, the instances that contain the overlapped words would have to delete them off if the labeled class group is not the winner group. The instances that belong to the winner group continue to own the words for describing the characteristic of the group. Another milder approach is to assign ‘weights’ according to the relative occurrence frequencies across the groups. The strict approach may have a disadvantage of overrelocation that leads to a situation where the winner group monopolizes the ownership of the frequently occurring terms, leaving the other groups lack of key terms for training up their mapping relations. However, when the instances have a sufficient number of instances and the overlapped terms are not too many, AOM works well and fast. Comparing to FS, AOM is having the advantage of preserving most of the attributes and yet it can prevent potential confusion in the classification training. Another benefit is the speed due to the fact that it is not necessary to refer to some ontological information during the processing. An example is shown in Figure 3, where in linguistic languages common words that have the same spellings are overlapped across different languages. AOM is a competitive scheme that allows a language group in which the words appear most frequently acquires away the overlapped words. © 2012 ACADEMY PUBLISHER German 261 English French Spanish en die data pour de se que un la Figure 3. An illustration of overlapped words among different languages III. EXPERIMENT In order to validate the feasibility our proposed model, a text mining program is built in WEKA over two representative datasets, and by using a wide range of classification algorithms. We aim to study the performance of the classifiers together with the use of different dimensionality reduction methods. The training data which are obtained from online websites are unstructured in nature. After the conversion the word vector grows to a size of 8135 attributes for maintaining frequency counts for each word in the documents. A combination of DR techniques is applied in our experiment. An outliner removal algorithm is used for trimming off data rows that have exceptionally different values from the norm. For reducing the number of attributes, a standard Feature Selection algorithms (FS) called Chi-Square is used because of its popularity and generality, together with our novel approach called Attribute Overlap Minimization (AOM) are applied. Two training data are used in the experiment: one is a collection of sample sentences on the related topics of data mining, retrieved from Wikipedia websites of four different languages – Spanish, French, English and German. The other one is excerpted from CNN news website, of the news articles that were released for ten days across the New Year 2012. The news collection has a good mix of political happenings, important world events and lifestyles. One hundred of sample news was obtained in total, and they were rated manually according to six basic human psychological emotions, namely, Anger, Fear, Joy, Love, Sadness and Surprise. The data are formatted into ARFF format, having one news/instance per row in the following structure: <emotion>, <”text of the news”> where the second field has a variable length. Similarly for the language sample dataset, the structure is <language>, <” wiki page text”>. The HTML tags, punctuation mark and symbols are filtered off. The training datasets are then subject to the above-mentioned dimensionality reduction methods for transformation to a concise dataset in which the attributes have substantial predictive powers contributing to. Accuracy which is a key performance indicator is defined by the percentage of the number of correctly classified instances over the total number of instances in the training dataset. Others are decision tree size or the amount of generated rules which implies the demand of the runtime memory requirement, and the time taken for 262 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 training up the model. By applying attribute reduction and data reduction, we can observe that the initial number of attributes have reduced greatly from 8135 to 11. Having a concise and elite amount of attributes is crucial in real-time application, and in text mining online news, the number of attributes is proportional to the coverage of news articles – the more unique words (vocabularies) that are being covered, the greater the number of attributes there are. TABLE I. They are the dataset with FS only, transformed dataset with reduced attributes and overlapped attributes rearranged (by both FS and AOM), and transformed dataset with both attributes reduced and outliners removed (FS+AOM+OE). The full performance results in terms of accuracy, tree/rule size and time taken are shown in Tables 3, 4 and 5. TABLE III. PERFORMANCE COMPARISON USING DIFFERENT CLASSIFIERS FOR LANGUAGE DATASET WITH FS TECHNIQUE ONLY. PERFORMANCE OF DECISION TREE MODEL TESTED UNDER DIFFERENT TYPES OF DR METHODS APPLIED, LANGUAGE DATASET. TABLE II. PERFORMANCE OF DECISION TREE MODEL TESTED UNDER DIFFERENT TYPES OF DR METHODS APPLIED, EMOTION DATASET. TABLE IV. PERFORMANCE COMPARISON USING DIFFERENT CLASSIFIERS FOR LANGUAGE DATASET WITH FS AND AOM TECHNIQUES. In general, it can be seen that the results from the above tables have the smallest tree size, highest accuracy and a very short training time when the three DR methods are used together. The language dataset represents a scenario where the number of attributes is approximately 10 times larger than the number of instances which is usual in text mining when vector space is used. The emotion dataset represents an extremely imbalanced case where the ratio of attributes to instances is greater than 80:1. It should be highlighted that by applying a series of FS+AOM+OE in the extreme case of emotion dataset, the number of attributes was not cut to an extremely small number (50 instead of 11) that are sufficient to characterize an emotional group, the instances amount are not overly eliminated (91 over 63) for sufficiently training the model; yet the accuracy achieved is the highest possible. The experiment is then extended to evaluate the use of machine learning algorithms, with the benchmarking objective of achieving the highest accuracy. The selection list of the machine learning algorithm used in our experiment here is by no means exhaustive, but will form the basis of a performance comparison which should supposedly cover most of the popular algorithms. The machine learning algorithms are grouped by four main categories, Decision Tree, Rules, Bayes, Meta and Miscellaneous; all of them are known to be effective for data classification in data mining to certain extents. Three versions of inflected datasets were text-mined by different classification algorithms in this experiment. © 2012 ACADEMY PUBLISHER TABLE V. PERFORMANCE COMPARISON USING DIFFERENT CLASSIFIERS FOR LANGUAGE DATASET WITH FS+AOM+OE TECHNIQUES. The experiments are repeated with respect to accuracy only, but graphically showing the effects of applying no JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 technique at all, techniques that are responsible for reducing the attributes, and techniques that reduce both attributes and instances. The results are visually displayed at scattered plots in Figure 4 and Figure 5 for language dataset and emotion dataset respectively. 263 the model training with dataset of high dimensionality, except the Rotation Forest. TABLE VI. % PERFORMANCE GAIN – (L) LANGUAGE, (R) EMOTION Figure 4. Accuracy graph of classifiers over the language dataset IV. CONCLUSION Novel dimensionality reduction techniques for text mining namely Attribute Overlap Minimization and Outlier Elimination are introduced in this paper. The performance is tested in empirical experiments for verifying the advantage of the techniques. The results show that the techniques are effective especially on large vector space. Figure 5. Accuracy graph of classifiers over the emotion dataset From the charts when the accuracy value of various classifiers lay across, it can be observed that in general DR methods indeed yield certain improvement. The improvement gain between the original dataset without any technique applied and the inflected datasets with DR techniques is very apparent in the emotion dataset which represents a very large vector space. It means for text mining applications that deal with a wide coverage of vocabularies like online news it is very essential to apply DR techniques for maintaining the accuracy. In fact the gain ratio results from Table 6 shows a big leap of improvement gain between the no-DR-applied and DRapplied, for language dataset and emotion dataset 3.584684% vs 101.3853% increases. On a second note, the improvement gain between with and without outlier elimination is relatively higher for language dataset. 5.519077% > 3.584684%. That infers to the importance of removing outliers especially in a relatively small vector space. Of all the classifiers being under test, Decision tree type and Bayes type outperform the rest. This phenomenon is observed consistently over different datasets and different DR techniques used. All the classification algorithms yield improvement and survive © 2012 ACADEMY PUBLISHER REFERENCES [1] E. Leopold and Kindermann J. Text categorization with support vector machines: how to represent texts in input space? Machine Learning, (2002), Vol.46, pp.423-444. [2] X. Qi and B. Davison. Web Page Classification: Features and Algorithms. ACM Computing Surveys, (2009), Vol.41, No.2, pp.12-31. [3] S. Fong, Measuring Emotions from Online News and Evaluating Public Models from Netizens’ Comments: A Text Mining Approach. Journal of Emerging Technologies in Web Intelligence, (2012), Vol.4, No.1, pp.60-66. [4] P. Ponmuthuramalingam and T. Devi. Effective Dimension Reduction Techniques for Text Documents, International Journal of Computer Science and Network Security, (2010), Vol.10, No.7, pp.101-109. [5] Porter M.F. An Algorithm for Suffix Stripping. Program, (1980), Vol.14, no.3, pp.130-137. [6] Y. Yang and J. O. Pederson. A comparative study on feature selection in text categorization. In Proceedings of Fourteenth International Conference on Machine Learning, (1997), pp.412-420. 264 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 New Metrics between Bodies of Evidences Pascal Djiknavorian, Dominic Grenier Laval University/Electrical and Computer Engineering, Quebec, Canada Email: {djikna, Dominic.Grenier}@gel.ulaval.ca Pierre Valin Defence R&D Canada Valcartier/Decision Support Systems section, Quebec, Canada Email: pierre.valin@drdc-rddc.gc.ca Abstract—We address the problem of the computational difficulties occurring by the heavy processing load required by the use of the Dempster-Shafer Theory (DST) in Information Retrieval. Specifically, we focus our efforts on the measure of performance known as the Jousselme distance between two basic probability assignments (or bodies of evidences). We discuss first the extension of the Jousselme distance from the DST to the DezertSmarandache Theory, a generalization of the DST. It is followed by an introduction to two new metrics we have developed: a Hamming inspired metric for evidences, and a metric based on the degree of shared uncertainty. The performances of theses metrics are compared one to each other. Index Terms—Dempster-Shafer, Measure of performance, Evidential Theory, Dezert-Smarandache, Distance I. INTRODUCTION Comparing two, or more, bodies of evidences (BOE) in the case of large frame of discernment, in the Dempster-Shafer theory of evidence [1, 2], may not always give intuitive choices from which we can simply choose a proposition the with largest basic probability assignment (BPA) (or mass), or belief. A metric becomes very useful to analyze the behavior of a decision system in order to correct and enhance its performance. It is also useful when trying to evaluate the distance between two systems giving different BOEs. It is also helpful to determine if a source of information regularly gives an answer that is far from other sources, so that this faulty source can be weighted or discarded. Different approaches to deal with conflicting or unreliable sources are proposed in [3, 4, 5]. Although the Dempster-Shafer Theory (DST) has many advantages, such as its ability to deal with uncertainty and ignorance, it has the problem of becoming quickly computationally heavy as it is an NPhard problem [6]. To alleviate this computational burden, many approximation techniques of belief functions exist [7, 8, 9]. References [10, 11] show implementations and a comparative study of some approximation techniques. To be able to efficiently evaluate the various approximation techniques, one needs some form of metric. The Jousselme distance between two bodies of evidences [12] is one of them. However, there is a © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.264-272 problem with this metric: it requires the computation of the cardinal of a given set, an operation which is very costly computation-wise within the DST. Alternatives to the Jousselme distance are thus needed. This is the objective of the research we present here. A. The Dempster-Shafer Theory in Information Retrieval The authors of [13] use the DST to combine the visual and textual measures for ranking choosing the best word to use as annotation for an image. The DST is also used in the modeling of uncertainty in Information Retrieval (IR) applied to structured documents. We find in [14] that the use of the DST is due to: (i) it’s ability to represent leaf objects; (ii) it’s ability to capture uncertainty and the aggregation operator it provides, allowing the expression of uncertainty with respect to aggregated components; and (iii) the properties of the aggregation operator that are compatible with those defined by the logical model developed by [15]. Extensible Markup Language (XML) IR, by contrast to traditional IR, deals with documents that contain structural markups which can be used as hints to assess the relevancy of individual elements instead of the whole document. Reference [16] presents how the DST can be used in the weighting of elements in the document. It is also used to express uncertainty and to combine evidences derived from different inferences, providing relevancy values of all elements of the XML document. Good mapping algorithms that perform efficient syntactic and semantic mappings between classes and their properties in different ontologies is often required for Question Answering systems. For that purpose, a multi-agent framework was proposed in [17]. In this framework, individual agents perform the mappings, and their beliefs are combined using the DST. In that system, the DST is used to deal with the uncertainty related to the use of different ontologies. The authors also use similarity assessment algorithms between concepts (words) and inherited hypernyms; once using BOE to represent information, metrics between BOE could be used to accomplish this. As shown in [18], the fundamental issues in IR are the selection of an appropriate scheme/model for document representation and query formulation, and the determination of a ranking function to express the relevance of the document to the query. The authors JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 compare IR systems based on probability and belief theories, and note a series of advantages and disadvantages with the use of the DST in IR. Putting aside the issue of computational complexity, they come to the conclusion that the DST is the better option, thanks to its ability to deal with uncertainty and ignorance. The most significant differences between DST and probability theory are the explicit representation of uncertainty and the evidence combination mechanism. This can allow for more effective document processing [19]. It is also reported by [20] that the uncertainty occurring in IR can come from three sources regarding the relation of a document to a query: (i) in the existence of different evidences; (ii) due to unknown number of evidences; and (iii) in the existence of incorrect evidences. There is thus a clear benefit to using a method that can better combine evidences and handle their uncertainty. Interested readers are encouraged to consult [21] for an extensive study of the use of Dempster-Shafer Theory to Information Retrieval. II. BACKGROUND A. Dempster-Shafer Theory of Evidence Dempster-Shafer Theory (DST) has been in use for over 40 years [1-2]. The theory of evidence or DST has been shown to be a good tool for representing and combining pieces of uncertain information. The DST of evidence offers a powerful approach to manage the uncertainties within the problem of target identity. DST requires no a priori information about the probability distribution of the hypothesis; it can also resolve conflicts and can assign a mathematical meaning to ignorance. However, traditional DST has the major inconvenience of being an NP-hard problem [6]. As various evidences are combined over time, Dempster-Shafer (DS) combination rules will have a tendency to generate more and more propositions (i.e. focal elements), which in turn will have to be combined with new input evidences. Since this problem increases exponentially, the number of retained solutions must be limited by some approximation schemes, which truncate the number of such propositions in a coherent (but somewhat arbitrary) way. Let be the frame of discernment, i.e. the finite set of mutually exclusive and exhaustive hypotheses . The power set of , is the set the subsets of , where denotes the empty set. 1) Belief functions: Based on the information provided by sensor sources and known a priori information (i.e. a knowledge base), a new proposition is built. Then, based on this proposition, a Basic Probability Assignment (BPA or mass function) is generated, taking into account some uncertainty or vagueness. Let us call , the new incoming BPA. The core of the fusion process is the combination of and the BPA at the previous time, . The resulting BPA at time is then the support for decision making. Using different criteria, the best candidate for identification is © 2012 ACADEMY PUBLISHER 265 selected from the database. On the other hand, must be combined with a new incoming BPA and thus becomes . However, this step must be preceded by a proposition management step, where is approximated. Indeed, since the combination process is based on intersections of sets, the number of focal elements increases exponentially and rapidly becomes unmanageable. This proposition management step is a crucial one as it can influence the entire identification process. The Basic Probability Assignment is a function such that which satisfies the following conditions: (1) (2) Where is called the mass. It represents our confidence in the fact that “all we know is that the object belongs to A”. In other words, is a measure of the belief attributed exactly to , and to none of the subsets of . The elements of that have a non-zero mass are called focal elements. Given a BPA , two functions from to are defined: a belief function , and a plausibility function such that (3) (4) It can also be stated that , where is the complement of A and measures the total belief that the object is in , whereas measures the total belief that can move into . The functions , and are in one-to-one correspondence, so it is equivalent to talk about any one of them or about the corresponding body of evidence. 2) Conflict definition: The conflict corresponds to the sum of all masses for which the set intersection yield the null set . is called the conflict factor and is defined as: (5) measures the degree of conflict between and : corresponds to the absence of conflict, whereas implies a complete contradiction between and . Indeed, if and only if no empty set is created when and are combined. On the other hand we get if and only if all the sets resulting from this combination are empty. 3) Dempster-Shafer Combination Formulae: In DST, a combined or “fused” mass is obtained by combining the previous (presumably the results of previous fusion steps) with a new to obtain a fused result as follows: (6) (7) 266 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 TABLE I. CARDINALITIES FOR DST AND DSMT 2 4 5 Cardinal of Cardinal of Cardinal of 3 8 19 4 16 167 5 32 7,580 The reader is referred to a series of books on DSmT [22, 23, 24] for lengthy descriptions of the meaning of this formula. A three-step approach is proposed in the second of these books, which is used in this technical report. From now on, the term “hybrid” will be dropped for simplicity. 6 64 7,828,353 The renormalization step using the conflict , corresponding to the sum of all masses for which the set intersection yields the null set, is a critical feature of the DS combination rule. Formulated as is equation (6), the DS combination rule is associative. Many alternative ways of redistributing the conflict lose this property. The associativity of the DS combination rule is critical when the timestamps of the sensor reports are unreliable. This is because an associative rule of combination is impervious to a change in the order of reports coming in. By contrast, other rules can be extremely sensitive to the order of combination. B. Dezert-Smarandache Theory The Dezert-Smarandache Theory (DSmT) [22, 23, 24] encompasses DST as a special case, namely when all intersections are null. Both the DST and the DSmT use the language of masses assigned to each declaration from a sensor. A declaration is a set made up of singletons of the frame of discernment , and all sets that can be made from them through unions are allowed (this is referred to as the power set ). In DSmT, all unions and intersections are allowed for a declaration, this forming the much larger hyper power set which follows the Dedekind sequence. For a case of cardinality 3, , with , is still of manageable size: 3, 2 3, 1 2 C. Pignistic Transformation 1) Classical Pignistic Transformation: One of the most popular transformations is the pignistic transformation proposed by Smets [25] as basis for decision in the evidential theory framework. The decision rule based on a BPA m is: (13) (14) (15) with the identified object among the objects in . This decision presents the main advantage that it takes into account the cardinality of the focal elements. 2) DSm Cardinal: The Dezert-Smarandache (DSm) cardinal [22, 23, 24] of a set A, noted , accounts for the total number of partitions including all intersection subsets. Each of these partitions possesses a numeric weight equal to 1, and thus they are all equal. The DSm cardinal is used in the generalized pignistic transformation equation to redistribute the mass of a set A among all its partitions B such that . 3) Generalized Pignistic Transformation: The mathematical transformation that lets us go from a representation model of belief functions to a probabilistic model is called a generalized pignistic transformation [22, 23, 24]. The following equation defines the transformation operator. 3, 1, 1, 1 2 1 3 2 3)} For larger cardinalities, the hyper power set makes computations prohibitively expensive (in CPU time). Table I illustrates the problem with the first few cardinalities of and . 1) Dezert-Smarandache Hybrid Combination Formulae: In DSmT, the hybrid rule [22, 23, 24] appropriate for constraints turns out to be much more complicated: (9) (8) (16) D. Jousselme Distance between two BOEs 1) Similarity Properties: Diaz and al. [26] expects that a good similarity measure should respect the following six properties: oth increasing on and decreasing on (10) symmetry (18) and (19) (20) e clusi eness (21) ecreasing on © 2012 ACADEMY PUBLISHER (17) identity of indiscernible (11) (12) normali ation (22) JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 2) Jaccard Similarity Measure: The Jaccard similarity measure [27] is a statistic used for comparing the similarity and diversity of sample sets. It was originally created for species similarity evaluation. 267 TABLE II FIRST SERIES OF THREE BODIES OF EVIDENCES (23) 3) Distances Properties: A distance function, also called a distance metric, on a set of points is a function with four properties [28, 29]; suppose : non-negati ity (24) identity of indiscernible (25) symmetry (26) triangle ine uality(27) Some authors also require that be non empty. 4) Jousselme Distance: To analyze the performance of approximation algorithm, to compare the proximity to non-approximated versions, or to analyze the performance of the DS fusion algorithm comparing the proximity with the ground truth if available, the Jousselme distance measures can be used [12]. The Jousselme distance is an Euclidean distance between two BPAs. Let and be two BPAs defined on the same frame of discernment , the distance between and is defined as: (28) (29) where is the Jaccard similarity measure III. NEW METRICS A. Extension of the Jousselme distance to the DSmT The Jousselme distance as defined originally in [12] can work without major changes, as it is within the DSm framework. The user simply has to use two BPAs defined over the DSm theory instead of BPAs defined within the DS theory. Boundaries, size, and thus amount of computation will of course be increased. But otherwise, there is no counter indication to using this distance in DSmT. We thus can keep equation (28) as the definition of Jousselme distance within DSmT, with the definition of the DSm Cardinal. Tables II and III show the bodies of evidences and their distances one-to-another. The example was realized with a discernment frame of size three ( , so that the cardinal of its hyper power set would be 19 for the free model, as defined by Dezert and Smarandache [22]. Table II is divided into three sections, each one of them represents data for one BOE. The three columns give the focal sets, associated BPA value, and the cardinal of that set. © 2012 ACADEMY PUBLISHER Pairwise computation between the different pairs of BOEs took quite some time with all the required calculations by the Jousselme distance of evidences. The results are shown in Table III. The proof of respect of all properties has already been done for the DST in [12]. The difference with the original version of the distance presented in [12] is the allowed presence of intersections which creates the hyper power set from the power set. This difference adds up possibilities of more computations to get to the distance value. More specifically, the cardinal evaluation part of the Jousselme distance is worsened by the hyper power set increase in size when compared to the power set. B. Hamming-inspired metric on evidences 1) Continuous XOR mathematical operator: In [30], Weisstein define the standard OR operator noted as a connective in logic which yields true if any one of a sequence conditions is true, and false if all conditions are false. In [31], Germundsson and Weisstein define the standard XOR logical operator ( ) as a connective in logic known as the exclusive OR or exclusive disjunction. It yields true if exactly one, but not both, of two conditions are true. This operator is typically designed as symmetric difference in set theory [32]. As such, the authors define it as the union of the complement of A with respect to B and B with respect to A. Figure 1 is a Venn diagram displaying binary XOR operator on numerical discrete values in Figure 1. TABLE III EXTENTED JOUSSELME DISTANCE RESULTS 268 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Figure 1. Venn diagram displaying binary XOR operator on numerical discrete values. Figure 2. Venn diagram displaying continuous XOR operator. Starting with the standard XOR logical operator and inspired by the Hamming distance [33] which uses a symmetric difference implicitly, we develop the idea of a continuous XOR operator. Figure 2 shows a simple case similar to that of previous figure but using values from . We can see that it is working as an absolute value of the difference applied on each partition of the Venn diagrams individually one to another. C. Metric using a degree of shared uncertainty 1) Similarity coefficient of degree of shared uncertainty: The idea behind a similarity coefficient of degree of shared uncertainty is to quantify the degree of shared uncertainty that lies behind a pair of sets. We want to avoid the use of cardinal operators. We conceived a decision tree test which will evaluate the degree of shared uncertainty. The following equation shows what the coefficient of similarity between a pair of sets is when using the metric that we suggest. 2) Metric between evidences based on Hamming distance principle: The Hamming distance [33] between two strings is the minimum number of substitutions required to change one string to another. In other words, it is defined by the sum of absolute values of differences. From this, with the DSm cardinal [22], and using a continuous XOR mathematical operator, we have developed a new distance, the Hamming Distance of Evidences (HDE). This distance is bounded within normal values, such that . This new distance also respects the properties of equations (24-27): non-negativity, identity of indiscernibility, symmetry, and the triangle inequality. The HDE is defined as in equation (30), which uses the defined in equation (31), and where is the super-power set. For example, in the case where we have a discernment frame such as , we would obtain the following super-power set . (30) (31) The HDE uses the BPA mass distributed among the different parts (sets) in that composes the BPA from . This transition from to is done using equation (31). Using the super-power set version of the BPA gets us a more refined and precise definition of it. Once in the super-power set framework, we use an adaptation of the Hamming distance or the continuous XOR operation defined previously. Its implementation is more easily understood as a summation of the absolute of divided by 2. the differences1 between the BPAs in For BOEs defined in Table II in the previous section, without any constrained set, we get the results given in Table IV. Then, we can easily compare relative distances to have a reliable point of reference. The Jousselme distance is considered to be our distance of reference. 1 This is equivalent to the symmetric difference expression used to define XOR operator in literature [32]. © 2012 ACADEMY PUBLISHER (32) Equation (32) gives a coefficient value of 3 when the pair of sets is equal; the value 2 when one of the sets is included in the other one, and 1 when the sets give a nonempty intersection but none is included in another nor being equal. Finally, the coefficient has a value of 0 when the intersection between the pair of sets is the empty set. The maximum value that the coefficient of similarity has between sets A and B is 3. 2) Metric between evidences based on a degree of shared uncertainty: From the similarity coefficient of degree of shared uncertainty as defined above, we get the following distance, noted and defined in equation (33). In that equation, the factor is a normalization factor required to bound of the distance. The summation over symbolizes a sum going over the matrix of every possible pair of sets from focal elements . (33) Even if we consider (33) as the distance using similarity coefficient, we might want to consider the possibility of building one that uses only a triangular matrix out of the matrix-domain of the summation. However, since commutativity is a built-in property, this measure will have a bit of useless redundancy. TABLE IV HAMMING DISTANCE ON EVIDENCES RESULTS JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Equation (33) could be expressed in the simple form: , where is a similarity factor. Since distances use dissimilarity factors (so that a distance of 0 means that ), a subtraction from 1 is required. However, the idea of a distance solely based on equation (33) isn’t enough. One should consider weighting similarities with mass value from BPAs in order to really represent the distance between bodies of evidences and not only a combination of sets. We propose (34) as a final equation for that reason. (34) Table V uses a simple case to show the inner workings of this method. The first matrix shown in the table is a computation matrix with the degree of shared uncertainty , defined in (32), and the product of the masses of The second matrix the pair of sets gives the value of the weighted similarity values. Finally, the last table in Table V indicates the sum of the values from within the previous matrix, or the value of the sum in equation (34), the normalization factor and finally the Distance of Shared Uncertainty (DSU). This distance could be qualified as discrete in the sense that not all values of will be possible for DSU in any case of distance measurement. However, that is true only for fixed values of BPA. Since BPA values are continuous in then DSU Table VI shows the results of the metric based on the degree of shared uncertainty measurements on the same BOEs described in Table II as previously experimented on at Tables III for Jousselme distance and Table IV for Hamming distance on evidences. TABLE V SIMPLE CASE OF METRIC BASED ON SHARED UNCERTAINTY DEGREE 269 TABLE VI METRIC BASED ON SHARED UNCERTAINTY DEGREE RESULTS IV. EXAMPLES AND PERFORMANCES This section explores the metrics presented in the previous section. Theses metrics will be used as distance measurements. We have implemented a DST, DSmT combination system within MatlabTM. The details explaining how DSmT was implemented appear in [34, 35]. Functions have been added in that system for the execution of the computation of various metrics. A. A few simple examples 1) Exploration case 1: Using the same bodies of evidences as presented in Table II, we obtained the results and times given in Table VII for the execution, in seconds, for the same inputs given to the three distances presented previously: the , the HDE and DSU. Based only on this data, it is difficult to choose which metric is best. However we can already see, as expected, that the Jousselme distance would be difficult to use in real-time complex cases due to the computation time it requires. 2) Exploration case 2: This case further explores the behaviors of the distance metrics. We will use two bodies of evidences. The first will be fixed with the following values: . For the second BOE, we will increment successively the mass of one focal element nine times, reducing from the same value the mass of the second focal element such as , where . The results of this exploration case are given in Table VIII. We can notice from that table that DSU is not able to correctly consider distances with the mass distributions. Obviously, this is an undesirable behavior occurring for the situation with a pair of BOE with identical sets. We can also see that the HDE and Jousselme distance responds in a symmetric manner to the symmetric mass distribution around equal BOEs. In other words, steps and gives equal values, as they should. For step , all metrics gives the proper distance of zero. TABLE VII DISTANCE AND TIME OF EXECUTION VALUES FOR CASE 1 Dist. HDE DSU © 2012 ACADEMY PUBLISHER time Dist. time Dist. time 270 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 TABLE IX EXECUTION TIME VALUES FOR CASE 2 HDE DSU Figure 3. Venn diagram with the 7 partitions of a size 3 case. Table IX shows that both HDE and DSU demonstrate a clear advantage over Jousselme distance in terms of execution times. 3) Exploration case 3: Figure 3 shows the 7 possible partitions of a size 3 case. This case proceeds a little differently from the previous two. Instead of keeping identical BOEs with varying masses, the BOEs are now varied. A third and fourth focal elements in some of the BOEs are introduced for that purpose. The first BOE is always the same: . The BOEs used as the second one in the pairwise distances are listed here: A. B. C. =0.2} =0.2} =0.2} D. E. F. G. The results of this case are given in Table X. As expected, we can observe a Distance Variation ( ) increase for the following pairs: , , and . The notation signifies that the observed distance variation going from case X to Y is increasing. For the interesting cases F and G, we have . The difference between F and G is that the mass of goes to in F, while in G it mainly goes to . DSU metric for case F is equal to case G, in all the other metrics they give smaller values for case G when compared to case F. Similar conclusions are obtained when comparing metrics for the pairs of cases (A, C), and (B,D): for similar mass redistribution, when giving the mass to a disjunction the resulting distance is smaller than if it were to be distributed to an intersection. 3) Exploration’s conclusions: In general, it is better for identical sets to have lowest distance. Otherwise, a minimal number of sets will minimize the distribution of mass onto unshared partitions. With no identical partitions in common, it is preferable to have a higher mass onto disjunctive sets which have more common partitions. Also, it is better to have disjunctive sets as specific as possible; in other words, of lowest cardinality. Hence, too much mass given to a set that has too many uncommon partitions with the targeted ID or ground truth must be avoided. To get distances values such as , one needs masses in to be distributed on sets that have a higher ratio of common partitions with than the sets of would have. Finally the use of either Jousselme (adapted to DSmT) or the DHE, which is much quicker, is recommended. TABLE VIII DISTANCE VALUES FOR CASE 2 HDE DSU TABLE X DISTANCE VALUES FOR CASE 3 HDE © 2012 ACADEMY PUBLISHER DSU JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 V. CONCLUSIONS This paper introduced two new distances between evidences for both the Dempster-Shafer Theory and Dezert-Smarandache Theory to replace the Jousselme distance. When the size of the discernment frame gets high: the distance calculation becomes too big to handle in a reasonable amount of time. In time critical systems, it would be better to use the Hamming distance of evidences. For the distance using the degree of shared uncertainty DSU, studies must be done further. A correction may be required to prevent it from considering masses properly when facing identical bodies of evidences. Future works would include the use of DSmT [22, 23, 24] and its hierarchical information representation abilities in conjunction with approximation of belief functions algorithms in Information Retrieval. [11] [12] [13] [14] [15] ACKNOWLEDGMENT The authors wish to thank the reviewers for their comments. This work was carried out as part of Pascal jikna orian’s doctoral research at Uni ersité La al. Pascal Djiknavorian study was partly funded by RDDC. [16] [17] REFERENCES [1] G. Shafer, A Mathematical Theory of Evidence. Princeton University Press, Princeton, NJ, USA, 1976. [2] A. empster, “Upper and lower probabilities induced by multi alued mapping”, The Annals of Mathematical Statistics, vol. 38, pp. 325-339, 1967. [3] M. C. Florea and E. Bosse, “ empster-Shafer Theory: combination of information using conte tual knowledge”, in Proceedings of 12th International Conference on Information Fusion, Seattle, WA, USA, July 6-9, 2009, pp. 522-528. [4] S. Le Hegarat-Mascle, I. Bloch, and D. Vidal-Madjar, “Application of empster-Shafer evidence theory to unsuper ised classification in multisource remote sensing”, IEEE Transactions on Geoscience and Remote Sensing, vol. 35, issue: 4, pp.1018-1031, August 1997. [5] J. Klein and O. Colot, “Automatic discounting rate computation using a dissent criterion”, Proceedings of Workshop on the Theory of Belief Functions, Brest, France, April 1-2, 2010. [6] P. Orponen, “ empster’s rule of combination is # Pcomplete”, Artificial Intelligence, vol. 44, no. 1-2, pp. 245–253, 1990. [7] B. Tessem, “Approximations for efficient computation in the theory of evidence”, Artificial Intelligence, vol. 61, pp 315-329, June 1993. [8] M. Bauer, “Approximation Algorithms and Decision Making in the Dempster-Shafer Theory of Evidence-An Empirical study”, International Journal of Approximate Reasoning, vol. 17, no. 2-3, pp. 217–237, 1997. [9] D. Boily, and P. Valin, “Truncated Dempster-Shafer Optimization and Benchmarking”, in Proceedings of Sensor Fusion: Architectures, Algorithms, and Applications IV, SPIE Aerosense 2000, Orlando, Florida, April 24-28, 2000, Vol. 4051, pp. 237-246. [10] P. Djiknavorian, P. Valin and D. Grenier, “Approximations of belief functions for fusion of ESM reports within the © 2012 ACADEMY PUBLISHER [18] [19] [20] [21] [22] [23] [24] [25] [26] 271 Sm framework”, Proceedings of the 13th International Conference of Information Fusion, Edinburg, UK, 2010. P. Djiknavorian, A. Martin, P. Valin and D. Grenier, « Étude comparati e d’appro imation de fonctions de croyances généralisées / Comparative Study of approximations of generalized beliefs functions », in Proceedings of Logique Floue et ses Applications, LFA2010, Lannion, France, Novembre 2010. A.-L. Jousselme, D. Grenier, and E. Bosse, “A new distance between two bodies of evidence”, Information Fusion, vol. 2, no. 2, pp. 91-101, June 2001. R. Xiaoguang, Y. Nenghau, W. Taifeng, L. Mingjing, “A Search-Based Web Image Annotation Method”, Proceedings of IEEE International Conference on Multimedia and Expo, 2007, pp. 655-658. M. Lalmas, “Dempster-Shafer’s theory of evidence applied to structured documents: modelling uncertainty”. in Proceedings of the 20th annual international ACM SIGIR, pp. 110-119, Philadelphia, PA, USA. ACM, 1997. DOI:10.1145/258525.258546 Y. Chiaramella, P. Mulhem and F. Fourel, “A Model for Multimedia Information Retrieval”, Technical Report, Basic Research Action FERMI 8134, 1996. F. Raja, M. Rahgozar, F. Oroumchian, “Using DempsterShafer Theory in XML Information Retrieval”, Proceedings of World Academy of Science, Engineering and Technology, Vol. 14, August 2006. M. Nagy, M. Vargas-Vera and E. Motta, “Uncertain Reasoning in Multi-agent Ontology Mapping on the Semantic Web”, in Proceedings of the Si th Me ican International Conference on Artificial Intelligence – Special Session, MICAI 2007, November 4-10, 2007. Pp. 221-230. DOI:10.1109/MICAI.2007.11 K.R. Chowdhary and V.S. ansal, “Information Retrie al using probability and belief theory”, in Proceedings of the 2011 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC), 2011, pp. 188-191. DOI:10.1109/ETNCC.2011.5958513 A. Verikas, A. Lipnickas, K. Malmqvist, M. Bacauskiene, and A. Gel inis, “Soft combination of neural classifiers: A comparati e study”, Pattern Recognition Letters, vol. 20, pp. 429-444, 1999. A.M. Fard, H. Kamyar, Intelligent Agent based Grid Data Mining using Game Theory and Soft Computing, Bachelor of Science Thesis, Ferdowsi University of Mashhad, September 2007. I. Ruth en and M. Lalmas, “Using empster-Shafer’s Theory of Evidence to Combine Aspects of Information Use”, Journal of Intelligent Information Systems, vol. 19 issue 3, pp.267-301, 2002. Smarandache, F., Dezert, J. editors. Advances and Applications of DSmT for Information Fusion, vol. 1, American Research Press, 2004. Smarandache, F., Dezert, J. editors. Advances and Applications of DSmT for Information Fusion, vol. 2, American Research Press, 2006. Smarandache, F., Dezert, J. editors. Advances and Applications of DSmT for Information Fusion, vol. 3, American Research Press, 2009. Ph. Smets, “ ata fusion in the Transferable elief Model”, in Proceedings of the 3rd International Conference on Information Fusion, Fusion 2000, Paris, July 10-13, 2000, pp. PS21-PS33. J. Diaz, M. Rifqi, B. Bouchon-Meunier, “A Similarity Measure between asic elief Assignments”, in Proceedings of the 9th International Conference on Information Fusion, Florence, Italy, 10-13 July, 2006. 272 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 [27] P. Jaccard, “Étude comparati e de la distribution florale dans une portion des Alpes et des Jura” ulletin de la Société Vaudoise des Sciences Naturelles, vol. 37, pp. 547–579, 1901. [28] E.W. Weisstein, “ istance” From MathWorld--A Wolfram Web Resource, July 2012. http://mathworld.wolfram.com/Distance.html [29] M. Fréchet, “Sur quelques points du calcul fonctionnel”, Rendiconti del Circolo Matematico di Palermo, vol. 22, pp. 1-74, 1906. [30] E.W. Weisstein, “OR”, From MathWorld--A Wolfram Web Resource, July 2012. http://mathworld.wolfram.com/OR.html [31] R. Germundsson and W. E. Weisstein, “XOR”, From MathWorld--A Wolfram Web Resource, July 2012. http://mathworld.wolfram.com/XOR.html [32] E.W. Weisstein, “Symmetric Difference”, From MathWorld--A Wolfram Web Resource, July 2012. http://mathworld.wolfram.com/SymmetricDifference.html [33] R.W. Hamming, “Error detecting and error correcting codes”, Bell System Technical Journal, vol. 29, issue 2, pp. 147–160, 1950. [34] P. Djiknavorian, D. Grenier, “Reducing DSmT hybrid rule complexity through optimization of the calculation algorithm”, in Advances and Applications of DSmT for Information Fusion, F. Smarandache and J. Dezert, editors, American Research Press, 2006. [35] P. Djiknavorian, “Fusion d’informations dans un cadre de raisonnement de Dezert-Smarandache appliquée sur des rapports de capteurs ESM sous le STANAG 1241”, Master’s Thesis, Uni ersité La al, 2008. Pascal Djiknavorian received a B.Eng. in computer engineering and a certificate in business administration in 2005 from Laval University. From there, he also completed in 2008 a M.Sc. in electrical engineering on information fusion within the Dezert-Smarandache theory framework applied to ESM reports under STANAG 1241. He is currently a Ph.D. student in information fusion at Laval University and is supervised by Professor Dominic Grenier and Professor Pierre Valin. He has a do en publications as book’s chapters, in journals, and conference’s proceedings. His research interests include evidential theory, Dezert-Smarandache theory, approximation algorithms, and optimization methods. Mr. Djiknavorian is a graduate student member of the IEEE. Dominic Grenier received the M.Sc. and Ph.D. degrees in electrical engineering in 1985 and 1989, respectively, from the UniversitéLaval, Quebec City, Canada. From 1989 to 1990, he was a Postdoctoral Fellow in the radar division of the Defense Research Establishment in Ottawa (DREO), Canada. In 1990, he joined the Department of Electrical Engineering at UniversitéLaval where he is currently a Full Professor since 2000. He was also coeditor for the Canadian Journal on Electrical and Computer Engineering during 6 years. © 2012 ACADEMY PUBLISHER Recognized by the undergraduate students in electrical and computer engineering at Université Laval as the electromagnetism and RF specialist, his excellence in teaching has resulted in his being awarded the “ est Teacher Award” many times. He obtained in 2009 one special fellowship from the Quebec Minister for education. His research interests include inverse synthetic aperture radar imaging, signal array processing for high resolution direction of arrivals and data fusion for identification. Prof. Grenier has 32 publications in refereed journals and 75 more in conference proceedings. In addition, 33 graduate students completed their thesis under his direction since 1992. Prof. Grenier is a registered professional engineer in the Province of Quebec (OIQ), Canada. Pierre Valin received a B.Sc. honours physics (1972) and a M.Sc. degree (1974) from McGill University, then a Ph.D. in theoretical high energy physics from Harvard University (1980), under the supervision of the 1979 Nobel Laureate Dr. Sheldon Glashow. He was a faculty lecturer at McGill and Associate Professor of Physics in New Brunswick at Moncton and Fredericton before joining Lockheed Martin Canada in 1993 (then called Paramax), as a Principal Member of R&D. In 2004, he became a defence scientist at Defence R&D Canada (DRDC) at Valcartier, where he currently leads a research group in Future C2 Concepts & Structures. He is thrust leader for Air Command at DRDC since 2007. He has been particularly active in the International Society of Information Fusion (ISIF) though the organization of FUSION 2001 and 2007. He has been ISIF board member since 2003, VP-membership since 2004, and was president in 2006. He is also an associate editor for the Journal of Advances in Information Fusion (JAIF). r. Valin’s interests focus mainly on the following topics: Multi-Sensor Data Fusion (MSDF) requirements and design, C2 systems, algorithmic benchmarking, use of a priori information databases, imagery classifiers (EO/IR and SAR) and their fusion, neural networks, fuzzy logic, information processing and uncertainty representation, reasoning techniques for recognition and identification (Bayes, Dempster-Shafer, DezertSmarandache), SAR image processing, Network Centric Warfare, distributed information fusion, dynamic resource management, as well as theoretical and mathematical physics. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 273 Bringing location to IP Addresses with IP Geolocation Jamie Taylor, Joseph Devlin, Kevin Curran School of Computing and Intelligent Systems University of Ulster, Magee Campus, Northland Road, Northern Ireland, UK Email: kj.curran@ulster.ac.uk Abstract - IP Geolocation allows us to assign a geographical location to an IP address allowing us to build up a picture of the person behind that IP address. This can have many potential benefits for business and other types of application. The IP address of a device is unique to that device and as such the location can be narrowed down from the continent to the country and even to the street address of the device. This method of tracking can have very broad results and can sometimes only get an accurate result with some input from the user about their location. In some countries laws are in place that state a service can only track you as far as your country without your consent. If the user consents then the service can view your ISP's logs and track you as accurately as possible. The ability to determine the exact location of a person connecting over the Internet can not only lead to innovative location based services but it can also dramatically optimise the shipment of data from end to end. In this paper we will look at applications and methodologies (both traditional and more recent) for IP Geolocation. I. INTRODUCTION IP Geolocation is the process of obtaining the geographical location of an individual or party starting out with nothing more than an IP address [1]. The uses (both current and potential) of IP Geolocation are many. Already, the technology is being used in advertising, sales and security. Geolocation is the identification of the realworld geographic location of an Internet-connected computer, mobile device, website visitor or other. IP address Geolocation data can include information such as country, region, city, postal/zip code, latitude, longitude and time zone [1]. Geolocation may refer to the practice of assessing the location, or to the actual assessed location, or to locational data. Geolocation is increasingly being implemented to ensure web users around the world are successfully being navigated to content that has been localised for them. Due to the ‘.com’ dilemma, most companies are finding that more than half of the visitors to their global (.com) home pages are based outside of their home markets. The majority of these users do not find the country site that has been developed for them. Companies such as Amazon have introduced geolocation as a method of dealing with this problem [2]. © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.273-277 There are organisations that are responsible for allocating IP address. The Internet Assigned Number Authority(IANA) is responsible for allocating large blocks of IP addresses to the following five Regional Internet Registries(RIR) that serve specific regions in the world: AfriNIC (Africa), APNIC (Asia/Pacific), ARIN (North America), LACNIC (Latin America) and RIPE NCC (Europe, the Middle East and Central Asia). These RIR's then allocate blocks of IP addresses to Internet Service Providers (ISP) who then allocates IP addresses to businesses, organizations and individual consumers. Using the above information IP addresses can be broken down into graphical locations within few steps but to get a more accurate result than that the user may have to provide additional details to aid the process. In some cases, this practice will become more efficient the more it is used, a user's location can be tracked by closely matching their IP address with a neighbouring IP address that has already been located. Many businesses have been started up just by hosting large databases of IP address to allow services to apply this technology with varying degrees of efficiency and accuracy. There are many methods of tracking a device such as GPS and cell phone triangulation but Geolocation, the least accurate, is becoming popular among website owners and government bodies alike The foundation for geolocation is the Internet protocol (IP) address, a numeric string assigned to every device attached to the Internet. When you surf the web, your computer sends out this IP address to every website you visit. IP addresses are not like mailing addresses. That is, most are not fixed to a specific geographic location. And knowing that a particular ISP (Internet Service Provider) is based in a particular city is no guarantee that you’ll know where its customers are located [3]. That is where geolocation service providers come in. Geolocation service providers build massive databases that link each IP address to a specific location. Some geolocation databases are available for sale, and some can also be searched for free online. As the IP system is in a constant state of flux, many providers update their databases on a daily or weekly basis. Some geolocation vendors report a 510% change in IP addresses locations each week. Geolocation can provide much more than a geographic location. Many geolocation providers supply up to 30 274 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 data fields for each IP address that can help to further determine if users really are where they say they are. These may include country, region, state, city, ZIP code, area code; Latitude/longitude; Time zone; Network connection type and domain name and type (i.e. .com or .edu). Not every IP address accurately represents the location of the web user. For example, some multinational companies route Internet traffic from their many international offices through a few IP addresses, which may create the impression that some Internet users are in, say, the UK when they are actually based in France. If someone is using a dial-up connection from Ireland back to their ISP provider in the France, it will appear like they are in the France. There are also proxy services that allow web users to cloak their identities online, a few geolocation providers however have introduced technology that can look past these proxy servers to access the user's true location. In addition, some providers can now locate, down to a city-street level, people connecting to the Internet via mobile phones or public Wi-Fi networks. This is accomplished through cell tower and Wi-Fi access point triangulation [4]. Here, we will be looking at this technology in more detail and what it could mean for us and our lives going forward. This paper is structured as follows. In Section 2, we look in more depth at the applications for IP Geolocation (both current and potential). Section 3 then presents a number of IP Geolocation methods, starting with more 'traditional' methods before progressing to those more recent and 'hybrid' in nature. In section 4, we outline some methods for avoiding IP geolocation and we conclude our discussion in Section 5. II. IP GEOLOCATION USAGE Localization is the process of adapting a product or service to target a specific group of users. These changes can include the look and feel of the product, the language and even fundamental changes in how the service or product works. Many global organisations would like to be able to tailor the experience of a website to the types of users viewing it as it can have a significant impact on whether or not a user will use your service. The ability to gather useful metrics increases when you add in the fact you can tell where your customers are from. Take for example Google. Google provides localized versions of its search engine to almost every country in the world. Using Geolocation, they can select the correct language for each user and alter their search results to reflect more accurately what it is the user is actually searching for. If Google so chose they could even start to omit certain results to comply with national laws. Google ads use this feature heavily by making sure that local businesses can reach people in their area so as to increase the impact of the advertising. This localization of websites is becoming increasingly popular and Geolocation is a tool that grants the ability to easily find out which version of your website to show. © 2012 ACADEMY PUBLISHER Other websites use localisation in the opposite way. Instead of attempting to increase the use of the site by accommodating worldwide users some websites would use a user's location to ensure that they cannot access the website or its content. This practice is most common on sites that host copyrighted content such as movies, TV shows or music. An example of this is the BBC iPlayer. This service cannot be accessed in the USA for example, as the BBC iPlayer will not allow anyone with an IP address outside the UK to view the content. Online gaming/gambling websites use Geolocation tools to ensure that they are not committing crimes in countries where gambling is illegal [5].. An example of this is www.WilliamHill.com. This website filters out American users to avoid breaking laws in that country. In Italy, a country where gambling is illegal, you will only be granted a license to host a gambling website if you apply Geolocation tools to restrict access to the site by Italians. MegaUpload.com in 2012 was involved in a legal dispute with regard to their facilitating of copyright infringement. To try to avoid charges such as those the company, who have all their assets in Hong Kong, made sure to use Geolocation tools to filter out anyone in Hong Kong from using their services. This meant that MegaUpload.com was committing copyright infringement in every country in the world except Hong Kong. IP Geolocation has a vast array of both current and potential uses and areas of application. Of course, the accuracy (or granularity) needed varies from application to application. Through the use of IP Geolocation, advertisements can be specifically tailored to an individual based on their geographical location. For example, a user in London will see adverts relative to the London area, a user in New York will see adverts relative to the New York area and so forth. Additional information such as local currency, pricing and tax can also be presented. A real life example of this would be Google AdSense. As one may imagine, the accuracy needed for this is considerable; we would need a town or (even better) a street as opposed to say a country or state in order to provide accurate information to the user. As the online space continues to become the place to do business, issues once thought to be solved now rear their heads again. For example, DVD drives were region locked to prevent media being played outside the intended region, but problems exist in combating this resurging issue in the online space. Other examples of content restriction include the enforcement of blackout restrictions for broadcasting, blocking illegal downloads and the filtering of material based on culture. Content localisation on the other hand, is working to ensure that only relevant information is displayed to the user. The accuracy needed for an application shows a visitor from Miami, Florida dressed in beach attire instead of parkas is less than that needed for advertising. Here, geolocation at the country level is normally sufficient to ensure that users from one country cannot access content exclusive to JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 275 another country for example. Delay Based Methods Businesses can often struggle to adhere to national and regional laws due to the degree of variance. Failure to comply with these laws however can result in financial penalties or even prison time. Advertising for instance, can be subject to tight control such as what can be advertised where, when and if the product or service in question can be advertised in a particular location at all. Indeed, even the above examples of content restriction are often done to comply with legal requirements. In addition, there is the need to avoid trading with countries, groups and individuals black-listed by government. Quova1 provide us with the OFAC (Office for Foreign Assets Control – United States) and the need to comply with its economic and trade sanctions as an example of this. IP Geolocation offers us a powerful tool to help us comply with these legal requirements. However, to use IP Geolocation effectively in this scenario, we would need a state-level accuracy as laws can vary from state to state. Constraint Based Geolocation (CBG) is a delay-based method employing multilateration (estimating a position using some fixed points) [6]. The ability of CBG to create and maintain a dynamic relationship between IP address and geographical location is one of the methods key contributions to the IP Geolocation process, since most preceding work relied on a static IP address to geographical location relationship. To calculate this distance, each landmark measures its distance from all other landmarks. A bestline is then created where a bestline is the least distorted relationship between geographic distance and network delay. IP Geolocation can also offer much to those in security. IP Geolocation is used as a security measure by financial institutions to help protect against fraud by checking the geographical location of the user and comparing it with common trends. In the field of sales, user location can be compared with billing address for example. MaxMind2 is one such group offering products such as MinFraud which provides relevant information about the IP's historic behaviour, legitimate and suspicious and attempts to detect potential fraud by analysing the differences between the user location and billing address. III. METHODS OF IP GEOLOCATION A common approach to IP Geolocation is to create and manually maintain a database containing relevant data. These Non-automated methods, (i.e. those relying on some form of human interaction or contribution) can be undesirable. Problems include the fact that IP addresses are dynamically assigned and not static and therefore the database requires frequent updating (potentially at considerable financial cost and the risk of human error). The switch from IPv4 (2^32 possible addresses) to IPv6 (2^128 possible addresses) increases the challenge exponentially. One approach is to rely on delay measurements in order to geolocate a target. It should be noted however that these approaches rely on a set of 'landmarks', where a landmark is some point whose location is already known. A common way often used to construct this set of landmarks is to take a subset of nodes from the PlanetLab3 network (consisting of more than 1000 nodes). 1 www.quova.com www.maxmind.com 3 www.planet-lab.org/ 2 © 2012 ACADEMY PUBLISHER A circle then emanates from each landmark, the radius of which represents the targets estimated distance (calculated above) from that landmark. The area of intersection is the region in which the target is believed to reside; CBG will commonly guess that the target is at the centroid of this region. The area of this region is an indication of confidence, the smaller the area, the more confident CBG is in its answer, and a larger area implies a lower level of confidence. Speed of Internet (SOI) [7] can be viewed as a simplification of CBG. Whereas CBG calculates a distance-to-delay conversion value for each landmark, SOI instead uses a general conversion value across all landmarks. This value is 4/9c (where c is the speed of light in a vacuum). Numerous delays (such as circuitous paths and packetization) prevent data from travelling through fibre optic cables at its highest potential speed (2/3c). Therefore it is reasoned that 4/9c can be used to safely narrow the region of intersection without sacrificing location accuracy. Shortest Ping is the simplest delay-based technique. In this approach a target is simply mapped to the closest landmark based on round-trip time (aka ping time). Delay-based methods rely on the distance between the target and its nearest landmark. This is a good predicator of the estimation error. Round-trip time is also a good indication of the error; delay-based methods work well when the RTT is small and performance deteriorates relative to the increase in RTT. Having to effectively take the network as is, feeling your way around with delay measurements as opposed to being able to map it out to potentially improve accuracy is something we'd be keen to overcome. As we will see, using topology information and other forms of external information can greatly increase accuracy. Topology-based Geolocation (TBG) The methods here attempt to go beyond using delay measurements as their sole metric. Some seek to combine traditional delay measurements with additional information such as knowledge of network topology and other additional information; some even attempt to recast 276 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 the problem entirely. The reliance of delay-based methods upon a carefully chosen set of landmarks is a problem [7]. Topology-based Geolocation (TBG) however uses topology in addition to delay-based measurements to increase consistency and accuracy. This topology is the combination of the set of measurements between landmarks, the set of measurements between landmarks and the target (both measurements obtained by traceroute) and the structural observations about collocated interfaces. The target is then located using this topology in conjunction with end to end delays and per hop latency estimates. When presented with a number of potential locations for a target, TBG will map the target to the location of the last constrained router. It should be noted however that TBG incurs some overheads that simple delay-based methods do not. TBG must first construct its topology information and an additional overhead can be found in refreshing this information to ensure it is up to date and accurate. However the authors point out that this topology information can be used for multiple targets and that this overhead need not necessarily apply to every measurement one may wish to make. There are three main variants of TBG. These are 1) TBG-pure, using active landmarks only 2) TBG-passive, using active and passive landmarks 3) TBG-undns, using active and passive landmarks in conjunction with verified hints Once successfully located, intermediate routers can be used as additional landmarks to help locate other network entities. Figure 1: Identifying & clustering multiple network interfaces [7] In order to accurately determine the locations of these intermediate routers with confidence and be able to use them effectively, we must record position estimates for all routers encountered so we can base our final position © 2012 ACADEMY PUBLISHER estimate on as much information as possible. For instance, in trying to geolocate a router that is one hop from a given point and multiple hops from another given point, we need to record all routers we encounter which will allow us to determine its position with more accuracy. In other words, a geolocation technique has to simultaneously geolocate the targets as well as routers encountered [7]. Discovering that a router has multiple network interfaces is a common occurrence. Normally these interfaces are then grouped (or clustered) together (this process of identification and resolution is also known as IP aliasing), otherwise we falsely inflate and complicate our topology information. Part a of Figure 1 shows two routers u and v which are in fact multiple interfaces for the same physical router. In part b we see how the topology has been simplified by identifying u and v and clustering them. Web Parsing Approach Another approach to improve upon delay-based methods is through the use of additional external information beyond inherently inaccurate delay-based measurements [8]. This can be achieved through parsing additional information from the web. Prime candidates therefore are those who have their geographic location on their website. The method seeks to extract, verify and utilize this information to improve accuracy. The overall system is made up of two main components, first is a three-tier measurement methodology, which seeks to obtain a targets location. The second part is a methodology for extracting and verifying information from the web then used to create web-based landmarks. This three-tier measurement methodology uses a slightly modified version of CBG (where 4/9c is used as an upper bound rather than 2/3c) to obtain a rough starting point. Tiers 2 and 3 then bring in information from the web obtained by the second component to increase the accuracy of the final result. The information extraction and verification methodology relies on websites having a geographical address (primarily a ZIP code). This ZIP code combined with a keyword such as university or business is passed to a public mapping. If this produces multiple IPs within the domain name then they are to be grouped together and refined during the verification process. Assuming one has used a public search tool as suggested: the first stage of verification is to remove results from the search if their ZIP code does not match that in the original query. In cases where the use of a Shared Hosting technique or a CDN (Content Delivery Network) results in an IP address being used for multiple domain names the landmark (and subsequently the IP) is to be discarded. Finally, in the case where a branch office assumes the IP of its headquarters: compare ZIP codes again to confirm its identity as a branch and subsequently remove it. As with TBG described above, the method presented here also succumbs to certain overheads that delay-based methods are able to avoid. The measurement JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 stage has a delay of 1-2 seconds for each measurement made. This is the result of 8 RTTs (Round Trip Time); 2 of which are performed in the first tier and 3 in both the second and third tiers. The verification stage incurs an overhead for each ZIP code considered as all landmarks for each ZIP are cached. However, this will only require occasional updates and thus does not affect each and every search. [9] attempt to improve the accuracy of IP Geolocation by broadening the scope of information considered through casting IP Geolocation as a machine learning-based classification problem. Here a Naive Bayes classifier is used along with a set of latency, hop count and population density measurements. Each of these metrics/variables can be assigned a weight to affect how it will influence and inform the classifier. Results are classed in quintiles, with each quintile representing 20% of the target IPs and a level of confidence in the results within that quintile. IV. GEOLOCATION EVASION (CYBERTRAVEL) With Geolocation restrictions becoming more popular internet users are finding ways to evade these restrictions. Every country has its own laws that they are applying to cases involving Geolocation, but those laws were not written with the technology in mind. Cybertravel in a phrase, admittedly almost unknown but apt, that refers to evading Geolocation, GPS and other similar tracking technologies by pretending that you are in a real world location that you are not. Cybertravel is not the same as making yourself anonymous as the latter is about making your location unknown and the former is about providing an incorrect location. One way to do this is to alter your IP address to make it seem that you are from another region. Many people use this evasion technique to access content that is restricted. It is popular with people who are trying to access websites hosting copyrighted TV shows. Another less popular way to cybertravel is to gain remote access another device that is physically in the region that you want to access the internet from. This way you are not actually altering you IP; it is as if you had physically travelled to that region and accessed the internet from there. Services like TOR4 provide you with internet anonymity. Actively trying to hide your identity or location can have the result that websites cannot determine your region and thus may not allow you access to their content at all which is why cybertravel is the method of choice for accessing region restricted content. Services exist that allow a user to pay a monthly fee in return for an IP from a particular region. An example of this is www.myexpatnetwork.co.uk. This company allows users from outside the UK to gain a UK IP address. This company only deals in the GBP currency and is marketed to UK residents. The company is not breaking any laws by 'leasing' these IP addresses and because they are marketing at UK residents who are abroad they may 4 https://www.torproject.org/ © 2012 ACADEMY PUBLISHER 277 believe they are covered from advertising a Geolocation evasion tool. This service description may not stand up in court as a large portion of their customers are likely non UK residents looking to access UK only content. Evasion of Geolocation has not become a major issue at the moment. As with many issues like this, most organisations do not care until services like myexpatnetwork become popular and so easy to use that a serious financial loss looms. Governments are starting to pay attention to this issue now as they start to understand the difficulty of enforcing their laws against companies and people outside their jurisdiction who commit crimes on the internet. V. CONCLUSION We have provided an overview of IP Geolocation applications and methodologies both traditional and those that attempt to push the envelope. The methodologies presented here vary both in their complexity and accuracy; as such, we cannot claim any one method as the ideal solution. The optimal approach is therefore highly sensitive to the type of application being developed. REFERENCES [1] Lassabe, F. (2009). Geolocalisation et prediction dans les reseaux Wi-Fi en interieur. PhD thesis, Université de Franche-Comté. Besançon [2] Brewster, S., Dunlop, M., 2002. Mobile Computer Interaction. ISBN: 978-3-540-23086-1. Springer. [3] Furey, E., Curran, K., Lunney, T., Woods, D. and Santos, J. (2008) Location Awareness Trials at the University of Ulster, Networkshop 2008 - The JANET UK International Workshop on Networking 2008, The University of Strathclyde, 8th-10th April 2008 [4] Furey, E., Curran, K. and McKevitt, P. (2010) Predictive Indoor Tracking by the Probabilistic Modelling of Human Movement Habits. IERIC 2010- Intel European Research and Innovation Conference 2010, Intel Ireland Campus, Leixlip, Co Kildare, 12-14th October 2010 [5] Sawyer, S. (2011) EU Online Gambling and IP Geolocation, Neustar IP Intelligence, http://www.quova.com/blog2/4994/ [6] Gueye, B., Ziviani, A., Crovella, M. and Fdida, S. (2004) Constaint Based Geolocation of internet hosts. In IMC '04. Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pp: 288 – 293. [7] Katz-Bassett, E., John, J., Krishnamurthy, A., Wetherall, D., Anderson, T. and Chawathe, Y. (2006) Towards IP Geolocation Using Delay and Topology Measurements. In ICM '06. Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 71 – 84. [8] Wang, Y., Burgener, D., Flores, M., Kuzmanovic, A. and Huang, C. (2011) Towards Street-Level Client Independent IP Geolocation. In NSDI'11. Proceedings of the 8th USENIX conference on networked systems design and implementation, pp: 27-36 [9] Eriksson, B., Barford, P., Sommers, J. and Nowak, R. (2010) A Learning-based Approach for IP Geolocation. IN PAM'10 Proceedings of the 11th international conference on Passive and active measurement, pp: 171 – 180. 278 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 On the Network Characteristics of the Google’s Suggest Service Zakaria Al-Qudah, Mohammed Halloush, Hussein R. Alzoubi, Osama Al-kofahi Yarmouk University, Dept. of Computer Engineering, Irbid, Jordan Email:{zakaria.al-qudah, mdhall, halzoubi, osameh}@yu.edu.jo Abstract— This paper investigates application- and transport-level characteristics of Google’s interactive Suggest service by analyzing a passively captured packet trace from a campus network. In particular, we study the number of HTTP GET requests involved in user search queries, the inter-request times, the number of HTTP GET requests per TCP connection, the number of keyword suggestions that Google’s Suggest service to users, and how often users utilize these suggestions in enhancing their search queries. Our findings indicate, for example, that nearly 40% of Google search queries involve five or more HTTP GET requests. For 36% of these requests, Google returns no suggestions, and 57% of the time users do not utilize returned suggestions. Furthermore, we find that some HTTP characteristics such as inter-request generation time for interactive search application are different from that of traditional Web applications. These results confirm the findings of other studies that examined interactive applications and reported that such applications are more aggressive than traditional Web applications. Index Terms— Ajax, Web 2.0, Network measurments, Performance, Google search engine I. I NTRODUCTION Interactive Web applications have become extremely common today. The majority of Web sites including major Web-based email services (e.g., Gmail, Yahoo mail, etc.), map services, social networks, and web search services support an interactive user experience. One of the enabling technologies for this interactive user-engaging experience is Asynchronous Javascript and XML (AJAX) [1]. AJAX allows the web client (browser) to asynchronously fetch content from a server without the need for typical user interactions such as clicking a link or a button. Due to this asynchronous nature, these interactive applications exhibit traffic characteristics that might be different from those of classical applications. In classical (non-interactive) applications, requests for content are usually issued in response to human actions such as clicking a link or submitting a Web form. Thus, the human factor is the major factor in traffic generation. With interactive applications, however, requests can be issued in response to user interactions that typically would not generate requests such as filling a text field or hovering the mouse over a link or an image. Furthermore, these requests can be made even without user intervention at all such as the case of fetching new email message with Gmail or updating news content on a news Web site. Therefore, traffic generation is not necessarily limited by the human factor. © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.278-284 In this paper we focus on one such interactive application which is the Google interactive search engine. Google search engine has many interactive features that are aimed at providing a rich search experience implemented using the AJAX technology [2]. For example, Google Suggest (or Autocomplete) [3] provides suggested search phrases to users as they type their query (See Fig. 1). The user can select a suggested search phrase, optionally edit it, and submit a search query for that search phrase. Suggestions are created using some prediction algorithm to help users find what they are looking for. Google Instant Search service [4] streams a continuously updated search results to the user as they type their search phrases (See Fig. 2). This is hoped to guide users’ search process even if they do not know what exactly they are looking for. The other interactive feature that Google provides is Instant Previews [5]. With Instant previews (Shown in Fig. 3), users can see a preview of the web pages returned in the search results by simply hovering over these search results. This service is aimed at providing users with the ability to quickly compare results and pinpoint relevant content in the results web page. Figure 1. A snapshot of the Google Suggest feature In this paper, we study the characteristics of the Google Suggest service by analyzing a passively captured packet trace from Yarmouk University campus in Jordan. We look into the number of HTTP GET requests a search query generates, the inter-request generation time, the number of suggestions Google typically returns for a request, and the percentage of time the returned suggestions JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 279 Google Instant Preview is another feature provided by Google. This feature allows users to get snapshots of web pages for the search results without leaving the search results. This feature enhances researcher’s search experience and satisfaction. Google Instant Preview provides an image of the web page in addition to the extracted text. Previews are dynamically generated because content is continuously changing. Google users are 5% better satisfied with this new feature [9]. Figure 2. A snapshot of the Google Instant search feature. Note that the search results for the first suggestion is displayed before the user finishes typing his/her query Figure 3. A snapshot of the Google Instant Preview feature. Note the displayed preview of the web page of the first search result entry “Yarmouk University” are actually utilized by users. We have already begun to study the Google Instant feature and plan to study the Instant Previews in the future. The rest of this paper is organized as follows. Section II highlights some background information related to our work. Section III motivates our work. Section IV presents the related work. Section V highlights the packet trace capturing environment and characteristics. Section VI presents our results and discuses our findings. We conclude and present our future work plans in Section VII. II. BACKGROUND When browsing the web, one normally uses web search engines several times a day to find the required information on the web. Web search engines therefore are visited by huge number of people every day. Web search can use query-based, directory-based, or phrase-based query reformulation-assisted search methods. Google is considered among the most popular search engines on the web. The Google search engine uses the standard Internet query search method [6], [7]. In 2010 Google announced Google Instant, which liveupdates search results interactively during the time at which users type queries. Every time the user hits a new character, the search results are changed accordingly based on what the search engine thinks a user is looking for. This can save substantial user time, since most of the time, the results that a user are looking for are returned before finishing typing. Another advantage of Google Instant is that users are less likely to type misspelled keywords because of the instant feedback. The public generally provided positive feedback towards this new feature [8]. © 2012 ACADEMY PUBLISHER III. M OTIVATION One motivation of this study is that we believe that measuring such services is extremely important due to the recent popularity of interactive web features. Characterizing new trends in network usage help the research community and network operators update their mental model about network usage. Another motivation is that such characterization is quite important for building simulators and performance benchmarks and for designing new services and enhancing existing ones. Moreover, Google interactive features may produce large amount of information, which may result in bad experience for users on mobile devices or over lowspeed Internet connections [8]. With the prevalence of browsing the web via mobile devices today, we believe that characterizing these services is vital to understanding the performance of these services. To the best of our knowledge, this is the first attempt to characterizing interactive features of a search engine from the applicationand the transport-level perspective. IV. R ELATED W ORK The AJAX technology suit enables automated HTTP requests without human intervention by allowing web browsers to make requests asynchronously. This has been made possible through the use of advanced features of HTTP 1.1 like prefetching data from servers, HTTP persistent connections, and pipelining. These features mask network latency and give end users a smoother experience of web applications. Therefore, AJAX creates interactive web applications and increases speed and usability [10]. The authors of [10] performed a traffic study of a number of Web 2.0 applications and compared their characteristics to traditional HTTP traffic through statistical analysis. They collected HTTP traces from two networks: the Munich Scientific Network in Munich, Germany and the Lawrence Berkeley National Laboratories (LBNL) in Berkeley, USA and classified traffic into Web 2.0 applications traffic and conventional applications traffic. They have used packet-level traces from large user populations and then reconstructed HTTP request-response streams. They identified the 500 most popular web servers that used AJAX-enabled Web 2.0 applications. Google Maps is one of the first applications that used AJAX. Therefore, the authors have focused on Google Maps Traffic. The presented findings of this study show that Web 2.0 traffic is more aggressive and bursty than classical HTTP traffic. This is due to the active prefetching of data, which means 280 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 many more automatic HTTP requests, and consequently greater number of bytes transferred. Moreover, they found that sessions in AJAX applications last longer and are more active than conventional HTTP traffic. Furthermore, AJAX inter-request times within a session are very similar and much shorter because they are more frequent than all other HTTP traffic. Besides [10], some work exists in the literature for characterizing HTTP traffic generated by popular Web 2.0 websites. In [11] for example, the authors examined traces of Web-based service usage from an enterprise and a university. They examined methodologies for analyzing Web-based service classes and identifying service instances, service providers, and brands pointing to the strengths and weaknesses of the techniques used. The authors also studied the evolution of Web workloads over the past decade, where they found that although the Web services have significantly changed over time, the underlying object-level properties have not. The authors of [12] studied HTTP traffic from their campus network related to map applications. Their work examined the traffic from four map web sites: Google Maps, Yahoo Maps, Baidu Maps, and Sogou Maps. In their paper, they proposed a method for analyzing the mash-up (combing data from multiple sources) characteristics of Google Maps traffic. They found that 40% of Google Maps sessions come from mash-up from other websites and that caching is still useful in web based map applications. Li et al. [13] studied the evolution of HTTP traffic and classified its usage. The results provided are based on a trace collected in 2003 and another trace collected in 2006. The total bytes in each HTTP traffic classes in the two traces were compared. The authors found that the whole HTTP traffic increased by 180% while Web browsing and Crawler both increased by 108%. However, Web apps, File download, Advertising, Web mail, Multimedia, News feeds and IM have shown sharp rise. Maier et al. [14] presented a study of residential broadband Internet traffic using packet-level traces from a European ISP. The authors found that session durations are quite short. They also found that HTTP, not peerto-peer, carries most of the traffic. They observed that Flash Video contributes 25% of all HTTP traffic, followed by RAR archives, while peer-to-peer contributes only to 14% of the overall traffic. Moreover, most DSL lines fail to utilize their available bandwidth and that connections from client-server applications achieve higher throughput per flow than P2P connections. In [15], a study of user sessions of YouTube was conducted. The results obtained from the study indicate longer user think times and longer inter-transaction times. The results also show that in terms of content, large video files transferred. Finally, in [16] [17], the authors proposed the AJAXTRACKER, a tool for mimicking a human interaction with a web service and collecting traces. The proposed tool captures measurements by imitating © 2012 ACADEMY PUBLISHER mouse events that result in exchanging messages between the client and the Web server. The generated traces can be used for studying and characterizing different applications like mail and maps. V. DATA SET AND METHODOLOGY As mentioned, this study is conducted based on a packet-level trace captured at the edge of the engineering building at Yarmouk University, Jordan. The engineering building contains roughly 180 hosts that are connected through a typical 100Mbps ethernet. The trace is collected over a period of five business days. The trace contains a total 31490 HTTP transactions that are related to Google search. To extract the transactions that are related to Google search, the URL or the “HOST:” HTTP request header has to contain the word “google”. The search query is contained in the URL in the form of “q=xyz” where “xyz” is the query. After identifying the HTTP request as a Google search request, the corresponding HTTP response is also extracted. The returned suggestions are extracted from these HTTP responses. We also collect the type of the returned HTTP response in order to separate queries from one another as explained later in Section VI. VI. R ESULTS In this section, we measure various parameters related to Google search queries. To identify the boundaries of a search query, we manually analyze a portion of the collected trace. We found that throughout the process a user is typing the search phrase, the browser generates HTTP GET requests. For these HTTP GET requests, the type of the HTTP response is either “text/xml” or “text/javascript”. When the user hits the Return key (to obtain the search results), the browser generates another HTTP GET request, for which the type of the returned HTTP response is “text/html”. We verify this observation by actively performing a number of search queries and observing the captured traffic. In our trace, however, we found a number of occurrences of a scenario where a series of HTTP GET requests from a user appear to be related to two different queries, yet this series of HTTP GET requests is not split by an “text/html” response separating the boundaries of the two different queries. There is a number of usage scenarios that could result in such a behavior. For example, a user might type in a search phrase and get interrupted for some reason. Therefore, the search query will end without the Return key being hit. Furthermore, the TCP connection that is supposed to carry the last HTTP response might get disrupted after the user hit the Return key and before the HTTP response is delivered back to the user. To handle such cases, we consider two HTTP GET requests that are not split by an “text/html” response belong to two different search queries if the time separation between the two requests is greater than t seconds. To find a suitable value for this parameter, we plot the percentage of queries found using the time separation to the overall number of search queries for different values of t in Fig. % of queries identified using time separation JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 Time (sec) 70 80 90 Figure 4. Setting of parameter t GET requests with some search queries involving over 90 HTTP GET requests. These results show that the “chatty” nature of AJAX-based applications reported in [10] for the map and email applications also applies to the Google suggest application. Among the near 60% of search queries for which we believe that users are enabling Suggest (i.e., number of GET requests is greater than one), the vast majority of queries seem to not utilize the suggestions for the first few characters. This is because users continue to type despite the returned suggestions. A possible reason for this might be that for a small number of characters of the search phrase, Google returns quite general suggestions that are usually not selected by the user. We believe there is a room for improvement in the service design by not returning suggestions for the initial few characters of the query. B. Inter-Requests Time 1 0.8 CDF 4. The figure suggests that, in general, the percentage of queries identified using the time separation heuristic is insensitive to the setting of the parameter t when t is above 30 seconds, an t = 60 is an appropriate value since the percentage of search queries identified using the time spacing heuristic to the number of overall search queries remains stable around this value. Therefore, we choose this value throughout our evaluation below. A. HTTP GET Requests 281 0.6 0.4 0.2 1 0 1e-05 0.0001 0.001 0.01 0.1 1 Inter-request time (sec) CDF 0.8 0.6 10 100 Figure 6. HTTP GET inter-request times within a search query 0.4 0.2 0 1 10 No. of HTTP GET requests 100 Figure 5. No. of HTTP GET requests per search query This subsection investigates the number of HTTP GET requests a search query typically involves. We identify a total of 7598 search query. We plot the Cumulative Distribution Function (CDF) of the number of HTTP GET requests in a query in Fig. 5. As shown, over 40% of search queries involve only one HTTP GET request. The possible reasons for the existence of these queries include (i) users not turning on Google Suggest and (ii) users copying search phrases and pasting them into Google and hitting the Return key. The figure also shows that arround 30% of search queries involve five or more HTTP © 2012 ACADEMY PUBLISHER Next, we turn our attention to investigating the time spacing between HTTP GET requests within the same search query. Fig. 6 shows the results. As shown, 64% of HTTP GET requests involved in a search query are separated by less than one second. We contrast these results with our mental model of traditional HTTP interactions. Normally, a number of HTTP requests are made to download a Web page along with the embedded objects. Then, a think time elapses before new requests are made to download a new page [18]. The interactive search application generates a radically different pattern of HTTP GET requests. This is due to the fact that requests for new set of Google suggestions are automatically made while the user types the search query. This also confirms the results of [10] indicating that interrequest times are shorter in AJAX-based applications than it is in traditional applications. We however believe that the traffic characteristics of these interactive applications are generally application-dependent and not technologydependent. That is, the characteristics of the traffic that is JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 generated by an application employing AJAX technology depends on the type of the application and not on the fact that it uses the AJAX technology. This is because AJAX enables the application to generate traffic automatically without user intervention (or in response to user actions that typically do not generate traffic such as hovering over a link), however, it is up to the application logic to decide whether to generate HTTP requests and when to generate these requests. 1 0.8 CDF 282 0.4 0.2 C. TCP Connections D. Suggestions (Predictions) In this section, we investigate the number of suggestions Google returns in HTTP responses for each HTTP GET request. We observe that the maximum number of returned suggestions is ten. Fig. 7 plots CDF of the number of returned suggestions per HTTP GET request. As shown, close to 40% of HTTP responses involve zero suggestions. This includes cases where the suggest algorithm returned no suggestions and connections disrupted before responses arrive. Furthermore, around 50% of HTTP responses the Google Suggest service returns the full ten suggestions. For the remaining small fraction, the Suggest service returns between one and nine suggestions. A likely reason for returning between one and nine sugestions is the inability of google to find ten suggestions for the specific user’s search phrase. © 2012 ACADEMY PUBLISHER 0 1 No. of Google suggestions (logscale) 10 Figure 7. No. of suggested search phrases returned for an HTTP GET request E. Prediction Usefulness 1 0.8 CDF In our trace, we find that each HTTP GET request is carried over its own TCP connection. To verify if this is a result of the deployed HTTP proxy, we have performed a number of search queries from the author’s houses (i.e., using residential broadband network connections). This experiment involves performing a number of search queries for different web browsers (Microsoft Internet Explorer, Mozialla Firefox, and Google Chrome) on a Microsoft Windows 7 machine. We have captured and examined the packet trace of these search queries. Our findings indicate that, contrary to what we find in our trace, various HTTP GET requests can be carried over the same TCP connection. We note here that HTTP proxies are commonly deployed in institutional networks. Therefore, our network setting is not necessarily unique, and we believe that it is totally legitimate to assume that many other institutions are employing similar network settings. We note that having a separate TCP connection per request might have significant impact on the performance of this service. In particular, each new TCP connection requires the TCP three-way handshake which might add a significant delay. Furthermore, if an HTTP request needs to be split over many packets, the rate at which these packets are transmitted to the server is limited by the TCP congestion control mechanisms. Therefore, these added delays might limit the usefulness of the Suggest service since suggestions usually become obsolete when the user types new text. 0.6 0.6 0.4 0.2 0 0.1 Percentage (logscale) 1 Figure 8. Percentage of time users actually use the returned suggestions of the search query Next, we investigate the percentage of time users do actually use the returned suggestions during the search process. To assess this, within a search query, we assume that the user has utilized the returned suggestions if the search phrase in the current request matches one of the suggestions appeared in the response for the previous request. To illustrate this, consider the following scenario from our trace. An HTTP GET request was sent to Google with “you” as a partial search phrase. Google responded with “youtube, you, youtube downloader, yout, youtube to mp3, youtu, youtube download, youtube music, you top, you born” as search suggestions. The next HTTP GET request was sent to Google with “youtube” as the search phrase. In this case, we assume that the user has utilized the returned suggestion since the current HTTP GET request involves one of the suggestions that were provided as a response to the previous HTTP GET request. That is, the user is asking for “youtube” which was one of the suggestions made by Google in the previous response. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 We note that this is an upper limit on the usage of this service because a search phrase in the current request may match the suggestions in the previous request, yet, the user might have typed the phrase instead of selecting it from the list of returned suggestions. The result is plotted in Fig. 8. As shown, nearly 58% of the time, users do not use the returned suggestions at all. On the other hand, nearly 10% of queries are constructed with complete guidance of the Suggest service. The following scenario from our trace illustrates a case where a query can be constructed with complete help of Google suggestions. A user typed “f” which triggered a request for suggestions to Google. Google responsed with “facebook,face,fa,friv,fac,firefox,faceboo,farfesh,factjo,fatafeat” as suggestions. The next and final request was for “facebook” which is among the suggested search phrases. In this case the user selected a search phrase from the first set of returned suggestion to complete the search query. This means suggestions are fully utilized. Hence, full utilization of google suggest service is acheived when the user selects a suggested phrase from each returned list of suggestions for a particular search query. VII. C ONCLUSIONS AND F UTURE W ORK In this paper, we have investigated the applicationand transport-level characteristics of Google’s Suggest interactive feature as observed in a passively captured packet trace from a campus network. We find that a large number of HTTP GET requests could be issued to obtain suggestions for a search query. Interestingly, the characteristics of the HTTP GET requests deviate significantly from those of HTTP GET requests issued for classical Web interactions. In particular, while classical Web interactions are limited by the human factor (thinktime), interactive applications are not necessarily limited by this factor. Furthermore, we have characterized the number and usefulness of suggestions made by Google. To this end, we have found that Google responds to the majority of requests for suggestions with either zero or 10 suggestions (the number 10 is the maximum number of suggestions returned per request). However, nearly 58% of users do not utilize the returned suggestions at all. We have already begun to investigate the characteristics of other Google interactive search features such as Google instant search and plan to evaluate the Google instant preview as well. R EFERENCES [1] J. J. Garrett, “Ajax: A new approach to web applications,” http://adaptivepath.com/ideas/essays/archives/000385.php, February 2005, [Online; Stand 18.03.2008]. [Online]. Available: http://adaptivepath.com/ideas/essays/archives/000385.php [2] “Ajax:A New Approach to Web Applications,” http://adaptivepath.com/ideas/ajax-new-approach-webapplications. [3] “Google Suggest (or Autocomplete),” http://www.google.com/support/websearch/bin/static.py?hl= en&page=guide.cs&guide=1186810&answer=106230&rd=1. © 2012 ACADEMY PUBLISHER 283 [4] “Google Instant,” http://www.google.com/instant/. [5] “Google Instant Previews,” http://www.google.com/landing/instantpreviews/#a. [6] P. Bruza, R. McArthur, and S. Dennis, “Interactive internet search: keyword, directory and query reformulation mechanisms compared,” in SIGIR’00, 2000, pp. 280–287. [7] S. Dennis, P. Bruza, and R. McArthur, “Web searching: A process-oriented experimental study of three interactive search paradigms,” Journal of the American Society for Information Science and Technology, vol. 53, issue 2, pp. 120–130, 2002. [8] http://dejanseo.com.au/google-instant/. [9] http://dejanseo.com.au/google-instant-previews/. [10] F. Schneider, S. Agarwal, T. Alpcan, and A. Feldmann, “The new web: Characterizing ajax traffic.” in PAM’08, 2008, pp. 31–40. [11] P. Gill, M. Arlitt, N. Carlsson, A. Mahanti, and C. Williamson, “Characterizing organizational use of webbased services: Methodology, challenges, observations and insights,” ACM Transactions on the Web, 2011. [12] S. Lin, Z. Gao, and K. Xu, “Web 2.0 traffic measurement: analysis on online map applications,” in Proceedings of the 18th international workshop on Network and operating systems support for digital audio and video, ser. NOSSDAV ’09. New York, NY, USA: ACM, 2009, pp. 7–12. [Online]. Available: http://doi.acm.org/10.1145/1542245.1542248 [13] W. Li, A. W. Moore, and M. Canini, “Classifying http traffic in the new age,” 2008. [14] G. Maier, A. Feldmann, V. Paxson, and M. Allman, “On dominant characteristics of residential broadband internet traffic,” in Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, ser. IMC ’09. New York, NY, USA: ACM, 2009, pp. 90–102. [Online]. Available: http://doi.acm.org/10.1145/1644893.1644904 [15] P. Gill, M. Arlitt, Z. Li, and A. Mahanti, “Characterizing user sessions on youtube,” in In Proc. of 15th Annual Multimedia Computing and Networking Conference, San Jose, CA, USA, 2008. [16] M. Lee, R. R. Kompella, and S. Singh, “Ajaxtracker: active measurement system for high-fidelity characterization of ajax applications,” in Proceedings of the 2010 USENIX conference on Web application development, ser. WebApps’10. Berkeley, CA, USA: USENIX Association, 2010, pp. 2–2. [Online]. Available: http://dl.acm.org/citation.cfm?id=1863166.1863168 [17] M. Lee, S. Singh, and R. Kompella, “Ajaxtracker: A tool for high-fidelity characterization of ajax applications,” 2008. [18] M. E. Crovella and A. Bestavros, “Self-similarity in world wide web traffic: Evidence and possible causes,” IEEE/ACM Transactions on Networking, 1997. Zakaria Al-Qudah received his B.S. degree in Computer Engineering from Yarmouk University, Jordan in 2004. He recived his M.S. and Ph.D. degrees in Computer Engineering from Case Western Reserve University, USA, in 2007 and 2010 respectively. He is currently an assistent professor of Computer Engineering at Yarmouk University. His research interests include internet, content distribution networks, and security. Mohammed Halloush received the B.S. degree from Jordan University of Science and Technology, Irbid, Jordan in 2004, the M.S. and the Ph.D. degrees in Electrical Engineering from Michigan State University, East Lansing, MI, USA in 2005, 2009 respectively. Currently he is an Assistant professor in the department of Computer Engineering at Yarmouk University, Irbid Jordan. His research interests include network coding, 284 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 multimedia communications, wireless communications and networking. Hussein Al-Zoubi received his MS. and Ph.D. in Computer Engineering from the University of Alabama in Huntsville, USA in 2004 and 2007, respectively. Since 2007, he has been working with the Department of Computer Engineering, Hijjawi Faculty for Engineering Technology, Yarmouk University, Jordan. He is currently an associate professor. His research interests include computer networks and their applications: wireless and wired, security, multimedia, queuing analysis, and high-speed networks. Osameh Al-Kofahi received his B.S. degree in Electrical and Computer Engineering from Jordan University of Science and Technology, Irbid, Jordan in 2002. He received his Ph.D. degree from Iowa State University, USA. in 2009. His research interests include Wireless Networks, especially Wireless Sensor Networks (WSNs), Wireless Mesh Networks (WMNs) and Ad hoc networks, Survivability and Fault Tolerance in wireless networks and Practical Network Coding. © 2012 ACADEMY PUBLISHER JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 285 Review of Web Personalization Zeeshan Khawar Malik, Colin Fyfe School of Computing, University of The West of Scotland Email: {zeeshan.malik,colin.fyfe}@uws.ac.uk Abstract— Today the modern phase of the internet is the personalize phase where the user is able to view everything that matches his/her interest and needs. Nowadays, Web users are relying totally on the internet in relation to all the problems they have in their daily life. If someone wants to find a job he/she will look on the internet, similarly if someone wants to buy some product/item the best preferred platform will be the internet so due to large numbers of users on the internet and also due to the large amount of data on the internet people starts preferring those platforms where they can find what they need in as minimum time as possible. The only way to make the web intelligent is through personalization. Web Personalization has been introduced more than a decade ago and many researchers have contributed to make this strategy as efficient as possible and also as convenient for the user as possible. Web personalization research has a combination of many other areas that are linked with it and includes AI, Machine Learning, Data Mining and natural language processing. This report describes the whole era of web personalization with a description of all the processes that have made this technique more popular and widespread. This report has also thrown light on the importance of this strategy and also the benefits and limitations of the methods that are introduced in this strategy. This report also discusses how this approach has made the internet world more facilitating and easy-to-use for the user. Index Terms— Web Personalization, Learning, Matching and Recommendation I. I NTRODUCTION In the early days of internet technology, people used to suffer a lot while browsing and finding data as per their interest and needs due to the richness of information available online. The concept of web personalization has to a very large extent enabled the internet users to find the most appropriate and best information as per their interest. This is one of the major contributions on the internet derived from the first and foremost concept of Adaptive Hypermedia which becomes more popular by giving a major contribution to adaptive web-based hypermedia in teaching systems [1], [2] and [3]. Adaptive Hypermedia was derived by observing the browsing habits of different users on the internet where people faced a lot of difficulty in choosing links out of many links available at one time. Based on this linking system, this concept of adaptive hypermedia was introduced which provides the most appropriate links to the users based on their browsing habits. This concept became more popular when it was introduced in the area of educational hypermedia This paper covers the whole review of web personalization technology This work was supported by University of The West of Scotland. © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.285-296 [4]. Web personalization [11, 22 ,72, 16] is to some extent closely linked with adaptive hypermedia in the way that the former most of the time works on an open corpus hypermedia whereas the latter mostly worked and became popular on closed corpus hypermedia. The basic objective of personalization is to some extent similar to adaptive hypermedia which is to help users by giving them the most appropriate information for their needs. The reason why web personalization has become more popular than adaptive hypermedia is due to its frequent implementation in commercial applications. Very few areas of the internet are left where this concept has not yet reached. Most areas of the internet have adopted this method including e-business [5], e-tailing [6], e-auctioning [7], [8] and others [9]. User adaptive services and personalization features both are basically designed for enabling users to reach their targeted needs without spending much time in searching. Web Personalization is divided into three main phases 1) Learning [10] 2) Matching [11] and 3) Recommendation [5], [12] as shown in Fig. 1 and Fig. 2 in detail. Learning is further subdivided into two types 1) Explicit Learning and 2) Implicit Learning. There is one more type of learning method mentioned most frequently nowadays by different researchers called behavioural learning [13] that also comes under the Implicit Learning category. The next stage is the matching phase. There is more than one type of matching or filtration techniques proposed by different researchers which primarily include 1) Content-Based Filtration [14] 2) Collaborative Filtration [15], [16] [17], [18] 3) Rule-Based Filtration [19] and 4) Hybrid Filtration [20] These prime categories further include sub-categories mentioned later that are based on the prior mentioned categories but are used to further enhance the performance and efficiency of this phase. There is still a lot of weaknesses to the efficiency and performance of this phase. Many new ideas are currently being proposed all over the world to further improve the performance in finding the nearest neighbours in the shortest possible time and producing more accurate results. The last phase is the recommendation [21] phase which is responsible for displaying the closest match to the interest and personalized choice of users. In this report a detailed review of web personalization is made by taking into account the following major points. 286 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Figure 1. Three Stages of Web Personalization. Figure 2. Web Personalization Process. 1) What is Web Personalization? What are the main Building Blocks of Web Personalization? 2) What are the major techniques that are involved in each phase of web personalization? 3) Description of each phase with complete overview of all the major contributions that are made in each phase of web personalization. II. WHAT IS WEB PERSONALIZATION? Web Personalization can be defined as a process of helping users by providing customized or relevant information on the basis of Web Experience to a particular user or set of users [22]. A Form of user-to-system interactivity that uses a set of technological features to adapt the content, delivery, and arrangement of a communication to individual users explicitly registered and/or implicitly determined preferences [23]. One of the first and foremost companies who had introduced this concept of personalization was Yahoo in 1996 [24]. Yahoo has introduced this feature of personalizing the user needs and requirement by providing different facilitating products to its users like Yahoo Companion, Yahoo Personalized Search and Yahoo Modules. Yahoo experienced quite a number of challenges which include scalability issues, usability issues and large-scale personalization issues but summing up as whole find it quite a successful feature as far as the user needs and requirements were concerned. © 2012 ACADEMY PUBLISHER Similarly Amazon, one of the biggest companies in the internet market, summarizes the recommendation system with three common approaches 1) Traditional Collaborative Filtering 2) Cluster Modelling and 3) Search-Based Methods as described in [25]. Amazon has also incorporated this method of web personalization and the most well-known use of collaborative filtering is also done by Amazon as well. Amazon.com, the poster child of personalization, will start recommending needlepoint books to you as soon as you order that ideal gift for your great aunt.(http://www.shorewalker.com) Web Personalization is the art of customizing items responding to the needs of users. Due to the large amount of data on the internet, people often get so confused in reaching their correct destination and spend so much time in searching and browsing the internet that in the end they get disappointed and prefer to do their work using traditional means. The only way to help internet users is by providing an organized look to the data and personalizing the whole decoration of items to satisfy the individual’s desire and in doing this the only way is to embed features of web personalization. Everyday a user has a different mood when browsing the internet and based on that day’s particular interest the user browses the internet, but definitely a time comes when the interest starts becoming redundant day by day and at that particular situation if the historical transactional record [5], [12] is maintained properly and the user behaviour is recorded [13] properly then the company can take benefit in filtering the record based on a single user or a group of users and can recommend useful links according to the interest of the user. Web Personalization can also be defined as a recommendation system that performs information filtering. The most important layer on which this feature is strongly dependent is the data layer [26]. This layer plays a very important role in recommendation. The system which is capable of storing data from more than one dimension is able to personalize the data in a much better way. Hence the feature of web personalization has a pretty closed relationship with web mining Web Personalization is normally offered as an implicit facility to the user: whereas some websites considered it optional for the user, most websites do it implicitly without asking the user. The issues that are considered very closely while offering web personalization is the issue of high-scalability of data [27], lack of performance issues [19], correct recommendation issues , black box filtration issues [28], [29] and other privacy issues [30]. Black box filtration is defined as a scenario where the user cannot understand the reason behind the recommendation and is unable to control the recommendation process. It is very difficult to cover the filtration process for a large amount of data which includes pages and products while maintaining a correct prediction and performance JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 accuracy and this normally happens due to the sparsity of data and the incremental cost of correlation among the users [31], [32]. This feature has a strong effect on internet marketing as well. Personalizing users needs is a much better way of selling items without wasting much time. This feature further pushes the sales ratio and helps merchants convince their customers without confusing them and puzzling them [33]. The internet has now become a strong source of earning money. The first step towards selling any item or generating revenue involves marketing of that item and convincing the user that the items which are being offered are of a superior quality and nobody can give them this item with such a high quality and at such a low price. In order to make the first step closer to the user, one way is by personalizing the items for each user regarding his/her area of interest. It means personalization can easily be used to reduce the gap between any two objects which can be a user and a product, a user and a user, a merchant and consumer, a publisher and an advertiser [34], a friend and an enemy and all the other combinations that are currently operating with each other on the internet. In a recent survey conducted by [23] in City University London, it is found that personalization as a whole is becoming really very popular in news sites as well. Electronic News platforms such as WSJ.com, NYTimes.com, FT.com, Guardian.co.uk, BBC News online, WashingtonPost.com, News.sky.com, Telegraph.co.uk, theSun.co.uk, TimesOnline.co.uk and Mirror.co.uk which has almost completely superseded traditional news organizations are right now considered to have one of the highest user viewership platforms globally. Today news sites are highly looking towards these personalization features and trying to adopt both explicit and implicit ways that includes email newsletters, one-to-one collaborative filtering, homepage customization, homepage edition, mobile editions and apps, my page, my stories, RSS feeds, SMS alerts, Twitter feeds and widgets as a former and contextual recommendations/ aggregations, Geo targeted editions, aggregated collaborative filtering, multiple metrics and social collaborative filtering as ways for personalizing the information just to further attract a users attention and to enable users to view specific information according to their interest. Due to the increasing number of viewers day by day these news platforms are becoming one of the biggest sources of internet marketing as well and most of the advertisers from all over the world are trying very hard to offer maximum percentage in terms of PPC (Pay Per Click) and PPS (Pay Per Sale) strategy to place their advertisement on these platforms to increase their sales and to generate revenue. So it is once again proved that personalization is one of the most important features that give a very high support to internet advertising as well. While discussing internet advertising the most popular and fastest way to promote any product or item on the internet is through affiliate marketing [35]. Affiliate Marketing offers different methods as discussed in [36] © 2012 ACADEMY PUBLISHER 287 for the affiliates to generate revenue from the merchants by selling or promoting their items. Web Personalization is playing an important role in reducing the gap between affiliates and advertisers by facilitating affiliates and providing them an easy way of growing with the merchant by making their items sell in a personalized and specific way. With the growing nature of this feature it is proved as confirmed by [37] that the era of personalization has begun and further states that people what they want is a brittle and shallow civic philosophy. It is hard to guess what people really want but still researchers are trying to reach as close as possible. Further in this report the basic structure of web personalization is explained in detail. III. LEARNING This phase is considered one of the compulsory phases of web personalization. Learning is the first step towards the implementation of web personalization. The next two phases are totally dependent on this phase. The better this phase is executed, the better and more accurate the next two phases will execute. Different researchers have proposed different methods for learning such as Web Watcher in [38] which learns the user’s interest using reinforcement learning. Similarly Letizia in [39] behaves as an assistant for browsing the web and learns the user’s web behaviour in a conventional web browser. A system in [40] is described as a system that learns user profiles and analyses user behaviour to perform filtered net-news. Similarly in [41] the author uses re-inforcement learning to analyse and learn a user’s preferences and web browsing behaviour. Recent research in [42] proposed a method of semantic web personalization which works on the content structure and based on the ontology terms learns to recognize patterns from the web usage log files. Learning is primarily the process of data collection defined in two different categories as mentioned earlier 1) Explicit Learning and 2) Implicit Learning Which are further elaborated below:A. IMPLICIT LEARNING Implicit learning is a concept which is beneficial since there is no extra time consumption from the user point of view. In this category nobody will ask the user to give feedback regarding the product’s use, nobody will ask the user to insert product feedback ratings, nobody will ask the user to fill feedback forms and in fact nobody will ask the user to spend extra time in giving feedback anywhere and in any form. The system implicitly records different kinds of information related to the user which shows the user’s interest and personalized choices. The three most important sources that are considered while getting implicit feedback for a user includes 1) Reading time of the user at any web page 2) Scrolling over the same page again and again and 3) behavioural interaction with the system. 288 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 1) GEO LOCATIONS: Geolocation technology helps in finding the real location of any object. This is very beneficial as an input to a personalization system and hence most of the popular portals like Google implicitly store geographical location of each user using Google search engine and then personalize the search results for each user according to the geographical location of that user. This concept is becoming very popular in other areas of the internet as well which primarily includes internet advertising [43]. Due to the increase in the mobility of internet spatial information is also becoming pervasive on the web. These mobile devices help in collecting additional information such as context information, location and time related to a particular user’s transaction on the web [44]. Intelligent techniques [45], [46] and [47] are proposed by researchers to record the spatial information in a robust manner and this further plays an additional role of accuracy in personalizing the record of the user. It is evident from the fact that many services on the internet require collection of spatial information in order to become more effective with respect to the needs of the user. Services such as restaurant finders, hospital finders, patrol station finders and post office finders on the internet require spatial information for giving effective recommendations to the users. 2) BEHAVIORAL LEARNING: In this category the individual behaviour of a user is recorded by taking into consideration the click count of the user at a particular link, the time spent on each page and the search text frequency [13], [40]. Social networking sites are nowadays found to extract the behaviour of each individual and this information is used by many online merchants to personalize pages in accordance with the extracted information retrieved from social sites. An adaptive web is mostly preferred nowadays which changes with time. In order to absorb the change, the web should be capable enough to record user’s interest and can easily adapt the ever increasing changes with respect to the user’s interest in terms of buying or any other activity on the web. Many interesting techniques have been proposed to record user’s behaviour [48], [49] and adapt with respect to the changes by observing the dynamic behaviour of the user. 3) CONTEXTUAL RELATED INFORMATION: There are many organizations like ChoiceStream, 7 Billion People, Inc, Mozenda and Danskin that are working as product development companies and are producing web personalization software that can help online merchants filter records on the basis of this software to give personalized results to their users. Some of these companies are gathering contextual related information from various blogs, video galleries, photo galleries and tweets and based on these aggregated data are producing personalized results. Apart from this since the origin of Web 2.0 the data related to users is becoming very sparse and many learning techniques are proposed by different researchers to extract useful information from this high amount of data by taking into account the tagging behaviour of the user, the collaborative ratings of the user and to record © 2012 ACADEMY PUBLISHER social bookmarking and blogging activities of the user [50], [51]. 4) SOCIAL COLLABORATIVE LEARNING: Online Social Networking and Social Marketing Sites [52] are the best platforms to derive a user’s interest and to analyse user behaviour. Social Collaborative filtering records social interactions among people of different cultures and communities involved together in the form of groups in social networking sites. This clustering of people shows close relationship among people in terms of nature and compatibility among people. Social Collaborative Learning systems learn a user’s interests by taking into account the collaborative attributes of people lying in the same group and give benefit to their users from these socially collaborative data by personalizing their needs on the basis of the filtered information they extract from these social networking sites. This social networking site introduces many new concepts that portray the feature of web personalization like facebook Beacon introduced by Facebook but removed due to privacy issues [53]. 5) SIMULATED FEEDBACKS: This is the latest concept discussed by [54] and [55] in which the researchers have proposed a method for search engine personalization based on web query logs analysis using prognostic search methods to generate implicit feedback. This concept is the next generation personalization method which the popular search engines like Google and yahoo can use to extract implicitly simulated feedbacks from their user’s query logs using AI methods and can personalize their retrieval process. This concept is divided into four steps 1) query formulation 2) searching 3) browsing the results and 4) generating clicks. The query formulation works by selecting a search session from user’s historical data and sending the queries sequentially to the search engine.The second steps involves retrieval of data based on the query selected in the previous step.The browsing result session is the most important step in which the patience factor of the user is learned based on the number of clicks per session, maximum page rank clicked in a session, time spent in a session and number of queries executed in each session.The last step is the scoring phase based on the number of clicks the user made on each link in every session. This is one of the dynamic ways proposed to get simulated feedback based on insight from query logs and using artificial methods to generate feedbacks. B. EXPLICIT LEARNING Explicit Learning methods are considered more expensive in terms of time consumption and less efficient in terms of user dependency. This method includes all possible ways that merchants normally adopt to explicitly get their user’s feedback in the form of email newsletter, registration process, user rating, RSS Twitter feeds, blogs, forums and getting feedbacks through widgets. Through explicit learning sometimes the chance of error becomes greater. Error arises because sometimes the user is not in a mood to give feedback and therefore enters bogus information into the explicit panel [56]. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 1) EMAIL NEWSLETTER: This strategy of being in touch with your registered users is getting very popular day by day [23] and [57]. The sign up process for this strategy will help the merchant find their user’s interest by knowing which product’s update the user want in his/her mailbox regularly. This strategy is the best way of electronic marketing as well as finding the interest of your customers. There are many independent companies like Aweber and Getresponse that are offering this service to most merchants on the internet and people are getting a lot of benefits in terms of revenue generation and building a close personalized relationship with their customers. Tools like iContact [58]have a functionality to do message personalization as well. Message personalization is a strategy through which certain parameters in the email’s content can be generalized and is one more quick and personalized way of explicitly getting feedback by just writing one generic email for all the users. 2) PREFERENCE REGISTRATION: This concept is incorporated by content providing sites such as news sites to get user preferences through registration for the recommendation of content. Every person has his own choice of content view so these news sites have embedded a content preference registration module where a user can enter his/her preference about the content so that the system can personalize the page in accordance with the preferences entered by the users. Most web portals create user profiles using a preference registration mechanism by asking questions of the user during registrations that identify their interest and reason for registering but on the other hand these web portals also have to face various security issues in the end as well [59]. The use of a web mining strategy has reduced this technique of preference registration system [60]. 3) SMS REGISTRATION: Mobile SMS service is being used in many areas starting from digital libraries [61] up to behavioral change intervention in health services [62] as well. Today mobile technology is getting popular day by day and people prefer to get regular updates on mobiles instead of their personal desktops inbox. Buyers who till now only expect location-based services through mobile are also expecting time and personalization features in mobile as well [63]. Most websites like Minnesota West are offering SMS registration through which they can get personalized interests of their users explicitly and can send regular updates through SMS on their mobiles regarding the latest news of their products and packages. 4) EXPLICIT USER RATING: Amazon, one of the most popular e-commerce based companies on the internet has incorporated three kinds of rating methods 1) A Star Rating 2) A Review and 3) A Thumbs Up/ Down Rating. The star rating helps the customers judge the quality of the product. A Review rating shows the review of existing customer after buying the product and a Thumbs Up/Down Rating gives the customer’s feedback after reading the reviews of other people related to that product. These explicit user rating methods are one of the biggest sources to judge customer’s needs and desires © 2012 ACADEMY PUBLISHER 289 about the product and Amazon is using this information for personalization purposes. Explicit user rating plays a vital role in identifying user’s need but extra time consumption of this process means that sometimes the user feels very uncomfortable to do it or sometimes the user feels very reluctant in doing it unless and until some benefit is coming out of it [64]. However still websites have incorporated this method to gather data and identify user’s interest. 5) RSS TWITTER FEEDS: RDF Site Summary is used to give regular updates about the blog entries, news headlines, audio and video in a standard format. RSS Feeds help customers get updated information about the latest updates on the merchant’s site. Users sometimes feel very tired searching for their interest related articles and this RSS Feeding feature help users by updating them about the articles of interest.To them this feature of RSS Feeding is very popular among content-oriented sites such as News sites and researchers are trying to evolve techniques to extract feedbacks from these RSS feeds for recommendations [65]. This concept is also being used by many merchants for the personalization process by getting user’s interests with regards to the updates a user requires in the form of RSS. Similarly the twitter social platform is becoming very popular in enabling the user to get updated about the latest information. Most merchants’ sites are offering integration with a user’s twitter account to get the latest feeds of those merchants’ product on the individual’s twitter accounts. Almost 1000+ tweets are generated by more than 200 million people in one second which in itself is an excellent source for recommender systems [66].These two methods are also used by many site owners especially news sites so that they can use this information for personalizing the user page. 6) SOCIAL FEEDBACK PAGES: Social feedback pages are those pages which companies usually build on social-networking sites to get comments from their customers related to the discussion of their products. These product pages are also explicitly used by the merchants to derive personalized interest of their users and to know the emotions of their customers with their products [67]. It has now became a trend that every brand, either small or large before introducing itself into the market, first uses the social web to get feedback about their upcoming brand directly from the user and then based on the feedback introduce their own brand into the market [68]. Although the information on the page seems to be very large and raw but still it is considered a very useful way to extract user’s individual perception regarding any product or service. 7) USER FEEDBACK: User Feedback plays a vital role to get a customer’s feedback about the company’s quality of services, quality of products offered and many other things. This information is collected by most merchants to gather a user’s interests so that they can give a personalized view of information to that user next time when the user visits their site.It is identified in [69] that most of the user feedbacks are differentiated in terms of 290 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 explicitness, validity and acquisition costs. It is identified that especially for new users explicit customer requirements as in [70] are also considered as a useful source of user feedbacks for personalization. Overall user feedbacks plays a very important role for recommendation but it is proved that in most of the systems, gradually the explicit user feedbacks decreases with time and sometimes it shows a very negative effect on a user’s behaviour [71]. 8) BLOGS AND FORUMS: Blogs and forums play a vital role in creating a discussion platform where a user can share his views about the product or services he has purchased online. Most e-companies offer these platforms to their customers where customers explicitly give their feedback regarding the products by participating in the forum or by giving comments on articles posted by the vendor related to the products or services. This information is used by the merchant for personalizing their layouts on the basis of user feedbacks from these additional platforms. Semantic Blogging Agent as in [72] is one of the agents proposed by researchers that works as a crawler and extracts semantic related information from the blogs using natural language processing methods to provide personalized services. Blogging is also very popular among mobile users as well. Blogs not only contains the description of various products, services, places or any interesting subject but also contains user’s comments on each article and with mobile technology the participation ratio has increased a lot. Researchers have proposed various content recommendation techniques in blogs for mobile phone users as in [73] and [74]. IV. MATCHING The matching module is another important part of web personalization. The matching module is responsible for extracting the recommendation list of data for the target user by using an appropriate matching technique. Different researchers have proposed more than one matching criterias but all of them lie under three basic categories of matching 1) Content-Based Filtration Technique 2) Rule-Based Reasoning Technique and 3) Collaborative Filtration Technique. A. CONTENT-BASED FILTRATION TECHNIQUE Content-Based filtration approach filters data based on a user’s previous liking based stored data. There are different approaches for the content-based filtration technique. Some merchants have incorporated a rating system and ask customers to rate the content and based on the rating of the individual, filter the content next time for that individual [75]. There is more than one content-based page segmentation approach introduced by researchers through which the page is divided into smaller units using different methods. The content is filtered in each segment and then the decision is made whether this segment of the page is incorporated in the filtered page or not [76], [77]. Content-based filtration technique is feasible only if there is something stored on the basis of content that © 2012 ACADEMY PUBLISHER shows the user’s interest for e.g. it is easy to give a recommendation for the joke about a horse out of many horse related jokes stored in the database on the basis of a user’s previous liking but it is impossible to extract the funniest joke out of all the jokes related to horses; for that one has to use collaborative filtration technique. In order to perform content filtration the text should be structured but for both structured and unstructured data one has to incorporate the process of stemming [78] especially news sites which contains news articles which are example of unstructured data. There are different approaches used for content filtration as mentioned in figure 3. Figure 3. Methods Used in Content Filtration Technique. 1) USER PROFILE: The profile of user plays a vital role in content filtration [79]. The profiles mainly consist of two important pieces of information. 1) The first consists of the user’s preferred choice of data. A user profile contains all the data that shows a user’s interest. The record contains all the data that shows a user’s preference model. 2) Secondly it contains the historical record of the user’s transactions. It contains all the information regarding the ratings of users, the likes and dislikes of the users and all the queries typed by the user for record retrieval. These profiles are used by the content filtration system [80] for displaying a user’s preferred data which will be personalized according to the user’s interest. 2) DECISION TREE: A decision tree is another method used for content filtration. Decision tree is created by recursively partitioning the training data as in [81]. In decision trees a document or a webpage is divided into subgroups and it will be continuously subdivided until a single type of a class is left. Using decision trees it is possible to find the interests of a user but it works well on structured data and in fact it is not feasible for unstructured text classification [82]. 3) RELEVANCE FEEDBACK: Relevance feedback [83] and [84] is used to help users refine their queries on the basis of previous search results. This method is also used for content filtration in which a user rates the documents returned by the retrieval system with respect to their interest. The most common algorithm that is used for relevance feedback purposes is Rocchio’s algorithm [85]. Rocchio’s algorithm maintains the weights for both relevant and non-relevant documents retrieved after the execution of the query and on the basis of a weighted sum incrementally move the query vector towards the cluster of relevant documents and away from irrelevant documents. 4) LINEAR CLASSIFICATION: There are numerous linear classification methods [86], [87] that are used for JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 text categorization purposes. In this method the document is represented in a vector space. The learning process will produce an output of n-dimensional weight vector whose dot product with an n-dimensional instance produces a numeric score prediction that leads to a linear regression strategy. The most important benefit of these linear classification approaches is that they can be easily learned on an incremental basis and can easily be deployed on web. 5) PROBABILISTIC METHODS: This is one more technique used for text classification and the method primarily used in it is the Naive Bayesian Classifier [88]. The two most common methods of Bayesian Classifier that are used for text classification are the multinomial model and multivariate Bernoulli as described in [89]. Some probabilistic models are called generative Models. B. COLLABORATIVE FILTERING Most online shops store records related to the buying of products by different customers. It is true that many products can be bought by many customers and it is also true that a single product can be bought by more than one customer but in order to predict which product the new customer should buy it is important to know the number of products that have been bought by other customers with the same background and choice and for this purpose collaborative filtration is performed. Collaborative filtration [27], [90], [91] is the process through which one can predict based on collaborative information from multiple users the list of items for the new users. Collaborative Filtration has some limitations as well that come with the increase in the number of items because it is very difficult to scale this technique to high volume of data while maintaining a reasonable prediction accuracy however apart from these limitations collaborative filtering is the most popular technique that is incorporated by most merchants for personalization. Many collaborative systems are designed on the basis of datasets on which these systems have to be implemented. The collaborative system designed for one dataset where there are more users than items may not work properly for any other type of datasets. The researchers in [92] perform a complete evaluation of collaborative systems with respect to the datasets being used, the methods of prediction and also perform a comparative evaluation of several different evaluation metrics on various nearestneighbour based collaborative filtration algorithms. There are different approaches used for collaborative filtration as mentioned in Fig. 4. Figure 4. Methods Used in Collaborative Filtration Technique. © 2012 ACADEMY PUBLISHER 291 1) MODEL-BASED APPROACH: Model-based approaches such as [93] classify the data based on probabilistic hidden semantic associations among co-occurring objects. Model-based approaches divide the data into multiple segments and based on a user’s likelihood [94] move the specific data into atleast one segment based on the probability and threshold value. Most of the modelbased approaches are computationally very expensive but most of them gather user’s interest and classify them into multiple segments [95], [96] and [97]. 2) MEMORY-BASED APPROACH: Clustering Algorithms such as K-means [98] are considered as the basis for memory-based approaches. The data is clustered and classified based on local centroid of each cluster. Most of the collaborative filtration techniques such as [99] work on user profiles based on their navigational patterns. Similarly [100] performs clustering based on sliding window of time in active sessions and [101] presents a fuzzy concept for clustering. 3) CASE-BASED APPROACH: Most of the times one problem has one solution which represents a case in the case-based reasoning approach. In case-based reasoning [102] if a new customer comes and needs a solution to his/her problem then depending upon the previously stored problems that are linked with at least one case solution, the one which is nearest to the customer’s problem will be considered as the case solution to his/her problem. Case-based recommender are closer to user requirements and work more efficiently and intelligently than normal filtering approaches in a way that every case works as a perfect match for a subset of users and so the data for consideration becomes less as compared to normal filtration approaches which resulted in an increase in performance as well as accuracy. Overall case-based reasoning always helps in improving the quality of recommendations [103]. 4) TAG-BASED APPROACH: A Tag-based approach as in [104] was introduced in collaborative filtering to increase the accuracy of the CF process. Usually two persons like one item based on different reasons such as one person may like a product as he is finding that product funny whereas another user likes that item as he is finding that product entertaining, so a tag is an extra-facility to write a user’s views in one or two short words in the form of a tag that shows his/her reason for his interest and will help in finding the similarity and dissimilarity among user’s interest using collaborative filtration. tagbased filtration sometimes are dependent on additional factors such as popularity of tag, representation of tag and affinity between user and tags [105]. 5) PERSONALITY BASED APPROACH: A Personality-based approach [106] was introduced to add the emotional attitude of the users to the collaborative filtration process which became further useful in reducing the high computational processing in calculating the similarity matrix for all users. User attitude plays a vital role in deriving the likes and dislikes of users so by using a big five personality model [106] the researcher explicitly derive the interest of the user that makes the 292 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 collaborative filtration process more robust and accurate. 6) RULE-BASED FILTRATION: This approach is one more method that is used for personalization purposes. The concept of rule-based approach is elaborated as all the business rules that are created by merchants either on the basis of transactions or on the basis of expert policies to further facilitate or create attraction in their online business. Rule-based approach such as a merchant offers gold membership, silver membership or bronze membership to its customers based on specific rules. Similarly a merchant offers discount coupons to its customers who make purchases on weekends. These rule-based approaches [107] are created in different ways as templatedriven rule-based filtering approach, interestingness-based rule filtering approach, similarity-based rule-filtering approach and incremental profiling approach. Rules are also identified using mining rules as Apriori [108] which is used to discover association rules; similarly Cart is a decision tree [109] used to identify classification rules. The only limitation in the rule-based approach is the creation of invalid or unreasonable rules just on one or two transactions which makes the data very sparse and complex to understand. A rule-based approach is very much dependent on the business rules and a sudden change in any rule will have a very high impact on the whole data as well. 7) HYBRID APPROACHES: A single technique is not considered enough to give a recommendation taking into account all the dynamic scenarios for each user. It is true that each user has his own historical background and his own list of likes and dislikes. Sometimes one method of filtration is not enough for one particular case for example collaborative filtration process is not beneficial for a new user with not enough historical background but is proved excellent in other scenarios, similarly a contentbased filtration process is not feasible where a user has not enough data associated with it that shows his likes or dislikes. Taking into account these scenarios researchers have proposed different hybrid methods [26] that include more than one technique [110] for filtration to be used for personalization purpose which could be used on the basis of a union or intersection for recommendation. WEIGHTED APPROACH: In this approach [111] the results of more than one method for filtration are calculated numerically for recommendation. MIXED APPROACH: In this approach [112] the results of more than one approach are displayed based on ranking and the recommender’s confidence of recommendation. SWITCHING APPROACH: In this approach [21] more than one method for filtrations is used in a way that if one is unable to recommend with high confidence the system will be switched to the second method for filtration and if the second as well is unable to recommend with high confidence, the system will switch to the third recommender. FEATURE AUGMENTATION: In this approach [113] a contributing recommendation system is augmented with the actual recommender to increase its performance in terms of recommendation. © 2012 ACADEMY PUBLISHER CASCADING: In this approach [114] the primary and secondary recommenders are organized in a cascading way such that on each retrieval both recommenders break ties with each other for recommendation. V. RECOMMENDATION Recommendation is considered the final phase of personalization whose performance and work is dependent wholly upon the previous two stages. Recommendation is the retrieval process which functions in accordance with the learning and matching phase. The review of all the methods which are discussed in learning and matching phase recommendation is primarily and conclusively based on four main methods that include contentbased recommendation, collaborative-based recommendation, knowledge-based recommendation and based on user-demographics or user demographic profiles. VI. FUTURE DIRECTIONS The overall objective of reviewing the whole era of web personalization is to realize its importance in terms of the facilities it provides to the end-users as well as giving a precise overview of the list of almost all the methods that have been introduced in each of its phases. One more important aim of this review is to give a brief overview of web personalization to those researchers working in other areas of the internet so that they are able to use this feature to evolve some intelligent solutions which match human needs in their areas as well. Some of the highlighted areas of the internet for future directions with respect to web personalization are:1)Internet marketing is the first step towards any product or service recognition on the internet. Through web personalization one is able to judge to some extent the browsing needs of the user and if a person is able to see advertisements of those products or services which he/she is looking for then the chances of that person’s interest in buying or even clicking that advertisement’s link will rise. Researchers are already trying to use personalizing features for doing improved social web marketing as in [52] and helping customers in decision making using web personalization [115]. 2)Internet of Things [69] is a recent development of internet. The internet of things will make all the identifiable things communicate with each other wirelessly. This concept of web personalization can offer many applications to IoT (Internet of Things) like personalizing things to control and communicate as per users interest, helping the customer in selecting the shop within a pre-selected shopping list, guidance in interacting with things of the user related to their interest and enabling the things learn from users personalized behaviour. 3)Affiliate Networks are the key platform for both the publishers and advertisers to interact with each other. There is a huge gap [34] between the publisher and advertiser in terms of selecting the most appropriate choice based on similarity. This gap can be reduced JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 using web personalization by collaboratively filtering the transactional profiles of publisher and advertiser and giving recommendations on the basis of best match to both of them. 4)The future of mobile networking also requires personalization, ambient awareness and adaptability [116] in its services. All services need to be personalized for each individual in his or her environment and in accordance with his or her preferences and different services should be adapted on a real time basis. In other words in every field of life which includes aerospace and aviation, automotive systems, telecommunications, intelligent buildings, medical technology, independent living, pharmaceutical, retail, logistics, supply chain management, processing industries, safety, security and privacy requires personalization in them to enable these technologies more user specific and in compatible with human needs. VII. C ONCLUSION Every second there is an increment of data on the web. With this increase of data and information on the web the adoption of web personalization will continue to grow unabated. This trend has now become a need and with the passage of time this trend will enter every field of our life and so in the future we will be provided with everything that we actually require. In this paper we have briefly describe the various research carried out in the area of web personalization. This paper also states how the adoption of web personalization is essential for users to facilitate, organize, personalize and to provide exactly needed data ACKNOWLEDGMENT The authors would like to gratefully acknowledge the careful reviewing of an earlier version of this paper which has greatly improved the paper. R EFERENCES [1] P. Brusilovsky and C. Peylo, “Adaptive and intelligent web-based educational systems,” Int. J. Artif. Intell. Ed., vol. 13, no. 2-4, pp. 159–172, Apr. 2003. [2] W. N. Nicola Henze, “Adapdibility in the kbs hyperbook system,” in Proceedings of the 2nd Workshop on Adaptive Systems and User Modeling on the WWW, 1999. [3] N. Henze and W. Nejdl, “Extendible adaptive hypermedia courseware: Integrating different courses and web material,” in Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems, ser. AH ’00. London, UK, UK: Springer-Verlag, 2000, pp. 109–120. [4] R. Zeiliger, “Concept-map based navigation in educational hypermedia : a case study.” 1996. [5] W.-P. Lee, C.-H. Liu, and C.-C. Lu, “Intelligent agentbased systems for personalized recommendations in internet commerce,” Expert Systems with Applications, vol. 22, no. 4, pp. 275 – 284, 2002. [6] A. D. Smith, “E-personalization and its tactical and beneficial relationship to e-tailing,” 2012. © 2012 ACADEMY PUBLISHER 293 [7] C. Oemarjadi, “Web personalization in used cars ecommerce site,” 2011. [8] C. Bouganis, D. Koukopoulos, and D. Kalles, “A realtime auction system over the www,” 1999. [9] O. Nasraoui, “World wide web personalization,” in In J. Wang (ed), Encyclopedia of Data Mining and Data Warehousing, Idea Group, 2005. [10] H. Hirsh, C. Basu, and B. D. Davison, “Learning to personalize,” Commun. ACM, vol. 43, no. 8, pp. 102– 106, Aug. 2000. [11] C. Wei, W. Sen, Z. Yuan, and C. Lian-Chang, “Algorithm of mining sequential patterns for web personalization services,” SIGMIS Database, vol. 40, no. 2, pp. 57–66, Apr. 2009. [12] Q. Song, G. Wang, and C. Wang, “Automatic recommendation of classification algorithms based on data set characteristics,” Pattern Recogn., vol. 45, no. 7, pp. 2672– 2689, July 2012. [13] M. Albanese, A. Picariello, C. Sansone, and L. Sansone, “Web personalization based on static information and dynamic user behavior,” in Proceedings of the 6th annual ACM international workshop on Web information and data management, ser. WIDM ’04. New York, NY, USA: ACM, 2004, pp. 80–87. [14] W. Chu and S.-T. Park, “Personalized recommendation on dynamic content using predictive bilinear models,” in Proceedings of the 18th international conference on World wide web, ser. WWW ’09. New York, NY, USA: ACM, 2009, pp. 691–700. [15] S. Gong and H. Ye, “An item based collaborative filtering using bp neural networks prediction,” Intelligent Information Systems, IASTED International Conference on, vol. 0, pp. 146–148, 2009. [16] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, “An algorithmic framework for performing collaborative filtering,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR ’99. New York, NY, USA: ACM, 1999, pp. 230–237. [17] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl, “Grouplens: applying collaborative filtering to usenet news,” Commun. ACM, vol. 40, no. 3, pp. 77–87, Mar. 1997. [18] U. Shardanand and P. Maes, “Social information filtering: algorithms for automating word of mouth,” in Proceedings of the SIGCHI conference on Human factors in computing systems, ser. CHI ’95. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1995, pp. 210–217. [19] C.-H. Lee, Y.-H. Kim, and P.-K. Rhee, “Web personalization expert with combining collaborative filtering and association rule mining technique,” Expert Systems with Applications, vol. 21, no. 3, pp. 131 – 137, 2001. [20] M. ubuk, “Hybrid recommendation engine based on anonymous users,” Eindhoven : Technische Universiteit Eindhoven, 2009, 2009. [21] D. Billsus and M. J. Pazzani, “User modeling for adaptive news access,” User Modeling and User-Adapted Interaction, vol. 10, no. 2-3, pp. 147–180, Feb. 2000. [22] S. Anand and B. Mobasher, “Intelligent techniques for web personalization,” in Intelligent Techniques for Web Personalization, ser. Lecture Notes in Computer Science, B. Mobasher and S. Anand, Eds. Springer Berlin / Heidelberg, 2005, vol. 3169, pp. 1–36. [23] S. Thurman, N. Schifferes, “The future of personalization at news websites:lessons from a longitudinal study,” 2012. [24] U. Manber, A. Patel, and J. Robison, “Experience with personalization of yahoo!” Commun. ACM, vol. 43, no. 8, pp. 35–39, Aug. 2000. 294 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 [25] G. Linden, B. Smith, and J. York, “Amazon.com recommendations: item-to-item collaborative filtering,” Internet Computing, IEEE, vol. 7, no. 1, pp. 76 – 80, jan/feb 2003. [26] R. Burke, “Hybrid recommender systems: Survey and experiments,” User Modeling and User-Adapted Interaction, vol. 12, no. 4, pp. 331–370, Nov. 2002. [27] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google news personalization: scalable online collaborative filtering,” in Proceedings of the 16th international conference on World Wide Web, ser. WWW ’07. New York, NY, USA: ACM, 2007, pp. 271–280. [28] J. L. Herlocker, J. A. Konstan, and J. Riedl, “Explaining collaborative filtering recommendations,” in Proceedings of the 2000 ACM conference on Computer supported cooperative work, ser. CSCW ’00. New York, NY, USA: ACM, 2000, pp. 241–250. [29] J. ZASLOW, “If tivo thinks you are gay, here’s how to set it straight what you buy affects recommendations on amazon.com, too; why the cartoons?” 2002. [30] E. Toch, Y. Wang, and L. Cranor, “Personalization and privacy: a survey of privacy risks and remedies in personalization-based systems,” User Modeling and UserAdapted Interaction, vol. 22, pp. 203–220, 2012. [31] “Clustering items for collaborative filtering,” 2001. [Online]. Available: http://citeseer.ist.psu.edu/connor01clustering.html [32] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Analysis of recommendation algorithms for e-commerce,” in Proceedings of the 2nd ACM conference on Electronic commerce, ser. EC ’00. New York, NY, USA: ACM, 2000, pp. 158–167. [33] C. Allen, B. Yaeckel, and D. Kania, Internet World Guide to One-to-One Web Marketing. New York, NY, USA: John Wiley & Sons, Inc., 1998. [34] Z. Malik, “A new personalized approach in affiliate marketing,” International Association of Development and Information Society, 2012. [35] B. B. C., “Complete guide to affiliate marketing on the web,” in Complete Guide to Affiliate Marketing on the Web, B. B. C., Ed. Atlantic Publishing Co, 2009, pp. 1–384. [36] J. K. R. Bandyopadhyay, Subir Wolfe, “Journal of academy of business and economics,” Int. A. Bus. Econ. Ed., vol. 9, no. 4, Apr. 2003. [37] P. Eli, “The filter bubble,” 2011. [38] T. Joachims, D. Freitag, and T. Mitchell, “Webwatcher: A tour guide for the world wide web,” in IN PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE. Morgan Kaufmann, 1997, pp. 770–775. [39] H. Lieberman, “Letizia: an agent that assists web browsing,” in Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1, ser. IJCAI’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, pp. 924–929. [40] M. Morita and Y. Shinoda, “Information filtering based on user behavior analysis and best match text retrieval,” in Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR ’94. New York, NY, USA: SpringerVerlag New York, Inc., 1994, pp. 272–281. [41] Y.-W. Seo and B.-T. Zhang, “Learning user’s preferences by analyzing web-browsing behaviors,” in Proceedings of the fourth international conference on Autonomous agents, ser. AGENTS ’00. New York, NY, USA: ACM, 2000, pp. 381–387. [42] R. G. Tiwari, M. Husain, V. Srivastava, and A. Agrawal, “Web personalization by assimilating usage data and semantics expressed in ontology terms,” in Proceedings of the International Conference & Workshop on © 2012 ACADEMY PUBLISHER [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] Emerging Trends in Technology, ser. ICWET ’11. New York, NY, USA: ACM, 2011, pp. 516–521. R. Lixăndroiu, “Customizing web advertisements based on internet users’ location,” in Proceedings of the 11th WSEAS international conference on mathematics and computers in business and economics and 11th WSEAS international conference on Biology and chemistry, ser. MCBE’10/MCBC’10, Stevens Point, Wisconsin, USA, 2010, pp. 273–278. E. Gabber, P. B. Gibbons, D. M. Kristol, Y. Matias, and A. Mayer, “Consistent, yet anonymous, web access with lpwa,” Commun. ACM, vol. 42, no. 2, pp. 42–47, Feb. 1999. M.-H. Kuo, L.-C. Chen, and C.-W. Liang, “Building and evaluating a location-based service recommendation system with a preference adjustment mechanism,” Expert Systems with Applications, vol. 36, no. 2, Part 2, pp. 3543 – 3554, 2009. M.-H. Park, J.-H. Hong, and S.-B. Cho, “Location-based recommendation system using bayesian users preference model in mobile devices,” in Ubiquitous Intelligence and Computing, ser. Lecture Notes in Computer Science, J. Indulska, J. Ma, L. Yang, T. Ungerer, and J. Cao, Eds. Springer Berlin / Heidelberg, 2007, vol. 4611, pp. 1130– 1139. Y. Yang and C. Claramunt, “A hybrid approach for spatial web personalization,” in Web and Wireless Geographical Information Systems, ser. Lecture Notes in Computer Science, K.-J. Li and C. Vangenot, Eds. Springer Berlin / Heidelberg, 2005, vol. 3833, pp. 206–221. W. Z. J. J. M. kilfoil, A.Ghorbani and X.Xu, “Towards an adaptive web: The state of the art and science,” 2003. D. Kelly and J. Teevan, “Implicit feedback for inferring user preference: a bibliography,” SIGIR Forum, vol. 37, no. 2, pp. 18–28, Sept. 2003. Q. Wang and H. Jin, “Exploring online social activities for adaptive search personalization,” in Proceedings of the 19th ACM international conference on Information and knowledge management, ser. CIKM ’10. New York, NY, USA: ACM, 2010, pp. 999–1008. M. A. Muhammad Nauman, Shahbaz Khan and F. Hussain, “Resolving lexical ambiguities in folksonomy based search systems through common sense and personalization,” 2008. B. Cugelman, “Online social marketing: website factors in behavioural change.” L. Story, “Feedback retreats on online tracking,” 2012. N. Kumar M and V. Varma, “An introduction to prognostic search,” in Behavior Computing, L. Cao and P. S. Yu, Eds. Springer London, 2012, pp. 165–175. N. K. M, “Generating simulated feedback through prognostic search approach,” Search and Information Extraction Lab, pp. 1–61, 2010. A. STENOVA, “Feedback acquisition in web-based learning,” 2012. K. Y. Tam and S. Y. Ho, “Web personalization: Is it effective?” IT Professional, vol. 5, pp. 53–57, 2003. R. Allis, “Best practices for email marketers,” Broadwick Corps., 2005. C.-H. Lee, Y.-H. Kim, and P.-K. Rhee, “Web personalization expert with combining collaborative filtering and association rule mining technique,” Expert Systems with Applications, vol. 21, no. 3, pp. 131 – 137, 2001. B. Mobasher, H. Dai, T. Luo, Y. Sun, and J. Zhu, “Integrating web usage and content mining for more effective personalization,” in Electronic Commerce and Web Technologies, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2000, vol. 1875, pp. 165– 176. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 [61] A. Parker, “Sms its use in the digital library,” in Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, ser. Lecture Notes in Computer Science, D. Goh, T. Cao, I. Slvberg, and E. Rasmussen, Eds. Springer Berlin / Heidelberg, 2007, vol. 4822, pp. 387– 390. [62] C.-H. Lee, Y.-H. Kim, and P.-K. Rhee, “Web personalization expert with combining collaborative filtering and association rule mining technique,” Expert Systems with Applications, vol. 21, no. 3, pp. 131 – 137, 2001. [63] A. Dickinger, P. Haghirian, J. Murphy, and A. Scharl, “An investigation and conceptual model of sms marketing,” in System Sciences, 2004. Proceedings of the 37th Annual Hawaii International Conference on, jan. 2004, p. 10 pp. [64] M. Claypool, D. Brown, P. Le, and M. Waseda, “Inferring user interest,” IEEE Internet Computing, vol. 5, pp. 32– 39, 2001. [65] J. J. Samper, P. A. Castillo, L. Araujo, and J. J. M. Guervós, “Nectarss, an rss feed ranking system that implicitly learns user preferences,” CoRR, vol. abs/cs/0610019, 2006. [66] O. Phelan, K. McCarthy, M. Bennett, and B. Smyth, “Terms of a feather: Content-based news recommendation and discovery using twitter,” in Advances in Information Retrieval, ser. Lecture Notes in Computer Science, P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, H. Lee, and V. Mudoch, Eds. Springer Berlin / Heidelberg, 2011, vol. 6611, pp. 448–459. [67] S. Carter, “Get bold: Using social media to create a new type of social business),” in Get Bold: Using Social Media to Create a New Type of Social Business, P. K. Brusilovsky, Ed. Pearson PLC, 2012. [68] C.-H. Lee, Y.-H. Kim, and P.-K. Rhee, “Web personalization expert with combining collaborative filtering and association rule mining technique,” Expert Systems with Applications, vol. 21, no. 3, pp. 131 – 137, 2001. [69] L. McGinty and B. Smyth, “Adaptive selection: An analysis of critiquing and preference-based feedback in conversational recommender systems,” International Journal of Electronic Commerce, pp. 35–57, 2006. [70] M. Zanker and M. Jessenitschnig, “Case-studies on exploiting explicit customer requirements in recommender systems,” User Modeling and User-Adapted Interaction, vol. 19, pp. 133–166, 2009. [71] G. Jawaheer, M. Szomszor, and P. Kostkova, “Characterisation of explicit feedback in an online music recommendation service,” in Proceedings of the fourth ACM conference on Recommender systems, ser. RecSys ’10. ACM, 2010, pp. 317–320. [72] K. T. Wolfgang Woerndl, Georg Groh, “Semantic blogging agents: Weblogs and personalization in the semantic web,” aaai.org, 2010. [73] P.-H. Chiu, G. Y.-M. Kao, and C.-C. Lo, “Personalized blog content recommender system for mobile phone users,” International Journal of Human-Computer Studies, vol. 68, no. 8, pp. 496 – 507, 2010. [74] D.-R. Liu, P.-Y. Tsai, and P.-H. Chiu, “Personalized recommendation of popular blog articles for mobile applications,” Information Sciences, vol. 181, no. 9, pp. 1552 – 1572, 2011. [75] P. Resnick and J. Miller, “Pics: Internet access controls without censorship,” Commun. ACM, vol. 39, no. 10, pp. 87–93, Oct. 1996. [76] K. Kumppusamy and G. Aghila, “A personalized web page content filtering model based on segmentation,” International Journal of Information Science and Techniques, vol. 2, no. 1, pp. 1–51, 2012. [77] C. Kohlschütter and W. Nejdl, “A densitometric approach to web page segmentation,” in Proceedings of the 17th © 2012 ACADEMY PUBLISHER [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] 295 ACM conference on Information and knowledge management, ser. CIKM ’08. ACM, 2008, pp. 1173–1182. M. Porter, “An algorithm for suffix stripping,” 2006. B. Mobasher, H. Dai, T. Luo, and M. Nakagawa, “Discovery and evaluation of aggregate usage profiles for web personalization,” Data Mining and Knowledge Discovery, vol. 6, pp. 61–82, 2002. S. Gauch, M. Speretta, A. Chandramouli, and A. Micarelli, “User profiles for personalized information access,” in The Adaptive Web. Springer Berlin / Heidelberg, 2007, vol. 4321, pp. 54–89. J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, pp. 81–106, 1986. Y. Yang, “A comparative study on feature selection in text categorization,” 1997. “Relevance feedback and personalization: A language modeling perspective,” Systems in Digital, 2001. Y. Hijikata, “Implicit user profiling for on demand relevance feedback,” in Proceedings of the 9th international conference on Intelligent user interfaces, ser. IUI ’04. New York, NY, USA: ACM, 2004, pp. 198–205. J. Rocchio, “Relevance feedback in information retrieval,” 1971. T. Zhang and F. Oles, “Text categorization based on regularized linear classification methods,” Information Retrieval, vol. 4, pp. 5–31, 2001. D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, “Training algorithms for linear text classifiers,” in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR ’96. New York, NY, USA: ACM, 1996, pp. 298–306. P. E. H. Richard O. Duda, “Pattern classification and scene analysis,” in Pattern Classification and Scene Analysis. Wiley-Blackwell, 1973, pp. 1–512. A. McCallum and K. Nigam, “A comparison of event models for naive bayes text classification,” 1998. D. M. Pennock, E. Horvitz, S. Lawrence, and C. L. Giles, “Collaborative filtering by personality diagnosis: a hybrid memory- and model-based approach,” in Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, ser. UAI’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000, pp. 473–480. B. M. Kim and Q. Li, “Probabilistic model estimation for collaborative filtering based on items attributes,” in Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, ser. WI ’04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 185– 191. J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, “Evaluating collaborative filtering recommender systems,” ACM Trans. Inf. Syst., vol. 22, no. 1, pp. 5– 53, Jan. 2004. T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Mach. Learn., vol. 42, no. 1-2, pp. 177–196, Jan. 2001. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, vol. 39, no. 1, pp. 1–38, 1977. J. S. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of predictive algorithms for collaborative filtering,” in Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, ser. UAI’98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp. 43–52. B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Itembased collaborative filtering recommendation algorithms,” in Proceedings of the 10th international conference on 296 [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 World Wide Web, ser. WWW ’01. New York, NY, USA: ACM, 2001, pp. 285–295. G. Shani, R. I. Brafman, and D. Heckerman, “An mdpbased recommender system,” in Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, ser. UAI’02. Morgan Kaufmann Publishers Inc., 2002, pp. 453–460. A. Ahmad and L. Dey, “A k-mean clustering algorithm for mixed numeric and categorical data,” Data amp; Knowledge Engineering, vol. 63, no. 2, pp. 503 – 527, 2007. F. B. K. J. F. Cyrus Shahabi, Adil Faisal, “Insite: A tool for real-time knowledge discovery from user web navigation,” 2000. T. S. C.P. Sumathi, R. Padmaja Valli, “Automatic recommendation of web pages in web usage mining,” 2010. A. K.Suresh, R.Madana Mohana, “Improved fcm algorithm for clustering on web usage mining,” 2011. C. Hayes, P. Cunningham, and B. Smyth, “A case-based reasoning view of automated collaborative filtering,” in Proceedings of the 4th International Conference on CaseBased Reasoning: Case-Based Reasoning Research and Development, ser. ICCBR ’01. London, UK, UK: Springer-Verlag, 2001, pp. 234–248. B. Smyth, “Case-based recommendation,” in The Adaptive Web, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2007, vol. 4321, pp. 342– 376. J. M. S. U. Reyn Nakamoto, Shinsuke Nakajima, “Tagbased contextual collaborative filtering,” 2008. F. A. Durão and P. Dolog, “A personalized tag-based recommendation in social web systems,” CoRR, vol. abs/1203.0332, 2012. J. T. Marko TKalcic, Matevz Kunaver, “Personality based user similarity measure for a collaborative recommender system,” 2009. G. Adomavicius and A. Tuzhilin, “Expert-driven validation of rule-based user models in personalization applications,” Data Mining and Knowledge Discovery, vol. 5, pp. 33–58, 2001. C. Aggarwal and P. Yu, “Online generation of association rules,” in Data Engineering, 1998. Proceedings., 14th International Conference on, feb 1998, pp. 402 –411. L. Breiman, “Classification and regression trees,” in Classification and regression trees. Wads-worth Brooks/Cole Advanced Books Software, 1984. N. Taghipour and A. Kardan, “A hybrid web recommender system based on q-learning,” in Proceedings of the 2008 ACM symposium on Applied computing, ser. SAC ’08. New York, NY, USA: ACM, 2008, pp. 1164– 1168. T. Miranda, M. Claypool, M. Claypool, A. Gokhale, A. Gokhale, T. Mir, P. Murnikov, P. Murnikov, D. Netes, D. Netes, M. Sartin, and M. Sartin, “Combining contentbased and collaborative filters in an online newspaper,” in In Proceedings of ACM SIGIR Workshop on Recommender Systems, 1999. B. Smyth and P. Cotter, “A personalised tv listings service for the digital tv age,” Knowledge-Based Systems, vol. 13, no. 23, pp. 53 – 59, 2000. R. J. M. Prem Melville, “Content-boosted collaborative filtering for improved recommendations,” 2002. R. Burke, “Hybrid recommender systems : A comparative study,” cdmdepauledu, 2006. S. H. Ha, “Helping online customers decide through web personalization,” Intelligent Systems, IEEE, vol. 17, no. 6, pp. 34 – 43, nov/dec 2002. S. Arbanowski, P. Ballon, K. David, O. Droegehorn, H. Eertink, W. Kellerer, H. van Kranenburg, K. Raatikainen, and R. Popescu-Zeletin, “I-centric com- © 2012 ACADEMY PUBLISHER munications: personalization, ambient awareness, and adaptability for future mobile services,” Communications Magazine, IEEE, vol. 42, no. 9, pp. 63 – 69, sept. 2004. Zeeshan Khawar Malik is currently a PhD Candidate at the University of The West of Scotland.He received his MS and BS(honors) degree from University of The Central Punjab, Lahore Pakistan, in 2003 and 2006, respectively. By profession he is an Assistant Professor in University of The Punjab, Lahore Pakistan currently on Leave for his PhD studies. Colin Fyfe is a Personal Professor at The University of the West of Scotland. He has published more than 350 refereed papers and been Director of Studies for 23 PhDs. He is on the Editorial Boards of 6 international journals and has been Visiting Professor at universities in Hong Kong, China, Australia, Spain, South Korea and USA. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 297 An International Comparison on Need Analysis of Web Counseling System Design Takaaki GOTO Graduate School of Informatics and Engineering, The University of Electro-Communications, Japan Email: gototakaaki@uec.ac.jp Chieko KATO, Futoshi SUGIMOTO, Kensei TSUCHIDA Department of Information Sciences and Arts, Toyo University, Japan Email: {kato-c, f sugi, kensei}@toyo.jp Abstract— Recently, more people are working abroad, and there is an increasing number of people with mental health problems. It is very difficult to find a place to get treatment, therefore online counseling is becoming more and more popular these days. We are making an online web counseling system to support the mental health of people working abroad. The design of this web counseling site needs to be in color, and the screen is all they can see, and people who look at a web site may be alerted about something hidden, so what they see on this screen should help them feel comfortable and be encouraged to try online counseling. We researched the way online counseling is conducted in foreign countries. We noticed that the color of this web counseling site is different depending on which country you are from. When it comes to web counseling, color is definitely important. Index Terms— web counseling, design, analysis of variance, international comparison I. I NTRODUCTION According to figures released by the Foreign Affairs Ministry, since 2005 Japanese people living abroad has topped ten million and is increasing year by year[1]. Also, there are a lot of foreigners living in and visiting Japan, especially from the U.S and China [2], [3]. People living in foreign countries face many problems, such as language, culture, human relationships and having to live alone away from their families. This causes mental health problems. We organized a project team to support those people working abroad. Our team has professionals, who help out with legal matters, computer technology, psychological problems and nursing [4], [5]. Online counseling is very useful, and can often be accessed by people living far away and in their own language. For example, Chinese and English speaking people living in Japan can get help, even if they don ’t speak Japanese. Japanese people living abroad in a remote area where there are no psychiatrists or psychologists can also seek counseling online. There are two reasons why design is so important. We have to take into account these reasons when designing a web counseling system. One point is that counseling is different from other online activities such as shopping or playing games. Counseling has different needs from ordinary shoppers or online users, so the system and screen must be designed © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.3.297-300 differently. The normal web site design may not be suitable for counseling. We also know from past research that those people with mental health problems are seeking a particular design, which is different from the kinds of designs that are attractive to others [6]. For example, on line shopping has a special kind of design which may attract people suddenly, and only at that moment. On the other hand, counseling is not a spur of the moment activity, and people need to feel assured that they can continue this for a certain period of time. And they need to feel confident that they can trust their counselors. The second reason is that there are many foreigners, both in Japan and other countries. When making an effective design, the desires of users will be different from country to country. It has been found out that preferred colors are different from country to country. For example, among 20 countries including Japan, China, Germany, and the U.S. research was done to find out which colors people like best, and feel familiar with [7]. The color Japanese and American people like best is bright purple. Next comes bright red. Chinese people like white best, then bright purple. A familiar color to Japanese people is bright red, next white. Chinese people prefer bright red, next bright orange. American people are familiar with bright red and bright purple. Countries have different preferences regarding colors. There was a survey done in each country to find out which colors are appropriate to online counseling. In Japan, China and the U.S., the background color and the color for words were investigated. This data was statistically analyzed. II. D ETAILS OF SURVEY Out of 122 subjects ages 18 to 25; 39 were Japanese, 52 were Chinese and 31 were Americans. This survey was done in Jan., 2010 in Japan; in June, 2010 in China; Sept., 2010 in U.S. In the survey, people were shown five different colors for backgrounds and words. Their impressions were evaluated at 6 levels from very good to very bad. Figures 1 and 2 show what the survey looked like. 298 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 Figure 1. An example of background color of Web counseling system (beige). Figure 2. An example of background color of Web counseling system (blue). © 2012 ACADEMY PUBLISHER JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 299 The subjects were asked which colors they prefer, if they were actually going to do online counseling. They looked at all five colors at the same time printed on paper. Colors were on the left side and the rankings were written on the right side from very good to very bad. An analysis was also made about the size of letters. The colors for background and words were selected for their effectiveness and helping people feel relaxed ([8], [9]). The colors were represented like this. Red: #de424c, Blue: #006b95, Purple: #5f3785, Beige: #ceb59f, Green: #008f59. How effective colors are is shown in Table I ([8], [9], [10]). TABLE I. A N EXAMPLE OF MEANING OF COLOR . Figure 3. Evaluation for background color red. Color Red Blue Green Purple Beige Meaning Red makes people feel more energetic and gives people a better feeling. But it also makes people feel nervous and more aggressive. Blue helps people feel more and more relaxed and calm. But it also makes people feel cold and lonely. Green helps people feel peaceful and has a healing effect. But it also makes people feel selfish and lazy. Purple gives people a noble feeling. It is a mysterious color and makes people think deeply. But it is not realistic, and makes people feel vague and uneasy. Beige helps people to relax and is thought of as being sincere. But many people feel that beige is conservative and unattractive. III. A NALYSIS AND C ONSIDERATION OF R ESULTS Two factors, background color and country were used to make two-way ANOVA data layout. Only those colors which showed a significant difference are considered here. As a result, although no main effect of countries was found, main effect of background colors (F (4, 460) = 36.42, p < .01) and interaction effect of countries and background colors were significant (F (8, 460) = 5.09, p < .01). Therefore we verified whether there are differences for each background color among the three countries. Looking at red, the total impression was not good. In the U.S., people (M = 2.27, SD = 0.94) thought red was not good even more than Japan (M = 3.19, SD = 1.33) (a significance was found (p < .05) between U. S. and Japan in the multiple comparison) or China (M = 3.76, SD = 1.43) (a significance was found (p < .01) between U. S. and China in the multiple comparison). Figure 3 shows results of red. Red was a familiar color among all countries, but in terms of online counseling red was not the preferred background color. This does not match previous research. Especially in the U.S., red was not suitable as a background color for online counseling. Next, considering color of the words, the result of this analysis is explained. The two factors, the color of words and country were used to make two-way ANOVA data layout. As a result,main effect of countries (F (2, 116) = 4.68, p < .05),main effect of colors of words (F (4, 464) = 54.20, p < .01) and interaction effect © 2012 ACADEMY PUBLISHER of colors of words and countries (F (8, 464) = 4.82, p < .01) were found. Therefore we verified whether there are differences for each color of words among the three countries. There was a significant difference for colors written in red and green. Red (M = 5.0, SD = 0.93) was considered the best in the U.S compared to Japan (M = 3.6, SD = 1.14) (a significance was found (p < .01) between U. S. and Japan in the multiple comparison) and China (M = 4.72, SD = 1, 10) (a significance was found (p < .05) between China and Japan in the multiple comparison) with regard to words. Figure 4 shows the results for words written in red. Figure 4. Evaluation for text color red. Words written in Green were considered best in China (M = 4.57, SD = 1.04) (a significance was found (p < .05) between China and Japan in the multiple comparison) and the U.S (M = 4.55, SD = 1.04), compared to Japan (M = 3.76, SD = 1.36) (a significance was found (p < .05) between U. S. and Japan in the multiple comparison). Figure 5 is for the color green. There was also an analysis made about the size of letters, but there was no significant difference about size. 300 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012 [9] Keiko Yamawaki, Yokuwakaru ShikisaiShinri. Natsumesha, 2005, (in Japanese). [10] Jonathan Dee and Lesley Taylor, Color Therapy. Sunchoh Publishing, 2006, (in Japanese). Takaaki GOTO received his M.E. and Dr.Eng. degrees from Toyo University in 2003 and 2009 respectively. In 2009 he joined the University of Electro-Communications as a Project Assistant Professor of the Center for Industrial and Governmental Relations. He has been a Project Assistant Professor of Graduate School of Informatics and Engineering at the University of Electro-Communications. His main research interests are applications of graph grammars, visual languages, and software development environments. He is a member of IPSJ, IEICE Japan and IEEE. Figure 5. Evaluation for text color green. IV. C ONCLUSION A survey was done to evaluate the design for online counseling and a comparison was made between three countries, Japan, the U.S. and China. This result was analyzed by statistical method. There was a difference between color preferences for online counseling and in general. When red is used as a background color and for words, one must consider carefully the difference among countries. In future research, we would like to do a survey involving many people living in different countries. What is desirable in terms of time spent in online counseling, age and gender will be considered when designing an online counseling web site. R EFERENCES [1] The Ministry of Foreign Affairs of Japan, “Annual report of statistics on japanese nationals overseas,” (in Japanese). [Online]. Available: http://www.mofa.go.jp/ mofaj/toko/tokei/hojin/index.html [2] The Ministry of Justice, “About number of foreign residents,” (in Japanese). [Online]. Available: http://www.moj.go.jp/nyuukokukanri/kouhou/ nyuukokukanri01 00013.html [3] Japn National Tourism Organization, “Changing numbers of foreign visitors,” (in Japanese). [Online]. Available: http://www.jnto.go.jp/jpn/reference/tourism data/pdf/ marketingdata tourists after vj.pdf [4] Chieko Kato, Yasunori Shiono, Takaaki Goto, and Kensei Tsuchida, “Development of online counseling system and usability evaluation,” Journal of Emerging Technologies in Web Intelligence, vol. 3, no. 21, pp. 146–153, 2011. [5] Takaaki Goto, Chieko Kato, and Kensei Tsuchida, “GUI for Online Counseling System,” Journal of the Visualization Society of Japan, vol. 30, no. 117, pp. 90–95, 2010, (in Japanese). [6] Akito Kobori, Chieko Kato, Nobuo Takahashi, Kensei Tsuchida, and Heliang Zhuang, “Investigation and analysis of effective images for on-line counseling system,” Proceedings of the 2009 IEICE Society Conference, p. 149, 2009, (in Japanese). [7] Hideaki Chijiiwa, “International comparison of emotion for colors,” Journal of Japanese Society for Sensory Evaluation, vol. 6, no. 1, pp. 15–19, 2002, (in Japanese). [8] Yoshinori Michie, “Psychology and color for comfort,” Re, vol. 26, no. 1, pp. 26–29, 2004, (in Japanese). © 2012 ACADEMY PUBLISHER Chieko KATO graduated from the Faculty of Literature, Shirayuri Women’s University in 1997, and received her M.A. from the Tokyo University and Dr. Eng. degree from Hosei University in 1999 and 2007, respectively. She served 2003 to 2006 as an Assistant Professor at the Oita Prefectural Junior College of Arts and Culture. She currently teaches at Toyo University, which she joined in 2006 as an Assistant Professor, and was promoted to an Associate Professor in 2007. Her research areas include clinical psychology and psychological statistics. She is a member of IEICE Japan, Design Research Association, the Japanese Society of Psychopathology of Expression and Arts Therapy. Futoshi SUGIMOTO received his B.S. degree in communication systems engineering and M.S. degree in management engineering from the University of Electro-Communications, Tokyo, Japan, in 1975 and 1978, respectively, and Ph.D. degree in computer science from Toyo University, Tokyo, Japan, in 1998. In 1978, he joined Toyo University as a Research Associate in the Department of Information and Computer Sciences. From 1984 to 1999, he was an Assistant Professor, from 2000 to 2005, was an Associate Professor, and from 2006 to 2008, was an Professor in the same department. From 2009, he has been a Professor in the Department of Information Sciences and Arts. From April 2000 to March 2001, he was an exchange fellow in the University of Montana, USA. His current research interests are in cognitive engineering and human interface. Dr. sugimoto is a member of the Institute of Image Information and Television Engineers, IPSJ, and Human Interface Society (Japan). Kensei TSUCHIDA received his M.S. and D.S. degrees in mathematics from Waseda University in 1984 and 1994 respectively. He was a member of the Software Engineering Development Laboratory, NEC Corporation in 1984-1990. From 1990 to 1992, he was a Research Associate of the Department of Industrial Engineering and Management at Kanagawa University. In 1992 he joined Toyo University, where he was an Instructor until 1995 and an Associate Professor from 1995 to 2002 and a Professor from 2002 to 2009 at the Department of Information and Computer Sciences and since 2009 he has been a Professor of Faculty of Information Sciences and Arts. He was a Visiting Associate Professor of the Department of Computer Science at Oregon State University from 1997 to 1998. His research interests include software visualization, human interface, graph languages, and graph algorithms. He is a member of IPSJ, IEICE Japan and IEEE Computer Society. Call for Papers and Special Issues Aims and Scope Journal of Emerging Technologies in Web Intelligence (JETWI, ISSN 1798-0461) is a peer reviewed and indexed international journal, aims at gathering the latest advances of various topics in web intelligence and reporting how organizations can gain competitive advantages by applying the different emergent techniques in the real-world scenarios. Papers and studies which couple the intelligence techniques and theories with specific web technology problems are mainly targeted. Survey and tutorial articles that emphasize the research and application of web intelligence in a particular domain are also welcomed. These areas include, but are not limited to, the following: • • • • • • • • • • • • • • • • • • • • • • • Web 3.0 Enterprise Mashup Ambient Intelligence (AmI) Situational Applications Emerging Web-based Systems Ambient Awareness Ambient and Ubiquitous Learning Ambient Assisted Living Telepresence Lifelong Integrated Learning Smart Environments Web 2.0 and Social intelligence Context Aware Ubiquitous Computing Intelligent Brokers and Mediators Web Mining and Farming Wisdom Web Web Security Web Information Filtering and Access Control Models Web Services and Semantic Web Human-Web Interaction Web Technologies and Protocols Web Agents and Agent-based Systems Agent Self-organization, Learning, and Adaptation • • • • • • • • • • • • • • • • • • • • • • Agent-based Knowledge Discovery Agent-mediated Markets Knowledge Grid and Grid intelligence Knowledge Management, Networks, and Communities Agent Infrastructure and Architecture Agent-mediated Markets Cooperative Problem Solving Distributed Intelligence and Emergent Behavior Information Ecology Mediators and Middlewares Granular Computing for the Web Ontology Engineering Personalization Techniques Semantic Web Web based Support Systems Web based Information Retrieval Support Systems Web Services, Services Discovery & Composition Ubiquitous Imaging and Multimedia Wearable, Wireless and Mobile e-interfacing E-Applications Cloud Computing Web-Oriented Architectrues Special Issue Guidelines Special issues feature specifically aimed and targeted topics of interest contributed by authors responding to a particular Call for Papers or by invitation, edited by guest editor(s). We encourage you to submit proposals for creating special issues in areas that are of interest to the Journal. Preference will be given to proposals that cover some unique aspect of the technology and ones that include subjects that are timely and useful to the readers of the Journal. A Special Issue is typically made of 10 to 15 papers, with each paper 8 to 12 pages of length. The following information should be included as part of the proposal: • Proposed title for the Special Issue • Description of the topic area to be focused upon and justification • Review process for the selection and rejection of papers. • Name, contact, position, affiliation, and biography of the Guest Editor(s) • List of potential reviewers • Potential authors to the issue • Tentative time-table for the call for papers and reviews If a proposal is accepted, the guest editor will be responsible for: • Preparing the “Call for Papers” to be included on the Journal’s Web site. • Distribution of the Call for Papers broadly to various mailing lists and sites. • Getting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be informed the Instructions for Authors. • Providing us the completed and approved final versions of the papers formatted in the Journal’s style, together with all authors’ contact information. • Writing a one- or two-page introductory editorial to be published in the Special Issue. Special Issue for a Conference/Workshop A special issue for a Conference/Workshop is usually released in association with the committee members of the Conference/Workshop like general chairs and/or program chairs who are appointed as the Guest Editors of the Special Issue. Special Issue for a Conference/Workshop is typically made of 10 to 15 papers, with each paper 8 to 12 pages of length. Guest Editors are involved in the following steps in guest-editing a Special Issue based on a Conference/Workshop: • Selecting a Title for the Special Issue, e.g. “Special Issue: Selected Best Papers of XYZ Conference”. • Sending us a formal “Letter of Intent” for the Special Issue. • Creating a “Call for Papers” for the Special Issue, posting it on the conference web site, and publicizing it to the conference attendees. Information about the Journal and Academy Publisher can be included in the Call for Papers. • Establishing criteria for paper selection/rejections. The papers can be nominated based on multiple criteria, e.g. rank in review process plus the evaluation from the Session Chairs and the feedback from the Conference attendees. • Selecting and inviting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be informed the Author Instructions. Usually, the Proceedings manuscripts should be expanded and enhanced. • Providing us the completed and approved final versions of the papers formatted in the Journal’s style, together with all authors’ contact information. • Writing a one- or two-page introductory editorial to be published in the Special Issue. More information is available on the web site at http://www.academypublisher.com/jetwi/. (Contents Continued from Back Cover) On the Network Characteristics of the Google's Suggest Service Zakaria Al-Qudah, Mohammed Halloush, Hussein R. Alzoubi, and Osama Al-kofahi 278 RISING SCHOLAR PAPERS Review of Web Personalization Zeeshan Khawar Malik and Colin Fyfe 285 SHORT PAPERS An International Comparison on Need Analysis of Web Counseling System Design Takaaki Goto, Chieko Kato, Futoshi Sugimoto, and Kensei Tsuchida 297