with the cover page - École des Mines de Saint
Transcription
with the cover page - École des Mines de Saint
Open Source Web Information Retrieval COMPIÈGNE, FRANCE • SEPTEMBER 19, 2005 Edited by Michel Beigbeder and Wai Gen Yee Sponsors In Conjunction with the 2005 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology ISBN : 2-913923-19-4 OSWIR 2005 First workshop on Open Source Web Information Retrieval Edited by Michel Beigbeder and Wai Gen Yee ISBN: 2-913923-19-4 OSWIR 2005 Organization Workshop chairs Michel Beigbeder G2I department École Nationale Supérieure des Mines de Saint-Étienne, France Wai Gen Yee Department of Computer Science Illinois Institute of Technology, USA Program Committee Abdur Chowdhury, America Online Search and Navigation, USA Ophir Frieder, Illinois Institute of Technology, USA David Grossman, Illinois Institute of Technology, USA Donald Kraft, Louisianna State University, USA Clement Yu, University of Illinois at Chicago, USA Reviewers Jefferson Heard, Illinois Institute of Technology, USA Dongmei Jia, Illinois Institute of Technology, USA Linh Thai Nguyen, Illinois Institute of Technology, USA 2 OSWIR 2005 Open Source Web Information Retrieval The World Wide Web has grown to be a primary source of information for millions of people. Due to the size of the Web, search engines have become the major access point for this information. However, ”commercial” search engines use hidden algorithms that put the integrity of their results in doubt, so there is a need for some open source Web search engines. On the other hand, the Information Retrieval (IR) research community has a long history of developing ideas, models and techniques for finding results in data sources, but finding one’s way through all of them is not an easy task. Moreover their applicability to the Web search domain is uncertain. The goal of the workshop is to survey the fundamentals of the IR domain and to determine the techniques, tools, or models that are applicable to Web search. This first workshop was organized by Michel BEIGBEDER from École Nationale Supérieure des Mines de Saint-Étienne1 , France and Wai Gen YEE from Illinois Institute of Technology2 , USA. It was held on September 19th, 2005 in the UTC (Compiègne University of Technology3 ) in conjunction with WI and IAT 20054 the 2005 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. We want to thank all the authors of the submitted papers, the members of the program committee: Abdur Chowdhury, Ophir Frieder, David Grossman, Donald Kraft, Clement Yu and the reviewers: Jefferson Heard, Dongmei Jia, and Linh Thai Nguyen. Michel Beigbeder and Wai Gen Yee 1 http://www.emse.fr/ 2 http://www.iit.edu/ 3 http://www.hds.utc.fr/ 4 http://www.hds.utc.fr/WI05/ 3 4 Table of contents Pre-processing Text for Web Information Retrieval Purposes by Splitting Compounds into their Morphemes Sven Abels and Axel Hahn 7 Fuzzy Querying of XML documents – The Minimum Spanning Tree Abdeslame Alilaouar and Florence Sedes 11 Link Analysis in National Web Domains Ricardo Baeza-Yates and Carlos Castillo 15 Web Document Models for Web Information Retrieval Michel Beigbeder 19 Static Ranking of Web Pages, and Related Ideas Wray Buntine 23 WIRE: an Open Source Web Information Retrieval Environment Carlos Castillo and Ricardo Baeza-Yates 27 Nutch: an Open-Source Platform for Web Search Doug Cutting 31 Towards Contextual and Structural Relevance Feedback in XML Retrieval Lobna Hlaoua and Mohand Boughanem 35 An Extension to the Vector Model for Retrieving XML Documents Fabien Laniel and Jean-Jacques Girardot 39 Do Search Engines Understand Greek or User Requests ”Sound Greek” to them? Fotis Lazarinis 43 Use of Kolmogorov Distance Identification of Web Page Authorship, Topic and Domain David Parry 47 Searching Web Archive Collections Michael Stack 51 XGTagger, an Open-Source Interface Dealing with XML Contents Xavier Tannier, Jean-Jacques Girardot and Mihaela Mathieu 55 The Lifespan, Accessibility and Archiving of Dynamic Documents Katarzyna Wegrzyn-Wolska 59 SYRANNOT: Information Retrieval Assistance System on the Web by Semantic Annotations Re-use Wiem Yaiche Elleuch, Lobna Jeribi and Abdelmajid Ben Hamadou 63 Search in Peer-to-Peer File-Sharing System: Like Metasearch Engines, But Not Really Wai Gen Yee, Dongmei Jia and Linh Thai Nguyen 67 5 6 Pre-processing text for web information retrieval purposes by splitting compounds into their morphemes Sven Abels, Axel Hahn Department of Business Information Systems, University of Oldenburg, Germany { abels | hahn } @ wi-ol.de morphemes “hand” and “shake”. In German, the word “Laserdrucker” is composed of the words “Laser” (laser) and “Drucker” (printer). In Spanish we can find “Ferrocarril” (railway) consisting of “ferro” (iron) and “carril” (lane). Splitting compounds into their morphemes is extremely useful when preparing text for further analysis (see e.g. [10]). This is especially valid for the area of web information retrieval because in those cases, you often have to deal with a huge amount of text information, located on a large number of websites. Splitting compounds helps to detect the meaning of a word easier. For example, when looking for synonyms, a decomposition of compound words will help because it is usually easier to find word-relations and synonyms of morphemes then to look at the compound word. Another important advantage when splitting compounds is the capability of stemming compounds. Most stemming algorithms are able to stem compounds correctly if their last morpheme differs in its grammatical case or in its grammatical number. For example, “sunglasses” will correctly be stemmed to “sunglass” in most algorithms. However: There are cases, where a stemming does not work correctly for compounds. This appears, whenever the first morpheme changes its case or number. For example the first word in “Götterspeise” (English: jelly) is plural and the compound will therefore not be stemmed correctly. Obviously, this will result in problems when processing the text for performing, e.g., text searches. Abstract In web information retrieval, the interpretation of text is crucial. In this paper, we describe an approach to ease the interpretation of compound word (i.e. words that consist of other words such as “handshake” or “blackboard”). We argue that in the web information retrieval domain, a fast decomposition of those words is necessary and a way to split as many words as possible, while we believe that on the other side a small error rate is acceptable. Our approach allows the decomposition of compounds within a very reasonable amount of time. Our approach is language independent and currently available as an open source realization. 1. Motivation In web information retrieval, it is often necessary to interpret natural text. For example, in the area of web search engines, a large amount of text has to be interpreted and made available for requests. In this context, it is beneficial to not only provide a full text search of the text as it is on the web page but to analyze the text of a website in order to, e.g., provide a classification of web pages in terms of defining its categories (see [1], [2]). A major problem in this domain is the processing of natural language (NLP; see e.g. [3], [4] for some detailed descriptions). In most cases, text is prepared in terms of being preprocessed before it is analyzed. A very popular method is text stemming, which creates a basic word form out of each word. For example the word “houses” is replaced with “house”, etc. (see [5]). A popular approach for performing text stemming is the porter stemmer [6]. Apart from stemming text, the removal of so called “stop words” is another popular approach for pre-processing text (see [7]). It removes all unnecessary words such as “he”, “well”, “to”, etc. While both approaches are well established and quite easy to implement, the problem of splitting compounds into its morphemes is more difficult and less often implemented. Compounds are words that consist of two or more other words (morphemes). They can be found in many of today’s language. For example, the English word “handshake” is composed of the two A nice side effect when splitting compounds is that you can use the decomposition to get a unified way of processing words that might be spelled differently. For example one text might use “containership”, while another one uses “container ship” (2 words). Using a decomposition will lead to a unified way. 2. Difficulties and Specific Requirements Splitting compounds in the domain of web information retrieval has some specific requirements. Since web information retrieval tasks usually have to deal with a large amount of text information, an approach for splitting compounds into its morphemes has to provide a 7 high speed in order to keep the approach applicable in praxis. Additionally, a low failure rate is of course crucial for the success. Hence, it is necessary to use an approach that provides a very high speed with an acceptable amount of errors. Another specific requirement for applying a decomposition of words in web information retrieval is the high amount of proper names and nouns that are concatenated to proper nouns or figures. For example, we might find words such as “MusicFile128k” or “FootballGame2005”. Because of that, it is necessary to use an approach that can deal with unknown strings at the end or at the beginning of a compound. The main task of our approach is performed in a recursive way. This is to be realized as a single findTupel method with one parameter, which is the current compound that should be decomposed into morphemes. In case that this word is smaller then 3 characters (or null), we simply return the word as it is. In all other cases, it is decomposed into a left part and a right part in a loop. Within each loop, the right part gets one character longer. For example, the word “hourseboat” will be handled like this: Loop No 1 2 3 4 ... The most difficulty in splitting compounds is of course the detection of the morphemes. However, even when detecting a pair of morphemes, it does not mean that we have a single splitting, that is correct for a word. For example, the German word “Wachstube” has two meanings and could be decomposed into “Wachs” and “Tube” (“wax tube”) or into “Wach” (guard) and “Stube” (house), which means “guardhouse”. Obviously, both decompositions are correct but have a different meaning. In the following section, we will describe an approach for realizing a decomposition of compounds into morphemes, which is designed for dealing with a large amount of text in order to be suited for web information retrieval. In this approach, we focus on providing a high speed with a small failure rate, which we believe is acceptable. Left part Right part Houseboa t Housebo at Houseb oat House boat ... ... Table 1. Decomposition Within each loop, it is checked wether the right part is a meaningful word that appears in a language specific wordlist or not. In our implementation, we provided a wordlist of 206.877 words containing different words in singular and plural. In case that the right part represents a word of this wordlist, it is checked it the left part can still be decomposed. In order to do this, findTupel method is called again with the left part as a new parameter (recursively). In case that the right part never represents a valid word, the method returns a null value. If the recursive call returns a value, different from null, its result is added to a resulting list, together with the right part. Else the loop continues. This ensures that the shortest decomposition of the compound is returned. For some languages, compounds are composed by adding a connecting character between the morphemes. For example, in the Germany language, one can find an “s” as a connection between words. In order to consider those connecting characters, they are removed when checking if the right part is a valid word or not. 2. Description of the Approach In our approach, we sequentially split compound words in three phases: 1. direct decomposition of a compound, 2. truncation of the word from left to right and 3. truncation of the word from right to left. In the first phase, we try to directly split the composed word by using a recursive method findTupel, which aims in detecting the morphemes of the word and returns it as an ordered list. In case of not being able to completely decompose the word, we truncate the word by removing characters starting at the left side of the word. After removing a character, we repeat the recursive findTupel method. If this does not lead to a successful decomposition, we use the same methodology in the third step to truncate the word from right to left. This enables us, to successfully split the word “HourseboatSeason2005” into the tokens { “House”, “Boat”, “Season”, “2005” } as discussed in the last section. Before starting with the analysis of the word, all nonalphanumeric characters are removed and the word is transformed into lower case. 3. Managing problems Basically, the success rate is highly influenced by the quality and completeness of the language specific word list. The approach benefits from a large amount of words. In order to ensure a high speed, we use a hash table for managing the wordlist. A major problem of this approach is the existence of very small words in a language that might lead to a wrong decomposition. For example the word “a” in English or the word “er” (he) in German can lead to results that change the meaning of the word. For example, our approach would decompose the word “Laserdrucker” into “Las”, “Er”, “Druck”, “Er” (read-he-print-he) instead of 8 this test, jWordSplitter was able to process 120.000 words per minute. In order to test the quality of the approach, we took a list of 200 randomly chosen compounds, consisting of 456 morphemes. The average time for splitting a word took about 80 milliseconds. Within this test set, jWordSplitter has been unable to split about 5% of the words completely. Another 6% have been decomposed incorrectly. Hence, 89% have been decomposed completely and without any errors and about 94% have been either composed correctly or at least not been composed incorrectly. We performed the same test with SiSiSi, which took about twice as long and which was unable to split 16% of the words. However, their failure rate was a bit less (3%). “Laser”, “Drucker” (laser printer). In order to avoid this, we use a minimum length of words, which in the current implementation is a length of 4. This made the problem almost disappear in practical scenarios. 4. Related work There are of course several other approaches that can be used for decomposing compounds. One example is the software Machinese Phrase Tagger from Connexor1, which is a “word tagger” used for identifying the type of a word. It can, however, also be used to identify the morphemes of a word but is quite to slow on large texts. Another example is SiSiSi (Si3) as described in [8] and [9]. It was not developed for decomposing compounds but for performing hyphenations. It does, however, identify main hyphenation areas for each word, which is in most cases identical with the morphemes of a compound. More examples can be found in [10] and [11].2 Existing solutions where, however, not developed for the usage in the web information retrieval domain. This means that many of them have a low failure rate but do also need a lot of time compared to our approach. In the following section we will therefore perform an evaluation, analyzing the time and quality of our approach. 7. Conclusion We have presented an approach, which we argue is suited for using it as a method for preparing text information in web information retrieval scenarios. The approach offers a good compromise between failure rate, speed and ability to split words. We think that in this domain it is most important to split as much words as possible in a short period of time, while we believe that a small amount of incorrect decompositions is acceptable for achieving this. Our approach can be used to ease the interpretation of text. It could for example be used in search engines and classification methods. 5. Realization and Evaluation The approach was realized in Java with the name jWordSplitter and was published as an open source solution using the GPL license. We used the implementation to perform an evaluation of the approach. We analyzed the implementation based on (i) its speed and (ii) its quality since we think that both is important for web information retrieval. 8. Further independence research and language In order to test the effectiveness of our approach, we intend to integrate it into the “Apricot”-project. It is proposed to offer a complete open source based solution for finding product information on the internet. We therefore provide a way of analyzing product data automatically in order to allow a fast search of products and in order to classify the discovered information. The integration of jWordSplitter in this real-world project will help to evaluate its long-term application and will hopefully also lead to an identification of problematic areas of the approach. The speed of our approach was measured in three different cases. • Using compounds with a large amount of morphemes (i.e. consisting of 5 or more morphemes). In this case, our approach was able to split about 50.000 words per minute. • Using compounds that consist of 1 or 2 morphemes (e.g. “handwriting”). In This case, our approach has been able to split about 150.000 words per minute. • Using words that do not make any sense and cannot be decomposed at all (e.g. “Gnuavegsdfweeerr”). In An interesting question is the language independence of the approach. jWordSplitter itself is designed to be fully language independent. It is obvious that its benefit does, however, vary between languages. While word splitting is very important for languages with many compounds, it might lead to fewer advantages in other languages. We therefore intend to extend Lucene3, an open source search engine by preprocessing text with 1 http://www.connexor.com Please note that pure hyphenating is not the same as word splitting. It is only equivalent in those cases where each root word has only one syllable. 2 3 9 http://lucene.apache.org jWordSplitter. Afterwards, we will rate its search results before and after the integration. We intend to repeat this test for different languages in order to get a language dependent statement about the benefits of jWordSplitter. [5] Jones, S.; Willet, K.; Willet, P.: Readings in Information Retrieval, Morgan Kaufmann, 1997 [6] Porter, M.F.: An algorithm for suffix stripping, Program, 14(3), 1980 9. Acknowledgement and Sources In the current implementation, it contains a German wordlist. Since the approach itself is language independent, it can be used for other languages as well if a wordlist is provided. The sources for jWordSplitter are available online as an open source project using the GPL at: http://www.wi-ol.de/jWordSplitter 10. References [1] Eliassi-Rad, T.; Shavlik, J. W.: Using a Trained Text Classifier to Extract Information, Technical Report, University of Wisconsin, 1999 [2] Jones, R.; McCallum, A.; Nigam, K.; Riloff, E.: Bootstrapping for Text Learning Tasks, In IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 52-63. 1999. [3] Wilcox A.; Hripcsak G. Classification Algorithms Applied to Narrative Reports, in: Proceedings of the AMIA Symposium, 1999 [4] Harman, D; Schäuble, P.; Smeaton, A.: Document Processing, in: Survey of the state of the art in human language technology, Cambridge Univ. Press, 1998 [7] Heyer, G.; Quasthoff, U.; Wolff, C.: Möglichkeiten und Verfahren zur automatischen Gewinnung von Fachbegriffen aus Texten. In: Proceedings des Innovationsforums Content Management, 2002 [8] Kodydek, G.: A Word Analysis System for German Hyphenation, Full Text Search, and Spell Checking, with Regard to the Latest Reform of German Orthography. In: Proceedings of the Third International Workshop on Text, Speech and Dialogue (TSD 2000), Springer-Verlag, 2000 [9] Kodydek, G.; Schönhacker, M.: Si3Trenn and Si3Silb: Using the SiSiSi Word Analysis System for PreHyphenation and Syllable Counting in German Documents, Proceedings of the 6th Internat. Conference on Text, Speech and Dialogue (TSD 2003), Springer-Verlag, 2003 [10] Andersson, L.: Performance of Two Statistical Indexing Methods, with and without Compound-word Analysis, www.nada.kth.se/kurser/kth/2D1418/ uppsatser03/LindaAndersson_compound.pdf, 2003 [11] Neumann, G.; Piskorski, J.: A Shallow Text Processing Core Engine. In: Proceedings of Online 2002, 25th European Congress Fair for Technical Communication, 2002 10 !"# $ & + ) 12 -/ , * %) * - , * . /) 0 ) ) 3 / ) ) )3 -) ++ 0 *+ 4 * ) . /-) + ) ) ) 4/ ) 4 $ . /+ % +) 0 + ). 44 4 ) * . /) 0 + + + ) % 4 +) 0 0-) + ) 0 , ) * ) ) 0/ ) 4 4 3) + ) * . /) 05 - ) 0 4 * 12 66/ 3 ) *. ) ) * %) ) ) / + 4 ) 3 ) 7 ) 48 ) 44 4 ) * + ) 0 9 0 0 ) ) / 4 + ) ) )/ * + +) 0 +) 4 4 - 4 4 -) +) + * 66/ 0) * - , + ) ) 4 ) 0 4 : ;< * %) * . /) 0 )5 ) 0 12 ) 4 ) ) )) / + - , * ) = * %4 ) 0 4 * . /) 0 ) ) : < ! % ) 00 % '() ) * 0 ) +) 4 ) ) * + 4 ) 0 + +) + 4 ) 0 4 ) + 12 *) 0) %4 ) 0 + 4 ** %) . /) 0 * 12 + + )*) ) 0 ) ? ! ! 5 ) + )6 / ) )% *) / 0 *) 0 )/ ) / + )0 * + %+ 0 *+ / 5 ) + )6 / 0 ) 0 0 0 )/ * ) 7) + ) 4 ) * )0+ + 3 * -) + )% + ) ) *8 / )% # ) ), 12 )5 * ) 5 ), 3) - + 12 : < ++ ) + * + ) 4 ) +) -) + + 4 4 0 7 8 0 ) ) * 0 + * * + 0 + * + *+ @A% 3 ) BC !C ) 0BC 5 CAD @ 0) D @ 0 ) )*) B C# C D @ + D @ D@ D E @9 D@9 D @ D@ D # @9 D@9 D @9+ D @4 ) D F!!@94 ) D @9 0 D @ 0 ) )*) B CE!C D @ + D @ D E @9 D @4 ) D @ D !!@9 D @-) D !!@9 -) D @94 ) @9+ D @9 0 D @9 0) D G + +) * + " % * + 4 * * = 12 7) -+) + / )** 08 + * - + ) >+ ) 12 - 5 * % , " )5 4 $ -? + 4?99--- ) ) *9= 11 %4 +* -) + 1H D 0 -) + E / + I0 >J 7C I099 @ 09 0 DI0() 0 % C899 0) BE )*) &' 0D +) . /-) *) ) + ) , 4/ * -) 0 + ) 4 , - 0 *+ + + ) , + ) +)4 + C 0C C C 0 ) * 9 ) +)4 + + 0)3 4 9+) ) +)4 ) )) + *0 4 C + C 0 C 0C C C) . / . /) 0 )5 + *+ . / + *+ 4 J -3 ) 0 ) / * + 12 + . / + ). + % +) 0 /4) / ) . ), / 4/ ) 4 * ) +) 3 + ). 4 4 -) + +) 4 + + * 66/ 0) + 66/ 0) + + ) * /) 0 + 4 * 4 ) *) 4 ) ) ) / )0 3 4 * 66/ ) 0 + ) , ) + * 66/ ) , - 0 + + + ) 0) + 4 -+ . /) 0 + )5 74 ) / 12 8 ) G*) % K * 7) ) + ) ) 3 K 8 4 ) /-+ + K ) , 4) ) -+ . /) 0 + 0 12 ) -+) + + 3) ) )** * + + * *) 7 3 )** ) + +4 * %) 8 ) 4 ) * 3 +) - , + + 4 * % + ) ) ) ) * * 66/ 0) * - , 44) + ) *) /? + %4 ) *+ G4 * + )3 3 *) 4 /? + 3 ) *. ) ) ,+ ) 0 + 0 -+) + + / ) */ + G4 * ) +) * - , + 4 4 / ) ) +) - ,50 4 :;< +) ) + ) * + * 66/ 0 4+ : #< + ) * + 12 / 3 ) 0+ ) 4 7- )0+) 08 * + ) * ) ) + 9 + )* + / 3) 4 + * / 0+) + )0) + - )0+ * 7) + * 8) 4 / 00 0 ) 03) 5 :"< + - )0+ * + 0) 0 + 4 + ) 4 ) + )0) 0 4+:;< J + 4 ) )4 ) * +) 44 + ) 4 * +) 0 4 ) * 66/0 4+ +) 04 >+) +) 0 /) + * %) ) ) / ) ) ) / + ) 4 4 %) / * + 4 * * 66/ ) + 4 0 * +) 0 + ) 0 4+ + . /0 4+ 3 + ) ) + 4 %) / 0 * +) 4 -+) + ) / +)0+ 70 4+ +) 0 ) 0 + 4 % 4 0 ) 0 4 8 > +3 * + + + + - , * ) = : < + * %) . /) 0 * ) ) )5 * + *12 ) + / * + ) **) ) / *. /) 0 + % +) 0 ) ) ) / 4 ) + %4 . ) -) + ) ) *, - 0 K + + . ) -) + +)0+ ) + + . ) ) )0 / + 4 * - )0+ ) ) + 0 4 4 / .) +) ) *) /4) * * 66/ 0) * - , +) ) 4 / * 66/ 4 ) :"< ), * % 4 ? %4 )3 0 J -3 * ) ) 0 44 + ) 0 + , - 0 * + + + - -) ) ) 0 ) +) 0 + *. ) + + - / 4 * ) + * -? #( ! ) 44 ) 0 + + 4) ) ) + * 00) 0 ) + * + 0 ) + ) +) +) > *) + *+ . /) 0 +) + * %) ) ) / ) / + *+ 74 / 8 + G4 * + 3 * ) ) 4 ) * %) ) ) / ) ) + ) 4 *+ . / ) ) + * 4) ) ) + )** + ) ) 3 + +) * ** %) ) ) / ) / ) + - , * ) = L =)3 :E M< + * %) . /) 0 + * - -) * / + ) ) 0 ) + . /) 0 -) + , 4 ) , - 0 * + *( >) + ) + ) ! ) 4) ) , - 0 * + / %4 +) . ) ) )** -/ + + + 4 ) ))/ %4 . ) ) * 5* ) *, /* ) 0 0 ) 3 0 /) * ) / * ) +4 +) 44 + + 4 ) : < + 12 0) * %4 ) 3) 4 3) * , /0 ) 9 ,/ *+ ) + + 0 ) ) * + 0 )/ *+ )4) )4 / ) 0 +) ) 0 3 + 0) )0+ 0/G ) ) 0 + ) ) *, - 0 +) 0) ) ) ) 0 + 4) 0 + * %4 ) 0 +) . / + ) / 44 +3 0 , - 0 *+ ) 7+ + , /8 * + /) 0 * : < J -3 ) 44 + + 4 * / 0) ) + 4+ + + 44 4 ) %) ) 0 ) + % + - L ) 0 -/ + -) %4 . ) /) ) ) 0 *4 * 7) * 66/ ) ) 8 *) 4 &ω) ? B " '-) + + ) 3 ? B#" * 4) ) / + + 4) ) / 3 ω ) + * - , * ) ) 4 ) ) ) / + /: !< / 0 * )/ ) ) 0 + * 7ω 8 B + - ) 4 )3 ω ) 4 / + * 7ω) 8 4 ) ,) * )4 ) :F< ) 0 0 * +)4µω $ * - ? ) 0 ) µ ω 7α 8 = = )*α ) *) ω )*α 3) − ω 4 -+ % ) ) / +) - / -+ % ) 4 ) ) */ ω 5 ) ) 3) ) + /4 ) ) + ω ) *) 7ω 8 0 . 5 ) >+ 3) 4 ) ))/ 0 ) 4 ) * ) ))/ + * )/ 0 + 3 *4 ) ) / *ω ) ) 30 ) ) + -) ? µ ω 7α 8 = ( % − -) + ) ) )0+5* *) ) ) 3 * 0)3 3 N + + 3 + 0)3 3 ) 0 + +4 ) -+ + ) ) . ) ) -) + ) 0 + + ) ) )/ *) %4) ) /) ,) * - , ) 0 ) ) ) + 0 0 * + ) *) ) 4) ) / / *+ ) 0 ) ) : E< * + * ) ) ) / 0 ) ) ) ) )/ 3 + -) 0 / ++ + -) ) ) ) )6 / *4 ) ) *) + +)0+ 5) 4 ) ) +) + * ) ) * 4 ) ) + 4 ) 0 ) + +) * * 3 / ) ) +) * ) ) * ) -) * /4 ) + + ) ) 4 ) 0 ) 4 + ) ) 4 ) 0 : E< * 0 4+) + * ) ) 0+ ) 0+ -+ /G 0+G + * + - )0+ * + ) 0) , ) + % + ) ) 4 ) 0 ) *) -+) + ) . / * 9 J + 3 ) *. / 0 * + 0 *4 ) ) ) / 0 * )/ / * 00 0 ) * + 4 ) ))/ )/ 0 * / ) ) + + . ) * / K ) + 0 * 4 ) ) ) /) / + ) ) + . ) * / )K ) + 0 * 4 ) ))/ ) /+ %) * -? µΠ µΠ µ ω 7α 8 ) µω 7α8 ) * ) ) ) / 3 * 4 ) - α /ω + - α) ) ) + ? µω 7α8 B ) ) + ) ) / ) *) ω) -+) µω 7α8 B ! ) ) ω) ! @ µω 7α8 @ -+ α ) *) / / 3) µω)7α8 % µω)7β8 ) ) + ω ) 4 ) /ω / β + * ) *) / α + * ) ) ω 7ω 8 D 7ω&8 %4 ) 0 + + ) * ) + . ) ) 4 * + )* ) *ω J 3 4 0 )3 / / 3 ) 0 . ) / + ) ) * + +)0+ 54) ) / + - 54) ) / + * + / / ) 0+ . / ) 0 +) ) 4 3 ) ) ) 0/ 3 + /+ ) 0 + * /) 4 + - , + ) 0 * 3) +) ) 4 3 +) 0* 4 ) 0 7 0 ) =$ 8 ) + -+) + ) **) ) / ) ) 7-) + 4 0)3 + + 8 J - 3 * ) α * K 7 ϖ ϖ ⊂℘ 8 = ) K ) *= (ϖ ϖ ⊂℘ ) = ++ α + ℘ +, 4= )/ 0 %α ∈ϖ ( ) α ∈ϖ ( µ ω 7α 8 ) µ ω 7α 8 ) * + . / ω) ) ) *ω) ϖ *0 "! )" +) + . ) 4 ) 4 / - )0+ -+ + /3 ) /3 ) 3 ) / + 0 ) 4 ) - )0+ ) 4 + - )0+ ) * ) ) +) - , * + ) + 0 +3 ) / +) *+ ) + 13 − + 3 + ) ) 0) + * 0 + - )0+ 4 -+) + 44 %) ) 4 ) -) + + 0 *+ 4 + ) 4 3 *+ + + 0 L ) ) + - )0+ 3 + 44 + + 4 , - 0 + +) *+ 0 )/ 3 ) $ . / + . ) * / +) /) 4 ) ) 44 %) +) K )*/ + * + +/4 + ) 3 -) + + ) 4/ -) + % +) +) ) 44 + , -4 ) /+ * +) + 0 -) +* + 4) . E!!O -) + ) 4 . C! C ) * C!"C ) 4) +) . / 4 ) + / - )0+ 4 + ) )0 / + +) . / + 44 %) +) +) ) + ) / )0 +) ) + * + - +3 3 4) *+ 0 ) + 0)3 + % * 0 4+ +) 0 +) 4 ) + 4 * + 5 + % . /) 0 12 ) 4 + * + + 5 * 0 ) ) / ) 0 ) 4 . / * + ) 3 -/ -) + +) 4 % 4 ) + * + 4) + * 00 0 + C4 ) C ) C4 ) C + 7 * 8+ + 4 ) 0 + % L * ) 4 ))/ * + 0 )+ * - -) 3 0)3 , + -+ 5 4 ) 0 + 0 74 ) B!"8 7 E!!O / % & B! 8 E " ) + 0 )+ 4 ) 4? + *) ) )6 ) + ) ) *5 *+ 4 . / 0 * + ) +) +) +) 4) / . )3 + . /) 0 -) + , + ) + ) 05 *+ )6 + 3 ) ) * + ) +) +) -) + 4 . / ) )+ + +) 0 . / % ) ) ) 4 ) 0 + ) -) +) ) ) / ) * ) ) ) 4 ) + 4 . / 3 / 4 * + 0 )+ ) +4 3) ) ) ) ) ,) 0 ) + ) +) +) + *) 4 . / / + / ) -+ + +) +) * 4 . / 4 + ) ) 4 ) 0 ) ) + + 5 4 44 + ) + - 4 0 )3 /* + *5 5 * + +)0+ 54 ) ) / + - 5 4) ) / -, " $ +) 4 4 - + 3 , + 4 *. /) 0 12 > +3 *+ + 00 + -) + ) ) 0 , %+ )3 3 / ) / + * 66/ 0) * - , - +3 4 4 + -) 0 4 0 )3 / ) ) 0 + / 3 ) 0 + 3 - ) 4 ) ) ) / * *) ) 0 + 4 ) + *+ + - +3 + ) ) 4 ) 0 + ) ) * +) - , - -) 3 ) )) / , + / %4 ) -) +) + = P4 * 1 7 ) ) )3 * + 3 ) *12 ) 3 8) , 4 )3 / -) + + + 4 4 44 + . : < 5Q +) $+ )3 3 = % ) L $6 + 4 ) #!!# :#< L 0 % % R 4+ 2 +) 0 ) 0 ) ) * ) ) ) 0 ) + =+ + ) = ) #!!# : < $+ S $+ 0 Q 66/ ) R 4+ -) + 44) ) $+ 0)) / 2 $/ ) MM# :E< L = = J ) ) * 66/ 4 ) ))/ + / + * ) / ) 4 )) ) 2 MMF 4 # ;5 #E :;< ) ) L) . ) 12 %4 / 44) ) #!!! 4#""5#FM :"< S ) R Q L 66/ 66/ 0) ? + / 44) ) )) ?=4 , MM :F< ) 0) J = J? = ) ) ) / + / ) $ ) )* ) = 44) )0 T " MM" 4 # F5 !M : < ) = J 66/ 0) + ). ) 2 ) ) H /) 0? = ) ) / 3 )0 ) *+ = ) S 0 7 8 #!! 4 5 M# :M< ) = J ) 0 * 66/ ) * %) ( ) . /) 0? -+/ +- ' MMF = E; 5 "! : !< ) 0 U = J = ) )) ) 0) J , * 0) ) 0) = 0 ) 0 7 R / %* )3 ) /= MME 4 E MP; : <= J V %) . /) 0 * )5 !+ #!! 4 M#5 M; : #< 2 U ) = 66/ R 4+ 66/ J/4 0 4+ 7 ) ) 66) * $ 4 ) 08 =+/ ) 5 T 0 J ) 0#!!! : < ) U 0 ) +J 3 ) 0 ) ) ) /) 12 ) ;+ ) > ,+ 4 + > #!!# : E< 3 + ) T L) / 4 * ) 0 ) ) ) 3 3 =+/ , " M"" 4F!F5F ! : ;< R + J = + +) / *+ 4 ) 0 4 C * + J) / *$ 4 ) 0 F M ; 44 5# 14 Link Analysis in National Web Domains Ricardo Baeza-Yates ICREA Professor University Pompeu Fabra ricardo.baeza@upf.edu Abstract The Web can be seen as a graph in which every page is a node, and every hyper-link between two pages is an edge. This Web graph forms a scale-free network: a graph in which the distribution of the degree of the nodes is very skewed. This graph is also self-similar, in terms that a small part of the graph shares most properties with the entire graph. This paper compares the characteristics of several national Web domains, by studying the Web graph of large collections obtained using a Web crawler; the comparison unveils striking similarities between the Web graphs of very different countries. Carlos Castillo Department of Technology University Pompeu Fabra carlos.castillo@upf.edu Table 1 summarizes the characteristics of the collections. The number of unique hosts was measured by the ISC2 ; the last column is the number of pages actually downloaded. Table 1. Characteristics of the collections. Collection Year Available hosts Pages [mill] (rank) [mill] Brazil 2005 3.9 11th 4.7 Chile 2004 0.3 42th 3.3 Greece 2004 0.3 40th 3.7 th Indochina 2004 0.5 38 7.4 Italy 2004 9.3 4th 41.3 South Korea 2004 0.2 47th 8.9 Spain 2004 1.3 25th 16.2 U. K. 2002 4.4 10th 18.5 1 Introduction Large samples from specific communities, such as national domains, have a good balance between diversity and completeness. They include pages inside a common geographical, historical and cultural context that are written by diverse authors in different organizations. National Web domains also have a moderate size that allows good accuracy in the results; because of this, they have attracted the attention of several researchers. In this paper, we study eight national domains. The collection studied include four collections obtained using WIRE [3]: Brazil (BR domain) [18, 15], Chile (CL domain) [1, 8, 4], Greece (GR domain) [12] and South Korea (KR domain) [7]; three collections obtained from the Laboratory of Web Algorithmics1: Indochina (KH, LA, MM, TH and VN domains), Italy (IT domain) and the United Kingdom (UK domain); and one collection obtained using Akwan [10]: Spain (ES domain) [6]. Our 104-million page sample is less than 1% of the indexable Web [13] but presents characteristics that are very similar to those of the full Web. By observing the number of available hosts and the downloaded pages in each collection, we consider that most of them have a high coverage. The collections of Brazil and the United Kingdom are smaller samples in comparison with the others, but their sizes are large enough to show results that are consistent with the others. Zipf’s law: the graph representing the connections between Web pages has a scale-free topology. Scale-free networks, as opposed to random networks, are characterized by an uneven distribution of links. For a page p, we have P r(p has k links) ∝ k −θ . We find this distribution on the Web in almost every aspect, and it is the same distribution found by economist Vilfredo Pareto in 1896 for the distribution of wealth in large populations, and by George K. Zipf in 1932 for the frequency of words in texts. This distribution later turned out to be applicable to several domains [19] and was called by Zipf the law of minimal effort. Section 2 studies the Web graph, and section 3 the Hostgraph. The last section presents our conclusions. 1 Laboratory of Web Algorithmics, Scienze dell’Informazione, Universitá degli <http://law.dsi.unimi.it/>. 2 Internet Systems Consortium’s <http://www.isc.org/ds/> Dipartimento di studi di Milano, 15 domain survey, Indegree 2 Web graph The distributions of the indegree and outdegree are shown in Figure 1; both are consistent with a power-law distribution. When examining the distribution of outdegree, we found two different curves: one for smaller outdegrees –less than 20 to 30 out-links– and another one for larger outdegrees. They both show a power-law distribution and we estimated the exponents for both parts separately. For the in-degree, the average power-law exponent θ we observed was 1.9±0.1; this can be compared with the value of 2.1 observed by other authors [9, 11] in samples of the global Web. For the out-degree, the exponent was 0.6 ± 0.2 for small outdegrees, and 2.8 ± 0.8 for large out-degrees; the latter can be compared with the parameters 2.7 [9] and 2.2 [11] found for samples of the global Web. Brazil 10−1 10−2 −3 10 10−4 −5 10 10−6 10−7 0 10 2.1 Degree Outdegree Brazil 10−1 10−2 −3 10 10−4 −5 10 −6 1 102 10 3 10 4 10 Chile 10−1 −2 10 10−3 −4 10 −5 10 10−6 10−7 0 10 100 10 101 102 103 102 10 Chile 10−1 −2 10 −3 10 10−4 10−5 −6 1 2 10 3 10 10 4 10 0 10 1 10 10 Greece 10−1 10−2 10−3 −4 10 −5 10 10−6 −7 10 100 Greece 10−1 −2 10 10−3 −4 10 −5 10 −6 1 2 10 3 10 10 4 10 10 100 1 2 10 103 10 2.2 Ranking One of the main algorithms for link-based ranking of Web pages is PageRank [16]. We calculated the PageRank distribution for several collections and found a power-law in the distribution of the obtained scores, with average exponent 1.86 ± 0.06. In theory, the PageRank exponent should be similar to the indegree exponent [17] (the value they measured for the exponent was 2.1), and this is indeed the case. The distribution of PageRank values can be seen in Figure 2. We also calculated a static version of the HITS scores [14], counting only external links and calculating the scores in the whole graph, instead of only on a set of pages. The tail of the distribution of authority-score also follows a power law. In the case of hub-score, it is difficult to assert that the data follows a power-law because the frequencies seems to be much more dispersed. The average exponent observed was 3.0 ± 0.5 for hub score, and 1.84 ± 0.01 for authority score. 3 Hostgraph We studied the hostgraph [11], this is, the graph created by changing all the nodes representing Web pages in the same Web site by a single node representing the Web site. The hostgraph is a graph in which there is a node for each Web site, and two nodes A and B are connected iff there is at least one link on site A pointing to a page in site B. In this section, we consider only the collections from which we have a hostgraph. Italy 10−1 −2 10 −3 10 10−4 −5 10 10−6 10−7 100 −2 10 10−3 10−4 10−5 101 10−1 10−2 −3 10 10−4 10−5 10−6 10−7 100 10−1 10−2 10−3 10−4 10−5 10−6 10−7 100 102 103 10−6 100 4 10 101 Korea −1 10 −2 10 10−3 10−4 10−5 10−6 10−7 100 Italy 10−1 102 103 Korea 10−1 10−2 −3 10 −4 10 −5 10 1 10 2 10 3 10 10−6 100 4 10 Spain 101 102 103 Spain 10−1 −2 10 10−3 10−4 −5 10 101 102 103 4 10 10−6 0 10 101 U.K. 102 103 U.K. 10−1 −2 10 10−3 10−4 10−5 1 10 2 10 3 10 10 4 10−6 100 101 102 3 10 Figure 1. Histograms of the indegree and outdegree of Web pages, including a fit for a power-law distribution. 16 3 Brazil 10-2 Chile 10-2 Greece 10-2 10-3 10-3 10-3 10-4 10-4 10-4 10 -5 10-5 10-5 10 10 -5 10 10-6 10-6 10-7 10-7 10 10-6 -5 10 10-4 10-4 -6 10-3 10 10 -4 10 10-7 10 -6 -5 10 10 -4 -4 -6 -6 10 10-7 10-7 10 -7 10-6 10-5 10-4 10-7 -6 10 -5 10-4 -7 -7 -6 10 -5 10 10 10-4 10 -7 10-6 10 Greece -5 10-4 10 Korea 10-3 -4 10 10 10-5 10 10-6 10 10 Korea 10-6 10 -5 -6 -7 10 10-4 -5 -4 -5 10 10 10-3 10 -5 10 -4 10 10-5 Chile 10 -6 10 10-3 10 10 10-7 Greece -7 10-3 10-5 -4 10 -6 Brazil -4 10-5 10 -7 -5 10-6 10 -4 10 -6 -7 10-7 10-3 -5 -7 10 Chile 10 -7 10 10-5 10 10-6 10 -6 10 -4 -5 -6 10 -7 10-3 10-4 10 -6 Brazil 10-3 10 10-7 -4 10 -7 Korea 10-2 -3 -7 10-6 10 -5 10-7 10-7 -4 10 10 -6 -5 10 10-4 10 Figure 2. Histograms of the scores using PageRank (top), hubs (middle) and authorities (bottom). 3.1 Degree 3.2 Web structure The average indegree per Web site (average number of different Web sites inside the same country linking to a given Web site) was 3.5 for Brazil, 1.2 for Chile, 1.6 for Greece, 37.0 for South Korea and 1.5 for Spain. The histograms of indegree is consistent with a Zipfian distribution, with parameter 1.8 ± 0.3. Broder et al. [9] proposed a partition of the Web graph based on the relationship of pages with the larger strongly connected component (SCC) on the graph. The pages in the larger strongly connected component belong to the category MAIN. All the pages reachable from MAIN by following links forwards belong to the category OUT, and by following links backwards to the category IN. The rest of the Web that is weakly connected (disregarding the direction of links) to MAIN is in a component called TENDRILS. By manual inspection we observed that in Brazil and specially in South Korea, there is a significant use –and abuse– of DNS wildcarding. DNS wildcarding is a way of configuring DNS servers so they reply with the same IP address no matter which host name is used in a DNS query. The average outdegree per Web site (average number of different Web sites inside the same country linked by a given Web site) was 2.2 for Brazil, 2.4 for Chile, 4.8 for Greece, 16.5 for South Korea and 11.2 for Spain. The distribution of outdegree also exhibits a power-law with parameter 1.6 ± 0.3. We also measured the number of internal links, that is, links going to pages inside the same Web site. We normalized this by the number of pages in each Web site, to be able to compare values. We observed a combination of two power-law distributions: one for Web sites with up to 10 internal links per Web page on average, and one for Web sites with more internal links per Web page. For the sites with less than 10 internal links per page on average, the parameter for the power-law was 1.1 ± 0.3, and for sites with more internal links per page on average, 3.0 ± 0.3. In [2] we showed that this macroscopic structure is similar at the hostgraph level: the hostgraphs we examined are scale-free networks and have a giant strongly connected component. We observed that distribution of the sizes of their strongly connected components is shown in Figure 3. 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 Brazil 1 10 2 10 3 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 4 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 5 10 10 Chile 1 10 2 10 3 10 4 10 Korea 1 10 2 10 10 3 10 4 5 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 5 10 Greece 10 1 10 2 Spain 10 1 2 10 10 3 10 4 5 10 Figure 3. Histograms of the sizes of SCCs. 17 3 10 4 10 5 10 The parameter for the power-law distribution was 2.7 ± 0.7. In Chile, Greece and Spain, a sole giant SCC appears having at least 2 orders of magnitude more Web sites than the second largest SCC component. In the case of Brazil, there are two giant SCCs. The larger one is a “natural” one, containing Web sites from different domains. The second larger is an “artificial” one, containing only Web sites under a domain that uses DNS wildcarding to create a “link farm” (a strongly connected community of mutual links). In the case of South Korea, we detected at least 5 large link farms. Regarding the Web structure, the distribution between sites in general gives the component called OUT a large share. If we do not consider sites that are weakly connected to MAIN, IN has on average 8% of the sites, MAIN 28%, OUT 58% and TENDRILS 6%. The sites that are disconnected from MAIN are 40% on average, but contribute less than 10% of the pages. 4 Conclusions Even when the collections were obtained from countries with different economical, historical and geographical contexts, and speaking different languages we observed that the results across different collections are always consistent when the observed characteristic exhibits a power-law in one collection. In this class we include the distribution of degrees, link-based scores, internal links, etc. Besides links, we are working in a detailed account of the characteristics of the contents and technologies used in several collections [5]. Acknowledgments: We worked with Vicente López in the study of the Spanish Web, with Efthimis N. Efthimiadis in the study of the Greek Web, with Felipe Ortiz, Bárbara Poblete and Felipe Saint-Jean in the studies of the Chilean Web and with Felipe Lalanne in the study of the Korean Web. We also thank the Laboratory of Web Algorithmics for making their Web collections available for research. References [1] R. Baeza-Yates and C. Castillo. Caracterizando la Web Chilena. In Encuentro chileno de ciencias de la computación, Punta Arenas, Chile, 2000. Sociedad Chilena de Ciencias de la Computaci´on. [2] R. Baeza-Yates and C. Castillo. Relating Web characteristics with link based Web page ranking. In Proceedings of String Processing and Information Retrieval SPIRE, pages 21–32, Laguna San Rafael, Chile, 2001. IEEE CS Press. [3] R. Baeza-Yates and C. Castillo. Balancing volume, quality and freshness in Web crawling. In Soft Computing Systems Design, Management and Applications, pages 565–572, Santiago, Chile, 2002. IOS Press Amsterdam. [4] R. Baeza-Yates and C. Castillo. Caracter´ısticas de la Web Chilena 2004. Technical report, Center for Web Research, University of Chile, 2005. [5] R. Baeza-Yates and C. Castillo. Characterization of national Web domains. Technical report, Universitat Pompeu Fabra, July 2005. [6] R. Baeza-Yates, C. Castillo, and V. L´opez. Caracter´ısticas de la Web de España. Technical report, Universitat Pompeu Fabra, 2005. [7] R. Baeza-Yates and F. Lalanne. Characteristics of the Korean Web. Technical report, Korea–Chile IT Cooperation Center ITCC, 2004. [8] R. Baeza-Yates and B. Poblete. Evolution of the Chilean Web structure composition. In Proceedings of Latin American Web Conference, pages 11–13, Santiago, Chile, 2003. IEEE CS Press. [9] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages 309–320, Amsterdam, Netherlands, May 2000. ACM Press. [10] A. S. da Silva, E. A. Veloso, P. B. Golgher, , A. H. F. Laender, and N. Ziviani. CoBWeb - A crawler for the Brazilian Web. In Proceedings of String Processing and Information Retrieval (SPIRE), pages 184–191, Cancun, Mxico, 1999. IEEE CS Press. [11] S. Dill, R. Kumar, K. S. Mccurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the web. ACM Trans. Inter. Tech., 2(3):205–223, 2002. [12] E. Efthimiadis and C. Castillo. Charting the Greek Web. In Proceedings of the Conference of the American Society for Information Science and Technology (ASIST), Providence, Rhode Island, USA, November 2004. American Society for Information Science and Technology. [13] A. Gulli and A. Signorini. The indexable Web is more than 11.5 billion pages. In Poster proceedings of the 14th international conference on World Wide Web, pages 902–903, Chiba, Japan, 2005. ACM Press. [14] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [15] M. Modesto, ´a. Pereira, N. Ziviani, C. Castillo, and R. BaezaYates. Un novo retrato da Web Brasileira. In Proceedings of SEMISH, São Leopoldo, Brazil, 2005. [16] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998. [17] G. Pandurangan, P. Raghavan, and E. Upfal. Using Pagerank to characterize Web structure. In Proceedings of the 8th Annual International Computing and Combinatorics Conference (COCOON), volume 2387 of Lecture Notes in Computer Science, pages 330–390, Singapore, August 2002. Springer. [18] E. A. Veloso, E. de Moura, P. Golgher, A. da Silva, R. Almeida, A. Laender, R. B. Neto, and N. Ziviani. Um retrato da Web Brasileira. In Proceedings of Simposio Brasileiro de Computacao, Curitiba, Brasil, 2000. [19] G. K. Zipf. Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley, Cambridge, MA, USA, 1949. 18 Web Document Models for Web Information Retrieval Michel Beigbeder G2I department École Nationale Supérieure des Mines 158, cours Fauriel F 42023 SAINT ETIENNE CEDEX 2 Abstract Different Web document models in relation to the hypertext nature of the Web are presented. The Web graph is the most well known and used data extracted from the Web hypertext. The ways it has been used in works in relation with information retrieval are surveyed. Finally, some considerations about the integration of these works in a Web search engine are presented. 1. Web document models Flat independent pages The immediate reuse of the long lived Information Retrieval (IR) techniques led to the most simple model of Web documents. It was used by the first search engines: Excite (1993), Lycos (1994), AltaVista (1994), etc. In this model, HTML pages are converted to plain text by removing the tags and keeping the text between the tags. Easily, the content of some tags can be ignored. Then pages are indexed as flat plain text. The prevailing IR model used with this document model is the vector model, though AltaVista introduced a combination of a Boolean model to select a bunch of documents which is then ranked with a vector model. The main advantage of this model is that many of the traditional IR tools and techniques could be straightforwardly used. Structured independent pages The enhancement from the first model is that some structure about the pages is kept either in the index or considered in the indexing step. For instance, with a Boolean like models, words could be only looked for in the title tag. Such capabilities have been proposed for some time, but, like other Boolean capabilities, did not get much public success. With the vector model the words appearing in the title or sectioning tag (for instance) could receive a greater weight than others. Some search engines mentioned this peculiarity, but, as far as I know, no details were given and no experiments were conducted to prove the effectiveness of these different weighting schemes. These uses of the internal structure of the Web documents are very weak compared to the strong internal structure allowed by HTML. But the documents found on the Web are not strongly structured because many structural elements are misused to obtain page layouts. So, the works in IR on structured documents are not useful in the actual Web. Linked pages In this model the hypertext links represented by the <a href="..."> tags are used to build a directed graph: the Web graph. The nodes of the graph are the pages themselves, and there is one arc from the node P to the node P 0 iff there is somewhere in the HTML code of P an href link to the page P 0 . Note that this is a simplification of what is really coded in the HTML, because if there are many href links in P to P 0 , there is only one arc (otherwise, we would define a multigraph). But the most difficult point here is to define precisely what are the nodes: pages or URL or set of pages. Let us precise the choice. The pages are identified by their URL, and URL themselves are composed of nine fields: <scheme>://<user>:<passwd>@<host>:<port>/ <path>;<parameters>?<query>#<fragment> If the user and passwd fields can be safely ignored, what to do with the parameters and query ones is not trivial. By ignoring them to define the nodes, a graph with fewer nodes and more connectivity is obtained, but the point is what of the many content is to be associated to the node ? Moreover, using the fragment field would lead either to consider the page as composed of smaller units or to consider these smaller units as the documents to be returned by the search engine. Though, due to the poor use of the HTML, many of the opening <A NAME="..."> tags are not closed with a </A>, so many fragments are not fully 19 delimited. So I think that this field should be ignored. Another difficulty is the replication of pages, either actual replication on different servers or replication through different names on a single server. As an example of the second case, both http://rim. Page gathering The page ranking algorithms can be used on emse.fr/ and http://www.emse.fr/fr/transfert/ g2i/depscientifiques/rim/index.html point on the same page. When it is possible to recognize this replication, I think it is better to merge the different URL in a single node because a graph with higher density is obtained and as they refer to the same content there is not the problem of choosing or building such a content. Given some choices regarding the quoted questions, the directed graph can be built. It has been extensively studied [3] [10] and used for information retrieval in particular. We will review some of its usages in section 2. to categorize their neighbors, this idea has been used in combination with the content analysis of the pages [4]. Anchor linked pages This model takes into account more of the HTML code. Each anchor, delimited with a <a href="..."> tag and the corresponding </a> tag, is used to index the page pointed by the href attribute. This idea is still in use in some search engines. Moreover in the Web context where spidering is an essential part of the information retrieval system (IRS) to keep the index up to date, it allows the association of an index to a document (a page) before it is actually loaded. Variations consist in heuristics that take into account not only the anchor text itself but also its neighboring. Note that this is not very different from the first point exposed in section 2 about relevance propagation. 2. Link usage The Web graph between pages is used by many works. In relation to IR it has been used for different goals. Index enhancement and relevance propagation One of the first ideas tested in hypertext environments [8] consists in using the index of neighbors of a node either to index the node in the indexing step, or to use the relevance score values (RSV) of these nodes in the querying step. Both of these methods are based on the idea that the text in a node (a page in the Web context) is not self contained and that the text of the neighbors can give either a context or some precision to the text of the nodes. Savoy conducted many experiments to test this idea. He reports that effectiveness improvements are low with vector and probabilistic models [16] and higher with the Boolean model [17]. Marchiori uses a propagation with some fading for fuzzy metadata [13]. The same scheme could be applied to the term weights in the vector model. Page ranking: PageRank [2] and HITS [9] We will not describe once more here these two methods. The first one attribute a (popularity) score to every page, the second one attributes two (hubbiness and authority) scores to them. The key point is that these scores are independent of the words used either in the documents or in the query. any graph, and hence on any subgraph of the Web. The PageRank algorithm has been used to focus gathering on a given topic [5]. Page categorization If some pages are categorized, it can help Page classification Classification is different from categorization in the sense that classes are not predefined. A method based on co-citation, which was first used in library science [18], is presented in the Web context by Prime et al. [15], it aims to semiautomatically qualify Web pages with metadata. Similar page discovery Dean et al. [6] proposed two solutions to this problem. The first one is based on the HITS algorithm and the second one is based on co-citation [18]. Replica discovery Bharat et al. present a survey of techniques to find replicas on the Web [1]. One of them is based on the link structure of the Web. Logical Units Discovery The idea here is akin to that of index enhancement: if pages are not self contained, they need to be indexed or searched with other ones. But here, the context is not built with a breadth first search algorithm on the Web graph, but with other algorithms. Three methods are aimed at augmenting the recall, with the idea that not all the concepts of a conjunctive query are present in a page, but some of them are in neighbor pages [7] [19] [11]. Note that the Dyreson’s method [7] does not use the Web graph but a graph derived from it by taking into account the directory hierarchy coded in the URL. These three methods share the drawback that they take place in a boolean framework. Tajima et al. [20] propose to discover the logical units of information by clustering. To take into account the structure, the similarity between two clusters is zero when there are no links between any page of one cluster and any page of the second cluster, otherwise the similarity is computed with Salton’s model. So there is not a strong use of the link structure. Communities discovery Another approach by Li et al. [12] attempts to discover logical domains — as opposed to the physical domains tied to the host addresses. These domains are of a greater granularity than the logical units of the previous paragraph. Their goal is to cluster the results of a search task. In order to build these domains, the first step consists in finding k (an algorithm parameter) entry points with criteria that take into account the title tag content, the textual content of the URL1 , the in and out degree within the Web graph, etc. In the second step, pages that are not entry points are linked to the first entry point located above considering their URL path (as a result, some pages may stay orphan). Moreover some conditions — minimal size of a domain, radius — influence the constitution of domains. 1 Some 20 words such as welcome, home, people, user, etc. are important. 3. Link tools There are rather few basic methods used in the link usage: 1. graph search (mainly breadth first search); 2. PageRank and HITS algorithms (which are matrix based); 3. co-citation (building the co-citation data is also a matrix; manipulation) 4. clustering (many methods can be used). 4. URL use We already note that Dyreson [7] does use the URL data to discover logical units. In the study conducted by Mizuuchi et al. [14] the URL coded paths are used to discover for every page P one (and only one) entry page, i.e. a page by which a reader is supposed to get through before arriving at P . A page tree is defined by these entry pages. This tree is used to enhance the index of a page with the content of some tags of the ancestors of P . 5. Conclusion and proposition IR integration The works quoted above are not all dealing directly with the IR problem. Many of them were not tested with test collections which are standard in the IR community such those of TREC 2 . So some work has to be done on how to integrate and test these methods in a search engine. Precision enhancement Now, I present some qualitative considerations. Many of these methods are aimed at dealing with the huge size of the Web: everything about some kind of classification or categorization are of this kind. Most often, these methods can be applied either before the query as a preprocessing step or on the results of a query. While not explicitly in this direction the PageRank algorithm can be considered of this kind. Due to the very huge size of the Web, many queries, especially the very short queries submitted by the Web users, have many, many answers. The polysemy is much higher than in traditional IR collections. So the use of clues external to the vocabulary can be seen as a discrimination factor to select documents when the collection is very large. Recall enhancement The other usages (Index enhancement and Logical Units Discovery) are aimed on the contrary to enhance recall, which is not often required, or not a priority when too many irrelevant answers are given to the queries. Though, as for me, the Logical Units Discovery methods can be considered in an IR point of view as trying to access to different levels of granularity of documents in the Web space. If we consider that an IR system returns pointers to documents, the notion of document is what is returned by the IR system. So if an IR system returns a Logical Unit which is composed of several pages, this is a higher level of granularity. 2 http://trec.nist.gov/ Proposition: a hierarchical presentation of the Web Many of the queries submitted to search engines have many many answers. The IR traditional relevance and the popularity produce lists of answers. But presenting the results as an ordered list increases the likelihood of missing important, and in some sense rarer, information. This is true especially if the ranking is only done with popularity as this has the effect that the best ranked documents have the more likelihood to get better ranked. I suggest that the results should be presented by clusters, with a number of clusters manageable by the user (from ten to one hundred, it could be a user preference). With iterative clustering, any document would be at a log(n) distance from the root rather than to be at a n distance from the beginning of a sorted list. To help to do that, many possibilities can be considered: • some of the clustering techniques could be applied either on the Web, or on the results of a query; These clustering could be done with similarity based on different clues according to the user information need (text similarity, co-citation similarity, co-occurrence, etc.) • some categorization could be used (particularly open ones 3 ); • Entry Points Discovery and Logical Units Discovery could be used to merge several URL in a single node in the graph; Merging several URLs in a single node has two beneficial effects: it both reduces the size of the graph and the resulting graph has a higher density. Reducing the size of the graph has an influence on the run time of the algorithms, which is important due to the size of the Web and the complexity of some algorithms (clustering for example). Increasing the density is important because the Web graph is rather sparse, and a few proportion of pages are cited (and even fewer are co-cited). So the benefit of the algorithms based on the links is not well spread. • recall enhancement methods could be used when queries give no answers. References [1] K. Bharat, A. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the www. IEEE Data Engineering Bulletin, 23(4):21–26, 2000. [2] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998. [3] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web: experiments and models. In 9th International World Wide Web Conference, The Web: The Next Generation, 5 2000. [4] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In L. M. Haas and A. Tiwary, editors, Proceedings ACM SIGMOD International Conference on Management of Data, pages 307–318. ACM Press, 1998. [5] J. Cho, H. Garcia-Molina, and L. Page. The anatomy ocient crawling through url ordering. Computer Networks and ISDN Systems, 30(1–7):161–172, 1998. 3 http://www.dmoz.org/ 21 for instance. [6] J. Dean and M. R. Henzinger. Finding related pages in the world wide web. Computer Networks, 31(11-16):1467– 1479, 1999. [7] C. E. Dyreson. A jumping spider: Restructuring the WWW graph to index concepts that span pages. In A.-M. Vercoustre, M. Milosavljevi, and R. Wilkinson, editors, Proceedings of the Workshop on Reuse of Web Information, held in conjunction with the 7th WWW Conference, pages 9–20, 1998. CSIRO Report Number CMIS 98-11. [8] M. E. Frisse. Searching for information in a hypertext medical handbook. Communications of the ACM, 31(7):880–886, 1988. [9] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [10] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The web as a graph. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000. [11] W.-S. Li, K. S. Candan, Q. Vu, and D. Agrawal. Retrieving and organizing web pages by ”information unit”. In M. R. Lyu and M. E. Zurko, editors, Proceedings of the Tenth International World Wide Web Conference, 2001. [12] W.-S. Li, O. Kolak, Q. Vu, and H. Takano. Defining logical domains in a web site. In HYPERTEXT ’00, Proceedings of the eleventh ACM on Hypertext and hypermedia, pages 123– 132, 2000. [13] M. Marchiori. The limits of web metadata and beyond. Computer Networks and ISDN Systems, 30(1–7):1–9, 1998. [14] Y. Mizuuchi and K. Tajima. Finding context paths for web pages. In HYPERTEXT ’99, Proceedings of the tenth ACM Conference on Hypertext and hypermedia: returning to our diverse roots, pages 13–22, 1999. [15] C. Prime-Claverie, M. Beigbeder, and T. Lafouge. Transposition of the co-citation method with a view to classifying web pages. Journal of the American Society for Information Science and Technology, 55(14):1282–1289, 2004. [16] J. Savoy. Citation schemes in Hypertext information retrieval, pages 99–120. Kluwer Academic Publishers, 1996. in Agosti M. and Smeaton A. editors, Information Retrieval and Hypertext. [17] J. Savoy. Ranking schemes in hybrid boolean systems: a new approach. Journal of the American Society for Information Science, 48(3):235–253, 1997. [18] H. Small. Co-citation in the scientific literature: a new measure of the relationship between two documents. Journal of the American Society for information Science, 24(4):265– 269, 1973. [19] K. Tajima, K. Hatano, T. Matsukura, R. Sano, and K. Tanaka. Discovery and retrieval of logical information units in web. In R. Wilensky, K. Tanaka, and Y. Hara, editors, Proc. of Workshop of Organizing Web Space (in conjunction with ACM Conference on Digital Ligraries ’99), pages 13–23, 1999. [20] K. Tajima, Y. Mizuuchi, M. Kitagawa, and K. Tanaka. Cut as a querying unit for WWW, netnews, e-mail. In HYPERTEXT ’98, Proceedings of the ninth ACM conference on Hypertext and hypermedia: links, objects, time and space–structure in hypermedia systems, pages 235–244, 1998. 22 Static Ranking of Web Pages, and Related Ideas Wray Buntine Complex Systems Computation Group, Helsinki Institute for Information Technology P.O. Box 9800, FIN-02015 HUT, Finland wray.buntine@hiit.fi Abstract This working paper reviews some different ideas in link-based analysis for search. First, results about static ranking of web pages based on the so called randomsurfer model are reviewed and presented in a unified framework. Second, a topic-based hubs and authorities model using a discrete component method (a variant of ICA and PCA) is developed, and illustrated on the 500,000 page English language Wikipedia collection. Third, a proposal is presented to the community for a Links/Ranking consortium extracted from the Web Intelligence paper Opportunities from Open Source Search. 1 Introduction PageRankTM used by Google and the HypertextInduced Topic Selection (HITS) model developed at IBM [9] are the best known of the ranking models although they represent a very recent part of a much older bibliographic literature (for instance, discussed in [5]). PageRank ranks all pages in a collection and is then used as a static (i.e., query-free) part of a query evaluation. Whereas HITS is intended to be applied to just the subgraph of pages retrieved with a query, and perhaps some of their neighbors. There is nothing, however, to stop HITS being applied like PageRank to a full collection rather than just query results. PageRank is intended to measure the authority of a webpage on the basis that high authority pages have other high authority pages linked to them. HITS is also referred to as the hubs and authority model: a hub is a web page that is viewed as a reliable source for links to other web pages, whereas an authority is viewed as a reliable content page itself. Generally speaking, good hubs should point to good authorities and visa verca. The literature about these methods is substantial [2, 1]. Here I review these two models, and then discuss their use in an Open Source environment. 2 Random Seekers Surfers versus Random The PageRank model is based on the notion of an idealised random surfer. The random surfer starts off by choosing from some selection of pages according to an initial probability vector ~s. When at a new page, the surfer can take one of the outgoing links from the current page, or with some smaller probability restart afresh at a completely new page again using the initial probability vector ~s. The general start-restart process is depicted in the graph in Figure 1, where the initial state is labelled start, and the pages themselves form a subgraph T . Every page in the collection has ... restart-1 restart-P Collection start Figure 1. Start-Restart for the Random Surfer a link to a matching restart state leading directly to start, and start links back to those pages in the collection with a non-zero entry in ~s. Note the restart states could be eliminated, but are retained for comparison with the later model. This represents a Markov model once we attach probabilities to outgoing arcs, and the usual analysis of Markov chains and linear systems (see for instance [12]) applies [1]. The computed static rank is the long run probability of visiting any page in the collection according to the Markov model. Extensions to the model include making the initial 23 probability vector ~s dependent on topic [7, 11], providing a back button so the surfer can reject a new page based on its unsuitability of topic [11, 10], and handling the way in which pages with no outgoing links can be dealt with [6, 1]. These extensions make the idealised surfer more realistic, yet no real analysis of the Markov models on real users has been attempted. A fragment of a graph illustrating the Markov model from the point of view of surfing from one page, is given in Figure 2, From web, but instead of continuously surfing, can ”find” a page and thus stop. The general model comparable to Figure 1 is now given by Figure 3, In the random seeker ... stop-1 stop-P Collection start Collection Figure 3. Start-Stop for the Random Seeker restart check-P1 check-Pk ... page click Figure 2. Local view of Primitive States a page j, the surfer decides to either restart with probability rj , or to click on a link to a new page. Once they decide to click, they try different pages k with probability given by matrix p, where ½ 0 page j has no out link to k pj,k = 1/L page j has L outlinks, one to k model, the computed static rank is the long run probability of stopping at (“finding”) any given page. It is thus given by the probabilities for the absorbing states in the Markov model, and again the usual analysis applies. The page to page transition probabilities, however, can otherwise be modelled in various ways using Equation (1). The structure of the graphs suggests that these two models (random surfer versus random seeker) should have a strong similarity in their results. We can work out the exact probabilities by folding the transition matrices. The following lemma do this. Lemma 1. Given the random seeker model with parameters ~s, ~r, ~a and p, where rj = 1 for any page j without outgoing links. Let P denote the transition matrix ½ 0 j not linked to k P Pj,k = pj,k ak /( k pj,k ak ) page j linked to k but have a one time opportunity (to check) to either accept the new page k, given by ak , or to try again and go back to the intermediate click state. Folding in the various intermediate states (click and the check states) and just keeping the pages and the start and restart Let Dr denote the diagonal matrix formed using entries states, yields a transition matrix starting from a page ~r. The total probability of the stop states for paths of j of length less than or equal to n + 2 is given by ! Ã ( n X rj state = restart i p a Dr I + ((I − Dr )P) ~s (2) p(state | page j) = (1 − rj ) P j,kp ka state = page k k j,k k (1) Note in this formulation, if a page j has no outgoing links, then rj = 1 necessarily. This has the parameters summarised in the following table. ~s ~r ~a Description initial probabilities for pages, normalised restart probabilities for pages acceptance probabilities for pages With appropriate choice of these, all of the common models in the literature can be handled [7, 11, 1]. A new model proposed by Amati, Ounis, and Plachouras [13] is the static absorbing model for the web. The absorbing model is instead based on the notion of a random seeker. The random seeker again surfs the i=1 This can be proven by straight forward enumeration of states. Equation (2) is evaluated in practice using a recurrence relation such as ~q0 = ~s, p~i+1 = p~i + Dr ~qi and ~qi+1 = (I − Dr )P~qi . Lemma 2. Given the random seeker model with parameters ~s, ~r, ~a and p, where rj = 1 for any page j without outgoing links,, and let P and Dr be defined as above. Assume rj > 0 for all pages j. The total absorbing probability of the stop states is given by Dr (I − (I − Dr )P) −1 ~s (3) The matrix inverse exists. Moreover, the L1 norm of this minus Equation (2) is less than (1 − r0 )n+1 /r0 where r0 = minj rj > 0. 24 Note in the standard PageRank interpretation, r0 = 1 − α, so the remainder is αn+1 /(1 − α), the same order as for the PageRank calculation [1]. i Proof. Consider ~qi = ((I − Dr )P) ~s, and prove the recursion ||~qi ||1 ≤ (1 − r0 )i . Since P is a probability matrix with some rows zero, ||P~qi ||1 ≤ i+1 ||~qi ||1 and ´ − r0 ) . Consider ³Phence ||~qi+1 ||1 ≤ (1 m i ~qn,m = ~s. Hence ||~qn,m ||1 ≤ i=n+1 ((I − Dr )P) Pm i n+1 − (1 − r0 )m )/r0 . i=n+1 (1 − r0 ) which is ((1 − r0 ) Thus ~qn,∞ is well defined, and has an upper bound of (1 − r0 )n+1 /r0 . Thus the total absorbing probability is given by Equation (2) as n → ∞, with L1 error after n steps bounded by (1 − r0 )n+1 /r0 . Since the sum is well defined and converges, it follows that −1 (I − (I − Dr )P) exists. Lemma 3. Given the random surfer model with parameters ~s, ~r, ~a and p, where rj = 1 for any page j without outgoing links,, and let P and Dr be defined as above. Assume rj > 0 and sj > 0 for all pages j. Then the long run probability over pages exists independently of the initial probability over pages and is proportional to −1 (I − (I − Dr )P) ~s (4) Proof. Eliminate the start and restart states, and then the transition matrix becomes as follows: given a probability over pages of p~i , then at the next cycle p~i+1 = ~s(~r† p~i + (I − Dr )P~ pi Since ~r and ~s are strictly positive, the Markov chain is ergodic and irreducible [12], and thus the long run probability over pages exists independently of the initial probability over pages. Consider the fixed point for these equations. Make a change of variables to p~ = Dr p~/(~r† p~). This is always well defined since the positivity constraints on ~r ensure ~r† p~ > 0. Then p~0 = Dr~s + Dr (I − Dr )PDr −1 p~0 Rewriting, Dr (I − (I − Dr )P) Dr −1 p~0 = Dr~s We know from above that the inverse of the middle matrix expression exists. Thus p~0 = Dr (I − (I − Dr )P) −1 ~s Substituting back for p~ yields the result. Note the usual recurrence relation for computing this is p~i+1 + = ~s(~r† p~i ) + (I − Dr )P~ pi , and due to the correspondence between Equations (3) and Equations (4), the alternative occurrence for the absorbing model could be adapted as well. The recurrence relation holds: ~q0 = ~s, p~i+1 = p~i + ~qi and ~qi+1 = (I − Dr )P~qi , noting that the final estimate p~i+1 so obtained needs to be normalised. This can, in fact, be supported on a graphical basis as well. This correspondence gives us insight into how to improve these models. How might we make the Markov models more realistic? Could the various parameters be learned from click stream data? While in the surfing model ~r corresponds to the probability of restarting, in the seeking model it is the probability of accepting a page and stopping. One is more likely to use the back button on such pages, and thus perhaps the acceptance probabilities ~a should be modified. Some versions are suggested in [6]. 3 Probabilistic Hubs and Authorities A probabilistic authority model for web pages, based on PLSI [8], was presented by [5]. By using the GammaPoisson version of Discrete PCA [4, 3], a generalisation of PLSI using independent Gamma components, this can be extended to a probabilistic version of the hubs and authorities model. The method is topic based in that hubs and authorities are produced for K different topics. An authority matrix Θ gives the authority score for each page j for the k-th topic, θj,k , normalised for each topic. Each page j is a potential hub, with hub scores lj,k for topic k taken from the hub matrix l. The links in a page are modelled independently using the Gamma(1/K, 1) distribution. The occurrences of link j in page i are then Poisson distributed with a mean given by authority scores for the link P weighted by the hub scores for the page, Poisson( k li,k θj,k ). More details of the model, and the estimation of the authority matrix and hub matrix are at [3], To investigate this model, the link structure of the English language Wikipedia from May 2005 was used as data. The output of this analysis is given at http://cosco.hiit.fi/search/MPCA/HA100.html. This is about 500,000 documents and K = 100 hub and authority topics are given. The authority scores are the highest values for a topic k from the authority matrix Θ, and the hub scores are the highest component estimates for topic k for lj,k for a page j. Note a variety of hub and authority models have been investigated in the context of query evaluation [2]. It is not clear if this is the right approach for using these models. Nevertheless, these represent another family of link-based systems than can be used in a search engine, and an alternative definition of authority to the previous section. 25 4 A Trust/Reputation Consortium for Open Source Ranking Having reviewed some methods for link analysis, let use now consider their use. Opportunities for their use abound once the right infrastructure is in place for open source search. Here I describe one general kind of system that could exist in the framework, intended either as an academic or commercial project. On Google the ranking of pages is influenced by the PageRank of websites. Sites appearing in the first page of results for popular and commercially relevant queries get a significant boost in viewership, and thus PageRank has become critical for marketing purposes. This method for computing authority for a web page borrows from early citation analysis, and the broader fields of trust, reputation, and social networks (which blog links could be interpreted to represent) provide new opportunities for this kind of input to search. Analysis of large and complex networks such as the Internet is readily done on todays grid computing networks. What are some scenarios for the use of new kinds of data about authority, trust and reputation, standards set up by a consortium perhaps. A related example is the new OpenID1 , a distributed identity system. ACM could develop a ”computer science site rank” that gives web sites an authority ranking according to ”computer science” relevance and reputation. In this ranking the BBC Sports website would be low, Donald Knuth’s home page high, and Amazon’s Computer Science pages perhaps medium. Our search engines can then incorporate this authority ranking into their own scores when asked to do so. ACM might pay for the development and maintenance of this ranking as a service to its members, possibly incorporating its rich information about citations as well, thus using a sophisticated reputation model well beyond simple PageRank. In an open source search network, consumers of these kinds of organisational or professional ranks could be found. To take advantage of such a system, a user could choose to search Australian university web sites via a P2P universities search engine and then enrol with the ACM ranking in order to help rank their results. Yahoo could develop a vendor web site classification that records all websites according to whether they primarily or secondarily perform retail or wholesale services, product information, or product service, extending its current Mindset demonstration2 . This could be coupled with a vendor login service so that venders can manage their entries, and trust capabilities so that some measure of authority exists about the classifications. Using this, search engines then have a trustworthy way of placing web pages into different product genres, and thus commercial and product search could be far more predictable. To take advantage of this, a user could search for product details, but enrol with the Yahoo service classification to restrict their search to relevant pages. Network methods for trust, reputation, community groups, and so forth, could all be invaluable to small local search engines, that cannot otherwise gain a global perspective on their content. They would also serve as a rich area for business potential. References [1] L. A.N., , and M. C.D. Deeper inside pagerank. Internet Mathematics, 1(3):335–400, 2004. [2] A. Borodin, G. Roberts, J. Rosenthal, and P. Tsaparas. Link analysis ranking: algorithms, theory, and experiments. ACM Trans. Inter. Tech., 5(1):231–297, 2005. [3] W. Buntine. Discrete principal component analysis. submitted, 2005. [4] J. Canny. GaP: a factor model for discrete data. In SIGIR 2004, pages 122–129, 2004. [5] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167–174. Morgan Kaufmann, San Francisco, CA, 2000. [6] N. Eiron, K. McCurley, and J. Tomlin. Ranking the web frontier. In WWW ’04: Proceedings of the 13th international conference on World Wide Web, pages 309–318, 2004. [7] T. Haveliwala. Topic-specific pagerank. In 11th World Wide Web, 2002. [8] T. Hofmann. Probabilistic latent semantic indexing. In Research and Development in Information Retrieval, pages 50–57, 1999. [9] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [10] F. Mathieu and M. Bouklit. The effect of the back button in a random walk: application for pagerank. In WWW Alt. ’04: Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pages 370–371, 2004. [11] M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In NIPS*14, 2002. [12] S. Ross. Introduction to Probability Models. Academic Press, fourth edition, 1989. [13] I. O. V. Plachouras and G. Amati. The static absorbing model for hyperlink analysis on the web. Journal of Web Engineering, 4(2):165–186, 2005. 1 http://www.openid.net/ 2 Search for Mindset at Yahoo Research 26 WIRE: an Open Source Web Information Retrieval Environment Carlos Castillo Center for Web Research, Universidad de Chile carlos.castillo@upf.edu Ricardo Baeza-Yates Center for Web Research, Universidad de Chile ricardo.baeza@upf.edu Abstract In this paper, we describe the WIRE (Web Information Retrieval Environment) project and focus on some details of its crawler component. The WIRE crawler is a scalable, highly configurable, high performance, open-source Web crawler which we have used to study the characteristics of large Web collections. 1. Introduction At the Center for Web Research (http://www.cwr.cl/) we are developing a software suite for research in Web Information Retrieval, which we have called WIRE (Web Information Retrieval Environment). Our aim is to study the problems of Web search by creating an efficient search engine. Search engines play a key role on the Web, as searching currently generates more than 13% of the traffic to Web sites [1]. Furthermore, 40% of the users arriving to a Web site for the first time clicked a link from a search engine’s results [14]. The WIRE software suite generated several sub-projects, including some of the modules depicted in Figure 1. So far, we have developed an efficient general-purpose Web crawler [6], a format for storing the Web collection, a tool for extracting statistics from the collection and generating reports and a search engine based on SWISH-E using PageRank with non-uniform normalization [3]. In some sense, our system is aimed at a specific segment: our objective was to use it to download and analyze collections having in the order of 106 − 107 documents. This is bigger than most Web sites, but smaller than the complete Web, so we worked mostly with national domains (ccTLDs: country-code top level domains such as .cl or .gr). The main characteristics of the WIRE crawler are: High-performance and scalability: It is implemented using about 25,000 lines of C/C++ code and designed to work with large volumes of documents and to handle up to a thousand HTTP requests simultaneously. The current implementation would require further work to scale to billions Figure 1. Some of the possible sub-projects of WIRE, highlighting the completed parts. of documents (e.g.: process some data structures on disk instead of in main memory). Currently, the crawler is parallelizable, but unlike [8], it has a central point of control. Configurable and open-source: Most of the parameters for crawling and indexing can be configured, including several scheduling policies. Also, all the programs and the code are freely available under the GPL license. The details about commercial search engines are usually kept as business secrets, but there are a few examples of open-source Web crawlers, for instance Nutch http://lucene.apache.org/nutch/. Our system is designed to focus more on evaluating page quality, using different crawling strategies, and generating data for Web characterization studies. Due to space limitations, on this paper we describe only the crawler in some detail. Source code and documentation, are available at http://www.cwr.cl/projects/WIRE/. The rest of this paper is organized as follows: section 2 details the main programs of the crawler and section 3 how statistics are obtained. The last section presents our conclusions. 27 2 Web crawler In this section, we present the four main programs that are run in cycles during the crawler’s execution: manager, harvester, gatherer and seeder, as shown in Figure 2. select pages P1 and P3 for this cycle as they give the higher profit. 2.2 Harvester: short-term scheduling The “harvester” program receives a list of K URLs and attempts to download them from the Web. The politeness policy chosen is to never open more than one simultaneous connection to a Website, and to wait a configurable amount of seconds between accesses (default 15). For the larger Websites, over a certain quantity of pages (default 100), the waiting time is reduced (to a default of 5 seconds). Figure 2. Modules of the crawler. 2.1 Manager: long-term scheduling As shown in Figure 4, the harvester creates a queue for each Web site and opens one connection to each active Web site (sites 2, 4, and 6). Some Web sites are “idle”, because they have transfered pages too recently (sites 1, 5, and 7) or because they have exhausted all of their pages for this batch (3). This is implemented using a priority queue in which Web sites are inserted according to a time-stamp for their next visit. The “manager” program generates the list of K URLs to be downloaded in this cycle (we used K = 100, 000 pages by default). The procedure for generating this list is based on maximizing the “profit” of downloading a page [7]. Figure 4. Operation of the harvester program. Figure 3. Operation of the manager program. The current value of a page depends on an estimation of its intrinsic quality, and an estimation of the probability that it has changed since it was crawled. The process for selecting the pages to be crawled next includes (1) filtering out pages that were downloaded too recently, (2) estimating the quality of Web pages, (3) estimating the freshness of Web pages and (4) calculating the profit for downloading each page. This balances the process of downloading new pages and updating the alreadydownloaded ones. For example, in Figure 3, the behavior of the manager for K = 2 is depicted. In the figure, it should Our first implementation used Linux threads and did blocking I/O on each thread. It worked well, but was not able to go over 500 threads even in PCs with processors of 1 GHz and 1GB of RAM. It seems that entire thread system was designed for only a few threads at the same time, not for higher degrees of parallelization. Our current implementation uses a single thread with non-blocking I/O over an array of sockets. The poll() system call is used to check for activity in the sockets. This is much harder to implement than the multi-threaded version, as in practical terms it involves programming context switches explicitly, but the performance is much better, allowing us to download from over 1000 Web sites at the same time with a very lightweight process. 28 2.3 Gatherer: parsing of pages The “gatherer” program receives the raw Web pages downloaded by the harvester and parses them. In the current implementation, only HTML and plain text pages are accepted by the harvester. The parsing of HTML pages is done using an eventsoriented parser. An events-oriented parser (such as SAX [12] for XML) does not build an structured representation of the documents: it just generates function calls whenever certain conditions are met. We found that a substantial amount of pages were not well-formed (e.g.: tags were not balanced), so the parser must be very tolerant to malformed markup. The contents of Web pages are stored in variable-sized records indexed by document-id. Insertions and deletions are handled using a free-space list with first-fit allocation. This data structure also implements duplicate detection: whenever a new document is stored, a hash function of its contents is calculated. If there is another document with the same hash function and length, the contents of the documents are compared. If they are equal, the document-id of the original document is returned, and the new document is marked as a duplicate. files for millions of pages, instead of small files, can save a lot of disk seeks, as noted also by Patterson [16]. 2.4 Seeder: URL resolver The “seeder” program receives a list of URLs found by the gatherer, and adds some of them to the collection, according to a criteria given in the configuration file. This criteria includes patterns for accepting, rejecting, and transforming URLs. Patterns for accepting URLs include domain name and file name patterns. The domain name patterns are given as suffixes (e.g.: .cl, .uchile.cl, etc.) and the file name patterns are given as file extensions. Patterns for rejecting URLs include substrings that appear on the parameters of known Web applications (e.g. login, logout, register, etc.) that lead to URLs which are not relevant for a search engine. Finally, to avoid duplicates from session ids, patterns for transforming the URLs are used to remove known session-id variables such as PHPSESSID from the URLs. Figure 5. Storing the contents of a document. Figure 6. For checking a URL: (1) the host name is searched in the hash table of Web site names. The resulting site-id (2) is concatenated with the path and filename (3) to obtain a doc-id (4). The process for storing a document, given its contents and document-id, is depicted in Figure 5. For storing a document, the crawler has to check first if the document is a duplicate, then search for a place in the free-space list, and then write the document to disk. This module requires support to create large files, as for large collections the disk storage grows over 2GB, and the offset cannot be provided in a variable of type “long”. In Linux, the LFS standard [10] provides offsets of type “long long” that are used for disk I/O operations. The usage of continuous, large The structure that holds the URLs is highly optimized for the most common operations during the crawling process: given the name of a Web site, obtain a site-id, given the siteid of a Web site and a local link, obtain a document-id, and given a full URL, obtain both its site-id and document-id. The process for converting a full URL is shown in Figure 6. This process is optimized to exploit the locality on Web links, as most of the links found in a page point to other pages co-located in the same Web site. For this, the implementation uses two hash tables: the first for converting 29 Web site names into site-ids, and the second for converting “site-id + path name” to a doc-id. 3 Obtaining statistics To run the crawler on a large collection, the user must specify the site suffix(es) that will be crawled (e.g.: .kr or .upf.edu), and has to provide a starting list of “seed” URLs. Also, the crawling limits have to be provided, including the maximum number of pages per site (the default is 25,000) and the maximum exploration depth (default is 5 levels for dynamic pages and 15 for static pages). There are several configurable parameters, including the amount of time the crawler waits between accesses to a Web site –that can be fine-tuned by distinguishing between “large” and “small” sites– the number of simultaneous downloads, the timeout for downloading pages, among many others. On a standard PC with a 1 GHz Intel 4 processor and 1 GB of RAM, using standard IDE disks, we usually download and parse about 2 million pages per day. WIRE stores as much metadata as possible about Web pages and Web sites during the crawl, and includes several tools for extracting this data and for obtaining statistics. The analysis includes running link analysis algorithms such as Pagerank [15] and HITS [11], aggregating this information by documents and sites, and generating histograms for almost every property that is stored by the system. It also includes a module for detecting the language of a document based on a dictionary of stopwords in several languages that is included with WIRE. The process for generating reports includes the analysis of the data, its extraction, the generation of gnuplot scripts for plotting, and the compilation of automated reports using LATEX. The generated reports include: distribution of language, histograms of in- and out-degree, link scores, page depth, HTTP response codes, age (including per-site average, minimum and maximum), summations of link scores per site, histogram of pages per site and bytes per site, an analysis by components in the Web structure [5], the distribution of links to multimedia files, and of links to domains that are outside the delimited working set for the crawler. 4 Conclusions So far, we have used WIRE to study large Web collections including the national domains of Brazil [13], Chile [2], Greece [9] and South Korea [4]. We are currently developing a module for supporting multiple text encodings including Unicode. While downloading a few thousands pages from a bunch of Web sites is relatively easy, building a Web crawler that has to deal with millions of pages and also with misconfigured Web servers and bad HTML coding requires solving a lot of technical problems. The source code and the documentation of WIRE, including step-by-step instructions for running a Web crawl and analysing the results, are available at http://www.cwr.cl/projects/WIRE/doc/. References [1] Search Engine Referrals Nearly Double Worldwide. http://websidestory.com/pressroom/pressreleases.html?id=181, 2003. [2] R. Baeza-Yates and C. Castillo. Caracterı́sticas de la Web Chilena 2004. Technical report, Center for Web Research, University of Chile, 2005. [3] R. Baeza-Yates and E. Davis. Web page ranking using link attributes. In Alternate track papers & posters of the 13th international conference on World Wide Web, pages 328– 329, New York, NY, USA, 2004. ACM Press. [4] R. Baeza-Yates and F. Lalanne. Characteristics of the Korean Web. Technical report, Korea–Chile IT Cooperation Center ITCC, 2004. [5] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages 309–320, Amsterdam, Netherlands, May 2000. ACM Press. [6] C. Castillo. Effective Web Crawling. PhD thesis, University of Chile, 2004. [7] C. Castillo and R. Baeza-Yates. A new crawling model. In Poster proceedings of the eleventh conference on World Wide Web, Honolulu, Hawaii, USA, 2002. [8] J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings of the eleventh international conference on World Wide Web, pages 124–135, Honolulu, Hawaii, USA, 2002. ACM Press. [9] E. Efthimiadis and C. Castillo. Charting the Greek Web. In Proceedings of the Conference of the American Society for Information Science and Technology (ASIST), Providence, Rhode Island, USA, November 2004. American Society for Information Science and Technology. Large File Support in Linux. [10] A. Jaeger. http://www.suse.de/aj/linux lfs.html, 2004. [11] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [12] D. Megginson. Simple API for XML (SAX 2.0). http://sax.sourceforge.net/, 2004. [13] M. Modesto, á. Pereira, N. Ziviani, C. Castillo, and R. Baeza-Yates. Un novo retrato da Web Brasileira. In Proceedings of SEMISH, São Leopoldo, Brazil, 2005. [14] J. Nielsen. Statistics for Traffic Referred by Search Engines and Navigation Directories to Useit. http://www.useit.com/about/searchreferrals.html, 2003. [15] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998. [16] A. Patterson. Why writing your own search engine is hard. ACM Queue, April 2004. 30 Nutch: an Open-Source Platform for Web Search Doug Cutting Internet Archive doug@nutch.org Abstract Nutch is an open-source project providing both complete Web search software and a platform for the development of novel Web search methods. Nutch is built on a distributed storage and computing foundation, such that every operation scales to very large collections. Core algorithms crawl, parse and index Web-based data. Plugins extend functionality at various points, including network protocols, document formats, indexing schemas and query operators. 1. Introduction Nutch is an open-source project hosted by the Apache Software Foundation [1]. Nutch provides a complete, high-quality Web search system, as well as a flexible, scalable platform for the development of novel Web search engines. Nutch includes: • a Web crawler; • parsers for Web content; • a link-graph builder; • schemas for indexing and search; • distributed operation, for high scalability; • an extensible, plugin-based architecture. Nutch is implemented in Java and thus runs on many operating systems and a wide variety of hardware. 2. Architecture Nutch has a set of core interfaces implemented by plugins. Plugins implement such things as network protocols, document formats and indexing schemas. Generic algorithms combine the plugins to create a complete system. These algorithms are implemented on a distributed computing platform, making the entire system extremely scalable. 3. Distributed Operation Distribution operation is built in two layers, storage and computation. 3.1 Nutch Distributed File System (NDFS) Storage is provided by the the Nutch Distributed File System (NDFS) which is modeled after the Google File System [2] (GFS). NDFS provides reliable storage across a network of PCs. Files are stored as a sequence of blocks. Each block is replicated on multiple hosts. Replication and fail-over are handled automatically, providing applications with an easy-to-manage, efficient file system that scales to multi-petabyte installations. For small deployments, without large storage requirements, Nutch is easily configured to simply use a local hard drive for all storage, in place of NDFS. 3.2 MapReduce MapReduce is Nutch's distributed computing layer, again inspired by Google [3]. MapReduce, as its name implies, is a two-step operation, map followed by reduce. Input and output data are files containing sequences of keyvalue pairs. During the map step, input data is split into contiguous chunks that are processed on separate nodes. A user-supplied map function is applied to each datum, producing an intermediate data set. Each intermediate datum is then sent to a reduce node, based on a user-supplied partition function. Partitioning is typically a hash function, so that all equivalently keyed intermediate data are all sent to a single reduce node. For example, if a map function outputs URL-keyed data, then partitioning by URL hash code sends intermediate data associated with a given URL to a single reduce node. Reduce nodes sort all their input data, then apply a user-supplied reduce function to this sorted map output, producing the final output for the MapReduce operation. All entries with a given key are passed to the reduce function at once. Thus, with URL-keyed data, all data associated with a URL is passed to the reduce function and may be used to generate the final output. The MapReduce system is robust in the face of machine failures and application errors. Thus one may reliably run long-lived applications on tens, hundreds or even thousands of machines in parallel. A single-threaded, in-process implementation of MapReduce is also provided. This is useful not just for debugging, but also to simplify small, single-machine installations of Nutch. 4. Plugins An overview of the primary plugin interfaces is provided below. 4.1 URL Normalizers and Filters These are called on each URL as it enters the system. A URL normalizer transforms URLs to a standard form. Basic implementations perform operations such as lowercasing protocol names (since these are case-independent) and removing default port numbers (e.g., port 80 from HTTP URLs). If an application has more knowledge of particular URLs, then it can easily implement things such as removal of session ids within a URL normalizer. URL filters are used to determine whether a URL is permitted to enter Nutch. One may, for example, wish to exclude queries with query parameters, since these are likely to be dynamically generated content. Or one may use a URL filter to restrict crawling to particular domains, to implement an intranet or vertical search engine. 31 Nutch provides regular-expression based implementations of both URL normalizer and URL filter. Thus most applications need only modify a configuration file containing regular expressions in order to alter URL normalization and filtering. However, if, e.g., an application needs to consult an external database in order to process URLs, that may easily be implemented as a plugin. 4.2 Protocol Plugins A protocol plugin is invoked to retrieve the content of URLs with a given scheme, e.g., HTTP, FTP, FILE, etc. A protocol implementation, given a URL, returns the raw, binary content of that URL, along with metadata (e.g., protocol headers). 4.3 Parser Plugins Parser plugins, given the output of a protocol plugin (raw content and metadata), extract text, links and metadata (author, title, etc.). Links are represented as a pair of strings: the URL that is linked to; and the “anchor” text of the link. Nutch includes parsers for formats such as HTML, PDF, Word, RTF, etc. Since Web content is frequently malformed, robust parsers are required. Nutch currently uses the NekoHTML [4] parser for HTML, which can successfully parse most pages, even those with mismatched tags, those which are truncated, etc. The HTML parser also produces a XML DOM parse tree of each page's content. Plugins may be specified to process this parse tree. For example, a Creative Commons plugin scans this parse tree for Creative Commons license RDF embedded within the HTML. If found, the license characteristics are added to the metadata for the parse so that they may subsequently be indexed and searched. 4.4 Indexing and Query Plugins Nutch uses Lucene for indexing and search. When indexing, each parsed page (along with a list of incoming links, etc.) is passed to a sequence of indexing plugins in order to generate a Lucene document to be indexed. Thus plugins determine the schema used; which fields are indexed and how they are indexed. By default, the content, URL and incoming anchor texts are indexed, but one may enable other plugins to index such things as date modified, content-type, language, etc. Queries in Nutch are parsed into an abstract syntax tree, then passed to a sequence of query plugins, in order to generate the Lucene query that is executed. The default indexing plugin generates queries that search the content, URL and anchor fields. Other plugins permit field-specific search, e.g., searching within the URL only, date-range searching, restricting results to particular document types and/or languages, etc. 5. Algorithms Generic algorithms are implemented in terms of the plugins outlined above, in order to perform user-level tasks such as crawling, indexing etc. Each algorithm, except search, is implemented as one or more MapReduce operations. All persistent data may be stored in NDFS for completely distributed operation. 5.1 Crawling The crawling state is kept in a data structure called the crawldb. It consists of a mapping from URLs to a CrawlDatum record. Each CrawlDatum contains a date to next fetch the URL, the status of the URL (fetched, unfetched, gone, etc.), the number of links found to this URL, etc. The crawldb is bootstrapped by inserting a few root URLs. The Nutch crawler then operates in a cycle: 1. generate URLs to fetch from crawldb; 2. fetch these URLs; 3. parse the fetched content; 4. update crawldb with results of fetch and new URLs found when parsing. These steps are repeated. Each step is described in more detail below. 5.1.1 Generate URLs are generated which are due to be fetched (status is not 'gone' and next fetch date is before now). This set of URLs may further be limited so that only the top most linked pages are requested, and so that only a limited number of URLs per host are generated. 5.1.2 Fetch The fetcher is a multi-threaded application that employs protocol plugins to retrieve the content of a set of URLs. 5.1.3 Parse Parser plugins are employed to extract text links and other metadata from the raw binary content. 5.1.4 Update The status of each URL fetched along with the list of linked URLs discovered while parsing are merged with the previous version of the crawldb to generate a new version. URLs which were successfully fetched are marked as such, incoming link counts are updated, and new URLs to fetch are inserted. 5.2 Link Inversion All of the parser link outputs are processed in a single MapReduce operation to generate, for each URL, the set of incoming anchor texts. Associating incoming anchor text with URLs has been demonstrated to dramatically increase the quality of search results. [5] 5.3 Indexing A MapReduce operation is used to combine all information known about each URL: page text, incoming anchor text, title, metadata, etc. This data is passed to the indexing plugins to create a Lucene document that is then added to a Lucene index. 5.4 Search Nutch implements a distributed search system, but, unlike other algorithms, search does not use MapReduce. Separate indexes are constructed for partitions of the collection. Indexes are deployed to search nodes. Each query is broadcast to all search nodes. The top-scoring results over all indexes are presented to the user. 32 6. Status Nutch has an active set of users and developers. Many sites are using Nutch today, for both intranet and vertical search applications, scaling to tens of millions of pages. [6] Nutch's search quality rivals that of commercial alternatives [7] at considerably lower costs. [8] Soon we hope that Nutch's public deployments will include multi-billion page search engines. The MapReduce-based version of Nutch described here is under active development. In the course of Summer 2005 we expect to index a billion-page collection using Nutch at the Internet Archive. [6] http://wiki.apache.org/nutch/PublicServers [7] http://www.nutch.org/twiki/Main/Evaluations/OSU_Querie s.pdf [8] http://osuosl.org/news_folder/nutch 7. Acknowledgments The author wishes to thank The Internet Archive, Yahoo!, Michael Cafarella and all who contribute to Nutch. 8. References [1] http://lucene.apache.org/nutch/ [2] Ghemawat, Gobioff, and Leung, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003, http://labs.google.com/papers/gfs.html [3] Dean and Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004, http://labs.google.com/papers/mapreduce.html [4] Clark, CyberNeko HTML Parser, http://people.apache.org/~andyc/neko/doc/html/index.html [5] Craswell, Hawking, and Robertson, Effective site finding using link anchor information, Proceedings of ACM SIGIR 2001, http://research.microsoft.com/users/nickcr/pubs/craswell_s igir01.pdf 33 34 Towards Contextual and Structural Relevance Feedback in XML Retrieval Lobna Hlaoua and Mohand Boughanem IRIT-SIG, 118 route de Narbonne 31068 Toulouse Cedex 4, France {hlaoua,bougha}@irit.fr - Abstract XML Retrieval is a process whose objective is to give the most exhaustive and specific information for given query. Relevance Feedback in XML retrieval has been recently investigated it consists in considering both content and structural information extracted from elements judged relevant, in query reformulation. In this paper, we describe a preliminary approach to select the most expressive keywords and the most appropriate generative structure to be added to the user query. 1. Introduction The Relevance Feedback (RF) is an interactive and evaluative process. It usually consists in enriching an initial request by adding terms extracted from documents judged as relevant by the user. Recently, the new standards of document representation have appeared, in particular, XML (eXtensible Markup Language) developed by W3C [10]. By exploring the characteristics of these new standard, traditional Information Retrieval (IR) that treats a document like only one atomic unit, has been extended to better manage this kind of documents. Indeed due to the structure of XML documents XML retrieval approaches try to select the most relevant part, represented by an XML element, instead of the whole document. As consequence XML retrieval systems offer two type of query expression, the CO (Content Only) query where user express his needs with simple key word, and the CAS (Content And Structure) query where user can add structural. Due to the structure of XML document, the traditional RF task becomes more complicated. Indeed, the RF in traditional IR consists in adding the most expressive keywords extracted from of the relevant document In XML retrieval the situation is quite different. The two main questions are: - First, how to extract from elements, that have different role (semantic), the best terms, and the second is how to select the best generative structure that can be added to the query. In this paper we will present a preliminary work on how one can incorporate the content and the structural information when reformulating the user query. We first give a brief related works in RF and XML retrieval then we present our approach in section 3. The proposed treats the content and the structure separately. In the last section we will describe how we will evaluate our approaches in the framework of INEX. 2. Previous Works In traditional Information Retrieval, RF consists of reformulating the original query according the user’s judgment or automatically said behind RF. It is applied in different model of IR like vector space model presented by Rocchio [7], Tamine [9] has defined the RF in connexionist model, Croft and Haines [1] described RF in an alternative probabilistic model. In XML retrieval the number of RF works is not important. The most works are presented within the framework of the XML retrieval in the company of INEX [2] (Initiative for the Evaluation of XML retrieval). The working group of V. Mihajlovic and G. Ramirez [6] has proposed a strategy of reformulation applied to the TIJAH [3] model. This last has the same architecture's data bases system. Indeed, the model is composed of three levels: conceptual, logic and physics. At the conceptual level, the authors have adopted the query language Narrowed Extended XPath (NEXI) proposed by INEX in 2003. The logical level is based on the algebra "`score area algebra"' whose documents were regarded as a continuation of tokens. At the physical level the "`MonetDB" ' which was applied to calculate similarity. This last is based to the three measures: tf (frequency of the terms), cf (frequency of the collection) and lp (size with priory). Reformulation is carried out on two stages: the first consists in extract from the document the most relevant elements. This information represents the newspaper where is found the most relevant element, the etiquette of the element and the size which one wishes to find. Another proposal for a reformulation was presented by the group IBM[4]. This proposal adapted the Rocchio [7] 35 algorithm to the vector model [5] whose vectors is composed of sub-vectors where each one represents a level of granularity, They applied the method IT (Lexical Affinity) for separation of the relevant documents and not relevant. elements judged more relevant. The CO query represent a simple application of contextual RF, but for CAS query, we have to add the Key words (or reject from) to the most generative structure, we will explain in follow how can restitute the most generative structure. 3. Relevance Feedback in XML documents 3.2. Structural Relevance Feedback Up until now, in Information Retrieval, the simple keywords are applied in query expansion. But the XML retrieval offers the opportunity to express user’s needs by structural information. The main goal of this preliminary work is to present our investigation in CO and CAS queries. More precisely, we discuss how one can introduce structural constraints in the CO query and how one can correct the structural constraints in the CAS query? We have seen that contextual RF is based on additional key words, but in structural RF it is not possible to add structure. Thus we are obliged to look for an appropriate method to restitute the generative structure that can help the user to get improvement in retrieval. Our goal is to define and to restitute the appropriate generative structure. We have to notice that the appropriate generative structure should not be the most generative because this last represents the totality of documents. That is why; we have to define the smallest common ancestor (sca). If we consider the following example, we notice that the XML document is represented as a tree in which, the root is the totality of document. The nodes represent the different elements of various granularities and the leaf nodes are the textual information. These two issues are described separately in the following the subsections. 3.1. Contextual relevance feedback According to the previous works in traditional Information Retrieval, we have noticed that the more appropriate method to expand query in vector space model is to add weighted key words that represent the most relevant documents and reject those express the irrelevant documents. This method is represented by the formula of Rocchio [7]. In the same way we do not apply any more key words of documents but those of various components of this last. Our approach is expressed in the following formula: Q'= Q + ∑npCp -∑nnpCnp With: Q: vector of the initial request Q': vector of the new request, Cp': (resp. Cnp): vector of a relevant component (resp. not relevant), np (resp. nnp): component count considered relevant (resp. not relevant). To apply this method, we have to select the most important key words, so it is clear that if we will add all the representative elements key words, the set of the last will be very enormous and we will have various concepts that can bring noise to the retrieval result. For this reason, we have got more important weight for the key words that are repeated in more one element. The key words weight is proportional to the number of appearances in the Figure 1: Example of XML document representation Let us consider the tree structure T, -Anc[n] is the whole of the ancestors of node n; it is the whole of the nodes which make the way active of the root towards n. -Des[n] is the whole of the descendants of node n in T; it is the whole of the nodes having n like ancestor. -Sca (m, n) is the smallest common ancestor of the nodes m and n; it is the first node common to the ways active of m and n towards the root. If we assume that for given query the IR system returns the nodes 13, 8 and 4, the task is to decide which structure can be introduced in the query: « book/chapter » 36 or « book/chapter/section » or «book/chapter/section/para ». We notice that « book/chapter »is more generative but the criterion that we have respect in IR is that the information must be exhaustive1 and specific2. So, in our approach, we get more advantage to the structure that is represented by a big number of relevant elements and by considering elements scores. The function that calculates the score ` SScore' of each structure candidate (i.e. which can be injected into the request) is given in follow: which can be candidates are presented in the following table with their scores (α=0.8). We have chosen this value arbitrarily that will be varied in the following experiments. We have to notice that if α smaller, we give advantage to the more specific structure and if α is bigger, we give advantage to the more generative structure. /A/K/C/B/ /A/F/L/B/ /A/K/B/ /A/ /A/K/ 0.5 0.2 0.35 0.58 0.6 Sscore= ∑in Si · αd Table 1: Measurement of the candidate structure scores With: Si score of a relevant element having a joint base with the structure candidate, n a number of the relevant judged elements, α a constant varying between 0 and 1, d is the distance which separates the turned over node, of the last node on the left of the structure candidate. Example 1 Q is a CO query (Content Only), composed by simple key words: "X, Y", We suppose that there are 3 components judged as relevant having respectively the following structures: « /A/B/C », « /A/B/F/L/P » and « /A/B / » and having various weights. It is noticed that structure «/ A/B » represents the common factor of the three components, the reformulated request Q': will be the query of the type CAS (Content And Structure): Q': /A/B [ about (X ,Y)]. Example 2 Q is a CAS query expressed in the query language of XFIRM [8]: //A[about (...,X)]//ce: B[Y], With: A and B are names of the tags of XML documents components. X and Y are key words. This query seeks one under component B which contains the key word Y and belongs to the descendants of A in which, one speaks about X. There are 3 components considered to be relevant having respectively, the following structures: « /A/K/C/B », « /A/F/L/B » and « /A/K/B/ » whose corresponding elements have respectively, following weights: 0.5, 0.2 and 0.35. We apply thereafter the formula which calculates the score ` SScore' of each structure candidate. The structures 1 An element is judged exhaustive if it involves the all information needed by the user. 2 An element is judged specific if the all information included is related to the subject of the user’s query. Sscore/A/K/= 0.5 · 0.82+0.35 · 0.81 = 0.6 Sscore/A/= 0.5 · 0.83+0.2 · 0.83+0.35 · 0.82 = 0.58 According to this table, we notice that the structure which can be inserted is: « /A/K/ ». To introduce it into the structured request we use the function of aggregation already used for the made up CAS query according to model of XFIRM [8]. We have to notice that if the structure having the most important weight is the same of the initial structure query we consider in aggregation the structure on the second rank. If we suppose that N and M are two different elements: The node result of aggregation (N and M) and its relevance are represented by the pair (l,rl ). L is the ancestor nearest is: rl =aggrand (rn, rm, dist(l,n),dist(l,m)) With: aggrand (rn, rm, dist (l,n), dist (l,m))= rn/ dist (l,n) +rm/ dist (l,m) dist(x, y) is the distance which separates X and y in-depth and ri the value of relevance of element i. The final reformulated query is the result of aggregated structure where content condition is the initial keyword added with expansion given by Contextual RF. 4. Experiments Application of reformulation is applied on XFIRM. It is a Flexible Model of Information Retrieval for the 37 storage and the interrogation of documents XML prepared within our team. It is based on data storage and a simple query language, allowing the user to formulate his need using simple key words or in a more precise way by integrating constraints structure of the documents. The similarity measure is based on tf (term frequency) and ief (inverse element frequency). To evaluate the results of our contribution, we have resorted to the company of INEX (Initiative for the Evaluation of XML Retrieval) [2]. The purpose of this company is to be able to evaluate the XML Retrieval systems by providing test collections of XML documents, the procedures of evaluation and a forum. This company allows to the participating organizations to compare their results. Collections of the test evaluation XML retrieval treat the elements of various granularities. The corpus is composed of papers coming from IEEE Computer Society marked out with format XML; they constitute a collection from approximately 750 MB, containing more than 13000 articles published between 1995 and 2004, coming from 21 reviews. An average article is composed of approximately 1500 nodes XML. The evaluation is based on the two criteria: exhaustiveness and specificity. It is the participant’s verdict which will decide the degree of tow criteria. We have carried out our approach that will be evaluated on INEX 2005. The ultimate result will be given in November 2005 and since it is our first participation in RF task, we have not yet had an official result. 102_109, 2003. [4] Y. Mass and M. Mandelbrod. Relevance Feedback for XML Retrieval. INEX 2004 Workshop Pre-Proceedings: 154_157, 2004. [5] M. Mass, M. Mandelbrod, E. Amitay, Y. Maarek. and A. Soffer. JuruXML-an XML retrieval system at INEX'02. Proceedings of the First Workshop of the INiative for the Evaluation of XML Retrieval (INEX): 73_80, 2002. [6] V. Mihajlovic, G. Ramirez, A. de Vries. and D. Hiemstra. TIJAH at INEX 2004 Modeling Phrases and Relevance Feedback. INEX 2004 Workshop PreProceedings: 141_148, 2004. [7] J. J. Rocchio. Relevance feedback in information retrieval. The SMART retrieval system - experiments in automatic document processing: 313_323, 1971. [8] K. Sauvagnat, M. Boughanem and C. Chrisment. Searching XML documents using relevance propagation. SPIRE 04: 242_254, 2004. [9] L. Tamine and M. Boughanem: Query Optimization Using An Improved Genetic Algorithm. CIKM 2000: 368-373, 2000. [10] Extensible markup language (XML). http://www.w3.org/TR/1998/REC-xml-19980210. 5. Conclusion In this paper, we have presented our search work done in the XML Retrieval. Our work represents a new approach in the Relevance Feedback task which we have applied a new strategy of contextual expansion query and our proposition is to restitute the appropriate generative structure in order to get the most exhaustive and specific information. In future, we will evaluate our approaches in INEX 2005. 6. References [1] W. Croft and D. Harper. Using probabilistic models of information retrieval without relevance information. Journal of Documentation. 35(4): 285_295, 1979. [2] INEX 2004 Workshop Pre-Proceedings. http://inex.is.informatik.uni-duisburg.de:2004/ [3] J. A. List, V. Mihajlovic, A. P. de Vries and G. Ramirez. The TIJAH XML-IR system at INEX 2003 (DRAFT. Proceedings of INEX 2003 Workshop: 38 An extension to the vector model for retrieving XML documents Fabien LANIEL, Jean-Jacques GIRARDOT École Nationale Supérieure des Mines de Saint-Étienne 158 Cours Fauriel 42023 Saint-Étienne CEDEX 2, FRANCE Email: {laniel,girardot}@emse.fr Abstract The information retrieval community has worked a lot on the combination of content and structure for creating preferment information retrieval systems. With the development of new standards like XML or DocBook, researchers got a growing data-base for creating and testing such systems. Many XML query engines have been proposed, but most of them do not possibly include a ranking system, because, in all the criteria we can extract form a document, it’s not easy to know which one cause a document to be more relevant than another. This paper describes a reverse engineering method to determinate which criteria are the best to optimize the system efficiency. 1 Introduction During the last twenty years, research in Information Retrieval (IR) has concentrated on two domains: flat documents mainly concerned by textual documents, and structured data such as those that are managed by relational databases. With the apparition of XML [8], a new standard for semi-structured data, and the very fast development of corpora of XML documents, new challenges are offered to the research community. As a matter of fact XML, which offers a very versatile format for information and data exchanging and keeping, can handle a large range of usages from little structured textual documents to strongly typed and structured data. There is however a hidden flaw behind this versatility: while most applications know how to read and write XML documents, there exist no tool that can search with efficiency large quantity of XML documents. Actually, XML documents that are mainly textual with little structured information (like the text of a novel) can be easily handled as flat documents using the IR approach; similarly, very structured documents (like the output of a program) can be easily mapped to relations, represented in a classical relational data-base, and queried with SQL. Between these extremes, documents that mix textual contents with complex structures are not satisfactorily handled with these approaches. These include most ”digital documents”, such as literal text transcribed with TEI [7] or Shakespeare’s plays [6], scientific documents represented in the DocBook [1] format, and most semi-structured information, like those used to constitute catalogs of industrial products, food, furniture, travels etc. In this last case, we expect to use both contents and structures information of the document to reply efficiently to a query. Many models and methods have been proposed [3, 5, 9, 4] with many criteria (often chosen arbitrarily). So, what criteria should we take into account? Number of appearances of terms? Proximity between them? Relative height between elements? etc. Furthermore, is there a criterion more important that others, and in which proportion? In this paper we will present an approach of reverse engineering process to try to answer those questions. We will present, in a first part, the context of this work: INitiative for the Evaluation of XML Retrieval (INEX). Next, we will describe the methodology we used. Finally we will show some results and discuss the approach. 2 Context The INEX [2] test collection consists of a set of XML documents, topics and relevance assessments. The documents are articles of the IEEE, which are quite structured and the topics relate to the content only or the structure and the content of the documents. INEX has defined a query language: NEXI, which proposes an important operator for us: about(). For example the NEXI query: //article[about(.,java)] //sec[about(.,implementing threads)] 39 represents the sections about implementing threads of articles which about java in general. The about() operator is exactly an IR operator, in other words the query //article[about(.,java)] can be processed with a classical flat model of IR. If we return to the first example, we can solve the first part with classical model (//article[about(.,java)]) giving a global relevance for the document, we can solve the second part too, if we then consider all the sections as separate documents. But it will be a failure to overlook that the sections are descendant of article, since this may have an impact to the ranking; i.e., if two sections have the same score, but the articles which contain them have different global relevancies, it seems logical to rank the section in the most relevant article before the other. 3 Fortunately, INEX provides us, not only with queries, but also with assessments to these queries. The idea presented here is to say that for this specific topic, we can compute Ra and Rsi,j , using some well established evaluation method (such as the vector model) for each assessed document, and say that the result Ri is equal to the user estimated relevance of the document 1 . In our case, this gives us a set of 2473 equations with 4 unknown quantities: this over-determined system can be solved with mathematical methods (a linear least squares method in our case), giving values for α, β and γ that minimizes the system. For the chosen model, and the specific query, we discover therefore the most appropriate values to represent the relevance of any unrated new document. With the query and the assessment table: Our Approch It is clear that we need an expression of the relevance of a document that takes into account the contents of individual elements of a document, and the structure of the document itself. If we suppose that (as in flat document retrieval) we can express the relevance Rx of the textual part of any XML element Ex of the document, the relevance of the document is to be expresses as a function of these values Rx that reflects the structure of the document. The relevance of the document to a specific request is therefore of the form R = FD−R (R1 , . . . , Ri , . . . , Rn), where the FD−R function is specific to the document and the request itself. In our very simple example, we could think that it is pertinent to select only documents where both conditions on article and section are satisfied. However, relevance is a strange function, and documents that are not detected as talking about java or that contain no section with the words ”implementing threads” can be judged as relevant by the user. Computing relevance is therefore not a matter of just ”anding” or ”oring” results, but rather a problem of finding a convenient equation with appropriate coefficients. Starting with a simple topic such as //article[about(., java)]//sec[about(., implementing threads)] where Rai and Rsi,j are the computed relevancies of article i and each of it’s sections j, we can say that the relevance of some section is a function of Rai and Rsi,j . Different models have been proposed in the past, combining Rai and Rsi,j with functions such as addition, multiplication, etc. We can note that ”and” conditions are typically represented by a multiplication or a minimum. If we use these combinations, the generic equation corresponding to our model is: Ri = α∗Rai +β∗Rsi,j +γ∗Rai ∗Rsi,j +δ∗min(Rai , Rsi,j ) The question is: how should we choose the coefficients? article section 1 1 1 2 1 3 ... ... 1 mi 2 1 2 2 i j ... ... n mn user relevance 1/3 0 0 ... 1/3 1 0 k ... 2/3 We create the system: α ∗ Ra1 + β ∗ Rs1,1 + γ ∗ Ra1 ∗ Rs1,1 + δ ∗ min(Ra1 , Rsi,j ) = 1/3 α ∗ Ra1 + β ∗ Rs1,2 + γ ∗ Ra1 ∗ Rs1,2 + δ ∗ min(Ra1 , Rsi,j ) = 0 α ∗ Ra1 + β ∗ Rs1,3 + γ ∗ Ra1 ∗ Rs1,3 + δ ∗ min(Ra1 , Rsi,j ) = 0 . . . α ∗ Ra1 + β ∗ Rs1,m1 + γ ∗ Ra1 ∗ Rs1,m1 + δ ∗ min(Ra1 , Rs1,mn ) = 1/3 α ∗ Ra2 + β ∗ Rs2,1 + γ ∗ Ra2 ∗ Rs2,1 + δ ∗ min(Ra2 , Rs2,1 ) = 1 α ∗ Ra2 + β ∗ Rs2,2 + γ ∗ Ra2 ∗ Rs2,2 + δ ∗ min(Ra2 , Rs2,2 ) = 0 . . . α ∗ Rai + β ∗ Rsi,j + γ ∗ Rai ∗ Rsi,j + δ ∗ min(Rai , Rsi,j ) = k . . . α ∗ Ran + β ∗ Rsn,mn + γ ∗ Ran ∗ Rsn,mn + δ ∗ min(Ran , Rsn,mn ) = 2/3 4 Results We used these three similar topics for testing this method: • Topic 128: (1623 Equations) //article[about(., intelligent transport systems)]//sec[about(., on-board route planning navigation system for automobiles)] • Topic 141: (2473 Equations) //article[about(., java)]//sec[about(., implementing threads)] • Topic 145: (2687 Equations) //article[about(., information retrieval)]//p[about(.,relevance feedback)] 1 For most INEX documents, relevancies have not been estimated. We use only documents for which the relevance has been estimated; the corresponding values (0, 1, 2, and 3) are normalized to 0, 1/3, 2/3 and 1. 40 By solving these three systems we obtain values for the unknown quantities: 128 0.051 0.085 -0.753 0.567 141 0.014 0.059 0.769 0.265 145 0.009 0.123 0.506 0.134 Now we can reintroduce these values into each system, compute the score of each answer, order them by growing values and draw the Precision-Recall graphics. The figure 1 shows for each topics the Precision-Recall graphics. Topic 128 60 Relevance of section only Our method 50 40 Precision Topic α β γ δ 30 20 5 Conclusion and Perspective 10 What conclusions can we draw from these very first experiments? While we have chosen three similar requests, which look like they might be solved by ”adding” two conditions, experimental results only partially validate this hypothesis. However, there are many aspects that impact the results, and which are difficult to take into account. 0 20 40 60 80 100 Recall Topic 141 60 Relevance of section only Our method 50 40 Precision • Clearly, the relevance assessment made by user is rarely a strict interpretation of the NEXI formulation: a user can also make errors, incorrect judgments2 , etc. 0 30 20 • the function that we use to evaluate the relevance of a passage is quite simple, based on the vector model. It doesn’t take into account synonymy or homonymy between words, etc. 10 0 0 20 40 60 80 100 Recall • the equation system that we obtain is usually ill conditioned, and gives sometimes unstable results. Topic 145 60 • taking into accounts the profile of the user. Relevance of section only Our method 50 40 Precision Many more experiments (including different evaluations functions for textual elements) clearly need to be conducted to be able to draw firm conclusions. However, we believe that the approach may lead to a progress in many directions, including: 30 20 • discovering the best usages of structures for XML information retrieval. 10 • adapting relevance feedback system 0 0 20 40 60 80 Recall More generally, we can expect such an approach to help designing acceptable models for ”and” and ”or” operations, used in typical requests on structure and contents of XML documents, therefore allowing us to build better information retrieval systems. Figure 1. Precision-Recall for each topics 2 Actually, when two INEX experts evaluate the same set of documents, they usually totally disagree about which are relevant and which are not. 41 100 References [1] DocBook. http://www.docbook.org/. [2] Initiative for the evaluation of xml retrieval. http:// inex.is.informatik.uni-duisburg.de/. [3] G. Navarro. A Language for Queries on Structure and Contents of Textual Databases. PhD thesis, University of Chile, 1995. [4] K. Sauvagnat, M. Boughanem, and C. Chrisment. Searching XML documents using relevance propagation. SPIRE, 2004. [5] T. Schlieder and H. Meuss. Querying and ranking XML documents. JASIST, 53(6):489–503, 2002. [6] XML corpus of Shakespeare’s plays. http://www.ibiblio.org/xml/examples/ shakespeare/. [7] TEI Consortium. Text Encoding Initiative, 1987. http: //www.tei-c.org/. [8] World Wide Web Consortium (W3C). Extensible Markup Language (XML), February 1998. http://http://www. w3.org/XML/. [9] R. Wilkinson. Effective retrieval of structured documents. In Research and Development in Information Retrieval, pages 311–317, 1994. 42 Do search engines understand Greek or user requests “sound Greek” to them? Fotis Lazarinis Department of Technology Education & Digital Systems University of Piraeus 80 Karaoli & Dimitriou,185 34 Piraeus, Greece lazarinf@teimes.gr Abstract This paper presents the outcomes of initial Greek Web searching experimentation. The effects of localization support and standard Information Retrieval techniques such as term normalization, stopword removal and simple stemming are studied in international and local search engines. Finally, evaluation points and conclusions are discussed. 1. Introduction The Web has rapidly gained popularity and has become one of the most widely used services of the Internet. Its friendly interface and its hypermedia features attract a significant number of users. Finding information that satisfies a particular user need is one of the most common and important operations in the WWW. Data are dispensed in a measureless number of locations and so utilization of a search engine is necessary. Although international search engines like Google and Yahoo are preferred over the local ones, as they employ better searching mechanisms and interfaces, they do not really value other spoken languages than English. Especially in languages like Greek which has inclinations and intonation, it seems that the majority of the international search engines have no internal (indexing) or external (interface) localization support. Thus the user has to devise alternative ways so as to discover the desired information and to adapt themselves to the search engine’s interface. This paper reports the results of initial experimentation in Greek Web searching. The effect of localization support, upper or lower case queries, stopword removal and simple stemming is studied and evaluation points are presented. The conclusions could be readily adapted to other spoken languages with similar characteristics to the Greek language. 2. Experimentation and evaluation Interface simplicity and adaptation is maybe the most important issue which influences user satisfaction and acceptance of Web sites and thus search engines [1, 2]. User acceptance factor is obviously increased when a search engine changes the language and maybe its appearance to satisfy its diversified user basis. This is significant especially to novice users. Stopword removal, stemming and capitalization or more generally normalization of index and query terms are amongst the oldest and most widely used IR techniques [3]. All academic systems support them. Commercial search engines, like Google, explicitly state that they remove stopwords, while capitalization support is easily inferred. Stemming seems to not be supported though. This may be due to the fact that WWW document collection is so huge and diverse that stemming would significantly increase recall and possibly reduce precision. However simple stemming, like final sigma removal which will be presented later in the paper, may play an important role when seeking information in the Web using Greek query terms. These four issues were examined with respect to the Greek language. For conducting our assessment we used most of the predominately known worldwide .com search engines: Google, Yahoo, MSN, AOL, Ask, Altavista. The .com search engines were selected based on their popularity [4]. Also, for comparison reasons, we considered using some native Greek search engines: In (www.in.gr), Pathfinder (www.pathfinder.gr) and Phantis (www.phantis.gr). 2.1. Interface issues Ten users participated in the interface related experiment and they also constructed some sample queries for the subsequent experiments. Users had varying degrees of computer usage expertise. We needed end users with technical expertise and obviously increased demands over the utilization of web searchers. On the other hand we should measure the difficulties and listen to the people who have just been introduced to search engines. This combination of needs reflect real everyday needs of web “surfers”. The following sub-issues extracted from a more complete evaluation study of user effort when searching the Greek Web space utilizing international search engines [5]. Here we extend (with more users and search 43 engines) and present only the issues connected with whether search engines really value other spoken languages than English, like Greek, or not. 2.1.1. Localization support. The first issue in our study was the importance of a localized interface. All the participants (100%) rated this feature as highly important as many users have basic or no knowledge of English. Although search engines have uncomplicated and minimalist interfaces their adaptation to the local language is essential as users could easily comprehend the available options. From the .com ones only Google automatically detects local settings and adapts to Greek. Altavista allows manual selection of the presentation language with a limited number of language choices and setup instructions in English. Also if you select another language, search is automatically confined to this country’s websites (this must be altered manually again). 2.1.2. Searching capability. In this task users were asked to search using queries with all terms in Greek. All search engines but AOL and Ask were capable of running the queries and retrieving possibly relevant documents. AOL pops-up a new Window when a user requests some information but it cannot correctly pass the Greek terms from the one window to the other. So no results are returned. However, when requests typed directly to the popped-up window then queries are run but presentation of the rank is problematic again. Ask does not retrieve any results, meaning that indexing of Greek documents is not supported. For example zero documents retrieved in all five queries of section 2.2. For these reasons AOL and Ask left out of the subsequent tests. 2.1.3. Output presentation. An important point made by the participants is that some of the search engines rank English web pages first, although search requests were in Greek. For example in the query “Ολυµπιακοί αγώνες στην Αθήνα” (Olympic Games in Athens) Yahoo, MSN and Altavista ranked some English pages first. This depends on the internal indexing and ranking algorithm but it is one of the points that increase user effort because one has to scroll down to the list of pages to find the Greek ones. of Recall and Precision [6] are used for comparing the results of the sample queries. Recall refers to the number of retrieved pages, as indicated by search engines, while precision (relevance) was measured in the first 10 results. Table 1. Sample queries. No Q1 Q2 Q3 Q4 Q5 Queries in Greek Μορφές ρύπανσης περιβάλλοντος Εθνική πινακοθήκη Αθηνών Προβλήµατα υγείας από τα κινητά τηλέφωνα Συνέδριο πληροφορικής 2005 Τεστ για την πιστοποίηση των εκπαιδευτικών Queries in English Environmental pollution forms National Art Gallery of Athens Health problems caused by mobile phones Informatics conference 2005 Tests for educators’ certification Table 2 presents the number of recalled pages for each query. From table 2 we realize that In and Pathfinder share the same index and employ exactly the same ranking procedure. The result set was identical both in quantity and order. Their only difference was in output presentation. Altavista and Yahoo had almost the same number of results, ranked slightly differently though. Table 2. Recall in lower case queries. Google Yahoo MSN Altavista In Pathfinder Phantis Q1 867 820 1357 821 251 251 33 Q2 3400 933 1537 939 343 343 63 Q3 805 527 542 515 67 67 22 Q4 15500 11200 6486 11400 689 689 88 Q5 252 186 272 191 49 49 6 In all cases the international search engines returned more results than the native Greek local engines. However, as seen in table 3, relevance of the first 10 results is almost identical in all cases, except Phantis, which maintains either a small index or employs a crude ranking algorithm. Query 4 retrieves so many results because it contains the number (year) 2005. So, documents which contain one of the terms and the number 2005 are retrieved, increasing recall significantly. Table 3. Precision of the top 10 results. Google Yahoo MSN Altavista In Pathfinder Phantis 2.2. Term normalization, Stemming, Stopwords Trying to realize how term normalization, stemming and stopwords affect retrieval we run some sample queries. We used 5 queries (table 1) suggested by the participants of the previous test. They were typed in lower case sentence form with accent marks leaving the default options of each search engine. A modified version 44 Q1 5 5 4 5 5 5 2 Q2 7 7 7 7 7 7 2 Q3 9 8 8 8 8 8 2 Q4 8 7 6 7 6 6 1 Q5 8 8 7 8 8 8 0 We confined the relevance judgment to only the first ten results so to limit the required time and because the first ten results are those with the highest probability to be visited. Relevance was judged upon having visited and inspected each page. The web locations visited had to be from a different domain. So if two consecutive pages were on the same server only one of them was visited. An interesting point to make is that although recall differs substantially among search engines precision is almost the same in all cases. Another point of attention is that the third query shows the maximum precision. This is because in this case terms are more normalized, compared to the other queries. This means that they are in the first singular or plural form which is the usual case in words appearing in headings or sub-headings. Consequently a better retrieval performance is exhibited. But, as we will see in section 2.2.3, it contains stopwords which when removed precision is positively affected and reaches 10/10. changes to “µορφεσ”. These observations are at least worrying. What would happen if a searcher were to choose to search only in capital letters or without accent marks? Their quest would simply fail in most of the cases leading novice users to stop their search. In English search there is no differentiation between capital and lower letters. The result sets are identical in both cases so user effort and required “user Web intelligence” is unquestionably less. 2.2.1. Term normalization. We then re-run the same queries but this time in capital letters with no accent marks. Recall (table 4) was dramatically diminished in most of the worldwide search enabling sites while it was left unaffected in two of the three domestic ones (In and Pathfinder). Precision was negatively affected as well (table 5), compared to results presented in table 3. Wrapping up this experiment one can argue that in Greek Web searching the same query should be run both in lower and in capital letters, so as to improve the performance of the search. Sites where there are no accent marks or contain intonation errors will not be retrieved unless variations of the query terms are used. Greek search engines are superior at this point and make information hunting easier and more effective. From the international search engines only Google has recognized these differences and try to improve its searching mechanism. Table 4. Recall in upper case queries. Google Yahoo MSN Altavista In Pathfinder Phantis Q1 22 18 10 18 251 251 4 Q2 3400 229 233 239 343 343 63 Q3 41 2 2 2 67 67 3 Q4 673 116 379 117 689 689 14 Q5 252 8 10 9 49 49 6 These observations are true for Yahoo, MSN and Altavista. Google and Phantis exhibit a somehow unusual behavior. In queries 2 and 5 Google and Phantis retrieve the same number of documents in the same order and have the same precision therefore. Upper case queries 1, 3 and 4 recall only a few documents compared to the equivalent lower case queries. Correlation between results is low and precision differs. Trying to understand what triggers this inconsistency we concluded that it relates to the final sigma existing in some terms of queries 1, 3 and 4. The Greek capital sigma is Σ but lower case sigma is σ when it appears inside a word and ς at the end of the word. Phantis presents the normalized form of the query along with the result set. Indeed it turns out that words ending in capital Σ are transformed to words with the wrong form of sigma, e.g. “ΜΟΡΦΕΣ” (forms) should change to “µορφες” but it Table 5. Precision of the top 10 results. Google Yahoo MSN Altavista In Pathfinder Phantis Q1 4 3 3 3 5 5 0 Q2 7 8 6 8 7 7 2 Q3 3 0 0 0 8 8 0 Q4 10 5 7 5 6 6 0 Q5 8 7 7 7 8 8 0 2.2.2. Stemming. Another factor that influences searching relates to the suffixes of the user request words. For example the phrases “Εθνική πινακοθήκη Αθηνών” or “Εθνική πινακοθήκη Αθήνας” or “Εθνική πινακοθήκη Αθήνα” all mean “National Art Gallery of Athens”. So while they are different they describe exactly the same information need. Each variation retrieves quite different number of pages. For example Google returned 3400, 722 and 5420 web pages respectively. Precision is different in these three cases as well, and correlation between results is less than 50% in the first ten results. One could argue that such a difference is rational and acceptable as the queries differ. If we consider these queries solely from a technical point of view then this argument is right. However if the information needed is in the center of the discussion then these subtle differences in queries which merely differ in one ending should have recalled the same web pages. Stemming is an important feature of retrieval systems [3] (p. 167) and its application should be at least studied in spoken languages which have conjugations of nouns and verbs, like in Greek. Google partially supports conjugation of English verbs. 45 2.2.3. Stopwords. Google and other international search engines remove English stopwords so as to not influence retrieval. For instance users are informed that the word of is an ordinary term and is not used in the query “National Art Gallery of Athens”. Removal of stopwords [3] (p. 167) is an essential part of typical retrieval systems. We re-run, in Google, queries 3 and 5 removing the ordinary words. Queries were in lower case and with accent marks so results should be compared with tables 2 and 3. Query 3 recalled 839 pages and precision equals 10 in the first 10 ranked documents. Similarly for the fifth query Google retrieved 275 documents and precision raised from 8 (table 2) to 10. As realized, recall was left unaffected but precision increased by 10% and by 20% respectively. This means that ranking is affected when stopwords are removed. However more intense tests are required to construct a stopword list and to see how retrieval is affected by Greek stopwords 4. Conclusions This paper presents a study regarding utilization of search engines using Greek terms. The issues inspected were the localization support of international search engines and the effect of stopword removal, capitalization and stemming of query terms. Our analysis participants identified as highly important the adaptation of search engines to local settings. Most of the international search engines do not automatically adapt their interface to other spoken languages than English and some of them do not even support other spoken languages. At least these are true for Greek. In order to get an estimate of the internal features of search engines that support Greek, we run some sample queries. International search engines recalled more pages than the local ones and they had a small positive difference in precision as well. However they are case sensitive, apart from Google, hindering retrieval of web pages which contain the query terms in a slightly different form to the requested one. Even if the first letter of a word is a capital letter the results will be different than when the word is typed entirely in lower case. Endings and stopwords are not removed automatically, thus affecting negatively recall of relevant pages. Stopwords are removed from English queries making information hunting easier, looking at it from a user’s perspective. Terms are not stemmed though even in English. However in a language with inclinations, like Greek, simple stemming seems to play an important role in retrieval assisting end users. In any case more intensive tests are needed to realize how endings, stopwords and capitalization affect retrieval. Trying to answer the question posed in the article’s title it can be definitely argued that international search enabling sites do not value the Greek language and possibly other languages with unusual alphabets. Google is the only one which differs than the others and seems to be in a process of adapting to and assimilating the additional characteristics. 5. References [1] J. Nielsen, R. Molich, C. Snyder, S. Farrel, Search: 29 Design Guidelines for Usable Search http://www.nngroup.com/ reports/ecommerce/search.html,2000. [2] Carpineto, C. et al., “Evaluating search features in public administration websites”, Euroweb2001 Conference, 2001, 167184. [3] Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval, Addison Wesley, ACM Press, New York, 1999. [4] D. Sullivan, Nielsen NetRatings: Search Engine Ratings http://searchenginewatch.com/reports/article.php/2156451, 2005. [5] Lazarinis, F., “Evaluating user effort in Greek web searching”, 10th PanHellenic Conference in Informatics, University of Thessaly, Volos, Greece, 2005 (to appear) [6] S. E. Robertson, “The Parameter Description of Retrieval Systems: Overall Measures”, Journal of Documentation, 1969, 25, 93-107. 46 Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry School of Computer and Information Sciences, Auckland University of Technology, Auckland, New Zealand Dave.parry@aut.ac.nz Abstract Recently there has been an upsurge in interest in the use of information entropy measures for identification of similarities and differences between strings. Strings include text document languages, computer programs and biological sequences. This work deals with the use of this technique for author identification in online postings and the identification of WebPages that are related to each other. This approach appears to offer benefits in analysis of web documents without the need for domain specific parsing or document modeling. 1. Introduction Kolmogorov distance measurement involves the use of information entropy calculations to measure the distance between sequences of characters. Information retrieval is a potentially fruitful area of use of this technique, and it has been used for language and authorship identification [1], plagiarism detection in computer programs [2] and biological sequences, such as DNA and amino acids [3]. Authorship, genre identification and measures of relatedness remain an important issue for the verification and identification of electronic documents. Related document searches have been identified as an important tool for users as information retrieval systems [4]. Computers have been used for a long time to try and verify the identity of authors in the humanities [5] and in the filed of software forensics [6]. Various techniques have been used in the past including Bayesian Inference[7], neural networks[8] and more sophisticated methods using support vector machines [9]. However, such approaches tend to be extremely language and context specific although often very effective. Briefly, this approach is based around the concept of the relative information entropy of a document. The concept of the relative information of a document is closely related to that of Shannon [10]. One way of expressing this concept is to view a document as a message that is being encoded over a communication channel. A perfect encoding and compression scheme would produce the minimum length of message. In general, a document that can undergo a high degree of shortening by means of a compression algorithm has a low information entropy – that is there is a large degree of redundancy, whereas one that changes little in size has a high degree of information entropy, with little redundant information. A good compression algorithm should never increase the size of the “compressed” document. As the authors of [11] point out, a good zipping algorithm can be considered as a sort of entropy meter. The Lempel-Ziv algorithm reduces the size of a file by replacing repeating strings with codes that represent the length and content of these strings [12], and has been shown to be a very effective scheme. To work efficiently, the Lempel-Ziv algorithm “learns” effective substitutions as it examines the document sequentially to find repeating sequences that can be replaced in order to reduce the file size. This algorithm is the basis of the popular and rapid zip software in its various incarnations including Gzip, Pkzip and WinZip. Importantly this method relies on a sequential examination of the document to be encoded, so concatenation with other documents can have dramatic effects on the efficiency of zipping, as rules for encoding created at the start of the process are found to be useless at the end. By adding, a document of unknown characteristics to one of known properties (for example language, author, genre etc.) then is suggested that the combined relative entropy is smallest when the two documents are most similar. The work of [1] demonstrated that it was possible to identify the language used in a document by comparison with known documents. This method is therefore complementary to other methods that concentrate on the understanding of the document, much as handwriting or voice analysis widens the possibilities of author identification, even if the content is not distinctive [13]. The rest of this paper describes one implementation of these types of algorithm (section 2), along with a number of experiments (Section 3). Section 4 discusses the results and section 5 describes other approaches and draws conclusions about this approach. 47 1. Algorithms The Kolmogorov distance is based on the method of [14], and earlier work such as [15] which deals with the identification of minimum pattern length similarities. By using compression algorithms the following formula for the distance between two objects may be computed. Assuming C (A|B) is the compressed size of A using the compression dictionary used in compressing B, and vice versa for C (B|A) and C (A), C (B) represent the compressed length of A and B using their own compression dictionaries. The distance between A and B, D(A,B) is given by: D( A, B) = study classification schemes – for example the use of readability or other scores[17] to characterize discussion. Postings from an online teaching system – Business on line [18], were used. A total of 160 initial messages were used. The Kolmogorov distance (KD) was calculated between this message and 10 other messages, only one of which by the same author as the first. The message combination with the shortest KD was then noted, and the results are shown in Table 1. Table 1: Kolmogorov Distance for Messages Percent Percent in Shortest KD sample 90% Author1<>Author2 51.88% 10% Author1=Author2 48.13% Status C ( A | B ) + C ( B | A) C ( A) + C ( B) This formula is explicitly derived in [16]. Various methods of compression have been used for this, for this work a method was used that did not need explicit access to the compression dictionary, so that standard zip programs could be used. Concatenating files and then compressing them allows the compression algorithm to develop its dictionary on the first file and then apply it to the second. The algorithm used is given by: Obtain the two files – file1 and file2 Concatenate them in two ways, file1+ file2 = (file12) and file2+ file1 =(file21) Calculate the compressed length of: file1 as zip1 file2 as zip2 file12 as zip12 file21 as zip21 The distance (D) is then given by: D( file1, file2) = ( zip12 − zip1 ) + ( zip 21 − zip 2 ) zip1 + zip 2 Using Chi-Squared, this result is significant at the p<0.001 level (SPSS 11) χ2=(1,Ν=160)=258,p<0.001. The proportion of messages with common authors having the smallest distance is a great deal higher than expected by chance. 3.2 Experiment Two 4,389 Web Pages were downloaded using a web spider from 6 root sites. A similar comparison was done for the website domain-based group, with each of 80 pages compared with one from the same domain, and nine from others. The results are shown in Table 2. Again, using Chi-Squared, this result is significant at the p<0.001 level (SPSS 11) χ2=(1,Ν=80)=451,p<0.001. The proportion of websites from common domains having the smallest distance is a great deal higher than expected by chance. Table 2: Kolmogorov Distance for Domains Status This approach depends on the compression algorithm being lossless. Previous work had demonstrated that if the file1 is the same as file2 the distance is minimal. 3. Methods Three experiments were performed to validate the algorithm used. One used author identification, the second used WebPages from different domains, and the third used different topics within a particular web corpus. 3.1 Experiment One One particularly rich source of testing data is achieved newsgroup and list server postings that often contain particularly relevant information in a concise format. Newsgroup postings provide a rich corpus of material to Different Domain Same Domain Percent lowest KD 18.75% 81.25% Percent in sample 90% 10% 3.3 Experiment Three This experiment used the British Medical Journal (BMJ) Website that includes a large number of pages grouped by topic. The process began by selecting those topics that had at least 5 valid pages available for download. For each of these valid domains (n=133), 5 initial pages were chosen. One page from the same domain, and nine different pages from other domains were then selected, in a similar manner to that described above. Again, the files were selected to be of similar length, and the pages zipped together, using the Kolmogorov distance by zipping algorithm. Self- 48 comparison i.e. where file1=file2 was not permitted. The results are shown in Table 3. Table 3: Kolmogorov distance BMJ topics Source Different topic domain Same topic domain Number of occurrences with shortest distance 17.89% Percent in sample 82.11% 10% 90% Using CHI-Squared implemented in SPSS version 11 the results show that the minimal distance is significantly more likely to occur using files from the same domain, rather than ones of similar length from other domains. χ2=(1,Ν=665)=3839,p<0.001 4. Discussion and Future Work The Kolmogorov distance measure approach demonstrates effective identification of related documents. This relatedness may be intrinsic to the text, as in the case of content, authorship or language, or related to the structure of the webpage, that is the arrangement of tags or formatting information. Drawbacks to the practical implementation of this method centre around two main areas, combinatorial explosion and confounding similarity. As stated this method requires each file to be compared with each other file, thus the number of calculations needed to find the distance between n documents is given by n! Current work is concentrating on the clustering of documents using this approach. One approach has been to find documents that are close in terms of KD, to use these as cluster centroids, and measure the distance of new examples from these. This approach, by identifying the centroid of a cluster in terms of a limited number of documents would remove the issue of n! comparisons. Work by [3], has emphasized the importance of clustering. Confounding similarity, represents the case where documents have a great deal of similarity that is unrelated to their content – for example in the case of documents converted to HTML by popular editors with supplied templates or conversion programs. This does not seem to be an issue in the case of the BMJ topic corpus, but may become important in other cases. If necessary text extraction and separation from formatting tags could be used. The ultimate length of documents that can be effectively processed in this way should be investigated, it seems reasonable to suppose that extremely long documents of very short documents would not be suitable because of the likelihood of common of repeating motifs in the former case and the absence of repeating motifs in the latter. Other compression techniques, including those where the compression dictionary is stored separately, should be investigated. It is important to note that this approach is generally complimentary to existing ones and has not been compared with other methods – such as comparisons using textual information, This method is attractive in areas where there is difficulty in performing domain specific parsing or there is no knowledge relating to document structure. In terms of open-source implementation, this approach could easily be added as a plug-in to browser technology, allowing individual users to compare new documents to those cached already, or by allowing users to collaboratively compare documents with a central or dispersed repository. The decreasing cost of storage implies that document cache comparison will become increasingly important, and simple, general comparison tools will be important in this regard. 5. Conclusion Comparing electronic documents using the Kolmogorov technique is easily implemented and is not constrained by any proprietary technology. This approach seems particularly useful for short, unstructured documents such as newsgroup postings and emails. Web logs (Blogs) are also becoming more popular and this approach could be used for comparison and validation of these. Use of this technique, in addition to current methods may allow improved characterization of electronic communication and searching of electronic databases. For search engine technology, such approaches may allow improved ranking of results. Particular applications include relatedness and clustering applications, email filtering, fraud and plagiarism detection and genre identification. Further research in this area may increase the value of this approach. 6. References [1] [2] [3] 49 D. Benedetto, E. Caglioti, and V. Loreto, "Language Trees and Zipping," Physical Review Letters, vol. 88, pp. 048702-1 to 048702-4, 2002. X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, "Shared information and program plagiarism detection," Information Theory, IEEE Transactions on, vol. 50, pp. 1545-1551, 2004. R. Cilibrasi and P. M. B. Vitanyi, "Clustering by compression," Information Theory, IEEE Transactions on, vol. 51, pp. 1523-1545, 2005. [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic, "Real life information retrieval: a study of user queries on the Web," SIGIR Forum, vol. 32, pp. 5-17, 1998. S. Y. Sedelow, "The Computer in the Humanities and Fine Arts," ACM Computing Surveys (CSUR), vol. 2, pp. 89-110, 1970. P. W. Oman and C. R. Cook, "Programming style authorship analysis," pp. 320--326, 1989. Mosteller F. and Wallice D., Applied Bayesian and Classical Inference: the case of the Federalist Papers: Addison-Wesley, 1964. S. T. Singhe, F.J., "Neural networks and disputed authorship: new challenges " in Artificial Neural Networks, 1995., Fourth International Conference on, 1995, pp. 24-28. O. d. Vel, A. Anderson, M. Corney, and G. Mohay, "Mining e-mail content for author identification forensics," ACM SIGMOD Record, vol. 30, pp. 55-64, 2001. Shannon., "Mathematical Theory of Communication," in Bell Systems Technical Journal, 1948. A. Puglisi, D. Benedetto, E. Caglioti, V. Loreto, and A. Vulpiani, "Data compression and learning in time sequences analysis," Physica D: Nonlinear Phenomena, vol. 180, pp. 92-107, 2003. J. L. Ziv, A., "A universal algorithm for sequential data compression," Information Theory, IEEE Transactions on, vol. 23, pp. 337343, 1977. S. N. Srihari and S. Lee, "Automatic handwriting recognition and writer matching on anthraxrelated handwritten mail," in Eighth International Workshop on Frontiers in Handwriting Recognition, 2002, pp. 280-284. A. Kolmogorov, "Logical basis for information theory and probability theory," Information Theory, IEEE Transactions on, vol. 14, pp. 662664, 1968. A. Kolmogorov, "Three Approaches to the quantitive definition of Information," Problems of Information Transmission, vol. 1, pp. 1-17, 1965. M. Li, X. Chen, X. Li, B. Ma, and P. Vitenyi, "The similarity metric," presented at SODA Proceedings of the fourteenth annual ACMSIAM symposium on Discrete algorithms, Baltimore, Maryland,, 2003. P. Sallis and D. Kasabova, "Computer-Mediated Communication, Experiments with e-mail readability," Information Sciences, pp. 43-53, 2000. [18] 50 A. Sallis, G. Carran, and J. Bygrave, "The Development of a Collaborative Learning Environment: Supporting the Traditional Classroom," presented at WWW9, Netherlands, 2000. Searching Web Archive Collections Michael Stack Internet Archive The Presidio of San Francisco 116 Sheridan Ave. San Francisco, CA 94129 stack@archive.org Abstract Web archive collection search presents the usual set of technical difficulties searching large collections of documents. It also introduces new challenges often at odds with typical search engine usage. This paper outlines the challenges and describes adaptation of an open source search engine, Nutch, to Web archive collection search. Statistics and observations indexing and searching small to medium-sized collections are presented. We close with a sketch of how we intend to tackle the main limitation, scaling archive collection search above the current ceiling of approximately 100 million documents. Technically, Nutch provides basic search engine capability, is extensible, aims to be cost-effective, and is demonstrated capable of indexing up to 100 million documents with a convincing development story for how to scale up to billions [9]. This paper begins with a listing of challenges searching WACs. This is followed by an overview of Nutch operation to aid understanding of the next section, a description of Nutchwax, the open-source Nutch extensions made to support WAC search. Statistics on indexing rates, index sizes, and hardware are presented as well as observations on the general WAC indexing and search operation. We conclude with a sketch of how we intend to scale up to index collections of billions of documents. 1. Introduction 2. Challenges Searching WACs The Internet Archive (IA)(www.archive.org) is a 501(c)(3) non-profit organization whose mission is to build a public Internet digital library [1]. Since 1996, the IA has been busy establishing the largest public Web archive to date, hosting over 600 terabytes of data. Currently the only public access to the Web archive has been by way of the IA Wayback Machine (WM) [2] in which users enter an URL and the WM displays a list of all instances of the URL archived, distinguished by crawl date. Selecting any date begins browsing a site as it appeared then, and continued navigation pulls up nearest-matches for linked pages. The WM suffers one major shortcoming: unless you know beforehand the exact URL of the page you want to browse, you will not be able to directly access archived content. Current Web URLs and published references to historic URLs may suggest starting points, but offer little help for thorough or serendipitous exploration of archived sites. URL-only retrieval also frustrates users who are accustomed to exhaustive Google-style full text search from a simple query box. What is missing is a full text search tool that works over archived content, to better guide users 'wayback' in time. WACs tend to be large. A WAC usually is an aggregate of multiple, related focused Web crawls run over a distinct time period. For example, one WAC, made by the IA comprises 140 million URLs collected over 34 weekly crawls of sites pertaining to the United States 2004 Presidential election. Another WAC is the complete IA repository of more than 60 billion URLs. (Although this number includes exact or near duplicates, the largest live-Web search engine, Google, only claims to be "[s]earching 8,058,044,651 web pages" as of this writing.) WACs are also large because archives tend not to truncate Web downloads and to fetch all resources including images and streams, not just text-only resources. Nutch [4] was selected as the search engine platform on which to develop Web Archive Collection (WAC) search. "Nutch is a complete open-source Web search engine package that aims to index the World Wide Web as effectively as commercial search services" [5]. A single URL may appear multiple times in a WAC. Each instance may differ radically, minimally or not at all across crawls, but all instances are referred to using the same URL. Multiple versions complicate search query and result display: Do we display all versions in search results? If not, how do we get at each instance in the collection? Do we suppress duplicates? Or do we display the latest with a count of known instances in a corner of the search result summary? A WAC search engine gets no help from the Web-at-large serving search results. What we mean by this is that for WAC searching, after a user clicks on a search result hit, there is still work to be done. The search result must 51 refer the user to a viewer or replay utility – a tool like the IA WM – that knows how to fetch the found page from the WAC repository and display it as faithfully as possible. (Since this redisplay is from a server other than the page's original home, on-the-fly content rewriting is often required.) While outside of the purview of collection search, WAC tools that can reassemble the pages of the past are a critical component in any WAC search system. 3. Overview Of Nutch Operation The Nutch search engine indexing process runs in a stepped, batch mode. With notable exceptions discussed later, the intent is that each step in the process can be "segmented" and distributed across machines so no single operation overwhelms as the collection grows. Also, where a particular step fails (machine crash or operator misconfiguration), that step can be restarted. A custom database, the Nutch webdb, maintains state between processing steps and across segments. An assortment of parse-time, index-time, and query-time plugins allows amendment of each processing step. After initial setup and configuration, an operator manually steps through the following cycle indexing: 1. Ask the Nutch webdb to generate a number of URLs t o fetch. The generated list is written to a "segment" directory. 2. Run the built-in Nutch fetcher. During download, an md5 hash of the document content is calculated and parsers extract searchable text. All is saved to the segment directory. 3. Update the Nutch webdb with vitals on URLs fetched. An internal database analysis step computes all in-link anchor text per URL. When finished, the results of the database inlink anchor text analysis are fed back to the segment. Cycle steps 1-3 writing new segments per new URL list until sufficient content has been obtained. 4 . Index each segment's extracted page text and in-link anchor text. Index is written into the segment directory. 5. Optionally remove duplicate pages from the index. 6. Optionally merge all segment indices (Unless the index is large and needs to be distributed). Steps 2 and 4 may be distributed across multiple machines and run in parallel if multiple segments. Steps 1, 3, and 5 require single process exclusive access to the webdb. Steps 3 and 6 require that a single process have exclusive access to all segment data. A step must complete before the next can begin. To query, start the Nutch search Web application. Run multiple instances of the search Web application to distribute query processing. The queried server distributes the query by remotely invoking queries against all query cluster participants. (Each query cluster participant is responsible for some subset of all segments.) Queries are run against Nutch indices and return ranked Google-like search results that include snippets of text from the pertinent page pulled from the segment-extracted text. 4. Nutch Adaptation Upon consideration, WAC search needs to support two distinct modes of operations. First, WAC search should function as a Google-like search engine. In this mode, users are not interested in search results polluted by multiple duplicate versions of a single page. Phase one of the Nutch adaptation focused on this mode of operation. A second mode becomes important when users want to study how pages change over time. Here support for queries of the form, "return all archive versions crawled in 1999 sorted by crawl date" is needed. (Satisfying queries of this specific type is what the IA WM does using a sorted flat file index to map URL and date to resource location.) Phase two added features that allow versionand date-aware querying. (All WAC plugin extensions, documentation, and scripts are open source hosted at Sourceforge under the Nutchwax project [8].) 4.1. Phase one Because the WAC content already exists, previously harvested by other means, the Nutch fetcher step had to be recast to pull content from a WAC repository rather than from the live Web. At IA, harvested content is stored in the ARC file format [6]; composite log files each with many collected URLs. For the IA, an ARC-tosegment tool was written to feed ARCs to Nutch parsers and segment content writers. (Adaptation for formats other than IA ARC should be trivial.) Upon completion of phase one, using indices purged of exact duplicates, it was possible to deploy a basic WAC search that used the IA WM as the WAC viewer application. 4.2. Phase two To support explicit date and date range querying using the IA 14-digit YYYYDDMMHHSS timestamp format, an alternate date query operator implementation replaced the native Nutch YYYYMMDD format. To support retrieval of WAC documents by IA WM-like viewer applications, location information -- collection, arcname and arcoffset -- was added to search results as well as an operator to support exact, as opposed to fuzzy, URL querying (exacturl). Nutch was modified to support sorting on arbitrary fields and deduplication at query time (sort, reverse, dedupField, hitsPerDup). Here is the complete list of new query operators: • sort: Field to sort results on. Default is no sort. • reverse: Set to true to reverse sort. Default is false. • dedupField: Field to deduplicate on. Default is 'site'. 52 • hitsPerDup: Count of dedupField matches to show i n search results. Default 2. • date: IA 14-digit timestamps. Ranges specified with '-' delimiter between upper and lower bounds. • arcname: Name of ARC file that containing result found. • arcoffset: Offset into arcname at which result begins. • collection: The collection the search result belongs to. • exacturl: Query for an explicit url. Natively Nutch passes all content for which there is no explicit parser to the text/html parser. Indexing, logs are filled with skip messages from the text/html parser as it passes over audio/*, video/*, and image/* content. Skipped resources get no mention in the index and so are not searchable. An alternate parser-default plugin was created to add at least base metadata of crawl date, arcname, arcoffset, type, and URL. This allows viewer applications, which need to render archived pages that contain images, stylesheets, or audio to ask of the Nutch index the location of embedded resources. Finally, an option was added to return results as XML (RSS) [7]. Upon completion of phase two, both modes of operation were possible using a single non-deduplicated index. 5. Indexing Stats Discussed below are details indexing two WACs: One small, the other medium-sized. All processing was done on machines of the following profile: single processor 2.80GHz Pentium 4s with 1GB of RAM and 4x400GB IDE disks running Debian GNU/Linux. Indexing, this hardware was CPU-bound with light I/O loading. RAM seemed sufficient (no swapping). All source ARC data was NFS mounted. Only documents of type text/* or application/* and HTTP status code 200 were indexed. 5.1. Small Collection This small collection was comprised of three crawls. Indexing steps were run in series on one machine using a single disk. The collection comprised 206 ARC files, 37.2GB of uncompressed data. 1.07 million of the collection total of 1.27 million documents was indexed. Table 1: MIME Types MIME Type text text/html application application/pdf application/msword Size (MB) 25767.67 22003.55 6719.92 4837.89 487.89 Table 2: Timings Segment Database 16h32m 2h26m Index 18h44m % Size 79.32% 67.73% 20.68% 14.89% 1.50% Dedup 0h01m Incidence 1052103 1044250 20969 16201 3306 Merge 02h35m collection. The index plus the cleaned-up segment data -cleaning involved removal of the (recalcuable) segmentlevel indices made redundant by the index merge -occupied 1.1GB + 4.9GB, or about 16% the size of the source collection. Uncleaned segments plus index made up about 40% the size of the source collection. 5.2 Medium-sized Collection The collection was made up of 1054 ARCs, 147.2GB of uncompressed data. 4.1 million documents were indexed. Two machines were used to do the segmenting step. Subsequent steps were all run in series on a single machine using a single disk. Table 3: MIME Types MIME Type Text text/html Application application/pdf application/msword Size (MB) 96882.84 90319.81 50338.40 21320.83 1000.70 Table 4: Timings Segment Database 12h32m 7h23m 19h18m % Size 65.32% 60.81% 34.68% 14.40% 0.67% Index 55h07m Dedup 0h06m Incidence 3974008 3929737 122174 45427 5468 Merge 0h31m Indexing took 99 hours of processing time (or 86.4 hours of elapsed time because segmenting was split and run concurrently on two machines). The merged index size was 5.2GB, about 4% the size of source collection. Index plus the cleaned-up segment data occupied 5.2GB + 14.5GB, or about 13.5% the size of the source collection. (Uncleaned segments plus index occupied about 22% the size of the source collection.) 6. Observations Indexing big collections is a long-running manual process that currently requires intervention at each step moving the process along. Attention required compounds the more distributed the indexing is made. An early indexing of a collection of approximately 85 million documents took more than a week to complete with segmenting and indexing spread across 4 machines. Steps had to be restarted as disks overfilled and segments had to be redistributed. Little science was applied so the load was suboptimally distributed with synchronizations waiting on laggard processes. (Others close in to the Nutch project have reported similar experiences [12].) An automated means of efficiently distributing the parsing, update, and indexing work across a cluster needs to be developed. In the way of any such development are at least the following obstacles: Indexing took 40.3 hours to complete. The merged index size was 1.1GB, about 3% the size of the source • Some indexing steps are currently single process. 53 • As the collection grows, with it grows the central webdb of page and link content. Eventually it will grow larger than any available single disk. We estimate that with the toolset as is, given a vigilant operator and a week of time plus 4 to 5 machines with lots of disk, indexing WACs of about 100 million documents is at the limit of what is currently practical. Adding to a page its inlink anchor-text when indexing improves search result quality. Early indexing experiments were made without the benefit of the Nutch link database — our custom fetcher step failed to properly provide link text for Nutch to exploit. Results were rich in query terms but were not what was 'expected'. A subsequent fix made link-text begin to count. Thereafter, search result quality improved dramatically. The distributed Nutch query clustering works well in our experience, at least for low rates of access: ~1 query per second. (Search access-rates are expected to be lower for WACs than live-Web search engines.) But caches kept in the search frontend to speed querying will turn problematic with regular usage. The base Nutch (Lucene) query implementation uses one byte per document per field indexed. Additions made to support query-time deduplication and sorting share a cache that stores each search result's document URL. Such a cache of (Java) UTF-16 Java strings gets large fast. An alternate smaller memory-footprint implementation needs to be developed. 7. Future Work From inception, the Nutch project has set its sights on operating at the scale of the public web and has been making steady progress addressing the difficult technical issues scaling up indexing and search. The Nutch Distributed File System (NDFS) is modeled on a subset of the Google File System (GFS) [11] and is "...a set of software for storing very large stream-oriented files over a set of commodity computers. Files are replicated across machines for safety, and load is balanced fairly across the machine set" [12]. The intent is to use NDFS as underpinnings for a distributed webdb. (It could also be used storing very large segments.) While NDFS addresses the problem of how to manage large files in a fault-tolerant way, it does not help with the even distribution of processing tasks across a search cluster. To this end, the Nutch project is working on a version of another Google innovation, MapReduce [13], "a platform on which to build scalable computing" [9]. In synopsis, if you can cast the task you wish to run on a cluster into the MapReduce mold -- think of the Python map function followed by reduce function -- then the MapReduce platform will manage the distribution of your task across the cluster in a fault-tolerant way. Mid 2005, core developers of the Nutch project are writing a Java version of the MapReduce platform to use in a reimplementation of Nutch as MapReduce tasks [9]. MapReduce and NDFS combined should make Nutch capable of scaling its indexing step to billions of documents. The IA is moving its collections to the Petabox platform; racks of low power, high storage density, inexpensive rack-mounted computers [14]. The future of WAC search development will be harnessing Nutch MapReduce/NDFS development on Petabox. 8. Acknowledgements Doug Cutting and all members of the IA Web Team: Michele Kimpton, Gordon Mohr, Igor Ranitovic, Brad Tofel, Dan Avery, and Karl Thiessen. The International Internet Preservation Consortium (IIPC) [3] supported the development of Nutchwax. 9. References [1] Internet Archive http://www.archive.org [2] Wayback Machine http://www.archive.org/web/web.php [3] International Internet Preservation Consortium http://netpreserve.org [4] Nutch http://lucene.apache.org/nutch/ [5] Nutch: A Flexible and Scalable Open-Source Web Search Engine http://labs.commerce.net/wiki/images/0/06/CN-TR04-04.pdf [6] ARC File Format http://www.archive.org/web/researcher/ArcFileFormat.php [7] A9 Open Search http://opensearch.a9.com/ [8] Nutchwax http:// archiveaccess.archive.org/projects/nutch [9] MapReduce in Nutch, 20 June 2005, Yahoo!, Sunnyvale, CA, USA http://wiki.apache.org/nutchdata/attachments/Presentations/attachments/mapred.pdf [10] The Nutch Distributed File System by Michael Cafarella http://wiki.apache.org/nutch/NutchDistributedFileSystem [11] Google File System http://www.google.com/url?sa=U&start=1&q=http://labs. google.com/papers/gfs-sosp2003.pdf&e=747 [12] “[nutch-dev] Experience with a big index” by Michael Cafarella http://www.mail-archive.com/nutchdevelopers@lists.sourceforge.net/msg02602.html [13] MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat http://labs.google.com/papers/mapreduce-osdi04.pdf [14] Petabox http://www.archive.org/web/petabox.php 54 XGTagger, an open-source interface dealing with XML contents Xavier Tannier, Jean-Jacques Girardot and Mihaela Mathieu Ecole Nationale Supérieure des Mines 158, cours Fauriel 42023 Saint-Etienne FRANCE tannier, girardot, mathieu@emse.fr Abstract This article presents an open-source interface dealing with XML contents and simplifying their analysis. This tool, called XGTagger, allows to use any existing system developed for text only, for any purpose. It takes an XML document in input and creates a new one, adding information brought by the system. We also present the concept of “reading contexts” and show how our tool deals with them. 1. Introduction XGTagger1 is a generic interface dealing with text contained by XML documents. It does not perform any analysis by itself, but uses any system S that analyse textual data. It provides S with a text only input. This input is composed of the textual content of the document, taking reading contexts into account. A reading context is a part of text, syntactically and semantically self-sufficient, that a person can read in a go, without any interruption [3]. Document-centric XML contents does not necessary reproduce reading contexts in a linear way. Within this context, we can distinguish three kinds of tags [1]: • Finally hard tags are structural tags, they break the linearity of the text (chapters, paragraphs. . . ). 2. General principle Figure 1 depicts the general functioning scheme of XGTagger. Input XML document is processed and a text is given to the user’s system S. After execution of S, a postprocessing is performed in order to build a new XML document. 2.1. Input As shown by figure 1, if a list of soft and jump tags is given by the user, XGTagger recovers the reading contexts, gathers them (separated by dots) and gives the text T to the system S. In the following example sc (small capitals) and bold are soft tags, since footnote is a jump tag. (1) <article> <title>Visit I<sc>stanbul</sc> M<sc>armara</sc> region</title> <par> This former capital of three empires<footnote>Istanbul has successively been the capital of Roman, Byzantine and Ottoman empires</footnote> is now the economic capital of <bold>Turkey</bold> • Soft tags identify significant parts of a text (mostly emphasis tags, like bold or italic text) but are transparent when reading the text (they do not interrupt the reading context); • Jump tags are used to represent particular elements (margin notes, glosses, etc.). They are detached from the surrounding text and create a new reading context inserted into the existing one. 1 http://www.emse.fr/∼tannier/en/xgtagger.html and </par> </article> Considering soft, jump and hard tags allows XGTagger to recognize terms “Istanbul” and “Marmara”, but to distinguish “empires” and “Istanbul” (not separated by a blank character). The text infered is: 55 take the example of POS tagging2, with TreeTagger [2] standing for the system S, the first field of the output is the initial text. Considering our example, words are separated: Initial XML Document Special tag lists Document parsing, reading context recovery text only System S (black box) text only Initial document reconstruction and updating User’s parameters Visit VV visit Istanbul NP Istanbul and CC and Marmara NP Marmara Region NN region . SENT . ... ... ... The user describes S output with parameters3, allowing XGTagger to compose back the initial XML structure and to represent additional information generated by S with XML attributes. In our running example, parameters should specify that fields are separated by tabulations, that the first field represents the initial word, the second field stands for the part-of-speech (pos) and the third one is the lemma (lem). XGTagger treats these parameters and S output and returns the following final XML document: <article> <title> <w id=”1” pos=”VV” lem=”visit”>Visit</w> <w id=”2” pos=”NP” lem=”Istanbul”>I</w> <sc> stylesheet <w id=”2” pos=”NP” lem=”Istanbul”> stanbul</w> Final XML Document </sc> <w id=”3” pos=”CC” lem=”and”>and</w> <w id=”4” pos=”NP” lem=”Marmara”>M</w> <sc> Figure 1. XGTagger general fonctioning scheme. <w id=”4” pos=”NP” lem=”Marmara”> armara</w> </sc> <w id=”5” pos=”NN” lem=”region”>region</w> </title> <par> <w id=”7” pos=”DT” lem=”this”>This</w> <w id=”8” pos=”JJ” lem=”former”>former</w> <w id=”9” pos=”NN” lem=”capital”>capital</w> <w id=”10” pos=”IN” lem=”of”>of</w> <w id=”11” pos=”CD” lem=”three”>three</w> <w id=”12” pos=”NNS” lem=”empire”>empires</w> <footnote> Visit Istanbul and Marmara region . This former capital of three empires is now the economic capital of Turkey . Istanbul has successively been the capital of Roman, Byzantine and Ottoman empires It is not necessary to take care of soft and jump tags if the document or the application do not impose it. If nothing is specified, all tags are considered as hard (in this example, “I” and “stanbul” would have been separated, as well as “M” and “armara” and the footnote would have stayed in the middle of the paragraph). Nevertheless, in applications like natural language processing or indexing, this classification can be very useful. 2.2. Output This output of the system S must contain (among any other information) the repetition of the input text. If we <w id=”21” pos=”NP” lem=”Istanbul”> Istanbul</w> <w id=”22” pos=”VHZ” lem=”have”>has</w> <w id=”23” pos=”RB” lem=”successively”>successively</w> ... <w id=”32” pos=”NP” lem=”Ottoman”> Ottoman</w> <w id=”33” pos=”NNS” lem=”empire”> empires</w> 2 A part-of-speech (POS), or word class, is the role played by a word in the sentence (e.g.: noun, verb, adjective. . . ). POS tagging is the process of marking up words in a text with their corresponding roles. 3 These parameters can be specified either through a configuration file or Unix or DOS-like options (the program is written is Java). 56 <w id=”1” pos=”PP” t=”I”>I</w> <w id=”2” pos=”VVD” t=”do”>did</w> <w id=”3” pos=”PP” t=”it”>it</w> <w id=”4” pos=”LOC” t=”in///order///to”> in</w> <w id=”4” pos=”LOC” t=”in///order///to”> order</w> <w id=”4” pos=”LOC” t=”in///order///to”> to</w> <w id=”5” pos=”VV” t=”clarify”>clarify </w> <w id=”6” pos=”NNS” t=”matter”>matters </w> </footnote> <w id=”13” pos=”VBZ” lem=”be”>is</w> <w id=”14” pos=”RB” lem=”now”>now</w> <w id=”15” pos=”DT” lem=”the”>the</w> <w id=”16” pos=”JJ” lem=”economi”>economic</w> <w id=”17” pos=”NN” lem=”capital”>capital</w> <w id=”18” pos=”IN” lem=”of”>of</w> <bold> <w id=”19” pos=”NP” lem=”Turkey”> Turkey</w> </bold> </par> </article> Note that the identifier id allows to keep the reading contexts (see ids 2 and 4, 12 and 13) without any loss of structural information. The initial XML document can be converted back with a simple stylesheet (except for blank characters that S could have added). More details about XGTagger use and functioning can be found in [4] and in the user manual [5]. 3. Examples of uses The first example was part-of-speech tagging, but any kind of treatments can be performed by system S. N.B.: Recall that an important constraint of XGTagger is that at least one field of the user system output must contain the initial text (blank characters excepted). 3.1. POS tagging upgrading: locution handling If the system S is able to detect locutions, XGTagger can deal with that feature, with a special option (called special separator). With this option the user can specify that a sequence of characters represents a separation between words. • Let’s take the following XML element: <sentence>I did it in order matters</sentence> to clarify • XGTagger will input the following text into the system: I did it in order to clarify matters </sentence> Note that the three words composing the locution get the same identifier. 3.2. Syntactic analysis With the same special separator option, a syntactic analysis can be performed. Suppose that S groups together noun phrases of the form “NOUN PREPOSITION NOUN”. • For the following XML element: <english_sentence>He has a taste<gloss>Taste: preference, a strong liking</gloss> for danger</english_sentence> • . . . XGTagger will give this text into the system (considering that ’gloss’ is a jump tag): He has a taste for danger . Taste: preference, a strong liking . • S can perform a simple syntactic analysis and return, by example: He has a taste_for_danger/NP . Taste: preference, a strong liking . • With XGTagger options -i -w 1 -2 pos -f “/” -d “ “ -e “_”, the final output is: <english_sentence> • With the special separator ’///’, S can return: I PP did VVD it PP in///order///to LOC clarify VV matters NNS <w id=”1”>He</w> <w id=”2”>has</w> <w id=”3”>a</w> <w id=”4” pos=”NP”>taste</w> <gloss> <w id=”6”>Taste:</w> <w id=”7”>preference,</w> ... <w id=”10”>liking</w> • With appropriate options, XGTagger final output is: <sentence> 57 • S output (same as the input): United States Elections </gloss> <w id=”4” pos=”NP”>for</w> <w id=”4” pos=”NP”>danger</w> • Possible final output: <title> </english_sentence> <w id=”1” rc=”United”>U</w> <sc> 3.3. Lexical enrichment <w id=”1” rc=”United”>nited</w> The user’s system can also return any information about words. For example, a translation of each noun: • XML Input: <sentence>I had brother</sentence> a conversation with </sc> <w id=”2” rc=”States”>S</w> <sc> my <w id=”2” rc=”States”>tates</w> </sc> <w id=”3” rc=”Elections”>E</w> <sc> • S output (suggestion): I had a conversation/entretien/Gespräch with my brother/frère/Bruder • Options: second field is French, third field is German; Output: <sentence> <w>I</w> <w>had</w> <w>a</w> <w french=”entretien” german=”Gespräch”>conversation</w> <w>with</w> <w>my</w> <w french=”frère” german=”Bruder”>brother</w> </sentence> 3.4. Reading Contexts finding Finally, S can just repeat the input text (possibly with a simple separation of punctuation). The result is that words are enclosed between tags, reading contexts are brought together (by ids) and cut words are reassembled. This operation can be particularly interesting for traditional information retrieval; it can represent a first step before indexing XML documents4 or operating researchs taking logical proximity [3] into account. • XML Input: <title>U<sc>nited</sc> S<sc>tates</sc> E<sc>lections</sc></title> 4 An option of XGTagger adds the path of each element as one of its attribute. <w id=”3” rc=”Elections”>lections </w> </sc> </title> 4. Conclusion We have presented XGTagger, a simple software system aimed at simplifying the handling of semi-structured XML documents. XGTagger allows any tool developed for text-only documents, either in the domain of information retrieval, natural language processing or any document engineering field, to be applied to XML documents. References [1] L. Lini, D. Lombardini, M. Paoli, D. Colazzo, and C. Sartiani. XTReSy: A Text Retrieval System for XML documents. In D. Buzzetti, H. Short, and G. Pancalddella, editors, Augmenting Comprehension: Digital Tools for the History of Ideas. Office for Humanities Communication Publications, King’s College, London, 2001. [2] H. Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. In International Conference on New Methods in Language Processing, Sept. 1994. [3] X. Tannier. Dealing with XML structure through "Reading Contexts". Technical Report 2005-400-007, Ecole Nationale Supérieure des Mines de Saint-Etienne, Apr. 2005. [4] X. Tannier. XGTagger, a generic interface for analysing XML content. Technical Report 2005-400-008, Ecole Nationale Supérieure des Mines de Saint-Etienne, July 2005. [5] X. Tannier. XGTagger User Manual. http://www.emse.fr/~tannier/XGTagger/Manual/, June 2005. 58 The lifespan, accessibility and archiving of dynamic documents Katarzyna Wegrzyn-Wolska ESIGETEL, Ecole Supérieure d’Ingénieurs en Informatique et Génie des Télécommunications, 77215 Avon-Fontainebleau, France katarzyna.wegrzyn@esigetel.fr Abstract Today most Web documents are created dynamically. These documents don’t exist in reality; they are created automatically and they disappear after consultation. This paper surveys the problems related to the lifespan, accessibility and archiving of these pages. It introduces definitions of the different categories of dynamics documents. It describes also the results of our statistical experiments performed to evaluate their lifespan. 1 Introduction Is a dynamic document a real document, or is it only the temporary presentation of data? Is it any document created automatically or is it the document created as a response to the user’s action? The term "dynamic" can be used for the different signification; for the HTML Web page with some dynamic parts like a layers, scripts, etc., but this term is more often used for the pages created on-line by the Web server. This paper deal with the problems related to the documents created on-line. 2 The Lifespan and Age of Dynamic Documents How can the lifespan of dynamic documents be evaluated? These documents disappear immediately from the computers’ memory after theirs consultation. In this paper we define the lifespan of dynamic documents as the period where the given demand results in the same given response. This period is the time observed by the user as a documents lifespan. User when surfing the Web in his browser doesn’t know how the document was created so he doesn’t distinguish the difference between the static and the dynamic document. How to determine age of the dynamic documents? Can we consider the value of the http header Modified and Expired or the value fixed in the HTML file with the META tag Expires to indicate exactly when the document was changed or when it can be considered to have expired? 3 Dynamic Documents Categories We distinguish two kinds of dynamic documents: documents created and modified automatically (news sites, chat sites, weblogs, etc.) and documents created as an answer to the user requests (the results pages given by the Search Engines, the responses obtained by filling in the data form, etc.). We will analyse these documents separately in two categories. The first category is represented by the responsepages obtained from the Search Engines. The second category contains the pages from the different news sites and the Weblogs sites. 3.1 News Published on the Web There are numerous web sites which publish the news. The news sites publish different kinds of information in different presentation forms [3]. News is a very dynamic kind of information, constantly updated. The news sites have to be interrogated frequently so as not to miss any of the news information. On the other hand, it is often possible to reach the old articles from the archival files available on their sites. The archival life is varied on the deferments sites. The updating frequency and the archival life for some news sites is presented in Table 1. This information, which we evaluated was confirmed by the sites administrators. 3.2 The Weblog Sites A weblog, web log or simply a blog, is a web application, which contains periodic posts on a common webpage. It is a kind of online journal or diary frequently updated [1, 2]. 59 Table 1. Updating news frequency and archival life. Service news Update Archiving French Google about 20 min 30 days Google about 20 min 30 days Voila actuality every day 1 week Voila news info instantaneously 1 week Yahoo!News instantaneously 1 week TF1 news instantaneously News now 5 min CategoryNet every day CNN instantaneously Company news about 40 per day never ending 2003, 2004, 2005 archived Figure 1. Visit Frequency of indexing robots. Table 2. Index-database updating frequency. Search Engine Updating frequency Google 4 weeks some pages are updating quasi daily Yahoo! 3 weeks All the Web vary frequently publishes la date of the robots visit grows very fast. An unfortunate side effect of this continual growth and dynamical modification is that it is impossible to save the totality of Web images. We have compared the data from the GoogleNews and BBC archives presented by the Wayback Machin with our statistical data (Table2, Table1). This comparison shows clearly that this archive is incomplete. since 2004 index-data base together with Yahoo! AltaVista since 2004 index-data base together with Yahoo! 3.3 Search Engines 5 Statistical Evaluation We have carried out the following statistical evaluation: index-databases updating frequency (for Search Engines and Meta Search Engines) and different statistical tests of the News sites and the Weblogs. The Search Engines’ response pages are the dynamic pages created on-line. The lifespan of the same response page (period when the Search Engine answer doesn’t change) depends on the data retrieved from the Search Engine’ index-database. It is evident that this time is correlated with the updating frequency of index-database. Table 2 shows the examples of values of the Search Engines’ index-database updating frequency. To estimate the updating frequency of index-databases we have analysed the differents logs files and we have calculated frequency of access to Search Engines carried out by different indexing robots [5, 6, 7]. Figure 1 shows the example of logs’ data concerning the robots visits. 4 Archiving 5.2 News sites and Web logs Dynamic documents can be printed, saved by the user or put to special caching and archiving systems. There are many Web applications, which store the current web image the Web (example Wayback Machine developed by The Internet Archive1 ). These applications try to retrieve and save all of the visible Web [4]. It is evident that this task is very difficult. The WWW is enormous and it changes and 1 http://www.archive.org/index.php 5.1 Search Engines We have carried out some statistical tests to evaluate the updating frequency (lifespan) [7] of News. The results showed the different behavior of interrogated sites (Figure 6a, Figure 6b, Tableau 3). We have analyzed four categories of sites: - Sportstrategies the sport news service, - News on the site of French television TF1, - News from BBC site, - Weblog site (Slashdot.org). 60 Figure 2. Sportstrategies: News lifespan. a) News 24hours/24 Figure 3. BBC: lifespan of the news. b) News at working hours Sportstrategies is an example of the very regular News site,with a constant update time (every hour : Figure 2). Figure 4. TF1 News: lifespan of the news. BBC News is diffused online, the lifespan is very irregular because the information is updated instantaneously when present (Figure 3). TFI News is updated frequently during the day. On the other hand there are no modifications by night. The lifespan of the News pages is very different in these two cases. We have presented it in two separated graphs. (Figure 4a et Figure 4b). Two high peaks in the extremes of the graph in Figure 4a correspond to the long period without any changes during the night. Slashdot.org Weblog site, represents the last category of site. This collective weblog is one of the more popular blogs oriented on the Open Source. The data changes here very quickly, the new articles are diffused very often and the actual discussions continue without any break. The lifespan of these dynamically changed pages is extremely short (Figure 5); the mean lifespan is equal to 77 sec. (Tableau 3). Updating frequency values. In the next graphs (Figure 6) and Table3 comparatives of updating frequency values for some tested sites are presented. We have found the maximal and minimal values of the updating frequency and calculated the mean. The results confirm that the content of the news sites changes very often. 6 Conclusion Dynamic documents don’t exist in reality, they disappear from the computer memory directly after consultation. Their real lifespan is very short. The sites can be classified into different categories depending on the news-updating period; very regular -with a constant update time, irregular - information updated when present. The sites can be also classified into two categories depending on the refresh time; slow -with a refresh time greater then 10 minutes, fast 61 Table 3. Updating frequency tested site Figure 5. Slashdot lifespan of articles. lifespan mean min. max. Slashdot.org 77 sec 10 sec 22 min BBC News 8,5 min 1 min 66 min TF1 news (24/24) 19,5 min 1 min 502 min TF1 News (working hours) 6,3 min 1 min 49 min Sportsynergies 56 min 9 min 61 min -information refresh even about 10 seconds. Some news sites present periodic activity: ex. the news site of the French television channel TF1 is updated only during working hours. On the other hand, dynamic documents can be stored by special archiving systems and in fact, users can access them for a long time. Management of the archived dynamic document’s lifespan is identical to that of static documents, because the dynamic documents are stored in the same way as static ones. References a) TF1: 24/24 [1] R. Blood. The Weblog Handbook: Pratical Advice on Creating and Mintaining your Blog. 2002. [2] S. Booth. C’est quoi un Weblog. 2002. [3] A. Christophe. Chercher dans l’actualite recente ou les archives d’actualites francaises et internationale, on-line http://c.asselin.free.fr, 2004. [4] S. Lawrence. Online or invisible ? Nature, 411(687):521, Jan 2001. [5] K. Wegrzyn-Wolska. Etude et realisation d’un meta-indexeur pour la recherche sur le Web de documents produits par l’administration francaise. PhD thesis, Ecoles Superieures de Mines de Paris, DEC 2001. [6] K. Wegrzyn-Wolska. Fim-metaindexer: a meta-search engine purpose-bilt for the french civil service and the statistical classification and evaluation of the interrogated search engines using fim-metaindexer. In G. J.T.Yao, V.V.Raghvan, editor, The Second International Workshop on Web-based Support Systems, In Conjunction with IEEE WIC ACM WIIAT’04, pages 163–170. Sainr Mary’s University, Halifax, Canada, 2004. [7] K. Wegrzyn-Wolska. Le document numerique: une etoile filante dans l’espace documentaire. Colloque EBSI-ENSSIB; Montreal 2004, 2004. b) TF1: working hours Figure 6. Updating frequency. 62 SYRANNOT: Information retrieval assistance system on the Web by semantic annotations re-use Wiem YAICHE ELLEUCH1, Lobna JERIBI2, Abdelmajid BEN HAMADOU3, 1,3 LARIM, ISIMS, SFAX, TUNISIE 2 RIADI GDL, ENSI, MANOUBA, TUNISIE 1 Wiem.Yaiche@isimsf.rnu.tn 2 lj@gnet.tn 3 Abdelmajid.BenHamadou@isimsf.rnu.tn Abstract: In this paper, SYRANNOT system implemented in java is presented. Relevant retrieved documents are given to the current user for his query and adapted to his profile. SYRANNOT is based on the mechanism of Case Based Reasoning (CBR). It memorizes the research sessions (user profile, query, annotation, session date) carried out by users, and re-use them when a similar research session arises. The first experimental evaluation carried out on SYRANNOT has shown very encouraging results. 1. Introduction The Case Based Reasoning is a problem resolution approach based on the re-use by analogy of previous experiments called cases [AAM 94][KOL 93][SCH 89]. Some works of research assistance systems based on CBR were carried out: RADIX [COR 98], CABRI [SMA 99], COSYDOR [JER 01]. Our approach consists in applying CBR on the semantic annotations coming out of the semantic Web domain. The CBR has various advantages (information transfer between situations, evolutionary systems, etc). Nevertheless, its integration presents some difficulties such as the representation, memorizing, re-use and adaptation of the cases. These four key words constitute the CBR cycle and are the subject of our study. In the following, SYRANNOT system architecture is presented. It integrates the CBR on the semantic annotations. Special attention is given to knowledge modelling of the reasoning, as well as the search algorithms and the similarities calculation functions, in each stage of the cycle of the CBR. 2. SYRANNOT Architecture We propose two scenarios of SYRANNOT use: the first is related to memorizing session research (cases) carried out by the user in RDF data base. Research sessions are RDF statements based on ontologies models in OWL language. The second concerns re-use cases by applying research algorithms and similarity functions to collect the most similar cases to the current one, and to exploit them in order to present to the current user relevant retrieved documents for its query and adapted to its profile. In the following, both scenarios processes are detailed. 2.1 Cases memorizing scenario A user having a given profile, memorized in the user profiles RDF data base in the form of RDF statements based on user ontology, expresses his need of information by formulating a query which he submits to the search engine. It collects and presents to the user the retrieved documents. When the current user finds a document which he considers relevant to his query, he annotates it. The annotation created is memorized in the RDF data base of the annotations in the form of RDF statements based on ontology annotation. The research session (user profile identifier, submitted query, annotation identifier, session date) is memorized in the RDF data base of cases in the form of RDF statements based on the cases ontology. Figure 1 presents the scenario proposed to memorize cases. Search engine Answers documents Cases base Relevant document query annotation annotations Base user User profiles base Figure 1 : Scenario of memorizing case The case memorizing scenario is illustrated by the interfaces figures 2, 4, 5 and 6. Figure 2 shows the user new inscription interface. The user having a single identifier (PID) assigned by the system fills in his name (yaiche), his first name (wiem), his login (wiem), his password (****) and a set of interests which he selects from the ontology domain (Case based reasoning, annotation). The domain ontology is organised in a concepts tree. The user profile created is memorized in RDF data base of user profiles in the form of RDF statements, based on the user ontology 63 modelled in OWL (figure 3). The RDF data base of user profiles will be re-used later. preceded by an icon. When the user finds a document which he considers relevant to his query, he annotates it by clicking on the icon. Figure 2: User new inscription interface Figure 5: Google retrieved documents for a query name First name PID possède login password Domain ontology interests Figure 3 : User ontology diagram Figure 6 shows the interface which permits to annotate a document considered by the user to be relevant to his query. The annotation consists on the one hand in determining the standardized properties of Dublin Core such as URL (http://www.scientificamerican....), the title (the semantic Web), the author (Tim BernersLee, James Hendler, Ora Lassila), the date (May 2001) and the language of the document (English), and on the other hand to select a set of concepts from the ontology domain in order to describe the document according to the user point of view (semantic Web definition, ontology definition, annotation definition). Figure 4 shows the SYRANNOT home page. The enrichment of the cases data base consists in submitting queries (semantic Web) on the google search engine, collecting retrieved documents, and annotating the relevant documents for the query. Figure 6: Annotation creation interface Figure 4: Scenario choice interface Figure 5 shows the answers collected by google for the submitted query. It is a list of URL, each one is The annotation created has a single identifier (AID) assigned by the system, and is memorized in the RDF data base of annotations, in the form of RDF statements based on ontology annotation (figure 7). 64 The RDF data base of annotations will be re-used later. Filtered Annotations Cases base Annotations containing the query concepts URL annotations Base title Document concern Users profiles base author Reclassified Annotations Domain ontology AID langage Relevant answers documents contain query date annotation Terms IHM Domain ontology user Figure 9: Cases re-use scenario Figure 7: Annotation ontology diagram The research session (the user identifier, the submitted query, the annotation identifier, session date) is memorized in the RDF data base of cases, in the form of RDF statements based on the ontology cases (figure 8). user date submit query annotation creator In figure 4, the link recherche sur SYRANNOT permits the current user to have retrieved documents from previous similar experiments (similar profiles, similar queries). Figure 10 presents the query formulation interface which allows the user to interrogate the memorized cases via SYRANNOT. The user expresses his need of information by selecting concepts from the domain ontology (semantic Web definition). He can also make an advanced research on the author, the date or the language of the document. Figure 8: Case ontology diagram The scenario presented above corresponds to the stages of representation and memorizing of the CBR cycle. 2.2 Cases re-use Scenario The current user, having a given profile memorized in the RDF data base of user profiles, formulates his query by selecting one or more concepts from the domain ontology. The system scans the RDF data base of annotations and collects those having at least one concept of the query in the annotations terms field. The system filters these annotations by calculating the similarity between the current query and the annotation terms of each annotation in order to retain the 20 most relevant annotations. Then, the system reclassifies them by calculating the similarity between the current user profile and the profile of the user who has created the annotation. Finally, the system extracts and presents to the current user some information about the relevant documents (URL, author, date, etc). Figure 9 illustrates the cases re-use scenario. Figure 10 : Query formulation interface SYRANNOT scans the RDF data base of annotations and collects all the annotations containing at least one element of the query in the annotations terms field. SYRANNOT then filters these annotations in order to retain the 20 most relevant annotations by using API JENA [JENA] developed by the HP company (the objective of JENA is to develop applications for the semantic Web) and by calculating the similarity between the concepts of the current query and the 65 concepts of the field terms of annotations corresponding to each annotation. JENA is used to carry out inferences on the ontologies and on the RDF data bases. The similarity calculation of two sets of concepts is carried out by using the Wu Palmer formula: ⎞ 1⎛ 1 1 Sim( A, B) = ⎜⎜ max(ConSim( Ai, Bi))⎟⎟ ∑ max(ConSim( Ai, Bi)) + | B | Bi∑ Ai∈P1 2 ⎝ | A | Ai∈P1 Bi∈P2 ∈P2 ⎠ With A: set of concepts {Ai}, |A| cardinal of A B: set of concepts {Bi}, |B| cardinal of B ConSim(C1, C2): similarity calculation function between two concepts C1 and C2, in a concepts tree. 3. SYRANNOT tests and evaluations To evaluate the contribution of SYRANNOT, we have initialized the data bases of cases, profiles, and annotations by simulating research sessions. Thus, we have built a corpus including a hundred PDF scientific documents annotated using the domain ontology. First evaluations showed that the fact that the concepts used for the annotations are elements of the current query permits to SYRANNOT to present a significant assistance to the current user. Our current research tasks focus on the comparison of the performances of SYRANNOT to other existing systems based on annotations. 4. Conclusion ConSim(C1, C2) = 2 * depth (C)/(depthc (C1) + depthc (C2)) With C is the smallest generalizing of C1 and C2 in arcs number, depth (C) is the number of arcs which separates C from the root. The system then reclassifies the 20 relevant annotations by calculating the similarity between the current user profile and the profile of the user who has created the annotation by using JENA and the Wu Palmer formula. The system extracts from each annotation and presents to the user the URL, the title, the author, the language of the document, as well as the query submitted to google for a possible reformulation of the user query (figure 11). Figure 11: SYRANNOT Results The scenario presented above corresponds to the stages of re-use and adaptation of the CBR cycle. In this paper, we presented the SYRANNOT system architecture which assists a user in the information retrieval session by presenting relevant retrieved documents for his query and adapted to his profile. SYRANNOT integrates the CBR mechanism in the semantic annotations coming out of the semantic Web field. Ontological models were presented, as well as the research algorithms and the similarity calculation functions proposed in each stage of the CBR cycle. Experimental evaluations have shown very encouraging results in particular when the data base of cases is important and diversified. References [AAM 94] AAMODT, A., PLAZA, E. Case-Based Reasoning : Foundational Issues, Methodological Variations and System Approaches. March 1994, AI Communications, the Europeen journal on AI, 1994, Vol 7, N°1, p. 39-59. [COR 98] CORVAISIER, F., MILLE, A., PINON, J.M. Radix 2, assistance à la recherche d'information documentaire sur le web. In IC'98, Ingénierie des Connaissances, Pont-à-Mousson, France, INRIALORIA, Nancy, 1998, p. 153-163. [JENA] jena.sourceforge.net/ [KOL 93] KOLODNER, J. Case based reasoning. San Mateo, CA: Morgan Kaufman, 1993. [JER 01] JÉRIBI, L. Improving Information Retrieval Performance by Experience Reuse. Fifth International ICCC/IFIP conference on Electronic Publishing: '2001 in the Digital Publishing Odyssey' ELPUB2001. Canterbury, United Kingdom, 5-7 July 2001, p.78-92. [SCH 89] SCHANK, R. C., RIESBECK, C. K. Inside Case Based Reasoning. Hillsdale, New Jersey, Usa : Lawrence Erlbaum Associates Publishers, 1989, 423 p. [SMA 99] SMAÏL, M. Recherche de régularités dans une mémoire de sessions de recherche d’information documentaire", InforSID’99, actes des conférences, XVIIème congrès, La Garde, Toulon, 2-4 juin 1999, p. 289-304. 66 Search in Peer-to-Peer File-Sharing System: Like Metasearch Engines, But Not Really Wai Gen Yee, Dongmei Jia, Linh Thai Nguyen Information Retrieval Lab Illinois Institute of Technology Chicago, IL 60616 Email: {yee, jiadong, nguylin}@iit.edu Abstract Peer-to-peer information systems have gained prominence of late with applications such as file sharing systems and grid computing. However, the information retrieval component of these systems is still limited because traditional techniques for searching and ranking are not directly applicable. This work compares search in peer-to-peer information systems to that in metasearch engines, and describes how they are unique. Many works describing advances in peer-to-peer information retrieval are cited. 1 Introduction File-sharing is a major use of peer-to-peer (P2P) technology. CacheLogic estimates that one-third of all Internet bandwidth is consumed by file-sharing applications [2]. Although much of this bandwidth consumption can be attributed to the large sizes of the shared files (i.e., media files), the fact that the usage has been consistent use suggests its popularity. Increasing system size increases the importance of search technology that helps rank query results. The task of effective ranking is exactly the goal of information retrieval (IR). However, traditional IR ranking does not function effectively in a P2P environment. Some work in the area of Web IR, however, is similar: the metasearch engine whose goal is to dispatch queries to other search engines and then rank their results. The goal of this paper is to explain the shortcomings of traditional IR in the P2P environment as well as the similarities and differences between metasearch engines and search in the P2P environment. The impact of improved ranking should be significant in terms of resource usage as well. The popular Gnutellabased P2P file sharing systems basically flood the network with queries, which is bandwidth intensive. By improving ranking effectiveness, fewer queries need to be issued to find a particular data object. Furthermore, effective ranking reduces the likelihood that a user will accidentally find and download interesting, but unrelated data. 2 Peer-to-Peer File Sharing Model Our model is based on that which exists in common P2P file sharing systems, such as Gnutella and Kazaa [8]. Peers of a P2P system collectively share a set of data objects by maintaining local replicas of them. Each replica1 (of a data object) is a file (e.g., a music file), which is identified by a descriptor. A descriptor is a metadata set, which is composed of terms. Depending on the implementation, a term may be a single word or a phrase. (A metadata set is technically a bag of terms, because each term may occur multiple times.) A peer acts as a client by initiating a query for a particular data object (as opposed to any one of a category of data objects). A query is also a metadata set, composed of terms that a user thinks best describe the desired data object. A query is routed to all reachable peers, which act as servers. Query results are references to data objects that fulfill the matching criterion DO ⊇ Q, where Q 6= ∅, (1) where DO is the descriptor of data object O, and Q is the query. In other words, by design, the data object’s descriptor must contain all the query terms [14]. A query result contains the data object’s descriptor as well as the identity of the source server. The descriptor helps the user distinguish the relevance of the data object to the query, and the server identity is required to initiate the data object’s download. 1 We 67 use the term replica and data object interchangeably. Once the user selects a result (for download), a local replica of the corresponding data object is made. In addition, the user has the option of manipulating the replica’s descriptor. He may manipulate it for personal identification or to better share it in the P2P system. The set of peers in a P2P file sharing system is connected in a general graph topology. Generally, peers join the system at arbitrary points, creating a random graph, although other topologies are possible and may yield performance benefits [13, 15, 18, 20]. Note that one major variation to the model is in what data are shared. We assume that data are binary objects, and, to be effectively shared, need to be identified via metadata in descriptors. Shared data, however, may be text. In this case, the data are self-describing, containing text that can be searched directly. This distinction may be important because the ranking scores of self-describing data objects are consistent, not being dependent on user-tunable descriptors. Furthermore, self-describing data objects are easier to rank because the information contained in descriptors tends to be more sparse and less consistent. Another variation is in the way the network and data are organized. We assume a random graph, but more structure may be introduced, such as a ring or mesh, as suggested above. Furthermore, in such systems, data are often restricted in where they can be placed. For example, in the DHT described in [20], a ring network topology is enforced, and a data object is placed on a node with a node identifier that most closely matches the data object’s object identifier. Consequently, replication of data is not allowed. dynamism of the P2P environment. 3.1 Source Selection Source selection in metasearch engines is done by maintaining statistics on the contents of each search engine. This is often done by sampling [5]. Terms are extracted from a pre-defined corpus, and the contents of a search engine are deduced based on the results. Source selection in P2P file sharing systems is related to the task of query routing because of the topology of the network. Because of the size of the network, two peers may be connected, but only through intermediate peers. The most general form of source selection, used by Gnutella [8], is through flooding, where queries are routed to all neighbors in a breadth-first fashion, until a certain query time-to-live has expired. Alternatives to flooding include the use previous query responses and the publication of content signatures to intelligently route queries. A peer may learn how responsive its neighbors are based on its responses to past queries and may route future queries accordingly [6, 16, 19]. Another way a peer can control routing is by looking at signatures that servers generate to describe their shared content [4, 9]. Queries are routed to servers whose signatures are the best matches. Finally, many P2P routing algorithms are based on distributed hash tables [13, 15, 20], which efficiently route single keys to nodes with the closest-matching node identifiers. This problem searching for data objects described with multiple terms has been addressed in various ways, such as by generating a query for each term in the query or by using unhashed queries and unhashed signatures to describe content [3, 17, 21]. The sampling technique used by metasearch engines does not in general work for P2P file sharing systems because of the dynamism and topology of the latter. Because all peers are autonomous, they can leave the network at any time. This can render any collected statistics obsolete. 3 Similarity to Metasearch Engines Metasearch engines’ main selling points are their ability to search a larger data repository and return results that are ranked better. These features stem from the fact that different data sources (other search engines) may index different data repositories, and, if their data repositories overlap, they can improve overall ranking by corroborating or contradicting each others’ rankings. The main tasks carried out by a metasearch engine include source selection, query dispatching, result selection, and result merging [11]. Source selection is the process of selecting the search engines to query. Query dispatching is the process of translating a query to the search engine’s local format, preserving the semantics of the final results. Result selection is the selection of the results returned by a search engine for consideration in the final results. Result merging is the ranking of the selected documents. These tasks have analogs in P2P file sharing systems because they and metasearch engines both work in an environment where there are many independent, heterogeneous data sources. The difference, as we shall see below, is in the 3.2 Query Dispatching Query dispatching in metasearch engines has received little attention because it is considered straightforward. For example, to express term weights in a query, certain terms may have to be repeated in the translated query. A certain number of results may be desired from each search engine so that the total number of results returned to the metasearch engine is fixed; this can be adjusted as well. Little attention has been paid in the literature to query dispatching in P2P file sharing systems as well. In general, all peers that have been “selected” are given the same query and are assumed to use the same ranking function. This is 2 68 generally the case in practical P2P file sharing systems. Furthermore, their results are not ranked–results basically have to conform to the matching criterion, described in Section 2. It therefore makes little sense to modify a query. Attempts to improve performance by query transformation have been limited. One attempt is to use query expansion by creating graphs that connect related terms as synonyms [12]. This term graph is generated dynamically using the data stored locally on each peer. We are currently also using a process we call query masking to grow and shrink queries at the client or server to tune the results that eventually reach the client [24]. The idea behind query masking is to control the recall and precision of query results by selecting a subset of the terms in the query. Applying traditional query transformation techniques in a P2P file sharing environment is also made difficult by the scale of the system. To effectively transform a query, the client must maintain statistical information about each server. The fact that the number of potential servers is in the millions obviates the use of traditional methods. 3.3 Result Selection Result selection in metasearch engines is performed using knowledge of each search engine’s relevance to a particular query. In general, the more relevant search engine is asked to return more results, so that the metasearch engine can tune the final number of results to return to the user. Result selection requires that the search engine rank results so that the top few can be returned. Ranking is generally not supported by servers in P2P file sharing systems. Recent research efforts, however, have incorporated ranking into the servers. In [9], for example, specialized nodes (known as ultrapeers) in the P2P system function as servers for particular content, and all peers that have such content are directly connected with it. All queries are routed to the relevant ultrapeer which, knowing the contents and ranking function of each of its attached peers (K-L divergence, in this case), can perform effective document selection. In [4], the client locally ranks results from servers keeping the top ones for the final result. In general, however, metasearch engine result selection is inapplicable in P2P file sharing environments because it is difficult to control the servers to which a query is sent, not to mention to maintain knowledge of every potential server’s contents and ranking functions. Furthermore, it is difficult to maintain a P2P network that effectively clusters peers based on their shared content. This complicates the implementation of a hybrid architecture containing ultrapeers. 3.4 Result Merging Result merging in metasearch engines generally employs ranking scores returned in the result set of each of the selected search engines. Each of these scores is normalized using knowledge of relevance of each search engine to the query and then a final result set is created containing the results with the highest normalized scores. Alternatively, if results refer to text documents, all top-ranked documents from each search engine can be downloaded by the metasearch engine to perform local ranking. Duplicate results can be handled by maintaining only the maximum, the sum, or the average rank score. Result merging in P2P file sharing systems poses two fundamental problems. First, it assumes that the client has knowledge of the ranking process of the servers, which is unlikely, considering the heterogeneity and dynamism of the system. Second, it assumes that ranking can be done at all by any peer–not a certainty, considering the lack of global statistics. Traditional IR techniques have been adapted to result merging in P2P environments with reasonable levels of effectiveness [4, 7, 10]. In [4], servers are ranked and then sequentially searched until a server’s result set does not affect the current top N results. In [10], semi-supervised learning and Kirsch’s algorithm are used for result merging in ultrapeers. A novel result in [22], however, is that group size–the number of results that refer to the same data object for a given query–is a better ranking metric than tf-idf. Furthermore, [23] shows that different ranking functions can be effectively used to find data of varying popularity in a P2P file sharing system. 4 Conclusion Information retrieval in P2P file sharing systems is complicated due to the fact that global statistics are hard to collect due to the dynamism of the system in terms of the shared content, the availability of peers, and the topology of the network. One fundamental question therefore is to see whether IR can be performed at all in P2P systems. The works cited in this paper present solutions that put various levels of constraints on the system (e.g., from a random to a fixed network topology). These constraints affect the applicability of a P2P system (e.g., a fixed topology would be appropriate for a grid system, but inappropriate for today’s file sharing systems). As systems become more constrained, it seems, traditional IR becomes more applicable, because more global statistics can be harvested. In unconstrained environments, the work becomes more challenging, as fewer assumptions can be made and less information is available; pre-existing IR techniques lose relevance. In effect, we propose that work be done carefully 69 considering the parameters of the P2P system, including the network topology, the autonomy of the peers, type of data shared, and the distribution of data. Contributions can still be made in constrained environments, but fundamental advances, such as link analysis in Web IR [1], can only be made in unconstrained ones. References [12] K. Nakauchi, Y. Ishikawa, H. Morikawa, and T. Aoyama. Peer-to-peer keyword search using keyword relationship. In Proc. Wkshp. Global and Peerto-Peer Comp. Large Scale Dist. Sys (GP2PC), pages 359–366, Tokyo, Japan, 2003. [13] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Proc. ACM SIGCOMM, 2001. [1] S. Brin and L. Page. The anatomy of a large scale hypertextual web search engine. In Proc. World Wide Web Conf., 1998. [14] C. Rohrs. Keyword matching [in gnutella]. Technical report, LimeWire, Dec. 2000. www.limewire.org/techdocs/KeywordMatching.htm. [2] CacheLogic. Cachelogic home page. Web Document. www.cachelogic.com. [15] A. Rowstron and P. Druschel. Storage management and caching in past, a large-scale, persistent, peer-topeer storage utility. In Proc. SOSP, 2001. [3] A. Crainiceanu, P. Linga, J. Gehrke, and J. Shanmugasundaram. Querying peer-to-peer networks using p-trees. In Proc. Wkshp. Web and Database, Paris, France, 2004. [16] Y. Shao and R. Wang. Buddynet:history-based p2p search. y. shao, r. wang. in ecir-05. In Proc. Euro. Conf. on Inf. Ret., 2005. [4] F. M. Cuenca-Acuna and T. D. Nguyen. Text-based content search and retrieval in ad hoc p2p communities. In Proc. Intl. Wkshp Peer-to-Peer Comp, May 2002. [17] S. Shi, G. Yang, D. Wang, J. Yu, S. Qu, and M. Chen. Making peer-to-peer keyword searching feasible using multi-level partitioning. In Intl. Wkshp. on P2P Sys. (IPTPS), 2004. [5] P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In Proc. VLDB, pages 394–405, 2002. [18] A. Singla and C. Rohrs. Ultrapeers: Another step towards gnutella scalability. Technical report, Limewire, LLC, 2002. rfcgnutella.sourceforge.net/src/Ultrapeers 1.0.html. [6] V. Kalogeraki, D. Gunopulos, and D. ZeinalipourYazti. A local search mechanism for peer-to-peer networks. In Proc. ACM Conf. on Information and Knowledge Mgt. (CIKM), 2002. [19] K. Sripanidkulchai, B. Maggs, and H. Zhang. Efficient content location using interest-based locality in peerto-peer systems. In Proc. IEEE INFOCOM, 2003. [7] I. A. Klamponos, J. J. Barnes, and J. M. Jose. Evaluating peer-to-peer networking for information retrieval within the context of meta-searching. In Proc. Euro. Conf. on Inf. Ret., pages 528–536, 2003. [8] T. Klingberg and R. Manfredi. Gnutella protocol 0.6. Web Document, 2002. rfcgnutella.sourceforge.net/src/rfc-0 6-draft.html. [9] J. Lu and J. Callan. Content-based retrieval in hybrid peer-to-peer networks. In Proc. ACM Conf. on Information and Knowledge Mgt. (CIKM), pages 199–206, Nov. 2003. [10] J. Lu and J. Callan. Federated search of text-based digital libraries in hierarchical peer-to-peer networks. In Proc. Euro. Conf. on Inf. Ret., 2005. [11] W. Meng, C. Yu, and K.-L. Liu. Building efficient and effective metasearch engines. ACM Comp. Surveys, 34(1):48–84, Mar. 2002. [20] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. ACM SIGCOMM, 2001. [21] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In Proc. ACM SIGCOMM, Aug. 2003. [22] W. G. Yee and O. Frieder. On search in peer-to-peer file sharing systems. In Proc. ACM SAC, Santa Fe, NM, Mar. 2005. [23] W. G. Yee, D. Jia, and O. Frieder. Finding rare data objects in p2p file-sharing systems. In Proc. IEEE P2P Conf., Constance, Germany, Sept. 2005. [24] W. G. Yee, L. T. Nguyen, and O. Frieder. Improving search performance in p2p file sharing systems by query masking. In Under Review, June 2005. 70