Report on "Advances in Web Archiving - LiWA
Transcription
Report on "Advances in Web Archiving - LiWA
European Commission Seventh Framework Programme Call: FP7-ICT-2007-1, Activity: ICT-1-4.1 Contract No: 216267 Report on “Advances in Web Archiving Technologies” D6.5 Version 1.0 Editor: EA Work Package: WP6 Status: Final Version Date: M22 Dissemination Level: PU LiWA Project Overview Project Name: LiWA – Living Web Archives Call Identifier: FP7-ICT-2007-1 Activity Code: ICT-1-4.1 Contract No: 216267 Partners: 1. 2. 3. 4. 5. 6. 7. 8. Coordinator: Universität Hannover, L3S Research Center, Germany European Archive Foundation (EA), Netherlands Max-Planck-Institut für Informatik (MPG), Germany Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI), Hungary Stichting Nederlands Instituut voor Beeld en Geluid (BeG), Netherlands Hanzo Archives Limited (HANZO), United Kingdom National Library of the Czech Republic (NLP), CZ Moravian Library (MZK), CZ Document Control Title: D6.5 Report on “Advances in Web Archiving Technologies” Author/Editor: Radu Pop, Julien Masanes (EA) Mark Williamson (HANZO) Andras Benczur (MTA) Marc Spaniol (MPG) Thomas Risse (L3S) Document History Version Date Author/Editor Description/Comments 0.1 04/05/2009 Radu Pop Document plan 0.2 01/06/2009 all First draft 0.3 22/06/2009 all Updates 0.4 12/08/2009 all First version 0.5 22/11/2009 JM Introduction, Section 2 completed Page 2 of 60 LiWA Legal Notices The information in this document is subject to change without notice. The LiWA partners make no warranty of any kind with regard to this document, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The LiWA Consortium shall not be held liable for errors contained herein or direct, indirect, special, incidental or consequential damages in connection with the furnishing, performance, or use of this material. Page 3 of 60 LiWA Table of Contents 1 Introduction.........................................................................................................................5 2 Current Challenges in Web Archiving...............................................................................6 2.1 Archive Fidelity ..............................................................................................................6 2.2 Archive Coherence.........................................................................................................6 2.3 Archive Interpretability....................................................................................................7 3 Archive’s completeness.....................................................................................................8 3.1 A new crawling paradigm...............................................................................................8 3.2 Capturing Streaming Multimedia..................................................................................10 3.3 State of the Art on the Streaming Capture Software.....................................................11 3.4 Rich Media Capture Module.........................................................................................13 3.5 Integration into the LiWA Architecture..........................................................................13 3.6 Evaluation and Optimizations.......................................................................................14 3.7 References...................................................................................................................17 4 Spam Cleansing................................................................................................................19 4.1 State of the Art on Web Spam......................................................................................19 4.2 Spam Filter Module......................................................................................................24 4.3 Evaluation....................................................................................................................26 4.4 Integration into the LiWA Architecture..........................................................................27 4.5 References...................................................................................................................31 5 Temporal Coherence.........................................................................................................33 5.1 State of the Art on Archive Coherence.........................................................................34 5.2 Temporal Coherence Module.......................................................................................34 5.3 Evaluation and Visualization........................................................................................42 5.4 Integration into the LiWA Architecture..........................................................................46 5.5 References...................................................................................................................47 6 Semantic Evolution...........................................................................................................48 6.1 State of the Art on Terminology Evolution....................................................................48 6.2 Detecting Evolution......................................................................................................51 6.3 Terminology Evolution Module.....................................................................................54 6.4 Evaluation....................................................................................................................55 6.5 Integration into the LiWA Architecture..........................................................................56 6.6 References...................................................................................................................58 Page 4 of 60 LiWA 1 Introduction Archiving the web is now on the agenda of many organizations. From companies that are required by regulation to preserve their websites or intranets to national libraries whose collecting mission encompasses entire national domains or national archives required to preserve the entire governmental web presence, more and more organizations are engaging with preserving Web content. Yet, Web preservation is still a very challenging task. In addition to the “usual” challenges of digital preservation (media decay, technological obsolescence, authenticity and integrity issues, etc.), Web preservation has its own unique difficulties: ! Rapidly evolving publishing and encoding technologies, which challenge the ability to capture Web content in an authentic and meaningful way that guarantees long-term preservation and interpretability, ! Distribution and temporal properties of online content, with unpredictable aspects such as transient unavailability, ! Huge number of actors (organizations and individuals) contributing to the Web, ! Large variety of needs that Web content preservation will have to serve. The Living Web Archive (LiWA) project is the first extensive R&D project entirely devoted to address some of these challenges. This document present a Report on advances made in this domain, at mid-term of this project. The focus of LiWA is described in Section 2. Section 3 addresses completeness of archives, explaining challenges and achievement to capture the entire content of sites. Section 4 describes the problem that Web spam raises for Web archives and the methods developed to detect and filter it. Section 5 deals with archive temporal coherence, with methods to measure, evaluate and visualize it. Finally Section 6 describes the issue of semantic evolution in Web archives, and proposes methods to make archives easier to search in the future. Page 5 of 60 LiWA 2 Current Challenges in Web Archiving This section presents the main research challenges that LiWA is addressing. We have grouped them in three main problem areas: archive fidelity, temporal coherence, and interpretability. 2.1 Archive Fidelity The first problem area is the archive's fidelity and authenticity to the original. Fidelity comprises, on the one hand, the ability to capture all types of content, including nonstandard types of Web content such as streaming media, which can often not be captured at all by in an existing Web crawler technology. In Web archiving today, state of the art crawlers, based on page parsing for link extraction and human monitoring of crawls, are at their intrinsic limits. Highly skilled and experienced staff and technologydependent incremental improvement of crawlers are permanently required to keep up with the evolution of the Web; this increases the barrier to entry in this field and often produces dissatisfying results due to poor fidelity. Consequently this leads to increased costs of storage and bandwidth due to the unnecessary capture of irrelevant content. Current crawlers fail to capture all Web content, because the current Web comprises much more than simple HTML pages: dynamically created pages, e.g., based on JavaScript or flash; multimedia content that is delivered using media-specific streaming protocols; hidden Web content that resides in data repositories and contentmanagement systems behind Web site portals. In addition to the resulting completeness challenges, one also needs to avoid useless content, typically Web spam. Spam classification and page-quality assessment is a difficult issue for search engines; for archival systems it is even more challenging as they lack information about usage patterns (e.g., click profiles) at capture time, which should ideally filter spam during the crawl process. LiWA has developed novel methods for content gathering of high-quality Web archives. They are presented in Section 3 (on completeness) and 4 (on filtering Web spam) of this report. 2.2 Archive Coherence The second problem area is a consequence of the Web's intrinsic organization and of the design of Web archives. Current capture methods for instance are based on snapshot crawls and “exact duplicate” detection. The archive's integrity and temporal coherence – proper dating of content and proper cross-linkage - is therefore entirely dependent on the temporal characteristics (duration, frequency, etc.) of the crawl process. Without judicious measures that address these issues, proper interpretation of archived content would be very difficult if possible at all. Ideally, the result of a crawl is a snapshot of the Web at a given time point. In practice, however, the crawl itself needs an extended time period to gather the contents of a Web site. During this time span, the Web continues to evolve, which may cause Page 6 of 60 LiWA incoherencies in the archive. Current techniques for content dating are not sufficient for archival use, and require extensions for better coherence and reduced cost of the gathering process. Furthermore, the desired coherence across repeated crawls, each one operating incrementally, poses additional challenges, but also opens up opportunities for improved coherence, specifically to improve crawl revisit strategies. These issues will be addressed in Section 5 of this report (Temporal coherence). 2.3 Archive Interpretability The third problem area is related to the factors that will affect Web archives over the long-term, such as the evolution of terminology and the conceptualization of domains underlying and contained by a Web archive collection. This has the effect that users familiar with and relying upon up-to-date terminology and concepts will find it increasingly difficult to locate and interpret older Web content. This is particularly relevant for long-term preservation of Web archives, since it is not sufficient to just be able to store and read Web pages in the long run – a "living" Web archive is required, which will also ensure accessibility and coherent interpretation of past Web content in the distant future. Methods for extracting key terms and their relations from a document collection produce a terminology model at a given time. However, they do not consider the semantic evolution of terminologies over time. Three challenges have to be tackled to capture this evolution: 1) extending existing models to take the temporal aspect into account, 2) developing algorithms to create relations between terminology snapshots in view of changing meaning and usage of terms, 3) presenting the semantic evolution to users in an easily comprehensible manner. The availability of temporal information opens new opportunities to produce higher-quality terminology models. Advance in this domain are presented in Section 6 of this report (Semantic evolution). Page 7 of 60 LiWA 3 Archive’s completeness One of the key problems in Web Archiving is the discovery of resources necessary to fetch them. Starting from known pages, tools to capture Web content have to discover all linked resources, including embeds (images, cuss etc.), and this even when they belong to the same site as no listing function is implemented in the http protocol. ‘Crawlers’, software tools that automatically parse known pages to extract links from the HTML code and add them to a queue, called the frontier, traditionally do this. This method has been designed at a time where the Web was entirely made of simple HTML pages and did work perfectly in this context. When navigational links started becoming coded with more sophisticated means, like scripts or executable code, embedded or not in html, this methods has shown its limits. We can classify navigational links in broadly three categories depending on the type of code in which they are encoded. 1. Explicit links (source code is available and full path is explicitly stated) 2. Variable links (source code is available but use variables to encode the path) 3. Opaque links (source code not available) Current crawling technologies, only address the first and partially the second category. For the latter, crawlers use heuristics to append file and path names to reconstitute URL. Heritrix even has a mode in which every possible combination of path and file name found in embedded JavaScript are combined and tested. This method has a high cost in terms of number of fetch. Besides, it still misses the cases where variable are used as a parameter to code the URL. For those cases as well as for the navigational links of third category, the only solution is to actually execute the code to get the links. This is what LiWA has been exploring. Although the result of this research is proprietary technology, we will expose at a general level the approach taken. 3.1 A new crawling paradigm Executing pages for capturing sites requires mainly three things. The first is to run an execution environment (HTML plus JavaScript, Flash etc.) in a controlled manner so that discoverable links can be extracted systematically. Web browser can provide this functionality but they are designed to execute and fetch links one at a time following the user interaction. The solution consists in tethering these browsers so that they execute all code containing links and extract this links without directly fetching the linked resources, but adding it to a list (similar to a crawler frontier). The second challenge is to encapsulate these headless browsers in a crawler-like workflow, which main purpose is to systematically explore all the branches of the hypertext tree. The difficulty comes from the fact that some contextual variables can be used in places, which make a simple one-pass execution of the target code (HTML plus Page 8 of 60 LiWA JavaScript, Flash etc.) incomplete. This challenge has been called non-determinism [MBD*07]. The last but not the least of the challenges, is to optimize this process so that it can scale to the size required for archiving sites. In most of documented cases These challenges have been addressed separately in the literature for different purposes. It has been for instance the case for malware detection [WBJR06, MBD*07,YKLC08], site adaptation [JHBa08] and site testing [BTGH07]. However to the best of our knowledge, LiWA is the first attempt to address the three together, and this for archiving purposes. This is currently being implemented in the new crawler that one of the partner has been developing (Hanzo Archives Ltd) and it is already used in production by them to archive a wide range of sites that can’t be archived by pre-existing crawlers, as well as in testing by another of the LiWA partner, the European Archive. Page 9 of 60 LiWA 3.2 Capturing Streaming Multimedia The Internet is becoming an important medium for the dissemination of multimedia streams. However, the protocols used for traditional applications were not designed to account for the specificities of multimedia streams, namely their size and real-time needs. At the same time networks are shared by millions of users and have limited bandwidth, unpredictable delay and availability. The design of real-time protocols for multimedia applications is a challenge that multimedia networking must face. Multimedia applications need a transport protocol to handle a common set of services. The transport protocol does not have to be complex as TCP. The goal of the transport protocol is to provide end-to-end services that are specific to multimedia applications and that can be clearly distinguished from conventional data services: ! a basic framing service is needed, defining the unit of transfer, typically common with the unit of synchronization; ! multiplexing (combining two or more information channels onto a common transmission medium) is needed to identify separate media in streams; ! timely delivery is needed; ! synchronization is needed between different media and it is also a common service to networked multimedia applications. The transfer protocols in the streaming technologies are used to carry message packets and communication takes place only through them. Despite the growth in multimedia, there have been few studies that focus on characterizing streaming audio and video stored on the Web. Mingzhe Li et al. presented in [LCKN05] the investigation's results on nearly 30,000 streaming audio and video clips identified on 17 million Web pages from diverse geographic locations. The streaming media objects were analyzed to determine attributes such as media type, encoding format, playout duration, bitrate, resolution, and codec. The streaming media content encountered is dominated by proprietary audio and video formats with the top four commercial products being RealPlayer, Windows Media Player, MP3 and QuickTime. Like similar Web phenomena, the duration of streaming media follows a power-law distribution. A more focused study was conducted in [BaSa06], analyzing the crawl sample of the media collection for several Dutch radio-TV Web sites. Three quarters of the streaming media were represented by RealMedia files and almost one quarter were Windows Media files. The detection of streaming objects during the crawl proved to be difficult, as there are no conventions on file extensions and mime types. Another extensive data-driven analysis on the popularity distribution of user-generated video contents is presented by Meeyoung Cha et al. in [CKR*07]. Video content in standard Video-on-Demand (VoD) systems has been historically created and supplied by a limited number of media producers. The advent of User-Generated Content (UGC) Page 10 of 60 LiWA has reshaped the online video market enormously, but also the way people watch video and TV. The paper analysis YouTube, the world's largest UGC VoD system, serving 100 million distinct videos and 65.000 uploads daily. The study is focused on the nature of the user behaviour, different cache designs and the implications of different UGC services on the underlying infrastructures. YouTube alone is estimated to carry 60% of all videos online, corresponding to a massive 50-200 Gb/s of server access bandwidth on a traditional client-server model. 3.3 State of the Art on the Streaming Capture Software There are many tools, usually called streaming media recorders, allowing to record streaming audio and video content from the Internet. Most of them are commercial software, especially running on Microsoft Windows and few of them are really able to capture all kind of streams. Several research prototypes related to video streaming applications often include recording or capturing functionalities. But in general, the proper capture and storage of the video content do not represent the central features of the proposed systems. Each prototype typically deals with a particular type stream distribution or stream analysis. We give in the following a brief overview on existing tools or software related to stream capturing, grouped into commercial, open-source or research projects. 3.3.1 Off-the-shelf commercial software Some commercial software such as GetASFStream [ASFS] and CoCSoft Stream Down [CCSD] are able to capture streaming content through various streaming protocols. But this software is usually not free and if so, they often had legal difficulties, like StreamBox VCR [SVCR], which has been prosecuted in justice. Some useful information on capturing streaming media is summarised on the following Web sites: http://all-streaming-media.com http://www.how-to-capture-streaming-media.com The most interesting software charted on these sites are those running on Linux platform, as they all are freeware, open-source and command-line based software. This last point is very important, as a command-line based software may easily be integrated in simple shell scripts or Java programs, whereas GUIs (most Windows software) don't. 3.3.2 Open-source software: MPlayer Project MPlayer [MPlP] is an open-source media player project developed by voluntary programmers around the world. The MPlayer project is also supported by the Swiss Federal Institute of Technologies in Zürich (ETHZ), which hosts the www4.mplayerhq.hu mirror, an alias for mplayer.ethz.ch. MPlayer is a command-line based media player, Page 11 of 60 LiWA which also comes with an optional GUI. It allows playing and capturing a wide range of streaming media formats over various protocols. As of now, it supports streaming via HTTP/FTP, RTP/RTSP, MMS/MMST, MPST, SDP. In addition, MPlayer can dump streams (i.e. download them and save to files on the disk) and supports HTTP, RTSP, MMS protocols to record Windows Media, RealMedia and QuickTime video content. Since the MPlayer project is under constant development, new features, modules and codecs are constantly added. Besides, MPlayer offers a good documentation and manual available on its Web site, with a continuous help (for bug reports) on the mailing list and archives. MPlayer runs on many platforms (Linux, Windows and MacOS), including a large set of codecs and libraries. 3.3.3 Research projects Research projects usually focus on one of the following two aspects: the analysis of the streaming content (audio/video encoding codecs, compression and optimizations) or architectures for the distribution or efficient broadcast of the streams (content delivery networks, P2P overlays, network traffic analysis, etc.). However, several research projects provide capturing capabilities for the streaming media and deal with the real-time protocols used for the broadcast. A complex system for video streaming and recording is proposed by the HYDRA (Highperformance Data Recording Architecture) project [ZPD*04]. It focuses on the acquisition, transmission, storage, and rendering of high-resolution media such as highquality video and multiple channels of audio. HYDRA consists of multiple components to achieve its overall functionality. Among these, the data-stream recorder includes two interfaces to interact with data sources: a session manager to handle RTSP communications and multiple recording gateways to receive RTP data streams. A data source connects to the recorder by initiating an RTSP session with the session manager, which performs the following functions: controls admission for new streams; maintains RTSP sessions with sources; and manages the recording gateways. Malanik et al. describe a modular system, which provides capability for capturing videos and screen casts from lectures and presentations in any academic or commercial environment [MDDC08]. The system is based on a client-server architecture. Client node sends streams from available multimedia devices to local area network. The server provides functions for capturing video from streams and for distributing the captured video files using torrent. The FESORIA system [PMV*08] is an analysis tool, which is able to process the logs gathered from the streaming servers and proxies. It combines the extracted information with other types of data, such as content metadata, content distribution networks architecture, user preferences, etc. All this information is analyzed in order to generate reports on service performance, access evolution and users' preferences, and thus to improve the presentation of the services. With regard to the TCP streaming, delivered over HTTP a recent measurement study Page 12 of 60 LiWA [WKST08] indicated that a significant fraction of Internet streaming media is currently delivered over HTTP. TCP generally provides good streaming performance when the achievable TCP throughput is roughly twice the media bitrate, with only a few seconds of startup delay. 3.4 Rich Media Capture Module The Rich Media Capture module is designed to enhance the capturing capabilities of the crawler, with regards to different multimedia content types. The current version of Heritrix is mainly based on the HTTP/HTTPS protocol and it cannot treat other content transfer protocols widely used for the multimedia content (such as streaming). The Rich Media Capturing module delegates the multimedia content retrieval to an external application (such as MPlayer) that is able to handle a larger spectrum of transfer protocols. The main performance indicator for this module is therefore related to the number of additionally archived multimedia types. 3.5 Integration into the LiWA Architecture The module is constructed as an external plugin for Heritrix. Using this approach, the identification and retrieval of streams is completely decoupled, allowing the use of more efficient tools to analyze video and audio content. At the same time, using the external tools helps in reducing the burden on the crawling process. The module is composed of several subcomponents that communicate through messages. We use an open standard communication protocol called Advanced Message Queuing Protocol (AMQP). The integration of the Rich Media Capturing module is shown in the Figure 3.2 and the workflow of the messages can be summarized as follows. The plugin connected to Heritrix detects the URLs referencing streaming resources and it constructs for each one of them an AMQP message. This message is passed to a central Messaging Server. The role of the Messaging Server is to decouple the Heritrix crawler from the clustered streaming downloaders (i.e. the external capturing tools). The Messaging Server stores the URLs in queues and when one of the streaming downloaders is available, it sends the next URL for processing. In the software architecture of the module we identify three distinct sub modules: ! a first control module responsible for accessing the Messaging Server, starting new jobs, stopping them and sending alerts; ! a second module used for stream identification and download (here an external tool is used, such as the MPlayer); ! a third module which repacks the downloaded stream into an format recognized by the access tools. When available, a streaming downloader connects to the Messaging Server to request Page 13 of 60 LiWA a new streaming URL to capture. Upon receiving the new URL, an initial analysis is done in order to detect some parameters, among others the type and the duration of the stream. Of course, if the stream is live, a fixed configurable duration may be chosen. After a successful identification the actual download starts. The control module generates a job which is passed to the MPlayer along with safeguards to ensure that the download will not take longer than the initial estimation. After a successful capture, the last step consists in wrapping the captured stream into ARC/WARC format and in moving it to the final storage. Figure 3.2: Streaming capture module interacting with the crawler 3.6 Evaluation and Optimizations We conducted several test crawls using the new capturing module on the GOV.UK collection. This UK governmental Web site collection is regularly crawled and enriched monthly by the European Archive. During the last 3 monthly crawls, the capturing module was successfully used to retrieve the multimedia content accessible from these Web sites, which has not been possible to archive with conventional archiving technologies. The table below gives some examples of the discovered URIs. They use the standard RTSP or MMS schemes, but one can notice that the video files are generally hosted on a different Web server than the Web site. Protocol rtsp Web site http://www.epsrc.ac.uk URI rtsp://rn.groovygecko.net/groovy/epsrc/ EPSRC_Aroll_041208_hb.rv Page 14 of 60 LiWA mms http://www2.cimaglobal .com mms http://www.businesslink .gov.uk rtsp://rn.groovygecko.net/groovy/epsrc/ Pioneers09_hb.rv ... mms://groovyg.edgestreams.net/groovyg/clients/ Markettiers4dc/Video%20Features/11725/ 11725_cima_employers2_HowToTV.wmv ... mms://msnvideo.wmod.llnwd.net/a392/d1/cmg/prb/Solutio ns_EPM_Final.wmv mms://msvcatalog-2.wmod.llnwd.net/a2249/e1/ft/share2/ b297/0/Solutions – Sales_OSC_English-1.wmv ... In the next test round we extended the testbed collection to some particular television Web sites, such as www.swr.de or www.ard.de, where the number of video streams is considerably larger. We performed 2 complete crawls of the Web sites, including the video collection, as well as several weekly crawls, capturing the last updates of the site and the new published video content. The table below gives an insight on the total number of video URIs discovered and captured, using rtsp, rtmp or mms protocols. Protocol Date of capture Number of video content URIs rtsp May crawl 7089 mms May crawl 74 rtmp May crawl 2636 rtsp July crawl 5058 rtmp July crawl 3320 rtsp 20th July update 184 rtmp 20th July update 354 rtmp 24th July update 282 rtmp 30th July update 341 rtmp 06th August update 458 These tests have put in evidence several issues raised by the video capture on a larger scale, such as for the media collection of a large Web site: ! the number of video/audio URIs is relatively large (in the order of several thousands) and they represent an essential part of the site content. Missing the capture of these resources would greatly impact on the quality of the archive. ! the video content changes frequently (several hundreds of new videos are published per week, replacing some older ones), therefore the capturing Page 15 of 60 LiWA mechanisms need optimizations in order to ensure the complete capture of all the videos. ! scheduling policy for the video capture should differ from the one used for the other type of resources (html, text, images, etc.). The size of a video stream, in terms of data, generally ranges from several KB to 100MB or more, according to the length (in seconds) and quality of the video. The total downloading time is therefore unpredictable at the start of the Web site crawl. Moreover, the downloading speed could vary while capturing a long list of video streams. A brief reflexion on these aspects brought us to several possible optimizations of the module. The main issues emerging from the initial tests were related to the synchronization between the crawler and the external capture module. In the case of a large video collection hosted on the Web site, a sequential download of each video would definitely take longer time than the crawling process of the text pages. The crawler would therefore have to wait for the external module to finish the video download. A speed-up of the video capture process can be indeed obtained by multiplying the number of downloaders. On the other hand, parallelising this process would be limited by the maximum bandwidth available at the streaming server. A feasible solution for managing the video downloaders would be to completely decouple the video capture module from the crawler and launch it in the post processing phase. That implies the replacement of the crawler plugin with a log reader and an independent manager for the video downloaders. The advantages of this approach would be: ! a global view on the total number of video URIs ! a better management of the resources (number of video downloaders sharing the bandwidth) The main drawback of the method is related to the incoherencies that might appear between the crawl time of the Web site and the video capture in the post processing phase: ! some video content might disappear (during one or two days delay) ! the video download is blocked waiting for the end of the crawl Therefore, there is a trade-off to be done when managing the video downloading, between: shortening the time for the complete download, error handling (for video contents served by slow servers), and optimizing the total bandwidth used by multiple downloaders. Page 16 of 60 LiWA 3.7 References [ASFS] GetASFStream – Windows Media streams recorder http://yps.nobody.jp/getasf.html [BaSa06] N. Baly, F. Sauvin. Archiving Streaming Media on the Web, Proof of Concept and First Results. In the 6th International Web Archiving Workshop (IWAW'06), Alicante, Spain, 2006. [BTGH07] Brown, C. Titus, Gheorghe Gheorghiu, et Jason Huggins. 2007. An introduction to testing web applications with twill and selenium. O'Reilly. [CCSD] CoCSoft Stream Down – Streaming media download tool http://stream-down.cocsoft.com/index.html [CKR*07] Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn and Sue Moon “I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system”, In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, California 2007. [JHBa08] Nichols Jeffrey, Zhigang Hua, et John Barton. 2008. Highlight: a system for creating and deploying mobile web applications. Dans Proceedings of the 21st annual ACM symposium on User interface software and technology, 249-258. Monterey, CA, USA: ACM. [LCKN05] Mingzhe Li, Mark Claypool, Robert Kinicki and James Nichols “Characteristics of streaming media stored on the Web”, In ACM Transactions on Internet Technology (TOIT) 2005. [MBD*07] Moshchuk Alexander, Tanya Bragin, Damien Deville, Steven D. Gribble, et Henry M. Levy. 2007. SpyProxy: execution-based detection of malicious web content. Dans Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, 1-16. Boston, MA: USENIX Association. [MDDC08] David Malaník, Zdenek Drbálek, Tomá! Dulík and Miroslav "ervenka “System for capturing, streaming and sharing video files”, In Proceedings of the 8th WSEAS international conference on Distance learning and web engineering, Santander, Spain, 2008. [MPlP] MPlayer Project – http://www.mplayerhq.hu [PMV*08] Xabiel García Pañeda, David Melendi, Manuel Vilas, Roberto García, Víctor García, Isabel Rodríguez: “FESORIA: An integrated system for analysis, management and smart presentation of audio/video streaming services”, In Multimedia Tools and Applications, Volume 39, 2008 [RFC3550] A Transport Protocol for Real-Time Applications (RTP) IETF Request for Comments 3550: [RFC2326] http://tools.ietf.org/html/rfc3550 Real Time Streaming Protocol (RTSP) IETF Request for Comments 2326: Page 17 of 60 http://tools.ietf.org/html/rfc2326 LiWA [SVCR] StreamBox VCR – Video stream recorder http://www.afterdawn.com/software/audio_software/audio_tools/streambox_vcr.cfm [WBJR06] Wang Yi-Min, Doug Beck, Xuxian Jiang, et Roussi Roussev. 2006. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites that Exploit Browser Vulnerabilities. Microsoft Research. http://research.microsoft.com/apps/pubs/default.aspx?id=70182. [WKST08] Bing Wang, Jim Kurose, Prashant Shenoy, Don Towsley: “Multimedia streaming via TCP: An analytic performance study”, In ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 2008 [YKLC08] Yu Yang, Hariharan Kolam, Lap-Chung Lam, et Tzi-cker Chiueh. 2008. Applications of a feather-weight virtual machine. Dans Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, 171-180. Seattle, WA, USA: ACM. [ZPD*04] Roger Zimmermann, Moses Pawar, Dwipal A. Desai, Min Qin and Hong Zhu “High resolution live streaming with the HYDRA architecture”, In Computers in Entertainment (CIE) 2004. Page 18 of 60 LiWA 4 Spam Cleansing The ability to identify and prevent spam is a top priority issue for the search engine industry [HMS] but less studied by Web archivists. The apparent lack of a widespread dissemination of Web spam filtering methods in the archival community is surprising in view of the fact that, under different measurement and estimates, roughly 10% of the Web sites and 20% of the individual HTML pages constitute spam. The above figures directly translate to 10–20% waste of archive resources in storage, processing and bandwidth. Spam filtering is essential in Web archives even if we acknowledge the difficulty of defining the boundary between Web spam and honest search engine optimization. Archives may have to tolerate more spam compared to search engines in order not to loose some content misclassified as spam that the users may want to retrieve later. Also they might want to have some representative spam either to preserve an accurate image of the Web or to provide a spam corpus for researchers. In any case, we believe that the quality of an archive with completely no spam filtering policy in use will greatly be deteriorated and significant amount of resources will be wasted as the effect of Web spam. Spam classification and page-quality assessment is a difficult issue for search engines; for archival systems it is even more challenging as they lack information about usage patterns (e.g., click profiles) at capture time. We survey methods that fit best the needs of an archive that are capable of filtering spam during the crawl process or in a bootstrap sequence of crawls. Our methods combine classifiers based on terms over the page and on features built from content, linkage and site structure. Web spam filtering know-how became widespread with the success of the Adversarial Information Retrieval Workshops, airweb.cse.lehigh.edu, since 2005 that host the Web Spam Challenges since 2007. Our mission is to disseminate this know-how and adapt it to the special needs of the archival institutions. This implies putting a particular emphasis on periodic recrawls and the time evolution of spam such as the disappearance of quality sites that become parking domains used for spamming purposes or spam, once blacklisted, reappearing under a new domain. In order to tie the bonds between the two communities we intend to provide time-aware Web spam benchmark data sets for future Web Spam Challenges. 4.1 State of the Art on Web Spam As Web spammers manipulate several aspects of content as well as linkage [GGM2], effective spam hunting must combine a variety of content [FMN, FMN2, NNFM] and link [GGMP, WGD, BCS] based methods. Page 19 of 60 LiWA 4.1.1 Content features The first generation of search engines relied mostly on the classic vector space model of information retrieval. Thus Web spam pioneers manipulated the content of Web pages by stuffing it with keywords repeated several times. A large amount of machine generated spam pages such as the one in Figure 4.1 are still present in today’s Web. These pages can be characterized as outliers through statistical analysis [FMN] targeting the template like nature: their term distribution, entropy or compressibility distinguishes them from normal content. Large numbers of phrases appearing in other Web pages as well also characterize spam [FMN2]. Sites exhibiting excessive phrase reuse are either template driven or spam, employing the so called stitching technique. Ntoulas et al. [NNFM] describe content spamming characteristics including overly large number of words either in the entire page or in the title or anchor text, as well as the fraction of page drawn from popular words and the fraction of most popular words that appear in the page. Figure 4.1: Machine generated page with copied content As most spammers act for financial gain [GGM], spam target pages are stuffed with a large number of keywords that are either of high advertisement value or highly spammed, including misspelled popular words such as “googel” or “accomodation” as seen among the top hits of a major search engine in the Figure 4.2. A page full of Google ads and maybe even no other content at all is also a typical spammer technique to misuse Google AdSense for financial gains [BBCS] as seen in the Figure 4.3. Similar misuses of eBay or the German Scout24.de affiliate program is also common practice [BCSV]. It is realized in [BBCS] that spam is characterized by its success in a search engine that does not deploy spam filtering over popular or monetize-able queries. Lists of such queries can be obtained from search engine query logs or via AdWords, Google’s flagship pay-per-click advertising product (http://adwords.google.com). Page 20 of 60 LiWA Figure 4.2: Spam in search engine hit list Page 21 of 60 LiWA Figure 4.3: Parked domain filled with Google ads. Community content is in particular sensible to the so-called comment spam: responses, posts or tags not related to the topic containing link to a target site or advertisement. This form of spam appears whenever there is no restriction for users putting their own content such as blogs [MCL], bookmarking systems [KHS] and even YouTube [BRAAZR]. We have experimented with tools based on language model disagreement [MCL]. Based on the existing literature on content spam, a sample of the LiWA baseline features include: ! the number of pages in the host; ! the number of characters in the host name; ! number of words in the home page and maximum PageRank page; ! average word length, average length of the title; ! precision and recall for frequent and monetize-able queries. Page 22 of 60 LiWA 4.1.2 Link features Following Google’s success all major search engines quickly incorporated link analysis algorithms such as HITS [K] and PageRank [PBMW] into their ranking schemes. The birth of the highly successful PageRank algorithm [PBMW] was indeed partially motivated by the easy spammability of the simple in-degree count. Unfortunately PageRank (together with probably all known link based ranking schemes) are prone to spam. Spammers build so-called link farms, large collections of tightly interconnected Web sites over diverse domains that eventually all point to the targeted page. The rank of the target will be large regardless of the ranking method due to the large number of links and the tightly connected structure. An example of a well-known link farm in operation for several years now is the 411Web page collection; the content of these sites is likely not spam (indeed they are not excluded from Google) but form a strongly optimized sub-graph that illustrates the operation of a link farm well. Based on the existing literature on content spam, a sample of the LiWA baseline features include link-based features for the hosts, measured in both the home page and the page with the maximum PageRank in each host such as: ! in-degree, out-degree; ! PageRank, TrustRank, Truncated PageRank; ! edge reciprocity, assortativity coefficient; ! estimation of supporters; along with simple numeric transformations of the link-based features for the hosts. 4.1.3 Stacked Graphical Learning Recently several results have appeared that apply rank propagation to extend initial trust or distrust judgments over a small set of seed pages or sites to the entire Web, such as trust [GGMP, WGD], distrust propagation in the neighborhood or their combination [WGD] as well as graph based similarity measures [BCS]. These methods are either based on propagating trust forward or distrust backwards along the hyperlinks based on the idea that honest pages predominantly point to honest ones, or, stated the other way, spam pages are backlinked only by spam pages. v7 v2 ? u Figure 4.4: Schematic idea of stacked graphical learning Stacked graphical learning introduced by Kou and Cohen [KC] is a simple implementation of propagation that outperforms the computationally expensive variants. Page 23 of 60 LiWA It is performed under the classifier combination framework as follows, see Figure 4.4 above. First the base classifiers are built and combined that give prediction p(u) for all the unlabeled nodes u. Next for each node v we construct new features based on the predicted p(u) of its neighbors and the weight of the connection between u and v as described in [CBSL] and classify them by a decision tree. Finally, classifier combination is applied to the augmented set of classification results; this procedure is repeated in two iterations as suggested by [CDGMS]. As new results we used stacked graphical features based on the “Connectivity Sonar” of Amitay et al. Our new features include the distribution of in and outlinks labeled spam within the site; the average level of spam in and outlinks; the top and leaf level link spamicity. As a novelty, we also tested various edge weights is our use of weights inferred from a generative “linked LDA” model. 4.2 Spam Filter Module The main objective of spam cleansing is to reduce the amount of fake content the archive will have to deal with. The envisioned toolkit will help prioritize crawls by automatically detecting content of value and exclude artificially generated manipulative and useless content based possibly on models built in a bootstrap procedure. In addition to individual solutions for specific archives, LiWA services intend to provide collaboration tools to share known spam hosts and features across participating archival institutions. A common interface to a central knowledge base will be built in which archive operators may label sites or pages as spam based on own experience or suggested by the spam classifier applied to the local archives. The purpose of the planned LiWA Web spam assessment interface is twofold: ! It aids the Archive operator in selecting and blacklisting spam sites, possibly in conjunction with an active learning environment where human assistance is asked for example in case of contradicting outcome by the classifier ensemble; ! It provides a collaboration tool for the Archives with a possible centralized knowledge base through which the Archive operators are able to share their labels, comments and observations as well as start discussion on the behaviour of certain questionable hosts. The Spam Filter module described in D3.1 Archive Filtering Technology V1 takes WARC format crawls as input and outputs a list of the sites with a predicted spamicity (strength of similarity in content or behaviour to spam sites) as a value between 0 and 1. The current LiWA solution is based on the lessons learned from the Web Spam Challenges [CCD]. As it has turned out, the feature set described in [CDGMS] and the bag of words representation of the site content [ACC] give a very strong baseline with only minor improvements achieved by the Challenge participants. We use the v1 order of their strength, of the following classifiers: combination, listed in the observed Page 24 of 60 LiWA SVM over tf.idf; an augmented set of the statistical spam features of [CDGMS] together with transformed feature variants; graph stacking [CBSL]; text classification by latent Dirichlet allocation [BSB] as well as by compression [BFCLZ, C]. The LiWA baseline content feature set consists of the following languageindependent measures: ! the number of pages in the host ! the number of characters in the host name, in the text, title, anchor text etc; ! the fraction of code vs. text ! the compression rate and entropy; ! the rank of a page for popular queries. Whenever a feature refers to a page instead of the host, we select the home page as well as the maximum PageRank page of the host in addition to host-level averages and standard deviation. We also classify based on the average tf.idf vector of the host. In the LiWA baseline link feature set we use the measures for in and outdegree, reciprocity, assortivity, (truncated) PageRank, Trustrank [GGMP] and neighborhood sizes, together with the logarithm and other derivatives for most values. Next we describe our main findings related to the use by archives. As a key element, archives may possess a large number of different time snapshots for the same domain. In this setup, we observe that our classifiers are ! stable across snapshots. We may apply an old model for a more recent collection without major deterioration of quality despite of the fact that there is relative large change due to the appearance and disappearance of hosts in time. ! instable across different crawl strategies. The WEBSPAM-UK2007 test data was collected by a very different crawl strategy and contains only 14K sites whereas all other research snapshots of the .uk domain have more than 100,000. Here the WEBSPAM-UK2007 model badly fails for all other crawls. In conclusion we may reuse the same classifier with little modification for a near future crawl, but in order to apply a model generated by another institution under a different domain or crawling strategy needs further research. Beyond the state-of-the-art, we were able to improve classification quality by the time change of features including the variance, stability, and the use of normalized versions as well as by the selection of stable hosts for training. As new features we investigate the ! creation and disappearance of new sites, pages; ! burst and decay of neighborhood; ! change in degree, rank; Page 25 of 60 LiWA ! percent of change in content. Prior to our research in LiWA, only statistical characteristics of this collection were investigated [BSV, BBDSV]. 4.3 Evaluation Novel to the LiWA project, a sequence of periodic recrawls is made available for the purposes of spam filtering development for the first time outside the major search engine operators. The data set of 13 UK snapshots (UK-2006-05 … UK-2007-05 where the first snapshot is WEBSPAM-UK2006 and the last is WEBSPAM-UK2007) provided by the Laboratory for Web Algorithmics of the Università degli studi di Milano [BCSV] supported from DSI-DELIS project processed. The LiWA test bed consists of more than 10,000 manual labels that proved to be useful over this data. We conducted the test on 16-April-2009 over the WEBSPAM-UK2007 data set converted into WARC 0.19 by SZTAKI as part of the LiWA test bed. For testing and training we used the predefined labeled subsets of WEBSPAM-UK2007. For testing purposes, the output of the evaluation script is a Weka classifier output that contains a summary of the relevant performance measures over a predefined labelled test set. The results of the test are given in Table 4.1 below. Training set Test set Size 4000 2053 True positive 236 72 True negative 2461 1242 False positive 2 24 False negative 1301 715 Correctly classified 2697 1314 Incorrectly classified 1303* 739* Precision 0.154* 0.091* Recall 0.992 0.75 F1 0.266 0.163 AUC (ROC area) 0.895 0.756 Table 4.1: Quality measures of spam classification. Starred values may be improved by increasing the threshold and reducing recall. When comparing to the baseline, we use the AUC measure since all other measures are sensitive to changing the threshold used to separate the spam and non-spam classes in the prediction. The best performing Web Spam Challenge 2008 participant reached an AUC of 0.85 [GJW] while our result reached 0.80. Some of the research Page 26 of 60 LiWA codes still require industry level tested implementations and will gradually be added to the LiWA code base. We are also expecting progress in reducing the resource needs for the feature generation code. 4.4 Integration into the LiWA Architecture The LiWA Spam Filtering Architecture is summarized in Figure 4.5. 1. The data source is always local in the form of a WARC archive. When acting as a crawl-time plug-in, the WARC at a checkpoint may be analysed to build a new model with updated blacklist for the next crawling phase. Local data is typically huge and cannot be transferred to another location. 2. When accessing the raw data, host (or in certain applications, page) level features are generated. This data portion is small and can be easily stored, retrieved and shared even across different institutions. 3. The main step of the procedure is the model building and classification step. Training a model is costly and is done batch between crawls. Models are small and easy to distribute and they can be applied in crawl-time plug-ins. 4. A key aspect of a successful spam filter is the quality of the manually labelled training data. To this end, the design involves an active learning environment in which the classifier presents cases where the decision is uncertain so that the largest accuracy gain is achieved by the new labels. Figure 4.5: Overview of LiWA Spam Filter Architecture. Page 27 of 60 LiWA 4.4.1 Batch feature generation and classification In the LiWA Spam Classifier Ensemble we split features into related sets and for each we use the best fitting classifier. These classifiers are then combined by random forest, a method that, in our cross validation experiment, outperformed logistic regression suggested by [HMS]. We used the classifier implementations of the machine learning toolkit Weka [WF] as it is open source, mature in quality, and it gives high quality implementation for most state-of-the-art classifiers. The computational resources for the filtering procedure are moderate. Content features are generated by reading the collection once. For link features typically only a host graph has to be built, which is very small even for large batch crawls. Training the classifier for a few 100,000 sites can be completed within a day on a single CPU on a commodity machine with 4-16GB RAM; here costs strongly depend on the classifier implementation. Given the trained classifier, a new site can be classified even at crawl time if the crawler is able to compute the required feature set for the new site encountered. 4.4.2 Assessment interface design While no single Web archive will likely have spam filtering resources comparable to a major search engine, our envisioned method facilitates the collaboration and knowledge sharing between specialized archives, in particular for spam that spans across domain boundaries. To illustrate, assume that an archive targets the .uk domain. The crawl encounters site www.discountchildrensclothes.co.uk that contains redirection to the .com domain that further redirects to .it. These .it sites were already flagged spam by another partner; hence, their knowledge can be incorporated into the current spam filtering procedure. The LiWA developments are planned to aid the international Web archiving community in building and maintaining a world wide data set of Web spam. Since most features, and in particular link features are language independent, a global collection will help all archives regardless of their target to level domain. The need for manual labeling is the single most important blocker of high-quality spam filtering. In addition to label sharing, the envisioned solution will also act in coordinating the labeling efforts in an active learning environment: Manual assessment will be supported by a target selection method that proposes sites of a target domain ambiguously classified based on existing common knowledge. The mockup of the assessment interface is modeled by the Web Spam Challenge 2007 volunteer interface [CCD]. The right side of Figure 4.6 is for browsing in a tabbed fashion. In order to integrate the temporal dimension of an archive, the available crawl times are shown (called access bar). Upon clicking, the page which appears is the one with crawl date the closest to the crawl date of the linking page. Page 28 of 60 LiWA The selected version of the linked page can be either cached at some partner archive or the current version downloaded from the Web. We use Firefox extension techniques similar to Zotero to note and organize information without messing about with rendering, frames and redirection. The possibility to select between a stored and the currently available pages also helps in detecting cloaking. The right side also contains in and outlinks as well as list or sample pages of the site. By clicking on an in or outlink, we may obtain all possible information in all the subwindows from the central service. The upper part of the left side is to do the assessment. Button “NEXT” links to the next site to be assessed in the active learning framework and “BACK” to the review page. When “NEXT” or “BACK” is pushed, the assigned label is saved. Before saving a “spam” or “borderline” label, a popup window appears requesting for explanation and spam type. Spam type can be: general, link or content and the appropriate types should be ticked. The ticked types appear as part of the explanation. Although not shown on the figure, a text field is available for commenting the site. The explanations and comments appear in the review page. The lower right part of the assessment page contains four windows in a tabbed fashion. ! The labels already assigned to this site (number of the four possible types each) with comments if any. ! Various spam classifier scores, and an LDA based content model [BSB]. ! Various site attributes selected by the classification model as most appropriate for deciding the label of the site. ! Whois information, with links to other sites of the owner. In a particular first implementation we fill this interface with the 12 crawl snapshots of the .uk domain gathered by the UbiCrawler [BCSV] between June 2006 and May 2007 [CCD]. In the “Links and pages” tab we show 12 bits for the presence of the page while the access bar in the bottom of the assessment page shows “jun, jul, . . . , may” and “now”, color coded for availability and navigation. Page 29 of 60 LiWA Figure 4.6: LiWA Spam Assessment Interface. The above interface will be part of the LiWA WP10 Community Platform that forms a centralized service and will act as a knowledge base also for crawler traps, crawling strategies for various content management technologies and other issues related to Web host archival behavior. 4.4.3 Crawl-time filtering infrastructure design In order to save bandwidth and storage, we have to filter out spam at crawl time. For a host that is already blacklisted, we may simply discard all URI to save bandwidth. However for a yet unseen host we have to obtain a few pages to build a model and then apply our classifier again at crawl time. From then on no more URI will be retrieved from spam hosts. The crawl-time spam filter accepts/rejects URIs and/or domains based on the spam analysis results of either earlier crawls or the previous checkpoint of the current crawl. The crawler(s) continuously write WARC(s) that, at some points in time, the spam filter also reads and processes. We synchronize concurrent access at checkpoints where the crawlers start writing a new WARC and the earlier ones are then free to be used by the filter. Page 30 of 60 LiWA The main function public class SpamDecideRule extends DecideRule { protected abstract DecideResult innerDecide(ProcessorURI uri){...} } simply checks the black and whitelist and commands the crawler to discard the URI if it belongs to a possible spam host. The lists are updated from text files whenever their content changes (push method). The spam filter also interacts with the host queue prioritization of the crawler. 4.5 References [ACC] J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. [BBC] A. A. Benczúr, I. Bíró, K. Csalogány, and T. Sarlós. Web spam detection via commercial intent analysis. In Proceedings of the 3th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held in conjunction with WWW2007, 2007. [BCS] A. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held in conjunction with SIGIR2006, 2006. [BRAAZR] F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, C. Zhang, and K. Ross. Identifying video spammers in online social networks. In AIRWeb ’08: Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web. ACM Press, 2008. [BSB] I. Bíró, J. Szabó, and A. A. Benczúr. Latent Dirichlet Allocation in Web Spam Filtering. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. [BCSV] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed Web crawler. Software: Practice & Experience, 34(8):721–726, 2004. [BSV] P. Boldi, M. Santini, and S. Vigna. A Large Time Aware Web Graph. SIGIR Forum, 42, 2008. [BBDSV] I. Bordino, P. Boldi, D. Donato, M. Santini, and S. Vigna. Temporal evolution of the uk Web, 2008. [BFCLZ] A. Bratko, B. Filipiˇc, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673–2698, 2006. [CCD] C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. [CDGMS] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the Web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, 2007. [C] G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007. [CBSL] K. Csalogány, A. Benczúr, D. Siklósi, and L. Lukács. Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn. In Graph Labeling Workshop in conjunction with ECML/PKDD 2007, 2007. Page 31 of 60 LiWA [FMN] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics – Using statistical analysis to locate spam Web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1–6, Paris, France, 2004. [FMN2] D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide Web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005. [GJW] Geng, G.G. and Jin, X.B. and Wang, C.H. CASIA at Web Spam Challenge 2008 Track III. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. [GGM] Z. Gyöngyi and H. Garcia-Molina. Spam: It’s not just for inboxes anymore. IEEE Computer Magazine, 38(10):28–34, October 2005. [GGM2] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005. [GGMP] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576–587, Toronto, Canada, 2004. [HMS] M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in Web search engines. SIGIR Forum, 36(2):11–22, 2002. [K] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [KC] Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007. [KHS] B. Krause, A. Hotho, and G. Stumme. The anti-social tagger - detecting spam in social bookmarking systems. In Proc. of the Fourth International Workshop on Adversarial Information Retrieval on the Web, 2008. [MCL] G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005. [NNMF] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83–92, Edinburgh, Scotland, 2006. [PBMW] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Technical Report 1999-66, Stanford University, 1998. [WF] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. [WGD] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust for the Web, Edinburgh, Scotland, 2006. Page 32 of 60 LiWA 5 Temporal Coherence The coherence of data in terms of proper dating and proper cross-linkage is influenced by the temporal characteristics (duration, frequency, etc.) of the crawl process. Web archiving is commonly understood as a continuous process that aims at archiving the entire Web (broad scope). However, a typical scenario in archiving institutions or companies is to periodically (e.g. monthly) create high quality captures of a certain Web site. These periodic domain scope crawls of Web sites aim at obtaining a best possible representation of a site. Figure 5.1 contains an abstract representation of such a domain scope crawling process. This Web site consists of n pages (p1,…,pn). Each of them consists of several successive versions, indicated by the horizontal lines (e.g., pn has three different versions in [t;t’]). Ideally, the result of a crawl would be a complete and instantaneous snapshot of all pages at a given point of time. Crawl interval p1 p2 Web site pn t Crawling progress t’ time Figure 5.1: Web site crawling process (domain scope) In reality, one crawl requires an extended time period to gather all pages of a site while being potentially modified in parallel, causing thus incoherencies in the archive. The risk of incoherence increases further due to politeness constraints and need for sophisticated time stamping mechanisms. An ideal approach to Web archiving would be to have captures for every domain at any point in time whenever there is a (small) change in any of the domain's pages. Of course, this is absolutely infeasible given the enormous size of the Web, high contentproduction rates in blogs and other Web 2.0 venues, the disk and server costs of a Web archive, and also the politeness rules that Web sites impose on crawlers. We therefore settle for the realistic goal capturing Web sites at convenient points (whenever the crawler decides to devote resources to the site and the site does not seem to be highly loaded), but when doing so, the capture should be as “authentic” as possible. In order to ensure an “as of time point x (or interval [x;y])“ capture of a Web Page 33 of 60 LiWA site, we therefore develop strategies that ensure coherence of crawls regarding a time point or interval, and identify those contents of the Web site that violate coherence [SDM*09]. 5.1 State of the Art on Archive Coherence The most comprehensive overview on Web archiving is given by Masanès [Masa06]. He describes the various aspects involved in the archiving as well as the subsequent accessing process. The issue of coherence is introduced as well, but only some heuristics how to measure and improve the archiving crawler's front line are suggested. Other related research mostly focuses on aligning crawlers towards more efficient and fresher Web indexes. B. E. Brewington and G. Cybenko [BrCy00] analyze changes of Web sites and draw conclusions about how often they must be reindexed. The issue of crawl efficiency is addressed by Cho et al. [CGPa98]. They state that the design of a good crawler is important for many reasons (e.g. ordering and frequency of URLs to be visited) and present an algorithm that obtains more relevant pages (according to their definition) first. In a subsequent study Cho and Garcia-Molina describe the development of an effective incremental crawler [ChGa00]. They aim at improving the collection's freshness by bringing in new pages in a timelier manner. Into the same direction head their studies on effective page refresh policies for Web crawlers [ChGa03a]. Here, they introduce a poisson process based change model of data sources. In another study, they estimated the frequency of change of online data [ChGa03b]. For that purpose, they developed several frequency estimators in order to improve Web crawlers and Web caches. In a similar direction goes research of Olston and Pandey [OlPa08] who propose a recrawl schedule based on information longevity in order to achieve good freshness. Another study about crawling strategies is presented by Najork and Wiener [NaWi01]. They have found out that breadth-first search downloads hot pages first, but also that the average quality of the pages decreases over time. Therefore, they suggest performing strict breadth-first search in order to enhance the likeliness to retrieve important pages first. However, aiming at an improved crawl performance against the background of assured crawl coherence requires a slightly different alignment. Our task therefore is to achieve both: Increase the probability of obtaining largely coherent crawls and identify those contents violating coherence. 5.2 Temporal Coherence Module In order to identify contents violating coherence and improve the crawling strategy with respect to temporal coherence, proper dating of Web contents is needed. Hence, techniques for (meta-)data extraction of Web contents have been implemented and the correctness of these methods has been tested. Unfortunately, the reliability of last modified stamps cannot be guaranteed due to missing trustworthiness of Web servers. To this end, we will first introduce our strategies to ensure proper dating of Web contents and subsequently introduce our coherence improving crawling strategy on top of these properly dated Web contents. Page 34 of 60 LiWA Proper Dating of Web Contents Proper dating technologies are required to know how fresh a Web page is – that means – what is the date (and time) of last modification. The canonical way for time stamping a Web page is to use its Last-Modified HTTP header, which is unfortunately unreliable. For that reason, another dating technique is to exploit the content’s semantic timestamps. This might be a global timestamp (for instance, a date preceded by “Last modified:” in the footer of a Web page) or a set of timestamps for individual items in the page, such as news stories, blog posts, comments, etc. However, the extraction of semantic timestamps requires the application of heuristics, which imply a certain level of uncertainty. Finally, the most costly – but 100% reliable – method is to compare a page with its previously downloaded version. Due to cost and efficiency reasons we pursue a potentially multistage change measurement procedure: 1) Check HTTP timestamp. If it is present and is trustworthy, stop here. 2) Check content timestamp. If it is present and is trustworthy, stop here. 3) Compare a hash of the page with previously downloaded hash. 4) Elimination of non-significant differences (ads, fortunes, request timestamp): a) only hash text content, or “useful” text content b) compare distribution of n-grams (shingling) c) or even compute edit distance with previous version. On the basis on these dating technologies we are able to develop coherence improving capturing strategies that allow us to reconcile temporal information across multiple captures and/or multiple archives. Coherence Measurement Due to the aforementioned unreliability of last modified stamps due to missing trustworthiness of Web servers, the only 100% reliable method is to self create a “virtual time stamp” by comparing the page's etag or content hash with its previously downloaded version. To this end, we introduce an induced coherence measure that allows the crawler to gain full control over the contents being compared. We apply a crawl-revisit sequence !(c;r), where c denotes a crawl of Web pages (p1,…, pn) and r being a subsequent revisit. In this consecutive revisit process we obtain a second (and potentially different) version of the previously crawled pages denoted as (p1’,… ,pn’). Hence, the crawl-revisit sequence !(c;r) consists of n crawl-revisit tuples " (pi;pi’) having i " {1;n}. While the time of downloading page pi is denoted as t(pi) = tj having j " [1;n], the time of revisiting page pi is denominated as t(pi’) = tk now having k > [n; 2n !1]. Technically, the last crawled page pv having t(pv) = n is not revisited again, but considered as crawled and revisited page at the same time. Hence, the revisit takes Page 35 of 60 LiWA place in the time interval [tn+1;t2n!1]. For convenience [ts;te] denotes the crawl interval, where ts = t1 is the starting point (download of the first page) of the crawl and te is the ending point of the crawl (download of the last page). Similarly, we denote [t’s;t’e] to be the revisit interval, where te = ts’ = tn is the starting point of the revisit (download of the last visited page that is at the same time the first revisited page) and t e’ is the ending point of the revisit (download of the last revisited page). In addition, we define the etag or content hash of a page or a revisited page as #(m) having m " {pi;pi’}. Overall, a complete crawl-revisit sequence !(c;r) spans the interval [t1; t2n!1]. It starts at ts = t1 with the first download of the crawl and ends at t e’ = t2n!1 with the last revisit download. Now, coherence of two or more pages exists if there is a time point tcoherence between the visit of pages t(pi) and the subsequent revisit t(pi’) where the etag or content hash # of corresponding pages (#(m) having m " {pi;pi’}) has not changed. This is formally denoted as: Figure 5.2 highlights the functioning of our coherence measure applied to a Web site consisting of n pages. We assume a download sequence p1,…,p4 spanning the crawl interval [t1;t4] and an inverted subsequent revisit sequence p3,…, p1 spanning the revisit interval [t5;t7]. This figure depicts n successful coherence tests. This results in an assurable coherence statement for the entire Web site valid at time point tcoherence = t4. Page 36 of 60 LiWA Figure 5.2: Web site crawling with successful coherence tests By contrast, Figure 5.3 indicates a failed inducible coherence test for the crawl-revisit tuple "(p2;p2’). In this case, page p2 was modified elsewhere between t(p2) = t2 and t(p2’) = t6, which results in a failed inducible coherence test. Due to non-existing or nonreliable last modified stamps we are not able to determine the exact time point of modification. To this end, we are only able to discover a boolean result because of a failed etag or hash comparison for the crawl-revisit tuple "(p2,p’2). The whole interval is flagged as insecure, even though, the modification might have taken place far beyond the aspired coherence time point (tcoherence = t4). Thus, despite being coherent from a global point of view for tcoherence = t4, a real life crawler is not be able to figure this out. Consequently, there might not be given an assurable coherence statement for the entire Web site, since there is an insecure time interval with respect to the crawl-revisit tuple " (p2,p2’). Page 37 of 60 LiWA Figure 5.3: Web site crawling with failed coherence test for crawl-revisit tuple "(p2;p’2) In reality and against the background of large Web sites it is almost unfeasible to achieve an assurable coherence statement for an entire Web site based on this coherence measure. Though, we might still be interested in specifying how “coherent” the remaining parts of our crawl c are. For that purpose, we introduce a metric that allows us to express the quality of a crawl c. The error function f("(pi,p i’)) that counts the occurring incoherences for crawl-revisit tuple "(pi,p i’) of the crawl-revisit sequence ! (c,r) is defined as: f # p i ,pi ' $= { , if ! # pi $ =! # pi ' $ , else 0 1 } The overall quality of a crawl c is then evaluated as: n & f # pi ,p i ' $ C # c $=1% i=1 n , n'1 Since, the risk of a single Web page pi being incoherent heavily depends on its position in the crawl-revisit sequence !(c,r) we will now introduce our coherence improving capturing strategy. Page 38 of 60 LiWA Coherence Improving Capturing Strategy In order to increase the overall quality of crawl, we examine the probability (and thus the risk) of crawling incoherent contents. The probability of a single page pi being incoherent with respect to the reference time point or time interval tcoherence is an important parameter to consider when scheduling a crawl. Incoherence occurs, when a page pi is subject to one or more modifications µi* that are in “conflict” with the ongoing crawl. A conflict with respect to coherence occurs if: (µi*: (µi*" [t(p i),t(pi’)] That means, a page has been modified at least once since its download during the crawling phase t(pi) and its revisit t(pi’). Given a page's change probability $i (which can be statistically predicted based on page type [e.g., MIME types], depth within the site [e.g., distance to site entry points], and URL name [e.g., manually edited user homepages vs. pages generated by content management systems]) its download time t(pi) and its revisit time t(pi’), the probability of conflict %(pi) is given as: %(pi)=1 ! (1 !$i)t(p ’)!t(p ) i i Potentially conflicting slots in applying inducible coherence are shown in Figure 5.4. In this example, a crawl ordering from top to bottom (p1,…,pn) and revisits from bottom to top (pn!1,…,p1) is being applied. The illustration differentiates between those slots where a change of page pi affects the coherence of crawl c and others that do not. This results in a set of concatenated slots (different in size) that represents (overall) the risk of a crawl being affected by changes. Page 39 of 60 LiWA slot1 D 1 … n-i-2 n-i-1 n-i n-i+1 n-i+2 n-i+3 … 2(n-1) … … … … … … … … … D D … D 1 2 3 4 D … D slotn-1 D D … D D 1 2 D D … D slotn D D D D D D D D D … D t1 t2 … tn-2 tn-1 tn tn+1 tn+2 tn+3 … t2n-1 … slotn-2 Legend: Periled slot & exp. in %(pi) Visit/revisit of pi D Don’t Care Slot Figure 5.4: Periled slots in crawl/revisit-pairs applying “virtual time stamping” Coherence Improving Crawl Scheduling As a consequence, from the previous observations we can identify two factors that influence the potential incoherence of a page pi with respect to the reference coherence time point tcoherence: The change probability $i of page pi and its download (and revisit) time t(pi) (and t(pi’)). Hence, our coherence optimized crawling strategy incorporates both factors. Starting point is a list of pages pi to be crawled sorted in descending order according to their change probabilities $i. Like before, the intention is to identify those pages that might overstep the readiness to assume risk threshold &. Since now all pages need to be scheduled according to the reference time point treference = tn being the last page to be crawled during the crawling phase, we need a different queuing strategy: We try to create a V-like access schedule having the (large) slots of stable pages on top and the (small) slots of instable ones at bottom (cf. Figure 5.4). Again, we start with assigning the uncritical slot (as we assume that changes of Web pages might occur only per time unit immediately before download) with length 0 to the most critical content at the first position (pslot = 1) of our queue. Since, initially, the length of the slot in the “joker” position (tn) to be assigned is zero, the threshold condition does not hold. However, from now on t (and thus the size of slots) increases stepwise so that any download bears the risk of being incoherent. To this end, we evaluate the current page's conflict probability %(pi) against the user defined threshold (%(pslot) ) &). As it is rarely possible to include all pages in this V-like structure, we split the download schedule into a promising section and a hopeless section. In case, the given threshold is exceeded we move the page at Page 40 of 60 LiWA pslot to the lastpromising position, which is the (at this point in time) the first position after those pages not exceeding the conflict threshold &. Otherwise, the page will be scheduled for download at pslot. This process is continued until all pages pi have been scheduled either in the promising section or the hopeless section. In the next stage, the crawl itself starts. During the crawling phase, we begin with the most hopeless ones first until we continue with those pages that have been allocated in the promising section. After completion, we directly initiate the revisit phase in the reverse order. We begin with the first element after the “joker” position (pslot = 2) until the revisit of the remaining pages has been completed. A pseudo code implementation of the strategy described is shown in Figure 5.5. input: p1,…,pn - list of pages in descending order of $i, & - readiness to assume risk threshold begin Start with: slot = 1, lastpromising = n while slot * lastpromising do if %(pslot) ) & then /* conflict expected! */ Move pslot to position lastpromising Decrease promising boundary: lastpromising !! end else Increase promising boundary: promising ++ end end slot = n while slot ) 1 do /* visit from hopeless to promising */ Download page pslot Decrease slot counter: slot !! end slot = 2 while slot * n do /* revisit from promising to hopeless */ Revisit page pslot Increase slot counter: slot ++ end end Figure 5.5: Pseudo code of coherence improved crawl scheduling Page 41 of 60 LiWA 5.3 Evaluation and Visualization The main performance indicator for the temporal coherence module is the fraction of accurately dated content and the crawl cost measures. The cost of the crawl can be measured by parameters such as the number of downloads, bandwidth consumed or crawl duration. As we have outlined before, a full guarantee on properly dated content requires the more sophisticated “virtual time stamping mechanism” (compared with a single access strategy relying on the accuracy of last modified time stamps). This implies that temporal coherence and crawl cost are contradictory objectives. However, it is possible to ensure and evaluate proper dating of contents and reduce the crawl cost in a subsequent step. Evaluation of Coherence Improved Crawling In terms of coherence improved crawling, we measure the percentage of content in a Web site that is coherently crawled (that means “as of the same time point or time interval”). Conventional implementations of archiving crawlers are based on a prioritydriven variant of the breadth-first-search (BFS) crawling strategy and do not incorporate revisits. However, “virtual time stamping” is unavoidable in order to determine coherence under real life crawling conditions. Therefore, the performance of our algorithms is evaluated against comparable modifications of conventional crawling strategies such as BFS-LIFO (breadth-first-search combined with last-in-first-out) or BFS-FIFO (breadth-first-search combined with first-in-first-out). In addition, we indicate baselines for optimal and worst case crawling strategies, which are obtained from full knowledge about changes within all pages pi during the entire crawl-revisit interval. Hence, these baselines are only considerable as theoretical achievable limits of coherence. Experiments were run on synthetic data in order to investigate the performance of versatile crawling strategies within a controlled test environment. All parameters are freely adjustable in order to resemble real life Web site behavior. Each experiment follows the same procedure, but varies in size of Web contents and change rate. We model site changes by Poisson processes with page-specific change rates. These rates can be statistically predicted based on page type (e.g., MIME types), depth within the site (e.g., distance to site entry points), and URL name (e.g., manually edited user homepages vs. pages generated by content management systems). Each page of the data set has a change rate $i. Based on a Poisson process model, the time between two successive changes of page pi is then exponentially distributed with parameter $i: P[time between changes of pi is less than time unit '] = 1 – e – $i ' Equivalently, the probability that pi changes k times in a time interval of length n follows a Poisson distribution: % "i "k e P[k changes of pi in one time unit] = i k! Page 42 of 60 LiWA Within the simulation environment a change history is generated, which registers every change per time unit. The probability that page pi changes at ti then is: P[pi has at least one change] = 1 – e – $i In order to resemble real life conditions, we simulated small to medium size crawls of Web sites consisting of 10.000 - 50.000 contents. In addition, we simulated the sites' change behavior to vary from nearly static to almost unstable. All experiments followed the same procedure, but varied in size of Web contents and change rate. Each page of the data set has a change probability $i in the interval [0;1]. Within the simulation environment a change history was generated, which registered every change per time unit. The probability that page pi changed at tj is P(µi)=P[((tj) * $i] where ((tj) is a function that generates per time unit a uniformly distributed random number in [0;1]. Figure 5.6: Comparison of inducible crawling strategies in a Web site of 10.000 contents Figure 5.6 depicts the results of our improved inducible crawling strategy compared with its “competitors” BFS-LIFO and BFS-FIFO. Our improved crawling strategy always performs better than the best possible conventional crawling strategy. Experiments are based on a Web site containing 10.000 contents and different readiness to assume risk thresholds & ranging from [0.45;0.7]. In addition, our strategy performs about 10% better given non-pathological Web site behaviour (neither completely static nor almost unstable). Values of & between [0;0.45) or (0.7;1] perform less effective. They induce an either too “risk-avoidant” (& > [0;0.45)) or too “risk-ignorant” (& >(0.7;1]) scheduling with minor (or even zero) performance gain, e.g. when acting “risk-ignorant” in heavily changing sites or “risk-avoidant” in mostly static sites. Comparable results have also been produced given larger (and smaller) Web sites having similar change distributions in numerous experiments. Page 43 of 60 LiWA Figure 5.7: Excerpt of a crawl’s automatically generated temporal coherence report Analysis and Visualization of Crawl Coherence The analysis of coherence defects measures the quality of a capture either directly at runtime or between two captures. To this end, we have developed methods for automatically generating sophisticated statistics per capture (e.g. number of defects occurred sorted by defect type) as part of our analysis environment. Figure 5.7 contains a screenshot that contains an excerpt of such a temporal coherence report. In addition, the capturing process is traced and enhanced with statistical data for exports in graphML. Hence, it is also possible to layout a capture’s spanning tree and visualize its coherence defects by applying graphML compliant software. This visual metaphor is intended as an additional means to automated statistics for understanding the problems that occurred during capturing. Figure 5.8 depicts a sample visualization of an mpi-inf.mpg.de domain capture (about 65.000 Web contents) with the Visone software (cf. http://visone.info/ for details). Depending on the nodes’ size, shape, and color the user gets an immediate overview on the success or failure of the capturing process. In particular, a node’s size is proportional to the amount of coherent Web contents contained in its sub-tree. In the same sense, a node’s color highlights its “coherence status”. While green stands for coherence, the signal colors yellow and red indicated (content modifications and/or link structure changes). The most serious defect class of Page 44 of 60 LiWA missing contents is colored in black. Finally, a node’s shape indicates its MIME type ranging from circles (HTML contents), hexagons (multimedia contents), rounded rectangles (Flash or similar), squares (PDF contents and other binaries) to triangles (DNS lookups). Altogether, the analysis and visualization features developed aim at helping the crawl engineer to better understand the nature of change(s) within or between Web sites and – consequently – to adapt the crawling strategy/frequency for future captures. As a result, this will also help increase the overall archive’s coherence. Figure 5.8: Coherence defect visualization of a sample domain capture (mpiinf.mpg.de) by Visone Page 45 of 60 LiWA 5.4 Integration into the LiWA Architecture Technically, the LiWA Temporal Coherence module is subdivided into a modified version of the Heritrix crawler (including a LiWA coherence processor V1) and its associated (Oracle) database. Here, (meta-)data extracted within the modified Heritrix crawler are stored and made accessible as distinct capture-revisit tuples. In addition, arbitrary captures can be combined as artificial capture-revisit tuples of “virtually” decelerated captures. In parallel, we created a simulation environment that employs the same algorithms we have developed in the measuring environment, but gives us full control over the content (changes) and allows us to perform extreme tests (in terms of change frequency, crawling speed and/or crawling strategy). Thus, experiments employing our coherence ensuring crawling algorithms can be carried out with different expectations about the status of Web contents and can be compared against ground truth. Figure 5.9 depicts a flowchart highlighting the main aspects of the LiWA Temporal Coherence processor V1 in Heritrix. Those elements in green contain unchanged elements compared with the standard Heritrix crawler. The bluish items represent methods of the existing crawler that have been adapted to our revisit strategy. Finally, the red unit represents an additional processing step of the LiWA temporal coherence module. Figure 5.9: Flowchart of LiWA Temporal Coherence processor V1 in Heritrix Page 46 of 60 LiWA The second component of the LiWA Temporal Coherence module is the analysis and visualization environment. It serves as a means to measure the quality of capture either directly at runtime (online) or between two captures (offline). For that purpose, statistical data per capture (e.g. number of defects occurred sorted by defect type) is computed from the associated (Oracle) database after crawl completion as part of LiWA Temporal Coherence processor V1 in Heritrix. 5.5 References [BrCy00] B. E. Brewington and G. Cybenko. Keeping up with the changing Web. Computer, 33(5):52-58, May 2000. [ChGa00] J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In VLDB '00: Proc. of the 26 th intl. conf. on Very Large Data Bases, pages 200-209, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [ChGa03a] J. Cho and H. Garcia-Molina. Effective Page Refresh Policies for Web Crawlers. ACM Transactions on Database Systems, 28(4), 2003. [ChGa03b] J. Cho and H. Garcia-Molina. Estimating Frequency of Change. ACM Trans. Inter. Tech., 3(3):256-290, Aug. 2003. [CGPa98] J. Cho, H. Garcia-Molina, and L. Page. Efficient Crawling through URL ordering. In WWW7: Proc. of the 7th intl. conf. on World Wide Web 7, pages 161-172, Amsterdam, The Netherlands, 1998. Elsevier Science Publishers B. V. [Masa06] J.Masanès. Web Archiving. Springer, New York, Inc., Secaucus, NJ, 2006. [NaWi01] M. Najork and J. L. Wiener. Breadth-First Search Crawling Yields High-Quality Pages. In In Proc. of the 10th intl. World Wide Web conf., pages 114-118, 2001. [OlPa08] C. Olston and S. Pandey. Recrawl Scheduling based on Information Longevity. In WWW '08: Proceeding of the 17th intl. conf. on World Wide Web, pages 437-446. ACM, 2008. [SDM*09] M. Spaniol, D. Denev, A. Mazeika, P. Senellart and G. Weikum. Data Quality in Web Archiving. Proceedings of the 3rd Workshop on Information Credibility on the Web (WICOW 2009) in conjunction with the 18th World Wide Web Conference (www2009), Madrid, Spain, April 20, pp. 19-26, 2009. Page 47 of 60 LiWA 6 Semantic Evolution Preserving knowledge for future generations is a major reason for collecting all kinds of publications, web pages, etc. in archives. However, ensuring the archival of content is just the first step toward ``full'' content preservation. It also has to be guaranteed that content can be found and interpreted in the long run. This type of semantic accessibility of content suffers due to changes in language over time, especially if we consider time frames beyond ten years. Language changes are triggered by various factors including new insights, political and cultural trends, new legal requirements, high-impact events, etc. Due to this terminology development over time, searches with standard information retrieval techniques, using current language or terminology would not be able to find all relevant content created in the past, when other terms were used to express the same sought content. To keep archives semantically accessible it is necessary to develop methods for automatically dealing with terminology evolution. 6.1 State of the Art on Terminology Evolution The act of automatically detecting terminology evolution given a corpus can be divided into two subtasks. The first one is to automatically determine, from a large digital corpus, the senses of terms. This task is generally referred to as Word Sense Discrimination. For this task we will present state of the art and give a description of the different approaches available. The second task takes place once several snapshots of words and their senses have been created using corpora from different periods of time. After having obtained these snapshots, terminology evolution can be detected between any two (or a series of) instances. To our knowledge little or no previous work has been done directly in this topic and thus we investigate state of the art in related areas such as evolution of clusters in dynamic networks and ontology evolution. 6.1.1 Word Sense Detection In this section we give a general overview and state of the art in Word Sense Discrimination (WSD) as well as related fields. Word Sense Discrimination is a subtask of Word Sense Disambiguation. The task of Word Sense Disambiguation is, given an occurrence of an ambiguous word and its context (usually sentence or surrounding words in a window), to determine which sense is referred to. Usually the senses used in Word Sense Disambiguation come from explicit knowledge sources (e.g. ontologies) such as thesauri or ontologies. Word Sense Discrimination on the other hand is the task of automatically finding the senses of words present in a collection. If an explicit knowledge sources is not used, Word Sense Discrimination can be considered a subtask of Word Sense Disambiguation. Using WSD instead of a thesaurus or other explicit knowledge sources has several advantages. Firstly, the method can be applied to domain specific corpora where few or no knowledge sources can be found. Examples are for instance detailed technical data Page 48 of 60 LiWA such as biology or chemistry and on the other end of the spectra, blogs where much slang or gadget names are used. Secondly, manmade thesauri often contain specific word senses which might not be related to the particular corpus. For example, WordNet [Mil95] depicts the word “computer” as “an expert in operating calculating machines”, which is definitely considered to be a less frequent utilization. The output from WSD is normally a set of terms to describe senses found in a collection. This grouping of terms is derived from clustering. We refer to such an automatically found sense as a cluster and throughout this document we will be using the terms clusters and senses interchangeably. Clustering techniques can be divided into hard and soft clustering algorithms. In hard clustering an element can only appear in one cluster, while soft clustering allows each element to appear in several. Due to the polysemous property of words, soft clustering is most appropriate for Word Sense Discrimination. 6.1.2 Word Sense Discrimination Word Sense Discrimination techniques can be divided into two major groups, supervised and unsupervised. Due to the vast amounts of data available on the Web and – as a consequence – stored in Web archives, we will be focusing on unsupervised techniques only. Automatic Word Sense Discrimination According to Schütze [Sch98], the basic idea of context group discrimination is to induce senses from contextual similarities. Each occurrence of an ambiguous word in a training set is mapped to a point in word space, called first order co-occurrence vectors. The similarity between two points is measured by cosine similarity. A context vector is then considered as the centroid (or sum) or the vectors of the words occurring in the context. This set of context vectors, also considered second order co-occurrence vectors, are then clustered into a number of coherent clusters or contexts using Buckshot, which is a combination of the Expectation Maximization-algorithm and agglomerative clustering. The representation of a sense is the centroid of its cluster. Occurrences of ambiguous words from the test set are mapped in the same way as words from the training set and labelled using the closest sense vector. The method is completely unsupervised since manual tagging is not required. However, disadvantages are that the clustering is hard and the number of clusters has to be predetermined. A systematically comparison of unsupervised WSD techniques for clustering instances of words using both vector and space similarity is conducted by Purandare et. al. [PP04]. The authors compares the aforementioned method of Schütze and Pedersen at el. [PB97] using first order co-occurrence vectors. The result is twofold: second order context vectors have an advantage over first order vectors for small training data; however, for larger amounts of homogeneous data such as “Line, Hard and Serve data“ [HLS], first order context vector representation with UPPGMA 1 clustering algorithm is 1 Unweighted Pair Group Method with Arithmetic Mean Page 49 of 60 LiWA the most effective at WSD. Word Sense Discrimination using dependency triples The use of dependency triples are one alternative for WSD algorithms, first described in [Lin98]. In this paper a word similarity measure is proposed and an automatically created thesaurus using this similarity is evaluated. The measure is based on one proposed by Lin in 1997 [Lin97]. This method has the restriction of using hard clustering. The author reports the method to work well but no formal evaluation is done. In 2002, Pantel et al. published “Discovering Word Senses from Text” [PL02]. In the paper a clustering algorithm called Clustering By Committee (CBC) is presented. The paper also proposes a method for evaluating the output of a word sense clustering algorithm to WordNet. Till then, the method has been widely used [Dor07, DfML07, RKZ08, Fer04]2. In addition, it has been implemented in the WordNet::Similarity Package3 by Ted Pedersen et al. Pantel et al. implemented several other algorithms like Buckshot, K-means and Average Link and showed that CBC outperforms all algorithms implemented, in both recall and precision. Graph algorithms for Word Sense Discrimination “Using curvature and Markov clustering in graphs for lexical acquisition and word sense discrimination” by Dorow et. al. [DWL+05] represent the third category of unsupervised Word Sense Discrimination techniques. A word graph G is build using nouns and noun phrases extracted from the the British National Corpus [BNC]. Each noun or noun phrase becomes a node in the graph. Edges exist between all pairs of nodes that have co-occurences in the corpus. More precisely if the terms are separated by “and”, “or” or commas. The curvature curv of a node w is defined as follows: curv(w)= # triangles w participates in / # of triangles w could participate in As a triangle we consider three nodes, of which w is a one, where all are connected. This is also referred to as the clustering coefficient of a node. Curvature is a way of measuring semantic cohesiveness of the neighbours of a word. If a word has a stable meaning, the curvature value will be high. On the other hand, if a word is ambiguous the curvature value will be low because the word is linked to members from different senses which are not interrelated. The results show that curvature value is particularly suited for measuring the degree of ambiguity of words. A more thorough investigation of the curvature measure and the curvature clustering algorithm is given in [Dor07]. An analysis of the curvature algorithm is made on the BNC corpus and the evaluation method proposed in [PL02] is employed. A high performance of the curvature clustering algorithm is noted which comes at the expense of low coverage. Another work related to clustering networks is a paper by Palla et al. [PD+05] dealing with the “Uncovering the overlapping community structure of complex networks in More papers using Lin-measure implemented by WordNet::Similarity package can be found on http://www.d.umn.edu/~tpederse/wnsim-bib/ 3 A description of this package can be found on http://www.d.umn.edu/~tpederse/similarity.html 2 Page 50 of 60 LiWA nature and society”. They define a cluster as a “k-clique community”, i.e. a union of k cliques (complete sub graphs of size k) that can be reached from each other through a series of adjacent k-cliques. Adjacent k-cliques share k-1 nodes. They also conclude that relaxing this criterion often is equivalent to decreasing k. The experiments are conducted by clustering three types of graphs: co-authorship, word association and protein interaction. For the first two graphs the average clustering coefficient for the communities found is 0.44 and 0.56. This leads us to believe that 0.5 as a clustering coefficient for curvature clustering could be a good threshold, as used in [Dor07, DWL+05]. 6.1.3 Summary We have investigated three main methods for automatically discovering word senses from large corpora. The method represented by Pantel et al. in [PL02] gives clusters where each element has some likelihood of belonging to the cluster. This has the advantage of assigning more significant elements to a cluster. The third method presented by Dorow et.al in [DWL+05] uses a graph theoretical approach and has reported higher precision than the one found in [PL02]. The findings of [PD+05] give us an indication of which value to use as curvature threshold for the curvature clustering algorithm. 6.2 Detecting Evolution Analysis of communities and their temporal evolution in dynamic networks has been a well studied field in recent years [LCZ+08, PBV07]. A community can be modelled as a graph where each node represents an individual and each edge represent interaction among individuals. As an example, in a co-authorship graph each author is considered as a node and collaboration between any two authors is represented by an edge. When it comes to detecting evolution the traditional approach has been to first detect community structure for each time slice and then compare these to determine correspondence. These methods tend to introduce dramatic evolutions in short periods of time and can hence be less appropriate to noisy data . A different path for detecting evolutions is by modelling the community structure at current time by taking into account also previous structures [LCZ+08]. This can help to prevent dramatic changes introduced by noise. A naïve way of determining cluster evolution in the traditional setting would be to simply consider each cluster from a time slice, as a set of term or nodes and then do a line up with all the clusters from consecutive time slice using a Jaccard similarity. This measures the number of overlapping nodes between two clusters divided by the total number of distinct nodes in the clusters. We could then conclude that the clusters with the highest overlap from two consecutive time slots are connected and than one evolved into the other. A more sophisticated way to detect evolution which also takes in to the account the edge structure within clusters has been proposed by Palla et al. [PBV07]. Page 51 of 60 LiWA 6.2.1 Quantifying social group evolution Using the clique percolation method (CPM) [PD+05], communities in a network are discovered. The communities of each time step from two types of graphs are extracted using the clique percolation method. The communities are then tracked over time steps. The basic events that can occur in the lifetime of a community are the following: 1. A community can grow or contract 2. A community can merge or split 3. A community can be born while others may disappear Similar cluster events are used in “MONIC – Modeling and Monitoring Cluster Transitions” presented in 2006 by Spiliopoulou et al. [SNTS06]. The methods proposed in MONIC are not based on topological properties of clusters, but on the contents of the underlying data stream. A typification of cluster transitions and a transition detection algorithm are proposed. The approach assumes a hard clustering where clusters are non overlapping regions described through a set of attributes. In this framework internal transitions are monitored for clusters that exist for more than one time point only. Size, compactness and shift of centre point are monitored. The disadvantages of the method are that the algorithm assumes a hard clustering and that each cluster is considered a set of elements without respect to the links between the elements of the cluster. A method for describing and tracking evolutions can be found in “Discovering Evolutionary Theme Patterns from Text” by Mei et al. [MZ05]. In this paper discovering and summarizing the evolutionary patterns of themes in a text stream is investigated. The problem is defined as follows: 1. Discover latent themes from text 2. Construct an evolution graph of themes 3. Analyze life cycles of themes The method proposed is suitable using text streams in which meaningful time stamps exist and is tested in the news paper domain as well as for abstracts of scientific papers. The theme evolution graphs proposed in this paper seem particularly suitable for describing terminology evolution. The technology used for finding clusters corresponding to word senses will differ from the probabilistic topic model chosen here. Therefore the definitions must be modified to suit our purposes, for example our clusters cannot be defined in the same way as themes are in this paper and similarity beween clusters cannot be measured like similarity between two themes. In FacetNet [LCZ+08] communities are discovered from social network data and an analysis of the community evolution is made in a manner that differs from the view of static graphs. The main difference to the traditional method is that FacetNet discovers communities that jointly maximize the correspondence of the observed data and the temporal evolution. FacetNet discovers community structure at a given time step t which is determined both by the data at t and by the historic community pattern. The method is unlikely to discover community structure that introduces dramatic evolutions in a very short time Page 52 of 60 LiWA period. The method also uses a soft community membership where each element belongs to any community with a certain probability. The method is shown to be more robust to noise. It also introduces a method where all members do not contribute equally to the evolution. A third method of finding evolutions in networks is influenced by “Identification of TimeVarying Objects on the Web” by Oyama et al. [OST08]. They proposed a method to determine whether data found on the Web origin from the same or different objects. The method takes into account the possibility of changes in the attribute values over time. The identification is based on two estimated probabilities. The probability that observed data is from the same object that has changed over time and the probability that observed data are from different objects. Each object is assigned certain attributes like age, job, salary etc. Once the object schema is determined, objects are clustered using an agglomerative clustering approach. Observations in the same cluster are assumed to belong to the same object. The experiments conducted show that the methods proposed improves precision and recall of object identification compared to methods that regard all values as constants. If the objects are people, the method is able to identify a person even when he/she has aged or changed jobs. The disadvantage of this method is that attribute types, as well as the probability models and their parameters must be determined using domain knowledge. For terminology evolution, once clusters are found from different time periods, each sense can be considered an object. Each cluster found in the snapshot can be considered an observation with added terms, removed terms etc. The terms in the cluster can be considered the attributes. We can then cluster observations from different snapshots in order to determine which senses are likely to belong to the same object and be evolutions of one another. An observation outside of a cluster can be considered similar to the sense represented by the cluster, but not as an evolved version of that sense. 6.2.2 Summary Several approaches for tracking evolution of clusters have been investigated. Based on the structure of the terminology graph and its evolution with respect to noise, how fast clusters change and appear/disappear a decision will be made for one of the approaches. If our data contains much noise, we will initially start with the methods introduced in [LCZ+08], if the clusters seem stable the methods in [PD+05], seem more suitable. As mentioned in Section 6.1.1, methods that assume hard clustering are not well suited for WSD. Moreover the internal structuring of the clusters should be taken into consideration. It is appropriate to consider two clusters with almost the same elements as different based on the edge structure of the cluster members. On top of the clustering process, a method for describing the overall evolution is needed. For this purpose, the model described in [MZ05] seems particularly suitable. It is likely that the model is not directly applicable for describing terminology evolution, but needs to be modified or extended for our purposes. Page 53 of 60 LiWA 6.3 Terminology Evolution Module The problem of automatically detecting terminology evolution can be split into three different sub problems: terminology snapshot creation, merging of terminology snapshots and mapping concepts to graphs. First, we need to identify and represent the relation between terms and their intended meanings (concepts) at a given time. We call such a representation a term-concept graph and a set of these a terminology snapshot. Such a snapshot is always based on a given document collection D"ti which is a set of documents d taken from a domain " in the time interval [ti!1 , ti], where i = 0, . . . , n, ti " T, where T is a the set of timestamps. 5.2.1 Terminology Snapshot Creation Each document dj " D" ti contains a set of terms w " W"ti. The set W"ti is an ideal domain specific term set containing the complete set of terms, independent of the considered corpora, that were ever used in domain " from time t0 since time ti. Since W#ti is not known, we define its approximation W'#ti. At time t0 the set is empty and W'" ti = W'" ti!1 U terms(D" ti ) for i = 1, . . . , N, where terms(D" ti ) = {w : (d w " d + d " D" ti }. To represent the relation between terms and their meanings we introduce the notion of concept and represent meanings as connections between term and concept nodes in a graph. Let C be the universe of all concepts. The semantics of a term w " W" ti is represented by connecting it to its concepts. The edges between terms and concepts inherit a temporal annotation depending on the collection’s time point. For every term w " W" ti, at least one term-concept edge has to exist. We introduce the function $ to be a representation of term-concept relations as follows: # : W $ T % (W $ P(C $ P(T))). P denotes a power set, i.e. the set of all subsets. Although # only generates one timestamp for each term-concept relation, we introduce the power set already at this point to simplify terminology snapshot fusion. The term-concept relations defined by # can be seen as a graph of a term and edges to all its concepts, referred to as a termconcept graph. 5.2.2 Terminology Snapshot Fusion After we have created several separate terminology snapshots, we want to merge them to detect terminology evolution. A term’s meaning has evolved if its concept relations have changed from one snapshot to another. The fusion of two terminology snapshots is (in general) more complicated than a simple graph merging. For example, we might merge two concepts from the source snapshots to a single concept in the target graph. As part of the fusion process we also need to merge the timestamps of the edges. When term and concept are identical in both snapshots, the new annotation is the union of both source annotations. Thus, we represent the concept relations of a term w " W as set of pairs (ci , {ti1 , . . . , tik }). To shorten the notation we define Ti as a set of timestamps tik, and the pairs can be written Page 54 of 60 LiWA as (ci , Ti). We note that a concept does not have to be continuously related to a term; instead the respective term meaning/usage can lose popularity and gain it again after some time has passed. Therefore, &i is not necessarily a set of consecutive timestamps. We introduce the function % which fuses two term-concept graphs. ' represents relations between concepts from different snapshots. ' : (W $P(C $& ))$(W $P(C $& )) % (W $P(C $& )). It should be clear that the set of concepts in the resulting graph of % does not necessarily have to be a subset of the set of concepts from the source graphs. 5.2.3 Mapping Concepts to Terms The graph resulting from snapshot fusion allows us to identify all concepts which have been related to a given term over time. We cannot directly exploit these relations for information retrieval, but we need to map the concepts back to terms used to express them. To represent this mapping, we introduce a function & as follows:. ( : C % P(W $ & ) For a given concept c, & returns the set of terms used to express c, together with timestamp sets which denote when the respective terms were in use. The characteristics of & are clearly dependent on the merging operation of the concepts in %. For instance, in case two concepts are merged, the term assignment has to reflect this merge. 6.4 Evaluation To the best of our knowledge there are no published evaluation methods or benchmark datasets to rely on. Therefore we have developed the following strategy for evaluating the Terminology Evolution module developed in LiWA: Starting point is the manual creation of a set of example terms that we call the test set. This test set should contain approx. 60-80 terms where there is evidence of an evolution. Examples terms are St. Petersburg or fireman/firefighter, for more examples please see [TIRNS08]. For counter reference we will include in the test set, a set of terms where there is no evidence of evolution. Using the Liwa Terminology Evolution Module we will identify how many of the evolutionary terms we are able to find in this automatic fashion. Every term evolution that is found (or not found) that correctly corresponds to our knowledge of the term will be considered a success. The procedure described above aims at discovering terminology evolution detectable in the archive. Because this does not address information extraction or search, we will also apply the following strategy: Corresponding to the test set of terms that we have described above we will manually identify the relevant documents present in the archive. These Page 55 of 60 LiWA documents will constitute our target set. In order to create a baseline, we use only one of these query terms to search among the target set and note how many of the relevant results that are returned. Then we extend the query term with additional terms found by the Terminology Evolution Module and query the target set again using this extended set of terms. If we are now able to extract more documents than the baseline that contain the extended set of terms, we count this query as a success. We can also measure with what percentage we succeeded as well as an average success rate etc. This evaluation will measure two related entities. Firstly, we measure how well our module can model the evolution found in an archive. Here the focus lies on finding all evolutions present in the archive. Secondly, we measure to what extent the information found by the module can aid in information retrieval. Are we able to return a higher number of relevant documents for the query using the information found by the Terminology Evolution Module? 6.5 Integration into the LiWA Architecture The Terminology Evolution Module is subdivided into terminology extraction and tracing of terminology evolution. Both sub-modules are integrated via UIMA pipelines as presented in Figure 6.1. The terminology extraction sub-module is automatically triggered when a crawl or a partial crawl is finished. The terminology evolution submodule is manually triggered by the archive curator based on the crawl statistics gathered during terminology extraction. “The Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies.” ([UIMA]) Page 56 of 60 LiWA Crawler Post-Processing …. put (crawlId, List <newWarcFiles>, Boolean final) Asynchron UIMA Processing Chain Job Queue Crawl Statistics WARC Extraction POS Tagger WARC Files TermEvolv DB Word Sense Discrimination Cluster Tracking Lammertizer Cooccurence Analysis Evolution Detection UIMA Processing Chain Curator Figure 6.1: Work flow for the Semantic Evolution module V1 6.5.1 Terminology Extraction Pipeline. The WARC Collection Reader (WARC Extraction) extracts the text and time metadata for each site archived in the input crawl. The POS (Part Of Speech) Tagger is an aggregate analysis engine from Dextract ([Dextract]). It consists of a tokenizer, a language independent part of speech tagger and lemmatizer ([TreeTagger]). In the Term Extraction sub-module, we read the annotated sites, extract the lemmas and the different occurring parts of speech that were identified for the archived sites. After that, we index the terms in an database (MySQL) index, see Figure 6.2. In the Cooccurrence Analysis we extract lemma or noun co-occurrence matrices for the indexed crawl from the database index. Page 57 of 60 LiWA Figure 6.2: Database Terminology Index 6.5.2 Terminology Evolution Pipeline After extracting the co-occurrence matrices for lemmas for different crawls captured at different moments in time, for the different time intervals, we cluster the lemmas with a curvature clustering algorithm. The clusters from different time intervals are then analyzed and compared in order to detect term evolution. 6.6 References [BGBZ07] Jordan Boyd-Graber, David Blei, and Xiaojin Zhu. A Topic Model for Word Sense Disambiguation. 2007 [BNC] BNC Consortium: British National Corpu, http://www.natcorp.ox.ac.uk/corpus/index.xml.ID=products [Brown] Brown University: Brown Corpus, http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM [DfML07] Koen Deschacht, Marie Francine Moens, and Interdisciplinary Centre For Law. Text analysis for automatic image annotation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. East Stroudsburg: ACL, 2007 [dMD04] Marie-Catherine de Marneffe and Pierre Dupont. Comparative study of statistical word sense discrimination techniques, 2004 [Dor07] B. Dorow. A graph model for words and their meanings. PhD thesis, University of Stuttgart, 2007 [DW03] Beate Dorow and Dominic Widdows. Discovering corpus specific word senses. In EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, Morristown, NJ, USA, 2003 Page 58 of 60 LiWA [DWL+05] B. Dorow, D. Widdows, K. Ling, J-P. Eckmann, and D. Sergian d E. Moses. Using curvature and Markov clustering in graphs for lexical acquisi tion and word sense discrimination. In MEANING-2005, 2nd Workshop organized by the MEAN-ING Project, Trento, Italy, February 3-4 2005 [Fer04] Olivier Ferret. Discovering word senses from a network of lexical cooccurrences. In COLING '04: Proceedings of the 20th international conference on Computational Linguistics, page 1326, Morristown, NJ, USA, 2004. Association for Computational Linguistics. [HLS] Leacock, Chodorow and Miller. Hard and Serve Leacock, Towell and Voorhees. Line http://www.d.umn.edu/~tpederse/data.html [KH07] Gabriela Kalna and Desmond J. Higham. A clustering coefficient for weighted networks, with application to gene expression data. 2007 [LADS06] Yaozhong Liang, Harith Alani, David Dupplaw, and Nigel Shadbold. An approach to cope with ontology changes for ontology-based applications. The 2nd AKT Doctoral Symposium and workshop, 2006 [LCZ+08] Yu-Ru Lin, Yun Chi, Shenghuo Zhu, Hari Sundaram, and Belle L. Tseng. Facetnet: a framework for analyzing communities and their evolutions in dynamic networks. In WWW '08: Proceeding of the 17th international conference on World Wide Web, pages 685-694, New York, NY, USA, 2008. ACM. [Lin97] Dekang Lin. Using syntactic dependency as local context to resolve word sense ambiguity. In ACL-35: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pages 64-71, Morristown, NJ, USA, 1997. Association for Computational Linguistics. [Lin98] Dekang Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on Computational linguistics, pages 768-774, Morristown, NJ, USA, 1998. Association for Computational Linguistics. [LSB06] Esther Levin, Mehrbod Shari, and Jerry Ball. Evaluation of utility of lsa for word sense discrimination. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, New York City, USA, June 2006. Association for Computational Linguistics. [Mil95] George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 1995. [MKWC04] Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding predominant word senses in untagged text. In ACL'04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 279, Morristown, NJ, USA, 2004. Association for Computational Linguistics. [MZ05] Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, New York, NY, USA, 2005. ACM. [Nav09] Roberto Navigli. Word sense disambiguation: A survey. ACM Comput. Surv., 2009. [OST08] Satoshi Oyama, Kenichi Shirasuna, and Katsumi Tanaka. Identication of timevarying objects on the web. In JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, New York, NY, USA, 2008. ACM. Page 59 of 60 LiWA [PB97] Ted Pedersen and Rebecca Bruce. Distinguishing word senses in untagged text. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing. 1997 [PBV07] Gergely Palla, Albert-Laszlo Barabasi, and Tamas Vicsek. Quantifying social group evolution. Nature,April 2007. [PD+05] Gergely Palla, Imre Derenyi, Illes Farkas, and Tamas Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005 [PL02] Patrick Pantel and Dekang Lin. Discovering word senses from text. In In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2002. [PP04] Amruta Purandare and Ted Pedersen, Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces. In: Proceedings of CoNLL-2004, Boston, MA, USA, 2004, pp. 41-48. [Rap04] Reinhard Rapp. A practical solution to the problem of automatic word sense induction. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 26, Morristown, NJ, USA, 2004. Association for Computational Linguistics. [RKZ08] Sergio Roa, Valia Kordoni, and Yi Zhang. Mapping between compositional semantic representations and lexical semantic resources: Towards accurate deep semantic parsing. In Proceedings of ACL-08: HLT, Short Papers, Columbus Ohio, June 2008. Association for Computational Linguistics. [Sch98] Hinrich Schütze. Automatic word sense discrimination. Journal of Computational Linguistics, 1998. [SemCor] Princeton University. Semcor http://multisemcor.itc.it/semcor.php [Senseval] A description of http://www.senseval.org/ , Senseval 1, Senseval 2 and Senseval 3 for download http://www.d.umn.edu/~tpederse/data.html [SNTS06] Myra Spiliopoulou, Irene Ntoutsi, Yannis Theodoridis, and Rene Schult. Monic: modeling and monitoring cluster transitions. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, 2006. ACM. [TIRNS08] Nina Tahmasebi, Tereza Iofciu, Thomas Risse, Claudia Niederee, and Wolf Siberski; Terminology Evolution in Web Archiving: Open Issues; In Proc. of the 8th International Web Archiving Workshop in conjunction with ECDL 2008, Aarhus, Denmark [TreeTagger] TreeTagger – part of speech tagger. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html [UIMA] UIMA Overview. Apache, http://incubator.apache.org/uima/ [vD00] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. [Yar95] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, Morristown, NJ, USA, 1995. Association for Computational Linguistics. Page 60 of 60