- Philipp Berger
Transcription
- Philipp Berger
Master’s Thesis Ranking Blogs based on Topic Consistency by Philipp Berger Potsdam, September 2012 Supervisor Prof. Dr. Christoph Meinel Internet-Technologies and Systems Group Disclaimer I certify that the material contained in this master’s thesis is my own work and does not contain significant portions of unreferenced or unacknowledged material. I also warrant that the above statement applies to the implementation of the project and all associated documentation. Hiermit versichere ich, dass diese Arbeit selbständig verfasst wurde und dass keine anderen Quellen und Hilfsmittel als die angegebenen benutzt wurden. Diese Aussage trifft auch für alle Implementierungen und Dokumentationen im Rahmen dieses Projektes zu. Potsdam, September 27, 2012 (Philipp Berger) iii Kurzfassung Gängige Blog Rankings, wie PageRank, Technorati Authority, und BI-Impact, bevorzugen Blogs, die sich mit einer Vielzahl von Themen auseinander setzen, da diese ein größeres Publikum und damit mehr Besucher, Links, und Kommentare anziehen. Ein Beispiel dafür ist der Blog spreeblick.com, der sich mit Themen rund um Politik, Gesellschaft und IT beschäftigt. Andererseits, erreichen Nischenblogs, welche sich auf ein Thema konzentrieren, nur wenig Einfluss. Nischenblogs sind Blogs wie telemedicus.info, dieser veröffentlicht nur Artikel über Datenschutz und Urheberrecht. Dadurch erhalten diese nur eine niedrige Bewertung von heutigen Blog-Suchmaschinen. Diese Arbeit erörtert, dass die Konsistenz von Blogs, d.h. wie konzentriert ein Autor ein Thema behandelt, ein Zeichen für Expertenwissen ist. Solche Blogs zu finden ist besonders wichtig für andere Experten, um diese Blogs zu identifizieren, damit sie diesen folgen und in einen aktiven Diskurs treten können. Um das Auffinden dieser Blogs zu erleichtern, d.h. sie von der Masse der vielseitig interessierten Blogs zu trennen, wird eine Metrik für Blogs vorgestellt, welche auf der thematischen Konsistenz basiert. Das Konsistenz-Ranking basiert auf der (1) Intra-Post, der (2) Inter-Post, der (3) Intra-Blog, und der (4) Inter-Blog Konsistenz. Die vorgestellte Metrik wird auf einem Datensatz von 12.000 gesammelten Blogs ausgewertet und somit die Plausibilität dieses Ansatzes demonstriert. iv Abstract Current ranking algorithms, such as PageRank, Technorati authority, and BI-Impact, favor blogs that report on a diversity of topics since those attract a large audience and thus more visitors, links, and comments. One example is the spreeblick.com blog, which offers articles on politics, society, and IT. On the other side, niche blogs with a very specific topic only attract a small audience and thus have only a small reach. Niche blogs are blogs like telemedicus.info, which only publishes articles on privacy and copyright. This results in a low ranking from today’s blog retrieval systems. This thesis argues that the consistency of a blog, i.e. how focused an author reports on a single topic, is a sign for expert knowledge. To find these blogs is particular important for other domain experts to identify blogs that they would like to follow and stay in active contact. To ease the retrieval of expert blogs, i.e. to separate them from the mass of blogs that report on random topics, a metric for blogs based on topic consistency is introduced. The consistency ranking is based on four different aspects: (1) intra-post, (2) inter-post, (3) intra-blog, and (4) inter-blog consistency. By evaluating the metric with a test data set of 12,000 crawled blogs, the plausibility of this approach is demonstrated. v Contents Contents 1 Introduction 1 2 Background 5 2.1 Weblogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 BlogIntelligence Framework . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Apache Nutch - the Crawling Framework . . . . . . . . . . . . . . 12 2.4 SAP HANA - the Persistence Layer . . . . . . . . . . . . . . . . . . 13 2.5 Clustering and Apache Mahout . . . . . . . . . . . . . . . . . . . . 15 3 4 5 6 vi Related Work 17 3.1 General Rankings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Blog-Specific Rankings . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Consistency-Related Rankings . . . . . . . . . . . . . . . . . . . . 20 Definition of the Topic Consistency Metric 23 4.1 Consistency between Posts (Inter-Post) . . . . . . . . . . . . . . . . 23 4.2 Internal Consistency of Posts (Intra-Post) . . . . . . . . . . . . . . 26 4.3 Consistency between Posts and Classification (Intra-Blog) . . . . . 27 4.4 Consistency of Linking and Linked Blogs (Inter-Blog) . . . . . . . 28 4.5 Combined Topic Consistency Rank . . . . . . . . . . . . . . . . . . 30 Implementation of Topic Detection 33 5.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Implementation of the Topic-Consistency Rank 39 6.1 Intra-Post Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.2 Inter-Post Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.3 Intra-Blog Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.4 Inter-Blog Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 41 Contents 6.5 7 8 9 BI-Impact Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Evaluation 45 7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.3 Results of the Topic Consistency Sub Ranks . . . . . . . . . . . . . 47 7.4 Comparison of BI-Impact and Combined Topic Consistency Rank 51 Recommendations for Future Research 55 8.1 Enhanced Topic Detection . . . . . . . . . . . . . . . . . . . . . . . 55 8.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 8.3 Full integration with SAP HANA . . . . . . . . . . . . . . . . . . . 59 Conclusion 61 List of Abbreviations 65 List of Figures 67 List of Tables 69 Bibliography 71 vii Contents “Blogging is ... to writing what extreme sports are to athletics; more free-form, more accident-prone, less formal, more alive. It is in many ways, writing out loud.” -from Andrew Sullivan, The Atlantic, Why I Blog viii 1 Introduction Weblogs, called blogs, are one of the most popular “social media tools” of the World Wide Web (WWW) [1]. They are specialized, but easy-to-use, content management systems. Blogs focus on frequently updated content, social interactions, and interoperability with other Web authoring systems. Blogs are part of the rise of social media, i.e. the move of the internet to more user participation and freedom of speech [2]. This is caused by their various application areas: beginning with personal diaries and holiday photo collections, reaching to knowledge management, educational, scientific research and corporate platforms, and finally to forums for traditional journalists and the upcoming concept of Citizen Journalists who leaped into fame during the Arab spring [3, 4, 5]. The actual power of blogs evolves through their common superstructure, i.e. a blog integrates itself into a huge think tank of millions of interconnected weblogs, called blogosphere that creates an enormous and ever-changing archive of open source intelligence [6]. Through the various application areas and the immense amount of blogs, the diversity of discussed topics continuously increases. As shown in Fig. 1, the diversity reaches from travel and news, to politics and gaming. Blog readers are not able to access all the information of the blogosphere because they are overwhelmed by the enormous number of blogs and the blogs’ diversity. To handle this information overload, the research and application area of blog retrieval evolved [8]. Equally to traditional information retrieval (IR) and data mining approaches, the target is to ease the understanding of the causal relations in the blogosphere and the retrieval of the most relevant blogs to the user’s information need [9]. Facing this unique challenge, the BlogIntelligence (BI) [10] project got initiated with the objective to map, and ultimately reveal, content-oriented, networkrelated structures of the blogosphere by using an intelligent crawler and tailormade analyses for the blogosphere [10]. Beside normal search engine function- 1 1 Introduction Personal blogs 63,5% Family of friend blogs 38,9% Music 33,1% News 29,1% Opinions on products and brands 26,6% Film/TV 26,4% Computers 24,8% Travel 22,5% Technology 20,8% Gaming 18,2% Sport 16,7% Science 13,6% Business 13,5% Business news 12,1% Celebrities 9,8% Other 6,7% 0% 10% 20% 30% 40% 50% Topics written about 60% 70% Figure 1: Topics blogged about in 2008 [7]. alities, BlogIntelligence has to consider the specific characteristics of the blogosphere, social interaction from other social networks, and leverage content mining [11]. This thesis originates from the BlogIntelligence project and presents a ranking approach based on the topical consistency of blogs. This ranking aims to ease the retrieval of expert blogs that are particular important for users to identify blogs to follow and interact with. The ranking of documents is a common technique in IR [12]. It aims to assess the relevance of documents for a specific user’s information need. Current IR systems mainly calculate the ranking or authority of a blog based on its position in the web graph or social graph [13, 14]. Advanced ranking approaches also consider the up-to-dateness of the content and level of readers’ engagement [11]. In contrast to current approaches, the goal of this thesis is to establish the topic consistency as primary factor for the ranking. 2 Topical consistency is defined as the degree to which a blog author focuses on a specific set of topics [15]. If blog authors cover several topics, like in random interest blogs or diaries, they have a low topic consistency and thus cannot create topical thrust. In contrast, a blog has the highest topic consistency if it continuously concentrates on one topic. It is argued that such a blog develops a sufficiently high expertise in this topic [16]. Thus, the content of this blog author is expected to be more relevant to an information need than the content of a topically versatile, and influential author. Analogous to frequently cited experts in the real world, it is expected that blog readers are more likely to trust and interact with a blog author with a high topic consistency. To implement the ranking, it is integrated into the BlogIntelligence framework. BlogIntelligence essentially consists of three components. The data extraction component is the basis of the BI framework that harvests the web, analyzes each web page, extracts blog-specific information, and stores the harvested data into the storage layer. The analysis component provides prototypical implementations for the detection of trending topics and the ranking of blogs. The third component is the visualization that communicates the analysis results with the user. To implement the topic consistency rank, it is necessary to integrate a topic detection mechanism into the analysis layer and to calculate the actual ranking based on the detected topics and the crawled data. Further, this thesis introduces an extension for the visualization that communicates the topic consistency of a blog with the user. In order to evaluate the plausibility of a topic consistency ranking, it gets formally defined and prototypically implemented in the course of this thesis. Further, it is tested whether a correlation between the topical consistency of a blog and its influence is observable. This evaluation can make recourse to the BlogIntelligence data set that currently consists of 12,000 blogs with over 600,000 posts. The remainder of this thesis is structured as follows. Section 2 outlines the foundations of this thesis. It introduces the reader to the concept of weblogs, to the layers of BlogIntelligence, and to the technique of data clustering. In Section 3, related research concerning ranking approaches and topical content analyses is 3 1 Introduction described. Section 4 presents the formal definition of the topic consistency rank and its sub ranks. Section 5 outlines the implementation of the underlying topic detection mechanism. Section 6 describes the implementation of the topic consistency rank and its integration into the BlogIntelligence framework. Section 7 discusses the results and the plausibility of the topic consistency rank. Future work is introduced in Section 8. Finally, Section 9 presents the conclusion of this thesis. 4 2 Background This Section presents the basic concept of weblogs. The analysis of weblogs is the main goal of the BlogIntelligence framework, which is the foundation of this work. Therefore, the layers of BlogIntelligence are introduced, too. Further, this Section gives an overview to the technologies that are utilized for the topic consistency rank calculation: Apache Nutch, Apache Mahout, and SAP HANA. 2.1 Weblogs As discussed in Sec. 1, blogs are specialized content management systems (CMS) that enable the authors to share content and open discussions. Blog platforms, like Blogger1 , WordPress2 , and TypePad3 , provide a unified structure for the published content. This structure reflects the requirements of a frequently updated and socially active medium. Weblog is a made-up word composed of the terms web and log [17]. The entries of this log are called posts. Posts are usually displayed in reverse chronological order with the most recent entry first. These posts can contain texts, images, and videos to express the author’s opinion. They are the counterpart to traditional newspaper articles. Each post can be referenced via a URI (Uniform Resource Identifier) in the World Wide Web (WWW). A special kind of URI is the permalink. A permalink is the durable address of a blog post which is guaranteed to be reachable and unique during the life-time of a blog. In addition, a blog author can categorize his posts based on two classification mechanisms: categories and tags. Categories offer a hierarchical structure for classifying a blog’s contents equally to traditional libraries. They are frequently used to emphasize distinct discus1 http://www.blogger.com/ 2 http://wordpress.com/ 3 http://www.typepad.com/ 5 2 Background sion streams within a blog. In contrast, tags are unordered keywords attached to a post and do not offer a hierarchy. They summarize the content of a post. Readers use tags to navigate a blog and to find posts related to a very specialized concept [18]. One prominent application is the tag cloud (see Fig. 2, generated by Wordle4 ). Figure 2: A generated tag cloud. A tag cloud visualizes all tags (keywords) of all posts of a blog and gives an impression of the most discussed topics. It becomes popular method to support navigation and retrieval of posts [19]. The social component of a blog is the reader’s ability to write comments [20]. Comments enable blog readers to open an active discussion attached to a post and communicate their opinions or to offer help. This enables the users of a blog to communicate in an highly interactive way [21]. Nevertheless, blog comments are manually moderated by the blog author because the author is responsible for the content published on his blog. The blog author also wants to control the discussion and to avoid inappropriate comments. Further, blogs have special technical features that simplify harvesting and analyzing their posts and comments. The most prominent technical feature of blogs is the publishing format feed [22]. Feeds present the content of a blog in standardized, XML-based formats (namely 4 http://www.wordle.net/ 6 RSS and ATOM). A feed is an integrated part of the blog system and is always up-to-date with the blogs content. These feeds ease the machine readability of blogs. They contain all relevant information like the publishing date, the author, categories, tags, the title, and a short description of a post. Thus, a new kind of application develops, named aggregators. An aggregator requests an user-selected set of feeds and displays the content to the user in a unified, enriched, and compact view. This way, users do no longer request a blog directly. Instead, they are provided the content of their favorite blogs and do not have to actively retrieve it from the WWW. Concerning the social interaction of blogs, important technical features are blogrolls and linkbacks [10]. A blogroll is noticeable placed on a blog’s starting page and contains links to other blogs. These blogs are considered as followed or friend blogs. Thus, blogrolls form close communities based on mutual linking. Linkbacks are methods a blog author can use to get notified when other authors link to his posts. This enables authors of different blogs to bidirectionally link their discussions. There are three kinds of linkbacks: refback, trackback, and pingback. Refback is not part of the blogging system. Instead it is part of the HTTP protocol and of today’s browsers. A refback occurs when a blog reader follows a link and the receiving blog recognizes the HTTP referrer value of the reader’s browser. In contrast, trackback and pingback are automatized mechanisms of blog systems based on HTTP-POST and XML-RPC. In the moment a blog author A references another blog author B, the blog system sends a notification to the server of B. The server stores this message which contains all relevant meta information like referencing post URI and post title. Thus, B can display this back reference under his post to lead the blog reader to further discussions. 7 2 Background 2.2 BlogIntelligence Framework To exploit the unique features of blogs, the BlogIntelligence project was initiated [10]. This project shows in which perspectives the entirety of weblogs can be analyzed and visualized in order to extract valuable aggregated information. The visualizations and insights are composed into a web portal5 . To generate the data for this web portal, BI provides a framework consisting of three layers: extraction, analysis, and visualization. An illustration of the complete architecture is shown in Fig. 3. 2.2.1 Extraction The extraction layer consists of a web harvesting application called crawler. Web crawlers are computer programs that browse the web in an automatic, methodical manner [23]. They are mainly used to store a copy of each visited page. Search engines and other services analyze and index these pages to provide fast search interfaces. The crawler starts with a fixed set of URIs. After visiting and copying the first pages, the crawler extracts all hyperlinks and continues by visiting the linked pages. The BI crawler is a tailor-made adaption of Apache Nutch for the special requirements of the blogosphere (see Sec. 2.3). Similar to common crawlers, the BI crawler traverses the link graph of the web to harvest web pages. Two parts of the crawler are adapted to the special needs of harvesting blogs. This first part is the URI selection. It is responsible for selecting the next set of URIs to crawl from the queued URIs in the joblist (see Fig. 3). It distinguishes between special types of links present in blogs. These types reflect the position of a link in a blog. Thereby, the crawler prefers links from blogrolls, posts, comments, and links explicitly marked as feeds. The second adapted part is the post-processor. This part is responsible for extracting meta data from the downloaded page and attach it to the persistent data ob5 http://www.blog-intelligence.com/ 8 EXTRACTION WWW ... LE TIT T AU T TEN CON DS -FEE RSS BLOGOSPHERE R HO P STAM TIME ORY CATEG LINKS ... BLOGPARS ING NEWS-PORTALS CRAWLER#1 CRAWLER#2 CRAWLER#3 ... BLOGR T WIT ANALYSIS ... TREND TER OLLS ACC . COMMUNITIES WHAT‘S RANKING PERSONALIZED SEARCH NET WORK CONTENT DATA ANLAYZERS INFORMATION SPREADING VISUALIZATION WEBINTERFACE COMMUNITIES RANKING INFORMATION SPREADING WHATS’S UP? TRENDS Figure 3: The BlogIntelligence architecture [10]. 9 2 Background ject. The default extraction includes language detection, text content extraction, meta tag extraction and link extraction. In addition, the BI crawler creates post and comment objects. Content, description, author, publishing date, language, tags, and categories of a post are extracted. The post-processor recognizes the specific blog system, like Blogger6 , based on hints in the HTML structure. This way, platform-specific informations like trackbacks are extracted, as well. Further, it analyzes the position of links in the content structure of a blog. After the completion of the post-processor, the crawler stores the enriched web page into the persistence layer. 2.2.2 Analysis The second layer of the framework, the analysis layer, performs while the crawler continuously collects new information. The analysis consists of multiple loosely coupled modules. Each module performs a specific algorithm that delivers data necessary for the third layer of the framework, the visualization layer. The current BI prototype includes a ranking, a clustering, and a dimension reduction algorithm. The ranking algorithm is described by Bross et al. [11]. The authors define a complex metric called BI-Impact score. The BI-Impact score combines multiple quality metrics of blogs to one score (see Sec. 3.2). The current implementation of the prototype runs as an over-night batch job to calculate a new ranking. It only considers the specific link types from blog, blogroll, post, or comment. The analysis elements (see Fig. 3), like trend detection, ranking, and community recognition, are prototypically implemented. The details of the ranking are discussed in Sec. 3.2. The integration of the topic consistency rank calculation in this analysis layer is described in Sec. 6. 6 http://www.blogger.com 10 2.2.3 Visualization The visualization layer is based on the results of the data analyses. It allows users to browse the preprocessed information of the data analyzers in an unlimited, personalized and intuitive way. The visualization layer consists of three visualizations that give different insights at different abstraction levels (see Fig. 3). The first visualization is directly integrated into the web portal. It basically shows the frequently discussed topics (What’s up), and trending terms (Trends) of the blogosphere. Figure 4: Screenshot of the BlogConnect visualization [24]. The second visualization is an interactive visualization tool to powerfully explore and browse through the network of blogs, called BlogConnect. Essentially, it displays all blogs as bubbles on a 2D canvas (see Fig. 4) [24]. The position of a blog reflects its topical identity. The size of a blog indicates its position in the ranking. Thus, users can orientate themselves in the network and find the most relevant blog for a topic area, called community. The third visualization, called PostConnect [25], serves as a visualization for blog archives. As shown in Fig. 5, it arranges all posts of a blog in a circle. By activat- 11 2 Background Figure 5: Screenshot of the PostConnect visualization [25]. ing a post, each topically linked post of a blog archive gets highlighted. Hereby, a post is topically related if it uses the same categories or tags as the activated post. PostConnect helps users to explore the topical nature of a blog and identify highly related subsets of posts. 2.3 Apache Nutch - the Crawling Framework As described by Berger et al. [26], the underlying framework of the BlogIntelligence extraction layer is the open-source web search engine Apache Nutch7 [27]. Apache Nutch provides a transparent alternative to private global scale search services. It comes with an easily extensible and scalable crawler component. Following the MapReduce paradigm [28], Apache Nutch defines four different phases for crawling: generator, fetcher, parser and updater that are executed iteratively [29]. The generator job selects the next URIs to fetch. The fetcher job asynchronously downloads the selected pages. Afterwards, the parser job extracts 7 http://nutch.apache.org/ 12 metadata, links and the actual text content. In addition, the framework offers an extension point to insert new parsing algorithms. This functionality is used by the topic detection implementation to integrate the term extraction module (see Sec. 5). Finally, the updater job inserts new links and calculates scores for the parsed web pages. These scores are used to select the next URIs to crawl. Each job works on a large amount of pages in parallel. Apache Nutch is a MapReduce application dedicated for scale-out scenarios, i.e. runs on a large number of small machines. For example, researchers at Google [28] report to use a massive cluster of small machines to crawl the web. Nevertheless, even on scale-up scenarios, i.e. execution on one big machine, MapReduce applications perform as scale-out-in-a-box more effective than pure scale-up approaches [30]. This enables us to run the crawler on a large cluster of small machines as well as on a large shared-memory server. In this context, the Hasso-Plattner Institute offers a testing platform, the Future SOC Lab8 , which provides researchers access to the latest multi/many-core hardware. Thus, the crawler implementation currently runs on a scale-up scenario. 2.4 SAP HANA - the Persistence Layer The persistence layer of the BlogIntelligence framework has an high impact on the performance of the extraction and analysis layer [26]. In addition, the overall target of BI is to provide real-time analytics for the whole blogosphere. Therefore, three different database technologies compete: a row-oriented, disc-based database, a distributed file system and a column-oriented, in-memory database. The evaluation considers a traditional row-oriented, disc-based databases, namely PostgreSQL9 . This database makes the data discoverable and easy to query by offering a SQL query API that ease the implementation of the analysis layer. However, the query performance of PostgreSQL massively decreases with growing data amounts during the extraction phase [26]. 8 http://www.hpi.uni-potsdam.de/forschung/future_soc_lab.html 9 http://www.postgresql.org/ 13 2 Background An alternative is the distributed file system HDFS10 . HDFS is the original persistence API of Apache Nutch. It is able to handle and process huge amounts of data using commodity hardware [31]. It does not provide a query API like SQL. Further, HDFS is not able to take full advantage of today’s high-end hardware with massive amount of memory, because it requests only minimal hardware resources [32]. Since costs for main memory are decreasing and access to data in the main memory is extremely fast, it makes sense to store all data mainly in the main memory. Thus, an in-memory database, namely SAP HANA11 , is tested. Although SAP HANA is targeting enterprise applications, the majority of analysis algorithms also apply to social media. Because of the effective usage of main memory, the versatile analysis capabilities, and the SQL API, the extraction component got adapted to store all collected data in SAP HANA [33]. To integrate the extraction component with the in-memory database, the persistence layer of Apache Nutch is replaced. The Apache Gora12 framework already offers an object relational mapper(ORM) for traditional SQL databases like PostgresSQL. This ORM is adapted to also support the SAP HANA database, because HANA uses a special SQL dialect. Hence, the complete extraction component is currently integrated with SAP HANA. Caused by the tight coupling of the persistence layer and the analysis layer, this change implies the adaption of the whole analysis layer. Thereby, most of the algorithms have to be modified regarding the direct integration into SAP HANA. HANA offers various kinds of programming interfaces to run analytics direct in-memory without transferring the data to the application layer. Besides saving transfer time, the main advantage of the database is the dictionary-encoded column-oriented in-memory computing that outruns file-based database solutions [33]. The dictionary encoding saves space and access time for highly redundant tables like the link or dictionary tables used for analysis (see Sec. 6). Further, the column-orientation performs best on tables 10 http://hadoop.apache.org/ 11 http://www.sap.com/HANA/ 12 http://gora.apache.org/ 14 with a large number of columns, but only few columns questioned. This applies to the main table of BlogIntelligence, called web page table, which essentially stores all informations of a web page like content, date, author, and many more into one table. However, HANA is still under development and the transfer of analysis algorithms is out of scope for this work. As a consequence, the clustering needed by the topic consistency rank is outsourced as described in the following Section. 2.5 Clustering and Apache Mahout One major foundation of the algorithms in the analysis layer is clustering. This also applies for the topic detection mechanism needed for the topic consistency rank (see Sec. 5). Clustering is an unsupervised classification technique of data items into groups, called clusters. These clusters contain data items that are similar in meaning. Beside density-based clusterings, frequently used clusterings in data mining are distance-based [34, 35]. Essentially, a distance-based clustering works like follows. Each data item has a number of numerical features. The feature vector of each data item is the combination of these features. One can think about the feature vector as a position in an n-dimensional space. The clustering defines a distance metric for the feature vectors. The task of the clustering is to group feature vectors together that have a low distance between each other. Thus, all data items are grouped together that have similar numerical features. The current clustering of BlogIntelligence groups blogs together based on the word occurrences in the blogs. Blogs are in one cluster if they contain similar words with a similar frequency and thus are regarded as topical similar. These clusters are visualized in the BlogConnect visualization (see Fig. 4). The prototypical analysis layer of the BI framework consists of Java implementations for the clustering and other analysis techniques. The logical next step is to integrate it with a well-established framework for information analysis. 15 2 Background Such an established framework is Apache Mahout13 [36]. Similar to Apache Nutch, it is based on a MapReduce framework. It provides various algorithms for clustering, classification, and collaborative filtering. In order to provide maximal distribution during the execution, these algorithms are customized for the MapReduce framework. Mahout is primarily built for batch analyses that are able to handle big data. This data has to be present on the distributed file system of Hadoop. Hence, the complete data has to be loaded from the persistence layer. However, the long-term target is to integrate all needed clustering and classification algorithms directly into the persistence layer to avoid high transfer costs. Although, first clustering algorithms for HANA are under development, the integration is out of scope of this thesis. 13 http://mahout.apache.org/ 16 3 Related Work The related work can be divided into three categories of ranking approaches. The first category consists of general rankings that assess web pages and other documents. The second category includes blog-specific rankings that are specialized on blogs and other social media channels. The last category comprises consistency-related rankings that incorporate the topic consistency of a document or blog into the ranking. 3.1 General Rankings PageRank is one of the most frequently used algorithms, e.g. by Google [37], for ranking traditional web pages based on the web link graph. It has been introduced by Page et al. [13] and is based on the random surfer model. A web page’s PageRank is defined as the probability of a random surfer visiting this web page. The random surfer traverses the web by choosing repeatedly between two options: clicking on a random link on the current page or randomly jumping to another web page. The second option is necessary to make sure the random surfer also visits pages that have no incoming links and to make sure that it is possible to escape from pages that have no outgoing links. The calculation of the PageRank algorithm is shown in the following equation. PR( pi ) = PR( p j ) 1−d +d ∑ N L( p j ) p ∈ M( p ) j i The probability of clicking on a random link is determined by the damping factor d. p j ∈ M ( pi ) if p j has a link to pi . L( p j ) gives the number of outgoing links for p j and PR( p j ) is the previous PageRank of p j . The PageRank algorithm is iterative and converges after a certain number of iterations depending on the implementation used. A very similar algorithm to PageRank is TrustRank [38]. In contrast to PageRank, TrustRank is initialized with a fixed set of trustworthy or untrustworthy web 17 3 Related Work pages. The trust propagates through the web graph equally to the PageRank algorithm. Another approach is the Hyperlink-Induced Topic Search (HITS) algorithm by Kleinberg [39]. It is based on the concept of hubs and authorities. In the traditional view of the web, hubs are link directories and archives that only refer to information authorities, which actually offer valuable information. The HITS algorithm operates on a subgraph of the web that is related to a specific input query. Each page gets an authority score and a hub score. The authority score is increased based on the hub score of linking web pages and vice versa. These traditional ranking algorithms are all based on the web link graph. However, traditional web pages show a different linking behavior than blogs. Blogs offer different types of links, e.g. trackbacks or blogroll links, with different semantics (see Sec. 2.1). Furthermore, the blog link graph tends to be rather sparse in comparison to the overall web [40]. 3.2 Blog-Specific Rankings To address the special characteristics of blogs, blog ranking engines and current research introduce tailor-made ranking algorithms for the blogosphere [11]. The most popular platforms ranking the blogosphere are Technorati14 and Spinn3r15 [11]. Other services like BlogPulse, PostRank, or BlogScoop went offline during the last year and got integrated into commercial products. Thus, the free services of Technorati and Spinn3r are described. Technorati established the authority score as their unique ranking. It is calculated based on a blog’s linking behavior, categorization and other associated data over a small period of time [41]. Furthermore, Technorati calculates its authority score also for topical segments of the blogosphere to identify topic-specific opinion leaders. Although Spinn3r is well known for its crawling service, it also provides a simple 14 http://technorati.com/ 15 http://spinn3r.com/ 18 PageRank and a Social Media Rank. The Social Media Rank is an adaption of the TrustRank algorithm. It incorporates social networks as incoming link providers and uses a fixed number of initially trusted users to prevent spam. Beside these platform specific rankings, current research also discusses blogspecific ranking approaches. A ranking score, called BlogRank, is introduced by Kritikopoulos et al. [42]. It is a modified version of the PageRank algorithm. The BlogRank score is based on the link graph and different similarity characteristics of weblogs. The authors create an enriched graph of inter-connected weblogs with additional edges and weights representing the specific features of blogs. Mainly, these features are shared authorship and topics. For example, the authors create a pseudo link between two posts that share the same topic that is identified by category annotations. Bross et al. [11] propose the BlogIntelligence-Impact-Score (BI-Impact) ranking, a more complete approach to successfully rank blogs. Their definition is the basis for the currently implemented scoring algorithm in the BlogIntelligence framework. Figure 6: Ranking variables of the BI-Impact score [11]. Similar to the above mentioned rankings, they give special weightings for special link types of the blogosphere. In contrast to BlogRank, their algorithm does not create new links between blogs. It rather weights the different interaction 19 3 Related Work types of blog authors like links to comments, posts, and to the start page of a blog. Like Spinn3r, they also consider links from outside the blogosphere such as from Twitter16 and news portals. All used ranking variables are shown in Fig. 6. They distinguish between a post and a blog ranking. The post ranking incorporates the different kinds of links between posts like linkbacks, tweets, and normal links. Further, the content of a post gets rated. In contrast to consistency-related rankings, the authors do not incorporate the topics of a post. Instead, the authors focus on the detection of spam keywords and trend keywords. Trend keywords are terms extracted by a hot topic analyzer, which is also part of the BI framework. The blog ranking combines the ranking of all posts with blog-specific characteristics. Among others, these are the publishing frequency and the blogroll links of a blog. All these variables are combined into one score for a blog and propagate through a PageRank-like algorithm to all linked blogs. The work presented in this thesis introduces a new score that complements the BI-Impact score to foster the retrieval of topically consistent blogs for hot topics. Thereby, users of the BI framework are able to find niche blogs that discuss trending and interesting topics. 3.3 Consistency-Related Rankings Consistency-related rankings are blog rankings that incorporate the topical consistency of a blog. This topical consistency adds to other factors to form one rank for each blog. A trend detection system, called Social Media Miner, is presented by Schirru et al. [43]. This system extracts topics and the corresponding, most relevant posts. The topics are detected using a clustering on word importance vectors (see Sec. 2.5). 16 http://twitter.com/ 20 Their approach is rather simple and does not directly reflect a consistency. They cluster topics for a given period, find relevant terms (or labels), and visualize the term mentions over time as a trend graph. Nevertheless, posts that consistently handle a specific topic have a constant term frequency of topic terms. Thus, topically consistent blogs get a good trend graph, at least for trending topics. Sriphaew et al. [44] discuss how to find blogs that have great content and are worth to be explored. They show how to identify these blogs, called cool blogs, based on three assumptions: cool blogs tend to have definite topics, enough posts, and a certain level of consistency among their posts. The level of consistency, called topical consistency, tries to measure whether a blog author focuses on a solid interest. Thus, it favors blogs with stable topics like reviews on mobile devices. The authors measure the consistency based on the similarity of topic probabilities of preceding posts. Eleven indicators of credibility to improve the effectiveness of topical blog retrieval are introduced by Weerkamp et al. [15]. Beside some syntactic indicators, they also present the timeliness of posts, and the consistency of blogs. The timeliness of a post is defined as the temporal distance of a blog post to a news portal post of the same topic. Their topical consistency represents the blog’s topical fluctuation. The authors define the consistency as a tf*idf-like score over all terms of a blog. Although this measure favors blogs that frequently use rare terms, it does not reflect when a blog author changes the topic from one post to another. In contrast to other related research, the authors do not use the natural ordering of posts. Nevertheless, the authors show that their indicators improve the topical blog retrieval significantly. The detection of spam blogs (splogs) is a frequently discussed topic in ongoing research [45, 46, 47]. However, Liuwei et al. [48] describe a spam blog filtering technique that also incorporates the writing consistency of a blog author. Similar to Weerkamp et al., the consistency on topic level is defined as the average topical similarity of posts. Each post gets compared with its preceding post. The topical similarity is defined as the distance of the posts’ tf*idf word vectors. Thereby, blogs with a extremely high topical consistency are expected to be auto-generated. They integrate their topic consistency into a blog filtering 21 3 Related Work system. Another approach for ranking blogs is introduced by Jiyin He et al. [49]. They define a coherence score to measure the topical consistency of a blog. The authors define a consistent blog as a blog that contains lots of coherent posts. A post is coherent to another post if both posts are in the same cluster of the whole collection. The authors integrate the coherence score into a blog ranking for boosting the topically relevant and topically consistent blogs. Chen et al. [50] present a blog-specific filtering system that measures topic concentration and variation. They assess the quality of blogs via two main aspects: content depth and breadth. In essence, the authors present a score that contains five criteria. Each criterion is based on an external topic model derived from Wikipedia17 articles. For example, the completeness of a blog is defined as the ratio of words used in a blog in comparison to all words assigned to a topic. Further, the authors define the topical consistency of a blog as the mean distance of used topics in a post. A blog is consistent if it only handles closely related topics. The ordering of posts, which can indicate a topic shift of the author, is not considered. In contrast to related work, the topic consistency rank presented in this thesis calculates the consistency of a blog based on multiple aspects. Thereby, it measures the topical consistency at four different granularities and thus offers a differentiated view on the blogs consistency. Further, during the calculation of the score, topics are not considered as probability distribution over words. Instead, a topic is defined as a fix set of words derived from a prior word clustering, which is also used by Sriphaew et al. [44]. 17 http://www.wikipedia.org/ 22 4 Definition of the Topic Consistency Metric To evaluate the topical consistency of a blog author, four different facets of consistency are defined. First, the consistency between posts defines the inter-post consistency. It investigates whether the contents of the latest posts discuss closely related topics. Next, the internal consistency of a post, called intra-post consistency, is a measure that considers to which extend all paragraphs of a post discuss a similar topic. In difference to the inter-post consistency, the intra-blog consistency compares the topic space created by each posts with the topic space created by tags and categories of this post. Therefore, it is a measure for the quality of the blog’s classification system. The inter-blog consistency measures whether a blog is part of a domain expert communitiy. Hereby, the rank of a blog is increased if blogs handling a similar topic link to it. In addition, a blog is boosted if it links to topically related blogs. Finally, all four facets get combined into the topic consistency rank. 4.1 Consistency between Posts (Inter-Post) As a first step, the inter-post consistency is formally defined. The inter-post consistency compares topical distance of succeeding posts. Each post is represented as a topic vector. Each component of this topic vector gives the probability of a post talking about one topic. The sum of all vector components is one as usual for a probability distribution. Fig. 7 shows the assignment of ten example posts to ten topics. Each column symbolizes a topic vector of a post. The size of a bubble indicates the probability of a post p to be in topic t. The transient nature of the blogosphere motivates us to only consider the latest posts that lay outside the outdated post area. There are two approaches to define outdated posts: exclude all posts exceeding a specific time span, or including only a specific number of latest posts. The latter solution punishes blogs that are frequently publishing new content by shrinking the observed time window to 23 4 Definition of the Topic Consistency Metric High Low distance distance 12 Outdated post area Topic Vector 10 Topic ID 8 6 Topic Probability 4 2 0 0 2 4 6 Post Number 8 Time 10 Figure 7: Visualization of post-topic-probabilities. a day’s work. The time span variant is beneficial for small blogs because only a small part of the content is considered. However, the time span variant is applied because it is assumed that it fits the user’s perception. Sriphaew et al. [44] calculate the average difference o topic vectors of posts with the blog’s topic centroid. This favors blogs with a central interest, but does not consider the change of a blog’s topic over time. As shown in Fig. 7, blogs can have low distances and high distances between posts. Thus, the average difference of topic vectors of two successive posts serves as indicator for topic consistency. In the following, the formal definition of the inter-post consistency is shown. Before defining the metric, the sets and functions used for the calculation have to be defined. The set Blog contains all blogs of the used data set. Post is a set that contains all posts. The set Postb with b ∈ Blog contains all posts of blog b. The function publishedDate( p) with p ∈ Post returns the publishing time and date of a post. The function LatestPostsb,d with b ∈ Blog and d ∈ Date being a 24 point in time is a set defined in Eq. 1. LatestPostsb,d = { p ∈ Postb | publishedDate( p) ≥ d} (1) Term is the set of all terms. The set Topic contains all topics discussed in the considered subset of the blogosphere. Similarly to Eguchi et al. [51], the set TTtp ⊂ Term is defined as all terms of a topic tp ∈ Topic. All TTtp are pairwise disjoint. ∀tp ∈ Topic ∀ j ∈ Topic : tp 6= j ⇒ TTtp ∩ TTj = ∅ (2) PTp ⊂ Term is the set of all used terms of a post p ∈ Post. The function Prob( p, tp) with p ∈ Post and tp ∈ Topic gives the probability of the post p being about the the topic tp. Prob( p, tp) = ∑t∈TTtp ∩ PTp t f ∗ id f (t, p) ∑t∈ PTp t f ∗ id f (t, p) (3) Salton et al. [52] give an overview to the components of the tf*idf-function and its variances. Essentially, it is the product of a term frequency component t f and a collection frequency component id f . t f ∗ id f (t, p) = t f (t, p) × id f (t, Post) (4) t f is the raw term frequency (number of times a terms occurs in a post). id f is the inverse document frequency. Postt with t ∈ Term is the set of all posts in which a term is contained. id f (t, Post) = log | Post| | Postt | (5) The funtion topicalDistance( pi , p j ) with pi , p j ∈ Post is defined as the Euclidean distance between the topic vectors of both posts (see Eq. 6). The Euclidean distance is a frequently used distance metric and has proven to apply best for text vector comparison [44]. topicalDistance( pi , p j ) = s ∑ ( Prob( pi , tp) − Prob( p j , tp))2 (6) tp∈ Topics 25 4 Definition of the Topic Consistency Metric The function predecessor ( p) ∈ Post returns the direct predecessor of p ∈ Post. Given these definitions the inter-post distance is formalized as shown in Eq. 7 with b ∈ Blog and d ∈ Date. interPostDistance(b, d) = ∑ p∈ LatestPostsb,d topicalDistance( p, predecessor ( p)) | LatestPostsb,d | (7) interPostDistance(b, d) is the average topical distance of two succeeding posts among the latest posts of a blog. It returns high values for very inconsistent blogs and low values for very consistent blogs. To give consistent blog a high inter-post consistency score, it is defined as the inverse interPostDistance(b, d), as shown in Eq. 8. interPostConsistency(b, d) = 1 interPostDistance(b, d) (8) 4.2 Internal Consistency of Posts (Intra-Post) The intra-post consistency focuses on the inner consistency of one post. It is high if a blog author focuses on one single topic and does not change the subject while writing one single post. Thus, it favors self-contained and complete posts that do not cover several topics. A consistent post should handle just a few topics, but discuss them in more detail. The intra-post consistency is very similar to the inter-post consistency except that it operates on the sections of posts. Each post is subdivided into sections by splitting the post’s content by each occurrence of more than one line break or HTML separator. Each section gets assigned one topic vector. The components of this topic vector represent the probability to which a section is about a specific topic. Two additional concepts need to be defined before formalizing the intrapost consistency. Firstly, Section is the set of all sections in the data set and Section p ⊂ Section is the set of all sections of one specific post p ∈ Post. Secondly, predecessor (s) with s ∈ Section is the function that returns the preceding section of one section s. 26 Further, the function topicalDistance(si , s j ) with si , s j ∈ Section is defined in the same manner as Eq. 6. intraPostDistance( p) = ∑s∈Section p topicalDistance(s, predecessor (s)) Section p (9) The intra-post distance is also defined for a whole blog. It is the mean of all distance values of the latest posts. intraPostDistance(b, d) = ∑ p∈ LatestPostsb,d intraPostDistance( p) | LatestPostsb,d | (10) Thereby, the intraPostConsistency(b, d) is defined as the inverse intra-post distance to provide consistent blogs with a high score (see Eq. 11). intraPostConsistency(b, d) = 1 intraPostDistance(b, d) (11) 4.3 Consistency between Posts and Classification (Intra-Blog) The intra-blog consistency serves as a measure for the quality of a blog’s classification. It evaluates to which extent the content of posts is consistent with tags and categories that form the classification system of a blog. As discussed in Sec. 2.1, tags and categories are very important for the orientation of a user and the navigation through the blog. It is crucial that blog authors choose tags and categories wisely and appropriate to their content. In addition, spam blogs tend to overuse tags and categories to earn a higher rank in blog search engines for a high number of keywords. These low quality blogs and spam blogs get a very low intra-blog consistency score. For a high consistency, tags and categories should span an equal topic distribution as the overall content of a blog. The intra-blog consistency is the distance of the topic vector of each post and the topic vector for the post’s classification system. 27 4 Definition of the Topic Consistency Metric Before defining the intra-blog consistency it is needed to formally define additional concepts. Tag is the set of all tags and Category is the set of all categories in the data set. Further, Tag p and Category p with p ∈ Post are the set of tags and categories of one post. The Classi f ication p set is the defined the union of categories and tags of one post p. Classi f ication p = Tag p ∪ Category p (12) Given the classification of each post, Classi f ication p , and the set of all posts in a blog, Postb , the intra-blog distance is defined as the average topical distance between each post and its classification (see Eq. 13). intraBlogDistance(b) = ∑ p∈ Postb topicalDistance(Classi f ication p , p) | Postb | (13) Finally, the intraBlogConsistency(b) is defined as shown in Eq. 14. intraBlogConsistency(b) = 1 intraBlogDistance(b) (14) A low value of intraBlogConsistency(b) indicates a mismatch between the classification and the actual content. Thus, the quality of the blog is questionable and it is supposed to be of a lower rank. 4.4 Consistency of Linking and Linked Blogs (Inter-Blog) Finally, the inter-blog consistency serves as a context-based consistency metric. It measures the consistency between the blog’s content and the content of linking and linked blogs. Thus, it measures whether a blog is part of an expert community. An expert community is a set of blogs that focus on one topic and discuss this topic interactively. For example, during the Arab spring one single blog starts the discussion and other blogs build an active discussion around this initial blog [5]. Among other motivations, the followers of blogs have two targets: First, they like to spread the word of the referenced blog author to widen the reach of the 28 message. Second, referencing blog authors want to discuss the message and get into an active discourse with the referenced blog author. Those discourses are the essence of the blogosphere. Similar to Wikipedia, blog authors increase the information quality by evaluating and iterating posts of each other. As already discussed for the BI-Impact score, blogs have a set of special link types, but only a few of them are actual interaction links and not only friendly links or advertisements (see Sec. 2.1). Blogroll links and links, which are not located in posts or comments, have no evaluating or commenting nature. In contrast, if a blog author links from a post directly to a post of another blog author, he indicates a reply or similar reaction like a reference. Further, comment authors can also link to other posts, this is formally regarded as a linkback. Linkbacks are also indicators for an active discourse between two blogs. These links, linkbacks and links from posts, are interaction links. The inter-blog consistency defines the consistency of a blog and blogs that link or are linked via an interaction link. The post linking post relation (PLP) contains the tuple ( pi , p j ) with pi , p j ∈ Post if pi has an interaction link to p j . The set IPpi , incoming posts, with pi ∈ Post is defined as follows: IPpi = { p j | p j ∈ Post ∧ ( p j , pi ) ∈ PLP} (15) In parallel, the set OPp , outgoing posts, p ∈ Post is defined. OPpi = { j | p j ∈ Post ∧ ( pi , p j ) ∈ PLP} (16) Incoming links cannot be controlled by the blog author. Hence,two constants α, β introduce a weighting for incoming and outgoing posts. The postContextDistance( p) with p ∈ Post as the weighted sum of the average distance to all incoming and the average distance to all outgoing posts (see Eq. 17). 29 4 Definition of the Topic Consistency Metric postContextDistance( p) = α ∗ β∗ ∑ j∈ IPp topicalDistance( p, j) + | IPp | ∑ j∈OPp topicalDistance( p, j) (17) |OPp | A typical weighting is α = 0.6; β = 0.4 to slightly emphasize incoming links for their unbiased nature. The interBlogDistance(b, d) with b ∈ Blog and d ∈ Date is defined in Eq. 18. The inter-blog distance calculation considers only the latest posts due to the transient nature of the blogosphere. interBlogDistance(b, d) = ∑ p∈ LatestPostsb,d postContextConsistency( p) | LatestPostsb,d | (18) Analogously to the other three aspects, the interBlogConsistency(b, d) is defined as the inverse interBlogDistance(b, d) (see Eq. 19). interBlogConsistency(b, d) = 1 interBlogDistance(b, d) (19) 4.5 Combined Topic Consistency Rank Finally, the topic consistency rank is defined as the combination of all four facets. All facets are combined by calculating a weighted sum for each blog. The topicConsistency(b, d) with b ∈ Blog and d ∈ Date is defined in Eq. 20. The four constants, χ, δ, e,and γ, give a weighting for each component of the topic consistency rank. topicConsistency(b, d) = χ ∗ interPostConsistency(b, d) + δ ∗ intraPostConsistency(b, d) + e ∗ intraBlogConsistency(b) + γ ∗ interBlogConsistency(b, d) 30 (20) The weighting can be varied according to the characteristic of the analyzed data set. Caused by the low usage of categories and tags in the BlogIntelligence data set and the high usage of content summaries in posts’content, the weights used in this thesis are: χ = 0.3; δ = 0.2; e = 0.2; γ = 0.3. The final topic consistency rank is calculated by normalizing the results of the topicCosistency function over all considered blogs. Through this normalization the values will be in the interval [0, 1], which is a common approach for rank normalizations [53]. 31 4 Definition of the Topic Consistency Metric 32 5 Implementation of Topic Detection As mentioned in Sec. 4.1, all topic consistency metrics depend on topic term sets. To find topics and assign terms to topic term sets, the topic detection procedure, shown in Fig. 8, is implemented. 1. 2. Download Post 4. Parse Content 3. Extract Terms BlogIntelligence Crawler 5. Calculate Tf*Idf SAP Hana Database Build Word Vectors 6. 7. Write Word Clusters Run k-Means Apache Mahout Analyzer Figure 8: Flow diagram of the topic detection. 5.1 Prerequisites There are several steps necessary before running the actual clustering algorithm, which creates the topic term sets. The preprocessing covers steps 1-5 of the topic detection flow (see Fig. 8). Step 1. First of all, the BI crawler harvests the blogosphere. It stores all data of blogs into the SAP HANA database. The crawler traverses the blog link graph and downloads every blog post. Immediately after downloading, the crawler parses the downloaded HTML files (see Fig. 8). 33 5 Implementation of Topic Detection Step 2. The parsing includes the removal of non-textual content like images and videos. Further, it removes markups like HTML tags. After parsing a web page, the crawler stores the pure text content as a character large object (CLOB) into the database. Step 3. The Nutch crawling cycle is extended by a new component that allows a word extraction on the text of posts. During this extraction, the crawler first segments the text into words. This is done by splitting on non-word characters. Afterwards, the extraction component removes all stop words from the word set. Stop words are the most common words of a language, such as the, is, at, and on. It uses the stop word lists from the Weka18 project. Weka is a collection of machine learning algorithms for data mining tasks. The word set is still redundant. It contains inflected or derived words. Thus, a stemming of words is applied to reduce the words to their stem form. The extraction component incorporates the stemmers of the Weka framework which provides stemmer classes for various languages like German. The preprocessing of the crawler assigns to each post the set of word stems. This set of words is stored in a separate table into the database, called dictionary table. The word extraction process is actually a common feature among text databases like Apache Lucene19 . Although SAP HANA already contains a word count matrix, which is the dictionary table for the topic detection, this matrix is not accessible via an application interface (API). In contrast, the next two steps are directly performed in the database. Step 4. An SQL procedure calculates the tf*idf values for each word. SQL procedures have the advantage that they can directly access the data in memory without transferring them for processing. The implementation follows Eq. 4. 18 http://www.cs.waikato.ac.nz/ml/weka/ 19 http://lucene.apache.org 34 Step 5. Further, the database is used to create the word vectors for each post and the post vectors for each word. The latter are used for the clustering of words that finally produces the desired topics. The vectors are computed by an SQL view that directly refers to the basic web page table and the result table of the tf*idf calculation. An example result of the view is shown in Tab. 1. post id word id tf*idf p4 w5 t f id f 4,5 p7 w8 t f id f 7,8 p5 w5 t f id f 5,5 p8 .. . w8 .. . t f id f 8,8 .. . Table 1: Example tf*idf vectors resulting from the SQL view. With step 5 the preprocessing is completed and all vectors can be loaded into the HDFS file system of Mahout. This is implemented by a tailor-made class for the BlogIntelligence analytics. It uses the adapted object relational mapper (ORM), Apache Gora, to access the tf*idf vector view of HANA and transfer all vectors to the HDFS file system. These vectors are the word vectors with posts as dimensions. Two example vectors are shown in Tab. 2. Mahout uses a sparse vector implementation. Sparse vectors are specially designed for document-word vectors that are only sparsely filled. Sparsely filled means that most of the vector components are zero because words only appear in a small set of documents compared to the overall collection. 5.2 Clustering The two last steps are executed by the adapted Mahout framework (see Sec. 2.5). Mahout offers various clustering algorithms like mean shift clustering, spectral clustering, latent Dirichlet allocation, and k-means clustering [36]. 35 5 Implementation of Topic Detection w5 w8 p4 t f id f 4,5 0 p5 t f id f 5,5 0 p6 0 0 p7 .. . 0 .. . t f id f 7,8 .. . Table 2: Sparse word vectors from HDFS. The current implementation of SAP HANA does not support a clustering applicable for the high number of dimensions created by the word-blog-vectors. The total post-word-matrix size for L20 is limited to the maximum integer. This value is too small for the approximately 1,000,000 by 500,000 word-post matrix. Thus, the L API is not applicable for the clustering task. Another alternative is the R21 integration of HANA. R is a programming language and software environment with special focus on statistical calculation. Besides clustering algorithms, R supports various algorithms like time-series analysis and statistical tests. The road block for R is also the massive amount of data. The database needs to transfer all vectors to an external R component. This process also fails due to the high transportation cost. To sum up, until the integration of advanced text analysis algorithms in HANA is completed, the external analysis framework Apache Mahout is used. As discussed in Sec. 4.1, the topic consistency rank relies on a 1:n relation between words and topics. This approach simplifies the prototypical implementation, because it does not require a complex clustering technique based on probability distributions. Advanced, more complex clustering techniques are subject to further research (see Sec. 8). 20 http://wiki.tcl.tk/17068 21 http://www.r-project.org/ 36 Step 6. k-means is a well known algorithm for clustering objects that creates pair-wise distinct clusters. All objects need to be represented as a numerical feature vector. In this case, these objects are the words that are grouped into topic term sets. The components of the feature vector are the tf*idf values of these words in each crawled post. The k in k-means identifies the user-defined number of clusters that is also input for the algorithm. The feature vector represents a vector in an n-dimensional space with n being the number of posts. The algorithm operates as illustrated in Fig. 9. k-means randomly chooses k points in the n-dimensional space that serve as initial centers of the clusters, or called centroids (see Fig. 9 A). In the next phase each word is assigned to the closest centroid. The closest centroid is the centroid with the minimal distance to the feature vector of the word (see Fig. 9 B). One can apply various distance measures depending on the data set to be clustered. As discussed in Sec. 4.1, the established Euclidean distance serves as distance measure. After assigning the words to centroids, each cluster gets a new centroid. These centroids are calculated by averaging the feature vectors of all words assigned to one cluster (see Fig. 9 C). This process of assigning words and computing new centroids is repeated until the convergence of the algorithm. The convergence can be reached if the centroid movement is below a predefined threshold. A) B) C) Figure 9: An example iteration of k-means (∆ - centroids; x - points). A) Random centroids. B) Assign clusters. C) Compute new centroids. 37 5 Implementation of Topic Detection Mahout’s version of k-means is implemented by the KMeansDriver class. Esteves et al. [54] describe the performance of this implementation. They highlight that the Mahout implementation scales with increasing data set size and increasing number of computing nodes. After each iteration, the KMeansDriver stores the new centroids into the HDFS. After the completion of all iterations, Mahout runs an extra job that writes the clustered points, i.e. the word to topic assignment, to the file system. Step 7. This assignment is readable from the cluster writer module of Mahout. An additional class, called HANAClusterWriter, is implemented. This class transfers the clustered points to the HANA database. It is not a MapReduce job because it only sequentially transfers the data from the HDFS to the database. word id cluster id 4 1 8 1 2 .. . 3 .. . Table 3: Resulting cluster table. An example of the resulting table is shown in Tab. 3. The choice of the feature vector is crucial for the meaning of the clustering results. By selecting the tf*idf values in each post for each word, words are grouped together that frequently appear in the same post. Thus, words with a similar meaning are assigned to the same cluster [10]. These word groups are the topic term sets used for the calculation of the topical distance. The granularity of the topics is dependent on the user-defined number of clusters k. As proposed by Abe et al. [55], the aim is to find clusters with around 100 words per cluster. In the evaluation (see Sec. 7), different settigs for k and the number of iterations are tested to achieve an average cluster size of 100. 38 6 Implementation of the Topic-Consistency Rank This Section presents the details of the implementation of the topic-consistency rank. The rank is completely integrated into the database and only relies on basic SQL constructs. The theoretical foundations for each of the underlying partial scores are already discussed in Sec. 4. Each score implementation consists of a combination of SQL views, permanent and temporary tables. The combined score for each blog is the weighted sum of the single scores (see Sec. 4.5). 6.1 Intra-Post Consistency To calculate the intra-post consistency, an additional tf*idf calculation view is implemented based on paragraphs. Equal to the normal tf*idf view (see Sec. 5.1), this view is also based on the dictionary tables. The dictionary tables are the result of the word extraction phase of the topic detection. An example dictionary table is shown in Tab. 4. For each word of a post a row is created that contains the word, the post id, and the word number. word post position hello postid1 0 world .. . postid1 .. . 1 .. . Table 4: The dictionary table maps words to the containing posts and positions. To create a tf*idf value based on paragraphs, all words within a specific window are regarded as paragraphs. The size of this window is set to 100 based on the average length of a paragraph, which is 100-150 words [56]. The calculation is a direct implementation from the formal definition (see Sec. 4.2). It creates a join between all succeeding sections. The result of this join are the tf*idf values for each section and each occurring word. Afterwards, this tf*idf values are joined with the cluster table. The score for each cluster is 39 6 Implementation of the Topic-Consistency Rank calculated by summing up the tf*idf values per cluster. Afterwards, the topical differences of the sections are calculated by joining the sections of each post on the topic cluster. The topical distance of two section is the square root of the sum of the differences for each cluster. The intra-post distance on post level is the average of the section distances. Based on the postlevel distance, the blog-level distance is calculated by averaging the intra-post distance values of each post. Finally, the intra-post score is computing by inverting the intra-post distance. To sum up, the intra-post score calculation is a combination of nine joins and four aggregations in the database. The mapping from ids to words and URIs and vice versa introduces the most complexity to this operation. Further, one has to mention that the intra-post rank is the most detailed rank in respect to size of the tf*idf view results. 6.2 Inter-Post Consistency The inter-post consistency builds upon the tf*idf view based on posts, called post-tf*idf, which is also used by the topic clustering (see Sec. 5.1). Posts are objects in the database and thus do not require an additional segmentation. To get succeeding posts, each post is joined with the post that has the minimal next publishing date. After this join, the topic vector differences of each post and its successor can be computed. By grouping for each post, the Euclidean distances between all succeeding posts are calculated. Afterwards, the average of all distances results in the inter-post distance and thus in the inter-post consistency score of a blog. This operation is pretty similar to the intra-post consistency except that it is based on the latest posts. The selection of the latest posts is implemented as a simple where-condition on the post publishing date. 40 6.3 Intra-Blog Consistency The intra-blog consistency calculates the distance between the classification of each post and its content. Therefore, it uses the post-tf*idf view to get the term importance values for the content. Further, it uses a tf*idf view based on the classification system, called class-tf*idf. This view returns the importance values for each term used in tags or categories. The intra-blog consistency on post-level is calculated by the topical distance of the post’s classification and the post’s content vector. Finally, all topical distances are combined by performing an average operation for each blog. To accelerate the calculation the tf*idf vectors become persistent as temporary column tables. Thereby, a join between vectors can be performed as a column search operation in the SAP HANA database, which is the fastest way of joining [33]. Further, blogs do not get an intra-blog consistency if they are not using tags or categories. These blogs are regarded as inconsistent with their non-existing classification system. Thus, they are assigned the minimal score, i.e. zero. 6.4 Inter-Blog Consistency The context-based consistency, called inter-blog consistency, of a blog is based on its linking and linked blogs. To calculate this score a join with the biggest table of the data set, the link table (see Tab. 5), is necessary. This table consists of the linking and linked blog URIs and the corresponding link type, which represents whether a blog links to another blog via a post or a comment. To calculate the topical distance between all outgoing and incoming links the blog-topic-probability table is joined with the link table. This is the most costly operation for the data set because the link table is rapidly growing and contains currently around 160 million rows. After the join computation, the post context distances can be calculated. By 41 6 Implementation of the Topic-Consistency Rank linking post linked post link type spreeblick.de?p=22 netzwertig.de?p=31 via post carta.info?p=12 spreeblick.de?p=26 via comment promicabana.de?p=76 .. . gesichtet.net?p=3 .. . via post .. . Table 5: Example rows of the link table. grouping for the blog, the inter-blog consistency score is computed as defined in Eq. 19. 6.5 BI-Impact Score As discussed in Sec. 3.2, BlogIntelligence implements a blog ranking metric called BI-Impact score as a prove-of-concept prototype. In the course of evaluating the topic consistency metrics against a blog-specific ranking, the BI-Impact score is transfered to SAP HANA. The score contains two components: the blog interaction and the post interaction. These components are also calculated as SQL views. The calculation requires numerous joins over the link table to calculate the partial rank for each distinct link type. The BI-Impact score is calculated by a recursive algorithm. It needs multiple iterations until the rank converges. After each iteration, a temporary table stores the ranks for each blog and serves as input for the next iteration. The whole calculation spans a complex query tree. It contains about 52 join operations. Although the majority of tables have a low number of rows, the usage of the link table introduces an high complexity. Listing 1 shows the simplified code for one of the basic views for the rank calculation. This view creates a score for each post based on the scores of all incoming links of blogs. It differentiates between the various link locations or link types of the incoming links. The final rank is calculated by the weighted sum of the 42 different link types [11]. Listing 1: SQL view creates post score per link type CREATE VIEW postScoreByLinkType AS SELECT post , l i n k t y p e , AVG( scoreOfIncomingBlogs ) AS s c o r e FROM postByIncomingPostAndLinkType AS i n B l o g JOIN normalizedBiImpactScore AS s c o r e ON s c o r e . h o s t = i n B l o g . h o s t GROUP BY post , l i n k t y p e ; 43 6 Implementation of the Topic-Consistency Rank 44 7 Evaluation This Section discusses the results and the plausibility of the topic consistency rank. Therefore, the evaluation shows the results of the partial ranks, the overall rank, and compares it to the results of the BI-Impact score. 7.1 Experimental Setup For the evaluation of this master’s thesis, we activated the BlogIntelligence crawler for one month. The crawler uses an 8 core machine with 24 gigabyte RAM running Ubuntu Linux. The harvested data is stored in a separate database machine with 32 cores and 1 terabyte RAM running Suse Linux. This machine also runs the SQL analytical queries. The cluster setup for the topic detection consists of 12 machines with 2 cores and 4 gigabyte RAM each. These machines are grouped into one Hadoop cluster that is configured to run 50 parallel tasks. The key data indicators of the data set are shown in Tab. 6. Indicator data set size crawled web pages Value (approx.) 500 GB 2.5 million identified blogs 12,000 identified posts 600,000 average words per post 57.5 average number of categories per post 2.6 average number of tags per post 4.2 number of news portals 1,300 Table 6: State of the BlogIntelligence data set. 45 7 Evaluation 7.2 Clustering The quality of the underlying clustering is crucial for the quality of the topic consistency rank. Especially, the size of clusters determines whether blogs with a versatile interest wrongly get a good consistency rank. The k-means clustering of the Mahout implementation runs on the cluster setup. The runtime depends on the number of iterations and the number of desired clusters. It varies between 8 to 20 minutes per iteration. However, the topic detection only has to be repeated if the number of words significantly changes. After the term extraction procedure, the data set contains 450 000 words. The resulting matrix for words and posts consists of 2.7 billion tf*idf values. Most of the values are zero. Therefore, Mahout uses a sparse vector representation that results in a matrix size of only 144 megabyte. For the clustering, four different variants are evaluated. The indicators for the quality of the clusterings are shown in Tab. 7. Variant 1 Variant 2 Variant 3 Variant 4 100 10 000 10 000 20 000 10 10 40 40 maximum cluster size 448 546 419 453 187 093 21 234 minimum cluster size 1 1 1 1 52 5 398 4 419 18 546 minimum filtered cluster size 2 2 2 2 maximum filtered cluster size 37 83 52 383 8.73 4.55 3.86 10.1 Parameters: k iteration Results: number of filtered clusters average filtered cluster size Table 7: Quality of the tested clustering configurations. The number of filtered clusters is always below the actual calculated number of clusters of k-means, called k. This is caused by the filtering of too small and too large clusters. The filtering is conservative. It removes clusters with a size of 46 one. This avoids expensive and too specific word distance calculations. Further, clusters with more than 1,000 words are ignored, because the word diversity of this cluster harms the validity of the topic consistency rank. Variant 1 creates 100 clusters with a maximum cluster size of 448 546 words. These words cannot be considered, because the cluster size is larger than 1,000. Thereby, only 1 500 words are grouped into meaningful clusters. With an average cluster size of 8.73, there are enough words per cluster to describe a topic. Variant 1 creates too few clusters. Therefore, the cluster number is increased in variant 2 to 10 000. Although it creates more than 5 000 filtered clusters, the average cluster size halves and the number of unused words in the biggest cluster only negligible decreases. Hence, variant 3 increases the number of iterations to get a better word distribution among the clusters. Unexpectedly, the number of filtered clusters decreases for variant 3. The size of the maximum cluster decreases and the average size of filtered clusters also decreases. Consequently, variant 4 creates more clusters with a size over 1,000 than variant 2. To further increase the number and average size of filtered clusters, variant 4 increases the number of created clusters. Variant 4 gives the best results in the evaluation. It contains over 18,000 filtered clusters and the maximum cluster size decreases to about 20,000. In addition, variant 4 has on average 10 words per cluster, which is a far more promising distribution than all three other variants. As a consequence of the clustering evaluation, the topic consistency rank calculation uses the filtered clusters of variant 4. 7.3 Results of the Topic Consistency Sub Ranks The ten best blogs for each of the topic consistency sub ranks are calculated. The BI crawler is focused to crawl the German blogosphere. Therefore, the majority of all blogs is German and the top consistency blogs are German, too. For each of the sub ranks, two highly ranked representatives are introduced in detail. The top ten blogs for the two post-related sub ranks are shown in Tab. 8. 47 7 Evaluation Rank Intra-Post Inter-Post 1 promicabana.de blog.de.playstation.com 2 dsds2011.info upload-magazin.de 3 blog.beetlebum.de blog.studivz.net 4 schockwellenreiter. der-postillon.com 5 hornoxe.com allfacebook.de 6 netbooknews.de achgut.com 7 iphoneblog.de gutjahr.biz 8 carta.info elmastudio.de 9 blog.studivz.net netzwertig.com seo.at lawblog.de 10 Table 8: The top ten ranked blogs for intra-post and inter-post consistency. One example for an high intra-post consistency is the dsds2011.info blog. The intra-post consistency gives the average internal consistency of posts in a blog. dsds2011.info is a follower blog of a German TV show that has the aim to cast a new superstar. This blog is a fan blog. Therefore, each post mostly focuses on one person, e.g. the current candidate. Further, some posts discuss the performance of each candidate of a show. This causes that each paragraph of such a post focuses on another person, but also uses the same attributes to describe the performance. Another blog with an high intra-post consistency is the iphoneblog.de. Obviously, the topics of each post are all related news about Apple’s iPhone. Each post of this blog contains on average five paragraphs, is carefully investigated, and concentrates on one feature, game, or accessory of the iPhone. These special interests are fully investigated in a post over several paragraphs. As a consequence, the internal consistency of the posts is high. A representative for an high inter-post consistency is the blog.de.playstation.com blog. This blog has an high topical consistency between the latest published posts. The main focus of this blog is on PlayStation games. Hereby, it frequently publishes posts about the latest games, which are discussed regarding 48 their game play, graphics, and story line. Each post presents a game in a similar structure and phrasing. Thus, the topical distance between these posts is very low and the topical consistency is very high. Another highly ranked blog regarding the consistency between posts is allfacebook.de. It publishes posts about new features of the social network, discussion about privacy, and the latest news about Facebook. Although this blogs handles these three topics, it usually publishes multiple posts per topic in a row. This decreases the distance between succeeding posts and boost its inter-post consistency. Rank Intra-Blog Inter-Blog 1 readers-edition.de innenaussen.com 2 iphoneblog.de shopblog author.de 3 eisy.eu nachdenkseiten.de 4 karrierebibel.de helmschrott.de 5 meinungs-blog.de blog.studivz.net 6 dsds2011.info fanartisch.de 7 macerkopf.de achgut.com 8 kwerfeldein.de internet-law.de 9 events.ccc.de scienceblogs.de mobiflip.de events.ccc.de 10 Table 9: The top ten ranked blogs for intra-blog and inter-blog consistency. The top ten blogs for the two blog-related sub ranks are shown in Tab. 9. One example of an high intra-blog consistency rank is also the iphoneblog.de blog. This blog uses the post classification in an appropriate way. As mentioned above, the posts of this blog are carefully edited. By investigating the content of the blog, it is observable that each post contains beside the common categories also at least six content-specific tags. This shows that a blog gains a high consistency ranking for the intra-post and intra-blog consistency by carefully authoring its posts. Another example is the macerkopf.de blog. In contrast to iphoneblog.de, the posts of 49 7 Evaluation this blog handle a higher variety of topics and comment more critical. For example, they frequently compare the iPhone against other mobile phones. Hereby, a post covers at least two topics. Nevertheless, categories and tags address each topic of the post, which results in a high quality of the classification and in a high intra-blog consistency rank. The inter-blog consistency measures the consistency of a blog with a linking and linked blogs. The best ranked blog for the inter-blog consistency is the innenaussen.com blog. This blog writes reviews about diverse beauty products. The blog link graph indicates that this blog is mainly linking other product reviews e.g. for referencing another opinion on the product. Further, it is observable that it is also linked by product review blogs on beauty products like the lipglossladys.com blog. The scienceblogs.de blog has also an high inter-blog consistency rank. This is caused by its link directory nature. It mainly collects and summarizes posts from other science-related blogs and provides an entry point into a science community. This blog mainly references the original content. Thereby, its summaries are very consistent with the linked content. In addition, by comparing all four sub ranks of Tab. 8 and Tab. 9, the blog.studivz.net shows high consistency ranks for each subrank except the intra-blog consistency. This blog writes about topics around a German social network called studiVZ. It is a typical corporate blog that describes news and new features of a company and the company’s products. Hereby, the blog has highly consistent posts that discuss a topic over multiple paragraphs. It constantly posts about activities of the company and is linked by blogs, which spread the news of the company. Nevertheless, each post of this blog is not tagged and is only categorized as allgemein (German for miscellaneous), which is a common standard configuration for blog systems. By investigating the top ten rank blogs for each subrank, two examples for each subrank are analyzed and the evaluation shows that the sub ranks create plausible results. 50 7.4 Comparison of BI-Impact and Combined Topic Consistency Rank The weighted combination of all sub ranks is the combined topic consistency rank. It identifies the topical consistent blogs in the data set. Thereby, it creates a ranking of experts depending on the consistency of their writing. In contrast, the BI-Impact aims to identify the most influential blog authors with the highest reach and famousness. During the evaluation, both ranks are compared against each other to find possible correlations. Blog Combined topic consistency rank BI-Impact helmschrott.de 1 85 gedankendeponie.net 2 94 yuccatree.de 3 104 upload-magazin.de 4 96 nachdenkseiten.de 5 117 events.ccc.de 6 54 telemedicus.info 7 118 bei-abriss-aufstand.de 8 90 stereopoly.de 9 87 10 88 annalist.noblogs.org Table 10: Top ten ranked blogs for the combined topic consistency rank with their BI-Impact rank First, the top ten blogs concerning the combined topic consistency rank are investigated. As shown in Tab. 10, each top ten blog is listed with its ranking position regarding both rankings. The two sample blogs, yuccatree.de and telemedicus.info, have high combined topic consistency ranks. yuccatree.de has a low inter-post consistency value caused by the diversity of discussed topics. However, it has a high combined consistency score because all remaining three consistency sub ranks are very 51 7 Evaluation high. In contrast, the telemedicus.info blog focuses only on privacy and patent right discussions. Thus, it has a very high inter-post consistency that results in combination with the proper usage of tags in a high combined topic consistency rank. In contrast, both have a very low BI-Impact score. Thus, both are not identified as highly influential blogs, because there position in the blog link graph has not enough influence. This can be seen for all other blogs of the top ten, as well. Blog Combined topic consistency rank BI-Impact fuenf-filmfreunde.de 54 1 sistrix.de 97 2 142 3 t3n.de 49 4 scienceblogs.de 75 5 fontblog.de 37 6 de.engadget.com 52 7 achgut.com 34 8 schockwellenreiter.de 77 9 saschalobo.com 35 10 elektrischer-reporter.de Table 11: Top ten ranked blogs for the BI-Impact rank with their combined topic consistency rank Secondly, the top ten blogs regarding the BI-Impact rank are investigated. As shown in Tab. 11, the blogs are ordered by the BI-Impact rank and listed with their combined topic consistency rank. By investigating three sample blogs, namely t3n.de, de.engadget.com, and saschalobo.com, it is observed that the most influential blogs deal with a high number of topics. These blogs summarize current events in technology or give their opinions to diverse political discussions. Although these blogs contain high quality content, the number of discussed 52 1 0,9 Normalized score 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 97 100 94 91 88 85 82 79 76 73 70 67 64 61 58 55 52 49 46 43 40 37 34 31 28 25 22 19 16 13 7 10 4 1 0 Rank position Consistency BI-Impact Figure 10: BI-Impact and topic consistency rank for top 100 blogs ordered by topic consistency rank. topics is very high. Further, the inter-blog consistency decreases through the number of different view points and the wide range of linking blog authors. The intra-post consistency also decreases by the usage of summary posts which summarize the news of a day. The exemplarily analysis of the top ten implies an inverse relation between the topic consistency of a blog and its reach. Thus, the expectation is to find a correlation between the BI-Impact rank and the topic consistency rank. To evaluate this, an analysis of the top 100 ranked blogs is done. The behavior of both ranks is shown in Fig. 10 and Fig. 11. In Fig. 10, the blogs are ordered by their ranking position in topic consistency ranking. The best blog gets the rank position one. The topic consistency rank is monotonously decreasing with the ranking position. Contradictory to the expectation, no correlation is observable between both ranks. However, an accumulation of higher BI-Impact scores can be identified in the area of low consistency ranks. It looks like blogs, which handle a higher diversity of topics, gain more influence in the blogosphere. In contrast, the BI-Impact 53 7 Evaluation 1 0,9 Normalized score 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 97 100 94 91 88 85 82 79 76 73 70 67 64 61 58 55 52 49 46 43 40 37 34 31 28 25 22 19 16 13 7 10 4 1 0 Rank position Consistency BI-Impact Figure 11: BI-Impact score and topic consistency rank for top 100 blogs ordered by BI-Impact rank. score of the most topical consistent blogs is low. Consequently, these blogs have low impact and a low reach. The assumption is that they form closed expert communities which are less integrated into the blogosphere. The same is observable by looking at the behavior of the topic consistency rank if the blogs are ordered by their BI-Impact score. There is an accumulation of high topic consistency ranks at the long-tail of the BI-Impact score. In addition, a small accumulation of medium topic consistency ranks at rank position 3-16 is observable. However, a correlation between both scores cannot be observed. 54 8 Recommendations for Future Research The focus of this thesis is to motivate and define a topic consistency rank for blogs. The formal definition and implementation specially focus on a resource efficient and fast calculation. Therefore, complex algorithms and dependencies to external resources are avoided. Nevertheless, this should be focus for future research. 8.1 Enhanced Topic Detection The central part of our topic consistency rank is the topic detection. As already discussed, k-means clustering detects the topics in the introduced implementation. Nevertheless, the central shortcoming of this approach is that it is highly dependent on the underlying collection. Thus, the rank depends on the crawl coverage of BlogIntelligence. There are several approaches that can circumvent this problem. Wikipedia. Although the content creation in the blogosphere is highly interac- tive, it does not aim to provide reliable knowledge. In contrast, Wikipedia offers a great information source of reviewed content. Wikipedia is fully available for download. The whole set of articles is available online and covers each imaginable topic. Thus, a word clustering based on this data has to be tested whether it can provide more reliable clusters. Thesauri. Another solution is the usage of thesauri. A thesaurus is a dictionary-like database that additionally contains acronyms, synonyms, and hypernyms. Currently, the most important words are identified by calculating the tf*idf score for each word. By using thesauri, the collection of common hypernyms for the most important words of a post is possible. These hypernyms can serve as new clusters with all their subordinated words. Thesauri are human-made collections and several times iterated by linguistic researchers. Thereby, the clustering will have an high quality and an intuitive 55 8 Recommendations for Future Research grouping. One frequently referenced thesaurus is WordNet [57]. WordNet allows the complete download of its database. This enables the analysis to load the complete knowledge in-memory and perform a fast matching of words and hypernyms. Although this process is expected to perform slower than the k-means clustering, the results can be more promising. Ontologies. A promising solution is the usage of ontologies. "An ontology is an explicit, formal specification of a shared conceptualization. The term is borrowed from philosophy, where an Ontology is a systematic account of existence. For AI systems, what "exists" is everything that can be represented." [58] An ontology holds numerous relations between concepts. Among others, an ontology defines classes of resources and super classes of classes. To use ontologies, the post’s content has to be assigned to the concepts present in the ontologies. This is a hard problem and frequently discussed in ongoing research [59, 60, 61]. Hereby, the probability of a word or word group representing a specific concept is needed. The probability is influenced by the direct context of the word and by the overall collection. Although this results in a hard calculation problem, the data is semantically enriched. These semantics can be used to easily derive clusters with different granularities. Further, it enables us to make the results machine readable and to offer more semantic filtering to users. Sentiments. Beside the quality of blog posts, incorporating the opinion of blog authors in the ranking is a future challenge. For example, the user may want to identify a blog author that constantly writes positively or negatively about a topic like Apple. Thereby, BlogIntelligence should provide special insights to identify fans and haters of products or persons. Therefore, sentiment analysis should be applied to the posts’ content. Sentiment analysis determines the attitude of a writer [62]. The attitude is the emotional state of the authors. 56 Probability distributions. As discussed in Sec. 5.2, a k-means clustering assigns words to topics. Although this gives promising results, another approach is to view topics as probability distributions over words. Thus, each word is assigned to a topic with a specific probability. This probability distribution creates overlapping topic clusters that represent the reality in more detail than a distinct assignment of word to topics. Hereby, the word ray (light ray) get assigned to physics, but also to fishing (ray-bones at the fin of a fish) with a smaller probability. Multilingual clustering. The word clustering in this thesis is limited to a Ger- man data set. Thereby, the problem of multilingual clustering is circumvented. Due to the future extension of BlogIntelligence to the whole blogosphere, the clustering also has to detect topics over language boundaries. This problem is discussed by Chen et al. [63], who propose to first cluster each language and afterwards merge the resulting topic clusters. Future work has to integrate this or a similar approach into the topic detection to solve the multilingual clustering problem. 8.2 Visualization The key component of the BlogIntelligence framework is the visualization. It enables users to understand and use the results of the BI analyses. The topic consistency rank presented in this paper is a complex calculation. It results in a numerical value for each blog. By displaying this number, the user is not able to relate it to other blogs or to interpolate its meaning. Therefore, future work will address the creation of an appropriate visualization. This visualization helps the user to explore and categorize blogs based on their visual perception. As discussed in Sec. 2.2.3, the BlogConnect visualization of BlogIntelligence already shows an exploratory overview to the blogosphere. To integrate the topic consistency rank into this view, another visual dimension gets introduced. This dimension has to symbolize the consistency of a blog. The user has to be able to perceive the order of blogs regarding their consistency. Thus, 57 8 Recommendations for Future Research Topic Granularity Society Politics Minimum BI-Impact Score Health Minimum Topic Consistency Tech Movies Search War Figure 12: BlogConnect 2.0 with topic consistency represented as color value. the color value of blog bubbles serves as the indicator for their topic consistency. The value is hereby the direct mapping from the normalized rank multiplied with a constant parameter. The prototypically BlogConnect 2.0 visualization is shown in Fig. 12. As shown, the user still controls the set of blogs via a search term at the lower right corner of the visualization. Blogs are only shown if they are related to the search term. Essentially, there are three extensions to the current BlogConnect visualization. First, blog bubbles are now ordered around their assigned topics. The topic names have to be calculated via a cluster labeling algorithm, which is also subject to future research. Further, the arrangement around the topics is based on a gravitation simulation where the force is determined via the distance of a blog to the clusters centroid. As mentioned above, the color value of the blog bubble represents the degree of topic consistency. As shown in Fig. 12, blogs with a high consistency shine threw the cloud of dark inconsistency blogs. Hereby, the small light point also helps the user to compare less consistent blogs. 58 Topic Granularity Society Politics Minimum BI-Impact Score Health Minimum Topic Consistency Tech Movies Search War Figure 13: BlogConnect with a high minimal topic consistency threshold. Third is the introduction of an interactive toolbar with three controls. The first control regulates the topic granularity of the visualization. One can see five topics. By raising the granularity, the BlogIntelligence framework calculates a higher number of clusters. This enables the user to explore the blogs in more detail. In addition, the user is able to configure the minimum BI-Impact score. All blogs with a lower score get excluded from the view leaving the most important blog for the user. Similarly, the minimum topic consistency can be controlled by the user. Thus, the user can exclude inconsistent blogs from his overview. As shown in Fig. 13, the higher the topic consistency threshold the less blogs are shown. One can see, that even big blogs disappear caused by their versatile interest. 8.3 Full integration with SAP HANA The full integration into SAP HANA is one of the main goals for the future of BlogIntelligence. Hereby, the focus lies on transferring the text analysis foundations into the core of HANA and creating an API for future text analysis algorithms. 59 8 Recommendations for Future Research As discussed in Sec. 5, the tf*idf calculation runs inside SAP HANA. Although the SQL procedures run totally on the database,they use a externally extracted dictionary table instead of the database owned word index. Transportation costs can be decreased by implementing the k-means clustering directly into the database. Although the k-means algorithms already runs in a distributed environment, an full in-memory computation can achieve an additional performance boost. However, this expectation has to be tested by integrating it into SAP HANA. Furthermore, the actual consistency ranking calculation can be adapted to incrementally update the rank of each blog on the insertion of new posts. Due to the integration of the text analysis algorithms into SAP HANA, the overall aim of BlogIntelligence, which is to provide real-time analytics of the blogosphere, can be approached. 60 9 Conclusion This master’s thesis proposed a metric for topical consistency of a blog with the goal to identify domain experts in the blogosphere. It is discussed that current blog ranking approaches focus on finding the most influential blogs that attract a large audience and thus more visitors, links, and comments. Further, it is argued that niche blogs with a very specific topic can only attract a limited audience and thus have only a small reach. For a blog to develop expert knowledge, it should show recurring interest in its topics and therefore concentrate on a small set of topics. To identify those experts blogs is particular important for domain experts to find blogs which they can observe and interact with. To ease the retrieval of these blogs, four different aspects of topic consistency were defined: (1) intra-post, (2) inter-post, (3) intra-blog, and (4) inter-blog consistency. These aspects define the consistency of a blog on different granularities: from the internal consistency of a post’s paragraphs to the global consistency between a blog and its linking and linked blogs. The four aspects are combined into a joint rank, called topic consistency rank. The implementation of the topic consistency rank was introduced. Further, this thesis showed how the topic consistency rank is integrated into the blog analytics framework, BlogIntelligence. The foundation of the topic consistency rank is based on the topic detection, which implements the automatic assignment of words into groups of highly related words. These groups are defined as topics. Using this topic detection, the implementation of the four aspects and the final rank were described with focus on the specifics of the persistence layer SAP HANA. The plausibility of the topic consistency rank was evaluated based on a real world data set. This data set consisted of 12,000 crawler blogs that were collected by the BlogIntelligence crawler. The top ten results of each aspect were analyzed and two representatives were discussed in detail. In addition, the correlation between the topic consistency of a blogs and it influ- 61 9 Conclusion ence was evaluated. This was done by implementing the BI-Impact score that is a measure for the reach and the impact of a blog and incorporates blog-specific characteristics. The analysis of the top ten blogs appeared to imply an inverse relation between the topic consistency of a blog and its reach i.e. the more consistent a blog is, the less influence it can gain in the blogosphere. In contrast, by analyzing the distribution of ranks among the top hundred, it could not be observed that there is a correlation between the influence and the consistency of blogs. Thus, both metrics are considered to be independent. As a consequence, the topic consistency rank is established as an additional indicator, beside the influence of a blog, to ease the blog retrieval for domain experts. Future work includes the enhancement of the topic detection to provide more specific and accurate topics that allows words to be part of multiple topics. The influence of this enhancement on the results of the topic consistency rank should be analyzed. In addition, the proposed visualization, BlogConnect 2.0, should be integrated into the BlogIntelligence web portal to offer the results of the topic consistency rank to the user. 62 .. 63 9 Conclusion 64 List of Abbreviations API Application Programming Interface ATOM Atom Syndication Format BI BlogIntelligence BI-Impact BlogIntelligence-Impact-Score Blog Weblog HDFS Hadoop Distributed File System HITS Hyperlink-Induced Topic Search HTTP Hypertext Transfer Protocol IR Information Retrieval RAM Random-Access Memory RPC Remote Procedure Call RSS Rich Site Summary Splog Spam Blog SQL Structured Query Language tf*idf Term Frequency-Inverse Document Frequency URI Uniform Resource Identifier WWW World Wide Web XML Extensible Markup Language 65 9 Conclusion 66 List of Figures List of Figures 1 Overview of blog topics. . . . . . . . . . . . . . . . . . . . . . . . . 2 2 An example tag cloud. . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 BlogIntelligence architecture overview. . . . . . . . . . . . . . . . . 9 4 BlogConnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5 PostConnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 6 Ranking variables of the BI-Impact. . . . . . . . . . . . . . . . . . . 19 7 Visualization of post-topic-probabilities. . . . . . . . . . . . . . . . 24 8 Topic detection flow diagram. . . . . . . . . . . . . . . . . . . . . . 33 9 An example iteration of k-means. . . . . . . . . . . . . . . . . . . . 37 10 BI-Impact and topic consistency ordered by topic consistency rank. 53 11 BI-Impact and topic consistency ordered by BI-Impact. . . . . . . 54 12 BlogConnect 2.0, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 13 BlogConnect 2.0 with a minimal topic consistency. . . . . . . . . . 59 67 List of Figures 68 List of Tables List of Tables 1 Example tf*idf vector table. . . . . . . . . . . . . . . . . . . . . . . 35 2 Sparse word vector representation. . . . . . . . . . . . . . . . . . . 36 3 Resulting cluster table. . . . . . . . . . . . . . . . . . . . . . . . . . 38 4 Example of the dictionary table. . . . . . . . . . . . . . . . . . . . . 39 5 Example of the link table. . . . . . . . . . . . . . . . . . . . . . . . 42 6 State of the BlogIntelligence data set. . . . . . . . . . . . . . . . . . . 45 7 Clustering quality results. . . . . . . . . . . . . . . . . . . . . . . . 46 8 Top 10 blogs for intra-post and inter-post consistency. . . . . . . . 48 9 Top 10 blogs for intra-blog and inter-blog consistency. . . . . . . . 49 10 Top 10 blogs for combined topic consistency rank with BI-Impact. 51 11 Top 10 blogs for BI-Impact with combined topic consistency rank. 52 69 List of Tables 70 References References [1] T. Cook and L. Hopkins: Social media or, “how i learned to stop worrying and love communication”, September 2007. http://trevorcook.typepad.com/weblog/files/ CookHopkins-SocialMediaWhitePaper-2007.pdf. [2] R. Ramakrishnan and A. Tomkins: Toward a peopleweb. Computer, 40(8):63–72, 2007. [3] H. Kircher: Web 2.0-plattform für innovation. IT-Information Technology, 49(1):63–65, 2007. [4] N.J. Thurman: Forums for citizen journalists? adoption of user generated content initiatives by online news media. New Media & Society, 10(1):139–157, 2008. [5] S.D. Reese, L. Rutigliano, K. Hyun, and J. Jeong: Mapping the blogosphere professional and citizen-based media in the global news arena. Journalism, 8(3):235–261, 2007. [6] J. Schmidt: Weblogs: eine kommunikationssoziologische studie. 2006. [7] Tom Smith: Power to the People: Social Media Tracker Wave 3. Technical report 2008, 2008. http://www.slideshare.net/Tomuniversal/ wave-3-social-media-tracker-presentation. [8] J. Arguello, J. Elsas, J. Callan, and J. Carbonell: Document representation and query expansion models for blog recommendation. In Proc. of the 2nd Intl. Conf. on Weblogs and Social Media (ICWSM), 2008. [9] I.H. Witten, E. Frank, and M.A. Hall: Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2011. [10] J. Bross: Understanding and Leveraging the Social Physics of the Blogosphere. PhD thesis, Hasso-Plattner-Institute, 2011. [11] J. Bross, K. Richly, M. Kohnen, and C. Meinel: Identifying the top-dogs of the blogosphere. Social Network Analysis and Mining, pages 1–15, 2011. [12] D.L. Lee, H. Chuang, and K. Seamons: Document ranking and the vector-space 71 References model. Software, IEEE, 14(2):67–75, 1997. [13] L. Page, S. Brin, R. Motwani, and T. Winograd: The pagerank citation ranking: Bringing order to the web. 1999. [14] M. Clements, A.P. de Vries, and M.J.T. Reinders: Optimizing single term queries using a personalized markov random walk over the social graph. In Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR), 2008. [15] W. Weerkamp and M. De Rijke: Credibility improves topical blog post retrieval. Association for Computational Linguistics (ACL), 2008. [16] K. Balog, L. Azzopardi, and M. De Rijke: Formal models for expert finding in enterprise corpora. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43–50. ACM, 2006. [17] R. Blood: Weblogs: a history and perspective. Rebecca’s Pocket, 7(9), 2000. [18] C. Körner, R. Kern, H.P. Grahsl, and M. Strohmaier: Of categorizers and describers: An evaluation of quantitative measures for tagging motivation. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, pages 157–166. ACM, 2010. [19] O. Kaser and D. Lemire: Tag-cloud drawing: Algorithms for cloud visualization. arXiv preprint cs/0703109, 2007. [20] C. Marlow: Audience, structure and authority in the weblog community. In International Communication Association Conference, volume 27, 2004. [21] M. Gumbrecht: Blogs as “protected space”. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, volume 2004, 2004. [22] S. Thies: Content-Interaktionsbeziehungen im Internet: Ausgestaltung und Erfolg. Springer DE, 2004. [23] M. Kobayashi and K. Takeda: Information retrieval on the web. ACM Computing Surveys (CSUR), 32(2):144–173, 2000. [24] J. Broß, P. Schilf, M. Jenders, and C. Meinel: Visualizing the blogosphere with 72 References blogconnect. In Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom), pages 651–656. IEEE, 2011. [25] J. Bross, P. Schilf, and C. Meinel: Visualizing blog archives to explore contentand context-related interdependencies. In Conf. Web Intelligence and Intelligent Agent Technology, 2010. [26] P. Berger, P. Hennig, J. Bross, and C. Meinel: Mapping the blogosphere–towards a universal and scalable blog-crawler. In 2011 IEEE Third International Conference on Social Computing (SocialCom), pages 672–677. IEEE, 2011. [27] M. Cafarella and D. Cutting: Building nutch: Open source search: A case study in writing an open source search engine. ACM Queue, 2(2), 2004. [28] J. Dean and S. Ghemawat: Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [29] R. Khare, D. Cutting, K. Sitaker, and A. Rifkin: Nutch: A flexible and scalable open-source web search engine. Oregon State University, 2004. [30] M. Michael, J.E. Moreira, D. Shiloach, and R.W. Wisniewski: Scale-up x scaleout: A case study using nutch/lucene. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–8. IEEE, 2007. [31] D. Borthakur: The hadoop distributed file system: Architecture and design. Hadoop Project Website, 11:21, 2007. [32] J. Shafer, S. Rixner, and A.L. Cox: The hadoop distributed filesystem: Balancing portability and performance. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 122–133. IEEE, 2010. [33] H. Plattner and A. Zeier: In-memory data management: an inflection point for enterprise applications. Springer, 2011. [34] A.K. Jain, M.N. Murty, and P.J. Flynn: Data clustering: a review. 73 References ACM computing surveys (CSUR), 31(3):264–323, 1999. [35] J. Han and M. Kamber: Data mining: concepts and techniques. Morgan Kaufmann, 2006. [36] S. Owen, R. Anil, T. Dunning, and E. Friedman: Mahout in action. Online, pages 1–90, 2011. [37] A.N. Langville, C.D. Meyer, and P. FernÁndez: Google’s pagerank and beyond: The science of search engine rankings. The Mathematical Intelligencer, 30(1):68–69, 2008. [38] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen: Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data basesVolume 30, pages 576–587. VLDB Endowment, 2004. [39] Jon Kleinberg: Bursty and hierarchical structure in streams. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’02, page 91, New York, New York, USA, July 2002. ACM Press, ISBN 158113567X. [40] K. Fujimura, H. Toda, T. Inoue, N. Hiroshima, R. Kataoka, and M. Sugizaki: Blogranger - a multi-faceted blog search engine. In Proceedings of the WWW 2006 3nd annual workshop on the weblogging ecosystem: Aggregation, analysis and dynamics, 2006. [41] Technorati: What is technorati authority?, September 2012. http://technorati.com/what-is-technorati-authority. [42] A. Kritikopoulos, M. Sideri, and I. Varlamis: Blogrank: ranking weblogs based on connectivity and similarity features. In Proceedings of the 2nd international workshop on Advanced architectures and algorithms for internet delivery and applications, page 8. ACM, 2006. [43] R. Schirru, D. Obradović, S. Baumann, and P. Wortmann: Domain-specific identification of topics and trends in the blogosphere. Advances in Data Mining. Applications and Theoretical Aspects, pages 490–504, 2010. [44] K. Sriphaew, H. Takamura, and M. Okumura: Cool blog identification using topic-based models. In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT’08. IEEE/WIC/ACM International Conference on, volume 1, pages 402–406. 74 References IEEE, 2008. [45] L. Zhu, A. Sun, and B. Choi: Online spam-blog detection through blog search. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 1347–1348. ACM, 2008. [46] T. Katayama, T. Utsuro, Y. Sato, T. Yoshinaka, Y. Kawada, and T. Fukuhara: An empirical study on selective sampling in active learning for splog detection. In 5th International Workshop on Adversarial Information Retrieval on the Web, pages 29–36. ACM, 2009. [47] P. Kolari, A. Java, and T. Finin: Characterizing the splogosphere. In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference. University of Maryland, Baltimore County, 2006. [48] W. Liu, S. Tan, H. Xu, and L. Wang: Splog filtering based on writing consistency. In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT’08. IEEE/WIC/ACM International Conference on, volume 1, pages 227–233. IEEE, 2008. [49] J. He, W. Weerkamp, M. Larson, and M. de Rijke: An effective coherence measure to determine topical consistency in user-generated content. International journal on document analysis and recognition, 12(3):185–203, 2009. [50] M. Chen and T. Ohta: Using blog content depth and breadth to access and classify blogs. International Journal of Business and Information, 5(1):26–45, 2010. [51] K. Eguchi, K. Kuriyama, and N. Kando: Sensitivity of ir systems evaluation to topic difficulty. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), volume 2, pages 585–589. Citeseer, 2002. [52] G. Salton and C. Buckley: Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988. [53] M. Fernández, D. Vallet, and P. Castells: Probabilistic score normalization for rank aggregation. Advances in Information Retrieval, pages 553–556, 2006. [54] R.M. Esteves, R. Pais, and C. Rong: K-means clustering in the cloud–a mahout 75 References test. In Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on, pages 514–519. IEEE, 2011. [55] Hidenao Abe and Shusaku Tsumoto: Evaluating a temporal pattern detection method for finding research keys in bibliographical data. pages 1–17, January 2011. [56] J.C. Tressler, M.H. Larock, and C.E. Lewis: Mastering Effective English. The Copp Clark., 1980. [57] C. Fellbaum: Wordnet. Theory and Applications of Ontology: Computer Applications, pages 231– 243, 2010. [58] T.R. Gruber et al.: A translation approach to portable ontology specifications. Knowledge acquisition, 5(2):199–220, 1993. [59] A. Hotho, A. Maedche, and S. Staab: Ontology-based text document clustering. KI, 16(4):48–54, 2002. [60] L. Jing, L. Zhou, M.K. Ng, and J.Z. Huang: Ontology-based distance measure for text clustering. In Proc. of SIAM SDM workshop on text mining, 2006. [61] Y. Ding and X. Fu: A text document clustering method based on ontology. Advances in Neural Networks–ISNN 2011, pages 199–206, 2011. [62] B. Pang and L. Lee: Opinion mining and sentiment analysis. Now Pub, 2008. [63] H.H. Chen and C.J. Lin: A multilingual news summarizer. In Proceedings of the 18th conference on Computational linguistics-Volume 1, pages 159–165. Association for Computational Linguistics, 2000. 76 References .. 77 References .. 78 References .. 79 References .. 80 References .. 81 References .. 82