Lecture Notes in Computer Science 6045
Transcription
Lecture Notes in Computer Science 6045
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany 6045 John G. Breslin Thomas N. Burg Hong-Gee Kim Tom Raftery Jan-Hinrik Schmidt (Eds.) Recent Trends and Developments in Social Software International Conferences on Social Software BlogTalk 2008, Cork, Ireland, March 3-4, 2008, and BlogTalk 2009, Jeju Island, South Korea, September 15-16, 2009 Revised Selected Papers 13 Volume Editors John G. Breslin National University of Ireland Engineering and Informatics Galway, Ireland E-mail: john.breslin@nuigalway.ie Thomas N. Burg Socialware Vienna, Austria E-mail: thomas@burg.cx Hong-Gee Kim Seoul National University Biomedical Knowledge Engineering Laboratory Seoul, Korea E-mail: hgkim@snu.ac.kr Tom Raftery Red Monk Seattle, WA, USA E-mail: tom@redmonk.com Jan-Hinrik Schmidt Hans Bredow Institut Hamburg, Germany E-mail: j.schmidt@hans-bredow-institut.de Library of Congress Control Number: 2010936768 CR Subject Classification (1998): H.3.5, C.2.4, H.4.3, H.5 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13 0302-9743 3-642-16580-X Springer Berlin Heidelberg New York 978-3-642-16580-1 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180 Preface 1 Overview From its beginnings, the Internet has fostered communication, collaboration and networking between users. However, the first boom at the turn of the millennium was mainly driven by a rather one-sided interaction: e-commerce, portal sites and the broadcast models of mainstream media were introduced to the Web. Over the last six or seven years, new tools and practices have emerged which emphasise the social nature of computer-mediated interaction. Commonly (and broadly) labeled as social software and social media, they encompass applications such as blogs and microblogs, wikis, social networking sites, real-time chat systems, and collaborative classification systems (folksonomies). The growth and diffusion of services like Facebook and Twitter and of systems like WordPress and Drupal has in part been enabled by certain innovative principles of software development (e.g. open APIs, open-source projects, etc.), and in part by empowering the individual user to participate in networks of peers on different scales. Every year, the International Conference on Social Software (BlogTalk) brings together different groups of people using and advancing the Internet and its usage: technical and conceptual developers, researchers with interdisciplinary backgrounds, and practitioners alike. It is designed to initiate a dialog between users, developers, researchers and others who share, analyse and enjoy the benefits of social software. The focus is on social software as an expression of a culture that is based on the exchange of information, ideas and knowledge. Moreover, we understand social software as a new way of relating people to people and to machines, and vice versa. In the spirit of the free exchange of opinions, links and thoughts, a wide range of participants can engage in this discourse. BlogTalk enables participants to connect and discuss the latest trends and happenings in the world of social software. It consists of a mix of presentations, panels, face-to-face meetings, open discussions and other exchanges of research, with attendees sharing their experiences, opinions, software developments and tools. Developers are invited to discuss technological developments that have been designed to improve the utilisation of social software, as well as reporting about the current state of their software and projects. This includes new blog and wiki applications, content-creation and sharing environments, advanced groupware and tools, client-server designs, GUIs, APIs, content syndication strategies, devices, applications for microblogging, and much more. Researchers are asked to focus on their visions and interdisciplinary concepts explaining social software including, but not limited to, viewpoints from social sciences, cultural studies, psychology, education, law and natural sciences. Practitioners can talk about the practical use of social software in professional and private contexts, around topics such as communication improvements, easy-to-use knowledge management, social software in politics and journalism, blogging as a lifestyle, etc. VI Preface 2 BlogTalk 2009 The 2009 conference was held on the picturesque Jeju Island in South Korea, and was coordinated locally by the prominent Korean blogger and researcher Channy Yun. This was the first BlogTalk to be held in Asia, and given its success, it will not be the last. The following presentations from BlogTalk 2009 are available in this volume. Philip Boulain and colleagues from the University of Southampton detail their prototype for an open semantic hyperwiki, taking ideas from the hypertext domain that were never fully realised in the Web and applying them to the emerging area of semantic wikis (for first-class links, transclusion, and generic links). Justus Broß and colleagues from the Hasso-Plattner Institute and SAP study the adoption of WordPress MU as a corporate blogging system for the distributed SAP organisation, connecting thought leaders at all levels in the company. Michel Chalhoub from the Lebanese American University analyses areas where the development and use of knowledge exchange systems and social software can be effective in supporting business performance (resulting in a measure for evaluating the benefit of investment in such technologies). Kanghak Kim and colleagues from KAIST and Daum Communications discuss their study on users’ voting tendencies in social news services, in particular, examining users who are motivated to vote for news articles based on their journalistic value. Sang-Kyun Kim and colleagues from the Korea Institute of Oriental Medicine describe research that connects researchers through an ontology-based system that represents information on not just people and groups but projects, papers, interests and other activities. Yon-Soo Lim, Yeungnam University, describes the use of semantic network analysis to derive structure and classify both style and content types in media law journalistic texts from both blogs and news sources. Makoto Okazaki and Yutaka Matsuo from the University of Tokyo perform an analysis of microblog posts for real-time event notification, focussing on the construction of an earthquake prediction system that targets Japanese tweets. Yuki Sato et al. from the University of Tsukuba, NTT and the University of Tokyo describe a framework for the complementary navigation of news articles and blog posts, where Wikipedia entries are utilised as a fundamental knowledge source for linking news and blogs together. Takayuki Yoshinaka et al. from the Tokyo Denki University and the University of Tokyo describe a method for filtering spam blogs (splogs) based on a machine-learning technique, along with its evaluation results. Hanmin Jung and colleagues from KISTI detail a Semantic Web-based method that resolves author co-references, finds experts on topics, and generates researcher networks, using a data set of over 450,000 Elsevier journal articles from the information technology and biomedical domains. Finally, Jean-Henry Morin from the University of Geneva looks at the privacy issues regarding the sharing and retention of personal information in social networking interactions, and examines the need to augment this information with an additional DRM-type set of metadata about its usage and management. Preface VII There were three further peer-reviewed talks that are not published here. Daniele Nascimento and Venkatesh Raghavan from Osaka City University described various trends in the area of social geospatial technologies, in particular, how free and open-source development is shaping the future of geographic information systems. Myungdae Cho from Sung Kyun Kwan University described various library applications of social networking and other paradigm shifts regarding information organisation in the library field. David Lee, Zenitum, presented on how governments around the world are muzzling the Social Web. BlogTalk has attracted prominent keynote speakers in the past, and 2009 was no exception: Yeonho Oh, founder of Ohmynews, spoke about the future of citizen journalism; and Isaac Mao, Berkman Center for Internet and Society at Harvard, presented on cloud intelligence. The conference also featured a special Korean Web Track: Jongwook Kim from Daum BloggerNews spoke about social ranking of articles; Namu Lee from NHN Corporation talked about the Textyle blogging tool; and Changwon Kim from Google Korea described the Textcube.com social blogging service. 3 BlogTalk 2008 In 2008, BlogTalk was held in Cork City, Ireland, and was sponsored by BT, DERI at NUI Galway, eircom and Microsoft. In these proceedings, we also gather selected papers from the BlogTalk 2008 conference. Uldis Bojars and colleagues from DERI, NUI Galway describe how the SIOC semantic framework can be used for the portability of social media contributions. David Cushman, FasterFuture Consulting, discusses the positives he believes are associated with the multiple complex identities we are now adopting in various online communities. Jon Hoem from Bergen University College describes the Memoz system for spatial web publishing. Hugo Pardo Kuklinski from the University of Vic and Joel Brandt from Stanford University describe the proposed Campus Móvil project for Education 2.0-type services through mobile and desktop environments. José Manuel Noguera and Beatriz Correyero from the Catholic University of Murcia discuss the impact of Politics 2.0 in Spanish social media, by tracking conversations through the Spanish blogosphere. Antonio Tapiador and colleagues from Universidad Politecnica de Madrid detail an extended identity architecture for social networks, attaching profile information to the notion of distributed user-centric identity. Finally, Mark Bernstein from Eastgate Systems Inc. writes about the parallels between Victorian and Edwardian sensibilities and modern blogging behaviours. Also, but not published here, there were some further interesting presentations at BlogTalk 2008. Joe Lamantia from Keane gave some practical suggestions for handling ethical dillemmas encountered when designing social media. Anna Rogozinska from Warsaw University spoke about the construction of self in weblogs about dieting. Paul Miller from Talis described how existing networks of relationships could be leveraged using semantics to enhance the flow of ideas VIII Preface and discourse. Jeremy Ruston from Osmosoft at BT presented the latest developments regarding the TiddlyWiki system. Jan Blanchard from Tourist Republic and colleagues described plans for a trip planning recommender network. Andera Gadeib from Dialego spoke about the MindVoyager approach to qualitative online research, where consumers and clients come together in an online co-creation process. Martha Rotter from Microsoft demonstrated how to build and mashup blogs using Windows Live Services and Popfly. Robert Mao, also from Microsoft, described how a blog can be turned into a decentralised social network. Brian O’Donovan and colleagues from IBM and the University of Limerick analysed the emerging role of social software in the IBM company intranet. Hak-Lae Kim and John Breslin from DERI, NUI Galway presented the int.ere.st tag-sharing service. The 2008 conference featured notable keynote speakers from both Silicon Valley and Europe talking about their Web 2.0 experiences and future plans for the emerging Web 3.0: Nova Spivack, CEO, Radar Networks, described semantic social software designed for consumers; Salim Ismail, formerly of Yahoo! Brickhouse, spoke about entrepreneurship and social media; Matt Colebourne, CEO of coComment, presented on conversation tracking technologies; and Michael Breidenbrücker, co-founder of Last.fm, talked about the link between advertising and Web 2.0. There were also two discussion panels: the first, on mashups, microformats and the Mobile Web, featured Sean McGrath, Bill de hÓra, Conor O’Neill and Ben Ward; the second panel, describing the move from blog-style commentary to conversational social media, included Stephanie Booth, Bernard Goldbach, Donncha O Caoimh and Jan Schmidt. 4 Conclusion We hope that you find the papers presented in this volume to be both stimulating and useful. One of the main motivations for running BlogTalk every year is for attendees to be able to connect with a diverse set of people that are fascinated by and work in the online digital world of social software. Therefore, we encourage you to attend and participate during future events in this conference series. The next BlogTalk conference is being organised for Galway, Ireland, and will be held in autumn 2010. February 2010 John Breslin Thomas Burg Hong-Gee Kim Tom Raftery Jan Schmidt Organization BlogTalk 2009 was organised by the Biomedical Knowledge Engineering Lab, Seoul National University. BlogTalk 2008 was organised by the Digital Enterprise Research Institute, National University of Ireland, Galway. 2009 Executive Committee Conference Chair Organising Chair Event Coordinator John Breslin (NUI Galway) Thomas Burg (Socialware) Hong-Gee Kim (Seoul National University) Channy Yun (Seoul National University) Hyun Namgung (Seoul National University) 2009 Programme Committee Gabriela Avram Anne Bartlett-Bragg Mark Bernstein Stephanie Booth Rob Cawte Josephine Griffith Steve Han Conor Hayes Jin-Ho Hur Ajit Jaokar Alexandre Passant Robert Sanzalone Jan Schmidt Hideaki Takeda University of Limerick Headshift Eastgate Systems Inc. Climb to the Stars eSynapse NUI Galway KAIST DERI, NUI Galway NeoWiz FutureText Publishing DERI, NUI Galway pacificIT Hans Bredow Institute National Institute of Informatics 2008 Executive Committee Conference Chair John Breslin, NUI Galway Thomas Burg, Socialware Tom Raftery, Tom Raftery IT Jan Schmidt, Hans Bredow Institute X Organization 2008 Programme Committee Gabriela Avram Stowe Boyd Dan Brickley David Burden Jyri Engeström Jennifer Golbeck Conor Hayes Ajit Jaokar Eugene Eric Kim Kevin Marks Sean McGrath Peter Mika José Luis Orihuela Martha Rotter Jeremy Ruston Rashmi Sinha Paolo Valdemarin David Weinberger Sponsoring Institutions BT DERI, NUI Galway eircom Microsoft University of Limerick /Message Friend-of-a-Friend Project Daden Ltd. Jaiku, Google University of Maryland DERI, NUI Galway FutureText Publishing Blue Oxen Associates Google Propylon Yahoo! Research Universidad de Navarra Microsoft Osmosoft, BT SlideShare, Uzanto evectors, Broadband Mechanics Harvard Berkman Institute Table of Contents A Model for Open Semantic Hyperwikis . . . . . . . . . . . . . . . . . . . . . . . . . . . . Philip Boulain, Nigel Shadbolt, and Nicholas Gibbins 1 Implementing a Corporate Weblog for SAP . . . . . . . . . . . . . . . . . . . . . . . . . . Justus Broß, Matthias Quasthoff, Sean MacNiven, Jürgen Zimmermann, and Christoph Meinel 15 Effect of Knowledge Management on Organizational Performance: Enabling Thought Leadership and Social Capital through Technology Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michel S. Chalhoub Finding Elite Voters in Daum View: Using Media Credibility Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kanghak Kim, Hyunwoo Park, Joonseong Ko, Young-rin Kim, and Sangki Steve Han 29 38 A Social Network System Based on an Ontology in the Korea Institute of Oriental Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sang-Kyun Kim, Jeong-Min Han, and Mi-Young Song 46 Semantic Web and Contextual Information: Semantic Network Analysis of Online Journalistic Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yon Soo Lim 52 Semantic Twitter: Analyzing Tweets for Real-Time Event Notification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Makoto Okazaki and Yutaka Matsuo 63 Linking Topics of News and Blogs with Wikipedia for Complementary Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuki Sato, Daisuke Yokomoto, Hiroyuki Nakasaki, Mariko Kawaba, Takehito Utsuro, and Tomohiro Fukuhara A User-Oriented Splog Filtering Based on a Machine Learning . . . . . . . . . Takayuki Yoshinaka, Soichi Ishii, Tomohiro Fukuhara, Hidetaka Masuda, and Hiroshi Nakagawa 75 88 Generating Researcher Networks with Identified Persons on a Semantic Service Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanmin Jung, Mikyoung Lee, Pyung Kim, and Seungwoo Lee 100 Towards Socially-Responsible Management of Personal Information in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-Henry Morin 108 XII Table of Contents Porting Social Media Contributions with SIOC . . . . . . . . . . . . . . . . . . . . . . Uldis Bojars, John G. Breslin, and Stefan Decker Reed’s Law and How Multiple Identities Make the Long Tail Just That Little Bit Longer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Cushman Memoz – Spatial Weblogging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jon Hoem 116 123 131 Campus Móvil: Designing a Mobile Web 2.0 Startup for Higher Education Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hugo Pardo Kuklinski and Joel Brandt 143 The Impact of Politics 2.0 in the Spanish Social Media: Tracking the Conversations around the Audiovisual Political Wars . . . . . . . . . . . . . . . . . José M. Noguera and Beatriz Correyero 152 Extended Identity for Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Tapiador, Antonio Fumero, and Joaquı́n Salvachúa 162 NeoVictorian, Nobitic, and Narrative: Ancient Anticipations and the Meaning of Weblogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark Bernstein 169 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 A Model for Open Semantic Hyperwikis Philip Boulain, Nigel Shadbolt, and Nicholas Gibbins IAM Group, School of Electronics and Computer Science, University of Southampton, University Road, Southampton SO17 1BJ, United Kingdom {prb,nrs,nmg}@ecs.soton.ac.uk http://users.ecs.soton.ac.uk/ Abstract. Wiki systems have developed over the past years as lightweight, community-editable, web-based hypertext systems. With the emergence of semantic wikis such as Semantic MediaWiki [6], these collections of interlinked documents have also gained a dual role as ad-hoc RDF [7] graphs. However, their roots lie in the limited hypertext capabilities of the World Wide Web [1]: embedded links, without support for features like composite objects or transclusion. Collaborative editing on wikis has been hampered by redundancy; much of the effort spent on Wikipedia is used keeping content synchronised and organised.[3] We have developed a model for a system, which we have prototyped and are evaluating, which reintroduces ideas from the field of hypertext to help alleviate this burden. In this paper, we present a model for what we term an ‘open semantic hyperwiki’ system, drawing from both past hypermedia models, and the informal model of modern semantic wiki systems. An ‘open semantic hyperwiki’ is a reformulation of the popular semantic wiki technology in terms of the long-standing field of hypermedia, which then highlights and resolves the omissions of hypermedia technology made by the World Wide Web and the applications built around its ideas. In particular, our model supports first-class linking, where links are managed separately from nodes. This is then enhanced by the system’s ability to embed links into other nodes and separate them out again, allowing for a user editing experience similiar to HTML-style embedded links, while still gaining the advantages of separate links. We add to this transclusion, which allows for content sharing by including the content of one node into another, and edit-time transclusion, which allows users to edit pages containing shared content without the need to follow a sequence of indirections to find the actual text they wish to modify. Our model supports more advanced linking mechanisms, such as generic links, which allow words in the wiki to be used as link endpoints. The development of this model has been driven by our prior experimental work on the limitations of existing wikis and user interaction.We have produced a prototype implementation which provides first-class links, transclusion, and generic links. Keywords: Open Hypermedia, Semantic Web, Wiki. J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 1–14, 2010. c Springer-Verlag Berlin Heidelberg 2010 2 1 P. Boulain, N. Shadbolt, and N. Gibbins Introduction Hypermedia is a long-standing field of research into the ways in which documents can expand beyond the limitations of paper, generally in terms of greater crossreferencing and composition (reuse) capability. Bush’s As We May Think [4] introduces a hypothetical early hypertext machine, the ‘memex’, and defines the “essential feature” of it as “the process of tying two items together”. This linking between documents is the common feature of hypertext systems, upon which other improvements are built. As well as simple binary (two endpoint) links, hypertext systems have been developed with features including n-ary links (multiple documents linked to multiple others), typed links (links which indicate something about why or how documents are related), generic links (links whose endpoints are determined by matching criteria of the document content, such as particular words), and composite documents (formed by combining a set of other documents). Open Hypermedia extends this with first-class links (and anchors) which are held external to the documents they connect. These allow links to be made to immutable documents, and to be added and removed in sets, often termed ‘linkbases’. One of the earliest projects attempting to implement globallydistributed hypertext was Xanadu [8], a distinctive feature of the design of which was transclusion: including (sections of) a document into another by reference. The World Wide Web, while undeniably successful, only implements a very small subset of these features—binary, embedded links—with more complicated standards such as XLink failing to gain mainstream traction. Since then, applications have been built using the web as an interface, and following its same, limited capabilities. One of these classes of applications is the semantic wiki, an extension of the community-edited-website concept to cover typed nodes and links, such that the wiki graph structure maps to a meaningful RDF graph. In this paper, we present a model which extends such systems to cover a greater breadth of hypermedia functionality, while maintaining this basic principle of useful graph mapping. We introduce the Open Weerkat system and describe the details of its implementation which relate to the model. 2 Open Weerkat Open Weerkat is a model and system to provide a richer hypertext wiki. This implementation is built upon our previous Weerkat extensible wiki system [2]. From our experimental work [3], we have identified the need for transclusion for content-sharing, better support for instance property editing, and generic linking, which requires first-class links. At the same time, we must not prohibit the ‘non-strict’ nature of wikis, as dangling links are used as part of the authoring process. We also wish to preserve the defining feature of semantic wikis: that there is a simple mapping between nodes and links in the wiki, and RDF resources and statements. In the rest of this section, we look at the core components of the model, and some of the implementation considerations. A Model for Open Semantic Hyperwikis 2.1 3 Atomic Nodes The core type of the model is that class which is fundamental to all wiki designs: the individual document, page, article, or component. We use here the general and non-domain-specific term node. Node title DOM tree text node transclude → native transclusion o_ _ _ _ _ DOM element element contents link / attribute "value" Fig. 1. Model diagram legend Components. As we have said, our model draws from both hypermedia and semantic wiki. In particular, we preserve the notion that wiki nodes are parallel to semantic web resources. Because these resources are atomic (RDF cannot perform within-component addressing on them, as that is only meaningful for an representation of a resource), we have carefully designed our wiki model to not rely on link endpoint specifications which go beyond what can be reasonably expressed in application-generic RDF. Anything about which one wishes to make statements, or to which one wishes to link, must have a unique identity in the form of an URI, rather than some form of (URI, within-specifier) pairing. Figure 1 shows the components of a node in the model. (We use this diagram format, which draws inspiration from UML class diagrams, to depict example hypertexts throughout this paper.) Every node has a title which serves as an identifier. Titles are namespaced with full stops (‘.’), which is useful for creating identities for content which nominally belongs within another node. Node content is either a DOM tree of wiki markup, or an atomic object (e.g. an image binary). A notable element in the DOM tree is the ‘native transclusion’, which indicates that another node’s content should be inserted into the tree at that point. This is necessary to support the linking behaviour described below, and is distinct from user-level transclusion using normal links. The bottom of the format shows the attribute-value pairs for the node. The domain of attributes is other nodes, and the domain of values are literals and other nodes. These are effectively very primitive embedded, typed links, and are used to provide a base representation from which to describe first-class links. Identity, Meta-nodes, and RDF. It is a design goal of the model that the hyperstructure of the wiki is isomorphic to a useful RDF graph. That is, typed links between pages are expressible as an RDF statement relating the pages, and attributes on a page are statements relating that page to the associated value. The link in figure 5 should be presented (via RDF export, a SPARQL 4 P. Boulain, N. Shadbolt, and N. Gibbins endpoint, or such) as the triple (Phil, Likes, Perl), with appropriate URIs. (Note that the anchor is not the subject—we follow the native transclusion back into the owning Phil node.) The attribute of the Perl node in the same figure should be presented as the triple (Perl, syntax, elegant). (For ‘fat’ links with more than two endpoints, there is a triple for each pairing of sources and targets.) For this, we grant each node a URI, namespaced within the wiki. However, ‘the node Perl’ and ‘Perl the programming language’ are separate resources. For example, the node Perl may have an URI of http://wiki.example.org/node/ Perl. Yet a typed link from Perl is clearly a statement about Perl itself, not a node about Perl. The statements are about the associated resource http:// wiki.example.org/resource/Perl. (This may be owl:sameAs some external URI representing Perl.) In order to describe the node itself (e.g. to express that it is in need of copy-editing), a Perl.meta node represents http://wiki.example.org/nodes/ Perl. This meta node itself has an URI http://wiki.example.org/nodes/ Perl.meta, and could theoretically have a ‘meta meta’ node. Effectively, there is an ‘offset’ of naming, where the wiki identifier Perl is referring to Perl itself semantically, and the Perl node navigationally; the identifier Perl.meta is referring to the Perl node semantically, and the Perl meta-node navigationally. Versions. We must give consideration to the identity of versions, or ‘revisions’ of a node. We wish to support navigational linking (including transclusion) to old versions of a node. However, we must also consider our semantic/navigational offset: we may wish to write version 3 of the Perl node, but we do not mean to assert things about a third revision of Perl itself. Likewise, a typed link to version 3 of the Perl node is not a statement about version 3 of Perl: it is a statement about Perl which happens to be directed to a third revision of some content about it. We desire three properties from an identifier scheme for old versions: Semantic consistency. Considers the version of the content about a resource irrelevant to its semantic identity. All revisions of the Perl node are still about the same Perl. Navigational identity. Each revision of a node (including meta-nodes) should have distinct identity within the wiki, so that it may be linked to. Intuitively, despite the above, version 3 of the Perl node is Perl3 , not Perl.meta3 . Semantic identity. Each revision of a node (including meta-nodes) should have a distinct URI, such that people may make statements about them. (Perl3 .meta, writtenBy, Phil) should express that Phil wrote version 3 of the content for the Perl node. We can achieve this by allowing version number specification both on the node and any meta-levels, and dropping the version specification of the last component to generate the RDF URI. Should somebody wish to make statements about version 4 of the text about version 3 of the text about Perl, they could use the URI Perl;3/meta;4/meta. This is consistent with the ‘node is resource itself; meta-node is node’ approach to converting typed links into statements. A Model for Open Semantic Hyperwikis 5 Additionally, we have no need to attempt to express that Perl and Perl3 are the same semantic resource, as this mechanism allocates them the same URI. It should be stressed that namespacing components of the node identifier cannot have versions attached, as any versioned namespace content does not affect the node content. For example, Languages2 .Perl3 is not a valid identifier, and would be isomorphic to Languages.Perl3 if it were. Representations. Each node URI should have a representation predicate which specifies a retrievable URL for a representation. (We do not claim to have any authority to provide a representation of Perl itself, merely our node about Perl.) For example, (wiki:node/Perl, representation, http://wiki.example.org/content/ Perl.html). There may be more than one, in which case each should have a different MIME type. Multiple representations are derived from the content: for example, rendering a DOM tree of markup to an XHTML fragment. Hence, the range of MIME types is a feature of the rendering components available in the wiki software to convert from the node’s content. Should an HTTP client request the wiki:node/Perl resource itself, HTTP content negotiation should be used to redirect to the best-matching representation. In the spirit of the ‘303 convention’ [9], if the HTTP client requests RDF, they should be redirected to data about the requested URI: i.e. one meta-level higher. This inconsistency is unfortunately a result of the way the convention assumes that all use of RDF must necessarily be ‘meta’ in nature, but we have considered it preferable to be consistent with convention than to unexpectedly return data, not metadata, RDF in what is now an ambiguous case. Clients which wish to actually request the Perl node’s content itself in an RDF format, should such exist, must find the correct URI for it (e.g. wiki:content/Perl.ttl) via the representation statements. Requests to resource URIs (e.g. wiki:resource/Perl) are only meaningful in terms of the 303 convention, redirecting RDF requests to data about wiki:node/Perl. There are no representations available in the wiki for these base resources—only for nodes about them—so any non-RDF type must therefore must be ‘Not Found’. 2.2 Complex Nodes We can build upon this base to model parametric nodes, whose content may be affected by some input state. Node identity. MediaWiki’s form of transclusion, ‘templates’, also provides for arguments to be passed to the template, which can then be substituted in. This is in keeping with the general MediaWiki paradigm that templates are solely for macro processing of pages. We propose a generalisation, whereby pages may be instantiated with arbitrary key/value pairs. The range of our links are node identifiers, so we consider these parameters as part of the identity of an instantiation in a (likely infinite) multi-dimensional space of instances. Figure 3 shows a subset of the instance 6 P. Boulain, N. Shadbolt, and N. Gibbins Template.GoodNode This node is featured in topic param topic in particular because of its param virtue Fig. 2. Exemplary parametric node _ virtue Template.GoodNode {topic→science, virtue→citations} Template.GoodNode {topic→science, virtue→grammar} Template.GoodNode {topic→art, virtue→citations} Template.GoodNode {topic→art, virtue→grammar} / topic Fig. 3. Instance-space of a parametric node space for a node, figure 2, which has parameters topic and virtue. There is assumed to be an instance at any value of these parameters, although evidently all such instances are ‘virtual’, with their content generated from evaluating the parametric Template.GoodNode node. We do not use an (identif ier, parameters) pair, as this does not fit the Semantic Web model that any resource worth making statements about should have identity. Granting instances in-system identity is useful, as it encapsulates all necessary context into one handle. To guarantee that all isomorphic instantiations of a page use the same identifier, parameters must be sorted by key in the identifier. Note that this is orthogonal to user interface concerns—the restriction is upon the identity used by links to refer to ‘this instance of this node with these parameters’, not upon the display of these parameters when editing such a link. As with revision specifiers, parameters upon namespace components of the identifier are meaningless and forbidden. Within the node’s content, parameters may be used to fill in placeholders in the DOM tree. These placeholders may have default value should the parameter not be provided; and the default-default parameter is to flag an error. For example, a parameter may be used to fill in a word or two of text, or as the target of a link. User interface operations upon Foo {bar→baz}’s content, such as viewing the history, and editing, should map through to Foo, as the instance has no content of its own to operate upon. A Model for Open Semantic Hyperwikis 7 Because we model parameterised nodes as a set of static objects with firstclass identity which are simply instantiations of a general node, identifiers which do not map to a valid instantiation of a node could be considered non-existent targets. For example, an identifier which specifies a non-existant parameter. Resource identity. We must consider whether such instances are separate Semantic Web resources to each-other, and to the parametric node from which their content is derived. As with version specifiers, parameters affect the content of a node, not the resource which it describes. Because the Perl node represents Perl itself, it follows that Perl {bar→baz} still represents Perl. However, as with version specifiers, these node instances still have distinct identity as nodes. As Perl.meta represents the Perl node, so does Perl {bar→baz}.meta represent the Perl {bar→baz} node. Therefore, we can form a URI for a parametric node instance in exactly the same way we form URIs for specific revisions, defined in section 2.1. In brief, the final set of parameters are dropped. RDF expressions of the hyperstructure should specify that parametric node instances, where used, are derivations of other nodes. For example, remembering that we are making a statement about Perl nodes, not about Perl itself, (Perl {bar→baz}.meta, templatedFrom, Perl.meta). Eager vs. lazy evaluation. The infinite space of non-link parametric node instances can be considered to not exist until they are specified as a link target, as their existence or non-existence in the absence of explicit reference is irrelevant. However, if we also allow parameter values to substitute into the attributes of a node, we can create parametric links. Parametric node instances which are links have the ability to affect parts of the hyperdocument outside of their own content and relations: this is the nature of first-class links. Hence we must consider whether parametric node instantiation, at least for link nodes, is eager (all possible instances are considered to always exist) or lazy (instances only exist if they are explicitly referred to). Template.FancyLink type Link source param(from) target param(to) decoration fancy Fig. 4. Free-variable parametric link Figure 4 highlights a case where this distinction is particularly significant. With lazy evaluation, this template could be used as a macro, in a ‘classical’ wiki style, to create links. One would have to create links to instances of this link, which would then cause that particular instance to exist and take effect, linking its from and to parameters. 8 P. Boulain, N. Shadbolt, and N. Gibbins An eager approach to evaluation would, however, treat parametric links as free-variable rules to satisfy. All possible values of from and to would be matched, and linked between. In this case, every node in the hyperdocument would be linked to every other node. Logically, eager evaluation is more consistent, and potentially more useful: free-variable links are of little utility if one has to explicitly provide them with possible values. It would be better to manually link the nodes, with a type of FancyLink which is then defined to be fancy. If there were some content provided by the Template.FancyLink template, it could still be used, but would simply display this content rather than actually functioning as a link. This is contrary to common practice on Semantic MediaWiki, which has evolved from practice on Wikipedia, where the templating system works via macro evaluation. We argue that this leads to bad ontology modelling, as class definitions end up embedded within display-oriented templates, such as ‘infoboxes’. For example, the common Semantic MediaWiki practice to provide the node about Brazil with a relational link to its capital Brası́lia would be to include a template in the Brazil node with the parameter capital→Brası́lia. The template would then contain markup to display a panel of information containing an embedded link of type has capital to the value of the capital parameter.1 The problem is that stating that templates have capitals is clearly not correct, and only results in correct information when they are macro-expanded into place.Statements about the template itself must be ignored as they are likely intended to be about whichever nodes use that template. In addition, what could be a statement about the class of countries—that they are the domain of a has capital property—is entangled with the display of this information. A better approach would be to simply assert the capital on the Brazil page, and then transclude a template whose only role is to typeset the information panel, using only the name of the transcluding page as an implicit parameter. This approach emphasises the use of correct semantics, and using these to inform useful display, rather than ‘hijacking’ useful display to try to add semantics. Templating. Templating can be achieved through the use of parametric nodes and transclusion. Simple macroing functionality, as in contemporary wiki systems, is possible by transcluding a particular instance of a parametric node which specifies the desired parameter values. It should be stressed that parametric nodes are not, however, a macro preprocessing system. As covered in section 2.2, parametric links are eagerly evaluated: i.e. they are treated as rules, rather than macros which must be manually ‘activated’ by using them in combination with an existing node. In general, use of macroing for linking and relations is discouraged, as it is better expressed through classes of relation. 1 This example is closely based upon a real case: http://www.semanticweb.org/wiki/ Template:Infobox_Country A Model for Open Semantic Hyperwikis 9 Phil Likes → Phil.anchor.1 , an em elegant language. _ _ _/ Perl (interesting facts) syntax elegant o Phil.anchor.1 Perl Phil.link.Perl.1 type Likes source Phil.anchor.1 target Perl Fig. 5. Linking in the model 2.3 Links Open Weerkat is an open hypermedia system, so links are first-class: all links are nodes. Nodes which have linking attributes are links. To maintain a normal wiki interface, we present links in an embedded form. Embedded. Figure 5 shows user-level linking. As presented to the user in an example plaintext markup, the source for the Phil node would be: Likes [link type=Likes to=Perl Perl], an [em elegant] language. We use edit-time transclusion, where transcluded text is displayed in-line even during the editing of a node, to present the user with the familiar and direct model of embedded linking, but map this into a open hypermedia model. The link element, when written, separates out the link text as a separate, ‘anchor’ node, and is replaced with native transclusion. A first-class link is then created from the anchor node to the link target. The identity if this link is largely arbitrary, so long as it is unique. Native transclusion is here an optimisation for creating a named, empty anchor in the DOM, then maintaining a link which transcludes in the node of the same name. It is also considered meronymous: a link involving an anchor is considered to relate the node to which that anchor belongs. Because native transclusion is entirely implicit, only the owning node can natively transclude its anchors. When editing the node again, the anchor is transcluded back into the node, and converted into a link element with targets from all links from it. (Depending on the exact markup language used to express the DOM for editing, this may require multiple, nested link elements.) 10 P. Boulain, N. Shadbolt, and N. Gibbins This guarantees that each anchor has a full identity (a node title) in the system. It does not, however, immediately provide a solution to ‘the editing problem’—a longstanding issue in hypertext research [5], where changes to a document invalidate external pointers into that document. The anchor names are not here used in the plaintext markup, so ambiguity can arise when they are edited. It should thus be possible to specify the anchor name (as a member of the Node.anchors namespace) for complicated edits: Likes [link anchor=1 type=Likes to=Scheme Scheme]... A graphical editor could treat the link elements as objects in the document content which store hidden anchor identities, providing this in all cases. Note that the link’s properties in figure 5 are stored as attributes. Theoretically, in the RDF mapping described in section 2.1, an attribute-value pair (source, Phil.anchor.1) in the node Phil.link.Perl.1 is identical to a link of type source from the link to the anchor. However, such an approach would become infinitely recursive, as the source link’s source could again be described in terms of links. The attribute-value pairs thus provide a base case with which we can record basic properties needed to describe first-class links. Transclusive. Transclusive links can be used to construct composite nodes. A link is transclusive if its type is a specialisation of Transclusion. A transclusive link replaces the display of its source anchor contents with its target contents. Unlike the ‘native transclusion’ in section 2.1, user-level transclusive links do not imply a part-of relation. This is because any part-of relation would be between the representations of the nodes, not the resources that the nodes represent. To extend the Brazil example in section 2.2, a country information box is not part of Brazil; instead the Infobox Country node is part of Brazil node. Edit-time transclusion is user-interface specific, although quite similar to the issues already covered in section 2.3 with the native transclusion performed by embedded anchors. For a simple, text serialisation interface, such as a web form, it is possible to serialise the transcluded content in-place with a small amount of surrounding markup; if the returned text differs, this is an edit of the transcluded node. Again, richer, graphical editors can replace this markup with subtler cues. Open. To realise first-class links while retaining a standard wiki embedded-style editing interface, we have modified Weerkat to work upon document trees, into which links can be embedded, and from which links can be separated. These embedding and separation routines rewrite documents into multiple nodes as is necessary. Transclusion, be it presented at edit-time, or for viewing, is possible via the same mechanism: including the target content within the link’s nowembedded anchor. To embed a link back into a document, including in order to create an XHTML representation of it for display and web navigation, it must be determined which links are applicable to the document being processed. For this, we have defined a new type of module in the system: a link matcher. Link matchers inspect the endpoints of links and determine if the document matches the endpoint criteria. A Model for Open Semantic Hyperwikis 11 For straightforward, literal links, this is a simple case of identity equality between the endpoint’s named document, and the current document. As part of the storage adaptation for first-class linking, we have introduced an attribute cache, which is fundamentally a triple store whose contents are entirely derived from the attributes of each node. As well as eventually being a useful way to interact with the semantic content of the wiki, this allows us to implement link matching in an efficient way, by querying upon the store. For example, in the literal endpoint case, assuming suitable prefixes and subtype inference, we can find such links with a set of simple SPARQL queries, selecting ?l where: 1. 2. 3. 4. { { { { ?l ?l ?l ?l type type type type link link link link . . . . ?l ?l ?l ?l source source target target Scheme . Scheme_5 Scheme . Scheme_5 } . } } . } The first two queries find links where this node is a source; the latter two, where it is a target. We must also find links from or to the specific version of the current node, which is provided by queries two and four. This approach can be extended to deal with endpoints which are not literal, which we consider ‘computed’. Query. Query endpoints are handled as SPARQL queries, where the union of all values of the selected variables is the set of matched pages. For example, a query endpoint of SELECT ?n WHERE { ?n Paradigm Functional . } would link from or to all functional programming languages. This kind of endpoint can be tested for a specific node via a SPARQL term constraint: SELECT ?n WHERE { ?n Paradigm Functional . FILTER ( ?n = Scheme ) } If multiple variables are selected, the filter should combine each with the logical or operator, so as to retrieve any logically-sound solution to the query, even if some of the variables involved are not the node we are interested in linking with. Generic. Generic endpoints can be implemented as a filtering step on query endpoints.2 We define a postcondition CONTAINS ( ?n, "term" ) to filter the solutions by those where the node n contains the given term. This postcondition can be implemented efficiently by means of a lexicon cache, from each term used by any generic link, to a set of the nodes using that term. Changes to generic links add or remove items from the lexicon, and changes to any node update the sets for any terms they share with the lexicon. If CONTAINS is used alone, n is implied to be the universal set of nodes, so matching is a simple lexicon lookup. To be useful for generic linking, CONTAINS implies an anchor at the point of the term when it is used as a source endpoint. For example, CONTAINS ( ?n, 2 An alternative approach may be to assert triples of the form (Scheme, containsTerm, term), but this would put a great load on the triplestore for each content edit. 12 P. Boulain, N. Shadbolt, and N. Gibbins "Scheme" ) matches the Scheme node, but should link not from the entire node, but from the text “Scheme” within it. For user interface reasons, it is desirable to restrict this only to the first occurrence of the term for non-transclusive links, so that the embedded-link document is not peppered with repeated links. For transclusive links, however, it is more consistent and useful to match all occurrences. While transclusive generic links are a slightly unusual concept, it is possible that users will find innovative applications for them. For example, if it is not possible to filter document sources at a node store level for some reason, a generic, transclusive link could be used to censor certain profane terms. Multiple CONTAINS constraints can be allowed, which require that a node contains all of the terms. Any of the terms are candidates for implicit anchors: i.e. whichever occurs first will be linked, or all will be replaced by transclusion. Parametric. We can use SPARQL variables for parametric links. Every SPARQL variable is bound to the parameter element in the node’s DOM tree with the same name: variables and parameters are considered to be in the same namespace. This allows the content to reflect the query result which matched this link. If the query allows OPTIONAL clauses which can result in unbound variables, then they could potentially have values provided by defaults from the parameter definitions in the DOM. Default values are meaningless for parameters which appear as compulsory variables in the query, as the query engine will either provide values, or will not match the link. Parametric links may have interdependent sources and targets, in which case they are simple functional links (the source can be a function of the target, and the target an inverse function of the source). Link matching is performed pairwise for all source and target combinations. For example, consider a link with these select endpoints: source: ?thing WHERE { ?thing Colour Red . } target: ?img WHERE { ?img Depicts ?thing . } target: ?img WHERE { ?img Describes ?thing . } This would create links from all nodes about things which are red, to all nodes which depict or describe or those red things. To perform this match, we union each pair of the clauses into a larger query: SELECT ?thing, ?img WHERE { ?thing Colour Red . ?img Depicts ?thing . FILTER ( ?thing = Scheme || ?img = Scheme ) } A similar query would also be performed for Describes. Note that we may receive values for the variables used as source or target which are not the current node if it matches in the opposite direction. We must still check that any given result for the endpoint direction we are interested in actually binds the variable to the current node. In this example, current node Scheme is not Red, so the query will not match, and no link will be created. A Model for Open Semantic Hyperwikis 13 The pairwise matching is to be consistent with the RDF representation presented in section 2.1, and the ‘or’ nature of matching with static endpoints: a link must only match the current node to be used, and other endpoints may be dangling. An alternative approach would be to create a ‘grand union’ of all sources and targets, such that all are required to be satisfied. Neither approach is more expressive at an overall level: with a pairwise approach, a single target endpoint can include multiple WHERE constraints to require that all are matched; with a union approach, independent targets can be achieved through use of multiple links (although they would no longer share the same identity). The union approach is more consistent with regard to the interdependence of variables; with the pairwise approach, one matching pair of source/target endpoints may have a different variable binding for a variable of the same name to another. However, it loses the RDF and static endpoint consistency. Ultimately, the decision is whether the set of targets is a function of the set of sources (and vica-versa with the inverse), or if it is the mapping of a function over each source. In lieu of strong use cases for n-ary, interdependent, parametric links (most are better modelled as separate links), we choose the former for its greater consistency, and ability for a single link to provide both behaviours. Functional. We also give consideration to arbitrarily-functional links. These are computationally expensive to match in reverse (i.e. for target-end linking and backlinks) unless the functions have inverses. We do not currently propose the ability for users to write their own Turing-complete functions, as the complexity and performance implications are widespread. However, we can potentially provide a small library of ‘safe’ functions: those with guaranteed characteristics, such as prompt termination. One such example which would be of use is a ‘concatenate’ function: source: SELECT ?n WHERE { ?n type ProgLang . } target: CONCAT( "Discuss.", ?n ) This would be a link from any programming language to a namespaced node for discussing it.However, it highlights the reversibility problem: the inverse of CONCAT has multiple solutions. For example, “ABC” could have been the result of CON CAT (“A”, “BC”), CON CAT (“AB”, “C”), or a permutation with blank strings. Hence, while it is easy to match the source, and then determine the target, it is not practical to start with the target and determine the source. We suggest that any endpoint which is an arbitrary function of others in this manner must therefore only ever be derived. Matching is performed against all other endpoints, and then the functional endpoints are calculated based on the results. A link from CONCAT( ?n, ".meta") to CONTAINS( ?n, "lambda" ) would only ever match as a backlink: showing that any node containing ‘lambda’ would have been linked from its meta-node, without actually showing that link on the meta-node itself. A link with only arbitrarily functional endpoints will never match and is effectively inert. 14 3 P. Boulain, N. Shadbolt, and N. Gibbins Conclusions In this paper, we have approached the perceived requirement for a more advanced communually-editable hypertext system. We have presented a solution to this as a model for a ”semantic open hyperwiki” system, which blends semantic wiki technology with open hypertext features such as first-class linking. We also offer an approach to implementing the more advanced link types with a mind towards practicality and computational feasibility. Providing users with stronger linking and translusion capabilities should help improve their efficiency when working on editing wikis such as Wikipedia. Interdocument linking forms a major component of current editing effort, which we hope to help automate with generic links. Content re-use is complicated by surrounding context, but even in cases where texts could be shared, technical usability obstacles with current macro-based mechanisms discourage editors from doing so. We address this with the concept of edit-time transclusion, made possible by the wiki dealing with programatically manipulatable tree structures. Beyond this, we wish to address other user-study-driven design goals, such as improving versioning support that allows for branching. References 1. Berners-Lee, T., Cailliau, R., Groff, J.-F., Pollermann, B.: World-Wide Web: The Information Universe. Electronic Networking: Research, Applications and Policy 1(2), 74–82 (1992) 2. Boulain, P., Parker, M., Millard, D., Wills, G.: Weerkat: An extensible semantic wiki. In: Proceedings of 8th Annual Conference on WWW Applications, Bloemfontein, Free State Province, South Africa (2006) 3. Boulain, P., Shadbolt, N., Gibbins, N.: Studies on Editing Patterns in Large-scale Wikis. In: Weaving Services, Location, and People on the WWW, pp. 325–349. Springer, Heidelberg (2009) (in publication) 4. Bush, V.: As We May Think. The Atlantic Monthly 176, 101–108 (1945) 5. Davis, H.: Data Integrity Problems in an Open Hypermedia Link Service. PhD thesis, ECS, University of Southampton (1995) 6. Krötzsch, M., Vrandečić, D., Völkel, M.: Wikipedia and the semantic web - the missing links. In: Proceedings of the WikiMania 2005 (2005), http://www.aifb. uni-karlsruhe.de/WBS/mak/pub/wikimania.pdf 7. Manola, F., Miller, E.: RDF Primer. Technical report, W3C (February 2004) 8. Nelson, T.: Literary Machines, 1st edn. Mindful Press, Sausalito (1993) 9. Sauermann, L., Cyganiak, R., Völkel, M.: Cool URIs for the Semantic Web. Technical Report TM-07-01, DFKI (February 2007) Implementing a Corporate Weblog for SAP Justus Broß1, Matthias Quasthoff1, Sean MacNiven2, Jürgen Zimmermann2, and Christoph Meinel1 1 Hasso-Plattner-Institut, Prof.-Dr.-Helmert-Strasse 2-3, 14482 Potsdam, Germany {Justus.Bross,Matthias.Quasthoff,Office-Meinel} @hpi.uni-potsdam.de 2 SAP AG, Hasso-Plattner-Ring 7, 69190 Walldorf {Sean.MacNiven,Juergen.Zimmermann}@sap.com Abstract. After web 2.0 technologies experienced a phenomenal expansion and high acceptance among private users, considerations are now intensified to assess whether they can be equally applicable, beneficially employed and meaningfully implemented in an entrepreneurial context. The fast-paced rise of social software like weblogs or wikis and the resulting new form of communication via the Internet is however observed ambiguously in the corporate environment. This is why the particular choice of the platform or technology to be implemented in this field is strongly dependent on its future business case and field of deployment and should therefore be carefully considered beforehand, as this paper strongly suggests. Keywords: Social Software, Corporate Blogging, SAP. 1 Introduction The traditional form of a controllable mass-medial and uni-directional communication is increasingly replaced by a highly participative and bi-directional communication in the virtual world, which proves to be essentially harder to direct or control [2][13]. For a considerable share of companies this turns out to be hard to tolerate. The usecase of a highly configured standard version of an open source multi-user weblog system for SAP – the market and technology leader in enterprise software – will form the basis for the paper outlined here. SAP requested the Hasso Plattner Institute (HPI) to realize such a weblog to support its global internal communications activities. In the current economic environment and with the changes in the SAP leadership, an open and direct exchange between employees and executive board was perceived as being critical to provide utmost transparency into the decisions taken and guidance for the way forward. Recent discussions about fundamental and structural changes within SAP have clearly shown that need for direct interaction. SAP and HPI therefore agreed to share research, implementation and configuration investments necessary for this project – hereafter referred to as “Point of View”, or shortly POV. The platform went online in June 2009, and is at this moment beginning to gain first acceptance among all SAP employees worldwide. To leverage the experiences and expert J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 15–28, 2010. © Springer-Verlag Berlin Heidelberg 2010 16 J. Broß et al. knowledge gained in the course of the project, we will elaborate upon the following research question from an ex post perspective: “What are critical key success factors for the realization of a corporate weblog in an environment comparable to the setting of SAP’s Point of View”? This paper will start with a short treatise about social software in general and weblogs in particular in section II, followed by a more elaborate and thorough analysis in section III about the capabilities and challenges of weblogs in the corporate environment , including their forms of deployment (e.g. CEO blog, PR-blog, internal communications tool), success factors (e.g. topicality, blogging policies, social software strategies), risks (e.g. media conformity, institutional backup, resource planning, security- and image related-issues,) and best-practice examples. Section IV is dedicated to the use-case of the POV-project, beginning with an introduction about the SAP’s motivation to have such a platform developed and the overall scope of the project. The subsequent paragraph will provide an overview about all technical development-, implementation- and configuration- efforts undertaken in the course of the project. In doing so, it will elaborate upon the precondition of SAPs work council to get an anonymous rating functionality, the prerequisite to bond the standard blogging software with the authentication systems in place on behalf of SAP (LDAP, SSO, etc.) and the precondition to realize the blog on the basis of a multi-user version of the blogging standard software Wordpress. Design issues like the integration of the blog into the SAP web portal as well as the CI/CD guidelines of SAP are also mentioned. A conclusion and the obligatory reference list complete the paper. 2 What Is a Weblog? A weblog – a made-up word that is composed of the terms „web“ and „log“– is no more than a specific website, whose entries, also known as “posts”, are usually written in reverse chronological order with the most recent entry displayed first. Initially, it was meant to be an online diary. Nowadays, there are countless weblogs around, with each covering a different range of topic. Single blog posts combine textual parts with images and other multimedia data, and can be directly addressed and referenced via an URL (Uniform Resource Locator) in the World Wide Web. Readers of a blog posts can publish their personal opinion in a highly interactive manner about the topic covered by commenting on a post. These comments can however be subject to moderation by the owner of a blog. 2.1 Social Software While the first blogs around were simple websites that were regularly update with new posts (or comments), we witnessed the emergence of so-called “Blog Hosting Services” by the end of the ‘90s. Services providers like for instance Wordpress, Serendipity, MovableType, Blogspot or Textpattern1 offered a user-friendly and readymade blog service that even allowed non-expert-users to generate and publish content accessible to all Internet users. Everybody capable of using a simple 1 www.wordpress.org; www.s9y.org; www.movabletype.org; www.textpattern.com; www.blogger.com Implementing a Corporate Weblog for SAP 17 text-editor-program could thus actively take part in the unconfined exchange of opinions over the web [35]. Nowadays, weblogging systems are more specialized, but still easy-to-use Content Management Systems (CMS) with a strong focus on updatable content, social interaction, and interoperability with other Web authoring systems. The technical solutions agreed upon among developers of weblogging systems are a fine example of how new; innovative conventions and best practices can be developed on top of existing standards set by the World Wide Web Consortium and the community. Applications like these, that offer a simplified mode of participation in today’s Internet in contrast to earlier and traditional web applications, were in the following described as “Web 2.0 applications”. The concurrently developing „Participation Internet“ is till the present day referred to as the „Web 2.0“ [25]. The above-mentioned cumulative „social“ character of the Internet is contrary to traditional mass media representatives like the printing-press, television or the radio, since these only offer a uni-directional form of communication. The Internet in turn offers all its users real interaction, communication and discussion. This is also why blogs – next to podcasts – are referred to as the most frequently used ”social media tools“ [26]. 2.2 Features One prominent feature of weblogging systems are so called feeds, an up-to-date table of contents of a weblog's content. Feeds are exchanged in standardized, XML-based formats like RSS or ATOM formats, and are intended to be used by other computer programs rather than being read by humans directly. Such machine-readable tables of contents of web sites opened a whole new way for users of consuming content from various Web sites. Rather than having to frequently check different web sites for updates, users can subscribe to feeds in so-called aggregators, i.e. software automatically notifying subscribers upon content updates. Feeds from different sources can even be mixed, resulting in a highly customized subscription to web content from different sources [1]. Such syndicated content can then be consumed as a push-medium, on top of the pull-oriented World Wide Web architecture. One example for a popular extension of feed formats are podcasts, which have additional media files, such as audio or video broadcasts, attached. Social interaction is another important aspect of weblogging systems, which form a notable part of the so-called Social Web. The most visible method of social interaction is inviting readers to comment and discuss postings directly within the weblogging system. Weblogs also introduced more subtle, interesting means of interaction. To overcome the limiting factor of HTTP-based systems being only aware of outbound hyperlinks, different type of Link Backs have been developed. These will automatically detect incoming hypertext links from one weblog posting to any other weblog posting, and will insert a link from the original link target to its source, hence making hypertext links symmetrical. Such links can be detected, e.g., using the often disregarded Referer [sic] header in an HTTP transmission, or by actively notifying the link target about the reference. Making hyperlinks symmetrical significantly helps weaving a true social web between weblog authors and thus ultimately forms the interconnectivity of the blogosphere. 18 J. Broß et al. The latter example of weblog systems actively notifying each other is one example of how interoperable weblogging systems are. Many of these systems have an XML-RPC interface, a technology used to control web services using non-browser technologies [36]. This interface can be used to notify about incoming links (so-called ping-backs), but even to author and manage content within the weblogging system, e.g. using mobile phone software. Other promising means of interoperability are upcoming technologies based on Semantic web standards, such as RDF and SIOC. Using these standards, the structure of a weblog's content and its role in the blogosphere can be expressed and published in a standardized, machine-readable way, which will be even more flexible compared to today's feeds and XML-RPC interfaces [16]. 3 Corporate Weblogs – Capabilities and Challenges Successful enterprises attribute part of their success to effective internal communication, which most employees would circumscribe as direct and open communications with their management. These internal open channels of communications create an atmosphere of respect where co-worker and manager-employee relationships can flourish, keep employees excited about their job, circulate vital information as quickly as possible and connect employees with the company’s goal and vision [5][7][22]. 3.1 The Corporate Internal Communications Perspective While most people consider face-to-face communication as the most effective communication tool, it is often too time-consuming or too difficult or expensive over greater distances of time or space. Print was also no option here, since it is too slow and requires filing as well as complex retrieval systems and storage. Advances in Information and Communication Technologies (ICT) could finally overcome these disadvantages while still allowing for direct and personal interaction. An increasing number of enterprises therefore started to employ weblogs as a complementary tool for their external or internal communications [6][11]. Blogs however turned out to be a far more effective tool within the internal corporate environment. Through their application in intranets, or enclosed network segments that are owned, operated, controlled and protected by a company, it could keep track of information and communication more quickly and effectively [23]. Inside the company walls, it could furthermore replace an enormous amount of emails, spread news more quickly, serve as a knowledge database or create a forum for collaboration and the exchange of ideas [7][11][12]. Chances are high that companies will become more innovative, transparent, faster and more creative with such instruments [5]. Especially the rather traditional big-businesses found the uncontrollable world of the blogosphere hard to tolerate, where fundamentally different (unwritten) rules, codes of conduct or pitfalls existed than what they were used to so far [13][17]. Even traditional hierarchies and models of authority were sometimes questioned when social software projects were initiated [5]. This disequilibrium and radical dissimilarity oftentimes resulted in worst-case-scenarios for public relations department of major companies that just did not know how to deal with this new tool of communications [2]. Implementing a Corporate Weblog for SAP 19 However, unlike their equivalents in the Internet, internal weblogs can be customized to help a company succeed both on the individual and organizational level. 3.2 Deployment of Corporate Blogs While first pragmatic systematization efforts of corporate weblogs [20][34] provided a coherent overview about the whole field but lacked a conceptual fundament, Zerfaß and Bölter [33] provided a more applicable reference framework that presents two dimensions in which the distinct forms of corporate weblogs can be located. On the one hand blogs differ regarding their field of applications: internal communications, market communications or PR. Then again blogs can support distinct aims of communications, which in turn can be distinguished between an informative procedural method, a persuasive method and finally processes of argumentation (see fig. 1). Fig. 1. Deployment possibilities for corporate blogs (adapted on the basis of [6]) Since this paper focuses on corporate weblogs applied in the internal communications perspective, we will leave market communication and PR, the latter two fields of application, out at this point. Knowledge Blogs can support a company’s knowledge management because expertise and know-how can be shared on that platform with other fellow employees [12]. A successful Collaboration blog like the “Innovation Jam” of IBM for instance brought together employees of their worldwide strategic partners and contractors with their own ones to spur software innovation [38]. 20 J. Broß et al. CEOs of major companies like Sun Microsystems, General Motors and Daimler2 or dotcoms like Xing3 are increasingly making use of CEO blogs to address matters of strategic interest and importance for their company’s stakeholders [2][4]. While a sustainable commitment is highly important for these kinds of blogs, Campaigning blogs are temporally limited and rather suitable for highly dramaturgical processes of communication. Topic blogs can similarly to Campaigning blogs be allocated within multiple dimensions of Zerfaß reference framework (see fig.1). They are utilized to prove a company’s competence in relevant fields of their industry. The graphical position of our use-case POV within Zerfaß’ framework (refer Fig. 1) indicates a profound distinctiveness compared to the other types of weblogs. It ranges over the entire horizontal reach of the framework while being restricted to only the “vertical” internal dimension of communication. POV mandate is however biased towards the horizontal coverage similar to the ones of CEO blogs. 3.3 Success Factors Even if a blog is professional or oriented towards the company, it is still a fairly loose form of self-expression since it is the personal attributes of weblogs what make them so effective in the entrepreneurial context [1][7]. Corporate weblogs offer a form of buttom-up-approach that stresses the individual and offers a forum for seamless and eternal exchange of ideas [37]. It allows employees to feel more involved in the company. There is however a downside to any killer application as usual. Before a corporate blog is established within a company – no matter if in an internal or external communications context, the people responsible must address several strategic issues in order to decide on the practicability and meaningfulness of the tool. It is first of all essential to assess whether a blog would be a good fit for the companies values, its corporate culture or its image. The management responsible for any social software project within a company should therefore first of all fully understand the form of communication they are planning to introduce [17]. It might therefore be recommendable that the persons in charge would collect their very personal experiences by running their own weblog or have at least other employees test it beforehand. A longterm blog-monitoring to systematically oversee the formation of opinion in this medium might be of great help here [14]. Furthermore, employees might not be going to enjoy blogging as much as their managers or the communications staff might. To keep internal communications via weblogs as effective as possible, it is essential that all stakeholders commit their time and effort to update their blogs, keep them interesting, and encourage other employees to use them. Social software after all only breathes with the commitment of the whole collective. Full and continuing institutional and managerial backup that can neither be convulsed by unexpected nor by sometimes unwanted occurrences is essential for a successful corporate weblog. However, even if companies would decide against a corporate blog, it should at all times stay on their agenda [9]. Ignoring the new communications arena of the blogosphere might put your entity at a risk that will grow with the raising importance of the medium [2][3][21]. This holds especially true if your direct competitors are working harder into this direction than you do [19]. 2 3 http://blogs.sun.com/jonathan/; http://fastlane.gmblogs.com/; http://blog.daimler.de/ http://blog.xing.com Implementing a Corporate Weblog for SAP 21 Blogs can be impossible to control if they are not regulated within certain limitations, codes of conduct and ethics [7][15]. This tightrope walk needs to be realized very carefully since weblogs tend to be a medium allergic to any kind of regulation or instances of control. But by opening a pipeline to comments from employees without any restrictions you can reach information glut very quickly, essentially defeating the purpose of the tool [19]. IBM for instance was one of the first big businesses that successfully established a simple and meaningful guideline, known as the “IBM blogging policy” for the proper use of their internal blogs that were quickly accepted by their employees [24]. Encouraging and guiding your employees to utilize internal blogs may be the most important issue a firm will have to address when implementing a blog for their internal communications [8]. 4 Case Study: Point of View Platform Especially in a time of crisis, generating open dialogue is paramount to managing fear and wild speculation, and yet traditional corporate communications remains a largely unidirectional affair. The transition of a new CEO, the global economic financial crisis and the first lay-offs in the history of the company had generated an atmosphere of uncertainty within SAP. While conversation happened in corridors and coffee corners, there was no way for employees to engage with executives transparently across the company and share their ideas, concerns and offer suggestions on topics of global relevance, and there was no consolidated way for executives to gain detailed insight into employee sentiments. But reaching the silent majority and making results statistically relevant requires more than offering the ability to comment on a topic. Knowing the general dynamics of lurkers versus contributors, especially in a risk averse culture, SAP and the HPI worked together on a customized ratings system that would guarantee the anonymity of those participants not yet bold enough to comment with their name, but still encourage them to contribute to the overall direction of the discussion by rating not only the topic itself, but also peer comments. 4.1 POV: Scope, Motivation, Vision To set appropriate expectations, the blog was launched as an online discussion forum rather than as a personal weblog, and was published as a platform for discussion between executives and employees, without placing too much pressure on any one executive to engage. Launched with the topic of “purpose and values” and following as part of the wave of activities around the onboarding of SAP’s new CEO, the new platform POV has signaled a fundamental shift towards a culture of calculated risk and a culture of dialogue [18]. This culture shift has extended now well beyond the initial launch of Point of View, with internal blogging becoming one of the hottest topics among executives who want to reach out to their people and identify areas for improvement. Another result of the collaboration has been a fundamental rethinking of the way news is created and published, with the traditional approach to spreading information via 22 J. Broß et al. HTML e-mail newsletters being challenged by the rollout of SAP’s first truly bidirectional Newslogs. As employees become more and more acquainted with RSS and aggregation of feeds, the opportunity to reclaim e-mail for the tasks it was originally designed for is tangibly near. Point of View has been the first step toward ubiquitous dialogue throughout the company, and the approach to facilitating open, transparent dialogue is arguably the single most pivotal enabler of internal cultural transformation at SAP. SAP thus follows the general trend of internationally operating big-business in Germany that increasingly employ weblogs in their enterprise (41% of those companies with more than 5000 employees [10]). 4.2 Configuration of the Standardized to Fit Corporate Requirements The WordPress MU weblogging system favored as the SAP corporate weblogging system needed, inspite its long lists of features and configuration options, quite some modifications to fit the requirements set by the company’s plans, and corporate policies. Of course, the very central blogging functionality has already been provided by WordPress. Posts and comments can be created and moderated, and permissions for different user roles can be restricted. Also, multimedia files can be embedded in postings. Postings and comments can by default be assigned a permanent, human-readable URI. Furthermore, WordPress already provides basic usage statistics for readers and moderators. One benefit of using a popular weblogging system like WordPress MU, rather than developing a customized system from scratch or using a general-purpose CMS, is that large parts of actual customizations needed can be achieved using extensions, or plugins, to the weblogging system. Using such plug-ins, some of SAP's more specialized requirements could, at least partly, be addressed. One group of plug-ins helped to meet SAP's display-related requirements, e.g. to list comments and replies to comments in a nested (threaded) view. Other plugins enable editing of postings and comments, even if they have already been published, and to easily enable or disable discussions for individual postings. Another set of plug-ins was required to highlight specific comments in a dedicated part of the website (see “nested comments” in fig. 2) and to ensure anonymous voting as demanded by the worker’s council. The last group of plug-ins focused on notifying users upon new postings or comments, e.g., via e-mail, and on enhancing WordPress MU’s default searching and browsing functionality for postings, comments and tag keywords. The dual-language policy of SAP, offering intranet web content both in English and German, was found a bigger challenge during the development, as all content, i.e. postings, comments, category names and tags, and the general layout of the CMS has been requested of being kept separate by language. The most feasible solution was found to be setting up completely independent weblogs within one shared WordPress MU installation for each language, at the cost of having independent discussions for different languages. Another big issue, which required thorough software development, was fulfilling privacy-related requirements. Understandably, in a controlled corporate environment due to potentially identifiable users, such requirements play a Implementing a Corporate Weblog for SAP 23 much bigger role than in a publicly available weblogging platform with terms of use often more reflecting technical possibilities rather than corporate policies. Hence, lots of the rating and statistics functionality needed adjustments to ensure privacy. Not only were moderators not allowed to see certain figures, but rather it had to be ensured that such figures were not stored in the database systems. This required some changes to the internal logic of otherwise ready-to-use voting and statistics enhancements. Fig. 2. Seamless Integration of POV in SAP’s internal corporate webportal 4.3 Who Are You Really? Nowhere it is easier to fake your real identity as in the public room of the Internet, or as Peter Steiner put it in a subtitle of a cartoon in The New Yorker: ”On the Internet, nobody knows that you are a dog“ [27]. This holds especially true for posts and comments inside a blog. Usually, only a valid email address and an arbitrary pseudonym are requested from authors of new posts or comments for their identification. Verification of the email address is however only limited to its syntax, this is to say that as long as the specified address is in regular form, it is accepted by the system irrespectively of the content posted with it. Another customary security mechanism is the request to the author of a comment to enter a graphically modified band of characters or “captcha”, which prevents so called web robots to automatically disseminate large quantities of content in other websites or forums. In some cases, in which relevant political, corporative, or societal content is published in weblogs and therefore potentially available to everybody online, the identity of authors should not be possible to fake, alter or change. This does not only hold true 24 J. Broß et al. for identities of general public interest, but sometimes also for the identity of participants in any given content-related discussion [28]. A useful security mechanism in this regard might for instance be digital signatures that can either be used for user-authentication or for the verification of a blog (post’s) integrity – thus ensuring the absence of manipulation and alteration [29]. In doing so, digital signatures serve a similar purpose to our regular signatures in diurnal life. By signing a specific document, we express our consent with the content of that document and consequently authorize it. Since every signature holds an individual and unique characteristic, it can be assigned to the respective individual without any doubt. A digital signature incorporates a similar individual characteristic due to unique captchas that link a signed document with the identity of the signee. Neither the content of the signed document nor the identity of the signee can be changed without altering the content of the digital signature. Finally, there is a third trusted instance, (a so-called “certification authority”) that confirms the integrity of the document, the author as well as the corresponding signature. For an internal corporate weblog like POV, a fully-functional user-authentication had to be equally realized to truly overcome traditionally unidirectional corporate communication and to generate a real open dialogue and the trust necessary to manage fear and wild speculation among the workforce within SAP. Every stakeholder active in the POV-platform thus needed the guarantee that every article or comment in the platform was written by exactly the author as specified within the platform. In the specific case of POV it was furthermore not only imperative to identify single users, but also clearly mark their corresponding affiliation to the major interest groups within that platform being the top-management and board of directors on the one hand, and SAPs 60.000 employees and their work council on the other. The WordPress and WordPress MU weblogging systems by default provide their own identity management solution, which require authors to register using their personal data, and optionally validate e-mail addresses or need new accounts being activated by administrators or moderators of the system. As mentioned before, this only partially enables user-authentication. As SAP already has a corporate identity management system in place, it was thus decided to reuse this infrastructure and allow users to authenticate with the weblog system without any username or password, but just using their corporate X.509 client certificate [32] using the Lightweight Directory Access Protocol (LDAP) directories already in place. There is no ready-to-use extension for WordPress to integrate the WordPress identity management and X.509. Hence, the functionality required needed to be developed from scratch and was packaged as a separate WordPress plugin. Given that user-authentication needed to be implemented, it was also imperative to allow for an easy and quick access of their employees [31]. The property of access control of multiple, related, but independent software systems – also known as SingleSign-On (SSO) - allowing SAPs employees to log in once into the well-established internal portal and consequently gain access to all other systems (including the blog) without being prompted to log in again at each of them [30]. This plugin makes use of the identity information conveyed in the users' TLS client certificates and provides it to the WordPress identity management system. As a consequence, when authenticated the SAP weblog could only be accessed using HTTPS Implementing a Corporate Weblog for SAP 25 connections. This required some further rewriting techniques for hyperlinks within the system itself, in order to avoid disturbing warning messages in the users' web browsers. 4.4 Seamless Integration SAP employees, like most information workers, prefer a one-stop-shop approach to information discovery, acquisition and retention, rather than site-hopping (see “SSO” in section 4.2 and fig. 1). To improve adoption of the new platform tight integration into SAP’s intranet and the illusion of the platform being a native component of the intranet was required. The design was discussed with Global Communications and SAP IT, and then implemented by HPI to meet the standards of the SAP Corporate Portal style-guides (see “CI/CD” in fig. 1). Feedback has shown that this has been so effective that employees have requested rating functionality for their own pages without even realizing that the entire application is a separate entity (see “integration” in fig. 1). Seamless integration has also ensured that it is possible to install several instances of the same discussion in multiple languages, so that employees can be automatically directed to their default language version based on their personal settings (see “Bilingual” in fig. 1). As an equal-opportunity employer, accessibility is a mandatory consideration for new platforms, and Point of View was tested for compatibility with external screen readers, screen inversion, and standard Windows accessibility functions. 4.5 Meeting Enterprise Standards Especially in the corporate (non-private) context, it should be regarded as a projects central aspect to safeguard your blog-platform against any kind of failure and have general system stability guaranteed at any time. For an internal communications platform with no intended customer interaction, but many thousands of potential platform users from the work force, it could grow into a fairly embarrassment for a company if such a platform would not be available as planned. Especially for the use case of the POV project, which was announced within the company [30] to be a central point of communication between SAP’s board and its eemployees, there was no room for (temporal) failure. This is why the development phase of POV was realized on separate server hardware. At the time the blog was fully functional und completely free of bugs, it was moved onto two identical physical machines that will guarantee redundancy for POV’s life-time. In the last resort of a system crash on the production server currently running, traffic could immediately be redirected towards the stand-by redundant second server. Already published posts and comments could quickly be restored from a database backup. System stability through redundancy on the hardware-side should however be realized at all times contemporaneously with software stability tests. Since POV was build upon the open-source blogging software of Wordpress that is mainly used for the private and small-scale context, and its code furthermore heavily adapted to fit extra requirements, the systems scalability had to thoroughly tested for the implementation in the corporate context with up to potentially 60.000 users as well. 26 J. Broß et al. Table 1. POV Load Test: Transaction Summary Transaction Name Logon ReadArchive ReadComments RecentlyAdded Search Min. Avg. Max. 1.175 0 0.409 0.445 0.458 2.735 0.063 1.231 1.241 1.248 26.972 20.924 13.094 14.16 22.704 Std. Dev. 2.098 0.237 1.081 1.092 1.12 90 % Pass 5.031 0.13 2.5 2.5 2.406 14,194 14,100 13,933 13,660 13,806 Fail Stop 58 0 18 26 15 2 0 1 3 2 The IT department of SAP therefore conducted load tests with 1000 concurrent users performing automated read scenarios with a think time set at random 30-90 seconds and 10 concurrent users carrying out heavy write transactions. The number of concurrent users was determined against benchmarks with similar platforms already in use at SAP such as forums and wikis, and scaled to ensure sufficient stability for a best-case employee engagement (see table 1 for transaction summary). 16 transactions per seconds were created, and 50 comments in a space of 15 minutes resulting in an overall logon, navigation and read transaction response time of less than 3 seconds. This result was comparable to similar systems such as internal forums and wikis, and no major errors were encountered. Of almost 70,000 transactions executed in the test, less than 2% failed or were stopped. The server CPU sustained between 70 and 90% utilization and RAM consumption was around 500MB. To date, CPU load in the active system does not exceed 15%. Performance lags in the Americas and Asia Pacific have also now been remedied, resulting in similar response times around the world. 5 Conclusion Point of View was launched to the company during the launch of SAP’s Purpose and Values by the CEO. Initially, participation was slow, and employees waited to see how the channel developed. Following a couple of critical statements, more people felt encouraged to participate, and the platform has begun to take on a life of its own with 128 comments for the first post alone, and counting, even 2 months after it was posted. Around 19,000 employees have visited the platform, and it has clocked up 55000 page views. This far exceeds the initial expectations and shows the need for feedback was indeed very present. An increase in access to the blog via tags has also been identified, a trend expected to grow as more content becomes available. We do conclude that a weblog is a highly dynamic online communications tool that if implemented correctly has the potential to make a company’s internal communications more cohesive and vibrant. However, it should also be mentioned here that any social software projects – especially when it comes to weblogs – can wreak havoc if the basic success factors discussed before are not fully adhered to. Nonetheless, weblogs inherently incorporate respect for individual self-expression and thus provide an excellent forum for the free exchange and development of ideas, that can make employees feel more involved in a company and connected closer to the corporate Implementing a Corporate Weblog for SAP 27 vision - even in times of crisis. Even though weblogs do not offer all solution to corporate communications departments, they can unbind human minds that make up an organization and make internal communications more effective. References 1. Ludewig, M., Röttgers, J.: Jedem sein Megaphon – Blogs zwischen Ego-Platform, Nischenjournalismus und Kommerz. C’t, Heft 25, Report | Web 2.0: Blogs, 162–165 (2007) 2. Jacobsen, N.: Corporate Blogging – Kommunikation 2.0, Manager Magazin, http://www.manager-magazin.de/it/artikel/ 0,2828,518180,00html 3. Klostermeier, J.: Web 2.0: Verlieren Sie nicht den Anschluss, Manager Magazin, http://www.manager-magazin.de/it/ciospezial/0,2828,517537,00.html 4. Tiedge, A.: Webtagebücher: Wenn der Chef bloggt, Manager Magazin, http://www.manager-magazin.de/it/artikel/ 0,2828,513244,00.html 5. Hamburg-Media.Net: Enterprise 2.0 - Start in eine neue Galaxie. Always on, Ausgabe 9 (February 2009) 6. Koch, M., Richter, A.: Enterprise 2.0: Planung, Einführung und erfolgreicher Einsatz von Social Software in Unternehmen. Oldenbourg Wissenschaftsverlag, München (2008) 7. Cowen, J., George, A.: An Eternal Conversation within a Corporation: Using weblogs as an Internal Communications Tool. In: Proceedings of the 2005 Association for Busines Communication Annual Convention (2005) 8. Langham, M.: Social Software goes Enterprise. Linux Enterprise (Weblogs, Wikis and RSS Special) 1, 53–56 (2005) 9. Heng, S.: Blogs: The new magic formula for corporate communications? Deutsche Bank Research, Digital Economy (Economics) (53), 1–8 (2005) 10. Leibhammer, J., Weber, M.: Enterprise 2.0 – Analyse zu Stand und Perspektiven in der deutschen Wirtschaft, BITKOM (2008) 11. BerleCon Research: Enterprise 2.0 in Deutschland – Verbreitung, Chancen und Herausforderungen, BerleCon Research im Auftrag der CoreMedia (2007) 12. IEEE: Web Collaboration in Unternehmen. In: Proceedings of first IEEE EMS Workshop about Web Collaboration in Enterprises, September 28, Munich (2007) 13. Sawhney, M.S.: Angriff aus der Blogosphäre. Manager Magazin 2 (2005), https://www.manager-magazin.de/harvard/0,2828,343644,00.html 14. Zerfaß, A.: Corporate Blogs: Einsatzmöglichkeiten und Herausorderungen, p.6 ff (2005), http://www.bloginitiativegermany.de 15. Scheffler, M.: Bloggers beware! Tippfs fürs sichere Bloggen im Unternehmensumelfd. Bedrohung Web 2.0, SecuMedia Verlags-Gmbh (2007) 16. Wood, L.: Blogs & Wikis: Technologies for Enterprise Applications? The Gilbane Report 12(10), 1–9 (2005) 17. Heuer, S.: Skandal in Echtzeit. BrandEins Schwerpunkt: Kommunikation Blog vs. Konzern 02/09, 76–79 (2009) 18. Washkuch, F.: Leadership transition comms requires broader strategy in current economy, July 2009, p.10 (2009), http://PRWEEKUS.com 19. Baker, S., Green, H.: Blogs will change your business. Business Week, May 2, 57–67 (2005), http://www.businessweek.com/magazine/content/05_18/b3931001_ mz001.htm 28 J. Broß et al. 20. Röll, M.: Business Weblogs – a pragmatic approach to introducing weblogs in medium and large nterprises. In: Burg, T.N. (Hrsg.), BlogTalks, Wien 2004, pp. 32–50 (2004) 21. Eck, K.: Substantial reputational risks, PR Blogger, http://klauseck.typepad.com/prblogger/2005/02/pr_auf_der_zus c.html 22. Argenti, P.A.: Corporate Communications. McGraw-Hill/Irwin, New York (2003) 23. O’Shea, W.: Blogs in the workplace, New York Times. July 7 (2003), http://www.nytimes.com/2003/07/07/technology/07NECO.html?ex= 1112846400&en=813ac9fbe3866642&ei=5070 24. Snell, J.: Blogging@IBM (2005), http://www-128.ibm.com/developerworks/ blogs/dw_blog.jspa?blog=351&roll=-2#81328 25. O‘Reilly, T.: Web 2.0 Compact Definition: Trying again (2006), http://radar.oreilly.com/archives/2006/12/web_20_compact.html 26. Cook, T., Hopkins, L.: Social Media or, How I learned to stop worrying and love communication (2007), http://trevorcook.typepad.com/weblog/files/CookHopkinsSocialMediaWhitePaper-2007.pdf 27. Steiner, P.: Cartoon. The New Yorker 69(20) (1993), http://www.unc.edu/depts/jomc/academics/dri/idog.html 28. Bross, J., Sack, H., Meinel, C.: Kommunikation, Partizipation und Wirkungen im Social Web. In: Zerfaß, A., Welker, M., Schmidt, J. (eds.) Kommunikation, Partizipation und Wirkungen im Social Web, Band 2 der Neuen Schriften zur Online-Forschung, Deutsche Gesellschaft für Online-Forschung (Hrsg.), pp. 265–280. Herbert van Halem Verlag, Köln (2008) 29. Meinel, C., Sack, H.: WWW – Kommunikation, Internetworking, Webtechnologien. Springer, Heidelberg (2003) 30. Varma, Y.: SSO with SAP enterprise portal, ArchitectSAP Solutions, http://architectsap.wordpress.com/2008/07/14/sso-with-sapenterprise-portal/ 31. Secude, How to Improve Business Results through Secure SSO to SAP, http://www.secude.com/fileadmin/files/pdfs/WPs/SECUDE_WhiteP aper_BusinessResultsSSOforSAP_EN_090521.pdf 32. The Internet Engineering Task Force (IETF), Internet X.509 Public Key Infrastructure Certificate and CRL Profile, http://www.ietf.org/rfc/rfc2459.txt 33. Zerfaß, A., Boelter, D.: Die neuen Meinungsmacher - Weblogs als Herausforderung für Kampagnen, Marketing, PR und Medien. Nausner & Nausner Verlag, Graz (2005) 34. Berlecon Research: Weblogs in Marketing and PR (Kurzstudie), Berlin (2004) 35. Leisegang C., Mintert S.: Blogging Software, iX (July 2008) 36. Scripting News, XML-RPC Home Page, http://www.xmlrpc.com/ 37. Cronin-Lukas, A.: Intranet, blog, and value, The big blog company, http://www.bigblogcompany.net/index.php/weblog/category/C45/ 38. Kircher, H.: Web 2.0 - Plattform für Innovation. IT-Information Technology 49(1), 63–65 (2007) Effect of Knowledge Management on Organizational Performance: Enabling Thought Leadership and Social Capital through Technology Management Michel S. Chalhoub Lebanese American University, Lebanon mchalhoub@live.com Abstract. The present paper studies the relationship between social networks enabled by technological advances in social software, and overall business performance. With the booming popularity of online communication and the rise of knowledge communities, businesses are faced with a challenge as well as an opportunity – should they monitor the use of social software or encourage it and learn from it? We introduce the concept of user-autonomy and user-fun, which go beyond the traditional user-friendly requirement of existing information technologies. We identified 120 entities out of a sample of 164 from Mediterranean countries and the Gulf region, to focus on the effect of social exchange information systems in thought leadership. Keywords: Social capital, social software, human networks, knowledge management, business performance, communities of practice. 1 Introduction The present paper studies the relationship between social networks enabled by technological advances in social software, and overall business performance. With the booming popularity of online communication and the rise of knowledge communities, businesses are faced with a challenge as well as an opportunity – should they monitor the use of social software or encourage it and learn from it? We introduce the concept of user-autonomy and user-fun, which go beyond the traditional user-friendly requirement of existing information technologies. We identified 120 entities out of a sample of 164 from Mediterranean countries and the Gulf region, to focus on the effect of social exchange information systems in thought leadership. During our exploratory research phase, we put forward that for a company to practice thought leadership, its human resources are expected to contribute continuously to systems that support the development of social capital. Majority of our respondents confirmed that although classical business packages such as enterprise resource planning (ERPs) have come a long way in supporting business performance, they are distant from fast changing challenges that employees face in their daily lives. Respondents favored the use of social software - blogs, wikis, text chats, internet forums, Facebook and the like - to open and conduct discussions that are both intellectual and fun, get advice, share experiences, and connect with communities of similar interests. J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 29–37, 2010. © Springer-Verlag Berlin Heidelberg 2010 30 M.S. Chalhoub ERPs would continue to focus on business processing while leaving room for social software in building communities of practice. In a second phase, we identified six dimensions where knowledge systems could be effective in supporting business performance. Those dimensions are (X1) training and on-the-job application of social software, (X2) encouraging participative decision-making, (X3) spurring thought leadership and new product development (NPD) systems, (X4) fostering a culture of early technology adoption, (X5) supporting customer-centered practices through social software, and (X6) using search systems and external knowledge management to support thought leadership. We performed linear regression analysis and found that (X1) is a learning mechanism that is positively correlated with company performance. (X2), which represents participative decisionmaking, gives rise to informed decisions and is positively and significantly related to company performance. (X3) or the use social software to support thought leadership is positively and significantly related to company performance. Most employees indicated that they increasingly shifted to social participation, innovation and long term relationships with fellow employees and partners. (X4), relevant to how social software fosters self-improvement and technology adoption is found to be statistically insignificant, but this may be due to the company setting related to sampling and requires further research. (X5) which corresponds to supporting customer-centered practices through social software was found positively and significantly related to company performance. (X6), representing the role of social software and advanced search systems that support thought leadership through external knowledge management is statistically insignificant in relation to company performance. Although this last result requires further research, it is consistent with our early findings that most respondents in the geographic region surveyed rely on direct social interaction rather than information technology applications. In sum, we recommend that social networks and their enabling information systems be integrated in business application rather than being looked upon by senior management as a distraction from work. Social software grew out of basic human needs to communicate, and is deployed through highly ergonomic tools. It lends itself to integration in business applications. 1.1 Research Rationale With competitive pressures in globalizing markets, the management of technology and innovation has become a prerequisite for business performance. New ways to communicate, organize tasks, design processes, and manage people have evolved. Despite the competitive wave of the 1990s which pushed firms to lower costs and adopt new business models, pressure remained to justify investments in management systems [1]. This justification was further challenged by a lack of measures for knowledge management systems as the latter is tacit and mobile [2]. Throughout the last decade, increased demand by users to interact in communities of interests – both intellectual and fun – gave rise to social software. We identified six dimensions along which managers could be pro-active in harnessing social software to enable employees perform better collectively. These dimensions comprise: Effect of Knowledge Management on Organizational Performance 31 (1) technology training and on-the-job application of social networks, (2) using enterprise communication mechanisms to increase employee participation in decision-making, (3) spurring employees towards thought leadership and new product development (NPD) systems. Thought leadership is our terminology referring to the ability to create, develop, and disseminate knowledge in a brainstorming mode, whereby the group of knowledge workers or team play a leading role in originating ideas and building intellectual capital, (4) fostering a culture of technological advancement and pride of being affiliated with the organization, (5) supporting customer-centered practices and customer relationship management (CRM) systems, and (6) using social software to support thought leadership through external knowledge management. The six dimensions above represent initiatives directed at the development of intellectual capital. They call for a competitive performance at an organizational level, while contributing to self-improvement at an individual level. They are all geared towards building a culture that appreciates early adoption of technology to remain up-to-date in the fast-evolving field of information and management systems. Gathering, processing, and using external knowledge is not only about the supporting technology, but mostly about an entire process whereby the employee develops skills in performing research. 2 Organizational and Managerial Initiatives That Support Intellectual Capital Development 2.1 Technology Training and On-the-Job Application It is common knowledge that employee growth contributes to overall company performance. Several researchers suggest measurement techniques to link individual growth to company result [3]. Strategic planning can no longer be performed without accounting for technology investment and what the firm needs to address in terms of employee training on new tools and techniques [4], [5], [6], [7]. Technology training has made career paths more complex, characterized by lateral moves into, and out of, technical jobs. But at the same time, it provides room for userautonomy. It was found that technology training provides intellectual stimulation and encourages employees to apply newly learned techniques on the job, realizing almost immediate application of what they were trained for. In particular, technology training allowed employees to develop on their own and with minimal investments personalized community systems powered by social software [8], [9]. 2.2 Enterprise Communication Mechanisms Applied to Participative DecisionMaking Employees strive to be involved in decision-making as this relates to selfimprovement. Over the last few decades, concepts and applications of enterprise-wide collaboration have evolved to show that participation leads to sounder decisions [10], 32 M.S. Chalhoub [11]. Most inventions and innovations that have been celebrated at the turn of the century demonstrated the power of collaborative efforts across one or more organizations [12]. Communities that are socially networked thrive on knowledge diffusion and relationships that foster innovation. The concept behind ERP type of applications is to move decisions from manual hard copy reports to a suite of integrated software modules with common databases to achieve higher business performance [13]. The database collects data from different entities of the enterprise, and from a large number of processes such as manufacturing, financial, sales, and distribution. This mode of operation increases efficiency, and allows management to tap into the system at any time and make informed decisions [9]. But for employees to participate in decisions, they must be equipped with the relevant processes and technology systems to help capture, validate, organize, update, and use data related to both internal and external stakeholders. This approach requires a system that supports social interaction to allow for brainstorming with minimal constraints, allowing employees to enjoy their daily interaction [14], [15]. Modern social systems seek to go beyond the cliché of user-friendly features and more into user-fun features. 2.3 Supporting Thought Leadership through NPD Systems and Innovation Novelty challenges the employee to innovate [16]. However, it is important to seek relevant technology within an organizational context, as opposed to chasing any new technology because it is in vogue. It has been argued that new product management is an essential part of the enterprise not only for the sake of external competitive moves, but also to bolster company culture and employee team-orientation - the closer to natural social interaction, the better [17]. In that regard, processes in R&D become necessary to drive innovation systematically rather than rely on accidental findings. Nevertheless, product development is cross-functional by nature and requires a multifaceted free flow of ideas that is best supported by social software [18], [19]. 2.4 Fostering a Culture of Technological Advancement and Pride of Organizational Affiliation Technological advancements have profound effects on company culture [19]. For example, before the internet, the entire supply chain coordination was hindered by the steep challenges of exchanging information smoothly among internal supply chain systems such as manufacturing, purchasing, and distribution, and with external supply chain partners such as suppliers and distributors. Today, enterprise systems provide compatibility and ease of collaboration, while social software facilitates the development of a culture of exchange and sharing. Many respondents expressed “pride” in belonging to an organization that built a technology-enabled collaborative culture and enhanced their professional maturity. This effect has been proven over the last two decades in that the adoption of modern technology is a source of pride to many employees [20], [21]. 2.5 Supporting Customer-Centered Practices through CRM Types of Systems Thought leadership has been illustrated by many examples of firms that drove industrial innovation such as Cisco, Intel, and Microsoft [22]. The internal enterprise Effect of Knowledge Management on Organizational Performance 33 systems have gone serious development over the last few decades to integrate seamlessly with customer relationship systems, such as CRM type of applications [23]. While ERPs help achieve operational excellence, CRM helps build intimacy with the customer. That intimacy with the customer was identified in past research, and in our exploratory research as an important part of the employee’s satisfaction on the job [24]. During our interviews, many expressed that the best part of their job is to accomplish something that is praised by the customer. Social software is now making strides in assisting with customer intimacy including the use of blogs and wikis [25]. 2.6 Using Advanced Systems to Support Thought Leadership through External Knowledge Management Knowledge formation has evolved into a global process through the widespread of web technologies and dissemination [26], [27]. Over the last few decades, firms have grown into more decentralized configurations, and many researchers argued that it would be no longer feasible to operate with knowledge and decisions centralized in a single location [28]. The development of integrated technology management processes became critical to business performance, as they link external and internal knowledge creation, validation, and sharing [29]. Potential business partners are increasingly required to combine competition and cooperation to assist a new generation of managers in configuring alliances and maximizing business opportunities [30], [31]. A range of social software applications are making their way into corporate practice including internal processes and external supply chain management to build and sustain partnerships [32], [33]. 3 Research Hypotheses Based on the managerial initiatives above, we state our six hypotheses: • H1: Practicing technology training with direct on-the-job application is positively correlated with company performance. • H2: Using enterprise communication mechanisms to apply participative decisionmaking is positively correlated with company performance. • H3: Supporting thought leadership through new product development information systems is positively correlated with company performance. • H4: Fostering a culture of technological advancement and pride of being affiliated with the organization is positively correlated with company performance. • H5: Supporting customer-centered practices through customer relationship management types of systems is positively correlated with company performance. • H6: Investing in advanced information technology systems to support external knowledge management is positively correlated with company performance. 3.1 Results of Empirical Analysis The linear regression analysis provided a reliable test with an R of 0.602 (R2=0.362) with beta coefficients β1, β2, …, β6, and their relative significance through the 34 M.S. Chalhoub Table 1. Independent variables representing the use of technology in enabling human resources in knowledge management and thought leadership, beta coefficients, and significance levels in relation to company performance Beta Sig. Constant 1.314 0.000 0.185 0.003 X1 Technology training and on-the job application of social software (& user-autonomy) 0.109 0.016 X2 Enterprise technology systems for participative decision-making through social networks (& user-fun) X3 Technological thought leadership in 0.127 0.005 innovation and product development X4 Pride in culture and technological 0.067 0.145 advancement 0.181 0.006 X5 Customer relationship management systems for service support leveraging social software -0.022 0.672 X6 Advanced technology applications for partnerships and external knowledge management With R=0.602 (R2=0.362), the regression is correlated, and significant at F= 10.5, Sig = 0.000 significance level of 0.05. n = 118). P-values. We used a 95% confidence interval. We find that X1, X2, X3, and X5 are significant at 95% confidence interval, but that X4 and X6 are insignificant in relation to company performance. The hypotheses H1, H2, …, H6 were tested using the regression equation. The regression results are as follows: Y = βo + β1 .X1 + β2 .X2 + β3 .X3 + β4 .X4 + β5 .X5 + β6 .X6 Y = 1.314 + 0.185 .X1 + 0.109 .X2 + 0.127 .X3 + 0.067 .X4 + 0.181 .X5 – 0.022 .X6 Summary results are presented in Table 1. At 5% confidence level, we found that X1, X2, X3, and X5 are positively and significantly correlated with company performance, while X4 and X6 are not significant. We accept H1, H2, H3, and H5 as there is positive correlation and statistical significance. We cannot accept H4 and H6 as the relationship in the regression was found insignificant. 4 Conclusions and Recommendations The use of social software in the development of communities of practice sharing common intellectual interests and pro-actively managing knowledge fosters thought leadership. Our research shows that people on the job are increasingly searching for Effect of Knowledge Management on Organizational Performance 35 technology that goes beyond the traditional user-friendly promise and more into the user-autonomy and user-fun. We also found that individual autonomy and partaking in idea generation while having fun is positively correlated with company performance as evidenced by the regression analysis of primary data. Decision about investment in new technologies need to be based on relevance to human resource’s work environment rather than excitement about novelty. We propose a framework built on six technology management initiatives – that we used as evaluation dimensions - and argue that if included in the company’s strategic planning, they result in competitive advantage. The six areas provide measurable and manageable variables that could be well used as performance indicators. Our empirical model uses primary data collected from a subset of 120 companies, of a sample of 164 Mediterranean and Gulf entities. The dependent variable is an index of growth, profitability, and customer service quality. The empirical analysis showed that training on technology and its application onthe-job in social networks, the use of enterprise systems for participative decisionmaking, fostering thought leadership and product development using social software, and the use of relationship systems for customer intimacy are all positively and significantly related to company performance. The cultural component represented by pride in being part of an organization that promotes the use of modern technologies was not found significant. The use of technology to support business partnership and apply external knowledge management was not found significant either. The two latter results do not indicate that these items are not important, but rather need to be revisited in more detail. This is especially true in Middle Eastern companies where company cultures are heavily influenced by owners, and employees are early technology adopters on their own. In such cases, social software is still perceived as somewhat out of-scope at work, or better put designed for fun and not for business. Nevertheless, this perception is changing as business managers are becoming increasingly aware of the business value of social software. Further, external knowledge management is still practiced through face to face interactions and events rather than through technology tools and techniques. Future research could focus on other regions where market data is available for publicly traded companies. The study would then explore the relationship between technology management initiatives and company value on open markets. References [1] Tushman, M., Anderson, P. (eds.): Managing Strategic Innovation and Change: A Collection of Readings. Oxford University Press, New York (1997) [2] Gumpert, D.E.: U.S. Programmers at overseas prices. Business Week Online (December 3, 2003) [3] Kaplan, R.S., Norton, D.P.: The Strategy-Focused Organization: How BalancedScorecard Companies Thrive in the New Business Environment. Harvard Business School Publishing Corporation, Cambridge (2001) [4] Training Industry Inc.: Training Industry Research Report on Training Effectiveness (1999) [5] Kim, W., Mauborgne, R.: Strategy, value innovation, and the knowledge economy. Sloan Management Review, 41–54 (Spring 1999) 36 M.S. Chalhoub [6] Kim, W., Mauborgne, R.: Charting your company’s future. Harvard Business Review, 76–83 (June 2002) [7] Jun, H., King, W.R.: The Role of User Participation In Information Systems Development: Implications from a Meta-Analysis. Journal of Management Information Systems 25(1) (2008) [8] James, W.: Best HR practices for today’s innovation management. Research Technology Management 45(1), 57–60 (2002) [9] Liang, H., Sharaf, N., Hu, Q., Xue, Y.: Assimilation of enterprise systems: The effect of institutional pressures and the mediating role of top management. MIS Quarterly 31(1) (March 2007) [10] Miles, R., Snow, C.: Organizations: New concepts for new forms. California Management Review 28(3), 62–73 (1986) [11] Vanston, J.: Better forecasts, better plans, better results. Research Technology Management 46(1), 47–58 (2003) [12] Stone, F.: Deconstructing silos and supporting collaboration. Employment Relations Today 31(1), 11–18 (2004) [13] Ferrer, J., Karlberg, J., Hintlian, J.: Integration: The key to global success. Supply Chain Management Review (March 2007) [14] Chalhoub, M.S.: Knowledge: The Timeless Asset That Drives Individual DecisionMaking and Organizational Performance. Journal of Knowledge Management – Cap Gemini (1999) [15] Xue, Y., Liang, H., Boulton, W.R.: Information Technology Governance In Information Technology Investment Decision Processes: The Impact of Investment Characteristics, External Environment and Internal Context. MIS Quarterly 32(1) (2008) [16] Andriopoulos, C., Lowe, A.: Enhancing organizational creativity: The process of perpetual challenging. Management Decision 38(10), 734–749 (2000) [17] Crawford, C., DiBenedetto, C.: New Products Management, 7th edn. McGraw-Hill, Philadelphia (2003) [18] Alboher, M.: Blogging’s a Low-Cost, High-Return Marketing Tool. The New York Times. December 27 (2007) [19] Laudon, K.C., Traver, C.G.: E-Commerce: Business, Technology, Society, 5th edn. Prentice-Hall, Upper Saddle River (2009) [20] Bennis, W., Mische, M.: The 21st Century Organization. Jossey-Bass, San Francisco (1995) [21] Hof, R.D.: Why tech will bloom again. BusinessWeek, 64–70 (August 25, 2003) [22] Gawar, A., Cuzumano, M.: Platform Leadership: How Intel, Microsoft, and Cisco Drive Industry Innovation. Harvard Business School Press, Cambridge (2002) [23] Goodhue, D.L., Wixom, B.H., Watson, H.J.: Realizing business benefits through CRM: Hitting the right target in the right way. MIS Quarterly Executive 1(2) (June 2002) [24] Gosain, S., Malhorta, A., ElSawy, O.A.: Coordinating flexibility in e-business supply chains. Journal of Management Information Systems 21(3) (Winter 2005) [25] Wagner, C., Majchrzak, A.: Enabling Customer-Centricity Using Wikis and the Wiki Way. Journal of Management Information Systems 23(3) (2007) [26] Sartain, J.: Opinion: Using MySpace and Facebook as Business Tools. Computerworld (May 23, 2008) [27] Murtha, T., Lenway, S., Hart, J.: Managing New Industry Creation: Global Knowledge Formation and Entrepreneurship in High technology. Stanford University Press, Palo Alto (2002) [28] Rubenstein, A.: Managing technology in the decentralized firm. Wiley, New York (1989) Effect of Knowledge Management on Organizational Performance 37 [29] Farrukh, C., Fraser, P., Hadjidakis, D., Phaal, R., Probert, D., Tainsh, D.: Developing an integrated technology management process. Research Technology Management, 39–46 (July-August 2004) [30] Chalhoub, M.S.: A Framework in Strategy and Competition Using Alliances: Application to the Automotive Industry. International Journal of Organization Theory and Behavior 10(2), 151–183 (2007) [31] Cone, E.: The Facebook Generation Goes to Work. CIO Insight (October 2007) [32] Kleinberg, J.: The Convergence of Social and Technological Networks. Communications of the ACM 51(11) (November 2008) [33] Malhorta, A., Gosain, S., ElSawy, O.A.: Absorptive capacity configurations in supply chains: Gearing for partner-enabled market knowledge creation. MIS Quarterly 29(1) (March 2005) Finding Elite Voters in Daum View: Using Media Credibility Measures Kanghak Kim1, Hyunwoo Park1, Joonseong Ko2, Young-rin Kim2, and Sangki Steve Han1 1 Graduate School of Culture Technology, KAIST 335 Daejeon, South Korea {fruitful_kh,shineall,stevehan}@kaist.ac.kr 2 Daum Communications Corp Jeju, South Korea {pheony,ddanggle}@daumcorp.com Abstract. As news media have been expected to provide valuable news contents to readers, credibility of each medium depends on what news contents it has created and delivered. In traditional news media, staff editors look into news articles and arrange news contents to enhance their media credibility. By contrast, in social news services, general users play an important role in selecting news contents through voting behavior as it is practically impossible for staff editors to go through thousands of articles sent to the services. However, although social news services have strived to develop news ranking systems that select valuable news contents utilizing users’ participation, they still represent popularity rather than credibility, or give users too much burden. In this paper, we examined whether there is a group of elite users who votes for articles whose journalistic values are higher than others. To do this, we firstly assessed journalistic values of 100 social news contents with a survey. Then, we extracted a group of elite users based on what articles they had voted for. To prove that the elite group shows a tendency to vote for journalistically valuable news contents, we analyzed their voting behavior in another news pool. Finally, we concluded with a promising result that news contents voted by the elite users show significantly higher credibility scores than other news stories do while the number of votes from general users is not significantly correlated with the scores. Keywords: News Ranking System, Media Credibility, Collaborative Filtering, Social Media, Social News Service. 1 Introduction Since the web emerged, the ecosystem of journalism has gone through huge changes. Given the web, people have become able to publish whatever they want without any cost, and the barrier between professional journalists and general people is no longer clear. We Media report (2003) from the Media Center named this phenomenon ‘participatory journalism.’ Participatory journalism is defined as the act of citizen, or J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 38–45, 2010. © Springer-Verlag Berlin Heidelberg 2010 Finding Elite Voters in Daum View: Using Media Credibility Measures 39 Where Do You Get Most of Your National and Internation News? 80 TV, 70 Percent (%) 70 60 50 Internet, 40 40 30 Newspaper, 35 20 10 0 2004 2005 2006 2007 2008 Fig. 1. Sources of News Consumption in the US group of citizens, playing an important role in collecting, reporting, analyzing and disseminating news and information. Social news media like Digg or Reddit help this phenomenon happen. People collect, recommend, and read news contents in social news media. According to the Pew Research Center for the People & Press, 40 percent of Americans keep up with news about national and international issues through the internet, and the percentage has been rapidly increasing. For news media, selecting valuable contents has always been considered essential since it decide their media credibility as information providers. In traditional media, therefore, staff editors look into articles, select some of them, and arrange the selected stories. In contrast, most social news services have been using automated news ranking systems that utilize users’ participation, trying to select credible items for their first page, as it is practically impossible for a small number of staff editors to screen numerous articles from a number of writers. For instance, Slashdot, a representative social media, adopted its meta-moderation to enhance news moderators’ credibility, while recently launched services such as NewsCred and NewsTrust aims to disseminate credible and trustworthy news contents by providing different voting method to users. However, the problem is that their systems often select popular contents rather than credible ones, or otherwise give users too much burden. This study examines whether there is a group of people who have a tendency to vote for journalistically valuable news contents. If there is, their votes will be not only powerful but also efficient in selecting valuable contents. For this, we firstly reviewed researches on media credibility measures and ranking systems in section 2. Then, we practically assessed values of news articles based on media credibility and extracted people who had voted for journalistically valuable news contents in section 3, and finally analyzed their voting behavior toward other news pools in section 4. As a result, it is proven that there is a group of elite users in terms of voting credibility, and it is promising in that we will be able to use their votes enhancing the credibility of selected news contents utilizing their voting behavior. 40 K. Kim et al. 2 Related Work 2.1 Researches on Journalistic Media Credibility Researches on journalistic media credibility have mainly focused on finding out components to assess perceived media credibility with. Related research started from early 1950s. Hovland and Weiss suggested trustworthiness and expertise as source credibility factors. Infante (1980) added dynamism on the previous research. Meyer (1988) presented measuring components categorized into 2 dimension, “social concern” and “credibility of paper” adopting Gaziano and McGrath ‘s (1986) well-known 12 factors1. Rimmer and Weaver (1987) suggested other 12 elements – including concern for community well-being and factual foundations of information published. Researchers started focusing on finding common or different measuring components for online news media. Ognianova (1998) used 9 semantic differential elements2 while Kiousis (1999) practically conducted a survey with 4 elements and concluded online news is more credible than television. Berkman Center for Internet and Society at Harvard University organized a conference titled “Blogging, Journalism & Cedibility: Battleground and Common Ground” in 2005 and discussed which subjects can be better dealt with in online journalism and what components should be considered to measure credibility. It shows how controversial it is to differentiate online news credibility from traditional credibility. Cliff Lampe and R. Kelly Garret classified measuring components into 2 groups3- normative and descriptive review elements – and suggested which one among 4 review instruments (normative, descriptive, full, mini review) performs well in terms of accuracy, discriminating credible news from others, and relieving user burden. Thus, numerous researched have been conducted to measure perceived credibility. Although these researches have provided good criteria to measure credibility, they are not adjustable to news ranking systems because those are not about forecasting credibility. 2.2 News Ranking Systems User-participation based news ranking systems used in representative social news services can be categorized into three groups. One is simple voting, another is weighted voting, and the other is rating-based voting method. Digg and Reddit’s ranking systems are examples of the simple voting method. They offer Digg / Burry, Upvotes / Downvotes features to users and once a news article earns a critical mass of Diggs or Upvotes, it is promoted to the front page. NewsCred’s method is similar to that of Digg and Reddit, except that Newscred asked 1 2 3 These were fairness, bias, telling the whole story, accuracy, respect for privacy, watching out after people’s interest, concern for community, separation of fact and opinion, trustworthiness, concern for public interest, factuality, an level of reporter training. 9 semantic differential elements include factual-opinionated, unfair-fair, accurate-inaccurate, untrustworthy-trustworthy, balanced-unbalanced, biased-unbiased, reliable-unreliable, thorough-not thorough, and informative-not informative. Their component includes accuracy, credibility, fairness, informativeness, originality, balance, clarity, context, diversity, evidence, objectivity, transparency. Finding Elite Voters in Daum View: Using Media Credibility Measures 41 users to vote for articles when they find them credible. Simple voting method is powerful in that it can stimulate users’ participation. However, it can cause a wellknown group voting problems known as Digg Mafia, Digg Bury Brigade, or Reddit Downmod Squad. More fundamentally, it does not represent credibility but popularity. Slashdot is an example of weighted voting method. Each news moderator has different weight in selecting news articles, depending on the evaluation from metamoderators. Newstrust have tried a rating-based voting system. It encourages users to evaluate news contents with a rating instrument involving several rating components such as accuracy or informativeness. Although it turns out to be quite reliable in assessing journalistic values of news contents, it can lower users’ participation because of its complexity. Daum View adopted a system that utilizes votes from elite users called Open Editors, but the performance of the system cannot be accurately evaluated due to lack of criteria in selecting credible news contents and elite users. On the other hand, Techmeme relied on structure analysis based method. It analyzed how news sites link to each other, and considered something gathering many inbound links as “news”. However, Techmeme gave up on its fully automated algorithm, and started allowing manual news selection because of its bad performance. It shows how complex it is to consider credibility with structure analysis. 3 3.1 Methods Daum View Daum Communications (Daum) is an internet portal company in Korea and launched a social news service named Daum View in February, 2005, which has become the most representative social news media in Korea. As of now, the service has more than 100 million page views per a month and approximately 150,000 enrolled news bloggers. It categorizes articles into 5 groups – current, everyday lives, culture/entertainment, science/technology and sports. In this research, we are focused on current news category because it is most likely to be subject to news credibility. 3.2 Assessing Journalistic Credibility of News Set 1 We collected top 100 popular news contents published from August 26 to September 2 in Daum View service. To assess journalistic credibility of a number of news contents, we conducted a survey over the web. Respondents were asked to assess the credibility of news contents using a review instrument named ‘normative review’ adopted from C. Lampe and R. Kelly Garret (2007), since the instrument shows best performance in that the result is similar to that from journalism experts. The normative review involves accuracy, credibility, fairness, informativeness, and originality, which are widely used in traditional credibility measures. The survey was conducted during September 2 – 9, 2009. A total number of 369 people participated, and assessed 100 social news contents with the Likert-type scale. Besides evaluating the journalistic credibility, we asked the subjects to determine the importance of each survey components with Likert-type scale to consider the characteristics of social news contents in value estimation. Then, we rescaled the weights so that news credibility scores ranges from 1 to 5, and calculated the credibility scores 42 K. Kim et al. for the sample news contents considering weights for each component. The result shows that people perceive credibility and accuracy as the most important requirements (0.227, 0.225 respectively), while originality considered as the least important factor (0.148). Table 1. Weights of Credibility Components for Social News Contents Accuracy Credibility Fairness Informativeness Originality 0.225 0.227 0.198 0.202 0.148 Finally, we calculated credibility scores for 100 news articles. Among 100 news contents, about 20 percent of them (22 articles) were considered “good” articles in consideration of the meaning of the Likert scale(Credibility Score > 3.5), while another 20 percent of them (23 articles) were considered “bad” (Credibility Score < 3.0.) Below are examples of news credibility scores. Table 2. Samples of Journalistic Credibility Scores for the News Contents and the Number of Votes for Them Previous Ranking 89 93 76 45 22 47 67 3.3 URL Credibility Score http://v.daum.net/link/3918133 http://v.daum.net/link/3930134 http://v.daum.net/link/3931410 http://v.daum.net/link/3912458 http://v.daum.net/link/3931027 http://v.daum.net/link/3912861 http://v.daum.net/link/3926142 4.109859 3.996898 3.882244 3.856029 3.807791 3.777746 3.705755 Collecting User Voting Data We collected user-voting data from Daum View. A total of 73,917 votes from 41,698 users were made for the 100 news contents. The data shows that the number of votes per a user follows a power law with 32% of active users making 80% of votes. We also differentiated users’ malicious votes, defining it as a vote made within a specific time in which the user is not likely to be able to read the whole voted article after he or she made the previous vote. 3.4 Valid Voting Rate As we gathered each article’s credibility score, we were able to calculate each user’s valid voting rate (VVR) which stands for the number of valid votes divided by the total number of votes the user has made. Valid votes are defined as the users’ votes for articles whose credibility scores are over 3.5, considering the meaning of 5 point Likert scale. In this process, we considered malicious votes for credible article as Finding Elite Voters in Daum View: Using Media Credibility Measures 43 Fig. 2. Distribution invalid votes and also, excluded users who made less than 3 votes because they can gather a high valid voting rate by chance. 36,284 users were excluded in this process and 5,414 remained. 3.5 Assessing Journalistic Credibility of News Set 2 We collected another top 50 popular current news contents published from September 9 - 16 in Daum View, and the number of votes each article gathered. Then, we again assessed the credibility scores of the second news set. The survey method is same as we did for news set 1, except that the number of sample news articles is smaller than that of the first news set. The reason is that approximately 50 percent of news content Fig. 3. (a) Ratio of Malicious Votes to Total Votes per an Article. (b) Number of Votes per an Article. 44 K. Kim et al. took about 75 percent of votes and that there was a tendency that the lower rank a news article has, the higher ratio of malicious votes it gathers. So we considered that news contents with row ranking results cannot have a chance to be voted even if their credibility scores are high enough. Finally, 22,488 votes from 14,205 users are gathered. 3.6 Evaluate Elite Voters’ Performance Pearson correlation coefficient is used to access the relation between news credibility scores and votes from elite votes, as well as that between the scores and votes from general users. In addition, to compare performances among elite user groups, we diversified elite user groups with 3 different criteria. – (1) elite user group 1 (VVR > 0.5), (2) elite user group 2 (VVR > 0.4), and (3) elite user group 3 (VVR > 0.3). 4 Result As we assumed, there was a significant correlation between the number of votes from elite user groups and news credibility scores. Among them, elite user group 2 showed the highest level of correlation (Pearson correlation coefficient 0.326), while other elite user group 1 and 3 showed slightly lower Pearson correlation coefficient (0.288 and 0.287 respectively). However, the number of votes from general users turned out not to have any significant correlation with the credibility scores. Table 3. Pearson Correlation Coefficient of General Users and Elite Users # voters General Users User Group 1 User Group 2 User Group 3 5 ? 273 620 914 News Credibility Score Pearson Correlation -.016 Sig. (2-tailed) .872 Pearson Correlation .302* Sig. (2-tailed) .043 Pearson Correlation .328* Sig. (2-tailed) .021 Pearson Correlation .287* Sig. (2-tailed) .043 Discussion A majority of social news media is adopting user-participation based ranking systems. That is not only because of the difficulty of measuring credibility through contents analysis, but also because of the social aspect of the web. However, current userparticipation based news ranking systems do not show satisfying ranking results in Finding Elite Voters in Daum View: Using Media Credibility Measures 45 selecting credible news contents. Moreover, it caused other problems such as group voting. James Surowiecki (2005) also claims that wisdom of crowd does not emerge naturally, but requires a proper aggregation methodology. Wikipedia, a representative example of wisdom of crowds, dealt with the credibility problem by differentiating users’ power in the system. Its editing model allows administrators, who are considered trustworthy by Wikipedia employees, to have more access to restricted technical tools including protecting or deleting pages. We assumed that utilizing this model in social news services can be a good way to enhance credibility, not giving users too much burden. So, we firstly present criteria to evaluate users’ performance in the system from researches on media credibility measures, and selected elite users. As a result, votes from selected user groups showed significant correlation with credibility scores. Although the correlation coefficient was not really high, it is still promising because the number of votes from general users did not show any significant correlation with credibility scores, supporting the fundamental problem of previous ranking system that they rather stand for popularity. This study is an initial work characterizing users’ particular voting tendency, and did not propose an elaborate news ranking model. Researches for designing a model which enhances the correlation between users’ votes and credibility are needed. References 1. Bowman, S., Willis, C.: We Media: How Audience are Shaping the Future of News and Information, p. 9. The Media Center at The American Press Institute (2003) 2. The Pew Research Center for the People & the Press, http://peoplepress.org/reports/pdf/479.pdf 3. Slashdot’s meta moderation, http://slashdot.org/moderation.shtml 4. Hovand, C.I., Weiss, W.: The Influence of Source Credibility on Communication Effectiveness. In: Public Opinion Quarterly, vol. 15, pp. 635–650. Oxford University Press, Oxford (1951) 5. Infante, D.A.: The Construct Validity of Semantic Differential Scales for the Measurement of Source Credibility. Communication Quarterly 28(2), 19–26 6. Gaziano, C., McGrath, K.: Measuring the Concept of Credibility. Journalism and Mass Communication Quarterly 63(3), 451–462 (1986) 7. Rimmer, T., Weaver, D.: Different Questions, Different Answers? Media Use and Media Credibility. Journalism Quarterly 64, 28–44 (1987) 8. Ognianova, E.: The Value of Journalistic Identity on the World Wide Web. Paper presented to the The Mass Communication amd Society Division, Association for Education in Journalism and Mass Communication, Balimore (1998) 9. Kiousis, S.: Public Trust or Mistrust? Perceptions of Media Credibility in the Information Age. Paper presented to the The Mass Communication amd Society Division, Association for Education in Journalism and Mass Communication, New Orleans (1999) 10. Lampe, C., Garrett, R.K.: It’s All News to Me: The Effect of Instruments on Rating Provision. Paper presented to the Hawaii International Conference on System Science, Waikoloa, Hawaii (2007) 11. Ko, J.S., Kim, K., Kweon, O., Kim, J., Kim, Y., Han, S.: Open Editing Algorithm: A Collaborative News Promotion Algorithm Based on Users’ Voting History. In: International Conference on Computational Science and Engineering, pp. 653–658 (2009) 12. Surowiecki, J.: The Wisdom of Crowds. Anchor Books, New York (2005) A Social Network System Based on an Ontology in the Korea Institute of Oriental Medicine Sang-Kyun Kim, Jeong-Min Han, and Mi-Young Song Information Research Center, TKM Information Research Division, Korea Institute of Oriental Medicine, South Korea {skkim,goal,smyoung}@kiom.re.kr Abstract. We in this paper propose a social network based on ontology in Korea Institute of Oriental Medicine (KIOM). By using the social network, researchers can find collaborators and share research results with others so that studies in Korean Medicine fields can be activated. For this purpose, first, personal profiles, scholarships, careers, licenses, academic activities, research results, and personal connections for all of researchers in KIOM are collected. After relationship and hierarchy among ontology classes and attributes of classes are defined through analyzing the collected information, a social network ontology are constructed using FOAF and OWL. This ontology can be easily interconnected with other social network by FOAF and provide the reasoning based on OWL ontology. In future, we construct the search and reasoning system using the ontology. Moreover, if the social network is activated, we will open it to whole Korean Medicine fields. 1 Introduction Recently throughout the world, Social Network Service (abbreviated as SNS)[1] is developing at a rapid rate. Due to this, numerous SNS has been created and people with various purposes are being connected through SNS. However, with multitudes of SNS formulated, there arouses the problem of linkage among the various SNSs. Face book unveiled a social platform called F8 and Google devised a platform named OpenSocial. These efforts were made in order to standardize the application offered by SNS but the sharing is only possible between the users who are using the particular platform. Lately, in order to solve this problem, there exists suggestion of semantic social network[1][2] on the basis of network between people and objects. Researches that support semantic social network are, to name a few, FOAF(Friend of a Friend)[3], SIOC(Semantically-Inter-linked Online Communities)[4]. In fact, My Space and Facebook are currently using FOAF. This paper constructs a social network using ontology for the Korea Institute of Oriental Medicine (abbreviated as KIOM) as a case of semantic social network. The purpose of this paper is to revitalize and activate research on oriental medicine by allowing researchers in KIOM to search various researchers who would aid the researches and to enable the KIOM researchers to easily share their research information. The KIOM social network that is constructed in this study possesses the characteristics mentioned below: J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 46–51, 2010. © Springer-Verlag Berlin Heidelberg 2010 A Social Network System Based on an Ontology in the KIOM 47 First, our ontology was modeled using OWL[5], which is a semantic web ontology language. Especially, for the information regarding people and personal contact, we used FOAF. These methods allow mutual linkage between other social networks and ontologies and provide, through the use of inference by OWL, much more intelligent searches than the pre-existing. Second, we created a closed social network that can be used only within the KIOM in order to make actual usage possible after constructing the social network with much information as possible. If we make it usable only within the Institute, the security can be maintained. The advantage of this is that the researchers can share personal information and private research content that they cannot post on internet. This inside system can provide foundation to expand this network throughout the oriental medicine community. In fact, Facebook was initially a SNS made for the use of only Harvard University, in U.S.A., and then later opened to public, expanded and further developed. 2 Social Network in KIOM 2.1 Construction of the Ontology The relationship between classes in the social network ontology is shown below in the figure. The figure does not include all classes but only those with relationship between objects. The concept that is the focal in the ontology is Institute Personnel and External Personnel. The reason for this is that the constructed ontology is used only internally within the Institute and thus these two classes have, partially, different properties. In Fig. 1. Class relationship of social network ontology 48 S.-K. Kim, J.-M. Han, and M.-Y. Song addition, the institute personnel and the external personnel are both connected to the Organization class but the institute personnel are connected as an instance of KIOM. The external personnel can possess diverse organizational structure such as institutions, schools, or enterprises. This diversity can be divided and differentiated in the Organization class with a property of the organization type. Moreover, the institute personnel can have links to research information such as papers, patents, and reports, and to academics, experiences, attainments, academic meetings, and personal contact by foaf:knows. In particular, papers and patents information will be linked through rdf:Seq. The order by name of the author or the inventor is important in papers and patents. However, because in RDF it does not have orders between the instances, the order should be clearly expressed through rdf:Seq. The Class Hierarchy of the Ontology The class structure of the social network ontology is shown in the figure below. This figure is the class structure of the ontology seen from TopBraid Composer[6]. Fig. 2. Class hierarchy of social network ontology A Social Network System Based on an Ontology in the KIOM 49 Under the highest class of Entity class, there exist Abstract and Physical class. This is following the structure of the Top-Level Ontology called Suggested Upper Merged Ontology (SUMO)[7]. The entities that contain a place within time and space are seen as Physical and those that are not a part of Physical are seen as Abstract. Thus in Abstract, it contains experience, achievements, and academic information whereas Physical contains classes for the instances that lower classes refer to. In the Physical class, there are the Agent class and the Object class. The Agent signifies those that devise a certain change or work by itself and the Object contains all of the other types. 2.2 The Analysis of the Relationship of the Ontology This section analyses the relationship between people and objects in the social network ontology. The information that is inferred through this analysis is new sources that are not stated in the current ontology and can be used in the future for inferences using ontology and ontology search system, which could make use of these analyses. Expert Relationship Expert Relationship refers to finding the experts in the related-field. In order for this to occur, in the KIOM social network ontology, one can use information on papers (title of the paper, keyword), patent (title of the patent), graduation dissertation (name of the paper), major, work in assigned, and field of interest. For example, an expert on Ginseng would have done many researches on Ginseng and thus he or she would write many papers and possess numerous patents related to it and most likely to have majored or written his or her graduation dissertation correlating to Ginseng. In addition, his or her assigned work and field of interest could be related to Ginseng. In the ontology, regarding research information, projects are excluded as although projects exist, generally participating researchers do not take part in actual researches. Furthermore, research topics tend to change according to trend as time passes. Although it is regarding the same topic of Ginseng, interests in old researches decreases. Therefore, in the cases of papers, graduation dissertation, and majors, there needs to be a sorting according to publication date or graduation date. Mentor Relationship Mentor Relationship refers to people among the experts who will be useful in a person’s research. In other words, mentors are those, including the experts, someone who can give help or become partners in a research. These people can be, among the experts, 1) either have the academics of a doctorate, experiences or positions of above a team leader or a professor or 2) 1st author, a collaborating author, or a corresponding author of a SCI paper. In the first case, these mentors will be helpful in carrying out and in managing the researches and in the case of the latter, they would be able to provide help on technical aspects of the research. In addition to these two cases, if we divide the mentors into Internal Mentors and External Mentors, we can also infer upon the below relationship. Internal mentors refer to mentor-and-menti relationship that exists within the Institution. In the case of projects, generally the research director becomes the mentor for 50 S.-K. Kim, J.-M. Han, and M.-Y. Song participating researchers. In the case of papers, the 1st author, the co-author, and the corresponding authors precede the research with other authors but because they take more charge than the other authors (excluding appointed positions) they can become mentors for the relevant dissertations. In external mentors relationship, mentors allude to those outside of the Institute and the mentis would be the researchers in the Institute. The relationship between the academic adviser of the researchers and the researchers themselves tend to continue as a mentor relationship after graduation. Moreover, in papers, the authors received helps in writing it from external personnel. Therefore, we can infer that the external personnel are an external mentor. Relationship of Personal Contact The inference for personal contact not only searches linkage of information of people in the social network but tries to find out immanent linkage relationship or how close people have contact with each other. The Expert relationship and Mentor relationship is also inferring the immanent relationship. However, in this section, it discusses other inherent relationships aside from Expert or Mentor relationship. Fig. 3. Example of personal contact relationship • Senior and Junior, or the Same School Class Relationship - If A and B’s academic advisers are the same, we can infer that A and B are “a relationship of senior and junior, or of the same school class”. • Intimate Relationship - If there is B, a person who is not a part of the Institute, among the list of authors in A’s paper or patent, we can infer that A and B have a “close relationship”. - Within a certain project, if A and B are either a research director, a detail subject director, a co-operating research director, or a commissioned research director, we can infer them to have a “close relationship”. • Personal Contact Relationship - If A and B both have the experience of working in the same company or were educated in the same school under the same major, we can infer that a “personal contact exists” between these two people. A Social Network System Based on an Ontology in the KIOM - 51 If A has an experience of working in a company or have graduated from a school in a major, we can infer that “personal contact exists” between A and people who are currently working in the company or who are related to the major in the school. 3 Conclusion In this study, we constructed a social network ontology for Korea Institute of Oriental Medicine. In order for this construction to occur, we collected personal information, academic information/experiences/attainments/academic meetings, research information, and personal contact information of all the researchers. With this as foundation, we used FOAF and OWL to construct social network ontology. The ontology that was constructed as such is able to link to other social networks and provide ontology inferences based on OWL. In order for the ontology inference, this study analyzed relationship of the ontology and deducted new relationships that were not stated in the ontology itself. These new relationships can be used in the future in building inference system and ontology foundation search. The social network in this study possesses a closed form of being used internally within KIOM only. Therefore, it has the advantage that it can share much more useful information than ordinary social networks. However, there is the problem that it is linked to outgoing links only, to those which the researchers already know of, but no information on the incoming links. In future, we are designing to build a search and inference system based on the constructed ontology and we are planning to make this social network public, once this network is firmly established, to the field of oriental medicine in order to solve the above problems. References [1] Boyd, D.M., Ellison, N.B.: Social Network Sites: Definitions, History, and Scholarship. Journal of Computer-Mediated Communication 13(1) (2007) [2] Breslin, J., Decker, S.: The Future of Social Networks on the Internet. IEEE Internet Computing, 84–88 (2007) [3] http://www.foaf-project.org/ [4] http://sioc-project.org/ [5] http://www.w3.org/TR/owl-features [6] http://www.topquadrant.com/products/TB_Composer.html [7] http://www.ontologyportal.org/ Semantic Web and Contextual Information: Semantic Network Analysis of Online Journalistic Texts Yon Soo Lim WCU Webometrics Institute, Yeungnam University 214-1 Dae-dong, Gyeongsan, Gyeongbuk, 712-749, South Korea yonsoolim@gmail.com Abstract. This study examines why contextual information is important to actualize the idea of semantic web, based on a case study of a socio-political issue in South Korea. For this study, semantic network analyses were conducted regarding English-language based 62 blog posts and 101 news stories on the web. The results indicated the differences of the meaning structures between blog posts and professional journalism as well as between conservative journalism and progressive journalism. From the results, this study ascertains empirical validity of current concerns about the practical application of the new web technology, and discusses how the semantic web should be developed. Keywords: Semantic Web, Semantic Network Analysis, Online Journalism. 1 Introduction The semantic web [1] is expected to mark a new epoch in the development of internet technology. The key property of the semantic web is to provide more useful information by automatically searching the meaning structure of web content. The new web technology focuses on the creation of collective human knowledge rather than a simple collection of web data. The semantic web is not only a technological revolution, but also a sign of social change. However, many researchers and practitioners are skeptical about the practicability of the semantic web. They doubt whether the new web technology can transform complex and unstructured web information into well-defined and structured data. Also, the technological limitation may bring irrelevant or fractional data without considering contextual information. Further, McCool [2] asserted that the semantic web will be fail if it ignores diverse contexts of web information. Although there are a lot of criticism and skepticism, the negative perspectives are rarely based on empirical studies. At this point, this study aims to ascertain why contextual information is important to actualize the idea of semantic web, based on an empirical case study of a socio-political issue in South Korea. This study investigates the feasibility of the semantic web technology using a semantic network analysis of online journalistic texts. Specifically, it diagnoses whether there are the differences of the semantic structures among online texts containing different contextual information. Further, this study will discuss about how the semantic web should be developed. J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 52–62, 2010. © Springer-Verlag Berlin Heidelberg 2010 Semantic Web and Contextual Information: Semantic Network Analysis 53 2 Method 2.1 Background On July 22, 2009, South Korea's National Assembly passed contentious media reform bill that allows newspaper publishers and conglomerates to own stakes in broadcasting networks. The political event generated a heated controversy in South Korea. Also, there were different opinions and information on the web. It seemed to be a global issue because regarding the event, international blog posts and online news stories, which use English language, could be easily found on the web. Also, major Korean newspaper publishers provide English-language based news stories for global audiences via internet. Their in-depth news stories could be sufficiently a cause for promoting global bloggers’ discussions about a nation-state's event on the web. For this reason, although the research topic is a specific social phenomenon, it may represent the complexity of online information. This study examines the semantic structure of English-language based online texts regarding Korea's media reform bill. Semantic network analysis is used to identify the differences of the semantic structures between blog posts and professional journalism as well as between conservative journalism and progressive journalism. 2.2 Data To identify the main concepts of online journalistic texts regarding Korean media reform bill, 62 blog posts and 101 news stories were gathered from Google news and blog search engines using the following words: Korea, media, law, bill, reform, revision, regulation. The time period was from June 1st-August 31th, 2009. 24 of 101 news stories were produced by conservative Korean news publishers, such as Chosun, Joongang, and DongA. 22 news stories were published by a progressive Korean newspaper, Hankyoreh. The unit of analysis is individual blogs and news stories. 2.3 Semantic Network Analysis Semantic network analysis is a systematic technique of content analysis to identify the meaning structure of symbols or concepts in a set of documents, including communication message content by using network analysis [3, 4]. The semantic network represents the associations of neurons responding to symbols or concepts that are socially constructed in human brains. That is, it is a relationship of shared understanding of cultural products among members in a social system [3]. In this study, the semantic network analysis of online journalistic texts was conducted using CATPAC [5, 6]. It embodies semantic network analysis in "a self-organizing artificial neural network optimized for reading text" [6]. The program identifies the most frequently occurring words in a set of texts and explores the pattern of interconnections based on their cooccurrence in a neural network [6, 7]. Many studies have used the program to analyze diverse types of texts, such as news articles, journals, web content, and conference papers [8-11]. 54 Y.S. Lim 2.4 CATPAC Analysis Procedure In CATPAC, a scanning window reads through fully computerized texts. The window size represents the limited memory capacity associated with reading texts. The default size of the window covers seven words at a time on the basis of Miller's [12] argument that people's working memory can hold seven meaningful units at a time. After first reading words 1 through 7, the window slides one word further and reads words 2 through 8 and so on. Whenever given words are presented in the window, artificial neurons representing each word are activated in a simulated neural network [5, 6]. The connection between neurons is strengthened when the number of times that they are simultaneously active increases. Conversely, their connections are weakened as the likelihood of their co-occurrence decreases. The program creates a matrix based on the probability of the co-occurrence between neurons representing words or symbols. From the matrix, CATPAC identifies the pattern of their interrelationships by using cluster analysis. In this study, the cluster analysis uses the Ward method [13] to optimize the minimum variance within clusters. This method provides a grouping of words that have the greatest similarity in the co-occurrence matrix, where each cell shows the likelihood that the occurrence of a word will indicate the occurrence of another. Through the cluster analysis, CATPAC produces a "dendogram," a graphical representation of the resultant clusters within the analyzed texts [5, 6]. With the cluster analysis, multidimensional scaling (MDS) technique facilitates the understanding of the interrelationships among words and clusters in the semantic neural network. The co-occurrence matrix can be transformed into a coordinate matrix for spatial representation through the MDS algorithm [14]. The position of each word in a multidimensional space is determined by the similarities between words, based on the likelihood of their co-occurrence. That is, words having strong connections would be close to each other, whereas words having weak relationships would be far apart. Thus, through MDS, the pattern of the semantic network in a given set of texts can be visually identified. For this analysis, this study used UCINET-VI [15], a program designed to analyze network data. 3 Results In the semantic network analysis, a list of meaningless words, including articles, prepositions, conjunctions, and transitive verbs were excluded. Also, any problematic words that may distort the analysis were eliminated by the researcher. In addition, similar words were combined into single words to facilitate the analysis. To clarify major concepts of online journalistic texts, this study focused on the most frequently occurring words over 1% of the total frequency in each set of texts. 3.1 Blog vs. Newspaper As shown in Table 1, regarding blog posts, the most frequently occurring word was media, which occurred 124 times in 33 (53.2%) posts. Other frequently occurring words were bill, 98 times (26, 41.9%); party, 84 times (23, 37.1%); parliament, 63 times Semantic Web and Contextual Information: Semantic Network Analysis 55 Table 1. List of the most frequently mentioned words in 62 blogs WORD Freq. Freq.(%) Case Case(%) WORD MEDIA 124 9.8 33 53.2 NATIONAL 25 2.0 10 16.1 98 7.8 26 41.9 PASS 25 2.0 15 24.2 BILL Freq. Freq.(%) Case Case(%) PARTY 84 6.7 23 37.1 PUBLIC 24 1.9 7 11.3 PARLIAMENT 63 5.0 24 38.7 DP 23 1.8 11 17.7 KOREA 57 4.5 27 43.5 NETWORK 23 1.8 13 21.0 OPPOSITION 55 4.4 18 29.0 MB 21 1.7 13 21.0 24.2 KOREAN 50 4.0 23 37.1 PEOPLE 21 1.7 15 RULING 49 3.9 19 30.6 REFORM 21 1.7 15 24.2 LAW 48 3.8 12 19.4 FIGHT 16 1.3 10 16.1 LAWMAKER 45 3.6 14 22.6 CHANGE 15 1.2 7 11.3 GNP 43 3.4 14 22.6 CONTROL 15 1.2 7 11.3 GOVERNMENT 39 3.1 19 30.6 VIOLENCE 15 1.2 10 16.1 NEWSPAPER 39 3.1 13 21.0 BRAWL 14 1.1 10 16.1 BROADCAST 36 2.9 13 21.0 MEMBERS 14 1.1 10 16.1 VOTE 35 2.8 23 37.1 POLITICIANS 14 1.1 8 12.9 OWNERSHIP 29 2.3 16 25.8 PRESIDENT 14 1.1 10 16.1 ASSEMBLY 27 2.1 10 16.1 SPEAKER 14 1.1 7 11.3 COMPANY 25 2.0 7 11.3 (24, 38.7%); Korea, 57 times (27, 43.5%); opposition, 55 times (18, 29.0%); Korean, 50 times (23, 37.1%); ruling, 49 times (19, 30.6%); law, 48 times (12, 19.4%); lawmaker, 45 times (14, 22.6%); and GNP (Grand National Party), 43 times (14, 22.6%). Table 2 shows the list of the most frequent words in news articles. In terms of newspapers, the most frequent word was bill, occurred 658 times in 100 (99.0%) news stories. Others were media, 587 times (98, 97.0%); GNP, 481 times (90, 89.1%); party, 470 times (88, 87.1%); DP (Democratic Party), 365 times (76, 75.2%); assembly, 350 times (90, 89.1%); lawmaker, 303 times (77, 76.2%); opposition, 296 times (87, 86.1%); national, 294 times (89, 88.1%); broadcast, 253 times (68, 67.3%); vote, 250 times (71, 70.3%); and law, 209 times (61, 60.4%). Based on the co-occurrence matrix representing the semantic network focusing on the most frequently occurring words, a cluster analysis was conducted to further examine the underlying concepts. From the cluster analysis, the groupings of words that have a tendency to co-occur in the online journalistic texts were identified. Figure 1 presents the co-occurring clusters about blog posts and news stories. The co-occurring clusters of blog posts were fragmentary, even though a group of words included the most frequently occurring words. Conversely, the dendogram of news stories represented a large cluster and several small clusters. Most words of high frequency were strongly connected to each other. MDS was conducted to investigate the interrelationships between words and the clusters. Figure 2 presents the semantic networks in the two-dimensional space. 56 Y.S. Lim Table 2. List of thee most frequently mentioned words in 101 news stories WORD Freq. Freq.(%) Case Case(%) WORD BILL 658 9.7 100 99.0 PARLIAMENT 136 Freq. Freq.(%) Case Case(% %) 2.0 54 53.55 MEDIA 587 8.7 98 97.0 PUBLIC 133 2.0 50 49.55 GNP 481 7.1 90 89.1 PASS 127 1.9 69 68.33 PARTY 470 6.9 88 87.1 PASSAGE 117 1.7 50 49.55 DP 365 5.4 76 75.2 REFORM 113 1.7 59 58.44 ASSEMBLY 350 5.2 90 89.1 MB 112 1.7 53 52.55 LAWMAKER 303 4.5 77 76.2 INDUSTRY 111 1.6 55 54.55 OPPOSITION 296 4.4 87 86.1 KOREA 109 1.6 60 59.44 NATIONAL 294 4.3 89 88.1 MEMBER 93 1.4 55 54.55 BROADCAST 253 3.7 68 67.3 AGAINST 82 1.2 53 52.55 VOTE 250 3.7 71 70.3 PRESIDENT 81 1.2 48 47.55 LAW 209 3.1 61 60.4 FLOOR 78 1.2 42 41.66 RULING 199 2.9 75 74.3 COMPANY 75 1.1 37 36.66 NEWSPAPER 170 2.5 68 67.3 PEOPLE 74 1.1 39 38.66 SPEAKER 159 2.3 62 61.4 LEGISLATION 71 1.0 25 24.88 SESSION 150 2.2 55 54.5 LEADER 69 1.0 41 40.66 (a) Blog (b) News Fig. 1. Co-occcurring clusters about blog posts and news stories The centralization of thee blog semantic network was 19.6%. A larger cluster included 20 of 35 words. Theere were strongly connected words: media, bill, Korea, pparliament, newspaper, broadccast, law, public, and pass. Besides, 11 words were isolatted. Conversely, the centralizaation of the news semantic network was 44.2%. 233 of 32 words were included in n a large cluster. Also, they were tightly associated w with each other. Semantic Web and Contextual Information: Semantic Network Analysis 57 (a) Blog (b) News Fig. 2. Semantic networks of blog posts and news stories 3.2 Conservative Newspaper vs. Progressive Newspaper As the same way as the previous analysis, 24 news stories published by conservative newspapers and 22 news articles by a progressive newspaper were examined. As shown in Table 3, regarding conservative newspapers, the most frequently occurring word was bill, which occurred 162 times in 23 (95.8%) news stories. Other words of high frequency were media, 135 times (22, 91.7%); assembly, 107 times 58 Y.S. Lim Table 3. List of the most frequently mentioned words in 24 conservative news stories WORD Freq. Freq.(%) Case Case(%) WORD BILL 162 9.3 23 95.8 REFORM Freq. Freq.(%) Case Case(%) 36 2.1 15 62.5 MEDIA 135 7.8 22 91.7 SPEAKER 35 2.0 12 50.0 62.5 ASSEMBLY 107 6.2 21 87.5 PASS 29 1.7 15 GNP 107 6.2 17 70.8 COMPANY 27 1.6 12 50.0 PARTY 99 5.7 20 83.3 WORKERS 25 1.4 9 37.5 NATIONAL 97 5.6 21 87.5 KOREA 24 1.4 15 62.5 DP 94 5.4 17 70.8 PUBLIC 23 1.3 13 54.2 LAWMAKER 85 4.9 18 75.0 FLOOR 21 1.2 11 45.8 VOTE 71 4.1 15 62.5 LEADER 20 1.2 10 41.7 OPPOSITION 64 3.7 19 79.2 MB 20 1.2 7 29.2 LAW 56 3.2 16 66.7 MEMBER 19 1.1 15 62.5 SESSION 56 3.2 13 54.2 PEOPLE 19 1.1 10 41.7 BROADCAST 46 2.7 12 50.0 TIME 19 1.1 12 50.0 RULING 46 2.7 15 62.5 END 18 1.0 11 45.8 INDUSTRY 45 2.6 17 70.8 LEGISLATIVE 18 1.0 7 29.2 NEWSPAPER 40 2.3 12 50.0 MBC 17 1.0 7 29.2 BROADCASTER 38 2.2 10 41.7 PARLIAMENT 17 1.0 6 25.0 Table 4. List of the most frequently mentioned words in 22 progressive news stories WORD Freq. Freq.(%) Case Case(%) WORD GNP 147 8.0 21 95.5 PEOPLE Freq. Freq.(%) Case Case(%) 38 2.1 14 63.6 MEDIA 144 7.8 20 90.9 MB 37 2.0 14 63.6 BROADCAST 128 6.9 20 90.9 VOTE 36 2.0 14 63.6 BILL 125 6.8 21 95.5 OPINION LAW 1.9 1.8 13 9 59.1 40.9 109 5.9 20 90.9 35 ADMINISTRATION 34 PARTY 86 4.7 19 86.4 NEWSPAPER 33 1.8 13 59.1 ASSEMBLY 84 4.6 21 95.5 RULING 31 1.7 16 72.7 DP 78 4.2 15 68.2 AGAINST 30 1.6 12 54.5 PUBLIC 75 4.1 15 68.2 SPEAKER 29 1.6 12 54.5 NATIONAL 73 4.0 20 90.9 POLITICAL 24 1.3 13 59.1 LAWMAKER 60 3.3 14 63.6 TERRESTRIAL 24 1.3 5 22.7 LEGISLATION 51 2.8 11 50.0 SESSION 22 1.2 11 50.0 OPPOSITION 46 2.5 17 77.3 STRIKE 22 1.2 7 31.8 PASSAGE 44 2.4 16 72.7 COMMENTS 21 1.1 20 90.9 REVISION 41 2.2 14 63.6 PRESIDENT 20 1.1 8 36.4 PASS 39 2.1 17 77.3 QUESTIONS 20 1.1 20 90.9 KOREA 38 2.1 18 81.8 RESPONDENTS 20 1.1 5 22.7 Semantic Web and Contextual Information: Semantic Network Analysis 59 (21, 87.5%); GNP, 107 times (17, 70.8%); Party, 99 times (20, 83.3%); national, 97 times (21, 87.5%); DP, 94 times (17, 70.8%); lawmaker, 85 times (18, 75.0%); vote, 71 times (15, 62.5%); and opposition, 64 times (19, 79.2%). On the other hand, as presented in Table 4, in a progressive newspaper, the most frequent word was GNP, occurred 147 times in 21 (95.5%) news stories. Other frequently occurring words were media, 144 times (20, 90.9%); broadcast, 128 times (20, 90.9%); bill, 125 times (21, 95.5%); law, 109 times (20, 90.9%); party, 86 times (19, 86.4%); assembly, 84 times (21, 95.5%); DP, 78 times (15, 68.2%); public, 75 times (15, 68.2%); and national, 73 times (20, 90.9%), A cluster analysis was conducted. Figure 3 presents the dendograms of conservative newspapers and a progressive newspaper. (a) Conservative (b) Progressive Fig. 3. Co-occurring clusters about conservative and progressive newspapers The two dendograms of newspapers seemed to be similar, in that a larger cluster included the majority of high frequent words. Most words of high frequency in the large group were also identical. However, other words included in minor clusters were different between newspapers. In conservative newspapers, the words were MB (the initial of Korean president), workers, MBC (one of Korean public broadcasting networks), parliament, people, public, and reform. Conversely, in a progressive newspaper, they were administration, MB, president, against, strike, people, legislation, and respondents. As shown in Figure 4, the visualized semantic networks present the differences between two newspapers. The network centralization of conservative newspapers was 46.6%. There was only a large cluster, including 27 of 34 words. Also, the words are tightly connected to each other. On the contrary, in terms of a progressive newspaper, the centralization of the semantic network was 20.0%. There were a larger cluster and a small group of words. The larger group included relatively neutral concepts, but the small group contained several negative words, such as against, strike, and questions. 60 Y.S. Lim (a) Conservative (b) Progressive Fig. 4. Semantic networks of conservative and progressive newspapers 4 Discussion This study focused on online journalist texts concerning a specific socio-political issue in South Korea. In the global perspective, the issue itself has very limited context under a nation-state’s boundary. However, the results indicated the differences of the online texts with different contextual information, such as journalistic style and tone. Semantic Web and Contextual Information: Semantic Network Analysis 61 From the results of semantic network analyses, the semantic structure of blog posts and news stories were different. The semantic network of blogs was relatively sparse comparing to that of newspapers. Also, the results reveals that bloggers discussed about diverse issues derived from the main event, such as politicians' fight and violence. Conversely, professional journalists focused on the main event, and straightforward reported the fact, such as media bill passed. The semantic networks of conservative journalism and progressive journalism were also different. In this study, a progressive newspaper, Hankyoreh, focused on negative issues, such as people strike against MB administration, even though it mainly reported the fact of the main event. On the contrary, conservative newspapers, such as Chosun, Joongang, and DongA, made little account of the negative aspects. Instead, they more focused on the main event. Additionally, as shown in Table 5, regarding the main words from the four types of semantic networks, while 20 words were commonly used, other 37 words were differently mentioned. Table 5. List of common words and different words Common Words (N=20) ASSEMBLY, BILL, BROADCAST, DP, GNP, KOREA, LAW, LAWMAKER, MB, MEDIA, NATIONAL, NEWSPAPER, OPPOSITION, PARTY, PASS, PEOPLE, PUBLIC, RULING, SPEAKER, VOTE Different Words (N=37) ADMINISTRATION, AGAINST, BRAWL, BROADCASTER, CHANGE, COMMENTS, COMPANY, CONTROL, END, FIGHT, FLOOR, GOVERNMENT, INDUSTRY, KOREAN, LEADER, LEGISLATION, LEGISLATIVE, MBC, MEMBER, NETWORK, OPINION, OWNERSHIP, PARLIAMENT, PASSAGE, POLITICAL, POLITICIANS, PRESIDENT, QUESTIONS, REFORM, RESPONDENTS, REVISION, SESSION, STRIKE, TERRESTRIAL, TIME, VIOLENCE, WORKERS In this case, if the semantic web technology considers only the common words regardless of other contextual information, a great number of useful information would be hidden or lost in the web system. Consequently, the new web system would provide only fractional data. It is far from the idea of the semantic web. At this point, this study empirically supports McCool’s [2] admonition that the semantic web will be fail if it neglects diverse contextual information on the web. To realize the idea of the semantic web that is the creation of collective human knowledge, web ontologies should be more carefully defined considering the complexity of the web information. The semantic web is not only a technological issue, but also a social issue. The semantic structure of the web information is changed and developed on the basis of social interaction among internet users. Computer scientists have led the arguments of the semantic web, and their arguments have usually focused on programming and database structure. In that case, the essentials of the web information can be overlooked. Alternatively, social scientists can provide a crucial idea to identify how the social web contents are constructed and developed. Thus, 62 Y.S. Lim collaborative multi-disciplinary approaches should be required for the practical embodiment of the semantic web. 5 Conclusion The findings of this study will be a starting point for future research. Although this study focused on a specific socio-political issue in South Korea, there were the differences of the semantic structures among online texts containing different contextual information. The results represent the complexity of the web information. To obtain better understandings of the semantic structure of massive online contents, subsequent research should be required with multi-disciplinary collaboration. Reference 1. Lee, T.-B., Hendler, J., Lassila, O.: The semantic web. Scientific American 284, 34–43 (2001) 2. McCool, R.: Rethinking the semantic web, part 1. IEEE Internet Computing, 85–87 (2005) 3. Monge, P.R., Contractor, N.S.: Theories of communication networks. Oxford University Press, New York (2003) 4. Monge, P.R., Eisenberg, E.M.: Emergent communication networks. In: Jablin, F.M., Putnam, L.L., Roberts, K.H., Porter, L.W. (eds.) Handbook of organizational communication, pp. 304–342. Sage, Newbury Park (1987) 5. Woelfel, J.: Artificial neural networks in policy research: A current assessment. Journal of Communication 43, 63–80 (1993) 6. Woelfel, J.: CATPAC II user’s manual (1998), http://www.galileoco.com/Manuals/CATPAC3.pdf 7. Doerfel, M.L., Barnett, G.A.: A semantic network analysis of the International Communication Association. Human Communication Research 25, 589–603 (1999) 8. Choi, S., Lehto, X.Y., Morrison, A.M.: Destination image representation on the web: Content analysis of Macau travel related websites. Tourism Management 28, 118–129 (2007) 9. Doerfel, M.L., Marsh, P.S.: Candidate-issue positioning in the context of presidential debates. Journal of Applied Communication Research 31, 212–237 (2003) 10. Kim, J.H., Su, T.-Y., Hong, J.: The influence of geopolitics and foreign policy on the U.S. and Canadian media: An analysis of newspaper coverage of Sudan’s Darfur conflict. Harvard International Journal of Press/Politics 12, 87–95 (2007) 11. Rosen, D., Woelfel, J., Krikorian, D., Barnett, G.A.: Procedures for analyses of online communities. Journal of Computer-Mediated Communication 8 (2003) 12. Miller, G.A.: The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review 63, 81–97 (1956) 13. Ward, J.H.: Hierarchical Grouping to optimize an objective function. Journal of American Statistical Association 58, 236–244 (1963) 14. Torgerson, W.S.: Theory and methods of scaling. John Wiley & Sons, New York (1958) 15. Borgatti, S.P., Everett, M.G., Freeman, L.C.: Ucinet 6 for Windows. Analytic Technologies, Harvard (2002) Semantic Twitter: Analyzing Tweets for Real-Time Event Notification Makoto Okazaki and Yutaka Matsuo The University of Tokyo 2-11-16 Yayoi, Bunkyo-ku Tokyo 113-8656, Japan Abstract. Twitter, a popular microblog service, has received much attention recently. An important characteristic of Twitter is its real-time nature. However, to date, integration of semantic processing and the real-time nature of Twitter has not been well studied. As described herein, we propose an event notification system that monitors tweet (Twitter messages) and delivers semantically relevant tweets if they meet a user’s information needs. As an example, we construct an earthquake prediction system targeting Japanese tweets. Because of numerous earthquakes in Japan and because of the vast number of Twitter users throughout the country, it is sometimes possible to detect an earthquake by monitoring tweets before an earthquake actually arrives. (An earthquake is transmitted through the earth’s crust at about 3–7 km/s. Consequently, a person has about 20 s before its arrival at a point that is 100 km distant.) Other examples are detection of rainbows in the sky, and detection of traffic jams in cities. We first prepare training data and apply a support vector machine to classify a tweet into positive and negative classes, which corresponds to the detection of a target event. Features for the classification are constructed using the keywords in a tweet, the number of words, the context of event words, and so on. In the evaluation, we demonstrate that every recent large earthquake has been detected by our system. Actually, notification is delivered much faster than the announcements broadcast by the Japan Meteorological Agency. 1 Introduction Twitter, a popular microblogging service, has received much attention recently. Users of Twitter can post a short text called a tweet: a short message of 140 characters or less. A user can follow other users (unless she chooses a privacy setting), and her followers can read her tweets. After its launch on October 2006, Twitter users have increased rapidly. Twitter users are currently estimated as 44.5 million worldwide1. An important characteristic of Twitter is its real-time nature. Although blog users typically update their blogs once every several days, Twitter users write tweets several times in a single day. Users can know how other users are doing and often what they are thinking now, users repeatedly come back to the site and check to see what other people are doing. 1 http://www.techcrunch.com/2009/08/03/twitter-reaches-44. 5-million-people-worldwide-in-june-comscore/ J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 63–74, 2010. c Springer-Verlag Berlin Heidelberg 2010 64 M. Okazaki and Y. Matsuo Fig. 1. Twitter screenshot In Japan, more than half a million Twitter users exist; the number grows rapidly. The Japanese version of Twitter was launched on 23 April 2008. In February 2008, Japan was the No. 2 country with respect to Twitter traffic2 . At the time of this writing, Japan has 11th largest number of users in the world. Figure 1 presents a screenshot of the Japanese version of Twitter. Every function is the same as in the original Englishlanguage interface, but the user interface is in Japanese. Some studies have investigated Twitter: Java et al. analyzed Twitter as early as 2007. They described the social network of Twitter users and investigated the motivation of Twitter users [1]. B. Huberman et al. analyzed more than 300 thousand users. They discovered that the relation between friends (defined as a person to whom a user has directed posts using an ”@” symbol) is the key to understanding interaction in Twitter [2]. Recently, boyd et al. investigated retweet activity, which is the Twitter-equivalent of e-mail forwarding, where users post messages originally posted by others [3]. On the other hand, many works have investigated Semantic Web technology (or semantic technology in a broader sense). Recently, many works have examined how to integrate linked data on the web[4]. Automatic extraction of semantic data is another approach that many studies have used. For example, extracting relations among entities from web pages [5] is an example of the utilization of natural language processing and web mining to obtain Semantic Web data. Extracting events is also an important means of obtaining knowledge from web data. To date, means of integrating semantic processing and the real-time nature of Twitter have not been well studied. Combining these two directions, we can develop various algorithms to process Twitter data semantically. Because we can assess numerous texts (and social relations among users) in mere seconds, if we were able to extract some tweets automatically, then we would be able to provide real-time event notification services. As described in this paper, we propose an event notification system that monitors tweets and delivers some tweets if they are semantically relevant to users’ information needs. As an example, we develop an earthquake reporting system using Japanese tweets. Because of the numerous earthquakes in Japan and the numerous and geographically dispersed Twitter users throughout the country, it is sometimes possible to detect 2 http://blog.twitter.com/2008/02/twitter-web-traffic-aroundworld.html Semantic Twitter: Analyzing Tweets for Real-Time Event Notification 65 Fig. 2. Twitter user map Fig. 3. Earthquake map an earthquake by monitoring tweets. In other words, many earthquake events occur in Japan. Many sensors are allocated throughout the country. Figure 2 portrays a map of Twitter users worldwide (obtained from UMBC eBiquity Research Group); Fig. 3 depicts a map of earthquake occurrences worldwide (using data from Japan Meteorological Agency (JMA)). It is apparent that the only intersection of the two maps, which means regions with many earthquakes and large Twitter users, is Japan. (Other regions such as Indonesia, Turkey, Iran, Italy, and Pacific US cities such as Los Angeles and San Francisco also roughly intersect, although the density is much lower than in Japan.) Our system detects an earthquake occurrence and sends an e-mail, possibly before an earthquake actually arrives at a certain location: An earthquake propagates at about 3–7 km/s. For that reason, a person who is 100 km distant from an earthquake has about 20 s before the arrival of an earthquake wave. Actually, some blogger has already written about the tweet phenomenon in relation to earthquakes in Japan3 : Japan Earthquake Shakes Twitter Users ... And Beyonce: Earthquakes are one thing you can bet on being covered on Twitter (Twitter) first, because, quite frankly, if the ground is shaking, you’re going to tweet about it before it even 3 http://mashable.com/2009/08/12/japan-earthquake/ 66 M. Okazaki and Y. Matsuo registers with the USGS and long before it gets reported by the media. That seems to be the case again today, as the third earthquake in a week has hit Japan and its surrounding islands, about an hour ago. The first user we can find that tweeted about it was Ricardo Duran of Scottsdale, AZ, who, judging from his Twitter feed, has been traveling the world, arriving in Japan yesterday. Another example of an event that can be captured using Twitter is rainbows. Sometimes people twitter about beautiful rainbows in the sky. To detect such target events, we first prepare training data and apply a support vector machine to classify a tweet as either belonging to a positive or negative class, which corresponds to the detection of a target event. Features for such a classification are constructed using keywords in a tweet, the number of words, the context of event words, and so on. In the evaluation, we can send an earthquake notification in less than a minute, which is much faster than the announcements broadcast by the Japan Meteorological Agency. The contributions of the paper can be summarized as follows: – The paper provides an example of semantic technology application on Twitter, and presents potential uses for Twitter data. – For earthquake prediction, many studies have been done from a geological science perspective. This paper presents an innovative social approach, which has not been reported before in the literature. This paper is organized as follows: In the next section, we explain the concept of our system and show system details. In Section 3, we explain the experiments. Section 4 is devoted to related works and discussion. Finally, we conclude the paper. 2 System Architecture In Fig. 4, we present the concept of our system. Generally speaking, the classical mass media provide standardized information to the masses, where social media provide realtime information in which pieces of information are useful for only a few people. Using semantic technology, we can create an advance social medium of a new kind; we can provide useful and real-time information to some users. We pick up earthquake information as an example because Japan has many earthquakes (as is true also of Korea, our conference venue). Moreover, earthquake information is much more valuable if given in real time. We can turn off a stove or heater in our house and hide ourselves under a desk or table if we have several seconds before an earthquake actually hits. For that very reason, the Japanese government has allocated a considerable amount of its budget to develop earthquake alert systems. We take a different approach to classical earthquake prediction. By gathering information about earthquakes from Twitter, we can provide useful and real-time information to many people. Figure 5 presents the system architecture. We first search for tweets TQ including the query string Q from Twitter at every s seconds. We use a search API4 to search tweets. 4 search.twitter.com or http://pcod.no-ip.org/yats/search Semantic Twitter: Analyzing Tweets for Real-Time Event Notification 67 Fig. 4. Use of semantic technology for social media Fig. 5. System architecture In our case, we set Q = {”earthquake” and ”shakes”}5 . In fact, TQ is a set of tweets including the query words. We set s to be 5 s. The obtained set of tweets TQ sometimes includes tweets that do not mention an actual earthquake occurring. For example, a user might see that someone is ”shaking” hands, or people in the upper floor apartment are like an ”earthquake”. Therefore we must clarify that the tweet t ∈ TQ is really referring to an actual earthquake occurring (at least in the sense that the user believes so.) To classify a tweet into a positive class (i.e. an earthquake occurs) or a negative class (i.e. an earthquake does not occur), we make a classifier using support vector machine (SVM) [6], which is a popular machine-learning algorithm. By preparing 597 examples as a training set, we can obtain a model to classify tweets into positive and negative categories automatically. 5 Actually, we set Q as ”nk” and ”h” in Japanese. 68 M. Okazaki and Y. Matsuo Table 1. Performance of classification (i) earthquake query: Features A B C All Recall 87.50% 87.50% 50.00% 87.50 % Precision 63.64% 38.89% 66.67% 63.64% F-value 73.69% 53.85% 57.14% 73.69% Features A B C All Recall 66.67% 86.11% 52.78% 80.56 % Precision 68.57% 57.41% 86.36% 65.91% F-value 67.61% 68.89% 68.20% 72.50% (ii) shaking query: Features of a tweet are the following, categorized into three groups. Morphological analysis is conducted using Mecab6 , which separates sentences into a set of words. Group A: simple statistical features the number of words in a tweet, and the position of the query word in a tweet Group B: keyword features the words in a tweet Group C: context word features the words before and after the query word The classification performance is presented in Table 1. We use two query words— earthquake and shaking; performances using either query are shown. We used a linear kernel for SVM. We obtain the highest F-value when we use feature A and all features. Surprisingly, feature B and feature C do not contribute much to the classification performance. When an earthquake occurs, a user becomes surprised and might produce a very short tweet. It is apparent that the recall is not so high as precision. It is attributable to the usage of query words in a different context than we intend. Sometimes it is difficult even for humans to judge whether a tweet is reporting an actual earthquake or not. Some examples are that a user might write ”Is this an earthquake or a truck passing?” Overall, the classification performance is good considering that we can use multiple sensor readings as evidence for event detection. After making a classification and obtaining a positive example, the system quickly sends an e-mail (usually mobile e-mail) to the registered users. It is hoped that the e-mail is received by a user shortly before the earthquake actually arrives. 3 Experiments We have operated a system, called Toretter7 since August 8. The system screenshot is shown in Fig. 6. Users can see the detection of past earthquakes. They can register their e-mails for to receive notices of future earthquake detection. To date, we have about 20 test users who have registered to use the system. 6 7 http://mecab.sourceforge.net/ It means ”we have taken it” in Japanese. Semantic Twitter: Analyzing Tweets for Real-Time Event Notification 69 Fig. 6. Screenshot of Toretter: Earthquake notification system Table 2. Facts about earthquake detection Date Magnitude Location Time First tweet detected #Tweets within 10 min Announce of JMA Aug 18 4.5 Tochigi 6:58:55 7:00:30 35 07:08 Aug 18 3.1 Suruga-wan 19:22:48 19:23:14 17 19:28 Aug 21 4.1 Chiba 8:51:16 8:51:35 52 8:56 Aug 25 4.3 Uraga-oki 2:22:49 2:23:21 23 02:27 Aug 25 3.5 Fukushima 22:21:16 22:22:29 13 22:26 Aug 27 3.9 Wakayama 17:47:30 17:48:11 16 17:53 Aug 27 2.8 Suruga-wan 20:26:23 20:26:45 14 20:31 Aug 31 4.5 Fukushima 00:45:54 00:46:24 32 00:51 Sep 2 3.3 Suruga-wan 13:04:45 13:05:04 18 13:10 Sep 2 3.6 Bungo-suido 17:37:53 17:38:27 3 17:43 Table 2 presents some facts about earthquake detection by our system. This table shows that we investigated 10 earthquakes during 18 August – 2 September, all of which were detected by our system. The first tweet of an earthquake is within a minute or so. The delay can result from the time for posting a tweet by a user, the time to index the post, and the time to make queries by our system. Every earthquake elicited more than 10 tweets within 10 min, except one in Bungo-suido, which is the sea between two big islands: Kyushu and Shikoku. Our system sent e-mails mostly within a minute, sometimes within 20 s. The delivery time is far earlier than the rapid broadcast of announcement of the Japan Meteorological Agency (JMA), which are widely broadcast Table 3. Earthquake detection performance for two months from August 2009 JMA intensity scale 2 or more 3 or more 4 or more Num. of earthquakes 78 25 3 70(89.7%) 24 (96.0%) 3 (100.0%) Detected Promptly detected8 53 (67.9%) 20 (80.0%) 3 (100.0%) 70 M. Okazaki and Y. Matsuo Fig. 7. The locations of the tweets on the earthquake Fig. 8. Number of tweets related to earthquakes on TV; on average, a JMA announcement is broadcast 6 min after an earthquake occurs. Statistically, we detected 96% of earthquakes larger than JMA seismic intensity scale9 3 or more as shown in Table 3. Figure 8 shows the number of tweets mentioning earthquakes. Some spikes are apparent when the earthquake occurs; the number gradually decreases. Statistically, we detected 53% of earthquakes larger than magnitude 1.0 using our system. Figure 7 shows the locations of the tweets on the earthquake. The color of balloons intend the passage of time. Red represents early tweets; blue represents later tweets. The red cross shows the earthquake center. 9 The JMA seismic intensity scale is a measure used in Japan and Taiwan to indicate earthquake strength. Unlike the Richter magnitude scale, the JMA scale describes the degree of shaking at a point on the earth’s surface. For example, the JMA scale 3 is, by definition, one which is ”felt by most people in the building. Some people are frightened”. It is similar to the Modified Mercalli scale IV, which is used along with the Richter scale in the US. Semantic Twitter: Analyzing Tweets for Real-Time Event Notification 71 Dear Alice, We have just detected an earthquake around Chiba. Please take care. Best, Toretter Alert System Fig. 9. Sample alert e-mail A sample e-mail is presented in Fig. 9. It alerts users and urges them to prepare for the earthquake. The location is obtained by a registered location on the user profile: the location might be wrong because the user might register in a different place, or the user might be traveling somewhere. The precise location estimation from previous tweets is a subject for our future work. 4 Related Work Twitter is an interesting example of the most recent social media: numerous studies have investigated Twitter. Aside from the studies introduced in Section 1, several others have been done. Grosseck et al. investigated indicators such as the influence and trust related to Twitter [7]. Krishnamurthy et al. crawled nearly 100,000 Twitter users and examined the number of users each user follows, in addition to the number of users following them. Naaman et al. analyzed contents of messages from more than 350 Twitter users and manually classified messages into nine categories [8]. The numerous categories are ”Me now” and ”Statements and Random Thoughts”; statements about current events corresponding to this category. Some studies attempt to show applications of Twitter: Borau et al. tried to use Twitter to teach English to English-language learners [9]. Ebner et al. investigated the applicability of Twitter for educational purposes, i.e. mobile learning [10]. The integration of the Semantic Web and microblogging was described in a previous study [11] in which a distributed architecture is proposed and the contents are aggregated. Jensen et al. analyzed more than 150 thousand tweets, particularly those mentioning brands in corporate accounts [12]. In contrast to the small number of academic studies of Twitter, many Twitter applications exist. Some are used for analyses of Twitter data. For example, Tweettronics10 provides an analysis of tweets related to brands and products for marketing purposes. It can classify positive and negative tweets, and can identify influential users. The classification of tweets might be done similarly to our algorithm. Web2express Digest11 10 11 http://www.tweettronics.com http://web2express.org 72 M. Okazaki and Y. Matsuo is a website that auto-discovers information from Twitter streaming data to find realtime interesting conversations. It also uses natural language processing and sentiment analysis to discover interesting topics, as we do in our study. Various studies have been made of the analysis of web data (except for Twitter) particularly addressing the spatial aspect: The most relevant study to ours is one by Backstrom et al. [13]. They use queries with location (obtained by IP addresses), and develop a probabilistic framework for quantifying spatial variation. The model is based on a decomposition of the surface of the earth into small grid cells; they assume that for each grid cell x, there is a probability px that a random search from this cell will be equal to the query under consideration. The framework finds a query’s geographic center and spatial dispersion. Examples include baseball teams, newspapers, universities, and typhoons. Although the motivation is very similar, events to be detected differ. Some examples are that people might not make a search query earthquake when they experience an earthquake. Therefore, our approach complements their work. Similarly to our work, Mei et al. targeted blogs and analyzed their spatiotemporal patterns [14]. They presented examples for Hurricane Katrina, Hurricane Rita, and iPod Nano. The motivation of that study is similar to ours, but Twitter data are more time-sensitive; our study examines even more time-critical events e.g. earthquakes. Some works have targeted collaborative bookmarking data, as Flickr does, from a spatiotemporal perspective: Serdyukov et al. investigated generic methods for placing photographs on Flickr on the world map [15]. They used a language model to place photos, and showed that they can effectively estimate the language model through analyses of annotations by users. Rattenbury et al. [16] specifically examined the problem of extracting place and event semantics for tags that are assigned to photographs on Flickr. They proposed scale-structure identification, which is a burst-detection method based on scaled spatial and temporal segments. 5 Discussion We plan to expand our system to detect events of various kinds from Twitter. We developed another prototype, which detect rainbow information. A rainbow might be visible somewhere in the world, and someone might be twittering about the rainbow. Our system can find the rainbow tweets using a similar approach to that used for detecting earthquakes. The differences are that in the rainbow case it is not so time-sensitive as that in the earthquake case. Rainbows can be found in various regions simultaneously, whereas usually two or more earthquakes do not occur together. Therefore, we can make a ”world rainbow map”. No agency is reporting rainbow information as far as we know. Therefore, such a rainbow map is producible only through Twitter. Other plans we have, which remain undeveloped yet, include reporting sightings of celebrities. Because people sometimes make tweets if they see celebrities in town, by aggregating these tweets, we can produce a map of celebrities found in cities. (Here we specifically examine the potential uses of the technology. Of course, we should be careful about privacy issues when using such features.) Such real-time reporting offers many possible advantages, as we described herein. By processing tweets using machine learning and semantic technology, we can produce an advanced social medium of new type. Semantic Twitter: Analyzing Tweets for Real-Time Event Notification 73 Finally, we mention some related works: Although few academic studies exist for Twitter, many Twitter applications exist. Some of them are used for analyses of Twitter data. For example, Tweettronics12 provides an analysis of tweets about brands and products for marketing purposes. It can classify positive and negative tweets, and can identify influential users. The classification of tweets might be done similarly to our algorithm. Web2express Digest13 is a website which auto-discovers information from Twitter streaming data to find real time interesting conversations. It also uses natural language processing and sentiment analysis to discover interesting topics, as we do in our study. 6 Conclusion As described in this paper, we describe an earthquake prediction system targeting Japanese tweets. Strictly speaking, the system does not predict an earthquake but rather informs users very promptly. The search API is integrated with semantic technology. Consequently, the system might be designated as semantic twitter. This report presents several examples in which our system can produce alerts, and describes the potential expansion of our system. Twitter provides social data of new type. We can develop an advanced social medium integrating semantic technology. References 1. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: Understanding microblogging usage and communities. In: Proc. Joint 9th WEBKDD and 1st SNA-KDD Workshop (2007) 2. Huberman, B., Romeroand, D., Wu, F.: Social networks that matter: Twitter under the microscope. First Monday 14 (2009) 3. boyd, d., Golder, S., Lotan, G.: Tweet, tweet, retweet: Conversational aspects of retweeting on twitter. In: Proc. HICSS-43 (2010) 4. Brizer, C., Heath, T., Idehen, K., Berners-Lee, T.: Linked data on the web. In: Proc. WWW 2008, pp. 1265–1266 (2008) 5. Matsuo, Y., Mori, J., Hamasaki, M., Nishimura, T., Takeda, H., Hasida, K., Ishizuka, M.: Polyphonet: An advanced social network extraction system from the web. Journal of Web Semantics 5(4) (2007) 6. Joachims, T.: Text categorization with support vector machines. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998) 7. Grosseck, G., Holotescu, C.: Analysis indicators for communities on microblogging platforms. In: Proc. eLSE Conference (2009) 8. Naaman, M., Boase, J., Lai, C.: Is it really about me? Message content in social awareness streams. In: Proc. CSCW 2009 (2009) 9. Borau, K., Ullrich, C., Feng, J., Shen, R.: Microblogging for language learning: Using twitter to train communicative and cultural competence. In: Spaniol, M., Li, Q., Klamma, R., Lau, R.W.H. (eds.) Advances in Web Based Learning – ICWL 2009. LNCS, vol. 5686, pp. 78–87. Springer, Heidelberg (2009) 12 13 http://www.tweettronics.com http://web2express.org 74 M. Okazaki and Y. Matsuo 10. Ebner, M., Schiefner, M.: In microblogging.more than fun? In: Proc. IADIS Mobile Learning Conference (2008) 11. Passant, A., Hastrup, T., Bojars, U., Breslin, J.: Microblogging: A semantic and distributed approach. In: Proc. SFSW 2008 (2008) 12. Jansen, B., Zhang, M., Sobel, K., Chowdury, A.: Twitter power:tweets as electronic word of mouth. Journal of the American Society for Information Science and Technology (2009) 13. Backstrom, L., Kleinberg, J., Kumar, R., Novak, J.: Spatial variation in search engine queries. In: Proc. WWW 2008 (2008) 14. Mei, Q., Liu, C., Su, H., Zhai, C.: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: Proc. WWW 2006 (2006) 15. Serdyukov, P., Murdock, V., van Zwol, R.: Placing flickr photos on a map. In: Proc. SIGIR 2009 (2009) 16. Rattenbury, T., Good, N., Naaman, M.: Towards automatic extraction of event and place semantics from flickr tags. In: Proc. SIGIR 2007 (2007) Linking Topics of News and Blogs with Wikipedia for Complementary Navigation Yuki Sato1 , Daisuke Yokomoto1, Hiroyuki Nakasaki2 , Mariko Kawaba3, Takehito Utsuro1 , and Tomohiro Fukuhara4 1 Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, 305-8573, Japan 2 NTT DATA CORPORATION, Tokyo 135-6033, Japan 3 NTT Cyber Space Laboratories, NTT Corporation, Yokosuka, Kanagawa, 239-0847, Japan 4 Center for Service Research, National Institute of Advanced Industrial Science and Technology, Tokyo 315-0064, Japan Abstract. We study complementary navigation of news and blog, where Wikipedia entries are utilized as fundamental knowledge source for linking news articles and blog feeds/posts. In the proposed framework, given a topic as the title of a Wikipedia entry, its Wikipedia entry body text is analyzed as fundamental knowledge source for the given topic, and terms strongly related to the given topic are extracted. Those terms are then used for ranking news articles and blog posts. In the scenario of complementary navigation from a news article to closely related blog posts, Japanese Wikipedia entries are ranked according to the number of strongly related terms shared by the given news article and each Wikipedia entry. Then, top ranked 10 entries are regarded as indices for further retrieving closely related blog posts. The retrieved blog posts are finally ranked all together. The retrieved blog posts are then shown to users as blogs of personal opinions and experiences that are closely related to the given news article. In our preliminary evaluation, through an interface for manually selecting relevant Wikipedia entries, the rate of successfully retrieving relevant blog posts improved. Keywords: IR, Wikipedia, news, blog, topic analysis. 1 Introduction We study complementary navigation of news and blog, where Wikipedia entries are utilized as fundamental knowledge source for linking news articles and blog feeds/posts. In previous works, Wikipedia, news, and blogs are intensively studied in a wide variety of research activities. In the area of IR, Wikipedia has been studied as rich knowledge source for improving the performance of text classification [1,2] as well as text clustering [3,4,5]. In the area of NLP, it has been studied as language resource for improving the performance of named J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 75–87, 2010. c Springer-Verlag Berlin Heidelberg 2010 76 Y. Sato et al. entity recognition [6,7], translation knowledge acquisition [8], word sense disambiguation [9], and lexical knowledge acquisition [10]. In previous works on news aggregation such as Newsblaster [11], NewsInEssence1 [12], and Google News2 , techniques on linking closely related news articles were intensively studied. In addition to those previous works on use and analysis of Wikipedia and news, blog analysis services have also become popular. Blogs are considered to be one of personal journals, market or product commentaries. While traditional search engines continue to discover and index blogs, the blogosphere has produced custom blog search and analysis engines, systems that employ specialized information retrieval techniques. With respect to blog analysis services on the Internet, there are several commercial and non-commercial services such as Technorati3 , BlogPulse4 [13], kizasi.jp5 , and blogWatcher6 [14]. With respect to multilingual blog services, Globe of Blogs7 provides a retrieval function of blog articles across languages. Best Blogs in Asia Directory8 also provides a retrieval function for Asian language blogs. Blogwise9 also analyzes multilingual blog articles. Compared to those previous studies, the fundamental idea of our complementary navigation can be roughly illustrated in Figure 1. In our framework of complementary navigation of news and blog, Wikipedia entries are retrieved when seeking fundamental background information, while news articles are retrieved when seeking precise news reports on facts, and blog feeds/posts are retrieved when seeking subjective information such as personal opinions and experiences. In the proposed framework, we regard Wikipedia as a large scale encyclopedic knowledge base which includes well known facts and relatively neutral opinions. In its Japanese version, about 627,000 entries are included (checked at October, 2009). Given a topic as the title of a Wikipedia entry, its Wikipedia entry body text is analyzed as fundamental knowledge source for the given topic, and terms strongly related to the given topic are extracted. Those terms are then used for ranking news articles and blog feeds/posts. This fundamental technique was published in [15,16] and was evaluated in the task of blog feed retrieval from a Wikipedia entry. [15,16] reported that this technique outperformed the original ranking returned by “Yahoo! Japan” API. In the first scenario of complementary navigation, given a news article of a certain topic, the system retrieves blog feeds/posts of closely related topics and show them to users. In the case of the example shown in Figure 1, suppose that a user found a news article reporting that “a long queue appeared in front of a game shop on the day a popular game Dragon Quest 9 was published”. Then, through the function of the complementary navigation of our framework, 1 2 3 4 5 6 7 8 9 http://www.newsinessence.com/nie.cgi http://news.google.com/ http://technorati.com/ http://www.blogpulse.com/ http://kizasi.jp/ (in Japanese) http://blogwatcher.pi.titech.ac.jp/ (in Japanese) http://www.globeofblogs.com/ http://www.misohoni.com/bba/ http://www.blogwise.com/ Linking Topics of News and Blogs with Wikipedia 77 Fig. 1. Framework of Complementary Navigation among Wikipedia, News, and Blogs a closely related blog post, such as the one posted by a person who bought the game on the day it was published, is quickly retrieved and shown to the user. In the scenario of this direction, first, about 600,000 Japanese Wikipedia entries are ranked according to the number of strongly related terms shared by the given news article and each Wikipedia entry. Then, top ranked 10 entries are regarded as indices for further retrieving closely related blog feeds/posts. The retrieved blog feeds/posts are finally ranked all together. The retrieved blog feeds/posts are then shown to users as blogs of personal opinions and experiences that are closely related to the given news article. In the second scenario of complementary navigation, which is the opposite direction from the first one, given a blog feed/post of a certain topic, the system retrieves news articles of closely related topics and show them to users. This scenario is primarily intended that, given a blog feed/post which refers to a certain news article and includes some personal opinions regarding the news, the system retrieves the news article referred to by the blog feed/post and show it to users. Finally, in the third scenario of complementary navigation, given a news article or a blog feed/post of a certain topic, the system retrieves one or more closely related Wikipedia entries and show them to users. In the case of the example shown in Figure 1, suppose that a user found either a news article reporting the publication of Dragon Quest 9 or a blog post by a person who bought the game on the day it was published. Then, through the function of the complementary 78 Y. Sato et al. navigation of our framework, the most relevant Wikipedia entry, namely, that of Dragon Quest 9, is quickly retrieved and shown to the user. This scenario is intended to show users background knowledge found in Wikipedia, given a news article or a blog feed/post of a certain topic. Based on the introduction of the overall framework of complementary navigation among Wikipedia, news, and blogs above, this paper focuses on the formalization of the first scenario of complementary navigation for retrieving closely related blog posts given a news article of a certain topic. Section 2 first describes how to extract terms that are included in each Wikipedia entry and are closely related to it. According to the procedure to be presented in section 3, those terms are then used to retrieve blog posts that are closely related to each Wikipedia entry. Based on those fundamental techniques, section 4 formalizes the similarity measure between the given news article and each blog post, and then presents the procedure of ranking blog posts that are related to the given news article. Section 5 introduces a user interface for complementary navigation, to be used for manually selecting Wikipedia entries which are relevant to the given news article and are effective in retrieving closely related blog posts. Section 5 also presents results of evaluating our framework. Section 6 presents comparison with previous works related to this paper. 2 Extracting Related Terms from a Wikipedia Entry In our framework of linking news and blogs through Wikipedia entries, we regard terms that are included in each Wikipedia entry body text and are closely related to the entry as representing conceptual indices of the entry. Those closely related terms are then used for retrieving related blog posts and news articles. More specifically, from the body text of each Wikipedia entry, we extract boldfaced terms, anchor texts of hyperlinks, and the title of a redirect, which is a synonymous term of the title of the target page [15,16,17]. We also extract all the noun phrases from the body text of each Wikipedia entry. 3 The Procedure of Retrieving Blog Posts Related to a Wikipedia Entry This section describes the procedure of retrieving blog posts that are related to a Wikipedia entry [15,16]. In this procedure, given a Wikipedia entry title, first, closely related blog feeds are retrieved, and then, from the retrieved blog feeds, closely related blog posts are further selected. 3.1 Blog Feed Retrieval This section briefly describes how to retrieve blog feeds given a Wikipedia entry title. In order to collect candidates of blog feeds for a given query, in this paper, we use existing Web search engine APIs, which return a ranked list of blog posts, Linking Topics of News and Blogs with Wikipedia 79 given a topic keyword. We use the Japanese search engine “Yahoo! Japan” API10 . Blog hosts are limited to major 11 hosts11 . We employ the following procedure for the blog distillation: i) Given a topic keyword, a ranked list of blog posts are returned by a Web search engine API. ii) A list of blog feeds is generated from the returned ranked list of blog posts by simply removing duplicated feeds. iii) Re-rank the list of blog feeds according to the number of hits of the topic keyword in each blog feed. The number of hits for a topic keyword in each blog feed is simply measured by the search engine API used for collecting blog posts above in i), restricting the domain of the URL to each blog feed. [15,16] reported that the procedure above outperformed the original ranking returned by “Yahoo! Japan” API. 3.2 Blog Post Retrieval From the retrieved blog feeds, we next select blog posts that are closely related to the given Wikipedia entry title. To do this, we use related terms extracted from the given Wikipedia entry as described in section 2. More specifically, out of the extracted related terms, we use bold-faced terms, anchor texts of hyperlinks, and the title of a redirect, which is a synonymous term of the title of the target page. Then, blog posts which contain the topic name or at least one of the extracted related terms are automatically selected. 4 Similarities of Wikipedia Entries, News, and Blogs In the scenario of retrieving blog posts closely related to a given news article, the most important component is how to measure the similarity between the given news article and each blog post. This section describes how we design this similarity. In this scenario, the fundamental component is how to measure the similarity Simw,n (E, N ) between a Wikipedia entry E and a news article N , and the similarity Simw,b(E, B) between a Wikipedia entry E and a blog post B. The similarity measure Simw,n (E, N ) is used when, given a news article of a certain topic, ranking Wikipedia entries according to whether each entry is related to the given news article. The similarity measure Simw,b (E, B) is used when, from the highly ranked Wikipedia entries closely related to the given news article, retrieving blog posts related to any of those entries. Then, based on those similarities Simw,n (E, N ) and Simw,b(E, B), the overall similarity measure Simn,w,b (N, B) between the given news article N and each blog post B is introduced. Finally, blog posts are ranked according to this overall similarity measure. 10 11 http://www.yahoo.co.jp/ (in Japanese) FC2.com,yahoo.co.jp,rakuten.ne.jp,ameblo.jp,goo.ne.jp,livedoor.jp, Seesaa.net, jugem.jp, yaplog.jp, webry.info.jp, hatena.ne.jp 80 Y. Sato et al. 4.1 Similarity of a Wikipedia Entry and a News Article / A Blog Post The similarities Simw,n (E, N ) and Simw,b(E, B) are measured in terms of the entry title and the related terms extracted from the Wikipedia entry as described in section 2. The similarity Simw,n (E, N ) between a Wikipedia entry E and a news article N is defined as a weighted sum of frequencies of the entry title and the related terms: w(type(t)) × f req(t) Simw,n(E, N ) = t where weight(t) is defined as 1 when t is the entry title, the title of a redirect, a bold-faced term, the title of a paragraph, or a noun phrase extracted from the body text of the entry. The similarity Simw,b (E, B) between a Wikipedia entry E and a blog post B is defined as a weighted sum of frequencies of the entry title and the related terms: w(type(t)) × f req(t) Simw,b (E, B) = t where weight(t) is defined as 3 when t is the entry title or the title of a redirect, as 2 when t is a bold-faced term, and as 0.5 when t is an anchor text of hyperlinks12 . 4.2 Similarity of a News Article and a Blog Post through Wikipedia Entries In the design of the overall similarity measure Simn,w,b(N, B) between a news article N and a blog post B through Wikipedia entries, we consider two factors. One of them is to measure the similarity between a news article and a blog post indirectly through Wikipedia entries which are closely related to both of the news article and the blog post. The other is, on the other hand, to directly measure their similarity simply based on their text contents. In this paper, the first factor is represented as the sum of the similarity Simw,n(E, N ) between a news article N and a Wikipedia entry E and the similarity Simw,b (E, B) between a blog post B and a Wikipedia entry E. The second factor is denoted as the direct document similarity Simn,b (N, B) between a news article N and a blog post B, where we simply use cosine measure as the direct document similarity. Finally, based on the argument above, we define the overall similarity measure Simn,w,b (N, B) 12 In [17], we applied machine learning technique to the task of judging whether a Wikipedia entry and a blog feed are closely related, where we incorporated features other than the frequencies of related terms in a blog feed and achieved improvement. Following the discussion in [15,16], the technique proposed by [17] outperforms the original ranking returned by “Yahoo! Japan” API. As a future work, we are planning to apply the technique of [17] to the task of complementary navigation studied in this paper. Linking Topics of News and Blogs with Wikipedia 81 between a news article N and a blog post B through Wikipedia entries as the weighted sum of the two factors below: Simn,w,b(N, B) = (1 − Kw,nb )Simn,b (N, B) + Kw,nb Simw,n (E, N ) + Simw,b (E, B) E where Kw,nb is the coefficient for the weight. In the evaluation of section 5.2, we show results with this coefficient Kw,nb as 1, since the results with Kw,nb as 1 are always better than those with Kw,nb as 0.5. 4.3 Ranking Blog Posts Related to News through Wikipedia Entries Based on the formalization in the previous two sections, given a news article N , this section presents the procedure of retrieving blog posts closely related to the given news article and then ranking them. First, suppose that the news article N contains titles of Wikipedia entries E1 ,. . ., En in its body text. Then, those entries E1 ,. . .,En are ranked according to their similarities Simw,n(Ei , N ) (i = 1, . . . , n) against the given news are selected. Next, each Ei article N , and top ranked 10 entries E1 , . . . , E10 (i = 1, . . . , 10) of those top ranked 10 entries are used to retrieve closely related blog posts according to the procedure presented in section 3. Finally, the retrieved blog posts B1 , . . . , Bm all together are ranked according to their similarities Simn,w,b(N, Bj ) (j = 1, . . . , m) against the given news article N . 5 Manually Selecting Wikipedia Entries in Linking News to Related Blog Posts In this section, we introduce a user interface for complementary navigation with a facility of manually selecting Wikipedia entries which are relevant to the given news article. With this interface, a user can judge whether each candidate Wikipedia entry is effective in retrieving closely related blog posts. We then evaluate the overall framework of complementary navigation and present the evaluation results. 5.1 The Procedure This section describes the procedure of linking a news article to closely related blog posts, where the measure for ranking related blog posts is based on the formalization presented in section 4.3. In this procedure, we also use an interface for manually selecting Wikipedia entries which are relevant to the given news article. 82 Y. Sato et al. Fig. 2. Interface for Complementary Navigation from News to Blogs through Wikipedia Entries The snapshots of the interface are shown in Figure 2. First, in “News Article Browser”, a user can browse through a list of news articles and can select one for which he/she wants to retrieve related blog posts. Next, for the selected news article, “Interface for Manually Selecting Relevant Wikipedia Entries” appears. In this interface, following the formalization of section 4.3, top ranked 10 Wikipedia entry titles are shown as candidates for retrieving blog posts that are related to the given news article. Then, the user can select any subset of the 10 candidate Wikipedia entry titles to be used for retrieving related blog posts. With the subset of the selected Wikipedia entry titles, “Browser for Relevant Blog Post Ranking” is called, where the retrieved blog posts are ranked according to the formalization of section 4.3. Finally, the user can browse through “High Ranked Blog Posts” by simply clicking the links to those blog posts. Table 1 shows a list of four news articles on “Kyoto Protocol” to be used in the evaluation of next section. For each news article, the table shows its summary and top ranked 10 Wikipedia entry titles, where entry titles judged as relevant to the news article are in squares. The table also shows the summary of an example of relevant blog posts. Linking Topics of News and Blogs with Wikipedia 83 Table 1. Summaries of News Articles for Evaluation, Candidates for Relevant Wikipedia Entries, and Summaries of Relevant Blog Posts news article ID summary of news article top ranked 10 entries as candidates for relevant Wikipedia entries (manually selected entries are in a square ) 1 Reports on Japan’s activities on “carbon offset”, reduction of electric power consumption, and preventing global warming. (date: Jan. 25, 2008) environmental issues, Kyoto Protocol , Japan, automobile, carbon offset , transport, United States, hotel, carbon dioxide , contribution summary of relevant blog posts “I understand the significance of Kyoto protocol, but I think it also has problems.” (blogger A) Kyoto Protocol , 2 carbon emissions trading , “Japan has to Reports on a Japan, post-Kyoto negotiations , rely on economic meeting for “carapproaches such energy conservation , bon offset”. (date: as carbon offset.” Mar. 31, 2008) Poland, fluorescent lamp, (blogger A) technology, greenhouse gases , industry 3 Reports on issues towards post-Kyoto negotiations. (date: Aug. 28, 2008) post-Kyoto negotiations , United Nations, protocol, carbon dioxide , United States, debate, Kyoto, greenhouse gases , minister, Poland Referring to a news article on World Economic Forum. (blogger B) 4 Discussion on global warming such as issues regarding developing countries and technologies for energy conservation in Japan. (date: Jun. 29, 2008) Japan, global warming , environmental issues, United States, politics, resource, 34th G8 summit , India, fossil fuels , society Engineers of Japanese electric power companies make progress in research and development. (blogger C) 84 Y. Sato et al. 5.2 Evaluation The Procedure. To each of the four news articles on “Kyoto Protocol” listed in Table 1, we apply the procedure of retrieving related blog posts described in the previous section. We then manually judge the relevance of top ranked N blog posts into the following three levels, i.e., (i) closely related, (ii) partially related, and (iii) not related. Next, we consider the following two cases in measuring the rate of relevant blog posts: (a) relevant blog posts = closely related blog posts only (b) relevant blog posts = closely related blog posts + partially related blog posts Fig. 3. Evaluation Results of the Ratio of Relevant Blog Posts (%): Comparison of with / without Manual Selection of Relevant Wikipedia Entries Linking Topics of News and Blogs with Wikipedia 85 (a) Only closely related blog posts (judged as (i)) are regarded as relevant. (b) Both closely related blog posts (judged as (i)) and partially related blog posts (judged as (ii)) are regarded as relevant. For both cases, the rate of relevant blog posts is simply defined as below: rate of relevant blog posts = the number of relevant blog posts N In the evaluation of this section, we set N as 10. Evaluation Results. In terms of the rate of relevant blog posts, Figure 3 compares the two cases of with / without manually selecting Wikipedia entries relevant to the given news article through the interface introduced in the previous section. In Figure 3 (a), we regard only closely related blog posts as relevant, where the rates of relevant blog posts improve from 0% to 10∼60%. In Figure 3 (b), we regard both closely and partially related blog posts as relevant, where the rates of relevant blog posts improve from 0∼10% to 80∼90%. With this result, it is clear that, the current formalization presented in this paper has its weakness in the similarity measure for ranking related Wikipedia entries. As can be seen in the list of top ranked 10 Wikipedia entry titles in Table 1 as well as those manually selected out of the 10 entries, general terms and country names such as “automobile”, “transport”, “Japan”, and “United States” are major causes of low rates of relevancy. Those general terms and country names mostly damage the step of retrieving related blog posts and the final ranking of those retrieved blog posts. However, it is also clearly shown that, once closely related Wikipedia entries are manually selected, the rates of relevant blog posts drastically improved. This result obviously indicates that the most important issue to be examined first is how to model the measure for ranking Wikipedia entries which are related to a given news article. We discuss this issue as a future work in section 7. 6 Related Works Among several related works, [18,19] studied linking related news and blogs, where their approaches are different from that proposed in this paper in that our proposed method conceptually links topics of news articles and blog posts based on Wikipedia entry texts. [18] focused on linking news articles and blogs based on cites from blogs to news articles. [19] studied to link news articles to blogs posted within one week after each news article is released, where a document vector space model modified by considering terms closely related to each news articles is employed. [20] also studied mining comparative differences of concerns in news streams from multiple sources. [21] studied how to analyze sentiment distribution in news articles across 9 languages. Those previous works mainly focus on news streams and documents other than blogs. 86 Y. Sato et al. Techniques studied in previous works on text classification [1,2] as well as text clustering [3,4,5] using Wikipedia knowledge are similar to the method proposed in this paper in that they are based on related terms extracted from Wikipedia, such as hyponyms, synonyms, and associated terms. The fundamental ideas of those previously studied techniques are also applicable to our task. Major differences between our work and those works are in that we design our framework as having the intermediate phase of ranking Wikipedia entries related to a given news article. 7 Conclusion This paper studied complementary navigation of news and blog, where Wikipedia entries are utilized as fundamental knowledge source for linking news articles and blog posts. In this paper, we focused on the scenario of complementary navigation from a news article to closely related blog posts. In our preliminary evaluation, we showed that the rate of successfully retrieving relevant blog posts improved through an interface for manually selecting relevant Wikipedia entries. Future works include improving the measure for ranking Wikipedia entries which are related to a given news article. So far, we have examined a novel measure which incorporates clustering of Wikipedia entries in terms of the similarity of their body texts. The underlying motivation of this novel measure is to prefer a small number of entries which have quite high similarities with each other, and we have already confirmed that this approach drastically improves the ranking of Wikipedia entries. We are planning to evaluate this measure against a much larger evaluation data set and the result will be reported in the near future. References 1. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In: Proc. 21st AAAI, pp. 1301–1306 (2006) 2. Wang, P., Domeniconi, C.: Building semantic kernels for text classification using Wikipedia. In: Proc. 14th SIGKDD, pp. 713–721 (2008) 3. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proc. 31st SIGIR, pp. 179–186 (2008) 4. Huang, A., Frank, E., Witten, I.H.: Clustering document using a Wikipedia-based concept representation. In: Proc. 13th PAKDD, pp. 628–636 (2009) 5. Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 389–396 (2009) 6. Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proc. EMNLP-CoNLL, pp. 708–716 (2007) 7. Kazama, J., Torisawa, K.: Exploiting Wikipedia as external knowledge for named entity recognition. In: Proc. EMNLP-CoNLL, pp. 698–707 (2007) Linking Topics of News and Blogs with Wikipedia 87 8. Oh, J.H., Kawahara, D., Uchimoto, K., Kazama, J., Torisawa, K.: Enriching multilingual language resources by discovering missing cross-language links in Wikipedia. In: Proc. 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 322–328 (2008) 9. Mihalcea, R., Csomai, A.: Wikify! linking documents to encyclopedic knowledge. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 233–242 (2007) 10. Sumida, A., Torisawa, K.: Hacking Wikipedia for hyponymy relation acquisition. In: Proc. 3rd IJCNLP, pp. 883–888 (2008) 11. McKeown, K.R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J.L., Nenkova, A., Sable, C., Schiffman, B., Sigelman, S.: Tracking and summarizing news on a daily basis with Columbia’s Newsblaster. In: Pro. 2nd HLT, pp. 280–285 (2002) 12. Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence: Summarizing online news topics. Communications of the ACM 48, 95–98 (2005) 13. Glance, N., Hurst, M., Tomokiyo, T.: Blogpulse: Automated trend discovery for Weblogs. In: WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004) 14. Nanno, T., Fujiki, T., Suzuki, Y., Okumura, M.: Automatically collecting, monitoring, and mining Japanese weblogs. In: WWW Alt. 2004: Proc. 13th WWW Conf. Alternate Track Papers & Posters, pp. 320–321 (2004) 15. Kawaba, M., Nakasaki, H., Utsuro, T., Fukuhara, T.: Cross-lingual blog analysis based on multilingual blog distillation from multilingual Wikipedia entries. In: Proceedings of International Conference on Weblogs and Social Media, pp. 200– 201 (2008) 16. Nakasaki, H., Kawaba, M., Yamazaki, S., Utsuro, T., Fukuhara, T.: Visualizing cross-lingual/cross-cultural differences in concerns in multilingual blogs. In: Proceedings of International Conference on Weblogs and Social Media, pp. 270–273 (2009) 17. Kawaba, M., Yokomoto, D., Nakasaki, H., Utsuro, T., Fukuhara, T.: Linking Wikipedia entries to blog feeds by machine learning. In: Proc. 3rd IUCS (2009) 18. Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., Konig, A.C.: Blews: Using blogs to provide context for news articles. In: Proc. ICWSM, pp. 60–67 (2008) 19. Ikeda, D., Fujiki, T., Okumura, M.: Automatically linking news articles to blog entries. In: Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pp. 78–82 (2006) 20. Yoshioka, M.: IR Interface for Contrasting Multiple News Sites. In: Prof. 4th AIRS, pp. 516–521 (2008) 21. Bautin, M., Vijayarenu, L., Skiena, S.: International Sentiment Analysis for News and Blogs. In: Proc. ICWSM, pp. 19–26 (2008) A User-Oriented Splog Filtering Based on a Machine Learning Takayuki Yoshinaka1 , Soichi Ishii1 , Tomohiro Fukuhara2 , Hidetaka Masuda3 , and Hiroshi Nakagawa4 1 2 3 School of Science and Technology for Future Life, Tokyo Denki University, 2-2 Kanda Nishikicho, Chiyoda-ku, Tokyo 101-8457, Japan yoshinaka@cdl.im.dendai.ac.jp Research into Artifacts, Center for Engineering, The University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa, Chiba 277-0882, Japan fukuhara@race.u-tokyo.ac.jp School of Science and Technology for Future Life, Tokyo Denki University, 2-2 Kanda nishikicho, Chiyoda-ku, Tokyo 101-8457, Japan masuda@im.dendai.ac.jp 4 Information Technology Center, The University of Tokyo, 7-3-1 Hongou, Bunkyo-ku, Tokyo 113-0033, Japan nakagawa@dl.itc.u-tokyo.ac.jp Abstract. A method for filtering spam blogs (splogs) based on a machine learning technique, and its evaluation results are described. Today, spam blogs (splogs) became one of major issues on the Web. The problem of splogs is that values of blog sites are different by people. We propose a novel user-oriented splog filtering method that can adapt each user’s preference for valuable blogs. We use the SVM(Support Vector Machine) for creating a personalized splog filter for each user. We had two experiments: (1) an experiment of individual splog judgement, and (2) an experiment for user oriented splog filtering. From the former experiment, we found existence of ‘gray’ blogs that are needed to treat by persons. From the latter experiment, we found that we can provide appropriate personalized filters by choosing the best feature set for each user. An overview of proposed method, and evaluation results are described. 1 Introduction Today, many people can own their blog sites. They can publish articles on their blog sites. There are many types of blog sites on the Web such as blogs that advertise books and commodities, blogs on programming, blogs on personal diaries. At the same time, a lot of spam blogs (splogs) are created by spam bloggers (sploggers). These splogs form a ‘splogosphere’[1]. Splogs cause several problems on the Web. For example, splogs degrade the quality of search results. Although splogs should be removed from search results, it is not easy to identify splogs for each user because there exists a blog marked as splog by person, but marked as authentic (valuable) site by another person. Thus, a user-oriented splog filtering method that can adapt each user’s preference for valuable blogs is needed. J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 88–99, 2010. c Springer-Verlag Berlin Heidelberg 2010 A User-Oriented Splog Filtering Based on a Machine Learning 89 We propose a user-oriented splog filtering method that is possible to adapt each user’s preference. For creating a personalized filter, our method collects individual splog data. Then, personalized filters are created using the support vector machine[2]. and this individual splog data. This paper is organized as following sections: In section 2, we review the previous work. In section 3, we describe an experiment of individual splog judgement. In section 4, we describe the user-oriented splog filtering method. In section 5, we describe evaluation results of proposed method. In section 6, we discuss the evaluation results. In section 7, we describe summaries of proposed method and evaluation results, and future work. 2 Previous Work There are several related work on splog filtering. Kolari et al.[1] analyzed the splogosphere, and proposed a splog filtering method using SVM. They proposed to use three feature sets for machine learning: (1)‘words’, (2)‘anchor text’, and (3)‘url’ appeared in a blog article. Their method succeeded to detect splogs at F-measure of splogs about 90%. Regarding Japanese splogs, Ishida analyzed the Japanese splogosphere[3]. He proposed a splog filtering method that uses the link structure analysis. His method detects splogs at F-measure 80%. These work provides a single common filter for all of users, and does not consider the user adaptation. Regarding the user adaptation in e-mail and web applications, several work considers the user adaptation functions. Junejo proposed a user-oriented spam filter of e-mail[4]. Because one receives a large number of spam e-mail, to filter spam mail is not easy on the user-side. They proposed a server-side spam mail filter that detects spams for each user. The point of this method is that this filter does not require much computational cost on the user-side. Jeh’s work[5] is over the web spam. They proposed a personalized web spam filter that is based on the page rank algorithm. This method, however, needs the whole link structures among web pages, and this requires much cost for the user adaptation. We need a simple method that does not require mush cost for the user adaptation. Therefore, we propose a user-oriented splog filtering method that can adapt each user’s preference, and does not require much cost for the user adaptation. 3 Experiment of Individual Splog Judgement We had an experiment for understanding individual splog judgement by persons. We asked 50 subjects to judge 50 blog articles whether they are splogs or authentic articles. For the test data (blog articles), we prepared ‘gray’ articles that are on the border line between splogs and authentic articles. We also asked subjects to judge blog article whether they are valuable or not. We describe an overview of the experiment and its results. 90 3.1 T. Yoshinaka et al. Overview 50 subjects (25 men, 25 women) attended this experiment. The range of age of subjects is from 21 to 55 years old. Their occupations are mainly ‘engineers of information technology’ and ‘general office worker’. For the test data, we prepared 50 blog articles. The dataset consists of (1) ‘40 common articles’ that are common test articles for all of subjects, and (2) ‘10 individual articles’ that are chosen by each subject. For the latter data, we asked subjects to choose 10 blog articles, that are, (1) five articles that they think the most interesting, and (2) five articles that they think the most boring. For the axes of evaluation, we adopt two axes for splog judgement: (1) spam axis, and (2) value axis. The spam-axis indicates the degree of spam. The valueaxis indicates the degree of value of blog articles for a subjects. Both of axes consists of four values. The questionnaires for spam-axis are ‘1:not splog’, ‘2:not splog(weak)’, ‘4:splog(weak)’, and ‘5:splog’. The questionnaires for value-axis are ‘1:not valuable’, ‘2:not valuable(weak)’, ‘4:valuable(weak)’, and ‘5:valuable’. 3.2 Results of Experiment Figure 1 shows the result of individual judgement for ‘40 common articles’. Total number of judgement is 2,000 (50 subjects × 40 articles = 2,000 judgements). There are three axes in Figure 1, x-axis is spam-axis, y-axis is value-axis and zaxis is the number of judgements (judge count). In Figure 1, a peak that has 678 is appeared at the intersection of spam=5 and value=1. These judges indicate that there are as unnecessary and valueless articles for most of subjects. On the other hand, in Figure 1, a right area that is circled indicates the existence of gray blogs for which subjects judged those blogs as splogs, but judged as valuable. Fig. 1. The result of individual judgement for ‘40 common articles’ A User-Oriented Splog Filtering Based on a Machine Learning 91 Fig. 2. The result of individual judgement for ’10 individual articles’ Figure 2 shows the result of individual judgement for ‘10 individual articles’. Total number of judgement is 500 (50 subjects × 10 articles = 500 judgements). In Figure 2, the axes are same as in Figure 1. From Figure 2, we found that judgement of spam is low because each subject chose the most interesting articles. From these results, we found that the user adaptation is needed for splog filtering. 4 User-Oriented Splog Filtering Method Proposed method accepts individual splog judgement data, and feedback from a user, and provides a personalized splog filter for each user. Figure 1 shows an overview of the user-oriented splog filtering method. The figure shows the relation between a user and the user-oriented splog filtering system that provides a personalized filter for this user. At the beginning, a user provides his or her splog judgement data with the system. This data is used for creating an initial user model for that user. The system creates his/her user model by learning from this judgement data. We use LibSVM (version 2.88)1 as a machine learning module in this system. The system provides a user an estimation of judgement of a blog article while he or she browses that article. A user can send feedback to the system for updating his or her user model. The system accepts feedback data that consists of a URL and judgement of that user. When the system accept feedback from a user, the system collects an HTML file of the URL from the Web, and extracts features that are used in the machine learning. Because we consider that there is a suitable feature set for each user, the system chooses the best feature set for 1 http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ 92 T. Yoshinaka et al. Fig. 3. The concept of the user-oriented splog filtering method each user. We will describe the detail of feature sets, and its evaluation results in section 5. 5 Evaluation for User-Oriented Splog Filtering In this section, we describe evaluation results of proposed method. We prepared three types of features: (1) ‘Kolari’s features’, (2) ‘Light-weight features’, and (3) ‘Mixed features’ as evaluation data, we compared performances of personalized filters between these features set. In addition to this, we had another evaluation by choosing the best feature set for each user. As an evaluation metric, we used F-measure[6] described in the following equation. F − measure = 2 × precision × recall precision + recall (1) We evaluated the performance of each filter based on five-fold cross validation. We used several kernel functions including linear kernel, polynomial kernel, RBF (radial basis function) kernel, and sigmoid kernel. As kernel parameters, we used default values of LibSVM for each kernel. 5.1 Dataset As dataset, we used individual judgement data described in the section 3. We use 50 Japanese blog articles. A User-Oriented Splog Filtering Based on a Machine Learning 93 Table 1. Feature list for Kolari’s features[8] Feature group Name of feature Dimension Value type Bag-of-words 9,014 tf*idf score Kolari’s features Bag-of-anchors 4,014 binary Bag-of-urls 3,091 binary 5.2 Features for Machine Learning We used following sets of features: (1) ‘Kolari’s features[1] described in the previous work, (2) ‘Light-weight features[7]’ that we propose, and (3) ‘Mixed features’ that are the mix of ‘Kolari’s features’ and ‘Light-weight features’. Kolari’s features. Table 1 shows the list of features described in the previous work. We use three type Kolari’s features, that are, ‘bag-of-words’, ‘bag-ofanchors’, and ‘bag-of-urls’. In our experiment, the ‘bag-of-words’ is morpheme words that are extracted by using a Japanese morphological analysis tool called Sen2 . The number of dimensions of this feature are 9, 014, we use tf*idf[8] values of morpheme words for creating a feature vector for this feature. The ‘bag-ofanchors’ contain morpheme words that are extracted from anchor text enclosed with <A> tag in HTML. The number of dimensions of this feature is 4, 014. The value of this vector is binary (1 or 0). The ‘bag-of-urls’ contain parts of URL text split by ‘. (dot)’ and ‘/ (slash)’ on all URLs appeared in a blog article. The number of dimensions of this feature is 3, 091. Elements of this feature vector are tf*idf values. These feature are prepared faithfully along with the method in the previous work[1]. Light-weight features. We propose ‘Light-weight features’ that consist of several simple features appeared in an HTML. Table 2 shows the list of features. There are 12 features in this feature set. These features have much lower dimensions than Kolari’s features. We explain for each feature. ‘Number of keywords’ is number of morpheme data that consists of only noun words extracted from a body part of a blog article. ‘Number of periods’ and the ‘number of commas’ are frequency of ‘ ’ and ‘ ’ in a blog article. ‘Number of characters’ is the length of character strings in blog article that contains HTML tags. ‘Number of characters without HTML tags’ is the length of character strings in blog article from which HTML tags are removed. ‘Number of br tags’ is the number of <BR> tag in an HTML. ‘Number of in-links’ is the number of links that connect to the same host (e.g., links to comment pages, and archive pages of the same domain are included.). ‘Number of out-links’ is the number of links that link to external domains. ‘Number of images’ is the number of images contained in a blog article. ‘Average height of all images’ is the average height of images contained in a blog article. ‘Average width of all images’ is the average width of images contained in an blog article. 2 https://sen.dev.java.net 94 T. Yoshinaka et al. Table 2. The list of features defined in the Light-weight features 1 2 3 4 5 6 7 8 9 10 11 12 Name of feature Number of keywords Number of ‘ (period)’ Number of ‘ (comma)’ Number of characters Number of characters without HTML tags Number of br tags Number of in-links Number of out-links Number of images Average height average of all image Average width of all image Number of affiliate IDs Table 3. Average values of F-measure using each feature Feature set Bag-of-words Bag-of-anchors Bag-of-urls Light-weight features Mixed features Linear Polynomial RBF Sigmoid 0.608 0.592 0.533 0.522 0.603 0.615 0.519 0.533 0.655 0.702 0.530 0.522 0.573 0.601 0.583 0.548 0.615 0.590 0.526 0.515 ‘Number of affiliate IDs[9]’ is the number of IDs extracted from affiliate links in a blog article. Mixed features. ‘Mixed features’ is the mix of ‘Kolari’s features’ and ‘Lightweight features’. The number of dimensions is 16, 131 (16, 119 in Kolari’s features plus 12 in Light-weight features). 5.3 Results Results of Kolari’s features. Table 3 shows the average values of F-measure using ‘Kolari’s features’ for each kernel function. The best score (F-measure 0.702) is appeared at the intersection of the ‘bag-of-urls’ row and the polynomial kernel column. Figure 4 shows values of F-measure for each user using ‘bag-ofurls’ and polynomial kernel. Figure 5 shows values of F-measure for each user using ‘bag-of-words’ and linear kernel. Figure 6 shows values of F-measure for each user using ‘bag-of-anchors’ and polynomial kernel. In Figure 4, Figure 5 and Figure 6, the y-axis is the F-measure and x-axis is the subject ID. Subject IDs are sorted by descending order of F-measure value of Figure 4. Subject ID 46 shows the best F-measure 0.947 in Figure 4. From this result, we found that a pair of ‘bag-of-urls’ and polynomial kernel shows a good performance in the personalized splog filtering. A User-Oriented Splog Filtering Based on a Machine Learning 95 Fig. 4. F-measure for each subject using the ‘bag-of-urls’ and the polynomial kernel Fig. 5. F-measure for each subject using the ‘bag-of-words’ and the linear kernel Performance of light-weigh features. Table 3 shows the average values of F-measure using ‘Light-weight features’ for each subject. In Table 3, the Fmeasure 0.601 in polynomial kernel is the best one for this feature set. Figure 7 shows each user’s F-measure value using this feature set and polynomial kernel. Figure 7 shows the same result compared to results of Kolari’s features. The best F-measure 0.933 is appeared at subject ID 46 in Figure 7. Performance of Mixed features. Mixed features is the mix of ‘Kolari’s features’ and ‘Light-weight features’. Table 3 shows the average values of F-measure for each subject using this feature set. In Table 3, F-measure 0.615 at linear kernel shows the best score. Figure 8 shows each user’s F-measure value using this feature set and linear kernel. The max value of F-measure 0.933 is appeared at subject ID 46 in Figure 8. 5.4 Analysis of the Best Feature Set for Each User We consider that there is the best feature set for each user. We found that there are the best feature for each user. The candidates of the best feature set are: ‘1. 96 T. Yoshinaka et al. Fig. 6. F-measure for each subject using the ‘bag-of-anchors’ and the polynomial kernel Fig. 7. F-measure for each subject using the ‘Light-weight’ features and the polynomial kernel bag-of-words’, ‘2. bag-of-anchors’, ‘3. bag-of-urls’, ‘4. Light-weight features’, and ‘5. Mixed features’. To find the best feature set, we use the best F-measure of value among these feature sets. In addition, when F-measure has same value, to calculate the best feature is based on Table 4. Table 4 shows the rank of features. This table is calculated based on the number of dimensions. A rank column in Table 4 shows the priority of calculation of the best feature. This column shows that if the value is small, priority is high. We chose the best feature for each subject based on Table 4. The result is shown in Table 5. Table 5 shows the best feature and the best F-measure value for each subject. The best F-measure is 0.947 in subject ID 47, then the best feature is ‘bag-of-urls’. The worst F-measure is 0.316 in subject ID 38, then the best feature is ‘bag-of-urls’. In Table 5, there is no subject who has F-measure 0. Although there are several subjects who have 0 F-measure values through Figure 4 to Figure 8, but there is no subject who has 0 value by choosing the best feature set for each subject. We counted feature IDs in Table 5. Table 6 A User-Oriented Splog Filtering Based on a Machine Learning 97 Fig. 8. F-measure for each subject using the ‘Mixed features’ and the linear kernel Table 4. Rank of features based on feature dimensions Feature name Dimension Rank 4. Light-weight features 12 1 3. Bag-of-urls 3,091 2 2. Bag-of-anchors 4,014 4 1. Bag-of-words 9,014 3 5. Mixed features 16,143 5 Table 5. Results of the best pair of features and kernel, and its F-measure value Subject ID Feature ID F-measure Subject ID Feature ID F-measure Subject ID Feature ID F-measure Subject ID Feature ID F-measure 1 4 0.848 14 4 0.720 27 4 0.692 40 3 0.841 2 4 0.679 15 4 0.833 28 3 0.560 41 4 0.831 3 3 0.455 16 3 0.667 29 2 0.714 42 3 0.571 4 4 0.772 17 4 0.793 30 3 0.571 43 1 0.625 5 3 0.933 18 4 0.653 31 3 0.632 44 N/A N/A 6 3 0.841 19 4 0.800 32 4 0.667 45 3 0.933 7 4 0.831 20 3 0.754 33 2 0.410 46 3 0.750 8 1 0.848 21 4 0.847 34 3 0.381 47 3 0.947 9 3 0.588 22 4 0.741 35 3 0.857 48 3 0.904 10 1 0.839 23 3 0.667 36 3 0.63 49 3 0.604 11 12 13 3 4 4 0.824 0.780 0.691 24 25 26 4 5 3 0.857 0.593 0.730 37 38 39 4 3 3 0.813 0.316 0.904 50 4 0.814 98 T. Yoshinaka et al. Table 6. Total number of feature ID Feature name Frequency 3. Bag-of-urls 24 4. Light-weight features 19 1. Bag-of-words 3 2. Bag-of-anchors 2 5. Mixed features 1 shows frequency of features. The feature with the most frequent feature is ‘3. bag-of-urls’, and its occurrence is 24. ‘4. Light-weight features’ appears 19 times, and our feature set occupies about 25% in all subjects. From these result, we found that there is the best feature for each user. 6 Discussion We evaluated performances of a user-oriented filter method by comparing combinations of several feature sets. From this experiment, we found that (1) the effect of ‘Kolari’s features’ for personalized splog filters, (2) ‘Light-weight feature’ was effective in a user-oriented splog filtering. 6.1 The Effect of ‘Kolari’s Features’ First, we consider a filter performance using ‘Kolari’s features’. From results of ‘Kolari’s features’, there are some subjects who has succeeded in the splog detection of over 90% in Figure 4 to Figure 6. On the other hand, there are subjects whose performances are not good. (subjects enclosed with the circle in Figure 6). These subjects are showed very low F-measures values when we use a common single kernel, but when we choose appropriate feature set for each user, their F-measure improved3 . 6.2 The Effect of ‘Light-Weight Features’ Second, we consider a filter performance using the ‘Light-weight features’. Table 3 shows the similar results compared to results of ‘Kolari’s features’ and ‘Mixed features’. The point is that dimensions of ‘Light-weight features’ are much lower than ‘Kolari’s features’ and ‘Mixed features’. We found that the increase of number of dimensions doesn’t improve F-measure values, and it is sufficient to use lower dimensions feature set. Therefore, we consider that Light-weight features is practical compared with ‘Kolari’s features’ and ‘Mixed features’. We will evaluate the method by using more large dataset. 3 For example, in Figure 6, the F-measure of the subject enclosed with the circle (subjects ID is 27) on each kernel are 5.00 in linear kernel, 0 in polynomial kernel, 0.604 in rbf kernel and 0.642 in sigmoid kernel. A User-Oriented Splog Filtering Based on a Machine Learning 7 99 Conclusion In this paper, we described a user-oriented splog filtering method providing appropriate personalized filter for each user. We had two experiments: (1) experiment of individual splog judgement, and (2) evaluation experiment of personalized splog filters. We collected individual splog judgement data from experiment of attending 50 subjects. We found that Light-weight features are showed the same effect or further effect compared Kolari’s features. We found that there is the best feature for each user, and we describe that our method is effective. In future work, we will try to select features for improving F-measure values for each user. References 1. Kolari, P., Java, A., Finin, T., Oates, T., Joshi, A.: Detecting spam blogs: A machine learning approach. In: Proceedings of the 21st National Conference on Association for Advancement of Artificial Intelligence (AAAI 2006), pp. 1351–1356 (2006) 2. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 1048–1054 (1999) 3. Ishida, K.: Extracting spam blogs with co-citation clusters. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), pp. 1043–1044 (2008) 4. Junejo, K.N., Karim, A.: PSSF: A novel statistical approach for personalized serviceside spam filtering. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), pp. 228–234 (2007) 5. Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th International Conference on World Wide Web (WWW 2003), pp. 271–279 (2003) 6. Manning, C.D., Shuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) 7. Yoshinaka, T., Fukuhara, T., Masuda, H., Nakagawa, H.: A user-oriented splog filtering based on machine learning method- (in japanese). In: Proceedings of The 23rd Annual Conference on the Japanese Society for Artificial Intelligence (JSAI 2009), vol. 2B2-4 (2009) 8. Manning, C.D., Shuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) 9. Wang, Y.M., Ma, M., Niu, Y., Chen, H.: Spam double-funnel: connecting web spammers with advertisers. In: Proceedings of the 16th International Conference on World Wide Web (WWW 2007), pp. 291–300 (2007) Generating Researcher Networks with Identified Persons on a Semantic Service Platform Hanmin Jung, Mikyoung Lee, Pyung Kim, and Seungwoo Lee KISTI, 52-11 Eueon-dong, Yuseong-gu, Daejeon, Korea 305-806 jhm@kisti.re.kr Abstract. This paper describes a Semantic Web-based method to acquire researcher networks by means of identification scheme, ontology, and reasoning. Three steps are required to realize it; resolving co-references, finding experts, and generating researcher networks. We adopt OntoFrame as an underlying semantic service platform and apply reasoning to make direct relations between far-off classes in ontology schema. 453,124 Elsevier journal articles with metadata and full-text documents in information technology and biomedical domains have been loaded and served on the platform as a test set. Keywords: Semantic Service Platform, OntoFrame, Ontology, Researcher Network, Identity Resolution. 1 Introduction Researcher network, a social network between researchers mainly based on coauthorship and citation relationship, helps for users to discover research trends and behavior of its members. It can also support to indicate key researchers in a researcher group, and further to facilitate finding appropriate contact point for collaboration with ease. Several researcher network services are currently on the Web. BiomedExperts (http://www.biomedexperts.com) shows co-publication between researchers and the researchers relating with a selected one in biomedical domain [1]. It also provides researcher’s metadata and exploratory session on the network. Authoratory (http://authoratory.com) is another service focused on co-authorship and article details. ResearchGate (http://www.researchgate.net) provide additional service function for grouping researchers by contacts, publications, and groups. Metadata of a researcher is also offered on every node [2]. Microsoft’s network (http://academic.research.microsoft.com) emphasizes attractive visualization as well as detailed co-authorship information. However, none of them are built on semantic service platform which can support precise information favored with identification system, ontology, and reasoning. As they are typical database applications based on data mining technologies, achieving a flexible and precise services in both connecting and knowledge and planning services would become serious to them. In order to surpass the qualitative limit of existing researcher network services, we will address three major issues in this paper; resolving co-references for assuring J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 100–107, 2010. © Springer-Verlag Berlin Heidelberg 2010 Generating Researcher Networks with Identified Persons 101 precise service level, finding experts for a topic, generating researcher networks triggered by them. The following sections explain how two types of researcher networks can be eventually acquired from articles. As the first step, we gathered Elsevier journal articles for service target since it is easy to recognize sub-set in information technology and biomedical domains, and further all of them have metadata and fulltext documents to facilitate applying text mining technology for extracting topics that have the role of crucial connection point between users and the service. 2 Resolving Co-references on a Semantic Service Platform OntoFrame is a semantic service platform to easily realize semantic services regardless of application domains [3]. It is composed of a semantic knowledge manager called as OntoURI, a reasoning engine called as OntoReasoner, and a commercial search engine. Semantic services based OntoFrame interacts with the two engines using XML protocol and Web services. Fig. 1. OntoFrame architecture The manager transfers metadata gathered from legacy databases into semantic knowledge in the form of RDF triples, as referring ontology schema 1 designed by ontology engineers. The manager then propagates the knowledge to the two engines. The reasoning engine intervenes in loading process to the repository for generating induced triples inferred by user-defined inference rules. The coverage of the engine can be roughly said as RDF++ as it supports of entailments of full RDFS (RDF schema) and some of OWL vocabularies such as ‘owl:inverseOf’ and ‘owl:sameAs’. 1 Currently it is composed of 16 classes and 89 properties using RDF, RDFS, and OWL Lite. 102 H. Jung et al. Ontology individuals should be clearly identified, thus the manager has additional role to resolve co-references between the individuals as well as to convert DB to OWL. The whole process performed in the manager can be viewed as a syntactic-tosemantic process shown in Fig. 2. Annotation for generating metadata can be put in front of DB-to-OWL conversion. The process has two independent sub-processes based on time criterion; modeling time and indexing time. The former includes ontology schema design and rules editing, and the latter concerns identity resolution and RDF triple generation. Fig. 2. OntoURI process for generating semantic knowledge OntoURI applies several rules to be managed such as URI generation, DB-to-OWL mapping, and identity resolution [4]. For example, it assigned different weights to each clue for resolving ambiguous authors as shown in Table 1. ‘Name’ is a pivot to initiate the resolution, that is, identity resolution rules will be triggered on finding the case that two authors located in different articles share the same name. ‘E-mail’ weight is the highest because it is very rare case that different authors share the same e-mail address. Property ‘hasTopic’ is a threshold feature because it is not binary feature which can clearly determine whether two authors are the same person or not. Table 1. Rules for resolving co-references between author individuals Class Person Person Person Person Person Person Resource Name hasInstitution E-mail hasCoauthor hasTopic Kind Order Pivot Feature Feature Feature Threshold Match Relation Source Exact Exact Number Number Single Single Single Multiple OntoURI OntoURI OntoReasoner Weight 1 2 4 1 0.8 Generating Researcher Networks with Identified Persons 103 Fig. 3 shows a result of the resolution for author individuals called as ‘Jinde Cao’. Authority data (Table 2 shows an example) is also applied to normalize individual names in the surface level. After resolving co-references between individuals acquired from 453,124 Elsevier journal articles with metadata and full-text documents in information technology and biomedical domains, the following identified individuals were loaded in the form of RDF triple in the repository. The total number of the triples in the repository is 283,087,518. We left identified persons without further trying to actively merge as one because it is always able to dynamically connect two different identifiers with ‘sameAs’ relation. 1,352,220 persons 339,947 refined topics 91,514 authorized institutions 409,575 locations with GPS coordinate Table 2. Example of authority data Normalized form IBM Microsoft Variant form International Business Machines Corporation MS London Academic Inc. Academic Press Inc, LTD 마이크로소프트 런던 Kind Abbreviation Class Institution Abbreviation Korean Korean Alternative Institution Institution Location Publication OntoFrame service including research network was designed as an academic research information service such as Google Scholar. However it controls individuals with URI-based (Uniform Resource Identifier) identification scheme and lies on Semantic Web service platform, that is, can be empowered by both search and reasoning in contrast with other similar services. It provides several advanced services; topic trends to show relevant topics by timeline, domain experts to recommend dominant researchers for a topic, researcher group to reveal collaboration behavior among researchers, researcher network to trace co-author and citation relationships in a group, and similar researchers who study relevant topics with a researcher. 3 Finding Experts Experts finding is very useful to seek for consultants, collaborators, and speakers. Semantic Web technology can be one of competent solutions for recognizing identified researchers exactly through underlying identification scheme. Deep analysis in full-text documents will be needed as topically classified documents in high precision can ensure recommend the right persons for a given topic. Thus we propose an experts-finding method based on identity resolution and full-text analysis. 104 H. Jung et al. Fig. 3. Example of identified authors (‘Jinde Cao’) Extracting topics from documents is the most basic task to acquire topic-centric experts. Extracted several topics will be assigned to each article. Indexer extracts index terms from an input document, and then, After matching the terms with topics in a topic index DB, successfully matched terms are ranked by frequency and then top-n (currently, five) of them are assigned to the document. The following workflow shows how experts for a given topic can be found [5]. 1. Knowledge expansion through reasoning Make direct relations between far-off classes in ontology schema for constructing shorter access path. 2. Querying and retrieving researchers Call SPARQL query with a given topic. Convert the query to corresponding SQL query. Exploit backward-chaining path to retrieve the researchers classified into the topic. 3. Post-processing Group the retrieved researchers. Rank them by names or the number of articles. Make an XML document as a result of expert finding. Generating Researcher Networks with Identified Persons 105 As our researcher network service is initiated from finding experts for a topic, the service regardless of network types requires person(s) mandatorily. Topic should be also provided in the case of generating topic-based network. 4 Generating Researcher Networks We designed two kinds of researcher networks in the viewpoint of the constraint considered to connect researchers in the network. The first type is topic-constrained network and the second is person-centric network. Topic-constrained network shows a network connecting researchers under a given topic. It implies that all of the relationships between researchers should share the same topic. The following pseudo code and SPARQL query is to generate a topicconstrained network. The first step retrieves the co-author pairs that wrote an article together classified into a given topic <topURI> identifier through SPARQL query. The second step searches a given researcher from the pairs. That is, two arguments, a topic and a researcher, need to be acquired topic-constrained network. The last step recursively traces the pairs acquired from the first step through the co-authors of the seed, i.e. the given researcher, as another seeds. Fig. 4. A topic-constrained network for topic ‘neural network’ and researcher ‘Jinde Cao’ 1. Get co-author pairs for a given topic SELECT DISTINCT ?person1 ?person2 WHERE { ?article aca:yearOfAccomplishment ?year . FILTER(?year>=startYear && ?year<=endYear) . ?article aca:hasTopicOfArticle <topURI> . 106 H. Jung et al. ?article aca:createdByPerson ?person1 . ?article aca:createdByPerson ?person2 . FILTER(?person1 < ?person2) . } 2. Select a target researcher in the pairs 3. Trace the pairs through the seed Person-centric network shows a network connecting researchers focused on a researcher without considering the shared topics between researchers. It is useful to understand the relationship between a given researcher and his colleagues in detail. The first step acquires the co-authors that wrote together with a given researcher <perURI> identifier through SPARQL query. The second step ranks them with the number of co-authorship. The ranked results are applied to visualized network as the distance from the central researcher. 1. Get co-authors of a target researcher SELECT ?per1 ?per2 WHERE { ?article aca:yearOfAccomplishment ?year . FILTER(?year>=startYear && ?year<=endYear) . ?article aca:createdByPerson ?per1 . ?article aca:createdByPerson ?per2 . FILTER(?per1 < ?per2) . FILTER(?per1=<perURI> || ?per2=<perURI>) . } 2. Rank them with the number of co-authorship Fig. 5. A person-centric network for researcher ‘Jinde Cao’ Generating Researcher Networks with Identified Persons 107 5 Conclusion This paper showed a solution for three major issues on researcher network service; resolving co-references for assuring precise service level, finding experts for a topic, generating researcher networks triggered by them. By the instrumentality of semantic service platform, two kinds of precise researcher networks were implemented with ease. Resolution rules and authority data were applied to resolve co-references between ontology individuals. SPARQL queries and reasoning were utilized in both finding experts and generating the networks. We plan to develop a pipelining system that assembles existing semanticallyoperated services such as ‘finding experts’ and ‘generating person-centric network’ to directly acquire researcher network from a topic query without go through user interactions. It will make it easier for users to access researcher network. References 1. Whitaker, I., Shokrollahi, K.: BiomedExperts: Unlocking the Potential of the Internet to Advance Collaborative Research in Plastic and Reconstructive Surgery. J. Annals of Plastic Surgery 63(2) (2009) 2. Pronovost, S., Lai, G.: Virtual Social Networking and Interoperability in the Canadian Forces Netcentric Environment. Technical report CR 2009-090, Defence R&D Canada (2009) 3. Sung, W., Jung, H., Kim, P., Kang, I., Lee, S., Lee, M., Park, D., Hahn, S.: A Semantic Portal for Researchers Using OntoFrame. In: 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (2007) 4. Kang, I., Na, S., Lee, S., Jung, H., Kim, P., Sung, W., Lee, J.: On Co-authorship for Author Disambiguation. J. Information Processing & Management 45(1) (2009) 5. Jung, H., Lee, M., Kang, I., Lee, S., Sung, W.: Finding Topic-Centric Identified Experts Based on Full Text Analysis. In: 2nd International ExpertFinder Workshop at ISWC 2007 + ASWC 2007 (2007) Towards Socially-Responsible Management of Personal Information in Social Networks Jean-Henry Morin University of Geneva – CUI, Department of Information Systems, Route de Drize 7, 1227 Carouge, Switzerland Jean-Henry.Morin@unige.ch Abstract. Considering the increasing number of Personal Information (PI) used and shared in our now common social networked interactions, privacy issues, retention and how such information are manages have become important. Most approaches rely on one-way disclaimers and policies, often complex, hard to find and lacking ease of understanding for ordinary users of such common networks. Thus leaving little room for users to actually retain any control how the released information is actually used and managed once it has been put online. Additionally, personal information (PI) may include digital artifacts and contributions for which people would legitimately like to retain some rights over their use and their lifetime. Of particular interest in this category is the notion of the “right to forget” we no longer have control over, given the persistent nature of the Internet and its ability to retain information forever. This paper examines this issue from the point of view of the user and social responsibility, arguing for the need to augment information with an additional set of metadata about its usage and management. We discuss the use of DRM technologies in this context as a possible direction. Keywords: Personal Information Management, social responsibility, social networks, privacy, right to forget, DRM. 1 Introduction Social networks and services have now penetrated most if not all of our activities whether professional or personal. They range from social bookmarking, social tagging, slide sharing, micro blogging, photo and video sharing, friendships, colleagues, etc. to more elaborate forms of interaction and collaboration through user centric conversational threads such as Google Wave [1]. Most if not all of these Web 2.0 services have unclear if not unaddressed positions with respect to the management of personal information and the corresponding social responsibility. In a similar way as social responsibility has become an important issue in the corporate world (corporate social responsibility, CSR [2]), we argue, based on recent evidence drawn from an increasing number of deceptive situations in social networks, that there is an urgent need to address Personal Information management in a socially responsible way to sustain social networks and services and consequently put the User back in “control” of his digital fate. J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 108–115, 2010. © Springer-Verlag Berlin Heidelberg 2010 Towards Socially-Responsible Management of Personal Information 109 While this latter aspect of the lack of concern for users in IT services isn’t new, its importance is growing with social networks. Each service has its own policies with respect to how they manage users information be it personal information or the content they share using such services. Often written in tiny characters, (e.g., point 8 being the minimum according to marketing practitioners), such agreements are much too verbose and complex for the vast majority of the users. Therefore putting at risk users information without them being even aware of what they actually agreed to when checking the “I agree to the terms and conditions” button or box upon signup. Furthermore, once published and shared, users retain very limited to no control whatsoever on their information. How it will be used, for how long, etc. Consequently, the information becomes available forever thus raising concern over the notion of the “right to forget” or “forgetfulness” [3] and the associated data retention and privacy policies. Often referred to as the panoptic society following the description of Foucault [4], users are increasingly positioned as “prisoners” locked in a oneway panopticon service environment offering a birds eye view to the service provider on blindfolded users. The key question in this context becomes in addition to raising user awareness on the issue, to study the conditions and requirements that should guide the design, implementation and use of social network services providing socially responsible management of personal information in the digital age. This paper is structured as follows. In the next section we describe the problem and the background. Section 3 describes a set of requirements that should be fulfilled to address the problem and design future services in a socially and ethically responsible electronic society. In section 4 we draw a parallel with Kim Cameron’s work on the laws of identity establishing how these may also be applied in the context of Personal Information (PI) and time bound social information. Finally section 5 discusses a proposition of using DRM technologies in this context before concluding remarks in section 6. 2 Background and Problem Statement Let’s first consider some key factors as background behind the problem. Given the global trend of democratization of Internet access and accessibility to broadband networks, more and more people are getting on-line. The cost of accessing the Internet, even in mobile settings using smart phones has dropped significantly to a point where it is becoming accessible to everyone even with mobile flat fee unlimited plans. Recent evolution in social networking services have drawn an impressive number of people to connect in on-line communities through services such as Facebook, LinkedIn, Friendster, MySpace, etc. This further led to other social activities around micro-blogging, bookmarking, pictures, presentations and videos, etc. Smart phones, Netbooks and Tablets are gaining popularity and provide very interesting platforms for mobile location-based services in social networks. Business models surrounding these platforms increasingly subsidize the cost of the hardware by billing the services rather than the devices. The revenue streams associated to services based billing models are far more stable and exhibit more value for service providers in terms of customer retention. Also to be noted is the spectacular cost reduction of 110 J.-H. Morin storage and processing power, which put into perspective of the increasing capabilities for searching and mining raises new issues in our electronic society. Existing and future searching and mining capabilities will become instrumental in allowing to find information at very low cost in very little time. While all this may seem fine in terms of infrastructure there is a growing concern about managing our on-line identities and the information we increasingly share online. Information becomes accessible globally and indefinitely. Ready for searching and mining. Individuals can store information in quantities that amount to what nations themselves couldn’t have dreamt of in a not so distant past. Increasingly people are warned about being careful with the information they share and publish based on this idea of “publish once, publish forever”. Assuming these trends will be further confirmed in the future, it opens up a whole range of issue with respect to Personal Information Management. Most countries now have strict laws addressing Personally Identifying Information (PII). These regulations impose specific practices for service providers on how to manage such information internally. However what happens when the users themselves share personal information, not necessarily identifiable, on such sites is the focus of this paper arguing for the need to both enhance awareness and in particular to address the issue of technically allowing for the management of such information in terms of usage rights, retention and scope of sharing. Assuming there is a problem with how PI are managed in social networks and that individuals should retain a right over the control of how their information are shared and used we contend there is a need for Managed Personal Information centered on the user rather than on the service provider in a social responsibility mindset. As a result, our proposition is that PI should be augmented with an additional layer of metadata governing its usage in a persistent way. Thus allowing to retain control over their information and manage the policies governing their use in the digital realm. One example of such a generic policy might be a default expiry date for general information whereby any information, unless otherwise defined as published with specific editorial requirements (e.g., scientific publications, cultural heritage, etc.), would become obsolete after six months. Of course, users ultimately have control over such policies even to an extent allowing them to potentially “recall” published information for any reason. 3 Requirements In order to fulfill the above proposition let’s review some of the requirements behind this problem. We consider these requirements essentially from the viewpoint of the user and the underlying enabling infrastructure since it is basically his information that are at stake. 3.1 User Consent, Awareness and Control Releasing PI should require explicit consent from the user. He should be made aware of the conditions under which his information is released as well as the extent to which he will be able to retain any form of control over it. Towards Socially-Responsible Management of Personal Information 111 Every piece of PI released should exhibit a default usage control policy either set by the user explicitly or by the service provider in the most conservative way possible for the user. Usage control policies may involve setting expiry dates to content, allowing to recall PI unilaterally, limiting the scope of sharing to specific people or groups, constrain the PI usage to specific situations and conditions, etc. In summary anything deemed necessary by the PI owner regardless of the infrastructure, service provider or parties involved in the conversations. 3.2 Infrastructure Requirements From the infrastructure standpoint, the overall service provisioning relying on PI should be technology and provider independent thus allowing for interoperability and portability of PI across platforms and operators. Releasing PI can be considered as important as releasing Identity information or paying. Therefore Infrastructures managing PI should exhibit similar process patterns whereby the users are not only aware but also familiar with it. Moreover, the user should be an explicit acting element in the process. Being involved explicitly in the HCI interaction allows for better awareness upon PI release preventing them from being released without consent. 4 From the Laws of Identity to the Laws of Personal Information (PI) Looking at the work done in the area of Identity Management we think this area shares many similar requirements with the issue of PI management. Building on Kim Cameron’s laws of identity [5], we propose to examine and transpose these laws in the context of Personal Information (PI) Management. We argue this framework covers some essential requirements for socially responsible management of PI in social networks. Let us briefly review the seven laws putting them into perspective of PI Management Systems (PIMS) and what they mean in this context based on the requirements discussed in the previous section. 1. User Control and Consent: users should explicitly consent to the information they release on PIMS and retain control over these information. 2. Minimal Disclosure for a Constrained Use: by default and without otherwise explicitly indicated by the user, any PIMS should reveal the minimum amount of information the user has explicitly, or not, agreed to and limit its use to that it was intended for. 3. Justifiable Parties: PIMS should limit and enforce the disclosure of PI to the parties clearly identified within a PIMS relationship (e.g., one-to-one relationship, group relationship, public information, etc.) 4. Directed Identity: should be renamed to Directed Information, whereby users can specify the scope of release of the information either for all to discover or for specific private use. 112 J.-H. Morin 5. Pluralism of Operators and Technologies: Interoperability and portability of PI across platforms and service providers. 6. Human Integration: people should be part of the human-computer process involving the release of PI. 7. Consistent Experience Across Contexts: releasing and managing PI should exhibit similar patterns of operation independently from the service provider or the technologies used. We found Cameron’s work on the laws of identity particularly insightful in the context of PI management. Many of the proposed items appear to be directly usable when transposed to PI. At this stage, it is beyond our purpose to decide which ones might be part of a mandatory set while others might be part of an optional category. The first four cover most of the essential user centered requirements identified in the previous section while the last three are more oriented towards the enabling infrastructure. Each of these aspects having a specific criticality level depending on how the PIMS behaves with respect to the issue, can lead to a simple matrix of readiness level (e.g., color codes) helping the users quickly and unambiguously picture what the level of Social Responsibility of the Service Provider is with respect to PI management. In order to implement some of these features, we have identified DRM technologies as a likely technical approach allowing to apply persistent protection and rights management to PI. 5 Using DRM to Address Personal Information Management DRM technologies have been mainly used to address the issue of persistent content protection and distribution in the context of the Entertainment industry and the Enterprise sector to safeguard intellectual property (i.e., copyright, trade secrets) and more recently to address compliance issues in the corporate world. DRM is the acronym of Digital Rights Management. It represents a technology allowing to cryptographically associate usage rules, also called policies, to digital content. These rules govern the usage of the content they are associated to. They have to be interpreted by an enforcement point prior to any access in order to determine whether or not access can be granted. If successfully interpreted a license is used to decrypt and render the content using a trusted interface (e.g. browser, application, sound or video device, etc.) The content being itself encrypted using strong cryptographic algorithms, it becomes persistently protected at all time, wherever it resides. The general DRM scenario can be decomposed in the following four main steps: 1. 2. Content preparation and packaging: this step requires the content owner to securely package the content by encrypting it together with its usage rules. The rules are also cryptographically attached to the content thus allowing Superdistribution. To be noted that the rules could also be dynamically acquired provided the only attached rule is to acquire these. This is particularly useful to retain some control over the rules and its associated content. Content distribution (and Superdistribution): from thereon, the content may be freely distributed (superdistributed) and shared through any media Towards Socially-Responsible Management of Personal Information 3. 4. 113 (Web, CD, DVD, email, ftp, removable storage, streaming, etc.) since it is persistently protected. Content usage: this step involves a consumer trying to access and render the content. It typically requires acquiring a license (from a license server) based on the interpretation of the rules attached to the content. If successful the license is granted and returned to the users DRM enforcement point for decryption and rendering of the content in a trusted interface. To be noted that the license server is not necessarily the content owner, this role may be outsourced to external actors such as content “clearing houses” or service provider. This activity is of great importance, as it will provide the usage data and metering information to the content owners for marketing and market analysis purposes. Settlement, clearing of transactions and usage metering: Finally, this step concerns the financial clearing and settlement of the completed transactions. It is mostly back office and is based on the collected data from the license acquisition request transactions. As an example among some of the most widely known DRM systems is FairPlay of Apple which is the DRM technology used for all of iTunes content (music, video, apps, etc.) We have now become familiar with the fact that any security system is bound to be broken given enough time and effort. Therefore security approaches needn’t be full proof military grade solutions especially when it comes to mundane usage situations and content. They should strike a balance between an acceptable level of risk and its related cost assuming most users aren’t criminals a priori. This is exactly what Apple did with FairPlay and one can reasonably admit the approach has proven successful both economically and technically. Considering personal information (PI) is intellectual property we contend there is a need for its protection and management in a persistent way. Using DRM technologies in ways similar to what Apple did with FairPlay to protect PI would provide the users with a much-needed solution to safeguard their own information when shared over social networks. DRM is poised to become a technology not only reserved to large companies and content publishers as our society increasingly progresses towards User Created Content (UCC, UGC) and Remix. Ordinary people now need the tools and techniques allowing them to factor in their own creativity and intellectual property while preserving the rights of others. This includes the rights to determine the conditions and the extent to which one is willing to share and distribute his information. As a result, our proposition is to give users access to a Personal Rights Management system allowing them to specify and choose the rules and conditions under which they’ll be releasing personal information on social networks through third party service providers. Ultimately such rules will persistently apply and govern the use of the released content no matter where it resides thus providing the users with increased confidence on how his PI is used. There are however many limitations to this proposition that need to be mentioned. First and foremost: interoperability and the lack of standards in DRM technologies. This industry has been dominated by proprietary incompatible solutions mostly driven by the rights holders or key players in the ecosystem. We think given the need to 114 J.-H. Morin bring such functionality to the level of the general users may drive new initiatives trying to harmonize these technologies towards standards. Another key challenge will be to design a policy expression language simple enough for general users to be able to quickly and efficiently recognize and express their needs in terms of PI management without the burden of having to enter into complex settings that would defeat the whole purpose as most people would then ignore them altogether. Designing such features will be instrumental in the adoption of PI Management by users. The lack of concern by users over PI during the growth of social networking services will be hard to reverse for service providers who have become accustomed to basically owning everything their members put online. This will require either a significant socially responsible mindset change in understanding the user has rights over his PI or a set of public policies to force them into understanding the criticality of the issue. In the worst case, a legal step might be considered like in France for example where a law on the right to forget is being proposed and discussed. Needless to say this is a useless path given the national territorial scope of such laws making them basically inoperative on a global media such as the Internet. Finally there is this whole debate about the evil nature of DRM as argued by several activists considering DRM as Digital Restriction Management that should be banned altogether from every service as it limits our fundamental rights. While we agree with many of the oppositions, their extreme position is highly arguable, as they don’t propose any alternative. Our position has been clear for many years now and we have worked on and proposed models [6] to handle exception management in DRM environments allowing to accommodate for fair use and other exceptional situations in traceable and accountable ways. 6 Conclusion and Future Work Personal Information (PI) Management in social networks is becoming an area of growing concern not only for service providers facing social responsibility issues but also for ordinary users increasingly unaware of how their information are managed and used when shared. In addition to raising and describing the issue we have drawn a parallel between Identitiy Management and PI Management based on Cameron’s laws of identity. In this context and given some preliminary requirements we argue DRM technologies should be brought down to the level of ordinary users in order for them to manage the rights to their PI. We are well aware this paper is prospective in nature and therefore covers initial ideas in terms of a proposed approach to address this issue. Future work will focus on studying and designing a lightweight rights framework optimized for PI Management and implement a prototype within an existing social network. In addition, work is needed on the usability and awareness issues. Towards Socially-Responsible Management of Personal Information 115 References 1. Bekmann, J., Lancaster, M., Lassen, S., Wang, D.: Google Wave Data Model and ClientServer Protocol, http://www.waveprotocol.org/whitepapers/internalclient-server-protocol (retrieved January 2010) 2. Aaronson, S.A.: Corporate responsibility in the global village: the British role model and the American laggard. Business and Society Review 108(3), 309–338 (2003) 3. Rouvroy, A.: Réinventer l’art d’oublier et de se faire oublier dans la société de l’information? In: Lacour, S. (ed.) version augmentée du chapitre paru, sous le même titre, dans La sécurité de l’individu numérisé. Réflexions prospectives et internationales, pp. 249–278. L’Harmattan, Paris (2008), http://works.bepress.com/antoinette_rouvroy/5 4. Foucault, M.: Surveiller et punir. In: de la Prison, N. (ed.), Galimard, Paris (1975) 5. Cameron, K.: The Laws of Identity. In: Identity Weblog, December 5 (2005), http://www.identityblog.com/stories/2005/05/13/TheLawsOfIdent ity.pdf Retrieved (January 2010) 6. Morin, J.-H.: Exception Based Enterprise Rights Management: Towards a Paradigm Shift in Information Security and Policy Management. International Journal On Advances in Systems and Measurements 1(1), 40–49 (2008) Porting Social Media Contributions with SIOC Uldis Bojars, John G. Breslin, and Stefan Decker DERI, National University of Ireland, Galway, Ireland firstname.lastname@deri.org Abstract. Social media sites, including social networking sites, have captured the attention of millions of users as well as billions of dollars in investment and acquisition. To better enable a user’s access to multiple sites, portability between social media sites is required in terms of both (1) the personal profiles and friend networks and (2) a user’s content objects expressed on each site. This requires representation mechanisms to interconnect both people and objects on the Web in an interoperable, extensible way. The Semantic Web provides the required representation mechanisms for portability between social media sites: it links people and objects to record and represent the heterogeneous ties that bind each to the other. The FOAF (Friend-of-a-Friend) initiative provides a solution to the first requirement, and this paper discusses how the SIOC (Semantically-Interlinked Online Communities) project can address the latter. By using agreed-upon Semantic Web formats like FOAF and SIOC to describe people, content objects, and the connections that bind them together, social media sites can interoperate and provide portable data by appealing to some common semantics. In this paper, we will discuss the application of Semantic Web technology to enhance current social media sites with semantics and to address issues with portability between social media sites. It has been shown that social media sites can serve as rich data sources for SIOC-based applications such as the SIOC Browser, but in the other direction, we will now show how SIOC data can be used to represent and port the diverse social media contributions (SMCs) made by users on heterogeneous sites. 1 Introduction “Social network portability” is the term used to describe the ability to reuse one’s own profile across various social networking sites. The founder of the LiveJournal blogging community, Brad Fitzpatrick, wrote an article1 from a developer’s point of view about forming a “decentralized social graph”, which discusses some ideas for social network portability and aggregating one’s friends across sites. However, it is not just friends that may need to be ported across social networking sites (and across social media sites in general), but content items as well. Soon afterwards, “A Bill of Rights for Users of the Social Web2“ was authored by Smarr et al. for “social web” sites who wish to guarantee ownership and control over one’s own personal information. As part of this bill, the authors asserted that 1 2 http://bradfitz.com/social-graph-problem/ http://opensocialweb.org/2007/09/05/bill-of-rights/ J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 116–122, 2010. © Springer-Verlag Berlin Heidelberg 2010 Porting Social Media Contributions with SIOC 117 participating sites should provide social network portability, but that they should also guarantee users “ownership of their own personal information, including the activity stream of content they create”, and also stated that “sites supporting these rights shall allow their users to syndicate their own stream of activity outside the site”. OpenSocial from Google is another related effort that has gained a lot of attention recently. While at the time of writing, OpenSocial has been mainly focusing on application portability across various social networking site, the following statement3 mentions future reuse of data across participating sites: “an OpenSocial app added to your website automatically uses your site’s data. However, it is possible to use data from another social network as well, should you prefer.” To enable a person’s transition and / or migration across social media sites, there are significant challenges associated with achieving such portability both in terms of the person-to-person networks and the content objects expressed on each site. As well as requiring APIs to access this data (such as SPARQL endpoints or AtomPub interfaces), representation mechanisms are needed to represent and interconnect people and objects on the Web in an interoperable, extensible way. The Semantic Web4 [1] provides such representation mechanisms: it links people and objects to record and represent the heterogeneous ties that bind us to each other. By using agreed-upon Semantic Web formats to describe people, content objects, and the connections that bind them together, social media sites can interoperate by appealing to common semantics. Developers are already using Semantic Web technologies to augment the ways in which they create, reuse, and link content on social media sites. Some social networking sites, such as Facebook, are also starting to provide query interfaces to their data, which others can then reuse and link to via the Semantic Web5, 6. The Semantic Web is a useful platform for linking and for performing operations on diverse person- and object-related data gathered from heterogeneous social media sites. In the other direction, social media sites can serve as rich data sources for Semantic Web applications. As Tim Berners-Lee said in a 2005 podcast7, Semantic Web technologies can support online communities even as “online communities ... support Semantic Web data by being the sources of people voluntarily connecting things together”. Such semantically-linked data can provide an enhanced view of individual or community activity across social media sites (for example, “show me all the content that Alice has acted on in the past three months”). Social media sites should be able to collect a person’s relevant content items and objects of interest and provide some limited data portability (at the very least, for their most highly used or rated items). We will refer to these items as one’s social media contributions, or SMCs. Through such portability, the interactions and actions of a person with other users and objects (on systems they are already using) can be used to create new person or content associations when they register for a new social media site. In [2], it was shown that social media sites can serve as rich data sources for SIOC-based applications such as the SIOC Browser. In the other direction, we will 3 http://code.google.com/apis/opensocial/container.html http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21 5 http://www.openlinksw.com/blog/~kidehen/?id=1237 6 http://www.dcs.shef.ac.uk/~mrowe/foafgenerator.html 7 http://esw.w3.org/topic/IswcPodcast 4 118 U. Bojars, J.G. Breslin, and S. Decker demonstrate in this paper how SIOC data can be used to represent and port the diverse SMCs made by users on heterogeneous sites. 2 Getting Content Items Using SIOC The SIOC initiative [3] was initially established to describe and link discussion posts taking place on online community forums such as blogs, message boards, and mailing lists. As discussions begin to move beyond simple text-based conversations to include audio and video content, SIOC has evolved to describe not only conventional discussion platforms but also new Web-based communication and content-sharing mechanisms. In combination with the FOAF vocabulary for describing people and their friends, and the Simple Knowledge Organization Systems (SKOS) model for organising knowledge, SIOC lets developers link posted content items to other related items, to people (via their associated user accounts), and to topics (using specific “tags” or hierarchical categories). Fig. 1. Porting social media contributions from data providers to import services Various tools, exporters and services have been created to expose SIOC data from existing online communities. These include APIs for PHP, Java and Ruby, data exporters systems like WordPress, Drupal and phpBB, data producers for RFC 4155 mailboxes and SIOC converters for Web 2.0 services like Twitter and Jaiku, and Porting Social Media Contributions with SIOC 119 commercial products like Talis Engage and OpenLink Virtuoso. A full set of applications that create SIOC data is available online8. All of these data sources provide accurate structured descriptions of social media contributions (SMCs), that can be aggregated from different sites (e.g., by person via their user accounts, by co-occurring topics, etc.). Figure 1 shows the process of porting SIOC data from various sources to SIOC import mechanisms for WordPress and future applications. We will now describe the SIOC import plugin for WordPress. 3 Import SIOC Data, with a WordPress Example The SIOC import plugin for WordPress9 is an initial demonstrator for social media portability using SIOC. This plugin creates a screen (see Figure 2) in the WordPress administration user interface which allows one to import user-created content in the form of SIOC data. Fig. 2. SIOC RDF import into WordPress Data to be imported can be created from a number of different social media sites using SIOC export tools (as described above). For example, a SIOC exporter plugin for a blog engine would create a SIOC RDF representation of every blog post and comment, including information about: • • • • • • 8 9 The content of a post The author The creation / update date Tags and categories All comments on the post Information about the container blog http://rdfs.org/sioc/applications/#creating http://wiki.sioc-project.org/w/SIOC_Import_Plugin 120 U. Bojars, J.G. Breslin, and S. Decker The data representation used (RDF) enables us to easily extend this data model with new properties when they become necessary. The import process implemented by the WordPress SIOC import plugin is the following: • Parse RDF data (using a generic RDF parser called ARC) • Find all posts - sioc:Post(s) - which exhibit all of the properties required by the target site • For each post found: • Create a new post using WordPress API calls The pilot implementation currently works with a single SIOC file and imports all the posts contained within it. Figure 3 shows an example post imported into WordPress: Fig. 3. Imported post in WordPress Since SIOC is a universal data format, and not specific to any particular site, this pilot implementation already allows us to move content between different blog engines or even between different kinds of social media sites. However, the import of a single file shown here is useful for demonstration purposes. We will now describe how a SIOC import tool can be extended to port all usercreated content from one social media site to another. By starting from a site’s main SIOC profile, we can retrieve machine-readable information about all the content of this site - starting with the forums hosted therein, and then retrieving the contained posts, comments, and associated users. This extended SIOC import tool needs to retrieve all SIOC data pages (possibly limited by some user-defined filters) and to recreate all the data found in this SIOC page on the target social media site. This will result in a replica of the original site, including links between objects (e.g., between posts and their comments). Often, a part of the content that a user wants to port is not publicly available. SIOC exporters can also be used in this case, Porting Social Media Contributions with SIOC 121 but the user will first need to authenticate at the source site and ensure that they have enough privileges to access all the data that need to be migrated. Another step in social media portability is keeping two sites synchronised (if required): having the same set of users, posts, comments, category hierarchies, etc. In principle, this can be achieved by importing a full SIOC dataset and then monitoring SIOC data feeds for new items added (some SIOC export tools may need to be extended to do this). Implementing this in practice will undoubtedly unfold some interesting challenges. Another example for using a complete site import would be for platform migration. For example, this could occur if a person has been using a mailing list for a particular community, and they then decide that the extended functionality offered to them by a Web-based message board platform is required. It is not just discussion-type content items that can be ported. Using the SIOC Types module10, various content types can be collected in sioc:Container(s) and ported in the same way (Sounds, MovingImages, Events, Bookmarks, etc.). 4 The Role of SKOS and FOAF SIOC allows us to describe most user-created content, but it can also be combined with other RDF vocabularies such as Dublin Core (DC), Friend-of-a-Friend (FOAF) and Simple Knowledge Organisation Systems (SKOS). These vocabularies can be used when there is a need to migrate some additional data specific for a particular site. DC provides a basic set of properties and types for annotating documents and resources. DC’s “Type” vocabulary also defines various document types such as MovingImages, Sound, etc., that can be used to describe media elements from social media sites. SKOS is designed for describing taxonomies such as category hierarchies. By exposing categories in SKOS we ensure the portability of this information to other social media sites. Finally, FOAF is designed for describing information about people and their social relations. This vocabulary is already used together with SIOC to describe information about users, and additional properties from FOAF (e.g., foaf:knows) can be used to describe users’ social networks. This can be useful when porting data from a social networking site. 5 Conclusions and Future Work In this paper we have shown how SIOC data can be used to represent and port the diverse social media contributions being made by users on various sites. We began by describing the need and requirements for such portability, then talked about sources of data including various SIOC data producers, and next we described how such SIOC data can be imported into a system such as WordPress. We finally talked about how this data can be augmented using other vocabularies such as FOAF. For future work, we mentioned the issue of who should be allowed to reuse certain data in other sites (as spam blogs are often duplicating other people’s content without authorisation for 10 http://rdfs.org/sioc/types 122 U. Bojars, J.G. Breslin, and S. Decker SEO purposes). As well as collecting a person’s relevant content objects, social media sites may need to verify that a person is allowed to reuse data / metadata from these objects in external systems. This could be achieved by using SIOC as a representation format, aggregating a person’s created items (through their user accounts) from various site containers, and combining this with some authentication mechanisms to verify that these items can be reused by the authenticated individual on whatever new sites they choose. References 1. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284(5), 35–43 (2001) 2. Bojars, U., Breslin, J.G., Passant, A.: SIOC Browser - Towards a Richer Blog Browsing Experience. In: Blogtalk Reloaded: Social Software - Research and Cases, Vienna, Austria (October 2006) ISBN 3833496142 3. Breslin, J.G., Harth, A., Bojars, U., Decker, S.: Towards Semantically-Interlinked Online Communities. In: Gómez-Pérez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 500–514. Springer, Heidelberg (2005) Reed’s Law and How Multiple Identities Make the Long Tail Just That Little Bit Longer David Cushman Faster Future Consulting, United Kingdom davidpcushman@gmail.com Abstract. Reed’s Law or “Group Forming Network Theory” (as Dr. David P Reed originally and modestly called it) is the mathematical explanation for the power of the network. As with many great ideas, it is quite simple, easy to understand and enlightening. This paper sets out to explain what Reed’s Law describes and includes more recent understandings of the collaborative power of networks which should help to make sense of and gives context to the exponential. It also suggests that the multiple complex identities we are adopting in multiple communities are not necessarily a “bad thing”. The contention of this paper is that the different modes of thought these actively encourage are to be welcomed when viewed in the context of unleashing the power of self-forming collaborative communities of interest and purpose. 1 Introduction In the beginning, we had Sarnoff’s Law1: a mathematical description from a broadcast, mass media age. It was first applied to cinema screens, and latterly to TV. Sarnoff’s Law states that the value of a network grows in proportion to the number of viewers. It is basically a straight line: the more viewers, the more value the network has. Most audience measurement techniques have simply followed this rule ever since. Some (such as unique users/visitor counts) have, inappropriately, continued to apply it to websites and social networks. However, this is a serious underestimation when you move out of broadcast models. Metcalfe’s Law offers a better fit. It offers a better way of measuring the relentless growth of the power of the Internet. This law states that the value of the network grows in proportion to the number of nodes on the network. For example, one fax machine on its own is useless (1 squared = 0). Two (2 squared = 4) has more utility. For each one that is added to the network, the value of all nodes in the network is increased (3 squared = 9 etc). If your website is getting 10,000 unique users a month more than a rival, the gap between you and them in terms of potential value created is 10,000 squared (100 million!). Having 10,000 more nodes on a network – even if there is just a linking from one to one – is much more valuable than having 10,000 passive viewers for a broadcast (assuming you have gone to the trouble of doing more digitally than simply replicating the broadcast model). 1 http://en.wikipedia.org/wiki/Sarnoff's_law J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 123–130, 2010. © Springer-Verlag Berlin Heidelberg 2010 124 D. Cushman Metcalfe’s N-squared value explains why the growth of networks used for one-toone communication (e.g. phone services, e-mail, instant messaging) follows the pattern it does. Simply, new users will almost always join the larger network because they will reason it offers more value to them. Frankly, it usually does: a tipping point is reached, and the floodgates open. For example, if and when a VoIP provider assumes dominance over our mobile identities (as defined by our personal ‘number’), then the operators may be in trouble. Metcalfe’s Law charts an impressive and rapid rate of growth of utility. However, even this, according to Reed [2], is a gross underestimation. What Metcalfe’s Law fails to take account of is that each of the nodes on the network can choose to form groups of their own, of whatever size or complexity they choose, with near neighbours or distant, initially unrelated, nodes. They can choose small groups, large groups, be part of multiple groups, uber- and subgroups, etc. 2 Wild Growth but Counterintuitive: Do the Maths If we add up all of the two-person, three-person, four-person, etc. groups, the utility is 2 to the power of N (Figure 1). This represents astonishing, wild growth potential. Reed quotes the example of a minister who was offered whatever he chose as a reward from his king. He asked for two copper coins on the first square of a chess board, four on the next, eight on the third and so on, following the progression of 2 to the power of N. The King thought he had gotten off lightly, until when they reached the 13th square the number of copper coins had reached 8192. If he had owned a big enough abacus he would have discovered that by the 64th square he would be handing over somewhere in the region of 18 quintillion copper coins, a number rather higher than there are grains of sand in the world. However, he did not bother, beheading the smart minister instead (not something we can do with our web rivals). Fig. 1. Various measures of the growth of networks Reed’s Law and How Multiple Identities Make the Long Tail Just That Little Bit Longer 125 There are those who argue that Reed’s Law [2] is counterintuitive. [1] states that: “Reed's Law says that every new person on a network doubles its value. Adding 10 people, by this reasoning, increases its value a thousandfold (210). But that does not even remotely fit our general expectations of network values - a network with 50,010 people can't possibly be worth a thousand times as much as a network with 50 000 people.” They argue that n log (n) offers a more accurate interpretation of the growth of networks. This is illustrated in the following diagram adapted from their wellreasoned argument [1]. Reed has refined his law in the following mathematical terms. The number of possible sub-groups of network participants is 2 to the power of N -N -1, where N is the number of participants. Therefore, what Reed’s Law describes is the possible number of sub-groups, and this reveals a potential, not an actual, value. That theoretical value gets closer to being realised if the network is used in a particular way - in a way which is becoming more and more the norm as the digital native generation takes charge. Groups offer their greatest potential value when they work together to do much more than chat. 3 Collaboration, Flow Not Focus In charting the possible number of sub-groups, Reed’s Law reveals the collaborative potential of a network. When collaboration happens, new value - often immense value -emerges. What do we mean by collaboration? [7] offers: “Collaboration is more than just ‘working together’ … Collaboration implies that multiple people produce something that the individuals involved could not have produced acting on their own … Technology advances have meant that some level of time-shifting and placeshifting is now possible, reducing the simultaneity inherent in the original scenario.” Stowe Boyd [3] suggests the greatest value unleashed by networks comes when the group is not only one which has self-formed with a collaborative purpose, but one in which people are willing to drop everything to join in the flow as and when they are required to by their connections – i.e. drop everything to act in real time. Boyd believes that far from the pipe dream that some may regard this as, this is the natural place our involvement in networks leads to. He asks us to think of attention (i.e. demand on our time) as being more about flow than focus [3]: • “Don't listen to industrial era or information era (the last stage of industrialism) nonsense about personal productivity. Don't listen to the Man. • “The network is mostly connections. The connections matter, give it value, not the nodes.” • “Time is a shared space - your time is truly not your own” • “Productivity is second to Connection: network productivity trumps personal productivity.” This belief in the power of the network - and his willingness to subsume personal focus to it is based on the simple notion that: “I am made greater by the sum of my connections - so are my connections.” 126 D. Cushman 4 Don’t Just Network for Networking’s Sake When a network is for simple one-to-one communication, there is little potential for collaboration (other than between one and another). There is no potential for unleashing the wisdom of crowds [8], for tapping into the notion that none of us is as clever as all of us. Ross Mayfield, CEO of Socialtext, offered his own equation for the value created by collaboration [5]. In his “Ecosystem of Networks” diagram (Figure 2, Table 1), he argues that the growth and potential value of a network is not only defined by its freedom to self-form into groups, but crucially by what those groups do. Fig. 2. “Ecosystem of Networks” by Ross Mayfield Table 1. “Ecosystem of Networks” by Ross Mayfield Network Layer Political Network Social Network Creative Network Unit Size 1000s 150 12 Distribution of Links Power Law/Scale-Free Random/Bell Curve Even/Flat Social Capital Weblog Mode Sarnoff's Law (N) Metcalfe's Law (N2) Reed's Law (2n) Publishing Communication Collaboration For example, if your blog (your place in the network) is simply about publishing information, its value to you and your readers follows Sarnoff’s Law. If you use Facebook to communicate with your friends, then the value derived follows Metcalfe’s Law. If your node, your place in the network, is used to collaborate with others - to share information, mash-up images, create new ideas, services and products, then the value growth can follow Reed’s Law. Reed’s Law and How Multiple Identities Make the Long Tail Just That Little Bit Longer 127 In other words, Reed’s Law reveals the potential value growth in collaborative networks of shared interest and purpose. It is in this that the runaway value Reed’s Law describes is found. It is this which answers [1]’s concern: how is a network with ten more people worth 1000 times more? Imagine a collaborative network of 1000 scientists who have been seeking the cure for cancer. If ten more join there are now 1000 more potential subgroups. If just one of the new juxtapositions/mergers/mashings of ideas results in that cure for cancer few would argue that the network had delivered a value 1000 times greater value than the network’s previous state. Of course, not every single potential group will form. Not every new idea will deliver cancer-curing value. This much we know. It is why simply enabling collaborative networks will never deliver a 100% replica of the Reed’s Law value curve. For each potential group that does not form, a big chunk of subsequent value is lost. What we find harder to know, or even imagine, is the value emerging from those collaborative groups which do form. 5 Self-forming Communities and Multi-fac(et)ed Identities It seems to me that the pursuit of that emerging value is the best indicator to where our development and investment efforts should be focused if we wish to create sustainable value. Networks which allow self-forming collaborative communities of shared interest and purpose will create value. This is the reason networks of collaboration have real power, create value you could not predict and are fast becoming the model for a new way of socioeconomic life. Also, as networks gain increasing influence on the macroeconomics of life, they are also having greater and greater influence on our microlife - on the creation of our own individual identities. Our identities become increasingly complex. The more I collaborate in selfforming groups – the more complex ‘I’ become. It is a question of psychological selfdeterminism: “who do I think I am?” Our identity - whether it carries the label of a name or a number - is a work in progress. Ever-shifting, responding to communities it is part of, your identity is as much (perhaps more) created by those around you as by yourself. The desire and need for psychological self-determinism is working as a powerful adjunct to the growing influence of global, digital networks. When communities were fixed in location, your identity was created by your relationships within that fixed community. Your identity was equally fixed. In terms of Group Forming Network Theory you belonged to just one group, and it was pretty much fixed in size. In a socially-networked world, the creation of your identity becomes a process which is contributed to by more people, more often, and from very varied backgrounds. The community you exist in shapes your identity from its perspective and from your own. Your identity varies from community to community. If once you were the blacksmith’s son and village blacksmith-in-waiting, now you are a huge variety of identities - depending on the community you are interacting with at any one time. Our identities become increasingly multi-faceted. For example, on my blog, my identity is relatively serious, thoughtful. On Facebook, it is more playful. I am displaying a different facet of a complex identity. The community I feel I am part of 128 D. Cushman when writing my blog joins in the construction of my serious and thoughtful persona (by their comments and expectations) [6]. The community I feel part of on Facebook also joins in the construction of my persona there - by the way it acts, by its response to what I do, by the tools it offers me. The push and pull of the forces forging my identity in all elements of my life are communal. I interact with a community, therefore I am. Each community creates a different facet of that identity - and in doing so makes a contribution to subtly reshaping the core. As a simple example, becoming a parent changes your personality. You have a new role to play and a new set of relationships - with your child, with your partner (now a parent, too) with other parents, grandparents, etc. Each interaction changes you in small but important ways, and these result in changes at your core. This may be an extreme and emotionally-loaded example, but the co-creation of facets of your personality has more than a superficial impact. This may be why ‘the edglings’ that Stowe Boyd describes or Generation-C that [4] describes, have a different set of wants, and are not satisfied by the norms of mass production/media. It is through new mobile, fluid, co-creating communities they find themselves. They find they want to share in, to be part of, and to engage with these communities. They are people for whom collaboration and participation is the norm - for whom Reed’s Law is right. Understanding which facets of personalities you seek to engage with, understanding that you are dealing with personalities created from converged facets: these are real challenges for those marketing and/or creating social media today. Furthermore, the notion that we want just one digital identity is challenged by the emerging value multiple identities offer. 6 An Adjustment to Reed’s Law: A Longer Tail There is an adjustment required for Reed’s Law. If each of us is a node on the network, each time one new node is added the value of the network (assuming the caveats described above) doubles. However, if each node has multiple identities then the potential value must be multiplied by X, where X = the number of identities per node. This clearly can only add to the value in an “uber” network - one in which multiple identities apply, with the Internet itself being the biggest of them all. What is the value of X? The average number of social networks regularly used by the average social networker is greater than one. For example, I am a user of five, but a regular user of three. Plenty of people who use them will use just one. Those who use none are not part of any network and are therefore not part of the Reed’s Law value curve creation of the Internet. X must equal a factor somewhat larger than 1. Some estimate the average at 2.5. This then goes some way to restoring some of the value lost to the Reed’s Law calculation when potential groups do not form. This suggests that the actual ‘real’ curve may be somewhere between Metcalfe’s and Reed’s curves. Of course, if the potential of every group were fulfilled and we apply the multiple identities factor, then the result must be an even steeper curve than even Reed predicted. The number of possible sub-groups of network participants becomes 2 to the power of N(x) - N -1, where N is the number of participants and (x) is the average number of identities of each participating node. The theoretical growth of value in Reed’s Law and How Multiple Identities Make the Long Tail Just That Little Bit Longer 129 participatory and collaborative networks of multiple identity nodes is greater even than Reed predicts. It seems reasonable to challenge this identity complication. Why should not N simply be the value which encompasses the total of N(x)? It is worth making the distinction because the N value does not reflect the diversity of thought the multiple identities of one individual node (person) can offer. The thoughts of my Facebook identity may differ from those of my Blogger.com one. That is because my modes of thought, my openness, my willingness to think differently in different contexts/communities does vary. Environment/community counts. For example, Twitter thoughts may kick the whole long-winded reasoned argument out of the window – resulting in a different set of problem solving thinking, that short-cuts the logical and makes leaps using instinct. Fig. 3. The long tail in Reed’s Law Given the notion that the converged identities that make up the network are collaborating to create, it is reasonable to suggest that the supply of what they create should match the demand for what is created. If the network works unfettered, it should make only that which there is a demand for. To discover the demand curve, all we need do is tip Reed’s Law on its side (Figure 3). 7 Conclusions In this paper, we presented an overview of Reed’s Law and various related theories for describing the collaborative power of a network. We examined these laws and suggested that the multiple complex identities we are adopting in various online communities are not necessarily a negative development. We contend that the different modes of thought these actively encourage are to be welcomed when viewed 130 D. Cushman in the context of unleashing the power of self-forming collaborative communities of interest. Adding our identity multiple to Reed’s equation makes the long tail just that little bit longer still: the more identities, the longer the tail. References 1. Briscoe, B., Odlyzko, A., Tilly, B.: Metcalfe’s Law is Wrong (2006), http://www.spectrum.ieee.org/print/4109 2. Reed, D.P.: That Sneaky Exponential: Beyond Metcalfe’s Law to the Power of Community Building. Context Magazine (1999), http://www.reed.com/Papers/GFN/reedslaw.html 3. Boyd, S.: Overload Shmoverload. Message (2007), http://www.stoweboyd.com/message/2007/03/overload_shmove.html 4. Moore, A., Ahonen, T.: Communities Dominate Brands, Futuretext (2005), http://communities-dominate.blogs.com/ 5. Mayfield, R.: Ecosystem of Networks (2003), http://radio.weblogs.com/0114726/2003/04/09.html 6. Cushman, D.: I Am Part of a Community, Therefore I Am, Faster Future (2007), http://fasterfuture.blogspot.com/2007/08/ i-am-part-of-communitytherefore-i-am.html 7. Rangaswami, J.P.: Maybe It’s Because I’m a Calcuttan, Confused of Calcutta (2007), http://confusedofcalcutta.com/2007/08/31/ maybe-its-because-im-acalcuttan/ 8. Surowiecki, J.: The Wisdom of Crowds, Random House (2005), http://www.randomhouse.com/features/wisdomofcrowds/ excerpt.html 9. Anderson, C.: The Long Tail (2006), http://www.thelongtail.com/ Memoz – Spatial Weblogging Jon Hoem Bergen University College jon.hoem@hib.no Abstract. The article argues that spatial webpublishing has influence on weblogging, and calls for a revision of the current weblog definition. The weblog genre should be able to incorporate spatial representation, not only the sequential ordering of articles. The article show examples of different spatial forms, including material produced in Memoz (MEMory OrganiZer). Keywords: Memoz, spatial webpublishing, spatial montage, spatial weblogging. 1 Introduction What I call spatial weblogging can be seen as a natural response to a more general development of media at large: from media with a bias towards making time one of the most significant factor towards emphasizing space. The following discussion of the cultural implications of personal media, and weblogs and spatial publishing systems in particular, should be understood in relation to two major movements concerning the differences between editorial and conferring media, and between evanescent and positioned media [1]. The first concerns the movement of power, from a situation where central units were in control towards a situation where large parts of the production, distribution, and use of media content happens through collective processes. The second concerns a shift in media concerning time, from a situation where the time between an event and the public mediation of this event was considered important towards an increasing importance of space. This influences production, distribution and use. Editorial media follow a tradition where a relatively small number of people select, produce and redact media content before this is distributed to a public audience where every individual user is addressed in the same manner. A distinctive mark of editorial media is that production and publishing are controlled by formal procedures before the content is made available to the public. Editorial media is contrasted by conferring media where there are no formalized procedures for controlling the content before it is published. Those who edit and produce the content are individuals not part of an organization. Conferring media are also characterized by the users’ active participation. What I choose to call evanescent media is characterized by a close relationship between events, the production of content and the moment of publishing. Evanescent media are both contrasted and complemented by positioned media, that is media where space becomes more important than time. Examples are digital media where both the production and consumption of mediated content is made dependent on J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 131–142, 2010. © Springer-Verlag Berlin Heidelberg 2010 132 J. Hoem where the user is situated in space. This shift is introduced by mobile devices that make their users able to produce, store and distribute media content combined with specific information about these devices position. Some personal publishing-solutions already include positioning (e.g. by introducing geotagging in weblogs), other solutions let the users make explicit connections between maps and information (like Google's My Maps). 2 Weblog Definition In a situation where conferring and positioned media become more widely used it is about time to discuss whether a weblog-definition, emphasizing chronology as a key feature, should be revised. Weblogs are already one of the most important genres of conferring media, but the full impact of mobility and positioning is yet to be seen. Spatial relationships between mediated objects and the events they represent will form sub-genres that we need to include when discussing the features of weblogs. When Jörn Barger coined the term weblog, more than ten years ago, he presented the following description: "A weblogger (sometimes called a blogger or a pre-surfer) logs all the other webpages she finds interesting. /../ The format is normally to add the newest entry at the top of the page, so that repeat visitors can catch up by simply reading down the page until they reach a link they saw on their last visit" [2]. Barger mentions the reverse chronological order as a common feature, but this is probably of minor importance when trying to explain the success of weblogs. Far more important are usability, the ability to publish with a personal voice, and the hypertextual connections (permalinks, trackbacks, linkbacks etc) to and feedback from other users. This makes an individual part of collaborative efforts that also facilitates the creation of communities. Expressed differently: If one remove the chronological sequencing of posts one would only loose the most obvious technical characteristics of weblogs, but all the features that have made personal publishing a success remain. Weblogs can be seen as online diaries, but arguably the most powerful metaphor is the ship log. The reference to "log" emphasizes that a weblog is always about things and events that belong to the past at the moment of publishing. The log includes different representations of events, the time when the references were made and finally references to the locations where the events took place. The latter is yet not developed into a significant feature of today's weblogs. "Log" also imply some assumptions about frequency: In the classical log of a ship one will be likely to expect either entries on a regular basis, by the occurrence of important events, or both. In the ship-log the relationship to a specific place is arguably just as important as the reference to time. On the Internet a place can be virtual, defined by a URL, or it can be a reference to a physical position, defined by a Geo-URL. 3 Spatial Montage Until print technology emerged in the latter half of the 1400s, paintings and decorated architecture were the dominating technologies when mediating stories. In Europe the invention of printing made it possible to organize information more systematically Memoz – Spatial Weblogging 133 - represented linear - and because of relatively cheap reproduction printed text became dominant for almost 500 years. Visual representations where of course developed along with print, but one can argue that print did become the preferred way of distributing knowledge. This is still clearly seen in education, where printed books dominate. Spatial, visual representations are still largely the domain of art and entertainment. There is, however, a lot to be learned from the long traditions that store, structure and present information visually. Even though we do not remember all information as images, information is often easier to remember when connected to familiar spatial forms. In ancient Greece, a predominantly oral culture, one developed techniques to help a speaker, who had no physical storage media available, remember long passages of linked information. Simonides, who is considered the founder of mnemonics (μνημονικός mnemonikos - "of memory"), developed the rhetorical discipline known as memoria (memory). To remember information and the relevant arguments Simonides recommended a speaker to associate individual items, that should be remembered, to specific rooms in a house. During his performance the speaker would make a mental walk through the imagined house and recall the items for each room he visited. Fig. 1. Early maps conveyed values rather than a representation of what the world really looks like. The Psalter map accompanied a 13th Century copy of the Book of Psalm. Being a Mappae Mundi (world map) it was not designed for navigation, but the spatial representation was intended to illustrate a world view: for example, Jerusalem is located in the map's center, in accordance with the contemporary Christian world view. 134 J. Hoem Fig. 2. Aby Warburg, "Mnemosyne-Atlas" [4], 1924-1929. Warburg made compilations of texts and illustrations from a variety of sources. The compilations were also photographed, as spatial representations of specific themes. The photographs could later be used in other juxtapositions. A visual representation can be used to recall the linear sequence of information elements, but a more significant feature are the spatial connections that can be read in any order - not restricted by the sequence of pages predefined by an author. Spatial representations enable multi-linearity and give more room for individual associations. Lev Manovich use the term spatial montage to describe situations where more than one visual object is sharing a singular frame at the same time. The objects can be of different sizes and types and their relationship form a meaningful juxtaposition that is perceived by the viewer. Where traditional cinematic montage privileges the temporal relationship between images, computer screen interfaces introduce other spatial, and simultaneous relationships [3]. Our ability to remember information is often dependent on whether we are able to construct mental maps. Mental maps are essential when we are learning - a process where we have to make connections between new information to existing knowledge. To use content in new contexts is an essential quality of the compilation of a digital text. In As We May Think, the article that introduces the idea of hypertextual organization in its modern form, Vannevar Bush [5] described a trail blazer, a person who's profession is to construct trails in large complexes of information for others to follow. Bush's concern was how researchers should be able to keep themselves informed in their fields. Today, the amount of available information force every information user to adopt similar methods. To make new combinations of existing information objects Memoz – Spatial Weblogging 135 can be such a method. In this context spatial forms of publishing seem to provide flexible options for linking web-based resources with existing knowledge. 4 Spatial Web Publishing Personal publishing on the Web affects all media, and a number of user-friendly, free services contribute to the wide spread of new genres. These include weblogs, wikis, and various services used to produce websites. All the different solutions have in common that the expressions are constantly changing, they are often played out within various social networks, and new aesthetic qualities are developed through various forms of interaction between the users. Among the publishing solutions used by young people we find a number of services that encourage visual and spatial expression. Using these solutions the users are able to easily create their own websites as a complex composition of text, images and video, complemented with music players, chat boxes, guest books etc. Fig. 3. A fragment of a typical page at Piczo.com. The visual appearance clearly illustrates who the users are, most are early teenage girls. The publishing concept used by Piczo is, however, interesting to more than teenagers. When editing the page all elements can be moved freely. 136 J. Hoem Even though pictures and videos are widely used, most weblogs still present information in ways that have not changed significantly from how they were presented ten years ago. This is contrasted by the most popular personal publishing-tools, which are used by youngsters. These webpages are also frequently revised and updated, and other users are able to respond. However, when it comes to chronological structure this does not seem to be a significant feature. Instead some popular systems let their users structure information spatially. During previous work with weblogs in education the attention was driven towards systems like Piczo, publishing systems that were well known and widely used by most youngsters a few years ago. What made Piczo particularly interesting was the fact that the user interface made it possible to place the published objects in any position on the screen, not limited by screen size or if a specific position was occupied by another object. Spatial publishing-systems (systems where media objects can be placed in a position on the screen chosen by the user) allow their users a lot of freedom to express and present themselves. Where weblogs have analogue predecessors like diaries, journals, personal letters and logs, the spatial publishing systems have strong relationships to poster walls and scrapbooks. When these media are taken into the digital domain new forms occur, where quite extensive reuse of media-objects seems to be an integrated part of this publishing-culture. Piczo had several shortcomings, especially a a potential tool for learning, but it initiated the idea of a spatial Memory organizer – Memoz. Memoz is a publishing environmet that let the users publish spatially on a "screen-surface" that is not restricted by the physical sceen-size. Fig. 4. An example of spatial publishing with Memoz. In the background the user has integrted a map with geotagged pictures, made with Google Mymaps. Note how a satellite photo is placed in a position corresponding to the underlying map. When editing the surface all objects can be moved, scaled and stacked on top of each other. Memoz – Spatial Weblogging 137 The design was inspired by some of the features known from commercial systems, but the specific design was developed with education in mind. Memoz' key-features are: • • • • • • Spatial publishing where there are no restrictions on the size of the publishing surface. Easy combination of different media-expressions (text, pictures, video, animations, maps, etc) Features that allow collaboration between several users. A user is able to share editing-access to a publishing space with other users giving Memoz some basic wiki-like features. Each object can be adressed by an unique URL, a spatial permalink (SPerL), facilitating links between objects on the publishing surface. Commenting on individual media-objects. Open architecture enabling compiling of a variety of web-resources. A fully functional prototype of Memoz was made during fall 20071, making it possible to publish videos, pictures, maps (using Google maps) and texts literary side by side. Memoz was then tested in selected schools as part of a research project in 2008. 5 Spatial Weblogging with Memoz Working in Memoz is closely related to content resources available on the Internet. This gives the teacher a concrete base and a tool to teach students how they can work with a variety of sources, how they perform source criticism, and how they should refer to sources using hyperlinks. In Memoz the spatial publishing can be directly connected to the use of digital maps. Users can organize information objects spatially and visually, and collaborate on and present information and media elements in relation to a geographical position. In relation to education, the use of spatial publishing follows a long tradition of 'place-based education' in which teaching is related to local resources in the curriculum (the local fauna, culture, history, etc.). Location-based teaching is inspired by the desire to bridge the gap between what happens in the classroom and places and events in the learner's surroundings. This perspective focuses on that one would like students to care about their local environment, the people living there, and become more able to take action when local problems occur. The structuring of the learning experience then becomes a way to create informed citizens who become more able, and interested in participating. A primary objective was to determine whether the students were able to take advantage of their everyday skills related to Internet use in school situations. Quite a few students managed to draw on experiences with other systems when working with Memoz. For example, knowledge of HTML, as they have learned as part of mastering 1 Memoz was designed by Mediesenteret Bergen University College. The following research project was funded by ITU (The Norwegian National Network for IT-Research and Competence in Education). 138 J. Hoem other publishing solutions. Many of these students showed great creativity when they faced problems with the technology. The students who master these techniques seem to find that they have some skills that are relevant to problem-solving in the school situation. Fig. 5. A webpage made with Memoz where the students have used a huge image of a tree as background in order to show how different authors of criminal literature are related to one another. Only a small part of the overall page is shown on the screen at a given time. The page was used to assist an oral presentation, and the students made spatial hyperlinks making it possible to follow a walkthrough. Memoz automatically scrolls the page to show the objects that are linked to. Memoz – Spatial Weblogging 139 Memoz seems to have a potential as a supplement to existing tools such as presentation tools, Learning Management Systems, weblogs etc. There are, however, also examples where the students did not use the potential that the tool offers. These students used the publishing surface in Memoz more as decoration than exploration. However, an important observation is that there is a wide range of activities related to the processes of selecting the content and form of what was presented, which can often not be found as part of the final product. The objects are transformed, moved and even deleted during the process of compiling the final page. Increasing use of online sources makes it necessary to deal with new questions about knowledge sources. Student's work in Memoz has been closely linked to online content resources, giving teachers both a justification for and a tool to motivate students to reflection on the source types, source criticism, and source references. The students can draw on experiences from their private use of media, but this does not mean that this can be implemented in education without a critical follow-up from teachers. Some guidance are needed to give the application of students' competence a direction, so that the work processes and the media products produced become valid knowledge resources. 6 Digital Bricolage Traditionally, media users were directed against practices and cultural understandings that provide relatively clear guidelines for what are considered good products. Previously, these socialization processes have been linked to social institutions, especially dominated by education and mass media, supplemented with experience from the individual's private sphere. New understandings and practices have been developed within the framework of subcultures. Some subcultures have evolved, been transformed, and become a part of the established mass culture. Today, individuals or groups of users can produce media products that are easily distributed through social networks, and in some cases to a large audience. The consequences are rapid proliferation of aesthetic practices, which challenge the traditional arenas of culture formation. Mediated expressions are transformed and recontextualized quickly., and the production and distribution of media text involve collective processes [6]. Without content with qualities that the users find interesting no information service will ever become successful. However, when looking more closely at digital texts one see that one of their major qualities is their ability to provide context. A successful text uses elements from different sources, and supports social relationships to the producers, most often through links and comments. The process of producing a digital text often involve some kind of copying content, like material previously produced by the user on his own computer, or material found on the Internet. In personal publishing most new texts are made in response to information already available, as comments, as additions to texts published by others or as autonomous texts connected to other texts through different ways of hyperlinking. The connections between texts may be characterised as communities. These communities constitute a public without any demands of formal connections between the participants [7]. 140 J. Hoem The production of new digital expressions almost always involve some kind of copying content, or selection from predefined functions, whether this is the copying from material previously produced by the user on his own computer, or found on the Internet. Copying and reuse of media material is an activity that resemble what Claude Lévi-Strauss call bricolage. Lévi-Strauss introduced le bricoleur as the antagonism to l'ingenieur, referring to the modern society's way of thinking in contradiction to how people in traditional societies solve problems. A bricoleur collects objects without any specific purpose, not knowing how or if they might become useful. These objects become parts of a repertory which the bricoleur may use whenever a problem needs to be solved [8]. Fig. 6. Part of a page made entirely of existing objects: pictures, videos and texts. The pupils have not contributed with any material, but the overall compilation is nevertheless unique. The copying and reuse of media material make young publicists able to produce new expressions. These media-elements are likely to be taken from different, often commercial, presentations and combined into new personal expressions that are Memoz – Spatial Weblogging 141 shared online. These activities resemble those of the bricoleur, understood as a person who take detours to be able to achieve a result that in the end seem like an improvised solution to problems of both practical and existential character. The process of bricolage may be highly innovative as objects often end up being used in contexts very different from the ones they originated from. Thus those who behave as bricoleurs often perform a complex set of aesthetic and practical considerations when using objects from their repertories in new media-expressions. 7 Further Research on Spatial Weblogging Memoz has not been shipped as a service outside education, and the examples shown are from school situations. Thay are produced within a limited timeframe, and can hardly be seen as examples that can be directly compared to texts produced with ordinary weblogging tools. It is, however, possible to imagine how new posts can be added to a page that grow over time, placed on the screen surface in relation to previous objects covering similar topics. New objects can even be placed over older ones, resembling the development of a poster wall. In Memoz any sequential relationship between objects has to be created manually, using spatial hyperlinks. This is, however, a functionality that can be developed into a system where the sequential ordering can be followed in ways known from tools like Etherpad and Google Wave. In other words: if spatial weblogging is to be considered a genre, it is to be further developed. The spatial expressions, which until now has been viewed on traditional computers screens, can also be mediated by devices that in various ways can be connected to the reader's position. Mobile devices such as small computers, cell phones, ebook-readers etc. can establish a connection between the content and the location where the device (and the user) is at the moment of reading. This creates opportunities for a new, virtual level of information that comes in addition to what we otherwise experience in the physical environment. In the meeting between these spheres one can see a continuous interchange between physical and virtual expression. Space becomes a meeting place where two types of environments influence each other or being built together. It will be interesting to look more into how spatial web publishing can lead to new aesthetic and social practices in other arenas. One can already see a number of examples of how the spatial publishing are related to physical, place-bound aesthetic expression. These practice fields can meet in what is often described as "shifting" or "enhanced" reality (augmented reality). The relationships between space, location and representation are issues that are the basis for architecture and urban planning, but which also have great relevance for understanding the Internet's texts. In large online hypertexts there is no natural beginning or end. Thus the user must develop an understanding of the text's structure through the active use of text. With the increasing complexity of how information is mediated, in both physical and virtual spaces, it becomes particularly relevant to look more into how the spatial screen-based representations can be connected to physical space in different contexts. This related perfectly to Barger's original definition of webloggging and to Vannevar Bush's thoughts about storing and exploring information, both emphasizing the making of connections to past events and experiences for others to follow. 142 J. Hoem References 1. Hoem, J.: Personal Publishing Environments, Doctoral theses at NTNU, 3 (2009), http://infodesign.no/2009/08/personal-publishingenvironments-all.htm 2. Barger, J.: Weblog resources FAQ (1999), http://www.robotwisdom.com/weblogs/ 3. Manovich, L.: The Archeology of Windows and Spatial Montage (2002), http://www.manovich.net/macrocinema.doc 4. Freiling, R.: The Archive, the Media, the Map and the Text (2007), http://www.medienkunstnetz.de/works/mnemosyne/ 5. Bush, V.: As We May Think, The Atlantic Monthly (July 1945), http://www.theatlantic.com/doc/194507/bush 6. Hoem, J.: Openness in Communication, First Monday Special Issue on Openness (2006), http://www.firstmonday.org/issues/issue11_7/hoem/index.html 7. Hoem, J., Schwebs, T.: Personal Publishing and Media Literacy. In: IFIP World. Conference on Computers in Education (2005), http://infodesign.no/artikler/personal_%20publishing_media_li teracy.pdf 8. Lévi-Strauss, C.: The Savage Mind (La Pensée Sauvage). Oxford Univ. Press, Oxford (1962) Campus Móvil: Designing a Mobile Web 2.0 Startup for Higher Education Uses Hugo Pardo Kuklinski1 and Joel Brandt2 1 Digital Interactions Research Group, University of Vic, Catalunya, Spain 2 Human-Computer Interaction Group, Stanford University hugo.pardo@uvic.cat, jbrandt@cs.stanford.edu Abstract. In the intersection between the mobile Internet, social software and educational environments, Campus Móvil is a prototype of an online application for mobile devices created for a Spanish university community, providing exclusive and transparent access via an institutional email account. Campus Móvil was proposed and developed to address needs not currently being met in a university community due to a lack of ubiquitous services. It also facilitates network access for numerous specialized activities that complement those normally carried out on campus and in lecture rooms using personal computers. 1 Introduction The synergy between novel technology and use patterns has enabled the convergence of mobile devices and Web 2.0 applications. This synthesis is a new conceptual space called Mobile Web 2.0 [1], leading to an always-on empowered web consumer. Handsets are becoming more powerful and sophisticated in terms of: processing power; new multimedia capabilities; more network bandwidth for internet communications; access to already-available WiFi access points; more efficient web browsers; larger high-resolution screens; novel hybrid mobile applications; and massive online communities. The adoption of 3G mobile devices by hardware manufacturers and operators has made available an infrastructure that promotes connected physical mobility and a new and attractive market for services. The Mobile Web 2.0 concept [1] can be linked to each of the seven principles outlined by O’Reilly [2] in his article describing Web 2.0. • The Web as a platform. A mobile device has never had as much computational power or storage capacity as its non-mobile counterpart. The Web as a platform emerges as a strong synergizing agent for mobile devices. • Database management as a core competence. The alliance of mobile and Web 2.0 allows the integration of data with the ease of quick access from any place and at any moment, supporting data ubiquity. • Ending of the software release cycle. For mobiles, this can be an advantage given certain system characteristics of these devices like reduced memory, minimal graphical user interfaces and the security risk associated with the installation of software by third-party developers. J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 143–151, 2010. © Springer-Verlag Berlin Heidelberg 2010 144 H.P. Kuklinski and J. Brandt • Lightweight programming models and the search for simplicity. Reduced interfaces and limited storage systems, graphical austerity as well as the use of application protocols, will be the base of any implementation for Mobile Web 2.0. • Software above the level of a single device. Software is now being designed for use on multiple hardware platforms, most commonly, personal computers and mobile devices. • Both rich user experiences and harnessing collective intelligence are the key to future development, given a current environment in which the mobile data industry is based on content provided by the carriers. The other attributes of Mobile Web 2.0 are: being able to capture the point of inspiration; global nodes and multi-language access; mashups on mobility; location-based services (as an organic use); and mobile search emphasizing context. In the intersection between the mobile Internet, social software and educational environments, Campus Móvil is the result of a research project with a strong business focus1. The Campus Móvil startup project is a prototype of an online application that will be used via mobile devices in a Spanish university community, with exclusive and transparent access provided through an institutional email account. The Campus Móvil project covers three main knowledge areas: Web 2.0 applications, mobile devices, and teaching innovation policies in the new European Space for Higher Education. European structural changes in education were suggested through the Bologna Process, which includes a special emphasis on innovation in technology usage as an essential value proposition for new pedagogical strategies. In the European Union, different financial sources have been set out for the implementation of Bologna. There is also growing interest from the IT business world towards contributing new and innovative ideas for Internet-based applications in the higher education domain. The Campus Móvil proposal is not entirely different from other web community strategies. The integration of tools as a sort of “cocktail” of existing products from the mobile business market is its main added value. Many of these products are in their initial phases of development and they have not previously been targeted towards the academic community. This project aims to offer an attractive basic service where currently there is an empty market. Also, various bonus activities (without any added cost to the users) will be proposed. 2 Understanding the Consumer and the Market It is necessary to take into account the evolution of the “always-on empowered Web consumer” which is now being examined by all companies and marketing strategies. This kind of user has driven the Internet industry in the last few years, especially in the new mobile Internet market and on the almost deserted landscape of mobile Web 2.0 services. In this sense, through Campus Móvil, a consumption and production space is proposed between institutions and students, at both the interaction and 1 This paper was part of the research project “Mobile Web 2.0: A Theoretical-Technical Framework and Developing Trends” carried out by Hugo Pardo Kuklinski (University of Vic), Joel Brandt (Stanford University) and Juan Pablo Puerta (Craigslist.org Inc.). Campus Móvil: Designing a Mobile Web 2.0 Startup for Higher Education Uses 145 pedagogical levels. The exigencies of an academic environment must be taken into account along with the increasing necessities of connectivity, ubiquity and being able to creatively adapt to these new converging technologies. Mobile Internet is a market with huge difficulties. These include high costs for connectivity, the slow speeds encountered when web surfing due to broadband limitations, short battery life in mobile devices, the lack of mobile Internet consumers beyond the iPhone, along with other factors. Nevertheless, expert predictions [1], [5][8] indicate that mobile Internet will be one of technology / consumer markets with large growth during the next few years, especially as mobiles begin to support collaborative applications. With the specific characteristics associated with mobile device usage, some things must be taken into account when promoting a new product as Campus Móvil: users on the move want to be entertained with short, direct and friendly content items; the offerings should be centered on value-added services related to specific instances (e.g. time and space) based on ubiquity, capturing contexts from the current point in time and using location-based services. These mobile uses can also be synchronized with regular web applications. The key values of the Campus Móvil project are: • To keep the community informed (showing only the latest news in one’s university); • To enable reciprocity (by providing a useful platform to users within the mobile application, users will respond and consume / create content in return); • To provide social validation (we advocate the creation of a powerful universityoriented community without external users “polluting” this interaction model); • To promote a desire to like and use the platform (if a useful service is offered, then users will cooperate and aid with the service’s growth leading to new services and content items); and • To leverage institutional authorities (universities will be integrated into project in an institutional way, conferring higher prestige to the product). 3 Characteristics and Value Proposition Campus Móvil2 will be designed for the Spanish university market, with a second expansion phase to other Spanish-speaking countries, planned for the third year of the product (2010). The opportunities in focusing on Spanish university users are significant. Spain has a high mobile phone density: for its 48 million users, there is a penetration level of 107.46 mobile phones per 100 people, and young people are the biggest market. The Spanish university system also stimulates and finances the use of innovative tools in the academic community. Even though wireless networks have not been developed at a large scale in the Spanish market, universities have their own networks. The government offers free WiFi coverage on all public university campuses. Some issues that can be addressed 2 www.campusmovil.net and www.campusmovil.es 146 H.P. Kuklinski and J. Brandt by this strategy include: the over-extended use of public transportation; the lack of public access computers on campus; a scarce density of laptops per student; and a limited amount of Web 2.0 applications designed for the Spanish-speaking market (Mobile Web 2.0 applications do not exist in this market). Campus Móvil will cover three unsolved necessities: 1) capturing the point of inspiration in the academic environment; 2) generating snippets which then will be retrieved and reused on other computing environments; 3) taking advantage of the dead time without computing availability and network access (public transportation, hours between lecture classes, libraries, public spaces outside campus) for keeping connected and interacting with the university community, via services providing access to today’s news and events or knowledge-management level functionality. Campus Móvil will allow to users to interact with their university on mobile devices more quickly and in an improved manner. 4 The Main Concepts around Campus Móvil Interaction on mobile devices happens in a different context where the physical environment plays a role in the interface, and depending on what the user currently has as their primary activity. There must be a balance between functionality and complexity, as discussed by [4]. These are the main concepts that Campus Móvil provides. Partnership with Spanish public universities. By invitation, we will offer authorities, professors and administrative staff a free implementation of the administrative and academic services available via Campus Móvil, based on the students’ necessities explained earlier. The contents and services proposed for this concept will provide an integration of online campus services adapted to the mobile environment: current campus news in brief (i.e. 15-20 words); absences of professors; examination information; on-campus and off-campus events agenda; located-based services; marks; brief responses to student’s demands, such as FAQs; multimedia services about academic activities; freshmen services; general alerts; and security alerts. Exclusivity, transparence identity and real profiles. Network-free access will be enabled by means of a university email account. We will organize the Campus Móvil community into groups: universities and knowledge areas and faculties. If the user is not a member of any of these groups, access will be restricted to only a few levels of general information (not including personal profiles of students or academic data), unless the user has been authorized by another person to be included in his or her user network. Tags for related sub-communities will also be shown. The contents and services proposed for this concept will allow members of the Campus Móvil community to consume, share and upload four kinds of data: text files, pictures, audio and video (all with headlines, tags and a brief description) from their private page or from public spaces. Voice equal to value. We will promote the production and consumption of shortduration podcasts and videocasts, especially in academic interactions from professorto-student and student-to-student. 3G technology is not suitable for transmitting high Campus Móvil: Designing a Mobile Web 2.0 Startup for Higher Education Uses 147 quality multimedia content, but it can be used to promote a podcasting service like iTunesU. We help the universities to create a similar tool and to facilitate content production, by means of a common platform and an easy-to-use development pattern. Producing short texts when mobile. This concept will promote the production and reading of short texts during a mobility state, for example, taking notes, creating diary entries, and microblogging (à la Twitter or Jaiku). Each Campus Móvil member has a personal page with an interface for entering 20-word answers to questions like “What do I need for tomorrow’s lecture?” or “What do I plan to do today after school?” Production and retrieval of snippets. Mobile devices offer a platform to produce snippets capturing the point of inspiration, and Campus Móvil offers the possibility for retrieving and reusing them (from a regular website with password access) [3]. The contents and services proposed for this concept will include ideas from lecture classes; data produced in public spaces where there is neither access to computers nor any Internet access; help and memory queries for research meetings; andvarious kinds of snippets for later retrieval in desktop computing applications. 5 Interface Design To adapt Web 2.0 applications for mobile interfaces that are only 240 pixels wide and have only limited graphical capabilities, the best approach is to have complementarity with the desktop website. This is similar to other examples of complementarity between desktop and mobile interfaces in Mobile Web 2.0 applications (Mosh, Facebook, Twitter, Dodgeball, Netvibes, MySay, Mindjot, etc.). Fig. 1. Providing complementarity between the mobile and desktop websites Campus Móvil will be primarily designed for interactions on mobile devices, although there will also be regular web applications to support and complement the mobile Web with the desktop Web (Figure 1). More complex procedures, such as subscriptions, long-time consumption, retrieval of snippets generated from mobile applications or channels, and group-to-group communication, will be reserved for use via the desktop Web. The desktop interface (levels shown in Figure 2) will cover three operations: marketing content; more complex interactions; and extending the data available via the mobile version. The personal page (in level 2) will be the main interface in the community. 148 H.P. Kuklinski and J. Brandt Fig. 2. Regular desktop Web architecture The mobile interface will cover four operations (levels shown in Figure 3): producing and reading short texts, less than 20 words; permitting one to carry out easier interactions; producing snippets while on campus; and listening audio files. Campus Móvil: Designing a Mobile Web 2.0 Startup for Higher Education Uses Fig. 3. Mobile architecture Fig. 4. Sample desktop interface views 149 150 H.P. Kuklinski and J. Brandt Fig. 5. Corresponding mobile interface views The first prototype of our interface designed for both the regular, i.e. desktop Web (Figure 4) and for the mobile Web (Figure 5) is shown here as a work in progress. 6 Conclusions and Future Work In this short paper, we have described a project which intersects the mobile Internet, social software and educational environments. Campus Móvil is a prototype for an online application that will run on mobile devices and has been created for a Spanish university community. It provides exclusive and transparent access via an institutional email account. The system was proposed and developed to address needs not currently being met in the university community due to a lack of ubiquitous services. Campus Móvil facilitates network access for numerous specialized activities that complement those normally carried out on campus and in lecture rooms using personal computers. The main activity in the next phase is to prototype the interactions, and to evaluate the functionality in focus groups so as to find out where breakdowns will happen. We research the interactions carried out via both the desktop and mobile interfaces, looking at how users interact with the system, and exploring whether to follow existing trends or to innovate new ones. Following definitions from the original partners, the action lines described here and further research on the prototype, we will finish all the software and content segments of the desktop and mobile applications, keeping the service in beta version for four months and accessible only by invitation. This will allow us to solve any technical and usability problems. We will then identify future development trends in Mobile Web 2.0 applications, which will enable us to create synergies with other mobile Internet applications with a services proposal related to the Campus Móvil project (especially those geared towards academic uses and higher education institutional management). We will then launch the initial commercial phase of Campus Móvil. The market niche will be Spanish public universities. Before the final launch, it will be necessary to close an agreement with a financial support agency or venture capitalist. Campus Móvil: Designing a Mobile Web 2.0 Startup for Higher Education Uses 151 References 1. Jaokar, A., Fish, T.: The Innovator’s Guide to Developing and Marketing Next Generation Wireless/Mobile Applications, futuretext (2006) 2. O’Reilly, T.: What Is Web 2.0: Design Patterns and Business Models for the Next Generation of Software (2005), http://tim.oreilly.com/news/2005/09/30/what-is-web-20.html 3. Brandt, J., Weiss, N., Klemmer, S.R.: txt 4 l8r: Lowering the Burden for Diary Studies Under Mobile Conditions. In: CHI 2007 Extended Abstracts on Human Factors in Computing Systems, San Jose, CA, USA (2007) 4. Brandt, J., Weiss, N., Klemmer, S.R.: Designing for Limited Attention, Technical Report, CSTR 2007-13 (2007), http://hci.stanford.edu/cstr/reports/2007-13.pdf 5. Castells, M., Fernandez-Ardevol, M., Linchuan Qiu, J., Sey, A.: Mobile Communication and Society: A Global Perspective. MIT Press, Cambridge (2007) 6. Levinson, P.: Cellphone: The Story of the World’s Most Mobile Medium and How It Has Transformed Everything. Palgrave Macmillan, Basingstoke (2004) 7. Steinbock, D.: The Mobile Revolution: The Making of Worldwide Mobile Markets. Kogan Page (2005) 8. Steinbock, D., Noam, E.M.: Competition for the Mobile Internet. Springer, Heidelberg (2003) The Impact of Politics 2.0 in the Spanish Social Media: Tracking the Conversations around the Audiovisual Political Wars José M. Noguera and Beatriz Correyero Journalism Department, Faculty Communication, UCAM, 30107 Murcia, Spain {Jose M.Noguera, Beatriz.Correyero, jmnoguera}@pdi.ucam.edu, bcorreyero@pdi.ucam.edu Abstract. After the consolidation of weblogs as interactive narratives and producers, audiovisual formats are gaining ground on the Web. Videos are spreading all over the Internet and establishing themselves as a new medium for political propaganda inside social media with tools so powerful like YouTube. This investigation proceeds in two stages: on one hand we are going to examine how this audiovisual formats have enjoyed an enormous amount of attention in blogs during the Spanish pre-electoral campaign for the elections of March 2008. On the other hand, this article tries to investigate the social impact of this phenomenon using data from a content analysis of the blog discussion related to these videos centered on the most popular Spanish political blogs. Also, we study when the audiovisual political messages (made by politicians or by users) “born” and “die” in the Web and with what kind of rules they do. Keywords: Political communication, Spanish blogosphere, political blogs, Spanish General Elections 2008, YouTube. 1 Introduction Since Joe Trippi started a "political campaign 2.0" for Howard Dean in USA, we have a lot of examples about how politicians want to make profit in social networks. Virtual worlds like Second Life or networks like Facebook are just one of the platforms that politicians want to explore. And in a World Live Web everytime more "visual", the audiovisual political wars are an evident reality. Often, the aim of these messages is to obtain more visibility in the mass media, but at the same time the result in the social media might be unexpected and even, it might turn against own politicians who designed the message. Political parties begin to measure the potential of the social networks, but these "audiovisual wars" on the Web are still a part of a phase of experimentation. What elements encourage the visibility of the political messages on the Web? Is it a collective social credibility? Are the social media able to identify the aims of some political messages? Are users "trained" to make good spaces for political debate? Do the institutional political messages arrive to these political spaces? Spanish J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 152–161, 2010. © Springer-Verlag Berlin Heidelberg 2010 The Impact of Politics 2.0 in the Spanish Social Media 153 blogosphere can show the answers to these questions if its behavior is observed during a political campaign. The methodology of this paper to describe that role begins with a selection of some of the most important Spanish politician blogs. According to several ranking tools, web sites like Escolar[1], Periodistas21[2], Eurogaceta[3], Internet Política [4] and others were collected. The sample was made gathering different kind of blogs, with journalists, teachers, politicians and just users of the Web. With this sample of blogs, the most popular tags in the political conversations around the audiovisual messages can be selected. After that, a sample of videos made by the most important Spanish parties (PSOE and PP) was chosen and their contents were tagged. From that moment is possible to follow how people (social media) use these messages. In addition, the conversations are studied with tools designed for tracking the buzz, like Technorati, Google Trends, Meneame (the Spanish version of Digg) or YouTube. The aim is not to value the efficiency of the audiovisual political campaigns on the Web. The real point is about following the political messages on the blogs, and to obtain this kind of data could be very interesting information for people which are working designing political campaigns for the social media on Internet. 1.1 The Context: New Trends of Modern Political Campaigns If 2005 was the year of blogging, 2007 was the year of "videocracy". Nowadays, viral propaganda begins to appear as one of the main forces and videos are the best tool for it. Discussion and dialogue is not trapped by the image, but stimulated by it. With tools like YouTube, Google Video or Vimeo, image is again a first-order political weapon, especially combined with humor. Nobody doubts that blogs and videos play an important role in presidential campaigns. Politicians all over the world use blogs and sharing video sites like YouTube to keep a hard and scheduled campaign in parallel to the campaign on traditional media. There are lots of examples of the growth of blogs and videos on the political scene during 2007 in United States, France, United Kingdom and Spain. For this reason, the main purpose of this article is to know if political blogs in Spain have been influential in increasing Blogosphere´s activity and encouraging the conversation about these audiovisual formats. To achieve this goal, the following tasks are required before: 9 Explaining the candidates’ use of technology and their online behaviour. 9 Analysing the impact on Technorati in terms of reactions and authority of the videos produced by the political parties on Youtube. 9 Tracking the way of travelling that these political videos do around the Blogosphere. 9 Explaining these ways around the Spanish Blogosphere trying to find “tipping points”. But before that, it can be useful to present Spanish presidential campaign for the elections on 9 March 2008 in order to show, first of all, the attitudes of Spanish politicians towards the new media and the use that they make of modern technologies, blogs, videos and social networks; secondly, the responses of online political citizens 154 J.M. Noguera and B. Correyero engaged with politics, specially engaged to the main political parties; and finally, the impact of the videos upload by the parties on YouTube, in order to demonstrate how video is emerging as a vehicle for promoting the political process into the Spanish blogosphere and how political videos have a special way of travelling around the blogs. Since the beginning of 2007, Spanish parties are using video on their own web sites or on YouTube, not only to create states of opinion to a sector of the population that surfs on the Internet, but also to gain and maintain political and media attention and to wake up the political apathy of the young people, which is expressed especially in the low indexes of electoral participation. Politicians seem to have found a way to fight against this apathy. The point is if politicians have real online skills to communication, and this is related to the entertainment. According to William McGaughey [5], "political campaigns are today a branch of the entertainment culture. Experienced entertainers make successful political leaders". We wonder, if today's political campaigns should entertain, what are the main components to be used to achieve this goal? In this paper, these elements are the following ones: Videocracy, videopolitic, buzz marketing, permanent campaign, crowdsourcing, metacoverage and targeting. Videocracy. Nowadays, the power of visual images on the society has been clearly moved. The great impact of television, cinema, Internet, and advertising on public opinion and political affairs is a fact. In Italy, for instance, one man has dominated the world of images for more than three decades. In a recent documentary titled "Videocracy", we can see how the director of documentary, Erik Gandini, explores the power of television in this country, and specifically Berlusconi's personal use of it during the last 30 years. Videopolitic. Drawn by the power of the image, political leaders around the world are turning to the web to deliver video messages to the voters in an effort to get more sympathizers. The video format opens the door for originality and spontaneity and like everybody knows, "visual images can be more powerful than words". In this sense, the expert in communication and language of the university of Berkeley, George Lakoff [6], explains that 98% of thoughts are unconscious and based on "values, images, metaphors and narratives, which are what really can convince voters in one either way”. Do you tube? Hillary does, Obama does, and of course, Spanish leaders –Zapatero and Rajoy- do. During the pre-campaign, the two main parties in Spain- Populars (PP) and Socialists (PSOE)- started a war of videos in order to create buzz and to reach publics conversations. Sometimes Spanish popular TV contests had been the source of inspiration for creating political videos that pretend to ridiculize the adversary. For instance, we can see the video of the Socialistic Youth to promote the topic of Education for the Citizenship [7] -it’s a new course that the actual Government imposed in the school-. Paradoxically, the slogan of this polemic video was: "Por la igualdad, por la convivencia, Educación para la Ciudadanía SÍ" ("For the equality, for the coexistence: Education for the Citizenship YES"). Audiovisual formats can be used to sell proper merits of the candidates like Zapatero’s video distributed on Internet called "Con Z de Zapatero" ("With Z of Zapatero") [8] and another video presented by Mariano Rajoy who praised the Spanish National The Impact of Politics 2.0 in the Spanish Social Media 155 Holiday (12th October, "Día de la Hispanidad"). We describe the objectives of these videos later on. Buzz Marketing. But online political video isn’t just only for candidates. It is also for citizens who want to reach out to other people and communicate their own points of view about politicians and political issues. Nowadays parties have political war rooms with teams of experts in communications who monitor and listen to the media and the public, respond to inquiries, and synthesize opinions to determine the best kind of action to creating buzz (buzz marketing), to get people talking about the candidates and promoting political viewpoints. Clues to know dominants issues among voters are provided by blogs, podcasts, wikis, vlogs and social networks like MySpace. This allows the political parties to target them, too. Art Murray [9] believes that the most important aspect of the modern campaign is "targeting". Crowdsourcing. But besides that, politicians try to involve the public in political activities for enriching political discussions. They are exploring ways to use Web 2.0 technology to promote to political supporters to the participation and on this way, getting to collect data from them. They are creating their own social networks. An own social network is a great tool and easy to use for community and links between supporters and to segment. And they are much more customized to do any activities to do through other networks like Facebook more generalists. Its basic objective is that people feel engaged to the campaign, but at the same time, anyone who wants to participate in a more active way can find here any material or information that is helpful (including the party, you can use what their supporters do). This basic form of crowdsourcing is taking what your supporters do for the party to hear what they propose and to be in contact with each other: creating community. The PP candidate for president of the Government in 2008 saw the benefits of crowdsourcing. On his web, Mariano Rajoy asked for volunteers to design their election videos. During the Spanish political campaign, the crowdsourcing was used by traditional media and online newspaper too. Spanish public television (RTVE channel) and YouTube created an official home in YouTube for presidential candidate videos and provided a platform to let people engaging dialogue with candidates. This format had already been practised in the United States as "You Choose" in march 2007 when YouTube and CNN created a platform for cosponsored presidential debates. Permanent Campaign. Other important component of the modern political campaign is "permanent campaign". Patrick Caddell, an advisor to then President-Elect Jimmy Carter in 1976, gave a name -the Permanent Campaign- to a political mind-set that had been developing since the beginning of the television age. According to Time Magazine [10] Caddell wrote, "Essentially, governing with public approval requires a continuing political campaign". Nowadays, candidates for the presidency are in a perpetual campaign mode. The frontier between campaign and government has almost dissapeared. For instance, in Spain, although the official electoral campaign period only lasts for the 15 days before the election, (with the exception of the day just before the election), many parties, 156 J.M. Noguera and B. Correyero especially the PP and PSOE, start their "pre-campaigns" months in advance, often before having closed their electoral lists members. Finally, the last significant element in modern political campaigns is metacoverage, what it is, the interest of media and political parties in reporting on the strategies of campaign and its design. 2 2008 Electoral Campaign in Spain In the last five years, in Spain, political parties are working on encouraging political dialog in several ways. "In 2004 Spanish General Elections we attend to the political birth of the smart crowd, the emergence of the Policy 3.0 and the triumph of mobile phones as a mobilizing tool. In 2005 regional elections were the starting point of political blogs in Spain". [11] 2008 was the year of video. Mass-media, politicians and political parties have discovered the viral power of video on the Internet but the videocracy prevails against a more open and participatory videopolitic1. These videos are part of the campaign called 2.0 based on the creation of electoral communities where thousands are the volunteers that expand the message of the candidate, to create videos that ridicule to the rival, and constructing their own networks of conversation and support online. The problem is that politicians pay attention to people during election cycles. But once they get elected, they ignore each other until the next election cycle comes back. In order to design their campaigns, Spanish political parties have taken as a starting point these topics: United States presidential elections (2008), the last presidential elections in Europe (UK and France), the use of the new technologies in the electoral strategies and the rise of Web 2.0 tools witch offers the chance to engage interested citizens. 2.1 PSOE’s Electoral Campaign Trough Videos José Luis Rodríguez Zapatero, current Spanish Prime Minister, is the leader of Spanish Socialist Workers Party (PSOE). The first phase of the Socialist Party’s campaign in 2008 General Elections was done under the slogan "Con Z de Zapatero" ("With Z of Zapatero"), a joke based on the Prime Minister and socialist candidate's habit of tending to pronounce words ending with D as if they ended with Z. The campaign was linked to terms like equality (Igualdad-Igualdaz) or solidarity (SolidaridadSolidaridaz), emphasizing the policies carried out by the current government. We have studied the travel of the conversation about this video trough Spanish blogosphere. The second phase was done under two slogans "La Mirada Positiva" [12] ("The positive outlook") and "Motivos para creer" [13] ("Reasons to believe") emphasizing the future government platform. 1 Videoyvoto.tv was the first initiative of online journalism exclusively in video format launched in Spain by the Zeta Group to report on the regional and municipal elections of May 2007. It combined the editorial opinion of professionals in the Zeta Group, information on videos that brought the agency EFE and the participation of citizen journalism. Also were collected testimonies that the public record and send on the campaign. The Impact of Politics 2.0 in the Spanish Social Media 157 The Socialists built the 2008 election campaign around Spain’s social progress under Zapatero (the laws he introduced to protect battered women, to promote sexual equality and to allow gays to marry).The videos created by this party were focused on accusing Mr Rajoy of creating problems where there are none. They want to be decisive in mobilising PSOE supporters who are thinking of abstaining. Tacticians estimate that a party needs of close to 75 per cent to win enough seats to claim victory and form a government. One of the most characteristic features of Zapatero is his eyebrows (is the symbol of his surname in signs language). To make this gesture is a message of support in one of the videos of the main group which support Zapatero [14]. Socialists even used instant messaging (IM) in their electoral campaign. They created the nick iZ (iz@psoe.es), Zapatero’s identity on Messenger chat, the first one of political character in all the world. The automated program answered questions about the lists of candidates or the election manifesto. 2.2 PP’s Electoral Campaign Trough Videos Mariano Rajoy is the Popular Party leader (PP). The PP believes can weaken to the Socialist party by attacking its management on the economy, which is decreasing, and on immigration, which is increasing. For the pre-campaign, the PP has used the slogan "Con Rajoy es Posible" ("With Rajoy it is Possible"). PP usually emphasized the campaign proposals, such as "Llegar a fin de mes, Con Rajoy es Posible" ("Making it to the end of the month, with Rajoy it's possible"). The PP videos try to portray Spain as a country on the verge of recession and overrun by immigrants. These audiovisual criticized Zapatero’s Government for being incompetent in solving economic problems and blame him for failing in creating employment. One of the most visited videos launched on YouTube by Popular Party was "Zapatero´s Big Lie", [15] which accuses the Prime Minister of not telling the truth about the suspension of negotiations with Basque separatist group ETA. We have studied the travel of the conversation about this video trough Spanish blogosphere. In others videos Rajoy shows the letter C from Canon, the digital tax added when buying a CD or DVD or any device to storage information. He promised that he was going to eliminate this tax if he won the elections. Spain pays the highest prices in Europe. To conclude this section, we can state that both candidates know that videos won’t convince voters to change parties, but they could convince the supporters to go out and vote instead of staying at home 3 Methodology and Case Study Research on ICT requires methodological innovations to work with social media (blogs, wikis, video and all kind of Web 2.0 tools) and getting results to explain the new relations ways of social shaping. The role of traditional political communication has changed due to these social and technical innovations. Viral actions have emerged as a new goal for the political purposes, however virality not always has the same rules to be successful. 158 J.M. Noguera and B. Correyero The diffusion based on crowdsourcing, buzz marketing or the possibility for expanding same messages on several platforms are part of these rules. New media ecosystem has problems applying old methods as content analysis (tagclouds could be a special and specific version for the Web) and there are new techniques to improve within this field, which are related for example to tracking conversations on real time on Web. Messages were tracked at same time that conversations were too. These messages become in viral actions thanks to users and due to this fact, we need to improve the field of communication research that is focused on the mechanisms that explain how information flows are on social media of the Web. In the research, the case study is the political information made by viral videos in Spain. At this point, the methodological innovation tries to identify where the influence is. In the study, influence was identified with a significative change in the volume of conversation in the Blogosphere regarding the videos. The first step is gathering a sample with the most relevant viral videos, the second step is getting their averages of reactions and Technorati authority, the most popular blog search engine with Google BlogSearch. The objective is showing where the influence is. Do we have real tipping points on the Blogosphere? For answering this question, on one hand we gathered a sample made by fifteen relevant Spanish blogs, maintained by politicians, journalists and other kind of users. This list was selected according several rankings (Alianzo [16], Wikio [17], 86400 [18] and Compareblogs [19]), which use criteria like RSS subscribers or incoming links. Part of the final sample is made with five journalistic/political blogs: Escolar, Periodistas 21, NetoRatón 2.0 [20], Guerra Eterna [21] and La Huella Digital [22]. On the other hand, two political videos were selected to track reactions: "Zapatero´s Big Lie" (PP) and "With Z of Zapatero" (PSOE). With the advent of Web 2.0, the communication research has developed topics as collaborative publishing, social filters or viral messages, but not yet others like influence, especially in terms of how some nodes of social media could be considered as real tipping points on the Web. 3.1 Influence of Journalistic and Political Blogs Tracking influence of several blogs, in this case on the Spanish blogosphere, can not be checked without a clear issue (the conversation around selected political videos) and a closed period as sample. Thanks to Technorati, the number of reactions after each video were registered according the hypertext (links) generated by each one. After that point, the Technorati authority of blogs in each day was measured to see if the presence of the most important blogs increased the volume of conversation around the videos some days. With Web 2.0, we are in a clear context which we need to track the users´ demand of information on the web. Due to this fact, tools like Google Trends show us what people are looking for. In a similar way, Technorati give us data about how many people are talking on the Blogosphere, according the number of links related to the political videos. On the Web, when a concept is named that concept is linked too. The conversation was registered during eighteen days (from 18th January to 4th February). During this time, 47 posts were published on Spanish Blogosphere related to political video “The Big Zapatero´s Lie”. These posts were considered as the first The Impact of Politics 2.0 in the Spanish Social Media 159 video reactions. In this case, the biggest appearance of relevant blogs was around the middle of the period. After that presence, there is no growth of publications on blogs. In other words, the moments with more reactions do not come after the biggest authorities. The conversation around the second video of our sample was registered the same number of days, from 18th October to 3th November, and 112 posts were published which contained links to the video called "With Z of Zapatero". During the weeks of conversation around the second video, big falls and big tops were gathered of Technorati authority, there was not regularity and because of this, video´s travel trough Blogosphere is more related to power of nets than related to power of rankings (in this case, a ranking of Technorati authority, made with outcoming links to political videos). 3.2 Two Ways for Drawing Internet Information Flows The traditional opinion about how messages circulate on the Web has been related to a view of Internet based on rankings mostly of cases. A Web based on the power of big nodes (sites) which have a relevant influence with their publications and opinions. In case of social media, we are talking about turning points made by users (blogs, podcasts, social filters…). In this case of study, the main question is knowing if there were some visible big nodes on the political conversation of the Spanish blogosphere. Specifically, we tracked the conversations about political videos made by PP and PSOE during the 2008 Campaign. However, according the data obtained through this study, the political conversations are more related to a different point of view about the Web, based on the nets and not just on the big nodes. As we have seen during almost three weeks with each video, many reactions on blogs come without a big presence of authority before them. And the appearance of big nodes (relevant blogs maintained by influencers) does not come with a bigger volume of conversation, the information flows are the same or even less than in the first steps of the period. In this sense, we could describe the Web under a big paradigm of nets and not just of relevant webs (tipping points). According this point of view, the big nodes are an echo of the conversations, not the cause of their visibility. Because of this, it might be overvalued the importance of big nodes on the Internet and social media. The power is in the interconnected clusters and the political strategists should consider if the myth of tipping points on social media is real or not. 4 Conclusions In the 2008 Spanish electoral campaign, the most popular political videos generated conversations on the Blogosphere and gave us the opportunity to measure the role of influencers. The main objective was tracking the travel of these videos on one of the most relevant social media: the blogs. Tracking the content of political videos was useful to identify how parties used this tool to discredit the adversaries. There is a clear problem for political strategists: parties still design their campaigns for the mass media, even those messages focused on social media on the Web. 160 J.M. Noguera and B. Correyero According to the content gathered on political campaigns, we can conclude that parties only want to generate enough polemic and turning video into news to jump to informative TV programs. By this way, the main message during part of the political campaign is a meta-coverage (content about the content), news about the video reactions and not about the storytelling and content that there is inside it. When we track Technorati reactions after each video, we measure the influence of blogs every day. If there is not conversation´s growth after the appearance of big nodes, we could consider that these tipping points are not a cause. They could be just an echo of conversation. From this point of view, the most effective way to pitch an idea is creating mass marketing through social media. But if video is understood as political advertising, it will not be useful. The campaign 2.0 is not about using technology with a funny or polemic video on platforms like Youtube. And according the gathered data, the key is not about the presence in a particular node, it is a question of a radical change in the tone of the political conversation during all its phases. The presence of important nodes (in terms of Technoraty authority) is not relevant in volume of political conversations. Because of this fact and as we underlined, the key for successful campaigns could be based on the capacity to manage the interaction with people, hearing them and taking part with a real conversation, the most clear paradigm of the Web 2.0. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Escolar.net, http://www.escolar.net Periodistas 21, http://periodistas21.blogspot.com Eurogaceta, http://eurogaceta.com Internet Política, http://internetpolitica.com McGaughey, W.: Five Epochs of Civilization: World History As Emerging in Five Civilizations. Thistlerose Publication, Canada (2000) Lakoff, G.: The Political Mind. Why You Can’t Understand 21st-Century Politics With an 18th-Century Brain. Viking/Penguin (2008) Educar para la ciudadanía, http://www.youtube.com/watch?v=WjXja39DH1s&feature=fvw Con, Z., de Zapatero.: http://www.youtube.com/watch?v=7rNSY6v01zg Murray, A.: The Most Important Component of The Modern Political Campaign Strategy, http://www.completecampaigns.com/article.asp?articleid=53 Times, http://www.time.com/time/columnist/klein/article/0,9565,1124 237,00.html Varela, J.: El año de los blogs políticos (2005), http://periodistas21.blogspot.com/2005/03/el-ao-de-losblogs-polticos.html La mirada positiva, http://www.20minutos.tv/video/JoWyCucNeg-lamirada-positiva-del-psoe/0/ Motivos para creer, http://www.youtube.com/watch?v=9qiM2os3pgk The Impact of Politics 2.0 in the Spanish Social Media 161 14. Vídeo de promoción de la Plataforma de Apoyo a Zapatero (PAZ), http://www.youtube.com/watch?v=RZxjCvq44ng&feature=channel 15. La gran mentira de Zapatero, http://www.youtube.com/watch?v=MiNhMJHRkcQ&feature=player_em bedded 16. Alianzo, http://www.alianzo.com 17. Wikio, http://www.wikio.es 18. 86400, http://86400.es 19. Compareblogs, http://www.compareblogs.com 20. NetoRatón, http://www.netoraton.es 21. Guerra Eterna, http://www.guerraeterna.com 22. La Huella Digital, http://lahuelladigital.blogspot.com Extended Identity for Social Networks Antonio Tapiador, Antonio Fumero, and Joaquín Salvachúa Universidad Politécnica de Madrid, ETSI Telecomunicación, Avenida Complutense 30, 28040 Madrid, Spain {atapiador,amfumero,jsalvachua}@dit.upm.es Abstract. Nowadays we are experiencing the consolidation of social networks (SN). Although there are trends trying to integrate SN platforms. they remain as data silos between each other. Information can't be exchanged between them. In some cases, it would be desirable to connect this scattered information, in order to build a distributed identity. This contribution proposes an architecture for distributed social networking. Based on distributed user-centric identity, our proposal extends it by attaching user information. It also bridges the gap between distributed identity and distributed publishing capabilities. Keywords: WWW, social networks, social software, digital identity, architecture. 1 Introduction Social networks (SN) is one of the key words when we talk about the Web 2.0 realm. We are experiencing nowadays the consolidation of the SN platforms. The recent launching of the Facebook platform has officially opened up the opportunity for issuing the next wave of value added applications for the next generation SN platforms. When we talk about SN in general terms, we are considering a wide scope of web services ranging basically between content and contact oriented social networks, that could be understood as wide and narrow sense social networks. The former are platforms which main goal is content publication, and social relations are a side effect of the interactions. The last are specialized on contacts creation and management, and so they are social networks in a narrow sense. The trend seems to have the contact oriented platforms integrating content sharing services, being Facebook again the best example. Meanwhile we still have in place a lot of independent, isolated content sharing services (e.g. Flickr, blip.tv, Youtube, Slideshare) and contact management centered social networks (e.g. Xing, LinkedIn) we'll be using for a while. The social networking services scenario is living a consolidation stage in terms of platforms (e.g. Xing A.G. has acquired the two major professional networks in Spain) and, at the same time, with the announcement of the Open Social initiative from Google, all the actors in such a scenario are positioning themselves for starting the race for that value added services. At this time, we have the Web plenty of a wide offer of social network services. Users develop their personal or professional activities in different platforms. A user J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 162–168, 2010. © Springer-Verlag Berlin Heidelberg 2010 Extended Identity for Social Networks 163 may use a blogging platform to express her thoughts, a photo platform to publish her images, and a social network platform to be in contact with her friends. But all these platforms ignore the activity the same user is developing in the rest of web sites. This is sometimes desirable for privacy reasons, but in other multiple cases interoperability is an added value. Things would be easier if images show available in the blogging platform to be integrated within posts, and last posts appear in the social network user's profile. This contribution describes an architecture for distributed SN that supports this interoperability. It proposes a distributed model for contacts and content integration. It's is build around distributed user-centric identity frameworks, such as OpenID. 2 Distributed Identity Such 'new age' services will be depending on the basic functionalities we will be delivering from our open platforms. One of the key functionalities of such platforms is the identity. We are still using as de-facto standard the ancient user/pass combination for identifying ourselves on them. As we have a lot of different services appearing on the Web 2.0 explosion, we have to manage not only a lot of different passwords for a variety of services, but a growing number of different profiles in a series of distributed platforms all around the world. This situation represents a fragmentation of user activity along all the SNs. There isn't a coherence between different user's profiles created in every SN. Furthermore, there is no way to combine digital contents created on every platform. Current scenario can be desirable for privacy reasons. Sometimes we don't want our activities traced and bound between every place we log in. But other times, Fig. 1. Digital identity is distributed among SN platforms 164 A. Tapiador, A. Fumero, and J. Salvachúa specially when we want to build a coherent identity and reputation, such interoperability would make things easier. In these cases, from the user point-of-view, it will be desirable to have one single profile that could be validated against any service she would be accessing. The decentralized, user-centric, soft identity schemes are a way of implementing such a requirement. The most popular of these schemes is OpenID [1]. It's being implemented within the main development communities of the Web 2.0, e.g. Wordpress or Blogger from blogging platforms segment. Some security problems [2] have been identified, so a different kind of decentralized, user-centric identity framework should be used in cases where the security requirements are harder. 3 Architecture for Extended Identity The idea behind our proposal is the qualitative leap that entails using a dereferenceable IDs, like OpenID. Until now, using a new SN platform implies providing a login and a password. In most cases, an email address is also required. The address will be verified by the web site, which sends an email including a link that will confirm the validity of the address. If we analyze the information the SN platform has about us at this point, it is limited to some credentials to authenticate in the web site, and an email address to contact with the user. On the other hand, we have authentication frameworks like OpenID. In these frameworks, we provide the SN platform with a dereferenceable ID, that is, an URI representing a resource that can be located and fetched. If the SN platform can dereference the ID, then it will be able to obtain extended information from it. Upon an OpenID URI we are able to get an HTML document, which can contain a lot of information about the user. We can use this ID for registering ourselves in a whole bunch of web services. These web services could include blogging, photo, slide, video sharing services, and narrow sense social networking services. Those service obtain more information about usdereferencing the ID, and discovering information attached to it. We also incorporate information back from web services to the ID. In such a way, we are able to connect information to sites. We will, for instance, allow our blog to know where our photos are (or some of them) or our friends knowing where are our videos are. 3.1 Architecture Components Our proposed architecture is client-server solution, and it is composed by three elements: Client Agents, Identity Servers and Resources Servers. Client Agent. A client agent (CA) is any application environment in a local machine or device controlled by a user. Examples of CA are mobile applications and browsers running in a PC. Network connection is assumed. The environment where the CA is running can provide it with API facilities such as authentication, resources fetching or content publishing. Identity Server. An identity server (IS) provides users with user-centric, decentralized IDs belonging to a given identity framework. Users authenticate with their IS and then have a mechanism for identifying themselves in the rest of SN platforms. Extended Identity for Social Networks 165 It is an "OpenID Provider" in the OpenID world, with extended capabilities. It also supports mechanisms for allowing other parties to access private resources, using protocols like OAuth [3]. The IS acts as a user proxy. It stores the main, authoritative user profile. This profile is composed by links to user resources (such as presence, geo-localization, personal data) and collections (e.g. contacts lists, blogs, albums, podcasts). It also may provide information to edit these resources and post new resources to collections. Fig. 2. Example of the described architecture, including a client agent, a identity server and two resources servers Profile information is controlled by access control lists. Users control granting or restricting access to their resources and collections stored in the IS. The IS should provide mechanisms to manage different profile data sets easily. Users typically want to show different kinds of profile to different SNs. We want to provide the minimum required information to some SNs, but detailed information to other trusted SNs. This is also the case for other IDs apart from ours. Other users are expected to query IS to obtain more information about somebody when they discover her ID and want to know more about that person. The IDs should be, therefore, dereferenceable URIs. CAs obtain the main or favorite editable resources and collections. When the user logs into her IS using the CA, she obtains not only authentication capabilities, but also publish information usable by the CA. This information facilitates the CA to write blog entries, post photos or videos, even in other SN platforms. These are the resources servers. 166 A. Tapiador, A. Fumero, and J. Salvachúa Resources server. A resources server (RS) is any web service providing resources management. Examples of RS are content oriented web services (e.g. blogs, podcasts, social bookmarking), but also contact oriented ones. We can also think contacts as resources. Any SN platform can be a RS if it allows users to sign up and posting resources. RS supports authentication using a distributed identity framework. It is a "Relaying Party" in the OpenID world. They relay authentication in ISs. A user sings up into a RS using the identity framework provided by her IS. At this step, she selects which kind of profile she will show to this new RS. Then the RS obtains profile information from the user by querying the IS. RS publish resource collections owned by users, like IS may do. CAs obtain from each RS a complete list of user information generated from the user in this specific RS. Users can bookmark this information in their IS, controlling who can access them. This way, they build their authoritative and main profile, which resides in her IS. The user also controls in the IS which information (e.g. collections and resources) show to other RSs. In this way, RSs can discover and mashup resources and collections from their users. Access to RS resources is controlled by the the RS itself. There may be resources announced at the IS but not accessible at the RS and vice-versa. Synchronization between IS ACLs and RS ACLs should be specified. 4 Information Flows The main information attached to a user of this architecture is composed by his ID, along with the resources and collections associated with it. The ID is entered by the user when logging in the RS, which discovers the IS location, as we explained before. This procedure is described in the specification of the OpenID protocol [1]. Later, there may be exchanges of information between the IS and the RSs, which have two ways, from the RS point of view. 4.1 Pull The RS obtains more information about the ID asking the IS. This allows the SN platforms to know more about the user, finding out the latest entries in her blog or her contacts network, for example. The following technologies currently support this information exchanges. OpenID Attribute Exchange. OpenID Attribute Exchange [4] is an OpenID protocol extension that supports the exchange of key-value pairs between the Relaying Party (the RS in our architecture) and the Identity Provider (the IS). This technique is limited by the format of the attributes. HTML. OpenID identifiers are typically HTTP URLs. The RS can dereference the URL and get the HTML document. This document provides information about the ID. Two different formats are available: 1. Microformats [5] are semi-structured information embedded in the HTML markup. Currently, there are formats defined for personal cards (hCard), events (hCalendar) and tags (rel-tag). Other object types are in the definition process. Extended Identity for Social Networks 167 2. HEAD links: the HTML <head> section provides support for <link> tags. These tags are already used for providing extended information about the HTML document, e.g. blog suscriptions in RSS or Atom format. Other data formats. The HTTP protocol supports a mechanism for requesting documents in a specific format. This is achieved including the Content-Type header in the request. This mechanism, along with the former of HEAD links, allow us to obtain the representation of the ID in different formats. One example are Atom feeds [6], a format used for content syndication. Other example is RDF (Resource Description Framework) and their schemas (RDFs, OWL), the base of the Semantic Web. FOAF [7] is a RDF based vocabulary used to describe agents (Users, Groups, Organizations) and their attributes. The experimental property foaf:openid supports the association of the user profile information with her ID. SIOC (Semantic Interlinked Online Communities) [8] provides a vocabulary for describing resources and resource collections. 4.2 Push The RS publishes information about the user in her IS. This case is interesting, for example, so the IS gathers the activity the user generates in the SN she participates. OpenID Attribute Exchange. The OpenID extension works in both ways. It can also be used by RSs to store key-value pairs in the IS. Atom Publishing Protocol. AtomPub [9] is a protocol designed by the IETF for publishing and editing web resources. One of the documents defined by the specification are Service documents. Service documents describe available Collections grouped in Workspaces. Collections are sets of resources. The Service document describes what kind of resources can be posted to a Collection. AtomPub can be extended, so Collections could also describe which kind of resources contains, for a better integration with CA publishing and management capabilities of the resources. 5 Validation OpenSocial [10] is one of the last technologies emerging in social software. OpenSocial is a public API launched by Google in late 2007. It provides management support on three kinds of resources attached to user's personal information; contacts, activities in SN platforms and persistent data support. OpenSocial proposal fits smoothly in our architecture model. In OpenSocial, user's contacts and activities are exported using Atom feeds. Activities publishing uses the Atom protocol. Finally, persistent data support shares the same principles with OpenID Attribute Exchange extension. To achieve a practical validation of the architecture, we are working on a plugin [11] for Ruby on Rails web development framework. This plugin provides an application with an authentication framework, supporting several authentication schemes including OpenID. It also provides authorization and contents and contact generation. We plan to evaluate the technologies mentioned in the previous section that support information exchanges between IS and RS. This plugin is currently used in several 168 A. Tapiador, A. Fumero, and J. Salvachúa SN platforms, which include the VCC [12] a rich web content management system that gives support to conferences. 6 Conclusions This article proposes an architecture that solves the problem of fragmented user identities on SN platforms. The architecture is based in OpenID, a user-centric distributed identity framework. The IS stores the authoritative user information. The RSs use the IS to obtain the extended identity about the users, as well as publishing new information about user's activities. There are currently several technologies supporting information flows among IS and RS. In this sense, we are working on a Ruby on Rails plugin supporting several of this technologies. This plugin is used as the base of SN platforms that will validate the proposed architecture. Finally, the proposed architecture fits with the last protocols emerging in the field, such as Google's Social API. References 1. Recordon, D., Reed, D.: OpenID 2.0: a Platform for User-Centric Identity Management. In: Proceedings of the second ACM workshop on Digital identity management, pp. 11–16 (2006) 2. Brands, S.: The problem(s) with OpenID, http://idcorner.org/2007/08/22/the-problems-with-openid/ 3. OAuth, An open protocol to allow secure API authentication in a simple and standard method from desktop and web applications, http://oauth.net 4. Hart, D., Bufu, J., Hoyt, J.: OpenID Attribute Exchange 1.0 – Final, http://openid.net/specs/openid-attribute-exchange-1_0.html 5. Khare, R.: Microformats: the next (small) thing on the semantic Web? IEEE Internet Computing 10(1), 68–75 (2006) 6. Nottingham, M., Sayre, R.: The Atom Syndication Format, RFC 4287 (2005), http://tools.ietf.org/html/rfc4287 7. Brickley, D., Miller, L.: FOAF Vocabulary Specification (2007), http://xmlns.com/foaf/spec/ 8. Semantic Interlinked Online Communities, http://sioc-project.org/ 9. Gregorio, J., de hOra, B.: The Atom Publishing Protocol. RFC 5023 (2007), http://tools.ietf.org/html/rfc5023 10. OpenSocial, http://code.google.com/apis/opensocial/ 11. Rails Station Engine, http://rstation.wordpress.com 12. Virtual Conference Center, http://vcc.dit.upm.es NeoVictorian, Nobitic, and Narrative: Ancient Anticipations and the Meaning of Weblogs Mark Bernstein Eastgate Systems Inc., 134 Main St, Watertown MA 02472 USA bernstein@eastgate.com Abstract. What makes a good weblog, a superb wiki, or an exceptional contribution to a social networking site? Though excellence is a frequent source of anxiety amongst weblog writers, it has not been a concern of weblog scholarship. In contemporary social software, we encounter once more the deep controversies of 19th century art, reinterpreted to meet the exigencies of time and technology. 1 Neovictorian Social Software What makes a good weblog, a superb wiki, or an exceptional contribution to a social networking site1? The readers and the writers of weblogs are so numerous and so diverse that they defy classification. It might be tempting to assert that a good weblog is a popular weblog, or conversely that whatever satisfies the weblog author is, by definition, good. Neither approach is entirely satisfactory, because neither guides our critical appreciation of weblogs or our ability to make them more original, more effective, or more beautiful. We sometimes compare journalistic methodologies or explore the properties of the social graph but, in discussing or prescribing ideals, most contributors to the BlogTalk and WikiSym conferences (and to the monograph literature) have been reluctant to move beyond the box office measurement of audience size and power. Though excellence is a frequent source of anxiety amongst weblog writers, it has not been a concern of weblog scholarship. Critics often find in new media a mirror of their dreams and anxieties. The student of the novel finds the shadow of the Victorian serial. The literary quality of blogs arises from a complex negotiation between discrete and often random daily entries and the often invisible arc that they together sketch. [1] 1 In discussing social networking sites, I follow the definition of boyd and Ellison [6]: “We define social network sites as web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system. The nature and nomenclature of these connections may vary from site to site.” J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 169–176, 2010. © Springer-Verlag Berlin Heidelberg 2010 170 M. Bernstein The student of hypertext looks at social media and finds emergent nonlinear narrative [3], and the disciple of the codex books descries a formless void [5]. The very nature of the blogosphere is proliferation and dispersal; it is centrifugal and represents a fundamental reversal of the norms of print culture. Faced with such contradictory critical frameworks, what is to guide the aspiring young writer as she sets out to craft her weblog, or the reluctant grandfather who is eager to join FaceBook or CyWorld but loathe to make a fool of herself? Much can be learned by abandoning the unfounded bias that everyday writing lacks intellectual foundations [14], or that people writing for themselves and their inner circle of acquaintance lack ideas or interests beyond the surface concerns of their prose. In contemporary social software, we encounter once more the deep controversies of 19th century art, reinterpreted to meet the exigencies of time and technology. Fig. 1. A fire control panel from an early Louis Sullivan skyscraper, now in the Art Institute of Chicago. Form follows function, but “the tall office building should be a proud and soaring thing.” In electronic media, everything that can be inscribed will carry meaning [12]. 2 Ancient Airs: Of Intellectual, Artistic, Social, and Sexual Concerns Our grandparents loved manifestos. Das Endziel aller bildnerischen Tätigkeit ist der Bau! The ultimate aim of all creative activity is the building! (Gropius; see [13]) Ancient Anticipations and the Meaning of Weblogs 171 Chastened by the twentieth century’s terrors or the shadows of their elders2, artists today seldom declare their intention, but we should not therefore conclude that artists — even young artists — are witless and rudderless. We cannot know where any artistic movement is headed until it arrives, but it does not seem difficult to discern in current concerns the shadows and refractions of older movements. What do weblogs want? Critics have argued that weblogs chiefly seek admiration, that they are narcissistic cries for attention from undisciplined techies eager to tell the world about their cheese sandwich lunch [4]. This cannot be right3: too many people choose to read and write weblogs for us to dismiss them as mere childishness, and a sympathetic reading of the best weblogs reveals many marks of skill and talent. Weblogs want to be right, and to be seen to be right. Examples abound. Blog writers from Matt Drudge to Joshua Micah Marshall stand conspicuously in the forefront of political and social reporting, and the disjoint, ad hoc social networks of Grace Davis and of Kathryn Cramer played significant roles in addressing urgent needs in the wake of Hurricane Katrina. We have seen this concern before: in the 19th century, we called it Realism. Weblogs want to excel. New bloggers —from classroom bloggers to noted novelists blogging their book tours — frequently express overt concern for their ability to perform well in the blogosphere. Sorry What a whinge. What a dreadful self-pitying whine. I do apologise, everyone. But I have the writer’s primary vanity which is to suppose that if I have experienced something and been somewhere then others will have too. [9] Concern with excellence is the dominant issue in wikis, which seek to deploy many eyes and many pens in order to arrive at a correct and useful statement [7]. When bloggers seek to situate the excellence of their blogs in their own intrinsic wonderfulness, we recognize Romanticism. When the excellence resides in correct reasoning, we see dialectic. And where the excellence resides in brilliant craftsmanship, we see echoes of the pre-Raphaelites and of Aestheticism. Weblogs want to be connected. Bloggers and social networkers alike want to be liked, and well liked, and they are anxious that the world’s admiration for them be manifest. At its best, this impulse connects writers to the world and interconnects a world of writers. Popular politics has been decried as an echo chamber before: 2 Or by their knowledge of the intentional fallacy, now taught in the cradle. The impact of the contingency of meaning on the creation of manifestos must not be underestimated. In Europe, the politics of the Bauhaus conventions of slender supports (floating buildings above the corrupt soil) and flat roofs (uncrowned) meant one thing, while in postwar America the same gestures came to mean Lever and Seagram — that is, multinational soap and liquor. 3 Though it cannot be right to disregard weblogs as mere narcissism, the accusation is so commonly made and so generally accepted that we should not dismiss it out of hand. A useful guide might be the strikingly similar accusations that were leveled against several late Victorians — Shaw and Wilde come to mind — whose interests seem particularly congruent with notable blogs wikis, and blog clusters. Consider, in particular, Shaw’s pragmatic political radicalism and critical sparring, as well as Wilde’s delight in sexual display and eagerness to engage in very public and destructive flame wars. 172 M. Bernstein In politics: I’m Vermont Democrat — You know what that is, sort of double dyed; The News has always been Republican. Fairbanks, he says to me, “Help us this year” Meaning by us their ticket. “No,” says I, “I can’t and won’t. You’ve been in long enough. [8] The anxiety of weblogs for connection is evidenced by the rich vocabulary that has evolved for describing those who seek links (spammers, splogs, SEO consultants, link whores), by the popular fascination with lengthy friend lists, and by a plethora of tools for estimating readership and influence. To the extent that social sites have ever expressed a manifesto, it is a message of widespread or universal participation that advocates of the Reform Act or Suffrage would instantly recognize. Fig. 2. Interest in NeoVictorian aesthetics of self-(re)presentation is not limited to weblogs, but it a prominent strand of contemporary culture. Photo: Lady Raven Eve, Singapore. Many weblogs are candidly sexual, not merely in the direct mode of sexual reference, but also in their emphasis on the authentic discussion of the writer’s relationship to the quotidian and physical world. The concerns of the body are conspicuous in weblogs, couched in confessional autobiography or expressed in cheese sandwiches. Though the nineteenth century is remembered for prudery and inhibition, the nature of sexuality and the discovery of useful ways to talk about it was among its chief intellectual and artistic projects. Ancient Anticipations and the Meaning of Weblogs 173 3 Plot or Character? A defining concern of particular interest to the student of social software4 is the tension between expression of plot and expression of character — between exploring what occurred and describing to whom it happened. This tension is, of course, very much in the air as writers seek to negotiate and assimilate the achievements of late modern fiction and postmodern metafiction. But weblogs and social sites share a further concern: both are performed across a span of time, during which the action unfolds in notionally chronological sequence. Dispute over the status of hypertext and experimental fiction should not lead us to overlook how exceptional and restrictive this condition is [15]: even Homer is replete in temporal excursions, flashbacks, and foreshadowing [11], tools that social writers are currently expected to hide or forego. Weblog criticism has been strangely concerned with the question of whether a weblog accurately depicts its protagonist. The revelation that some weblog protagonists (Kaycee Nicole, or LonelyGirl15) are partly or entirely fictitious has been greeted with surprising degrees of shock and outrage, even among sophisticated readers who are familiar with fiction and the complexities of (re)presentation and construction of meaning. In social sites, issues of authenticity seem even more hotly contested [6], though here they are often cloaked in concern for the presence of pedophiles. These debates recall earlier scandals on realistic figures (cf. Olympia) and themes (cf. Social Realism), informed here by a postcolonial reinterpretation of class, ethnicity and gender. The weblog’s interest with discovery and presentation of true or inherent character finds many echoes, both historical and contemporary. Of singular interest, though, is the (mostly-Japanese) interest in cosplay — elaborate self-representation in the form of highly formal costumed character styles. While many attributes of the costume are fixed, the cosplay — which characteristically is situated in busy urban streets, foregrounds the artist’s craft and personality; the goth lolita does not represent (or parody) Nabokov’s character but rather uses that character as a springboard for exploring experience and personality. Similarly, cosplay (and blogs) frequently blur or transgress conventional boundaries of gender and class. We readily apprehend that the artist both is and is not the character, just as Oscar Wilde was not (always) Reginald Bunthorne, Rosetti’s model Lizzie Siddal was not Beatrice, and the Divine Sarah sometimes slept. 4 Nobitic Weblogs: Writing for Yourself Crucially for our understanding of contemporary social media from Facebook to Flickr, from Bento to Tinderbox, the writer's first concern is a nobitic audience, an audience that is notionally “amongst ourselves”, intended for the writer’s circle of acquaintance5. Social software’s writers write first for themselves and their inner 4 5 Though not yet a concern of wikis, because the wiki tradition has been to avoid narrative. Nobitic audiences need not be inconsequential; the politics of the American and French revolutions were carried on by committees of correspondence, and most early scientific writing took the form of letters and after-dinner speeches. 174 M. Bernstein circle, and often regard the prospect of a broad audience of strangers as abhorrent. Will the author’s mother see the work? Will their future employer, or their hypothetical grandchild? That is, the writer's chief concern is to satisfy themselves, to ensure that their (re)presentation is artistically honest. In his masterful survey of the history of published journals and diaries, Thomas Mallon identifies seven distinct kinds of journal writers: chroniclers, travelers, pilgrims, creators, apologists, confessors, and prisoners [17]. The same taxonomy applies usefully to social software, which serves much the same purpose with regard to the nobitic circle, the general reader, and especially the often-overlooked but crucial question of representational talkback [16], of the effect of the author’s writing on their own future sense and sensibility. Social media shares with the personal diary an insistence on frequent consultation and writerly reading; the artistic practice of both the diarist and the blogger demands frequent contemplation and equally frequent addition to the work. These media demand spontaneity and immediacy; they resist sentimental attachment to what the writer ought to feel. The same concerns, in 19th century art, gave rise to impressionism and expressionism, to the desire to capture the sense of the moment en plein air or on the writer’s soul. Fig. 3. Everyday people face extraordinary tasks all the time. A curator sorts through debris at the National Museum Baghdad, 2003 Photo: D Miles Cullen, US Army. 5 The Artisan’s Touch In Abroad, Paul Fussell observed that the travel book emerged from the decline of the audience for books of essays and sermons [10]. Weblogs and social sites, as a form of serious writing, play a very similar role, providing a home for brief, occasional pieces Ancient Anticipations and the Meaning of Weblogs 175 of observation and commentary. The blogger’s quest for immediacy, like the painter’s, sometimes conflicts with some conventionally-valorized issues of craft; the painter, working swiftly to capture the light, was obligated to work swiftly, without taking time to mix complex colors or to efface brushmarks. The work that resulted seemed at first to be casual, slipshod, or unfinished; in time, critics observed that the work was not unfinished but rather raised new questions about the academic idea of “finish”. Similarly, the blog valorizes personal voice, even sacrificing grammar and spelling or adopting dialects like Singaporean “Singlish or the SMS-abbreviated LOLcat-drived l33tspeak. Bad graphic design seems to have become a badge of honor in social sites, where noisy background, distracting animations and intrusive music tracks abound; here, too, we may see an echo of Fauvist painting or Brutalist architecture. A key figure in understanding the quality of blogs, wikis, and social sites is the flâneur, that sophisticated observer of the urban scene, unbound by class and unconstrained by convention. Current social software writing is remarkably short on characters; it has voice in abundance, but dialogue is rare and developed secondary characters and foils — even cardboard cutouts like Mike Royko’s Slatz Grobnick or Don Marquis’ Archie and Mehitabel, are seldom in evidence. Throughout, the weblog spotlight falls on the writer/observer, her struggle to understand or wry amusement at the absurdities of the daily scene. We are then, perhaps, at a cusp of a development that parallels Thespis’ innovative approach to theater: we may, at the boundaries of social software and the craft of interlinked writing, be prepared to let a second actor, a true character, onto the stage. References 1. Fitzpatrick, K.: The Pleasure of the Blog: the Early Novel, the Serial, and the Narrative Archive. In: The 4th International Conference on Social Software (BlogTalk 2006), Vienna, Austria (2006) 2. Bernstein, M.: The Social Physics of Weblogs. In: BlogTalk 2, The 2nd International Conference on Social Software, Vienna, Austria (2004) 3. Bernstein, M.: Saving the Blogosphere, BlogTalk Down Under, Sydney, Australia (2005) 4. Bernstein, M.: What Good Is A Weblog?, BlogHui, Wellington, New Zealand (2006) 5. Birkerts, S.: Lost in the Blogosphere: Why Literary Blogging Won’t Save Our Literary Culture, The Boston Globe (2007) 6. boyd, d.m., Ellison, N.B.: Social Network Sites: Definition, History, and Scholarship. Journal of Computer-Mediated Communication 13(1) (2007) 7. Leuf, B., Cunningham, W.: The Wiki Way: Quick Collaboration on the Web. AddisonWesley, Reading (2001) 8. Frost, R.: A Hundred Collars. North of Boston, Henry Holt and Company, New York (1914) 9. Fry, S.: I Give Up (2007), http://stephenfry.com/blog/?p=21 10. Fussell, P.: Abroad: British Literary Traveling Between the Wars. Oxford University Press, Oxford (1980) 11. Lowe, N.J.: The Classical Plot and the Invention of Western Narrative. Cambridge University Press, Cambridge (2004) 12. Landow, G.P.: Hypertext 3.0: Critical Theory and New Media in an Era of Globalization. Johns Hopkins Press, Baltimore (2006) 176 M. Bernstein 13. Noble, J., Biddle, R.: Notes on Notes on Postmodern Programming. In: OOPSLA 2004, Vancouver, Canada (2004) 14. Rose, J.: The Intellectual Life of the British Working Classes. Yale University Press, New Haven and London (2001) 15. Walker, J.: Piecing Together and Tearing Apart: Finding the Story in Afternoon. In: Hypertext 1999, pp. 111–118. ACM, New York (1999) 16. Yamamoto, Y., Nakakoji, K., Aoki, A.: Spatial Hypertext for Linear-Information Authoring: Interaction Design and System Development Based on the ART Design Principle. In: Hypertext 2002, pp. 35–44. ACM, New York (2002) 17. Mallon, T.: A Book of One’s Own: People and Their Diaries, Penguin (1984) Author Index Bernstein, Mark 169 Bojars, Uldis 116 Boulain, Philip 1 Brandt, Joel 143 Breslin, John G. 116 Broß, Justus 15 Lee, Mikyoung 100 Lee, Seungwoo 100 Lim, Yon Soo 52 MacNiven, Sean 15 Masuda, Hidetaka 88 Matsuo, Yutaka 63 Meinel, Christoph 15 Morin, Jean-Henry 108 Chalhoub, Michel S. 29 Correyero, Beatriz 152 Cushman, David 123 Decker, Stefan Nakagawa, Hiroshi 88 Nakasaki, Hiroyuki 75 Noguera, José M. 152 116 Fukuhara, Tomohiro 75, 88 Fumero, Antonio 162 Okazaki, Makoto Gibbins, Nicholas Park, Hyunwoo 1 Han, Jeong-Min 46 Han, Sangki Steve 38 Hoem, Jon 131 Ishii, Soichi Jung, Hanmin 63 38 Quasthoff, Matthias 15 Salvachúa, Joaquı́n 162 Sato, Yuki 75 Shadbolt, Nigel 1 Song, Mi-Young 46 88 100 Tapiador, Antonio Kawaba, Mariko 75 Kim, Kanghak 38 Kim, Pyung 100 Kim, Sang-Kyun 46 Kim, Young-rin 38 Ko, Joonseong 38 Kuklinski, Hugo Pardo Utsuro, Takehito 162 75 Yokomoto, Daisuke 75 Yoshinaka, Takayuki 88 143 Zimmermann, Jürgen 15