D9.2: Benchmark Report on Structural
Transcription
D9.2: Benchmark Report on Structural
Report D9.2, V1.0 Dissemination Level: PU EC Project 257859 Risk and Opportunity management of huge-scale BUSiness communiTy cooperation D9.2: Benchmark Report on Structural, Behavioural and Linguistic Signifiers of Community Health 30th October 2012 Version: 1.0 Toby Mostyn toby@polecat.co Polecat Limited, Garden Studios, 71 - 75 Shelton Street, Covent Garden, London WC2H 9JQ Dissemination Level: PU – Public © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 1/45 Report D9.2, V1.0 Dissemination Level: PU Executive Summary This is a detailed report of deliverable 9.2. Work package 9 focuses on public data, technology orientated fora and wikis. Specifically, D9.2 states: Benchmark report on structural, lexical and behavioural indicators of community health: a full report including benchmarks on structural, lexical and behavioural indicators of community health. The report is a deliverable of the WP9 T9.2, which states: This task will create benchmarks for structural, behavioural and lexical signifiers of health vis a vis the desired output (KPIs) of a community through comparison between multiple communities to surface trends, patterns and outliers in community interaction set against the context of the purpose of the community. In particular, it will include: Identify baseline indicators of community success (KPIs) from community owners such as thread generation, email receipts, core code changes, or unique users. Analyze community discussion for sentiment, opinion, motivation lexical signifiers. Create ontology of lexical signifiers. Analyze behaviour of community members and assign role motifs using psychographic profiling. Produce a report showing the connection between lexical, behavioural and structural signifiers. These connections will be within the context of risk management, productivity, strength of relationship between members, etc. This delivery will feed WP3 [T3.1, 3.2, 3.3, 3.4] - modelling of motivation incentives to promote healthy community interaction based on healthy behavioural, lexical and structural signifiers. Refers to deliverable D9.2 © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 2/45 Report D9.2, V1.0 Dissemination Level: PU Table of Contents 1. Introduction .....................................................................................5 2. Overview of the Document ................................................................6 3. Lexical Signifiers ..............................................................................6 3.1. Psycholinguistic signifiers ............................................................6 3.1.1. Success Language Analysis ...................................................7 3.1.2. Motivation Language Analysis ................................................8 3.1.3. Action Language Analysis ......................................................9 3.1.4. Indicators of less healthy language........................................ 10 3.1.5. User type analysis ............................................................... 11 3.2. Sentiment Based Signifiers ........................................................ 12 3.2.1. Healthy/ Unhealthy Language Analysis .................................. 12 3.2.2. Automated Classification ...................................................... 13 3.3. Topic Based Signifiers ............................................................... 15 3.3.1. Topic by heath (derived) ...................................................... 15 3.3.2. Topic by health (explicit) ...................................................... 19 4. Behavioural Signifiers ..................................................................... 29 4.1. Identification of health indicators ................................................. 29 4.2. Measurement of user behaviour.................................................. 30 4.3. Discovering user roles ............................................................... 30 4.4. Analysing role/ health relationship ............................................... 31 5. Structural Signifiers ........................................................................ 31 5.1.1. Community owner feedback on structural signifiers of health .... 32 5.1.2. Analysis of TiddlyWiki structural factors ................................. 32 6. Project Integration .......................................................................... 34 6.1. The dJST topic by sentiment extraction model .............................. 34 6.2. Graphic Equalizer ..................................................................... 36 6.3. Metaphor base visualisation ....................................................... 36 7. Conclusion .................................................................................... 37 8. Appendix ....................................................................................... 38 8.1. User classification training set errors ........................................... 38 8.2. Sub topic density using cosine similarity (example) ....................... 38 8.3. Snapshots of the WikiGroup Network .......................................... 38 8.4. WikiGroup Network Statistics...................................................... 39 List of Figures ...................................................................................... 41 © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 3/45 Report D9.2, V1.0 Dissemination Level: PU List of Tables........................................................................................ 42 List of Abbreviations.............................................................................. 43 References .......................................................................................... 44 © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 4/45 Report D9.2, V1.0 Dissemination Level: PU 1. Introduction This document examines some of the work done in the ROBUST project around signifiers of health in on-line communities, and provides benchmarks and empirical results on those signifiers where applicable (as the basis for further research). It examines three distinct types of signifiers, which are defined below: Linguistic signifiers: These are indications of health that can be gleaned from the content of the interaction between users; in most cases, the posts of those users. The most basic example of a health signifier in this context would be the sentiment of the text. Behavioural signifiers: The behaviour of users is a strong indication of health, and can be expressed in a number of ways. For this reason, these signifiers overlap with both linguistic and structural data. An example of a behavioural signifier of health might be “engagement” – the proportion of all users a user has communicated with. Structural signifiers: Typically, these are statistics that describe the community as a whole, and can be analysed statistically. These are the signifiers that community owners generally know most about, and upon which they currently assess the health of their community. An example of a structural signifier is the number of users in a community. Against each of these signifiers, the document examines data from a number of different communities. Much of the analysis was against TiddlyWiki. The TiddlyWiki project (http://www.tiddlywiki.com/) is a development community; for the most part the members of the community are a geographically disparate collection of programmers writing either the core TiddlyWiki application or plug-ins. TiddlyWiki was chosen in most instances for two reasons. Firstly, the dataset is a single community with a common purpose, so it provides a suitable test set for algorithms before they are run over much larger data sets containing many sub-communities (where verification of results is much more difficult). Secondly, the network structure consists (generally speaking) of a main development core community, and sub-communities for plug-in development. Again, this makes it a good test set for network analysis, in part because it provides a useful baseline against which results from more complex communities can be compared. Other communities used by research within this document were larger and more complex, involving a myriad of sub-communities: IBM Connections, SAP SCN and _connect. The _connect community (https://connect.innovateuk.org/) is a forum of the Technology Strategy Board, © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 5/45 Report D9.2, V1.0 Dissemination Level: PU part of the UK government department for Business, Innovation and Skills. Their relevance lies not only in their data, but also in the fact that they are customers of Polecat, and therefore are a potential point of exploitation. The identification and analysis of signifiers in this document is the result of two main approaches. Firstly Polecat gathered information by interacting directly with the community owners and analysing the content manually using specialist analysts. Second, various algorithms and software was tested and developed to extract further information. 2. Overview of the Document The document is broken up into four main sections. Firstly, the work done around the research and benchmarking of linguistic indicators of health in communities is examined. This includes psycholinguistic profiling of communities content, sentiment analysis of the content, and a description of two novel ways to understand the health within a community around a given topic. Secondly, the document examines the behavioural signifiers of health that were identified and analysis performed around these. The third section focuses on structural signifiers of health, and examines both metrics gleaned from real-life communities and those extracted automatically from community content. Lastly a section has been included to briefly describe some of the integration of this research into the ROBUST platform ( and therefore the application of research likely to be exploited by Polecat). 3. Lexical Signifiers 3.1. Psycholinguistic signifiers Building on previous research carried out in the field of psycholinguistics, and working closely with the TiddlyWiki community and leaders, Polecat analysed the language of the community and developed a number of standard classes representing various types of discourse found within. This analysis was based on analysts within Polecat who have expertise in linguistics – this was not work carried out by algorithms, but was the result of expert human judgement. The analysts worked to the following methodology. Firstly, they examined the goals of both the community and the individuals within it. Oftentimes these aims overlapped; at other times they were observed to be at cross purposes to one another. For example, there were examples of individuals attempting to introduce new functionality to solve a particular problem they had, with little thought to the overall impact on the software as a whole. Secondly, the analysts examined the intent of the language with regards to other users (in relation to the goals identified in stage one). In other words, the analysis discarded any specific technical terms of the discourse, and focused instead solely on generic language focus around what the author of the post was trying to achieve. The rational was that the output from an analysis of this type would be conceptually and lexically independent (as far as possible), and could be applied to any community that shared a common goal towards which © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 6/45 Report D9.2, V1.0 Dissemination Level: PU the users were working. Thirdly, using these techniques, the analysts were able to identify the aforementioned classes and their linguistic properties.. These classes were: 1. Language indicative of success 2. Language indicative of motivation 3. Language indicative of action 4. Language indicative of encouragement 5. Language indicative of negativity Using these as a basis, the classes were refined into four basic lexicons to identify the types of behaviour typically found in such communities: a success lexicon, a motivation lexicon, an action lexicon and a lexicon indicative of negativity. Language indicative of encouragement was, after some analysis, deemed to be a sub-set of the motivation lexicon. Given the analysis, these three classes were those deemed most representative of the health of a community for those communities working towards a common goal (usually involving a tangible output). Below is further information on the analysis carried out to derive these lexicons. 3.1.1. Success Language Analysis The language of success for communities was identified in three key areas: firstly, the success of the community as a whole, or sub-sections of the community where relevant. Second was technical success, such as the achievement of better software, new knowledge or novel applications. Thirdly, the language accounted for personal success for individual users; achievement of personal challenges or meeting the particular needs of a community user. The aspects that make up these types of success can be seen below (Figure 1: Success language patterns). One area in isolation is not enough to define the success within a community. For example, personal benefit at the expense of the community or the system would not be a success. Interaction with, and feedback from, the TiddlyWiki community suggested success was dependent on all three territories, and the lexicon had to therefore reflect this. It should be noted, however, that although there is a linguistic emphasis on the success of a non-hierarchical community, it is also highly dependent on strong leadership and organisation which ultimately protects the quality of the output. This means that the insight gleaned from analysing text for the language of success is limited, and is best understood in conjunction with other structural factors. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 7/45 Report D9.2, V1.0 Dissemination Level: PU Figure 1: Success language patterns 3.1.2. Motivation Language Analysis In the case of motivation, it was discovered that the technological and the personal form the patterns of communication are also important. Again, these patterns are cross hatched; motivation is never just technological, personal or collective, but some combination of the three. Thus any lexicon tool will need to span the scope of these motivation types. In the case of technical communities, it was found that the identity of the individual formed around personal motivations is always made in relation to the collective identity of the © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 8/45 Report D9.2, V1.0 Dissemination Level: PU TW community: “I am” what “it is”. There are a mixture of high altitude motivations such as freedom, emancipation, creativity, affirmation, respect together with ruthlessly pragmatic motivations around cost saving and software quality. Figure 2: Motivating language patterns 3.1.3. Action Language Analysis Action language was found to reveal a clear trajectory between the singular determined act and a more fluid engagement with a light, dynamic, agile, never ending system. This takes the action language beyond the practical application of core skills: engineering/ development in the case of the © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 9/45 Report D9.2, V1.0 Dissemination Level: PU TiddlyWiki community. The most potent/compelling space lies in fluid engagement where the singular acts are transactional tools in this bigger picture. There is a strong correlation in the action language between the language of fluid engagement and the descriptor of success as learning, adapting and configuring a system to meet your needs (and in the process helping to meet other people’s needs and extending, developing and maintaining the system and community). Action language is also a proxy language of belonging a rite of passage into the community you display your ability through your technical eloquence. 3.1.4. Indicators of less healthy language Less healthy language was found to be that which communicates closed rather than open, that is singular versus multiple, lacks a discourse of invention, creativity, innovation and is interested in that which can be measured vs. tested, tried out. Less healthy language also fails to recognise the dual benefit principles of communities and focuses on altruism. In a developer community, there also appear to be some debate on the recycling of hacker language; some people find it a potent term that aptly characterises the nature of fluid engagement when taken literally, whereas other people see it as a derogatory/populist term that fundamentally misunderstands the process. Language that implies hobby/hobbyist can also be seen to be derogatory or hierarchical (separating the user from the serious developer) and negating the positive feedback flow principle. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 10/45 Report D9.2, V1.0 Dissemination Level: PU Figure 3: Less healthy indicators 3.1.5. User type analysis . It should be noted that this user type analysis was not related to the role analysis performed by WP3 (which is described in more detail later in the document). That work was based on behavioural features of community users, whereas the techniques outlined below took a purely linguistic approach. As such, they provide a similar output via differing methods. The linguistic approach consisted of, initially, an identification of the types of users in the community by the Polecat analysts, based primarily on the lexical indicators discovered in the previous section. Table 1: User types identified by linguistic analysis Label Description Newbie Members new to a community. They might also be new to online interaction. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 11/45 Report D9.2, V1.0 Dissemination Level: PU Elder Elders may not be held accountable to the same community norms or scrutiny of the other members. Elders can dominate new members by a few words, regardless of the value of the words of others around them. Core participants There are usually as small group of people who quickly adapt to online interaction and provide a large proportion of an online group's activity. These individuals visit frequently and post often. They are important members. Flamer Flaming is defined as sending hostile, unprovoked messages. Name calling, innuendo and such are the tools of flamers. Flamers can also be the source of new ideas, however. 3.2. 3.2.1. Sentiment Based Signifiers Healthy/ Unhealthy Language Analysis Alongside analysed of the types of language in postings, Polecat also benchmarked the general “health” of language for a technical community. Traditionally, sentiment ratings consist of negative, neutral and positive ratings (commonly, a Likert scale with a bipolar response to a statement “is this positive or negative?”). However, this approach is usually found lacking because it sets the expectation of subjectivity; whether something is positive or negative depends on who is reading the document, and their reaction to it. By contrast, looking for healthy or unhealthy language makes the measure objective and therefore more accurate as a single measure for a community. After analysis of the postings by the linguistic analysts, health was split into five bands (or ratings) and the data annotated accordingly: Table 2: Health bands across communities 5 (v. healthy) There is evidence of collective success such as, reciprocal trust between the participants and healthy feedback between the user and the developer. There is evidence of technical success and personal success. Often, the participant wants to share newly developed technology within the community. In other cases, the participant is mentoring and is delivering full explanations and information, along with plenty of encouragement to newer participants. 4 (healthy) The conversation is full of information, questions, discussions, but the participants are not particularly excited. The success can be a mixture of one or two types of success, such as Personal and Technical or Technical and collective, etc. 3 (neutral) The conversation is short. (Sometimes the conversation is short, but can be Very Healthy) The success is generally collective, with evidence of feedback. The conversation lacks enough information. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 12/45 Report D9.2, V1.0 2 (unhealthy) Dissemination Level: PU The conversation has some hint of dissatisfaction or criticism, regarding the technology, a user, a developer and/or the community. The success is generally collective, but sometimes can be personal or technical depending on the content. 1 (v. unhealthy) 3.2.2. The conversation is simply rude and offensive about a participant, the community and/or the technology. There is generally no evidence of success, the conversation is negative. Automated Classification Using the analysis above, and subsequent annotated documents, Polecat then trained several classifier s using different approaches and benchmarked both the sentiment of the conversation and the user types previously identified. In terms of experimental set-up, the classifier was trained to discover two classes – healthy language and unhealthy language. This was deliberately distinct from classes of positive and negative because it removes the implicit subjectivity of these terms, and instead presents the classification as an objective metric. The training data for both classifications was selected by the Polecat analysts, and consisted of around 300 posting in each class. These postings were taken from the TiddlyWiki community between its inception in 2005 until 2011. A further 182 postings were selected by the Polecat analysts as testing data, and annotated as either healthy or unhealthy. Although the classifier actually returns probabilities of the posting being in one class or another, the training data simply assigned the documents as either being healthy or unhealthy. Because there were only two classes in the classification, a document was considered to belong to a particular class if it had a probability of greater than 0.5. The training data was run against six classifiers and the accuracy of each of these classifiers was assessed. These algorithms were: balanced winnow, C45, decision tree, maximum entropy, MVMaxEnt and Naive Bayes. The feature set was entirely linguistic and contained no structural information. The results of the classification are shown below. The evaluation shows, for each classifier, the precision of the healthy class and the precision of the unhealthy class1. The “All” column shows the overall accuracy of the classifier; simply the number of correct classifications as a fraction of the number of classifications made. Precision has been calculated here as the number of documents classified correctly for the given class divided by the number of documents actually in that class (or true positives / (true positives + false positives)) © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 13/45 1 Report D9.2, V1.0 Dissemination Level: PU Table 3: Sentiment classifier results Precision Balanced Winnow C45 Decision Tree MaxEnt MC MaxEnt Naïve Bayes Healthy Unhealthy All 1 0.333333333 0.398373984 0.650406504 0.943089431 0.918699187 0 0.706896552 0.724137931 0.844827586 0.379310345 0.448275862 0.679558011 0.453038674 0.502762431 0.712707182 0.762430939 0.767955801 As can be seen from the table above, Naïve Bayes scored best overall in this classification, although it had a poor precision for the unhealthy class. By contrast MaxEnt had a lower overall precision, but showed better results as an average across positive and negative precision. This classification was run across a number of the most influential forums collected by Polecat and made available to the ROBUST project. As a benchmark, the results of the classification are shown in the table below to show the expected positive and negative distributions for on-line forums. Table 4: Sentiment split for on-line forums ASP.NET Android Foums Digital Point Forums Electronic Arts UK NoNewbs Tech Support Guy TechAreana Ubuntu Forums Iphone Dev SDK XDA Developers Positive Negative % Positive % Negative 585 693 0.46 0.54 500 535 0.48 0.52 420 521 0.45 0.55 311 354 0.47 0.53 86 77 0.53 0.47 513 595 0.46 0.54 466 525 0.47 0.53 369 471 0.44 0.56 236 266 0.47 0.53 613 811 0.43 0.57 Finally, Polecat performed classification against the user types it had identified. Across the TiddlyWiki community, the following classifications were extracted for the various algorithms. For reference, the classification error rates during training are included in the appendix (38). Table 5: Classified user types for the TiddlyWiki community Algorithm C4.5 Decision Tree Maximum Entropy Naive Bayes Newbie 2675 1 268 787 Core Participant Elder 159 2805 2557 1884 139 167 148 302 © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 14/45 Report D9.2, V1.0 3.3. Dissemination Level: PU Topic Based Signifiers All of the work above is focused on the health of a community. By contrast, Polecat's specific use case means that they are focused also on metrics specifically tailored to the information needs of large corporations and decision makers. On-line community data provide essential channels for companies to monitor the way they are their products are being discussed, allowing them to react accordingly. Moreover, it also provides essential information around entire sectors and industries, informing strategic decision making. Needless to say, the volume of this data is often prohibitively large meaning key messages and indicators can be missed. Analysing the health of the community from the perspective of a particular external party, rather than the internal members, is therefore a key metric, providing essential information these companies. Typically, this is in an information retrieval scenario where the user (in this case the company) wishes to search communities and view the discussion around the user himself, to understand the health of this conversation and allowing them to react accordingly. The work above already allows these users to query community data and get a sense of the health of the discussion. However, users often require more granular results. One such important filter is that of topic: what is the heath in the community around company A and topic B? There are two approaches to this – the deriving of a topic model from the data, to see the topics that are being discussed and the health of these. Another, less well explored, area is finding the community health around explicitly defined topics. 3.3.1. Topic by heath (derived) Topic by heath (derived) WP3 have proposed a dynamic joint sentiment-topic model (dJST) which allows the detection and tracking of views of current and recurrent interests and shifts in topic and sentiment, based on Latent Dirichlet Allocation (LDA). This means that users and readers of a community are able to easily see the health around particular subjects, and thus gain a deeper insight into the discourse. The dJST model makes the assumption that documents at the current epoch are influenced by documents at past. Therefore, the sentiment -topic word distributions are generated from those word distributions calculated previously by the model. This can be done using three different techniques: A sliding window, where the current sentiment-topic-word distributions are dependent on the previous sentiment-topic specific word distributions in the last S epochs. A skip model where history sentiment- topic-word distributions are considered by skipping some epochs in between © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 15/45 Report D9.2, V1.0 Dissemination Level: PU A multi-scale model where previous long (and short) timescale distributions are taken into consideration. As data is received in a stream, results are buffered until the end of the specified epoch (an epoch is defined as a number of milliseconds as a configurable parameter of the system). At that point, a model is extracted, where documents are represented as term vectors. Because of the assumption that documents at current epoch are influenced by documents at past, the current sentiment-topic specific word distributions at the current epoch are generated according to the word distributions at previous epochs. Evaluation dataset The dataset for this evaluation was review documents from the Mozilla Addons web site between March 2007 and January 2011. These reviews are about six different add-ons: Adblock Plus, Video DownloadHelper, Firefox Sync, Echofon for Twitter, Fast Dial, and Personas Plus. All text were converted to lowercase and non-English characters were removed. Documents were further pre-processed by stop words removal based on a stop words list and stemming. The final dataset contains 9,114 documents, 11,652 unique words, and 158,562 word tokens in total. The unit epoch was set to quarterly and there were a total of 16 epochs, and the total number of reviews for each add-on were plotted against the epoch number. It can be observed that at the beginning, there were only reviews on Adblock Plus and Video DownloadHelper. Reviews for Fast Dial and Echofon for Twitter started to appear at Epoch 3 and 4 respectively. And reviews on Firefox Sync and Personas Plus only started to appear at Epoch 8. The review occurrences have a strong correlation with the release dates of various add-ons. We also notice that there were a significantly high volume of reviews about Fast Dial at Epoch 8. As for other add-ons, reviews on Adblock Plus and Video DownloadHelper peaked at Epoch 6 while reviews on Firefox Sync peaked at Epoch 15. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 16/45 Report D9.2, V1.0 Dissemination Level: PU Figure 4: dJST by number of reviewers Each review is also accompanied with a user rating in the scale of 1 to 5. This user rating represents the quality of the user’s experience using the plug-in, ranging from 1 representing a negative experience, to 5 for a very positive experience. The average user rating across all the epochs for Adblock Plus, Video DownloadHelper, and Firefox Sync are 5-star, 4-star, and 2-star respectively. The reviews of the other three add-ons have an average user rating of 3-star. Figure 5: dJST by average rating © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 17/45 Report D9.2, V1.0 Dissemination Level: PU Word polarity prior information was incorporated into model learning where polarity words were extracted from the two sentiment lexicons, the MPQA subjectivity lexicon2 and the appraisal lexicon 3. These two lexicons contain lexical words whose polarity orientations have been fully specified. Words were extracted with strong positive and negative orientation: the final sentiment lexicon consists of 1,511 positive and 2,542 negative words. Evaluation metrics and results The model was evaluated using two metrics: Predictive perplexity. This is defined as the reciprocal geometric mean of the likelihood of a test corpus given a trained model’s Markov chain state. Lower perplexity implies better predictiveness, and hence a better model. Sentiment classification: Document-level sentiment classification is based on the probability of sentiment label given a document. For the data used here, since each review document is accompanied with a user rating, documents rated as 4 or 5 stars were considered as true positive and other ratings as true negative. Evaluation performed by WP5 as part of D5.2 studied the influence of the topic number settings on the dJST model performance in comparison with other models. With the number of time slices fixed at S = 4, the topic number was varied. Figure 6: dJST perplexity and classification shows the average per-word perplexity over epochs with different number of topics. JST-all has higher perplexities than all the other models and the perplexity gap with the dJST models increases with the increased number of topics. All the variants of the dJST model have fairly similar perplexity values and they outperform both JST-all and JST-one. Figure 6: dJST perplexity and classification shows the average documentlevel sentiment classification accuracy over epochs with different number of topics. dJSTs outperform JST-one with skip-EM and multiscale-EM having similar sentiment classification accuracies as JST-all beyond topic number 1. Also, setting the number of topics to 1 achieves the best classification accuracy for all the models. Increasing the number of topics leads to a slight drop in accuracy though it stabilises at the topic number 10 and beyond for all the models. Nevertheless, the drop in sentiment classification accuracy by modelling more topics is only marginal (about 1% drop) for sliding-EM and skip-EM. 2 3 http://www.cs.pitt.edu/mpqa/ http://lingcog.iit.edu/arc/appraisal_lexicon_2007b.tar.gz © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 18/45 Report D9.2, V1.0 Dissemination Level: PU Figure 6: dJST perplexity and classification The conclusion from WP5’s evaluation was that both skip model and multiscale model achieve similar sentiment classification accuracies as JSTall, but they avoid taking all the historical context into account and hence are computationally more efficient. On the other hand, dJST models outperform JST-one in terms of both perplexity values and sentiment classification accuracies, which indicates the effectiveness of modelling dynamics. 3.3.2. Topic by health (explicit) Users, in the Polecat use case, are often more interested in tracking the health of a community conversation around a specific topic. For example, Polecat recently did some work with the Irish government, who were trying to set an agenda for the Irish Economic Forum. They wanted to understand the main areas around the Irish economy that were being discussed, and the health of these areas, so that they could make the forum as relevant as possible. Querying using the Boolean expression “Ireland and economy” fails here for a number of reasons. Firstly, the recall is affected because many documents related to the economy will not mention the term “economy”. Secondly, the results give no indication of what the specific sub-topics of economy are most and least important. A derived topic model does not generally add information here; generally speaking, the main topics about which companies are discussed are well known. By contrast, the health around less discussed topics might be of more interest; identifying unhealthy topics allows companies to react accordingly – either by changing policy, or engaging with the community to allay negative sentiment. The aim of the research, therefore, is the creation of a retrieval system that allows users to query data in the traditional manner but additionally allows the user to specify a topic, and see the density of this topic over the result-set, as well as the density of the most pertinent sub-topics. Both topic and query are provided as key words, and the techniques below can both be thought of as a form of query expansion using one of two techniques: the use of explicit lists to describe a topic, or the automatic creation of a word list. To meet this use case requirement of identifying and analysing explicit topics, Polecat developed a number of techniques. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 19/45 Report D9.2, V1.0 Dissemination Level: PU Retrieving documents by topic query Traditional information retrieval systems use simple queries. Whilst some techniques are often applied to the query before the search is submitted, such as disambiguation or spelling corrections, there is little attempt to treat any query word as a topic or subject and retrieve documents that do not directly include the term. Partly this is because it is very difficult to anticipate when a user has specified a term as a precise query, or as a topic. However, the Polecat use case shows a customer requirement for topic searching, so treating certain query terms as topics is an essential feature. It should be noted here that this is not a traditional query expansion problem (though it is certainly related). Query expansion has tended to focus on discovering synonyms, discovering alternative term morphologies or fixing spelling errors4. By contrast, this research aims to retrieve documents for an entire topic; for example, and user interested in the topic of “obesity” may be interested in results concerning “heart disease”. Heart disease is a concept related to obesity, but certainly not a synonym. Polecat has found no research to date where query expansion is used to discover a topic. The initial problem was to define the concept of “topic”, and how a topic could be represented as a bag of words. Research into explicit semantic analysis [1] suggests that single Wikipedia pages can be treated as semantic concepts (referred to hereafter as “concept”). In this research, the semantic similarity of documents was calculated by finding the Wikipedia pages for the major words in each document, and creating a vector of cosine similarity scores for each document for each one of the discovered concepts. The similarities of the vectors thus gave the semantic similarity. Building on the positive results of this research, Polecat created a graph for the entire set of English Wikipedia pages. This graph had the structure as shown in Figure 7: Graph structure for Wikipedia topic service. http://en.wikipedia.org/wiki/Query_expansion © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 20/45 4 Report D9.2, V1.0 Dissemination Level: PU Figure 7: Graph structure for Wikipedia topic service When a user specifies a topic term, such as “obesity”, this term is sent to the topic service. This service then finds the Wikipedia page that best matches the topic term. This is achieved in the following way: If the topic term matches a concept exactly, return that concept Disambiguate the topic term. exactly, return that concept 5 If the disambiguated term matches Look for redirects for the topic term. If the redirected term matches exactly, return that concept 6 Search the content of each concept. Return the highest concept (dependant on the implemented ranking algorithm) ranking Once the topic has been identified, the service returns the TF-IDF vector 7 for that concept. This vector is used to perform query expansion on the topic term. This means that the final query submitted to the search: <original-query> AND (<topic-term> OR <tf-idf-term1> OR … <tf-idf-termN>) There is no empirical evidence from previous research into using Wikipedia for query expansion as to how many term should be used in the expansion itself. Therefore, evaluation was performed for 1 to 30 terms in the expansion, Polecat imported the Wikipedia disambiguation data and created a lookup service 6 Polecat imported the Wikipedia redirect data and created a lookup service 7 http://en.wikipedia.org/wiki/Tf%E2%80%93idf © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 21/45 5 Report D9.2, V1.0 Dissemination Level: PU against a gold standard created by the Polecat analyst team on the communities data pulled every day by Polecat as part of the MeaningMine product. An abbreviated version of the recall, precision and f-measure statistics are shown below for expansion at increments of 5 up to 30. Although the results are shown at increments of five, the evaluation was run using term expansion at each interval between these increments too. Table 6: Precision for topic query expansion No query terms Cardiac innovation Offshore gas Energy census Energy appliances Shell arctic Cyber security CO2 and climate change Intellectual property Innovation in Ireland Aging population 1 0.17 1.00 0.97 0.37 1.00 1.00 5 0.15 0.53 0.32 0.34 0.53 0.15 10 0.19 0.52 0.90 0.15 0.44 0.14 15 0.16 0.39 0.59 0.14 0.37 0.08 20 0.08 0.39 0.59 0.13 0.30 0.06 25 0.08 0.39 0.39 0.10 0.29 0.04 30 0.08 0.39 0.32 0.08 0.28 0.03 0.86 0.86 1.00 1.00 0.26 0.37 0.98 0.22 0.26 0.32 0.19 0.01 0.27 0.26 0.18 0.01 0.22 0.19 0.10 0.01 0.25 0.17 0.08 0.01 0.18 0.16 0.06 0.01 Table 7: Recall for topic query expansion No query terms Cardiac innovation Offshore gas Energy census Energy appliances Shell arctic Cyber security CO2 and climate change Intellectual property Innovation in Ireland Aging population 1 0.44 0.92 0.01 0.74 0.76 0.09 5 0.46 1.00 0.01 0.77 0.81 0.52 10 0.58 1.00 0.33 0.82 0.82 0.52 15 0.79 1.00 0.35 0.82 0.83 0.89 20 0.81 1.00 0.35 0.84 0.84 0.89 25 0.78 1.00 0.37 0.87 0.84 0.91 30 0.77 1.00 0.37 0.88 0.84 0.92 0.12 0.61 0.58 0.01 0.44 0.37 0.58 0.01 0.46 0.32 0.60 0.18 0.48 0.26 0.60 0.13 0.61 0.19 0.63 0.14 0.78 0.17 0.66 0.14 0.88 0.16 0.70 0.13 Table 8: F-Measure for topic query expansion No query terms Cardiac innovation Offshore gas Energy census Energy appliances Shell arctic Cyber security CO2 and climate change Intellectual property Innovation in Ireland Aging population 1 0.24 0.96 0.01 0.49 0.86 0.17 5 0.23 0.69 0.02 0.47 0.64 0.24 10 0.29 0.68 0.48 0.25 0.57 0.22 15 0.27 0.56 0.44 0.24 0.51 0.15 20 0.15 0.56 0.44 0.22 0.44 0.12 25 0.15 0.56 0.38 0.18 0.43 0.07 30 0.15 0.56 0.35 0.14 0.42 0.06 0.22 0.71 0.74 0.02 0.33 0.37 0.73 0.03 0.34 0.32 0.29 0.02 0.35 0.26 0.28 0.01 0.33 0.19 0.18 0.01 0.38 0.17 0.15 0.01 0.30 0.16 0.12 0.01 © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 22/45 Report D9.2, V1.0 Dissemination Level: PU As is expected, recall tends to increase as the terms are expanded (although this is not the case for all queries). For the most part, there is little significant increase in recall beyond an expansion to 15 terms. By contrast, precision falls more rapidly, tending to bottom out for most queries around an expansion at 10 terms. As a result, identifying the most suitable number of terms to expand to has been difficult. As can be seen, the f-measure does not give a clear signal on this because of the variance between different topics. However, an expansion to between 10 and 12 terms appears to be optimal (because it gives the highest score based on the average variance from the mean for the f-measure of each topic). Calculating Topic Density As well as refining the document set in order to give the user a most granular view of community heath, Polecat also developed methods to measure the density of topics over a given results set. This allows any health metric to be correctly assessed in terms of its impact: simply seeing the number of matching documents gives no indication of how prevalent the topic is over the documents themselves, nor how correlated the query and the topic are. This research was performed in two stages: The first stage was using lists of terms that described a topic rather than extracting a topic automatically as in the technique described above. This allows analysts and users to have full control over the topics they are querying for by creating relevant word lists tailored to their exact information need. There are instances where customers are searching for topics that either cross a number of Wikipedia “concepts”, or represent a sub-set of one. Given that the user/ analyst creates the relevant terms in this scenario, the research challenge was to identify the algorithm that best calculated the density of a topic, given a user query. More formally, given a user query q how is it possible to calculate the density of the topic T (represented by the term vector) over the documents D, taking into account the relevance (score) of the documents to the query: The density of the terms in this taxonomy was tested with three techniques: simply counting the terms, using a TF-IDF metric, and using an altered implementation of BM25. This density was weighted for each document based on its correlation to the original query. Further insight was added by finding the relative density of this topic i.e. how dense the topic was for this query compared to how dense the topic is against the background corpus. To calculate this, the density of the topic was calculated using the same technique, replacing the original query with a stop word, to represent a background distribution. Query scores were profiled and © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 23/45 Report D9.2, V1.0 Dissemination Level: PU the distribution was found to be exponential, and this distribution was used to calculate the relative density of the topic against the query in comparison to the expected density. This was done because some topics would have almost no density, so any mention of them would be significant. Similarly, other topics would be discussed frequently, so a high density would not represent a significant statistic. The background distribution allows the discovered density of the topic to be normalised against its expected density. The data used in the evaluation was human judgements by the Polecat analyst teams for ten queries using two taxonomies over community data collected by Polecat. This data was all of the community data collected by Polecat in the period 01-05-2012 to 31-05-2012, and included a variety of traditional communities, but did not include social media data. These analysts examined the documents in the results-set for the ten queries, and, for each query, estimated the total percentage of the conversation that was around the given topic (which is called here “topic density) compared to the density of that topic in the background corpus. For example, the analysts suggested that the topic “energy reputation” was discussed with documents matching the query “Nigeria” around 60% more than would be expected for the entire corpus. Figure 8: Topic density for taxonomy “energy reputation” and Figure 9: Topic density for taxonomy “finacial distress” show the human judgement scores for the five topic density scores assigned to the two word lists by the analysts, and the results for the different density functions. 1.2 1 0.8 Human judgement 0.6 Count 0.4 TF-IDF 0.2 BM25 0 1 2 3 4 5 Figure 8: Topic density for taxonomy “energy reputation” © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 24/45 Report D9.2, V1.0 Dissemination Level: PU 1.2 1 Human judgement 0.8 Count 0.6 TF-IDF 0.4 BM25 0.2 0 1 2 3 4 5 Figure 9: Topic density for taxonomy “finacial distress” The second stage added further capability to this functionality: instead of using a word list created by the user, it built on the document searching technique by using the aforementioned Wikipedia service to discover topics. This technique allows the user to measure topic density given only a single topic term, and to find relevant sub-topics and measure the density of these sub-topics. This involves two stages: a - Finding the most relevant sub-topics: The most relevant sub-topics are discovered by querying the Wikipedia topic service. Firstly this gets the Wikipedia page that best matches the topic word using the steps outlined in the previous section. It then uses an algorithm to find the most relevant connected pages, and the TF-IDF vectors from these pages form the sub-topics. Polecat tested various techniques for discovering the best sub-topics: page rank, shared categories, document similarity and number of shared links. The data set for testing was made against human judgements from the Polecat analyst team. Three analysts were each presented with 15 topics terms (page names or “concepts” in Wikipedia) with every single associated topic. Here an associated topic was either a page that was linked to from the Wikipedia of the original topic, or a category that had an associated page. The analysts then selected, for each topic, an ordered list of the twenty most closely associated topics. These judgements were then merged into a single list using a simple scoring technique 8 to output a single ranked judgement of the most closely associated sub-topics for each topic. In order to run the experiment, Polecat then calculated a ranked list of the most closely associated sub-topics for each of the given topics using four techniques. These ranked lists were then compared against the human Any time a sub-topic was selected, it was assigned a ranking score (20 at position 1, decrementing by one thereafter). Each topic then received a score than was the sum of these individual ranking scores, and a new ranked list was compiled. Where a tie occurred, the judgement made by the senior member of the team was favoured. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 25/45 8 Report D9.2, V1.0 Dissemination Level: PU judgement using three metrics shown it the table below: the number of shared elements between the generated lists (column “A” for each algorithm), Spearman’s rank correlation coefficient of the two lists (column “B” for each algorithm) and the Jaccard distance between the two lists (column “C” for each algorithm) For example, in the table below, the Jaccard distance between the ranked list of sub-topics extracted by the document similarity algorithm and the ranked list of sub-topics from the human judgement is 0.34. An analysis of the results shows that, for each of the three measures (shared elements, Spearman’s rank correlation coefficient, and the Jaccard distance), document similarity had the most accurate results. However, further investigation suggests that the technique may depend on the type of topic being queried. The next stage of the research looks to classify topic types and select sub-topic by an associated algorithm most suitable to that class of topic specified. Table 9: Evaluation of sub-topic retrieval algorithms Category algorithm extraction Category Association Page Rank Evaluation Measure A B C A B Document Similarity C A B Shared Links C A B C biofuel 7.00 -1.00 0.17 1.00 -1.00 0.02 14.00 -1.00 0.34 health 9.00 -1.00 0.28 5.00 -0.85 0.16 13.00 -0.28 0.41 14.00 -0.17 0.44 sustainability 13.00 -0.87 0.25 4.00 -1.00 0.08 insurance 1.00 -1.00 0.02 fraud 0.00 -1.00 0.00 13.00 -0.91 0.33 hydraulic fracturing 5.00 -1.00 0.14 palliative care 10.00 -0.95 0.26 Topic pension 9.00 -1.00 0.17 8.00 -1.00 0.20 9.00 -0.85 0.17 9.00 -0.66 0.22 16.00 -0.25 0.39 19.00 -0.04 0.46 6.00 -1.00 0.15 3.00 -0.99 0.08 4.00 -1.00 0.11 10.00 -0.84 0.27 11.00 -0.74 0.30 8.00 -1.00 0.21 19.00 -0.53 0.49 5.00 -1.00 0.13 6.00 -1.00 0.16 7.00 -1.00 0.18 13.00 -1.00 0.34 17.00 -0.73 0.45 nigeria 2.00 -1.00 0.05 6.00 -1.00 0.15 6.00 -0.99 0.15 0.00 -1.00 0.00 obesity 4.00 -0.92 0.10 6.00 -0.98 0.15 7.00 -0.78 0.18 3.00 -1.00 0.07 5.00 -1.00 0.14 14.00 -0.79 0.38 5.00 -1.00 0.14 old age 8.00 -1.00 0.26 16.00 0.03 0.52 11.00 -0.46 0.35 5.00 -0.66 0.16 bankruptcy 8.00 -1.00 0.20 8.00 -0.99 0.20 7.00 -0.74 0.17 14.00 -0.86 0.34 10.00 -0.64 0.25 5.00 -1.00 0.13 7.00 -0.95 0.18 6.00 -0.88 0.15 6.00 -1.00 0.14 5.00 -1.00 0.12 7.00 -0.88 0.16 1.00 -1.00 0.02 159 -11.5 4.13 120 -11.9 3.11 business intelligence 14.00 -0.83 0.38 climate change innovation 103 -14.2 2.65 Key explanation 102 -13.4 2.7 9 A = Number of links that the human judgement and algorithm results share B = Spearman’s rank correlation coefficient between the two human judgement and algorithm results C = Jaccard distance between the human judgement and algorithm results © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 26/45 9 Report D9.2, V1.0 Dissemination Level: PU b - Calculating the density of the sub-topics was carried out using two techniques. One, in order to compare with the baseline recorded in section above, used the amended BM25 algorithm. In an effort to improve this, and given the use of Wikipedia data, this built on the results from the original research into explicit semantic analysis and calculated the cosine similarity of each document against the TF-IDF vector for that concept. More formally, the topic T is represented by the vector t: To date, the human judgements have not yet been completed, so definitive results from are not yet available. The experiment is set up as follows: ten topics have been identified, and the ten most closely associated sub-topics were selected using the document similarity technique described above. These sub-topics have been given to the Polecat analyst team, who have been asked to estimate the density of these topics over the document results for a given query. It is then planned that the density calculated from the algorithm described above can be compared with this human judgement, and the efficacy of the technique assessed. It should be noted that it may well be the case that the sub-topics on which the assessment is made are not the best sub-topics to choose given the contents of the documents in the result set. However, the aim of this experiment is not to assess the quality of the selected sub-topics, only to measure the accuracy of the density estimated by the algorithm. Application and feedback Alongside the evaluation of algorithms described above, Polecat also developed a test application for topic searching over community data. This allows the user to query data with a standard query and a topic query (denoted with leading slash). Results that match this query and topic are retrieved. Also visualised is the density of the sub-topics. A screenshot is shown below: On the right of the page, the documents that match the query and topic term are displayed, and the user is able to page through the results. On the left, three graphs are shown. At the top is a radar diagram. On the outer axis of this is shown each of the selected sub-topics for the topic specified in the query. The red area displays the density of this sub-topic over the documents retrieved by the query. It should be noted, however, that the density score is normalised by the visualisation, so that the largest of these is always shown as a maximum density. It gives no indication of actual density, therefore, only of density relative to the other sub-topic densities. To display the overall density of the topic (and therefore put the radar chart in context), relative to the density seen in the background corpus, the speedometer chart is displayed below. The axis of this chart is from -10 (meaning that the topic is © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 27/45 Report D9.2, V1.0 Dissemination Level: PU not discussed at all) through 0 (meaning the topic is discussed exactly as would be expected) to 10 (meaning that the topic is discussed a great deal more than expected. More scientifically, this means that the topic density is at or above 2.5 standard deviations from the mean of the background distribution). A line chart is shown alongside the speedometer. Given that the data displayed is temporal, this simply shows the number of posts that match the query for each day in the specified time period. Figure 10: feedback prototype for topic density evaluation Each action from the user is logged in the application so that an accurate picture of usability is built up. Further, users are able to feedback on a number of aspects of the results, namely: Whether a given retrieved document is considered irrelevant Whether a sub-topic is not relevant Whether the density for a given sub-topic looks incorrect This feedback will provide further judgements against the results. It is envisaged that the most useful aspect of this will be the assessment of the selected sub-topics. It is the opinion of Polecat that the technique for selected the most relevant and closely associated sub-topic for any given topic differs depending on the type of topic. Anecdotally, it has been noticed that the degree of document similarity between topic and sub-topics varies according to what the topic is, suggesting that this technique is limited in its utility. For example, document similarity has been seen to work poorly for quite technical topics. The topic “computer programming” returns sub-topics such as “Method”, which is certainly a topic associated with programming, but not one © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 28/45 Report D9.2, V1.0 Dissemination Level: PU in which the user is likely to be interested. They ar e more likely to be interested in aspects of the action of computer programming, such as debugging, gathering requirements etc. By contrast, the results for more nebulous topics are better. For example, query “equality” gives a number of aspects of that topic, such as “social equality”, “liberalism” etc). To date, the application is still being evaluated. Polecat plans to exploit the capabilities from this application provided the feedback is deemed to have reached a certain level of quality. This exploitation will be in two phases. The first stage is to allow the user to query using a topic term (probably using the slash notation). One of the major difficulties users have with MeaningMine is identifying the dataset in which they are interested, so this will provide a significant step forward in unlocking the potential of the product. The second stage is to add the sub-topic densities as an insight 10 into the product. 4. Behavioural Signifiers Identifying behavioural signifiers of health within communities was work predominantly carried out in WP3 (by partner OU) as part of D3.2. They focused on detecting and understanding the correlation between community social behaviour and its overall health. Their assumption was that the type and composition of behaviour roles exhibited by the members of a community (e.g. experts, novices, initiators) could be used to forecast change in community health. Therefore, they framed their main research question as: “can we accurately and effectively detect positive and negative changes in community health from its composition of behaviour roles”? Behavioural health was thus understood in terms of the actions and interactions of users with other community users; the role that a user assumes is the label associated with a given type of behaviour. Roles were identified by a set of behaviours or interactions, such as “engagement, contribution, popularity, participation, etc”. 4.1. Identification of health indicators The work identified four health indicators, based on previous research: loyalty, participation, activity and social capital. These correlated to four specific sets of community features: 1) Churn rate: This is the number of users that have posted in the community for the final time as a proportion of the entire number of users that have posted in the same time period. 2) User count: This is calculated as the number of users that posted in the community at least once in a given time period. MeaningMine presents a number of ways of summarising the result-set from a query via various facets of extracted information, such as topic models, sentiment etc. These are termed “insights” within the product. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 29/45 10 Report D9.2, V1.0 Dissemination Level: PU 3) Seed/ non seed proportion: This is the number of seed posts (thread starters) that generate at least one reply as a proportion of seed posts that generate no replies. 4) Clustering coefficient: This is the average network degree of users in the graph as an indication of how inter-connected users are. 4.2. Measurement of user behaviour As well as health indicators, the research examined user behaviour. The rational for this was: “understanding the behaviour of community users and how that relates to community health indicators, could provide community managers with information of healthy and unhealthy behavioural traits found in their communities”. These roles were calculated from six key numerical features of the community for each user: 1) Focus dispersion: the forum entropy of the user, where a high value indicates that the user is active on a large number of forums within a community. 2) Engagement: the proportion of all users that the user has replied to, where a high number indicates that the user has wide engagement. 3) Popularity: The proportion of all of the users in the community that have replied to the user. 4) Contribution: The proportion of all thread replies that were created by the user. 5) Initiation: The proportion of threads that were started by the user – essential a measure of how active the user is in instigating discussion. 6) Content quality: this is the average points per post awarded to the user. Obviously this measure is only applicable to certain communities with this feature available. 4.3. Discovering user roles In order to map the user behaviour to certain user roles in the system, the research took the following steps. First, the user behaviours above were pairwise analysed for correlation (correlated features are not a useful in describing unique behaviour). Results suggested that engagement, contribution and popularity were all highly correlated with each other (as might be guessed intuitively), resulting in the use only of the dimensions focus dispersion, initiation, content quality and popularity. With this trimmed list of user behaviour features, the users were then clustered using unsupervised clustering techniques: Expectation Maximisation, K-Means and Hierarchical clustering. The best result, judged according to the cohesion and separation of the clusters, was K-Means. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 30/45 Report D9.2, V1.0 Dissemination Level: PU Clusters were labelled using a maximum-entropy decision tree that divided clusters into branches that maximises the dispersion of dimension levels. This process was performed until single clusters, or the previously merged clusters, were in each leaf node; the path to the root node was then used to derive the label. This gave the labels in the table below: Table 10: Derived cluster labels Cluster Labels Focussed Novice Focused expert participant Knowledgeable member Knowledgeable sink Focussed expert initiator Mixed novice Mixed expert Distributed novice Distributed expert 4.4. Analysing role/ health relationship Having identified the health indicators and behavioural roles, it was then possible to identify patterns that explain the relation between a degradation or improvement in a community’s heath, and the behaviour of the members. This was done using two distinct techniques: Health indicator regression – this used the role composition in each community as a predictor for each of the heath indicators. Health change detection – this performs a binary classification task to detect changes in community health from one time step to the next investigation whether it is possible to detect changes to communities that could result in bad health. It should be noted here, however, that the research drew no empirical conclusions on the relationship. It simply demonstrated that a relationship did exist. 5. Structural Signifiers The investigation of structural signifier had two main elements. The first was some research carried out by Polecat into those signifiers that most reflected the health of a community by discussion with community owners. Secondly, WP5 carried out some analysis on the TiddlyWiki community to gather some initial statistics on the shape and size of the community. This is particularly © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 31/45 Report D9.2, V1.0 Dissemination Level: PU valuable in light of the linguistic signifiers of health identified later in the document. 5.1.1. Community owner feedback on structural signifiers of health In order to provide a benchmark of the structural indicators of health that are important to a community, Polecat talked to community owners from a range of currently running on-line communities and, using a simple questionnaire, asked them which statistics from the community they monitor to understand the health at a given time. These communities were: TiddlyWiki, SAP SCN, IBM Connections and _connect. The results from this are shown in the table below. What is most striking about the results is that, for four vibrant communities, there is mostly overlap between the metrics, and very few metrics at that. This suggests either that community owners have a limited grasp of the structural metrics that inform them of the communities health, or that very little is needed to build an adequate picture of how the community is performing. Table 11: Structural health signifier matrix User Metrics TiddlyWiki IBM SCN # logins per day No Yes Yes Average time spent logged in No Yes Yes # posts Yes Yes Yes # posts generating replies Yes Yes Yes # content views/ visitors Yes Yes Yes # likes No Yes No Average user connections No Yes No # new members in time period Yes Yes Yes # users Yes Yes Yes Max concurrent users per hour No No Yes 5.1.2. _connect Analysis of TiddlyWiki structural factors WP5 carried out an initial analysis of the TiddlyWiki11 as a benchmark of structural statistics that contribute to the health of a community against which other communities could be compared. This relates to the TiddlyWiki 11 2011_April_Tiddlywiki_Analysis.pdf © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 32/45 Report D9.2, V1.0 Dissemination Level: PU community from 2005 until the end of 2011, and includes three separate subsections: WikiGroup: this is the core TiddlyWiki development community WikiDevGroup: a development community for TiddlyWiki software WebGroup: a development community focused on associated web technologies The basic statistics of the community over all time steps are shown below in Table 12: Basic statistics for the TiddlyWiki groups. These statistics are the number of users in the community (represented by “nodes”), the number of relationships between these users (calculated as the occurrence of direct communication between two users and represented by “edges”) and the number of postings in the community (represented by “emails”). Table 12: Basic statistics for the TiddlyWiki groups Data set Nodes (users) Edges (replies) Emails WikiGroup 2774 16804 51662 WikiDevGroup 698 4345 14703 WebGroup 77 464 3166 With the goal of analysing the sub-networks, a 12 months window size turned out to be the most suitable, as it was then possible to detect any significant communities in the other sequences. Given the WebGroup data-set spanned only two years and no significant community structure was detected, it is not discussed here furthermore. The partitions that were discovered were, for the most part, associated with the relatively low modularity 12 value 0:3. As a result, the results of that clustering method are not presented here due to that quality. Further investigation of the community structure using the OSLOM method, which finds only statistically significant communities, confirmed that there is only weak community structure with the majority of nodes not being member of any community. In terms of WikiGroup, visual inspection of the community structure of the WikiGroup dataset suggested that there is apparently only a single stable community (Table 13: WikiGroup and WikiDevGroup sizes by time-slice). All the other communities were significantly smaller and all of them either dissolved or disappeared completely. In other words, the vertices of the subgraph disappear from the network in the following step-graph. This suggests that there is one stable community of core users discussing amongst themselves, a couple of ad-hoc short-living communities on the periphery, Modularity is a quality function with values approaching 1 indicating good community structure, whereas values approaching 0 indicating poor or completely missing community structure. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 33/45 12 Report D9.2, V1.0 Dissemination Level: PU while the majority of users are not forming any significant cluster of frequently mutually communicating users. The size of the WikiGroup community (and WikiDevGroup) over time-slices is shown below. Table 13: WikiGroup and WikiDevGroup sizes by time-slice Time Slice WikiGroup Size WikiDevGroup Size 2005-06-15 – 2006-06-15 602 273 2006-06-15 – 2006-07-15 693 210 2007-06-15 – 2006-08-15 715 142 2008-06-15 – 2006-09-15 691 151 2009-06-15 – 2006-10-15 574 101 2005-10-15 – 2006-11-15 285 61 By contrast WikiDevGroup was constantly shrinking over time. The community structure itself is similar to the one observed in the WikiGroup data-set: a core stable community with a handful of small short-lived communities on the periphery where the majority of users are not members. The results from this study form the basis for further study. Having the data annotated both from a network perspective and a linguistic perspective should allow us to view the correlations between the two. The suggestion is that various forms of language cause adaptations not only in the health of the language used, but also in the way that users congregate into networks of cooperation. We hope to address this in D9.6. Further network statistics for the WikiGroup are shown in the appendix (39). 6. Project Integration Many of the outputs from the work described above have fed into software deployed to the ROBUST platform, and are in the process of exploitation by Polecat. 6.1. The dJST topic by sentiment extraction model This is currently being implemented onto the ROBUST platform. It is being used by Polecat to monitor a stream of tweets for growth in negative topics surrounding a particular query. In most cases, and certainly in the case of eventually exploitable code, this query pertains to a company or product name. Community content is buffered by a bespoke component. Then, every n hours (referred to hereafter as an epoch), a topic model is extracted from that content using the dJST module that has been deployed as a service. The © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 34/45 Report D9.2, V1.0 Dissemination Level: PU topic model contains the “size” of the topic, and this metric is sent to a predictive algorithm that uses gibbs sampling to infer predictions. If any of the negative topics look as though they are likely to grow significantly in the near future, the gibbs sampler alerts a service of the problem so that appropriate action can be taken. The flow of the application is shown below: Figure 11: Flow of negative topic prediction architecture © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 35/45 Report D9.2, V1.0 6.2. Dissemination Level: PU Graphic Equalizer The graphic equalizer (as described in D9.3 and 9.4) is software designed to present the health of an on-line community both visually and aurally. It displays a number of different health metrics, from a potentially disparate number of sources, and offers a single view of these. It performs two primary user functions: firstly, it allows users to watch the different elements of the health of a community over a set time-frame, and secondly it allows users to monitor, in real time, health as it is changing and adapting in the community. It should be noted that, since the health metrics from the different sources are normalised, the software can in theory be used to describe any temporal data. The data used by the graphic equalizer to data has been the role analysis work from WP3 described above under behavioural analysis, and structural network data from WP5. It uses a selection of remote services from various partners (WP3 and WP5) to achieve this, utilizing the enterprise service bus. The architecture, and how these technologies have been deployed to the ROBUST image, are shown in the figure below: Figure 12: Graphic equalizer architecture overview 6.3. Metaphor base visualisation Whilst still in development, as part of D9.4 and D9.5 Polecat are currently developing a metaphor based visualisation to display the health metrics described above. The current prototype works from the same services as the graphic equalizer (indeed is utilizes the same UI service), but displays the metrics as part of an intuitive, easily understood metaphor. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 36/45 Report D9.2, V1.0 Dissemination Level: PU 7. Conclusion This document has outlined the work performed to benchmark the structural, behavioural and linguistic signifiers of on-line communities. Benchmarking linguistic signifiers was approached from three directions. Firstly, a manual analysis of community data was performed by Polecat experts, and the core linguistic aspects identified. Secondly, the accuracy of traditional classification techniques over community data to measure the heath of the language was undertaken. Thirdly, in accordance with the specific Polecat use case, an analysis of explicit and implicit topic modelling was described. The benchmarking of behavioural signifiers followed a more linear path, from the identification of key indicators through to techniques that measured this behaviour and extracted roles and assigned these to users. Because structural signifiers are the most tangible of the three community metrics, and therefore least in need of algorithmic investigation, the benchmarking focused on the needs of the community leaders and which metrics they already monitor. It also included a benchmark of network statistics against which linguistic correlations could be discovered, leading (it is hoped) to insights about how use of language can affect user interaction. The next phase, building on the results outlined above, will examine statistical and linguistic processes that are created specifically for social media, and improve on some of the baselines. This will include research into if and how the document result-set from an information retrieval query can be truncated so that any aggregated results of information extracted from the results is of a similar of better quality to those aggregated results from the entire result -set. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 37/45 Report D9.2, V1.0 Dissemination Level: PU 8. Appendix 8.1. User classification training set errors Below are the error rates from the classification of roles derived from Polecat’s linguistic analysis of community data. The error rate, in this context, is the percentage of documents that were predicted incorrectly. C45 Core Participant Newbie Elder 8.2. Decision Tree MaxEnt 2.46% 0.19% 31.26% 50.64% 0.40% 13.10% Naïve Bayes 0.00% 54.84% 0.00% 2.08% 45.70% 1.19% Sub topic density using cosine similarity (example) Below is an example of five sub-topics of the topic “north sea oil”, and the average cosine similarity measure between the sub-topic and the documents in the result-set. This average cosine similarity is calculated for differing numbers of terms that describe the sub-topic. For example, “5” means that the sub-topic is treated as 5 terms, and the cosine similarity is between these terms and each document in the result-set. Topic 1 5 10 15 20 25 30 8.3. Oil platform 3.30 1.94 5.03 4.24 3.18 3.89 3.31 Offshore oil and gas in the United States 19.06 16.91 10.50 20.67 15.11 13.28 12.55 Rhum gasfield 0.00 5.64 4.82 4.25 4.91 5.88 5.03 Bahar oilfield 0.00 1.18 10.32 11.39 16.18 18.53 18.81 Pallas gas field 0.00 0.09 3.16 3.74 2.77 3.23 4.43 Snapshots of the WikiGroup Network Below are pictorial representations of the WikiGroup network for each of the years of the data that were analysed. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 38/45 Report D9.2, V1.0 8.4. Dissemination Level: PU 2005-2006 2006-2007 2007-2008 2008-2009 2009-2010 2010-2011 WikiGroup Network Statistics Shown below are some of the key network statistics extracted by the work of WP5 on the TiddlyWiki data. These statistics are calculated for each year of the community (shown on the x-axis for each graph). The y-axis represents the scale of each individual metric. © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 39/45 Report D9.2, V1.0 Dissemination Level: PU Clustering coefficient and density Connected components and avg degree Modularity Network Distances © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 40/45 Report D9.2, V1.0 Dissemination Level: PU List of Figures Figure 1: Success language patterns ........................................................8 Figure 2: Motivating language patterns ......................................................9 Figure 3: Less healthy indicators............................................................. 11 Figure 4: dJST by number of reviewers ................................................... 17 Figure 5: dJST by average rating ............................................................ 17 Figure 6: dJST perplexity and classification .............................................. 19 Figure 7: Graph structure for Wikipedia topic service ................................. 21 Figure 8: Topic density for taxonomy “energy reputation” ........................... 24 Figure 9: Topic density for taxonomy “finacial distress” .............................. 25 Figure 10: feedback prototype for topic density evaluation ......................... 28 Figure 11: Flow of negative topic prediction architecture ............................ 35 Figure 12: Graphic equalizer architecture overview ................................... 36 © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 41/45 Report D9.2, V1.0 Dissemination Level: PU List of Tables Table 1: User types identified by linguistic analysis ................................... 11 Table 2: Health bands across communities .............................................. 12 Table 3: Sentiment classifier results ........................................................ 14 Table 4: Sentiment split for on-line forums ............................................... 14 Table 5: Classified user types for the TiddlyWiki community ....................... 14 Table 6: Precision for topic query expansion ............................................ 22 Table 7: Recall for topic query expansion................................................. 22 Table 8: F-Measure for topic query expansion .......................................... 22 Table 9: Evaluation of sub-topic retrieval algorithms.................................. 26 Table 10: Derived cluster labels .............................................................. 31 Table 11: Structural health signifier matrix................................................ 32 Table 12: Basic statistics for the TiddlyWiki groups ................................... 33 Table 13: WikiGroup and WikiDevGroup sizes by time-slice ....................... 34 © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 42/45 Report D9.2, V1.0 Dissemination Level: PU List of Abbreviations Abbreviation Explanation ROBUST Risk and Opportunity management of huge-scale BUSiness communiTy cooperation dJST Dynamic joint sentiment topic model WP Work package TF-IDF Term frequency – Inverse document frequency © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 43/45 Report D9.2, V1.0 Dissemination Level: PU References [1] Evgeniy Gabrilovich and Shaul Markovitch (2007). Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. Proc. 20th Int'l J. Conf. on AI (IJCAI). [2] Angeletou, Rowe, Alani (2011): Modelling and analysis of user behaviour in online communities: The Semantic Web-ISWC 2011 [3] T. Mostyn (2112) Polecat Use Case Data and Requirements: WP9-D9.1Report_v2.0.docx [4] V. Belak (2011): Analysis of Community Structure and Dynamics in Tiddlywiki Email Fora :2011_April_Tiddlywiki_Analysis.pdf [5] S. Staab, T,. Gottron (2010): ROBUST_Description of Work.pdf Robust Description of Work [6] A. Hogan, M. Kunstedt (2012): Suite for behaviour anlaysis and topic/sentiment tracking: WP5-behaviour-analysis.docx © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 44/45 Report D9.2, V1.0 Dissemination Level: PU Version history Version 0.1 0.2 1.0 Date 11/10/2012 29/10/2012 30/10/2012 Author Toby Mostyn Toby Mostyn Toby Mostyn Comments Initial draft Response to feedback Release version Acknowledgement The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 257859, ROBUST © Copyright Polecat Ltd and other members of the EC FP7 ROBUST project consortium (grant agreement 257859), 2012 45/45