How to Use the PowerPoint Template
Transcription
How to Use the PowerPoint Template
Sentiment-Analyse über Tweets (Demo mit Apache Flume, R und ORAAH) Nagaraj Malaiappan, Indien Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Sentiment-Analyse bei Airtel About Airtel Bharti Airtel Limited: führende Telefongesellschaft in 20 Ländern in Asien+ Afrika. (Headquarter New Delhi) Produkte 2G, 3G and 4G Wireless Services, Mobile Commerce, Fesztnetz, DSL, IPTV, DTH, u. a. Bharti Airtel hat fast 287 Millionen Kunden (Dez 2013) Herausforderungen Es gibt keine Tools zur Messung der Wahrnehmung der Firma beim Kunden (positiv/negativ) Es gibt keine Tools zium Einholen des Feedbacks über neue Services und Produkte Finden der “Influencers” für eine Zielgruppensegmentierug von Marketingkampagnen Es müssen Trends erkannt werden, um neue Produkte oder auch Kampagnen zu launchen Es müssen unübliche Vorkommnisse (z. B. Netzausfall) und deren Folgen bei der Wahrnehmung durch die Kunden erkannt werden. Benutzte Produkte zur Lösung Oracle Big Data Appliance (BDA) Oracle R Advanced Analytics for Hadoop (ORAAH) Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Lösungsumgebung Oracle Event Processing Twitter Streaming API Oracle R Advanced Analytics for Hadoop (Sentiment, Trend analysis) HDFS Apache Flume (tweets) Hive (Influencer Analysis) Stream Acquire – Organize – Analyze Oracle Big Data Appliance Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Lösungsdetails • Verwenden von Apache Flume zu streamen der Twitter-Daten • Keywords (user handle or hashtag) werden zur Identifizierung in dem Twitterstrom genutzt • Streaming API bei Twitter anmelden • Über Java-Programm Twitter API aufrufen • Java-Code als “Apache Flume Source” registrieren • Key und Keywords als Teil der Flume Konfiguration nutzen • HDFS als Senke in Flume konfigurieren • Flume Agent starten umd die Tweets einzusammeln • Flume “streamt” jetzt die Daten in das HDFS Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal 4 Tweet-Strom Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Sentiment-Analyse • Oracle R Advanced Analytics for Hadoop (ORAAH) – Laden postiv- / negativ-Wortliste – Umwandeln der Tweets in eine Wortliste – Anwenden der positiv/negativ-Liste auf die Tweet-Wörter – Auf Gesamt-Sentiment aggregieren • Der R-Code der vorher beschriebenen Schritte wird als Map Reduce Job über ORAAH in dem HDFS ausgeführt Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal 6 Trend-Analyse • Oracle R Advanced Analytics for Hadoop (ORAAH) – Tweets werden mit „Cleansing, Stemming, Stopwords“ bearbeitet – Über einen Map Reduce Job wird jetzt die wichtigsten Wörter im HDFS gezählt – Das Ergebnis wird als Word-Cloud angezeigt Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Influencer-Analyse • Oracle R Advanced Analytics (HIVE Adapter) – Verarbeiten der Tweets als JSON-Format -> leichtes Umwandeln in Tabellenform – Anlegen einer Hive-Tabelle mit allen Tweets – Absetzen einer Hive-Abfrage: • Wer hat die meisten Tweets verschickt? • Wer hat die meisten Antwort-Tweets auf seine Nachricht? retweeted_screen_name total_retweets tweet_count malaysianairlines 493 1 HarvardBiz 362 6 TechCrunch 314 7 analytics 244 10 BigDataBorat 201 6 stephen_wolfram 182 1 CloudExpo 153 28 TheNextWeb 150 1 GonzalezCarmen 121 10 bigdata 100 37 Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Erbenisse und Nutzen der Lösung Key Nutzen Bereiche Nutzen für Airtel Business Performance Verbesserte Marken Reputation Mehr Kundenzufriedenheit minimiert Churner Mehr und zielgerichtetere Marketing Kampagnen Detaillierteres Feedback auf Product/ + Services Profitabilität Verminderte Churner-Rate erhört Marge Zusätzlicher Umsatz durch mehr zielgerichtete Kampagnen Wettbewerb Besserer Kundenservice ist möglich Positive Reputation gegenüber Wettbewerbern Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Beispiel-Code (alles kein Hexenwerk) Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Acquiring Tweets 1. Create a dev account with twitter from here 2. Install Flume from here if not installed already 3. Download jar file which contains the class files to load data from Twitter into Hadoop from here 4. Edit the flume-env.sh file (located in /etc/flume-ng/conf) and Include FLUME_CLASSPATH e.g. FLUME_CLASSPATH=”/home/oracle/Downloads/ flume-sources-1.0-SNAPSHOT.jar 5. Create a .conf file which has the source settings (Twitter), channel (memory channel ) and sink (HDFS). 6. Make sure to change the consumer key, secret keys and in addition the keywords to search in twitter site 7. Save the file as <name>.conf e.g. flume.conf in /etc/flume-ng/conf) 8. Entries to go in to the conf file TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey =<fill details you obtained from step 1> TwitterAgent.sources.Twitter.consumerSecret =<fill details you obtained from step 1> TwitterAgent.sources.Twitter.accessToken =<fill details you obtained from step 1> TwitterAgent.sources.Twitter.accessTokenSecret =<fill details you obtained from step 1> TwitterAgent.sources.Twitter.keywords = <User Handle or Hash tag> Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Acquiring Tweets (Cont) TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/oracle/flume/tweets/%Y/%m/%d/%H/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 100 8. Run Flume to collect Tweets Navigate to flume conf directory where you saved above conf file and run below command $>flume-ng agent --conf . -f flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Sentiment Analysis 1. open Rstudio (http://localhost:8787) 2. install required libraries and load them R>install.packages(“twitteR”,”plyr”,”stringr”) R>library(“plyr”,”twitteR”,”ORCH”,”stringr”) 3. Download the positive and negative words from here and unzip it and run below (note the download file location) R>pos.words=scan('C:/Users/nmalaiap/Downloads/opinion-lexicon-English/positive-words.txt', what='character', comment.char=';') R>neg.words=scan('C:/Users/nmalaiap/Downloads/opinion-lexicon-English/negative-words.txt', what='character', comment.char=';') 4. Create a function which will be used for scoring the sentiments from twitter text R> score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) # we got a vector of sentences. plyr will handle a list of a vector as an "l" for us # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply: scores = laply(sentences, function(sentence, pos.words, neg.words) { # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]','',sentence) sentence = gsub('[[:cntrl:]]','',sentence) sentence = gsub('\\d+','',sentence) } Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Sentiment Analysis (cont) # and convert to lower case: sentence = tolower(sentence) #split into words. str_split is in the stringr package word.list=str_split(sentence, '\\s+') # sometimes a list() is one level of hierarchy too much words=unlist(word.list) # compare our words to the dictionaries of positive and negative terms pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) #match() returns the position of the matches term or NA # we just want a TRUE/FALSE: pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) # and conveniently enough , TRUE/FALSE will be treated as 1/0 by sum(): score=sum(pos.matches)-sum(neg.matches) return (score) }, pos.words, neg.words, .progress=.progress) scores.df=data.frame(score=scores, text=sentences) return(scores.df) } Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal 5. Run the below R code to get sentiment and plot the same #create a HDFS pointer for R tweets.dfs=hdfs.attach("/user/oracle/flume/tweets/2014/04/07/08/",key.sep="'",value.sep=",",key=NULL,force=FALSE,trim=FALSE,data.frame=T RUE,silent=FALSE) #Sentiment Map Reduce job run via ORCH air.sentiment <- hadoop.run ( data = tweets.dfs, #ORCH HDFS identifier containing the tweets export=orch.export(score.sentiment,pos.words,neg.words), #pass local R objects to map reduce function #Use init function to initialize libraries required for map reduce init=function(){ #If you see error saying "there is no package called plyr" then it means the library where plyr is installed is not the place where library is looked for, hence package needs to be moved to that place. # To identify the same, call this "install.library(plyr)" inside init function. job will fail with error "Permission denied" and also gives the path where library is being installed e.g. /usr/lib64/R/library/ # Having identified the library path, now move the libraries plyr (and its dependancies Rcpp, you can find it by installing in RStudio using "Install Packages" button), stringr library(plyr) library(stringr) orch.dbg.on('all') orch.dbg.output(stderr()) }, mapper = function(k, v) { scores=score.sentiment(as.character(v$val3),pos.words,neg.words, .progress='text') orch.keyvals(scores$text,as.numeric(scores$score)) }, config = new("mapred.config", job.name = "Sentiment Analysis of Tweets", map.tasks = 1, map.output = data.frame(key="s", val=1)), final = function() { orch.dbg.output() orch.dbg.off() } ) Sentiment Analysis (cont) Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Sentiment Analysis (cont) #move the sentiment results to Local R air.sentiment.r=hdfs.get(air.sentiment) #check out the results air.sentiment.r #Check the column names names(air.sentiment.r) #plot the distribution of seniments using histogram hist(air.sentiment.r$val) #load the library required for qplot library("ggplot2") qplot(air.sentiment.r$val) Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Trend Analysis 1. Run the below R code to get the list of words in tweets and their frequency and plot the tag cloud # Trend Analysis --Working version #(You may need to change the path as needed) tweets.dfs=hdfs.attach("/user/oracle/flume/tweets/2014/04/07/08/",key.sep="'",value.sep=",",key=NULL,force=FALSE,trim=FALSE,data.fr ame=TRUE,silent=FALSE) tweets.freq <- hadoop.run( data = tweets.dfs, init=function(){ library(stringr) }, mapper = function(k, v) { word.list=str_split(v$text,'\\s+') words=unlist(word.list) orch.keyvals(words,rep(1,length(words))) }, reducer = function(k, v) { orch.keyval(k, sum(v)) }, config = new("mapred.config", job.name = "Trend Analysis", map.tasks = 2, reduce.tasks = 1, map.output = data.frame(key="s", val=1), reduce.output = data.frame(word="s", freq=1) )) Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Trend Analysis (cont) #move the word frequency results to Local R tweets.freq.r=hdfs.get(tweets.freq) #check out the results tweets.freq.r #Check the column names names(tweets.freq.r) #plot the distribution of seniments using wordcloud library(wordcloud) wordcloud(d$word, d$freq, min.freq=3) Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Influencer Analysis 1. Install Hive if required as detailed here 2. Download hive-serdes-1.0-SNAPSHOT.jar from here to the lib directory of Hive. E.g. $>sudo mv /home/oracle/Downloads/hive-serdes-1.0-SNAPSHOT.jar /usr/lib/hive/lib/ 3. Open Hive interactive console and add the library which is required for understanding JSON format hive>ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar; 4. Create external hive Table to point to the tweets sinked by flume hive>CREATE EXTERNAL TABLE tweets ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, user:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/oracle/flume/tweets/2014/04/07/08'; Influencer Analysis (Cont) P.S You need to change the HDFS path as required in the create script 5. Analyze to find the user who has the highest Influence: hive>SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10; retweeted_screen_name malaysianairlines HarvardBiz TechCrunch analytics BigDataBorat stephen_wolfram CloudExpo TheNextWeb GonzalezCarmen bigdata total_retweets User Screen Name tweet_count 493 362 314 244 201 182 153 150 121 100 Followers 1 6 7 10 6 1 28 1 10 37 6. Analyze to find the user who has the highest followers: hive>select user.screen_name, user.followers_count c from tweets order by c desc; Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal