How to Use the PowerPoint Template

Transcription

How to Use the PowerPoint Template
Sentiment-Analyse über Tweets
(Demo mit Apache Flume,
R und ORAAH)
Nagaraj Malaiappan, Indien
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Sentiment-Analyse bei Airtel
About Airtel
 Bharti Airtel Limited: führende Telefongesellschaft in 20 Ländern in Asien+ Afrika. (Headquarter New Delhi)
Produkte 2G, 3G and 4G Wireless Services, Mobile Commerce, Fesztnetz, DSL, IPTV, DTH, u. a.
 Bharti Airtel hat fast 287 Millionen Kunden (Dez 2013)
Herausforderungen
Es gibt keine Tools zur Messung der Wahrnehmung der Firma beim Kunden (positiv/negativ)
Es gibt keine Tools zium Einholen des Feedbacks über neue Services und Produkte
Finden der “Influencers” für eine Zielgruppensegmentierug von Marketingkampagnen
Es müssen Trends erkannt werden, um neue Produkte oder auch Kampagnen zu launchen
Es müssen unübliche Vorkommnisse (z. B. Netzausfall) und deren Folgen bei der Wahrnehmung durch
die Kunden erkannt werden.
Benutzte Produkte zur Lösung
Oracle Big Data Appliance (BDA)
Oracle R Advanced Analytics for Hadoop (ORAAH)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Lösungsumgebung
Oracle Event
Processing
Twitter Streaming API
Oracle R Advanced
Analytics for Hadoop
(Sentiment, Trend analysis)
HDFS
Apache
Flume
(tweets)
Hive
(Influencer Analysis)
Stream
Acquire – Organize – Analyze
Oracle Big Data Appliance
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Lösungsdetails
• Verwenden von Apache Flume zu streamen der Twitter-Daten
• Keywords (user handle or hashtag) werden zur Identifizierung in dem
Twitterstrom genutzt
• Streaming API bei Twitter anmelden
• Über Java-Programm Twitter API aufrufen
• Java-Code als “Apache Flume Source” registrieren
• Key und Keywords als Teil der Flume Konfiguration nutzen
• HDFS als Senke in Flume konfigurieren
• Flume Agent starten umd die Tweets einzusammeln
• Flume “streamt” jetzt die Daten in das HDFS
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
4
Tweet-Strom
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Sentiment-Analyse
• Oracle R Advanced Analytics for Hadoop (ORAAH)
– Laden postiv- / negativ-Wortliste
– Umwandeln der Tweets in eine Wortliste
– Anwenden der positiv/negativ-Liste auf die Tweet-Wörter
– Auf Gesamt-Sentiment aggregieren
• Der R-Code der vorher
beschriebenen Schritte
wird als Map Reduce Job über
ORAAH in dem HDFS ausgeführt
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
6
Trend-Analyse
• Oracle R Advanced Analytics for Hadoop (ORAAH)
– Tweets werden mit „Cleansing, Stemming, Stopwords“ bearbeitet
– Über einen Map Reduce Job wird jetzt die wichtigsten Wörter im HDFS
gezählt
– Das Ergebnis wird als Word-Cloud angezeigt
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Influencer-Analyse
• Oracle R Advanced Analytics (HIVE Adapter)
– Verarbeiten der Tweets als JSON-Format -> leichtes Umwandeln in Tabellenform
– Anlegen einer Hive-Tabelle mit allen Tweets
– Absetzen einer Hive-Abfrage:
• Wer hat die meisten Tweets verschickt?
• Wer hat die meisten Antwort-Tweets auf
seine Nachricht?
retweeted_screen_name
total_retweets
tweet_count
malaysianairlines
493
1
HarvardBiz
362
6
TechCrunch
314
7
analytics
244
10
BigDataBorat
201
6
stephen_wolfram
182
1
CloudExpo
153
28
TheNextWeb
150
1
GonzalezCarmen
121
10
bigdata
100
37
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Erbenisse und Nutzen der Lösung
Key Nutzen
Bereiche
Nutzen für Airtel
Business Performance
 Verbesserte Marken Reputation
 Mehr Kundenzufriedenheit minimiert Churner
 Mehr und zielgerichtetere Marketing Kampagnen
 Detaillierteres Feedback auf Product/ + Services
Profitabilität
 Verminderte Churner-Rate erhört Marge
 Zusätzlicher Umsatz durch mehr zielgerichtete Kampagnen
Wettbewerb
 Besserer Kundenservice ist möglich
 Positive Reputation gegenüber Wettbewerbern
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Beispiel-Code
(alles kein Hexenwerk)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Acquiring Tweets
1. Create a dev account with twitter from here
2. Install Flume from here if not installed already
3. Download jar file which contains the class files to load data from Twitter into Hadoop from here
4. Edit the flume-env.sh file (located in /etc/flume-ng/conf) and Include FLUME_CLASSPATH e.g.
FLUME_CLASSPATH=”/home/oracle/Downloads/ flume-sources-1.0-SNAPSHOT.jar
5. Create a .conf file which has the source settings (Twitter), channel (memory channel ) and sink (HDFS).
6. Make sure to change the consumer key, secret keys and in addition the keywords to search in twitter site
7. Save the file as <name>.conf e.g. flume.conf in /etc/flume-ng/conf)
8. Entries to go in to the conf file
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey =<fill details you obtained from step 1>
TwitterAgent.sources.Twitter.consumerSecret =<fill details you obtained from step 1>
TwitterAgent.sources.Twitter.accessToken =<fill details you obtained from step 1>
TwitterAgent.sources.Twitter.accessTokenSecret =<fill details you obtained from step 1>
TwitterAgent.sources.Twitter.keywords = <User Handle or Hash tag>
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Acquiring Tweets (Cont)
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/oracle/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
8. Run Flume to collect Tweets
Navigate to flume conf directory where you saved above conf file and run below command
$>flume-ng agent --conf . -f flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Sentiment Analysis
1. open Rstudio (http://localhost:8787)
2. install required libraries and load them
R>install.packages(“twitteR”,”plyr”,”stringr”)
R>library(“plyr”,”twitteR”,”ORCH”,”stringr”)
3. Download the positive and negative words from here and unzip it and run below (note the download file location)
R>pos.words=scan('C:/Users/nmalaiap/Downloads/opinion-lexicon-English/positive-words.txt', what='character', comment.char=';')
R>neg.words=scan('C:/Users/nmalaiap/Downloads/opinion-lexicon-English/negative-words.txt', what='character', comment.char=';')
4. Create a function which will be used for scoring the sentiments from twitter text
R> score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
require(plyr)
require(stringr)
# we got a vector of sentences. plyr will handle a list of a vector as an "l" for us
# we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
scores = laply(sentences, function(sentence, pos.words, neg.words)
{
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]','',sentence)
sentence = gsub('[[:cntrl:]]','',sentence)
sentence = gsub('\\d+','',sentence)
}
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Sentiment Analysis (cont)
# and convert to lower case:
sentence = tolower(sentence)
#split into words. str_split is in the stringr package
word.list=str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words=unlist(word.list)
# compare our words to the dictionaries of positive and negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
#match() returns the position of the matches term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough , TRUE/FALSE will be treated as 1/0 by sum():
score=sum(pos.matches)-sum(neg.matches)
return (score)
}, pos.words, neg.words, .progress=.progress)
scores.df=data.frame(score=scores, text=sentences)
return(scores.df)
}
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
5.
Run the below R code to get sentiment and plot the same
#create a HDFS pointer for R
tweets.dfs=hdfs.attach("/user/oracle/flume/tweets/2014/04/07/08/",key.sep="'",value.sep=",",key=NULL,force=FALSE,trim=FALSE,data.frame=T
RUE,silent=FALSE)
#Sentiment Map Reduce job run via ORCH
air.sentiment <- hadoop.run
( data = tweets.dfs, #ORCH HDFS identifier containing the tweets
export=orch.export(score.sentiment,pos.words,neg.words), #pass local R objects to map reduce function
#Use init function to initialize libraries required for map reduce
init=function(){
#If you see error saying "there is no package called plyr" then it means the library where plyr is installed is not the place where library
is looked for, hence package needs to be moved to that place.
# To identify the same, call this "install.library(plyr)" inside init function. job will fail with error "Permission denied" and also gives the
path where library is being installed e.g. /usr/lib64/R/library/
# Having identified the library path, now move the libraries plyr (and its dependancies Rcpp, you can find it by installing in RStudio
using "Install Packages" button), stringr
library(plyr)
library(stringr)
orch.dbg.on('all')
orch.dbg.output(stderr())
},
mapper = function(k, v)
{ scores=score.sentiment(as.character(v$val3),pos.words,neg.words, .progress='text')
orch.keyvals(scores$text,as.numeric(scores$score))
},
config = new("mapred.config",
job.name = "Sentiment Analysis of Tweets",
map.tasks = 1,
map.output = data.frame(key="s", val=1)),
final = function() {
orch.dbg.output()
orch.dbg.off()
}
)
Sentiment Analysis (cont)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Sentiment Analysis (cont)
#move the sentiment results to Local R
air.sentiment.r=hdfs.get(air.sentiment)
#check out the results
air.sentiment.r
#Check the column names
names(air.sentiment.r)
#plot the distribution of seniments using histogram
hist(air.sentiment.r$val)
#load the library required for qplot
library("ggplot2")
qplot(air.sentiment.r$val)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Trend Analysis
1. Run the below R code to get the list of words in tweets and their frequency and plot the tag cloud
# Trend Analysis --Working version
#(You may need to change the path as needed)
tweets.dfs=hdfs.attach("/user/oracle/flume/tweets/2014/04/07/08/",key.sep="'",value.sep=",",key=NULL,force=FALSE,trim=FALSE,data.fr
ame=TRUE,silent=FALSE)
tweets.freq <- hadoop.run(
data = tweets.dfs,
init=function(){
library(stringr)
},
mapper = function(k, v) {
word.list=str_split(v$text,'\\s+')
words=unlist(word.list)
orch.keyvals(words,rep(1,length(words)))
},
reducer = function(k, v) {
orch.keyval(k, sum(v))
},
config = new("mapred.config",
job.name = "Trend Analysis",
map.tasks = 2,
reduce.tasks = 1,
map.output = data.frame(key="s", val=1),
reduce.output = data.frame(word="s", freq=1)
))
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Trend Analysis (cont)
#move the word frequency results to Local R
tweets.freq.r=hdfs.get(tweets.freq)
#check out the results
tweets.freq.r
#Check the column names
names(tweets.freq.r)
#plot the distribution of seniments using wordcloud
library(wordcloud)
wordcloud(d$word, d$freq, min.freq=3)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Influencer Analysis
1. Install Hive if required as detailed here
2. Download hive-serdes-1.0-SNAPSHOT.jar from here to the lib directory of Hive. E.g.
$>sudo mv /home/oracle/Downloads/hive-serdes-1.0-SNAPSHOT.jar /usr/lib/hive/lib/
3. Open Hive interactive console and add the library which is required for understanding JSON format
hive>ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
4. Create external hive Table to point to the tweets sinked by flume
hive>CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/oracle/flume/tweets/2014/04/07/08';
Influencer Analysis (Cont)
P.S You need to change the HDFS path as required in the create script
5. Analyze to find the user who has the highest Influence:
hive>SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as
retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name,
retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;
retweeted_screen_name
malaysianairlines
HarvardBiz
TechCrunch
analytics
BigDataBorat
stephen_wolfram
CloudExpo
TheNextWeb
GonzalezCarmen
bigdata
total_retweets
User Screen Name
tweet_count
493
362
314
244
201
182
153
150
121
100
Followers
1
6
7
10
6
1
28
1
10
37
6. Analyze to find the user who has the highest followers:
hive>select user.screen_name, user.followers_count c from tweets order by c desc;
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal