C /E D on T
Transcription
C /E D on T
Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign Our Team Team member: Elisee Habimana Jicong Wang Sridevi Maharaj Ronald Doku Mingjia Zhang Tobias Kin Hou Lei Ravi Khadiwala Duber Gomez Rui Yang Project leader: Yizhou Sun Rui Li Motivation - why Twitter? Wide Coverage Real Time Motivation - An Example • An earthquake happened in Chile at 03:34 local time, Sat Feb 27, 2010 • Traditional communication almost impossible for 2-3 hours, first video image available 6-7 hours after quake Source: <Information Credibility on Twitter>, by Carlos Castillo et al. Motivation - Another Example • Tweet posted at 2:22pm, June 28th, 20 minutes after the shot, while first news report appears almost 3 hours later Motivation • Twitter reshape the way people spread and receive information • The real time feature makes twitter a good source of breaking news • The official and verified accounts on twitter provides reliable information • We propose to build up a web application that provide reliable real time crime related information Demo Crime/Event Detection on Twitter Data Sciences Summer Institute 2011 Multimodal Information Access & Synthesis, University of Illinois at Urbana-Champaign Table of Contents • • • • • • • • Major Challenges Crime Focused Crawling Tweet Classification Event Extraction Tweet Ranking Clustering Tools Summary Major Challenges • Most tweet contents are useless for us o o o o o o o • • • • Pointless babble – 40% Conversational – 38% Pass-along value – 9% Self-promotion – 6% Spam – 4% News – 4% Crime related - 0.005% Roughly 10,000 crime related tweets each day Information like location and time not always explicit Display only the most important tweets Present results in an organized fashion Source: <Twitter Study – August 2009> Kelly, Ryan, ed (August 12, 2009) Project Flowchart Crime Focus Crawling Crawling crime related tweets from Twitter Presented by Jicong Wang A Snapshot of Twitter Data USERID 43893075 ID 68542312782905344 TEXT Break shooting scene 1 "No More" with @dindamanda @yuyayuyi http://lockerz.com/s/100883315 LOCATION GeoLocation latitude=-6.196612, longitude=106.829552 PLACE TIME Thu May 12 00:05:35 CDT 2011 URLS url=http://lockerz.com/s/100883315, MentionedEntities: 37623286 66072730 Hashtags: also number of Followers, number of Friends, name of User, etc NOT ALL TWEETS ARE CRIME RELATED! ONLY about 0.005%! Observation Iteratively Refining Rules • Repeat the above procedures until an ideal rule is obtained Problem However, there are STILL many "fake" crime tweets Refine the Rules Single Keyword e.g. crime, kill, death, police, cop, shot Combination of Keywords Key Phrases o o o found shot OR died OR injured OR body armed OR unarmed robbery police on scene of Result • Improved crawling result: Keyword Proportion of crime related tweets Single < 5% Combination 50% among results from single keywords • Crawling result: About 25,000 crawled tweets per day. Over 13,000 users per day. Tweets Classification Determine whether a tweet is a related event Presented by Tobias Kin Hou Lei Are these tweets related to crime? A Classification approach Features Engineering - Basic features • Concept clusters o o o o o o o Natural disaster: {earthquake,tornado, ...} Weapon: {weapon,weapons,gun,guns,gunshot, ...} Injure: {...} Burglar: {...} ... Non-Fire : {hilarious,weather,red,moon,sun, ... ,musician, pizza,cook,music,dance justin bieber} • Could predict unseen words. e.g. Train on tornado warning, could predict earthquake warning. Tradition Classification Features Only Text Classification But Tweets are short and noisy. – at most 140 words – contain noisy words, – contain urls, tags; Features Engineering - Social Features • Special tags: o o #hpd #breaking news Features Engineering - Social Features • User as a feature o List of verified police departments on Twitter • URL • Date • Number Features Engineering - Social Features Classification Model • Naive Bayes o Easy and good-performance model for online classification. o Many meaningful features and training data, different classification models will performance the similar result. Training Data • Crawled in from Twitter at different period of times • Manually labeled by our team • 2000 samples for training, among them: o 60% positive samples o 40% negative samples • 1000 samples for testing o 65% positive samples o 35% negative samples Summary • About 100 concept clusters covers in different areas of the feature space • Average accuracy on test set is 83.788% Event Extraction Extracting event information and grouping Presented by Ravi Khadiwala Event Extraction • Within the text of an individual tweet there may be information not previously found in through data crawling • This information is often useful to the user o o o Allows user to visualize where crime occurred Allows user to view filter by category Decreases the amount of raw tweets the user must read • This information is also useful to improve performance o o o Ranking Clustering Improves accuracy The Social Location Web Temporal/Spatial Information Five potential sources of locations, listed in descending order of perceived usefulness: • GPS tagged tweets latitude=57.8433342, longitude=12.6506338 • 'Place' tagged tweets (57.6190897,12.427637),(57.6190897,12.7635394) 7.8653997, 12.7635394),(57.8653997,12.427637) • User location • Textual Location Extraction o o Named Entity Recognition Regular Expressions (5 Temporal/Spatial Information • Location information hierarchically structured based on reliability • Use Named Entity Recognition o o Succeeds on: "I just witnessed a robbery in Champaign" Fails on: "Breaking and entering at 128 Maple St." • Use regular expressions to recognize common formating of addresses, highways, etc. • Time based on tweet time Regex Example "[0-9]+ ([A-Z][A-Za-z]* )+ (ALLEE|ALLEY|ALLY|ALY|ANEX|ANNEX|ANNX|ANX|ARC|ARCADE|AV|AVE|AVEN|AVENU|AVENUE|AVN| AVNUE|BAYOO|BAYOU|BCH|BEACH|BEND|BND|BLF|BLUF|BLUFF|BLUFFS|BOT|BOTTM|BOTTOM|BTM|BLVD|BOUL|BOULE VARD|BO ULV|BR|BRANCH|BRNCH|BRDGE|BRG|BRIDGE|BRK|BROOK|BROOKS|BURG|BURGS|BYP|BYPA|BYPAS|BYPASS|BYPS|CA MP|CMP| CP|CANYN|CANYON|CNYN|CYN|CAPE|CPE|CAUSEWAY|CAUSWAY|CSWY|CEN|CENT|CENTER|CENTR|CENTRE|CNTER|C NTR|CTR|C ENTERS|CIR|CIRC|CIRCL|CIRCLE|CRCL|CRCLE|CIRCLES|CLF|CLIFF|CLFS|CLIFFS|CLB|CLUB|COMMON|COR|CORNER|CO RNERS|CORS| COURSE|CRSE|COURT|CRT|CT|COURTS|CTS|COVE|CV|COVES|CK|CR|CREEK|CRK|CRECENT|CRES|CRESCENT|CRESEN T|CRSCNT|C RSENT|CRSNT|CREST|CROSSING|CRSSING|CRSSNG|XING|CROSSROAD|CURVE|DALE|DL|DAM|DM|DIV|DIVIDE|DV|DVD| DR|DRIV|DRI VE|DRV|DRIVES|EST|ESTATE|ESTATES|ESTS|EXP|EXPR|EXPRESS|EXPRESSWAY|EXPW|EXPY|EXT|EXTENSION|EXTN|E XTNSN|EXTE NSIONS|EXTS|FALL|FALLS|FLS|FERRY|FRRY|FRY|FIELD|FLD|FIELDS|FLDS|FLAT|FLT|FLATS|FLTS|FORD|FRD|FORDS|FOR EST|FORE STS|FRST|FORG|FORGE|FRG|FORGES|FORK|FRK|FORKS|FRKS|FORT|FRT|FT|FREEWAY|FREEWY|FRWAY|F RWY|FWY|GARDEN|GA RDN|GDN|GRDEN|GRDN|GARDENS|GDNS|GRDNS|GATEWAY|GATEWY|GATWAY|GTWAY|GTWY|GLEN|GLN|GLENS|GREE N|GRN|G REENS|GROV|GROVE|GRV|GROVES|HARB|HARBOR|HARBR|HBR|HRBOR|HARBORS|HAVEN|HAVN|HVN|HEIGHT|HEIGHT S|HGTS|HT |HTS|HIGHWAY|HIGHWY|HIWAY|HIWY|HWAY|HWY|HILL|HL|HILLS|HLS|HLLW|HOLLOW|HOLLOWS|HOLW|HOLWS|INLET|IN LT|IS|ISL AND|ISLND|ISLANDS|ISLNDS|ISS|ISLE|ISLES|JCT|JCTION|JCTN|JUNCTION|JUNCTN|JUNCTON|JCTNS|JCTS|JUNCTIONS|K EY|KY|KEYS |KYS|KNL|KNOL|KNOLL|KNLS|KNOLLS|LAKE|LK|LAKES|LKS|LAND|LANDING|LNDG|LNDNG|LA|LANE|LANES|LN|LGT|LIGHT| LIGHTS| Location Disambiguation • Search extracted locations through a city to GPS lookup table • Many American city names are repeated (Atlanta,IL vs Atlanta,GA) o Check for well formated locations (city,state) o If not, resolve by selecting matched city with the largest population • Give preferences to other location sources (like user location and GPS) when there are multiple matches Categorization • Would like categories with finer granularity than crime or not crime • Based on keyword partitions corresponding to categories, ex: o Robbery/Theft: {robbed,robbery,burglar,theft...} o Natural Disaster: {tornado,typhoon,earthquake...} • Keyword based crawling guarantees presence of words that convey meaningful category information Ranking Scoring and Ordering Tweets based on Importance Presented by Ravi Khadiwala Ranking • We only want to display best "n" tweets o o o Nature of twitter may result in an extremely variable amount of data Serves as another way to filter non-crime tweets May be able to highlight important events • Summarize the most important data points o Avoid overwhelming the user with results Learning to Rank Goal: Learn a function f: X -> r where X is a vector of features and r is a importance score Strategy: Take pointwise approach and use a sample of manually scored data find the curve that fits our labeled data We use linear regression using the simple least squares method to find weights such that r = w1x1 + w2x2 + w3x3 + . . . wnxn Determine Ranking Features • Selected from a large pool of potential features • Social o Number of hashtags,urls,@ (indicates a reply), retweet count • Contextual o Tweet length, category, mentioned locations • User Credibility o Age of user account, friends, followers, status count, verification • Classifier Confidence Ranking Features and Weights • Labeled ~500 tweets with a ranking (integer from 1 to 5) • Linear regression on all features (normalized) o o o Examined correlation coefficients Examined weights Pruned features • Repeated until we had an adequate feature set with logical weights Ranking Features and Weights Weights -0.996904004778 2.87974471144 1.71671010105 1.17242993534 2.67005302808 -3.97882564778 Features category account age favorites status count followers confidence Clustering Geographical location: determinant for grouping tweets together Presented by Ronald Doku Clustering tweets • Clustering of tweets means to group overlapping tweets found in the same location into one category. Why is tweet clustering important? • Clustered tweets inform the user about where most events are happening at a particular time. • The sizes of the clustered tweets also convey how relevant or important the tweets are. • eg. A user may want to find out how far a wild fire outbreak is spreading or has spread to. Clustered tweets of the wildfire on the map shows the user where the fire is or has spread to. Clustered tweets: high level overview Clustered tweets: after click (California) How do we cluster tweets? Also by defining at which zoomlevels each tweet should appear, we cluster the tweets to reduce the number shown at a time. We call this hierarchical clustering. Miscellaneous/Tools Presented by Sridevi Maharaj Tools Summary Conceptual Level – Detects and monitors crime via a popular 21st century social media Technical Level – Developed crawler to obtain data – Identified and explored useful features from social network to rank and classify crime System Level – Built user-friendly system – Works in real time – Processes large collection of data – Iphone interface supported Questions?