Vorlesung Topic Detection

Transcription

Vorlesung Topic Detection
Text Mining Wissensrohstoff Text
Gerhard Heyer
Universität Leipzig
heyer@informatik.uni-leipzig.de
Institut für Informatik
Trend und Topic Detection
Topic und Trend Detection
Prof. Dr. G. Heyer
Text Mining – Wissensrohstoff Text
2
Trend und Topic Detection
Ziele und Aufgaben
• Schnelles Finden aktueller Informationen
• Explorative Suche – Finden von Dokumenten, die zu einer
bestimmten Zeit besonders aktuell waren
• Aufgaben
– automatische Klassifizierung von Dokumenten nach Themen
und Zeiten (Zeit-Themen-Matrix)
– neue Themen entdecken / verfolgen (Text-Mining)
• bisherige IR-Methoden reichen nicht aus:
– Keyword-Suche vs. generische Queries
• “was ist passiert?”
– Abstraktions-Level: “arabischer Frühling”
– zeitliche Dimension:
• “was ist neu?”, “wie entwickelt sich ein Thema?”
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
3
Trend und Topic Detection
Anwendungen
•
•
•
•
•
•
Journalismus
Börsen- und Finanzmarkt-Analyse
Konsum-Marktforschung
Politik, Krisen-Erkennung
eHumanities
private Information und
Unterhaltung
• Suchmaschinen
• verbesserte
Übersetzung
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
4
Trend und Topic Detection
Some real life problems
In many cases, the user needs support
• to make himself familiar with a search domain,
• to identify terms that are of potential interest to the topic he is
researching, and
• to follow variant paths to explore his domain of interest
E.g., instances of events that caused critical comments
in the western media on the Iraque war
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
5
Trend und Topic Detection
Exploratory search
The notion of exploratory search has been coined to
cover all cases that go beyond „lookup“, like learning or
investigating [1]
In general, exploratory search is taken to support users
in investigating a data space in depth as well as in
bredth [2]
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
6
Trend und Topic Detection
Begriffe (vgl. Allen 2000)
• event: "A reported occurrence at a specific time and place, and
the unavoidable consequences. Specific elections, accidents,
crimes, natural disasters.”
• activity: "A connected set of actions that have a common focus
or purpose - campaigns, investigations, disaster relief efforts."
• topic: "A seminal event or activity, plus all derivative (directly
related) facts, events or activities."
• story: "A topically cohesive segment of news that includes two
or more declarative independent clauses about a single event."
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
7
Trend und Topic Detection
Beispiele
• Hurricane Mitch (Sep./Oct.‟98)
– On topic: coverage of the disaster itself; estimates of damage and
reports of loss of life; relief efforts by aid organizations; impact of the
hurricane on the economies of the effected countries.
• Euro Introduced (1.1.1999)
– On topic: stories about the preparation for the common currency
(negotiations about exchange rates and financial standards to be shared
among the member nations); official introduction of the Euro; economic
details of the shared currency; reactions within the EU and around the
world.
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
8
Trend und Topic Detection
Basic approaches
• Counting terms
• Counting particular kinds of terms
(NEs, topics, ...)
• Differential analyses
(tf/idf, reference corpus, measuring surprise)
• Clustering
• Classification
• Information Extraction
• Relation Extraction
• Co-occurrence analysis
• ... ... ...
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
9
Trend und Topic Detection
Previous work
• Relevance of terms measured by multiple document
models and thresholds (Swan and Allan 2000, Kumaran
and Allan 2004)
• Temporal extension of relevant terms modelled by
weighted finite state automaton (Kleinberg 2002)
• Topic detection based on co-occurrence patterns (LDA)
and locality of those patterns over time (Wang and
McCallum 2006)
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
10
Trend und Topic Detection
Two examples (taken from McCallum, 2006): topics
Source: State-of-the-Union addresses 1780 - 2000
„Panama Canal“
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
„Cold War“
11
Trend und Topic Detection
TDT - Korpora
Topic Detection and Tracking (TDT) is a multi-site
research project, now in its third phase, to develop
core technologies for news understanding
systems. Specifically, TDT systems discover the
topical structure in unsegmented streams of news
reporting as it appears across multiple media and
in different languages.
http://projects.ldc.upenn.edu/TDT/
Letzter Stand TDT5 (2006)
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
12
Trend und Topic Detection
Lösungsansätze
Prof. Dr. G. Heyer
Text Mining – Wissensrohstoff Text
13
Trend und Topic Detection
Aufgaben
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
14
Trend und Topic Detection
Topic Detection
Repräsentiere Topics als Cluster bereits betrachteter Stories
•
Single-Pass Clustering
für aktuelle Story S...
– bestimme ähnlichsten Cluster C
– falls Ähnlichkeit “groß” ist addiere S zu C, sonst bilde neuen Cluster (FSD:
markiere S als NEW)
Optimierung
– nur zwei Cluster: Yes und No (initialisiert mit entsprechenden Dokumenten
aus T)
– bestimme Ähnlichkeit von S mit Yes und No
– füge S zu ähnlichstem Cluster hinzu
•
kNN, Nearest Neighbour
– vergleiche S direkt mit bisherigen Stories (Zeitfenster)
– betrachte k ähnlichste Stories und deren Topics
– Topic (Cluster) von S durch “einfache Mehrheit”
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
15
Trend und Topic Detection
Topic Detection - Single-Pass Clustering mit Language Model [3]
Bestimme Wort-Verteilung für jeden Cluster C
(Wahrscheinlichkeit, daß ein Wort w in C vorkommt)
• Zu aktueller Story S ähnlichstes Cluster:
N
sim( S , C )   log pc ( wi )  log pb ( wi )  t
i 1
• N=Länge von S, pc(w)=Prob(w) in Cluster, pb(w)=Prob(w) in
Background-Modell, t=“Zeitstrafe”
• sim groß, wenn:
– Terme in S kommen oft in C und selten in Background vor
– Stories in C sind “neu”
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
16
Trend und Topic Detection
First Story Detection
• bestimme Ähnlickeit der aktuellen Story mit “Vergangenheit”
• Story ist NEW, falls Ähnlichkeit “gering”, sonst OLD
Vektorraum-Modell:
– repräsentiere Stories als Query-Vektoren
– Stemming, Stopwort-Elimination, Termgewichtung
Varianten:
–
–
–
–
Termgewichte (reine Termfrequenz, tf*idf, ...)
Ähnlickeits-Maße (Cosinus, gewichtete Summe, ...)
Grenzwerte für NEW/OLD
Menge der Vergleichs-Stories (Zeit-Ausschnitt)
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
Trend und Topic Detection
First Story Detection - Single-Pass Clustering [4]
• für aktuelle Story S mit Term-Vektor d:
–
–
–
–
–
bilde Query q aus N gewichteten Features von S
bestimme Basis-Schwellwert x = sim(q,S)
vergleiche Queries bisheriger Stories mit S
falls dabei x + “Zeitstrafe” überschritten wird OLD(S), sonst NEW(S)
optional, OLD: “Cluster”-Bildung (assoziiere S mit “Trigger-Query”)
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
18
Trend und Topic Detection
Topic Tracking
Gegeben Trainings-Corpus für Topic T, Frage: S “on topic”?
• kNN
– bestimme kNN von aktueller Story S aus Trainings-Corpus
– falls davon mehr mit “yes”, als mit “no” markiert sind YES, sonst NO
• Decision Trees
–
–
–
–
–
baue je einen Decision Tree pro Topic T
repräsentiere Trainings-Stories für T (markiert mit "yes", "no") als Queries
Knoten-Labels sind Aussagen über Term-Gewichte qi
maximiere Informationsgewinn, "Reinheit" der Unterbäume
Ziel: pro Blatt nur "yes"/"no"-Queries
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
19
Trend und Topic Detection
Topic Tracking - kNN-Algorithmus [5]
• Parameter: k>0 und 0<k1<k, 0<k2<k
• für aktuelle Story S bestimme...
• K(k‟,m) := Menge der k‟ zu S ähnlichsten Stories aus
Trainings-Corpus mit Markierung m
• P(S,k1) := K(k1,m), m=“yes”
• N(S,k2) := K(k2,m), m=“no”
• Wahrscheinlichkeit, daß S bzgl. des geg. Topics relevant ist:
 
 

1
1
P( yes | s )  dP ( s ,k1) cos(d , s )  dN ( s ,k 2) cos(d , s )
k1
k2
• Gesamtzahl positiver Trainings-Beispiele pro Event (<=16),
z.B. k=5
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
20
Trend und Topic Detection
Verbesserungen
Term zentriert
• verschiedene Termgewichte, Ähnlichkeitsmaße (Vektorraum)
• Verwendung von Named Entities
• Berücksichtigung von Kontextveränderungen
weitere Möglichkeiten
• Ausnutzung von...
– Text-Struktur (z.B. erster / letzter Satz)
– Einfluß von Topic auf Art der Terme: wo vs. wer (NE‟s), Verben
• NLP: “Schlüsselsätze” finden
• prob. Vorhersagen auf Basis von zeitlicher Topic-Entwicklung
– Verbrechen -> Untersuchung -> Prozess
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
21
Trend und Topic Detection
Kontextvolatilität
Prof. Dr. G. Heyer
Text Mining – Wissensrohstoff Text
22
Trend und Topic Detection
Intuitition
Our focus in exploratory search is on the retrieval of what authors
consider „interesting“ (for whatever reason)
„Interesting“ terms mirror an author‟s, or society‟s, view on the events
described. And this view can change over time.
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
23
Trend und Topic Detection
Sketch of the idea [6,7]
• In addition to term frequency (and derived measures),
consider a term„s change of context as an additional
dimension for analyzing what people consider
interesting
• Changes in the global context of a term (the set of it„s cooccurrences) indicate a change of usage, and hence
may be considered interesting (reporting something
new)
• The rate of change is indicative of how much the
„opinion stakeholders“ agree/disagree on the appropriate
usage of a term
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
24
Trend und Topic Detection
Example Co-occurrence of Graph “iraq” 1/3
March 2001

Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
strong cluster
related to the
Madrid train
bombings
25
Trend und Topic Detection
Example Co-occ. Graph “iraq” 2/3
May 2004

Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
strong cluster
related to the
scandal at Abu
Ghraib prison
26
Trend und Topic Detection
Example Co-occ. Graph “iraq” 3/3
August 2004

skirmishes in and
around Najaf

ceasefire with
Muqtada al-Sadr

installation of Iyad
Allawi
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
27
Trend und Topic Detection
Context Volatility
Assumptions
We compute a term„s change of context by averaging the
changes in the ranks of its co-occurrences for every time
slice based on a reference set of all its co-occurrences
occurring over a total span of time (e.g. 20 years of NYT
corpus with 7.475 time slices)
Context volatility is computed as the average variance of a
term„s context changes for some period of time
To avoid problems of data sparseness, we compute the shifts in
rank position on a 30 day average
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
28
Trend und Topic Detection
Identifying interesting terms - Algorithm
1. Compute all significant overall co-occurrences Co,w
for term w.
2. Compute all significant co-occurrences Cti,w for
every time slice ti for term w.
3. For every co-occurrence term co,w,j  Co,w and for all
time slices ti compute the series of ranki(co,w,j) which
represents the ranks of co,w,j in the different global
contexts of w for every time slice ti.
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
29
Trend und Topic Detection
Identifying interesting terms - Algorithm
4. Compute the variance of rank series Var(ranki(co,w,j))
for every co-occurrence term in co,w,j  Co,w .
5. Compute the average of the variances to obtain a
term„s volatility
Vol(w) = avg (Var(ranki(co,w,j))
=
1
| Co,w |
Prof. Dr. G. Heyer
 Var(ranki(co,w,j))
j
Text Mining - Wissensrohstoff Text
30
Trend und Topic Detection
Analogy to financial markets
Stock market
Topic detection
trading volume
term frequency
fixing of price
fixing of global
context
• Fixing the meaning of a term can be considered like
fixing the price of a stock
• Analysis of volatility of global contexts can likewise
be employed to detect interesting topics and their
change over time
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
31
Trend und Topic Detection
An example from the financial market
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
32
Trend und Topic Detection
„Interesting“ terms for 2004 (NYT corpus)
presidential debate, election day, the convention, touchdown, caucuses, turnout, the reach, hurricane, abu ghraib,
howard dean, ghraib, abu, new year, fenway, league championship series, john edwards, inning, debates, united states
senate, halloween, republican convention, innings, quarterback, fenway park, battleground, running mate, falluja,
receiver, national convention, iowa caucuses, democratic
convention, the new year, division series, incumbent, incumbents, moderator, gephardt, same-sex marriage, contested,
the terrorists, assists, american people, undecided, tax cuts,
pitches, gay marriage, electoral votes, abortion, counterterrorism, wesley, voter, caucus, the american people,
democratic national convention, end zone, martínez, . . .
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
33
Trend und Topic Detection
„Interesting“ terms for 2004 (NYT corpus)
presidential debate, election day, the convention, touchdown, caucuses, turnout, the reach, hurricane, abu ghraib,
howard dean, ghraib, abu, new year, fenway, league championship series, john edwards, inning, debates, united states
senate, halloween, republican convention, innings, quarterback, fenway park, battleground, running mate, falluja,
receiver, national convention, iowa caucuses, democratic
convention, the new year, division series, incumbent, incumbents, moderator, gephardt, same-sex marriage, contested,
the terrorists, assists, american people, undecided, tax cuts,
pitches, gay marriage, electoral votes, abortion, counterterrorism, wesley, voter, caucus, the american people,
democratic national convention, end zone, martínez, . . .
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
34
Trend und Topic Detection
„Interesting“ terms for 2004 (NYT corpus)
presidential debate, election day, the convention, touchdown, caucuses, turnout, the reach, hurricane, abu ghraib,
howard dean, ghraib, abu, new year, fenway, league championship series, john edwards, inning, debates, united states
senate, halloween, republican convention, innings, quarterback, fenway park, battleground, running mate, falluja,
receiver, national convention, iowa caucuses, democratic
convention, the new year, division series, incumbent, incumbents, moderator, gephardt, same-sex marriage, contested,
the terrorists, assists, american people, undecided, tax cuts,
pitches, gay marriage, electoral votes, abortion, counterterrorism, wesley, voter, caucus, the american people,
democratic national convention, end zone, martínez, . . .
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
35
Trend und Topic Detection
„Interesting“ terms for 2004 (NYT corpus)
presidential debate, election day, the convention, touchdown, caucuses, turnout, the reach, hurricane, abu ghraib,
howard dean, ghraib, abu, new year, fenway, league championship series, john edwards, inning, debates, united states
senate, halloween, republican convention, innings, quarterback, fenway park, battleground, running mate, falluja,
receiver, national convention, iowa caucuses, democratic
convention, the new year, division series, incumbent, incumbents, moderator, gephardt, same-sex marriage, contested,
the terrorists, assists, american people, undecided, tax cuts,
pitches, gay marriage, electoral votes, abortion, counterterrorism, wesley, voter, caucus, the american people,
democratic national convention, end zone, martínez, . . .
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
36
Trend und Topic Detection
Relation to frequency
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
37
Trend und Topic Detection
Relation to frequency
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
38
Trend und Topic Detection
Relation to frequency
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
39
Trend und Topic Detection
Comparision to tf/idf (result\2001_top1000_var.PNG)
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
40
Trend und Topic Detection
Overlap grows linearly with the amount of „interesting“ terms
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
41
Trend und Topic Detection
What does the context volatility measure extract ?
Usage of terms
Main associations as reflected by usage
(global contexts)
New aspects (change of associations)
Interesting terms
„Hotly discussed“ topics
Topics and events (in the sense of [Allan 2002])
Time related, cyclic concepts
The converse of context volatility - „stable“ concepts
Topics rather than words
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
42
Trend und Topic Detection
Characteristics of context volatility




Frequency independent – works as well with high and
low frequent terms
Scalable – works with large amounts of data
Interactive – context changes can be interactively
explored for any period of time
Streaming is possible – does not necessarily require
global knowledge (representative sub-corpus sufficient)
Cf. http://aspra23.informatik.uni-leipzig.de:8400/blazeds/volsquares_simple.swf
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
43
Trend und Topic Detection
Interactively exploring 2004
Prof. Dr. G. Heyer
Dagstuhl Workshop on Document Mining, April 2011
44
Trend und Topic Detection
Litelratur
[1] Marchionini, G.: Exploratory Search: From Finding to Understanding.
Communications of the ACM 49(4), 41{46 (2006)
[2] J. Waitelonis, M. Knuth, L. Wolf, J. Hercher, H. Sack: The Path is the
Destination - Enabling a New Search Paradigm with Linked Data, in Proc.
of Linked Data in the Future Internet at the Future Internet Assembly,
Ghent 16/17 Dec 2010,CEUR Workshop Proceedings, ISSN 1613-0073.
[3] Statistical Models for Tracking and Detection, Dragon Systems, 1999
[4] Papka, Allan, Online New Event Detection using Single-Pass
Clustering, University of Massachusetts 1997
[5] Yang, Carbonell, Brown, Learning Approaches for Detecting and
Tracking News Events, CMU 1999
[6] Heyer et. al. 2009 KDIR 2009: Proc. of Int. Conf. on Knowledge
Discovery and Information Retrieval, INSTICC Press, 2009
[7] Rohrdantz et. al. 2010, Visuelle Textanalyse, Informatikspektrum 2010
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
45
Trend und Topic Detection
Literatur
[Kumaran & Allan 2004] Kumaran, G.; Allan, J.: Text classification and named
entities for new event detection. In SIGIR ‟04: Proceedings of the 27th
annual international ACM SIGIR conference on Research and development
in information retrieval, pages 297–304, New York, NY, USA. ACM, 2004
[Swan & Allan 1999] Swan, R.; Allan, J.: Extracting significant time varying
features from text. In CIKM ‟99: Proceedings of the eighth international
conference on Information and knowledge management, pages 38–45, New
York, NY, USA. ACM, 1999.
[Wang & McCallum 2006] Wang, X.; McCallum, A.: Topics over time: a nonMarkov continuous-time model of topical trends. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data
mining, KDD ‟06, pages 424–433, New York, NY, USA. ACM, 2006.
[Gerhard Heyer, Daniel Keim, Sven Teresniak, Daniela Oelke 2011], Interaktive
explorative Suche in großen Dokumentbeständen, Datenbank-Spektrum
3/11, S. 195-206, Springer 2011
Prof. Dr. G. Heyer
Text Mining - Wissensrohstoff Text
46