Vorlesung Topic Detection
Transcription
Vorlesung Topic Detection
Text Mining Wissensrohstoff Text Gerhard Heyer Universität Leipzig heyer@informatik.uni-leipzig.de Institut für Informatik Trend und Topic Detection Topic und Trend Detection Prof. Dr. G. Heyer Text Mining – Wissensrohstoff Text 2 Trend und Topic Detection Ziele und Aufgaben • Schnelles Finden aktueller Informationen • Explorative Suche – Finden von Dokumenten, die zu einer bestimmten Zeit besonders aktuell waren • Aufgaben – automatische Klassifizierung von Dokumenten nach Themen und Zeiten (Zeit-Themen-Matrix) – neue Themen entdecken / verfolgen (Text-Mining) • bisherige IR-Methoden reichen nicht aus: – Keyword-Suche vs. generische Queries • “was ist passiert?” – Abstraktions-Level: “arabischer Frühling” – zeitliche Dimension: • “was ist neu?”, “wie entwickelt sich ein Thema?” Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 3 Trend und Topic Detection Anwendungen • • • • • • Journalismus Börsen- und Finanzmarkt-Analyse Konsum-Marktforschung Politik, Krisen-Erkennung eHumanities private Information und Unterhaltung • Suchmaschinen • verbesserte Übersetzung Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 4 Trend und Topic Detection Some real life problems In many cases, the user needs support • to make himself familiar with a search domain, • to identify terms that are of potential interest to the topic he is researching, and • to follow variant paths to explore his domain of interest E.g., instances of events that caused critical comments in the western media on the Iraque war Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 5 Trend und Topic Detection Exploratory search The notion of exploratory search has been coined to cover all cases that go beyond „lookup“, like learning or investigating [1] In general, exploratory search is taken to support users in investigating a data space in depth as well as in bredth [2] Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 6 Trend und Topic Detection Begriffe (vgl. Allen 2000) • event: "A reported occurrence at a specific time and place, and the unavoidable consequences. Specific elections, accidents, crimes, natural disasters.” • activity: "A connected set of actions that have a common focus or purpose - campaigns, investigations, disaster relief efforts." • topic: "A seminal event or activity, plus all derivative (directly related) facts, events or activities." • story: "A topically cohesive segment of news that includes two or more declarative independent clauses about a single event." Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 7 Trend und Topic Detection Beispiele • Hurricane Mitch (Sep./Oct.‟98) – On topic: coverage of the disaster itself; estimates of damage and reports of loss of life; relief efforts by aid organizations; impact of the hurricane on the economies of the effected countries. • Euro Introduced (1.1.1999) – On topic: stories about the preparation for the common currency (negotiations about exchange rates and financial standards to be shared among the member nations); official introduction of the Euro; economic details of the shared currency; reactions within the EU and around the world. Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 8 Trend und Topic Detection Basic approaches • Counting terms • Counting particular kinds of terms (NEs, topics, ...) • Differential analyses (tf/idf, reference corpus, measuring surprise) • Clustering • Classification • Information Extraction • Relation Extraction • Co-occurrence analysis • ... ... ... Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 9 Trend und Topic Detection Previous work • Relevance of terms measured by multiple document models and thresholds (Swan and Allan 2000, Kumaran and Allan 2004) • Temporal extension of relevant terms modelled by weighted finite state automaton (Kleinberg 2002) • Topic detection based on co-occurrence patterns (LDA) and locality of those patterns over time (Wang and McCallum 2006) Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 10 Trend und Topic Detection Two examples (taken from McCallum, 2006): topics Source: State-of-the-Union addresses 1780 - 2000 „Panama Canal“ Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text „Cold War“ 11 Trend und Topic Detection TDT - Korpora Topic Detection and Tracking (TDT) is a multi-site research project, now in its third phase, to develop core technologies for news understanding systems. Specifically, TDT systems discover the topical structure in unsegmented streams of news reporting as it appears across multiple media and in different languages. http://projects.ldc.upenn.edu/TDT/ Letzter Stand TDT5 (2006) Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 12 Trend und Topic Detection Lösungsansätze Prof. Dr. G. Heyer Text Mining – Wissensrohstoff Text 13 Trend und Topic Detection Aufgaben Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 14 Trend und Topic Detection Topic Detection Repräsentiere Topics als Cluster bereits betrachteter Stories • Single-Pass Clustering für aktuelle Story S... – bestimme ähnlichsten Cluster C – falls Ähnlichkeit “groß” ist addiere S zu C, sonst bilde neuen Cluster (FSD: markiere S als NEW) Optimierung – nur zwei Cluster: Yes und No (initialisiert mit entsprechenden Dokumenten aus T) – bestimme Ähnlichkeit von S mit Yes und No – füge S zu ähnlichstem Cluster hinzu • kNN, Nearest Neighbour – vergleiche S direkt mit bisherigen Stories (Zeitfenster) – betrachte k ähnlichste Stories und deren Topics – Topic (Cluster) von S durch “einfache Mehrheit” Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 15 Trend und Topic Detection Topic Detection - Single-Pass Clustering mit Language Model [3] Bestimme Wort-Verteilung für jeden Cluster C (Wahrscheinlichkeit, daß ein Wort w in C vorkommt) • Zu aktueller Story S ähnlichstes Cluster: N sim( S , C ) log pc ( wi ) log pb ( wi ) t i 1 • N=Länge von S, pc(w)=Prob(w) in Cluster, pb(w)=Prob(w) in Background-Modell, t=“Zeitstrafe” • sim groß, wenn: – Terme in S kommen oft in C und selten in Background vor – Stories in C sind “neu” Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 16 Trend und Topic Detection First Story Detection • bestimme Ähnlickeit der aktuellen Story mit “Vergangenheit” • Story ist NEW, falls Ähnlichkeit “gering”, sonst OLD Vektorraum-Modell: – repräsentiere Stories als Query-Vektoren – Stemming, Stopwort-Elimination, Termgewichtung Varianten: – – – – Termgewichte (reine Termfrequenz, tf*idf, ...) Ähnlickeits-Maße (Cosinus, gewichtete Summe, ...) Grenzwerte für NEW/OLD Menge der Vergleichs-Stories (Zeit-Ausschnitt) Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text Trend und Topic Detection First Story Detection - Single-Pass Clustering [4] • für aktuelle Story S mit Term-Vektor d: – – – – – bilde Query q aus N gewichteten Features von S bestimme Basis-Schwellwert x = sim(q,S) vergleiche Queries bisheriger Stories mit S falls dabei x + “Zeitstrafe” überschritten wird OLD(S), sonst NEW(S) optional, OLD: “Cluster”-Bildung (assoziiere S mit “Trigger-Query”) Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 18 Trend und Topic Detection Topic Tracking Gegeben Trainings-Corpus für Topic T, Frage: S “on topic”? • kNN – bestimme kNN von aktueller Story S aus Trainings-Corpus – falls davon mehr mit “yes”, als mit “no” markiert sind YES, sonst NO • Decision Trees – – – – – baue je einen Decision Tree pro Topic T repräsentiere Trainings-Stories für T (markiert mit "yes", "no") als Queries Knoten-Labels sind Aussagen über Term-Gewichte qi maximiere Informationsgewinn, "Reinheit" der Unterbäume Ziel: pro Blatt nur "yes"/"no"-Queries Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 19 Trend und Topic Detection Topic Tracking - kNN-Algorithmus [5] • Parameter: k>0 und 0<k1<k, 0<k2<k • für aktuelle Story S bestimme... • K(k‟,m) := Menge der k‟ zu S ähnlichsten Stories aus Trainings-Corpus mit Markierung m • P(S,k1) := K(k1,m), m=“yes” • N(S,k2) := K(k2,m), m=“no” • Wahrscheinlichkeit, daß S bzgl. des geg. Topics relevant ist: 1 1 P( yes | s ) dP ( s ,k1) cos(d , s ) dN ( s ,k 2) cos(d , s ) k1 k2 • Gesamtzahl positiver Trainings-Beispiele pro Event (<=16), z.B. k=5 Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 20 Trend und Topic Detection Verbesserungen Term zentriert • verschiedene Termgewichte, Ähnlichkeitsmaße (Vektorraum) • Verwendung von Named Entities • Berücksichtigung von Kontextveränderungen weitere Möglichkeiten • Ausnutzung von... – Text-Struktur (z.B. erster / letzter Satz) – Einfluß von Topic auf Art der Terme: wo vs. wer (NE‟s), Verben • NLP: “Schlüsselsätze” finden • prob. Vorhersagen auf Basis von zeitlicher Topic-Entwicklung – Verbrechen -> Untersuchung -> Prozess Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 21 Trend und Topic Detection Kontextvolatilität Prof. Dr. G. Heyer Text Mining – Wissensrohstoff Text 22 Trend und Topic Detection Intuitition Our focus in exploratory search is on the retrieval of what authors consider „interesting“ (for whatever reason) „Interesting“ terms mirror an author‟s, or society‟s, view on the events described. And this view can change over time. Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 23 Trend und Topic Detection Sketch of the idea [6,7] • In addition to term frequency (and derived measures), consider a term„s change of context as an additional dimension for analyzing what people consider interesting • Changes in the global context of a term (the set of it„s cooccurrences) indicate a change of usage, and hence may be considered interesting (reporting something new) • The rate of change is indicative of how much the „opinion stakeholders“ agree/disagree on the appropriate usage of a term Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 24 Trend und Topic Detection Example Co-occurrence of Graph “iraq” 1/3 March 2001 Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text strong cluster related to the Madrid train bombings 25 Trend und Topic Detection Example Co-occ. Graph “iraq” 2/3 May 2004 Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text strong cluster related to the scandal at Abu Ghraib prison 26 Trend und Topic Detection Example Co-occ. Graph “iraq” 3/3 August 2004 skirmishes in and around Najaf ceasefire with Muqtada al-Sadr installation of Iyad Allawi Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 27 Trend und Topic Detection Context Volatility Assumptions We compute a term„s change of context by averaging the changes in the ranks of its co-occurrences for every time slice based on a reference set of all its co-occurrences occurring over a total span of time (e.g. 20 years of NYT corpus with 7.475 time slices) Context volatility is computed as the average variance of a term„s context changes for some period of time To avoid problems of data sparseness, we compute the shifts in rank position on a 30 day average Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 28 Trend und Topic Detection Identifying interesting terms - Algorithm 1. Compute all significant overall co-occurrences Co,w for term w. 2. Compute all significant co-occurrences Cti,w for every time slice ti for term w. 3. For every co-occurrence term co,w,j Co,w and for all time slices ti compute the series of ranki(co,w,j) which represents the ranks of co,w,j in the different global contexts of w for every time slice ti. Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 29 Trend und Topic Detection Identifying interesting terms - Algorithm 4. Compute the variance of rank series Var(ranki(co,w,j)) for every co-occurrence term in co,w,j Co,w . 5. Compute the average of the variances to obtain a term„s volatility Vol(w) = avg (Var(ranki(co,w,j)) = 1 | Co,w | Prof. Dr. G. Heyer Var(ranki(co,w,j)) j Text Mining - Wissensrohstoff Text 30 Trend und Topic Detection Analogy to financial markets Stock market Topic detection trading volume term frequency fixing of price fixing of global context • Fixing the meaning of a term can be considered like fixing the price of a stock • Analysis of volatility of global contexts can likewise be employed to detect interesting topics and their change over time Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 31 Trend und Topic Detection An example from the financial market Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 32 Trend und Topic Detection „Interesting“ terms for 2004 (NYT corpus) presidential debate, election day, the convention, touchdown, caucuses, turnout, the reach, hurricane, abu ghraib, howard dean, ghraib, abu, new year, fenway, league championship series, john edwards, inning, debates, united states senate, halloween, republican convention, innings, quarterback, fenway park, battleground, running mate, falluja, receiver, national convention, iowa caucuses, democratic convention, the new year, division series, incumbent, incumbents, moderator, gephardt, same-sex marriage, contested, the terrorists, assists, american people, undecided, tax cuts, pitches, gay marriage, electoral votes, abortion, counterterrorism, wesley, voter, caucus, the american people, democratic national convention, end zone, martínez, . . . Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 33 Trend und Topic Detection „Interesting“ terms for 2004 (NYT corpus) presidential debate, election day, the convention, touchdown, caucuses, turnout, the reach, hurricane, abu ghraib, howard dean, ghraib, abu, new year, fenway, league championship series, john edwards, inning, debates, united states senate, halloween, republican convention, innings, quarterback, fenway park, battleground, running mate, falluja, receiver, national convention, iowa caucuses, democratic convention, the new year, division series, incumbent, incumbents, moderator, gephardt, same-sex marriage, contested, the terrorists, assists, american people, undecided, tax cuts, pitches, gay marriage, electoral votes, abortion, counterterrorism, wesley, voter, caucus, the american people, democratic national convention, end zone, martínez, . . . Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 34 Trend und Topic Detection „Interesting“ terms for 2004 (NYT corpus) presidential debate, election day, the convention, touchdown, caucuses, turnout, the reach, hurricane, abu ghraib, howard dean, ghraib, abu, new year, fenway, league championship series, john edwards, inning, debates, united states senate, halloween, republican convention, innings, quarterback, fenway park, battleground, running mate, falluja, receiver, national convention, iowa caucuses, democratic convention, the new year, division series, incumbent, incumbents, moderator, gephardt, same-sex marriage, contested, the terrorists, assists, american people, undecided, tax cuts, pitches, gay marriage, electoral votes, abortion, counterterrorism, wesley, voter, caucus, the american people, democratic national convention, end zone, martínez, . . . Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 35 Trend und Topic Detection „Interesting“ terms for 2004 (NYT corpus) presidential debate, election day, the convention, touchdown, caucuses, turnout, the reach, hurricane, abu ghraib, howard dean, ghraib, abu, new year, fenway, league championship series, john edwards, inning, debates, united states senate, halloween, republican convention, innings, quarterback, fenway park, battleground, running mate, falluja, receiver, national convention, iowa caucuses, democratic convention, the new year, division series, incumbent, incumbents, moderator, gephardt, same-sex marriage, contested, the terrorists, assists, american people, undecided, tax cuts, pitches, gay marriage, electoral votes, abortion, counterterrorism, wesley, voter, caucus, the american people, democratic national convention, end zone, martínez, . . . Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 36 Trend und Topic Detection Relation to frequency Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 37 Trend und Topic Detection Relation to frequency Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 38 Trend und Topic Detection Relation to frequency Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 39 Trend und Topic Detection Comparision to tf/idf (result\2001_top1000_var.PNG) Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 40 Trend und Topic Detection Overlap grows linearly with the amount of „interesting“ terms Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 41 Trend und Topic Detection What does the context volatility measure extract ? Usage of terms Main associations as reflected by usage (global contexts) New aspects (change of associations) Interesting terms „Hotly discussed“ topics Topics and events (in the sense of [Allan 2002]) Time related, cyclic concepts The converse of context volatility - „stable“ concepts Topics rather than words Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 42 Trend und Topic Detection Characteristics of context volatility Frequency independent – works as well with high and low frequent terms Scalable – works with large amounts of data Interactive – context changes can be interactively explored for any period of time Streaming is possible – does not necessarily require global knowledge (representative sub-corpus sufficient) Cf. http://aspra23.informatik.uni-leipzig.de:8400/blazeds/volsquares_simple.swf Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 43 Trend und Topic Detection Interactively exploring 2004 Prof. Dr. G. Heyer Dagstuhl Workshop on Document Mining, April 2011 44 Trend und Topic Detection Litelratur [1] Marchionini, G.: Exploratory Search: From Finding to Understanding. Communications of the ACM 49(4), 41{46 (2006) [2] J. Waitelonis, M. Knuth, L. Wolf, J. Hercher, H. Sack: The Path is the Destination - Enabling a New Search Paradigm with Linked Data, in Proc. of Linked Data in the Future Internet at the Future Internet Assembly, Ghent 16/17 Dec 2010,CEUR Workshop Proceedings, ISSN 1613-0073. [3] Statistical Models for Tracking and Detection, Dragon Systems, 1999 [4] Papka, Allan, Online New Event Detection using Single-Pass Clustering, University of Massachusetts 1997 [5] Yang, Carbonell, Brown, Learning Approaches for Detecting and Tracking News Events, CMU 1999 [6] Heyer et. al. 2009 KDIR 2009: Proc. of Int. Conf. on Knowledge Discovery and Information Retrieval, INSTICC Press, 2009 [7] Rohrdantz et. al. 2010, Visuelle Textanalyse, Informatikspektrum 2010 Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 45 Trend und Topic Detection Literatur [Kumaran & Allan 2004] Kumaran, G.; Allan, J.: Text classification and named entities for new event detection. In SIGIR ‟04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 297–304, New York, NY, USA. ACM, 2004 [Swan & Allan 1999] Swan, R.; Allan, J.: Extracting significant time varying features from text. In CIKM ‟99: Proceedings of the eighth international conference on Information and knowledge management, pages 38–45, New York, NY, USA. ACM, 1999. [Wang & McCallum 2006] Wang, X.; McCallum, A.: Topics over time: a nonMarkov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ‟06, pages 424–433, New York, NY, USA. ACM, 2006. [Gerhard Heyer, Daniel Keim, Sven Teresniak, Daniela Oelke 2011], Interaktive explorative Suche in großen Dokumentbeständen, Datenbank-Spektrum 3/11, S. 195-206, Springer 2011 Prof. Dr. G. Heyer Text Mining - Wissensrohstoff Text 46