TMS 09 Text Mining Creating Semantics in the Real World
Transcription
TMS 09 Text Mining Creating Semantics in the Real World
Text Mining: Creating Semantics in the Real World Prof. Dr. Stefan Wrobel Dr. Gerd Paass, Andreas Schäfer, Dr. Stefan Eickeler Text Mining: Creating Semantics in the Real World Fraunhofer IAIS: Intelligent Analysis and Information Systems 250 people: scientists, project engineers, technical and administrative staff, students Located on Fraunhofer Campus Schloss Birlinghoven/Bonn Joint research groups and cooperation with Core research areas: Machine learning/data mining Multimedia pattern recognition Visual Analytics Process Intelligence Adaptive robotics Cooperating objects Directors: T. Christaller, S. Wrobel (exec.) Prof. Dr. Stefan Wrobel 2 Text Mining: Creating Semantics in the Real World Brainyquote.com Where is all the knowledge we lost with information? T. S. Eliot Thomas Stearns Eliot, OM (September 26, 1888 – January 4, 1965) US-born British poet, dramatist and literary critic 3 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Outline Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar Conclusion Prof. Dr. Stefan Wrobel 4 Text Mining: Creating Semantics in the Real World Internet Trends Ubiquitous intelligent systems Convergence Users as producers 5 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Users as producers Web 2.0, Social Web, Crowdsourcing Exploding growth of content Media providers transform from content to confidence providers, competing with social communities Users expect full interactivity and control Quality control, confidence, choice and searching are becoming central Prof. Dr. Stefan Wrobel 8 Text Mining: Creating Semantics in the Real World Drowning in Data …. Megabytes Gigabytes Terabytes Petabytes Prof. Dr. Stefan Wrobel nniviveerrssee:: u l a it u ig l d a SSizizeeooff digit abyte Ex yte 22000077::116611 Exabyte Exab yte 22001100::999988 Exab [[ID IDCC]] Exabytes 9 Text Mining: Creating Semantics in the Real World The data iceberg Database tables 20% Excel spreadsheets Other data with fixed structure Email, Notes 80% Word documents PDF. Power Point Other text Images Video, audio Prof. Dr. Stefan Wrobel 10 Text Mining: Creating Semantics in the Real World Drowning in Unstructured Data …. Megabytes Gigabytes ! g ! n g i n n i a n e a e m d m e e d e n e d n annd Terabytes … …a Prof. Dr. Stefan Wrobel Petabytes Exabytes 11 Text Mining: Creating Semantics in the Real World Semantics: The need for meaning Knowledge will be the driving force of business excellence Quality of services increasingly distinguished by amount of knowledge they can use Enormous savings if unstructured existing documents could be used Without needing to structure them first cf. failures of knowledge management! Prof. Dr. Stefan Wrobel 12 Text Mining: Creating Semantics in the Real World The challenge of semantics intelligent data and text mining technologies Very large set of (electronic) documents Manual Structuring Intelligent Service 13 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Outline Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar Conclusion Prof. Dr. Stefan Wrobel 14 Text Mining: Creating Semantics in the Real World Text Mining is cool, since … the entire world works for us! 215,675,903 websites (Netcraft, March 2009) 19 200 000 000 webpages (Yahoo, Aug 2005) 29 700 000 000 webpages (boutell.com, Jan 2007) Google index 26 Million (1998), 1 billion (2000), 1 Trillion 1,000,000,000,000) unique URLs (25. 7. 2008) => perhaps quadrillions of words (images, videos) … And most of them put together meaningfully (somewhat)! => smart algorithms can build on that. 15 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World The basic idea If two words occur frequently in the same context - page, paragraph, sentence, part-of-speech Then there must be some semantic relation between them Add in a lot of statistics, algorithms, intelligence… raw raw material material (web, (web, documents, documents, …) …) ++ correlations and correlations and statistics AND YOU CAN DOstatistics A LOT! ++ intelligent intelligent data data mining mining algorithms algorithms You You can can create create (a (a bit bit of) of) semantics! semantics! Prof. Dr. Stefan Wrobel 16 Text Mining: Creating Semantics in the Real World <document> Automated Clustering [Paass 07] <document> … … 100 000 documents <title>Bayern <title>BayernMünchen Münchenverlor verlorTabellenführung Tabellenführungund undElber Elberbeim beim11: :11ininWolfsburg</title> Wolfsburg</title> <text>Ausgerechnet <text>Ausgerechnetder derVfL VfLWolfsburg Wolfsburghat hatden denFC FCBayern BayernMünchen Münchenvom vomThron Thronder derFußball Fußball- -Bundesliga Bundesligagestoßen gestoßen. . Mit dem 1 : 1 ( 0 : 1 ) gelang den Wolfsburgern am Samstag der erste Punkt im sechsten Spiel gegen den Mit dem 1 : 1 ( 0 : 1 ) gelang den Wolfsburgern am Samstag der erste Punkt im sechsten Spiel gegen den Deutschen DeutschenRekordmeister Rekordmeister. .Durch Durchdas dasRemis Remisund undden dengleichzeitigen gleichzeitigenSieg Siegvon vonKonkurrent KonkurrentBayer BayerLeverkusen Leverkusen beim beimTSV TSV1860 1860München Münchenverlor verlorder derFC FCBayern Bayerndie dieTabellenführung Tabellenführung. .Carsten CarstenJancker Jancker( (29 29. .) )hatte hattedie dieGäste Gästeinin Führung gebracht . Doch vor 20 400 Zuschauern im ausverkauften VfL Stadion wurden die Bayern für Führung gebracht . Doch vor 20 400 Zuschauern im ausverkauften VfL - Stadion wurden die Bayern fürihre ihre pomadige pomadigeSpielweise Spielweisedurch durchden denWolfsburger WolfsburgerAusgleichstreffer Ausgleichstreffervon vonAndrzej AndrzejJuskowiak Juskowiak( (60 60. .) )bestraft bestraft. .Zudem Zudem verlor verlordas dasTeam Teamvon vonTrainer TrainerOttmar OttmarHitzfeld Hitzfeldauch auchnoch nochStürmer StürmerGiovane GiovaneElber Elber( (80 80. .) ). .ErErsah sahwegen wegeneiner einer Tätlichkeit Tätlichkeitgegen gegenVfL VfL- -Abwehrspieler AbwehrspielerHolger HolgerBallwanz Ballwanzdie dieRote RoteKarte Karte. .Die DieBayern Bayerngingen gingenersatzgeschwächt ersatzgeschwächtinin die Partie . Vor allem das Fehlen des verletzten Regisseurs Stefan Effenberg und des ebenfalls die Partie . Vor allem das Fehlen des verletzten Regisseurs Stefan Effenberg und des ebenfallsangeschlagenen angeschlagenen Mehmet MehmetScholl Schollmachte machtesich sichbemerkbar bemerkbar. .Die DieWolfsburger Wolfsburgermussten musstenweiter weiterauf aufdie dieAbwehrspieler AbwehrspielerClaus ClausThomsen Thomsen und Thomas Hengen sowie den gesperrten Waldemar Kryger verzichten . Die Münchener konnten und Thomas Hengen sowie den gesperrten Waldemar Kryger verzichten . Die Münchener konntenihre ihreAusfälle Ausfälle anfangs anfangsbesser besserkompensieren kompensieren. .Aus Auseiner einergestärkten gestärktenDeckung Deckung, ,die dievor vorder derPause Pausenur nurselten seltenvon vonden den [Paass07] Wolfsburger WolfsburgerStürmern StürmernJuskowiak Juskowiakund undJonathan JonathanAkpoborie Akpoboriegefordert gefordertwurde wurde, ,kontrollierten kontrolliertendie dieBayern Bayerndie die Partie . Mit ihrer Taktik hatten sie nach knapp Partie . Mit ihrer Taktik hatten sie nach knappeiner einerhalben halbenStunde StundeErfolg Erfolg: :Jancker Janckerspitzelte spitzelteden denBall Ballnach nach einem einemabgefälschten abgefälschtenFreistoß Freistoßvon vonMichael MichaelTarnat Tarnatins insTor Tor. .Der DerBrasilianer BrasilianerPaolo PaoloSergio Sergio( (14 14. .) )hätte hättesogar sogarschon schon früher sein Team in Führung schießen können . Doch traf er aus 14 m nur die Oberkante der Latte des früher sein Team in Führung schießen können . Doch traf er aus 14 m nur die Oberkante der Latte desVfL VfL- Tores Tores. .Die DieGastgeber Gastgeberbesaßen besaßennur nureine eineMöglichkeit Möglichkeitininder derersten erstenHalbzeit Halbzeit, ,als alsder derstarke starkeSpielmacher SpielmacherDorinel Dorinel Munteanu ( 37 . ) mit einem Schuss an dem großartig reagierenden Nationaltorhüter Oliver Munteanu ( 37 . ) mit einem Schuss an dem großartig reagierenden Nationaltorhüter OliverKahn Kahnscheiterte scheiterte. . Nach Nachdem demWechsel Wechselwurden wurdendie dieWolfsburger Wolfsburgermutiger mutigerund undmunterer munterer. .Sie Sieübernahmen übernahmenlangsam langsamdas dasKommando Kommando. . Beim BeimAusgleichstreffer Ausgleichstrefferdurch durchJuskowiak Juskowiakhalf halfdie dieBayern Bayern- -Deckung Deckungallerdings allerdingsmit mit. .Samuel SamuelKuffour Kuffourverlor verlorden den Ball Ballan anden denpolnischen polnischenNationalspieler Nationalspieler, ,Juskowiak Juskowiakzog zogsofort sofortab abund undließ ließdem dembesten bestenBayern Bayern- -Spieler SpielerKahn Kahn keine keineChance Chance. .Danach Danachbemühten bemühtensich sichdie dieMünchner Münchnernoch nocheinmal einmalund underhöhten erhöhtenden denDruck Druck. .Doch Dochklare klare Möglichkeiten besaßen sie nicht mehr . In der hektischen Schlussphase verlor Elber die Nerven Möglichkeiten besaßen sie nicht mehr . In der hektischen Schlussphase verlor Elber die Nerven, ,so sodass dassdie die Bayern Glück hatten , in Unterzahl nicht auch noch zu verlieren .</text> Bayern Glück hatten , in Unterzahl nicht auch noch zu verlieren .</text> <dpa_TextEndCode>dpa Prof. Dr. Stefan Wrobel <dpa_TextEndCode>dpayyni yynice cejo</dpa_TextEndCode> jo</dpa_TextEndCode> </document> </document> … … 17 Text Mining: Creating Semantics in the Real World Unsupervised hierarchical term Clustering: dpa data Team sport Spiel Bundesliga Team Trainer Sieg Mannschaft Niederlage Samstag Platz Saison Erfolg Punkte Pokal Nationalspieler … Football Not football Finale Frankfurt deutsche Meister Hamburg Zuschauer Zuschauern Männer Halle WM Titelverteidiger Final EM … Basketball Berlin Weltmeister Bonn Kampf K Hagen Trier Würzburg LBA Playoff Runde Box Berliner Klitschko Titel… Basketball + Boxing Prof. Dr. Stefan Wrobel Kiel Handball Magdeburg Flensburg HSG VfL TV THW Tore Bad Wuppertal Lemgo Bundesliga Handewitt … Handball FC Trainer Fußball München Spieler Bayern Mannschaft Saison Hertha Stürmer Stadion Spiel SV Dortmund Coach … Minute Tor VfL Schiedsrichter Bayern League Champions Zuschauer Minuten Führung Fußball UEFA United Hinspiel Tore Hansa Eintracht Schalke Cup Manager Vertrag Leeds Bundesliga Karte Wolfsburg … Club Fans Real Hitzfeld … German League European League 18 Text Mining: Creating Semantics in the Real World Text Mining Market Size „The text mining market has roughly $50-100 million annual product revenue, and is growing at roughly 40-60% annually.“ [Monash 06, texttechnologies.com] Sounds small … But then … • Several research sites devoted to the technology … So the real market must be somewhere else … 19 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World The Text Mining Market … is called “Text Analytics” Primary areas: Web search, site search knowledge management, enterprise portals Information collection, extraction, harvesting Email handling, security, spam and phishing filtering Market research Online advertising Specialized markets • litigation, juridical • Patent search [cf. Monash/2008] Prof. Dr. Stefan Wrobel 20 Text Mining: Creating Semantics in the Real World Application Field Market Research: Germany 1.6 billion, growing Both ad-hoc studies and panels can benefit from text mining http://www.adm-ev.de/zahlen.html 21 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Enterprise Search as a text mining market More than 1.2 billion $ in 2010 Year 2006 2007 2008 2009 2010 Software revenue Million $ 717 860 989 1108 1219 [Gartner 2008] Prof. Dr. Stefan Wrobel 22 Text Mining: Creating Semantics in the Real World Companies 23 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Outline Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Wikinger, Fraunhofer Web • Structuring and Monitoring: Semantic Map, EmotionRadar Conclusion Prof. Dr. Stefan Wrobel 24 Text Mining: Creating Semantics in the Real World Text Mining Tasks Document classification, scoring and/or ranking, isolated retrieval • Assign a class, score or rank to an entire document In-collection, linked retrieval and organization • Find documents in a collection • Link results to other results Information and relation extraction • Extract pieces of information, fill particular relations Overview and monitoring of collections • Give summary impression of information in a collection or source 25 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Outline Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar Conclusion Prof. Dr. Stefan Wrobel 26 Text Mining: Creating Semantics in the Real World Motivation Spotting Faked Offers at Internet Auctions Techniques to sell fakes Put faked products on an internet auction platform, e.g. ebay Describe product as forged, falsified, e.g. “very similar to XXX” Aspects Infringement of registered trade marks Violation of patents Enormous sales volume 27 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Counter Measures Motivation Use trainable classifiers Compile training set of genuine and faked internet auction offers Train classifiers to detect these classes use text, format information, etc. as features Fakes x2 Hy pe Use different classifier for different brands / products Apply to new internet auction offers Ban faked offers from auction rp lan Originals x1 Update classifiers to new techniques Prof. Dr. Stefan Wrobel e 28 Text Mining: Creating Semantics in the Real World Results A classifier was developed and tested Similar techniqes as for spam detection The Germal Federal Court of Justice: Internet Auction providers have to filter the auctions using approriate methods to detect faked offers. Good results: F-value >> 90% 29 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Motivation Phishing E-mail fraud Send officially looking email Include web link or form Ask for confidential information e.g. password, account details Attacker uses information to withdraw money, enter computer system, etc. Prof. Dr. Stefan Wrobel 30 Text Mining: Creating Semantics in the Real World Motivation AntiPhish Project AntiPhish Consortium Develop content-based phishing filters Fraunhofer IAIS (DE) Include other clues like whitelists Symantec (GB, IRL) Trainable and adaptive filters Î adapt to new phishing attacks Î anticipate attacks Tiscali (IT) Nortel (FR) K.U. Leuven (BE) 32 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Motivation Phishing: Defense Techniques Workflow Obtain training data from email stream Integrate new filters into email filtering framework Extract features Deploy at internet service provider Estimate and update classifiers and filters Deploy at central wireless packet switch Prof. Dr. Stefan Wrobel 33 Text Mining: Creating Semantics in the Real World Approach: Multiple feature sets Prof. Dr. Stefan Wrobel 37 Text Mining: Creating Semantics in the Real World Basic Features Prof. Dr. Stefan Wrobel 38 Text Mining: Creating Semantics in the Real World Dynamic Markov Chains Prof. Dr. Stefan Wrobel 39 Text Mining: Creating Semantics in the Real World DMC Details Prof. Dr. Stefan Wrobel 40 Text Mining: Creating Semantics in the Real World Latent Topic Models Prof. Dr. Stefan Wrobel 41 Text Mining: Creating Semantics in the Real World Class-Specific Topic Models Prof. Dr. Stefan Wrobel 42 Text Mining: Creating Semantics in the Real World Feature Processing and Selection Prof. Dr. Stefan Wrobel 43 Text Mining: Creating Semantics in the Real World Test Corpora Prof. Dr. Stefan Wrobel 46 Text Mining: Creating Semantics in the Real World Overall Result 47 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Outline Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar Conclusion Prof. Dr. Stefan Wrobel 52 Text Mining: Creating Semantics in the Real World The THESEUS research program in Germany [theseus-program.com] Deutsche Nationalbibliot hek Deutsche Thomson OHG (DTO) semantic Deutsches Forschungszentr um für Künstliche Intelligenz (DFKI GmbH) empolis GmbH Festo AG FraunhoferGesellschaft (7 Institutes) Friedrich-AlexanderUniversität Erlangen FZI Forschungszentrum Informatik Institut für Rundfunktechni k GmbH (IRT) intelligent views gmbh Ludwig-MaximiliansUniversität (LMU) moresophy GmbH syntactic LYCOS Europe mufin GmbH ontoprise GmbH SAP AG Siemens AG Wess/07 Technische Universität Darmstadt Technische Universität Dresden Single author Multiple authors Technische Universität München Universität Karlsruhe (TH) Verband Deutscher Maschinen- und Anlagebau e.V. (VDMA) 53 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World The THESEUS use cases ALEXANDRIA ALEXANDRIA CONTENTUS CONTENTUS MEDICO MEDICO The TheInternet InternetKnowledge KnowledgePlatform Platform Next NextGeneration GenerationDigital DigitalLibraries Libraries for saving our cultural for saving our culturalheritage heritage Semantic Semanticimage imageSearch Search ininMedicine Medicine ORDO ORDO PROCESSUS PROCESSUS TEXO TEXO Personal PersonalOrdered OrderedKnowledge Knowledge Management Management Semantic SemanticBusiness BusinessProcesses Processes Business BusinessWebs Websininthe theInternet Internet Of OfThings Things Prof. Dr. Stefan Wrobel 54 Text Mining: Creating Semantics in the Real World CONTENTUS - Next Generation Digital Libraries for saving our cultural heritage Publishers, Libraries, broadcasters, etc. are interested in using, distributing and saling their archive content In analog form archives are threatened by deterioration, are not linked, difficult to use, and huge. Goals: Digitalization, optimization of quality, availability Indexing, semantic and social linking and intelligent search, communities Rescue of cultural heritage, preventing losses from deterioration Prof. Dr. Stefan Wrobel Laufzeit bis 2012 55 Text Mining: Creating Semantics in the Real World Showcases Semantic Digital Libraries 225 years Neue Zürcher Zeitung NZZ GDR music archive German National Library Prof. Dr. Stefan Wrobel 56 CONTENTUS Workflow Text Mining: Creating Semantics in the Real World Workflow Data generation: registered users / communities, Data generation: registered algorithms with acceptable users / communities, quality algorithms with Quality control: selfacceptable control (see quality Wikipedia) Quality control: self control (see Wikipedia) Shell Mantle Core 1 2 Digitization 4 3 Automated generation of metadata Automated optimization of quality 5 Controlled quality Data generation: automatically Controlled quality generated through high-quality Data generation: automatically algorithms generated through high-quality Quality control: training and algorithms improvement of algorithms Quality control: training and improvement of algorithms Guaranteed quality Data generation and Guaranteed quality correction: Libraries, museums, Data generation and universities, experts, etc. correction: museums, Quality control:Libraries, Schooling, experts, etc. rules,universities, advisory boards Quality control: Schooling, Highest stability, highest rules, advisory boards persistence Highest stability, highest persistence 6 Open knowledge networks – user augmentation Semantic linking of metadata Semantic access to knowledge and content 57 Prof. Dr. Stefan Wrobel Digitalisierung 1 2 Digitization Automatic Optimization of quality Text Mining: Creating Semantics in the Real World 3 Automated Generation of metadata 4 Semantic Linking of metadata 5 Open knowledge networks – user augmentation 6 Semantic access to knowledge and content High-Throughput Methods Modern book scanners: Thousands of pages per day Almost fully automatic Data volumes: 70TB (NZZ), Peta-Exabytes (DNB) Prof. Dr. Stefan Wrobel 58 Digitalisierung Text Mining: Creating Semantics in the Real World 1 Digitalisierung Digitization High-Throughput Methods Modern book scanners: Thousands of pages per day Almost fully automatic Data volumes: 70TB (NZZ), Peta-Exabytes (DNB) 59 Prof. Dr. Stefan Wrobel Qualitätsoptimierung 1 Digitalisierung Digitalization 2 Automated Optimization of quality Development of intelligent algorithms for optimizing print, images, sound & movies Automated generation of presentation formats Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Margin removal Sharpening, Straightening Denoising, declicking Scratch removal 60 Metadatengenerierung 1 2 Digitalisierung Digitalization Automated Optimization of quality 3 Text Mining: Creating Semantics in the Real World Automated Generation of metadata Structural and contentual metadata OCR, speech, music, video recognition Structure analysis and type recognition Linking with current norms & standards 61 Prof. Dr. Stefan Wrobel Semantische Vernetzung 1 Digitalisierung Digitalization 2 Automated Optimization of quality 3 Automated Generation of metadata Text Mining: Creating Semantics in the Real World 4 Semantic linking of contents Link-up with related media Incorporation of external knowledge sources (metadata systems, Wikipedia, …) Disambiguation, classification, relation extraction Prof. Dr. Stefan Wrobel 62 Text Mining: Creating Semantics in the Real World Determining meaning The words of natural language are often ambiguous Über Kohl höhnte Strauß: „Er wird nie Kanzler werden. Die Zeit, 18.7.08 » For each word / term, find a meaning » Subproblem: » Part of speech recognition: Nouns, Verb, Adjective, … » Named entity recognition: People, Places, Organizations, … » Assignment of concepts: Plant, Bird, Politician, … 63 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Named entity recognition » Analyze Surroundings of Words Über Kohl höhnte Strauß: „Er wird nie Kanzler werden. » “Kohl” in a sentence with “Kanzler” Î probably “person” » “Kohl” in a sentence with “kochen” Î probably “vegetable” » Statistical model for person names » Word + Surroundings -> word is a person » Training using annotated sentences. » Automatic Recognition of words / phrases that represent people Prof. Dr. Stefan Wrobel 64 Text Mining: Creating Semantics in the Real World Conditional Random Field Model » Observed words X1,…,Xn » Category of words Y1,…,Yn ⎛ N ⎡ N2 ⎤⎞ 1 p (Y1,K, Yn | X ) = exp⎜ ∑ ⎢ ∑ λk f k ,C (Yt −1, Yt , X )⎥ ⎟ ⎜ ⎟ Z (X, λ , μ ) ⎦⎠ ⎝ t =1⎣k =1 » Properties f may depend on two subsequent states and on all observed words Example » Property f10293 has value 1, - if Yt-1=“PER" and Yt=“PER” and - Xt has value “Müller”. Otherwise its value is 0. [Lafferty, McCallum, Pereira 01] 66 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Modeling of names: features for a CRF model » Title FirstName Connective LastName » Properties. Recorded for the words xt-2,xt-1,xt,xt+1,xt+2 » Words, stem, part of speech » Prefix, Suffix (3 letters) » Shape properties Capital characters at the beginning, only numbers, contains numbers, mix capital /no capital, contains hyphens » LDA topic model class » Contained in list of first names, contained in list of last names Prof. Dr. Stefan Wrobel it rbe A In 67 Text Mining: Creating Semantics in the Real World Identity of names » There are several people named “Helmut Kohl” » Helmut Kohl, born 1930, Chancellor » Helmut Kohl, born 1943, Referee » Helmut Kohl, textile merchand » … 99 further hits in the telephone book » Identification in Wikipedia » Compare words of Wikipedia-article with the text in which “Helmut Kohl” was found » Similar words -> similar person » Automated assignment: Person name -> Wikipedia article 68 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Assignment of similarity from the environment Simple algorithm for assigning people to Wikipedia article » Occurence in text: Helmut Kohl » Description using characteristic terms -> x » Wikipedia article on Kohl » Description by characteristic terms -> w » Comparison using a distance metric: for example Cosine distance d(w,u) » Implemented in a prototype » Further approach: Assignment as a classification task f(w,u) = 0 or 1 Master Thesis Prof. Dr. Stefan Wrobel 70 Semantische Interpretation Text Mining: Creating Semantics in the Real World Semantic Interpretation Currently assign semantic categories in the Contentus Prototype » Names: People, Organizations, points in time, places, … » Assignment to Wikipedia articles Under development: » Hypernyms in ontology (GermaNet): Nouns, Verbs Î Supersenses » Cluster of words with similar meaning: Topics » Relations between names / concepts “Berthold Brecht” studied in “München” » Classes of documents: Politics, Economy, … 72 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Knowledge store » Further information for entities that were found in the text » Dates, publications » Number of inhabitants, topological relationships Helmut Kohl Geburtsdatum 30.4.1930 Geburtsort Ludwigshafen Ehegatte Hannelore K. Ausbildung Historiker Religion katholisch Partei CDU Berlin Fläche » Social networks » Who knows whom? » Who was at the same place at the same time? » Who influenced whom? Prof. Dr. Stefan Wrobel Einwohner BIP Höhe 891 km2 3.420.786 83,6 Mrd. € 34–115 m Geo. Breite 52° 31′ N Geo. Länge 13° 25′ O 87 Text Mining: Creating Semantics in the Real World Knowledge store: Format » Factual knowledge as logical expressions: » <subject> <predicate> <object> » Semantic-Web-Standards » RDF » RDFS » OWL » Technical Basis » Database MySQL » Triple-Store Jena + Joseki » Query language » SPARQL 88 Prof. Dr. Stefan Wrobel Buchmesse Frankfurt 10.Oktober 2008 | 88 Wissensvernetzung Text Mining: Creating Semantics in the Real World Linking of knowledge » Semantic Integration of data and information from different sources » DBPedia: an interpreted form of Wikipedia » Geonames Ontology: all the places in the world » Catalogue of the German national library: Books and publications Î Triplestore » Based on open standards » W3C Semantic Web Stack » RDF, RDFS, OWL, SPARQL Prof. Dr. Stefan Wrobel 89 Text Mining: Creating Semantics in the Real World Knowledge sources » DBPedia (www.dbpedia.org) » GeoNames Ontology » Already in RDF/OWL-Format » Person reference database PND » Topic reference database SWD » Online catalogue OPAC » Partial export to RDF » Found entities in the text » Identification using Wikipedia » Linking with DBPedia-Daten per Link 90 Prof. Dr. Stefan Wrobel Buchmesse Frankfurt 10.Oktober 2008 | 90 Offene Wissensnetzwerke 2 1 Digitalisierung Digitalization Automated Optimization of quality 3 Automated Generation of metadata Text Mining: Creating Semantics in the Real World 4 Semantic linking of metadata 5 Open knowledge networks – user augmentation Further annotations from experts and users • Completions, corrections • Cooperation with the ALEXANDRIA project in Theseus • Suitable measures to assure high quality of data Prof. Dr. Stefan Wrobel 91 Offene Wissensnetzwerke Text Mining: Creating Semantics in the Real World The Multiple Shell Model Cf. Wikinger [Bröcker et.al. 08]! Open Openknowledge knowledgenetwork network Data generation: Registered Data generation: Registeredusers users/ /Communities, Communities, Algorithms Algorithms Quality Qualitycontrol: control:Self Selfcontrol control(cf. (cf.Wikipedia) Wikipedia) Outer Mantel Controlled ControlledQuality Quality Data generation: Data generation:Algorithms Algorithmsof ofhigh highquality quality Quality control: Training and improvement Quality control: Training and improvementof of algorithms algorithms Core Assured Assuredquality quality Data generation Data generationand andcorrection: correction:Libraries, Libraries, Universities, Museums, groups of experts, Universities, Museums, groups of experts,etc. etc. Quality control: Fixed rules, committes, Quality control: Fixed rules, committes, 92 maximal maximalStability Stabilityand andPersistence Persistence Prof. Dr. Stefan Wrobel Semantische Suche 2 1 Digitalisierung Digitalization Automated Optimization of quality Text Mining: Creating Semantics in the Real World 3 Automated 4 Generation of metadata Semantic linking of metadata 5 Open knowledge networks – user augmentation 6 Semantic access to knowledge and content The knowledge network • Digital, multimedia data • Content is semantically linked • Is enriched from external sources and user groups Access • Structure by Ontology • Content relationships become clear • “Knowledge exploration” is possible Prof. Dr. Stefan Wrobel 93 Text Mining: Creating Semantics in the Real World The Contentus Demonstrator 95 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Outline Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar Conclusion Prof. Dr. Stefan Wrobel 102 Text Mining: Creating Semantics in the Real World Example of a classified webpage Übereinstimmung zu dem Dokumentmodell = 80% Klassifikation als = Projekte 107 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Workflow for semantic processing of documents !! 9 9 0 22000 PreCategoriza Entity r r e Extracted e processing tion Recognitio m m m Search index Metadata n m u s u s iinn h c h nnc u Crawl u a l eela Documents r r e tthhe r r o ffo Using the document model t t u oou h c h aattc W W Extracted Knowledge Store Search regions Using the structure model Prof. Dr. Stefan Wrobel 111 Text Mining: Creating Semantics in the Real World Outline Why is Text Mining cool? • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets What can we do with Text Mining in the Real World? Some case studies • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar Conclusion Prof. Dr. Stefan Wrobel 112 Text Mining: Creating Semantics in the Real World Emotion Radar Which issues are important to people, where are the emotional discussions in blogs and discussion forums? Goal: market research, … Prof. Dr. Stefan Wrobel 113 Text Mining: Creating Semantics in the Real World Emotion Radar example: Looking at two large automobil companies A and B* Selection and Crawling of discussion forums • Criteria: Search engine ranking, size, activity • Period used: January 2008 to January 2009 • Storage: 2 GB of data crawl during a period of 7 days Structure analysis of the discussion forums: • Manufacturer A: – – – – • Number of postings: 188.487 Monthly number of new postings: ca. 1.500 Number of threads: 21.613 Number of authors: 15.445 Manufacturer B: – – – – Number of postings: 406.814 Monthly number of new postings: ca. 2.700 Number of threads: 38.758 Number of authors: 21.919 * anonymisiert 114 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World Case study: Internet postings related to the introduction of a new car model in Germany 2008 Cars are delivered Manufacturer publishes further product features/start of sales Manufacturer publishes first pictures Dr. Stefan Wrobel *Prof. anonymisiert 115 Text Mining: Creating Semantics in the Real World Partially automated emotion analysis shows a mood swing from positive to negative („in love-> angry“) Cars are delivered „angry“ „angry“ − − „proud“ „proud“ − − „angry“ „angry“ − − Manufacturer publishes further product features/start of sales Manufacturer publishes first pictures „surprised“ „surprised“ − − „turned off“ „turned off“ − − „in love“ „in love“ „hoping“ „hoping“ − − „interested“ „interested“ − − 116 Dr. Stefan Wrobel *Prof. anonymisiert Text Mining: Creating Semantics in the Real World Topic recognition shows a change of product features that are discussed from design to gasoline consumption Auslieferungen Probefahrten Probefahrten − − Verbrauch Verbrauch − − „verärgert“ „verärgert“ − − Preise CarPreise CarKonfigurator Konfigurator und Liste und Liste − − Chromleisten, Chromleisten, Wertanmutung Wertanmutung − − Erste Fotos „Riesen„Riesenfischmaul“ fischmaul“ − − Schaltung, Schaltung, Effizienz, Effizienz, Technologie Technologie − − FahrzeugFahrzeuglänge, Design länge, Design − − „verliebt“ „verliebt“ − − − − − − Dr. Stefan Wrobel *Prof. anonymisiert Verbrauch Verbrauch Schiebedach, Schiebedach, Lenkrad, Lenkrad, Bordcomputer, Bordcomputer, Audiosystem Audiosystem − − „abgestoßen“ „abgestoßen“ − − Verbrauch, Verbrauch, Klappschlüssel, Klappschlüssel, Audio Audio − − „verärgert“ „verärgert“ − − Nachbarn Nachbarn − − „stolz“ „stolz“ − − „überrascht“ „überrascht“ − − „...kaum zu glauben „...kaum zu glauben was dieses kleine was dieses kleine Auto an Benzin Auto an Benzin verbraucht!“ verbraucht!“ (Kalle83) (Kalle83) „zugeneigt“ „zugeneigt“ − − „hoffend“ „hoffend“ 117 Text Mining: Creating Semantics in the Real World How can manufacturers use these text mining results? Auslieferungen e.g. Z.B. short durch term eine recognition frühzeitigeof relevant Probefahrten Probefahrten − − topics Erkennung (consumption) relevanter andThemen preparation Verbrauch Verbrauch − − of(Verbrauch) appropriateund resonse Ableiten (gasvon saver „verärgert“ „verärgert“ − − trainings, Maßnahmen fuel efficient (Spritspartrainings, tires, proactive communikation) Leichtlaufreifen, Kommunikation) „verärgert“ „verärgert“ − − Preise CarPreise CarKonfigurator Konfigurator und Liste und Liste − − Chromleisten, Chromleisten, Wertanmutung Wertanmutung − − Verbrauch, Verbrauch, Klappschlüssel, Klappschlüssel, Audio Audio − − Nachbarn Nachbarn − − „stolz“ „stolz“ − − „überrascht“ „überrascht“ − − „Riesen„Riesenfischmaul“ fischmaul“ − − Schaltung, Schaltung, Effizienz, Effizienz, Technologie Technologie − − FahrzeugFahrzeuglänge, Design länge, Design − − „verliebt“ „verliebt“ − − − − − − Dr. Stefan Wrobel *Prof. anonymisiert Verbrauch Verbrauch „hoffend“ „hoffend“ Schiebedach, Schiebedach, Lenkrad, Lenkrad, Bordcomputer, Bordcomputer, Audiosystem Audiosystem − − „zugeneigt“ „zugeneigt“ − − „abgestoßen“ „abgestoßen“ − − ... Long term continuous monitoring of emotional topics 118 Text Mining: Creating Semantics in the Real World Prof. Dr. Stefan Wrobel 119 Text Mining: Creating Semantics in the Real World Summary Text Mining is cool! • Drowning in Data: The Challenge of Meaning • Text Mining: Creating Meaning from Large Collections • Text Mining Markets We can do a lot with Text Mining in the Real World! • Document classification: eBay, antiPhish • Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web • Structuring and Monitoring: EmotionRadar 120 Prof. Dr. Stefan Wrobel Text Mining: Creating Semantics in the Real World The fine print: Papers and further reading Horváth, Tamás; Ramon, Jan:Efficient frequent connected subgraph mining in graphs of bounded treewidth: Machine learning and knowledge discovery in database: ECML PKDD 2008. Berlin [u.a.]: Springer, 2008. (Machine learning and knowledge discovery in databases 1), S. 520-535 Kolb, Inke; Deutschland / Bundesbeauftrager für Kultur und Medien; FraunhoferInstitut IAIS: Auf dem Weg zur Deutschen Digitalen Bibliothek (DDB): erstellt im Auftrag des Beauftragten der Bundesregierung für Kultur und Medien. 2008 Köhler, Joachim; Larson, Martha; Jong, Franciska de Jong; Kraaij, Wessel; Ordelman, Roeland; Association for Computing Machinery / Special Interest Group on Information Retrieval: Proceedings of the ACM SIGIR Workshop "Searching Spontaneous Conversational Speech": held in conjunction with the 31th Annual International ACM SIGIR Conference 24 July 2008, Singapore, 2008 Krausz, Barbara; Herpers, Rainer: Event detection for video surveillance using an expert system In: Association for Computing Machinery / Special Interest Group on Multimedia: 1st ACM International Workshop in Analysis and Retrieval of Events/Actions and Workflows in Video Streams (AREA 2008): October 31, 2008, Vancouver, Canada ; in conjunction with ACM Multimedia 2008. New York, NY: ACM, 2008, S. 49-55 Lioma, Christina; Moens, Marie-Francine; Gomez, Juna-Carlos; De Beer, Jan; Bergholz, Andre; Paass, Gerhard; Horkan, Patrick: Anticipating Hidden Text Salting in Emails: extended abstract. In: Lippmann, Richard (Ed.) et al.: Recent advances in intrusion detection: 11th international symposium, RAID 2008, Cambridge, MA, USA, September 15-17, 2008 ; proceedings. Berlin [u.a.]: Springer, 2008. Anja Pilz, Lukas Molzberger, and Gerhard Paa. Entity resolution by kernel methods. In Proc. Sabre TMS, 2009. Andre Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard Paass, Siehyun Strobel. New Filtering Approaches for Phishing Email. Accepted for publication for Journal of Computer Security (JCS) Gerhard Paass and Frank Reichartz (2009): Exploiting Semantic Constraints for Estimating Supersenses with CRFs. Proc. SDM 2009 (accepted for publication) Prof. Dr. Stefan Wrobel Paaß, Gerhard; Reinhardt, Wolf; Rüping, Stefan; Wrobel, Stefan: Data mining for security and crime detection In: Gal, Cecilia S. (Ed.) et al.: Security informatics and terrorism: social and technical problems of detecting and controlling terrorists' use of the World Wide Web ; proceedings of the NATO Advanced Research Workshop on Security Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 June 2007. Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series D, Information and Communication Security 15), S. 56-70 Paaß, Gerhard; Kindermann, Jörg: Entity and relation extraction in texts with semi-supervised extensions In: Gal, Cecilia S. (Ed.) et al.: Security informatics and terrorism: social and technical problems of detecting and controlling terrorists' use of the World Wide Web ; proceedings of the NATO Advanced Research Workshop on Security Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 June 2007. Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series D, Information and Communication Security 15), S. 132-141 Frank Reichartz and Gerhard Paaß. Estimating Supersenses with Conditional Random Fields. Workshop on High-Level Information Extraction, ECML/PKDD 2008. Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel, Marie-Francine Moens and Brian Witten: Detecting Known and New Salting Tricks in Unwanted Emails Fifth Conference on Email and AntiSpam, CEAS 2008, Aug 21-22, 2008 Andre Bergholz,Jeong-Ho Chang, Gerhard Paass, Frank Reichartz and Siehyun Strobel. Improved Phishing Detection using Model-Based Features. Fifth Conference on Email and Anti-Spam, CEAS 2008, Aug 2122, 2008, Mountain View, Ca. Stefan Eickeler, Lars Br¨ocker, and Ruth Haener. NZZ: 225 Jahre Old economy vernetzt - Realisierung des digitalen Archivs der Neuen Zürcher Zeitung. In GI Jahrestagung, pages 73–77, 2005. 121