Intelligente Suchmaschinen der Zukunft
Transcription
Intelligente Suchmaschinen der Zukunft
Intelligente Suchmaschinen der Zukunft Gerhard Weikum weikum@mpi-inf.mpg.de http://www.mpi-inf.mpg.de/~weikum/ Beyond Google: Knowledge Queries Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB“) ! Answer „knowledge queries“ such as: proteins that inhibit both proteases and some other enzymes neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘ differences in Rembetiko music from Greece and from Turkey connection between Thomas Mann and Goethe market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Nobel laureate who survived both world wars and his children drama with three women making a prophecy to a British nobleman that he will become king 2/44 Don‘t Let Me Be Misunderstood Keyword query: Max Planck or Keyword query: Greek art Paris or Entity query: Entity query: Person = „Max Planck“ „Greek art“ & Location = „Paris“ Keyword query: Madonna child or Entity-Relation query: Person x = „Madonna“ & HasChild (x,y) 3/44 Semantic Search Today: Example Google Which politicians are also scientists ? What is lacking? • data is not knowledge → extraction and organization • keywords cannot express advanced user intentions → concepts, entities, properties, relations 4/44 Three Roads to Web Knowledge and Intelligent Search Leibniz Approach: Semantic Web from Handcrafted Knowledge Sources and Search over Entities, Relations, Ontologies Planck Approach: Statistical Web from Large-scale Information Extraction & Harvesting and Search over Uncertain Relations Darwin Approach: Social Wisdom from Web 2.0 Communities and Search by Collaborative Recommendation 5/44 Outline 9 Motivation • Leibniz: Semantic Web • Planck: Statistical Web • Darwin: Social Web • Conclusion 6/44 Leibniz Approach: Semantic Web Gottfried Wilhelm Leibniz (1646 - 1716) Handcrafted High-Quality Knowledge: • Ontologies and other Lexical Sources • Build on Rigorous Knowledge Atoms („Characteristica Universalis“) • Semantic Search over Entities, Relations, Ontologies 7/44 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil) 8/44 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family 200 000 concepts and relations; can be cast into • description logics or • graph,proteins with weights relation enzyme -- (any of several complex that arefor produced bystrengths cells and from co-occurrence act as catalysts in(derived specific biochemical reactions)statistics) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells; ...) => macromolecule, supermolecule ... => organic compound -- (any compound of carbon and another element or a radical) ... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation; ...) 9/44 Query Expansion Example From TREC 2004 Robust Track Benchmark: Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. 10/44 Query Expansion Example From TREC 2004 Robust Track Benchmark: Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], Let us take, for example, the case of Medellin cartel's "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], boss Pablo Escobar. Will the fact thatmaffia[0.318|1.00], he was eliminated mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", change anything at all? No, it may perhaps have a "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], psychological effect on other drug dealers but, ... organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20], ...}} the illicit export of metals and import ... for organizing of arms. It is extremely difficult for the law-enforcement 135530 sorted accesses in 11.073s. organs to investigate and stamp out corruption among leading officials. Interpol Chief on Fight Against ... Narcotics Economic CounterintelligenceATasks Viewed commission accused Swiss prosecutors parliamentary today ofofdoing little toCrime stop drug and money-laundering Dresden Conference Views Growth Organized in Europe international networks from Region pumping billions of dollars Report on Drug, Weapons Seizures in Southwest Border through Swiss companies. SWITZERLAND CALLED SOFT ON CRIME ... Results: 1. 2. 3. 4. 5. ... 11/44 Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources 12/44 Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]]</br> … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … 13/44 Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]]</br> … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … 14/44 YAGO Knowledge Base Entities Facts Entity KnowItAll 30 000 subclass SUMO 20 000 60 000 WordNet 120 000 800 000 Person Cyc 300 000 5 Mio. TextRunner n/a 8 Mio. subclass YAGO 1.7 Mio. 15 Mio. Scientist DBpedia 103 Mio. subclass1.9 Mio. [F. Suchanek et al.: WWW’07] Accuracy ≈ 95% subclass subclass subclass Biologist concepts Location subclass City Country Physicist instanceOf instanceOf Erwin_Planck Nobel Prize hasWon October 4, 1947 bornIn Kiel FatherOf diedOn individuals Max_Planck bornOn means means “Max Planck” “Max Karl Ernst Ludwig Planck” April 23, 1858 means “Dr. Planck” words Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/ 15/44 NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW’07, ICDE‘08] Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness discovery queries Kiel bornIn $x isa hasWon Nobel prize hasSon $a diedOn $x > scientist $b diedOn $y connectedness queries German novelist isa * Thomas Mann Goethe queries with regular expressions Ling hasFirstName | hasLastName (coAuthor | advisor)* Beng Chin Ooi $x isa scientist worksFor $y locatedIn* Zhejiang 16/44 Outline 9 Motivation 9 Leibniz: Semantic Web • Planck: Statistical Web • Darwin: Social Web • Conclusion 17/44 Planck Approach: Statistical Web Max Planck (1858 - 1947) Information Extraction & Harvesting: • Gather Entities, Relations, Facts • Live with Uncertainty • Search & Rank Knowledge Facts 18/44 Information Extraction (IE): Text to Records Person BirthDate Max Planck 4/23, 1858 Albert Einstein 3/14, 1879 Mahatma Gandhi 10/2, 1869 BirthPlace ... Kiel Ulm Porbandar Person ScientificResult Max Planck Quantum Theory extracted facts often Valueconfidence Dimension have <1 23 Js 6.226×10 → DB with uncertainty (probabilistic DB) Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Constant Planck‘s constant Person Organization Max Planck KWG / MPG combine NLP, pattern matching, lexicons, statistical learning 19/44 Knowledge Acquisition from the Web Learn Semantic Relations from Entire Corpora at Large Scale (as exhaustively as possible but with high accuracy) Examples: • all cities, all basketball players, all composers • headquarters of companies, CEOs of companies, synonyms of proteins • birthdates of people, capitals of countries, rivers in cities • which musician plays which instruments • who discovered or invented what • which enzyme catalyzes which biochemical reaction Existing approaches and tools (Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], …): almost-unsupervised pattern matching and learning: seeds (known facts) → patterns (in text) → (extraction) rule → (new) facts 20/44 Methods for Web-Scale Fact Extration seeds → text → rules → new facts Example: Example: city in in city (Seattle) (Seattle) in downtown downtown Seattle Seattle in downtown downtown X X city Seattle X city (Seattle) (Seattle) Seattle and and other other towns towns X and and other other towns towns city Las VegasLas andVegas otherand towns and other towns city (Las (Las Vegas) Vegas) otherXtowns X and other towns plays playing Y: Y: … …X X plays (Zappa, (Zappa, guitar) guitar) playing playing guitar: guitar: … … Zappa Zappa playing plays X plays (Davis, (Davis, trumpet) trumpet) Davis Davis … … blows blows trumpet trumpet X… … blows blows Y Y in downtown Beijing Coltrane blows sax city(Beijing) old center of Beijing plays(Coltrane, sax) sax player Coltrane city(Beijing) plays(C., sax) old center of X Y player X Assessment of facts & generation of rules based on statistics Rules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(NN)=instrument & class(head(NP))=person → plays(head(NP), NN) 21/44 Performance of Web-IE State-of-the-art precision/recall results: relation precision recall corpus systems countries cities scientists CEOs birthdates instanceOf 80% 80% 60% 80% 80% 40% Web Web Web News Wikipedia Web KnowItAll KnowItAll KnowItAll Snowball, LEILA LEILA Text2Onto, LEILA 90% ??? ??? 50% 70% 20% precision value-chain: entities 80%, attributes 70%, facts 60%, events 50% Anecdotic evidence: invented (A.G. Bell, telephone) married (Hillary Clinton, Bill Clinton) isa (yoga, relaxation technique) isa (zearalenone, mycotoxin) contains (chocolate, theobromine) contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables) married (Segolene Royal, Francois Hollande) isa (yoga, excellent way) isa (your day, good one) contains (chocolate, raisins) plays (the liver, central role) makes (everybody, mistakes) 22/44 Beyond Surface Learning with LEILA Learning to Extract Information by Linguistic Analysis [F. Suchanek et al.: KDD‘06] Limitation of surface patterns: who discovered or invented what “Tesla’s work formed the basis of AC electric power” “Al Gore funded more work for a better basis of the Internet” Almost-unsupervised Statistical Learning with Dependency Parsing (Cologne, Rhine), (Cairo, Nile), … (Cairo, Rhine), (Rome, 0911), (∗, ∗[0..9]*∗), … NP VP PP NP NP PP NP NP NP PP NP VP NP PP NP NP NP Cologne lies on the banks of the Rhine People in Cairo like wine from the Rhine valley Ss MVp DMc Jp Mp Mp Dg Js Js Os Sp AN Mvp Ds Js NP VP VP PP NP NP PP NP NP Paris was founded on an island in the Seine Ss Pv MVp Ds DG (Paris, Seine) Js Js MVp We visited Paris last summer. It has many museums along the banks of the Seine. 23/44 YAGO Enhancement by IE on Text Sources Entity 1.0 subclass 1.0 subclass Person Capture confidence value for each fact subclass Celebrity 0.7 subclass 1.0 subclass 1.0 Mythological Figure City instanceOf 0.6 Location 0.8 instanceOf subclass 1.0 Country 0.9 instanceOf 1.0 instanceOf Paris Hilton Paris(Myth.) Paris(France) means means 0.1 means 0.05 “Paris” locatedIn France 0.95 0.7 means 0.5 means 0.9 “La Grande Nation” “France” ongoing work: harvesting relations by IE tools like GATE, LEILA, ... (e.g.: which enzyme catalyzes which biochemical process, who discovered or invented what, ...) 24/44 Search Results Without Ranking q: Fisher isa scientist Fisher isa $x $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = alumnus_109165182 $@Fisher = Irving_Fisher $@scientist = scientist_109871938 mathematician_109635652 —subClassOf—> scientist_109871938 $X = social_scientist_109927304 Alumni_of_Gonville_and_Caius_College,_Cambridge$@Fisher —subClassOf—> alumnus_1091 = James_Fisher "Fisher" —familyNameOf—> Ronald_Fisher $@scientist = scientist_10981938 Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge $X = ornithologist_109711173 Ronald_Fisher —type—> 20th_century_mathematicians $@Fisher = Ronald_Fisher "scientist" —means—> scientist_109871938 $@scientist = scientist_109871938 $X = theorist_110008610 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = colleague_109301221 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = organism_100003226 … 25/44 Ranking with Statistical Language Model q: Fisher isa scientist Fisher isa $x $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = mathematician_109635652 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = statistician_109958989 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 Score: 7.184462521168058E-13 mathematician_109635652 —subClassOf—> scientist $X = president_109787431 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> 20th_century_mathematicians $@Fisher = Ronald_Fisher "scientist" —means—> scientist_109871938 $@scientist = scientist_109871938 20th_century_mathematicians —subClassOf—> mathematician_109635652 $X = geneticist_109475749 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = scientist_109871938 → statistical language model for result graphs … Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/ 26/44 Ranking Factors Confidence: Prefer results that are likely to be correct ¾ Certainty of IE ¾ Authenticity and Authority of Sources Informativeness: bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars (Martian Bloggeria) q: isa (Einstein, $y) Prefer results that are likely important May prefer results that are likely new to user ¾ Frequency in answer ¾ Frequency in corpus (e.g. Web) ¾ Frequency in query log isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) Compactness: Prefer results that are tightly connected isa vegetarian ¾ Size of answer graph Einstein Tom isa Cruise won Nobel Prize won bornIn 1962 Bohr diedIn 27/44 NAGA Ranking Model Following the paradigm of statistical language models (used in speech recognition and modern IR), applied to graphs For query q with fact templates q1 … qn rank result graphs g with facts g1 … gn by decreasing likelihoods: using generative mixture model ex.: bornIn (Goethe, Frankfurt) n P [q | g] = ∏ (1 − α ) ⋅ P [q i | g i ] + α ⋅ P [q i ] i =1 β ⋅ Pconf [qi | gi ] + (1 − β ) ⋅ Pinform [qi | gi ] based on IE accuracy and authority analysis ex.: bornIn ($x, Frankfurt) P[Goethe | bornIn , Frankfurt ] = background model P [Goethe , bornIn , Frankfurt ] ] P [bornIn , Frankfurt ] ne P( x, r, z) P( x, r, z) P ( x | r , z ) = = conf(e) = ∑ acc(e, Pi ) ⋅ trust( Pi ) P(r, z) ∑P( x', r, z) i =1 x' estimated by correlation statistics for qi = (x, r, z) with variable x 28/44 Entity Search: Example NAGA Query: $x isa politician $x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel … 29/44 Outline 9 Motivation 9 Leibniz: Semantic Web 9 Planck: Statistical Web • Darwin: Social Web • Conclusion 30/44 Darwin Approach (Social Web) Charles Darwin (1809 - 1882) Social Wisdom & Natural Selection: • Evolution of (Web 2.0) species • Survival of the fittest • Search & rank by collaborative recommendation 31/44 „Wisdom of Crowds“ at Work on Web 2.0 Information enrichment & knowledge extraction by humans: • Collaborative Recommendations & QA • Amazon (product ratings & reviews, recommended products) • Netflix: movie DVD rentals → $ 1 Mio. Challenge • answers.yahoo, iknow.baidu, etc. • Social Tagging and Folksonomies • del.icio.us: Web bookmarks and tags • flickr: photo annotation, categorization, rating • YouTube: same for video • Human Computing in Game Form • ESP and Google Image Labeler: image tagging • Peekaboom: image segmenting and tagging • Verbosity: facts from natural-language sentences • Online Communities • dblife.cs.wisc.edu for database research • www.lt-world.org for language technology • Yahoo! Groups, Myspace, Facebook, etc. etc. 32/44 Social Tagging: Example Flickr (1) 33/44 Social Tagging: Example Flickr (2) 34/44 Social Tagging: Example Flickr (3) 35/44 Social-Tagging Community > 20 Mio. users > 2 Bio. photos > 10 Bio. tags 30% monthly growth 36/44 Source: www.flickr.com ESP Game [Luis von Ahn et al. 2004 ] played against random, anonymous partner on Internet taboo: pyramid Louvre museum Paris art • Game with a purpose • Collects annotations (wisdom) • Can exploit tag statistics (crowds) • Attracts people, fun to play, some play hours • ESP gamehas collected > 10 Mio. tags > 20000 users your partner suggested: myfrom labels: 3 7 11 17 labels labelspeople could tag all photos on reflection • 5000 the Web in 4 weeks water (human computing) Congratulations! You scored 1 point! Mitterand Mona Lisa metro lignes 7, 14 1 Da Vinci code 37/44 Dark Side of Social Wisdom • Spam (Web spam – not just for email anymore): lucky online casino, easy MBA diploma, cheap V!-4-gra, etc.; law suits about „appropriate Google rank“ • Truthiness: degree to which something is truthy (not necessarily facty); truthy := property of something you know from your guts • Disputes: editorial fights over critical Wikipedia articles; Citizendium: new endeavor with "gentle expert oversight" • Dishonesty, Bias, … 38/44 The Wisdom of Crowds: PageRank PageRank (PR): links are endorsements & increase page authority authority is higher if links come from high-authority pages PR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ ∑ p∈IN ( q ) with PR( p ) ⋅ t( p,q ) Social Ranking t ( p , q ) = 1 / outdegree( p) and j ( q ) = 1 / N equivalent to principal eigenvector: pn×1 = An×n × pn×1 Authority (page q) = stationary prob. of visiting q random walk: uniformly random choice of links + random jumps; add bias to transitions and jumps for personal PR, TrustRank, etc. 39/44 Wisdom of Crowds: Graphs are Everywhere http://www.flickr.com/photos/lukemontague/14038129/ http://www.flickr.com/photos/shopping2null/395271855/ People Opinions Data 40/44 http://datamining.typepad.com/gallery/newblog-crop.png http://datamining.typepad.com/gallery/core.png Wisdom of Crowds: Beyond PR users tags docs Typed graphs: data items, users, friends, groups, postings, ratings, queries, clicks, … with weighted edges → spectral analysis of various graphs 41/44 Outline 9 Motivation 9 Leibniz: Semantic Web 9 Planck: Statistical Web 9 Darwin: Social Web • Conclusion 42/44 Summary Harvesting knowledge & organizing in semantic DB / graph for • scholarly Web, • digital libraries, • enterprise know-how, • online communities, etc. Enable semantic search, with ranking of uncertain facts Three roads to knowledge and intelligent search: • Leibniz / Semantic Web: ontologies, encyclopedia, etc. • Planck / Statistical Web: large-scale IE from text, speech, etc. • Darwin / Social Web: wisdom of crowds, tagging, folksonomies 43/44 Thank you ! 44/44