Intelligente Suchmaschinen der Zukunft: Trends und
Transcription
Intelligente Suchmaschinen der Zukunft: Trends und
Intelligente Suchmaschinen der Zukunft: Trends und Herausforderungen Gerhard Weikum (weikum@mpi-inf.mpg.de) What Google Can‘t Do professors from Saarbruecken who teach DB or IR and have projects on XML drama with three women making a prophecy to a British nobleman that he will become king the woman from Paris whom I met at the PC meeting chaired by Renee Miller best & latest insights on percolation theory for networks pros and cons of dark energy hypothesis evolving opinions on EU constitution in different countries market impact of XML standards in 2002 vs. 2004 experienced NLP experts who may be recruited for IT staff apps in customer support, business analytics, health care, law, etc. + multilingual/multicultural, personalized/contextual, multimedia, etc. Gerhard Weikum May 10, 2006 2/48 Gerhard Weikum May 10, 2006 3/48 Gerhard Weikum May 10, 2006 4/48 Gerhard Weikum May 10, 2006 5/48 Gerhard Weikum May 10, 2006 6/48 What is Beyond Google? for Advanced Information Requests by „Power Users“ (librarians, market analysts, scientists, students, etc.) background knowledge → ontologies & thesauri, statistics, continuous learning (semi-)structured and „semantic“ data → XML, info extraction, annotation & classification humans in the loop, wisdom of crowds → collaboration, recommendation, social networks, P2P context awareness → personalization, geo & time, user behavior, reality mining Gerhard Weikum May 10, 2006 7/48 A Broader View of Search Engine Technology (Information Retrieval) • Intranet and Enterprise Search • Scholarly Work on Digital Libraries, Web Archives, etc. • „Vertical“ Search: Products, Entertainment, Health, etc. • Desktop Search / Personal Information Management • Deep Web Search / Information Integration • Continuous Queries (PubSub) on News, Blogs, etc. • Personalized and „Social“ Search • Multimedia Search (Images, Video, Speech, Music, etc.) • Multilingual and Multicultural Search • Embedded (Mobile) and Integrated (DB&IR) Applications Gerhard Weikum May 10, 2006 8/48 Outline 9 Motivation and Strategic Direction • Semantic Search (Ontologies, XML, Info Extraction) • Personalized Search (User-Behavior History) • Social Search (Communities, P2P) • Conclusion Gerhard Weikum May 10, 2006 9/48 Ontologies & Thesauri: Example WordNet IR&NLP Approach e.g. WordNet Thesaurus (Princeton) (> 100 000 concepts with lexical & linguistic relations) woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil) Gerhard Weikum May 10, 2006 10/48 „Semantic“ Query Expansion and Execution Thesaurus/Ontology: User query: ~c = ~t1 ... ~tm concepts, relationships, glosses from WordNet, Gazetteers, Web forms & tables, Wikipedia Example: ~professor and ( ~course = „~IR“ ) Term2Concept with WSD Query expansion exp(ti)={w | sim(ti,w)≥ θ} alchemist primadonna magician artist director wizard investigator intellectual Weighted expanded query Example: (professor lecturer (0.749) scholar (0.71) ...) and ( (course class (1.0) seminar (0.84) ... ) = („IR“ „Web search“ (0.653) ... ) ) Efficient top-k search with dynamic expansion better recall, better mean precision for hard queries researcher RELATED RELATED (0.48) (0.48) professor HYPONYM HYPONYM (0.749) (0.749) scientist scholar academic, academician, faculty member mentor teacher relationships quantified by statistical correlation measures Gerhard Weikum May 10, 2006 11/48 Query Expansion Example From TREC 2004 Robust Track Benchmark: Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], Let us take, for example, the case of Medellin cartel's "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], boss Pablo Escobar. Will the fact thatmaffia[0.318|1.00], he was eliminated mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", change anything at all? No, it may perhaps have a "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], psychological effect on other drug dealers but, ... organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20], ...}} the illicit export of metals and import ... for organizing of arms. It is extremely difficult for the law-enforcement 135530 sorted accesses in 11.073s. organs to investigate and stamp out corruption among leading officials. Interpol Chief on Fight Against ... Narcotics Economic CounterintelligenceATasks Viewed commission accused Swiss prosecutors parliamentary today ofofdoing little toCrime stop drug and money-laundering Dresden Conference Views Growth Organized in Europe international networks from Region pumping billions of dollars Report on Drug, Weapons Seizures in Southwest Border through Swiss companies. SWITZERLAND CALLED SOFT ON CRIME ... Results: 1. 2. 3. 4. 5. ... Gerhard Weikum May 10, 2006 12/48 What If The Semantic Web Existed And All Information Were in XML? Which professors <?xml version = '1.0' Professor from Saarbruecken (SB) encoding = 'UTF-8'?> are teaching IR and have <homepage> research projects on XML? Address … ... <professor> Name: City: SB <name> Gerhard Weikum </name> Country: Gerhard <teaching Germany Research: Weikum Teaching: xlink:href=„http://www.uni-saarland.de/...“ /> <address> Course <city> Saarbrücken </city> <country> Title: IR Germany </country> </address> Syllabus Description: <research ... Information xlink:href=„http://www.mpi-inf.mpg.de/…“ /> retrieval ... Book Article … ... ... Gerhard Weikum May 10, 2006 Project Title: Intelligent ... Search Sponsor: of XML German Data Science Foundation 13/48 Professor Address ... XML-IR Example (1) Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? Name: City: SB Country: Gerhard Germany Research: Weikum Teaching: Course Project Title: IR Description: Information retrieval ... Syllabus ... Book ... Article ... // Professor [//* = „ Saarbruecken“] [// Course [//* = „ IR“] ] [// Research [//* = „ XML“] ] Gerhard Weikum May 10, 2006 Title: Intelligent ... Search Sponsor: of XML German Data Science Foundation 14/48 Professor XML-IR Example (2) Lecturer professors Which from Saarbruecken (SB) areAddress: teaching IR and have research projects on XML? Max-Planck Address ... Name: City: SB Country: Gerhard Institute for CS, Name:Research: Germany Weikum Teaching: Germany Interests: Ralf Semistructured Schenkel Teaching: Data, IR Course Project Title: IR Description: Information retrieval ... Book Title: Statistical ... Language Models Syllabus ... Contents: Article Book Ranked Search ... ... ... Seminar Title: Intelligent ... Search Literature Sponsor: of XML German Data Science Foundation Combine DB and IR techniques with logics, statistics, AI, ML, NLP for ranked retrieval //// Professor ~Professor[//* [//*= =„ „Saarbruecken“] ~ Saarbruecken“] [// ~Course ] of Course[//* [//*==„„IR“] ~ IR“] ] semistructured data (e.g. TopX) [// ~Research ] ] May 10, 2006 Gerhard Weikum Research[//* [//*==„„XML“] ~ XML“] 15/48 TopX Engine at MPII (1) Gerhard Weikum May 10, 2006 16/48 TopX Engine at MPII (2) Gerhard Weikum May 10, 2006 17/48 TopX Engine at MPII (3) Gerhard Weikum May 10, 2006 18/48 TopX Engine at MPII (4) Gerhard Weikum May 10, 2006 19/48 TopX Engine at MPII (5) Gerhard Weikum May 10, 2006 20/48 Efficient Top-k Search [Buckley85, Güntzer et al. 00, Fagin01] TA: efficient & principled top-k query processing with monotonic score aggr. Data items: d1, …, dn d11 s(t s(t11,d ,d11)) == 0.7 0.7 … … s(t s(tmm,d ,d11)) == 0.2 0.2 Query: q = (t1, t2, t3) TA with sorted access only (NRA): can index lists; consider d at posi in Li; E(d) := E(d) ∪ {i}; highi := s(ti,d); worstscore(d) := aggr{s(tν,d) | ν ∈E(d)}; bestscore(d) := aggr{worstscore(d), aggr{highν | ν ∉ E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ ∈ top-k}; else if bestscore(d) > min-k then cand := cand ∪ {d}; s threshold := max {bestscore(d’) | d’∈ cand}; if threshold ≤ min-k then exit; Index lists t1 t2 t3 d78 0.9 d64 0.8 d10 0.7 d23 0.8 d23 0.6 d78 0.5 d10 0.8 d10 0.6 d64 0.4 d1 0.7 d10 0.2 d99 0.2 d88 0.2 d78 0.1 d34 0.1 … … k=1 Scan Scan Scan Scan Scan Scan depth 112 depth depth depth depth depth233 … Ex. Google: > 10 mio. terms, > 8 bio. docs, > 4 TB index Gerhard Weikum May 10, 2006 Rank Doc Worst- BestRank WorstBestRank Doc Docscore Worst-score Bestscore score score score 1 2.4 d78 0.9 1 1 d78 1.4 2.0 d10 2.1 2 2.4 2.1 d64 0.8 2 2 d23 1.9 d78 1.4 1.4 2.0 3 2.4 d10 0.7 3 3 d64 0.8 2.1 d23 1.4 1.8 STOP! STOP! 4 4 d10 2.1 d64 0.7 1.2 2.0 21/48 Probabilistic Pruning of Top-k Candidates [VLDB 04] TA family of algorithms based on invariant (with sum as aggr): si ( d ∑ i∈ E( d ) ) ≤ s( d ) ≤ si ( d ∑ i∈ E( d ) worstscore(d) • • Æ Often overly conservative (deep scans, high memory for PQ) score drop d from priority queue bestscore(d) min-k Æ Approximate top-k with score predictor can use LSTs & Chernoff bounds, Poisson approximations, or histogram convolution scan depth worstscore(d) probabilistic guarantees: si ( d ∑ i∈ E( d ) highi ∑ i∉ E( d ) bestscore(d) Add d to top-k result, if worstscore(d) > min-k Drop d only if bestscore(d) < min-k, otherwise keep in PQ p( d ) := P [ )+ )+ Si ∑ i∉ E( d ) >δ ] discard candidates d from queue if p(d) ≤ ε ⇒ E[rel. precision@k] = 1−ε Gerhard Weikum May 10, 2006 22/48 Top-k Queries with Query Expansion [SIGIR 05] consider expandable query „~professor and research = XML“ with score Σi∈q {max j∈exp(i) { sim(i,j)*sj(d) }} dynamic query expansion with incremental on-demand merging of additional index lists B+ tree index on tag-term pairs and terms thesaurus / meta-index research: professor lecturer: scholar: 0.6 XML 0.7 92: 0.9 67: 0.9 52: 0.9 44: 0.8 55: 0.8 ... 37: 0.9 44: 0.8 22: 0.7 23: 0.6 51: 0.6 52: 0.6 ... 12: 0.9 14: 0.8 28: 0.6 17: 0.55 61: 0.5 44: 0.5 ... ... 57: 0.6 44: 0.4 52: 0.4 33: 0.3 75: 0.3 professor lecturer: 0.7 scholar: 0.6 academic: 0.53 scientist: 0.5 ... + much more efficient than threshold-based expansion + no threshold tuning + no topic drift Gerhard Weikum May 10, 2006 23/48 Performance Results for .Gov Queries on .GOV corpus from TREC-12 Web track: speedup by factor 10 1.25 Mio. docs (html, pdf, etc.) at high precision/recall (relative to TA-sorted); 50 keyword queries, e.g.: aggressive queue mgt. • „Lewis Clark expedition“, even yields factor 100 • „juvenile delinquency“, at 30-50 % prec./recall • „legalization Marihuana“, • „air bag safety reducing injuries death facts“ #sorted accesses elapsed time [s] max queue size relative recall rank distance score error TA-sorted 2,263,652 148.7 10849 1 0 0 Prob-sorted (smart) 527,980 15.9 400 0.69 39.5 0.031 Gerhard Weikum May 10, 2006 24/48 Experimental Results: INEX Benchmark on IEEE-CS journal and conference articles: 12,000 XML docs with 12 Mio. elements,7.9 GB for all indexes 20 CO queries, e.g.: „XML editors or parsers“ 20 CAS queries, e.g.: //article[ .//bibl[about(.//„QBIC“)] and .//p[about(.//„image retrieval“)] ] #sorted accesses @10 #random accesses @10 relative recall @10 precision@10 MAP@1000 Join &Sort Struct Index TopX (ε=0.0) TopX (ε=0.1) 9,122,318 0 1 0.34 0.17 761,970 635,507 426,986 3,245,068 64,807 59,414 1 1 0.8 TopX outperforms 0.34 0.34 0.32 Join&Sort by factor 0.17 0.17 0.17 > 10 and beats StructIndex by factor > 20 on INEX, factor 2-3 on IMDB Gerhard Weikum May 10, 2006 25/48 Towards a Statistically Semantic Web <Person> Information extraction yields: <TimePeriod> <Scientist> Person TimePeriod ... Sir Isaac Newton 4 Jan 1643 - ... ... Leibniz ... Kneller Publication Philosophiae Naturalis <Publication> Author ... Newton <Scientist> Topic ... gravitation Publication Philosophia ... Scientist <Painter> Sir Isaac Newton ... Leibniz <Person> but with confidence < 1 → Semantic-Web database with uncertainty ! → ranked retrieval ! Gerhard Weikum May 10, 2006 26/48 Information Extraction from Web Pages Leading open-source tool: GATE/ANNIE http://www.gate.ac.uk/annie/ Gerhard Weikum May 10, 2006 27/48 Outline 9 Motivation and Strategic Direction 9 Semantic Search (Ontologies, XML, Info Extraction) • Personalized Search (User-Behavior History) • Social Search (Communities, P2P) • Conclusion Gerhard Weikum May 10, 2006 28/48 Personalized Search & Info Management Personalized Result Ranking: or • query interpretation depends on personal interests and bias • need to learn user-specific weights for multi-criteria ranking (relevance, authority, freshness, etc.) • can exploit user behavior (feedback, bookmarks, query logs, click streams, etc.) Personal Information Management (PIM): • manage, annotate, organize, and search all your personal data • on desktop (mail, files, calendar, etc.) • at home (photos, videos, music, parties, invoices, tax filing, etc.) and in smart home with ambient intelligence Gerhard Weikum May 10, 2006 29/48 Google‘s PageRank [Brin & Page 1998] Idea: incoming links are endorsements & increase page authority, authority is higher if links come from high-authority pages PR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ ∑ PR( p ) ⋅ t( p,q ) p∈IN ( q ) with t ( p, q ) = 1 / outdegree( p) and j ( q ) = 1 / N Authority (page q) = stationary prob. of visiting q random walk: uniformly random choice of links + random jumps Gerhard Weikum May 10, 2006 30/48 Personalized PageRank [Haveliwala et al. 2003] Idea: random jumps favor designated high-quality pages such as personal bookmarks, frequently visited pages, etc. PR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ ∑ with ⎧1 / | B | for q ∈ B j(q ) = ⎨ otherwise ⎩0 PR( p ) ⋅ t( p,q ) p∈IN ( q ) Authority (page q) = stationary prob. of visiting q random walk: uniformly random choice of links + biased jumps to personal favorites Gerhard Weikum May 10, 2006 31/48 Exploiting Query Logs and Click Streams from PageRank: uniformly random choice of links + random jumps to QRank: + query-doc transitions + query-query transitions + doc-doc transitions on implicit links (w/ thesaurus) with probabilities estimated from log statistics PR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ ∑ max planck PR( p ) ⋅ t( p,q ) p∈IN ( q ) QR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ ( α ∑ mpg budget max planck wissenschaft A.M. MPII MPII PR( p ) ⋅ t( p,q ) + A.M. p∈ exp licitIN ( q ) (1−α ) ∑ p ∈ implicitIN ( q ) PR( p ) ⋅ sim( p,q ) ) Gerhard Weikum May 10, 2006 32/48 Small-Scale Experiments Setup: 70 000 Wikipedia docs, 18 volunteers posing Trivial-Pursuit queries ca. 500 queries, ca. 300 refinements, ca. 1000 positive clicks ca. 15 000 implicit links based on doc-doc similarity Results (assessment by blind-test users): • QRank top-10 result preferred over PageRank in 81% of all cases • QRank has 50.3% precision@10, PageRank has 33.9% Untrained example query „philosophy“: 1. 2. 3. 4. 5. PageRank QRank Philosophy GNU free doc. license Free software foundation Richard Stallman Debian Philosophy GNU free doc. license Early modern philosophy Mysticism Aristotle Gerhard Weikum May 10, 2006 33/48 Outline 9 Motivation and Strategic Direction 9 Semantic Search (Ontologies, XML, Info Extraction) 9 Personalized Search (User-Behavior History) • Social Search (Communities, P2P) • Conclusion Gerhard Weikum May 10, 2006 34/48 Social Search: Vision & Trends „Enable people to find, use, share, and expand all human knowledge“ (Yahoo!: knowledge fusion) Collect & harvest the wisdom of crowds: • bookmarks of users, with content tags • query logs, click streams, news readings, etc. all • interactions in communities (blogs, e-groups, etc.) managed • opinions on products, movies, music, pharmaceuticals, etc. by one • photos, annotations, ratings, etc. „super provider“ (Yahoo!, MSN, or Google) Affects search result ranking: → decentralized & self-organizing prefer results liked by similar users peer-to-peer (P2P) networks ! Gerhard Weikum May 10, 2006 35/48 Social Search: Yahoo! MyWeb search engine highly susceptible to spam & manipulation ! Gerhard Weikum May 10, 2006 36/48 Social Search: Yahoo! Flickr Gerhard Weikum May 10, 2006 37/48 Social Search: Yahoo! Flickr Gerhard Weikum May 10, 2006 38/48 Social Search: Yahoo! Flickr Gerhard Weikum May 10, 2006 39/48 Social Search: Yahoo! Flickr Gerhard Weikum May 10, 2006 40/48 Peer-to-Peer (P2P) Web Search Vision: Self-organizing P2P Web Search Engine with Google-or-better functionality • Scalable & Self-Organizing Data Structures and Algorithms (DHTs, Semantic Overlay Networks, Epidemic Spreading, Distr. Link Analysis, etc.) • Better Search Result Quality (Precision, Recall, etc.) • Powerful Search Methods for Each Peer (Concept-based Search, Query Expansion, Personalization, etc.) • Leverage Intellectual Input at Each Peer (Bookmarks, Feedback, Query Logs, Click Streams, Evolving Web, etc.) • Collaboration among Peers (Query Routing, Incentives, Fairness, Anonymity, etc.) • Benefits of Large-Scale Social Networks: Small-World Phenomenon, Breaking Information Monopolies Foundations pursued in EU Integrated Project DELIS Gerhard Weikum May 10, 2006 41/48 Minerva System Architecture peer lists (directory) term a: 17, 11, 92, ... term f: 43, 65, 92, ... url z: 54, 128, 7, ... url x: 37, 44, 12, ... term c: 13, 92, 45, ... term g: 13, 11, 45, ... url y: 75, 43, 12, ... bookmarks query peer P0 B0 local index X0 term g: 13, 11, 45, ... Query routing aims to optimize benefit/cost driven by distributed statistics on peers‘ content similarity, content overlap, freshness, authority, trust, performability etc. Dynamically precompute „good peers“ to maintain a Semantic Overlay Network Exploit community input (bookmarks, etc.) Gerhard Weikum May 10, 2006 42/48 Spam: Not Just for E-mail Anymore Distortion of search results by „spam farms“ (aka. search engine optimization) boosting pages (spam farm) page to be „promoted“ Susceptibility to manipulation and lack of trust model Research Challenge: is a major •problem: Robustness to egoistic and malicious behavior • 2004 DarkBlue SEO Challenge: „nigritude ultramarine“ • Trust/Distrust models and mechanisms extremely „successful“ • Pessimists estimate 75 Mio. out of 150 Mio. Web hosts are spam • Recent example: Ληstές http://www.google.gr/search?hl=el&q=%CE%BB%CE%B7%CF%83%CF%84%CE%AD%CF unclear borderline between spam and community opinions Gerhard Weikum May 10, 2006 43/48 Gerhard Weikum May 10, 2006 44/48 Gerhard Weikum May 10, 2006 45/48 Web Spam Generation Content spam: • repeat words (boost tf scores) • weave words/phrases into copied text • manipulate anchor texts Link spam: • copy links from Web dir. and distort Example: Remember not only online learning to say the right doctoral degree thing in the right place, but far cheap tuition more difficult still, to leave career unsaid the wrong thing at university the tempting moment. • create honeypot page and sneak in links • infiltrate Web directory • purchase expired domains • generate posts to Blogs, message boards, etc. • build & run spam farm (collusion) + form alliances Hide/cloak the manipulation: • masquerade href anchors Example: read about my <a href=„myonlinecasino.com“> trip to Las Vegas </a>. • use tiny anchor images with background color • generate different dynamic pages to browsers and crawlers Gerhard Weikum May 10, 2006 46/48 Countermeasures: BadRank and TrustRank BadRank: start with explicit set B of blacklisted pages define random-jump vector r by setting ri=1/|B| if i∈B and 0 else propagate BadRank mass to predecessors BR( p) = β rp + (1 − β )∑q∈OUT ( p ) BR(q) / indegree(q) TrustRank: start with explicit set T of trusted pages with trust values ti define random-jump vector r by setting ri = ti / if i ∈T and 0 else propagate TrustRank mass to successors TR (q) = τ rq + (1 − τ )∑ p∈IN ( p ) TR ( p) / outdegree( p) Problems: maintenance of explicit lists is difficult difficult to understand (& guarantee) effects Gerhard Weikum May 10, 2006 47/48 Learning Spam Features [Drost/Scheffer 2005] Use classifier (e.g. Bayesian predictor, SVM) to predict „spam vs. ham“ based on page and page-context features Most discriminative features are: •tfidf weights of words in p0 and IN(p0) •avg, #inlinks of pages in IN(p0) •avg. #words in title of pages in OUT(p0) •#pages in IN(p0) that have same length as some other page in IN(p0) •avg. # inlinks and outlinks of pages in IN(p0) But spammers may •avg. #outlinks of pages in IN(p0) learn to adjust to the •avg. #words in title of p0 •total #outlinks of pages in OUT(p0) anti-spam measures. •total #inlinks of pages in IN(p0) It‘s an arms race! •clustering coefficient of pages in IN(p0) (#linked pairs / m(m-1) possible pairs) •total #words in titles of pages in OUT(p0) •total #outlinks of pages in OUT(p0) •avg. #characters of URLs in IN(p0) •#pages in IN(p0) and OUT(p0) with same MD5 hash signature as p0 •#characters in domain name of p0 •#pages in IN(p0) with same IP number as p0 Gerhard Weikum May 10, 2006 48/48 Outline 9 Motivation and Strategic Direction 9 Semantic Search (Ontologies, XML, Info Extraction) 9 Personalized Search (User-Behavior History) 9 Social Search (Communities, P2P) • Conclusion Gerhard Weikum May 10, 2006 49/48 Strategic Research Avenues Exploit the Web‘s potential for being a knowledge base • Build large-scale & interesting „Semantic“ Web corpora (Wikipedia++, all homepages of CS researchers, etc.) • Enhance & interconnect Deep-Web databases (digital libraries, scientific data, judicial expertise, etc.) Semantic search: ontologies, richly structured & annotated (XML) data, info extraction & enrichment Personalized search: history of user behavior (queries, clicks, etc.) and current context Social search: wisdom of crowds (recommendations, community behavior, etc.) embedded in P2P network Data curation, quality control & trust are crucial for effective information search: authenticity, freshness, accuracy, authority, etc. Gerhard Weikum May 10, 2006 50/48