Intelligente Suchmaschinen der Zukunft

Transcription

Intelligente Suchmaschinen der Zukunft
Intelligente Suchmaschinen
der Zukunft
Gerhard Weikum
weikum@mpi-inf.mpg.de
http://www.mpi-inf.mpg.de/~weikum/
Beyond Google: Knowledge Queries
Turn the Web, Web2.0, and Web3.0 into the world‘s
most comprehensive knowledge base („semantic DB“) !
Answer „knowledge queries“ such as:
proteins that inhibit both proteases and some other enzymes
neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘
differences in Rembetiko music from Greece and from Turkey
connection between Thomas Mann and Goethe
market impact of Web2.0 technology in December 2006
sympathy or antipathy for Germany from May to August 2006
Nobel laureate who survived both world wars and his children
drama with three women making a prophecy
to a British nobleman that he will become king
2/44
Don‘t Let Me Be Misunderstood
Keyword query: Max Planck
or
Keyword query: Greek art Paris
or
Entity query:
Entity query:
Person = „Max Planck“ „Greek art“ &
Location = „Paris“
Keyword query: Madonna child
or
Entity-Relation query:
Person x = „Madonna“
& HasChild (x,y)
3/44
Semantic Search Today: Example Google
Which politicians
are also scientists ?
What is lacking?
• data is not knowledge
→ extraction and organization
• keywords cannot express
advanced user intentions
→ concepts, entities, properties,
relations
4/44
Three Roads to Web Knowledge and
Intelligent Search
Leibniz Approach:
Semantic Web from
Handcrafted Knowledge Sources and
Search over Entities, Relations, Ontologies
Planck Approach:
Statistical Web from
Large-scale Information Extraction & Harvesting and
Search over Uncertain Relations
Darwin Approach:
Social Wisdom from Web 2.0 Communities and
Search by Collaborative Recommendation
5/44
Outline
9 Motivation
• Leibniz: Semantic Web
• Planck: Statistical Web
• Darwin: Social Web
• Conclusion
6/44
Leibniz Approach: Semantic Web
Gottfried Wilhelm Leibniz
(1646 - 1716)
Handcrafted High-Quality Knowledge:
• Ontologies and other Lexical Sources
• Build on Rigorous Knowledge Atoms
(„Characteristica Universalis“)
• Semantic Search over
Entities, Relations, Ontologies
7/44
High-Quality Knowledge Sources
General-purpose thesauri and concept networks: WordNet family
woman, adult female – (an adult female person)
=> amazon, virago – (a large strong and aggressive woman)
=> donna -- (an Italian woman of rank)
=> geisha, geisha girl -- (...)
=> lady (a polite name for any woman)
...
=> wife – (a married woman, a man‘s partner in marriage)
=> witch – (a being, usually female, imagined to
have special powers derived from the devil)
8/44
High-Quality Knowledge Sources
General-purpose thesauri and concept networks: WordNet family
200 000 concepts and relations;
can be cast into
• description logics or
• graph,proteins
with weights
relation
enzyme -- (any of several complex
that arefor
produced
bystrengths
cells and
from co-occurrence
act as catalysts in(derived
specific biochemical
reactions)statistics)
=> protein -- (any of a large group of nitrogenous organic compounds
that are essential constituents of living cells; ...)
=> macromolecule, supermolecule
...
=> organic compound -- (any compound of carbon
and another element or a radical)
...
=> catalyst, accelerator -- ((chemistry) a substance that initiates or
accelerates a chemical reaction
without itself being affected)
=> activator -- ((biology) any agency bringing about activation; ...)
9/44
Query Expansion Example
From TREC 2004 Robust Track Benchmark:
Title: International Organized Crime
Description: Identify organizations that participate in international criminal activity,
the activity, and, if possible, collaborating organizations and the countries involved.
10/44
Query Expansion Example
From TREC 2004 Robust Track Benchmark:
Title: International Organized Crime
Description: Identify organizations that participate in international criminal activity,
the activity, and, if possible, collaborating organizations and the countries involved.
Query = {international[0.145|1.00],
~META[1.00|1.00][{gangdom[1.00|1.00],
gangland[0.742|1.00],
Let us take, for example, the case of Medellin cartel's
"organ[0.213|1.00] & crime[0.312|1.00]",
camorra[0.254|1.00],
boss Pablo Escobar.
Will the fact thatmaffia[0.318|1.00],
he was eliminated
mafia[0.154|1.00], "sicilian[0.201|1.00]
& mafia[0.154|1.00]",
change anything
at all? No, it may perhaps have a
"black[0.066|1.00] & hand[0.053|1.00]",
mob[0.123|1.00],
syndicate[0.093|1.00]}],
psychological
effect on other drug
dealers but,
...
organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20],
columbian[0.686|0.20], cartel[0.466|0.20],
...}} the illicit export of metals and import
... for organizing
of arms. It is extremely difficult for the law-enforcement
135530 sorted accesses in 11.073s.
organs to investigate and stamp out corruption among
leading officials.
Interpol Chief on Fight Against
... Narcotics
Economic CounterintelligenceATasks
Viewed commission accused Swiss prosecutors
parliamentary
today ofofdoing
little toCrime
stop drug
and money-laundering
Dresden Conference Views Growth
Organized
in Europe
international
networks
from Region
pumping billions of dollars
Report on Drug, Weapons Seizures
in Southwest
Border
through
Swiss
companies.
SWITZERLAND CALLED SOFT
ON
CRIME
...
Results:
1.
2.
3.
4.
5.
...
11/44
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
12/44
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
{{Infobox_Scientist
| name = Max Planck
| birth_date = [[April 23]], [[1858]]
| birth_place = [[Kiel]], [[Germany]]
| death_date = [[October 4]], [[1947]]
| death_place = [[Göttingen]], [[Germany]]
| residence = [[Germany]]
| nationality = [[Germany|German]]
| field = [[Physicist]]
| work_institution = [[University of Kiel]]</br>
[[Humboldt-Universität zu Berlin]]</br>
[[Georg-August-Universität Göttingen]]
| alma_mater = [[Ludwig-Maximilians-Universität München]]
| doctoral_advisor = [[Philipp von Jolly]]
| doctoral_students =
[[Gustav Ludwig Hertz]]</br>
…
| known_for = [[Planck's constant]],
[[Quantum mechanics|quantum theory]]
| prizes = [[Nobel Prize in Physics]] (1918)
…
13/44
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
{{Infobox_Scientist
| name = Max Planck
| birth_date = [[April 23]], [[1858]]
| birth_place = [[Kiel]], [[Germany]]
| death_date = [[October 4]], [[1947]]
| death_place = [[Göttingen]], [[Germany]]
| residence = [[Germany]]
| nationality = [[Germany|German]]
| field = [[Physicist]]
| work_institution = [[University of Kiel]]</br>
[[Humboldt-Universität zu Berlin]]</br>
[[Georg-August-Universität Göttingen]]
| alma_mater = [[Ludwig-Maximilians-Universität München]]
| doctoral_advisor = [[Philipp von Jolly]]
| doctoral_students =
[[Gustav Ludwig Hertz]]</br>
…
| known_for = [[Planck's constant]],
[[Quantum mechanics|quantum theory]]
| prizes = [[Nobel Prize in Physics]] (1918)
…
14/44
YAGO Knowledge Base
Entities
Facts
Entity
KnowItAll
30 000
subclass
SUMO
20 000
60 000
WordNet
120 000
800 000
Person
Cyc
300 000
5 Mio.
TextRunner
n/a
8 Mio.
subclass
YAGO
1.7 Mio.
15 Mio.
Scientist
DBpedia
103 Mio.
subclass1.9 Mio.
[F. Suchanek et al.: WWW’07]
Accuracy ≈ 95%
subclass
subclass
subclass
Biologist
concepts
Location
subclass
City
Country
Physicist
instanceOf
instanceOf
Erwin_Planck
Nobel Prize
hasWon
October 4, 1947
bornIn
Kiel
FatherOf
diedOn
individuals
Max_Planck
bornOn
means
means
“Max Planck”
“Max Karl Ernst
Ludwig Planck”
April 23, 1858
means
“Dr.
Planck”
words
Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/
15/44
NAGA: Graph IR on YAGO
[G. Kasneci et al.: WWW’07, ICDE‘08]
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on confidence and informativeness
discovery queries
Kiel
bornIn $x
isa
hasWon Nobel
prize
hasSon
$a diedOn $x
>
scientist
$b diedOn $y
connectedness queries
German
novelist
isa
*
Thomas Mann
Goethe
queries with regular expressions
Ling
hasFirstName | hasLastName
(coAuthor
| advisor)*
Beng Chin Ooi
$x
isa
scientist
worksFor
$y
locatedIn*
Zhejiang
16/44
Outline
9 Motivation
9 Leibniz: Semantic Web
• Planck: Statistical Web
• Darwin: Social Web
• Conclusion
17/44
Planck Approach: Statistical Web
Max Planck
(1858 - 1947)
Information Extraction & Harvesting:
• Gather Entities, Relations, Facts
• Live with Uncertainty
• Search & Rank Knowledge Facts
18/44
Information Extraction (IE): Text to Records
Person
BirthDate
Max Planck
4/23, 1858
Albert Einstein 3/14, 1879
Mahatma Gandhi 10/2, 1869
BirthPlace ...
Kiel
Ulm
Porbandar
Person
ScientificResult
Max Planck Quantum Theory
extracted facts often
Valueconfidence
Dimension
have
<1
23 Js
6.226×10
→
DB with
uncertainty
(probabilistic DB)
Person
Collaborator
Max Planck Albert Einstein
Max Planck Niels Bohr
Constant
Planck‘s constant
Person
Organization
Max Planck KWG / MPG
combine NLP, pattern matching, lexicons, statistical learning
19/44
Knowledge Acquisition from the Web
Learn Semantic Relations from Entire Corpora at Large Scale
(as exhaustively as possible but with high accuracy)
Examples:
• all cities, all basketball players, all composers
• headquarters of companies, CEOs of companies, synonyms of proteins
• birthdates of people, capitals of countries, rivers in cities
• which musician plays which instruments
• who discovered or invented what
• which enzyme catalyzes which biochemical reaction
Existing approaches and tools
(Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], …):
almost-unsupervised pattern matching and learning:
seeds (known facts) → patterns (in text) → (extraction) rule → (new) facts
20/44
Methods for Web-Scale Fact Extration
seeds
→
text
→
rules
→
new facts
Example:
Example:
city
in
in
city (Seattle)
(Seattle)
in downtown
downtown Seattle
Seattle
in downtown
downtown X
X
city
Seattle
X
city (Seattle)
(Seattle)
Seattle and
and other
other towns
towns
X and
and other
other towns
towns
city
Las VegasLas
andVegas
otherand
towns
and other
towns
city (Las
(Las Vegas)
Vegas)
otherXtowns
X and
other towns
plays
playing Y:
Y: …
…X
X
plays (Zappa,
(Zappa, guitar)
guitar) playing
playing guitar:
guitar: …
… Zappa
Zappa playing
plays
X
plays (Davis,
(Davis, trumpet)
trumpet) Davis
Davis …
… blows
blows trumpet
trumpet
X…
… blows
blows Y
Y
in downtown Beijing
Coltrane blows sax
city(Beijing)
old center of Beijing
plays(Coltrane, sax) sax player Coltrane
city(Beijing)
plays(C., sax)
old center of X
Y player X
Assessment of facts & generation of rules based on statistics
Rules can be more sophisticated:
playing NN: (ADJ|ADV)* NP & class(NN)=instrument & class(head(NP))=person
→ plays(head(NP), NN)
21/44
Performance of Web-IE
State-of-the-art precision/recall results:
relation
precision recall
corpus
systems
countries
cities
scientists
CEOs
birthdates
instanceOf
80%
80%
60%
80%
80%
40%
Web
Web
Web
News
Wikipedia
Web
KnowItAll
KnowItAll
KnowItAll
Snowball, LEILA
LEILA
Text2Onto, LEILA
90%
???
???
50%
70%
20%
precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%
Anecdotic evidence:
invented (A.G. Bell, telephone)
married (Hillary Clinton, Bill Clinton)
isa (yoga, relaxation technique)
isa (zearalenone, mycotoxin)
contains (chocolate, theobromine)
contains (Singapore sling, gin)
invented (Johannes Kepler, logarithm tables)
married (Segolene Royal, Francois Hollande)
isa (yoga, excellent way)
isa (your day, good one)
contains (chocolate, raisins)
plays (the liver, central role)
makes (everybody, mistakes)
22/44
Beyond Surface Learning with LEILA
Learning to Extract Information by Linguistic Analysis [F. Suchanek et al.: KDD‘06]
Limitation of surface patterns:
who discovered or invented what
“Tesla’s work formed the basis of AC electric power”
“Al Gore funded more work for a better basis of the Internet”
Almost-unsupervised Statistical Learning with Dependency Parsing
(Cologne, Rhine), (Cairo, Nile), …
(Cairo, Rhine), (Rome, 0911), (∗, ∗[0..9]*∗), …
NP
VP
PP
NP
NP
PP NP
NP
NP
PP
NP
VP
NP
PP
NP
NP
NP
Cologne lies on the banks of the Rhine People in Cairo like wine from the Rhine valley
Ss
MVp
DMc
Jp
Mp
Mp
Dg
Js
Js
Os
Sp
AN
Mvp
Ds
Js
NP
VP
VP
PP
NP
NP
PP NP
NP
Paris was founded on an island in the Seine
Ss
Pv
MVp
Ds
DG
(Paris, Seine)
Js
Js
MVp
We visited Paris last summer. It has many museums along the banks of the Seine.
23/44
YAGO Enhancement by IE on Text Sources
Entity
1.0 subclass 1.0
subclass
Person
Capture
confidence
value
for each fact
subclass
Celebrity
0.7
subclass 1.0
subclass 1.0
Mythological
Figure
City
instanceOf
0.6
Location
0.8
instanceOf
subclass 1.0
Country
0.9
instanceOf
1.0
instanceOf
Paris
Hilton
Paris(Myth.)
Paris(France)
means
means 0.1
means 0.05
“Paris”
locatedIn
France
0.95
0.7
means 0.5 means 0.9
“La Grande
Nation”
“France”
ongoing work: harvesting relations by IE tools like GATE, LEILA, ...
(e.g.: which enzyme catalyzes which biochemical process,
who discovered or invented what, ...)
24/44
Search Results Without Ranking
q: Fisher isa scientist
Fisher isa $x
$@Fisher = Ronald_Fisher
$@scientist = scientist_109871938
$X = alumnus_109165182
$@Fisher = Irving_Fisher
$@scientist = scientist_109871938
mathematician_109635652 —subClassOf—> scientist_109871938
$X = social_scientist_109927304
Alumni_of_Gonville_and_Caius_College,_Cambridge$@Fisher
—subClassOf—>
alumnus_1091
= James_Fisher
"Fisher" —familyNameOf—> Ronald_Fisher
$@scientist = scientist_10981938
Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge
$X = ornithologist_109711173
Ronald_Fisher —type—> 20th_century_mathematicians
$@Fisher = Ronald_Fisher
"scientist" —means—> scientist_109871938
$@scientist = scientist_109871938
$X = theorist_110008610
$@Fisher = Ronald_Fisher
$@scientist = scientist_109871938
$X = colleague_109301221
$@Fisher = Ronald_Fisher
$@scientist = scientist_109871938
$X = organism_100003226
…
25/44
Ranking with Statistical Language Model
q: Fisher isa scientist
Fisher isa $x
$@Fisher = Ronald_Fisher
$@scientist = scientist_109871938
$X = mathematician_109635652
$@Fisher = Ronald_Fisher
$@scientist = scientist_109871938
$X = statistician_109958989
$@Fisher = Ronald_Fisher
$@scientist
= scientist_109871938
Score: 7.184462521168058E-13 mathematician_109635652
—subClassOf—>
scientist
$X
=
president_109787431
"Fisher" —familyNameOf—> Ronald_Fisher
Ronald_Fisher —type—> 20th_century_mathematicians
$@Fisher = Ronald_Fisher
"scientist" —means—> scientist_109871938
$@scientist = scientist_109871938
20th_century_mathematicians —subClassOf—> mathematician_109635652
$X = geneticist_109475749
$@Fisher = Ronald_Fisher
$@scientist = scientist_109871938
$X = scientist_109871938
→ statistical language model
for result graphs
…
Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/
26/44
Ranking Factors
Confidence:
Prefer results that are likely to be correct
¾ Certainty of IE
¾ Authenticity and Authority of Sources
Informativeness:
bornIn (Max Planck, Kiel) from
„Max Planck was born in Kiel“
(Wikipedia)
livesIn (Elvis Presley, Mars) from
„They believe Elvis hides on Mars
(Martian Bloggeria)
q: isa (Einstein, $y)
Prefer results that are likely important
May prefer results that are likely new to user
¾ Frequency in answer
¾ Frequency in corpus (e.g. Web)
¾ Frequency in query log
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q: isa ($x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
Compactness:
Prefer results that are tightly connected isa vegetarian
¾ Size of answer graph
Einstein
Tom
isa Cruise
won
Nobel
Prize
won
bornIn
1962
Bohr
diedIn
27/44
NAGA Ranking Model
Following the paradigm of statistical language models
(used in speech recognition and modern IR), applied to graphs
For query q with fact templates q1 … qn
rank result graphs g with facts g1 … gn
by decreasing likelihoods:
using
generative
mixture model
ex.: bornIn (Goethe, Frankfurt)
n
P [q | g] = ∏ (1 − α ) ⋅ P [q i | g i ] + α ⋅ P [q i ]
i =1
β ⋅ Pconf [qi | gi ] + (1 − β ) ⋅ Pinform [qi | gi ]
based on
IE accuracy
and
authority
analysis
ex.: bornIn ($x, Frankfurt)
P[Goethe | bornIn , Frankfurt ] =
background
model
P [Goethe , bornIn
, Frankfurt
]
]
P [bornIn , Frankfurt ]
ne
P( x, r, z) P( x, r, z)
P
(
x
|
r
,
z
)
=
=
conf(e) = ∑ acc(e, Pi ) ⋅ trust( Pi )
P(r, z) ∑P( x', r, z)
i =1
x'
estimated by
correlation
statistics
for qi = (x, r, z) with variable x
28/44
Entity Search: Example NAGA
Query:
$x isa politician
$x isa scientist
Results:
Benjamin Franklin
Paul Wolfowitz
Angela Merkel
…
29/44
Outline
9 Motivation
9 Leibniz: Semantic Web
9 Planck: Statistical Web
• Darwin: Social Web
• Conclusion
30/44
Darwin Approach (Social Web)
Charles Darwin
(1809 - 1882)
Social Wisdom & Natural Selection:
• Evolution of (Web 2.0) species
• Survival of the fittest
• Search & rank by
collaborative recommendation
31/44
„Wisdom of Crowds“ at Work on Web 2.0
Information enrichment & knowledge extraction by humans:
• Collaborative Recommendations & QA
• Amazon (product ratings & reviews, recommended products)
• Netflix: movie DVD rentals → $ 1 Mio. Challenge
• answers.yahoo, iknow.baidu, etc.
• Social Tagging and Folksonomies
• del.icio.us: Web bookmarks and tags
• flickr: photo annotation, categorization, rating
• YouTube: same for video
• Human Computing in Game Form
• ESP and Google Image Labeler: image tagging
• Peekaboom: image segmenting and tagging
• Verbosity: facts from natural-language sentences
• Online Communities
• dblife.cs.wisc.edu for database research
• www.lt-world.org for language technology
• Yahoo! Groups, Myspace, Facebook, etc. etc.
32/44
Social Tagging: Example Flickr (1)
33/44
Social Tagging: Example Flickr (2)
34/44
Social Tagging: Example Flickr (3)
35/44
Social-Tagging Community
> 20 Mio. users
> 2 Bio. photos
> 10 Bio. tags
30% monthly growth
36/44
Source: www.flickr.com
ESP Game
[Luis von Ahn et al. 2004 ]
played against random, anonymous partner on Internet
taboo:
pyramid
Louvre
museum
Paris
art
• Game with a purpose
• Collects annotations (wisdom)
• Can exploit tag statistics (crowds)
• Attracts people, fun to play, some play hours
• ESP
gamehas
collected
> 10 Mio. tags
> 20000 users
your
partner
suggested:
myfrom
labels:
3
7
11
17
labels
labelspeople could tag all photos on
reflection
• 5000
the Web in 4 weeks
water
(human computing)
Congratulations!
You scored 1 point!
Mitterand
Mona Lisa
metro lignes 7, 14
1
Da Vinci code
37/44
Dark Side of Social Wisdom
• Spam (Web spam – not just for email anymore):
lucky online casino, easy MBA diploma, cheap V!-4-gra, etc.;
law suits about „appropriate Google rank“
• Truthiness:
degree to which something is truthy (not necessarily facty);
truthy := property of something you know from your guts
• Disputes:
editorial fights over critical Wikipedia articles;
Citizendium: new endeavor with "gentle expert oversight"
• Dishonesty, Bias, …
38/44
The Wisdom of Crowds: PageRank
PageRank (PR): links are endorsements & increase page authority
authority is higher if links come from high-authority pages
PR( q ) = ε ⋅ j( q ) + ( 1 − ε ) ⋅ ∑
p∈IN ( q )
with
PR( p ) ⋅ t( p,q )
Social Ranking
t ( p , q ) = 1 / outdegree( p)
and j ( q ) = 1 / N
equivalent to
principal eigenvector:
pn×1 = An×n × pn×1
Authority (page q) =
stationary prob. of visiting q
random walk: uniformly random choice of links + random jumps;
add bias to transitions and jumps for personal PR, TrustRank, etc.
39/44
Wisdom of Crowds: Graphs are Everywhere
http://www.flickr.com/photos/lukemontague/14038129/
http://www.flickr.com/photos/shopping2null/395271855/
People
Opinions
Data
40/44
http://datamining.typepad.com/gallery/newblog-crop.png
http://datamining.typepad.com/gallery/core.png
Wisdom of Crowds: Beyond PR
users
tags
docs
Typed graphs: data items, users, friends, groups,
postings, ratings, queries, clicks, …
with weighted edges → spectral analysis of various graphs
41/44
Outline
9 Motivation
9 Leibniz: Semantic Web
9 Planck: Statistical Web
9 Darwin: Social Web
• Conclusion
42/44
Summary
Harvesting knowledge & organizing in semantic DB / graph for
• scholarly Web,
• digital libraries,
• enterprise know-how,
• online communities, etc.
Enable semantic search, with ranking of uncertain facts
Three roads to knowledge and intelligent search:
• Leibniz / Semantic Web: ontologies, encyclopedia, etc.
• Planck / Statistical Web: large-scale IE from text, speech, etc.
• Darwin / Social Web: wisdom of crowds, tagging, folksonomies
43/44
Thank you !
44/44