Introduction - Institute for Web Science and Technologies

Transcription

Introduction - Institute for Web Science and Technologies
Institut für Web Science & Technologien - WeST
Web Retrieval
Steffen Staab, Sergej Sizov
http://west.uni-koblenz.de
Course organization
Lectures:
Š with Summer Academy: 21 June – 14 July 2010
Course materials:
Š http://www.uni-koblenz-landau.de/koblenz/fb4/studying/
summer-academy/webscience/WebRetrieval
Examination:
Š Oral exam at the end of the semester
Homework:
Š No classes or mandatory assignments
Š Examples and think-about questions integrated into lectures!
Announcements:
Š Course Web Site, email
Contact:
Š B-112, Wed 14-16 and on appointment (sizov@uni-koblenz.de)
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
2
What is Web Retrieval all about?
.. discovering useful information from the World-Wide Web
and its usage patterns
using
Text Mining:
Š application of Data Mining techniques to unstructured text
(e.g. Web sites, blog postings, user comments..)
Structure Mining:
Š taking into account structure and relations in Web data
(HTML tags, hyperlinks, friend lists..)
Usage Mining:
Š taking into account user interactions with Web systems
(clickstreams, collaborative filtering, ...)
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
3
Chapter 1
Motivation & Overview
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
4
Major instruments and apps
What can we use for this?
Š
Š
Š
Š
Š
Š
document content & structure analysis
indexing, search, relevance ranking
classification, grouping, segmentation
interaction with knowledge bases
annotation, summarization, visualization
personalized interaction & collaboration
What is this good for?
Š
Š
Š
Š
Š
Web & Deep Web search
intranet & enterprise search
integration of rich data sources
personalized filtering and recommenders
user support in Web 2.0
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
5
Related areas
ƒ Web Mining is related to Information Retrieval, Natural Language
Processing (NLP), Graph Theory, Statistical Machine Learning (ML):
Š
Š
Š
Š
Š
Š
Š
learning predictive models from data
pattern, rule, trend, outlier detection
classification, grouping, segmentation
knowledge discovery in data collections
information extraction from text and Web
Image understanding, speech recognition, video analysis
graph mining (e.g. on Web graph), social network analysis
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
6
Retrospective: Web search engines
The Web happened (1992)
Mosaic/Netscape happened (1993-95)
Crawler happened (1994): M. Mauldin (founded Lycos)
Yahoo founded in 1994 as a directory
Several SEs happened 1994-1996
(InfoSeek, Lycos, Altavista, Excite, Inktomi, …)
Backrub happened 1996 (went Google 1996-1998)
15 years later (2007):
> 11 Billion (109) pages
> 450 Million daily queries
> 8 Billion US $ annual revenue
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
7
Retrospective: Web search engines (2)
Early search engines based on text IR methods
Š Field started in the 1950s
Š Mostly uses statistical methods to analyze text
– Repeated words on a page are important
– Common words are not important (e.g. the, for, …)
Text IR necessary but not sufficient for Web search
Š Doesn’t capture authority
– An article on BBC as good as a copy www.duck.com
Š Doesn’t address Web navigation
– Query “uni koblenz” seeks www.uni-koblenz.de
– www.uni-koblenz.de may look less topic-specific than PR releases
Many alternatives have been tried and exist
Š Link analysis and authority ranking
Š Topics/query suggestion tools (e.g. Vivisimo, Exalead)
Š Graphical, 2-D, 3-D user interfaces, …
ƒ
..simple and clean preferred by users
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
8
Observations: Small Diameter of the Web
Small World Phenomenon (Milgram 1967)
Studies on Internet Connectivity (1999)
Source: Bill Cheswick and Hal Burch,
http://research.lumeta.com/ches/map/index.html
Source: KC Claffy,
http://www.caida.org/outreach/papers/1999/Nae/Nae.html
suggested small world phenomenon: low-diameter graph
( diameter = max {shortest path (x,y) | nodes x and y} )
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
9
The oracle of Bacon
social network of actor collaborations
http://oracleofbacon.org
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
10
Observations: Power-Law In/Out-Degrees (for links)
Study of Web Graph (Broder et al. 2000)
• power-law distributed degrees: P[degree=k] ~ (1/k)α
with α ≈ 2.1 for indegrees and α ≈ 2.7 for outdegrees
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
11
Under the hood: crawling and Indexing
WWW
......
.....
......
.....
Crawling
Metternich:
One of the
famous
Austrian
politicians
Metternich
Austrian
Politicians
...
Extraction
of relevant
words
Linguistic
methods:
stemming
Thesaurus
(Ontology)
metternich
austria
politician
...
Statistically
weighted
features
(terms)
Index
(e.g. B+-tree)
e.g. synonyms,
sub-/superconcepts
Metternich = district of Koblenz ?
Koblenz =city in Rhineland-Palatinate?
RP = federal state in Germany ?
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Indexing
politician
austria
...
URLs
Chapter 1: Introduction
12
Observation: Search engines have different data
Source: A. Gulli, A. Signorini, WWW 2005
Google > 8 Bio., MSN > 5 Bio., Yahoo! > 4 Bio., Ask/Teoma > 2 Bio.
overlap statistics → (surface) Web > 11.5 Bio. pages (> 40 TBytes)
Deep Web (Hidden Web) estimated to have 500 Bio. units (> 10 PBytes)
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
13
Observation: they also show different results
http://ranking.thumbshots.com/
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
14
Ranking: Content Relevance
Ranking by
descending
relevance
Similarity metric:
Search engine
Query
(Set of weighted
features)
Documents are feature vectors
e.g., using:
Summer Academy 2010: Course Web Retrieval
tf*idf
formula
Steffen Staab
Chapter 1: Introduction
15
Ranking: link-based authority
Ranking by
descending
relevance & authority
Search engine
Query
(Set of weighted
features)
Additionally, consider links between Web nodes:
Authority Score (di) := stationary visit probability [di]
in the random walk on the Web
.. reconciliation of relevance and authority by ad hoc weighting
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
16
PageRank in a nutshell
random walk on the Web graph:
uniformly random choice of links + random jumps
Authority (page q) =
stationary prob. of visiting q
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
17
What is our technology used for?
WWW
......
.....
......
.....
crawl
extract
& clean
index
handle
dynamic pages,
detect duplicates,
detect spam
strategies for
crawl schedule and
priority queue for
crawl frontier
search
rank
fast top-k queries,
query logging and
auto-completion
build and analyze
Web graph,
index all tokens
or word stems
present
GUI, user guidance,
personalization
scoring function
over many data
and context criteria
special file system for highindex-entry and query-result
performance storage management caching for fast search
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
18
How users use the Web?
70-80% of users use SE to find sites !
and most users prefer
a few commercial
large-scale
search engines
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
19
Where are the users?
Google: the users are all over the world
• Search engine serves over 100 different languages
• Should not have a catastrophic failure in any
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
20
What are the users asking us for?
Google-style Web search:
• Users give a 2-4 word query
• SE gives a relevance ranked list of web pages
• Most users click only on the first few results
• Few users go below the fold
.. whatever is visible without scrolling
down
• Far fewer ask for the next 10 results
over 200 Million queries a day
noisy inputs
searching over Eight Billion+ documents
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
21
User intentions
classification of queries [Rose/Levinson: WWW 2004]:
ƒ navigational: find specific homepage with unknown URL, e.g. Germanwings
ƒ informational: learn about topic (e.g. Information Retrieval)
ƒ focused, e.g. Fürst von Metternich, soccer world championship qualification
ƒ unfocused, e.g. undergraduate statistics, dark matter, Koblenz
ƒ seeking advice, e.g. help losing weight, low-fat food, marathon training tips
ƒ locating service, e.g. 6M pixel digital camera, taxi service Koblenz
ƒ exhaustive, e.g. Dutch universities, hotel reviews Crete, MP3 players
ƒ transactional: find specific resource, e.g. download Lucene source code,
Sony Cybershot DSC-W5, Mars surface images, hotel beach south Crete August
ƒ embedded in business workflow (e.g. CRM, business intelligence) or
personal agent (in cell phone, MP3 player, or ambient intelligence at home)
with automatically generated queries
ƒ natural-language question answering (QA):
factoids, e.g. where was Fürst von Metternich born, where is the German Corner, etc
ƒ list queries, e.g. in which movies did Johnny Depp play
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
22
Organization of Search Results (1)
large-scale Web search with authority ranking
http://www.google.com
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
23
Organization of Search Results (2)
cluster search results into topic areas
http://www.clusty.com
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
24
Organization of Search Results (3)
show broader context of results
http://www.exalead.com/search
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
25
Organization of Search Results (4)
suggest related search directions
http://www.yebol.com
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
26
Organization of Search Results (5)
auto-complete queries
http://labs.google.com/suggest/
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
27
Evaluation of Result Quality: Basic Measures
ideal measure is user satisfaction !
heuristically approximated by benchmarking measures
(on test corpora with query suite and relevance assessment by experts)
Capability to return only relevant documents:
typically for
r = 10, 100, 1000
Precision =
Capability to return all relevant documents:
typically for
r = corpus size
Recall =
Ideal quality
Precision
Precision
Typical quality
Recall
Summer Academy 2010: Course Web Retrieval
Recall
Steffen Staab
Chapter 1: Introduction
28
Evaluation: Aggregated Measures
Combining precision and recall into F measure
(e.g. with α=0.5 harmonic mean F1):
Precision-recall breakeven point of query q:
point on precision-recall curve p = f(r) with p = r
for a set of n queries q1, ..., qn (e.g. TREC benchmark)
Macro evaluation
(user-oriented) =
of precision
analogous
for recall
and F1
Micro evaluation
(system-oriented) =
of precision
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
29
Problems with IR-like evaluation on the Web
Š Collection is dynamic
10-20% urls change every month
Š Queries are time sensitive
Topics are hot then they are not
Š Spam methods evolve
Algorithms evaluated against last month’s web may not work today
Š Need to keep the collection fresh
Š Need to keep the queries fresh
Š Search space is extremely large
Š Over 100 million unique queries a day
Š To measure a 5% improvement at 95% confidence level
.. one would need 2700 judged queries (raw estimate)
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
30
Moders SE: Where do we stand today?
great for e-shopping, school kids,
scientists, doctors, etc.
high accuracy for simple queries
high scalability
(now >15 Bio. docs,
>10000 queries/sec)
continuously enhanced:
Froogle, Google Scholar, alerts,
multilingual for >100 languages,
query auto-completion,
integrated tools (calculator,
weather, news, etc. etc.)
Summer Academy 2010: Course Web Retrieval
Problem solved ?
Steffen Staab
Chapter 1: Introduction
31
Modern SE: Limitations
Hard queries (disregarding QA, multilingual, multimedia):
- professors from Koblenz who teach DB and have projects on Semantic
Web
- famous 19th century politician that was born in Koblenz
- drama with three women making a prophecy to a British nobleman that he
will become king
- best and latest insights on percolation theory
- pros and cons of dark energy hypothesis
- market impact of XML standards in 2002 vs. 2006
- experienced NLP experts who may be recruited for IT staff
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
32
Looking for:
Q1: famous 19h century
austrian politician born in
Koblenz
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
33
Limitations of Modern SE (2)
Q2: famous 19h century
austrian politician born in
Metternich
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
34
Information Extraction: Example
open-source tool: GATE/ANNIE
http://www.gate.ac.uk/annie/
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
35
Issue: Web spam, advertising, manipulations
..not just for email anymore
Users follow search results
Š Money follows users… Spam follows money…
There is value in getting ranked high
Š Funnel traffic from SEs to Amazon/eBay/…
Make a few bucks
Š Funnel traffic from SEs to a Viagra seller
Make $6 per sale
Š Funnel traffic from SEs to a porn site
Make $20-$40 per new member
Š Affiliate programs
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
36
Web Spam: Motivation
Let’s do the math..
Assume 500M searches/day on the web
All search engines combined
Assume 5% commercially viable
Much more if you include „adult-only“ queries
Š Assume $0.50 made per click (from 5c to $40)
Š $12.5M/day or about $4.5 Billion/year
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
37
Link Spam: google bombs
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
38
Hommingberger Gepardenforelle
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
39
Personalized information management (PIM)
Personalization:
• query interpretation depends on
personal interests and bias
• need to learn user-specific weights for
multi-criteria ranking (relevance, authority, freshness, etc.)
• can exploit user behavior
(feedback, bookmarks, query logs, click streams, etc.)
Personal Information Management (PIM):
• manage, annotate, organize, and search all your personal data
• on desktop (mail, files, calendar, etc.)
• at home (photos, videos, music, parties, invoices, tax filing, etc.)
and in smart home with ambient intelligence
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
40
Exploiting Query Logs and Clickstreams
from PageRank: uniformly random choice of links + random jumps
to QRank: + query-doc transitions + query-query transitions
+ doc-doc transitions on implicit links (w/ thesaurus)
with probabilities estimated from log statistics
ab
a
xyz
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
41
PIM: Semantic Browsing
Source: Alon Halevy, Semex Project, http://www.cs.washington.edu/homes/alon/
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
42
Recommender systems
Scenario: users have a potential interest in certain items
Goal: Provide recommendations for individual users
Examples:
Š recommendations to customers in an on-line store
Š movie recommendations
"If I have 2 million customers on the Web, I should have 2
million stores on the Web" (Jeff Bezos, CEO Amazon.com)
Types of recommendations:
Š display of customer comments
Š personalized recommendations based on buying decisions
Š customers who bought also bought.... (books/authors/artists)
Š email notifications of new items matching pre-specified criteria
Š explicit feedback (rate this item) to get recommendations
Š customers provide recommendation lists for topics
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
43
Recommender Systems: Example
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
44
Recommender Systems (2)
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
45
Web 1.0 vs. Web 2.0
SERVER
CLIENTS
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
46
What makes the difference ?
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
47
Institut für Web Science & Technologien - WeST
Motivation: multi-modal, sparse Web 2.0 data..
Tags
Recommender scenarios:
User
Š Given a user, recommend
photos which may be of interest.
Profile
Š Given a user, recommend users they may like to contact.
Š Given a user, recommend groups they may like to join.
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
49
Formalizing the problem..
ƒCollaborative content sharing framework:
ƒ users
tags
resources
ƒFolksonomy cloud:
Šuser-centric:
Šresource-centric:
Šcommunity-specific
contacts)
Šcollection-specific
Šarbitrary
(e.g. groups or
(e.g. favorites)
ƒ
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
50
The IR background – constructing feature vectors
alice
Koblenz
photo 1
bob
Germany
photo 2
dave
corner
photo 3
Clouds of interest:
- favorites
- groups
- contact lists
- comments on others‘ resources
.. defined analogously to tf·idf
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
51
Results: user-focused favorite recommendation
40 Training / 50 Test favorites, 250 contrast (randomly chosen) docs
Global model
Summer Academy 2010: Course Web Retrieval
Personal model
Steffen Staab
Chapter 1: Introduction
52
..spatial info?
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
53
Same coordinates, different views
German corner?
Old Balduin bridge?
Steamer „Goethe“?
Uni Koblenz?
Nuclear power plant?
Fortress?
What can you see from
Ehrenbreitstein?
Restaurant Ferrari?
Shopping malls?
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
54
Multi-modal Analysis of Social Media
Tags
User
Profile
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
55
GeoFolk step by step
ƒ Define the model
ƒ Import training data
ƒ Estimate model parameters (e.g. MCMC method)
- we use software like JAGS for this..
ƒ Use resulting topic-based feature vectors
for tags and resources in IR-like scenarios
(similarity based classification, clustering, ranked retrieval)
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
56
Tag recommendation
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
57
Web retrieval: course topics
Š Motivation and Overview
basics
Š Technical basics
Š
Š
Š
Š
Link Analysis and Authority Ranking
Text processing and analysis
Advanced IR models
Classification and Clustering
making Web
Mining effektive
and efficient
Š Web Spam and Advertising
Š Web Crawling
common
scenarios
Š interactive social media
novel media
and challenges
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
58
Recommended Literature (1)
ƒ Information Retrieval:
Š Soumen Chakrabarti: Mining the Web: Analyis of Hypertext and SemiStructured Data, Morgan Kaufmann, 2002
Š Christopher Manning, Prabhakar Raghavan, Hinrich Schütze: Introduction
to Information Retrieval, Cambridge University Press, 2007
Š Pierre Baldi, Paolo Frasconi, Padhraic Smyth: Modeling the Internet and
the Web - Probabilistic Methods and Algorithms, Wiley & Sons, 2003.
Š Ricardo Baeza-Yates, Berthier Ribeiro-Neto: Modern Information
Retrieval, Addison-Wesley, 1999.
Š Christopher D. Manning, Hinrich Schütze: Foundations of Statistical
Natural Language Processing, MIT Press, 1999.
Š David A. Grossman, Ophir Frieder: Information Retrieval: Algorithms and
Heuristics, Springer, 2004.
Š Ian H. Witten: Managing Gigabytes: Compressing and Indexing
Documents and Images, Morgan Kaufmann, 1999.
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
59
Recommended Literature (2)
Stochastics
Larry Wasserman: All of Statistics, Springer, 2004.
George Casella, Roger L. Berger: Statistical Inference, Duxbury, 2002.
Arnold Allen: Probability, Statistics, and Queueing Theory with Computer
Science Applications, Academic Press, 1990.
Machine Learning
Richard O. Duda, Peter E. Hart, David G. Stork: Pattern Classification,
Wiley&Sons, 2000.
Trevor Hastie, Robert Tibshirani, Jerome H. Friedman: Elements of Statistical
Learning, Springer, 2001.
Tom M. Mitchell: Machine Learning, McGraw-Hill, 1997.
Tools and Programming
Ian H. Witten, Eibe Frank: Data Mining: Practical Machine Learning Tools
and Techniques, Morgan Kaufmann, 2005.
Erik Hatcher, Otis Gospodnetic: Lucene in Action, Manning Publications,
2004.
Tony Loton: Web Content Mining with Java, John Wiley&Sons, 2002.
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
60
Additional sources
important conferences on IR
(see DBLP bibliography for full detail, http://www.informatik.uni-trier.de/~ley/db/)
SIGIR, ECIR, CIKM, TREC, WWW, KDD, ICDM, ICML, ECML
online portals
DBLP, Google Scholar, SiteSeer search engines
ACM, IEEE portals
Scientific mailing lists (e.g. DBWorld, AK-KDList, SIG-IRList, WebIR, DDLBETAtag, etc.)
evaluation initiatives:
• Text Retrieval Conference (TREC), http://trec.nist.gov
• Cross-Language Evaluation Forum (CLEF), www.clef-campaign.org
• Initiative for the Evaluation of XML Retrieval (INEX),
http://inex.is.informatik.uni-duisburg.de/
• KDD Cup, http://www.kdnuggets.com/datasets/kddcup.html
and http://kdd05.lac.uic.edu/kddcup.html
• Language-Independent Named-Entity Recognition,
www.cnts.ua.ac.be/conll2003/ner/
feel free to contact..
a) lecturer, b) authors of publications, c) members of online communities
and mailing lists
Summer Academy 2010: Course Web Retrieval
Steffen Staab
Chapter 1: Introduction
61