TMS 09 Text Mining Creating Semantics in the Real World

Transcription

TMS 09 Text Mining Creating Semantics in the Real World
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
Dr. Gerd Paass, Andreas Schäfer, Dr. Stefan Eickeler
Text Mining: Creating Semantics in the Real World
Fraunhofer IAIS: Intelligent Analysis and Information Systems
ƒ 250 people: scientists, project engineers,
technical and administrative staff, students
ƒ Located on Fraunhofer Campus Schloss
Birlinghoven/Bonn
ƒ Joint research groups and cooperation with
Core research areas:
ƒ Machine learning/data mining
ƒ Multimedia pattern recognition
ƒ Visual Analytics
ƒ Process Intelligence
ƒ Adaptive robotics
ƒ Cooperating objects
Directors: T. Christaller, S. Wrobel (exec.)
Prof. Dr. Stefan Wrobel
2
Text Mining: Creating Semantics in the Real World
Brainyquote.com
Where is all the knowledge we lost with information?
T. S. Eliot
Thomas Stearns Eliot, OM (September 26, 1888 – January 4, 1965)
US-born British poet, dramatist and literary critic
3
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Outline
ƒ Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
ƒ What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
ƒ Conclusion
Prof. Dr. Stefan Wrobel
4
Text Mining: Creating Semantics in the Real World
Internet Trends
Ubiquitous intelligent systems
Convergence
Users as producers
5
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Users as producers
ƒ Web 2.0, Social Web, Crowdsourcing
ƒ Exploding growth of content
ƒ Media providers transform from content to confidence providers, competing
with social communities
ƒ Users expect full interactivity and control
ƒ Quality control, confidence, choice and searching are becoming central
Prof. Dr. Stefan Wrobel
8
Text Mining: Creating Semantics in the Real World
Drowning in Data ….
Megabytes
Gigabytes
Terabytes
Petabytes
Prof. Dr. Stefan Wrobel
nniviveerrssee::
u
l
a
it
u
ig
l
d
a
SSizizeeooff digit abyte
Ex yte
22000077::116611 Exabyte
Exab yte
22001100::999988 Exab
[[ID
IDCC]]
Exabytes
9
Text Mining: Creating Semantics in the Real World
The data iceberg
ƒ Database tables
20%
ƒ Excel spreadsheets
ƒ Other data with fixed structure
ƒ Email, Notes
80%
ƒ Word documents
ƒ PDF. Power Point
ƒ Other text
ƒ Images
ƒ Video, audio
Prof. Dr. Stefan Wrobel
10
Text Mining: Creating Semantics in the Real World
Drowning in Unstructured Data ….
Megabytes
Gigabytes
!
g
!
n
g
i
n
n
i
a
n
e
a
e
m
d
m
e
e
d
e
n
e
d
n
annd
Terabytes
…
…a
Prof. Dr. Stefan Wrobel
Petabytes
Exabytes
11
Text Mining: Creating Semantics in the Real World
Semantics: The need for meaning
ƒ Knowledge will be the driving force
of business excellence
ƒ Quality of services increasingly
distinguished by amount of
knowledge they can use
ƒ Enormous savings if unstructured
existing documents could be used
ƒ Without needing to structure them
first
cf. failures of knowledge
management!
Prof. Dr. Stefan Wrobel
12
Text Mining: Creating Semantics in the Real World
The challenge of semantics
intelligent data and text
mining technologies
Very large set
of (electronic)
documents
Manual Structuring
Intelligent Service
13
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Outline
ƒ Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
ƒ What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
ƒ Conclusion
Prof. Dr. Stefan Wrobel
14
Text Mining: Creating Semantics in the Real World
Text Mining is cool, since … the entire world works for us!
ƒ 215,675,903 websites (Netcraft, March 2009)
ƒ 19 200 000 000 webpages (Yahoo, Aug 2005)
ƒ 29 700 000 000 webpages (boutell.com, Jan 2007)
ƒ Google index 26 Million (1998), 1 billion (2000), 1 Trillion 1,000,000,000,000)
unique URLs (25. 7. 2008)
ƒ => perhaps quadrillions of words (images, videos) …
ƒ And most of them put together meaningfully (somewhat)!
ƒ => smart algorithms can build on that.
15
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
The basic idea
ƒ If two words occur frequently in the same context
- page, paragraph, sentence, part-of-speech
ƒ Then there must be some semantic relation between them
Add in a lot of statistics, algorithms, intelligence…
raw
raw material
material (web,
(web, documents,
documents, …)
…)
++ correlations
and
correlations
and
statistics
AND YOU CAN
DOstatistics
A LOT!
++ intelligent
intelligent data
data mining
mining algorithms
algorithms
You
You can
can create
create (a
(a bit
bit of)
of) semantics!
semantics!
Prof. Dr. Stefan Wrobel
16
Text Mining: Creating Semantics in the Real World
<document>
Automated
Clustering [Paass 07]
<document>
…
…
100 000
documents
<title>Bayern
<title>BayernMünchen
Münchenverlor
verlorTabellenführung
Tabellenführungund
undElber
Elberbeim
beim11: :11ininWolfsburg</title>
Wolfsburg</title>
<text>Ausgerechnet
<text>Ausgerechnetder
derVfL
VfLWolfsburg
Wolfsburghat
hatden
denFC
FCBayern
BayernMünchen
Münchenvom
vomThron
Thronder
derFußball
Fußball- -Bundesliga
Bundesligagestoßen
gestoßen. .
Mit
dem
1
:
1
(
0
:
1
)
gelang
den
Wolfsburgern
am
Samstag
der
erste
Punkt
im
sechsten
Spiel
gegen
den
Mit dem 1 : 1 ( 0 : 1 ) gelang den Wolfsburgern am Samstag der erste Punkt im sechsten Spiel gegen den
Deutschen
DeutschenRekordmeister
Rekordmeister. .Durch
Durchdas
dasRemis
Remisund
undden
dengleichzeitigen
gleichzeitigenSieg
Siegvon
vonKonkurrent
KonkurrentBayer
BayerLeverkusen
Leverkusen
beim
beimTSV
TSV1860
1860München
Münchenverlor
verlorder
derFC
FCBayern
Bayerndie
dieTabellenführung
Tabellenführung. .Carsten
CarstenJancker
Jancker( (29
29. .) )hatte
hattedie
dieGäste
Gästeinin
Führung
gebracht
.
Doch
vor
20
400
Zuschauern
im
ausverkauften
VfL
Stadion
wurden
die
Bayern
für
Führung gebracht . Doch vor 20 400 Zuschauern im ausverkauften VfL - Stadion wurden die Bayern fürihre
ihre
pomadige
pomadigeSpielweise
Spielweisedurch
durchden
denWolfsburger
WolfsburgerAusgleichstreffer
Ausgleichstreffervon
vonAndrzej
AndrzejJuskowiak
Juskowiak( (60
60. .) )bestraft
bestraft. .Zudem
Zudem
verlor
verlordas
dasTeam
Teamvon
vonTrainer
TrainerOttmar
OttmarHitzfeld
Hitzfeldauch
auchnoch
nochStürmer
StürmerGiovane
GiovaneElber
Elber( (80
80. .) ). .ErErsah
sahwegen
wegeneiner
einer
Tätlichkeit
Tätlichkeitgegen
gegenVfL
VfL- -Abwehrspieler
AbwehrspielerHolger
HolgerBallwanz
Ballwanzdie
dieRote
RoteKarte
Karte. .Die
DieBayern
Bayerngingen
gingenersatzgeschwächt
ersatzgeschwächtinin
die
Partie
.
Vor
allem
das
Fehlen
des
verletzten
Regisseurs
Stefan
Effenberg
und
des
ebenfalls
die Partie . Vor allem das Fehlen des verletzten Regisseurs Stefan Effenberg und des ebenfallsangeschlagenen
angeschlagenen
Mehmet
MehmetScholl
Schollmachte
machtesich
sichbemerkbar
bemerkbar. .Die
DieWolfsburger
Wolfsburgermussten
musstenweiter
weiterauf
aufdie
dieAbwehrspieler
AbwehrspielerClaus
ClausThomsen
Thomsen
und
Thomas
Hengen
sowie
den
gesperrten
Waldemar
Kryger
verzichten
.
Die
Münchener
konnten
und Thomas Hengen sowie den gesperrten Waldemar Kryger verzichten . Die Münchener konntenihre
ihreAusfälle
Ausfälle
anfangs
anfangsbesser
besserkompensieren
kompensieren. .Aus
Auseiner
einergestärkten
gestärktenDeckung
Deckung, ,die
dievor
vorder
derPause
Pausenur
nurselten
seltenvon
vonden
den
[Paass07]
Wolfsburger
WolfsburgerStürmern
StürmernJuskowiak
Juskowiakund
undJonathan
JonathanAkpoborie
Akpoboriegefordert
gefordertwurde
wurde, ,kontrollierten
kontrolliertendie
dieBayern
Bayerndie
die
Partie
.
Mit
ihrer
Taktik
hatten
sie
nach
knapp
Partie . Mit ihrer Taktik hatten sie nach knappeiner
einerhalben
halbenStunde
StundeErfolg
Erfolg: :Jancker
Janckerspitzelte
spitzelteden
denBall
Ballnach
nach
einem
einemabgefälschten
abgefälschtenFreistoß
Freistoßvon
vonMichael
MichaelTarnat
Tarnatins
insTor
Tor. .Der
DerBrasilianer
BrasilianerPaolo
PaoloSergio
Sergio( (14
14. .) )hätte
hättesogar
sogarschon
schon
früher
sein
Team
in
Führung
schießen
können
.
Doch
traf
er
aus
14
m
nur
die
Oberkante
der
Latte
des
früher sein Team in Führung schießen können . Doch traf er aus 14 m nur die Oberkante der Latte desVfL
VfL- Tores
Tores. .Die
DieGastgeber
Gastgeberbesaßen
besaßennur
nureine
eineMöglichkeit
Möglichkeitininder
derersten
erstenHalbzeit
Halbzeit, ,als
alsder
derstarke
starkeSpielmacher
SpielmacherDorinel
Dorinel
Munteanu
(
37
.
)
mit
einem
Schuss
an
dem
großartig
reagierenden
Nationaltorhüter
Oliver
Munteanu ( 37 . ) mit einem Schuss an dem großartig reagierenden Nationaltorhüter OliverKahn
Kahnscheiterte
scheiterte. .
Nach
Nachdem
demWechsel
Wechselwurden
wurdendie
dieWolfsburger
Wolfsburgermutiger
mutigerund
undmunterer
munterer. .Sie
Sieübernahmen
übernahmenlangsam
langsamdas
dasKommando
Kommando. .
Beim
BeimAusgleichstreffer
Ausgleichstrefferdurch
durchJuskowiak
Juskowiakhalf
halfdie
dieBayern
Bayern- -Deckung
Deckungallerdings
allerdingsmit
mit. .Samuel
SamuelKuffour
Kuffourverlor
verlorden
den
Ball
Ballan
anden
denpolnischen
polnischenNationalspieler
Nationalspieler, ,Juskowiak
Juskowiakzog
zogsofort
sofortab
abund
undließ
ließdem
dembesten
bestenBayern
Bayern- -Spieler
SpielerKahn
Kahn
keine
keineChance
Chance. .Danach
Danachbemühten
bemühtensich
sichdie
dieMünchner
Münchnernoch
nocheinmal
einmalund
underhöhten
erhöhtenden
denDruck
Druck. .Doch
Dochklare
klare
Möglichkeiten
besaßen
sie
nicht
mehr
.
In
der
hektischen
Schlussphase
verlor
Elber
die
Nerven
Möglichkeiten besaßen sie nicht mehr . In der hektischen Schlussphase verlor Elber die Nerven, ,so
sodass
dassdie
die
Bayern
Glück
hatten
,
in
Unterzahl
nicht
auch
noch
zu
verlieren
.</text>
Bayern Glück hatten , in Unterzahl nicht auch noch zu verlieren .</text>
<dpa_TextEndCode>dpa
Prof.
Dr. Stefan Wrobel
<dpa_TextEndCode>dpayyni
yynice
cejo</dpa_TextEndCode>
jo</dpa_TextEndCode>
</document>
</document>
…
…
17
Text Mining: Creating Semantics in the Real World
Unsupervised hierarchical term Clustering: dpa data
Team sport
Spiel Bundesliga Team Trainer
Sieg Mannschaft Niederlage
Samstag Platz Saison Erfolg
Punkte Pokal Nationalspieler …
Football
Not football
Finale Frankfurt deutsche
Meister Hamburg Zuschauer
Zuschauern Männer Halle WM
Titelverteidiger Final EM …
Basketball Berlin Weltmeister
Bonn Kampf K Hagen Trier
Würzburg LBA Playoff Runde
Box Berliner Klitschko Titel…
Basketball + Boxing
Prof. Dr. Stefan Wrobel
Kiel Handball Magdeburg
Flensburg HSG VfL TV THW
Tore Bad Wuppertal Lemgo
Bundesliga Handewitt …
Handball
FC Trainer Fußball München
Spieler Bayern Mannschaft
Saison Hertha Stürmer Stadion
Spiel SV Dortmund Coach …
Minute Tor VfL Schiedsrichter
Bayern League Champions
Zuschauer Minuten Führung Fußball UEFA United Hinspiel
Tore Hansa Eintracht Schalke Cup Manager Vertrag Leeds
Bundesliga Karte Wolfsburg … Club Fans Real Hitzfeld …
German League
European League
18
Text Mining: Creating Semantics in the Real World
Text Mining Market Size
ƒ „The text mining market has roughly $50-100 million annual product
revenue, and is growing at roughly 40-60% annually.“ [Monash 06, texttechnologies.com]
ƒ Sounds small …
ƒ But then …
• Several research sites devoted to the technology
…
ƒ So the real market must be somewhere else …
19
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
The Text Mining Market … is called “Text Analytics”
Primary areas:
ƒ Web search, site search
ƒ knowledge management, enterprise portals
ƒ Information collection, extraction, harvesting
ƒ Email handling, security, spam and phishing filtering
ƒ Market research
ƒ Online advertising
ƒ Specialized markets
• litigation, juridical
• Patent search
[cf. Monash/2008]
Prof. Dr. Stefan Wrobel
20
Text Mining: Creating Semantics in the Real World
Application Field Market Research: Germany 1.6 billion, growing
Both ad-hoc studies and panels can benefit from text mining
http://www.adm-ev.de/zahlen.html
21
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Enterprise Search as a text mining market
ƒ More than 1.2 billion $ in 2010
Year
2006
2007
2008
2009
2010
Software
revenue
Million $
717
860
989
1108
1219
[Gartner 2008]
Prof. Dr. Stefan Wrobel
22
Text Mining: Creating Semantics in the Real World
Companies
23
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Outline
ƒ Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
ƒ What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Wikinger, Fraunhofer
Web
• Structuring and Monitoring: Semantic Map, EmotionRadar
ƒ Conclusion
Prof. Dr. Stefan Wrobel
24
Text Mining: Creating Semantics in the Real World
Text Mining Tasks
ƒ Document classification, scoring and/or ranking, isolated retrieval
• Assign a class, score or rank to an entire document
ƒ In-collection, linked retrieval and organization
• Find documents in a collection
• Link results to other results
ƒ Information and relation extraction
• Extract pieces of information, fill particular relations
ƒ Overview and monitoring of collections
• Give summary impression of information in a collection or source
25
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Outline
ƒ Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
ƒ What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
ƒ Conclusion
Prof. Dr. Stefan Wrobel
26
Text Mining: Creating Semantics in the Real World
Motivation
Spotting Faked Offers at Internet Auctions
Techniques to sell fakes
ƒ Put faked products on an internet
auction platform, e.g. ebay
ƒ Describe product as forged,
falsified, e.g. “very similar to XXX”
Aspects
ƒ Infringement of registered trade
marks
ƒ Violation of patents
ƒ Enormous sales volume
27
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Counter Measures
Motivation
Use trainable classifiers
ƒ Compile training set of genuine and
faked internet auction offers
ƒ Train classifiers to detect these
classes
use text, format information, etc. as
features
Fakes
x2
Hy
pe
ƒ Use different classifier for different
brands / products
ƒ Apply to new internet auction
offers
ƒ Ban faked offers from auction
rp
lan
Originals
x1
ƒ Update classifiers to new techniques
Prof. Dr. Stefan Wrobel
e
28
Text Mining: Creating Semantics in the Real World
Results
ƒ A classifier was developed and
tested
ƒ Similar techniqes as for spam
detection
ƒ The Germal Federal Court of Justice:
Internet Auction providers have to
filter the auctions using approriate
methods to detect faked offers.
ƒ Good results: F-value >> 90%
29
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Motivation
Phishing
E-mail fraud
ƒ Send officially looking
email
ƒ Include web link or form
ƒ Ask for confidential
information
e.g. password, account
details
ƒ Attacker uses information
to withdraw money, enter
computer system, etc.
Prof. Dr. Stefan Wrobel
30
Text Mining: Creating Semantics in the Real World
Motivation
AntiPhish
Project AntiPhish
Consortium
ƒ Develop content-based phishing filtersƒ Fraunhofer IAIS (DE)
ƒ Include other clues like whitelists
ƒ Symantec (GB, IRL)
ƒ Trainable and adaptive filters
Î adapt to new phishing attacks
Î anticipate attacks
ƒ Tiscali (IT)
ƒ Nortel (FR)
ƒ K.U. Leuven (BE)
32
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Motivation
Phishing: Defense Techniques
Workflow
ƒ Obtain training data from email
stream
ƒ Integrate new filters into email
filtering framework
ƒ Extract features
ƒ Deploy at internet service provider
ƒ Estimate and update classifiers and
filters
ƒ Deploy at central wireless packet
switch
Prof. Dr. Stefan Wrobel
33
Text Mining: Creating Semantics in the Real World
Approach: Multiple feature sets
Prof. Dr. Stefan Wrobel
37
Text Mining: Creating Semantics in the Real World
Basic Features
Prof. Dr. Stefan Wrobel
38
Text Mining: Creating Semantics in the Real World
Dynamic Markov Chains
Prof. Dr. Stefan Wrobel
39
Text Mining: Creating Semantics in the Real World
DMC Details
Prof. Dr. Stefan Wrobel
40
Text Mining: Creating Semantics in the Real World
Latent Topic Models
Prof. Dr. Stefan Wrobel
41
Text Mining: Creating Semantics in the Real World
Class-Specific Topic Models
Prof. Dr. Stefan Wrobel
42
Text Mining: Creating Semantics in the Real World
Feature Processing and Selection
Prof. Dr. Stefan Wrobel
43
Text Mining: Creating Semantics in the Real World
Test Corpora
Prof. Dr. Stefan Wrobel
46
Text Mining: Creating Semantics in the Real World
Overall Result
47
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Outline
ƒ Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
ƒ What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
ƒ Conclusion
Prof. Dr. Stefan Wrobel
52
Text Mining: Creating Semantics in the Real World
The THESEUS research program in Germany [theseus-program.com]
Deutsche
Nationalbibliot
hek
Deutsche Thomson
OHG (DTO)
semantic
Deutsches
Forschungszentr
um für
Künstliche
Intelligenz
(DFKI GmbH)
empolis GmbH
Festo AG
FraunhoferGesellschaft (7
Institutes)
Friedrich-AlexanderUniversität
Erlangen
FZI Forschungszentrum
Informatik
Institut für
Rundfunktechni
k GmbH (IRT)
intelligent views gmbh
Ludwig-MaximiliansUniversität
(LMU)
moresophy GmbH
syntactic
LYCOS Europe
mufin GmbH
ontoprise GmbH
SAP AG
Siemens AG
Wess/07
Technische Universität
Darmstadt
Technische Universität
Dresden
Single author
Multiple authors
Technische Universität
München
Universität Karlsruhe
(TH)
Verband Deutscher
Maschinen- und
Anlagebau e.V.
(VDMA)
53
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
The THESEUS use cases
ALEXANDRIA
ALEXANDRIA
CONTENTUS
CONTENTUS
MEDICO
MEDICO
The
TheInternet
InternetKnowledge
KnowledgePlatform
Platform
Next
NextGeneration
GenerationDigital
DigitalLibraries
Libraries
for
saving
our
cultural
for saving our culturalheritage
heritage
Semantic
Semanticimage
imageSearch
Search
ininMedicine
Medicine
ORDO
ORDO
PROCESSUS
PROCESSUS
TEXO
TEXO
Personal
PersonalOrdered
OrderedKnowledge
Knowledge
Management
Management
Semantic
SemanticBusiness
BusinessProcesses
Processes
Business
BusinessWebs
Websininthe
theInternet
Internet
Of
OfThings
Things
Prof. Dr. Stefan Wrobel
54
Text Mining: Creating Semantics in the Real World
CONTENTUS - Next Generation Digital Libraries
for saving our cultural heritage
ƒ Publishers, Libraries, broadcasters, etc. are interested in using,
distributing and saling their archive content
ƒ In analog form archives are threatened by deterioration, are not
linked, difficult to use, and huge.
Goals:
ƒ Digitalization, optimization of quality, availability
ƒ Indexing, semantic and social linking and intelligent search,
communities
ƒ Rescue of cultural heritage, preventing losses from deterioration
Prof. Dr. Stefan Wrobel
Laufzeit bis
2012
55
Text Mining: Creating Semantics in the Real World
Showcases Semantic Digital Libraries
ƒ 225 years Neue Zürcher Zeitung NZZ
ƒ GDR music archive German National Library
Prof. Dr. Stefan Wrobel
56
CONTENTUS Workflow
Text Mining: Creating Semantics in the Real World
Workflow
Data generation: registered
users / communities,
Data generation:
registered
algorithms
with acceptable
users / communities,
quality
algorithms
with
Quality
control:
selfacceptable
control
(see quality
Wikipedia)
Quality control: self control
(see Wikipedia)
Shell
Mantle
Core
1
2
Digitization
4
3
Automated
generation of
metadata
Automated
optimization of
quality
5
Controlled quality
Data generation: automatically
Controlled
quality
generated
through
high-quality
Data generation: automatically
algorithms
generated
through
high-quality
Quality
control:
training
and
algorithms
improvement
of algorithms
Quality control: training and
improvement of algorithms
Guaranteed quality
Data generation and
Guaranteed
quality
correction:
Libraries,
museums,
Data generation
and
universities,
experts, etc.
correction:
museums,
Quality
control:Libraries,
Schooling,
experts, etc.
rules,universities,
advisory boards
Quality
control:
Schooling,
Highest
stability,
highest
rules,
advisory
boards
persistence
Highest stability, highest
persistence
6
Open
knowledge
networks – user
augmentation
Semantic
linking of
metadata
Semantic
access to
knowledge
and content
57
Prof. Dr. Stefan Wrobel
Digitalisierung
1
2
Digitization
Automatic
Optimization
of quality
Text Mining: Creating Semantics in the Real World
3
Automated
Generation of
metadata
4
Semantic
Linking of
metadata
5
Open
knowledge
networks –
user
augmentation
6
Semantic
access to
knowledge
and content
ƒ High-Throughput
Methods
ƒ Modern book
scanners:
Thousands of pages
per day
ƒ Almost fully
automatic
Data volumes: 70TB
(NZZ), Peta-Exabytes
(DNB)
Prof. Dr. Stefan Wrobel
58
Digitalisierung
Text Mining: Creating Semantics in the Real World
1
Digitalisierung
Digitization
ƒ High-Throughput
Methods
ƒ Modern book
scanners:
Thousands of pages
per day
ƒ Almost fully
automatic
Data volumes: 70TB
(NZZ), Peta-Exabytes
(DNB)
59
Prof. Dr. Stefan Wrobel
Qualitätsoptimierung
1
Digitalisierung
Digitalization
2
Automated
Optimization
of quality
ƒ Development of
intelligent
algorithms for
optimizing print,
images, sound &
movies
ƒ Automated
generation of
presentation formats
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Margin removal
Sharpening,
Straightening
Denoising, declicking
Scratch removal
60
Metadatengenerierung
1
2
Digitalisierung
Digitalization
Automated
Optimization
of quality
3
Text Mining: Creating Semantics in the Real World
Automated
Generation of
metadata
ƒ Structural and
contentual metadata
ƒ OCR, speech, music,
video recognition
ƒ Structure analysis and
type recognition
ƒ Linking with current
norms & standards
61
Prof. Dr. Stefan Wrobel
Semantische Vernetzung
1
Digitalisierung
Digitalization
2
Automated
Optimization
of quality
3
Automated
Generation of
metadata
Text Mining: Creating Semantics in the Real World
4
Semantic
linking of
contents
ƒ Link-up with related
media
ƒ Incorporation of external
knowledge sources
(metadata systems,
Wikipedia, …)
ƒ Disambiguation,
classification, relation
extraction
Prof. Dr. Stefan Wrobel
62
Text Mining: Creating Semantics in the Real World
Determining meaning
The words of natural
language are often
ambiguous
Über Kohl höhnte Strauß: „Er wird nie Kanzler
werden. Die Zeit, 18.7.08
» For each word / term, find a meaning
» Subproblem:
» Part of speech recognition: Nouns, Verb, Adjective, …
» Named entity recognition: People, Places, Organizations, …
» Assignment of concepts: Plant, Bird, Politician, …
63
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Named entity recognition
» Analyze
Surroundings of
Words
Über Kohl höhnte Strauß: „Er wird nie Kanzler
werden.
» “Kohl” in a sentence with “Kanzler” Î probably “person”
» “Kohl” in a sentence with “kochen” Î probably “vegetable”
» Statistical model for person names
» Word + Surroundings -> word is a person
» Training using annotated sentences.
» Automatic Recognition of
words / phrases that represent people
Prof. Dr. Stefan Wrobel
64
Text Mining: Creating Semantics in the Real World
Conditional Random Field Model
» Observed words X1,…,Xn
» Category of words Y1,…,Yn
⎛ N ⎡ N2
⎤⎞
1
p (Y1,K, Yn | X ) =
exp⎜ ∑ ⎢ ∑ λk f k ,C (Yt −1, Yt , X )⎥ ⎟
⎜
⎟
Z (X, λ , μ )
⎦⎠
⎝ t =1⎣k =1
» Properties f may depend on two subsequent states and on all observed words
Example
» Property f10293 has value 1,
- if Yt-1=“PER" and Yt=“PER” and
- Xt has value “Müller”.
Otherwise its value is 0.
[Lafferty, McCallum, Pereira 01]
66
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Modeling of names: features for a CRF model
» Title FirstName Connective LastName
» Properties. Recorded for the words xt-2,xt-1,xt,xt+1,xt+2
» Words, stem, part of speech
» Prefix, Suffix (3 letters)
» Shape properties
Capital characters at the beginning, only numbers, contains numbers, mix capital /no
capital, contains hyphens
» LDA topic model class
» Contained in list of first names, contained in list of last names
Prof. Dr. Stefan Wrobel
it
rbe
A
In
67
Text Mining: Creating Semantics in the Real World
Identity of names
» There are several people named “Helmut Kohl”
» Helmut Kohl, born 1930, Chancellor
» Helmut Kohl, born 1943, Referee
» Helmut Kohl, textile merchand
» … 99 further hits in the telephone book
» Identification in Wikipedia
» Compare words of Wikipedia-article
with the text in which
“Helmut Kohl” was found
» Similar words -> similar person
» Automated assignment:
Person name -> Wikipedia article
68
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Assignment of similarity from the environment
Simple algorithm for assigning people to Wikipedia article
» Occurence in text: Helmut Kohl
» Description using characteristic terms -> x
» Wikipedia article on Kohl
» Description by characteristic terms -> w
» Comparison using a distance metric: for example Cosine distance d(w,u)
» Implemented in a prototype
» Further approach: Assignment as a classification task
f(w,u) = 0 or 1
Master Thesis
Prof. Dr. Stefan Wrobel
70
Semantische Interpretation
Text Mining: Creating Semantics in the Real World
Semantic Interpretation
Currently assign semantic categories in the Contentus Prototype
» Names: People, Organizations, points in time, places, …
» Assignment to Wikipedia articles
Under development:
» Hypernyms in ontology (GermaNet): Nouns, Verbs Î Supersenses
» Cluster of words with similar meaning: Topics
» Relations between names / concepts
“Berthold Brecht” studied in “München”
» Classes of documents: Politics, Economy, …
72
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Knowledge store
» Further information for entities that
were found in the text
» Dates, publications
» Number of inhabitants, topological
relationships
Helmut Kohl
Geburtsdatum
30.4.1930
Geburtsort
Ludwigshafen
Ehegatte
Hannelore K.
Ausbildung
Historiker
Religion
katholisch
Partei
CDU
Berlin
Fläche
» Social networks
» Who knows whom?
» Who was at the same place at the same time?
» Who influenced whom?
Prof. Dr. Stefan Wrobel
Einwohner
BIP
Höhe
891 km2
3.420.786
83,6 Mrd. €
34–115 m
Geo. Breite
52° 31′ N
Geo. Länge
13° 25′ O
87
Text Mining: Creating Semantics in the Real World
Knowledge store: Format
» Factual knowledge as logical expressions:
» <subject> <predicate> <object>
» Semantic-Web-Standards
» RDF
» RDFS
» OWL
» Technical Basis
» Database MySQL
» Triple-Store Jena + Joseki
» Query language
» SPARQL
88
Prof. Dr. Stefan Wrobel
Buchmesse Frankfurt 10.Oktober 2008 | 88
Wissensvernetzung
Text Mining: Creating Semantics in the Real World
Linking of knowledge
» Semantic Integration of data and information from different sources
» DBPedia: an interpreted form of Wikipedia
» Geonames Ontology: all the places in the world
» Catalogue of the German national library:
Books and publications
Î Triplestore
» Based on open standards
» W3C Semantic Web Stack
» RDF, RDFS, OWL, SPARQL
Prof. Dr. Stefan Wrobel
89
Text Mining: Creating Semantics in the Real World
Knowledge sources
» DBPedia (www.dbpedia.org)
» GeoNames Ontology
» Already in RDF/OWL-Format
» Person reference database PND
» Topic reference database SWD
» Online catalogue OPAC
» Partial export to RDF
» Found entities in the text
» Identification using Wikipedia
» Linking with DBPedia-Daten per Link
90
Prof. Dr. Stefan Wrobel
Buchmesse Frankfurt 10.Oktober 2008 | 90
Offene Wissensnetzwerke
2
1
Digitalisierung
Digitalization
Automated
Optimization
of quality
3
Automated
Generation of
metadata
Text Mining: Creating Semantics in the Real World
4
Semantic
linking of
metadata
5
Open
knowledge
networks –
user
augmentation
ƒ Further annotations from experts and users
• Completions, corrections
• Cooperation with the ALEXANDRIA project in Theseus
• Suitable measures to assure high quality of data
Prof. Dr. Stefan Wrobel
91
Offene Wissensnetzwerke
Text Mining: Creating Semantics in the Real World
The Multiple Shell Model
Cf. Wikinger [Bröcker et.al. 08]!
Open
Openknowledge
knowledgenetwork
network
Data
generation:
Registered
Data generation: Registeredusers
users/ /Communities,
Communities,
Algorithms
Algorithms
Quality
Qualitycontrol:
control:Self
Selfcontrol
control(cf.
(cf.Wikipedia)
Wikipedia)
Outer
Mantel
Controlled
ControlledQuality
Quality
Data
generation:
Data generation:Algorithms
Algorithmsof
ofhigh
highquality
quality
Quality
control:
Training
and
improvement
Quality control: Training and improvementof
of
algorithms
algorithms
Core
Assured
Assuredquality
quality
Data
generation
Data generationand
andcorrection:
correction:Libraries,
Libraries,
Universities,
Museums,
groups
of
experts,
Universities, Museums, groups of experts,etc.
etc.
Quality
control:
Fixed
rules,
committes,
Quality control: Fixed rules, committes,
92
maximal
maximalStability
Stabilityand
andPersistence
Persistence
Prof. Dr. Stefan Wrobel
Semantische Suche
2
1
Digitalisierung
Digitalization
Automated
Optimization
of quality
Text Mining: Creating Semantics in the Real World
3
Automated
4
Generation of
metadata
Semantic
linking of
metadata
5
Open
knowledge
networks –
user
augmentation
6
Semantic
access to
knowledge
and content
ƒ The knowledge network
• Digital, multimedia data
• Content is semantically linked
• Is enriched from external sources and
user groups
ƒ Access
• Structure by Ontology
• Content relationships become clear
• “Knowledge exploration” is possible
Prof. Dr. Stefan Wrobel
93
Text Mining: Creating Semantics in the Real World
The Contentus Demonstrator
95
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Outline
ƒ Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
ƒ What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
ƒ Conclusion
Prof. Dr. Stefan Wrobel
102
Text Mining: Creating Semantics in the Real World
Example of a classified webpage
Übereinstimmung zu dem Dokumentmodell = 80%
Klassifikation als = Projekte
107
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Workflow for semantic processing of documents
!!
9
9
0
22000
PreCategoriza
Entity
r
r
e
Extracted
e
processing
tion
Recognitio
m
m
m
Search index
Metadata
n
m
u
s
u
s
iinn
h
c
h
nnc
u
Crawl
u
a
l
eela
Documents
r
r
e
tthhe
r
r
o
ffo Using the document model
t
t
u
oou
h
c
h
aattc
W
W
Extracted
Knowledge
Store
Search
regions
Using the structure
model
Prof. Dr. Stefan Wrobel
111
Text Mining: Creating Semantics in the Real World
Outline
ƒ Why is Text Mining cool?
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
ƒ What can we do with Text Mining in the Real World? Some case studies
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web
• Structuring and Monitoring: EmotionRadar
ƒ Conclusion
Prof. Dr. Stefan Wrobel
112
Text Mining: Creating Semantics in the Real World
Emotion Radar
Which issues are important to
people, where are the
emotional discussions in
blogs and discussion
forums?
Goal: market research, …
Prof. Dr. Stefan Wrobel
113
Text Mining: Creating Semantics in the Real World
Emotion Radar example: Looking at two large automobil companies A and
B*
ƒ
ƒ
Selection and Crawling of discussion forums
•
Criteria: Search engine ranking, size, activity
•
Period used: January 2008 to January 2009
•
Storage: 2 GB of data crawl during a period of 7 days
Structure analysis of the discussion forums:
•
Manufacturer A:
–
–
–
–
•
Number of postings: 188.487
Monthly number of new postings: ca. 1.500
Number of threads: 21.613
Number of authors: 15.445
Manufacturer B:
–
–
–
–
Number of postings: 406.814
Monthly number of new postings: ca. 2.700
Number of threads: 38.758
Number of authors: 21.919
* anonymisiert
114
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
Case study: Internet postings related to the introduction of a new car
model in Germany 2008
Cars are delivered
Manufacturer publishes
further product
features/start of sales
Manufacturer
publishes first
pictures
Dr. Stefan Wrobel
*Prof.
anonymisiert
115
Text Mining: Creating Semantics in the Real World
Partially automated emotion analysis shows a mood swing from positive to
negative („in love-> angry“)
Cars are delivered
„angry“
„angry“
− −
„proud“
„proud“
− −
„angry“
„angry“
− −
Manufacturer publishes
further product
features/start of sales
Manufacturer
publishes first
pictures
„surprised“
„surprised“
− −
„turned
off“
„turned
off“
− −
„in
love“
„in
love“
„hoping“
„hoping“
− −
„interested“
„interested“
− −
116
Dr. Stefan Wrobel
*Prof.
anonymisiert
Text Mining: Creating Semantics in the Real World
Topic recognition shows a change of product features that are discussed from
design to gasoline consumption
Auslieferungen
Probefahrten
Probefahrten
− −
Verbrauch
Verbrauch
− −
„verärgert“
„verärgert“
− −
Preise
CarPreise
CarKonfigurator
Konfigurator
und
Liste
und Liste
− −
Chromleisten,
Chromleisten,
Wertanmutung
Wertanmutung
− −
Erste Fotos
„Riesen„Riesenfischmaul“
fischmaul“
− −
Schaltung,
Schaltung,
Effizienz,
Effizienz,
Technologie
Technologie
− −
FahrzeugFahrzeuglänge,
Design
länge,
Design
− −
„verliebt“
„verliebt“
− −
− −
− −
Dr. Stefan Wrobel
*Prof.
anonymisiert
Verbrauch
Verbrauch
Schiebedach,
Schiebedach,
Lenkrad,
Lenkrad,
Bordcomputer,
Bordcomputer,
Audiosystem
Audiosystem
− −
„abgestoßen“
„abgestoßen“
− −
Verbrauch,
Verbrauch,
Klappschlüssel,
Klappschlüssel,
Audio
Audio
− −
„verärgert“
„verärgert“
− −
Nachbarn
Nachbarn
− −
„stolz“
„stolz“
− −
„überrascht“
„überrascht“
− −
„...kaum zu glauben
„...kaum zu glauben
was dieses kleine
was dieses kleine
Auto an Benzin
Auto an Benzin
verbraucht!“
verbraucht!“
(Kalle83)
(Kalle83)
„zugeneigt“
„zugeneigt“
− −
„hoffend“
„hoffend“
117
Text Mining: Creating Semantics in the Real World
How can manufacturers use these text mining results?
Auslieferungen
e.g.
Z.B.
short
durch
term
eine
recognition
frühzeitigeof relevant
Probefahrten
Probefahrten
− −
topics
Erkennung
(consumption)
relevanter
andThemen
preparation
Verbrauch
Verbrauch
− −
of(Verbrauch)
appropriateund
resonse
Ableiten
(gasvon
saver
„verärgert“
„verärgert“
− −
trainings,
Maßnahmen
fuel efficient
(Spritspartrainings,
tires, proactive
communikation)
Leichtlaufreifen, Kommunikation)
„verärgert“
„verärgert“
− −
Preise
CarPreise
CarKonfigurator
Konfigurator
und
Liste
und
Liste
− −
Chromleisten,
Chromleisten,
Wertanmutung
Wertanmutung
− −
Verbrauch,
Verbrauch,
Klappschlüssel,
Klappschlüssel,
Audio
Audio
− −
Nachbarn
Nachbarn
− −
„stolz“
„stolz“
− −
„überrascht“
„überrascht“
− −
„Riesen„Riesenfischmaul“
fischmaul“
− −
Schaltung,
Schaltung,
Effizienz,
Effizienz,
Technologie
Technologie
− −
FahrzeugFahrzeuglänge,
Design
länge,
Design
− −
„verliebt“
„verliebt“
− −
− −
− −
Dr. Stefan Wrobel
*Prof.
anonymisiert
Verbrauch
Verbrauch
„hoffend“
„hoffend“
Schiebedach,
Schiebedach,
Lenkrad,
Lenkrad,
Bordcomputer,
Bordcomputer,
Audiosystem
Audiosystem
− −
„zugeneigt“
„zugeneigt“
− −
„abgestoßen“
„abgestoßen“
− −
... Long term continuous
monitoring of emotional topics
118
Text Mining: Creating Semantics in the Real World
Prof. Dr. Stefan Wrobel
119
Text Mining: Creating Semantics in the Real World
Summary
ƒ Text Mining is cool!
• Drowning in Data: The Challenge of Meaning
• Text Mining: Creating Meaning from Large Collections
• Text Mining Markets
ƒ We can do a lot with Text Mining in the Real World!
• Document classification: eBay, antiPhish
• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer
Web
• Structuring and Monitoring: EmotionRadar
120
Prof. Dr. Stefan Wrobel
Text Mining: Creating Semantics in the Real World
The fine print: Papers and further reading
ƒ
Horváth, Tamás; Ramon, Jan:Efficient frequent connected subgraph mining in
graphs of bounded treewidth: Machine learning and knowledge discovery in
database: ECML PKDD 2008. Berlin [u.a.]: Springer, 2008. (Machine learning and
knowledge discovery in databases 1), S. 520-535
ƒ
Kolb, Inke; Deutschland / Bundesbeauftrager für Kultur und Medien; FraunhoferInstitut IAIS: Auf dem Weg zur Deutschen Digitalen Bibliothek (DDB): erstellt im
Auftrag des Beauftragten der Bundesregierung für Kultur und Medien. 2008
ƒ
Köhler, Joachim; Larson, Martha; Jong, Franciska de Jong; Kraaij, Wessel;
Ordelman, Roeland; Association for Computing Machinery / Special Interest Group
on Information Retrieval: Proceedings of the ACM SIGIR Workshop "Searching
Spontaneous Conversational Speech": held in conjunction with the 31th Annual
International ACM SIGIR Conference 24 July 2008, Singapore, 2008
ƒ
Krausz, Barbara; Herpers, Rainer: Event detection for video surveillance using an
expert system In: Association for Computing Machinery / Special Interest Group on
Multimedia: 1st ACM International Workshop in Analysis and Retrieval of
Events/Actions and Workflows in Video Streams (AREA 2008): October 31, 2008,
Vancouver, Canada ; in conjunction with ACM Multimedia 2008. New York, NY:
ACM, 2008, S. 49-55
ƒ
Lioma, Christina; Moens, Marie-Francine; Gomez, Juna-Carlos; De Beer, Jan;
Bergholz, Andre; Paass, Gerhard; Horkan, Patrick: Anticipating Hidden Text Salting
in Emails: extended abstract. In: Lippmann, Richard (Ed.) et al.: Recent advances in
intrusion detection: 11th international symposium, RAID 2008, Cambridge, MA,
USA, September 15-17, 2008 ; proceedings. Berlin [u.a.]: Springer, 2008.
ƒ
Anja Pilz, Lukas Molzberger, and Gerhard Paa. Entity resolution by kernel methods.
In Proc. Sabre TMS, 2009.
ƒ
Andre Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard
Paass, Siehyun Strobel. New Filtering Approaches for Phishing Email. Accepted for
publication for Journal of Computer Security (JCS)
ƒ
Gerhard Paass and Frank Reichartz (2009): Exploiting Semantic Constraints for
Estimating Supersenses with CRFs. Proc. SDM 2009 (accepted for publication)
Prof. Dr. Stefan Wrobel
ƒ
Paaß, Gerhard; Reinhardt, Wolf; Rüping, Stefan; Wrobel, Stefan: Data
mining for security and crime detection In: Gal, Cecilia S. (Ed.) et al.:
Security informatics and terrorism: social and technical problems of
detecting and controlling terrorists' use of the World Wide Web ;
proceedings of the NATO Advanced Research Workshop on Security
Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5
June 2007. Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series
D, Information and Communication Security 15), S. 56-70
ƒ
Paaß, Gerhard; Kindermann, Jörg: Entity and relation extraction in texts
with semi-supervised extensions In: Gal, Cecilia S. (Ed.) et al.: Security
informatics and terrorism: social and technical problems of detecting
and controlling terrorists' use of the World Wide Web ; proceedings of
the NATO Advanced Research Workshop on Security Informatics and
Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 June 2007.
Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series D,
Information and Communication Security 15), S. 132-141
ƒ
Frank Reichartz and Gerhard Paaß. Estimating Supersenses with
Conditional Random Fields. Workshop on High-Level Information
Extraction, ECML/PKDD 2008.
ƒ
Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel,
Marie-Francine Moens and Brian Witten: Detecting Known and New
Salting Tricks in Unwanted Emails Fifth Conference on Email and AntiSpam, CEAS 2008, Aug 21-22, 2008
ƒ
Andre Bergholz,Jeong-Ho Chang, Gerhard Paass, Frank Reichartz and
Siehyun Strobel. Improved Phishing Detection using Model-Based
Features. Fifth Conference on Email and Anti-Spam, CEAS 2008, Aug 2122, 2008, Mountain View, Ca.
ƒ
Stefan Eickeler, Lars Br¨ocker, and Ruth Haener. NZZ: 225 Jahre Old
economy vernetzt - Realisierung des digitalen Archivs der Neuen
Zürcher Zeitung. In GI Jahrestagung, pages 73–77, 2005.
121