GermaNet - of Verena Henrich

Transcription

GermaNet - of Verena Henrich
GermaNet:
Anwendungen und Zugriffsstrukturen
Verena Henrich und Erhard Hinrichs
Eberhard Karls Universität Tübingen
Seminar für Sprachwisenschaft
Berlin, 5. Dezember 2011
Struktur des Vortrags
•  GermaNet: eine kurze Einführung
•  Komposita in GermaNet
•  Alignierung von Lesarten in GermaNet und Wiktionary
•  Demos: Anwendungen und Zugriffsstrukturen
Introduction to GermaNet
GermaNet: A German Wordnet
•  GermaNet is a lexical semantic network covering the
German base vocabulary
•  Belongs to the family of wordnets modeled after the
Princeton WordNet for English
•  GermaNet is divided into 3 word categories:
- 
- 
- 
Adjectives
Nouns
Verbs
•  Words are ordered according to their meaning
Word Meanings, Lexical Units, and Synsets
•  Word meanings are
represented by lexical units
Lexical Unit
g elbe Rübe
Lexical Unit
•  Lexical units are grouped into
semantic concepts (synsets)
according to their meaning
Mohrrübe
Lexical Unit
Karotte
Lexical Unit
•  A synset is a set of (near-)
synonymous words
Sy
no
ny
m
y
Möhre
Lexical Relations
•  Lexical relations hold between two lexical units
-  Synonymy
-  Antonymy
-  Pertainymy
Antonymy
Blume
Pertainymy
g eblümt
Conceptual Relations
Hy
po
pe
rn
ym
y
Ball
Hy
•  Conceptual relations hold between synsets
-  Hypernymy and hyponymy
-  Part-whole relations
-  Entailment
-  Causation
-  Association
ny
m
y
Fußball
Fußballtor
Association
Fußball
Tennisball
Volleyball
Part-Whole Relations
•  4 kinds of part-whole relations
-  Component meronymy
-  Portion meronymy
-  Substance meronymy
-  Member meronymy
1kg
Portion
Meronymy
1g
Part-Whole Relations
Member
Meronymy
Schiff
Flotte
Substance
Meronymy
Schnee
Schneemann
Size of GermaNet Release 6 (April 2011)
•  Number of lexical units: 93.407
- 
- 
- 
Adjectives: 8.582 lexical units
Nouns: 71.844 lexical units
Verbs: 12.981 lexical units
•  Number of synsets: 69.594
- 
- 
- 
Adjectives: 5.991 synsets
Nouns: 5.3753 synsets
Verbs: 9.850 synsets
•  Literals: 85.214
•  Lexical relations: 3.562
•  Conceptual relations: 81.852
+
apple
=
tree
+
rain
+
apple tree
sun
flower
+
=
bow
=
rainbow
foot
sunflower
=
ball
football
Compounds in GermaNet
Determining Immediate Constituents
of Compounds in GermaNet
Introduction: Modeling Compounds in GermaNet
•  Goal: systematically link
compounds in GermaNet
to their constituent parts
•  Condition: compound
splitting needs to be
applied recursively
Kraftfahrzeugsteuer
‘motor vehicle tax’
c_modifier
Kraftfahrzeug
‘motor vehicle’
c_head
Steuer
‘tax’
Determining Immediate Constituents
of Compounds in GermaNet
Introduction: Modeling Compounds in GermaNet
Kraftfahrzeugsteuer
‘motor vehicle tax’
•  Goal: systematically link
compounds in GermaNet
to their constituent parts
•  Condition: compound
splitting needs to be
applied recursively
c_modifier
c_head
Kraftfahrzeug
‘motor vehicle’
c_modifier
Kraft
‘power’
c_head
Fahrzeug
‘vehicle’
Steuer
‘tax’
Determining Immediate Constituents
of Compounds in GermaNet
Introduction: Modeling Compounds in GermaNet
Kraftfahrzeugsteuer
‘motor vehicle tax’
•  Goal: systematically link
compounds in GermaNet
to their constituent parts
•  Condition: compound
splitting needs to be
applied recursively
c_modifier
c_head
Kraftfahrzeug
‘motor vehicle’
c_modifier
Steuer
‘tax’
c_head
Kraft
‘power’
Fahrzeug
‘vehicle’
c_modifier
fahren
‘to drive’
c_head
Zeug
‘stuff’
Determining Immediate Constituents
of Compounds in GermaNet
Compounding in German is Challenging
•  Intervening linking elements:
Blume
‘flower’
Sieg
‘win’
ns
+
n
e
+
s
+
Vase
‘vase’
=
Blumenvase
‘flower vase’
+
Wille
‘will’
=
Siegeswille
‘will to win’
=
Hüftschwung
‘hip swing’
ens
es
en
linking
elements
•  Elision of word-final characters:
Hüfte
‘hip’
–
e
word-final
character
+
Schwung
‘swing’
Determining Immediate Constituents
of Compounds in GermaNet
SMOR and ASV Toolbox Compound Splitter
•  SMOR is a morphological analyzer for German
•  ASV Toolbox Baseforms is a German compound splitter
à Both do not group immediate constituents
•  Modified versions of SMOR and ASV Toolbox compound splitters
group immediate constituents
Determining Immediate Constituents
of Compounds in GermaNet
Compound Splitter Incorporating GermaNet (GN-CS)
Example 1: Kohleprodukt ‘coal product’
•  Pattern matching for gathering all potential modifiers and heads
•  In case more than one potential modifier-head composition is
possible: heuristics
Kohleprodukt
‘coal product’
Kohle
‘coal’
+
Produkt
‘product’
&
Kohl
+ e +
‘cabbage’
linking
element
Produkt
‘product’
Determining Immediate Constituents
of Compounds in GermaNet
Compound Splitter Incorporating GermaNet (GN-CS)
Example 1: Kohleprodukt ‘coal product’
•  Pattern matching for gathering all potential modifiers and heads
•  In case more than one potential modifier-head composition is
possible: heuristics
Kohleprodukt
‘coal product’
Correct
Kohle
‘coal’
False
+
Produkt
‘product’
Matches without linking
elements are preferred
&
Kohl
+ e +
‘cabbage’
linking
element
Produkt
‘product’
Determining Immediate Constituents
of Compounds in GermaNet
Compound Splitter Incorporating GermaNet (GN-CS)
Example 2: Flughafengelände ‘airport area’
Flughafengelände
‘airport area’
Flughafen
‘airport’
+
Gelände
‘area’
•  No linking elements
•  All existing words in GermaNet
&
Flug
‘flight’
+
Hafengelände
‘harbor area’
Determining Immediate Constituents
of Compounds in GermaNet
Compound Splitter Incorporating GermaNet (GN-CS)
Example 2: Flughafengelände ‘airport area’
Flughafengelände
‘airport area’
Flughafen
‘airport’
+
Gelände
‘area’
•  No linking elements
•  All existing words in GermaNet
&
Flug
‘flight’
+
Hafengelände
‘harbor area’
Determining Immediate Constituents
of Compounds in GermaNet
Compound Splitter Incorporating GermaNet (GN-CS)
Example 2: Flughafengelände ‘airport area’
Flughafengelände
‘airport area’
Correct
Flughafen
‘airport’
False
+
Gelände
‘area’
Matches with connected
constituents are preferred
•  No linking elements
•  All existing words in GermaNet
&
Flug
‘flight’
+
Hafengelände
‘harbor area’
Determining Immediate Constituents
of Compounds in GermaNet
Compound Splitter Incorporating GermaNet (GN-CS)
Example 3: Nachttischlampe ‘bedside lamp’
Nachttisch
‘bed table’
Nacht
‘night’
+
+
Lampe
‘lamp’
Tischlampe
‘table lamp’
•  No linking elements
•  All existing words in GermaNet
Determining Immediate Constituents
of Compounds in GermaNet
Compound Splitter Incorporating GermaNet (GN-CS)
Example 3: Nachttischlampe ‘bedside lamp’
Nachttisch
‘bed table’
+
•  No linking elements
Lampe
‘lamp’
•  All existing words in GermaNet
•  Both heads are direct/indirect
hypernyms of compound
has hypernym
Nacht
‘night’
+
Tischlampe
‘table lamp’
has (indirect)
hypernym
has hypernym
Nachttischlampe
‘bedside lamp’
Determining Immediate Constituents
of Compounds in GermaNet
Compound Splitter Incorporating GermaNet (GN-CS)
Example 3: Nachttischlampe ‘bedside lamp’
Correct
Nachttisch
‘bed table’
+
•  No linking elements
Lampe
‘lamp’
•  All existing words in GermaNet
•  Both heads are direct/indirect
hypernyms of compound
has hypernym
False
Nacht
‘night’
+
Tischlampe
‘table lamp’
has (indirect)
hypernym
has hypernym
Nachttischlampe
‘bedside lamp’
Longer hypernym
distance is preferred
Determining Immediate Constituents
of Compounds in GermaNet
Majority Voting
Example: Segelflugzeug ‘glider’
1
•  SMOR: Segel + Flugzeug ‘sail + plane’....1
•  GN-CS: Segel + Flugzeug ‘sail + plane’...
•  ASV: Segelflug + Zeug
‘gliding + stuff’..................................
1
Determining Immediate Constituents
of Compounds in GermaNet
Majority Voting
Example: Segelflugzeug ‘glider’
1
•  SMOR: Segel + Flugzeug ‘sail + plane’....1
•  GN-CS: Segel + Flugzeug ‘sail + plane’...
•  ASV: Segelflug + Zeug
‘gliding + stuff’..................................
2
Segel + Flugzeug
‘sail + plane’
1
1
Segelflug + Zeug
‘gliding + stuff’
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter
Example: Gesetzmäßigkeit ‘legality’
1‚law + moderateness‘
•  SMOR: Gesetz + Mäßigkeit...............1‚law + moderateness‘
•  ASV: Gesetz + Mäßigkeit...................1‚law + moderateness ‘
•  GN-CS: Gesetz + Mäßigkeit..............
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter
Example: Gesetzmäßigkeit ‘legality’
1‚law + moderateness‘
•  SMOR: Gesetz + Mäßigkeit...............1‚law + moderateness‘
•  ASV: Gesetz + Mäßigkeit...................1‚law + moderateness ‘
•  GN-CS: Gesetz + Mäßigkeit..............
3
Gesetz + Mäßigkeit
‘law + moderateness’
0
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter
Example: Gesetzmäßigkeit ‘legality’
1‚law + moderateness‘
•  SMOR: Gesetz + Mäßigkeit...............1‚law + moderateness‘
•  ASV: Gesetz + Mäßigkeit...................1‚law + moderateness ‘
•  GN-CS: Gesetz + Mäßigkeit..............
no compound
predicted
Heuristic: adjective
gesetzmäßig ‚legal‘ +
derivation suffix –keit
à no composition
3
Gesetz + Mäßigkeit
‘law + moderateness’
0
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter – Heuristics
•  Suffixes –heit, –keit, –ität, –ung, –tum, etc. indicate derivation
=
All
Gemeinheit
+
‘universe’
‘villainy’
Allgemeinheit
‘generality’
=
allgemein
‘general’
adjective
+
–heit
derivation
suffix
no compound predicted
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter – Heuristics
•  Suffixes –heit, –keit, –ität, –ung, –tum, etc. indicate derivation
=
False
All
Gemeinheit
+
‘universe’
‘villainy’
Allgemeinheit
‘generality’
=
allgemein
‘general’
adjective
–heit
+
derivation
suffix
Correct
no compound predicted
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter – Heuristics
•  Small case heads are most probably incorrect
Teppichleger
‘carpet layer’
=
Teppich
‘carpet’
+
+
leger
small case head
‘informal’
Leger
‘layer’
capitalized head
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter – Heuristics
•  Small case heads are most probably incorrect
Teppichleger
‘carpet layer’
=
Teppich
‘carpet’
+
+
leger
small case head False
‘informal’
Leger
‘layer’
capitalized head Correct
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter – Heuristics
•  Small case heads are most probably incorrect
Teppichleger
‘carpet layer’
Teppich
‘carpet’
=
leger
small case head False
‘informal’
+
+
Leger
‘layer’
capitalized head Correct
•  Head should be an ending substring of the compound
Weltreise
‘world trip’
=
Welt
‘world’
+
+
Reis
‘rice’
no ending substring
Reise
‘trip’
ending substring
Determining Immediate Constituents
of Compounds in GermaNet
Combined Hybrid Compound Splitter – Heuristics
•  Small case heads are most probably incorrect
Teppichleger
‘carpet layer’
Teppich
‘carpet’
=
leger
small case head False
‘informal’
+
+
Leger
‘layer’
capitalized head Correct
•  Head should be an ending substring of the compound
Weltreise
‘world trip’
=
Welt
‘world’
+
+
Reis
‘rice’
no ending substring False
Reise
‘trip’
ending substring Correct
Sense Alignment of
GermaNet and Wiktionary
Introduction: The Necessity of Sense Descriptions
•  Descriptions illustrate individual word senses in dictionaries
•  For example: Princeton WordNet contains 3 senses for nail
Introduction: The Necessity of Sense Descriptions
•  Descriptions illustrate individual word senses in dictionaries
•  For example: Princeton WordNet contains 3 senses for nail
•  Without definitions it is not easy to distinguish senses
Extend GermaNet with Sense Definitions
GermaNet
‘advertisement’
‘complaint’
‘display’
Extend GermaNet with Sense Definitions
Wiktionary
Kurze Mitteilungen in den Medien, die der
Bekanntmachung oder Werbung dienen
GermaNet
‘advertisement’
Short notices in the media for making announcements
Recht: Bekanntgabe einer Straftat bei
einer Behörde
Law: report of a crime at an authority
Technik: eine Vorrichtung zur
Signalisierung von Zuständen und Werten
Technical device for signaling visual information
‘complaint’
‘display’
GermaNet-Wiktionary Mapping
Wiktionary
GermaNet
‘advertisement’
‘complaint’
‘display’
Bag of Words from GermaNet Sense Anzeige ‘advert.’
Wiktionary
GermaNet
‘advertisement’
‘complaint’
‘display’
Bag of Words from GermaNet Sense Anzeige ‘advert.’
Wiktionary
GermaNet
Anzeige
Annonce
Inserat
Ausschreibung
Versandanzeige
Kaufgesuch
Verkaufsangebot
Familienanzeige
Partnergesuch
Kontaktanzeige
Stellenanzeige
Stellenangebot
Stellenannonce
Stellengesuch
Kleinanzeige
Großanzeige
Zeitungsanzeige
Zeitung
Blatt
Gazette
‘advertisement’
‘complaint’
‘display’
Bag of Words from Wiktionary Sense Anzeige ‘advert.’
Wiktionary
Bag of Words from Wiktionary Sense Anzeige ‘advert.’
Wiktionary
Anzeige
Mitteilung
Medien
Bekanntmachung
Werbung
Annonce
Inserat
Familienanzeige
Geburtstaganzeige
Heiratsanzeige
Hochzeitsanzeige
Kontaktanzeige
Todesanzeige
Traueranzeige
Verlobungsanzeige
Werbeanzeige
Kleinanzeige
Word Overlap Example: Anzeige ‘advertisement’
GermaNet
Anzeige
Mitteilung
Medien
Bekanntmachung
Werbung
Annonce
Inserat
Familienanzeige
Geburtstaganzeige
Heiratsanzeige
Hochzeitsanzeige
Kontaktanzeige
Todesanzeige
Traueranzeige
Verlobungsanzeige
Werbeanzeige
Kleinanzeige
Word
overlap
Anzeige
Annonce
Inserat
Familienanzeige
Kontaktanzeige
Kleinanzeige
Anzeige
Annonce
Inserat
Ausschreibung
Versandanzeige
Kaufgesuch
Verkaufsangebot
Familienanzeige
Partnergesuch
Kontaktanzeige
Stellenangebot
Stellenannonce
Stellengesuch
Kleinanzeige
Großanzeige
Zeitungsanzeige
Zeitung
Blatt
Gazette
Bag of words from GermaNet
Bag of words from Wiktionary
Wiktionary
Coordinated Relations Example: Anzeige ‘advert.’
Wiktionary
GermaNet
‘advertisement’
‘complaint’
‘display’
Coordinated Relations Example: Anzeige ‘advert.’
Wiktionary
GermaNet
Synonyms
in common
‘advertisement’
‘complaint’
Hyponyms
in common
‘display’
Different Sense Granularities
Wiktionary
Sammlung historisch oder aus anderen Gründen
bedeutsamer Dokumente
‘Collection of documents that are historically or for other reasons
important’
Einrichtung, Institution zur Aufbewahrung und Pflege
historisch oder aus anderen Gründen
bedeutsamer Dokumente
GermaNet
Archiv
‘data repository’
Archiv
‘archive’
‘Institution for storing and maintenance of historically
or for other reasons important documents’
Gebäude oder Gebäudeteil, der eine Institution zur
Aufbewahrung von Dokumenten enthält
‘Building or part of the building containing an institution for storing
documents’
Archiv
‘archived file’
Sense Mapping Editor Using Anzeige
Applications and Tools
for GermaNet
Tools for GermaNet
•  Application Programming Interfaces
- 
- 
Java API
Perl API
•  Web Application:
http://weblicht.sfs.uni-tuebingen.de/rws/gnet/
•  Web service: as part of WebLicht
•  GermaNet-Explorer: visualisation tool (developed at the University
of Dortmund)
•  GernEdiT: GermaNet editing tool
GernEdiT – The GermaNet Editing Tool
Demo
GermaNet-Explorer (University of Dortmund)
Demo
Web Application
http://weblicht.sfs.uni-tuebingen.de/rws/gnet/
Demo
Thank you.
Verena Henrich & Erhard Hinrichs
Department of Linguistics
University of Tübingen
Wilhelmstr. 19
72074 Tübingen
Germany
verena.henrich@uni-tuebingen.de
http://www.verenahenrich.de
erhard.hinrichs@uni-tuebingen.de
http://www.sfs.uni-tuebingen.de/~eh/
Links & References
•  GermaNet homepage: http://www.sfs.uni-tuebingen.de/GermaNet/
•  GermaNet web application: http://weblicht.sfs.uni-tuebingen.de/rws/gnet/
•  Verena Henrich and Erhard Hinrichs: Determining Immediate Constituents of
Compounds in GermaNet. In Proceedings of Recent Advances in Natural
Language Processing (RANLP 2011), Hissar, Bulgaria, 2011.
http://www.aclweb.org/anthology/R/R11/R11-1058.pdf
•  Verena Henrich, Erhard Hinrichs, and Tatiana Vodolazova: Semi-Automatic
Extension of GermaNet with Sense Definitions from Wiktionary. In
Proceedings of 5th Language & Technology Conference (LTC 2011), Poznań,
Poland, 2011.
•  Verena Henrich and Erhard Hinrichs: GernEdiT - The GermaNet Editing Tool.
In Proceedings of the Seventh Conference on International Language Resources
and Evaluation (LREC 2010), Valletta, Malta, 2010.
http://www.lrec-conf.org/proceedings/lrec2010/pdf/264_Paper.pdf
•  Fellbaum, C. (ed.): WordNet – An Electronic Lexical Database. The MIT Press,
1998.
•  Princeton WordNet web application: http://wordnetweb.princeton.edu/perl/webwn