Slides - Université de Montréal

Transcription

Slides - Université de Montréal
Projective Methods for Mining Missing Translations in DBpedia
1
Projective Methods for Mining Missing Translations in
DBpedia
Laurent Jakubina
1
Philippe Langlais
1 RALI - DIRO
Université de Montréal
jakubinl@iro.umontreal.ca
2 RALI - DIRO
Université de Montréal
felipe@iro.umontreal.ca
BUCC Workshop 2015
2
Projective Methods for Mining Missing Translations in DBpedia
Introduction
Linked (Open) Data in Semantic Web
Fig.: ”Classical” Web vs. Semantic Web
2
Projective Methods for Mining Missing Translations in DBpedia
Introduction
DBpedia in/and The Semantic Web
Fig.: Concepts and Labels
=⇒ Truly Multilingual World Wide Web ? ...Most labels are currently only in
English. [Gómez-Pérez et al., 2013]
3
Projective Methods for Mining Missing Translations in DBpedia
Introduction
Zoom
Fig.: Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja
Jentzsch and Richard Cyganiak : http ://lod-cloud.net/
=⇒ in DBpedia ? Same Problem : only one label on five in French.1
1
see Data Set Statistics in January 2015 : http ://wiki.dbpedia.org/Datasets/DatasetStatistics
4
Projective Methods for Mining Missing Translations in DBpedia
Introduction
Wikipedia and Goals
Where that come from ?
One rdfs :label property in a given language in DBpedia =...
...= the title of the Wikipedia article which is inter-language linked to...
...the (English) Wikipedia article associated to the DBpedia concept.
=⇒ The root problem comes from Wikipedia.
20% → 100%?
Identifying translations for (English) Wikipedia article titles (in French).
=⇒ Investigating two projectives approaches and theirs parameters...
...using Wikipedia and its structure as a comparable corpus.
5
Projective Methods for Mining Missing Translations in DBpedia
Approaches
Standard Approach (Stand) - Presentation
Assumption :
If two words co-occur more often than expected from chance in a source language,
then theirs translations must co-occur more often than expected from chance in a
targuet language. [Rapp, 1995]
Fig.: Steps of the Standard Approach in a nutshell. add reference bouamor dessin
6
Projective Methods for Mining Missing Translations in DBpedia
Approaches
Standard Approach (Stand) - Presentation
Parameters :
Contextual Window Size : 2, 6, 14, 30.
Association Measure
Discontinuous Odds-Ratio (ord) [Evert, 2005, p. 86]
Log-Likelihood Ratio (llr) [Dunning, 1993]
Bilingual Seed Lexicon : One large lexicons comprising 116 354 word pairs
populated from several available resources (in-house, Ergane, Freelang).
Similarity Measure : Cosine Similarity (as in [Laroche and Langlais, 2010]).
Note :
The co-occurrent words are extracted from all the source documents of the
comparable corpus in which the term to translate appears.
7
Projective Methods for Mining Missing Translations in DBpedia
Approaches
Neighbourhood variants (lki, lko, cmp and ra) - Presentation
Idea : Translating Wikipedia titles ?
Only considering the occurrences of this term in the article whose title we seek to
translate. And avoiding populating the context vector different senses of the word to
translate.
Idea : Too few occurrences ?
Considering some neighbourhood functions : returns a set of Wikipedia articles related
to the one under consideration for translation.
4 Functions (and many combinaisons of them) :
lki(a) returns the set of articles that have a link pointing to the article a under
consideration.
lko(a) returns the set of articles to which a points to.
cmp(a) returns the set of articles that are the most similar to a (Using the
MoreLikeThis method of the search engine Lucene).
rnd() randomly returns articles (for sanity check).
8
Projective Methods for Mining Missing Translations in DBpedia
Approaches
Neighbourhood variants (lki, lko, cmp and ra) - Presentation
One New Parameter :
Size of the returned set of articles : 10, 100 or more ?
Fig.: Neighbourhood functions with the article : ”Alternating series”
9
Projective Methods for Mining Missing Translations in DBpedia
Approaches
Explicit Semantic Analysis (Esa-B) - Presentation
Approach described in [Bouamor, 2014].
Adaptation of the Explicit Semantic Analysis approach described in
[Gabrilovich and Markovitch, 2007].
Words Vectors → Documents Vectors
Parameter ? Documents vectors Maximum Size (Semantic Drift)
Bilingual Lexicon → Wikipedia Interlanguage Links.
Fig.: Esa-B approach in a nutshell.
10
Projective Methods for Mining Missing Translations in DBpedia
11
Experimental Protocol
Reference List
= a list of English source terms and their reference (French) translation.
Randomly sampling pairs of articles in Wikipedia that are inter-language linked.
(good translations [Hovy et al., 2013])
Named Entities Filter by bilingual lexicon (see Stand) on english side.
Unigrams and Specials Characters filters on both side.
Random entries through 4 frenquency classes.
[1-25]
[26-100]
[101-1000]
[1001+]
Total
74 (8.5%)
myringotomy
paracentèse
267 (30.7%)
syllabification
césure
259 (29.8%)
numerology
numérologie
269 (30.9%)
entertainment
divertissement
869 (100%)
Projective Methods for Mining Missing Translations in DBpedia
Experimental Protocol
Evaluation
Each approach returns ranked list of (at most) 20 candidates (for each source
English term).
P@1 : % of terms lists for which the best ranked candidate is the reference.
MAP@20 : Mean Average Precision at rank 20 [Manning et al., 2008].
Example :
A = [A’,B’,C’,D’,E’] = 1
C = [A’,B’,C’,D’,E’] = 1/3
E = [A’,B’,C’,D’,E’] = 1/5
—————————————————
MAP5 = (1 + (1/3) + (1/5))/3 = 0.511
P@1 = 1/3 = 0.333
12
Projective Methods for Mining Missing Translations in DBpedia
13
Results
Stand
P@1
Stand (llr)
Stand (ord)
[1-25]
MAP
0.000
0.027
0.003
0.057
[26-100]
P@1
MAP
[101-1000]
P@1
MAP
[1001+]
P@1
MAP
P@1
0.011
0.217
0.019
0.425
0.134
0.461
0.051
0.338
0.019
0.281
0.023
0.474
0.154
0.506
Observations :
With previous xp : optimal window size = 6. (3 words each side, no func words)
ord For the win = six time higher performance, on average. ([Laroche and Langlais, 2010])
Strong correlation btw frequency and performance. (well-know fact, [Prochasson and Fung, 2011])
[Total]
MAP
0.061
0.389
Projective Methods for Mining Missing Translations in DBpedia
14
Results
Stand - 2
Observations :
Rare words better ranked in ord context vector.
Better discriminative power.
Deserves further investigations.
ord
myringoplasty (16.32)
myringa (16.14)
laryngotracheal (15.13)
tympanostomy (14.60)
laryngomalacia (14.19)
patency (13.43)
equalized (11.75)
grommet (11.58)
obstructive (11.09)
incision (10.37)
llr
tube (147.6)
laser (44.90)
procedure (40.83)
usually (31.86)
knife (30.13)
myringoplasty (29.85)
ear (28.19)
laryngotracheal (27.45)
tympanostomy (26.39)
cold (24.09)
Fig.: Top words in the context vector computed with ord and llr for the source term
Myringotomy. Words in bold appear in both context vectors.
Projective Methods for Mining Missing Translations in DBpedia
15
Results
Neighbourhood variants
[1-25]
[26-100]
[101-1000]
[1001+]
[Total]
P@1
MAP
P@1
MAP
P@1
MAP
P@1
MAP
P@1
MAP
lki-1000
lko-1000
cmp-1000
0.000
0.000
0.016
0.002
0.000
0.022
0.064
0.016
0.072
0.080
0.022
0.099
0.124
0.089
0.131
0.156
0.119
0.170
0.126
0.033
0.093
0.155
0.046
0.120
0.096
0.044
0.092
0.119
0.058
0.121
rnd-1000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Observations :
Meta-Parameter : more = better. (1000 everywhere)
From a practical pov : very disappointing = Significant drop in performance.
Dissymmetry btw src and trg context vector =⇒ context vector computation online for each term.
Left as future works.
At least, outperform random sampling.
Projective Methods for Mining Missing Translations in DBpedia
16
Results
Esa-B
P@1
[1-25]
MAP
[26-100]
P@1
MAP
[101-1000]
P@1
MAP
[1001+]
P@1
MAP
P@1
[Total]
MAP
Stand (ord)
0.027
0.057
0.217
0.281
0.425
0.474
0.461
0.506
0.338
0.389
Esa-B
0.014
0.080
0.056
0.122
0.205
0.300
0.424
0.513
0.211
0.293
Observations :
Doc Vec Maximum Size = 30 in our cases. (default, 100)
Contrary to [Bouamor et al., 2013], under-performs Stand with ord.
Authors filter nouns, verbs and adjectives = might interfere with the previous observations on rare words.
(URLs, Spelling Mistakes, etc. = more discriminative (anecdotes))
Projective Methods for Mining Missing Translations in DBpedia
17
Results
All Results
[1-25]
Stand (ord)
[26-100]
[101-1000]
[1001+]
[Total]
P@1
MAP
P@1
MAP
P@1
MAP
P@1
MAP
P@1
MAP
0.027
0.057
0.217
0.281
0.425
0.474
0.461
0.506
0.338
0.389
Esa-B
0.014
0.080
0.056
0.122
0.205
0.300
0.424
0.513
0.211
0.293
cmp-1000
0.016
0.022
0.072
0.099
0.131
0.170
0.093
0.120
0.092
0.121
lki-1000
0.000
0.002
0.064
0.080
0.124
0.156
0.126
0.155
0.096
0.119
Stand (llr)
lko-1000
0.000
0.000
0.003
0.000
0.011
0.016
0.019
0.022
0.019
0.089
0.023
0.119
0.134
0.033
0.154
0.046
0.051
0.044
0.061
0.058
rnd-1000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Tab.: Precision (at rank 1) and MAP-20 of some variants we tested. Each neighbourhood function
was asked to return (at most) 1000 English articles. The Esa-B variant is making use of context
vectors of (at most) 30 titles.
Projective Methods for Mining Missing Translations in DBpedia
Results
Analysis
Combination ? Considering 528 terms that appear over hundred times...
Stand - ord = 362 successes (top-20)
Esa-B = 351 successes (top-20)
With an oracle telling us which variant to trust ?
Potentially translate correctly 431 terms (81,6%).
Failures for the 97 terms ?
English terms appear in the French Wikipedia and are proposed by Stand
approach. (ex : barber (oracle translation : coiffeur))
Stand proposes morphological variants of the reference. (ex : coudre (verbal
form) in-place of the noun couture for sewing)
Wrong reference translations or too specific. (ex : For instance, the reference
translation of veneration is dulie, while the first translation produced by
Stand is vénération
Most frequent case : the thesaurus effect of both approaches where terms related
to the source one are proposed.
Finally, sometimes, it is just...wrong. (e.g. noun translated as spora)
18
Projective Methods for Mining Missing Translations in DBpedia
Conclusion
Discussion
What have we learned ?
Stand performs as well or better than Esa-B, depending on parameters.
And combining both might improve results for high frequency terms.
Well-known bias on unfrequent terms = need for methods =⇒ direct future
works
Lot of meta-parameters through approaches and need costly calibration experiences
=⇒ the code and resources used in this work will be available at this url :
http://rali.iro.umontreal.ca/rali/?q=fr/Ressources (WIP)
19
Projective Methods for Mining Missing Translations in DBpedia
Bibliography
Bibliographie I
Bouamor, D. (2014).
Constitution de ressources linguistiques multilingues à partir de corpus de textes
parallèles et comparables.
PhD thesis, Université Paris Sud - Paris XI.
Bouamor, D., Popescu, A., Semmar, N., and Zweigenbaum, P. (2013).
Building specialized bilingual lexicons using large scale background knowledge.
In EMNLP, pages 479–489.
Dunning, T. (1993).
Accurate Methods for the Statistics of Surprise and Coincidence.
Comput. Linguist., 19(1) :61–74.
Evert, S. (2005).
The statistics of word cooccurrences.
PhD thesis, Dissertation, Stuttgart University.
Gabrilovich, E. and Markovitch, S. (2007).
Computing semantic relatedness using wikipedia-based explicit semantic analysis.
In Proceedings of the 20th International Joint Conference on Artifical
Intelligence, IJCAI’07, pages 1606–1611, San Francisco, CA, USA. Morgan
Kaufmann Publishers Inc.
20
Projective Methods for Mining Missing Translations in DBpedia
Bibliography
Bibliographie II
Gómez-Pérez, A., Vila-Suero, D., Montiel-Ponsoda, E., Gracia, J., and
Aguado-de Cea, G. (2013).
Guidelines for multilingual linked data.
In Proceedings of the 3rd International Conference on Web Intelligence, Mining
and Semantics, WIMS ’13, pages 3 :1–3 :12, New York, NY, USA. ACM.
Hovy, E., Navigli, R., and Ponzetto, S. P. (2013).
Collaboratively built semi-structured content and artificial intelligence : The story
so far.
Artificial Intelligence, 194 :2–27.
Laroche, A. and Langlais, P. (2010).
Revisiting context-based projection methods for term-translation spotting in
comparable corpora.
In Proceedings of the 23rd International Conference on Computational
Linguistics, COLING ’10, pages 617–625, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Manning, C. D., Raghavan, P., and Schütze, H. (2008).
Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA.
21
Projective Methods for Mining Missing Translations in DBpedia
Bibliography
Bibliographie III
Prochasson, E. and Fung, P. (2011).
Rare word translation extraction from aligned comparable documents.
In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics : Human Language Technologies - Volume 1, HLT ’11, pages
1327–1335, Stroudsburg, PA, USA. Association for Computational Linguistics.
Rapp, R. (1995).
Identifying word translations in non-parallel texts.
In Proceedings of the 33rd Annual Meeting on Association for Computational
Linguistics, ACL ’95, pages 320–322, Stroudsburg, PA, USA. Association for
Computational Linguistics.
22
Projective Methods for Mining Missing Translations in DBpedia
Questions ?
Questions ?
23

Similar documents