Slides - Université de Montréal
Transcription
Slides - Université de Montréal
Projective Methods for Mining Missing Translations in DBpedia 1 Projective Methods for Mining Missing Translations in DBpedia Laurent Jakubina 1 Philippe Langlais 1 RALI - DIRO Université de Montréal jakubinl@iro.umontreal.ca 2 RALI - DIRO Université de Montréal felipe@iro.umontreal.ca BUCC Workshop 2015 2 Projective Methods for Mining Missing Translations in DBpedia Introduction Linked (Open) Data in Semantic Web Fig.: ”Classical” Web vs. Semantic Web 2 Projective Methods for Mining Missing Translations in DBpedia Introduction DBpedia in/and The Semantic Web Fig.: Concepts and Labels =⇒ Truly Multilingual World Wide Web ? ...Most labels are currently only in English. [Gómez-Pérez et al., 2013] 3 Projective Methods for Mining Missing Translations in DBpedia Introduction Zoom Fig.: Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak : http ://lod-cloud.net/ =⇒ in DBpedia ? Same Problem : only one label on five in French.1 1 see Data Set Statistics in January 2015 : http ://wiki.dbpedia.org/Datasets/DatasetStatistics 4 Projective Methods for Mining Missing Translations in DBpedia Introduction Wikipedia and Goals Where that come from ? One rdfs :label property in a given language in DBpedia =... ...= the title of the Wikipedia article which is inter-language linked to... ...the (English) Wikipedia article associated to the DBpedia concept. =⇒ The root problem comes from Wikipedia. 20% → 100%? Identifying translations for (English) Wikipedia article titles (in French). =⇒ Investigating two projectives approaches and theirs parameters... ...using Wikipedia and its structure as a comparable corpus. 5 Projective Methods for Mining Missing Translations in DBpedia Approaches Standard Approach (Stand) - Presentation Assumption : If two words co-occur more often than expected from chance in a source language, then theirs translations must co-occur more often than expected from chance in a targuet language. [Rapp, 1995] Fig.: Steps of the Standard Approach in a nutshell. add reference bouamor dessin 6 Projective Methods for Mining Missing Translations in DBpedia Approaches Standard Approach (Stand) - Presentation Parameters : Contextual Window Size : 2, 6, 14, 30. Association Measure Discontinuous Odds-Ratio (ord) [Evert, 2005, p. 86] Log-Likelihood Ratio (llr) [Dunning, 1993] Bilingual Seed Lexicon : One large lexicons comprising 116 354 word pairs populated from several available resources (in-house, Ergane, Freelang). Similarity Measure : Cosine Similarity (as in [Laroche and Langlais, 2010]). Note : The co-occurrent words are extracted from all the source documents of the comparable corpus in which the term to translate appears. 7 Projective Methods for Mining Missing Translations in DBpedia Approaches Neighbourhood variants (lki, lko, cmp and ra) - Presentation Idea : Translating Wikipedia titles ? Only considering the occurrences of this term in the article whose title we seek to translate. And avoiding populating the context vector different senses of the word to translate. Idea : Too few occurrences ? Considering some neighbourhood functions : returns a set of Wikipedia articles related to the one under consideration for translation. 4 Functions (and many combinaisons of them) : lki(a) returns the set of articles that have a link pointing to the article a under consideration. lko(a) returns the set of articles to which a points to. cmp(a) returns the set of articles that are the most similar to a (Using the MoreLikeThis method of the search engine Lucene). rnd() randomly returns articles (for sanity check). 8 Projective Methods for Mining Missing Translations in DBpedia Approaches Neighbourhood variants (lki, lko, cmp and ra) - Presentation One New Parameter : Size of the returned set of articles : 10, 100 or more ? Fig.: Neighbourhood functions with the article : ”Alternating series” 9 Projective Methods for Mining Missing Translations in DBpedia Approaches Explicit Semantic Analysis (Esa-B) - Presentation Approach described in [Bouamor, 2014]. Adaptation of the Explicit Semantic Analysis approach described in [Gabrilovich and Markovitch, 2007]. Words Vectors → Documents Vectors Parameter ? Documents vectors Maximum Size (Semantic Drift) Bilingual Lexicon → Wikipedia Interlanguage Links. Fig.: Esa-B approach in a nutshell. 10 Projective Methods for Mining Missing Translations in DBpedia 11 Experimental Protocol Reference List = a list of English source terms and their reference (French) translation. Randomly sampling pairs of articles in Wikipedia that are inter-language linked. (good translations [Hovy et al., 2013]) Named Entities Filter by bilingual lexicon (see Stand) on english side. Unigrams and Specials Characters filters on both side. Random entries through 4 frenquency classes. [1-25] [26-100] [101-1000] [1001+] Total 74 (8.5%) myringotomy paracentèse 267 (30.7%) syllabification césure 259 (29.8%) numerology numérologie 269 (30.9%) entertainment divertissement 869 (100%) Projective Methods for Mining Missing Translations in DBpedia Experimental Protocol Evaluation Each approach returns ranked list of (at most) 20 candidates (for each source English term). P@1 : % of terms lists for which the best ranked candidate is the reference. MAP@20 : Mean Average Precision at rank 20 [Manning et al., 2008]. Example : A = [A’,B’,C’,D’,E’] = 1 C = [A’,B’,C’,D’,E’] = 1/3 E = [A’,B’,C’,D’,E’] = 1/5 ————————————————— MAP5 = (1 + (1/3) + (1/5))/3 = 0.511 P@1 = 1/3 = 0.333 12 Projective Methods for Mining Missing Translations in DBpedia 13 Results Stand P@1 Stand (llr) Stand (ord) [1-25] MAP 0.000 0.027 0.003 0.057 [26-100] P@1 MAP [101-1000] P@1 MAP [1001+] P@1 MAP P@1 0.011 0.217 0.019 0.425 0.134 0.461 0.051 0.338 0.019 0.281 0.023 0.474 0.154 0.506 Observations : With previous xp : optimal window size = 6. (3 words each side, no func words) ord For the win = six time higher performance, on average. ([Laroche and Langlais, 2010]) Strong correlation btw frequency and performance. (well-know fact, [Prochasson and Fung, 2011]) [Total] MAP 0.061 0.389 Projective Methods for Mining Missing Translations in DBpedia 14 Results Stand - 2 Observations : Rare words better ranked in ord context vector. Better discriminative power. Deserves further investigations. ord myringoplasty (16.32) myringa (16.14) laryngotracheal (15.13) tympanostomy (14.60) laryngomalacia (14.19) patency (13.43) equalized (11.75) grommet (11.58) obstructive (11.09) incision (10.37) llr tube (147.6) laser (44.90) procedure (40.83) usually (31.86) knife (30.13) myringoplasty (29.85) ear (28.19) laryngotracheal (27.45) tympanostomy (26.39) cold (24.09) Fig.: Top words in the context vector computed with ord and llr for the source term Myringotomy. Words in bold appear in both context vectors. Projective Methods for Mining Missing Translations in DBpedia 15 Results Neighbourhood variants [1-25] [26-100] [101-1000] [1001+] [Total] P@1 MAP P@1 MAP P@1 MAP P@1 MAP P@1 MAP lki-1000 lko-1000 cmp-1000 0.000 0.000 0.016 0.002 0.000 0.022 0.064 0.016 0.072 0.080 0.022 0.099 0.124 0.089 0.131 0.156 0.119 0.170 0.126 0.033 0.093 0.155 0.046 0.120 0.096 0.044 0.092 0.119 0.058 0.121 rnd-1000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Observations : Meta-Parameter : more = better. (1000 everywhere) From a practical pov : very disappointing = Significant drop in performance. Dissymmetry btw src and trg context vector =⇒ context vector computation online for each term. Left as future works. At least, outperform random sampling. Projective Methods for Mining Missing Translations in DBpedia 16 Results Esa-B P@1 [1-25] MAP [26-100] P@1 MAP [101-1000] P@1 MAP [1001+] P@1 MAP P@1 [Total] MAP Stand (ord) 0.027 0.057 0.217 0.281 0.425 0.474 0.461 0.506 0.338 0.389 Esa-B 0.014 0.080 0.056 0.122 0.205 0.300 0.424 0.513 0.211 0.293 Observations : Doc Vec Maximum Size = 30 in our cases. (default, 100) Contrary to [Bouamor et al., 2013], under-performs Stand with ord. Authors filter nouns, verbs and adjectives = might interfere with the previous observations on rare words. (URLs, Spelling Mistakes, etc. = more discriminative (anecdotes)) Projective Methods for Mining Missing Translations in DBpedia 17 Results All Results [1-25] Stand (ord) [26-100] [101-1000] [1001+] [Total] P@1 MAP P@1 MAP P@1 MAP P@1 MAP P@1 MAP 0.027 0.057 0.217 0.281 0.425 0.474 0.461 0.506 0.338 0.389 Esa-B 0.014 0.080 0.056 0.122 0.205 0.300 0.424 0.513 0.211 0.293 cmp-1000 0.016 0.022 0.072 0.099 0.131 0.170 0.093 0.120 0.092 0.121 lki-1000 0.000 0.002 0.064 0.080 0.124 0.156 0.126 0.155 0.096 0.119 Stand (llr) lko-1000 0.000 0.000 0.003 0.000 0.011 0.016 0.019 0.022 0.019 0.089 0.023 0.119 0.134 0.033 0.154 0.046 0.051 0.044 0.061 0.058 rnd-1000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Tab.: Precision (at rank 1) and MAP-20 of some variants we tested. Each neighbourhood function was asked to return (at most) 1000 English articles. The Esa-B variant is making use of context vectors of (at most) 30 titles. Projective Methods for Mining Missing Translations in DBpedia Results Analysis Combination ? Considering 528 terms that appear over hundred times... Stand - ord = 362 successes (top-20) Esa-B = 351 successes (top-20) With an oracle telling us which variant to trust ? Potentially translate correctly 431 terms (81,6%). Failures for the 97 terms ? English terms appear in the French Wikipedia and are proposed by Stand approach. (ex : barber (oracle translation : coiffeur)) Stand proposes morphological variants of the reference. (ex : coudre (verbal form) in-place of the noun couture for sewing) Wrong reference translations or too specific. (ex : For instance, the reference translation of veneration is dulie, while the first translation produced by Stand is vénération Most frequent case : the thesaurus effect of both approaches where terms related to the source one are proposed. Finally, sometimes, it is just...wrong. (e.g. noun translated as spora) 18 Projective Methods for Mining Missing Translations in DBpedia Conclusion Discussion What have we learned ? Stand performs as well or better than Esa-B, depending on parameters. And combining both might improve results for high frequency terms. Well-known bias on unfrequent terms = need for methods =⇒ direct future works Lot of meta-parameters through approaches and need costly calibration experiences =⇒ the code and resources used in this work will be available at this url : http://rali.iro.umontreal.ca/rali/?q=fr/Ressources (WIP) 19 Projective Methods for Mining Missing Translations in DBpedia Bibliography Bibliographie I Bouamor, D. (2014). Constitution de ressources linguistiques multilingues à partir de corpus de textes parallèles et comparables. PhD thesis, Université Paris Sud - Paris XI. Bouamor, D., Popescu, A., Semmar, N., and Zweigenbaum, P. (2013). Building specialized bilingual lexicons using large scale background knowledge. In EMNLP, pages 479–489. Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Comput. Linguist., 19(1) :61–74. Evert, S. (2005). The statistics of word cooccurrences. PhD thesis, Dissertation, Stuttgart University. Gabrilovich, E. and Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pages 1606–1611, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. 20 Projective Methods for Mining Missing Translations in DBpedia Bibliography Bibliographie II Gómez-Pérez, A., Vila-Suero, D., Montiel-Ponsoda, E., Gracia, J., and Aguado-de Cea, G. (2013). Guidelines for multilingual linked data. In Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics, WIMS ’13, pages 3 :1–3 :12, New York, NY, USA. ACM. Hovy, E., Navigli, R., and Ponzetto, S. P. (2013). Collaboratively built semi-structured content and artificial intelligence : The story so far. Artificial Intelligence, 194 :2–27. Laroche, A. and Langlais, P. (2010). Revisiting context-based projection methods for term-translation spotting in comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 617–625, Stroudsburg, PA, USA. Association for Computational Linguistics. Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. 21 Projective Methods for Mining Missing Translations in DBpedia Bibliography Bibliographie III Prochasson, E. and Fung, P. (2011). Rare word translation extraction from aligned comparable documents. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics : Human Language Technologies - Volume 1, HLT ’11, pages 1327–1335, Stroudsburg, PA, USA. Association for Computational Linguistics. Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL ’95, pages 320–322, Stroudsburg, PA, USA. Association for Computational Linguistics. 22 Projective Methods for Mining Missing Translations in DBpedia Questions ? Questions ? 23