Machine Learning for Machine Translation
Transcription
Machine Learning for Machine Translation
Machine Learning for Machine Translation Pushpak Bhattacharyya CSE Dept., IIT Bombay ISI Kolkata, Jan 6, 2014 Jan 6, 2014 ISI Kolkata, ML for MT 1 NLP Trinity Problem NLP Trinity Semantics Parsing Part of Speech Tagging Morph Analysis Marathi French HMM Hindi CRF English Language MEMM Algorithm Jan 6, 2014 ISI Kolkata, ML for MT 2 NLP Layer Discourse and Corefernce Increased Complexity Of Processing Semantics Extraction Parsing Chunking POS tagging Morphology Jan 6, 2014 ISI Kolkata, ML for MT 3 Motivation for MT MT: NLP Complete NLP: AI complete AI: CS complete How will the world be different when the language barrier disappears? Volume of text required to be translated currently exceeds translators’ capacity (demand > supply). Solution: automation Jan 6, 2014 ISI Kolkata, ML for MT 4 Taxonomy of MT systems MT Approaches Data driven; Machine Learning Based Knowledge Based; Rule Based MT Interlingua Based Jan 6, 2014 Transfer Based Example Based MT (EBMT) ISI Kolkata, ML for MT Statistical MT 5 Why is MT difficult? Language divergence Jan 6, 2014 ISI Kolkata, ML for MT 6 Why is MT difficult: Language Divergence One of the main complexities of MT: Language Divergence Languages have different ways of expressing meaning Lexico-Semantic Divergence Structural Divergence Our work on English-IL Language Divergence with illustrations from Hindi (Dave, Parikh, Bhattacharyya, Journal of MT, 2002) Jan 6, 2014 ISI Kolkata, ML for MT 7 Languages differ in expressing thoughts: Agglutination Finnish: “istahtaisinkohan” English: "I wonder if I should sit down for a while“ Analysis: ist + "sit", verb stem ahta + verb derivation morpheme, "to do something for a while" isi + conditional affix n+ 1st person singular suffix ko + question particle han a particle for things like reminder (with declaratives) or "softening" (with questions and imperatives) Jan 6, 2014 ISI Kolkata, ML for MT 8 Language Divergence Theory: LexicoSemantic Divergences (few examples) Conflational divergence F: vomir; E: to be sick E: stab; H: chure se maaranaa (knife-with hit) S: Utrymningsplan; E: escape plan Categorial divergence Change is in POS category: The play is on_PREP (vs. The play is Sunday) Khel chal_rahaa_haai_VM (vs. khel ravivaar ko haai) Jan 6, 2014 ISI Kolkata, ML for MT 9 Language Divergence Theory: Structural Divergences SVOSOV E: Peter plays basketball H: piitar basketball kheltaa haai Head swapping divergence E: Prime Minister of India H: bhaarat ke pradhaan mantrii (India-of Prime Minister) Jan 6, 2014 ISI Kolkata, ML for MT 10 Language Divergence Theory: Syntactic Divergences (few examples) Constituent Order divergence E: Singh, the PM of India, will address the nation today H: bhaarat ke pradhaan mantrii, singh, … (India-of PM, Singh…) Adjunction Divergence E: She will visit here in the summer H: vah yahaa garmii meM aayegii (she here summer- in will come) Preposition-Stranding divergence E: Who do you want to go with? H: kisake saath aap jaanaa chaahate ho? (who with…) Jan 6, 2014 ISI Kolkata, ML for MT 11 Vauquois Triangle Jan 6, 2014 ISI Kolkata, ML for MT 12 Kinds of MT Systems (point of entry from source to the target text) Jan 6, 2014 ISI Kolkata, ML for MT 13 Illustration of transfer SVOSOV S S NP VP NP N V NP John eats N (transfer N svo sov) John NP N V eats bread bread Jan 6, 2014 VP ISI Kolkata, ML for MT 14 Universality hypothesis Universality hypothesis: At the level of “deep meaning”, all texts are the “same”, whatever the language. Jan 6, 2014 ISI Kolkata, ML for MT 15 Understanding the Analysis-TransferGeneration over Vauquois triangle (1/4) H1.1: सरकार_ने चन ु ावो_के_बाद मुंब ु ई में करों_के_माध्यम_से अपने राजस्व_को बढ़ाया | T1.1: Sarkaar ne chunaawo ke baad Mumbai me karoM ke maadhyam se apne raajaswa ko badhaayaa G1.1: Government_(ergative) elections_after Mumbai_in taxes_through its revenue_(accusative) increased E1.1: The Government increased its revenue after the elections through taxes in Mumbai Jan 6, 2014 ISI Kolkata, ML for MT 16 Interlingual representation: complete disambiguation Washington voted Washington to power Vote @past <is-a > action goal Washingto n <is-a > place Jan 6, 2014 power <is-a > capability <is-a > … Washington @emphasis <is-a > person ISI Kolkata, ML for MT 17 Kinds of disambiguation needed for a complete and correct interlingua graph N: Name P: POS A: Attachment S: Sense C: Co-reference R: Semantic Role Jan 6, 2014 ISI Kolkata, ML for MT 18 Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. ISSUES Jan 6, 2014 Part Of Speech ISI Kolkata, ML for MT Noun or Verb 19 Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. ISSUES Part Of Speech NER Jan 6, 2014 John is the name of a PERSON ISI Kolkata, ML for MT 20 Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. ISSUES Part Of Speech NER WSD Jan 6, 2014 Financial bank or River bank ISI Kolkata, ML for MT 21 Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. ISSUES Part Of Speech “it” “bank” . NER WSD Co-reference Jan 6, 2014 ISI Kolkata, ML for MT 22 Issues to handle Sentence: I went with my friend, John, to the bank to withdraw some money but was disappointed to find it closed. ISSUES Part Of Speech NER Pro drop (subject “I”) WSD Co-reference Subject Drop Jan 6, 2014 ISI Kolkata, ML for MT 23 Typical NLP tools used POS tagger Stanford Named Entity Recognizer Stanford Dependency Parser XLE Dependency Parser Lexical Resource WordNet Universal Word Dictionary (UW++) Jan 6, 2014 ISI Kolkata, ML for MT 24 System Architecture NER Simple Sentence Analyser Stanford Dependency Parser Stanford Dependency Parser XLE Parser Clause Marker Feature Generation WSD Simplifier Simple Enco. Simple Enco. Simple Enco. Simple Enco. Simple Enco. Relation Generation Merger Jan 6, 2014 Attribute Generation ISI Kolkata, ML for MT 25 Target Sentence Generation from interlingua Target Sentence Generation Lexical Transfer (Word/Phrase Translation ) Jan 6, 2014 Morphological Synthesis (Word form Generation) Syntax Planning (Sequence) ISI Kolkata, ML for MT 26 Generation Architecture Deconversion = Transfer + Generation Jan 6, 2014 ISI Kolkata, ML for MT 27 Transfer Based MT Marathi-Hindi Jan 6, 2014 ISI Kolkata, ML for MT 28 Indian Language to Indian Language Machine Translation (ILILMT) Bidirectional Machine Translation System Developed for nine Indian language pairs Approach: Transfer based Modules developed using both rule based and statistical approach Jan 6, 2014 ISI Kolkata, ML for MT 29 Architecture of ILILMT System Source Text Target Text Morphological Analyzer Word Generator POS Tagger Analysis Interchunk Chunker Intrachunk Vibhakti Computation Name Entity Recognizer Word Sense Disambiguatio n Jan 6, 2014 Generation Agreement Feature Transfer Lexical Transfer ISI Kolkata, ML for MT 30 M-H MT system: Evaluation Subjective evaluation based on machine translation quality Accuracy calculated based on score given by linguists Score : 5 Correct Translation Score : 4 Understandable with minor errors Score : 3 Understandable major errors Score : 2 Not Understandable S5: Number of score 5 Sentences, S4: Number of score 4 sentences, S3: Number of score 3 sentences, N: Total Number of sentences with Accuracy = Score : 1 Non sense translation Jan 6, 2014 ISI Kolkata, ML for MT 31 Evaluation of Marathi to Hindi MT System 1.2 1 0.8 0.6 Precision Recall 0.4 0.2 0 Morph Analyzer POS Tagger Chunker Vibhakti Compute WSD Lexical Transfer Word Generator Module-wise precision and recall Jan 6, 2014 ISI Kolkata, ML for MT 32 Evaluation of Marathi to Hindi MT System (cont..) Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the translation quality. Accuracy: 65.32 % Result analysis: Morph, POS tagger, chunker gives more than 90% precision but Transfer, WSD, generator modules are below 80% hence degrades MT quality. Also, morph disambiguation, parsing, transfer grammar and FW disambiguation modules are required to improve accuracy. Jan 6, 2014 ISI Kolkata, ML for MT 33 Statistical Machine Translation Jan 6, 2014 ISI Kolkata, ML for MT 34 Czeck-English data [nesu] [ponese] [nese] [nesou] [yedu] [plavou] Jan 6, 2014 “I carry” “He will carry” “He carries” “They carry” “I drive” “They swim” ISI Kolkata, ML for MT 35 To translate … I will carry. They drive. He swims. They will drive. Jan 6, 2014 ISI Kolkata, ML for MT 36 Hindi-English data [DhotA huM] [DhoegA] [DhotA hAi] [Dhote hAi] [chalAtA huM] [tErte hEM] Jan 6, 2014 “I carry” “He will carry” “He carries” “They carry” “I drive” “They swim” ISI Kolkata, ML for MT 37 Bangla-English data [bai] [baibe] [bay] [bay] [chAlAi] [sAMtrAy] Jan 6, 2014 “I carry” “He will carry” “He carries” “They carry” “I drive” “They swim” ISI Kolkata, ML for MT 38 To translate … (repeated) I will carry. They drive. He swims. They will drive. Jan 6, 2014 ISI Kolkata, ML for MT 39 Foundation Data driven approach Goal is to find out the English sentence e given foreign language sentence f whose p(e|f) is maximum. Translations are generated on the basis of statistical model Parameters are estimated using bilingual parallel corpora Jan 6, 2014 ISI Kolkata, ML for MT 40 SMT: Language Model To detect good English sentences Probability of an English sentence w1w2 …… wn can be written as Pr(w1w2 …… wn) = Pr(w1) * Pr(w2|w1) *. . . * Pr(wn|w1 w2 . . . wn-1) Here Pr(wn|w1 w2 . . . wn-1) is the probability that word wn follows word string w1 w2 . . . wn-1. N-gram model probability Trigram model probability calculation Jan 6, 2014 ISI Kolkata, ML for MT 41 SMT: Translation Model P(f|e): Probability of some f given hypothesis English translation e How to assign the values to p(e|f) ? Sentence level Sentences are infinite, not possible to find pair(e,f) for all sentences Introduce a hidden variable a, that represents alignments between the individual words in the sentence pair Word level Jan 6, 2014 ISI Kolkata, ML for MT 42 Alignment If the string, e= e1l= e1 e2 …el, has l words, and the string, f= f1m=f1f2...fm, has m words, then the alignment, a, can be represented by a series, a1m= a1a2...am , of m values, each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string, then aj= i, and if it is not connected to any English word, then aj= O Jan 6, 2014 ISI Kolkata, ML for MT 43 Example of alignment English: Ram went to school Hindi: Raama paathashaalaa gayaa Ram went <Null> Raama Jan 6, 2014 to school paathashaalaa gayaa ISI Kolkata, ML for MT 44 Translation Model: Exact expression Choose the length of foreign language string given e Choose alignment given e and m Choose the identity of foreign word given e, m, a Five models for estimating parameters in the expression [2] Model-1, Model-2, Model-3, Model-4, Model-5 Jan 6, 2014 ISI Kolkata, ML for MT 45 Proof of Translation Model: Exact expression Pr( f | e) Pr( f , a | e) ; marginalization a ; marginalization Pr( f , a | e) Pr( f , a, m | e) m Pr( f , a, m | e) Pr(m | e) Pr( f , a | m, e) m Pr(m | e) Pr( f , a | m, e) m m Pr(m | e) Pr( f j , a j | a1j 1 , f1 j 1 , m, e) m j 1 m Pr(m | e) Pr(a j | a1j 1 , f1 j 1 , m, e) Pr( f j | a1j , f1 j 1 , m, e) m j 1 m is fixed for a particular f, hence m Pr( f , a, m | e) Pr(m | e) Pr(a j | a1j 1 , f1 j 1 , m, e) Pr( f j | a1j , f1 j 1 , m, e) j 1 Jan 6, 2014 ISI Kolkata, ML for MT 46 Alignment Jan 6, 2014 ISI Kolkata, ML for MT 47 Fundamental and ubiquitous Spell checking Translation Transliteration Speech to text Text to speeh Jan 6, 2014 ISI Kolkata, ML for MT 48 EM for word alignment from sentence alignment: example English (1) three rabbits a b French (1) trois lapins w x (2) rabbits of Grenoble b c d (2) lapins de Grenoble x y z Jan 6, 2014 ISI Kolkata, ML for MT 49 Initial Probabilities: each cell denotes t(a w), t(a x) etc. a b c d w 1/4 1/4 1/4 1/4 x 1/4 1/4 1/4 1/4 y 1/4 1/4 1/4 1/4 z 1/4 1/4 1/4 1/4 The counts in IBM Model 1 Works by maximizing P(f|e) over the entire corpus For IBM Model 1, we get the following relationship: t (w f | w e ) c (w | w ; f , e ) = . e0 el f f t (w | w ) +… + t (w | w ) f e c (w f | w e ; f , e ) is the fractional count of the alignment of w f with w e in f and e t (w f | w e ) is the probability of w f being the translation of w e is the count of w f in f is the count of w e in e Jan 6, 2014 ISI Kolkata, ML for MT 51 Example of expected count C[aw; (a b)(w x)] t(aw) = ------------------------- X #(a in ‘a b’) X #(w in ‘w x’) t(aw)+t(ax) 1/4 = ----------------- X 1 X 1= 1/2 1/4+1/4 Jan 6, 2014 ISI Kolkata, ML for MT 52 “counts” ab a b c d bcd a b c d wx w 1/2 1/2 0 0 xyz w 0 0 0 0 x 1/2 1/2 0 0 x 0 1/3 1/3 1/3 y 0 0 0 0 y 0 1/3 1/3 1/3 z 0 0 0 0 z 0 1/3 1/3 1/3 Jan 6, 2014 ISI Kolkata, ML for MT 53 Revised probability: example trevised(a w) 1/2 = ------------------------------------------------------------------(1/2+1/2 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z) Jan 6, 2014 ISI Kolkata, ML for MT 54 Revised probabilities table a b c d w 1/2 1/4 0 0 x 1/2 5/12 1/3 1/3 y 0 1/6 1/3 1/3 z 0 1/6 1/3 1/3 “revised counts” ab a b c d bcd a b c d wx w 1/2 3/8 0 0 xyz w 0 0 0 0 x 1/2 5/8 0 0 x 0 5/9 1/3 1/3 y 0 0 0 0 y 0 2/9 1/3 1/3 z 0 0 0 0 z 0 2/9 1/3 1/3 Jan 6, 2014 ISI Kolkata, ML for MT 56 Re-Revised probabilities table a b c d w 1/2 3/16 0 0 x 1/2 85/144 1/3 1/3 y 0 1/9 1/3 1/3 z 0 1/9 1/3 1/3 Continue until convergence; notice that (b,x) binding gets progressively stronger; b=rabbits, x=lapins Derivation of EM based Alignment Expressions VE vocalbula ry of language L1 (Say English) VF vocabular y of language L2 (Say Hindi) E1 what F1 is in a नाम में क्या naam meM kya name in what what is in a name ? है ? hai ? is ? name ? That which we call rose, by any other name will smell as sweet. E2 जजसे हम गुऱाब कहते हैं, और भी ककसी नाम से उसकी कुशबू सामान मीठा होगी Jise hum gulab kahte hai, aur bhi kisi naam se uski khushbu samaan mitha hogii F2 That which we rose say , any other name by its smell as sweet That which we call rose, by any other name will smell as sweet. Jan 6, 2014 ISI Kolkata, ML for MT 58 Vocabulary mapping Vocabulary VE VF what , is , in, a , name , that, which, we , call ,rose, by, any, other, will, smell, as, sweet naam, meM, kya, hai, jise, hum, gulab, kahte, hai, aur, bhi, kisi, bhi, uski, khushbu, saman, mitha, hogii Jan 6, 2014 ISI Kolkata, ML for MT 59 Key Notations English vocabulary : 𝑉𝐸 French vocabulary : 𝑉𝐹 No. of observations / sentence pairs : 𝑆 Data 𝐷 which consists of 𝑆 observations looks like, 𝑒11 , 𝑒12 , … , 𝑒1𝑙1 𝑓11 , 𝑓12 , … , 𝑓1𝑚1 𝑒21 , 𝑒22 , … , 𝑒2𝑙2 𝑓21 , 𝑓22 , … , 𝑓2𝑚2 ..... 𝑒𝑠1 , 𝑒𝑠2 , … , 𝑒𝑠𝑙𝑠 𝑓𝑠1 , 𝑓𝑠2 , … , 𝑓𝑠𝑚𝑠 ..... 𝑒𝑆1 , 𝑒𝑆2 , … , 𝑒𝑆𝑙𝑆 𝑓𝑆1 , 𝑓𝑆2 , … , 𝑓𝑆𝑚𝑆 No. words on English side in 𝑠 𝑡 sentence : 𝑙 𝑠 No. words on French side in 𝑠 𝑡 sentence : 𝑚 𝑠 𝑖𝑛𝑑𝑒𝑥𝐸 𝑒𝑠𝑝 =Index of English word 𝑒𝑠𝑝in English vocabulary/dictionary 𝑖𝑛𝑑𝑒𝑥𝐹 𝑓𝑠𝑞 =Index of French word 𝑓𝑠𝑞in French vocabulary/dictionary (Thanks to Sachin Pawar for helping with the maths formulae processing) Jan 6, 2014 ISI Kolkata, ML for MT 60 Hidden variables and parameters Hidden Variables (Z) : Total no. of hidden variables = 𝑆𝑠=1 𝑙 𝑠 𝑚 𝑠 where each hidden variable is as follows: 𝑠 = 1 , if in 𝑠 𝑡 sentence, 𝑝 𝑡 English word is mapped to 𝑞𝑡 French 𝑧𝑝𝑞 word. 𝑠 = 0 , otherwise 𝑧𝑝𝑞 Parameters (Θ) : Total no. of parameters = 𝑉𝐸 × 𝑉𝐹 , where each parameter is as follows: 𝑃𝑖,𝑗 = Probability that 𝑖𝑡 word in English vocabulary is mapped to 𝑗𝑡 word in French vocabulary Jan 6, 2014 ISI Kolkata, ML for MT 61 Likelihoods Data Likelihood L(D; Θ) : Data Log-Likelihood LL(D; Θ) : Expected value of Data Log-Likelihood E(LL(D; Θ)) : Jan 6, 2014 ISI Kolkata, ML for MT 62 Constraint and Lagrangian 𝑉𝐹 𝑃𝑖,𝑗 = 1 , ∀𝑖 𝑗=1 Jan 6, 2014 ISI Kolkata, ML for MT 63 Differentiating wrt Pij Jan 6, 2014 ISI Kolkata, ML for MT 64 Final E and M steps M-step E-step Jan 6, 2014 ISI Kolkata, ML for MT 65 Combinatorial considerations Jan 6, 2014 ISI Kolkata, ML for MT 66 Example Jan 6, 2014 ISI Kolkata, ML for MT 67 All possible alignments Jan 6, 2014 ISI Kolkata, ML for MT 68 First fundamental requirement of SMT Alignment requires evidence of: • firstly, a translation pair to introduce the POSSIBILITY of a mapping. • then, another pair to establish with CERTAINTY the mapping Jan 6, 2014 ISI Kolkata, ML for MT 69 For the “certainty” We have a translation pair containing alignment candidates and none of the other words in the translation pair OR We have a translation pair containing all words in the translation pair, except the alignment candidates Jan 6, 2014 ISI Kolkata, ML for MT 70 Therefore… If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty. Jan 6, 2014 ISI Kolkata, ML for MT 71 Rough estimate of data requirement SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge, i.e., no meanings or grammatical properties of any words, phrases or sentences Each language has a vocabulary of 100,000 words can give rise to about 500,000 word forms, through various morphological processes, assuming, each word appearing in 5 different forms, on the average For example, the word „go‟ appearing in „go‟, „going‟, „went‟ and „gone‟. Jan 6, 2014 ISI Kolkata, ML for MT 72 Reasons for mapping to multiple words Synonymy on the target side (e.g., “to go” in English translating to “jaanaa”, “gaman karnaa”, “chalnaa” etc. in Hindi), a phenomenon called lexical choice or register polysemy on the source side (e.g., “to go” translating to “ho jaanaa” as in “her face went red in anger””usakaa cheharaa gusse se laal ho gayaa”) syncretism (“went” translating to “gayaa”, “gayii”, or “gaye”). Masculine Gender, 1st or 3rd person, singular number, past tense, non-progressive aspect, declarative mood Jan 6, 2014 ISI Kolkata, ML for MT 73 Estimate of corpora requirement Assume that on an average a sentence is 10 words long. an additional 9 translation pairs for getting at one of the 5 mappings 10 sentences per mapping per word a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences Estimate is not wide off the mark Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs. Jan 6, 2014 ISI Kolkata, ML for MT 74 Our work on factor based SMT Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh and Pushpak Bhattacharyya, Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT, ACL-IJCNLP 2009, Singapore, August, 2009. Jan 6, 2014 ISI Kolkata, ML for MT 75 Case Marker and Morphology crucial in E-H MT Order of magnitiude facelift in Fluency and fidelity Determined by the combination of suffixes and semantic relations on the English side Augment the aligned corpus of the two languages, with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers Jan 6, 2014 ISI Kolkata, ML for MT 76 Semantic relations+SuffixesCase Markers+inflections I ate mangoes I {<agt} ate {eat@past} I {<agt} mangoes {<obj.@pl} {eat@past} mei_ne aam Jan 6, 2014 mangoes {<obj} khaa_yaa ISI Kolkata, ML for MT 77 Our Approach Factored model (Koehn and Hoang, 2007) with the following translation factor: suffix + semantic relation case marker/suffix Experiments with the following relations: Dependency relations from the stanford parser Deeper semantic roles from Universal Networking Language (UNL) Jan 6, 2014 ISI Kolkata, ML for MT 78 Our Factorization Jan 6, 2014 ISI Kolkata, ML for MT 79 Experiments Jan 6, 2014 ISI Kolkata, ML for MT 80 Corpus Statistics Jan 6, 2014 ISI Kolkata, ML for MT 81 Results: The impact of suffix and semantic factors Jan 6, 2014 ISI Kolkata, ML for MT 82 Results: The impact of reordering and semantic relations Jan 6, 2014 ISI Kolkata, ML for MT 83 Subjective Evaluation: The impact of reordering and semantic relations Jan 6, 2014 ISI Kolkata, ML for MT 84 Impact of sentence length (F: Fluency; A:Adequacy; E:# Errors) Jan 6, 2014 ISI Kolkata, ML for MT 85 A feel for the improvementbaseline Jan 6, 2014 ISI Kolkata, ML for MT 86 A feel for the improvementreorder Jan 6, 2014 ISI Kolkata, ML for MT 87 A feel for the improvement-Semantic relation Jan 6, 2014 ISI Kolkata, ML for MT 88 A recent study PAN Indian SMT Jan 6, 2014 ISI Kolkata, ML for MT 89 Pan-Indian Language SMT http://www.cfilt.iitb.ac.in/indic-translator SMT systems between 11 languages Corpus 7 Indo-Aryan: Hindi, Gujarati, Bengali, Oriya, Punjabi, Marathi, Konkani 3 Dravidian languages: Malayalam, Tamil, Telugu English Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50,000 parallel sentences Evaluation with BLEU METEOR scores also show high correlation with BLEU Jan 6, 2014 ISI Kolkata, ML for MT 90 SMT Systems Trained Phrase-based (PBSMT) baseline system (S1) E-IL PBSMT with Source side reordering rules (Ramanathan et al., 2008) (S2) E-IL PBSMT with Source side reordering rules (Patel et al., 2013) (S3) IL-IL PBSMT with transliteration postediting (S4) Jan 6, 2014 ISI Kolkata, ML for MT 91 Natural Partitioning of SMT systems Baseline PBSMT - % BLEU scores (S1) Clear partitioning of translation pairs by language family pairs, based on translation accuracy. Shared characteristics within language families make translation simpler Divergences among language families make translation difficult Jan 6, 2014 ISI Kolkata, ML for MT 92 The Challenge of Morphology Morphological complexity vs BLEU Training Corpus size vs BLEU Vocabulary size is a proxy for morphological complexity *Note: For Tamil, a smaller corpus was used for computing vocab size • • Translation accuracy decreases with increasing morphology Even if training corpus is increased, commensurate improvement in translation accuracy is not seen for morphologically rich languages • Handling morphology in SMT is critical Jan 6, 2014 ISI Kolkata, ML for MT 93 Common Divergences, Shared Solutions Comparison of source reordering methods for E-IL SMT - % BLEU scores (S1,S2,S3) All Indian languages have similar word order The same structural divergence between English and Indian languages SOV<->SVO, etc. Common source side reordering rules improve E-IL translation by 11.4% (generic) and 18.6% (Hindi-adapted) Common divergences can be handled in a common framework in SMT systems ( This idea has been used for knowledge based MT systems e.g. Anglabharati ) Jan 6, 2014 ISI Kolkata, ML for MT 94 Harnessing Shared Characteristics PBSMT+ transliteration post-editing for E-IL SMT - % BLEU scores (S4) Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common phonetic organization of Indic scripts Accuracy Improvements of 0.5 BLEU points with this simple approach Harnessing common characteristics can improve SMT output Jan 6, 2014 ISI Kolkata, ML for MT 95 Cognition and Translation: Measuring Translation Difficulty Abhijit Mishra and Pushpak Bhattacharyya, Automatically Predicting Sentence Translation Difficulty, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 Jan 6, 2014 ISI Kolkata, ML for MT 96 Scenario Subjective notion of difficulty Sentences John ate jam John ate jam made from apples John is in a jam Jan 6, 2014 Easy Moderate Difficult? ISI Kolkata, ML for MT 97 Use behavioural data Use behavioural data to decipher strong AI algorithms Specifically, For WSD by humans, see where the eye rests for clues For the innate translation difficulty of sentences, see how the eye moves back and forth over the sentences Jan 6, 2014 ISI Kolkata, ML for MT 98 Saccades Fixations Jan 6, 2014 ISI Kolkata, ML for MT 99 Image Courtesy: http://www.smashingmagazine.com/2007/10/09/30-usability-issues-to-be-aware-of/ Eye Tracking data Gaze points : Position of eye-gaze on the screen Fixations : A long stay of the gaze on a particular object on the screen. Fixations have both Spatial (coordinates) and Temporal (duration) properties. Saccade : A very rapid movement of eye between the positions of rest. Scanpath: A path connecting a series of fixations. Regression: Revisiting a previously read segment Jan 6, 2014 ISI Kolkata, ML for MT 100 Controlling the experimental setup for eye-tracking Eye movement patterns influenced by factors like age, working proficiency, environmental distractions etc. Guidelines for eye tracking Participants metadata (age, expertise, occupation) etc. Performing a fresh calibration before each new experiment Minimizing the head movement Introduce adequate line spacing in the text and avoid scrolling Carrying out the experiments in a relatively low light environment Jan 6, 2014 ISI Kolkata, ML for MT 101 Use of eye tracking Used extensively in Psychology Mainly to study reading processes Seminal work: Just, M.A. and Carpenter, P.A. (1980). A theory of reading: from eye fixations to comprehension. Psychological Review 87(4):329–354 Used in flight simulators for pilot training Jan 6, 2014 ISI Kolkata, ML for MT 102 NLP and Eye Tracking research Kliegl (2011)- Predict word frequency and pattern from eye movements Doherty et. al (2010)- Eye-tracking as an automatic Machine Translation Evaluation Technique Stymne et al. (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis Dragsted (2010)- Co-ordination of reading and writing process during translation. Relatively new and open research direction Jan 6, 2014 ISI Kolkata, ML for MT 103 Translation Difficulty Index (TDI) Motivation: route sentences to translators with right competence, as per difficulty of translating On a crowdsourcing platform, e.g. TDI is a function of sentence length (l), degree of polysemy of constituent words (p) and structural complexity (s) Jan 6, 2014 ISI Kolkata, ML for MT 104 Contributor to TDI: length What is more difficult to translate? John eats jam John eats jam made from apples vs. John eats jam made from apples grown in orchards vs. vs. John eats bread made from apples grown in orchards on black soil Jan 6, 2014 ISI Kolkata, ML for MT 105 Contributor to TDI: polysemy What is more difficult to translate? John is in a jam vs. John is in difficulty Jam has 4 diverse senses, difficulty has 4 related senses Jan 6, 2014 ISI Kolkata, ML for MT 106 Contributor to TDI: structural complexity What is more difficult to translate? John is in a jam. His debt is huge. The lenders cause him to shy from them, every moment he sees them. vs. John is in a jam, caused by his huge debt, which forces him to shy from his lenders every moment he sees them. Jan 6, 2014 ISI Kolkata, ML for MT 107 Measuring translation through Gaze data Translation difficulty indicated by staying of eye on segments Jumping back and forth between segments Example: Jan 6, 2014 The horse raced past the garden fell ISI Kolkata, ML for MT 108 Measuring translation difficulty through Gaze data Translation difficulty indicated by staying of eye on segments Jumping back and forth between segments Example: The horse raced past the garden fell बगीचा के पास से दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa gir gayaa The translation process will complete the task till garden, and then backtrack, revise, restart and Jan 6, 2014 Kolkata, ML for MT translate in a differentISIway 109 Scanpaths: indicator of translation difficulty (Malsburg et. al, 2007) Sentence 2 is a clear case of “Garden pathing” which imposes cognitive load on participants and the prefer syntactic re-analysis. Jan 6, 2014 ISI Kolkata, ML for MT 110 Translog : A tool for recording Translation Process Data Translog (Carl, 2012) : A Windows based program Built with a purpose of recording gaze and key-stroke data during translation Can be used for other reading and writing related studies Using Translog, one can: Create and Customize translation/reading and writing experiments involving eye-tracking and keystroke logging Calibrate the eye-tracker Replay and analyze the recorded log files Manually correct errors in gaze recording Jan 6, 2014 ISI Kolkata, ML for MT 111 TPR Database The Translation Process Research (TPR) database (Carl, 2012) is a database containing behavioral data for translation activities Contains Gaze and Keystroke information for more than 450 experiments 40 different paragraphs are translated into 7 different languages from English by multiple translators At least 5 translators per language Source and target paragraphs are annotated with POS tags, lemmas, dependency relations etc Easy to use XML data format Jan 6, 2014 ISI Kolkata, ML for MT 112 Experimental setup (1/2) Translators translate sentence by sentence typing to a text box The display screen is attached with a remote eye-tracker which constantly records the eye movement of the translator Jan 6, 2014 ISI Kolkata, ML for MT 113 Experimental setup (2/2) Extracted 20 different text categories from the data Each piece of text contains 5-10 sentences For each category we had at least 10 participants who translated the text into different target languages . Jan 6, 2014 ISI Kolkata, ML for MT 114 A predictive framework for TDI Test Data Labeling through gaze analysis Training data Features Regressor TDI Direct annotation of TDI is fraught with subjectivity and ad-hocism. We use translator‟s gaze data as annotation to prepare training data. Jan 6, 2014 ISI Kolkata, ML for MT 115 Annotation of TDI (1/4) First approximation -> TDI equivalent to “time taken to translate”. However, time taken to translate may not be strongly related to translation difficulty. It is difficult to know what fraction of the total time is spent on translation related thinking. Sensitive to distractions from the environment. Jan 6, 2014 ISI Kolkata, ML for MT 116 Annotation of TDI (2/4) Instead of the “time taken to translate”, consider “time for which translation related processing is carried out by the brain” This is called Translation Processing Time, given by: 𝑇𝑝 = 𝑇𝑐𝑜𝑚𝑝 +𝑇𝑔𝑒𝑛 Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively. Jan 6, 2014 ISI Kolkata, ML for MT 117 Annotation of TDI (3/4) Humans spend time on what they see, and this “time” is correlated with the complexity of the information being processed f- fixation, s- saccade, Fs- source, Ft- target 𝑇𝑝 = 𝑑𝑢𝑟 𝑓 + 𝑓 ∈ 𝐹𝑠 𝑑𝑢𝑟 𝑠 + 𝑠 ∈ 𝑆𝑠 𝑑𝑢𝑟 𝑓 + 𝑓 ∈ 𝐹𝑡 Jan 6, 2014 𝑑𝑢𝑟 𝑠 𝑠 ∈ 𝑆𝑡 ISI Kolkata, ML for MT 118 Annotation of TDI (4/4) The measured TDI score is the Tp normalized over sentence length 𝑇𝐷𝐼𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 Jan 6, 2014 𝑇𝑝 = 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒_𝑙𝑒𝑛𝑔𝑡 ISI Kolkata, ML for MT 119 Features Length: Word count of the sentences Degree of Polysemy: Sum of number of senses of each word in the WordNet normalized by length Structural Complexity: If the attachment units lie far from each other, the sentence has higher structural complexity. Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence. Measured TDI for TPR database for 80 sentences. Jan 6, 2014 ISI Kolkata, ML for MT 120 Experiment and results Training data of 80 examples; 10-fold cross validation Features computed using Princeton WordNet and Stanford Dependency Parser Support Vector Regression technique (Joachims et al., 1999) along with different kernels Error analysis was done by Mean Squared Error estimate We also computed the correlation of the predicted TDI with the measured TDI. Jan 6, 2014 ISI Kolkata, ML for MT 121 Examples from the dataset Jan 6, 2014 ISI Kolkata, ML for MT 122 Summary Covered Interlingual based MT: the oldest approach to MT Covered SMT: the newest approach to MT Presented some recent study in the context of Indian Languages. Jan 6, 2014 ISI Kolkata, ML for MT 123 Summary SMT is the ruling paradigm But linguistic features can enhance performance, especially the factored based SMT with factors coming from interlingua Large scale effort sponsored by ministry of IT, TDIL program to create MT systems Parallel corpora creation is also going on in a consortium mode Jan 6, 2014 ISI Kolkata, ML for MT 124 Conclusions NLP has assumed great importance because of large amount of text in e-form Machine learning techniques are increasingly applied Highly relevant for India where multilinguality is way of life Machine Translation is more fundamental and ubiquitous than just mapping between two languages Utterancethought Speech to speech online translation Jan 6, 2014 ISI Kolkata, ML for MT 125 Pubs: http://ww.cse.iitb.ac.in/~pb Resources and tools: http://www.cfilt.iitb.ac.in Jan 6, 2014 ISI Kolkata, ML for MT 126