SYNTAX-TO-MORPHOLOGY MAPPING IN FACTORED PHRASE-BASED STATISTICAL MACHINE TRANSLATION KemalOflazer
Transcription
SYNTAX-TO-MORPHOLOGY MAPPING IN FACTORED PHRASE-BASED STATISTICAL MACHINE TRANSLATION KemalOflazer
SYNTAX-TO-MORPHOLOGY MAPPING IN FACTORED PHRASE-BASED STATISTICAL MACHINE TRANSLATION KemalOflazer (Joint work with ReyyanYeniterzi@CMU-LTI) Turkish Turkish is an Altaic language with over 60 Million speakers ( > 150 M for Turkic Languages: Azeri, Turkoman, Uzbek, Kirgiz, Tatar, etc.) Agglutinative Morphology Morphemes glued together like "beads-on-a-string" Morphophonemic processes (e.g.,vowel harmony) 2 Turkish Morphology Productive inflectional and derivational suffixation. Many derivational suffixes Possibly multiple derivations in a word form Derivations applicable to almost all roots in a POS-class No prefixation, and no productive compounding. With minor exceptions, morphotactics, and morphophonemic processes are very "regular." 3 Turkish Morphology Basic root lexicon has about 30,000 entries ~100,000 roots with proper nouns But each noun/verb root word can generate a very large number of forms Nouns have about 100 different forms w/o any derivations Verbs have about 500 again w/o any derivations Things get out of hand when productive derivations are considered. Hankamer (1989) e.g., estimates few million forms per verbal root (counting derivations and inflections). 4 Some Statistics 5 HasimSak and Murat Saraclar of Bogazici University have recently compiled a 491Mword corpus About 4.1M types Going from 490M to 491M adds about 5,000 new types Most frequent 50K types cover 89% Most frequent 300K types cover 97% 3.4M Types occur less than 10 times 2.0M types occurs once Some Statistics 6 Word Structure A word can be seen as a sequence of inflectional groups (IGs) separated by derivational boundaries (^DB) Root+Infl1^DB+Infl2^DB+…^DB+Infln sağlamlaştırdığımızdaki ( (existing) at the time we caused (something) to become strong. ) sağlam+laş+tır+dığ+ımız+da+ki sağlam+Adj^DB+Verb+Become(+laş) ^DB+Verb+Caus+Pos(+tır) ^DB+Noun+PastPart+P1sg+Loc(+dığ, +ımız,+da) ^DB+Adj+Rel(+ki) Morphemes can have up to 8 allomorphs depending on the phonological context güzelleştirdiğimizdeki 7 How does English become Turkish? 8 if we are going to be able to make [something] become pretty if we are going to be able to make become pretty güzel +leş +tir +ebil güzelleştirebileceksek +ecek +se +k English phrases vs. Turkish words 9 Verb complexes/Adverbial clauses Iwould not be able todo(something) yap+ama+yacak+tı+m if wewillbe able to do (something) yap+abil+ecek+se+k discontinuity when/at the time wehad (someone) have (someone else) do (something) yap+tır+t+tığ+ımız+da English phrases vs. Turkish words 10 Possessive constructions/prepositional phrases my .... magazines dergi+ler+im with your .... magazines dergi+ler+iniz+le due-to theirclumsi+ness sakar+lık+ları+ndan after they were causedtobecome pretty güzel+leş+tir+il+me+leri+nden How bad can it potentially get? 11 Finlandiyalılaştıramadıklarımızdanmışsınızcasına (behaving) as if you have beenone of thosewhomwecouldnotconvertintoaFinn(ish citizen)/someone from Finland Finlandiya+lı+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Finlandiya+Noun+Prop+A3sg+Pnon+Nom ^DB+Adj+With/From ^DB+Verb+Become ^DB+Verb+Caus ^DB+Verb+Able+Neg ^DB+Noun+PastPart+A3pl+P1pl+Abl ^DB+Verb+Zero+Narr+A2pl ^DB+Adverb+AsIf But it gets better!-Finnish Numerals 12 Finnish numerals are written as one word and all components inflect and agree morphologically with the head noun they modify. Kahdensienkymmenensienkahdeksansien second tenth eighth Twenty eighth kaksi+Ord+Pl+Genkymmenen+Ord+Pl+Genkahdeksan+Ord+Pl+Gen kahdensi en kymmenens i en kahdeksans i en Example Courtesy of Lauri Karttunen But it gets better! 13 Aymara ch’uñüwinkaskirïyätwa ch’uñu +: +wi +na -ka +si -ka -iri +: +ya:t(a) +wa ch’uñu I was (one who was) N always at‘freeze-dried the place forpotatoes’ making ch’uñu’ +: N>V be/make … +wi V>N place-of +na in (location) -ka N>V be-in (location) +si continuative -ka imperfect -iri V>N one who Example Courtesy of Ken Beesley Polysynthetic Languages 14 Inuktikut uses morphology to combine syntactically related components (e.g. verbs and their arguments) of a sentence together Parismunngaujumaniralauqsimanngittunga Paris+mut+nngau+juma+niraq+lauq+si+ma+nngit+ju n Example Courtesy of Ken Beesley Back to English – Turkish SMT 15 Previous work in English-to-Turkish SMT relied segmenting Turkish into morphemes and translated at the levels of morphemes. (Durgar-El Kahlout and Oflazer (2010)) Selective morpheme segmentation Morpheme and word-based LMs Post-processing to occasionally correct malformed words Mermer et al. (2009, 2011) uses morpheme-based English – Turkish SMT: Problems 16 Sentences get longer for alignment Many sentences getting close to 100 tokens after morpheme segmentation Morphemes attach to incompatible roots; incorrect morphotactics Decoder handles both syntactic reordering and morphotactics using the same statistics Intuitively this did not look right English – Turkish SMT: Highlights 17 Two phrase translations coming together to form a new word Source: promote protection of children's rights in line with eu and international standards . Translation:çocukhak+larh+nhn koru+hn+ma+sh+nhnabveuluslar+arasistandart+lar+y a uygunşekil+dageliş+dhr+hl+ma+sh . Lit. develop protection of children's rights in accordance with eu and international standards . English – Turkish SMT: Highlights 18 Two phrase translations coming together to form a new word Source: promote protection of children's rights in line with eu and international standards . Translation:çocukhak+larh+nhn koru+hn+ma+sh+nhnabveuluslar+arasistandart+lar+y a uygunşekil+dageliş+dhr+hl+ma+sh . Lit. develop protection of children's rights in accordance with eu and international standards . English – Turkish SMT: Highlights 19 Mining the phrase-table, one finds similar interesting phrase pairs like afterexamine +vvg, +acc incele +dhk +abl sonra One can think of this as a structural transfer rulelike afterexamine +vvgNPeng NPturk+acc incele +dhk +abl sonra Now for a completely different approach 20 Examples such as Iwould not be able todo(something) yap+ama+yacak+tı+m yapamayacaktım if wewillbe able to do (something) yap+abil+ecek+se+k yapabileceksek when/at the time wehad (someone) have (someone else) do (something) yap+tır+t+tığ+ımız+da yaptırttığımızda with your .... magazines dergi+ler+iniz+ledergilerimle Now for a completely different approach 21 Instead of segmenting Turkish, can we map syntactic structures in English to complex words in Turkish directly ? Recognize certain local and nonlocal syntactic structures on the English side Package those structures and attach to heads to obtain parallel morphological structures Use factored PB-SMT Essentially, can we transform English to an agglutinative language? Syntax-to-Morphology Mapping 22 ontheireconomicrelations Tagger on+IN their+PRP$ economic+JJ relation+NN_NNS Dependency Parser PMOD POS on+IN their+PRP$ economic+JJ relation+NN_NNS Transformation economic+JJ relation+NN_NNS_their+PRP$_on+IN Syntax-to-Morphology Mapping 23 economic+JJrelation+NN_NNS_their+PRP$_on+IN Syntax-to-morphology mapping ekonomik+Adjilişki+Noun+A3pl+P3pl+Loc Morphological Analyzer/Disambiguator ekonomikilişkilerinde A Constituency View 24 PP PP NP Case-Marked NP Poss-NP NP NP in their economic relations NP economic relations their in NP ekonomik ilişki +leri+nde Align Map Syntax-to-Morphology Mapping 25 On both sides of the parallel data, each token now comprises of three factors: Surface (= Root+Tag) Root The complex tag Local/nonlocal syntax on the English side(+any morphology) economic|economic|+JJ relations|relation|+NN_NNS_their+PRP$_on+IN Full morphology on the Turkish side ekonomik|ekonomik|+Adj iliskilerinde|ilişki|+Noun+A3pl+P3sg+Loc Observations 26 We can identify and reorganize phrases on the English side, to “align” English syntax to Turkish morphology. The length of the English sentences can be dramatically reduced. most function words encoding syntax are now abstracted into complex tags Continuous and discontinuous variants of certain (syntactic) source phrases can be conflated during the SMT phrase extraction process. on their economic relations on their strong economic relations Rest of Talk 27 Another example Experimental Setup Experiments Additional Improvements Constituent Reordering Applications to Turkish-to-English SMT Conclusions Syntax-to-Morphology Mapping 28 ifarequestismadeorallytheauthoritymustmakearecordofit Tagger if+INa+DTrequest+NNbe+VB_VBZmake+VB_VBNorally+RB the+DTauthority+NNmust+MDmake+VBa+DTrecord+NNof+INit+PRP Dependency Parser VMOD NMOD VC if+IN a+DT request+NN be+VB_VBZ make+VB_VBN NMOD VC NMOD orally+RB PMOD the+DT authority+NN must+MD make+VB a+DT record+NN of+IN it+PRP Transformation request+NN_a+DTmake+VB_VBN_be+VB_VBZ_if+INorally+RB authority+NN_the+DTmake+VB_must+MDrecord+NN_a+DTit+PRP_of+IN Capturing Discontinuous Syntax 29 ifarequestismadeorallytheauthoritymustmakearecordofit Tagger if+INa+DTrequest+NNbe+VBVBZmake+VB_VBNorally+RB the+DTauthority+NNmust+MDmake+VBa+DTrecord+NNof+INit+PRP Dependency Parser VMOD VC if+IN be+VB_VBZ make+VB_VBN Transformation make+VB_VBN_be_VB+VBZ_if+IN Syntax-to-Morphology Mapping 30 request+NN_a_DTmake+VB_VBN_be_VB_VBZ_if_INorally+RB authority+NN_the_DTmake+VB_must_MDrecord+NN_a_DTit+PRP_of_IN English side now has less tokens (7 vs 14 originally) istek+Nounsözlü+Adjol+Verb+ByDoingSoyap+Verb+Pass+Narr+Cond yetkili+Adjmakam+Nounbu+Pron+Acckaydet+Verb+Neces+Cop Morphological Analyzer/Disambiguator isteksözlü olarak yapılmışsayetkilimakambunukaydetmelidir Syntax-to-Morphology Mapping 31 We use about 20 linguistically motivated syntax-tomorphology transformations which handle the following cases: Prepositions Possessive pronouns Possessive markers Auxiliary verbs and modals Forms of be used as predicates with adjectival or nominal dependents Forms of be or have used to form passive voice, and forms of be used with -ing verbs to form present continuous verbs Data Preparation 32 Same data that has been used in Durgar-El-Kahlout and Oflazer, 2010 52712 parallel sentences Average of 23 words in English sentences 18 words in Turkish sentences Randomly generated 10 train, test and dev set combinations Data Preparation 33 English POS tagging with Stanford Log-Linear Tagger Dependency parsing with MaltParser Additional stemming with TreeTagger Examples is : be+VB_VBZ, made : make+VB_VBN, Turkish Perform full morphological analysis and morphological disambiguation Remove any morphological features that are not explicitly marked by an overt morpheme kitaplarınızın ofyourbooks kitap-lar-ınız-ın kitap+Noun+P2pl+A3pl+Gen Experiments 34 Moses toolkit to encourage long distance reordering distortion limit of ∞ distortion weight of 0.1 Dual-path decoding SRILM Toolkit Translate surface if you can Translate root and complex tag and conjoin to get the translated surface Large generation table! 3-gram LM initially for all factors Modified Kneser-Ney discounting with interpolation Evaluation Each experiment was repeated over the 10 data sets Baseline Systems 35 Baseline System Surface form of the word 3-gram LM for surface words relation+NN_NNS ilişki+Noun+A3pl Baseline-Factored System Surface | Lemma | ComplexTag Aligned based on Lemma factor Different Surface 3-gram| LMs are| used for each factor ComplexTag Lemma Ave. STD. | relation | +NN_NNS Baselineilişki+Noun+A3pl | ilişki 17.08 0.60 | +Noun+A3pl Max. Min 17.99 15.97 Baseline-Factored Model 19.41 16.80 Experiment relation+NN_NNS 18.61 0.76 Experiments with Transformations 36 Transformations on the English side Nouns and adjectives (Noun+Adj) Verbs (Verb) Prepositions, possessive pronouns and markers, forms of be used as predicates with adjectives etc. Auxiliary verbs, negation markers, modals, passive constructions etc. Adverbs (Adv) Various adverbial clauses formed with if, while, when etc. Transformations on the Turkish side Verbs lexical and adverbs (Verb+Adv) Some postpositions in Turkish corresponds to English prepositions These postpositions are treated as if they were case-markers and attached to the immediately preceding noun (PostP) Experiments with Transformations 37 Experiment Ave. STD. Max. Min Baseline 17.08 0.60 17.99 15.97 Baseline-Factored Model 18.61 0.76 19.41 16.80 Noun+Adj 21.33 0.62 22.27 20.05 Verb 19.41 0.62 20.19 17.99 Adv 18.62 0.58 19.24 17.30 Verb+Adv 19.42 0.59 20.17 18.13 Noun+Adj+Verb+Adv 21.67 0.72 22.66 20.38 Noun+Adj+Verb+Adv+PostP 21.96 0.72 22.91 20.67 28.57% points over baseline 18.00% points over factored baseline Experiments with Transformations 38 Experiment Ave. Baseline-Factored Model 18.61 Noun+Adj 21.33 Verb 19.41 Adv 18.62 Verb+Adv 19.42 Noun+Adj+Verb+Adv 21.67 Noun+Adj+Verb+Adv+PostP 21.96 2.72 BLEU points 0.8 BLEU points BLEU Score vs. Number of Tokens 39 BLEU Score 21.00 1050000 20.00 1000000 19.00 950000 18.00 900000 17.00 850000 16.00 800000 15.00 Correlation : -0.99 Experiments Noun+Adj+Verb+Adv+P ostPC 1100000 Noun+Adj+Verb+PostP C 22.00 Noun+Adj+Verb 1150000 Noun+Adj+Verb+Adv 23.00 Noun+Adj 1200000 Verb+Adv 24.00 Verb 1250000 Adv 25.00 BLEU Scores Turkish 1300000 Baseline-Factored Number of Tokens English n-gram Precision Components of BLEU Scores 40 BLEU for words, roots (BLEU-R) and morphological tags (BLEU-M) 1-gr. 2-gr. 3-gr. 4-gr. BLEU 21.96 55.73 27.86 16.61 10.68 BLEU-R 27.63 68.60 35.49 21.08 13.47 BLEU-M 27.93 67.41 37.27 21.40 13.41 We are getting most of the root words and the complex morphological tags correct, but not necessarily getting the combination equally as good Using longer distance constraints on the morphological tag Experiments with Higher Order LMs 41 Factored phrase-based SMT allows the use of multiple LMs for different factors during decoding Investigate the contribution of higher order n-gram language models (4-grams to 9-grams) for the morphological tag factor Ave. The improvements were consistent up toSTD. 8-gram LM orders Max. Min Surface|Lemma|Tag 3|3|3 21.96 0.72 22.91 20.67 3|3|8 22.61 0.72 23.66 21.37 3|4|8 22.80 0.85 24.07 21.57 3|4|8 + Lexical Reordering 23.76 0.93 25.16 22.49 Augmenting the Training Data 42 Augment the training data with reliable phrase pairs obtained from a previous alignment Extract phrases from phrase table that satisfy 0.9 ≤ p(e|t)/p(t|e) ≤ 1.1 (phrases translate to each other) p(t|e) + p(e|t) ≥ 1.5 (and not much to others) These phrases are added to the training data to further bias Experiment Ave. STD. Max. Min the+alignment process 3|4|8 Lexical Reordering 23.76 0.93 25.16 22.49 Above+Augmentation 24.38 0.81 25.44 23.18 Sentence Length vs Transformations 43 Results after only the transformations (same LMs) English Sentence length 1-10 in the original test set Average BLEU 46.19 Average %Improvement over baseline 3% relative English Sentence length 20-30 in the original test set Average BLEU 20.93 Average %Improvement over baseline 17% Constituent Reordering 44 Syntax to morphology transformations do not perform any constituent level reordering We now reordered the source sentences, to bring English constituent order (SVO) more in line with the Turkish constituent order (SOV) at the top and embedded phrase levels. Constituent Reordering 45 Object reordering (ObjR) Adverbial phrase reordering (AdvR) from English V AdvPto Turkish AdvP V Passive sentence agent reordering (PassAgR) from English SVO to Turkish SOV from English SBJ PassiveVCbyVAgentto Turkish SBJ VagentbyPassiveVC Subordinate clause reordering (SubCR) postnominal relative clauses and prepositional phrase modifiers from English Noun SubCto TurkishSubC Noun Experiments with Reordering 46 Experiment Ave. STD. Max. Min Best Result from Previous Transformations (3-3-3/No-reordering/No Aug.) 21.96 0.72 22.91 20.67 ObjR 21.94 0.71 23.12 20.56 ObjR+AdvR 21.73 0.50 22.44 20.69 ObjR+PassAgR 21.88 0.73 23.03 20.51 ObjR+SubCR 21.88 0.61 22.77 20.92 Although there were some improvements for certain cases, none of the reorderings gave consistent improvements for all the data sets Examination of the alignments produced after these reordering transformations indicated that the resulting root alignments were not necessarily that close to being monotonic as we would have expected Turkish to English Translation 47 Syntax-to-Morphology mapping can be applied in the reverse direction, but The decoded English would have tags encoding syntax which would further have to be post-processed to put various functionrelation+NN_NNS_their+PRP$_on+IN words in their right places. economic+JJ on+IN their+PRP$ economic+JJ relation+NN_NNS Turkish to English Translation 48 Exactly the same set-up as English-to-Turkish system (except for decoding parms) Post-processing with a Transformed English-to-English SMT Train with transformed English train set as the source and the POS-tagged original English as the target language Rule/Heuristics-based transformation undo with coupled with a second SMT system Undo easy cases manually + with heuristics and then undo Turkish-to-English Translation 49 Experiment Ave. STD. Max. Min Factored Baseline (3-3-3) 24.96 0.48 25.82 24.02 Syntax-to-Morphology Transformations (3-3-3)+Rule-based+SMT Undo (3-3-3) 27.59 0.62 28.47 26.72 Syntax-to-Morphology Transformations (3-3-3)+Only SMT Undo (3-3-3) 28.27 0.46 28.99 27.75 Syntax-to-Morphology Transformations (3-4-5)+Only SMT Undo (4-5-7) 29.67 0.61 30.60 28.75 Above + Lexical Reordering 30.31 0.72 31.35 29.34 Sentence Length vs Transformations 50 Results after only the transformations (same LMs) English Sentence length 1-10 in the original test set Average BLEU 43.66 Average %Improvement over baseline 11% relative English Sentence length 20-30 in the original test set Average BLEU 22.48 Average %Improvement over baseline 13% Conclusions: English-to-Turkish SMT 51 A novel approach to map source syntactic structures to target morphological structures by encoding many local and nonlocal source syntactic structures as additional complex tag factors In our experiments, we performed syntax-to-morphology mapping transformations on the source side a very small set of transformations on the target side Overall, with some additional techniques we got about 30% improvement of a factored baseline A lot of the improvement is probably due to reduction in the Conclusions: Source-side Reordering 52 We performed numerous additional syntactic reordering transformations on the source to further bring the constituent order in line with the target order These reorderings did not provide any tangible improvements when averaged over the 10 different data sets Conclusions: Turkish-to-English SMT 53 We obtained similar improvements in the reverse direction using a second straight-forward SMT system to undo transformations. There is still more room there Augmentation LM’s using much larger English data Experiments with reordering Future Work 54 Can we learn transformation rules from a preprocessed / parsed corpora with some minimal additional information about relative morphology? Other languages English-to-Finnish would be interesting Finnish: Some ideas Finnish numerals are written as one word and all components inflect and agree morphologically with the head noun they modify. ...of the twenty eighth olympics …. …. Kahdensienkymmenensienkahdeksansien… Parse English and propagate any features (you can extract) to all components of the ordinal (+other modifiers) as additional complex tags second tenth analyze Finnish eighthnumerals to Morphologically 55 kaksi+Ord+Pl+Genkymmenen+Ord+Pl+Genkahdeksan+Ord+Pl+Gen make component morphology available to the Thanks 56 57 Syntactic Annotation 59 SyntacticAnnotation 60 The intensifier adverbial en (most) modifies the intermediate derived adjective akıl+lı(with intelligence/intelligent)