Statistical Dependency Parsing in Korean
Transcription
Statistical Dependency Parsing in Korean
Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on Statistical Parsing of Morphologically-Rich Languages 12th International Conference on Parsing Technologies Jinho D. Choi & Martha Palmer University of Colorado at Boulder October 6th, 2011 choijd@colorado.edu Thursday, October 6, 2011 Dependency Parsing in Korean • Why dependency parsing in Korean? - Korean is a flexible word order language. S SOV construction S NP-SBJ NP-OBJ-1 She NP-SBJ VP AP S AP VP NP-OBJ VP him loved still VP Him she VP NP-OBJ VP *T* loved still OBJ ADV ADV SBJ SBJ OBJ 2 Thursday, October 6, 2011 Dependency Parsing in Korean • Why dependency parsing in Korean? - Korean is a flexible word order language. Rich morphology makes it easy for dependency parsing. 그녀 + 는 그 + 를 She + Aux. particle SBJ She ADV still 3 Thursday, October 6, 2011 He + Obj. case marker loved OBJ him Dependency Parsing in Korean • • Statistical dependency parsing in Korean - Sufficiently large training data is required. • Not much training data available for Korean dependency parsing. Constituent Treebanks in Korean - Penn Korean Treebank: 15K sentences. KAIST Treebank: 30K sentences. Sejong Treebank: 60K sentences. • • The most recent and largest Treebank in Korean. Containing Penn Treebank style constituent trees. 4 Thursday, October 6, 2011 Sejong Treebank • Phrase structure - Including phrase tags, POS tags, and function tags. Each token can be broken into several morphemes. ! S NP-SBJ She still ! VP ! NP-OBJ VP him loved /JX /MAG /NP+ /JKO /NNG+ /XSV+ /EP+ /EF Tokens are mostly separated by white spaces. 5 Thursday, October 6, 2011 )/NP+ ! VP AP ( containing only left and right brackets, respectively. These tags are also used to determine dependency relations during the conversion. Sejong Treebank higher precedenc precedence in VP Once we have th Phrase-level tags Function tags erate dependenc S Sentence SBJ Subject each phrase (or Q Quotative clause OBJ Object the head of the NP Noun phrase CMP Complement all other nodes i VP Verb phrase MOD Noun modifier The procedure g VNP Copula phrase AJT Predicate modifier in the tree finds AP Adverb phrase CNJ Conjunctive and Palmer (20 DP Adnoun phrase INT Vocative IP Interjection phrase PRN Parenthetical by this procedu (unique root, si NNG General noun MM Adnoun EP Prefinal EM JX Auxiliary PR Table 2: Phrase-level tags (left) and function tags (right) however, it doe NNP Proper noun MAG General adverb EF Final EM JC Conjunctive PR in the Sejong Treebank. NNB Bound noun MAJ Conjunctive adverb EC Conjunctive EM IC Interjection shows how to a NP Pronoun JKS Subjective CP ETN Nominalizing EM SN Number NR Numeral JKC Complemental CP ETM Adnominalizing EM SL Foreign Inword addition, Sec VV Verb JKG Adnomial CP XPN Noun prefix SH Chinese word Head-percolation rules solve VA Adjective 3.2 JKO Objective CP XSN Noun DS NF Noun-like wordsome of th VX Auxiliary predicate JKB Adverbial CP XSV Verb DS NV Predicate-like word nested function the list VCP Copula Table 3 gives JKV Vocative CPof head-percolation XSA Adjective DSrules (from NA Unknown word VCN Negation now adjectiveon,JKQ Quotative CPderived XR from Baseanalysis morpheme of each SF, SP, SS, SE,ItSO,isSWworth m headrules), Tree Table 1: P OS tags phrase in the Sejong Treebank (PM:Sejong predicate marker, CP: case particle, EM:for ending DS: Sejong derivatype in the Treebank. Except themarker,the tional suffix, PR: particle, SF SP SS SE SO: different types of punctuation). quotative clause (Q), all 6other phrase types try to egories. This im ated by these he find their heads from the rightmost children, which Thursday, October 6, 2011 morphological analysis has been one of Automatic Han et al. (2000) presented an approach for han- Dependency Conversion • Conversion steps - • Find the head of each phrase using head-percolation rules. • All other nodes in the phrase become dependents of the head. Re-direct dependencies for empty categories. • • Empty categories are not annotated in the Sejong Treebank. Skipping this step generates only projective dependency trees. Label (automatically generated) dependencies. Special cases - Coordination, nested function tags. 7 Thursday, October 6, 2011 Dependency Conversion Head-percolation rules • as described in there are some approaches that have treated each - by analyzing each phrase Treebank. to several mor-Achieved morpheme as an individual token in to the parseSejong (Chung le 1). In the Se- et al., 2010).5 Korean is a head-final language. mostly by white S r VP;VNP;S;NP|AP;Q;* B+C)D’ is conQ l S|VP|VNP|NP;Q;* oes not contain NP r NP;S;VP;VNP;AP;* ult, a token can VP r VP;VNP;NP;S;IP;* ual morphemes VNP r VNP;NP;S;* AP r AP;VP;NP;S;* ated with funcDP r DP;VP;* gs show depenIP r IP;VNP;* hrases and their X|L|R r * y labels during special types of No rules to find the head morpheme of each token. Table 3: Head-percolation rules for the Sejong TreeTable 2. X indi- bank. l/r implies looking for the leftmost/rightmost conarticles, ending stituent. * implies any phrase-level tag. | implies a logi8 ndicate phrases cal OR and ; is a delimiter between tags. Each rule gives Thursday, October 6, 2011 Dependency Conversion • Dependency labels - Labels retained from the function tags. Labels inferred from constituent relations. S use the automatiejong Treebank, and d and linked empty categories to generNP-SBJ VP ective dependencies. dency labels AP VP NP-OBJ VP f dependency labels are derived from the rees. The first type includes labels rethe function When any anShetags. still him nodeloved a function tag is determined toOBJ be a deADV some other node by our headrules, the SBJ is taken as the dependency label to its e 3 shows a dependency tree converted stituent tree in Figure 2, using the funcdependency labels (SBJ and OBJ). Thursday, October 6, 2011 input : (c, p), where c is a dependent of p. l output: A dependency label l as c ← − p. begin if p = root then ROOT → l elif c.pos = AP then ADV → l elif p.pos = AP then AMOD → l elif p.pos = DP then DMOD → l elif p.pos = NP then NMOD → l elif p.pos = VP|VNP|IP then VMOD → l else DEP → l end Algorithm 1: Getting inferred labels. AJT 9 CMP 11.70 1.49 MOD AMOD 18.71 0.13 X X AJT 0.01 0.08 Dependency Conversion • • Coordination - Previous conjuncts as dependents of the following conjuncts. Nested function tag - Nodes with nested f-tags become the heads of the phrases. S NP-SBJ NP-SBJ NP-CNJ I_and VP NP-CNJ NP-SBJ he_and she CNJ NP-OBJ VP home left CNJ OBJ SBJ 10 Thursday, October 6, 2011 Dependency Parsing • Dependency parsing algorithm - Transition-based, non-projective parsing algorithm. • Performs transitions from both projective and non-projective dependency parsing algorithms selectively. • • Choi & Palmer, 2011. Linear time parsing speed in practice for non-projective trees. Machine learning algorithm - Liblinear L2-regularized L1-loss support vector. Jinho D. Choi & Martha Palmer. 2011. Getting the Most out of Transition-based Dependency Parsing. In Proceedings of ACL:HLT’11 11 Thursday, October 6, 2011 Dependency Parsing • Feature selection - Each token consists of multiple morphemes (up to 21). POS tag feature of each token? • • (NNG & XSV & EP & EF & SF) vs. (NNG | XSV | EP | EF | SF) Sparse information vs. lack of information. ! Nakrang_ /NNP+ /NNG+ Happy medium? /JX Nakrang + Princess + JX ! Hodong_ /NNP+ /NNG+ /JKO Hodong + Prince + JKO ! /NNG+ /XSV+ /EP+ /EF+./SF Love + XSV + EP + EF + . 12 Thursday, October 6, 2011 j * ure extraction ed to λ1 along with all many other morphemes for punctuation, parsing. Thus, PY helpful The last only if there is no other ned in Section 3.1, each token in our coredure is repeated with as a compromise, we decide to select followed certain types morpheme by the punctuation sts of one or many morphemes annotated of morphemes and use only these as features. Ta+1 . The algorithm terrent POS tags. This morphology makes Table 6: Types of morphemes in each token used to exn left in β. ble 6 shows the types of morphemes used to extract to extract features for dependency pars- tract features for our parsing models. features for our parsing models. Morpheme selection nglish, when two tokens, wi and wj , are rithm Figure 6 shows morphemes extracted from the tofor a dependency relation, FS we extract fea-morpheme The first zed L1-loss S VM for kens inbefore FigureJO|DS|EM 5. For unigrams, these morphemes POS tags of wi and wj (wLS wj .pos), i .pos,The last morpheme , applying c = 0.1 be used either individually (e.g., the POS tag of d feature of POS tags between two tokens(J*can JK Particles in Table 1) iterion), B = 0 (bias). JK for the 1st in token is JX) or jointly (e.g., a joined annotated with suffixes Derivational (XS j .pos). Since each token isDS * Table 1) feature POS tags OS tag in English, it is trivial extract EM to Ending markers (E* of in Table 1) between LS and JK for the 1st token is only NNG+JX) ures. In Korean, each token PYis annotated The last punctuation, if theretoisgenerate no other features. From our each token in our corexperiments, extracted from the JK and EM uence of POS tags, depending on how mor- followed morpheme by thefeatures punctuation morphemes annotated e segmented. It is possible to join all POS morphemes are found to be the most useful. s morphology makes Table 6: Types of morphemes in each token used to exn a token and treat that as/NNG+ a single tag /NNP+ /JX (e.g., for dependency pars- tract features for our parsing models. +JX for theNakrang first token in Figure 5); how+ Princess + JX okens, wi and wj , are e tags usually cause veryFigure sparse6 vectors /NNP /NNG /JX toshows morphemes extracted from the lation, we extract fea/NNG+ /JKO as features. /NNP+ kens in Figure 5. For unigrams, these wj (wi .pos,Hodong wj .pos), /NNP /NNGmorphemes /JKO + Prince + JKO (e.g., the POS tag of s between two tokens can be used either individually /NNG /XSV /EF /SF JK for the 1st token is JX) or jointly (e.g., a joined oken is annotated with /NNG+ /XSV+ /EP+ /EF+./SF Hodong_ feature it is trivialLove to extract + XSV + EP + EF of + . POS tags between LS and JK for the 1st Figure 6: Morphemes extracted from the tokens in Figtoken is NNG+JX) to generate features. From our ch!token /NNP+ is annotated /NNG+ /JX ure 5 with respect to the types in Table 6. experiments, features extracted from the JK and EM epending on how mor13 the most useful. + Princess + JX morphemes are found to be ossibleNakrang to join all POS For n-grams where n > 1, it is not obvious which atThursday, as a single tag (e.g.,/NNG+ /JKO October 6, 2011 ! /NNP+ combinations of these morphemes across different Dependency Parsing • rning (Hsieh et al., 2008), applying c = 0.1 st), e = 0.1 (termination criterion), B = 0 (bias). JK DS EM PY Particles (J* in Table 1) Derivational suffixes (XS* in Table 1) Ending markers (E* in Table 1) The last punctuation, only if there is no other morpheme followed by the punctuation Dependency Parsing Feature extraction mentioned in Section 3.1, each token in our cora consists of one or many morphemes annotated h different POS tags. This morphology makes Table 6: Types of morphemes in each token used to exdifficult to extract features for dependency pars- tract features for our parsing models. Extract features using only important morphemes. . In English, when two tokens, wi and wj , are Figure 6 shows morphemes extracted from the tompared for a dependency relation, we extract feaIndividual POS tag features of the1st5.and 3rd tokens. kens in Figure For unigrams, these morphemes es like POS tags of wi and wj (wi .pos, wj .pos), : NNP1, NNG1, JK1, NNG3,can XSV3, EF3 be used either individually (e.g., the POS tag of a joined feature of POS tags between two tokens JK for the 1st token is JX) or jointly (e.g., a joined .pos+wj .pos). Since each token features is annotatedofwith Joined POS tags between the 1st and 3rd tokens. feature of POS tags between LS and JK for the 1st ingle POS tag in English, it is trivial to extract : NNP1_NNG3, NNP1_XSVtoken 3, NNP1_EF3, JK1_NNG3, JK1_XSV3 is NNG+JX) to generate features. From our se features. In Korean, each token is annotated h a sequence of POS tags, depending on how mor- experiments, features extracted from the JK and EM Tokens used: w , w i, wj, wi±1morphemes j±1 are found to be the most useful. emes are segmented. It is possible to join all POS s within a token and treat that as/NNG+ a single tag /NNP+ /JX (e.g., P+NNG+JX for theNakrang first token in Figure 5); how+ Princess + JX r, these tags usually cause very sparse vectors /NNP /NNG /JX /NNG+ /JKO en used as features. /NNP+ • Feature extraction - • • - Hodong + Prince + JKO /NNG+ /XSV+ Hodong_ akrang_ /EP+ /EF+./SF Love + XSV + EP + EF + . ! /NNP+ /NNG+ /JX Nakrang + Princess + JX ! /NNP+ Thursday, October 6, 2011 /NNG+ /JKO /NNP /NNG /NNG /JKO /XSV /EF /SF Figure 6: Morphemes extracted from the tokens in Figure 5 with respect to the types in Table 6. 14 For n-grams where n > 1, it is not obvious which combinations of these morphemes across different discussing world history. m Table 10 shows how these corpora are divided into training, development, and evaluation sets. For the , wi and wj . development and evaluation sets, we pick one newsCorpora paper about art, one fiction text, and one informad features extive book about trans-nationalism, and Sejong use each of Dependency trees converted from the Treebank. h column and the first half for development and the second half for d wj , respecConsists of 20Note sources in 6 genres. evaluation. that these development and evalue of POS tags ation sets are very diverse(MZ), compared to(FI), theMemoir training(ME), Newspaper (NP), Magazine Fiction joined feature data. Testing on such evaluation sets Cartoon ensures the roInformative Book (IB), and Educational (EC). d a POS tag of bustness of our parsing model, which is very impored feature be- Evaluation sets are very diverse compared to training sets. tant for our purpose because we are hoping to use and a form of this model parse various texts on the web. Ensures theto robustness of our parsing models. features used ed morpholoNP MZ FI ME IB EC DS Experiments • - EM z z x,z x y+ x∗ ,y+ xThursday, October x, z6, 2011 • • T D E 8,060 2,048 2,048 6,713 - 15,646 2,174 2,175 5,053 - 7,983 1,307 1,308 1,548 - # of sentences in each set Table 10: Number of sentences in training (T), development (D), and evaluation (E) sets for each genre. 15 Experiments • • • Morphological analysis - Two automatic morphological analyzers are used. Intelligent Morphological Analyzer - Developed by the Sejong project. Provides the same morphological analysis as their Treebank. • Considered as fine-grained morphological analysis. Mach (Shim and Yang, 2002) - Analyzes 1.3M words per second. Provides more coarse-grained morphological analysis. Kwangseob Shim & Jaehyung Yang. 2002. A Supersonic Korean Morphological Analyzer. In Proceedings of COLING’02 16 Thursday, October 6, 2011 Experiments • Evaluations - NP FI IB Avg. Gold-standard vs. automatic morphological analysis. • Relatively low performance from the automatic system. Fine vs. course-grained morphological analysis. • Differences are not too significant. Robustness across different genres. Gold, fine-grained LAS UAS LS 82.58 84.32 94.05 84.78 87.04 93.70 84.21 85.50 95.82 83.74 85.47 94.57 Auto, fine-grained LAS UAS LS 79.61 82.35 91.49 81.54 85.04 90.95 80.45 82.14 92.73 80.43 83.01 91.77 Auto, coarse-grained LAS UAS LS 79.00 81.68 91.50 80.11 83.96 90.24 81.43 83.38 93.89 80.14 82.89 91.99 ble 11: Parsing accuracies achieved by three models (in %). L AS - labeled attachment score, UAS - unlabel achment score, L S - label accuracy score 17 Thursday, October 6, 2011 Conclusion • • Contributions - Generating a Korean Dependency Treebank. - Evaluating the robustness across different genres. Selecting important morphemes for dependency parsing. Evaluating the impact of fine vs. coarse-grained morphological analysis on dependency parsing. Future work - Increase the feature span beyond bigrams. Find head morphemes of individual tokens. Insert empty categories. 18 Thursday, October 6, 2011 Acknowledgements • • Special thanks are due to - Professor Kong Joo Lee of Chungnam National University. Professor Kwangseob Shim of Shungshin Women’s University. We gratefully acknowledge the support of the National Science Foundation Grants CISE-IIS-RI-0910992, Richer Representations for Machine Translation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 19 Thursday, October 6, 2011