EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS Karine Megerdoomian University of Maryland,

Transcription

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS Karine Megerdoomian University of Maryland,
EXTENDING A PERSIAN
MORPHOLOGICAL ANALYZER TO BLOGS
Karine Megerdoomian
University of Maryland,
College Park
karinem@umiacs.umd.edu
‫دومین کارگاه پژوهشی زبان فارسی و رایانه دانشگاه تهران‬
Talk Outline

Persian Weblogs
–

Description of a finite-state morphological
analyzer for Persian
–
–

Persian is the 4th largest blog language in the
world (~75,000 sites)
System description
Language issues and implementation
Computational issues in weblogs
Language of Blogs


Contain both formal and informal morphology
Morphology
–
–
–
Informal text is very different from formal
‫مرا گرفته است‬‫گرفته تم‬
Features that don’t exist in formal
‫فروشندهه؛ رفتش‬
Shortened verbal stems and inflection
‫می گویند‬‫میگن‬
Language of Blogs

Morphology
–
–
–
Colloquial pronunciation
‫غلطای امالیی ؛ این سایتو ؛ دوستامونم ؛ دردناکه ؛ مثل منن‬
‫ازشون ؛ خودتون ؛ نگاه های شان ؛ همسایه اشون‬
Spelling errors and non-standard punctuation &
spacing
Emoticons  and hyperlinks
‫‪Language of Blogs‬‬
‫‪Lexicon‬‬
‫‪Wordforms follow pronunciation‬‬
‫اوضاش ؛ برام ؛ نگامی کنم ؛ خونه ؛ تمبل ؛ همدیگه ؛ بش گفتم‬
‫‪Colloquial forms‬‬
‫تو دانشگاه ؛ واسه استادام‬
‫‪New words‬‬
‫لینکدونی ؛ دوستان کامنت گذار‬
‫–‬
‫–‬
‫–‬
‫‪‬‬
Language of Blogs

Lexicon
–
Loan words
‫چت روم ؛ آن الین ؛ دان لود کنین‬
–
Interjections
!‫آاااخ! ؛ واال ؛ وای ؛ اوووه‬
–
More idiomatic expressions
‫دمش گرم آقا‬
Language of Blogs

Huge amount of variation!!
 Need
for flexible rules
 Phonological rules to represent colloquial
speech
 Need to disambiguate
(statistical component?)

Formal blog text is also different from
traditional formal text
‫‪Language of Blogs‬‬
‫خوابگرد‬
‫موافق اند‬
‫بیننده گان‬
‫کتاب اش‬
‫کم تر‬
‫کافی ست‬
‫حتا‬
‫‪BBC‬‬
‫موافقند‬
‫بینندگان‬
‫کتابش‬
‫کمتر‬
‫کافیست‬
‫حتی‬
Finite-State Transducers (FST)

Two-level network or transducer
–
–
b
b
Input = lower-side of arc
Output = upper-side of arc
i
i
r
r
d
d
+Noun
+Pl
s
MA: System Description

Developed on Xerox Finite State Technology
(XFST) [Karttunen & Beesley 1992]

Components:
–
–


Lexicon and morphology rules (lexc)
Phonological rules (regular expressions)
Compiled into a FST (finite-state transducer)
FST for each part of speech created separately then
composed  final FST for morphological analysis
MA: System Description
Noun FST
Verb FST
Adverb FST
COMPOSITION
Phonology
rules
Input string

Final FST
For Morphology
Output string
MA: System Description

Coverage: formal Persian language
– Full verbal conjugation
– Nonverbal inflection
‫مسافرین ؛ فقرا‬
– Productive derivational morphology
‫سرسام آور‬
– ~20 phonological rules
– Proper nouns of people, places, organizations
Inflectional Morphology
LEXICON Root
ktab Noun ;
LEXICON Noun
+Pl:ha
#;
+Pl:_ha
#;
+Sg:0
#;
‫کتابها‬
‫کتاب ها‬
‫کتاب‬
+Pl:a
‫کتابا‬
#;
Complex Tokens

Two different POS categories
‫ دردفتر ؛ وگفت‬- ‫بعقیده شما ؛ اینکار؛ بهترست‬
bh+Prep<eqydh+Noun+Sg
dr+Prep<dftr+Noun+Sg
ktab+Noun+Pl>av+Pron+Pers+Poss+1P+Pl
‫بعقیده‬
‫دردفتر‬
‫کتابهایمان‬
bradr+Noun+Sg>av+Pron+Pers+Poss+1P+Pl
‫برادرشه‬
>bvdn+Verb+Ind+Pres+3P+Sg
Verbal Morphology

Two different stems
Infinitive
Present
Stem
Past Stem
‫توانستن‬
‫توان‬
‫توانست‬
‫رفتن‬
‫رو‬
‫رفت‬
Verbal Morphology
LEXICON PastStem
tvanst
Infl1 ;
rft
Infl1 ;
xndyd
Infl1 ;
LEXICON PstStemBlog
tvnst
InflBlog1;
LEXICON PresentStem
tvanst:tvan Infl2 ;
rft:rv
Infl2;
xndyd:xnd Infl2;
LEXICON PrStemBlog
tvanst:tvn
Infl2 ;
rft:r
Infl2;
Long Distance Dependencies

Some tenses of the verb can only be determined if we
take into account the co-occurrence of the prefix and the
person inflection / auxiliary
 problem for linear approaches
‫است‬
Pres.
Aux.3sg
‫د‬
‫گذار‬
Pres.3sg Present
‘’
‫گذاشت‬
Past.3sg
Past
‫می‬
Imperf.
‫می‬
Imperf.
‫ه‬
Past.3sg
‫می‬
Imperf.
‫گذاشت‬
Past
‫می گذارد‬
‫میذاره‬
‫می گذاشت‬
‫میذاشت‬
‫می گذاشته است‬
‫میذاشته‬
Long Distance Dependencies



Leads to very complex paths and continuation
classes in lexc
Using filters largely increases the size of the FST
Use flag diacritics for unification
(@U.Feature.Value@)
- Keeps FST small
- Can apply constraints between non-adjacent
morphemes
Phonology Rules
Form of affixes may change based on the ending
character of the stem
Formal:
‫صدایش ؛ همسایه اش‬/‫کتابش ؛ چشم هایش‬
Informal:
‫صداش ؛ همسایش‬/‫کتابش ؛ چشماش‬

define clitic1 [^NB  0 || Cons __ ] ;
define clitic2 [^NB  y || Vowel __ ] ;
define clitic3 [^NB  “\u200c” a || e __ ] ;
Optional in informal blog text
ktab^NBš
Sda^NBš
hmsaye^NBš
Evaluation





FST: 178,452 states; 928,982 arcs before
optimization
Speed: 20.84 CPU time in seconds for 10
MB file, on SunSparcStation
Coverage=97.5%; Accuracy=95%
Unanalyzed tokens: proper nouns + missing
lexicon words
No weblog language rules included yet!
Conclusion



Challenges in morphological analysis of Persian
formal text  Solutions in XFST system
New issues and variance due to blog language
Need robust system:
Lexicon updated with colloquial forms
Flexible morphological rules + derivational morphology rules
Transliteration component for loan words
Statistical approach to disambiguate and to deal with
unknowns