peyamner

Transcription

peyamner

A
Towards Kurdish Information Retrieval
Kyumars Sheykh Esmaili, Nanyang Technological University
Shahin Salavati, University of Kurdistan
Anwitaman Datta, Nanyang Technological University
The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in
the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages
and has not seen much attention from the IR and NLP research communities. This paper reports on the
outcomes of a project aimed at providing essential resources for processing Kurdish texts.
A principal output of this project is Pewan, the first standard Test Collection to evaluate Kurdish Information Retrieval systems. The other language resources that we have built include a light-weight stemmer
and a list of stopwords.
Our second principal contribution is using these newly-built resources to conduct a thorough experimental study on Kurdish documents. Our experimental results show that normalization and, to a lesser extent,
stemming can greatly improve the performance of Kurdish IR systems.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and
Retrieval
General Terms: Design, Measurement, Experimentation, Performance
Additional Key Words and Phrases: Kurdish Language, Bi-Standard Languages, Test Collection, Stemming,
Cross-Lingual Information Retrieval
1. INTRODUCTION
With increasingly higher numbers of non-English language web searchers the problems of efficient handling of non-English documents and user queries are becoming
major issues for search engines [Lazarinis et al. 2009]. To tackle these issues, a range
of workshops and forums have been launched over the last decade. Three important
examples are CLEF [CLEF 2013] for European languages, NTCIR [NTCIR 2013] for
Asian languages and FIRE [FIRE 2013] for Indian languages.
The Kurdish language is an Indo-European language spoken in Kurdistan, a large
geographical region in the Middle East. Despite having 20 to 30 millions of native
speakers [Haig and Matras 2002; Hassanpour et al. 2012; Thackston 2006b; 2006a],
Kurdish is among the less-resourced languages. More specifically, in spite of the few
attempts in building corpus [Gautier 1998] and lexicon [Walther and Sagot 2010], Kurdish still does not have a large-scale and reliable corpus. Similarly, no test collection
–which is central to Information Retrieval research and development– or stemming
algorithm has been developed for this language.
In recent years, Kurdish autonomy in the area and the consequent rise in importance
of communications in Kurdish in Iraq (and lately in Turkey) as well as the proliferation
of social media and Internet use in both Kurdistan and diaspora [Sheyholislami 2010]
has prompted the need for Kurdish language processing. To cater for this need, we
have recently launched the Kurdish Language Processing Project (KLPP1 ), aiming at
1 Project
page at http://eng.uok.ac.ir/esmaili/research/klpp/en/main.htm
Author’s addresses: K. Sheykh Esmaili and A. Datta, School of Computer Engineering, Nanyang Technological University, Singapore; S. Salavati, Computer Engineering Department, University of Kurdistan, Sanandaj, Iran.
Journal, Vol. V, No. N, Article A, Publication date: January YYYY.
A:2
K. Sheykh Esmaili et al.
providing basic tools and techniques for Kurdish text processing. This paper reports
on KLPP’s main outcomes so far2 .
Our first primary contribution is Pewan, which to the best of our knowledge is the
first standard test collection to evaluate Kurdish IR systems. To build Pewan, we have
carefully followed TREC [TREC 2013]’s standard test collection construction methodology. More specifically, we first built a relatively-large Kurdish text corpus, and then
used a powerful Desktop search tool to compile a list of queries. Next, three widelyused open-source IR systems as well as our own implementation of two well-known
retrieval models were used to create result pools for all queries. These pools were then
manually assessed by our team to generate the true list of relevant documents for each
query.
The other language resources that we have constructed are a light-weight stemmer,
a list of affixes, and a list of stopwords. We have also translated Pewan’s queries into
English and Persian, hence making it possible for researchers to investigate the CrossLingual Information Retrieval (CLIR) problem as well.
Our second important contribution is a comprehensive experimental study of basic
IR techniques on Kurdish documents. Our experimental results show that normalization and, to a lesser extent, stemming can greatly improve the quality of Kurdish IR
systems. They also highlight the need for further research into the problem of crosslanguage IR on Kurdish documents.
All of the aforementioned resources are freely accessible and can be obtained
from [Pewan 2013]. We hope that making these resources publicly available, will bolster IR research on Kurdish language.
The rest of the paper is organized as follows. We first give a brief description of
the main challenges in Kurdish text processing in Section 2. Then in Section 3 we
focus on the Sorani branch of Kurdish and explain the process of constructing Pewan
text corpus and test collection. The results of our experimental study with Pewan are
reported in Section 5. Section 6 reports on how we augmented Pewan with Kurmanji
documents and relevance judgments. Finally, we conclude the paper in Section 7.
2. CHALLENGES IN KURDISH TEXT PROCESSING
Kurdish language belongs to the Indo-Iranian family of Indo-European languages. Its
closest better-known relative is Persian. Kurdish has two main branches –namely, Sorani and Kurmanji– and is spoken in Kurdistan, a large geographical area spanning
the intersections of Turkey, Iran, Iraq and Syria. It is one of the two official languages
of Iraq and has a regional status in Iran.
Apart from the resource-scarceness problem, we have identified four other challenges in processing Kurdish texts. The first three of these challenges highlight the
problems arising from Kurdish language’s inherent diversity and complexity. The
fourth challenge is posed by the implementational issues in Kurdish’s Arabic-based
writing system.
2.1. Dialect Diversity
The first and foremost challenge in processing Kurdish language is its dialect diversity [Esmaili 2012]. In this paper we focus on Sorani and Kurmanji which are the two
most important dialects in terms of number of speakers [Haig and Matras 2002]. Together, they account for more than 75% of native Kurdish speakers [Walther and Sagot
2010].
2 An
earlier and shorter version of this paper titled “Building a Test Collection for Sorani Kurdish” [Esmaili
et al. 2013] is to appear in the proceedings of the 10th ACS/IEEE Conference on Computer Systems and
Applications (AICCSA13).
1
2
3
4
5
6
A:3
7
8
9
10
11 12 13 14 15
16 17
18
19
20
21
22
23 24
Arabic‐based
‫ز خ ڤ وو ت ش س ر ق پ ۆ ن م ل ک ژ گ ف ێ د چ ج ب ا‬
Latin‐based
A
B
C Ç D Ê
F
G
J
K
L M N O P Q R
S
Ş
T
Û
V
29
30
31
32
33
‫ڕ‬
‫ع ڵ‬
‫غ‬
‫ح‬
X Z
(a) One-to-One Mappings
Arabic‐based
Latin‐based
25
26
27
28
/ ‫ئ‬
‫و‬
‫ی‬
‫ه‬
I
U/W
Y/Î
E/H
(b) One-to-Two Mappings
Arabic‐based
Latin‐based
(RR) - (E) (X) (H)
(c) One-to-Zero Mappings
Fig. 1: The two standard Kurdish Alphabets.
The features distinguishing these two dialects are morphological as well as phonological. The important morphological differences are [MacKenzie 1961; Haig and Matras 2002]:
(1) Kurmanji is more conservative in retaining both gender (feminine:masculine) and
case opposition (absolute:oblique) for nouns and pronouns, Sorani has largely
abandoned this system and uses the pronominal suffixes to take over the functions
of the cases,
(2) the definite suffix -aka appears only in Sorani,
(3) in the past-tense transitive verbs, Kurmanji has the full ergative alignment but
Sorani, having lost the oblique pronouns, resorts to pronominal enclitics, and
(4) in Sorani, passive and causative can be created exclusively via verb morphology, in
Kurmanji they can also be formed with the verbs hatin (to come) and dan (to give)
respectively.
2.2. Script Diversity
Due to geopolitical reasons [Sheyholislami 2010], each of the two aforementioned dialects has been using its own writing system. In fact, Kurdish is considered a bistandard language [Gautier 1998], with Sorani almost-exclusively written in Arabicbased letters and Kurmanji almost exclusively written in Latin-based letters. Both of
these systems are phonetic [Gautier 1998]; that is, vowels are explicitly represented
and their use is mandatory. In case of the Arabic-based alphabet, this is a significant
advantage over Arabic-based writing systems in other languages (e.g., Persian, Arabic,
and Urdu) where its use is optional, causing a range of processing ambiguities3 .
Figure 1 shows both the Arabic-based and the Latin-based standard Kurdish alphabets and the mappings between them which we have categorized into three classes:
— One-to-one mappings (Figure 1a); covering a large subset of the characters,
— One-to-two mappings (Figure 1a); which reflect the inherent discrepancies between
the two writing systems [Barkhoda et al. 2009]. While transliterating between these
two alphabets, the contextual information can provide hints in choosing the right
counterpart.
— One-to-zero mappings (Figure 1a); they can be split further into two distinct categories: (i) the strong L and strong R characters ({ } and { }) are used only in Sorani
3 In
fact, as discussed in [Farghaly and Shaalan 2009], the absence of short vowels contributes most significantly to ambiguity in Arabic language, causing difficulty in homograph resolution, word sense disambiguation, part-of-speech detection. In Persian, its negative consequence is more visible in detecting the Izafe
constructs [Shamsfard 2011].
A:4
Kurdish (although there are a handful of words with the latter in Kurmanji too) and
demonstrate the inherent phonological differences between Sorani and Kurmanji
dialects, and (ii) the remaining three characters are almost entirely used in Arabic
loanwords in Sorani (in Kurmanji they are approximated with other characters).
2.3. Complex Morphology
Kurdish has a relatively complex morphology [Samvelian 2007; Walther 2011]. One of
the driving factors behind this complexity is the wide use of suffixes. Some of the most
common suffixes are (an extended list is given in Section 4):
(1) -i, the Izafe construction marker (approximately corresponds to the English preposition “of ”.),
(2) -aan, the plural noun marker,
(3) -aka and -ek, the definiteness and indefiniteness markers, and
(4) personal verb endings/possessive pronouns.
For example, as depicted below, the phrase {
} naawandakaani
dangdaanmaan “our voting centers” consists of a noun ({
} naawand “center”) and a
non-finite verb ({
} dangdaan “voting”) and four different suffixes.
maan
possessive
pronoun
+
+
+
dangdaan
non-finite verb
i
Izafe
marker
+
+
+
aan
plural
marker
+
+
+
+
aka
+
definite +
marker
naawand
noun
To demonstrate the morphological complexity of Kurdish language empirically, we
carried out an experiment in which the proportion of distinct words to the total number
of words (a variation of the Heap’s law [Heaps 1978]) were computed for both dialects
of Kurdish as well as three other languages: English, Persian, and Arabic. In this
experiment, the English corpus consisted of the Editorial articles of The Guardian
newspaper [Guardian 2013], the Persian and the Arabic corpora were drawn from the
standard Hamshahri Collection [AleAhmad et al. 2009] and the OSAC Corpus [Saad
and Ashour 2010], respectively .The Kurdish documents were collected mainly from
the Peyamner News Agency [Peyamner 2013].
As the corresponding curves in Figure 2 show, the number of distinct words in both
Sorani and Kurmanji Kurdish are higher than English and Persian. Moreover, the ratio in Sorani is comparable to that of Arabic which has a notoriously complex morphology system 4 . Another interesting observation here is the difference between the ratios
for Sorani and Kurmanji. Two primary sources of these differences are: (i) the inherent linguistics differences between the two dialects as mentioned earlier in Section 2.1,
(ii) the writing style differences; more specifically, use of space-delimited words is less
}
common in Sorani [Esmaili and Salavati 2013]. For instance, the to-be verb ({
boon) as well as some of the most common prepositions (e.g., { } ish “too”) are used
as suffixes.
2.4. Text Preprocessing
The Arabic-based writing system of Kurdish language poses a set of challenges in
the text preparation and preprocessing phase. Below, we highlight two types of such
challenges.
4A
similar observation between Arabic and English has been reported by Xu et al in [Xu et al. 2002].
Number of Distinct Words
2.5E+05
2.0E+05
1.5E+05
A:5
Arabic
Sorani
Kurmanji
Persian
English
1.0E+05
5.0E+04
0.0E+00
0.0E+00
1.0E+06
2.0E+06
3.0E+06
4.0E+06
Total Number of Words
Fig. 2: Number of Distinct Words for Different Languages
2.4.1. Normalization. The Unicode assignments of the Arabic-based Kurdish alphabet
has two potential sources of ambiguity which should be dealt with carefully:
— for some letters such as ye and ka there are more than one Unicode representations
({ } versus { } and { } versus { }). During the normalization phase, the occurrences of these multi-code letters should be unified.
— as in Urdu, the isolated { } and final { } forms of the letter ha constitute one letter (pronounced a), whereas the initial { } and medial { } forms of the same letter
constitute another letter (pronounced h), for which a different Unicode encoding is
available [Walther and Sagot 2010; Gautier 1998]. In many electronic texts, these
letters are written using only the ha, differentiated by using the zero-width nonjoiner (zwnj) character that prevents a character from being joined to its follower.
This distinction must be taken into account in the normalization phase.
2.4.2. Segmentation. Segmentation refers to the process of recognizing boundaries of
text constituents, including sentences, phrases and words. The Arabic-based Kurdish
alphabet suffers from two segmentation problems that are inherited from the Arabic
writing system:
— it does not have capitalization and therefore it is more difficult to recognize sentence
boundaries as well as recognizing Named Entities,
— white space is not a deterministic delimiter and boundary sign [Shamsfard 2011];
it may appear within a word or between words, or may be absent between some
sequential words. However, compared to Persian [Shamsfard et al. 2010] and
Urdu [Rehman et al. 2011], this problem is less severe in Kurdish.5
3. PEWAN: A KURDISH TEST COLLECTION
This section reports on our efforts to construct Pewan6 , the first standard test collection
to evaluate Kurdish IR systems. Later, in Section 4, we show how Pewan was leveraged
to build a number of other (smaller) resources for the Kurdish language.
5 In
Kurdish, it is primarily caused by using white space instead of zwnj, when the latter is not supported
by the typesetting environment. This error is prevalent in the older articles of VOA [VOA 2013b], one of
Pewan’s raw text sources.
6 A Kurdish word meaning measurement.
A:6
In the following, we first (in Section 3.1) briefly describe the purpose and use of
IR test collections as well as a standard methodology to construct them. Then after
explaining our design decisions in Section 3.2, we present Pewan’s construction details
in Section 3.3. Finally, the performance of the IR systems used in the pooling process
are reported in Section 3.4.
3.1. Background
Test collections are widely used as a standard tool to evaluate the performance of retrieval systems. There is a variety of test collections, tailored for different IR tasks.
Text REtrieval Conferences (TREC [TREC 2013]) held by NIST have greatly contributed to the construction of large and reliable test collections. Each test collection
has three main components:
— a text corpus,
— a set of queries (or topics),
— a list of relevant documents for each query (or Relevance Judgments).
Among these components, the most critical one is the relevance judgments, which has
a significant impact on the reliability of the collection. For small test collections it is
feasible to judge the relevance for all document-query pairs, however, in large test collections this approach is not feasible. As an alternative, most of modern test collections,
including TREC’s Ad-Hoc tracks, use system pooling [Sparck-Jones and van Rijsbergen
1976]. In system pooling, a number of retrieval systems are used to retrieve and rank
documents and then the top ranked results from all systems are merged to create a
pool for each query. Later, these pools are manually examined by human assessors to
create the lists of true relevant documents.
3.2. Design Decisions
Before presenting the construction details of Pewan, we would like to justify two important decisions that we have made in building Pewan: to build two separate collections for the two main standards of Kurdish, and to attach greater importance to the
Sorani/Arabic-based standard.
3.2.1. Decoupling the Bi-Standard Aspect. As mentioned in Section 2, Kurdish is a bistandard language with Sorani dialect written in Arabic-based alphabet and Kurmanji
dialect written in Latin-based alphabet. The mapping between these two dialect/script
branches are not trivial and as the examples in [Gautier 1998] show, the same word,
when going from Sorani to Kurmanji, may at the same time go through several levels
of change: writing systems, phonology, morphology, and sometimes semantics. This
clearly demonstrates the fact that the mapping between these two dialects/scripts is
more than just transliteration, although its complexity is less than translation. This bistandard nature poses a unique challenge in construction of test collections for Kurdish
language.
Our approach is to decouple the mapping problem from the test collection construction problem, meaning we build two test collections, one per standard. As a result,
in addition to the primary use of evaluating Sorani IR and Kurmanji IR systems, Pewan can be also perceived as a mechanism to evaluate mapping proposals between
these standards (since the queries in these two collections are identical transliteration/translation of each other).
3.2.2. Focusing on the Sorani Dialect/Arabic-based Writing System. In the remaining parts
of this paper, we give greater importance to the Sorani dialect/Arabic-based writing
system. The rationale behind this is two-fold: (i) as a result of the severe restrictions
on use of Kurdish language in Turkey –where the majority of Kurmanji speakers live–
A:7
100000
Count
10000
1000
100
10
1
0
20
40
60
80
100
120
140
Document Size (KB)
(a) Document Size Distribution in Pewan
(b) A Sample Document in Pewan
Fig. 3: Documents in Pewan
currently, there are very few sources with considerable amount of raw Kurmanji text
readily available and even these sources do not strictly follow its writing standards.
The Sorani dialect on other hand, does not suffer from these shortcomings, thanks to
its official status and wider use, (ii) the Arabic-based writing system is more expressive
and far more challenging to process.
Nonetheless, to guarantee the usability and generality of Pewan, we repeated the
construction process for Kurmanji dialect/Latin-based writing system as well. The corresponding details are reported in Section 6.
3.3. Construction of Pewan’s Three Core Components
Within KLPP, we closely followed the TREC’s construction methodology explained in
Section 3.1 and built Pewan’s three core components.
3.3.1. Text Corpus. Like many of the TREC Ad-Hoc tracks (and similar to the existing
Arabic [Saad and Ashour 2010] and Persian [AleAhmad et al. 2009; Esmaili et al. 2007]
corpora), we used news articles to build our text corpus. After surveying all options we
chose two online news agencies: (i) Peyamner [Peyamner 2013], a popular multi-lingual
news agency based in Iraqi Kurdistan, and (ii) the Sorani Kurdish website of Voice Of
America [VOA 2013b]. The main criteria in this process were:
(1)
(2)
(3)
(4)
size (number of news articles),
subject diversity,
metadata support (e.g., news category labels),
crawl-friendliness.
For each agency, we developed a crawler to fetch the articles and extract their textual
content. In case of Peyamner, since articles have no language label, we additionally
implemented a simple classifier that decides each page’s language based on the occurrence of language-specific characters.
Overall, 115340 news articles dated between 2003 and 2012 were collected (96920
from Peyamner and 18420 from VOA). As illustrated in Figure 3a, their sizes range
from 1KB to 154KB (on average 2.8KB). A sample article –about the dust storms in
southwest Iran– from the final corpus is shown in Figure 3b.
3.3.2. Queries. After browsing the Peyamner and VOA websites, each member of our
team compiled a list of possible queries (all members had previously seen sample
queries from TREC 1999 [Voorhees and Harman 1999] and TREC 2004 [Voorhees
2004]). Later, the Google Desktop search tool and our local repository were used to
A:8
(a) In Sorani
(b) In Kurmanji
(c) In Persian
(d) In English
Fig. 4: A Sample Query in Pewan
refine these queries. Eventually, out of the initial 65 queries, 42 were selected to be
used in the pooling/assessment steps. Figure 4 depicts a sample query in Pewan.
Queries in Pewan are represented in the TREC’s standard format which has three
elements:
— title, a short version (2-3 keywords) of the query,
— description, a longer and sentence-like version of it,
— narrative, 1-2 paragraphs explaining the scope of expected results (only for human
assessors).
3.3.3. Relevance Judgments. To create the pools we used 5 different systems to generate
5 runs, each containing 500 documents. Three of these systems are widely-used opensource Java retrieval systems –namely MG4J [MG4J ], Apache Lucene [Lucence
2013], and Terrier IR Platform [Terrier 2013]– that we selected based on results of
a study conducted in [Middleton and Baeza-Yates 2007]. After making the necessary
changes to enable them to process Sorani texts (among other things, correct handling
of the zwnj character), we ran all three of them with OR-ed version of the query description terms (the AND-ed version is extremely selective).
The other two systems were in-house implementations that we have developed and
refined as part of our IR coursework: (i) KLPP-VSM: an implementation of Vector
Space Model based on the specifications given in [Manning et al. 2008], and (ii) KLPPEBM: an implementation of Extended Boolean Model as described in [Salton et al.
1983]. For the EBM runs, we used the AND-ed version of the query terms (the OR-ed
version’s performance was consistently inferior). In both implementations we used the
T F × IDF weighting scheme recommended in [Manning et al. 2008]:
w(ti , dj ) = log(1 + f (i, j)) × log
|D|
|{d ∈ D : ti ∈ d}|
where w(ti , dj ) is the total weight of term ti in the vector representing document dj . In
this formula f (i, j) is the frequency of term ti in document dj and D is the collection of
all documents.
The populated pools (depths ranging from 576 to 1671, on average 1147) were then
manually assessed by our team members (all native Kurdish speakers) to generate
the true list of relevant documents for each query. It has been shown [Zobel 1998] that
Category
Politics
Health
Culture
Social
Economy
Sports
Science
Media
A:9
Query IDs
3, 9, 11, 12, 13, 15, 20, 23, 30, 36, 39
24, 32, 34
4, 42
7, 16
31
8
27
17
Table I: Subject Categorization of Pewan’s Queries
queries with long lists of relevant documents, are more likely to have other relevant
documents missing from the assessment pool. Also, the author in [Hull 1996] have suggested that queries with few relevant documents will tend to have higher variability
than those with many relevant documents. To ensure the reliability of Pewan, out of
the 42 assessed queries, we excluded those with too many (≥ 100) or too few (≤ 10)
relevant documents. The remaining 22 queries are included in the current release of
Pewan. This subject categorization of these queries are presented in Table I.
3.4. Evaluation of Pewan’s Pooling Systems
The results of this evaluation are depicted in Figure 5. Three important observations
are: (i) MG4J consistently outperforms all other systems, (ii) at lower recall values,
Lucene performs quite well, however, the quality of its results degrades faster than
other systems as the recall increases; more precisely, while it is the most precise system
at the first recall point, it has the worst performance for recall points greater than 0.6,
(iii) our implementation of Vector Space Model follows a reverse trend, compared to
that of Lucence; that is, its relative rank among these five systems improves with the
increase in the recall level.
While running these experiments, we also measured the execution times. In short,
while all these systems perform comparably fast in building the inverted lists and
processing the queries, the construction of document vectors in our implementation of
Vector Space Model is by far the most time-consuming computation.7
Based on these results, we chose MG4J –as the best performing system – to be used
in the rest of the experiments (denoted as Baseline hereafter).
4. ENRICHING PEWAN WITH OTHER RESOURCES
On top of the three core test collection components explained in the previous section,
we also built three other IR resources for Kurdish language. Below, they are briefly
introduced8 .
4.1. Stopwords List
Stopwords are words which do not convey meaning on their own and are usually filtered out prior to processing of natural language data. Stopword elimination always
reduces the computational cost (through reducing the index sizes) and in some cases
can marginally improve the quality of the outputs as well.
Although stopwords lists have been built for many languages [Lazarinis et al. 2009],
so far no list has been reported for Kurdish . We used Pewan’s text corpus to build a
list of Kurdish stopwords. To this end, we followed the approach proposed in [Savoy
7 This,
however, should be seen in the light of the fact that the other competitors are highly-optimized search
engines enjoying advanced and specialized indexes.
8 As noted in [Hasanpoor 1999], a series of linguistics reports on the Kurdish language (e.g., on affixes and
stopwords) have been published in The Journal of the Kurdish Academy. However, these reports are in
Kurdish and Arabic and have not been electronically archived yet.
A:10
Precision
1
0.9
MG4J
0.8
Lucence
0.7
Terrier
0.6
KLPP_VSM
0.5
KLPP_EBM
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Fig. 5: Performance of Pewan’s Pooling Systems
#
Stopword
Eng. Trans.
Freq.
#
Stopword
Eng. Trans.
Freq.
1
‫له‬
‫و‬
‫به‬
‫بۆ‬
‫که‬
‫ئهو‬
‫ئهم‬
‫ی‬
‫کرد‬
‫لهگهڵ‬
from
and
with
for
which
that
this
of
made/did
together
649337
11
12
316692
13
239346
14
185279
15
147520
16
72354
17
53327
18
51550
19
47187
20
about
that which
on/head
two
also
after
from that
makes/does
some
every
45435
597696
‫لهسهر‬
‫ئهوهی‬
‫سهر‬
‫دوو‬
‫ھهروهھا‬
‫دوای‬
‫لهو‬
‫دهکات‬
‫چهند‬
‫ھهر‬
2
3
4
5
6
7
8
9
10
43617
39897
38212
37895
35656
35176
34360
31234
31201
Fig. 6: The Top 20 Most-Frequent Words in Pewan’s Text Corpus.
1999] and first compiled a list of highly-frequent words in Pewan and then manually
examined it to extract stopwords. Our final list contains 282 words and similar to
other languages, it mainly consists of propositions. Figure 6 shows the most frequent
stopwords in Pewan’s text corpus.
4.2. Prefixes/Suffixes
Identifying the common prefixes and suffixes is an essential step in processing languages like Kurdish that heavily rely on affixes and their combinations to build new
words. One of the most important uses of such lists is to build rule-based stemmers [Porter 1997]. Pewan’s text corpus was again leveraged to build a list of common
Sorani pre/postfixes. To do so, we generated all character-level n-grams for 2 ≤ n ≤ 5
for all documents in Pewan and then manually examined the top most-frequent ngrams in each case to extract the meaningful affixes. The most widely-used Sorani
#
1
2
3
4
5
6
7
8
9
10
Suffix
‫ان‬
‫که‬
‫کان‬
‫کانی‬
‫ێک‬
‫انی‬
‫يان‬
‫وو‬
‫کرد‬
‫مان‬
Description
plural marker
#
11
definite marker
12
#1 + #2
13
#1 + #2 + Izafe marker
14
indefinite marker
15
#1 + Izafe marker
16
third‐person plural marker
17
pluperfect marker
18
auxiliary verb
19
first‐person plural marker
20
Suffix
‫تان‬
‫کهی‬
‫يش‬
‫ستان‬
‫زان‬
‫ێکی‬
‫بوون‬
‫گه‬
‫کراو‬
‫کهوت‬
A:11
Description
second‐person plural marker
#
I
#2 + Izafe marker
II
adverb ("as well")
III
derivational suffix ("land of")
IV
derivational suffix ("master of")
V
#5 + Izafe marker
VI
auxiliary verb
VII
derivational suffix ("place of")
VIII
past participle adjective marker
IX
auxiliary verb
X
Prefix
‫له‬
‫به‬
‫ده‬
‫نه‬
‫بێ‬
‫سهر‬
‫بهر‬
‫ناو‬
‫ڕا‬
‫پێش‬
Description
derivational prefix ("from")
derivational prefix ("with")
verb tense marker
negation marker
derivational prefix ("without")
derivational prefix ("top")
derivational prefix ("front")
derivational prefix ("in")
derivational prefix ("up")
derivational prefix ("before")
Fig. 7: The Top Most-Frequent Kurdish Suffixes and Prefixes
affixes are presented in Figure 79 . As mentioned before, in Kurdish, suffixes are used
more often than prefixes and cover a wider range of functionalities.
4.3. Pewan’s Queries in English and Persian
We translated the title and description parts of all Pewan’s queries into English and
Persian. Figure 4d and Figure 4c depict the English and Persian equivalent of the sample query presented in Figure 4a. As shown later in Section 5, these translated queries,
coupled with English-to-Sorani and Persian-to-Sorani dictionaries, can be used to run
and evaluate the performance of CLIR systems on Sorani documents using English
and Persian queries.
5. EXPERIMENTS
In this section we present the results of four sets of experiments that we carried out
using Pewan. The first set is on the effectiveness of some of the basic IR techniques on
Sorani documents. In the second and third sets, we investigate the impact of n-grams
and light-stemming respectively. Lastly, in the fourth set, we look into the problem of
CLIR using query translation.
In all experiments, the standard 11-point Precision-Recall curves and Mean Average
Precision (MAP) measures have been used to compare different setups. Moreover, in all
cases –unless explicitly mentioned– the longer version of the queries (the description
part) were used. This follows the works reported in [Voorhees 1994; Xu and Croft 1998;
Hull 1996].
The experiments were run on a powerful desktop PC with 4×3.2GHz Xeon Processors and 4GB of RAM, running Windows 7.
5.1. Basic IR Techniques
Here we briefly describe the techniques that are considered in these experiments:
— Normalization: is considered an IR preprocessing task which unifies different representations of multi-code characters. In all of our experiments we have used (and will
use) the normalized version of the text corpus and also treated the zwnj character as
a non-delimiter character. However, in order to quantitatively measure the impact
of text preparation step in Sorani IR, we ran a single experiment on the raw version
of the text corpus in which characters were not normalized and zwnj was treated
like a delimiter.
9 Izafe
is an unstressed vocal -e or -i added between prepositions, nouns and adjectives in a phrase. It
approximately corresponds to the English preposition of. This construct is frequently used in both Persian
and Kurdish languages.
A:12
1
0.8
0.7
4‐grams
Baseline
3‐grams
5‐grams
6‐grams
2‐grams
0.9
0.8
0.7
0.6
Precision
Precision
1
Stopwords_Removed
Baseline
Short_Queries
Raw
0.9
0.5
0.4
0.3
0.2
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
Recall
(a) Basics
0.5
0.6
0.7
0.8
0.9
1
(b) n-grams
1
1
Stemmed_3_16
0.9
Stemmed_2_16
0.8
0.7
Stemmed_4_15
0.7
0.6
Baseline
0.6
0.5
0.4
0.3
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Recall
0.9
1
CLIR_Persian
0.4
0.1
0
CLIR_English
0.5
0.2
0
Baseline
0.9
0.8
Precision
Precision
0.4
Recall
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
(c) Light-Weight Stemming
(d) Query Translation
Fig. 8: Performance Graphs
— Stopword elimination: Pewan’s stopwords list (from Section 4) was used to carry out
the corresponding experiment.
— Query Length: as described in Section 3.3, each Pewan query has a short (title) and a
long (description) version. In all our experiments, we have used the description part
of the queries. To measure the impact of query length, we ran a single experiment
in which the shorter version of the queries were used (this is particularly important
for evaluation of Web IR systems).
As the results of these experiments show (Figure 8a), normalization is a critical
step in Sorani text processing. Furthermore, the relatively-good performance of the
short queries can be attributed to the fact that our choices of query keywords have
been made carefully in order to preserve the essence of user’s information need in the
shorter version of the queries. Finally, elimination of the stopwords slightly improves
the quality of the outputs.
5.2. n-grams
N-grams is a popular language-blind retrieval technique which generally improves the
quality of IR systems’ outputs [Mcnamee and Mayfield 2004]. In our experiments, the
following process was repeated 5 times (for 2 ≤ n ≤ 6): first each term in Pewan’s
documents and queries was replaced by its corresponding character-level n-grams (our
n-gram strings do not contain delimiters), then our IR engine was applied on the transformed collection and its effectiveness was measured.
The results are depicted in Figure 8b. The main findings can be summarized as
follows: (i) among all n-grams configurations, n = 4 is a clear winner, and (ii) similar
A:13
to Persian [AleAhmad et al. 2007] and in contrast to Arabic [Xu et al. 2002], Kurdish
can benefit from n-grams. This is due to the fact that unlike Arabic, derivative words
in Persian and Kurdish preserve the root, and add only prefixes and postfixes to it.
5.3. Stemming
Stemming is the process of reducing a word to its stem or root form. It allows documents in which a term is expressed using a different morphological form from the
query, to be found and matched [Lazarinis et al. 2009]. The benefits of stemming are
more pronounced in languages with complex morphology (e.g. German [Braschler and
Ripplinger 2004] and Arabic [Xu et al. 2002] ).
We designed a simple rule-based light-weight stemmer for Kurdish language which
uses Pewan’s prefix/suffix lists. In order to handle the problem of multiple suffixes
(as demonstrated by the example in Section 2.3), our stemmer adopts a recursive approach.
Moreover, to prevent over-stemming –a common error in rule-based stemmers
caused by blindly removing substrings that are part of the word’s stem– we have defined two parameters:
— minimum length (L): this parameter is adopted from Lovins’ stemmer [Lovins 1968]
and guarantees that the final stem’s length will always be longer than or equal to L,
— minimum frequency (F ): this parameter only concerns prefix removal steps (as removing prefixes is more likely to result in errors [Hull 1996; Paice 1994]). A prefix
can be removed from a word, if the remaining string has the frequency of at least F
in the corpus.
We performed a sensitivity analysis on our stemmer by varying these two parameters.
A summary of this study is shown in Figure 8c and indicates that: (i) stemming is
beneficial to Kurdish IR, and (ii) the combination of (L = 3, F = 16) has the best
performance on Pewan.
5.4. Cross-Lingual Information Retrieval
In the last set of our experiments, we used our implementation of a simple wordby-word dictionary-based translation approach [Ballesteros and Croft 1996] and conducted two CLIR experiments:
— from English: using Dicto [?], an online, high-quality, bidirectional English-Sorani
dictionary
— from Persian: using Hajir [Abollahpour 2013], a medium-size (around 40,000 entries), unidirectional Persian-to-Sorani dictionary in PDF format
In both cases, we always choose the first word from the first set of translation candidates (a.k.a sense).
The results of these experiments are shown in Figure 8d. Expectedly, the performance of this naive translation approach is inferior to those of the monolingual experiments. In case of CLIR from Persian, the dictionary’s relatively-small size (e.g., it has
no entries for Named Entities) has also contributed to the poor quality of the outputs.
In future, we plan to improve these results by by applying pre- and post-translation
query modifications [Ballesteros and Croft 1996].
5.5. Summary
A summary of our experimental study using the MAP measure in shown in Figure 9.
Here are the noteworthy remarks:
— normalization is a critical step in Kurdish text processing,
A:14
Mean Average Precision
0.5
0.4
0.3
0.2
0.1
0.0
Fig. 9: Performance of Different Systems/Setups based on the Mean Average Precision
(MAP) Measure
— Kurdish IR can benefit from light-stemming and n-grams. Moreover, similar to reported results for European languages [Mcnamee and Mayfield 2004], the best ngrams performance is achieved for n = 4,
— stopword elimination results in both computational and performance gains (although marginal),
— further resources and techniques are needed to improve the performance of crosslingual IR from Kurdish documents.
6. AUGMENTING PEWAN WITH KURMANJI DOCUMENTS
In order to guarantee the usability and generality of Pewan, we repeated the construction process reported in Section 3.3 for Kurmanji dialect/Latin-based writing system
as well. Here are the notable details:
— to build the Kurmanji text corpus, we used Peyamner as well as the Kurmanji Kurdish website of Voice Of America [VOA 2013a]. Overall, 25572 news articles were
collected (19873 from Peyamner and 5699 from VOA). Their sizes range from 1KB
to 42KB.
— queries are the transliterated/translated equivalent of Pewan’s queries.
Kurmanji/Latin-based alphabet version of the sample query in Figure 4a, is
shown in Figure 4b).
— to create the pools we used the same 5 systems as in Section 3.3, although with
shorter runs (100 documents, instead of 500). Then, the populated pools (depths
ranging from 77 to 302, on average 213) were again manually assessed by our team
members and the relevance judgments were generated.
Similarly, these judgments were used to compare the performance of the pooling
systems. The results of this evaluation are depicted in Figure 10. While comparing
these new results with those in Figure 5, the following observations can be made:
(i) in general, the Kurmanji curves are less steep, meaning worse precision at low
A:15
0.7
MG4J
Lucence
KLPP_VSM
Terrier
KLPP_EBM
0.6
Precision
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Fig. 10: Performance of Pewan’s Pooling Systems (on Kurmanji Documents)
recall levels, and better precision at high recall levels. This is partly due to the fact
that relevance judgment lists are shorter in the Kurmanji collection (on average, 12.5
versus 42), and (ii) although MG4J is again the best-performing engine overall, but it
is not as dominant as in Figure 5. In fact, for high recall levels, Lucene and for middle
levels, our implementation of Vector Space Model are competitive.
7. CONCLUSIONS AND FUTURE WORK
In this paper we introduced Pewan, the first standard test collection for Sorani language, and briefly explained its construction details. Pewan is freely available and can
be obtained from [Pewan 2013]. In addition to the test collection, the compressed file
also contains: (i) a list of prefixes/suffixes, (ii) a list of stopwords, and (iii) the English
translation of Pewan’s queries.
With the help of Pewan, we also conducted an experimental study to investigate the
effectiveness of well-known IR techniques on Kurdish documents. Our results show
that normalization and, to a lesser extent, stemming can greatly improve the quality
of Sorani IR systems.
Pewan and the experimental study presented in this paper are the preliminary outcomes of our KLPP project. There are many avenues to continue this work. First, we
would like to further enrich our test collection by augmenting it with classification
tags. Our plan is to start with retrieving subject labels (e.g., arts, sports, etc) from the
original source news agencies and then cross-check (and correct, if needed) the labeling
accuracy. This will allow Pewan to be used for classification tasks as well.
In this work, we gave an overview of the diversity problem of the Kurdish language
and highlighted the need for a solution to cater for this diversity. In particular, building
a transliteration/translation system between the Sorani and Kurmanji branches of
Kurdish is an important open problem. As another line of research, we plan to address
this problem.
Finally, we would like to use Pewan for other IR tasks. In particular, we plan to
improve the performance of our CLIR system and also investigate the effectiveness of
statistical stemmers on Kurdish IR.
A:16
ACKNOWLEDGMENTS
The authors would like to thank Somayeh Yosefi, Purya Aliabadi, Donya Eliassi, Shownem Hakimi and
Asrin Mohammadi for their help in the manual assessment of the relevance judgments.
REFERENCES
Hajir Abollahpour. 2013. Hajir Dictionary. http://kurmanj.ir/news.php?readmore=76. (2013).
Abolfazl AleAhmad, Hadi Amiri, Ehsan Darrudi, Masoud Rahgozar, and Farhad Oroumchian. 2009.
Hamshahri: A standard Persian Text Collection. Knowledge-Based Systems 22, 5 (2009), 382–387.
Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani, and Farhad Oroumchian. 2007. N-gram and
Local Context Analysis for Persian Text Retrieval. In In Proceedings of ISSPA. 1–4.
Lisa Ballesteros and W. Bruce Croft. 1996. Dictionary Methods for Cross-Lingual Information Retrieval. In
DEXA. 791–801.
Wafa Barkhoda, Bahram ZahirAzami, Anvar Bahrampour, and Om-Kolsoom Shahryari. 2009. A Comparison between Allophone, Syllable, and Diphone based TTS Systems for Kurdish Language. In Signal
Processing and Information Technology (ISSPIT), 2009 IEEE International Symposium on. 557–562.
Martin Braschler and Bärbel Ripplinger. 2004. How Effective is Stemming and Decompounding for German
Text Retrieval? Information Retrieval 7, 3-4 (2004), 291–316.
CLEF. 2013. Conference and Labs of the Evaluation Forum. http://www.clef-initiative.eu/. (2013).
Kyumars Sheykh Esmaili. 2012. Challenges in Kurdish Text Processing. CoRR abs/1212.0074 (2012).
Kyumars Sheykh Esmaili, Hassan Abolhassani, Mahmood Neshati, Ehsan Behrangi, Asreen Rostami, and
Mojtaba Mohammadi Nasiri. 2007. Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems. In Proceedings of the 5th IEEE/ACS International Conference on Computer Systems
and Applications (AICCSA ’07). 639–644.
Kyumars Sheykh Esmaili and Shahin Salavati. 2013. Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison. In Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (ACL’13). 300–305.
Kyumars Sheykh Esmaili, Shahin Salavati, Somayeh Yosefi, Donya Eliassi, Purya Aliabadi, Shownem
Hakimi, and Asrin Mohammadi. 2013. Building a Test Collection for Sorani Kurdish. In Proceedings
of the 10th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA ’13).
Ali Farghaly and Khaled F. Shaalan. 2009. Arabic Natural Language Processing: Challenges and Solutions.
ACM Transactions on Asian Language Information Processing 8, 4 (2009).
FIRE. 2013. Forum for Information Retrieval Evaluation. http://www.isical.ac.in/c̃lia/. (2013).
Gérard Gautier. 1998. Building a Kurdish Language Corpus: An Overview of the Technical Problems. In
Proceedings of ICEMCO.
Guardian. 2013. The Guardian. www.guardian.co.uk/. (2013).
Goeffrey Haig and Yaron Matras. 2002. Kurdish Linguistics: A Brief Overview. Language Typology and
Universals 55, 1 (2002).
Jafar Hasanpoor. 1999. A Study of European, Persian and Arabic loans in Standard Sorani. Uppsala University.
Amir Hassanpour, Jaffer Sheyholislami, and Tove Skutnabb-Kangas. 2012. Introduction. Kurdish: Linguicide, Resistance and Hope. International Journal of the Sociology of Language 217 (2012), 1–8.
Harold Stanley Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic
Press, Inc. Orlando, FL, USA.
David A Hull. 1996. Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American
Society for Information Science 47, 1 (1996), 70–84.
Fotis Lazarinis, Jesús Vilares, John Tait, and Efthimis N. Efthimiadis. 2009. Current Research Issues and
Trends in Non-English Web Searching. Information Retrieval 12, 3 (2009), 230–250.
Julie B Lovins. 1968. Development of a Stemming Algorithm. MIT Information Processing Group, Electronic
Systems Laboratory.
Lucence. 2013. Apache Lucene. http://lucene.apache.org/. (2013).
David N. MacKenzie. 1961. Kurdish Dialect Studies. Oxford University Press.
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information
Retrieval. Cambridge University Press, New York, NY, USA.
Paul Mcnamee and James Mayfield. 2004. Character N-Gram Tokenization for European Language Text
Retrieval. Information Retrieval 7, 1-2 (2004), 73–97.
MG4J. Managing Gigabytes for Java. Available at: http://mg4j.dsi.unimi.it/. (????).
A:17
Christian Middleton and Ricardo Baeza-Yates. 2007. A Comparison of Open Source Search Engines.
arXiv:1212.0074. (2007).
NTCIR. 2013. NII Test Collection for IR Systems. http://research.nii.ac.jp/ntcir/index-en.html. (2013).
Chris D Paice. 1994. An Evaluation Method for Stemming Algorithms. In Proceedings of ACM SIGIR’94.
42–50.
Pewan. 2013. Pewan’s Download Link. https://dl.dropbox.com/u/10883132/Pewan.zip. (2013).
Peyamner. 2013. Peyamner News Agency. http://www.peyamner.com/. (2013).
M. F. Porter. 1997. An Algorithm for Suffix Stripping. Morgan Kaufmann Publishers Inc., 313–316.
Zobia Rehman, Waqas Anwar, and Usama Ijaz Bajwa. 2011. Challenges in Urdu Text Tokenization and
Sentence Boundary Disambiguation. In Proceedings of the 2nd Workshop on South and Southeast Asian
Natural Language Processing (WSSANLP), IJCNLP 2011. 40–45.
Motaz K. Saad and Wesam Ashour. 2010. OSAC: Open Source Arabic Corpus. In The 6th International
Symposium on Electrical and Electronics Engineering and Computer Science. 557–562.
Gerard Salton, Edward A. Fox, and Harry Wu. 1983. Extended Boolean Information Retrieval. Commun.
ACM 26, 11 (1983).
Pollet Samvelian. 2007. A Lexical Account of Sorani Kurdish Prepositions. In Proceedings of International
Conference on Head-Driven Phrase Structure Grammar. 235–249.
Jacques Savoy. 1999. A Stemming Procedure and Stopword List for General French Corpora. JASIS 50, 10
(1999), 944–952.
Mehrnoush Shamsfard. 2011. Challenges and Open Problems in Persian Text Processing. In Proceedings of
LTC’11.
Mehrnoush Shamsfard, Hoda Sadat Jafari, and Mahdi Ilbeygi. 2010. STeP-1: A Set of Fundamental Tools
for Persian Text Processing. In LREC.
Jaffer Sheyholislami. 2010. Identity, Language, and NMedia: the Kurdish Case. Language Policy 9, 4 (2010),
289–312.
Karen Sparck-Jones and C.J. van Rijsbergen. 1976. Information Retrieval Test Collections. Journal of Documentation 32(1):5972 (1976).
Terrier. 2013. Terrier IR Platform. http://terrier.org/. (2013).
Wheeler M. Thackston. 2006a. Kurmanji Kurdish: A Reference Grammar with Selected Readings. Harvard
University.
Wheeler M. Thackston. 2006b. Sorani Kurdish: A Reference Grammar with Selected Readings. Harvard
University.
TREC. 2013. Text REtrieval Conference. http://trec.nist.gov/. (2013).
VOA. 2013a. Voice of America - Kurdish (Kurmanji) . http://www.dengeamerika.com/. (2013).
VOA. 2013b. Voice of America - Kurdish (Sorani). http://www.dengiamerika.com/. (2013).
Ellen M Voorhees. 1994. Query Expansion Using Lexical-Semantic Relations. In Proceedings of ACM SIGIR94. 61–69.
Ellen M Voorhees. 2004. Overview of the TREC 2004 Robust Retrieval Track. In Proceedings of TREC 2004.
Ellen M Voorhees and Donna Harman. 1999. Overview of the Eighth Text Retrieval Conference (TREC-8).
In Proceedings of TREC, Vol. 8. 1–24.
Géraldine Walther. 2011. Fitting into Morphological Structure: Accounting for Sorani Kurdish Endoclitics.
In The Proceedings of the Eighth Mediterranean Morphology Meeting.
Géraldine Walther and Benoı̂t Sagot. 2010. Developing a Large-scale Lexicon for a Less-Resourced Language. In SaLTMiL’s Workshop on Less-resourced Languages (LREC).
Jinxi Xu and W Bruce Croft. 1998. Corpus-based Stemming Using Cooccurrence of Word Variants. ACM
TOIS 16, 1 (1998), 61–81.
Jinxi Xu, Alexander Fraser, and Ralph Weischedel. 2002. Empirical Studies in Strategies for Arabic Retrieval. In Proceedings ACM SIGIR’02. 269–274.
Justin Zobel. 1998. How Reliable Are the Results of Large-scale Information Retrieval Experiments?. In
Proceedings of ACM SIGIR’98. 307–314.

peyamner

Transcription

Similar documents

PDF - The German Marshall Fund of the United States

Walk on my Head - Eric Lafforgue

The Monitoring of the coverage of Human Right Issues in

Graffiti: Beirut`s creative pulse

Changes conceptions of women`s public space in the Kurdish city

18-pages PDF document

Research Newsletter 1 - College of Social Sciences and

May 2012 - Earth Cultures Project