peyamner
Transcription
peyamner
A Towards Kurdish Information Retrieval Kyumars Sheykh Esmaili, Nanyang Technological University Shahin Salavati, University of Kurdistan Anwitaman Datta, Nanyang Technological University The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages and has not seen much attention from the IR and NLP research communities. This paper reports on the outcomes of a project aimed at providing essential resources for processing Kurdish texts. A principal output of this project is Pewan, the first standard Test Collection to evaluate Kurdish Information Retrieval systems. The other language resources that we have built include a light-weight stemmer and a list of stopwords. Our second principal contribution is using these newly-built resources to conduct a thorough experimental study on Kurdish documents. Our experimental results show that normalization and, to a lesser extent, stemming can greatly improve the performance of Kurdish IR systems. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Design, Measurement, Experimentation, Performance Additional Key Words and Phrases: Kurdish Language, Bi-Standard Languages, Test Collection, Stemming, Cross-Lingual Information Retrieval 1. INTRODUCTION With increasingly higher numbers of non-English language web searchers the problems of efficient handling of non-English documents and user queries are becoming major issues for search engines [Lazarinis et al. 2009]. To tackle these issues, a range of workshops and forums have been launched over the last decade. Three important examples are CLEF [CLEF 2013] for European languages, NTCIR [NTCIR 2013] for Asian languages and FIRE [FIRE 2013] for Indian languages. The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having 20 to 30 millions of native speakers [Haig and Matras 2002; Hassanpour et al. 2012; Thackston 2006b; 2006a], Kurdish is among the less-resourced languages. More specifically, in spite of the few attempts in building corpus [Gautier 1998] and lexicon [Walther and Sagot 2010], Kurdish still does not have a large-scale and reliable corpus. Similarly, no test collection –which is central to Information Retrieval research and development– or stemming algorithm has been developed for this language. In recent years, Kurdish autonomy in the area and the consequent rise in importance of communications in Kurdish in Iraq (and lately in Turkey) as well as the proliferation of social media and Internet use in both Kurdistan and diaspora [Sheyholislami 2010] has prompted the need for Kurdish language processing. To cater for this need, we have recently launched the Kurdish Language Processing Project (KLPP1 ), aiming at 1 Project page at http://eng.uok.ac.ir/esmaili/research/klpp/en/main.htm Author’s addresses: K. Sheykh Esmaili and A. Datta, School of Computer Engineering, Nanyang Technological University, Singapore; S. Salavati, Computer Engineering Department, University of Kurdistan, Sanandaj, Iran. Journal, Vol. V, No. N, Article A, Publication date: January YYYY. A:2 K. Sheykh Esmaili et al. providing basic tools and techniques for Kurdish text processing. This paper reports on KLPP’s main outcomes so far2 . Our first primary contribution is Pewan, which to the best of our knowledge is the first standard test collection to evaluate Kurdish IR systems. To build Pewan, we have carefully followed TREC [TREC 2013]’s standard test collection construction methodology. More specifically, we first built a relatively-large Kurdish text corpus, and then used a powerful Desktop search tool to compile a list of queries. Next, three widelyused open-source IR systems as well as our own implementation of two well-known retrieval models were used to create result pools for all queries. These pools were then manually assessed by our team to generate the true list of relevant documents for each query. The other language resources that we have constructed are a light-weight stemmer, a list of affixes, and a list of stopwords. We have also translated Pewan’s queries into English and Persian, hence making it possible for researchers to investigate the CrossLingual Information Retrieval (CLIR) problem as well. Our second important contribution is a comprehensive experimental study of basic IR techniques on Kurdish documents. Our experimental results show that normalization and, to a lesser extent, stemming can greatly improve the quality of Kurdish IR systems. They also highlight the need for further research into the problem of crosslanguage IR on Kurdish documents. All of the aforementioned resources are freely accessible and can be obtained from [Pewan 2013]. We hope that making these resources publicly available, will bolster IR research on Kurdish language. The rest of the paper is organized as follows. We first give a brief description of the main challenges in Kurdish text processing in Section 2. Then in Section 3 we focus on the Sorani branch of Kurdish and explain the process of constructing Pewan text corpus and test collection. The results of our experimental study with Pewan are reported in Section 5. Section 6 reports on how we augmented Pewan with Kurmanji documents and relevance judgments. Finally, we conclude the paper in Section 7. 2. CHALLENGES IN KURDISH TEXT PROCESSING Kurdish language belongs to the Indo-Iranian family of Indo-European languages. Its closest better-known relative is Persian. Kurdish has two main branches –namely, Sorani and Kurmanji– and is spoken in Kurdistan, a large geographical area spanning the intersections of Turkey, Iran, Iraq and Syria. It is one of the two official languages of Iraq and has a regional status in Iran. Apart from the resource-scarceness problem, we have identified four other challenges in processing Kurdish texts. The first three of these challenges highlight the problems arising from Kurdish language’s inherent diversity and complexity. The fourth challenge is posed by the implementational issues in Kurdish’s Arabic-based writing system. 2.1. Dialect Diversity The first and foremost challenge in processing Kurdish language is its dialect diversity [Esmaili 2012]. In this paper we focus on Sorani and Kurmanji which are the two most important dialects in terms of number of speakers [Haig and Matras 2002]. Together, they account for more than 75% of native Kurdish speakers [Walther and Sagot 2010]. 2 An earlier and shorter version of this paper titled “Building a Test Collection for Sorani Kurdish” [Esmaili et al. 2013] is to appear in the proceedings of the 10th ACS/IEEE Conference on Computer Systems and Applications (AICCSA13). Journal, Vol. V, No. N, Article A, Publication date: January YYYY. Towards Kurdish Information Retrieval 1 2 3 4 5 6 A:3 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Arabic‐based ز خ ڤ وو ت ش س ر ق پ ۆ ن م ل ک ژ گ ف ێ د چ ج ب ا Latin‐based A B C Ç D Ê F G J K L M N O P Q R S Ş T Û V 29 30 31 32 33 ڕ ع ڵ غ ح X Z (a) One-to-One Mappings Arabic‐based Latin‐based 25 26 27 28 / ئ و ی ه I U/W Y/Î E/H (b) One-to-Two Mappings Arabic‐based Latin‐based (RR) - (E) (X) (H) (c) One-to-Zero Mappings Fig. 1: The two standard Kurdish Alphabets. The features distinguishing these two dialects are morphological as well as phonological. The important morphological differences are [MacKenzie 1961; Haig and Matras 2002]: (1) Kurmanji is more conservative in retaining both gender (feminine:masculine) and case opposition (absolute:oblique) for nouns and pronouns, Sorani has largely abandoned this system and uses the pronominal suffixes to take over the functions of the cases, (2) the definite suffix -aka appears only in Sorani, (3) in the past-tense transitive verbs, Kurmanji has the full ergative alignment but Sorani, having lost the oblique pronouns, resorts to pronominal enclitics, and (4) in Sorani, passive and causative can be created exclusively via verb morphology, in Kurmanji they can also be formed with the verbs hatin (to come) and dan (to give) respectively. 2.2. Script Diversity Due to geopolitical reasons [Sheyholislami 2010], each of the two aforementioned dialects has been using its own writing system. In fact, Kurdish is considered a bistandard language [Gautier 1998], with Sorani almost-exclusively written in Arabicbased letters and Kurmanji almost exclusively written in Latin-based letters. Both of these systems are phonetic [Gautier 1998]; that is, vowels are explicitly represented and their use is mandatory. In case of the Arabic-based alphabet, this is a significant advantage over Arabic-based writing systems in other languages (e.g., Persian, Arabic, and Urdu) where its use is optional, causing a range of processing ambiguities3 . Figure 1 shows both the Arabic-based and the Latin-based standard Kurdish alphabets and the mappings between them which we have categorized into three classes: — One-to-one mappings (Figure 1a); covering a large subset of the characters, — One-to-two mappings (Figure 1a); which reflect the inherent discrepancies between the two writing systems [Barkhoda et al. 2009]. While transliterating between these two alphabets, the contextual information can provide hints in choosing the right counterpart. — One-to-zero mappings (Figure 1a); they can be split further into two distinct categories: (i) the strong L and strong R characters ({ } and { }) are used only in Sorani 3 In fact, as discussed in [Farghaly and Shaalan 2009], the absence of short vowels contributes most significantly to ambiguity in Arabic language, causing difficulty in homograph resolution, word sense disambiguation, part-of-speech detection. In Persian, its negative consequence is more visible in detecting the Izafe constructs [Shamsfard 2011]. Journal, Vol. V, No. N, Article A, Publication date: January YYYY. A:4 K. Sheykh Esmaili et al. Kurdish (although there are a handful of words with the latter in Kurmanji too) and demonstrate the inherent phonological differences between Sorani and Kurmanji dialects, and (ii) the remaining three characters are almost entirely used in Arabic loanwords in Sorani (in Kurmanji they are approximated with other characters). 2.3. Complex Morphology Kurdish has a relatively complex morphology [Samvelian 2007; Walther 2011]. One of the driving factors behind this complexity is the wide use of suffixes. Some of the most common suffixes are (an extended list is given in Section 4): (1) -i, the Izafe construction marker (approximately corresponds to the English preposition “of ”.), (2) -aan, the plural noun marker, (3) -aka and -ek, the definiteness and indefiniteness markers, and (4) personal verb endings/possessive pronouns. For example, as depicted below, the phrase { } naawandakaani dangdaanmaan “our voting centers” consists of a noun ({ } naawand “center”) and a non-finite verb ({ } dangdaan “voting”) and four different suffixes. maan possessive pronoun + + + dangdaan non-finite verb i Izafe marker + + + aan plural marker + + + + aka + definite + marker naawand noun To demonstrate the morphological complexity of Kurdish language empirically, we carried out an experiment in which the proportion of distinct words to the total number of words (a variation of the Heap’s law [Heaps 1978]) were computed for both dialects of Kurdish as well as three other languages: English, Persian, and Arabic. In this experiment, the English corpus consisted of the Editorial articles of The Guardian newspaper [Guardian 2013], the Persian and the Arabic corpora were drawn from the standard Hamshahri Collection [AleAhmad et al. 2009] and the OSAC Corpus [Saad and Ashour 2010], respectively .The Kurdish documents were collected mainly from the Peyamner News Agency [Peyamner 2013]. As the corresponding curves in Figure 2 show, the number of distinct words in both Sorani and Kurmanji Kurdish are higher than English and Persian. Moreover, the ratio in Sorani is comparable to that of Arabic which has a notoriously complex morphology system 4 . Another interesting observation here is the difference between the ratios for Sorani and Kurmanji. Two primary sources of these differences are: (i) the inherent linguistics differences between the two dialects as mentioned earlier in Section 2.1, (ii) the writing style differences; more specifically, use of space-delimited words is less } common in Sorani [Esmaili and Salavati 2013]. For instance, the to-be verb ({ boon) as well as some of the most common prepositions (e.g., { } ish “too”) are used as suffixes. 2.4. Text Preprocessing The Arabic-based writing system of Kurdish language poses a set of challenges in the text preparation and preprocessing phase. Below, we highlight two types of such challenges. 4A similar observation between Arabic and English has been reported by Xu et al in [Xu et al. 2002]. Journal, Vol. V, No. N, Article A, Publication date: January YYYY. Number of Distinct Words Towards Kurdish Information Retrieval 2.5E+05 2.0E+05 1.5E+05 A:5 Arabic Sorani Kurmanji Persian English 1.0E+05 5.0E+04 0.0E+00 0.0E+00 1.0E+06 2.0E+06 3.0E+06 4.0E+06 Total Number of Words Fig. 2: Number of Distinct Words for Different Languages 2.4.1. Normalization. The Unicode assignments of the Arabic-based Kurdish alphabet has two potential sources of ambiguity which should be dealt with carefully: — for some letters such as ye and ka there are more than one Unicode representations ({ } versus { } and { } versus { }). During the normalization phase, the occurrences of these multi-code letters should be unified. — as in Urdu, the isolated { } and final { } forms of the letter ha constitute one letter (pronounced a), whereas the initial { } and medial { } forms of the same letter constitute another letter (pronounced h), for which a different Unicode encoding is available [Walther and Sagot 2010; Gautier 1998]. In many electronic texts, these letters are written using only the ha, differentiated by using the zero-width nonjoiner (zwnj) character that prevents a character from being joined to its follower. This distinction must be taken into account in the normalization phase. 2.4.2. Segmentation. Segmentation refers to the process of recognizing boundaries of text constituents, including sentences, phrases and words. The Arabic-based Kurdish alphabet suffers from two segmentation problems that are inherited from the Arabic writing system: — it does not have capitalization and therefore it is more difficult to recognize sentence boundaries as well as recognizing Named Entities, — white space is not a deterministic delimiter and boundary sign [Shamsfard 2011]; it may appear within a word or between words, or may be absent between some sequential words. However, compared to Persian [Shamsfard et al. 2010] and Urdu [Rehman et al. 2011], this problem is less severe in Kurdish.5 3. PEWAN: A KURDISH TEST COLLECTION This section reports on our efforts to construct Pewan6 , the first standard test collection to evaluate Kurdish IR systems. Later, in Section 4, we show how Pewan was leveraged to build a number of other (smaller) resources for the Kurdish language. 5 In Kurdish, it is primarily caused by using white space instead of zwnj, when the latter is not supported by the typesetting environment. This error is prevalent in the older articles of VOA [VOA 2013b], one of Pewan’s raw text sources. 6 A Kurdish word meaning measurement. Journal, Vol. V, No. N, Article A, Publication date: January YYYY. A:6 K. Sheykh Esmaili et al. In the following, we first (in Section 3.1) briefly describe the purpose and use of IR test collections as well as a standard methodology to construct them. Then after explaining our design decisions in Section 3.2, we present Pewan’s construction details in Section 3.3. Finally, the performance of the IR systems used in the pooling process are reported in Section 3.4. 3.1. Background Test collections are widely used as a standard tool to evaluate the performance of retrieval systems. There is a variety of test collections, tailored for different IR tasks. Text REtrieval Conferences (TREC [TREC 2013]) held by NIST have greatly contributed to the construction of large and reliable test collections. Each test collection has three main components: — a text corpus, — a set of queries (or topics), — a list of relevant documents for each query (or Relevance Judgments). Among these components, the most critical one is the relevance judgments, which has a significant impact on the reliability of the collection. For small test collections it is feasible to judge the relevance for all document-query pairs, however, in large test collections this approach is not feasible. As an alternative, most of modern test collections, including TREC’s Ad-Hoc tracks, use system pooling [Sparck-Jones and van Rijsbergen 1976]. In system pooling, a number of retrieval systems are used to retrieve and rank documents and then the top ranked results from all systems are merged to create a pool for each query. Later, these pools are manually examined by human assessors to create the lists of true relevant documents. 3.2. Design Decisions Before presenting the construction details of Pewan, we would like to justify two important decisions that we have made in building Pewan: to build two separate collections for the two main standards of Kurdish, and to attach greater importance to the Sorani/Arabic-based standard. 3.2.1. Decoupling the Bi-Standard Aspect. As mentioned in Section 2, Kurdish is a bistandard language with Sorani dialect written in Arabic-based alphabet and Kurmanji dialect written in Latin-based alphabet. The mapping between these two dialect/script branches are not trivial and as the examples in [Gautier 1998] show, the same word, when going from Sorani to Kurmanji, may at the same time go through several levels of change: writing systems, phonology, morphology, and sometimes semantics. This clearly demonstrates the fact that the mapping between these two dialects/scripts is more than just transliteration, although its complexity is less than translation. This bistandard nature poses a unique challenge in construction of test collections for Kurdish language. Our approach is to decouple the mapping problem from the test collection construction problem, meaning we build two test collections, one per standard. As a result, in addition to the primary use of evaluating Sorani IR and Kurmanji IR systems, Pewan can be also perceived as a mechanism to evaluate mapping proposals between these standards (since the queries in these two collections are identical transliteration/translation of each other). 3.2.2. Focusing on the Sorani Dialect/Arabic-based Writing System. In the remaining parts of this paper, we give greater importance to the Sorani dialect/Arabic-based writing system. The rationale behind this is two-fold: (i) as a result of the severe restrictions on use of Kurdish language in Turkey –where the majority of Kurmanji speakers live– Journal, Vol. V, No. N, Article A, Publication date: January YYYY. Towards Kurdish Information Retrieval A:7 100000 Count 10000 1000 100 10 1 0 20 40 60 80 100 120 140 Document Size (KB) (a) Document Size Distribution in Pewan (b) A Sample Document in Pewan Fig. 3: Documents in Pewan currently, there are very few sources with considerable amount of raw Kurmanji text readily available and even these sources do not strictly follow its writing standards. The Sorani dialect on other hand, does not suffer from these shortcomings, thanks to its official status and wider use, (ii) the Arabic-based writing system is more expressive and far more challenging to process. Nonetheless, to guarantee the usability and generality of Pewan, we repeated the construction process for Kurmanji dialect/Latin-based writing system as well. The corresponding details are reported in Section 6. 3.3. Construction of Pewan’s Three Core Components Within KLPP, we closely followed the TREC’s construction methodology explained in Section 3.1 and built Pewan’s three core components. 3.3.1. Text Corpus. Like many of the TREC Ad-Hoc tracks (and similar to the existing Arabic [Saad and Ashour 2010] and Persian [AleAhmad et al. 2009; Esmaili et al. 2007] corpora), we used news articles to build our text corpus. After surveying all options we chose two online news agencies: (i) Peyamner [Peyamner 2013], a popular multi-lingual news agency based in Iraqi Kurdistan, and (ii) the Sorani Kurdish website of Voice Of America [VOA 2013b]. The main criteria in this process were: (1) (2) (3) (4) size (number of news articles), subject diversity, metadata support (e.g., news category labels), crawl-friendliness. For each agency, we developed a crawler to fetch the articles and extract their textual content. In case of Peyamner, since articles have no language label, we additionally implemented a simple classifier that decides each page’s language based on the occurrence of language-specific characters. Overall, 115340 news articles dated between 2003 and 2012 were collected (96920 from Peyamner and 18420 from VOA). As illustrated in Figure 3a, their sizes range from 1KB to 154KB (on average 2.8KB). A sample article –about the dust storms in southwest Iran– from the final corpus is shown in Figure 3b. 3.3.2. Queries. After browsing the Peyamner and VOA websites, each member of our team compiled a list of possible queries (all members had previously seen sample queries from TREC 1999 [Voorhees and Harman 1999] and TREC 2004 [Voorhees 2004]). Later, the Google Desktop search tool and our local repository were used to Journal, Vol. V, No. N, Article A, Publication date: January YYYY. A:8 K. Sheykh Esmaili et al. (a) In Sorani (b) In Kurmanji (c) In Persian (d) In English Fig. 4: A Sample Query in Pewan refine these queries. Eventually, out of the initial 65 queries, 42 were selected to be used in the pooling/assessment steps. Figure 4 depicts a sample query in Pewan. Queries in Pewan are represented in the TREC’s standard format which has three elements: — title, a short version (2-3 keywords) of the query, — description, a longer and sentence-like version of it, — narrative, 1-2 paragraphs explaining the scope of expected results (only for human assessors). 3.3.3. Relevance Judgments. To create the pools we used 5 different systems to generate 5 runs, each containing 500 documents. Three of these systems are widely-used opensource Java retrieval systems –namely MG4J [MG4J ], Apache Lucene [Lucence 2013], and Terrier IR Platform [Terrier 2013]– that we selected based on results of a study conducted in [Middleton and Baeza-Yates 2007]. After making the necessary changes to enable them to process Sorani texts (among other things, correct handling of the zwnj character), we ran all three of them with OR-ed version of the query description terms (the AND-ed version is extremely selective). The other two systems were in-house implementations that we have developed and refined as part of our IR coursework: (i) KLPP-VSM: an implementation of Vector Space Model based on the specifications given in [Manning et al. 2008], and (ii) KLPPEBM: an implementation of Extended Boolean Model as described in [Salton et al. 1983]. For the EBM runs, we used the AND-ed version of the query terms (the OR-ed version’s performance was consistently inferior). In both implementations we used the T F × IDF weighting scheme recommended in [Manning et al. 2008]: w(ti , dj ) = log(1 + f (i, j)) × log |D| |{d ∈ D : ti ∈ d}| where w(ti , dj ) is the total weight of term ti in the vector representing document dj . In this formula f (i, j) is the frequency of term ti in document dj and D is the collection of all documents. The populated pools (depths ranging from 576 to 1671, on average 1147) were then manually assessed by our team members (all native Kurdish speakers) to generate the true list of relevant documents for each query. It has been shown [Zobel 1998] that Journal, Vol. V, No. N, Article A, Publication date: January YYYY. Towards Kurdish Information Retrieval Category Politics Health Culture Social Economy Sports Science Media A:9 Query IDs 3, 9, 11, 12, 13, 15, 20, 23, 30, 36, 39 24, 32, 34 4, 42 7, 16 31 8 27 17 Table I: Subject Categorization of Pewan’s Queries queries with long lists of relevant documents, are more likely to have other relevant documents missing from the assessment pool. Also, the author in [Hull 1996] have suggested that queries with few relevant documents will tend to have higher variability than those with many relevant documents. To ensure the reliability of Pewan, out of the 42 assessed queries, we excluded those with too many (≥ 100) or too few (≤ 10) relevant documents. The remaining 22 queries are included in the current release of Pewan. This subject categorization of these queries are presented in Table I. 3.4. Evaluation of Pewan’s Pooling Systems The results of this evaluation are depicted in Figure 5. Three important observations are: (i) MG4J consistently outperforms all other systems, (ii) at lower recall values, Lucene performs quite well, however, the quality of its results degrades faster than other systems as the recall increases; more precisely, while it is the most precise system at the first recall point, it has the worst performance for recall points greater than 0.6, (iii) our implementation of Vector Space Model follows a reverse trend, compared to that of Lucence; that is, its relative rank among these five systems improves with the increase in the recall level. While running these experiments, we also measured the execution times. In short, while all these systems perform comparably fast in building the inverted lists and processing the queries, the construction of document vectors in our implementation of Vector Space Model is by far the most time-consuming computation.7 Based on these results, we chose MG4J –as the best performing system – to be used in the rest of the experiments (denoted as Baseline hereafter). 4. ENRICHING PEWAN WITH OTHER RESOURCES On top of the three core test collection components explained in the previous section, we also built three other IR resources for Kurdish language. Below, they are briefly introduced8 . 4.1. Stopwords List Stopwords are words which do not convey meaning on their own and are usually filtered out prior to processing of natural language data. Stopword elimination always reduces the computational cost (through reducing the index sizes) and in some cases can marginally improve the quality of the outputs as well. Although stopwords lists have been built for many languages [Lazarinis et al. 2009], so far no list has been reported for Kurdish . We used Pewan’s text corpus to build a list of Kurdish stopwords. To this end, we followed the approach proposed in [Savoy 7 This, however, should be seen in the light of the fact that the other competitors are highly-optimized search engines enjoying advanced and specialized indexes. 8 As noted in [Hasanpoor 1999], a series of linguistics reports on the Kurdish language (e.g., on affixes and stopwords) have been published in The Journal of the Kurdish Academy. However, these reports are in Kurdish and Arabic and have not been electronically archived yet. Journal, Vol. V, No. N, Article A, Publication date: January YYYY. A:10 K. Sheykh Esmaili et al. Precision 1 0.9 MG4J 0.8 Lucence 0.7 Terrier 0.6 KLPP_VSM 0.5 KLPP_EBM 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Fig. 5: Performance of Pewan’s Pooling Systems # Stopword Eng. Trans. Freq. # Stopword Eng. Trans. Freq. 1 له و به بۆ که ئهو ئهم ی کرد لهگهڵ from and with for which that this of made/did together 649337 11 12 316692 13 239346 14 185279 15 147520 16 72354 17 53327 18 51550 19 47187 20 about that which on/head two also after from that makes/does some every 45435 597696 لهسهر ئهوهی سهر دوو ھهروهھا دوای لهو دهکات چهند ھهر 2 3 4 5 6 7 8 9 10 43617 39897 38212 37895 35656 35176 34360 31234 31201 Fig. 6: The Top 20 Most-Frequent Words in Pewan’s Text Corpus. 1999] and first compiled a list of highly-frequent words in Pewan and then manually examined it to extract stopwords. Our final list contains 282 words and similar to other languages, it mainly consists of propositions. Figure 6 shows the most frequent stopwords in Pewan’s text corpus. 4.2. Prefixes/Suffixes Identifying the common prefixes and suffixes is an essential step in processing languages like Kurdish that heavily rely on affixes and their combinations to build new words. One of the most important uses of such lists is to build rule-based stemmers [Porter 1997]. Pewan’s text corpus was again leveraged to build a list of common Sorani pre/postfixes. To do so, we generated all character-level n-grams for 2 ≤ n ≤ 5 for all documents in Pewan and then manually examined the top most-frequent ngrams in each case to extract the meaningful affixes. The most widely-used Sorani Journal, Vol. V, No. N, Article A, Publication date: January YYYY. Towards Kurdish Information Retrieval # 1 2 3 4 5 6 7 8 9 10 Suffix ان که کان کانی ێک انی يان وو کرد مان Description plural marker # 11 definite marker 12 #1 + #2 13 #1 + #2 + Izafe marker 14 indefinite marker 15 #1 + Izafe marker 16 third‐person plural marker 17 pluperfect marker 18 auxiliary verb 19 first‐person plural marker 20 Suffix تان کهی يش ستان زان ێکی بوون گه کراو کهوت A:11 Description second‐person plural marker # I #2 + Izafe marker II adverb ("as well") III derivational suffix ("land of") IV derivational suffix ("master of") V #5 + Izafe marker VI auxiliary verb VII derivational suffix ("place of") VIII past participle adjective marker IX auxiliary verb X Prefix له به ده نه بێ سهر بهر ناو ڕا پێش Description derivational prefix ("from") derivational prefix ("with") verb tense marker negation marker derivational prefix ("without") derivational prefix ("top") derivational prefix ("front") derivational prefix ("in") derivational prefix ("up") derivational prefix ("before") Fig. 7: The Top Most-Frequent Kurdish Suffixes and Prefixes affixes are presented in Figure 79 . As mentioned before, in Kurdish, suffixes are used more often than prefixes and cover a wider range of functionalities. 4.3. Pewan’s Queries in English and Persian We translated the title and description parts of all Pewan’s queries into English and Persian. Figure 4d and Figure 4c depict the English and Persian equivalent of the sample query presented in Figure 4a. As shown later in Section 5, these translated queries, coupled with English-to-Sorani and Persian-to-Sorani dictionaries, can be used to run and evaluate the performance of CLIR systems on Sorani documents using English and Persian queries. 5. EXPERIMENTS In this section we present the results of four sets of experiments that we carried out using Pewan. The first set is on the effectiveness of some of the basic IR techniques on Sorani documents. In the second and third sets, we investigate the impact of n-grams and light-stemming respectively. Lastly, in the fourth set, we look into the problem of CLIR using query translation. In all experiments, the standard 11-point Precision-Recall curves and Mean Average Precision (MAP) measures have been used to compare different setups. Moreover, in all cases –unless explicitly mentioned– the longer version of the queries (the description part) were used. This follows the works reported in [Voorhees 1994; Xu and Croft 1998; Hull 1996]. The experiments were run on a powerful desktop PC with 4×3.2GHz Xeon Processors and 4GB of RAM, running Windows 7. 5.1. Basic IR Techniques Here we briefly describe the techniques that are considered in these experiments: — Normalization: is considered an IR preprocessing task which unifies different representations of multi-code characters. In all of our experiments we have used (and will use) the normalized version of the text corpus and also treated the zwnj character as a non-delimiter character. However, in order to quantitatively measure the impact of text preparation step in Sorani IR, we ran a single experiment on the raw version of the text corpus in which characters were not normalized and zwnj was treated like a delimiter. 9 Izafe is an unstressed vocal -e or -i added between prepositions, nouns and adjectives in a phrase. It approximately corresponds to the English preposition of. This construct is frequently used in both Persian and Kurdish languages. Journal, Vol. V, No. N, Article A, Publication date: January YYYY. A:12 K. Sheykh Esmaili et al. 1 0.8 0.7 4‐grams Baseline 3‐grams 5‐grams 6‐grams 2‐grams 0.9 0.8 0.7 0.6 Precision Precision 1 Stopwords_Removed Baseline Short_Queries Raw 0.9 0.5 0.4 0.3 0.2 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 Recall (a) Basics 0.5 0.6 0.7 0.8 0.9 1 (b) n-grams 1 1 Stemmed_3_16 0.9 Stemmed_2_16 0.8 0.7 Stemmed_4_15 0.7 0.6 Baseline 0.6 0.5 0.4 0.3 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Recall 0.9 1 CLIR_Persian 0.4 0.1 0 CLIR_English 0.5 0.2 0 Baseline 0.9 0.8 Precision Precision 0.4 Recall 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall (c) Light-Weight Stemming (d) Query Translation Fig. 8: Performance Graphs — Stopword elimination: Pewan’s stopwords list (from Section 4) was used to carry out the corresponding experiment. — Query Length: as described in Section 3.3, each Pewan query has a short (title) and a long (description) version. In all our experiments, we have used the description part of the queries. To measure the impact of query length, we ran a single experiment in which the shorter version of the queries were used (this is particularly important for evaluation of Web IR systems). As the results of these experiments show (Figure 8a), normalization is a critical step in Sorani text processing. Furthermore, the relatively-good performance of the short queries can be attributed to the fact that our choices of query keywords have been made carefully in order to preserve the essence of user’s information need in the shorter version of the queries. Finally, elimination of the stopwords slightly improves the quality of the outputs. 5.2. n-grams N-grams is a popular language-blind retrieval technique which generally improves the quality of IR systems’ outputs [Mcnamee and Mayfield 2004]. In our experiments, the following process was repeated 5 times (for 2 ≤ n ≤ 6): first each term in Pewan’s documents and queries was replaced by its corresponding character-level n-grams (our n-gram strings do not contain delimiters), then our IR engine was applied on the transformed collection and its effectiveness was measured. The results are depicted in Figure 8b. The main findings can be summarized as follows: (i) among all n-grams configurations, n = 4 is a clear winner, and (ii) similar Journal, Vol. V, No. N, Article A, Publication date: January YYYY. Towards Kurdish Information Retrieval A:13 to Persian [AleAhmad et al. 2007] and in contrast to Arabic [Xu et al. 2002], Kurdish can benefit from n-grams. This is due to the fact that unlike Arabic, derivative words in Persian and Kurdish preserve the root, and add only prefixes and postfixes to it. 5.3. Stemming Stemming is the process of reducing a word to its stem or root form. It allows documents in which a term is expressed using a different morphological form from the query, to be found and matched [Lazarinis et al. 2009]. The benefits of stemming are more pronounced in languages with complex morphology (e.g. German [Braschler and Ripplinger 2004] and Arabic [Xu et al. 2002] ). We designed a simple rule-based light-weight stemmer for Kurdish language which uses Pewan’s prefix/suffix lists. In order to handle the problem of multiple suffixes (as demonstrated by the example in Section 2.3), our stemmer adopts a recursive approach. Moreover, to prevent over-stemming –a common error in rule-based stemmers caused by blindly removing substrings that are part of the word’s stem– we have defined two parameters: — minimum length (L): this parameter is adopted from Lovins’ stemmer [Lovins 1968] and guarantees that the final stem’s length will always be longer than or equal to L, — minimum frequency (F ): this parameter only concerns prefix removal steps (as removing prefixes is more likely to result in errors [Hull 1996; Paice 1994]). A prefix can be removed from a word, if the remaining string has the frequency of at least F in the corpus. We performed a sensitivity analysis on our stemmer by varying these two parameters. A summary of this study is shown in Figure 8c and indicates that: (i) stemming is beneficial to Kurdish IR, and (ii) the combination of (L = 3, F = 16) has the best performance on Pewan. 5.4. Cross-Lingual Information Retrieval In the last set of our experiments, we used our implementation of a simple wordby-word dictionary-based translation approach [Ballesteros and Croft 1996] and conducted two CLIR experiments: — from English: using Dicto [?], an online, high-quality, bidirectional English-Sorani dictionary — from Persian: using Hajir [Abollahpour 2013], a medium-size (around 40,000 entries), unidirectional Persian-to-Sorani dictionary in PDF format In both cases, we always choose the first word from the first set of translation candidates (a.k.a sense). The results of these experiments are shown in Figure 8d. Expectedly, the performance of this naive translation approach is inferior to those of the monolingual experiments. In case of CLIR from Persian, the dictionary’s relatively-small size (e.g., it has no entries for Named Entities) has also contributed to the poor quality of the outputs. In future, we plan to improve these results by by applying pre- and post-translation query modifications [Ballesteros and Croft 1996]. 5.5. Summary A summary of our experimental study using the MAP measure in shown in Figure 9. Here are the noteworthy remarks: — normalization is a critical step in Kurdish text processing, Journal, Vol. V, No. N, Article A, Publication date: January YYYY. A:14 K. Sheykh Esmaili et al. Mean Average Precision 0.5 0.4 0.3 0.2 0.1 0.0 Fig. 9: Performance of Different Systems/Setups based on the Mean Average Precision (MAP) Measure — Kurdish IR can benefit from light-stemming and n-grams. Moreover, similar to reported results for European languages [Mcnamee and Mayfield 2004], the best ngrams performance is achieved for n = 4, — stopword elimination results in both computational and performance gains (although marginal), — further resources and techniques are needed to improve the performance of crosslingual IR from Kurdish documents. 6. AUGMENTING PEWAN WITH KURMANJI DOCUMENTS In order to guarantee the usability and generality of Pewan, we repeated the construction process reported in Section 3.3 for Kurmanji dialect/Latin-based writing system as well. Here are the notable details: — to build the Kurmanji text corpus, we used Peyamner as well as the Kurmanji Kurdish website of Voice Of America [VOA 2013a]. Overall, 25572 news articles were collected (19873 from Peyamner and 5699 from VOA). Their sizes range from 1KB to 42KB. — queries are the transliterated/translated equivalent of Pewan’s queries. Kurmanji/Latin-based alphabet version of the sample query in Figure 4a, is shown in Figure 4b). — to create the pools we used the same 5 systems as in Section 3.3, although with shorter runs (100 documents, instead of 500). Then, the populated pools (depths ranging from 77 to 302, on average 213) were again manually assessed by our team members and the relevance judgments were generated. Similarly, these judgments were used to compare the performance of the pooling systems. The results of this evaluation are depicted in Figure 10. While comparing these new results with those in Figure 5, the following observations can be made: (i) in general, the Kurmanji curves are less steep, meaning worse precision at low Journal, Vol. V, No. N, Article A, Publication date: January YYYY. Towards Kurdish Information Retrieval A:15 0.7 MG4J Lucence KLPP_VSM Terrier KLPP_EBM 0.6 Precision 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Fig. 10: Performance of Pewan’s Pooling Systems (on Kurmanji Documents) recall levels, and better precision at high recall levels. This is partly due to the fact that relevance judgment lists are shorter in the Kurmanji collection (on average, 12.5 versus 42), and (ii) although MG4J is again the best-performing engine overall, but it is not as dominant as in Figure 5. In fact, for high recall levels, Lucene and for middle levels, our implementation of Vector Space Model are competitive. 7. CONCLUSIONS AND FUTURE WORK In this paper we introduced Pewan, the first standard test collection for Sorani language, and briefly explained its construction details. Pewan is freely available and can be obtained from [Pewan 2013]. In addition to the test collection, the compressed file also contains: (i) a list of prefixes/suffixes, (ii) a list of stopwords, and (iii) the English translation of Pewan’s queries. With the help of Pewan, we also conducted an experimental study to investigate the effectiveness of well-known IR techniques on Kurdish documents. Our results show that normalization and, to a lesser extent, stemming can greatly improve the quality of Sorani IR systems. Pewan and the experimental study presented in this paper are the preliminary outcomes of our KLPP project. There are many avenues to continue this work. First, we would like to further enrich our test collection by augmenting it with classification tags. Our plan is to start with retrieving subject labels (e.g., arts, sports, etc) from the original source news agencies and then cross-check (and correct, if needed) the labeling accuracy. This will allow Pewan to be used for classification tasks as well. In this work, we gave an overview of the diversity problem of the Kurdish language and highlighted the need for a solution to cater for this diversity. In particular, building a transliteration/translation system between the Sorani and Kurmanji branches of Kurdish is an important open problem. As another line of research, we plan to address this problem. Finally, we would like to use Pewan for other IR tasks. In particular, we plan to improve the performance of our CLIR system and also investigate the effectiveness of statistical stemmers on Kurdish IR. Journal, Vol. V, No. N, Article A, Publication date: January YYYY. A:16 K. Sheykh Esmaili et al. ACKNOWLEDGMENTS The authors would like to thank Somayeh Yosefi, Purya Aliabadi, Donya Eliassi, Shownem Hakimi and Asrin Mohammadi for their help in the manual assessment of the relevance judgments. REFERENCES Hajir Abollahpour. 2013. Hajir Dictionary. http://kurmanj.ir/news.php?readmore=76. (2013). Abolfazl AleAhmad, Hadi Amiri, Ehsan Darrudi, Masoud Rahgozar, and Farhad Oroumchian. 2009. Hamshahri: A standard Persian Text Collection. Knowledge-Based Systems 22, 5 (2009), 382–387. Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani, and Farhad Oroumchian. 2007. N-gram and Local Context Analysis for Persian Text Retrieval. In In Proceedings of ISSPA. 1–4. Lisa Ballesteros and W. Bruce Croft. 1996. Dictionary Methods for Cross-Lingual Information Retrieval. In DEXA. 791–801. Wafa Barkhoda, Bahram ZahirAzami, Anvar Bahrampour, and Om-Kolsoom Shahryari. 2009. A Comparison between Allophone, Syllable, and Diphone based TTS Systems for Kurdish Language. In Signal Processing and Information Technology (ISSPIT), 2009 IEEE International Symposium on. 557–562. Martin Braschler and Bärbel Ripplinger. 2004. How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7, 3-4 (2004), 291–316. CLEF. 2013. Conference and Labs of the Evaluation Forum. http://www.clef-initiative.eu/. (2013). Kyumars Sheykh Esmaili. 2012. Challenges in Kurdish Text Processing. CoRR abs/1212.0074 (2012). Kyumars Sheykh Esmaili, Hassan Abolhassani, Mahmood Neshati, Ehsan Behrangi, Asreen Rostami, and Mojtaba Mohammadi Nasiri. 2007. Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems. In Proceedings of the 5th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA ’07). 639–644. Kyumars Sheykh Esmaili and Shahin Salavati. 2013. Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13). 300–305. Kyumars Sheykh Esmaili, Shahin Salavati, Somayeh Yosefi, Donya Eliassi, Purya Aliabadi, Shownem Hakimi, and Asrin Mohammadi. 2013. Building a Test Collection for Sorani Kurdish. In Proceedings of the 10th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA ’13). Ali Farghaly and Khaled F. Shaalan. 2009. Arabic Natural Language Processing: Challenges and Solutions. ACM Transactions on Asian Language Information Processing 8, 4 (2009). FIRE. 2013. Forum for Information Retrieval Evaluation. http://www.isical.ac.in/c̃lia/. (2013). Gérard Gautier. 1998. Building a Kurdish Language Corpus: An Overview of the Technical Problems. In Proceedings of ICEMCO. Guardian. 2013. The Guardian. www.guardian.co.uk/. (2013). Goeffrey Haig and Yaron Matras. 2002. Kurdish Linguistics: A Brief Overview. Language Typology and Universals 55, 1 (2002). Jafar Hasanpoor. 1999. A Study of European, Persian and Arabic loans in Standard Sorani. Uppsala University. Amir Hassanpour, Jaffer Sheyholislami, and Tove Skutnabb-Kangas. 2012. Introduction. Kurdish: Linguicide, Resistance and Hope. International Journal of the Sociology of Language 217 (2012), 1–8. Harold Stanley Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc. Orlando, FL, USA. David A Hull. 1996. Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science 47, 1 (1996), 70–84. Fotis Lazarinis, Jesús Vilares, John Tait, and Efthimis N. Efthimiadis. 2009. Current Research Issues and Trends in Non-English Web Searching. Information Retrieval 12, 3 (2009), 230–250. Julie B Lovins. 1968. Development of a Stemming Algorithm. MIT Information Processing Group, Electronic Systems Laboratory. Lucence. 2013. Apache Lucene. http://lucene.apache.org/. (2013). David N. MacKenzie. 1961. Kurdish Dialect Studies. Oxford University Press. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. Paul Mcnamee and James Mayfield. 2004. Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval 7, 1-2 (2004), 73–97. MG4J. Managing Gigabytes for Java. Available at: http://mg4j.dsi.unimi.it/. (????). Journal, Vol. V, No. N, Article A, Publication date: January YYYY. Towards Kurdish Information Retrieval A:17 Christian Middleton and Ricardo Baeza-Yates. 2007. A Comparison of Open Source Search Engines. arXiv:1212.0074. (2007). NTCIR. 2013. NII Test Collection for IR Systems. http://research.nii.ac.jp/ntcir/index-en.html. (2013). Chris D Paice. 1994. An Evaluation Method for Stemming Algorithms. In Proceedings of ACM SIGIR’94. 42–50. Pewan. 2013. Pewan’s Download Link. https://dl.dropbox.com/u/10883132/Pewan.zip. (2013). Peyamner. 2013. Peyamner News Agency. http://www.peyamner.com/. (2013). M. F. Porter. 1997. An Algorithm for Suffix Stripping. Morgan Kaufmann Publishers Inc., 313–316. Zobia Rehman, Waqas Anwar, and Usama Ijaz Bajwa. 2011. Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation. In Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), IJCNLP 2011. 40–45. Motaz K. Saad and Wesam Ashour. 2010. OSAC: Open Source Arabic Corpus. In The 6th International Symposium on Electrical and Electronics Engineering and Computer Science. 557–562. Gerard Salton, Edward A. Fox, and Harry Wu. 1983. Extended Boolean Information Retrieval. Commun. ACM 26, 11 (1983). Pollet Samvelian. 2007. A Lexical Account of Sorani Kurdish Prepositions. In Proceedings of International Conference on Head-Driven Phrase Structure Grammar. 235–249. Jacques Savoy. 1999. A Stemming Procedure and Stopword List for General French Corpora. JASIS 50, 10 (1999), 944–952. Mehrnoush Shamsfard. 2011. Challenges and Open Problems in Persian Text Processing. In Proceedings of LTC’11. Mehrnoush Shamsfard, Hoda Sadat Jafari, and Mahdi Ilbeygi. 2010. STeP-1: A Set of Fundamental Tools for Persian Text Processing. In LREC. Jaffer Sheyholislami. 2010. Identity, Language, and NMedia: the Kurdish Case. Language Policy 9, 4 (2010), 289–312. Karen Sparck-Jones and C.J. van Rijsbergen. 1976. Information Retrieval Test Collections. Journal of Documentation 32(1):5972 (1976). Terrier. 2013. Terrier IR Platform. http://terrier.org/. (2013). Wheeler M. Thackston. 2006a. Kurmanji Kurdish: A Reference Grammar with Selected Readings. Harvard University. Wheeler M. Thackston. 2006b. Sorani Kurdish: A Reference Grammar with Selected Readings. Harvard University. TREC. 2013. Text REtrieval Conference. http://trec.nist.gov/. (2013). VOA. 2013a. Voice of America - Kurdish (Kurmanji) . http://www.dengeamerika.com/. (2013). VOA. 2013b. Voice of America - Kurdish (Sorani). http://www.dengiamerika.com/. (2013). Ellen M Voorhees. 1994. Query Expansion Using Lexical-Semantic Relations. In Proceedings of ACM SIGIR94. 61–69. Ellen M Voorhees. 2004. Overview of the TREC 2004 Robust Retrieval Track. In Proceedings of TREC 2004. Ellen M Voorhees and Donna Harman. 1999. Overview of the Eighth Text Retrieval Conference (TREC-8). In Proceedings of TREC, Vol. 8. 1–24. Géraldine Walther. 2011. Fitting into Morphological Structure: Accounting for Sorani Kurdish Endoclitics. In The Proceedings of the Eighth Mediterranean Morphology Meeting. Géraldine Walther and Benoı̂t Sagot. 2010. Developing a Large-scale Lexicon for a Less-Resourced Language. In SaLTMiL’s Workshop on Less-resourced Languages (LREC). Jinxi Xu and W Bruce Croft. 1998. Corpus-based Stemming Using Cooccurrence of Word Variants. ACM TOIS 16, 1 (1998), 61–81. Jinxi Xu, Alexander Fraser, and Ralph Weischedel. 2002. Empirical Studies in Strategies for Arabic Retrieval. In Proceedings ACM SIGIR’02. 269–274. Justin Zobel. 1998. How Reliable Are the Results of Large-scale Information Retrieval Experiments?. In Proceedings of ACM SIGIR’98. 307–314. Journal, Vol. V, No. N, Article A, Publication date: January YYYY.