Tekstin tallennus- ja hakumenetelmien kehittäminen suomen kielen
Transcription
Tekstin tallennus- ja hakumenetelmien kehittäminen suomen kielen
Tekstin tallennus- ja hakumenetelmien kehittäminen suomen kielen tulkintaohjelmien avulla: FULLTEXT-projektin loppuraportti Riitta Alkula & Timo Honkela ABSTRACT The project, Linguistic processing and retrieval techniques in Finnish fulltext databases (FULLTEXT), dealt with the special problems of fulltext databases in the Finnish language. Finnish has a rich inflectional and derivational morphology. Another typical characteristic is the use of compounds; in the English language these compunds would be multi-word terms. The characteristics of Finnish result in poor system performance when commercial information retrieval systems developed for English are used. To decrease the size of the inverted file and to improve retrieval efficiency, it is reasonable to normalize the inflectional variants of a word to the basic form. In the FULLTEXT project, natural language analysis modeules for Finnish were incorporated into the BASIS and APL-MINTTU retrieval systems and severeal test databases were produced. When word forms were normalized to their basic form, the memory size of the index file was smaller than the a traditional index, where the words are saved in their inflectional form. Even when the components of the compound words were added to the basic form index it still remained smaller than the traditional index. In the retrieval tests, best recall was achieved in the index that contained the basic word forms and components of compound words. It was found that good recall did not result in poor precision. The precision ratio was about as good as in other indexes. Queries had best precision in a database where the automatically truncated terms were searched in a traditional index and then the retrievd index terms were analyzed and filtered with natural language analysis modules. Unfortunately, in this case, the recall ration was lower than in other test databases. Problems in the use of natural language modules were also investigated. When the search terms are given in their basic form, the searcher must be more conscious with derivatives and compounds than when using truncated search terms in traditional indexes. Methods to transform the search terms to their correct basic form should be further developed. Remarks The scanned original full text report starts at the 4th page of this document. References Bain, Malcolm, Richard Bland, Lou Burnard, Jon Duke, Colin Edwards, David Lindsey, Nicholas Rossiter, and Peter Willett. Free text retrieval systems: a review and evaluation. Taylor Graham Publishing, 1989. Blair, David C. Language and representation in information retrieval. Elsevier North-Holland, Inc., 1990. Doszkocs, Tamas E., James Reggia, and Xia Lin. "Connectionist models and information retrieval." Annual review of information science and technology 25 (1990): 209-262. Lehti, Merja, and Pirkko Eskola. Suorakäyttöisten tiedonhakujärjestelmien käyttö Suomessa 1985. Valtion teknillinen tutkimuskeskus. Informaatiopalvelulaitos, 1987. Harter, Stephen P. Online information retrieval: concepts, principles, and techniques. Academic Press Professional, Inc., 1986. Heimbürger, Anneli, Riitta Alkula, and Taru Kuhanen. Hyperteksti ja hypermedia. Valtion teknillinen tutkimuskeskus, informaatiopalvelulaitos, 1990. Honkela, Timo, and Ari M. Vepsäläinen. "Interpreting imprecise expressions: Experiments with Kohonen’s selforganizing maps and associative memory." In Proceedings of ICANN 2011, vol. 1, pp. 897-902. 1991. Jäppinen, Harri, Aarno Lehtola, Esa Nelimarkka, and Matti Ylilammi. "Knowledge engineering approach to morphological analysis." In Proceedings of the first conference on European chapter of the Association for Computational Linguistics, pp. 49-51. Association for Computational Linguistics, 1983. Karetnyk, David, Fred Karlsson, and Godfrey Smart. "Knowledge-based indexing of morpho-syntactically analysed language." International Journal of Applied Expert Systems 4, no. 1 (1991): 1-29. Karlsson, Fred. "Morphological tagging of Finnish." Computational Morphosyntax, Publica 13 (1985): 115-136. Koskenniemi, Kimmo. "An application of the two-level model to Finnish." Computational morphosyntax: Report on research 1984 (1981): 19-41. Koskenniemi, Kimmo. "FINSTEMS: a module for information retrieval." Computational Morphosyntax: Report on Research 84 (1981): 81-92. Kotzias, Klaus. "How to respond to different language particularities by indexing texts using automatic text analysis." In International online information meeting, pp. 61-68. 1990. Laalo, Klaus. Säkeistä patoihin: suomen kielen monitulkintaiset sananmuodot. Suomalaisen kirjallisuuden seura, 1990. Lin, Xia, Dagobert Soergel, and Gary Marchionini. "A self-organizing semantic map for information retrieval." In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 262-269. ACM, 1991. Newton, Steve J. Text filing and retrieval systems: a practical evaluation guide. National computing centre, 1983. Peters, Thomas A. "When Smart People Fail: An Analysis of the Transaction Log of an Online Public Access Catalog." Journal of academic librarianship 15, no. 5 (1989): 267-73. Ritter, Helge, and Teuvo Kohonen. "Self-organizing semantic maps." Biological cybernetics 61, no. 4 (1989): 241254. Saffady, William. Text storage and retrieval systems: A technology survey and product directory. Meckler, 1989. Salton, Gerard. Automatic Text Processing: The Transformation, Analysis, and Retrieval of. Addison-Wesley, 1989. Tenopir, Carol, and Jung Soon Ro. Full text databases. Greenwood Press, 1990. Thönssen, Barbara. "Automatische Indexierung und Schnittstellen zu Thesauri.[Interfaces Between Automatic Indexing and Thesauri]." Nachrichten fur Dokumentation (West Germany) 39, no. 4 (1988): 227-230. [The list of references has been reproduced to support search system operations. Errors are possible. Please check the original.] Keywords and search terms: Named entities: VTT, Valtion teknillinen tutkimuskeskus, TEKES, VTKK, KTA-Papyrus, Aamulehti, Länsiväylä-lehti, Tampereen yliopisto, Eeva Palosuo, Juhani Virtanen, Matti Sihto, Kimmo Koskenniemi, Mika Herpiö, Pekka Vuorio, Harri Arnola, Sauli Laitinen, Eero Sormunen, Taru Kuhanen, Sanna Hätönen, Raili Salminen, Markku Kuokkala, Markku Ylinen, Tarja Hjorth, Kaarina Nazarenko, Jaakko Anttila, Kari Martiskainen, Irma Salovaara, Pirjo Valpas, Tarja Heinivaho, Klaus Nurmi, Tuija Tuominen, Kalervo Järvelin, Olli Paavola. Finnish terms: Tiedonhakujärjestelmä, hakujärjestelmä, suomen kieli, taivutusmuoto, johdos, yhdyssana, homografia, sanaliitto, hakusana, taivutusvartalo, perusmuoto, MINTTU, BASIS, testikysely, hakemisto, TWOL, hakutulos, käyttäjä, perusmuotohaku, automaattinen katkaisu English terms: Information retrieval system, database, free-text retrieval, inverted index, index term, stop word, query, Finnish language, inflectional word forms, compound word, automatic truncation, morphological analysis, APL language, C language The full report follows in a scanned form.