Making a Dictionary in Ulaanbaatar
Transcription
Making a Dictionary in Ulaanbaatar
Making a Dictionary in Ulaanbaatar: Corpus-based Lexicography with Limited Financial and Technical Resources Stefan Engelberg (Institut für Deutsche Sprache & Universität Mannheim) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 1] CONTENT 1) 2) 3) 4) 5) Mongolia and its languages Publishing dictionaries in Mongolia The lexicographic workplace: Free corpuslinguistic resources Improving bilingual dictionaries Outlook Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 2] 1 1 Mongolia / Languages Mongolia – basic data 2 Publishing dictionaries 3 Corpus linguistics population: 2 951 000 (estimate 2007) = 1,9 / km² 4 Improving dictionaries capital: Ulaanbaatar (> 1 000 000 inhabitants, fast growing) 5 Outlook government: stable parliamentary democracy economic basis: agriculture (sheep, cattle, …), mining (copper, gold, coal, …) Gross National Income (World Bank, measuring GNI per person, 2004): Mongolia: $ 600,- (rank: ca. 132/175) (lowest group) Human Development Index (United Nations Development Programme; measuring rate of literacy and life expectancy, 2007): Mongolia: 0,691 (Rank: 116/177) (medium group) tertiary education: about 200 private „colleges“ in Ulaanbaatar; 6 universities offer degrees that are acknowledged in Germany Fischer Weltalmanach 2007. Frankfurt/M.: Fischer 2006. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 3] 1 Mongolia / Languages 2 Publishing dictionaries Major languages in Mongolia 3 Corpus linguistics 4 Improving dictionaries Eastern Mongolian languages: Khalkha Mongolian: ca. 2 400 000 speakers Kalmyk-Oirat: ca. 210 000 speakers Buriat: ca. 70 000 speakers Darkhat: ca. 32 000 speakers 5 Outlook Turkic languages: Kazakh: ca. 200 000 speakers Tuvin: ca. 30 000 speakers Other languages: Chinese: ca. 35 000 speakers Russian: ca. 4 000 speakers (numbers of speakers extrapolated from numbers in Ethnologue relative to recent population growth) Ethonolgue: http://www.ethnologue.com. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 4] 2 1 Mongolia / Languages Eastern Mongolian languages 2 Publishing dictionaries 3 Corpus linguistics (Mongolian is considered a branch of the Altaic language family; the internal classification of the Mongolian languages is controversial; in addition there is one Western Mongolian language: Moghloi, ca. 200 speakers in Afghanistan) Sp. Mongolia 4 Improving dictionaries 5 Outlook Group Languages Sp. China Sp. Russia Mongolian proper Khalkha Mongolian, Peripheral Mongolian Buriat Mongolian B., Chinese B., Russian B., ca. 70 000 ca. 70 000 ca. 320 000 Oirat-Kalmyk-Darkhat Darkhat, Kalmyk-Oirat ca. 240 000 ca. 140 000 ca. 180 000 Mongour Kangjia, Tu, Bonan, Dongxian, East Yugur ca. 500 000 Dagur Daur ca. 100 000 ca. 2 400 000 ca. 3 400 000 mutually intelligible (6 260 000 speakers) Ethonolgue: http://www.ethnologue.com. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 5] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Foreign languages Russian: formerly first second language; widespread among older Mongolians. English: first second language since 2005. German: about 30 000 second language speakers. Chinese: not widespread. Six universities offer degrees in German that are acknowledged at German universities. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 6] 3 Monsudar publishers 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics The publisher 4 Improving dictionaries 5 Outlook • Company: Monsudar publishers as part of the Admon company (printing and publishing). • Dictionary department: Monsudar dictionary department founded two years ago (head: Bayarsaikhan). • Publishing plan: a series of bilingual dictionaries (Mongolian – English / German / Chinese / Korean) in cooperation with foreign partners (Oxford, Pons, …). Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 7] 1 Mongolia / Languages The dictionary department • Staff: Head of department and 2½ staff . • German-Mongolian dictionary: (Head of dept., 1 staff and about 20 part-time freelancers (university lecturers, interpreters, translators, travel guides) currently working on GermanMongolian dictionary. • Equipment: PCs and Internet connection (very slow) available to staff and most freelancers. • Reference works: a small collection of reference works available in the editorial office, among them an older GermanMongolian dictionary (Vietze 1981). (Acquisition of the 10volume „Duden – Großes Wörterbuch der deutschen Sprache“ beyond financial resources.) 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 8] 4 1 Mongolia / Languages 2 Publishing dictionaries The German-Mongolian / Mongolian-German dictionary 3 Corpus linguistics 4 Improving dictionaries 5 Outlook • German partner: cooperation with Pons (Klett, Stuttgart). • Dictionary basis: Pons provides the German part of the German-Mongolian dictionary (identical to the new Pons Deutsch-Englisches Kompaktwörterbuch). • Procedure German-Mongolian: The German part of the Pons German-English dictionary has been used unaltered by the editorial staff; the Mongolian lexicographers merely add translations. • Procedure Mongolian-German: The Mongolian side has been compiled on the basis of older dictionary and manual collection of neologisms (done by some mongolist). Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 9] 1 Mongolia / Languages 2 Publishing dictionaries Main problems of the German-Mongolian / Mongolian-German dictionary 3 Corpus linguistics 4 Improving dictionaries 5 Outlook • Dictionary basis: The empirical dictionary basis was insufficient. • Microstructure: The structure of the articles excluded the use of the dictionary as an active ditionary. Measures taken: • Corpuslinguistic foundation: development of a corpus-based lexicographic workplace. • Training: training of staff and freelancers (connection between dictionary use and dictionary stucture). Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 10] 5 Corpus-based lexicographic workplace 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics Contents of the CLW 4 Improving dictionaries 5 Outlook 1) Information: Basic information material on (i) dictionary structure, (ii) the major functions of the software installed, (iii) the compilation of own corpora. 2) • • • • Corpus analysis software: Installation of corpus analysis software: AntConc KWICFinder Leipzig Corpus Browser Co-occurrence Database / COSMAS II (Institut für deutsche Sprache, Link) 3) • • • • Corpora: Collection of corpora: German newspaper corpus of the Leipzig Corpus Collection (15 million textwords) English newspaper corpus of the Leipzig Corpus Collection (21 million textwords) Monsudar Mongolian corpus (under construction) Corpus-based frequency lists of German words: (i) based on the IDS corpus collection (2000 million textwords), (ii) based on the German LCC corpus Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 11] Corpusbased lexicographic workplace 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics Corpus analysis software 4 Improving dictionaries 5 Outlook Four freely available sources for corpus analysis: • • • • Corpus analysis software I: AntConc Corpus analysis software II: Corpus Browser Corpus analysis software III: COSMAS II & CCDB Corpus analysis software IV: KWICFinder Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 12] 6 Corpus analysis software I: AntConc 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook AntConc • Developer: Laurence Anthony, Faculty of Science and Engineering,Waseda University, Japan. • Version: 3.2.1w (Windows), release March 10th, 2007. • Search: offline. • Software: installed on a local computer. • Access: free download. • Corpora: own (txt-files). • Languages: all (Unicode): German, Englisch, Romanian, Mongolian. • URL: http://www.antlab.sci.waseda.ac.jp/antconc_index.html. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 13] Corpus analysis software I: AntConc 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics (III) clusters (IV) co-ocurrences 4 Improving dictionaries 5 Outlook (I) concordances (KWICs) (II) frequencies / word lists Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 14] 7 Search: concordances for geloven in the Dutch corpus of the Leipzig Corpus Collection (newspapers). Search term (here: geloven) Sort (here: alphabetically according to the word on the right of the search term) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 15] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Search: frequency list of all word forms in part of the English corpus of the LCC (newspapers) Start (no search term) Sort (here: accord. to frequency) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 16] 8 1 Mongolia / Languages Search: clusters out of 2dictionaries words 2 Publishing ending in off in part of the 3 Corpus linguistics English corpus of the LCC 4 Improving dictionaries 5 Outlook Search term position (here: on right) Size of cluster (here: clusters out of two words) Search term (here: off) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 17] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics co-occurrence analysis 4 Improving dictionaries 5 Outlook co-occurrence analysis – the basic idea 1) Assumption: In a certain corpus, word X occurs a 1000 times, word Y a 100 times, word Z 10 times. 2) Probability: The combination XY is ten times as likely as the combination XZ. XY should occur ten times as often as XZ. 3) Observation: Actually, XZ occurs about as often as XY. 4) Conclusion: There is a close linguistic connection between X and Z (close beyond expectation). Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 18] 9 1 Mongolia / Languages Search: co-occurrences for 2 Publishing dictionaries just in part of the English corpus of3 Corpus the LCC. linguistics 4 Improving dictionaries 5 Outlook List of co-occurrence partner words with rank, frequency, and significance measure Definition of search context (here: up to 2 words after the search term) Search term (here: just) Sort (here: accord. to significance of co-occurrence) Frequency condition (here: at least 10 tokens) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 19] Corpus analysis software I: AntConc 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook • can be recommended with smaller corpora (up to 20 Mill. text words) • strenghts: sorted concordances, word lists, cluster analyses, key word analyses • less useful for co-occurrence analyses (too slow; larger corpora are needed) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 20] 10 Corpus analysis software II: Corpus Browser 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Corpus Browser • • • • • • • • Developer: Volker Boehlke (University of Leipzig). Version: 1.00 (Windows). Search: offline. Software: locally installed. Access: free download. Corpora: integrated into the program; own corpora can be created. Languages: 14 languages (see next slide). URL: http://corpora.informatik.uni-leipzig.de/download.html. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 21] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook The corpus size is measured by the number of sentences included in the corpus. When downloaded as Plain Text Files, the corpora can also be used under AntConc. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 22] 11 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Search term (here: vite) Results (for word): • absolute frequency • frequency class • corpus examples • significant left and right neighbors • co-occurrences Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 23] Corpus analysis software III: COSMAS II & CCDB 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries COSMAS II & CCDB • • • • • • • • 5 Outlook Developer: Institut für Deutsche Sprache (CCDB: Cyil Belica). Version: 1.2.1. Search: online. Software: installed locally (Client) or as web interface. Access: free download of the client (registration). Corpora: corpora of the IDS. Languages: German. URL: https://cosmas2.ids-mannheim.de Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 24] 12 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook loading corpora with COSMAS II Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 25] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries Example for search in COSMAS II 5 Outlook Looking for: dass-clauses as sentential subject with the verb helfen (‘to help’). Assumption: Sentential subjects with helfen mainly occur within the construction <[…] es […] hilft, dass/daß>. Search: (es /+w3 &helfen) /+w1 (dass oder daß) Beispiele T04 Der SPD hat es nicht geholfen, dass der Sympathieträger und B99 Uns könne es nur helfen, dass wir so früh den Weg zu B02 Vielleicht hat es Metzelder geholfen, dass die Kollegen seinen E96 Da wird es auch nicht helfen, dass der Publikumsrat E99 Mir hat es viel geholfen, dass ich Kabuki-Theater N98 "Uns könnte es helfen, daß gleichzeitig Landtagswahl ist", P93 Saddam Hussein könnte es helfen, daß Zulieferstaaten ... eine volle P98 "Wenn es Saddam hilft, daß Unscom von Diplomaten R99 Was kann es nun helfen, daß inzwischen 13 der 15 Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 26] 13 Question: co-occurrences for Anwendungsbeispiel II:bestehen (in particular governed prepositions). Kookkurrenzen zu bestehen 1 Mongolia Textkorpora / Languages 2 Publishing Recherchemethoden dictionaries 3 Corpus Anwendungen linguistics 4 Improving Rechercheprogramme dictionaries 5 Schlussbemerkung Outlook Typical syntagmatic patterns in which the words co-occur, e. g. besteht aus […] [zwei|drei] Teilen, ‘consists of […] [two|three] parts’ Secondary co-occurrence partners of bestehen + aus, here: aus Mitgliedern / Teilen / Ortsteilen bestehen, ‘consist of members / parts / suburbs’ Primary co-occurrence partner of bestehen (here: aus) Strength of the connection (here: 40683) Co-occurrence analysis for bestehen as part of the CCDB (setting: do not ignore function words) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 27] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries results (among others): 5 Outlook besteht […] aus (‘consists of […]’) besteht […] aus […] Mitgliedern (‘consists […] of […] members’) darin: besteht […] darin, dass (‘is […] that’) die Schwierigkeit […] besteht […] darin, dass (‘the difficulty […] is […] that’) darauf: besteht […] darauf, dass (‘insists […] that’) er bestand […] darauf, dass (‘he insisted […] that’) worin: worin […] besteht worin […] besteht der Unterschied zwischen (‘what […] is the difference between’) aus: • governed preposition: auf, aus, in • prepositions auf and in in particular as prepositional complement clauses • preposition in often in interogative sentences Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 28] 14 Corpus analysis software III: COSMAS II & CCDB 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook • probably best co-ocurrence analysis available; easy access via co-occurrence database • very extended search language for corpora • working with COSMAS II needs some training Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 29] Corpus analysis software IV: KWICFinder 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics KWICFinder Key Word in Context Research Tool and Concordancer for the Web • • • • • • • • 4 Improving dictionaries 5 Outlook Developer: William Fletcher. Version: 0.98.22 (Beta Version), 11. Dec. 2006 (Windows). Search: online. Software: locally installed. Access: free download. Corpora: WWW. Languages: ca. 20 languages on the basis of the Latin script. URL: http://www.kwicfinder.com/KWiCFinder.html. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 30] 15 Corpus analysis software IV: KWICFinder 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook • produces concordances on the basis of WWW pages • search can be restricted to pages with particular titles or in particular domains • can be used to find examples for colloquial language (chat rooms) or examples for special / technical language Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 31] Employing corpus analysis software in dictionary making 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook The application domains in detail: • determining relevant meaning variants (by studying concordances and co-occurrence analyses) • identifying collocations and other fixed expressions (by evaluating co-occurrence analyses) • choosing examples and typical contexts of usage (by evaluating cluster analyses and co-occurrence analyses) • examining the lemma list (by comparing the existing list with frequency lists and lists of keyword searches) • Example I: Identification of meaning variants and contexts of use • Example II: Exploration of collocations and fixed expressions • Example III: Identification of special vocabulary Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 32] 16 Example I: Identification of meaning variants and contexts of use 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries Article for abziehen (literally: to pull off) in Vietze‘s (1981) German-Mongolian dictionary. 5 Outlook Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 33] 1 Mongolia / Languages Lemma: abziehen 2 Publishing dictionaries Inflection: <32a> structure of the article Grammatical variants 3 Corpus linguistics 1: tr 4 Improving dictionaries Translations general: (‘pull off’?) 5 Outlook specific 1: Fell (‘coat/fur’) ‘skin’ 2: Flüssigkeit (‘liquid’) (‘bottle’ ?) 3: Math (‘mathematics’) ‘subtract’ 4: Typ (‘typography’) ‘run off’ Examples 1: das Rasiermesser ~ (‘the straight blade razor’) ‘sharpen’ 2: Rinde ~ (‘bark’) ‘pull off’? 3: den Schlüssel ~ (‘the key’) ‘take out’ 2: intr Translations specific 1: sich entfernen ‘go away’ 2: sich zurückziehen ‘withdraw’ Examples 1: unverrichteterdinge ~ (‘go away without achieving anything’) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 34] 17 Korpusrecherchemethoden Step 1: co-occurrence analysis for abziehen (CCDB); Korpusrecherchesystem IV: Corpus Browser function words not considered. meanings covered in Vietze meanings not covered in Vietze Truppen abziehen, ‘to withdraw troups’ unverrichteter Dinge wieder abziehen , ‘to go away without having achieved anything’ wurden zwei Punkte abgezogen , ‘two points were deducted’ eine Show abziehen, ‘to make a scene’ die Haut abziehen, ‘peel (fruit), skin (an animal)’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 35] Korpusrecherchemethoden Korpusrecherchesystem IV: Corpus Browser vom Einkommen abziehen , ‘to deduct from the income’ den Zündschlüssel abziehen , ‘to take out the ignition key’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 36] 18 Korpusrecherchemethoden Korpusrecherchesystem IV: Corpus Browser aus 20 Metern abziehen , ‘to shoot (a ball) vigorously from 20 m distance’ Botschafter (aus …) abziehen , ‘to withdraw the ambassador (from …)’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 37] Korpusrecherchemethoden Korpusrecherchesystem IV: Corpus Browser Kapital (aus …) abziehen , ‘to withdraw capital (from)’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 38] 19 Korpusrecherchemethoden Korpusrecherchesystem IV: Corpus Browser den Rauch abziehen lassen , ‘to let the smoke escape’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 39] Step 2: using KWICFinder to collect concordances reflecting colloquial German from the internet 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook enter search term: abziehen Search in pages that show „chat“ in their title. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 40] 20 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics Results 4 Improving dictionaries 5 Outlook Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 41] 1 Mongolia / Languages more meanings not covered in Vietze (1) Die Leute die mich kennen, wissen, daß ich eigentlich eine ganz Friedfertige und Versöhnliche bin. Aber was hier einige Leute abziehen ... echt therapiebedürftig!!! ‘[…] what some people are pulling off here […]’ 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook abziehen – ‘to pull off something’(coll.) Was ziehst du hier ab? ‘What are you pulling off here?’ (2) Die Suppe mit Salz abschmecken, mit verquirltem Eigelb abziehen und die Spargelstückchen hineingeben. ‘[…] thicken the soup with beaten egg yolk […]’ abziehen – ‘to thicken’ (gastr.) er zieht die Suppe mit Eigelb ab ‘he thickens the soup with egg yolk’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 42] 21 (3) ich finde auch den preis etwas niedrig und der ebayer hat auch nur 2 bewertungen,habe deshalb ihn gefragt,ob wir das geschäft über den treuhandservice abwickeln können.jetzt warte ich auf seine antwort.nicht das der mich abziehen will,nur weil vielleicht zu wenig für das board geboten wurde.nicht mein problem. ‘[…] that he wants to swindle me […]’ 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook ‘to swindle / cheat’ (coll.) er versuchte mich abzuziehen ‘he tried to swindle me’ (4) Bieretiketten kann mein einfach von der Flasche abziehen. ‘[…] Beer labels can be easily pulled off the bottle […]’ abziehen – ‘to pull off’ sie zog das Etikett von der Bierflasche ab ‘she pulled the label off the beer bottle’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 43] 1 Mongolia / Languages more meanings not covered in Vietze 2 Publishing dictionaries 3 Corpus linguistics meanings covered in Vietze 4 Improving dictionaries 5 Outlook Relevant for a general bilingual dictionary intr. ‘withdraw’ ‘take out (key)’ tr. ‘withdraw (troups, ambassador)’ ‘withdraw (capital)’ ‘deduct (points)’ intr. ‘go away’ (coll.) ‘swindle, cheat’ (math.) ‘subtract’ Irrelevant for a general bilingual dict. ‘skin (coat/fur)’ (coll., neg.) ‘do (something)’ itr. ‘escape (of smoke’ ‘pull off (label)’ (coll.) ‘shoot vigorously’ ‘deduct (something from income)’ ‘sharpen (a straight blade razor)’ (typogr.) ‘run off’ (youth) ‘extort’ ‘pull off (bark)’ (youth) ‘tear off and rob’ (gastr) ‘thicken (soup)’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 44] 22 1 Mongolia / Languages Example II: Exploration of collocations and fixed expressions 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Article from the new Monsudar German-Mongolian dictionary (preliminary version). 20 Flaschen à 8 Euro, ‘20 bottles at 8 Euros each’ Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 45] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries Concordances for à in a 1-million-TW selection of the German corpus within the LCC 5 Outlook Fixed expression à la, ‘after the fashion of’ (5 out of 10 hits) Fixed expression peu à peu, ‘bit by bit’ (1 out of 10 hits) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 46] 23 1 Mongolia / Languages Co-occurrence analysis on the basis of the German Reference Corpus (2 billion textwords); COSMAS II web interface 2 Publishing dictionaries la as the most siginificant 3 Corpus linguistics cooccurrence partner of à 4 Improving (log likelihood ratio: 135300) dictionaries 5 Outlook Both collocations, à la and peu à peu are missing in the dictionary. peu as the second most siginificant cooccurrence partner of à (log likelihood ratio: 15974) Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 47] Example III: Identification of special vocabulary Task: The German part of the German-Mongolian dictionary is supposed to contain those words used in German that are specific to Mongolian culture and typically occur in German texts related to Mongolia: Jurte, Airag, … 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Step 1: Google search for texts containing „Mongolei“. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 48] 24 1 Mongolia / Languages 2 Publishing dictionaries Step 2: Copy all texts that seem suitable on first sight into a txt-file. 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 49] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Step 3: Loading the text corpus resulting from this procedure in AntConc. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 50] 25 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Step 4: Loading a reference corpus (e.g., newspaper texts) under Tool Preferences / Keyword List. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 51] 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Step 5: Starting compilation of a keyword list (i.e., words typical for the special corpus compared to the reference corpus). Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 52] 26 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Step 6: Manual evaluation of the keyword list yields Airag, Airak, Jurte, Nomade, Tugrik, Yak, Khan, Obertongesang, Chuuschuur, Pferdekopfgeige, Schamane, Milchschnaps, etc. Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 53] 1 Mongolia / Languages 2 Publishing dictionaries Corpus use in lexicography 3 Corpus linguistics 4 Improving dictionaries 5 Outlook allow for empirically sound, scientific dictionaries big, expensive corpuslinguistic solutions (large, annotated corpora, tailor-made analysis software) small, inexpensive corpuslinguistic solutions (small, unannotated, plain-text corpora, free analysis software) much, much better than „corpus-free“ dictionaries Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 54] 27 1 Mongolia / Languages 2 Publishing dictionaries 3 Corpus linguistics 4 Improving dictionaries 5 Outlook Copy of the slides (on Monday) under: http://www.ids-mannheim.de/ll/lehre/engelberg/talks/talks.html engelberg@ids-mannheim.de Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 55] 28