text corpus

Transcription

text corpus
On Data Bases and Text Corpora as Tools for Transla3on. Lectures within the audiovisual transla3on studies Hans-­‐Harry Droessiger ISBN 978-­‐609-­‐459-­‐308-­‐6 (c) 2013 by Hans-­‐Harry Droessiger Hans-­‐Harry Droessiger. On Data Bases and Text Corpora as Tools for TranslaCon. Metodinė medžiaga (paskaitų konspektas). Kaunas, 2014 – 259 psl. (elektroninėje laikmenoje). ISBN 978-­‐609-­‐459-­‐308-­‐6 Apsvarstė ir rekomendavo publikuo3 Vilniaus universiteto Kauno humanitarinio fakulteto taryba (2014 m. kovo 26 d., protokolas Nr. 8). Recenzavo: Doc. dr. Jurga Cibulskienė (Lietuvos edukologijos universitetas) Doc. dr. Goda Rumšienė (VU KHF) (c) 2013 by Hans-­‐Harry Droessiger Contents Lecture 1: Aims and Topics of the Course – A Short Overview Lecture 2: From Text Corpus to Data Base – Defini3ons, Features, Func3ons Lecture 3: Sorts of Corpora Lecture 4: On the Analysis of Corpus Data. Part 1: Subtopics 1 – 4 Lecture 5: On the Analysis of Corpus Data. Part 2: Subtopics 5 – 7 Lecture 6: Working with a self created corpus. Part 1 Lecture 7: Working with a self created corpus. Part 2 (c) 2013 by Hans-­‐Harry Droessiger OrganisaCon of the Course •  In every lecture, theore3cal material, examples, figures and more are presented. •  In certain cases internet connec3on will be used to actually perform some search or presenta3on. •  Every lecture ends with “Summary: Basic Knowledge” (except lecture # 1). •  Every lecture will be closed up with “Tasks and Assignments”, which have to be worked out to a deadline given by the lecturer. (c) 2013 by Hans-­‐Harry Droessiger Literature (1) •  Usonienė, A.; Grigaliūnienė, J.; Ryvitytė, B.; Būtėnas, L.; Jasionytė, E. (2008): Lietuvių mokslo kalbos tekstynas. – In: Bal3s3ca XLIII (1), p. 101-­‐114 –  online: h$p://www.bal,s,ca.lt/index.php/bal,s,ca/ar,cle/view/1212/1134 •  Petkevičiūtė, I.; Tamulynas, B. (2011): Kompiuterinis verCmas į lietuvių kalbą: alternatyvos ir jų lingvisCnis verCnimas. – In: Kalbų studijos, Nr. 18, p. 39-­‐46 –  online: h$p://www.kalbos.lt/archyvas4.html •  Lindquist, Hans (2011): Corpus Linguis,cs and the Descrip,on of English. – Edinburgh: University Press (c) 2013 by Hans-­‐Harry Droessiger Literature (2) •  Teubert, Wolfgang; Čermákova, Anna (2007): Corpus Linguis,cs: A Short Introduc,on. – London: Con3nuum. – VUKHF: 800.004 (Lituanis3kos katedra) •  Scherer, Carmen (2006): Korpuslinguis,k. – Heidelberg: Winter •  Halliday, M.A.K. et al. (2004): Lexicology and Corpus Linguis,cs: An Introduc,on. – London. New York: Con3nuum •  and any other scien3fic monograph, scien3fic ar3cle or course book about corpus linguisCcs, text corpora, lexicography, lexicology, data bases, computer aided language or text analysis... you will find in libraries, online libraries, or other sources (c) 2013 by Hans-­‐Harry Droessiger Lecture # 1: Aims and Topics of the Course – A short overview (c) 2013 by Hans-­‐Harry Droessiger Aims of the Course •  Learning different basic terms, discuss their defini3ons and descrip3ons to prac3cally use them; developing a theore3cal background knowledge •  Prac3cing searches to collect linguis3c informa3on to correctly perform transla3ons and philological research •  Developing consciuosness of transla3on problems, which are connected with linguis3c informa3on (c) 2013 by Hans-­‐Harry Droessiger Topics of the course •  Basic terms: database, text corpus ... •  Func3ons, forms, features, and purposes of databases and text corpora •  Prac3cal work with databases and text corpora to support transla3ons and any other kind of philological or linguis3c research (c) 2013 by Hans-­‐Harry Droessiger Databases and text corpora -­‐ examples (c) 2013 by Hans-­‐Harry Droessiger Languages of the European Union (c) 2013 by Hans-­‐Harry Droessiger BriCsh English (c) 2013 by Hans-­‐Harry Droessiger BriCsh English (c) 2013 by Hans-­‐Harry Droessiger US English (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Lithuanian htp://tekstynas.vdu.lt/ (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Term Bank htp://terminai.vlkk.lt/pls/tb/tb.search (c) 2013 by Hans-­‐Harry Droessiger Online DicConary htp://www.oxforddic3onaries.com/ (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger htp://dic3onary.cambridge.org/ (c) 2013 by Hans-­‐Harry Droessiger htp://www.lkz.lt/ (c) 2013 by Hans-­‐Harry Droessiger Online Encyclopaedia htp://lt.wikipedia.org/wiki/Pagrindinis_puslapis (c) 2013 by Hans-­‐Harry Droessiger Tasks and assignments •  Set up a list of informa3on resources of English language (dic3onaries, encyclopaedia) containing at least entries of 5 printed and 5 digital media. •  What sorts of online resources are reliable, and why? •  Set up a list of at least 10 entries of theore3cal papers (not older than from year 2000) dealing with the problems of data bases and text corpora (do not use the list presented by the lecturer). Papers could be writen in English, Lithuanian and, addi3onal, in any other language you read and understand. (c) 2013 by Hans-­‐Harry Droessiger Lecture # 2: From Text Corpus to Data Base – DefiniCons, Features, FuncCons (c) 2013 by Hans-­‐Harry Droessiger TEXT CORPUS • 
• 
• 
• 
Defini3on Contents of text corpora Main features of text corpora Use and purposes of text corpora (c) 2013 by Hans-­‐Harry Droessiger DefiniCon of text corpus •  a collec3on of texts or parts of texts, organisied and classified by linguis3c parameters; •  a text corpus represents a certain part of a language; •  the contents of a text corpus should be stable and reliable; •  its func3on is to allow and to support linguis3c researches; •  texts of a corpus are called primary data (c) 2013 by Hans-­‐Harry Droessiger Contents of text corpora •  Text corpora can be organised and classified by: –  use of language: •  writen, spoken texts –  text sorts or communica3ve inten3ons: •  literary forms; informa3ve text sorts, entertaining text sorts etc. –  sociolinguis3c parameters: •  texts in standard language, dialect, sociolect, LSP etc. –  historical aspects: •  texts of the different phases of the historical development of a language –  individual aspects: •  texts by certain authors, e. g. William Shakespeare (c) 2013 by Hans-­‐Harry Droessiger Main features of text corpora •  A text corpus has to be representaCve: –  the sta3s3cal universe of a text corpus has to be definite •  Example: youth language –  defini3on of “youth” –  age –  sorts of uterances (writen, spoken) –  discourse features (partners, situa3ons, circumstances) (c) 2013 by Hans-­‐Harry Droessiger Main features of text corpora •  A text corpus has to be representaCve: –  quan3ty of text sorts: seldom or oxen used text sorts; –  ac3vely or passively used text sorts; –  quality of text sorts, depending on classifying parameters (as men3oned above) (c) 2013 by Hans-­‐Harry Droessiger Main features of text corpora •  A text corpus has to be stable and reliable: –  axer crea3ng, a text corpus will not be changed in content, quan3ty, and structure –  the quan3ty or size of a text corpus will be given in the number of “text words”, i. e. lexical units and/or word tokens: •  lexical units: house, write, nice, good •  word tokens: house – houses, write – writes – wrote – wri2en, nice – nicer – nicest, good – be2er -­‐ best (c) 2013 by Hans-­‐Harry Droessiger Main features of text corpora •  Examples of text corpora by quan3ty: –  The BNC – BriCsh NaConal Corpus: hbp://corpus.byu.edu/bnc/ (c) 2013 by Hans-­‐Harry Droessiger The BriCsh NaConal Corpus -­‐ BNC (c) 2013 by Hans-­‐Harry Droessiger Main features of text corpora •  Examples of text corpora by quan3ty: –  The Collins Corpus & The Bank of English: hbp://www.mycobuild.com/about-­‐collins-­‐
corpus.aspx (c) 2013 by Hans-­‐Harry Droessiger Main features of text corpora •  A text corpus contains: –  texts in their whole length or –  parts of texts: a certain number of text words (c) 2013 by Hans-­‐Harry Droessiger Main features of text corpora •  A text corpus has to have metadata: –  Metadata is a set of data which describes and presents informa:on about other data: •  name of the source of the text •  name, age, sex of author or editor •  date of publishing or date of recording •  informa3on about circumstances of publishing and/or recording (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of text corpora •  For linguis3c research and/or other scien3fic work –  linguis3cs, historical linguis3cs, sociolinguis3cs, dialectology etc.; –  history, poli3cs, sociology, psychology, cultural sciences etc. (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of text corpora •  Linguis3c purposes –  Research of linguis3c structures and language varia3on*; –  Crea3ng dic3onaries*; –  Crea3ng grammars*; –  Foreign language teaching; –  Transla3on works*; –  Computa3onal linguis3cs (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of text corpora •  Linguis3c purposes (1) –  Research of linguis3c structures and language varia3on: •  word colloca3ons – quan3ty, usual colloca3ons, unusual colloca3ons; •  word distribu3on – quan3ty of word use •  word order in sentences – rules and regulari3es (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of text corpora •  Linguis3c purposes (2) –  Crea3ng dic3onaries: •  printed and electronical / online dic3onaries •  electronical / online dic3onaries for easy use –  quickly collect Ø informa3on about phonological, gramma3cal, and seman3c features of a word, Ø the number of meanings of a word Ø contextual and colloca3onal informa3on Ø informa3on about origin and history of a word (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of text corpora •  Linguis3c purposes (3) –  Crea3ng grammars: •  of special varia3ons of languages, e. g. LSP – word order in sentences and syntagmata, use of preposi3ons, word forma3on in the case of terms etc. •  of certain periods in the history of a language (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of text corpora •  Linguis3c purposes (4) –  Transla3on works: •  monolingual dic3onaries – informa3on about context and frequency of word use – to find informa3on about seldomly used words or stylis3cly dis3nct words •  bi-­‐ or mul3lingual dic3onaries – search for equivalents in the target language (c) 2013 by Hans-­‐Harry Droessiger Summary TEXT CORPUS
MAIN
FEATURES
PURPOSES /
USE
DATA BASE
- encyclopaedia
- dictionaries
- thesauri
(c) 2013 by Hans-­‐Harry Droessiger Digitalised Sources of InformaCon = Data Bases • 
• 
• 
• 
Encyclopaedia Dic3onaries Thesauri Term banks ... (c) 2013 by Hans-­‐Harry Droessiger Encyclopaedia •  A compendium holding a summary of informa3on from either all branches of knowledge or a par3cular branch of knowledge; •  its ar3cles or entries are organised alphabe3cally •  entries are longer and more detailled than those in most dic3onaries (c) 2013 by Hans-­‐Harry Droessiger www.britannica.com (c) 2013 by Hans-­‐Harry Droessiger DicConary •  A collec3on of words in one (monolingual) or more specific (bi-­‐ or mul3lingual) languages; •  words are listed alphabe3cally; •  word informa3on about: etymology, phone3cs, pronuncia3on, morphology, style, seman3cs etc. •  lexicographic features: informa3on links and rela3ons between data (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Thesaurus (plural form Thesauri) •  A reference work that lists words grouped together according to similarity of meaning (synonyms and antonyms) •  in several sciences, specialised thesauri are designed for informa3on retrieval as a type of controlled vocabulary, for indexing or tagging purposes •  in some cases a thesaurus be referred to as an ontology (c) 2013 by Hans-­‐Harry Droessiger Printed Thesauri – an example (c) 2013 by Hans-­‐Harry Droessiger Online Thesauri – an example www.thesaurus.com (c) 2013 by Hans-­‐Harry Droessiger Term bank •  A term bank is a sort of data base to organise and manage the terminology of a certain branch of science or field of knowledge. •  The main aim of a term bank is to support the use of correct terms (on the other hand to avoid the use of incorrect terms) in the men3oned branches of science while wri3ng academic papers, preparing lectures, and publishing results of scien3fic research. •  Term banks are established in monolingual or mul3lingual versions. (c) 2013 by Hans-­‐Harry Droessiger Term bank: hbp://iate.europa.eu/SearchByQueryEdit.do (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of data bases •  Transla3on works – examples: •  problems using electronical transla:on programs (electronical bilingual dic3onaries): •  meter maid, fly-­‐on-­‐the-­‐wall, Mr Underhill (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of data bases •  Transla3on –  Example 1: meter maid (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger The song: htp://www.youtube.com/watch?v=Se5JLYKQfDU (c) 2013 by Hans-­‐Harry Droessiger Translated by htp://transla3on2.paralink.com/English-­‐Lithuanian-­‐Translator Lovely Rita meter maid Lovely Rita meter maid Lovely Rita meter maid Nothing can come between us When it gets dark I tow your heart away Standing by a parking meter When I caught a glimpse of Rita Filling in the Ccket in her lible white book In a cap she looked much older And the bag across her shoulder Made her look a lible like a military man Lovely Rita meter maid May I inquire discreetly When are you free to take some tea with me? Puikus Rita skaiCklio tarnaitė miela Rita skaiCklio tarnaitė miela Rita skaiCklio tarnaitė nieko gali ateiC tarp mūsų kai jį sutemus vilkC savo širdies toli stovi parkingowy kai aš sugauC jo žvilgsnis apie Rita pildymas bilieto į savo mažai baltųjų knygą į BŽŪP ji atrodė daug vyresnio amžiaus ir maišelį per savo pečių padaryC savo išvaizda šiek kaip karinės vyras puikus Rita skaiCklio tarnaitė gali paklausC nepastebimai kai tu nemokamai pasinaudoC kai arbata su manimi? (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of data bases •  Transla3on –  Example 2: fly-­‐on-­‐the-­‐wall (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger The documentary: htp://www.youtube.com/watch?v=-­‐vjItStmovU (c) 2013 by Hans-­‐Harry Droessiger Use and purposes of data bases •  Transla3on –  Example 3: Mr. Underhill “Please remember, said one of them, that the name Baggins must not be men3oned. I am Mr. Underhill, if any name must be given.” (The Lord of the Rings, Chapter 10) (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Summary: Basic knowledge •  Terms and their definiCons: Ø  text corpus, Ø  primary data, Ø  metadata, Ø  data base Ø  encyclopaedia Ø  dicConary Ø  thesaurus Ø  term bank (c) 2013 by Hans-­‐Harry Droessiger Tasks and assignments •  How text corpora can be organised and classified? Use examples as illustra3ons. •  What linguis3c purposes are important for using text corpora? •  How would you evaluate online transla3ng programs? Have you any experiences using them? •  What sorts of data bases can be distuingished by what criteria? Give an example for any sort of data base (in addi3on to the examples presented in the lecture). •  Perform a search for legal terms using the IATE term bank or another. Source language is English, target language is Lithuanian: cold case, crime mapping, presump,on of innocence, right to silence, grand jury, plea bargain, prosecu,on, blue-­‐collar crime. (c) 2013 by Hans-­‐Harry Droessiger Lecture # 3: Sorts of Corpora (c) 2013 by Hans-­‐Harry Droessiger Sorts of corpora Corpora can be classified by certain criteria: -­‐  form, -­‐  technology, -­‐  structure, -­‐  use / purpose, -­‐  linguis3cs (c) 2013 by Hans-­‐Harry Droessiger 1. Criteria of form -­‐  the number of entries can vary from 20,000 entries of corpora for special purposes up to more than a Billion for common corpora; -­‐  by finiteness of the corpus we can subdivide into -­‐  sta3c corpora -­‐  monitor corpora (c) 2013 by Hans-­‐Harry Droessiger Monitor corpora htp://www.ilc.cnr.it/EAGLES/corpustyp/node19.html (c) 2013 by Hans-­‐Harry Droessiger 2. Criteria of technology We can subdivide the corpora into •  computer aided corpora •  non-­‐computer aided corpora (c) 2013 by Hans-­‐Harry Droessiger Computer-­‐aided corpora (electronic corpora) •  can be read by using personal computers; Example: Wikipedia •  can be created by using special computer soxware Example: dic3onary crea3ng soxware htp://tshwanedje.com/tshwanelex/ (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Non-­‐computer aided corpora (print media corpora) •  not yet digi3sed print media collec3ons •  digi3sed handwri3ngs, autographs, and incunabula: htp://www.bl.uk/manuscripts/
FullDisplay.aspx?ref=Harley_MS_3244 (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger 3. Criteria of structure 1. By hierarchy we can subdivide the corpora into: •  common / whole / complete corpora •  par3al corpora 2. By completeness we can subdivide the corpora into: •  full text corpora •  par3al text corpora (c) 2013 by Hans-­‐Harry Droessiger 4. Criteria of use and purpose We can subdivide into • reference corpora • special corpora (c) 2013 by Hans-­‐Harry Droessiger Reference corpora • 
• 
• 
• 
represents a language as a whole provides many sorts of linguis3c informa3on / data includes all varia3ons of a language example: the Bri3sh Na3onal Corpus (BNC) -­‐  90 % writen sources, 10 % spoken sources -­‐  the writen sources include: o  na3onal and local newspapers, o  scien3fic journals, non-­‐scien3fic journals, o  literary books, non-­‐literary books o  private documents ... htp://www.ilc.cnr.it/EAGLES96/corpustyp/node18.html#SECTION00080000000000000000 (c) 2013 by Hans-­‐Harry Droessiger Special corpora •  represents a certain varia3on of a na3onal language, e. g. a certain dialect, the sociolect of the youth, languages for special purposes (LSP) ... •  represents a certain sort of text, e. g. newspaper texts, novels, lyrics, varia3ons of the Holy Bible ... •  is used for language learner‘s purposes htp://www.visualrhymes.com/41932/database (c) 2013 by Hans-­‐Harry Droessiger Special corpora – example 1 (c) 2013 by Hans-­‐Harry Droessiger Special corpora – example 2 (c) 2013 by Hans-­‐Harry Droessiger Special corpora – example 3 For Lithuanian language: htp://ualgiman.d3ltas.lt/eiledara.html (c) 2013 by Hans-­‐Harry Droessiger Special corpora – example 4 For Lithuanian language: htp://www.upc.smm.lt/ekspertavimas/mddb/Kalbinis
%20ugdymas/Lietuvi%C5%B3%20kalba%20ir%20literat
%C5%ABra/S%C4%85vok%C5%B3%20apibr
%C4%97%C5%BEimai%20ir%20pavyzd%C5%BEiai.pdf (c) 2013 by Hans-­‐Harry Droessiger 5. Criteria of linguis,cs historical
aspect
aspect of
annotation
Criteria of
linguistics
aspect of use
number of
languages
(c) 2013 by Hans-­‐Harry Droessiger 1. Historical aspect We can subdivide into corpora of • contemporary language • non-­‐contemporary language: htp://users.ox.ac.uk/~stuart/
english/med/corp.htm (c) 2013 by Hans-­‐Harry Droessiger 2. Aspect of annotaCon The corpus contains not only primary data, but linguis3c informa3on added by tags: -­‐  phone3cs / phonology -­‐  pronuncia3on -­‐  morphology -­‐  POS -­‐  colloca3on -­‐  stylis3cs / register -­‐  discourse features (c) 2013 by Hans-­‐Harry Droessiger 3. Aspect of use By use of language we can subdivide into corpora of -­‐  writen language -­‐  spoken language -­‐  both writen and spoken language (c) 2013 by Hans-­‐Harry Droessiger 4. Number of languages We can subdivide into -­‐  monolingual -­‐  bilingual -­‐  mulClingual corpora Bilingual and mul3lingual corpora = parallel corpora. (c) 2013 by Hans-­‐Harry Droessiger 4.1. Parallel corpora: an example (c) 2013 by Hans-­‐Harry Droessiger 4.2. Parallel corpora: an example (c) 2013 by Hans-­‐Harry Droessiger 4.3. Parallel corpora: an example (c) 2013 by Hans-­‐Harry Droessiger 4.4. Parallel corpora: an example htp://opus.lingfil.uu.se/ htp://opus.lingfil.uu.se/OpenSub3tles_v2.php (c) 2013 by Hans-­‐Harry Droessiger Summary: Basic knowledge •  Terms and their definiCons: Ø  monitor corpus Ø  computer-­‐aided corpus Ø  reference corpus Ø  special corpus Ø  parallel corpus Ø  annotaCon (c) 2013 by Hans-­‐Harry Droessiger Tasks and assignments (1) •  Describe the common criteria to dis3nguish corpora. •  Describe the dis3nguishing features of linguis3c criteria of corpora. (c) 2013 by Hans-­‐Harry Droessiger Tasks and assignments (2) Translate the following excerpt – a “Bed3me Story” – from the film “Despicable Me” into Lithuanian. You can use a special corpus like presented in this lecture: Three little kittens loved to play
they had fun in the sun all day.
Their mother replied, with a voice like silk,
“Fine, but at least you should drink your milk.”
Then their mother came out and said:
“Time for kittens to go to bed.”
Three little kittens, with milk all gone,
rubbed their eyes and started to yawn.
Three little kittens started to bawl
“Mommy, we’re not tired at all.”
“We can’t sleep, we can’t even try.”
Then their mother sang a lullaby.
Their Mother smiled and said with a purr,
“Fine, but at least you should brush your fur”.
“Good night kittens, close your eyes.
Sleep in peace until you rise.”
Three little kittens with fur all brushed
said, “We can’t sleep, we feel too rushed!”
“Though while you sleep, we are apart,
your mommy loves you with all her heart.”
(c) 2013 by Hans-­‐Harry Droessiger Lecture # 4: On the Analysis of Corpus Data. Part 1: Subtopics 1 – 4 (c) 2013 by Hans-­‐Harry Droessiger On the analysis of corpus data 1.  Levels of descrip3on 2.  Methods of data analysis and evalua3on 3.  Comparison of data 4.  Search by keyword 5.  Word concordances 6.  Colloca3onal analysis 7.  Word lists (c) 2013 by Hans-­‐Harry Droessiger 1. Levels of descripCon PresupposiCons: terms of corpus linguisCcs (1) have to be dis3nguished from the terms of general linguisCcs (2): (1) text word, token, type – (2) word, word form, lexeme (c) 2013 by Hans-­‐Harry Droessiger Example 1: If a fly flies behind flies, then a fly flies aPer flies. How many units does this sentence contain? (c) 2013 by Hans-­‐Harry Droessiger There are at least 3 correct answers: 1)  In an orthographical sense, the sentence contains 12 units: separated by spaces and comma. 2)  In a morpho-­‐syntac3cal sense, the sentence contains 8 formally dis3nguished units: if, a, fly, flies, behind, flies, then, a?er. 3)  In a seman3c-­‐lexical sense, the sentence contains 6 units: if, behind, a?er, fly, a, to fly. (c) 2013 by Hans-­‐Harry Droessiger Example 2: Geri vyrai geroj girioj gerą girą gėrė. How many units does this sentence contain? (c) 2013 by Hans-­‐Harry Droessiger There are at least 3 correct answers: 1)  In an orthographical sense, the sentence contains 7 units: separated by spaces and comma. 2)  In a morpho-­‐syntac3cal sense, the sentence contains 7 formally dis3nguished units: geri, vyrai, geroj, girioj, gerą, girą, gėrė. 3)  In a seman3c-­‐lexical sense, the sentence contains 5 units: gerai, vyrai, giria, gira, gėr:. (c) 2013 by Hans-­‐Harry Droessiger But corpus analyses are not limited to lexical units (words)! A corpus analysis can also be performed at the level of phonems, sentences, texts, and even seman3cs. On behalf of this, we will use the terms token and type, which can be independently used for any level or component of language. (c) 2013 by Hans-­‐Harry Droessiger A TOKEN is the actually used form of a linguis3c unit. A TYPE is an abstrac3on of all the actually used forms of a linguis3c unit. (c) 2013 by Hans-­‐Harry Droessiger 2. Methods of data analysis and evaluaCon Corpora can be analysed and evaluated in a qualitaCve as well as in a quanCtaCve way. Example: Search for foreign words in text corpora. (c) 2013 by Hans-­‐Harry Droessiger QualitaCve analysis and evaluaCon -­‐  to classify foreign words by morphological types of words as noun, verb, adjec3ve; -­‐  to classify foreign words by branch of vocabulary as LSP, common language; -­‐  to interpret these classifica3ons. (c) 2013 by Hans-­‐Harry Droessiger Example: Borrowings in English language www.dshumphries.com/Germanisms_Thoughts_2008.pdf See page 6 f. (c) 2013 by Hans-­‐Harry Droessiger QuanCtaCve analysis and evaluaCon The aim is, for example, to evaluate the frequency of word use with regard to the rela3on between tokens and types with the following consequences: §  if the number of tokens of a type is very high, then we can call this highly used number of tokens a stereotypical use; innova3on in word use is not very likely; §  if the number of tokens of a type is very low, then many types very seldom occur; this could be an evidence for crea3ve and/or innova3ve use of a language §  a special case is the 1 to 1 relaCon between tokens and types. (c) 2013 by Hans-­‐Harry Droessiger The special case of 1 to 1 This special case is called HAPAX LEGOMENON. Hápax Legómenon = greek „Only once told“ (c) 2013 by Hans-­‐Harry Droessiger The special case of 1 to 1 This means that the used linguis3c unit very seldom occurs, but at the same 3me this is an evidence for creaCve and unique use of language, because the speaker/writer created something new in the case of word forms, colloca3ons, sentence structure etc. that never has been used before. CONCLUSION: The higher the number of Hapax Legomenon the higher the quality of language use. (c) 2013 by Hans-­‐Harry Droessiger Example: Analysis and evaluaCon in the field of syntax Several syntac3c phenomena can be analysed and evaluated: -­‐  the number of sentences in a corpus or in a text; -­‐  the average length of sentences; -­‐  the number of sentences of a certain length; -­‐  the propor3on of sentences of a certain length in a corpus or in a text (c) 2013 by Hans-­‐Harry Droessiger "This sentence has five words. Here are five more words. Five-­‐word sentences are fine. But several together become monotonous. Listen to what is happening. The wri3ng is ge‚ng boring. The sound of it drones. It's like a stuck record. The ear demands some variety. Now listen. I vary the sentence length, and I create music. Music. The wri3ng sings. It has a pleasant rhythm, a lilt, a harmony. I use short sentences. And I use sentences of medium length. And some3mes, when I am certain the reader is rested, I will engage him with a sentence of considerable length, a sentence that burns with energy and builds with all the impetus of a crescendo, the roll of the drums, the crash of the cymbals-­‐-­‐sounds that say listen to this, it is important. So write with a combina3on of short, medium, and long sentences. Create a sound that pleases the reader's ear. Don't just write words. Write music." (Gary Provost, 100 Ways to Improve Your Wri3ng. Mentor, 1985) [Source: htp://grammar.about.com/od/rs/g/Sentence-­‐Length.htm] (c) 2013 by Hans-­‐Harry Droessiger "The young man's judgment was one at which few people with an eye for beauty would have cavilled. When the great revolu3on against London's ugliness really starts and yelling hordes of ar3sts and architects, maddened beyond endurance, finally take the law into their own hands and rage through the city burning and destroying, Wallingford Street, West Kensington, will surely not escape the torch. Long since it must have been marked down for destruc3on. For, though it possesses certain merits of a low prac3cal kind, being inexpensive in the mater of rents and handy for the buses and the Underground, it is a peculiarly beastly litle street. Situated in the middle of one of those districts where London breaks out into a sort of eczema of red brick, it consists of two parallel rows of semi-­‐detached villas all exactly alike, each guarded by a ragged evergreen hedge, each with coloured glass of an extremely regretable nature let into the panels of the front door; and sensi3ve young impressionists from the ar3sts' colony up Holland Park way may some3mes be seen stumbling through it with hands over their eyes, mutering between clenched teeth 'How long? How long?'" (P.G. Wodehouse, Leave It to Psmith, 1923) tp://grammar.about.com/od/rs/g/Sentence-­‐Length.htm] (c) [2Source: 013 by Hhans-­‐Harry Droessiger Conclusion Qualita3ve and quan3ta3ve analyses and evalua3ons belong together. They are several sides of a medal. From quan3ta3ve analysis we can step forward to qualita3ve analysis; and qualita3ve analysis offers more criteria of classifica3on we can use for quan3ta3ve analysis. (c) 2013 by Hans-­‐Harry Droessiger 3. Comparison of data The comparison can help to explain certain developments in language: 1)  The comparison of a historical corpus with the corpus of child language can show common rules of language development. 2)  The comparison of a corpus of a certain variety of language – dialect, LSP – with a reference corpus can show special developments in the men3oned varie3es of a language. (c) 2013 by Hans-­‐Harry Droessiger Example: htp://corpus.byu.edu/coha/ (c) 2013 by Hans-­‐Harry Droessiger Example: htp://www2.anglis3k.uni-­‐freiburg.de/ins3tut/lskortmann/FRED/samples.htm Text file and audio file (c) 2013 by Hans-­‐Harry Droessiger Example: htp://www.uni-­‐due.de/SVE/ (c) 2013 by Hans-­‐Harry Droessiger 4. Search by keyword We can search for units like these: -­‐  word, lexeme: house, break, if -­‐  word form or word stem: houses, broken, engl-­‐ -­‐  parts of words, esp. in the case of word forma3on: un-­‐, mis-­‐, -­‐ish, -­‐able (c) 2013 by Hans-­‐Harry Droessiger One aim is to find usual and/or unusual § 
§ 
§ 
§ 
word combina3ons, word forma3ons, syntagmata, and possible changes of word meanings. Example: park htp://corpus.byu.edu/coha/ (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Another aim is to find new words in the vocabulary (= neologisms) -­‐  to create newer edi3ons of dic3onaries; -­‐  to collect new vocabulary for learners of the language (as mother language as foreign language as well) (c) 2013 by Hans-­‐Harry Droessiger The search for words, word stems or parts of words can easily be done. Type the word and let the search engine work... (c) 2013 by Hans-­‐Harry Droessiger But watch out for: -­‐  case sensi3ve wri3ngs: upper-­‐case, lower-­‐case leters §  park, Park; gates, Gates -­‐  change of word stems by inflexion (inflec3on) §  sing – sang – sung, get – got, take – took – token §  skris: – skrenda -­‐ skrido -­‐  translitera3ons of foreign wri3ng systems: Russian, Greek, Hebrew, Arabian §  Марьяновская – Mar‘yanovskaya §  Чаиковский -­‐ Chaikovskiy (c) 2013 by Hans-­‐Harry Droessiger Database for (and not only for) transliteraCons (c) 2013 by Hans-­‐Harry Droessiger Problems of search by keyword: homograph words (someCmes known as homonyme words): (c) 2013 by Hans-­‐Harry Droessiger Problems of search by keyword: homonyme words: (more informa3on: htp://usefulenglish.ru/
wri3ng/homonyms-­‐short-­‐
list ) band band bank bark bat lie light race ring 3p toast well desert [‘dezert] tear [3ər] bank bark bat lie light race ring 3p toast well desert [di‘zert] tear [teər] wind (c) 2013 b[y wind] Hans-­‐Harry Droessiger wind [waind] Problems of search by keyword: homophone words: Source: wikipedia (c) 2013 by Hans-­‐Harry Droessiger Summary: Basic knowledge •  Terms and their definiCons Ø  token, type Ø  qualitaCve analysis, quanCtaCve analysis Ø  hapax legomenon Ø  keyword Ø  neologism Ø  homograph words, homonyme words, homophone words (c) 2013 by Hans-­‐Harry Droessiger Tasks and assignments •  Describe the theore3cal and prac3cal differences between the terms of corpus linguis3cs (text word, type, token) and the terms of general linguis3cs (word, word form, lexeme). Use examples for illustra3on. •  Perform a keyword search using the COHA corpus for the keyword body. •  Collect a list of at least 10-­‐15 homograph, homonyme and homophone words of Lithuanian. (c) 2013 by Hans-­‐Harry Droessiger Lecture # 5: On the Analysis of Corpus Data. Part 2: Subtopics 5 – 7 (c) 2013 by Hans-­‐Harry Droessiger On the analysis of corpus data 1.  Levels of descrip3on 2.  Methods of data analysis and evalua3on 3.  Comparison of data 4.  Search by keyword 5.  Word concordances 6.  Colloca3onal analysis 7.  Word lists (c) 2013 by Hans-­‐Harry Droessiger 5. Word concordances A concordance is a list which presents all occurrences of a word in contexts. Concordances are usually presented in text lines which are called KWIC = Key Word In Context. Concordances help to iden3fy different meanings of a keyword and different syntacCc structures the keyword is used in. (c) 2013 by Hans-­‐Harry Droessiger Example: htp://www.lextutor.ca/concordancers/concord_e.html (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Example: travel a noun in combina3on with other nouns: travel agent, travel book, a spaceship travel; a verb in combina3on with adverbs in different syntac3c posi3ons: I rarely travel abroad; spaceships don‘t travel slowly (c) 2013 by Hans-­‐Harry Droessiger Summary of the KWIC-­‐search for travel (c) 2013 by Hans-­‐Harry Droessiger To execute a KWIC search in an own text corpus, you have to: -­‐  create an own corpus -­‐  convert text files in the *.txt file format Ø  download the free soxware KWIC for Windows from htp://www.chs.nihon-­‐u.ac.jp/eng_dpt/tukamoto/
kwic_e.html Ø  download the free soxware Concorder Pro for Macintosh from htp://www.macupdate.com/app/mac/10475/
concorder-­‐pro -­‐  install the soxware and run it (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger 6. CollocaConal analysis A collocaCon is the habitual juxtaposi3on of a par3cular word with another word or words. This juxtaposiCon can be stated as LEFT from or RIGHT from the par3cular word (keyword). In the case of a high frequency of the juxtaposiCon of two or more words with the same keyword, we have to call it colloca3on. (c) 2013 by Hans-­‐Harry Droessiger One aim of the colloca3onal analysis is to state typical occurrences of a keyword and its juxtaposed context words. These typically occurring juxtaposi3ons generate stable word combina3ons, some3mes in the sense of idioma3c phrases. This helps to understand juxtaposi3ons / colloca3ons in a more formally syntac3c way, e. g. word order. See the following example again: (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger A second aim of this analysis is to collect linguisCc knowledge about terms (concepts) to reconstruct our (common) knowledge about certain concepts. Example: the word concept, using htp://www.lextutor.ca/concordancers/
concord_e.html (c) 2013 by Hans-­‐Harry Droessiger concept all adjec3ves in alphabe3cal order in its LEFT juxtaposi3on abstract (1), ambiguous (2), apoli3cal (1), atenuated (1), basic (4), beneficent (1), bourgeois (1), broader (3), central (2), coherent (1), contradictory (1), credible (1), cri3cal (1), cubist (1), developed (1), different (3), economic (1), elusive (1), embracing (2), emerging (2), essen3al (1), ethical (1), explanatory (1), fashionable (1), formalist (3), former (1), fundamental (1), general (5), given (1), governing (1), human (1), imported (1), impossible (1), independent (1), inflated (1), influen3al (1), intractable (1), landless (1), larger (1), liberal (1), malleable (1), meaningful (1), metaphysical (2), modern (1), modernist (1), murderous (1), naturalis3c (1), new (7), obsolete (1), opera3ve (1), ordering (1), organizing (1), original (1), philosophical (1), precise (1), primi3ve (1), problema3c (2), problema3cal (1), progressivist (1), ra3onal (1), related (1), relevant (3), revised (1), rudimentary (1), similar (1), simple (1), single (1), sophis3cated (1), substan3ve (2), surprising (1), technical (1), theological (2), totalizing (1), tradi3onal (1), uncri3cized (1), universal (1), unknown (1), useful (1), verbal (1), very (12), whole (4) (c) 2013 by Hans-­‐Harry Droessiger concept all adjec3ves in order by frequency (3 or more) in its LEFT juxtaposi3on very (12), new (7), general (5), basic (4), whole (4), broader (3), different (3), formalist (3), relevant (3) The more general the meaning of the juxtaposed adjec3ve, the more aten3on has to be paid on the other juxtaposed words. Example: very concept, new concept: (c) 2013 by Hans-­‐Harry Droessiger Look at the RIGHT juxtaposed preposi3onal phrases. What do you see? (c) 2013 by Hans-­‐Harry Droessiger Look at the RIGHT juxtaposed preposi3onal phrases. What do you see? (c) 2013 by Hans-­‐Harry Droessiger We look at the RIGHT juxtaposed phrases: (c) 2013 by Hans-­‐Harry Droessiger And now we look at the LEFT juxtaposed words, esp. verbs: (c) 2013 by Hans-­‐Harry Droessiger 7. Word lists A word list is a collec3on of all TOKENS in a text, text corpus, or database. The aim of this kind of search is to gather informa3on about the frequency of the use of TOKENS in a text, text corpus, or a database (-­‐-­‐-­‐> language staCsCcs). Word lists of different texts, text corpora, or databases allow comparisons in the case of the frequencies of TOKENS, for example: (c) 2013 by Hans-­‐Harry Droessiger standard language 1 – standard language 2 standard anguage – LSP........................ standard language – dialect.................... standard language – sociolect................ LSP 1 – LSP 2 dialect 1 – dialect 2 sociolect 1 – sociolect 2 collected works 1 – collected works 2 novel 1 – novel 2 poem 1 – poem 2 and so on... but the most important condiCons are... (c) 2013 by Hans-­‐Harry Droessiger -­‐  the same level of language varia3on (standard language 1 – standard language 2; LSP 1 – LSP 2) -­‐  a rela3on in the kind of common term – specialised term (standard language – dialect; the collected works of an author – one novel of this author) -­‐  the same text sort (non-­‐fic33ous texts) -­‐  the same genre (fic33ous texts) -­‐  the same media of the text (press, film, internet)... but... (c) 2013 by Hans-­‐Harry Droessiger ... to not only state the differences between the two (or more) compared texts, text corpora, or databases, but to state what is missing in one of the compared sides and why it is missing, we can gather word lists. One of the results or consequences of this kind of search is to develop the lexicographic work. (c) 2013 by Hans-­‐Harry Droessiger English The most frequently used TOKENS are (c) 2013 by Hans-­‐Harry Droessiger Source: htp://www.duboislc.org/Educa3onWatch/First100Words.html (c) 2013 by Hans-­‐Harry Droessiger Source: htp://www.insigh3n.com/esl/1000.php (c) 2013 by Hans-­‐Harry Droessiger EvaluaCng and comparing these two lists: Dubois Insight No. 1-­‐7 are the same the, of, and, a, to, in, the, of, and, a, to, in, is is first NOUN token first VERB token first ADJECTIVE token word – No. 30 is -­‐ No. 7 each – No. 44 man – No. 62 is -­‐ No. 7 more – No. 48 domina3ng tokens among the top 50 preposi,ons, pronouns preposi,ons, pronouns (c) 2013 by Hans-­‐Harry Droessiger Lithuanian The most frequently used TOKENS are: source: htp://invokeit.wordpress.com/frequency-­‐word-­‐lists/ (c) 2013 by Hans-­‐Harry Droessiger 1.  aš 2.  ir 3.  tai 4.  tu 5.  kad 6.  ne 7.  taip 8.  jis 9.  ką 10. ar 11. čia 12. į 13. kaip 14. su 15. man 16. kas 17. mes 18. o 19. gerai 20. bet 21. mano 22. iš 23. yra 24. 3k 25. ji 26. tau 27. buvo 28. jie 29. jų 30. savo 31. tavo 32. kai 33. mane 34. 
kur (c) 2013 by Hans-­‐Harry Droessiger 35. jei 36. dabar 37. dar 38. apie 39. jo 40. jį 41. to 42. tave 43. viskas 44. jau 45. ten 46. labai 47. turi 48. po 49. reikia 50. už EvaluaCng this list: No. 1-­‐7 first NOUN token first VERB token first ADJECTIVE token domina3ng tokens among the top 50 aš, ir, tai, tu, kad, ne, taip pone – No. 80 yra – No. 23 gerai – No. 19 preposi,ons, pronouns (c) 2013 by Hans-­‐Harry Droessiger Comparing English and Lithuanian word lists: English: Dubois English: Insight Lithuanian the, of, and, a, to, in, is the, of, and, a, to, in, is aš, ir, tai, tu, kad, ne, taip first NOUN token word – No. 30 man – No. 62 pone – No. 80 first VERB token is -­‐ No. 7 is -­‐ No. 7 yra – No. 23 first ADJECTIVE token each – No. 44 more – No. 48 gerai – No. 19 domina3ng tokens among the top 50 preposi,ons, pronouns preposi,ons, pronouns preposi,ons, pronouns No. 1-­‐7 are the same (c) 2013 by Hans-­‐Harry Droessiger ExplanaCon: Why dominate the syntac3c words? (c) 2013 by Hans-­‐Harry Droessiger Summary: Basic knowledge •  Terms and their definiCons: Ø  concordance Ø  collocaCon Ø  word list Ø  syntacCc word Ø  KWIC (c) 2013 by Hans-­‐Harry Droessiger Tasks and assignments •  Perform a KWIC search with the keyword love using the “Corpus Concordance English”* as described in the lecture and analyse the KWIC search to describe colloca3ons, combina3ons and syntac3c roles of the given keyword. * htp://www.lextutor.ca/concordancers/concord_e.html (c) 2013 by Hans-­‐Harry Droessiger Lecture # 6: Working with a self created corpus. Part 1 (c) 2013 by Hans-­‐Harry Droessiger Working with a self created corpus. Part 1 1. 
2. 
3. 
4. 
Defining aim and objec3ves of the research Building up a corpus Preparing the corpus for examina3on Surveying the data from the corpus (c) 2013 by Hans-­‐Harry Droessiger 1. Defining aim and objecCves of the research What kind of linguis3c research has to be done? What exactly should be analysed? What kind of language variety to be grounded? What linguis3c units / items are of interest? Will the research be done by qualita3ve or quan3ta3ve parameters? •  What 3me period is of interest? And will this 3me period be researched synchronically or diachronically? • 
• 
• 
• 
• 
(c) 2013 by Hans-­‐Harry Droessiger 1. Defining aim and objecCves of the research: an example The aim is to research the appearance and frequency of word formaCon suffixes in modern English / Lithuanian. (c) 2013 by Hans-­‐Harry Droessiger 1. Defining aim and objecCves of the research: an example The suffixes in modern English / Lithuanian are: (c) 2013 by Hans-­‐Harry Droessiger Lithuanian suffixes to derivate names or categories of persons Source: „DabarCnės lietuvių kalbos D
gramaCka“ (c) 2013 by Hans-­‐Harry roessiger – Vilnius, 2006 English suffixes to derivate names or categories of persons (c) 2013 by Hans-­‐Harry Droessiger 1. Defining aim and objecCves of the research: an example The objecCves are: •  to state the sorts of suffixes to derivate agent nouns (nomina agen:s), •  to state the frequencies of the use of these suffixes, •  to state the interrela3ons between sorts of word stems and suffixes to explain rules and developments of deriva3on. (c) 2013 by Hans-­‐Harry Droessiger 1. Defining aim and objecCves of the research: an example The presupposiCons of this research are: •  standard language, •  belles-­‐letres and/or newspapers •  qualita3ve and quan3ta3ve research, •  modern language (synchronical research). (c) 2013 by Hans-­‐Harry Droessiger 2. Building up a corpus •  What does it mean “self created corpus”? – use an existent corpus – create an own corpus (c) 2013 by Hans-­‐Harry Droessiger 2. Building up a corpus •  Reasons for an exisCng corpus: – easiliy to access, e. g. the B.N.C., the Donelai,s corpus of Lithuanian – containing the material of interest, e. g. newspapers of Great Britain, Lithuania – electronically work on the corpus (c) 2013 by Hans-­‐Harry Droessiger 2. Building up a corpus •  Reasons for a self created corpus: – the material of interest is not in electronic corpora available: •  certain authors’ / writers’ works •  certain dialects •  certain sociolects like youth language •  certain 3me periods of the history of a language (c) 2013 by Hans-­‐Harry Droessiger 2. Building up a corpus •  Reasons for a self created corpus: – the exis3ng corpora does not fit to your aims and objec3ves of the research – electronic corpora of a language or a variety of a language simply do not exist – some3mes people prefer to work with a printed corpus instead of an electronic corpus (c) 2013 by Hans-­‐Harry Droessiger 2. Building up a corpus •  How large can a corpus be? Some calculaCons: – three edi3ons of a daily newspaper like “The Times” consist of 200,000 text words – a student’s bachelor theses consist of 25,000 text words – Shakespeare’s sonnets consist of 103,000 text words (c) 2013 by Hans-­‐Harry Droessiger 2. Building up a corpus •  SuggesCons for a self created corpus: –  annual essay / annual theses: 20,000 text words –  bachelor thesis: 100,000 text words –  master thesis: 200,000 text words •  But finally, the quanCty of the corpus depends on aim(s) and objecCves of the thesis. (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 1. Use of exis3ng electronic corpora (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 1.1. Use of an online electronic corpus: The Donelai,s corpus of Lithuanian: htp://tekstynas.vdu.lt/ •  Find the corpus in the internet •  Log in to the corpus (or register as a user, if necessary) •  Set the parameters for the search •  Execute the search •  Keep the search results (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 1.2. Work with a specialised electronic data base edi3on (CD-­‐ROM or download versions) –  purchase the CD-­‐ROM or download the file of your interest –  search mode in the whole data base is always possible –  some edi3ons are parallel corpora, useful for certain aims and objec3ves: • 
example: “The Cornerstone Bible” (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 2. Crea3ng an own electronic corpus (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 2.1. Presupposi3ons 2.2. Using several programs (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 2.1. PresupposiCons 2.1.1. Find sources 2.1.2. Collect texts by conver3ng them into the file format you want (you have) to work with 2.1.3. Organise the texts by using the same parameters (fonts, layout) to allow easy handling (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 2.2. Using several programs 2.2.1. work with a KWIC-­‐program 2.2.2. work with Adobe Acrobat Reader 2.2.3. work with MS Word (or similar text processing programs) 2.2.4. organise your data of interest in a data base program (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 2.1. Work with a KWIC-­‐program –  download and install a KWIC-­‐program for Windows or for Macintosh –  convert all files containig text into txt-­‐file format (KWIC for Windows) –  using CasualCon for Macintosh you can work with r}-­‐, doc-­‐, and pdf-­‐files as well (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 2.2. Work with Adobe Acrobat Reader –  only pdf-­‐files can be used, –  but Adobe Acrobat Reader simulates a kind of KWIC-­‐search (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 2.3. Work with Microsox Word –  create the documents in the txt-­‐, r}-­‐, doc-­‐, or docx-­‐file formats –  simple search mode is possible to detect words, word forms (c) 2013 by Hans-­‐Harry Droessiger 3. Preparing the corpus for examinaCon 2.4. Organise your data in a data base program – 
– 
– 
– 
– 
– 
Microsox Excel Microsox Access spread sheet module of Libre Office data base module of Libre Office Filemaker a lot of smaller programs as freeware or shareware in the internet (c) 2013 by Hans-­‐Harry Droessiger 4. Surveying the data from the corpus 1.  Exis3ng electronical corpus (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Using a corpus of the English language: htp://www.corpora4learning.net/resources/corpora.html#BE htp://view.byu.edu/ click the BriCsh NaConal Corpus and follow the instruc3ons and advices (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Using a sort of parallel corpus of the English language example: “The Cornerstone Bible” (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger The assignment could be to compare certain parts of the text to work out common and different features of the given texts. This is oxen part of philological sciences or other sciences like theology as well. In the following examples from different versions of The Holy Bible watch out for Mary and her marital status... (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger 4. Surveying the data from the corpus 2. Self created electronic corpus (c) 2013 by Hans-­‐Harry Droessiger 4. Surveying the data from the corpus •  The chosen form of the own corpus follows the ideas, aims, and objec3ves of your own work. •  A simple table in MS Word or in MS Excel can contain lexical material for lexicological research. •  A data base form or template developed by the researcher works with categories of the intended research. (c) 2013 by Hans-­‐Harry Droessiger 4. Surveying the data from the corpus Example from my research on culture-­‐
bound words in the Brothers Grimms‘ Fairy Tales (c) 2013 by Hans-­‐Harry Droessiger Title of the fairy tale German culture-­‐bound words (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger Table to compare the lexical items in three languages (c) 2013 by Hans-­‐Harry Droessiger Summary: Basic knowledge •  Terms and their definiCons: Ø  exisCng corpus Ø  self created corpus (c) 2013 by Hans-­‐Harry Droessiger Tasks and assignments •  For what reason and purposes we shall create an own corpus? •  Define aim and objec3ves for a corpus research on linguis3c features of verbal prefixes in Lithuanian. Execute this research and present your results. (c) 2013 by Hans-­‐Harry Droessiger Lecture # 7: Working with an own corpus. Part 2 (c) 2013 by Hans-­‐Harry Droessiger Working with an own corpus. Part II 1.  Annota3ng the corpus data 2.  Inquiry of the data (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data We can disCnguish two ways of annotaCng data: 1.  Annota3on of the whole corpus 2.  Annota3on of certain parts of the corpus, depending on aim and objec3ves of the research (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data We can disCnguish two techniques of annotaCng data: 1.  Annota3on by hand 2.  Annota3on by computer programs (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data To annotate a corpus means to set addiConal – secondary – data to the text corpus, while we call the text corpus primary data. (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data We can divide sorts of annota3on: -­‐  tagging: annota3on on word level -­‐  parsing: annota3on on phrase (syntac3c) level (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data Tagging A sort of annota3on in word level. Most frequently executed sort of tagging is to tag the tokens of the corpus with POS labels. An example: (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data Using simple POS tags: They took the train. They_PRO took_VER-­‐AUX the_ART-­‐DEF train_NN Jie keliavo traukiniu. Jie_PRO keliavo_VER-­‐AUX traukiniu_NN (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data Using extended POS and MORPH tags: They took the train. They_PRO_3PS_PL_NOM took_VER-­‐AUX_PAST the_ART-­‐DEF_SGL_ACC train_NN_SGL_ACC Jie keliavo traukiniu. Jie_PRO_3PS_PL_NOM keliavo_VER-­‐
AUX_PL_3PS_PAST traukiniu_NN_SGL_INSTR (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data Using SEMANTIC or CONCEPTUAL tags: They took the train. They_PERSON_NUMBER-­‐UNSPEC took_ACTIVITY the train_MEANS_SORT-­‐UNSPEC (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data Overview to all sorts of tags (one of a lot of choices): htp://bulba.sdsu.edu/jeanete/thesis/
PennTags.html Tagging So•ware is available using this link: htp://www-­‐nlp.stanford.edu/links/
statnlp.html#Taggers (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data Tagging So•ware can also be used online: htp://ucrel.lancs.ac.uk/claws/ (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger 1. AnnotaCng the corpus data Parsing A sort of annota3on on phrase level: • 
• 
• 
• 
sentence structure, intona3on, Theta-­‐Roles, sorts / types of speech acts htp://nlp.cs.nyu.edu/app/ (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger (c) 2013 by Hans-­‐Harry Droessiger 2. Inquiry of the data Inquiry of the data is a set of procedures preparing, collec3ng, organising, structuring, and classifying single data. Example: virus in “The First Virus” (c) 2013 by Hans-­‐Harry Droessiger Line numbering (c) 2013 by Hans-­‐Harry Droessiger HighlighCng (c) 2013 by Hans-­‐Harry Droessiger Excerpts to extract the tokens and the context of the tokens • 
• 
• 
• 
• 
• 
the first virus was conceived (line 1) the name ‘virus’ was thought by... (line 3) the first virus was completed (line 5) the virus was demostrated (line 7) the virus was implanted (line 12) that the virus would not spread without detec3on (line 16-­‐17) (c) 2013 by Hans-­‐Harry Droessiger Excerpts •  if our own corpus consists only of one text (like in this example), it is easy to excerpt in a simple table (c) 2013 by Hans-­‐Harry Droessiger Group single data by certain criteria •  the first virus was •  that the virus would not conceived (line 1) spread without detec3on (line 16-­‐17) •  the name ‘virus’ was thought by... (line 3) •  the first virus was completed (line 5) •  the virus was demostrated (line 7) •  the virus was implanted (line 12) (c) 2013 by Hans-­‐Harry Droessiger Classify data and conclusions •  the first virus was conceived (line 1) •  the name ‘virus’ was thought by... (line 3) •  the first virus was completed (line 5) •  the virus was demostrated (line 7) •  the virus was implanted (line 12) •  that the virus would not spread without detec3on (line 16-­‐17) (c) 2013 by Hans-­‐Harry Droessiger Excerpts •  if we use a larger corpus, it is useful to create a simple data base, because we need more informa3on about the source of the tokens, e. g. –  informa3on about each text: 3tle, date / year, author, average length...; –  informa3on about each token: correct form, context; –  addi3onal ideas, associa3ons... (c) 2013 by Hans-­‐Harry Droessiger Excerpts •  the advantages of a simple data base are: –  form of a data sheet, which can be –  converted into text and vice versa (this helps to create an appendix of your research) (c) 2013 by Hans-­‐Harry Droessiger Data base “Metaphors” Table mode Data sheet mode (c) 2013 by Hans-­‐Harry Droessiger Table mode – can be converted into text: (c) 2013 by Hans-­‐Harry Droessiger Converted into text, so we can easily create an appendix (c) 2013 by Hans-­‐Harry Droessiger Final remarks •  For the purposes of transla3on in any form, it is useful to use –  tables, –  data base programs or –  programs for crea3ng own dic3onaries, e. g. TRADOS for Windows (c) 2013 by Hans-­‐Harry Droessiger Summary: Basic knowledge •  Terms and their definiCons: Ø  annotaCon Ø  secondary data Ø  tag, tagging Ø  parsing, POS label Ø  data inquiry Ø  grouping and classifying data Ø  excerpt (c) 2013 by Hans-­‐Harry Droessiger Tasks and assignments Perform a parsing (by hand or by using a parsing program) with regard to the length of sentences and to the POS labels using the text example of lecture # 4. "This sentence has five words. Here are five more words. Five-­‐word sentences are fine. But several together become monotonous. Listen to what is happening. The wri3ng is ge‚ng boring. The sound of it drones. It's like a stuck record. The ear demands some variety. Now listen. I vary the sentence length, and I create music. Music. The wri3ng sings. It has a pleasant rhythm, a lilt, a harmony. I use short sentences. And I use sentences of medium length. And some3mes, when I am certain the reader is rested, I will engage him with a sentence of considerable length, a sentence that burns with energy and builds with all the impetus of a crescendo, the roll of the drums, the crash of the cymbals-­‐-­‐sounds that say listen to this, it is important. So write with a combina3on of short, medium, and long sentences. Create a sound that pleases the reader's ear. Don't just write words. Write music." (c) 2013 by Hans-­‐Harry Droessiger