Corpus Linguistics Around the World
Transcription
Corpus Linguistics Around the World
&RUSXVOLQJXLVWLFV DURXQGWKHZRUOG /$1*8$*($1'&20387(56 678',(6,135$&7,&$//,1*8,67,&6 1R HGLWHGE\ &KULVWLDQ0DLU &KDUOHV)0H\HU 1HOOHNH2RVWGLMN &RUSXVOLQJXLVWLFV DURXQGWKHZRUOG (GLWHGE\ $QGUHZ:LOVRQ 'DZQ$UFKHU 3DXO5D\VRQ $PVWHUGDP1HZ<RUN1< &RYHUGHVLJQ3LHU3RVW &RYHULPDJH1$6$WKH9LVLEOH(DUWKKWWSYLVLEOHHDUWKQDVDJRY 2QOLQHDFFHVVLVLQFOXGHGLQSULQWVXEVFULSWLRQV VHHZZZURGRSLQO 7KH SDSHU RQ ZKLFK WKLV ERRN LV SULQWHG PHHWV WKH UHTXLUHPHQWV RI ,62,QIRUPDWLRQDQGGRFXPHQWDWLRQ3DSHUIRUGRFXPHQWV 5HTXLUHPHQWVIRUSHUPDQHQFH ,6%1836-4ERXQG (GLWLRQV5RGRSL%9$PVWHUGDP1HZ<RUN1< 3ULQWHGLQ7KH1HWKHUODQGV Contents Preface Andrew Wilson, Dawn Archer and Paul Rayson Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for automatic processing Aduriz I., Aranzabe M.J., Arriola J.M., Atutxa A., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Oronoz M., Soroa A., Urizar R. 1 The mood of the (financial) markets: in a corpus of words and of pictures Khurshid Ahmad, David Cheng, Tugba Taskaya, Saif Ahmad, Lee Gillam, Pensiri Manomaisupat, Hayssam Traboulsi and Andrew Hippisley 17 Towards a methodology for corpus-based studies of linguistic change: Contrastive observations and their possible diachronic interpretations in the Korpus 2000 and Korpus 90 General Corpora of Danish Jørg Asmussen 33 Synchronic and diachronic variation: the how and why of sociolinguistic corpora Kate Beeching 49 Statistical analysis of the source origin of Maltese Roderick Bovingdon and Angelo Dalli 63 Discovering regularities in non-native speech Julie Carson-Berndsen, Ulrike Gut and Robert Kelly 77 Tracking lexical changes in the reference corpus of Slovene texts Vojko Gorjanc 91 Relating linguistic units to socio-contextual information in a spontaneous speech corpus of Spanish José María Guirao, Antonio Moreno Sandoval, Ana González Ledesma, Guillermo de la Madrid, Manuel Alcántara 101 An analysis of lexical text coverage in contemporary German Randall L. Jones 115 Analysing a semantic corpus study across English dialects: Searching for paradigmatic parallels Sarah Lee and Debra Ziegeler 121 The curse and the blessing of mobile phones – a corpus-based study into American and Polish rhetorical conventions Agnieszka LeĔko-SzymaĔska 141 Using a dedicated corpus to identify features of professional English usage: What do “we” do in science journal articles? Judy Noguchi, Thomas Orr and Yukio Tono 155 Methods and tools for development of the Russian Reference Corpus Serge Sharoff 167 A profile-based calculation of region and register variation: the synchronic and diachronic status of the two main national varieties of Dutch Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts 181 A multilingual learner corpus in Brazil Stella E. O. Tagnin 195 Quantitative or qualitative content analysis? Experiences from a cross-cultural comparison of female students’ attitudes to shoe fashions in Germany, Poland and Russia Andrew Wilson and Olga Moudraia 203 Survey and Prospect of China’s Corpus-Based Research Yang Xiao-jun 219 Corpus linguistics around the world Andrew Wilson, Dawn Archer and Paul Rayson Lancaster University Preface The scope of corpus-based research is becoming ever wider. Not so many years ago, the vast majority of corpus-linguistic research was concerned with the grammar and vocabulary of standard language varieties – the latter meaning, in many cases, British or American English. Whilst research on other topics, languages, and varieties was by no means completely absent from the scene, this was the general picture of the field which came across to the interested observer. Today, things have changed dramatically. As this volume shows, the range of languages, research questions, and, indeed, methodologies which are addressed by corpus linguists has diversified. It is probably true to say that none of the papers published in this volume focuses primarily on standard English as a general variety. Here we find work not only on English dialects (Lee & Ziegeler) but also on learner language (Carson-Berndsen, Gut & Kelly; Lenko-Szymanska; Tagnin) and on a wide range of world languages - Basque (Aduriz et al.), Chinese (Xiao-jun), Danish (Asmussen), Dutch (Speelman et al.), German (Jones), Maltese (Bovingdon & Dalli), Russian (Sharoff), Slovene (Gorjanc), and Spanish (Guirao et al.). In terms of the research questions addressed, the more ‘traditional’ areas of corpus linguistics are still well represented, with papers on vocabulary (Jones), spoken language (Carson-Berndsen, Gut & Kelly; Guirao et al.), synchronic and diachronic variation (Asmussen, Beeching; Bovingdon & Dalli; Gorjanc; Lee & Ziegeler; Speelman et al.), Languages for Special Purposes (Noguchi, Orr & Tono), tagging, and corpus development (Aduriz et al.; Sharoff). However, exciting new departures are also present, with corpus-based work now extending into areas such as cross-cultural rhetoric and social psychology (LenkoSzymanska; Wilson & Moudraia) and even economic forecasting (Ahmad et al.). The papers published in this volume are but a small selection from the many which were presented at the Corpus Linguistics 2003 conference, held at Lancaster University in March 2003. This was the second Corpus Linguistics conference which we hosted at Lancaster (the first was in 2001), and, like its predecessor, it truly amazed us with the range of corpus-informed work being carried out world-wide. Computer corpus linguistics continues to thrive and to extend into so many areas of inquiry, many of which would probably have been unimaginable for its pioneers in the 1960s and 1970s. We are sure that it will continue to hold surprises for us. In the meantime, perhaps this collection of recent corpus research around the globe will whet the reader’s appetite to follow up new developments in this fast-moving field. Andrew Wilson Dawn Archer Paul Rayson Lancaster, August 2004 Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for automatic processing Aduriz I.*, Aranzabe M.J., Arriola J.M., Atutxa A., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Oronoz M., Soroa A., Urizar R. University of the Basque Country * University of Barcelona. Abstract In this article, we will describe the different steps in the construction of EPEC (Reference Corpus for the Processing of Basque). EPEC is a corpus of standard written Basque that has been manually tagged at different levels (morphology, surface syntax, phrases) and is currently being hand-tagged at deep syntax level following the Dependency Structurebased Scheme. It is aimed to be a "reference" corpus for the development and improvement of several NLP tools for Basque. This corpus has already been used for the construction of some tools such as a morphological analyser, a lemmatiser, or a shallow syntactic analyser. 1. Introduction When specifying the strategic priorities for the development of language technology in minority languages, Sarasola (2000) stated: Language foundations and research are essential to create any tool or application; but in the same way tools and applications will be very helpful in research and improving language foundations. Therefore, these three levels [applications, tools, and language foundations] have to be incrementally developed in a parallel and coordinated way in order to get the best benefit possible. Moreover, Sarasola (2000) proposes five phases as a general strategy to follow in the processing of a language: (1) laying foundations, (2) basic tools, (3) tools of medium complexity, (4) advanced tools and multilinguality, and (5) general applications. In all the phases proposed, corpora, first raw and then tagged, stand out as an essential language resource. In this article, we will describe the different steps in the construction of EPEC (Reference Corpus for the Processing of Basque). EPEC is a corpus of standard written Basque that has been manually tagged at different levels (morphology, surface syntax, phrases) and is currently being hand tagged at deep syntax level. 2 Aduriz et al. It is aimed to be a “reference” corpus for the development and improvement of several NLP tools for Basque. In section 2, we explain how the raw corpus was compiled and we briefly describe the design of the tagset. In section 3, we account for the morphological disambiguation process carried out manually over the outcome of MORFEUS (the morphological analyser for Basque). The shallow syntactic tagging and phrase tagging are explained in sections 4 and 5 respectively. Finally, in section 6 we explain the tag system chosen for the dependency-based syntactic analysis and how the treebank is being tagged manually. In figure 1, we can see a diagram showing the different phases in the construction of EPEC, contrasting the manual tasks (right column) with the computer-based ones (left column) as well as the dependencies between them. 2. The tagged corpus 2.1 Compilation of the corpus EPEC is a 50,000-word sample collection of standard written Basque. It is a strategic resource for the processing of Basque and it has already been used for the development and improvement of some tools. Half of this collection was obtained from the Statistical Corpus of 20th Century Basque (http://www.euskaracorpusa.net). The other half was extracted from Euskaldunon Egunkaria (http://www.egunero.info), the only daily newspaper written entirely in standard Basque. The Statistical Corpus of 20th Century Basque is a reference corpus of Basque including 4,658,036 word-forms. It was created by UZEI (http://www.uzei.com), a non-profit organisation devoted to making the Basque language suitable for any specialised field. The corpus was constructed on the basis of an exhaustive inventory of 20th century Basque publications, from which a random sampling was extracted. This corpus has become an invaluable linguistic reference for written Basque of this period. It was classified taking into account the following criteria: the publications were divided into 4 periods (1900-1939, 1940-1968, 1969-1990, 1991-1999), 6 different dialects (Biscayan, Guipuzcoan, Souletin, Labourdin-Navarrese, Standard Basque, and non-classified), and 14 genres (literary prose, poetry, theatre, administration, newspapers...). Each book or article also contained information about the author(s) and its title. A subcorpus of about 25,000 word-forms was extracted from this corpus in order to build EPEC. Texts written in standard Basque, corresponding to the last period (1991-1999) and belonging to both literary and non-literary prose, were chosen for this purpose. The second part of EPEC consists of several articles extracted from the Euskaldunon Egunkaria written in the second half of 1999 and in 2000. The The construction of EPEC 3 articles were chosen so that they covered an assorted range of topics (economics, culture, entertainment, international, local, opinion, politics, sports...). PARSER CHUNKER SHALLOW MORPHOSYNTACTIC SYNTAX TAGGING TAGGING COMPILATION OF TEXTS MANUAL DISAMBIGUATION MORPHEUS CG DISAMBIGUATION ( LINGUISTIC KNOWLEDGE) STOCHASTIC DISAMBIGUATION COMPARISON CG DISAMBIGUATION ( LINGUISTIC KNOWLEDGE) MANUAL DISAMBIGUATION COMPARISON Future work CG CG CHUNKER ( LINGUISTIC KNOWLEDGE) MANUAL REVIEW PARSER ( LINGUISTIC KNOWLEDGE) MANUAL TAGGING COMPARISON EPEC Forthcoming Figure 1. Sketch of the different steps in the completion of EPEC . Figure 1: Sketch of the different steps in the completion of EPEC 4 2.2 Aduriz et al. Design of the tagset Choosing an appropriate tagset is a crucial task since the usefulness of further applications depends on it. The main problem we found while defining the tagset for Basque was the absence of an exhaustive one for automatic use. Moreover, printed dictionaries of Basque also lacked systematisation of categories. For the morphosyntactic treatment of Basque texts, the tag system we developed is a four level system, ranging from the simplest part-of-speech tagging scheme to full morphosyntactic information. At the first level, 20 general categories are included for lexical items (noun, adjective, verb, pronoun, conjunction...). At the second one, each category tag is further refined by subcategory tags. For instance, the category ‘pronoun’ has 6 subcategories: common, emphatic, interrogative, indefinite, reflexive and reciprocal. The third level includes some basic morphosyntactic information such as declension case, number, etc. This morphological information is carried by the dependent morphemes attached to the stem. The full output of the morphosyntactic analysis constitutes the fourth level of tagging. The only difference between this and the previous level is that, here, all the morphological information is considered along with the tags for syntactic functions. Morphology and syntax are closely related in Basque, so most syntactic functions are provided by the database, along with inflexional morphemes. For instance, the ergative case in Basque marks the subject in a clause (with transitive verbs) the absolutive case may either indicate the subject or the predicative (with intransitive verbs), or the direct object. The specification at this level is very detailed and constitutes the input for the morphosyntactic disambiguation process as well as for syntactic and other types of language processing. In addition to these four levels, further tags are added to mark verb chains, noun phrases, and postpositional phrases (see sections 4.3.1 and 4.3.2) Nowadays, we are involved in the syntactic tagging of the corpus, following the Dependency Structure-based Scheme (see section 5). About 31 syntactic tags are being used for this purpose. 3. Morphosyntactic tagging of the corpus A morphological analyser of words is an indispensable basic tool when defining a general framework for the automatic processing of agglutinative languages like Basque (Aduriz et al., 1998). However, prior to the completion of the The construction of EPEC 5 morphological analyser MORFEUS, the design of the tagset had to be accomplished (see section 2.2) and a lexical database developed. 3.1 EDBL, a lexical database for Basque EDBL (Aldezabal et al., 2001) is a general-purpose lexical database used in Basque text-processing tools. This large repository of lexical knowledge is the basis for many different NLP tasks, and provides lexical information for several language tools including, obviously, the morphological analyser. At present, it consists of nearly 80,000 entries divided into (i) dictionary entries (the same found in any conventional dictionary), (ii) inflected verb forms, and (iii) dependent morphemes, all of them with their respective morphological information. 3.2 MORFEUS, automatic morphological analyser MORFEUS is a robust morphological analyser for Basque. It is a basic tool for current and future work on NLP. The analyser is based on the two-level formalism proposed by Koskenniemi (1983), which has had widespread acceptance due mostly to its general applicability, declarativeness of rules and clear separation between linguistic knowledge and program. The architecture of the analyser was defined using three main modules: 1 The standard analyser that uses a general lexicon and a user’s lexicon. This module is able to analyse and generate standard language wordforms. In our applications for Basque, we defined more than 130 patterns of morphotactics and two-rule systems in cascade, the first one for longdistance dependencies among morphemes and the second one for morphophonological changes. These elements are compiled together in the standard transducer. 2 The analysis and normalization of linguistic variants (dialectal uses and competence errors). Due to non-standard or dialectal uses of the language and competence errors, the standard morphology is not enough to offer good results when analysing real text corpora. This problem becomes critical in languages like Basque in which standardisation is in process and dialectal forms are still in widespread use. For this process, the standard transducer is extended with new lexical entries and phonological rules producing the enhanced transducer. 3 The guesser or analyser of words without lemmas in the lexicons. In this case, the standard transducer is simplified removing the lexical entries and allowing the analysis of any string. Therefore, the standard transducer is substituted by a general transducer to describe any combination of characters. 6 Aduriz et al. The morphological analyser gives as a result all the possible analyses of each token in the text. 3.3 Manual disambiguation of the corpus The manual disambiguation of the corpus was performed on the output of MORFEUS. Thus, the whole corpus was morphosyntactically analysed giving to each word-form every possible analysis, without taking into account the context in which it appeared. Once each word-form in the corpus was morphosyntactically analysed, we carried out the manual disambiguation process. Two linguists independently assigned the correct syntactic tag to each word in the corpus, applying the “double blind” method described in Voutilainen & Järvinen (1995). In case no right tag had been automatically assigned, they typed it themselves. Both linguists’ answers were compared and, when differences occurred, they agreed a single tag. This manually disambiguated corpus was used both to improve a Constraint Grammar disambiguator and to develop a stochastic tagger. After the corpus was manually disambiguated, we started to construct a grammar of constraint rules that would automatically select the correct syntactic tags in any real corpus. For this purpose, we chose the Constraint Grammar (CG) formalism (Karlsson et al., 1995; Tapanainen & Voutilainen, 1994), which was designed with the aim of being a language-independent and robust tool to disambiguate and analyse unrestricted texts. The CG grammar statements are close to real text sentences and directly address some crucial parsing problems, especially ambiguity. The role of the CG system is to apply a set of linguistic constraints that discard as many alternatives as possible, leaving at the end the most fully disambiguated sentences possible. Each rule produced for this grammar was checked on the manually disambiguated corpus so as to test its goodness and improve it iteratively whenever necessary. Moreover, in the cases in which the analyser did not assign any correct analysis to a word-form in the corpus, the linguists contributed greatly to the improvement of the lexical database and the analyser itself. Besides this, we also developed a stochastic tagger. Statistical methods need little effort and obtain very good results (Church, 1998; Cutting et al., 1992), at least when applied to English. In our case, we selected the TATOO tagger based on Hidden Markov Models (Armstrong et al., 1995). TATOO was designed to be applied to the output of a morphological analyser and the tagset can be easily switched without changing the input text. However, because Basque is an agglutinative and free-order language, the stochastic tagger turned out to be much less accurate than for English when trained directly on the output of the morphological analyser. So, we performed a supervised training on the output of the CG grammar. Since the CG The construction of EPEC 7 disambiguator leaves a relatively low ambiguity rate, the results of TATOO were much better. Currently, we apply a combination of the CG disambiguator with the stochastic tagger and get good results (Ezeiza 2003). The CG disambiguator is first applied and then the remaining ambiguities are solved using the results of TATOO. 4. Shallow syntax tagging After disambiguating the morphological tags in the corpus, the next step was to assign the corresponding syntactic tag to each word-form. Syntactic function tags follow the philosophy of the Constraint Grammar (CG) formalism in the sense that they are based on a functionally labelled dependency syntax 1 . By adopting the CG formalism, we express the syntactic functions of words and the interdependencies that exist among them rather than deep structural relations. So the syntactic tags at this level refer to shallow syntactic functions, i.e. they may provide information about the surface structure of verb chains, noun phrases, or postpositional phrases. Therefore, this results in a shallow parsing of the corpus. As we mentioned before, most syntactic functions are added to the word-forms together with inflectional morphemes. Morphological suffixes and syntactic functions are closely related in Basque and both are included in the database. Thus, the output of the morphological analyser displays most of these shallow syntactic tags. However, some other syntactic tags that are not inherited from the database are added to the analysis through CG mapping rules. These functions are mostly attached to parts of speech, and they are generally assigned to word-forms provided that they comply with some given contextual conditions. Mainly, the syntactic function tags are divided into three groups: main functions (subject, object, indirect object…), modifiers (indicating the direction relative to their head), and verb functions (used to detect verb chains). This distinction of the syntactic functions is essential for the tagging of the different kinds of phrases (see section 5). The ambiguity rate related to the shallow syntactic tagging is over 22% 2 , that is, for each 100 word-forms 22 are assigned more than one syntactic tag. 4.1 Manual disambiguation and applications Once each word-form in the corpus was given at least one syntactic tag, we repeated the manual disambiguation process. This method was similar to the one used for the morphological disambiguation in the previous step. Two linguists independently assigned the correct syntactic tag to each word in the corpus or, in case no right tag had been automatically assigned, they typed it themselves. Then, both linguists agreed a single tag when differences occurred. After the corpus was manually disambiguated, we started to make up a grammar of constraint rules that 8 Aduriz et al. would automatically select the correct syntactic tags in any real corpus. Each rule produced was checked on the manually disambiguated corpus so as to test its goodness and improve it if necessary. 5. Tagging phrases At this stage we have the corpus manually tagged with surface syntactic tags following the CG syntax. No phrase units are marked yet, although based on this representation, the identification of various kinds of phrase units, such as verb chains, noun phrases, and postpositional phrases is reasonably straightforward. 5.1 Tags for verb chains In order to detect verb chains, we use the verb function tags (@+FAUXVERB, @-FAUXVERB, @+FMAINVERB, @-FMAINVERB 3 …) and some particles (the negative particle, modal particles…). Based on these elements we are able to detect not only continuous verb chains but also dispersed ones. So as to mark up continuous verb chains, the following tags are attached using again CG mapping rules: • %VCH: this tag is attached to a verb chain composed of a single element. • %VCHI: this is attached to the initial element of a complex verb chain. • %VCHF: this is attached to the final element of a complex verb chain. The tags used to mark-up the dispersed verb chains are: • %NCVCHI: this tag is attached to the initial element of a noncontinuous verb chain. • %NCVCHC: this tag is attached to the second element of a noncontinuous verb chain. • %NCVCHF: this tag is attached to the final element of a non-continuous verb chain. 5.2 Tags for noun phrases and postpositional phrases Our assumption is that any word having a modifier function tag is linked to some word with a main syntactic function tag. Moreover, a word with a main syntactic function tag can, by itself, constitute a phrase unit. With this in mind, we established three tags to mark up this kind of phrase units (noun phrases or postpositional phrases): • %PHR: this tag is attached to words with main syntactic function tags that constitute a phrase unit by themselves. The construction of EPEC 9 • %PHRI: this tag is attached to the initial element of a phrase unit. • %PHRF: this tag is attached to the final element of a phrase unit. In order to attach one of these tags to each word-form, we have simultaneously developed two subgrammars containing CG mapping rules. The first subgrammar is aimed at delimiting verb chains whereas the second one marks noun and postpositional phrases. 5.3 Manual tagging and applications At present, a linguist is checking the tags that the first set of mapping rules marked up in the corpus. Whenever necessary, she adds, removes, or changes the tags automatically assigned. Once this work is finished, the first set of mapping rules that were developed will be tested on the corpus and the results will be used to improve the rules iteratively as well as to develop new ones. 6. Treebank The next logical stage in the completion of the corpus is deep syntax tagging, in order to build a treebank (Aduriz et al., 2002.) Although manually tagging a treebank is an expensive and time-consuming task, it is also an essential step for the development of syntactic tools and applications for Basque. A group of linguists in our research group is currently involved in this arduous task 4 . dependant structurally_case- marked complements nc arg_mod clausal modifier detmod ncsubj ncobj nczobj finite_clause non-finite_clause ccomp_subj ccomp_obj ccomp_zobj xcomp_subj auxiliary conjunction nc clausal pred ncmod cmod xmod xcomp_obj ncpred xpred xcomp_zobj Figure 2: Hierarchy of grammatical relations. After considering a number of diverse choices (including Skut et al., 1997; Oflazer, 1999) we decided to follow a dependency-based procedure, for it was, in our opinion, the one that could best deal with the free word order displayed by Basque syntax. The dependency-based analysis describes the relations existing between components (i.e. word-forms). This way, for each sentence in the corpus 10 Aduriz et al. we explicitly determine the syntactic dependencies between the heads and the dependents. In order to define the syntactic tagging system, we adopted the framework presented in Carroll et al. (1998, 1999). By following this line of work, we developed a coding-system based on hierarchies of grammatical relations, both for lexical and empty elements, such as pro (see figure 2). As can be seen in figure 2, the hierarchy distinguishes between several general levels, which are further specified in subsequent levels. Thus, for instance, in the general level we find structurally case-marked complements, thematic roles (arg_mod), modifiers, auxiliaries and conjunctions. In turn, structurally casemarked complements, for example, are divided into noun phrases and clauses. Each continuous gradation achieves further specification by taking into account their grammatical function (e.g. ncsubj, ncobj, and nczobj). Next, we present an example showing some of the grammatical relations specified in the hierarchy: ncsubj (Case, Head, Head of NP, the Case-marked element within NP, subj ) ncobj (Case, Head, Head of NP, the Case-marked element within NP, obj) nczobj 5 (Case, Head, Head of NP, the Case-marked element within NP, ind.obj) These are examples of structurally case-marked complements when complements are nc (non-clausal, Noun Phrases, henceforth NP), as, for instance, in the sentence Aitak haurrari sagarra eman dio ‘Father has given an apple to the child’ (literally ‘Father to-child apple given has’): ncsubj (erg, eman, aitak, aitak, subj) nczobj (dat, eman, haurrari, haurrari, ind.obj) ncobj (abs, eman, sagarra, sagarra, obj) This description is extremely important, since it determines the number and type of tags needed for each relation (number of slots, the characteristics of each one, etc.). This formalisation will be very useful for future treatments, for example, to transfer all this information in to XML format (see section 7). Tagging the corpus manually has enabled us to find solutions to problems that emerge in the analysing process, such as discontinuous constituents, coordination, or comparative clauses. Moreover, it is not unusual that similar phenomena are treated as distinct by the different linguists tagging the corpus. In these cases, the group of linguists tries to agree a single analysis that will be regarded as correct thereafter. Consequently, as the tagging process goes on and we find new solutions to arising problems, accuracy, robustness, and speed will improve. Besides, we are The construction of EPEC 11 currently developing a computational tool designed to make the manual tagging easier and faster. All of this work is being carried out within a project that aims at constructing treebanks for Catalan, Spanish, and Basque (Civit & Martí, 2002). 6.1 Applications When the manual tagging of the corpus is finished, we plan to develop a tool based on linguistic knowledge that will be able to parse real corpora automatically. As in the previous steps of manual tagging, each rule produced for the parser will be tested on the manually tagged corpus in order to assess its effectivity and improve it accordingly. In the future, we also plan to apply machine learning methods to the corpus, in order to carry out automatic tagging. 7. Representation of the Corpus using XML Over the last three years much effort has been made in our research group (Artola et al., 2002) to integrate the NLP tools for Basque described in this chapter. Due to the complexity of the information to be exchanged among the tools, Feature Structures (FSs) are used to represent it. Feature structures are coded following the TEI’s DTD for FSs, and Feature Structure Definition descriptions (FSD) have been thoroughly defined. The documents used as input and output of the different tools, contain TEI-P4-conformant feature structures (FS) coded in XML. The use of XML for encoding the I/O streams flowing between programs forces us to describe the mark-up formally, and provides software to check that these mark-up hold invariantly in an annotated corpus. We could deeply analyse the framework for linguistic knowledge representation and integration developed in our group, but as it is not the goal of this paper, we will only show the output of the tools of the analysis chain (figure 3). Different representations of the sentence Noizean behin itsaso aldetik Donostiako Ondarreta hondartzara enbata iristen da (Once in a while, a storm arrives from high seas to the Donostia’s beach of Ondarreta) coded in XML are shown in figure 3. 12 Aduriz et al. <text id=‘T1> .. <P>Noizean behin itsaso aldetik Donostiako Ondarreta hondartzara enbata iristen da.</p></text> .xml ... </p> <p id='w'> <linkGrp type=‘MWLU‘ tagOrder='y'> <link ID=‘mwlnk1 targets='Xw1 Xw2'/> ... <w id='w1' sameAs='Xw1' type=‘BEG_CAP'>Noizean</w> <w id='w2' sameAs='Xw2'>behin</w> <w id='w3' sameAs='Xw3'>itsaso</w> <w id='w4' sameAs='Xw4'>aldetik</w> <w id='w5' sameAs='Xw5' type=‘BEG_CAP'>Donostiako</w> <w id='w6' sameAs='Xw6' type=‘BEG_CAP'>Ondarreta</w> <w id='w7' sameAs='Xw7'>hondartzara</w> <w id='w8' sameAs='Xw8'>enbata</w> <w id='w9' sameAs='Xw9'>iristen</w> <w id='w10' sameAs='Xw10'>da</w> <w id='w11' sameAs='Xw11' type='PUNCT_FSTOP'>.</w> </p> ... w.xml <p id='linkGrp'> <linkGrp type=‘mwlnk-lem‘tagOrder='y'> <link targets=‘mwlnk1 ADV'/> <linkGrp type='w-lem‘tagOrder='y'> <link targets='Xw3 COM-NOUN-1'/> Morphosyntactic Analysis <link targets='Xw4 COM-NOUN-2'/> <text id='LemDoc0001'> .... <fs id="COM-NOUN-2" type="Lemmatisation"> <f name="Form"><str>aldetik</str></f> <f name="Lemma"><str>alde</str></f> <f name="morphological-Features"> <fs type="Top-Features-List"> <f name="POS"><sym value="NOUN"/></f> <f name="SUBCAT"><sym value="COM"/></f> <f name="DET"><sym value="DET"/></f> <f name="NUM"><sym value="S"/></f> <f name="CASE"><sym value="ABL"/></f> </fs> </f> </fs> </p> <p> <fs id="PROP-N-LOC-1" type="Lemmatisation "> <f name="Form"><str>Donostiako</str></f> <f name="Lemma"><str>Donostia</str></f> <f name="Top-Features-List"> <fs type="upper-level-features"> <f name="POS"><sym value="NAME"/></f> <f name="SUBCAT"><sym value="PROP-LOC"/></f> <f name="DET"><sym value="DET"/></f> <f name="NUM"><sym value="S"/></f> <f name="CASE"><sym value="GEL"/></f> </fs> </f> </fs> </p> <p> .lem.xml ... <link targets='Xw5 PROP-N-LOC-1'/> <link targets='Xw6 PROP-N-LOC-2'/> ... .lemlnk.xml EUSLEM Chunker <text id=span0001>... <p id=“linkgrp”> <linkgrp type=“span” tagOrder='y'> <text id=cad0001> ... <fs id=COM-NOUN-1 type="phrase"> <f name="chain"> <str>itsaso aldetik</str></f> <f name="head"> <str>itsaso</str></f> <f name="POS"><str>NOM</str></f> <f name="SUBCAT"><str>COM</str></f> <f name="SFL" ORG=‘list’> <sym value = "NCMOD" > </f> </fs> ... Dependencies .sint.xml <text id=dep0001> ... <linkgrp type="dep" targorder=“y“ targFunc=“head dependant case-dep-unit" domains=???> <!-- ... --> <link id="dep1" targets="Xw9 span1 span1"> <link id="dep2" targets=“Xw9 Xw7 Xw7"> ... </linkgrp> dep.xml <link id=“span1” targets=“Xw3 Xw4”> <link id=“span2” targets=“Xw5 Xw6 Xw7”>... </linkgrp> </p> .spanlnk.xml </text> <text id=span-sint0001> ... <p id=“linkgrp”> <linkgrp type=“span-sint” tagOrder='y'> <link targets=“span1 NOUN-COM”> ... </linkgrp> </p> </text> .sintlnk.xml ... <linkgrp type="dep-deplib" targorder="yes"> <link targets="dep1 D-NCMOD-ABL"/> <link targets="dep2 D-NCMOD-ALA"/> ... </linkgrp>... Deplnk.xml <fs id="D-NCMOD-ALA“ type="dependency"> ... <f name=“nombre"> <sym value="NCMOD-ABL"/></f> <f name=“CASE"> <sym value="ALA"/></f> .. </fs> Figure3: Output of the different tools coded in XML. Figure 3: Output of the different tools coded in XML Deplib.xml The construction of EPEC 8. 13 Future work The lexical information gathered in the lexical database (EDBL), which is the basis for several NLP tools in our research group, is constantly being renewed. New entries from diverse sources are periodically added to the database. Moreover, new tools such as multiword units, named entities, or postposition recognisers have been developed. These changes must be reflected in the corpus, so we must review it regularly. Therefore, in the near future, we intend to update EPEC with new information. This will be done semiautomatically, so that only the new information needs to be reviewed. 9. Acknowledgements This research is supported by the University of the Basque Country (9/UPV00141.226-14601/2002), the Ministry of Industry of the Basque Government (project XUXENG, OD02UN52), the Interministerial Commission for Science and Technology of the Spanish Government (FIT-150500-2002-244), and the European Community (MEANING project, IST-2001-34460). Notes 1 The concept of dependency-syntax has a long tradition in grammatical analyses since the Greco-Roman era. More recently, within the application of formalisms to syntactic theory, among others we find Tèsniere (1959), Hays (1964) and Mel’cuk (1988), the ones who have recovered dependency-syntax in theoretical terms. 2 This ambiguity was estimated taking into account the syntactic functions of a subset of 200 common words. 3 Finite auxiliary verb, non-finite auxiliary, finite main verb, non-finite main verb… 4 In this research line, our group is taking part in the project entitled “The IXA group, tools for an automatic treatment of Basque: creating a database composed of syntactic-semantic trees” (See ‘acknowledgments’) 5 nczobj would be equivalent to the English nciobj (non-clausal indirect object). References Aduriz I., Agirre E., Aldezabal I., Alegria I., Ansa O., Arregi X., Arriola J.M., Artola X., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Maritxalar A., Maritxalar M., Oronoz M., Sarasola K., Soroa A., Urizar R., Urkia M. 14 Aduriz et al. (1998), A Framework for the Automatic Processing of Basque. Proceedings of the First International Conference on Language Resources and Evaluation, Granada. Aduriz I., Aldezabal I., Aranzabe M., Arrieta B., Arriola J., Atutxa A., Díaz de Ilarraza A., Gojenola K., Oronoz M., Sarasola K. (2002), Construcción de un corpus etiquetado sintácticamente para el euskera. Actas del XVIII Congreso de la SEPLN, Valladolid, Spain. Aldezabal I., Ansa O., Arrieta B., Artola X., Ezeiza A., Hernández G., Lersundi M. (2001), EDBL: a General Lexical Basis for the Automatic Processing of Basque. IRCS Workshop on Linguistic Databases, Philadelphia (USA). Alegria I., Aranzabe M., Ezeiza A., Ezeiza N., Urizar R. (2002) Robustness and customisation in an analyser/lemmatiser for Basque. Proceedings of Workshop on "Customizing knowledge in NLP applications ". Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria (Spain). Armstrong S., Russell G., Petitpierre D., Robert G. (1995) An Open Architecture for Multilingual Text Processing. Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics, Dublin, Ireland, pp. 101-106. Artola X., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Hernández G., Soroa A. (2002) A Class Library for the Integration of NLP Tools: Definition and implementation of an Abstract Data Type Collection for the manipulation of SGML documents in a context of stand-off linguistic annotation. Proceedings of the Third International Conference on Language Resources and Evaluation, Las Palmas de G 1 ran Canaria, Spain. Carroll J., Briscoe T., Sanfilippo A. (1998) Parser evaluation: a survey and a new proposal. Proceedings of the International Conference on Language Resources and Evaluation, Granada, Spain, pp. 447-454. Carroll J., Minnen G., Briscoe T. (1999) Corpus Annotation for Parser Evaluation. Proceedings of Workshop on Linguistically Interpreted Corpora, EACL´99, Bergen. Church K. W. (1998) A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. Proceedings of the Second Conference on Applied Natural Language Processing, Québec, Canada, pp. 136-143. Civit M., Martí M. (2002) Design Principles for a Spanish Treebank. Proceedings of the Treebanks and Linguistic Theories (TLT2002), Sozopol, Bulgaria. Cutting D., Kupiec J., Pederson J., Sibun P. (1992) A Practical Part-of-speech Tagger. Proceedings of the Third Conference on Applied Natural Language Processing, Philadelphia, USA, pp. 133-140. Ezeiza, N. (2003) Corpusak ustiatzeko tresna linguistikoak. Euskararen etiketatzaile sintaktiko sendo eta malgua. PhD thesis, University of the Basque Country. Hays D. C. (1964) Dependency theory: a formalism and some observations. Language 40, pp. 511-525. The construction of EPEC 15 Karlsson F., Voutilainen A., Heikkila J., Anttila A. (1995) Constraint Grammar: Language-independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin. Koskenniemi K. (1983) Two-level Morphology: A general Computational Model for Word-Form Recognition and Production. University of Helsinki, Department of General Linguistics. Publications 11. Mel’cuk I. (1988) A Dependency Syntax: Theory and Practice. State University of New York Press. Oflazer K., Zeynep D., Tür H., Tür G. (1999) Design for a Turkish treebank. Proceedings of Workshop on Linguistically Interpreted Corpora, at EACL, Bergen. Sarasola K. (2000) Strategic priorities for the development of language technology in minority languages. Proceedings of Workshop on "Developing language resources for minority languages: re-usability and strategic priorities". Second International Conference on Language Resources and Evaluation, Athens, Greece. Skut W., Krenn B., Brants T., Uszkoreit H. (1997) An Annotation Scheme for Free Word Order Languages. Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, pp. 88-95. Tapanainen P., Voutilainen A. (1994) Tagging Accurately-Don´t guess if you know. Proceedings of the 4th Conference on Applied Natural Language Processing, Washington. Tesnière L. (1959) Eléments de Syntaxe Structurale, (2nd ed.) Paris, Klincksieck. Voutilainen A., Järvinen T. (1995) Specifying a shallow grammatical representation for grammatical purposes. Proceedings of the 7th Conference of European Association of Computational Linguistics, Dublin. This page intentionally left blank The mood of the (financial) markets: In a corpus of words and of pictures Khurshid Ahmad, David Cheng, Tugba Taskaya, Saif Ahmad, Lee Gillam, Pensiri Manomaisupat, Hayssam Traboulsi and Andrew Hippisley Department of Computing, University of Surrey Abstract Corpora of texts are used typically to study the structure and function of language. The distribution of various linguistic units, comprising texts in a corpus are used to make and test hypotheses relevant to different linguistic levels of description. News reports and editorials have been used extensively to populate corpora for studying language, for making dictionaries and for writing grammar books. News reports of financial markets are generally accompanied by time-indexed series of values of shares, currencies and so on, reflecting the change in value over a period of time. A corpus linguistic method for extracting sentiment indicators, e.g. shares going up or a currency falling down, is presented together with a technique for correlating the quantitative time-series of values with a time series of sentiment indicators. The correlation may be used in the analysis of the movement of shares, currencies and other financial instruments. 1. Introduction Financial markets are places where financial instruments are bought and sold. These instruments include shares, currencies, bonds: there are shares traded for individual organisations and traders take options – slang bet – on the aggregate value of key shares e.g. Financial Times Stock Exchange (FTSE). Some of these instruments are traded in millions, others in thousands and yet others in hundreds: the prices of instruments change frequently during single trading or over a longer trading horizon. A set of buying / selling prices of instruments, ordered in time, is usually referred to as a (quantitative) time series. In the financial pages of newspapers, and now on specialised web sites, these time series are either displayed independently or as graphical illustrations within (long) texts. The buying and selling of instruments in itself causes changes in their value: too many buyers for a share and its value goes up, too many sellers and the value goes down. The so-called efficient market hypothesis (EMH) suggests that the (trading in a) financial market is the sole arbiter of the price of an instrument. Despite the preponderance of the EMH, newspapers and financial web sites regularly report the reactions of individuals, acting on their own or on behalf of organisations or governments: some web sites display results of polls of financial experts. The experts report whether they are either ‘bearish’ – or shy to buy or 18 Ahmad et al. sell, or ‘bullish’ – too eager / aggressive to buy or sell, and indeed some of them are neutral. These polls are conducted at regular intervals and the sentiments of experts are displayed as a time series. And then there are some who ‘correlate’ the time series comprising the bearish / bullish / neutral voting figures with the time series of financial instruments. The correlation is then used as a cipher for buying or disposing of an instrument. The sentiment of the market traders – or market sentiment for short – is shaped by, and in turn shapes, the value of a financial instrument usually in the short term and perhaps in the long term as well. Much as the sentiment of a trader influences others, others also influence him. The view of the others is typically communicated through press statements. One can argue that (financial) news stories may affect the trader’s sentiment, or more precisely, his or her attitude towards an instrument. Ergo, positive news stories may persuade people to invest in the market thereby driving the prices of instruments up: conversely negative or gloomy stories force prices down. Note that the physical prepositions (up/down) used to describe the position/location of a physical object, are also used to describe the change in value of an abstract financial instrument during a fixed time period. News stories and people’s conversation about the financial markets extend these spatial metaphors further by talking in terms of state change – one sees a change in the value of an instrument in terms of rising or falling. The use of literary allusions, including bear/bull, vibrant/anaemic, and more colourful slang, including the phrase dead cat bounce, to describe that the upward movement of stock is much like a lifeless object merely moving because of the laws of gravity, shows a creative use of language in the specialist field of financial trading (Ahmad 2002). The sentiment of a trader toward the market may change by reading a news story in that bullish stories may cheer him or her up, and bearish stories may depress him or her and in turn depress the market (Knowles 1996). We wish to explore whether it is possible to extract the sentiment from a news story through linguistic analysis. It is possible to use the literature on buying/selling in semantic theory (Jackendoff 1991) as a framework for analysing the meaning of the news stories. The literature on natural language processing (Simmons et al 1984) and on knowledge representation suggests that frame semantics has been used to build systems that can, in principle, analyse, extract and disseminate the meaning (intent?) of a specialist news report. Frame semantics has a number of limitations, and a prominent one is the need for a lexicon that is rich and extensive in terms of meaningful data. What about a purely lexical approach? Recall, Quirk et al.’s (1985) observation that relates frequency of a lexical item to the acceptability of that item by an The mood of the (financial) markets 19 educated-native speaker of the language. Stretching this dictum to financial markets, one can argue that higher frequency of phrases that express positive sentiment suggests that all is well in the market. Similarly, a predominance of negative phrases might suggest that all is not well in the market and that it may have fallen or is about to fall. We have analysed three year’s output of Reuters financial news comprising over 10 millions tokens published during 2000-2002. 1.1 rose fell 1 FTSE 100 0.9 Normalised Value 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Figure 1: Monthly variation of “rose”, “fell” and FTSE 100 Figure 1 shows the variation in the frequency of two verbs rose and fell over a one-year period (Ahmad et al. 2002). Also plotted is the value of FTSE-100 at the close of daily trading during 2002. There appears to be an encouraging numerical correlation amongst the sentiment verbs and FTSE-100. One further exploration of the above hypothesis that sentiment may correlate with frequency of phrases that may express positive and negative sentiment used to describe changes in the value of an instrument, requires an understanding of the following: How do we analyse texts such that frequent phrases hypothetically related to a sentiment are indeed used in a sentence or in many sentences to express the Ahmad et al. 20 sentiment? The ambiguity of language does play a significant part in confusing the sense of these phrases, for example, rose, as in something rising, with that of a flower or indeed the name of a person. We show how simple cues related to the grammatical categories of phrases in the neighbourhood of the potential sentiment laden word ensure that the word rise quite x suggests that the value of a financial instrument is rising. How do we use the time series of sentiment word frequency in conjunction with the time series of the values of an instrument either to predict stability or chaos in the market, or to discover the ‘turning point’ in the value of an instrument? The turning point is the point in time when the value of the instrument stops decreasing and starts to increase or vice versa. How do we organise texts in a diachronic corpus such that these texts may be added or omitted according to the pragmatic attributes of the texts? This is important if the market movements and sentiment analysis are required for a specific instrument (e.g. US$, Euro, BT stocks, FTSE derivative) traded in a certain country or group of countries. 2. Sentiment words and their frequency 2.1 A note on up-and-down phrases Amongst the many different phrases chosen to describe the changes in the actual or potential values of financial instruments the phrases up, down, rise and fell have an intuitive prominence. Other related phrases and synonyms are also used: growth, slump, jump and drop are good examples. The British National Corpus (BNC) shows the preponderance in the general language of these up-and-down phrases: Table 1: BNC frequencies obtained from the on-line version of the corpus; for rose and fell we restricted the query to verbs only (NBNC = 100,106,008) Token fBNC Token fBNC up growth rose 207709 12794 5566 Down Fell Slump 92285 9563 632 The BNC shows that there is more ‘positive’ sentiment in the corpus than, say, the ‘negative’. However a much more detailed analysis is required before any serious conclusion can be drawn from these literally raw figures. A similar analysis of financial news texts shows the dominance of the phrases up and down but with a somewhat different distribution of the other four phrases. Consider a The mood of the (financial) markets 21 sample from Reuters UK-financial News for the month of November 2002 (comprising 400,000 or so tokens): Table 2: Frequencies of ‘positive’ and ‘negative’ sentiment phrases based on Reuters UK-financial News November 2002. This is an untagged sample and homographic conflicts (e.g. rose as noun or verb) have not been resolved (NReuters=402089) Token fReuters Token fReuters up growth rose 1435 650 424 down fell slump 716 391 73 The rank order of the three ‘positive’ sentiment phrases and that of the ‘negative’ sentiments is preserved when the register changes from the BNC – a corpus which largely comprises general language texts (c. 62% texts are drawn from fiction, leisure, world affairs, the arts, belief and thought) – to the specialist financial news which wholly comprises, what the BNC compilers would call, commerce and finance texts (the BNC has just under 8% of such texts in its composition). However, the relative distribution of the different phrases within the ‘positive’ and ‘negative’ sentiment categories is substantially different: Table 3: Relative distribution of “positive” and “negative” sentiment phrases across BNC and Reuters UK-financial News November 2002. Token fBNC Up growth Rose Total 207709 12794 5566 Positive =226069 N BNC fBNC / 91.8 5.66 2.46 100% Positive N BNC fReuters 1435 650 424 Positive N Re uters = 2509 fReuters / Positive N BNC 57.2 25.9 16.9 100% The predominance of up and down is reduced by about one-third and the more domain-specific growth and rose increase dramatically when we change the register from general language to special language: from 6% in the BNC to about 26% in Reuters for growth and from under 3% to about 17% for rose. Similar results can be obtained for negative sentiment words. Financial journalists typically express rise and fall in percentage terms: this perhaps gives their story a quantitative and objective look and feel. A concordance of Reuters UK-financial News (November 2002) containing the phrases rose shows a varied usage of the pattern: Ahmad et al. 22 º ª » « by » X percent rose « «only/nearly» » « to ¼ ¬ Some examples of these patterns include: enterprise shares Volatile mortgage payments , home loan repayments , 152 pence, logica Property companies -last week house prices rose rose rose rose rose rose 2.83 by 0.1 to 2.3 over eight nearly seven only 1.4 percent percent percent percent percent percent to 584 pence in on the month to in the year to and cmg climbed 6 to 1,235 last month - - A further analysis of the stories published by Reuters, under the UK-financial rubric throughout the calendar year 2002, shows that not all the above patterns are used as frequently and that one pattern, rose X percent, tends to dominate – or, in other words was used preferentially by the journalists in the year 2002. Table 4: Variations of sentiment rose across year 2002, where fx denotes the raw frequency of the phrases Patterns frose X% by X % to X % over X % Nearly X % by over X % only X % frose [phrase] % Proportion of rose [phase] % Jan 417 58.0% 2.2% Feb 369 52.0% 7.0% 4.1% 0.5% 0.3% 284 238 Mar 245 49.4% 0.8% 1.6% 0.8% Apr 263 58.2% 1.1% 1.5% 0.4% 143 174 May 376 57.4% 1.9% 1.6% 0.8% Jun 245 53.1% 1.6% Jul 427 50.1% 1.9% 0.4% 1.4% 0.3% 0.4% 0.5% 0.3% 149 0.2% 254 0.6% 217 238 Aug 351 51.3% 2.8% 0.3% Sep 357 44.8% 3.6% 0.3% 0.6% 0.3% 190 Oct 342 59.4% 2.9% 2.3% 0.6% Nov 424 50.7% 3.8% 2.8% 1.4% 0.6% 0.7% 0.0% 0.0% 236 0.2% 0.7% 256 Dec 231 63.6% 3.5% 0.4% 175 68.1% 64.5% 58.4% 66.2% 63.3% 60.8% 59.5% 61.8% 53.2% 69.0% 60.4% 75.8% For example, in January 2002, of all the patterns rose [phrase] X percent, over 58% were rose X percent, followed by 2.2% comprising rose by X percent and just under 1% comprising rose over X percent. During the year 2002, the frequency of pattern rose by X % was above 50% for all months except March and September where it was 49.4 % and 44.8 % respectively (the average value was 54%). Table 4 shows the monthly variation in the various patterns comprising the verb rose. The rise/fall and up/down metaphor may be expressed through other related words and through synonyms. Textbooks on ‘report writing’ suggest that without The mood of the (financial) markets 23 losing accuracy, one may use related words/synonyms instead of repeating the same words. Indeed, slump instead of fall, and jump/climb instead of rise appear to be good candidates for not only improving writing style but also may make the news report more sensational. We looked into Roget’s Thesaurus (1980) and the WorldNet online to find related words/synonyms for the rise/fall we have discovered in Tables 2 – 4 thus far. It appears that the frequency of the related words/synonyms is much lower (Table 5). Table 5: Related words/synonym sets of rise and fall 3. Lemma fall, inc. fell, falling fReuters 875 rise, inc. rose, rising, risen 1004 Related words/Synonyms drop slump strike jump climb lift fReuters 155 69 53 133 75 33 Sentiments and lexico-grammatical patterns? The domination of one word form over others with related meanings shows perhaps the precision, which came from a lack of imagination some would argue, in the language of financial journalists. This is good news for people interested in information extraction – a branch of computing dedicated to the extraction of meaning from natural language text. The initial results related to the preponderance of one word form over others (Table 5) together with a preponderant lexico-grammatical pattern in which the word form is found (Table 4), suggests to us that when the word forms rise and fall, or rather rose and fell, are used, especially followed by a number and ‘percentage’, then they indicate an upward or downward movement in the value of some part of the financial market – this being true of shares, aggregate share index of one business section (cf. Property companies, telecommunications) and house prices. This still could be an ambiguity though: the report telling us that ‘mortgage defaulters fell by X percent’ is bad news except if you are in the debt collection business. Zellig Harris and his pupil Maurice Gross (1991) have been keen to suggest that in certain types of text one may find local grammars in operation: certain phrase structures that occur more frequently in one type of text, or one set of text fragments than in the language as a whole. Harris illustrated this point by citing examples of recursive noun-phrases used in biochemical literature to either refer to complex biochemical compounds or complex biochemical processes. Gross (1993) focussed on how we specify time and date and showed cardinal numbers used to denote time and calendrical expressions (day / month / year, century) embedded in their own local grammar. Barnbrook and Sinclair have used this notion to argue that dictionary definitions are also written in a local grammar (1993). Local grammars and the Hallidayan term ‘lexico-grammatical’ patterns 24 Ahmad et al. have a certain resonance, and this we have used to explore the grammatical environment of market-sentiment indicating words. The sentences comprising rose and fell embedded in the patterns shown in Table 4 were analysed using a reliable part-of-speech tagger – CLAWS. The local grammar of the two-sentiment indicators rose/fell comprises these patterns (also see): VVD ° ° ® ° ° ¯ Ø PRP ½ ° ° ¾ ° ° ¿ AV0 AVP DT0 CRD NN0 VVD AV0 ^ DT0 ` CRD VVD PRP ^ AT0 ` CRD NN0 NN0 Here VVD = verb; CRD = cardinal; NN0 = numeral; DT0 = determiner; AV0 = adverb; AVP = adverb particle; PRP = preposition. Table 6: Grammatical properties of the lexico-grammatical patterns Pattern rose X % rose by X % rose to X % rose over X % rose nearly X % rose by over X % rose only X % 4. Grammar VVD CRD NN0 VVD PRP CRD NN0 VVD PRP CRD NN0 VVD AV0 CRD NN0 VVD AV0 CRD NN0 VVD AVP AV0 CRD NN0 VVD AV0 CRD NN0 “Virtual Corpora” The designers of individual corpora have discussed the organisation of the text files within a corpus. The early corpora divided texts into informative/imaginative types (cf. Lancaster-Oslo Bergen and Brown Corpora, c.1960s), the texts were then divided into genre, topic and other pragmatic attributes. The subsequent ones took two different approaches: first, genre-based classification was used by some, where texts were classified into books, magazines, personal correspondence and so on (cf. Collins-COBUILD, c.197080s); second, topic-based classification was used by the designers of LancasterLongman corpus (c.1980s) wherein texts were divided into subject topics (science, world-affairs, news and so on). The other pragmatic attributes were included as well. The LOB and Brown corpora were developed for the study of The mood of the (financial) markets 25 (English) language in general and the Collins-CoBuild and Longman-Lancaster for lexicographical purposes. The more ambitious British National Corpus extended the list of pragmatic attributes, and perhaps made the attributes more explicit. The texts can be selected for analysis through a complex query language or by selecting the texts from a file store. For us, there is a hierarchical structure that drives the design of a given text corpus. The selection of the top-level features drives the selection of the other features – these are “doable” tasks if you know the organisation of the corpus you are using. The user of the computerbased corpus must know how the corpus is structured and what is more desirable is the use of ‘logical’ (attribute-oriented) features of the text. The organisation of a corpus of news reports suggests that sometimes there would be a need to analyse the corpus diachronically, focusing on a given individual/organisation, financial instrument, and at other times there is a need to conduct a synchronic analysis, for example, the analysis of news about all organisations within a given industry sector or all members of a political party. The analysis may be required based on a query comprising (a number of) keyword(s) that may be used in the indexation of a set of news reports. The permutations of the pragmatic and lexical attributes of a given text are numerous. It is possible that individual users of a text corpus may like to organise the texts according to their own needs. In order to have a user-configurable corpus, the notion of a virtual corpus was introduced (Holmes et al. 1994) for analysing technical and scientific texts. The notion of virtual corpus is similar to that of a virtual machine: there is in reality only one corpus, but the users can arrange the text attributes in a hierarchy of their choice based on a physically extant set of texts, for the duration of their use. This configurable hierarchy will have to be made available through the agency of a program, within a suite of corpus management programs, for producing this virtual corpus. The notion of virtual corpus introduces a shift from the usual pre-defined and explicit corpus hierarchical approach, in that it allows the definition of virtual hierarchies. Texts can then be retrieved by navigating through an organisation – the virtual hierarchy – specified by the user. We have designed a corpus for use in the automatic extraction of financial information from newspaper texts. The principal source of this corpus are the Reuters News Agency texts – in NewsML format , which is an extension to XML for enriching news stories, conceived by Reuters, developed and ratified by the International Press and Telecommunications Council (IPTC). Atkins, Clear and Ostler (1992) discussed criteria for corpus design. Based on this, and with attributes available in NewsML, news can be organized using the six major (pragmatic) attributes shown in Table 7 below. Ahmad et al. 26 Table 7: Pragmatic attributes used for organising NewsML texts Publisher Name; Place of Publication; Source of Publication; Date of Publication; Date of Origination Availability Copyright Status; Copyright Duration; Copyright Owner; Usage Restriction Text Title/Headline; Dateline; Text Type(book, newspaper, journal, etc), Text Mode (written, spoken?), Text Entry (electronic, transcribed?) Language Language Name; Regional Variant Author Byline; Reporter Nationality; First Language; Editor Category Industry Code; Topic Code From the above set of pragmatic attributes, a news database management system, Virtual Corpus Manager (VCM) was developed at Surrey. VCM can be used to organise texts; to share and retrieve texts; to navigate through (content of) the corpus and to impose integrity and security checks on texts. We have used six different types of constraints. Each constraint allows the user to choose one set of inter-related attributes at a time. For example, the users can choose as many major attributes and for each attribute can choose the sub-attributes. One example in the use of VCM is selecting UK-specific financial texts from Reuters daily news stream for a given year. 5. Visualising the mood of the market Time series analysis is one of the established and persuasive branches of statistics. This analysis is used extensively for analysing ‘a sequence of data indexed by time, often comprising uniformly spaced observations’ in science, engineering, economics, commerce, biology and in almost every subject. Financial news reports are usually illustrated with a time series of the instrument that is being reported: the share price of a company at the opening or closing of a day’s trading plotted over a financial (calendar) year shows the perception of the traders of the financial health (another metaphor) of the company. There are time series, which include both opening/closing and day’s highs/lows – the Japanese candlestick patterns as they are called in the trade. There are time series of the geometric mean of major organisations whose shares are traded in the market – FTSE-100 share index is the geometric mean of the share prices of 100 leading organisations at the close of daily trading, and there is FTSE index, which is the geometric mean of all-shares traded in the London Stock Exchange at the close of trading. We similarly have DAX-30 for Germany and CAX-100 for France, and there is Dow-Jones Industrial Average (DJIA) in the USA. Financial traders, having sought opinion from statisticians, tend not to deal with the ‘raw’ data value, but use other statistical measures related to the value of the instrument(s): typically used indices are those of volatility – a measure based on the standard deviation of closing price from its average value in The mood of the (financial) markets 27 the past few days/hours/minutes; the moving average; and return value, which is the (logarithmic) difference between the value of the instrument at time t-1 and at time t. The use of other statistical measures of the quantitative changes in the value of instrument(s) are important for us as we try to attempt to incorporate the use of sentiment indicators – or rather changes in sentiment – in an overall financial analysis framework. One such attempt has involved helping the traders to correlate the quantitative signal, either in its raw form or the derived forms (return and volatility measures), with the movement of sentiment indicating phrases. The traders typically use two sophisticated computer systems almost simultaneously during a trading session: one screen dedicated to the value of financial instruments, sometimes resolved at 50 values per minute, and the other screen dedicated to news streams supplied by Reuters, Bloomberg and others. A typical trader looks from one to the other and then makes his or her decision. This, rather simplistic view of financial trading has led to the development of SATISFI – which can simultaneously display, or help to visualise the news, the value of an instrument, and the changes in the frequency of sentiment indicators. Table 8 and Table 9 below show the two most frequent sentiment words used in generating the positive and negative sentiment time series respectively. Table 8: Dominant Sentiment words rose and up Rose Up Total Jan 87.80 117.48 205.28 Feb 79.81 135.81 215.62 Rose Up Total Jul 63.27 99.89 163.16 Aug 86.09 134.09 220.18 Relative Frequency (10-5) Mar Apr May 57.94 64.51 85.39 109.40 88.97 96.87 167.34 153.48 182.26 Relative Frequency (10-5) Sep Oct Nov 58.28 62.26 63.67 96.92 94.71 99.98 155.2 156.97 163.65 Jun 58.41 69.78 128.19 Dec 67.25 80.70 147.95 Table 9: Dominant Sentiment words fell and down Fell Down Total Jan 62.45 84.09 146.54 Feb 88.53 100.94 189.47 Fell Down Total Jul 63.02 91.92 154.94 Aug 75.38 87.28 162.66 Relative Frequency (10-5) Mar Apr May 69.28 27.06 68.89 68.88 74.89 61.71 138.16 101.95 130.6 Relative Frequency (10-5) Sep Oct Nov 51.53 48.81 51.73 101.22 81.78 67.65 152.75 130.59 119.38 Jun 69.78 92.51 162.29 Dec 68.40 85.31 153.71 28 Ahmad et al. Figure 2: SATISFI prototype shown with one-year FTSE index based on monthly data with upward and downward movement indicator series SATISFI has four major components that have been fully integrated as shown in figure 2. i) Time Series Display: SATISFI can display three time series at a time. These time series comprise of FTSE-100 close index values, upward movement indicators and downward movement indicators. As discussed above, upward and downward movement indicators are the quantification of the market sentiment expressed in financial news. Over 70 terms each have been identified for conveying ‘good’ and ‘bad’ news. For example upward movement indicators would contain terms like ‘up, rise, growth’ etc. while downward movement indicators would contain terms like ‘down, fall’ etc. The movement indicator time series are synthesized by counting these movement indicator terms within the financial news published for a particular day. Each time series is normalised for proper display purposes. SATISFI is capable of displaying the above time series in three forms: (1) Raw form denotes the original time series. (2) Return form refers to the logarithmic difference between two consecutive values. (3) Volatility (historical volatility) is the relative rate at which the time series moves up or down. ii) Time Series Correlation: Correlation is a measure of the degree of linear relationship between two time series. SATISFI provides the The mood of the (financial) markets 29 user the facility of cross-correlating two series in any form (raw, return, volatility). Any series can be shifted forward or backward and the cross correlation recalculated to determine whether the market is followed by the news or vice versa. iii) Document Display: This is comprised of two parts: (1) Document Titles: Clicking a dot (date) on any of the time series displays the corresponding date’s news titles. (2) Document Content: The content of any document title can be viewed by clicking that news title. iv) Document Analysis: Whenever a document title is selected from the news list, the extracted sentiment keywords along with the frequencies are displayed in “Document Keywords” area. Positive sentiment keyword analysis details appear under the title “Upward Movement Indicators” and negative sentiment keyword analysis details appear under the title “Downward Movement Indicators”. 6. Finding ‘meaningful’ patterns 6.1 A case study: Movements in 2002 The year 2002 has seen its ups and downs like many other years and the movements in financial markets worldwide have been in the downward direction. Or, at least, the geometric mean of the value of the shares of the major corporations in North America, Japan, and the European Union, with some exceptions, has reduced substantially. The UK FTSE-100 shows mainly downward movements interspersed with small periods of upward movements, which unfortunately, could not compensate for the previous reduction in the value of the index. In time-series analysis literature one finds techniques that tend to separate the so-called trends from cyclical movements in the series: the cyclical movements may be due to factors like holidays, trading patterns that may be seasonal and so on (Tino et al. 2001). The trends, it is claimed, show a change that is caused because the basic structure of the market has changed. Techniques like fractal analysis and the related chaotic systems methods help in disentangling the trends from the cycles. Another related and robust technique is the wavelet analysis (Rioul et al. 1991): a wave comprises oscillations of a number of different frequencies and trends and wavelet analysis suggests ways in which these could be disentangled. We have used wavelet analysis on the FTSE-100 data for 2002 together with the time series of upward and downward indicating phrases, in Reuters Financial News for the same year. Figure 3a shows the raw figures for the daily trading data for the FTSE-100. Figure 3b shows the longterm trend in the time series (downwards) while figure 3c shows the short-term trend and hence some cyclical behaviour. Note the turning points in the cyclical data (marked by arrows in figure 3c). There is, as noted earlier, a considerable 30 Ahmad et al. interest in identifying these turning points. The system SATISFI is being extended to generate a textual description of the turning points. Figure 33a shows a time series of the frequency of a number of upward movement indicators, including rise, growth and other less frequent phrases indicating upward movement. There is a corresponding long-term decay in the time-series of upward movement indicators (figure 33b) as found in the raw FTSE-100 data (figure 3b). The cyclical movements are much more pronounced (figure 33c) but a comparison with eye suggests that there is a similar pattern in figure 33c as in figure 3c. The downward movement indicators show that for the first six-months or so of 2002 the frequency of downwards indicators increases rapidly but shows a decay in the latter half of 2002 (figure 33b). The sentiment indicating time-series has to be refined and much more work is required before we may use it to predict the actual mood of the market. However, our approach is perhaps amongst the first of the explorations, which investigate how the quantitative movements in a financial market are influenced by the news stories, some influencing the market and others showing the influence of the market. Afterword We have attempted to build a system from various linguistic, visual and mathematical components that allows us to explore the behaviour of the traders, and to attempt to model this behaviour. The purpose of the system is to assist the trader by reducing the amount of textual and numeric data that the trader needs to assimilate to form a view of the market. In terms of corpus linguistics, the analysis of qualitative data available in collections of news texts organised by descriptive metadata (through XML and Reuters codes), combined with text processing techniques to determine key patterns, proper-noun analysis to determine key entities and the use of terminology collections for reducing the understanding overhead necessary for the text, and for automatic classification of the text, shows early benefits. This work is directly relevant to the potential for Information Extraction techniques to be adopted across domains, an issue that is being investigated also in a related EPSRC-sponsored project Scene of Crime Information System (SOCIS). The integration of techniques in corpus linguistics with other forms of analysis including mathematical analysis and image analysis provides a supporting environment for future projects that do not focus on a single medium of communication. Experts do not generally rely on the text medium alone; this work provides evidence of the kinds of information fusion that financial experts carry out many, many times each day. The mood of the (financial) markets 31 Figure 3: Acknowledgements The work described in this paper has been supported by the EU IST Programme’s Generic Information Decision Assistant (GIDA IST 2000-31123). Based on papers presented at two workshops: LREC Event Modelling Workshop (Spain 2002) and Financial News Analysis Workshop, 11th International Terminology and Knowledge Engineering Congress (France 2002). References Ahmad, K. (2002), Events and the causes of events: the use of metaphor in financial texts, in Proceedings of the workshop at the International Conference on Terminology and Knowledge Engineering. Nancy, France. Atkins, S., J. Clear and N. Ostler (1992), Corpus Design Criteria, Literary and Linguistic Computing 7(1): 1-16. Barnbrook, G. and J. Sinclair (1995), Parsing CoBuild Entries, in J. Sinclair; M. Hoelter and Peters, C (eds.), The Languages of Definition: The Formalization of Dictionary Definitions for Natural Language Processing. 32 Ahmad et al. Luxembourg: Office for Official Publications of the European Communities. pp. 13-58. Gross, M. (1993), Local Grammars and their Representation by Finite Automata, in M. Hoey, (ed.), Data, Description, Discourse: Papers on the English Language in Honour of John McH Sinclair. HarperCollins Publishers. pp. 26-38. Harris, Z. (1991), A Theory of Language and Information: A Mathematical Approach. Clarendon Press, Oxford. Holmes-Higgin, P., S. Abidi and K. Ahmad (1994), ‘Virtual’ Text Corpora and their Management. In Proc. of Sixth EURALEX International Congress on Lexicography, Amsterdam. Jackendoff, R. (1991), Semantic Structures. Cambridge (USA) & London: The MIT Press. Knowles, F. (1996), ‘Lexicographical Aspects of Health Metaphors in Financial Texts’, in: M. Gellerstam et al (eds.), Euralex96 Proceedings (Part II). Göteborg, Sweden: Göteborg University. pp. 789-796. Maybury, M. (1995), Generating Summaries from Event Data. Information Processing and Management 31(5): 735-751. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive Grammar of the English Language. Longman. Rioul, O. and M. Vitterli (1991), Wavelets and Signal Processing, IEEE Signal Processing Magazine, pp. 14-38. Robert A. (eds) (1980), Roget’s Thesaurus. Great Britain, Longman. Simmons, R. F. (1984), Computations from the English: A Procedural Logic Approach for Representing and Understanding English Texts. Englewood Cliffs, NJ: Prentice Hall. Tino, P., C. Schittenkopf and G. Dorffner (2001), Volatility Trading via Temporal Pattern Recognition in Quantized Financial Time Series, Pattern Analysis and Applications. Reuters NewsML Showcase: http://about.reuters.com/newsml/ Tagger: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html System Quirk: http://www.computing.surrey.ac.uk/ai/SystemQ WordNet: http://www.cogsci.princeton.edu/~wn/ Towards a methodology for corpus-based studies of linguistic change: Contrastive observations and their possible diachronic interpretations in the Korpus 2000 and Korpus 90 General Corpora of Danish Jørg Asmussen 1 Society for Danish Language and Literature Abstract The Korpus 90 and Korpus 2000 Corpora of Danish were both designed and compiled at the Society for Danish Language and Literature (DSL). The joint web-based query interface of the two corpora enables immediate comparative studies. This paper focuses on examples of contrastive observations and their possible diachronic interpretations. It discusses whether observable differences of word frequency, inflection, collocation, connotation, and word order reflect real changes in the Danish language, or whether they reflect differently compiled corpora. Finally, the paper proposes prerequisites for a methodology of comparative corpus investigation and the determination of diachronic corpus similarity. In this context, the concept of invariant textual features is introduced. 1. Background: The Korpus 2000 and Korpus 90 General Corpora of Danish The Korpus 2000 and Korpus 90 General Corpora of Danish were both designed and compiled at the Society for Danish Language and Literature (Det Danske Sprog- og Litteraturselskab, DSL). DSL, which is a type of academy, was founded in 1911 with the aim of publishing scholarly editions of Danish works of linguistic or literary importance, including dictionaries. During the 1990s, the design and compilation of corpora and other electronic linguistic resources of the Danish language has become an important activity for DSL. Today, DSL is regarded as the main public institution 2 in Denmark within the fields of development and compilation of dictionaries and corpora 3 of the Danish language. DSL’s long-term goal is to combine dictionaries and corpora to yield a unique language information system – a bank of Danish language. Korpus 2000 (K2000) holds 28 million text words and is composed of text material from the years 1998-2002, covering a wide variety of genres. The objective of the Korpus 2000 project was to compile a general corpus in order to document written Danish language around the turn of the millennium. Korpus 34 Jørg Asmussen 2000 has been made publicly accessible on the Internet, one of the main purposes of the project being to increase laymen’s awareness of the advantages of a corpus as an extension to dictionaries. Korpus 90 (K90) is a 28-million-word subset of the 40-million-word Corpus of The Danish Dictionary (Den Danske Ordbog). Spoken material, as well as texts with certain copyright restrictions, has been excluded from this subset. The Corpus of The Danish Dictionary, holding text material from the decade 19831992, was compiled by DSL in the early 1990s. 4 It served as a major source in the compilation of The Danish Dictionary, an entirely new, written-from-scratch dictionary of contemporary Danish to be published in six volumes by DSL during 2003-2004. As part of the Korpus 2000 project, K90 was made publicly accessible on DSL’s website 5 , where it serves as an approximately 10-year-older counterpart to K2000. Both K2000 and K90 are morphosyntactically tagged with the Constraint Grammar-based DanPars-tagger 6 and are accessible through a joint query interface 7 that enables immediate comparisons of certain linguistic aspects. The following section will provide some examples of contrastive observations made through this interface, as well as possible diachronic interpretations – and misinterpretations – of them. 2. Problem: Contrastive interpretation observations and their diachronic Query results such as frequencies and collocates are presented in the K2000 interface in contrastive tables showing results for K2000 and K90. Hence the user can compare frequency figures and collocates immediately, and hopefully get an idea of what kinds of changes in vocabulary, inflectional forms, or collocates (semantics) have taken place in the years between the compilation of K90 and K2000. However, the contrastive presentation of query results has its shortcomings as well, since it may seduce some users into making faulty generalisations about language change. Below, we will present some examples of these pitfalls and sketch some possible ways to avoid them, some of which could be implemented in the user interface as a diachronic interpretation facility. 2.1 Vocabulary A comparison of the frequencies of all words in K2000 and K90 reveals, not surprisingly, that some words are significantly more frequent in one corpus than the other. Assuming that each of the two corpora reflects Danish language usage during a given period of time (1983-1992 and 1998-2002), we might interpret these frequency differences as changes in the incidence of words in the Danish language as a whole. Towards a methodology for corpus-based studies of linguistic change 35 Figure 1 (below) shows how the frequencies for the noun regn (rain) are presented in the user interface: the first column lists the possible inflectional (and orthographic) forms of the lemma, the second column indicates the frequency of each form in K2000, and the third column the frequency in K90. The form listed in the bottom row is the lemma itself with all its inflectional forms; i.e., it is the total of all of the forms listed above the bottom row. 8 Frequency is not shown in absolute decimal figures, but as a logarithmic score 9 represented by a number (07) of bullets. Experimentally, it has been established that this logarithmic score apparently follows general intuitions on word frequency quite well, such that words with 1-2 bullets are less frequent, e.g. entomologi (entomology), words with 6-7 bullets are very frequent, e.g. i (in) and og (and), and middle-range words with 3-5 bullets are of average frequency, e.g. regn. As the table shows, there do not appear to be any striking differences in the use of these forms between K2000 and K90, with one exception - the indefinite genitive form regns (rain’s), which does not occur at all in K2000, but scores one bullet in K90. The raised thumb indicates that this form is “interestingly” more frequent in K90 than in K2000 - which in this case means that it occurs at least twice as often as the corresponding form in the other corpus. 10 This does not necessarily mean that the difference is statistically significant as well. Especially with low frequency words, the significance may be skewed. However, the phenomenon may have some linguistic relevance, and our intention, therefore, was to highlight this in the frequency table - a practice which may have confused users, as it turns out. Clicking on one of the magnifying glasses yields the KWIC concordance for the corresponding form, together with the number of occurrences. For regns, we get three occurrences in the 28-million-word K90. Figure 1: Frequency table for the noun regn (‘rain’) The total frequency of all inflectional variants of the lemma mobiltelefon (mobile phone) is about 25 times higher in K2000 (1,486 occurrences) than in K90 (59 occurrences). Assuming that the vocabulary of a language reflects general changes in society, and keeping in mind technological changes from the eighties to the late nineties, it seems evident that this frequency observation could be interpreted as a general change in the vocabulary of the Danish language. Similar 36 Jørg Asmussen examples are biltelefon (car phone) and benchmarking; biltelefon, which is five times more frequent in K90 (51 occurrences) than in K2000 (9 occurrences), denotes a technical device that by and large has been replaced by mobile phones. Benchmarking does not occur at all in K90, whereas it occurs 34 times in K2000, which might indicate that this word is new to Danish vocabulary. Jarvad (1999), whose dictionary of new words in Danish is based on good old-fashioned “manual” language excerpts, dates the first occurrence of this word in Danish to 1996 - an observation which might support the assumption that this word is new. All these examples appear to indicate that the two corpora reflect changes in language use as a result of general changes in society. A less frequent word such as kambrium (Cambrium) occurs four times in K90, but not at all in K2000, and is therefore marked with a raised thumb in the K90 column. A somewhat naive, but not entirely impossible, interpretation of this could be that the incidence of this word in the Danish language is decreasing. However, a closer look at the word reveals that it only occurs in a single text on geology, a scientific domain which appears to have been covered unevenly in the two corpora. So, in certain cases, raw frequency comparisons turn out to be unreliable: they indicate different corpora compositions rather than diachronic changes in the vocabulary of the language in question. In the case of kambrium, the introduction of some measure of word dispersion in the corpus would probably mitigate against misinterpretation; thus dispersion could be used to correct the use of raw frequency to measure incidence. However, the disadvantage of this method is that it generally diminishes the weight of linguistically interesting, infrequent words, such as new words. Yet the implementation of this method in the K90/2000 user interface would prevent users who are unfamiliar with quantitative linguistic methods from misinterpreting words such as kambrium. 2.2 Inflection A comparison of the inflectional forms of particular words will at first glance show some striking differences between K2000 and K90. Looking at our previous example, regn, we already have noted some differences between K2000 and K90 for the indefinite genitive form regns (rain’s), which occurs three times in K90, but not at all in K2000. 11 For the definite genitive form regnens (the rain’s), the table does not indicate any quantitative differences, but if one looks at the corresponding concordances, one will observe that this genitive form occurs twelve times in K90 and nine times in K2000. The same picture emerges again: fewer genitives in K2000. Elbro 2002 reports on observations from K90 and K2000 that genitives of certain frequently used nouns appear to be less prevalent in K2000 than in K90. This, he suggests, might indicate a general tendency in Danish towards replacing genitive constructions by prepositional phrases, probably as a result of the influence of English. Thus, for example, bilens ejer (the car’s owner), which is considered the canonical norm in Danish, is often re- Towards a methodology for corpus-based studies of linguistic change 37 placed by ejeren af bilen (the owner of the car). This assumption is further supported by Elbro’s (2002) observation of increased frequencies of certain prepositions in K2000. Apparently, the absolute number of genitive forms of some nouns in K2000 is significantly lower than in K90. For instance, the noun bil (car) has a total of 393 genitive forms in K2000, as opposed to 586 in K90. Similar results are observed for other nouns denoting common objects, such as cykel (bicycle) and hus (house). A closer look at bil reveals that the lemma, including all its inflectional variants, occurs 10,360 times in K90, as opposed to 8,354 times in K2000, an observation which hardly - even among laymen - would yield an interpretation such as “the word bil is replaced by other words”, or even “the denoted object is about to disappear from our lives” - interpretations that appeared obvious for a word such as biltelefon, as discussed above. Furthermore, an examination of the relative numbers of genitive forms of the lemma bil in K90 and K2000 shows 5.7% and 4.7% respectively - a difference too weak to support conclusions about radical linguistic change. Other examples that may disprove the hypothesis of decreasing genitive forms include land (land, country) and the proper noun Danmark (Denmark). The lemma land occurs 28,222 times in K2000, the genitive proportion being 22.3%, against 21,478 times in K90, with a genitive proportion of 16.7%, thus showing the opposite tendency of what was reported above. Danmark occurs 30,730 times in K2000, against 22,243 times in K90, with 15.1% genitive forms in K2000 and 15.7% in K90. Once again, no alarming differences in the use of genitives can be observed. However, what might be alarming is the fact that some words, such as land and Danmark, which we intuitively would expect to have a constant prevalence in the language over ten years, show these astonishing frequency differences, despite their identical logarithmic score. This could be another indication of two differently-composed corpora, K2000 probably containing a larger amount of newspaper text than K90. The examples given above suggest that one presumably cannot account for general inflectional change merely by examining randomly-selected frequent words, since the result of the comparison will be too arbitrary to enable us to make any general conclusions. If one wishes to investigate this kind of linguistic change, one should examine the phenomenon in question throughout the corpus. Thus, in the case of the genitive forms, one should at least compare the total ratio of genitives forms of all nouns in each corpus, and probably also the relative number of nouns compared with other parts of speech in each corpus.12 2.3 Collocation The K2000 query system can display collocates as either frequently or typically co-occurring collocates. The former are mainly made up of co-occurring function words, whereas the latter are determined by means of mutual information, I(worda;wordb) = P(worda,wordb)/P(worda)P(wordb). 13 In order to prevent infrequent words from popping up among the typical collocates, some intuitively 38 Jørg Asmussen defined noise-reducing, and process time-reducing, conditions have been set up: the corpus frequencies of both examined words worda and wordb must be greater than 30, the collocate candidate’s frequency must exceed a threshold t = (log10(fk))2/10, where fk is the frequency of the collocator, i.e. the word for which we want to determine the collocates. Moreover, the collocation itself must occur at least twice, and the resulting (non-logarithmic) mutual information score must be greater than 20. The collocation window is set to two words to the left and two to the right of the collocator. The resulting collocates are presented in a table consisting of four lists, a left-hand and a right-hand list for each corpus. The lists are sorted in descending order according to their mutual information score, which is not explicitly stated in the list, but instead converted into a degree (1-5) of mutual attraction d = round(log10(I)-1), where I is the non-logarithmic mutual information score. The resulting degree of mutual attraction is expressed by a corresponding number of bullets in front of each collocate, where one bullet may be interpreted as a weak - but significant - mutual attraction between the given words, and the maximum of five bullets as a very strong attraction. Figure 2 shows, as an example, a collocation table for the lemma terrorist. Figure 2: Typical collocates for the noun terrorist In K2000, we find to the left of the noun terrorist adjectival forms that denote either a general quality or state (hensynsløse (ruthless), eftersøgte (wanted), fængslede (imprisoned), islamiske (Islamic), internationale (international)) or a (national) origin or location. On the right-hand side, we note a couple of proper nouns. In K90 we also find eftersøgte (wanted) and palæstinensiske (Palestinian) to the left, and furthermore vesttyske (West German); finally, we find dræbt (killed) to the right. The result may be interpreted as follows: one of the characteristics of a terrorist that remains constant over time is that he is wanted or Palestinian, whereas West German is no longer a significant attribute, but, instead, many other nationalities, a certain religious orientation, or even international. In K2000, terrorists furthermore have names or belong to certain organisations, whereas in K90 they Towards a methodology for corpus-based studies of linguistic change 39 either killed or were killed. The greater number of collocates in K2000 may indicate that terrorist has become a more frequently used word, and, in fact, the lemma occurs nearly twice as often in K2000 (477) as in K90 (253). Or perhaps this is just another indication that the corpora are composed differently, K2000 probably containing more newspaper material than K90. Yet the results and their interpretation seem reasonable to some extent in that they resemble general tendencies in Danish society related to this topic. Our historical knowledge helps us to understand these collocates and helps us to understand the constants and changes in the collocational behaviour of this word. A word which ought not to have changed its collocational behaviour over ten years is jul (Christmas) because its use is most likely embedded in a strong traditional context, and, accordingly, the majority of the listed collocates are to be found in both corpora, such as glædelig (happy), fejre (celebrate), or, on the right-hand side, nytår (New Year). However, what is conspicuous is that the number of collocates is somewhat greater in K90, indicating that jul occurs more often in this corpus (2,196) than in K2000 (1,275), which might be taken as another indication of compositional differences between the corpora. Thus, for example, hvid (white) does not occur as a collocate to jul in K2000, even if the collocation hvid jul occurs twice in K2000 and, of course, still is a valid collocation in Danish. Unfortunately it has been blocked by the noise-reduction measures. This example shows that incidental differences in frequency of a (normally) quite frequently used word, such as jul, may have a serious impact on the determination of collocates. Thus, comparing collocates derived from words with remarkably different frequencies in the two corpora does not necessarily give a correct impression of changes in their general collocational behaviour. Even if a word is used less frequently, it may still keep its well-established collocates, but they may not appear in collocational statistics any longer. What might appear, in some cases, are candidates that intuitively do not count as collocates. Looking up collocates for the word juletræ (Christmas tree), one will, as expected, find pynte (decorate) and danse [rundt=om] (dance [around]), but in K2000 we also find talende (speaking) - and this as the most significant collocate for this corpus, scoring four bullets on the degree of mutual attraction scale. On closer examination, it appears that all examples of the talende juletræ come from one single text, a strange story of a speaking Christmas tree. Thus, also when determining collocations, some kind of dispersion-based correction of the raw frequencies of the words involved would help diminish these odd findings. 2.4 Semantics Closely related to collocation is the phenomenon of semantic prosody, 14 the ability of words to establish certain flavours of meaning contextually, e.g. positive, negative, ironic, etc. Consider the word sideeffekt (side effect), an English loanword, which is not registered by Jarvad (1999), but occurs 11 times 40 Jørg Asmussen in K90 and 22 times in K2000. One might argue that, as a word, sideeffekt is superfluous in Danish because another word, bivirkning, already exists with exactly the same meaning. Approximately half of the examples in K90 are used in a clearly negative context, indicating that sideeffekt is something unwanted and harmful, and thus the semantics of this word appears to be very close to that of bivirkning. The picture has changed in K2000, where the majority of examples clearly show sideeffekt as something still unwanted, but quite good, and some examples are explicitly preceded by the adjective positiv (positive), as shown in Figure 3. Figure 3: Indications of changed semantic prosody for sideeffekt in K2000. The question is whether these examples empirically justify the conclusion that sideeffekt in Danish has indeed changed its semantic prosody and found a semantic niche. How many examples of a type of semantic change do we need in order to exclude corpus compositional bias and to be able to make conclusions about a language in general? 2.5 Syntax We will give one example of syntax that shows some topological differences for the negation ikke (not) which one can observe between K90 and K2000. In main clauses, the negation is placed after the finite verb: Peter drikker ikke te (‘Peter drinks not tea’), whereas the negation is placed before the finite verb in relative clauses: Anne serverer kaffe fordi Peter ikke drikker te (‘Anne serves coffee because Peter not drinks tea’). Main clause word order in subordinate clauses is generally considered substandard Danish. However, evidence of this usage can be found in both corpora. Intuitively, we would expect this non-canonical word order to be more frequent in current Danish, and we therefore would expect to find more examples of this in K2000 than in K90. However, a comparison shows more examples of the incorrectly placed ikke in K90, some of which are shown in Figure 4. Does this prove our intuition wrong, or does it indicate that K2000 contains more professional - and thus more ‘correct’ - language? Or is it the case that the corpus evidence is not convincing enough because of methodological shortcomings? Towards a methodology for corpus-based studies of linguistic change 41 Figure 4: Examples of non-canonical word order in K90. 3. Towards a methodology of corpus-based comparative studies The examples above demonstrate that our contrastive corpus observations in some cases may yield dubious interpretations and generalisations on linguistic change. In order to improve the quality of our comparative corpus investigations, we need to establish (i) a framework for the way we examine our data, i.e. a declaration of the analytical methods applied, and (ii) a method of describing and classifying our data, i.e. the corpus. To ensure that others can repeat our investigations on other data with analogous results, we would probably even want to standardise our procedures or create a general methodology for corpus-based studies of linguistic change. To be able to handle different scenarios of investigation, the analytical methods should at least cover the fields of vocabulary, morphology (inflection), collocation, semantic prosody, and syntax. Based on the examples given in section 2, we can list the following preliminary prerequisites for these analytical methods. 3.1 Prerequisites for a methodology of comparative corpus investigation The underlying word concept is important in comparisons based on vocabulary. In order to make these comparisons, we propose a word definition based on lexicalised units which include fixed multiword units, i.e. a word is defined as a lemma. Hence, what is comparable across corpora, when investigating vocabulary, are lemmas, not types. Furthermore, for vocabulary comparisons, measures other than raw lemma frequencies should be applied. A score based on different quantitative characteristics of a lemma in the corpus appears to be better suited to describing the prevalence of that lemma. These characteristics should at least include raw frequency and dispersion. To date, the field of vocabulary comparison appears to have attracted the most attention in corpus linguistic research, and a considerable number of statistical methods have been proposed.15 With regard to morphology, the comparison of the frequency of inflectional variants of a lemma must be handled differently from vocabulary comparison. As has been shown above, comparisons based on type can yield misleading results. First of all, one has to determine whether one wants to investigate a certain inflectional category within a lemma - for example changes 42 Jørg Asmussen in the quantity of plural forms of a noun - or if one wants to examine probable quantitative changes of a grammatical category as a whole throughout the corpus - for example the relative quantity of genitive forms of all nouns in the corpus. In the first case, the ratio of an inflectional variant of a lemma contrasted with all other inflectional variants should be compared with the ratio of the same inflectional variant of the same lemma in another corpus. To ensure that potential changes are lemma-specific, and not just part of a general diachronic shift, the observations should be contrasted with an investigation of the examined inflectional category as a whole throughout the corpus. In the second case, the inflectional category in question should be investigated throughout the corpus. In both cases, statistical methods of determining the significance of proportional changes must be developed and applied. The comparison of changes in the frequency of collocations in two different corpora appears to be more complex than plain lemma-based vocabulary comparison. Statistical methods of the mutual information type are probably not suitable in a comparison context; their strength is mainly found in detecting statistically salient collocations. Once a collocation has been detected, one can compare its number of occurrences across corpora. However, collocations should not be compared as if they were words by merely applying one of the methods for vocabulary comparison, since the frequencies of the collocations themselves also depend on the words involved. Hence, if one observes significant differences in the frequency of one of the words involved in a collocation when comparing two corpora, these differences will also be reflected in all collocations that involve this word. In these cases, it is not necessarily the frequency of the collocation, or its prevalence in the language, that has changed. Comparisons of collocations might therefore best be treated as composite vocabulary comparisons of the words involved in a collocation, in conjunction with the frequency scores of the collocations themselves. Within the field of semantic prosody, some kind of semantic tagging, or at least grouping of contextual words, is indispensable. Changes will often affect new loanwords, and thus relatively infrequent words that are semantically vague at first, but find a semantic niche after some years. The number of words that show these characteristics is normally quite limited, and hence the comparative statistical methods must be adjusted in these special circumstances. The number of observable instances will decrease as the complexity of a certain linguistic phenomenon increases, i.e. when it is composed of several words or linguistic constituents. This applies to most observable linguistic phenomena above the word level, e.g. collocations, semantic prosody, and - perhaps most saliently - syntax, and it will have implications on the statistics used in these cases. Towards a methodology for corpus-based studies of linguistic change 43 For other fields of linguistic comparisons between corpora, similar prerequisites have to be established. Common to all of these prerequisites should be the principles of (i) contrasting a phenomenon with its potential alternatives within one corpus, and comparing this contrast ratio with that of another corpus, and (ii) determining appropriate comparison statistics for linguistic phenomena of varying complexity. 3.2 Prerequisites for a methodology of measuring corpus similarities over time Apart from the development of analytical methods to be applied, one needs a method to describe and classify the data, i.e. the corpus. Although considerable efforts have been made to describe what is in a corpus and why, and although corpora usually are designed in accordance with some idea of representativeness, the problem remains that corpus descriptions are generally founded on relatively vaguely defined, intuitively applied textual categories. Classifying texts by intuition is necessarily imprecise because intuitions may differ from compiler to compiler, and a single compiler may even apply different criteria of classification according to extra-textual circumstances. However, this problem does not seem to be critical, as long as the majority of corpus-based linguistic research refers to the same well-established corpus, e.g. the BNC, or, in the case of Danish, K2000 or K90. Statistical methods for determining corpus differences (Garside and Rayson 2000) and corpus similarities have been proposed, mainly based on vocabulary (Kilgarriff 2001). But what happens if one wants to compile a corpus designed exactly like an existing one, e.g. a new BNC, holding a similar mixture of, e.g., texts that are ten years more recent? In this case, the definition of a concept of similarity might become quite intricate because differences are explicitly desired as long as they are the result of diachronic language change, but they must definitely not be the result of differently composed corpora. How can one compile an updated version of a well-established corpus that is absolutely comparable to the existing one in its composition, so that the linguistic differences detected by comparing the corpora truly and only can be explained diachronically - and not as a result of compositional bias? A solution to this problem of corpus composition is a measure, a quantitatively established figure, or - more likely - several figures, characterising the overall textual composition of a corpus. Corpora with the same characteristics, then, would probably hold the same mixture of text types and would hence be comparable. So far, this approach is not different from the (implicitly) synchronic similarity measuring approaches proposed by Kilgarriff 2001. The difference is that the required measure should be based on positively invariant features of language, i.e. features that do not change (significantly) over time. 44 Jørg Asmussen Within a synchronic context, the linguistic features to compare almost may be chosen by chance. They may be, and often are, based on vocabulary, but other features - such as part-of-speech, syntactic constituents, sentence length, word length, maybe even the distribution of character n-grams, etc. - may to some extent be useful as well. If statistics show that the two corpora have matching features, they may be expected to be of a similar type or composition. Within a diachronic context, however, many of these features may not be suitable because an ongoing general linguistic change might affect one or more of them; hence keeping them constant across corpora and over time might obscure the detection of possible changes. Linguistic features based on grammar or semantics have the disadvantage of requiring morphosyntactically or semantically tagged corpora. The task of tagging a corpus before one can determine its compositional characteristics introduces a certain amount of interpretation, as one has to apply a grammatical tradition or semantic framework that does not necessarily capture the possible, subtle linguistic changes that occur as a language develops over time. Furthermore, the task of tagging increases the degree of methodological complexity, since one does not only need a tool for determining the compositional characteristics of the corpus, but also tagging tools. To ensure that the measurement of the compositional characteristics can be easily achieved in different settings, all tools that are involved have to be well documented and maintained. Slight modifications in tagging algorithms or tag sets may have farreaching implications on the reliability of this approach. For these reasons, one should try to keep the number and complexity of the tools involved as low as possible, as well as the setting itself. We therefore propose that corpus similarity should be determined on untagged corpora only, and that any introduction of linguistic theory into this process should be avoided. The mere physical, orthographic surface of the corpus should suffice, since its recurrent symbols yield structures of varying complexity that resemble the overall textual composition of the corpus. It is beyond the scope of this chapter to try to determine the required invariant linguistic features. However, a couple of candidates will be briefly discussed. Belica (1998) gives an account of some statistical diachronic investigations of vocabulary and collocations in the German “Wendekorpus”. 16 He seems to be quite aware of the corpus compositional problem dealt with in this paper and points at the distribution of different collocations of selected function words 17 as a candidate for an invariant feature, but does not give any examples or further details. However, it does not seem evident that this type of collocation should remain stable over time. A more appropriate candidate for an invariant linguistic feature seems to be the frequency and dispersion of words belonging to a defined semantic core vocabulary. Ruus (1995) attempts to determine a core vocabulary for Danish which she believes constitutes the basic lexical norm of the language. The core vocabulary, which Ruus determines, includes semantically vacuous, but Towards a methodology for corpus-based studies of linguistic change 45 highly frequent function words, whereas the semantic core vocabulary, which we shall suggest as a candidate, excludes them. The remainder should be relatively frequent, semantically well-established words of the jul (Christmas) type mentioned above. This semantic core vocabulary can be expected to remain grammatically and semantically stable over time and thus may prove quite useful in our context. Further experimental work will show whether it is possible to derive such a constant semantic core vocabulary from a large amount of text material, 18 and whether it really can serve as an appropriate invariant linguistic feature. 4. Conclusion In this paper, we have discussed examples of comparative investigations on two general language corpora of Danish, K90 and K2000, the former reflecting Danish language usage around 1990, the latter one around 2000. We have argued that some of the interpretations of the observed differences in vocabulary, collocation, semantics, and grammar were not necessarily the result of general changes in language usage, but rather a likely consequence of differentlycomposed corpora. 19 This observation led us to the conclusion that a framework for comparative diachronic corpus investigation is needed, and we sketched some prerequisites for a methodology of diachronic corpus comparison. With regard to comparative corpus investigation, we found that standardisable approaches are required to account for the complexity of the observed linguistic phenomena, as well as the quantitative relationship between a linguistic realisation and its potential variants. In further work, such approaches should be concretised and evaluated. With regard to the data material, i.e. the corpora to be compared, we argued that in order to determine the similarity of corpora comprising material from different time periods, the approach must allow a certain degree of dissimilarity for those linguistic characteristics which are likely to change over time. Therefore, we proposed to base this approach on invariant textual features, the semantic core vocabulary of a language being an initial candidate for investigation in further work. Generally, we find that the determination of invariant features appears to be a conceptual, or hermeneutic, challenge rather than a mere statistical one. Notes 1 Thanks to my colleagues Britt Keson and Allan Ørsnes for their valuable comments on the manuscript. 46 Jørg Asmussen 2 Legally DSL is a semi-public institution under the jurisdiction of the Danish Ministry of Culture, and its activities are financed in part by the Danish Government and in part by private and public foundations. 3 Among other corpora compiled or distributed by DSL are the Danish PAROLE Corpus (compilation and distribution, cf. Keson 1998) and the Bergenholtz DK87-90 Corpus (distribution, cf. Bergenholtz 1988). An overview of Danish corpora is given in Asmussen (2001). 4 A comprehensive account of the background and the design of the Corpus of the Danish Dictionary is given in Asmussen & Norling-Christensen (1998). 5 Accessible at http://www.dsl.dk/korpus2000. 6 Developed by E. Bick under the VISL project, University of Southern Denmark, http://visl.sdu.dk. 7 The query tool being CQP (Christ 1994) in conjunction with a special web interface developed in the Korpus 2000 project. For details on Korpus 2000 and its web-based user interface cf. Andersen et al. (2002). 8 The smiling face indicates that the lemma is spelt in accordance with official Danish orthography. 9 The score s is computed as follows: s = round(0.5 + log10(f)), for f>0, and s = 0, for f=0, where f is the computed average number of occurrences in one hundred million words of text. 10 This rule has been arbitrarily defined. 11 Even if this is by no means statistically significant, a user who is not aware of this might make a comparison based on these raw frequencies. 12 As noted by Elbro himself. 13 Cf. Church & Hanks (1989); a modification is that their log2 function is not applied here. 14 For a brief description, some examples, and further references cf. Rundell (2002). 15 For a comprehensive overview and discussion cf. Kilgarriff (2001). 16 Institut für deutsche Sprache, Mannheim, http://www.ids-mannheim.de. 17 "die Verteilung von verschiedenen Funktionswörter" (p. 35). Kollokationen ausgewählter Towards a methodology for corpus-based studies of linguistic change 47 18 DSL stores approximately 400 million words of text material in its text bank. The material covers a time span of approximately 20 years and constitutes the raw material for other DSL corpora. A part of this material is - to date - seven years of newspaper data from a Danish daily newspaper (Berlingske Tidende), approximately 25 million words per year. A first step could be to determine the common vocabulary to be found in each year (or each month or day), and to use this as a basis for further investigations. 19 In fact, there is a compositional difference between these two corpora as K2000 holds approximately 2/3 of newspaper text as opposed to approximately 1/3 in K90. References Andersen, M.S., Asmussen, H., Asmussen, J. (2002), ‘The Project of Korpus 2000 Going Public’, in: A. Braasch and C. Povlsen (eds.) Proceedings of the Tenth EURALEX International Congress, EURALEX (2002), Copenhagen. Asmussen, J. (2001), Korpus 2000. Et overblik over projektets baggrund, fremgangsmåder og perspektiver. NyS 30. Nydanske studier & almen kommunikationsteori, Copenhagen. Asmussen, J. and O. Norling-Christensen (1998), ‘The Corpus of The Danish Dictionary’, in: Lexikos 8, Afrilex Series 8:1998, Stellenbosch, pp. 223242. Belica, C. (1998), Statistische Analyse von Zeitstrukturen in Korpora. Teubert W. (ed.): Neologie und Korpus. Tübingen. Bergenholtz H. (1988), DK87: Et korpus med dansk almensprog. Hermes, Journal of Linguistics 1, Århus, pp. 229-237. Christ, O. (1994), A modular and flexible architecture for an integrated corpus query system. COMPLEX’94, Budapest. Church, K. and P. Hanks (1989), Word association norms, mutual information and lexicography. ACL Proceedings, 27th Annual Meeting, Vancouver. Elbro, C. (2002), Ift, ifm, mht, mhp og andre uspecifikke præpositioner. Mål og Mæle 3:2002, Copenhagen, pp. 17-23 Garside, R. and P. Rayson (2000), Comparing corpora using frequency profiling. Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000). 1-8 October 2000, Hong Kong, pp. 1 - 6. 48 Jørg Asmussen Jarvad, P. (1999), Nye Ord. Ordbog over nye ord i dansk 1955-1998. Copenhagen. Keson, B. (1998). Documentation of The Danish Morphosyntactically Tagged PAROLE Corpus. Society for Danish Language and Literature, DSL, Copenhagen, http://korpus.dsl.dk/e-resurser/parole-doc.rtf Kilgarriff, A. (2001), Comparing Corpora. International Journal of Corpus Linguistics, 2001, vol. 6, no. 1, 2001, pp. 97-133. Rundell, M. (2002), ‘Good Old-fashioned Lexicography: Human Judgment and the Limits of Automation’, in M-H. Corréard (ed.) Lexicography and Natural Language Processing. A Festschrift in Honour of B.T.S. Atkins. EURALEX 2002. Ruus, H. (1995), Danske Kerneord. Centrale dele af den danske leksikalske norm. Copenhagen. Synchronic and diachronic variation: the how and why of sociolinguistic corpora. Kate Beeching University of the West of England Abstract This paper aims to illustrate the potential of (spoken) sociolinguistic corpora for research studies in both synchronic and diachronic variation, with reference to French, and to suggest ways in which useful research corpora may be established for future generations of scholars. Spoken corpora and corpus tools are an excellent heuristic in charting distributional frequencies or probabilistic factors. Andersen (2000) suggests that the upsurge of innit and like in the COLT Corpus of adolescent English may be more than age-grading. The present paper will present broad-brush preliminary evidence with respect to the evolution of selected pragmatic particles in French. 1. Introduction The title of my paper on sociolinguistic corpora has put the how? before the why? This raises a very interesting methodological issue relating to corpus collection and corpus-based research. Generally speaking, linguistic research design which includes data collection focuses very precisely on the nature of the data required for the research project in question. You start with the why? and move on to the how? A hypothesis is formulated and the data collection is specifically designed to test that hypothesis. The nature of corpus design is quite the opposite: very large amounts of data are collected from a representative sample of the population, both written and spoken, and - as was the case for the Survey of English Usage - the corpus designer decides which genres are to be included as being those which are most typical of the language at any one time. Inevitably, the composition of the corpus influences the findings which may be made and bias may creep in. For a long time, for example, the spoken language was considered to have been underrepresented in the British National Corpus and, even now, if we compare how much time the average British English speaker spends speaking and listening as opposed to reading or writing, corpora are generally weighted in favour of the written language. The time-consuming nature of transcription plays a major role in the relative paucity of spoken material. Should researchers wish to chart distributional frequencies in the spoken as opposed to the written language, they are at liberty to select sub-corpora. Inevitably, too, the corpus may not include the precise register of the language which the researcher is interested in. In Beeching 50 Kate Beeching (1997), I provide an apology for the small home-grown corpus which focuses on, in this case, the type of spoken French most needed by horticulturalists (a group of learners I happened to be teaching at the time and for whose needs no specifically-designed corpus was available). At the University of the West of England, Bristol, we are currently engaged in a Joint Project collecting samples of learner English/French/German and Spanish at a number of levels. Already the methodological questions are being posed: what level of detail will be required in transcription? Does the method of data collection fit the individual research aims of the staff involved? How best can we collect data which it will be subsequently of interest to interrogate for studies which may be as diverse as morphological development, lexical richness and so on. The how? and the why? questions bedevil the collection of corpus material in a very large number of fields. The key factor about corpus investigations is size: millions of words are considered to be required to gauge the use of lexical items, even those which are not particularly rare. For studies focusing on variation, large is necessary but insufficient. In the case of spoken data requiring transcription, large means long hours of work by trained personnel - in other words, funding. Corpus Linguistics is a field where individual researchers, public and private funding, universities, libraries, publishers and government need to pull together, as has been most spectacularly demonstrated by the success of the BNC. A collaboration of this sort has, unhappily, not occurred for French. For studies focusing on variation, large is necessary, but insufficient, because samples must be accompanied by particular demographic data to allow investigations to be meaningful. And opinions differ on exactly how data should be collected and collated for reasonable conclusions to be drawn. In this paper, my remarks relate to the type of corpus which lends itself to the study of semantico-pragmatic change. On the one hand, the why? logically precedes the how? On the other hand, most empirical research papers place the methodology section before the results section. It is for this reason that I have positioned the how? before the why? in this paper, though the latter part of the paper serves as justification for the methods described in the former. The questions to be raised are: How reasonable is it for sociolinguists to ask for specific information about speakers to be included in the data set along with transcriptions when corpora are compiled? And what sort of information and what degree of detail is useful? 2. The how? Examining data collection methods which have been used in the past is one way to explore best practice for the future. Two outstanding examples of existing sociolinguistic corpora for French are the Etude Sociolinguistique d’Orléans (collected between 1966-1970) and the VALIBEL Corpus. The latter is the largest electronic corpus of spoken French, with 400 hours of recordings or about Synchronic and diachronic variation 51 4 million words, and hopes to be representative of the “communauté linguistique Wallonie-Bruxelles” (Francard et al. 2002). Each interview is accompanied by demographic information covering the following six parameters: sex, age, geographical location, place of birth, educational background and the socioprofessional situation of the informant. More information is accessible at http://valibel.fltr.ucl.ac.be. The VALIBEL centre focuses on variation studies (lexical, syntactic, phonetic and prosodic). Despite the excellence of this corpus, it is of little use for my exploration of hexagonal French, as its focus is entirely on Belgian French. As yet nothing similar is available for hexagonal French though the Orléans Corpus is an excellent model. My own very small Bristol Corpus is available on-line and the “reference corpus of spoken French” currently the responsibility of Mireille Bilger (Université de Perpignan: bilger@univ-perp.fr) holds great promise. The Orléans Corpus, the collection and transcription of which was supported by grants from the DES, the French Embassy in London, BELC in Paris, the CRDP in Orléans, the CMPP in Orléans, INSEE in Orléans and the French Centre for European Sociology is an invaluable resource. As the authors explain in their catalogue, the origins of the ESLO (Etude Sociolinguistique d’Orléans) go back to 1966, the era of the ‘audio-visual revolution” in language teaching. There was an acute need for non-literary and everyday samples of authentic spoken French. The aim, then, of the corpus was to collect a body of recordings of spoken French from an urban environment, the choice of subjects being governed by sociological criteria in order to ensure that the corpus was representative of French speakers of the time. The corpus was to be transcribed and made available to researchers in linguistics, sociology and pedagogy and was to be used as the basis for teaching materials. The resulting electronic corpus comprises approximately 109 hours of spoken French, 902,755 words of which have been orthographically transcribed with a further 13 hours of phonetic transcription. These transcriptions are available at: http://bach.arts.kuleuven.ac.be/elicop. Each interview is accompanied (in the catalogue) with very full background information, including not only the six parameters favoured by VALIBEL but the initials of the interviewer, the date of the interview and date and place of birth of the interviewee, the INSEE coding of the interviewee, their marital status, political leanings, the length and quality of the recording and some indication of the topics covered. Most interviews covered a pre-determined set of questions, but the speakers appear relaxed and informal in their mode of response. This imaginative yet rigorously managed project is a treasure-house of historical evidence on the French language, the first and, as yet, virtually the only French corpus in existence which is freely available on-line. The Bristol Corpus also had its origins in language pedagogy. Supported by small grants and advances from OUP and CUP, I spent 10 years from 1980 to 1990 as a freelance French textbook writer, collecting, transcribing and preparing pedagogical materials for French (e.g. Beeching, 1985, 1986, 1989, Beeching & 52 Kate Beeching Page, 1988, Beeching & Le Guilloux, 1990, 1993). All were based on and had an accompanying cassette containing authentic taped spoken interviews with French people from all walks of life and from a number of urban and provincial parts of France. A selected sub-section of the interviews recorded during these years was transcribed as part of my PhD thesis, an abridged version of which was published in 2002. These interviews now make up the Bristol Corpus which comprises 17.5 hours of orthographically transcribed speech, or 155,000 words, involving 95 speakers, aged 7 to 90 years with a balanced range of educational backgrounds. The Corpus may be accessed at http://www.uwe.ac.uk/facults/hlss/languages/ research/staff/CORPUS.pdf. The file contains some demographic information about the speakers: their sex, age (in 3 age bands: 0-20; 21- 40; 41+) and level of education (in 3 bands: no bac, bac but no university degree, bac + university degree(s)). Bilger (2002) presented the “Reference Corpus of spoken French” the constitution of which was proposed in 1998 by the GARS teams and the CNRS ESA 6060 team directed by C.Blanche-Benveniste. Currently the DELIC team led by J. Veronis is completing the corpus. Bilger regretted that the corpus did not fulfil all of the criteria which Sinclair suggests characterise a ‘reference’ corpus as not all speech situations are included and a number of usages of the language are missing. She suggests that this is but a first step in a vaster project. An attempt has been made to capture a representative sample of common, current usage. The corpus is 400,000 words long, the interviews were recorded in 40 towns and three speech situations have been favoured: “private” speech, professional speech and public speech. Bilger did not mention in detail what background notes accompany the transcriptions. To sum up, a reference corpus which is useful for sociolinguistic research must attempt to provide samples which are representative of all members of the society in question talking at a similar level (or levels) of formality in order for comparisons to be made and variation discerned. A minimum of information should accompany each interview: the sex, age and educational background of the speakers. 3. The why? Reference corpora which are allied with demographic information can be used, not only to look at synchronic variation and the way that social hierarchies and power may be enshrined in language use (see Beeching, forthcoming a) but also to investigate diachronic variation (see Beeching, forthcoming b). This is the most exciting new development. Though change may conceivably occur because of the written language, it is generally recognised that most changes occur in face-to-face interaction. In the past, language change has been studied with some difficulty: written documents of various sorts have been scrutinised to find Synchronic and diachronic variation 53 evidence of what might have been happening in the spoken language at a particular stage (an excellent example of such an investigation at the phonological level is Sampson, 2002). The fact that we now have what amount to historical documents in the form of tape-recordings and, if we are lucky, transcriptions of those tape-recordings, means that, from now on, we will be able to monitor change as it occurs. Corpus evidence allows us to quantify shifts, gauge reversals, and check distributional frequencies. Electronic search facilities and statistical packages can alleviate much of the drudgery entailed. We can refine not only our descriptions of linguistic change but also our hypotheses about the causes and effects of such changes. Linguists have traditionally favoured language-internal causes of linguistic change. The advent of sociolinguistics and sociolinguistic evidence allows us to put forward a very strong argument for the importance of language-external, social factors. As Croft (2000: 4) remarks ‘the real entities of language are utterances and speakers’ grammars. Language change occurs via replication of these entities not through inherent change of an abstract system’. There is considerable evidence that synchronic and diachronic variation are inextricably linked and that language items which begin as one variant of a variable or one sense of a polyseme gradually gain the upper hand. A new formfunction configuration can eventually obscure the original one as, becoming more frequent, it is routinised. As Heine, Claudi and Hünnemeyer (1991:261) remark with reference to grammaticalisation, language change and synchronic variation are inextricably interlinked: Grammaticalization has to be conceived of as a panchronic process that presents both a diachronic perspective, since it involves change, and a synchronic perspective, since it implies variation that can be described as a system without reference to time. Jakobson (1952/1963:37) made a similar point when he said Pendant un certain temps, le point de départ et le point d’aboutissement de la mutation se trouvent coexister sous la forme de deux couches stylistiques différentes...Un changement est donc, à ses débuts, un fait synchronique 1 . Grammaticalisation is a term which is generally applied to nouns and verbs which come to serve grammatical functions and then continue to develop new grammatical functions. Classic examples include the development of LatinRomance articles, auxiliaries and indefinite pronouns. Whilst much focus has traditionally been placed on what Givón has famously referred to as ‘Today’s morphology is yesterday’s syntax’, it could be argued that the emphasis on morphology reflects the neogrammarians’ preoccupation with this area rather than giving a just view of a process or phenomenon which extends to semantic and pragmatic areas, not solely purely ‘grammatical’ ones. In order for a lexical 54 Kate Beeching or content word to gain in currency and become a function or grammatical word, it must of necessity lose the particular lexical meaning which ties it to specific contexts and become desemanticised. This ‘bleaching’ has been regarded as a somewhat negative fate for words though some researchers point up that we should not consider the desemanticisation as ‘loss’ but as an enriching or multiplying of meaning. In addition, desemantisation is a sine qua non for a term to become grammaticalised or gain in currency, to change from a ‘content’ to a ‘function’ word, as Haspelmath (1999:1062) explains: Semantic generalization or bleaching is usually a prerequisite for use in a basic discourse function, that is, for the increase in frequency that triggers the other changes. The semantic generalisation which is so often observed in the development of a lexical category to a functional one appears, then, to be not a consequence of routinisation but a pre-requisite for it. This “bleaching” and “gaining in currency” is not a phenomenon which restricts itself to morphology but is omnipresent, not least in cases of pragmatico-semantic change. In Beeching (forthcoming b) I argue precisely this and I posit that, along with the formal changes which appear to accompany grammaticalisation (desemantisation, fusion, coalescence), what we are witnessing in all cases of language change, is a process of pragmaticalisation which I define as follows: Pragmaticalisation is the manner in which words, used in context, shift in meaning or attract a new social semiotic, become habituated in that usage and are propagated because of the new fashion or prestige which is attached to them. Pragmaticalisation occurs during human interaction and human interaction is always heavily overlaid with connotations which may be most readily explored using politeness theory. Kasper (1990) distinguishes between two trends in the conceptualisation of politeness: politeness as strategic conflict-avoidance and politeness as social indexing. Both are involved in a consideration of pragmatic particles such as like and innit in British English or enfin, hein or quoi in hexagonal French. All are used to encourage speaker involvement, and they are part of the means whereby speakers downtone assertions, avoid presenting themselves as the expert and thus they are a means of avoiding conflict and managing face needs. But they are often words, too, which are stigmatised in “polite” circles, they are familiar, colloquial, only used in the spoken language, in informal contexts or by an easily identifiable subset of the population. Quotative like for example is much more frequent on the lips of younger speakers. Hein, and, to an even greater extent, quoi, are characteristically used to a much greater extent by working class speakers. Frequent use of such terms, then, contains a social semiotic, resulting in social indexation. Synchronic and diachronic variation 55 The underlying processes of language innovation and language diffusion appear to me to reside in the conventionalisation of generalised invited inferences, some of which may be cognitively motivated metaphorical or metonymic interpretations, but many of which are motivated by considerations of politeness in its broadest sense including both notions of sociability and face and of social indexation (see Eeelen, 2001, and Beeching, 2002: 25-27). On the one hand, Croft (2000: 178) intimates the usefulness of politeness theory in helping to explicate the directionality of the transmission of variants, an aspect not accounted for in the Milroys’ weak-tie model. On the other hand, drawing on Levinson (2000), Traugott and Dasher (2002) posit an Invited Inferencing Theory of Semantic Change (IITSC). This theory suggests that historically there is a path from coded meanings (Ms) to utterance-token meaning (IINs) to utterance-type, pragmatically polysemous meanings (GIINs) to new semantically polysemous (coded) meanings. Their arguments for the role of conceptual metonymy are persuasive. My theory of pragmaticalisation builds on the IITSC but places it firmly in the context of politeness theory and of Milroyan sociolinguistic theory and methodology. Both Croft (2000) and Givón (2002) draw out the parallels between adaptive behaviour in biology and linguistic phenomena, with variation seen as the indispensable tool of learning, change and adaptation and this in the context of the sociability, co-operation and communicative way of life rooted in a hunting-and-gathering society. The primacy of sociability, the bid for conflictavoidance in human interaction and the equally prevalent urge for social indexation promote not only a high rate of usage of politeness markers but also their differential usage in different identity groups. Moreover, because such particles are both frequent and highly polysemous, they are subject to change. Hence the fascination they hold for scholars of synchronic and diachronic variation. The techniques and tools of Corpus Linguistics are formidable allies when exploring the relative distributional frequencies of old and new forms coexisting in synchrony. Though linguists have generally focused on change and, in particular, on the beginning and end of a change, the far commoner and more ubiquitous situation is that of stable sociolinguistic and stylistic variation, out of which change may, or may not, emerge. As Hopper and Traugott (1993: 95) point out: Changes do not have to occur. They do not have to go to completion, they do not have to move all the way along a cline ..[....] the outcome of grammaticalization is quite often a ragged and incomplete subsystem that is not evidently moving in some identifiable direction. Polysemy and synchronic variation are far commoner than diachronic change and, coupled with social stratification, they constitute the raw material without which change is not possible. Distributional frequencies may fluctuate, depending upon fashion and prestige, a form-function association may dwindle and even die 56 Kate Beeching and, in those relatively rare cases, change may be said to have occurred. There appears to be a probabilistic aspect to the spread of a new form, the now famous S-curve, whereby after a slow start a new form surges forward at an accelerated pace and then falls backs as it stabilises. In Beeching (forthcoming a), I have attempted to capture the numerous complexities involved in the sociolinguistic promotion and demotion of a new form in the disarmingly simple formula l = p n, where l is the likelihood of spread, p are the positive aspects of the (changed) form-function configuration and n are the negative ones in the mind of the speaker involved. This formula allows for the fact that not all forms are positive for all members of a speech community and that attitudes to a form may change over time. In highly stratified societies, the n value of a stigmatised form entirely eclipses its p value. Only if strata become less rigid and identification with the life-style associated with the form rises can the n value drop. A measurement of the distributional frequencies of a form in speech samples from similar speech communities at two points in time may indicate the ‘l’ and help us to chart fluctuations over a large speech population. In this respect, the usefulness of large corpora of transcribed spoken language coupled with the demographic information classically associated with sociolinguistic studies - the age, sex and socioeconomic background of speakers - can hardly be exaggerated. In Beeching (forthcoming b), I survey the evolution of four pragmatic particles, as evidenced in a comparison of their usage in the (1966-1970) Orléans Corpus and the (1980-1990) Bristol Corpus. Many caveats must be issued concerning the conclusiveness of the data presented. The corpora are small by many standards. The Orléans Corpus is restricted to only one town in France, whereas the Bristol Corpus covers a number of different towns and regions. Although a similar amount of speech is examined in each corpus (around 155,000 words), ninetyfive speakers are surveyed in the Bristol Corpus while only twelve have been studied in detail in the Orléans Corpus, two men and two women from each of the three education groups which I have designated here as WC, LMC and MC. As a means of illustrating the ‘why?’ of sociolinguistic corpora, I wish now to focus on only two of the particles studied in Beeching (forthcoming b): hein and quoi. Hein and quoi both occur in an utterance-, or at the very least, tone-groupterminal position and are mildly stigmatised (they do not occur in formal written discourse and are highly unlikely to occur in formal spoken French). Quoi remains perhaps slightly more stigmatised than hein. Both, however, could be said to serve social/interactional purposes in maintaining hearer-involvement and in hedging or downtoning remarks which a speaker or hearer might consider over-assertive. Hein is generally translated into English by a tag question or by you know?, while quoi may be rendered ‘as it were’, ‘so to speak’, ‘know what I mean’, ‘like’ ‘sort of’, ‘kind of thing’, ‘you know’ or even ‘of course’. Oui, peut-être mais ça dépend aussi, hein? Yes, perhaps, but it depends, too, doesn’t it? Synchronic and diachronic variation 57 ne pas avoir que des contraintes dans dans sa vie, quoi, hein? not just to have obligations in in one’s life, as it were, you know? Hein and quoi appear, thus, to function in both of the manners in which politeness is conceptualised. They serve both to flag social indexation (they are nonstandard, stigmatised forms) and to mediate sociability. While quoi seems to have one core function or meaning (it flags tentativeness concerning the adequacy of one’s expression), Beeching (2002) charted two main functions or meanings of hein: a Hyperbolic function, where hein underscores an emphatic remark, and a Discoursal function, where hein is used as a backchannel device to maintain hearer involvement. Results in Beeching (2002) based on an analysis of the Bristol Corpus suggested a shift in distributional frequencies from the Hyperbolic to the Discoursal usage and, simultaneously a change in social semiotic: the Discoursal usage of hein was favoured by female speakers. 140 120 100 80 Mean HRANDQR 60 CLA SS 40 WC 20 MC 0 LMC Bristol Or léans CORPUS Figure 1: Mean rates of hein and quoi usage in the Bristol and the Orléans Corpus As Beeching (forthcoming b) shows, rates of hein usage are similar overall in both corpora. The class distribution of usage differs dramatically, however. In the later, Bristol, Corpus, the middle class speakers have adopted hein and working class rates are proportionately smaller. Rates of quoi-usage have, by contrast, doubled in the intervening years, with extremely high rates amongst working class speakers but much higher rates, too, amongst middle class speakers. 58 Kate Beeching Figure 1 charts the distributional frequencies of the sum of the means of hein and quoi rates per 10,000 words in the Bristol and Orléans Corpora subdivided for educational background. (It should be noted that the corpora had to be screened in the case of quoi to focus only on occurrences of it in tone-group-terminal function: quoi has of course a number of other more canonical pronominal uses in interrogative and relative constructions - de quoi parles-tu? il n’y a pas de quoi rire etc.) The most striking aspect of this figure is the increase in the usage of hein and quoi amongst middle class speakers. There appears to have been a slight dropping of in lower-middle-class usage of these stigmatised forms in the Bristol Corpus and a very slight increase in working class usage but the increase in usage overall is amongst middle-class speakers. Table 1 demonstrates the role played by the speaker’s sex in the increased rates of hein/quoi usage. Table 1: Rates of hein/quoi usage per 10,000 words in the Orléans and Bristol Corpora, subdivided by sex and class (N= the sum of the raw number of occurrences of hein and quoi). Male Orléans Corpus (1966) Bristol Corpus (1990) Female WC LMC MC WC LMC MC 65.61 N=200 57.91 N=271 62.54 N=191 27.69 N=53 15.03 N=37 24.33 N=22 22.94 N=69 46.89 N=150 13.92 N=51 33.96 N=108 15.84 N=29 35.98 N=54 In the Orléans Corpus, hein/quoi usage is a predominantly male WC and LMC phenomenon. The female speakers observe the same reticence concerning hein/quoi usage as the MC males. In the Bristol Corpus, however, rates are a great deal more evenly distributed. Though the highest rates belong to WC speakers, this is true of both the men and the women. Indeed, the female LMC and MC rates exceed those of their male equivalents. It seems that hein and quoi are becoming less stigmatised and that it may be women who are leading this change in social semiotic. In Beeching (forthcoming a), I discuss the relationship between politeness and power and the maintenance of hierarchies through asymmetrical language-usage. It is too early to make any conclusive remarks and replications of the study, drawing on more data from the invaluable Orléans Corpus and also from the more recent Reference Corpus, is advisable. However, it seems possible that, if middle class speakers are beginning to adopt stigmatised ‘working class’ speech forms, there has been a democratisation, a shift in the hierarchical nature of French society. Our study of distributional frequencies of pragmatic particles may inform Synchronic and diachronic variation 59 us not only about linguistic usage but also about the structure of society. It is the inter-relationship between the two which may bring about linguistic change. 4. Conclusion In this brief chapter, it has been impossible to do full justice to either of my main foci: the usefulness of sociolinguistic corpora and the way that diachronic change may be charted through differences in distributional frequencies (synchronic variation) noted at different points in time. As Macaulay (2002, 298) points out, there are many variables that affect samples of speech. He claims, however: Yet we need not despair. One way forward is in replication. As more studies are carried out, the influence of accidental factors may be easier to detect. Macaulay makes a number of very useful practical suggestions concerning the way that replication may be made more reliable. For one thing, researchers need to show rates of occurrence in a standardised way which will allow comparisons to be made e.g. how many occurrences per 10,000 words. He recommends the amassing of comparative data. When collecting corpus material, I urge fellow researchers to append as much information as possible concerning the date and place of collection, the sex, age, place of birth and educational and socioeconomic background of the speaker, the nature of the relationship between the interviewer and interviewee (or speakers, if a ‘fly-on-the-wall’ recording). Macaulay stresses, too, the usefulness of the computer, a point of view which I whole-heartedly support. One is forced, however, to recognise that linguists do not look set to be put out of their job by a computer: the polysemy and multifunctionality of certain terms, the subtlety of invited inferences and the continuum along which terms travel towards coded meanings pose a formidable challenge to the computer programmer. Finally, as any researcher who has collected and transcribed spoken material knows only too well, transcription is fantastically time-consuming. It has become obvious to me, however, that when we transcribe, what we are engaged in is not only a record and research base for the study of contemporary forms. What we are engaged in is the creation of a historical archive of spoken language which was impossible for our less technologically-advantaged forebears. To aid our understanding of the nature of language, of the way that it is structured and how that structure reflects social phenomena, of its variation, both synchronic and diachronic, every attempt should be made to take samples which are representative and to accompany our hard-won transcriptions with detailed demographic information. In this way, we allow others to stand on our shoulders and the fruits of our efforts are redoubled. 60 Kate Beeching Notes 1 For a while the departure and arrival point of the change coexist in the form of two different stylistic layers.... a change is thus, in its beginnings, a synchronic phenomenon. References Andersen G. (2000), Pragmatic Markers and Sociolinguistic Variation. A relevance-theoretic approach to the language of adolescents. Amsterdam/Philadelphia: John Benjamins. Beeching, K. (1985), Vrai de vrai! Oxford: Oxford University Press. Beeching, K. (1986), A vrai dire. Oxford: Oxford University Press Beeching, K. (1989), Actifrance. Oxford: Oxford University Press Beeching, K. (1997), French for specific purposes: the case for spoken corpora. Applied Linguistics. (1997). 18, 3, 374-394. Beeching K. (2002), Gender, Politeness and pragmatic particles in French. Amsterdam/Philadelphia: John Benjamins. Beeching K. (Forthcoming a), Pragmatic particles - polite but powerless? Tonegroup terminal hein and quoi in contemporary spoken French. To appear in one of the first two editions of Multilingua 2004. Beeching, K. (Forthcoming b), Pragmaticalisation, politeness and linguistic change: synchronic evidence from French. Beeching K., & I. le Guilloux. (1990), Ça se dit et ça s'écrit. Oxford: Oxford University Press. Beeching K. & I. le Guilloux. (1993), La passerelle. Cambridge: Cambridge University Press. Beeching K., & B. Page. (1988), Contrastes. Cambridge: Cambridge University Press. Bilger M. (2002.), Présentation du “corpus de référence de français parlé”. Paper given at the ATALA Conference, “Constitution et exploitation de corpus du français parlé” 25 May 2002. Croft, W. (2000), Explaining Language Change. An Evolutionary Approach. Harlow: Longman. Eelen, G. (2001), A critique of politeness theories. Manchester: St. Jerome Publishing. Francard M., G. Geron, V. Giroul, P. Hambye, A.C.Simon, and R.Wilmet (2002), Le centre de recherche Valibel: des corpus oraux au service d’un observatoire du français en Belgique. Paper given at the ATALA Conference, “Constitution et exploitation de corpus du français parlé” 25 May 2002. Synchronic and diachronic variation 61 Givón, T. (2002), Bio-linguistics. The Santa Barbara lectures. Amsterdam: John Benjamins. Haspelmath, M. (1999), Why is grammaticalization irreversible? Linguistics, 376: 1043-1068. Heine, B., U. Claudi, and F. Hünnemeyer. (1991), Grammaticalization: a conceptual framework. Chicago: University of Chicago Press. Hopper, P. J. & E.C.Traugott. (1993), Grammaticalization Cambridge: Cambridge University Press. Jakobson, R. ([1952] 1963), Essais de linguistique générale. Paris: Éditions de minuit. Kasper, G. (1990), Linguistic politeness: current research issues. Journal of Pragmatics 14: 193-218. Levinson, S. (2000), Presumptive Meanings: the theory of Generalized Conversational Implicature. Cambridge, MA: MIT Press, Bradford. Macaulay, R. (2002), Discourse Variation. In Chambers, J K, Schilling-Estes, N The handbook of language variation and change. Oxford: Blackwell, pp. 283-305. Sampson, R. (2002), A transient vowel in early modern French: i nasal, in: R. Sampson and W. Ayres-Bennett (eds.) Interpreting the history of French. A Festschrift for Peter Rickard on the occasion of his eightieth birthday. Amsterdam/New York: Rodopi. Traugott, E. and R. Dasher. (2002), Regularity in Semantic Change. Cambridge: Cambridge University Press. This page intentionally left blank Statistical analysis of the source origin of Maltese Roderick Bovingdon Angelo Dalli University of Sheffield Abstract This paper presents the results of the first ever large-scale statistical analysis of Maltese using the newly formed Maltilex Corpus. Traditional etymological and categorical analyses were supplemented with data mining techniques to provide accurate results with reduced effort. Statistics about the relationship between etymology and word classes were analysed from different viewpoints. Maltese grammar and morphology remain to this day largely Arabic, but with distinct Romance and English morphological accretions. Italian lexical influence upon present day Maltese has exceeded the Arabic content in a quantitative sense, enriching Maltese from a purely root based morphology with additional productive Romance features. 1. Introduction The most recent theory relating to the origin of the Maltese language points to a direct Sicilian-Arabic connection. This claim, by Dionisius Agius (Agius, 1990; Agius, 1993; Agius,1996) and Joseph Brincat (Brincat, 1994), although supported by significant research findings by other contemporary scholars from different disciplines 1 , remains inconclusive (Bonanno, 1988; Borg and Alexander, 1978; Borg and Alexander, 1994). In the light of the prevailing knowledge of his times, Joseph Aquilina was on the right track when he proposed the Arabic of the Muslim Aghlabids of the Maghreb, as the most likely original source for the Maltese language (Aquilina, 1959). Even at this early stage Aquilina does not rule out a possible SicilianArabic link (Aquilina, 1988). In contrast to Aquilina’s thesis, the current Siculo-Arabic school of thought presents a revolutionary twist. It repositions the original source of Maltese from a direct northern African origin to an offshoot of Sicilian Arabic. This theory reinforces the notion that the form of Arabic which was adopted in Malta, as the forerunner of today’s Maltese, was already tainted with non-Arabic Language influences from its very inception (Agius, 1996; Brincat, 1995). Whilst considerable language convergence exists between Maltese and Siculo-Arabic 64 Roderick Bovingdon and Angelo Dalli (Agius, 1996), there still remain significant unexplained aspects of inquiry into early Maltese 2 . The answers to most unexplained lines of inquiry, as sparse as the remaining evidence may be, may very well be laying dormant within the more subtle and distant aspects of early Maltese. The grammar and morphology of modern Maltese remain to this day Arabic, with distinct Romance and English morphological, and increasingly lexical, accretions (Aquilina, 1979; Mifsud, 1995). A thorough combing of European, North African and Turkish depositories as well as of private Maltese collections, has the potential of uncovering major revelations, as given in the relatively recent contributions in Wettinger and Fsadni (1968), Cassola (1992 and 1996), and Brincat (1995). Scholarship of Maltese has taken great strides forward by way of research, both within the spheres of linguistics as well as from a purely historical perspective (Cassola, 1992; Mifsud, 1995). Wettinger’s exposure of Vatican Manuscript 411 deserves closer scrutiny as a possible significant link between medieval Maltese and Arabic, as well as being a valuable pointer to the existence of other similar contemporary documents elsewhere, including possible links to Hebrew (Wettinger, 1979). The quantity and significance of recent documentary discoveries relating to medieval Maltese, are highly suggestive of even earlier records of written Maltese. Such exciting finds could provide scholars with the elusive missing linguistic links, bridging the gap between the pre-Arabic and Arabic beginnings. This need not necessarily conflict with Agius’ and Brincat’s postulations, as there is already ample evidence of the strong Siculo-Arabic connection. Agius openly claims, in his major work on Siculo Arabic, that Maltese is a direct offshoot of Siculo-Arabic with no connections to the north African littoral 3 . Brincat (1994) emphasises the lack of a substrate for the pre-Arabic period. The distinct possibility of remnants of a pre-Arabic substrate concealed within the later Siculo-Arabic strata is suspected. 2. Non-Arabic linguistic influences on contemporary Maltese Traditionally the Romance element in Maltese was thought to commence with the coming of the Normans from 1049 onwards. Nowadays the year 1127 is perceived as a more realistic date in historical terms (Cassar, 2000). The preNorman Arabic content appears to have been heavily Sicilian based (Brincat, 1995). Brincat’s contention that the lack of a substrate is the strongest pointer to a Siculo-Arabic genus 4 holds much merit along with Agius’ extensive approach to such origins. Statistical analysis of the source origin of Maltese 2.1 65 Knights of St. John of Jerusalem The more overt lexical and the later morphological Romance additions to Maltese, had to wait for the arrival of the religious Order of the Knights of St. John of Jerusalem, circa 1530, before any significant inroads began to become apparent. The Arabic-oriented populace of Malta, left to its own devices, continued to interact harmoniously for a long period of time, within a Muslim 5 , Christian 6 and Jewish milieu. Under the Knights, direct rule was introduced 7 , with the consequent imposition of the foreigners’ will and culture. During this long period in Malta’s history between 1530 and 1798, the Maltese came under more direct and imposing influence from the Romance element, due to the comparatively large numerical presence of Knights 8 , with an increasing quantity of Romance words and phrases from different regions of the Italian mainland (especially from the northern regions of Italy) being absorbed into Maltese. The increased social interaction with the local populace through this direct rule was an overriding force impinging on every aspect of the indigenous Maltese way of life, not least the language admixture. The Order’s rule over Malta lasted for over two centuries. During this vibrant period in the archipelago’s history, this considerably increased interaction between the rulers, the Knights, and the general populace, consisted for the most part in a master-subordinate relationship. It is important to note this point, as such social interaction between the overlords and the general populace meant that several aspects of the rulers’ culture, not least their predominantly Romance languages, imprinted their influences on the mainly peasant indigenous stock. After the Order was ousted from the islands by the French under Napoleon in 1798 – who in turn saw their demise after only two years’ occupation – the British rule took possession for the next century and a half. 2.2 Modern Italian Ironically, the greatest linguistic inroads from the Italian mainland occurred during the early days of the British rule 9 . Owing to the political turmoil aroused on the Italian mainland, during the unification of Italy many political refugees sought and were granted political asylum in the British protectorate of Malta. These émigrés, who included a number of prominent Italian intellectuals and politicians, banded together and formed a strong political lobby of their own. With Malta’s long tradition of strong political, cultural, administrative, religious and linguistic influence from the Italian mainland as well as from Sicily, these émigrés found many sympathisers amongst the local population (Friggieri, 1979). A considerable number of newspapers in the Italian language flourished on the island, stimulating the adoption and spread of modern Italian Language influence 66 Roderick Bovingdon and Angelo Dalli on both the Maltese idiom, as well as Maltese thought 10 . The infamous Language Question of the late 1920s and early 1930s, when the British colonial power squeezed out Italian from officialdom, saw the insertion of Maltese as the language of common parlance, while English took the coveted functionality of Italian. This political manoeuvring marked a phase when Italian linguistic influence suffered a temporary though prolonged lull (Fuccio, 1933; Hull, 1993). Modern Italian has since made vast inroads into contemporary Maltese, mostly through the influence of Italian television, which enjoys a wide following in Malta. So strongly has this new influx been felt that the large quantity of Italian lexemes entering Maltese has affected the morphology of the language – a phenomenon indicated by the statistics presented in this study. Contemporary Italian borrowings can be distinguished with ease from those previously adopted during the much earlier pre-British period 11 . These additional and more significant Italian influences have markedly shaped Maltese with distinctly European traits. Such trend emanates from the intellectual class which consistently borrows new terminology, mostly from Italian, with increasing inclusions from other European languages; these being the dominant language sources whence contemporary intellectual, scientific and technological innovations originate. This study clearly shows that Italian lexical influence on present-day Maltese, if only by way of numerical representation, has surpassed the Arabic content. Italian morphological influences have taken hold to such an extent that the updating of Maltese grammars to include this aspect is being considered. Such development has evolved Maltese from a purely root-based morphology, as a typological feature of its Semitic past (Schweiger, 1994), with the additional productive Romance feature of catenation (Mifsud, 1995). 2.3 English The most recent linguistic influence on Maltese is English. English has steadily and increasingly affected Maltese, adding another language facet to the overall structure of Maltese (Mifsud, 1995). Interestingly enough, despite these recent accretions from English and the vastly different morphological structure of the two languages, the assimilation of English lexemes into a Maltese mould, occurs with the least possible disturbance to Maltese morphological structure. By comparison, contemporary Arabic, under the influence of similar pressures of modern life, as well as the ever-changing world political scene, when borrowing from English and American-English, is adopting similar assimilative patterns as Maltese (Holes, 1995). In the case of Maltese, this is especially so with verbs, where the basic structure of the English word is left intact while other morphemes in the form of mainly prefixes and suffixes take over (i.e. the stem does not change its internal make-up). Statistical analysis of the source origin of Maltese 67 Other linguistic devices, also of a Semitic nature, such as gemination of the initial radical in the case of verbs that are frequently preceded with a euphonic i, are also at work (Mifsud, 1995). Furthermore, as a result of this linguistic enrichment and infusion, Romance affixes assimilate naturally with both Semitic and English lexemes as an intrinsic part of standard Maltese grammatical structure. 3. Statistical and analytical results The data used in this statistical analysis has been sampled from the Maltilex Project’s corpus as of November 2001. The Maltilex Project is a joint effort between the Department of Computer Science and AI and the Institute of Linguistics at the University of Malta. Its aim is to produce the first national collection of computerised language resources for Maltese (Rosner et al., 1999; Rosner et al., 2000). The Maltilex corpus is made up of a representative mixture of newspaper articles of different kinds, including local and foreign news coverage, sports articles, political discussions and others, together with transcripts of radio shows, official government publications and some novels. When this statistical analysis was performed the corpus had 1.8 million words consisting of almost 70,000 different word forms. 3.1 Selection methodology and analysis tools A representative sample of 1,000 words was needed for the purpose of this study. The sample was selected using a strictly random process to ensure the validity of our statistical results. A chaotic function was used to assign a random number to all 70,000 unique word forms in the corpus, permitting a randomly ordered list of words to be created. This list was then examined and the following modifications were made: 1. Spelling Mistakes – Spelling mistakes were immediately corrected. Words that were spelled ambiguously were excluded from the list. This kind of modification and filtering is statistically sound since no attempt is made to influence the contents of the list. During the cleaning process the data was also converted into Unicode format to be universally accessible by all analysis tools. 2. Hapax Words – The word occurrence frequency in the corpus was examined for every word and all hapax words were automatically removed. This process removed most superfluous and arcane words that were accidentally inserted in the corpus and that appeared in the sample. This process can be statistically justified since hapax words can be seen as 68 Roderick Bovingdon and Angelo Dalli outliers that can be safely removed without affecting the validity of the resulting analysis. The etymology and class was noted down for every word in the sample. When a word had more than one class, the word entry was duplicated and a single category was entered for every word. In order to maintain accurate statistics, a weight was added to every entry representing the number of classes associated with a particular word. Thus every entry of a word having n classes was assigned a weight of 1/n. A total of 1034 entries were thus obtained, representing all possible etymology/word class pairs for the sample word forms. The data matrix that was obtained was analysed using a custom-written data mining tool to extract statistics about the relationship between etymology and word classes in Maltese. Overall statistics about the source language origins of Maltese together with the most commonly occurring word classes was also extracted. The use of a data mining tool enabled us to analyse the data from two different perspectives – word class distribution for every etymological class and vice-versa. 3.2 Etymology In our data sample, most of the words derived from Italian, Arabic and English. Source Origin of the Maltese Language It 54% Other 5% Eng 4% Eng>Dutch 0% Ar 41% Eng>It 1% Ar Eng Eng>Dutch Eng>It It Figure 1: Source Origin of the Maltese Language There were also some isolated cases of Italian and Dutch etymons 12 finding their way into Maltese through English. In the analysis these languages are denoted by Statistical analysis of the source origin of Maltese 69 the codes It, Ar and Eng respectively. The isolated cases are denoted as Eng > It and Eng > Dutch respectively. Figure 1 shows the source origin of the Maltese Language, with Italian words forming the majority of the words at 54%, Arabic at 41% and English at 5%. Table 1 shows a summary of the etymological analysis data 13 , showing the exact number of word forms pertaining to every etymological class. Table 1: Maltese Language Etymology Summary Etymology Ar Eng Eng>Dutch Eng>It It Count 411 36 1 9 543 Following this summary analysis we then split up the data set according to the three main source languages and further analysed the source languages’ contribution in terms of word classes. The following abbreviations were used to denote different word classes: adj – adjective; adv – adverb; conj – conjunction; demon – demonstrative; interj – interjection; n – noun; pers – personal; poss – possessive; pp. – participle; prep – preposition; pron – pronoun; v – verb. The word classes were further annotated with m, f, pl and dual to denote masculine, feminine, plural and dual forms of word class respectively. Arabic Grouped Word Classes n 21% adv 2% pp 4% Other 9% v 66% Figure 2: Arabic Word Classes adj 2% pron 2% prep 2% conj 1% v n pp adv adj pron prep conj 70 Roderick Bovingdon and Angelo Dalli Figure 2 shows the summary results for Arabic word classes, showing this language source as a significant contribution in terms of verbs, and to a lesser extent, nouns. Italian Grouped Word Classes v 29% adj pp 11% 9% adv 3% interj 0% conj 0% n 48% Other 0% n v adj pp adv interj conj Figure 3: Italian Word Classes Figure 3 shows the summary results for Italian word classes, showing Italian’s significant contribution in terms of nouns, and to a lesser extent, verbs. This further mirrors the significant contribution of Arabic. Figure 4 shows the summary results for English word classes. The relatively low percentage of English words (5% of the total) is consistent with established Maltese literary convention. This result seems to underestimate the percentage of English words in common Maltese parlance. From a purely lexical viewpoint Standard Maltese comprises 41% from Arabic origins, 54% from Italian and 9% from English as illustrated in Figure 1. Linguistically this points to a predominantly Italian influence with English slowly edging its way into the Maltese mould. The Arabic content, on the other hand, at surface value appears to be waning to a marked degree, bearing in mind that Maltese is still looked on as belonging to the Semitic fold. A deeper analysis justifies the continued classification of Maltese with the Semitic family of languages. Table 2 presents percentage data formed from the comparison of the word classes in each of the three source language groups, illustrated in Figures 2, 3 and 4. Statistical analysis of the source origin of Maltese 71 . English Grouped Word Classes pp 7% n 63% v 22% adj n pp v adj 8% Figure 4: English Word Classes Table 2: Word Class Composition Class Verbs Adverbs Nouns Adjectives Participles Pronouns Prepositions Conjunctions Arabic 66% 2% 21% 2% 4% 2% 2% 1% Italian 29% 3% 48% 11% 9% 0% 0% 0% English 22% 0% 63% 8% 7% 0% 0% 0% The following insights can be obtained from Table 2: 1. Smaller word classes in the category of pronouns, prepositions and conjunctions are not favoured either from the Italian or the English lexicon. This trend attests to their stronger adherence to their older Arabic origins. Considering the sparse use of adverbs in the older (Arabic) Maltese, this word class appears to have no preference for English with a minor inclination towards Italian borrowings. 2. The major word classes impinging on Maltese from both Italian as well as English are the two main classifications consisting of the Verbs class together with the Nouns/Adjectives/Participles class. 72 Roderick Bovingdon and Angelo Dalli These latter results strongly suggest that Maltese morphology, when borrowing from Italian, exhibits a distinctly stronger preference towards the nominal lexicology than the verbal portion, while English displays quite the opposite. Italian borrowings from the Nouns/Adjectives/Participles class show 68% nominal borrowings as opposed to English’s 78%. On the other hand, the Italian verbal borrowings from the Verbs class are not that much higher at 29% than their English counterpart of 22%. Considering the far lengthier historical-political connection with Italian mainland compared to the relatively recent and much shorter British connection, this result shows that verbal borrowing from English is increasing. 3.3 Word Classes The contribution of the source languages to different word classes in the Maltese language necessarily entailed an etymological analysis. The data mining tool used for the analysis enabled us to split the data set according to word classes. This allowed us to analyse the composition of different word class groups according to the source languages. Figure 5 shows a summary of the word category classes in Maltese with verbs (43%) and nouns (37%) making up 80% of all words. Word Categories of the Maltese Language v 43% Other 5% adv 3% pp 7% conj 1% interj 0% n 37% adj 7% prep pron 1% 1% adj adv conj interj n pp prep pron v Figure 5: Word Categories of the Maltese Language Table 3 presents the normalized counts of all word classes in Maltese. The counts are not all integers since the actual counts are based on the 1,034 word sample that was created by duplicating word entries having multiple word classes, as previously explained. The count was then normalised to 1,000 words. Statistical analysis of the source origin of Maltese 73 Table 4 presents the percentage data for the word classes in standard Maltese, as illustrated in Figure 5. Table 4 shows that an obvious preponderance of verbs over nouns in Standard Maltese, exceeding them by 6%. As such variation is not strong enough to indicate a distinct characteristic, it suggests that Maltese is a less concrete and a more conceptual language than formerly assumed. It therefore appears that contemporary Maltese, under the influence of its two main language sources, especially of recent times, is developing more abstract means of expression than it was previously able to impart. Table 3: Maltese Word Category Classes – Normalized Counts Word Class adj adv conj interj n pp prep pron v Counts 73.5 26.83 5 1 368.5 68 6.83 8 442.34 A comparison of Table 2 and Table 4 points to a consistent trend towards the Maltese linguistic structure relying more heavily upon verbs and nouns than other word classes. The shifting from a purely root-based morphology to that of a more diversified form with the addition of concatenation is perhaps the most significant evolutionary device Maltese has adopted in recent times. Table 4: Maltese Word Category Classes – Percentages Word Class Verbs Adverbs Nouns Adjectives Participles Pronouns Prepositions Conjunctions Percentile 43% 3% 37% 7% 7% 1% 1% 1% 74 Roderick Bovingdon and Angelo Dalli Note 1 Including, amongst others, works by Arnold Cassola, Girolamo Caracausi, Adalgisa De Simone, Alberto Varvaro, Stanley Fiorini and Godfrey Wettinger. 2 A prime example is Alexander Borg’s investigation of the imaala phenomenon in Maltese, whose erratic behaviour is still left without a definitive explanation. 3 “..., my hypothesis is that contemporary Maltese, containing a mixture of Arabic and Romance, is directly linked with the Siculo-Arabic and not with North African dialects as has been so far believed.” (Agius, 1996, p.432) 4 There still lingers the remote possibility of linking Maltese to earlier origins than the current Siculo-Arabic claim. In this paper, the term Semitic includes, along with Arabic, the Berber element, owing to the long standing association of the Berber language with Arabic, both in Sicily as well as during much earlier times, along the North African littoral, with the considerable interchange between the two lexicons. Also such term is intentionally applied as an all-inclusive reference to any remote possible language influences from the wider Semitic language group. In similar manner, the terms Romance and Italian, for the scope of this study, include medieval and modern Sicilian and Italian, with all their dialectal and regional variations, as well as French and Andalusian Spanish with its own Arabo-Berber influences included. 5 The population of the Maltese Islands in 1240 consisted of the following family distribution: 836 Saracens (Muslims), 250 Christians and 33 Jewish (Wettinger, 1968, pg. 33). 6 According to the official report by the Apostolic Delegate Dusina in 1574, the Christian population of Malta, including the clergy, was lax in the extreme. Thus, at this relatively late date, one might be inclined to contemplate a religious belief and custom predominated by the numerically stronger Muslim presence. 7 Prior to the Knights, Malta’s affairs were handled from the occupying power's overseas quarters. 8 Mainly Italian, Portuguese, Spanish and French. The German Knights do not appear to have exerted any influence either upon the Maltese language or the culture of the populace (Aquilina, 1976). 9 Prior to British rule, Maltese had not acquired the status of a literary language. The main Romance linguistic influence made its entry mainly Statistical analysis of the source origin of Maltese 75 through the spoken idiom rather than literary texts or formal learning methods. Hence the Romance lexical material entering Maltese was of the most basic type, enabling it to become molded within Maltese Semitic morphology with relative ease. In contrast, during the British rule, the learning of Italian was formally imposed upon the populace through the educational system, as well as the general culture of the local ecclesiastical authorities, the administration, the law courts and the press. 10 Dormant notions of nationhood and nationalism were stimulated, resulting in the first formations of formal and popular political agitation with the resultant linguistic bent towards an Italianate mode of linguistic expression. 11 The early Romance element in Maltese became intrinsically integrated into the basic language structure, while the more modern and erudite forms of Italian, with increasingly less input from Sicilian and Southern Italian, tended to resist full assimilation. 12 Like fissuri (fissures) and jott (yacht). 13 In this case, word duplications due to a word having multiple word classes were not included, hence the sample size of 1,000 words. References: Agius, D. (1990), ‘Il-Miklem Malti: A contribution to Arabic lexical dialectology’, British Society of Middle Eastern Studies Bulletin, 17:2. Agius, D. (1993), ‘Reconstructing the Medieval Arabic of Sicily’, Languages of the Mediterranean, Msida: University of Malta. Agius, D. (1996), ‘Siculo Arabic’, Kegan Paul International, 12. Aquilina, J. (1959), The Structure of Maltese, Valletta: University of Malta. Aquilina, J. (1979), Maltese-Arabic Comparative Grammar. Libya: Socialist People’s Libyan Arab Jamahiriya Press. Aquilina, J. (1988), ‘Criteri di etimologia siculo-maltese’, Malta e Sicilia: Contiguita e Continuita Linguistica e Culturale. Catania: Gruppo Linguistico Catanese. Aquilina, G. (1988), ‘Il lessico agricolo e meteorologico nel maltese e le sue fonti arabe e siciliane’, Journal of Maltese Studies. Malta: University of Malta. 17-18. Bonanno, A. (1988), ‘Contiguita e continuita culturale e linguistica fra Sicilia e Malta in eta prearaba’, Malta e Sicilia: Contiguita e Continuita Linguistica e Culturale. Catania: Gruppo Linguistico Catanese. 76 Roderick Bovingdon and Angelo Dalli Borg, A. (1978), A historical and comparative phonology and morphology of Maltese. M.A. Thesis, Msida: University of Malta. Borg, A. (1994), ‘On some Mediterranean influences on the lexicon of Maltese’, Blaustein Institute for Desert Research and Ben Gurion. Israel: University of Beer Sheeva. Brincat, G. (1994), ‘Gli albori della lingua maltese: il problema del sostrato alla luce delle notizie storiche di al-Himyari sul periodo arabo a Malta (8701054)’, Languages of the Mediterranean. Msida: University of Malta. Brincat, J. (1995), ‘Malta 870-1054: Al Himyari's Account and its Linguistics Inplications’. Msida: University of Malta. Cassola, A. (1992), The Bibliotecha Vallicelliana Regole per la Lingua Maltese. Egypt: Said International. Cassola, A. (1996), Il Mezzo Vocabolario Maltese-Italiano del '700. Egypt: Said International. Cassar, C. (2000), Society, Culture and Identity in Early Modern Malta. Msida: Mireva Publications. Friggieri, O. (1979), Storja tal-Letteratura Maltija. Malta: Klabb Kotba Maltin. Fuccio, G. (1933). ‘Il Conflitto Anglo-Maltese’, Quaderni dell'Istituto Nazionale Fascista di Cultura, Treves-Treccani-Tumminelli, 3:8. Holes, C. (1995), Modern Arabic: Structures, Functions and Varieties. London: Longman Linguistics Library. Hull, G. (1993), The Malta Language Question. Egypt: Said International. Mifsud, M. (1995), Loan Verbs - A Descriptive and Comparative Study. Leiden: Brill. Schweiger, F. (1994), ‘To what extent is Maltese a Semitic Language?’, Languages of the Mediterranean. Malta: University of Malta. Rosner, M., R. Fabri, J. Caruana, M. Lougraïeb, M. Montebello, D. Galea and G. Mangion. (1999), Maltilex Project. Malta: University of Malta. Rosner, M., R. Fabri and J. Caruana. (2000), ‘Maltilex: A Computational Lexicon for Maltese’. Msida: University of Malta. Wettinger, G. (1973), ‘Arabo-Berber Influences in Maltese: Onomastic evidence’, Proceedings of the First Congress on Mediterranean Studies of AraboBerber Influence. Msida: University of Malta. Wettinger, G. (1979), ‘Late Medieval Judaeo-Arabic Poetry in Vatican MS411: Links with Maltese and Sicilian Arabic’, Journal of Maltese Studies. Msida: University of Malta. 13. Wettinger, G. and M. Fsadni. (1968), Peter Caxaro's Cantilena: A Poem in Medieval Maltese. Valletta: University of Malta. Discovering regularities in non-native speech Julie Carson-Berndsen1, Ulrike Gut2 and Robert Kelly1 University College Dublin 1 University of Bielefeld 2 Abstract This paper presents ongoing collaborative research which focuses on the application of computational linguistic techniques to the analysis of a corpus of native and non-native speech. The aim of this research is to use computational tools for modelling phonological acquisition and representation to identify regularities and sub-regularities between different speaker groups. The corpus is being collected and annotated at different levels as part of ongoing research into the acquisition of prosody by non-native speakers at the University of Bielefeld. The computational tools have been designed and implemented at University College Dublin as part of a development environment for modelling, testing and evaluating phonotactic descriptions of lesser-studied languages. 1. Introduction It is a well-known and easily observable fact that non-native speakers sometimes produce syllables in their speech which violate the phonological rules of the foreign or target language. This can have various causes: the simplest is that a speech sound is produced which does not exist in the language being learned. Alternatively, a sequence of sounds may be produced which is not permissible in the target language. This constitutes a violation of the phonotactic rules of the language. Many explanations have been put forward to explain the occurrence of these types of errors, of which the claim that speakers transfer phonological rules of their native language to their productions in the target language is the most popular one. The majority of studies on the acquisition of phonotactic rules are based on small numbers of participants, which reflects the time-consuming nature of a manual analysis of this kind of data. There are two specific aspects which motivate the work presented here. The first, a more computational linguistic motivation, is primarily concerned with the acquisition and evaluation of phonological structure that can be usefully employed in speech technology. The second motivation, a more theoretical linguistic motivation, is concerned with the application of the comparative methodology in the context of the phonological analysis of non-native speech. Each of these motivations is now addressed in turn in sections 2 and 3. Section 4 discusses a specific representation of phonotactic constraints and section 5 presents a set of finite state tools which are used to learn regularities from a phonological corpus. In section 6, a particular corpus containing data from non- 78 Julie Carson-Berndsen, Ulrike Gut and Robert Kelly native speakers of German with two different native languages, Italian and Polish, is used for a study of phonotactic errors found in non-native speech. Section 7 concludes with a discussion of future work. 2. Ubiquitous Acquisition of Phonotactic Constraints Ubiquitous language technology concerns the development of language technologies for different purposes on different platforms so that they can be made available to everybody at all times rather than to a select group for specific purposes. Clearly this is a long-term goal which involves a rethinking of current approaches to speech and language technologies combined with an enhancement based on information of varying levels of granularity paving the way for the development of robust multilingual applications which are easily scalable. One immediate prerequisite to this is the provision of a methodology which can be applied to numerous languages in order to accumulate descriptions which can be reused with varying speech technologies. One particular technology to which this approach is being applied is based on the Time Map Model (Carson-Berndsen, 1998). This model defines the constraints on permissible combinations of sounds in a language in terms of a phonotactic automaton. The advantage of this approach is that each sound in the language is described with respect to the structural context (both preceding and following) in which it can occur (see section 4). The phonotactic automaton thus models all possible syllables of a language and in this way caters for native speaker intuitions about the wellformedness of phonological representations. This paper presents an innovative methodology for automatic analysis of the phonotactics of spoken language based on a suite of finite state tools which have been developed primarily for use in multilingual ubiquitous speech technology. We are primarily concerned with the phonotactic level of description, which is employed in a computational linguistic approach to speech recognition and synthesis. We present an XML-based representation of various types of information defined with respect to the phonotactic context. This representation can be learned automatically from a phonemically labelled data set and can be processed to provide analyses of the phonological regularities which have been found. We demonstrate how this methodology can be applied to the task of phonotactic analysis of non-native speech. 3. Phonotactics in Non-Native Speech The term “phonotactics” refers to language-specific rules and constraints for how sounds can be combined in a syllable. A German syllable (ı) consists of three parts: an onset, a nucleus, and a coda (e.g. Wiese 1996). The nucleus constitutes the centre of all syllables and consists of one or two vowels (V). The onset comprises all prevocalic consonants (C) and can be filled with between zero and Discovering regularities in non-native speech 79 four consonants. All postvocalic consonants form the coda. In German, a sequence of up to five consonants can occur in this position. Figure 1 illustrates this with the word springt’ jumps’. It has the three consonants [6S^] in the onset position, the vowel [,] as the nucleus and the two consonants [1W] in the coda position. Figure 1: The syllable structure of the German word springt. Two kinds of syllable types can be distinguished in German (Carson-Berndsen, 1998): non-reduced and reduced syllables. Non-reduced have between zero and three consonants in the onset position, a short or a long full vowel (or diphthong) and up to four consonants in the coda position. Reduced syllables contain an optional single initial consonant, a weak vowel /l/ and an optional single final consonant. Reduced syllables are never accented. The occurrence and the ordering of consonants and consonant clusters in onset and coda is constrained by the phonotactics. An example of an occurrence constraint in German is for example that the phoneme /1/ is not permitted in the onset position and that voiced stops and fricatives such as /d/ and /v/ are not permitted in the final consonant position of the coda. Ordering constraints apply to the sequence of consonants within either onset or coda. The consonant cluster /lp/ for example is not permissible in the onset, but can occur in the coda position, as in the word Kalb [kalp] (see Kohler, 1995). It has often been observed that learners of a foreign language produce “illegal” consonant clusters in both the onset and the coda position. These differences between native and non-native speakers of a language are systematic and are therefore assumed to form part of the non-native speakers’ interlanguage, i.e. their current representation of the grammar of the target language. Several reasons for systematic phonotactic errors have been proposed: Carlisle (2002) claims that some onset consonant clusters in words are more marked (i.e. occur less frequently in all languages of the world) than others and that foreign language learners always acquire the less marked onsets before the marked ones. Similarly, Eckman (1991) proposed that the reduction of English final consonant 80 Julie Carson-Berndsen, Ulrike Gut and Robert Kelly clusters by native speakers of Cantonese, Japanese and Korean follows universal principles, that is, they are reduced from more marked to less marked forms. These approaches assume that the non-native speakers’ interlanguage is governed by universal principles. Other researchers find influence of native language phonotactic rules in the productions of non-native syllables. Broselow (1984) examined the consonant cluster simplifications in English produced by Arabic native speakers and concluded that they directly reflect the phonotactic rules of their native languages. The underlying assumption is that the interlanguage of non-native speakers contains phonological rules of their native language. Due to the time-consuming character of phonetic analyses, many of these studies are based on a small number of speakers only. However, in order to test hypotheses about the nature of interlanguage structure and the cause of systematic errors, large-scale studies are necessary (cf. Carlisle, 2002). The object of this paper is to present tools for automatic analysis of non-native speakers’ production of syllable types which allows processing of large data resources and suggests a model for interlanguage representations. 4. Representing Phonotactic Constraints The underlying assumption in this paper is that phonotactic constraints for any language can be represented in terms of finite state automata which define acceptable sequences of sounds within a syllable or word domain. A subsection of a phonotactic automaton describing CC-clusters in English syllable onsets is depicted in Figure 2. The complete phonotactic automaton for English syllables allows for a distinction to be made between well-formed and illegal structures, i.e. although the word blick does not exist in English, the sequence of sounds is permissible and would be recognised as well-formed by a native speaker of the language, whereas the form bnick will always be rejected by native speakers of English as a possible word. A phonotactic automaton is language-specific and can be developed for each individual language. A multilingual time map (Carson-Berndsen, 2002) extends the functionality of a phonotactic automaton by combining language-specific information at various levels of granularity represented as a multilevel finite state transducer. The advantage of this representation, as with the phonotactic finite state automaton, is that it is declarative, bi-directional, efficient to process, and can be easily learned. The multilevel finite state transducer can be viewed as an extension of the phonotactic automaton to include (at least) the following levels: graphemes, phonemes, allophones, canonical form, features, constraints on overlap relations, average duration, frequency and probability. Each arc specifies information on all of these levels (although some of this information may not be available in all cases; however, the transducer can be readily updated at any time). This representation serves as the basis for the learning, analysis and comparison tools Discovering regularities in non-native speech 81 which are presented in the next section. For the specific case study presented below, only the phoneme and canonical form levels are relevant to the discussion. Figure 2: Subsection of phonotactic automaton depicting CC- onset in English. A multilingual time map is represented in XML with a visualisation in terms of a directed graph. A subsection of a German multilingual time map depicting two consonant clusters in onset position is visualised in Figure 3. The full information is shown for only one arc. Figure 3: CC- onset in German. 82 5. Julie Carson-Berndsen, Ulrike Gut and Robert Kelly Finite State Tools This section presents a suite of finite state tools which have been developed for use in computational phonology applications motivated by the developments in linear and nonlinear finite state phonology (e.g. Kaplan & Kay, 1994; Bird & Ellison, 1994). The suite of tools is centred around two main programs GTI and PAL, which are described firstly in terms of their general functionality. At the end of this section the specific ways in which these programs underpin the finite state tools for phonotactic analysis are summarised. 5.1 The Generic Transducer Interpreter The Generic Transducer Interpreter, GTI, is a program designed to read the structure of a finite state transducer (FST) and using this structure test a set of sequences of tokens for acceptance by that FST. FST structures are stored in XML marked up form. A FST consists of a set of states and state transitions. A single start state is defined and a subset of the states is designated as final states. The state transitions of a FST may have any number of tapes defining different transductions between tapes. A sequence of tokens is accepted by a FST if a path can be traced from the start state of the FST to a final state of the FST for that sequence. Starting with the initial state as the current state a path is traced by examining each successive token in the sequence to determine if a state transition can be applied from the current state. A state transition can be applied at a given state for a given token if there is a state transition in the state transition set having a source state matching the current state and transition symbols matching the current token. Test sequences are currently stored in text files in which sequences must be specified in a particular format. For GTI to test a set of sequences it must have specific tapes of state transitions in the FST defined as input tapes. Any number of input tapes can be specified (up to the number of tapes in the state transitions). Input tapes are those tapes of state transitions used to match the tokens of sequences against in order to determine if a state transition can be applied. Thus, each token of a test sequence is an item of input for the FST. If L tapes of state transitions have been specified as input tapes then each token in a test sequence must have L tapes also. If GTI detects that there are an incompatible number of tapes in any token of any sequence in the test file then an error is flagged. It is also possible to specify which tapes of state transitions are to be treated as output tapes. Output tapes are those tapes for which the result of a transduction is required. When a sequence is accepted on the input tapes of a multi-tape FST there will be associated outputs, namely the concatenation of the sequences of symbols present on non-input tapes. It may be necessary to examine the output produced by any one of these output tapes. Again, any number of output tapes can be specified (up to the number of tapes in the state transitions). Discovering regularities in non-native speech 83 The names of both the FST file and the test sequence file together with input and output tapes are passed to GTI via the command line. If the FST file is in the correct format without error then GTI will create the FST as specified. As GTI tests each sequence, it reports whether the sequence was accepted or not when applied to the FST on the specified input tapes. Also, if output tapes are specified then the outputs are displayed for each output tape. Note that if the FST is nondeterministic that more than one acceptance path may be traced. In this case, the outputs for all successful acceptance paths are reported. It is also possible to output the actual trace(s) of state transitions for each accepted sequence. 5.2 The Phonotactic Automaton Learner The Phonotactic Automaton Learner, PAL, is a program that takes a set of training sequences of symbols and learns the structure of a stochastic finite state transducer (SFST) that accepts the training sequences. The training sequences to be used by PAL are stored in a text file in which sequences must be specified in a particular format. The training sequences may have multiple tapes and in this case all tokens of all sequences must have the same number of tapes specified or an error will be flagged. Once learned, the SFST representation is as an XML marked-up form. The PAL program uses the ALGERIA machine induction algorithm (Carrasco & Oncina, 1999) to learn the structure of the required SFST. The algorithm works in two stages. First, a prefix-tree SFST is built from the specified training sequences. A prefix-tree SFST has a single start state with a single path from the start state to a final state for each of the training sequences. For each token in each training sequence there is a state transition between states along the unique path for that sequence. Also, if two training sequences share a prefix then there is a single path in the prefix-tree SFST for that prefix. The path diverges into distinct state transitions for the distinct postfixes after the final state transition in the common prefix. Each state transition in the prefix-tree SFST has a frequency of traversal dependent on the number of training sequences that share that state transition. The frequencies of state transitions are then used to identify the canonical SFST based on state merging. The state merging process is the second stage of the ALGERIA algorithm. Two states are merged into a single canonical state if the language generated from that point on is statistically identical (i.e. for each continuation path emanating from the first state there is a matching continuation path emanating from the second, both of which have a statistically identical frequency). If two states are merged then all state transitions in the prefix-tree SFST that refer to either of the two merged states, now refer to the newly created merged state. By the process of comparing each state with each other state in the prefix-tree SFST the canonical SFST is identified. The ALGERIA algorithm has an associated confidence level. This confidence determines how rigorous the learning of SFSTs is. The lower the confidence the less likely it will be that the learned SFST is absolutely correct, that is, the less 84 Julie Carson-Berndsen, Ulrike Gut and Robert Kelly likely it will be that the learned SFST will accept only the training sequences. There is a default confidence which has been found by experimentation to be sufficient for effective learning, however, an alternative confidence can be specified. PAL accepts the name of a training sequence file and a destination for writing the learned SFST to as well as an optional confidence through a command line interface. In summary, GTI and PAL are used specifically in the context of the tools for phonotactic analysis of non-native speech as follows: The Learning Tool uses PAL to learn a deterministic finite state automaton from a phonemically labelled data set. This automaton defines the phoneme level of the multilingual time map and can be output as XML. A canonicalisation step allows each phoneme on each arc in the multilingual time map to be supplemented by its canonical form in terms of V or C. The final output is thus a two-level transducer. The Analysis Tool uses GTI to transduce from one level to another (e.g. phoneme to canonical form or vice versa). The Comparison Tool uses GTI to partition the data into accepted and rejected forms. An analysis summary shows how many forms were accepted or rejected. 6. The study of phonotactic errors in non-native speech This section is concerned with the application of the computational linguistic tools presented in section 5 to the corpus-based analysis of the phonotactics of non-native speech. The corpus used in this study is being collected and annotated at different levels as part of ongoing research into the acquisition of prosody by non-native speakers at the University of Bielefeld (see Milde & Gut, 2002). 6.1 Participants The data are taken from the LeaP corpus 3 , which consists of prosodically annotated non-native speech of more than 70 speakers. For the study, three data sets were used consisting of speech produced by German, Italian and Polish native speakers. The German natives are all speakers of Hochdeutsch (Standard High German). The Italian native speakers were between 21 and 31 years old when they were recorded and had been living in Germany for between two months and five years. They all studied German at school or at University level for up to six years prior to their arrival in Germany. The Polish speakers were between 22 and 29 years old at the time of recording and had been living in Germany for a period of between eight months and six years. They all have a University degree in German from their home country, where they studied German for between four and six years. All speakers were intermediate to advanced speakers of German and had no active knowledge of German phonotactic rules. Discovering regularities in non-native speech 6.2 85 The Data Recordings consisted of three parts. First, a short interview (approximately ten minutes) was conducted with the non-native German speakers, in which various questions about their language learning history such as age at first contact with German and length of formal instruction were asked. Second, the participants were asked to read out a short story. Third, they re-told the story in their own words and without reference to the written text. All recordings were carried out in Bielefeld in either a sound-treated or a quiet room. Only the readings were analysed for this study. The first set, German_Read, consists of three readings of a story by German native speakers. The second set, Italian_Read, consists of five readings (total of 624 syllables) of the same story by Italian speakers of German. The third set, Polish_Read, consists of five readings (total of 618 syllables) of the same story by Polish speakers of German. The acoustic analysis of the data was carried out using ESPS/waves+ and Praat and was done by one trained phonetician and four students with extensive training and experience in phonetic analysis. The data was annotated at a number of linguistic levels such as the level of the intonational phrase, the word, the syllable, and the skeletal (CV) structure, and comprises annotations of the prosodic structures of intonation and pitch range. The syllable level was then selected from the prosodic annotation as the relevant input for the finite state tools. All syllables were transcribed phonetically in SAMPA. Transcription was fairly broad but included processes such as nasalisation, unreleased stops and aspiration. In case of ambisyllabic consonants, half of the ambisyllabic consonants was considered to belong to the preceding syllable and half to the subsequent one. 6.3 Corpus-based investigation of non-native speech errors The finite state tools described in section 5 were applied to the data sets described above in order to identify regularities in the phonotactic errors produced by Italian and Polish speakers of German. In each case a specific phonotactic automaton for each data set was constructed using the learning tool, PAL. In order to be able to filter the phonotactic errors, it was necessary to identify which forms produced by the Italian and Polish speakers adhere to the phonotactics of German. For this a complete manually constructed phonotactic automaton for German (from Carson-Berndsen, 1998) was used; this automaton is referred to as Phono_German_Comp.fsm below. The procedure involved the application of the learning tool, the analysis tool and the comparison tool as follows. The Learning Tool was applied to the German_Read data set to generate a finite state representation of the phonotactic forms used in the read speech of the German native speakers. The output of this was termed Phono_German_Read.fsm. The phonemic forms contained in the 86 Julie Carson-Berndsen, Ulrike Gut and Robert Kelly Italian_Read data set were then compared against the Phono_German_Read.fsm using the Comparison Tool and the rejected forms were then compared against the Phono_German_Comp.fsm. This ensured that all forms which corresponded to the phonotactics of native speakers of German (the target language) were filtered and that any remaining forms in the data set corresponded to phonotactically ill-formed structures which had been produced by Italian native speakers of German. The Learning Tool was once again applied to these forms to produce an automaton representation of the ill-formed structures, termed Italian_Rejected.fsm. The Analysis Tool was applied to this data in order to map from the phonemic representations to the canonical CV form. This resulted in the data set, Italian_Read_Canonical, which characterised the canonical forms of all the phonotactically ill-formed structures in the Italian_Read set. Finally, the Comparison Tool was reapplied to associate all phonemic realisations with a canonical form so that particular phonotactic anomalies and regularities with respect to the target language could be identified. The flowchart for the application of the tools in this task is depicted in Figure 4 with respect to the Italian data. Figure 4: The application of the finite state tools to the Italian data. An analogous procedure was followed for the analysis of the Polish_Read corpus. The results are presented in the next section. One point to be noted here is Discovering regularities in non-native speech 87 that whilst the learning tool does, of course, generalise to some extent over these forms, there is no notion of complete coverage as defined in the data set, although currently we are investigating projection techniques to cater for gaps caused by the lack of data which would allow for a distinction to be made between idiosyncratic gaps and systematic gaps. This can be based on the notion of natural classes of sounds which function similarly in a particular phonotactic context. 6.4 Results Table 1 lists the percentage and absolute number of syllables in Italian_Read and Polish_Read which were rejected after comparing them first to Phono_German_Read and then to Phono_German_Comp. Only syllable types were considered, that is, if a rejected syllable occurred more than once it was only counted as one rejection. Thus, after two rounds of application, a total of 54% of the syllables produced by the Italian non-native speakers and 48.7% of the syllables produced by the Polish non-native speakers were rejected. Table 1: % of rejected syllables. % of syllable tokens rejected by Phono_German_Read % of syllable tokens rejected by Phono_German_Complete Italian Polish 67% (418 of 624) 59.5% (368 of 618) 80.6% 81.7% The results of the application of the Learning Tool to the remaining syllables were then classified manually into a) phonemic inventory errors (phonemes that do not occur in German were produced) b) onset consonant cluster errors c) coda consonant cluster errors Table 2 summarizes the results for the Italian and Polish speakers. Distinct differences between the phonotactic violations produced by the Italian and the Polish speakers can be seen. On the whole, the Polish speakers produce fewer errors; different types of illegal initial consonant clusters occur in the two speaker groups, and the postvocalic r (which is produced as an a-schwa in German) constitutes the major problematic area for the Italian speakers. Julie Carson-Berndsen, Ulrike Gut and Robert Kelly 88 Table 2: Error analysis of Italian and Polish speakers of German. Italian Polish 7. Phoneme inventory Errors 6 consonants, 5 monophthongs, 16 diphthongs Onset consonant cluster errors Coda consonant cluster errors 10 types, predominantly [nd], [mb] 2 types, 14 final voiced consonants, 27 postvocalic r 4 consonants, 3 monophthongs, 24 diphthongs 6 types, 2 initial consonants [x, N] 1 type, 11 final voiced consonants, 1 postvocalic r Conclusion In this paper we demonstrated how a suite of finite state tools can be applied to study the phonotactics of different varieties of spoken language. The results imply that research in second language acquisition can potentially benefit enormously from this methodology as it allows a rapid analysis of large corpora. While the corpus described in this study was relatively small, to perform a manual analysis of this data would have been a laborious task. Future work involves employing the tools to analyse phonotactic differences among other non-native speech groupings using the LeaP corpus. Currently this corpus consists of 253 annotated recordings of between 2 and 30 minutes’ duration by 88 different speakers with 21 different native languages. Since the corpus is annotated at a number of linguistic levels as described above, inter-level analyses are possible. For the application of the finite state tools, each level of annotation is viewed as a tier, analogous to the representations of autosegmental phonology (Goldsmith, 1990). Analysis can take place either with respect to individual tiers or with respect to an associated set of tiers. In the latter case, one tier is chosen as the primary tier and the others are associated with it in terms of overlap and precedence relations between the units as suggested in Carson-Berndsen (1998: 60). Using the computational linguistic tools, finite state automaton and finite state transducer representations of the tiers are extracted automatically from the annotated corpus. Regularities in the data are then identified either with respect to a single tier or with respect to an associated set of tiers. In addition to identifying phonotactic errors with respect to one non-native speakers of one particular language, we are also currently applying this methodology to the investigation of the phonotactics of typologically different languages, in particular spoken Yoruba and Igbo. Note 1 This work was part funded by an Enterprise Ireland Resreach Innovation Fund grant (no. IF/2001/021) and an Enterprise Ireland International Collabortaion grant (no. IC/2002/053). Discovering regularities in non-native speech 89 2 The LeaP project is funded by the MSWF (Ministry for Education of North-Rhine Westphalia, Germany). 3 http://www.spectrum.uni-bielefeld.de/LeaP/ References Bird, S. & T. M. Ellison (1994), One - level phonology: autosegmental representations and rules as finite state automata, Computational Linguistics 20: 55-90. Broselow, E. (1984), An investigation of transfer in second language phonology. International Review of Applied Linguistics 22 (4), 253-269. Carlisle, (2002), The Acquisition of Two and Three Member Onsets: Time III of a Longitudinal Study. Proceedings of New Sounds 2000, pp. 42-47. Carrasco, R. C. & Oncina, J. (1999), Learning deterministic regular grammars from stochastic samples in polynomial time, ITA, Vol.33, No.1, 1-19. Carson-Berndsen, J. (2002), Multilingual time maps: portable phonotactic models for speech technology. In Proceedings of the LREC 2002 workshop on Portability Issues in Human Language Technology. Carson-Berndsen, J. (1998), Time map phonology. Dordrecht: Kluwer. Eckman, F. (1991), The structural conformity hypothesis and the acquisition of consonant clusters in the interlanguage of ESL learners. Studies in Second Language Acquisition 13, 23-42. Goldsmith, J. (1990), Autosegmental and Metrical Phonology, Cambridge, Mass: Basil Blackwell. Kaplan, R. & M. Kay (1994), Regular models of phonological rule systems, Computational Linguistics 20:331-378. Kohler, K. (1995), Phonetik des Deutschen. Berlin: Erich Schmidt. Milde, J.-T. & Gut, U. (2002), A prosodic corpus of non-native speech. In: B. Bel & I. Marlien (eds.) Proceedings of the Speech Prosody 2002 conference, 11-13 April 2002. Aix-en-Provence: Laboratoire Parole et Langage, pp. 503-506. Wiese, R. (1996), German phonology. Clarendon: Oxford. This page intentionally left blank Tracking lexical changes in the reference corpus of Slovene texts Vojko Gorjanc University of Ljubljana Abstract The text focuses on lexical borrowings from English, introduced into Slovene in the last decade of the 20th century. Using the FIDA corpus, a reference corpus of Slovene, new lexical items can be tracked over the last decade: by means of corpus analysis, we can determine when a word entered in the Slovene language and how it established itself in the language. Corpus analysis reveals great creativity of Slovene language speakers; in addition to loan words, original Slovene coinages occur almost invariably. Corpus data shows a great deal of variability originating from the desire of the speakers of Slovene to coin new expressions, but after a few years, the data begin to reveal the prevailing variant. 1. Introduction With the help of a corpus, we can track lexical changes quickly and reliably, and also observe the response of a selected language to new lexical items introduced into it from other languages, e.g. English, or some other language with which the selected language has direct contact, in the case of Slovene, these are Italian, German, Hungarian or Croatian. This paper focuses on lexical items introduced into Slovene in the last decade of the 20th century. The starting point for the comparison with the state in the corpus of the Slovene language is John Ayto’s list of English lexical items from the 1980s and 1990s (Ayto 1999); the subsequent corpus analyses focus on lexical items from the fields of the Internet and computer science. With corpus investigations, we will try to determine how the Slovene language reacts to lexical items introduced into Slovene from English. Using the corpus, we can track the new lexical items through the last decade, and observe their characteristics in the corpus. 2. The Corpus The Corpus of the Slovene Language, called FIDA, is a reference corpus of Slovene compiled by a consortium of four project partners: University of Ljubljana (Faculty of Arts), Jozef Stefan Institute Ljubljana, DZS Publishing House and Amebis software company (http://www.fida.net). The corpus is composed of contemporary Slovene texts, the majority of which were published Vojko Gorjanc 92 in the 1990s. The corpus contains just over 100 million words, encompassing a broad variety of language variants and registers. It is composed of written texts and texts originally produced as written-to-be-spoken. The transcripts of Slovene parliamentary proceedings are the only spoken component of the corpus. The basic corpus characteristics according to taxonomic corpus parameters are (in %): Medium spoken electronic written Text type 1.97 0.03 98.00 literary technical other Linguistic proofreading 5.94 18.46 75.60 yes no unknown 63.92 3.13 32.95 The FIDA corpus is lemmatised and morpho-syntactically tagged, but all the tagging was done automatically without the possibility of disambiguation in cases where double or even multiple lemmas were possible. Since Slovene is a morphologically complex language, double or triple lemmas are frequent, which makes statistical data from the corpus unreliable to some extent. In the last few years, some significant steps were taken to solve the problem, both by testing the existing language-independent tools and by developing new ones (Džeroski and Erjavec, 2000; Mladeniü, 2002) but the situation is still far from ideal. Although less acute, the lemmatisation of non-lemmatised words is another problem waiting for the improvement of text processing tools for Slovene. The lemmatisation of the FIDA corpus was based on the lexicon developed by the software company involved in the project. Experiences show that in certain cases non-lemmatised words skew the results of statistical analysis, so all these have to be taken into account when interpreting the corpus data (Gorjanc and Krek, 2001). 3. Methodology Using a wordlist from the Slovene corpus, we will obtain information on the lexical items from Ayto's list relevant for the Slovene language. By means of corpus analysis, we will determine when a word occurs in the Slovene language and how it establishes itself in the language, and by means of statistical analysis, we will determine the possible collocations of the selected word and their changes from the first occurrence of the word until the end of the decade. Since pairs of synonyms or strings often occur with new expressions, we will try to determine how they occur in the corpus and how they disappear. With the help of markers of semantic relations already identified for the Slovene language by corpus analysis, we will identify pairs of synonyms and strings within the corpus. We will focus on the distribution of synonyms within the corpus regarding the time of their Tracking lexical changes in the reference corpus of Slovene texts 93 occurrence, and consider when and why one of the synonyms becomes dominant in the language while the other variants disappear. 3.1 Extracting collocations For extracting collocations, the MI3 value introduced by McEnery et al. with information on the probability of a word pair occurring together or separately will be used (McEnery, Langé, Oakes and Véronis 1997). The MI3 value has turned out to be sound information for content words in Slovene. On the other hand, with this value it is hard to determine function words as part of collocations; in particular, in the case of collocators of verbs and nouns with prepositional phrases. For instance, to detect propositional words, raw statistics provide more valuable information. After detection, a noun + propositional word pair, for example, the MI3 value for the whole pair is calculated to extract the string of collocators. The comparison of results between T, MI and MI3 values for the Slovene corpus shows that MI3 is the most effective of the three. MI introduced in Church and Hanks (1989) is far less effective, since the frequency of corpus elements is underestimated and a single co-occurrence of two elements in the corpus gives high scores which can diminish the importance of more frequent lexical units (Manning and Schütze, 1999). This fact is even more relevant in the case of the FIDA corpus, since specific forms of non-lemmatised words are attributed high MI scores. To some extent, MI3 neutralises the effects of low frequency of a corpus element which is why it gives better results for collocations. The noun raþunalnik (Engl. computer) with collocators according to MI and MI3 values in the FIDA corpus (frame 3) MI3 raþunalnik oseben (Engl. personal) (Engl. computer) prenosen (Engl. portable) na (Engl. on) vaš (Engl. your) biti (Engl. to be) z (Engl. with) in (Engl. and) v (Engl. in) poznavanje (Engl. knowing) za (Engl. for) delo (Engl. work) moj (Engl. my) uporabljati (Engl. to use) zmogljiv (Engl. high capacity) zagnati (Engl. to start) MI =uporablja =appleov =deskpro =skreširani =pomeþite =nnnn =gxi =upsajo =85prenosni =megatronski =macintosch =optiplex =sprojektiram =blagajniþarska =dlanþni 94 Vojko Gorjanc internet (Engl. Internet) ki (Engl. which) uporaba (Engl. useage) =windows* delati (Engl. to work) * = non-lemmatised =brskalnemu =vseuporabljajo =pc486 izklopitev =ignororajo By using MI3 values, we obtain relevant results for collocations, while the highest MI values are generally non-lemmatised words, mostly spelling errors. From the point of view of non-lemmatised words the results are interesting, since they reveal some of the problems the speakers of Slovene have with forming new words, e.g. the adjective derived from the name Apple; in Slovene there is an attempt to use the suffix -ov, appleov. 3.2 Identifying markers of semantic relations Semantically related words – synonyms, hyper- and hyponyms, abbreviations often collocate or appear in similar contexts and it is usually possible to identify the domain of a word on the basis of its textual environment. When we want to explain the relations between concepts within the reality portrayed, we often use explicit linguistic structures or phrases, such as X is defined as Y, X is an instance of Y, There are several types of X, for example A, B, C etc. So, if we, for example, identify the pattern “A, also known as B” to indicate complete or near synonymy, we can extract the noun phrases linked by the pattern as pairs of synonyms. Similar methods are proposed by several authors in the field of terminology, either for the purpose of an automatic construction of knowledge databases (Bowden et al., 1996), conceptual sampling for terminography (Meyer et al., 1999) or simply a search for synonyms in a corpus (Pearson, 1998). For Slovene the most frequent markers were identified by analysing the FIDA corpus. For synonymy, these markers are ali (Engl. or), ali tudi (Engl. also), imenujemo tudi (Engl. we also refer to it as), imenovan* tudi, (sinonim _) (Engl. also referred to as), je sinonim za (Engl. is a synonym for), znan* tudi kot (Engl. also known as), znan* tudi pod imenom (Engl. also familiar as), z drugim imenom (Engl. also called), ... (Vintar and Gorjanc 2000). For the purpose of this paper, these findings were used to extract pairs or strings of synonyms of selected words. Internet ali medmrežje, kot so ga izvirno poslovenili /.../, že dolgo ni veþ neznanka povpreþnemu Slovencu. Pair of synonyms: internet medmrežje Tracking lexical changes in the reference corpus of Slovene texts 4. 95 Corpus analysis In this analysis, we will focus on some key lexical elements, from the field of the Internet, which have undergone the process of being accepted into the Slovene language. Keeping the loan words unchanged, the most passive response of recipient-language users is confirmed as temporary and it leaves open the possibility for coining new terms; it turns out that if the new terms are formed in a way which is acceptable to users, no problems arise in introducing the Slovene variant. Let us illustrate this, using the example of the term World Wide Web, and its Slovene variant, svetovni splet (the graph below is in %). 100 80 60 40 20 0 1994 1995 1996 world wide web 1997 1998 1999 svetovni splet Figure 1: Occurrences of World Wide Web, and its Slovene variant, svetovni splet in the FIDA corpus between 1994 and 1999 In the two years after its first appearance, only the loan word occurs in the corpus, but when the Slovene variant appears, it immediately becomes a successful rival and the use of the loan word gradually decreases. In texts, the dominance of the Slovene synonym over the loan word is even more obvious in the case of another key word from the field of the Internet, i.e. home page. After eliminating corpus noise related to proper names of pages, it turns out that the Slovene term has dominated completely (91.8% of corpus occurrences). In addition to the calque domaþa stran (Engl. home page), there is also a new term predstavitvena stran (Engl. web presentation page) coined in Slovene, but it seems that the calque from English is more acceptable. The fate of the following term from the field of IT is quite different. The English term screen saver entered the Slovene language in the 1990s. Vojko Gorjanc 96 100 80 60 40 20 0 1 2 3 4 1 screen saver; 2 varþevalnik zaslona; 3 ohranjevalnik zaslona; 4 ohranjevalec zaslona Figure 2: Occurrences of screen saver, and its Slovene variants, varþevalnik zaslona, ohranjevalec zaslona and ohranjevalnik zaslona in the FIDA corpus After the loan word, the calque varþevalnik zaslona occurs next, but a later Slovene term formed by using attribute ohrajeva- (Engl. keep) turns out to be more acceptable. At first, there are two variants, but later the adjective with the suffix -ik dominates. While it is true that in Slovene a new word varþevalnik (Engl. saver) is derived from the verb varþevati (Engl. save), it seems that the semantic link is not strong enough for the speakers. In the corpus, the verb varþevati (Engl. save) tends to collocate with words such as banka (Engl. bank), denar (Engl. money); zaþeti (Engl. start), splaþati (Engl. worth). It thus covers a semantic field which the speakers do not associate with new terminology from the field of IT. The term Internet itself has become fully integrated into the Slovene language; this is partly due to its everyday use. As a noun, it occurs as a premodifier in noun phrases: e.g., internet storitev (Engl. Internet service), internet naslov (Engl. URL), internet povezava (Engl. Internet connection), internet ponudnik (Engl. Internet service provider), internet stran (Engl. Web page), internet raþun (Engl. Internet account), internet protokol (Engl. Internet protocol). In the Slovene language, the new type of noun phrase with a noun functioning as a premodifier is becoming increasingly common. This type of noun phrase is formed under the Tracking lexical changes in the reference corpus of Slovene texts 97 influence of the English language, in Slovene the premodifier in a noun phrase had to be an adjective. The noun as the premodifier appears frequently, even though there is also the possibility of forming the noun phrase with an adjective as the premodifier. The noun Internet happens to be extremely prolific in terms of word formation, since it forms adjectives using the suffixes -ni, -ski and -ov: internetni, internetski, internetovski, internetov in Adjective+Noun combinations; the adverb internetsko using the suffix -o in Adverb+Verb combinations, and newly derived nouns using the suffixes -ar and -ec: internetar; internetovec, as well as new compound nouns, e.g. internetdžanki (Engl. Internet junkie). In adjectives, the relatively high variability indicates that the newly formed words have not yet been fully accepted. The collocations of an individual adjective show that the collocators of the adjectives internetni, internetski and internetovski overlap /service, page, search engine, business, shop, bookseller, service provider.../, so that it is impossible to determine the specific dependent links between words. Therefore, it seems that the use is very optional and different variants of the adjective are possible with the same headword. In the case of the adjective internetov, which is the least common of the adjectives listed above, the link to the headword is completely dispersed; this indicates that the suffix variant -ov is not integrated and consequently inappropriate for the classifying character of the adjective, generally expressed by the adjectival suffixes -ni and -ski in Slovene. The frequent use of the classifying adjective with the suffix -ni shows a prevalence of this variant, and its only real rival is the classifying adjective with the suffix -ski. 60 50 40 30 20 10 0 1 2 3 4 1 internetski; 2 internetni; 3 internetov; 4 internetovski Figure 3: Occurrences of adjectives Internet using the Slovene suffixes -ni, -ski and -ov in the FIDA corpus On the other hand, Slovene term medmrežje, introduced when the term internet was already fully accepted, has very low corpus frequency (corpus occurrences 98 Vojko Gorjanc medmrežje : internet = 2.2% : 97.8%) and in terms of word formation it is not productive. Despite the fact that the speakers of Slovene did not accept the Slovene term medmrežje, there are still constant but unsuccessful attempts to force medmrežje instead of internet by normativist linguists. In assigning terms to concepts, the Internet has stimulated the formation of two strings of newly formed terms, which seem to be growing in prolificacy with the development of the Internet, i.e., terms of the type e-mail and terms of the type kiber- and cyber-. Among the latter, a great variability in spelling can be observed in Slovene: the new terms can be spelled as two words kiber prostor (Engl. cyberspace), cyber kavarna (Engl. cyber-café), or as one word, with or without a hyphen, e.g., kiber-kiþ (Engl. cyber-kitsch), cyber-kultura (Engl. cyber-culture) and kibersvet (Engl. cyberworld), cyberfolk (Engl. cyberfolk). The spelling of these terms is an important point of dispute among Slovene grammarians at the moment, with the question of the influence of the English pattern being particularly prominent. New terms of the type e-Noun, e-Adjective have already been presented and show a very open series (Jakopin, 2001); here, let us consider the trends in Slovene with kiber- and cyber-. At first glance, the results of the corpus are extremely dispersed; the process is very productive, and both the loan element as well as the Slovene element are prolific. With the loan element, the tendency to write the new terms as two separate words can be observed; the one-word spelling is reserved for a limited number of complete loan words, e.g. cyberspace, cybersex, cybercash. However, as there are so many new terms coined and cyber- generally occurs with a Slovene second element, the new term is generally spelled as two separate words, e.g., cyber otroški vrtec (Engl. cyber-kindergarten), cyber jasli (Engl. cybercréche), cyber kavarna (Engl. cyber-café), cyber umetnost (Engl. cyber-art). Consequently, the two-word spelling is used with loan words as well, e.g. cyber café, cyber space. The pattern is quite open and this can be seen in the possibility of the hyphenated spelling for both complete loan words as well as for the combinations of a loan element and a Slovene element, e.g. cyber-space, cyberpunk; cyber-gostilniþar (Engl. cyber-innkeeper), cyber-klobasica (Engl. cybersausage). The hyphenated spelling is particularly common with adjectives, e.g. cyber-totalitarni (Engl. cyber-totalitarian), cyber-kavbojski (Engl. cybercowboy). On the other hand, there is a tendency for one-word spelling with the Slovene version kiber, the spelling change thus only enables the formation process in Slovene. The one-word spelling is also a consequence of the fact that the adjectives kibernetiþni in kibernetski (Engl. cybernetic) assume the attributive function. The greater assimilation of kiber into the Slovene language is confirmed by the terms coined for the male and female representatives of the cyberworld: kibernetnica, kibernetniþar and kibernetniþarka, and it also occurs in the root of the verb kiberseksati (Engl. to have cybersex). Tracking lexical changes in the reference corpus of Slovene texts 99 In this case, the almost unbelievable range of terms reveals a dynamic process of assigning terms to concepts in Slovene, while the corpus indicates that the new terms with kiber- prevail (over 60%). The use also shows that the lexical elements kiber and cyber are extremely popular. In Slovene, it seems that the speakers accept the lexical term as semantically emptied, as something very fashionable at the moment. Thus, as we have seen, the collocations may be rather unusual. 5. Conclusion In its observation of the process of accepting new lexical items into Slovene, the corpus analysis reveals the great creativity of Slovene language speakers; in addition to loan words, original Slovene expressions occur almost invariably. Corpus data shows a great deal of variability, linked above all to the desire for original expressions. However, after a few years, the data begin to reveal the prevailing variant. The question of which variant is eventually fully accepted, though, remains open. Full acceptance of a variant is conditioned by a series of linguistic and even more non-linguistic factors. It is very important that the new terms emerge spontaneously in the language community and are not introduced into the language by linguistic intervention. If the speakers of Slovene even begin to suspect that the term is an attempt at linguistic intervention, they will probably reject it. In order for a variant to be fully accepted in Slovene today, it should have an appropriate semantic basis, so that the speakers of Slovene can identify it as “sexy” and “cool”. References Ayto, J. (1999), 20th Century Words. Oxford, Oxford University Press. Bowden P., P. Halstead and T Rose (1996), Extracting Conceptual Knowledge from Text Using Explicit Relation Markers. In Proceedings of EKAW-96, Nottingham, pp. 147-162. Church K. and P. Hanks (1989), Word association norms, mutual information and lexicography, in: Proceedings of the 27th Annual conference of the Association of Computational Linguistics, pp. 76-82. Džeroski S. and T. Erjavec (2000), Learning to lemmatise Slovene words, in: J. Cussens and S. Džeroski (eds), Learning Language in Logic. Berlin, Springer, pp. 69-88. Gorjanc V. and S. Krek (2001), A corpus-based dictionary database as the source for compiling Slovene-X dictionaries, in Proceedings of the COMPLEX 2001 6th Conference on Computational Lexicography and Corpus Research, Birmingham, pp. 41-47. Jakopin P (2001) Words and nonwords as basic units of a newspaper text corpus, in: Proceedings of the COMPLEX 2001 6th Conference on Computational Lexicography and Corpus Research, Birmingham, pp. 49-65. 100 Vojko Gorjanc Manning C., and H. Schütze (1999) Foundations of Statistical Natural Language Processing. Cambridge MA: The MIT Press. McEnery T., J. Langé, Oakes, M. and J. Véronis (1997), The exploration of multilingual annotated corpora for term extraction, in R. Garside, G. Leech, A. McEnery (eds.), Corpus Annotation. Linguistic Information from Computer Text Corpora. London, Longman. Meyer, I., Mackintosh K., Barriere, C. and T. Morgan (1999), Conceptual sampling for terminological corpus analysis, in Sandrini (ed.), Proceedings of TKE ’99. Vienna, TermNet, pp. 256-267. Mladeniü D. (2002), Automatic word lemmatisation, in: T. Erjavec and J. Gros (eds.), Jezikovne tehnologije, Language Technologies. Ljubljana, Institut Jozef Stefan, pp. 153-159. Pearson J. (1998), Terms in Context. Amsterdam, John Benjamins. Vintar Š, and V. Gorjanc (2000) Identifying markers of semantic relations in Slovene. http://www2.arnes.si/vinta/telri.rtf Relating linguistic units to socio-contextual information in a spontaneous speech corpus of Spanish José María Guirao Universidad de Granada Antonio Moreno Sandoval, Ana González Ledesma, Guillermo de la Madrid, Manuel Alcántara Universidad Autónoma de Madrid Abstract This chapter shows the application of statistical tests to a corpus of spontaneous spoken Spanish. Our goal is to find representative differences between different parts of the corpus. To this end, we tagged n-grams in the corpus with features related to the speaker (age, gender, etc.), or the context (dialogue, monologue, media, etc.), and applied the loglikelihood test (Dunning, 1993) in order to find the most distinctive lexical or grammatical items for each specific socio-contextual feature. This chapter is divided in three sections. In the first, the characteristics of the spoken corpus are shown. The second section is devoted to the explanation of the computational tool. In the third section, a first rough estimate of the results obtained is given, as well as possible applications of the model. 1. The Spanish corpus of the C-ORAL-ROM project. C-ORAL-ROM is a multi-lingual corpus of spontaneous speech for the main four Romance languages, French, Italian, Portuguese and Spanish (Cresti et al. 2002). The project is funded by the EU under the V Framework Programme (IST-200026228) and the consortium consists of 9 partners, co-ordinated by the University of Florence. The remarkable feature of C-ORAL-ROM is its spontaneity: texts have been recorded in their actual context and without any script. Each subcorpus is made up of 300,000 words with the same text distribution to assure comparability and sufficient register representation. The resource will be delivered in several formats: an orthographic transcription, an xml-tagged version, and the aligned audio source. Partial linguistic annotation will be provided, as well as some programs to handle the resources and quantitative studies. This paper shows preliminary results with respect to the Spanish corpus. 102 1.1 Guirao et al Differences between a speech database and a corpus of spontaneous speech When discussing spoken resources, a preliminary distinction has to be made. Most linguistic resources currently available are speech databases: collections of high-quality recordings and detailed phonetic transcriptions of speech set up in controlled environments (typically telephone services). These speech databases are mostly used for training and testing speech systems and they are developed by and for the language engineering industry. They aim to serve as a basis for recognizing and producing speech in restricted, predictable domains. In most cases, those databases contain many samples of the same word (that is, many tokens of the same type). Usually, the utterances are prepared and pronounced by professional speakers. The acoustic quality of the recording is essential. Speech databases usually provide detailed phonetic descriptions, including disfluencies, noises and other sounds. In general, those databases reflect the standard register, and distant variants (dialects, jargons) are poorly represented. Instances of those are SpeechDat (LRE-63314, Infrastructure for Spoken Language Resources), SpeechDat II (LRE2-4001, Speech Databases for the Creation of Voice Driven Teleservices), which have set up a standard for this type of resource. On the other hand, corpora of spontaneous speech are typically collections of a wide variety of spoken registers and non-scripted speech. Those corpora are collected mainly for linguistic analyses and applications (language teaching, grammars and dictionaries). In such corpora the acoustic quality is not essential. What is important is that the texts reflect as much variation as possible and the speaker behaves in a spontaneous manner. In some cases, those corpora are only concerned with a given register, for instance, a dialect or children’s speech. An important difference with respect to speech databases is the transcription: spontaneous spoken corpora usually are less precise in the acoustic and phonetic parts. On the contrary, they include detailed information about the context and the speakers. These corpora are used mainly for sociolinguistic, text-typologic, or psycholinguistic analyses. Examples are CHILDES and London-Lund. C-ORAL-ROM is a corpus of spontaneous speech, but it also shows some distinctive features: Multilingual: the main goal is to compare the four languages, on the same grounds, and provide comparative studies at different linguistic levels. Acoustic quality: in order to be re-usable by the speech industry, sufficient samples of digital recordings, media and phone conversations are included. Alignment of the transcription and the original sound: this is useful both to verify the accuracy of the transcription and for teaching and other applied investigation purposes. Relating linguistic units to socio-contextual information 103 The main limitation of C-ORAL-ROM is its size. 300,000 words per language is not a sufficient number for stating classifications and statistically significant analyses. We believe that this corpus will show the relevancy and usefulness of an approach that pays as much attention to the acoustic quality of the register as to the linguistic annotation. 1.2 Multi-lingual comparability Cross-linguistic comparison can provide two complementary perspectives: comparing a given feature or features across languages, and comparing a given register or text type across languages (Biber, 1995). On the C-ORAL-ROM project different traditions and experiences have interacted. On the practical side, the teams came to an agreement around two basic points: a text distribution (or sampling design) and a unified format for the transcription. 1.2.1 Text distribution In order to compare the linguistic features of the four languages, the same common sampling criteria and the same proportion of each type in the four subcorpora are needed. There is a long tradition in sociolinguistics and in corpus linguistics (Labov, 1966; Biber, 1988; Biber et al., 1999; Miller & Weinert, 1999) in determining the relevant non-linguistic parameters. Basically, authors agree with a series of socio-situational parameters, such as register and genre variation, sociological features of the speakers (sex, age, education, occupation, origin), and dialogic structure (monologue, dialogue, conversation). The disagreement is in how to combine these parameters. C-ORAL-ROM has chosen the design of the Spoken Dutch Corpus (http://lands.let.kun.nl/cgn/ehome.htm). The sampling design is different in both sub-corpora: Informal register is organised according to social context (familiar-private vs. public) and dialogic structure (monologue vs. dialogue-conversation). Formal register is organised according to channel (media, telephone, natural context). In addition, media texts and formal in natural context are grouped by genre (see table below). Sociological features have not been taken into account for text selection, but they are explicitly marked in the metadata section of the transcription. Male/Female distinction has been the only feature to be balanced. With respect to the text length, some decisions have been made. Only three types of size are allowed: short (1,500 words), medium (3,000 words) and large (4,500 words). Texts shorter than 1,500 words have allowed in genre types like meteorological reports, but always compounding segments of 1,500 words. 104 Guirao et al Tables 1 and 2 show the distribution design of the informal and formal subcorpora for each language. Table 1: Informal sub-corpus Private/Familiar Context 113,000 words Monologue Dialogue 33,000 words 80,000 words Public Context 37,000 words Monologue 6,000 words Dialogue 31,000 words Table 2: Formal sub-corpus Formal in Natural Context 65,000 words Political Speech Political Debate Preaching Teaching Professional Explanation Conferences Business Law Formal in Media Context 60,000 words News Meteo Interviews Reportage Scientific Press Sport Talk Show Political Thematic Explanation Talk Show Culture Talk Show Science Telephone 25,000 words Private Dialogues Phone to Call Services 1.2.2 Common format To ensure a valid comparison, it is also necessary to use a consistent annotation framework. The consortium developed the C-ORAL-ROM format, which is based in the known CHAT format. A conversion to XML is provided. The xmltagged version guarantees easy interpretation though the corresponding DTD. The combined use of XML and DTD ensures that every text in each corpus complies with the same requirements. In this way, textual uniformity are obtained throughout and between the four corpora. The format is divided into the header (with the meta-data) and the transcription. Most features in the header are compulsory, therefore a rich information is provided for every text. The transcription is divided into turns, where applicable. Each turn is marked by a three-letter code identifying the speaker. An orthographic transcription is provided, along with some tags marking disfluencies, noises, overlapping, and prosodic units. Morpho-syntactic tagging will be supplied in a separate tier. Figure 1 shows a fragment of a text. A large selection of fragments from the four languages can be consulted on the official webpage of the project, along with the sound source. Relating linguistic units to socio-contextual information 105 @Title: Raquel @File: efamdl04 @Participants: PAT, Patricia, (woman, B, 2, hairdresser, participant, Madrid) ROS, Rosa, (woman, B, 3, English teacher, participant, Madrid) @Date: 10/03/2001 @Place: Madrid @Situation: chat between friends at home, not hidden, researcher observer @Topic: friends, movies and future Use’s works @Source: C-ORAL-ROM @Class: informal, familiar/private, dialogue @Length: 7’ 58’’ @Words: 1509 @Acoustic_quality: A @Transcriber: Guillermo @Revisor: Manuel; Guillermo, Jesús and Manuel (prosody) @Comments: *PAT: si ya han [/] han decidido ir con ellos / y la conocen ... *ROS: ya / pero si yo no [/] si a mí me da igual / si yo no digo nada de Use y Nuria / yo digo que la peña es un / poco egoísta // *PAT: <no> // *ROS: [<] <y ya está> // Figure 1: A fragment of C-ORAL-ROM text 1.3 Other relevant aspects C-ORAL-ROM is compliant with the state of the art in spoken corpora. These aspects are briefly summarised in the following paragraphs. 1.3.1 The legal issue During the 1990s legislation on Copyright and Privacy changed in many European countries. In spoken language corpora, the law is applied when recording individuals or using sound documents from the mass media. In the first case, speakers retain their right to preserve privacy, and have to give their express authorisation in order to their speech will be transcribed and published. In order to preserve spontaneity, which is essential for our purposes, the procedure is to ask each participant to sign an authorisation after the recording. If a speaker refuses his/her consent, then the recording is discarded. The right to privacy applies to every recording in a private context, but not to ones in a public situation (a lecture, a political speech, a sermon). 106 Guirao et al On the other hand, many texts in the corpora are copyrighted, not only the media recordings but also those in which the speaker creates knowledge in the form of ideas or structure of contents. Typically, this is true of lectures and professional talks. We obtained the written authorisation from the authors or the copyright holders for all the texts included in the Spanish corpus. 1.3.2 The acoustic quality The Spanish corpus of C-ORAL-ROM has been collected from scratch, although other teams in the project have reused part of their previous texts. In our case, we preferred to make new recordings because, on one hand, we did not have the written consent for our previous texts and, on the other hand, the acoustic quality of the analogical tapes was poor. Most texts have been recorded with a DAT Tascam (model DA-P1) and two unidirectional microphones. The source has been converted into a WAV file, mono, 16 bit, 22.050 Hz, through a SPDIF port in a Sound Blaster Live Platinum 5.1, using the software Creative Recorder. In public places, when possible, the DAT recorder has been connected to the sound system. The media recordings either have been provided directly by the broadcasting station or recorded by a computer connected to the receiver. Acoustic quality is essential for the application in speech technologies and language engineering. 1.3.3 The linguistic annotation Corpora increase in value depending on the annotation layers provided. Tagging a spontaneous speech corpus is a task slightly different to the same for written corpora (Uchimoto et al., 2002). The difference is not in the tagged information but in the lesser efficiency of the taggers when applied to spoken corpora. For instance, some POS taggers are usually trained on written texts, which show a quite stable and determined word order. On the contrary, corpora of spontaneous speech are highly flexible in word order. In addition, they show repetition, restartings, overlapping, and other features of spoken syntax which have to be trained specifically. The lexicon is also different. One can find many words that are not included in printed dictionaries, because they are innovations, or belong to an informal register, or simply because they are mispronunciations. A complete lemmatization and POS tagging is provided. Moreno and Guirao (2003) report the development of a POS tagger and unknown words recogniser for morphosyntactic annotation of the Spanish corpus. The results provided by the lemmatizer are used in this paper (see section 3). Relating linguistic units to socio-contextual information 107 1.3.4 The validation To verify the reliability of data has become a fashionable topic in the recent years. Users of linguistic resources want to know how the resources have been collected and their accuracy. C-ORAL-ROM passes two types of evaluation. An internal validation is carried out by the team itself. Each text passes through five steps: transcription, first revision, prosodic tagging, second revision, and sound-text alignment. At least, three linguists transcribe/revise each text. A program verifies format errors, blanks, typos, badly formed tags, etc. Therefore, content and form have been validated exhaustively, guaranteeing that the transcription is accurate to the sound source. We want to stress that the alignment of sound and text is the best guarantee for validation of a spoken text: any discrepancy between the actual speech and its transcription will be easily detected. An external validation will be done by experts at the end of the project. 2. The computational tool We have developed computational tools for transforming the C-ORAL-ROM format into a more suitable tagging scheme in order to relate meta-data with lexical items, and compute the appropriate statistics. We will divide this section into three sections. 2.1 Using xml-tagged corpus for relating meta-data and linguistic features The original C-ORAL-ROM annotation has been designed for registering a wide range of features, including acoustic ones (prosodic marks, noises, etc.) which will be used by the speech technology community. An example of an xml-tagged file is shown: <Turn> <Name>PAT</Name> <Says> <Utterance Type= "interrogation"> y cómo está </Utterance> <Notes Type= "act"> cough </Notes> </Says> </Turn> <Turn> <Name>ROS</Name> <Says> <Utterance Type= "enunciation"> bueno </Utterance> <Utterance Type= "enunciation"> no está mal </Utterance> </Says> </Turn> 108 Guirao et al Our goal in this experiment is to seek out lexical units peculiar to each subcorpus. The first step was to remove the non-lexical information from the original xml tagging. In particular, we wanted to capture two types of information: i) The words that every speaker says, and ii) The split of every turn into utterances in order to prevent ill-formed word clusters. This task is similar to tokenisation in written corpora. This division into utterances is also needed for delimiting the context that the POS tagger uses for disambiguation. A Perl script generates a new tagged corpus with only two tags: one for TURN, with attributes for speaker and file, and another for UTTERANCE: <turn speaker="PAT" file="efamcv01"> <utt> y como está </utt> </turn> <turn speaker="ROS" file="efamcv01"> <utt> bueno </utt> no está mal </utt> </turn> By this means, every word in the corpus can be related with the speaker and the text. The file keeps in the header all the socio-contextual information. The corpus is partitioned in as many sub-corpora as different features appeared in the header. For instance, a male sub-corpus, an informal sub-corpus, a telephone sub-corpus, a meteo sub-corpus, etc. are all generated. After partition into sub-corpora, all occurrences (the tokens) for every lexical unit (the types) are counted in each subcorpus. The next table shows the distribution by sex of speaker. Table 3: Distribution by sex Sex Man Woman X Tokens for the category 182832 134693 9519 Total number of tokens 327044 327044 327044 Percentage 55.9 % 41.2 % 2.9 % The “X” value is assigned when the sex of the speaker is unknown (typically in a media recording). The procedure can be applied to any type of information derived from the corpus. For instance, we tagged it with POS and lemma, using a POS tagger for spoken Spanish developed in our laboratory (Moreno & Guirao, 2003). We show the previous example after lemmatization and POS tagging. Lemmas are shown in uppercase. Relating linguistic units to socio-contextual information 109 Lemmatisation <turn speaker="PAT" file="efamcv01"> <utt> Y CÓMO ESTAR </utt> </turn> <turn speaker="ROS" file="efamcv01"> <utt> BUENO </utt> NO ESTAR MAL </utt> </turn> POS tagging <turn speaker="PAT" file="efamcv01"> <utt> C </turn> P AUX </utt> <turn speaker="ROS" file="efamcv01"> <utt> MD </utt> ADV AUX ADV</utt> </turn> C= conjunction; P = pronoun; AUX = auxiliary; MD: discursive marker; ADV= adverb In summary, in this experiment we have considered three levels of linguistic data: words, lemmas and POS. 2.2 Extracting word clusters If we calculate statistics directly on every unit, the result will not be correct, since multi-words units will not be included in this count. Discourse markers as frequent as “por ejemplo” (for instance), “es decir” (in other words) or “o sea” (that is) will not appear if we work on single word units. To solve this, we developed an algorithm based on n-grams in order to extract multi-word candidates. We took out all n-grams with three or more occurrences, for n = 4, 3, and 2. Next, a filter is applied for discarding all n-grams that start or end with a determiner or auxiliary. Finally, multi-words are selected by hand. Every multiword is regarded as a lexical unit, equivalent to the simple/single words. 2.3 Applying the statistics of surprise In order to identify the distinctive words, lemmas or POS for a given sub-corpus, we have employed the log-likelihood ratio test proposed by Dunning (1993). This method does not assume normal statistical distributions of units in a corpus. Instead, the log-likelihood ratio O assumes a binomial distribution more appropriate for rare but distinctive words. “Texts are composed largely of such rare events” (Dunning, 1993). In addition, this test does not need balanced subcorpora for comparison. 110 Guirao et al This method has been successfully applied for finding collocations (Dunning, 1993) and terms (Daille, 1994). In order to test the method for finding distinctive units in specific domains, we can work on two hypotheses: i) Two registers (or sub-corpora) show no difference in distinctive units (Null hypothesis) ii) For a given sub-corpus, we can find out distinctive units (Alternative hypothesis). We applied the test to two well-defined sub-corpora, meteorological reports and law, in order to discard one of the hypotheses. Results are shown in Table 4. The critical value for one degree of freedom is 7.88. Table 4: Dunning Test applied to well-defined sub-corpora Meteo -2 log O 165 160 145 128 97 91 80 69 69 64 Freq 50 38 27 6466 16 19 19 12 12 92 Lemmas norte fuerza viento en componente temperatura noroeste oeste nube zona Law -2 log O 200 200 116 113 101 84 83 78 77 65 Freq 58 289 85 45 22 31 26 50 69 17 Lemmas policía persona derecho contrato judicial delito delincuente ley determinar cometer Results confirm the alternative hypothesis and the suitability of the Dunning test for the task. Most of the “top 10” lemmas in both domains have a low occurrence, but all are typical terms in its domain. 3. Preliminary results Our goal is to show a range of possibilities for the application of this method. We will show here a very incomplete set of data. Currently, there is a disproportion of social and register features with respect to annotated linguistic features. The linguistic annotation is being carried out this year (2003). Comprehensive results will be delivered with the final version of the four corpora, including a crosslinguistic comparison. In this paper, the only linguistic features that will be taken into account are: words and multi-words lemmas POS tags. Relating linguistic units to socio-contextual information 111 First, we show the 10 most frequent word types in our corpus of spontaneous speech for three professions: a consumers’ association manager, a football coach and a computer system administrator. Notice that words and multi-words are regarded as equivalent units. Table 5: Most characteristic words in three professions Consumers’ association manager -2 log O 110 establecimiento 107 establecimientos 106 encuesta 103 cesta 65 política de precios 45 marcas 44 precios 44 índice 42 consumidor 41 insisto Football coach -2 log O 31 30 28 25 18 17 16 16 16 16 Computer system administrator -2 log O directiva 91 tío club 48 grabando Real Madrid 41 web rueda de prensa 33 yo qué sé director general 33 no sé no 30 joder estabilidad 30 linux confianza 28 cabrón hombre yo creo que 28 detectan Y que 26 barato The procedure can be extended to any profession registered in the corpus, as a means of detecting sociolectal information. Now we will provide the results of the Dunning test on formal and informal registers, approximately 150,000 words each: Table 6: More characteristic words in formal and informal registers Formal -2 log O 134 91 62 61 60 55 49 49 47 46 de es decir su en gobierno en este momento desarrollo general nuestra países Informal -2 log O 422 279 238 231 231 222 182 179 173 171 si ah sabes claro me tía yo no sé no ya Another interesting comparison is to find out which POS are more typical in male and female registers. This table shows the results. In our corpus, men prefer to use nouns and women prefer clearly pronouns. Finally, after lemmatization, we can show the 10 most frequent verbs in general, male and female sub-corpora. 112 Guirao et al Table 7: POS in male and female registers General Total occurrences 47052 42531 38210 32284 31404 30737 25044 17418 12611 10112 Male V N PREP ART ADV C P AUX ADJ Q -2 log O 515 422 399 382 80 34 34 Female N ADJ ART PREP DEM Q REL -2 log O 1327 524 243 149 126 28 16 16 4 P ADV C MD INTJ V POSS AUX NPR Table 8: Most frequent verb lemmas in male and female registers General Total occurrences 3398 2973 2579 2067 1046 1026 995 802 779 577 4. Male ir tener decir hacer poder saber ver dar querer creer -2 log O 28 28 26 23 23 23 22 22 20 20 Female Escuchar Recordar Aparecer Llegar contemplar caminar intentar Amar juntar superar -2 log O 159 154 112 99 86 47 46 42 36 36 ir decir saber venir dar mirar comprar gustar quedar contar Conclusions and future work Here, we have shown the relevance of this procedure as an empirical method for the validation of sociolinguistic hypotheses in spoken language, as well as for determining register typology. The method correlates linguistic with socio-contextual data applying Dunning’s Statistics of Surprise. In order to achieve this, a rich linguistic-tagged corpus and the use of xml have been essential. The preliminary results are promising and have not been shown previously for Spanish. However, extracting conclusions and interpretations for these figures is premature, since the corpus is clearly not sufficient. For this reason, we will apply the method to CORLEC corpus, also developed by the LLI-UAM (see Moreno 2002 for an overview). The combination of C-ORAL-ROM and CORLEC corpora will contain over 1,500,000 words of spontaneous spoken Spanish. Relating linguistic units to socio-contextual information 113 We also wish to find out more about the correlation between linguistic and sociocontextual features, when complete morphosyntactic annotation will be finished. For instance, verb tenses, number, gender, persons in pronouns, etc, will be tagged. Biber (1988, 1995) provides a rich catalogue of linguistic features that can be traced. Finally, a cross-linguistic comparison between the four Romance languages in CORAL-ROM will be made, based on the same text distribution. References Biber, D. (1988), Variation across speech and writing. Cambridge: CUP. Biber, D. (1995), Dimensions of register variation. Cambridge: CUP. Biber D., S. Johansson, G Leech, S. Conrad and E. Finegan (eds.) (1999), The Longman grammar of spoken and written English. London: Longman. Cresti, E. et al. (2002) `The C-ORAL-ROM project. New methods for spoken language archives in a multilingual romance corpus’, in: Proceedings of LREC 2002. Las Palmas de Gran Canaria. Daille, B. (1994), Combined approach for terminology extraction: lexical statistics and linguistic filtering. Ph.D. Thesis, Paris 7. Dunning T. (1993), ‘Accurate methods for the statistics of surprise and coincidence’. Computational Linguistics 19(1): 61-74. Labov W. (1966), The social stratification of English in New York City. Washington: Center for Applied Linguistics. Miller J. and R. Weinert (1999) Spontaneous spoken language. Oxford: Clarendon. Moreno A. (2002), ‘La evolución de los corpus de habla espontánea: la experiencia del LLI-UAM’, in: Proceedings of II Jornadas de Tecnologías del Habla, Granada, Spain. Moreno A. and J. M. Guirao (2003), ‘Tagging a spontaneous speech corpus of Spanish’, in: Proceedings of Recent Advances in NLP (RANLP-2003) Borovets, Bulgaria. Uchimoto K., C. Nobata, A. Yamada, S. Sekine and H. Isahara, (2002), ‘Morphological Analysis of the Spontaneous Speech corpus’. In: Proceedings of Conference of Computational Linguistics (COLING 2002) Taipei, Taiwan This page intentionally left blank An analysis of lexical text coverage in contemporary German Randall L. Jones Brigham Young University Abstract One of the many practical applications of corpus studies is the generation of word frequency information. It makes sense that for the teaching of vocabulary in a second language, lexical frequency should play a significant role in the selection of vocabulary to be included in pedagogical materials. Using the concept of “lexical text coverage”, a study based on the BYU/Leipzig Corpus of Contemporary German has shown that a basic vocabulary of 3,000 high frequency words can account for between 75% and 90% of the words in the text, depending on the register. 1. Vocabulary and second language learning How many words must a learner know in order to read and understand a German novel, a newspaper, an academic text, or to understand a German television broadcast or a conversation? Which words are most important, i.e. which ones should be learned first? Is the vocabulary used in the above-mentioned registers about the same, or are there substantial differences? And what is a word and what does it really mean to know a word in a second language? These questions lie at the very root of second language vocabulary learning. To begin with the question of which words should be learned first, it is important to look at vocabulary frequency, as the frequency of the various words in a typical text differs significantly. Some occur numerous times and some occur only once. By focusing first on learning the most frequently occurring words, it would seem that the process of learning to read or comprehend speech in a second language would become more systematic and therefore more efficient. But again the question arises, what are the most frequently occurring words in German and how many does one have to learn? 2. The concept of lexical text coverage Work by scholars such as Nation (2001) and others has helped us immeasurably in understanding the nature of vocabulary frequency and its relation to vocabulary learning in a second language. For a learner to read and have some comprehension of a German text, a minimum or threshold vocabulary is 116 Randall L. Jones necessary. Nation suggests that an understanding of a 2,000 to 3,000 word family level is a minimum for reading an unedited English text (2001: 146). He has introduced the notion of “word token coverage” and applied it to various types of English texts (2001: 17, 147). The notion of “word token coverage” means the degree to which a defined level of high-frequency vocabulary from a text covers or accounts for all the words in the text. For example, what percentage of a given text is covered by the 1,000 most frequently occurring words? What about the next 1,000, and how many words does it take to achieve a coverage of, say, 90%? The procedure for determining this is quite straightforward. i) Generate a word frequency list for a text. ii) Construct word families from the n most frequent words. iii) Match that word family frequency list against the original text to determine what percentage is covered by the words in the list. Nation uses what he calls word families instead of simple words, i.e. all words that are related to and recognisable from a base word. For example, the base word agree also includes inflected verb forms as well as associated nouns and adjectives: agreed, agreeing, agreement, agreements, agrees, agreeable, disagree, disagreements, disagreeable, disagreed, disagreeing, disagreement, disagrees, a total of 14 word tokens in three word classes. It is assumed that by understanding the verb agree, one will also understand other morphological permutations. The decision is, of course, somewhat arbitrary, i.e. it is assumed that at an early stage of vocabulary learning one understands regular verb conjugation and has a knowledge of the meaning of simple affixes such as dis-, -ment, -able. Table 1 shows Nation’s analysis of text coverage by the 1,000 and 2,000 most frequent words in four English texts registers. The highest level of coverage for 1,000 words was for conversation (84.3%), while the lowest was for academic texts (73.5%). An additional 1,000 words increases the coverage by only a few percentage points. Each additional 1,000 words results in an increasingly smaller addition to the coverage. It is important to stress that we are dealing with coverage of word tokens, not word types. In a running text of 100,000 words of the conversation sub-corpus for example, about half of the words will occur more than once, and some of them numerous times. If the 2,000 most frequently occurring words in the text were compared with each running word in the text, about 90% of the running words would be accounted for or “covered”. These statistics are highly interesting and even a bit surprising. The 1,000 most frequently-occurring words in the English conversation text covered 84.3% of the words in the text. This means that by knowing this subset of vocabulary, a reader will recognise and have some knowledge of any given word in the text 84.3% of the time. It is also surprising that, by doubling the number of words to 2,000, only An Analysis of Lexical Text Coverage in Contemporary German 117 6% additional coverage is achieved. How many words are necessary in order to achieve 100% coverage? It is also interesting to speculate on the difference in coverage among the four registers. One would expect that conversation would represent the highest coverage, but why would newspaper English be ten full percentage points lower? Table 1: Text type and text coverage (tokens) by the most frequent 2,000 words of English in four different text registers (Nation, 2001: 17). Levels Conversation Fiction Newspapers Academic texts 1st 1000 2nd 1000 84.3% 6.0% 82.3% 5.1% 75.6% 4.7% 73.5% 4.6% Total 90.3% 87.4% 80.3% 78.1% These statistics may convey the message that by learning a relatively small number of words – a small number at least when compared to the total number of words in the English language – one can read and comprehend an average English text. Caution is in order here. We still do not understand the cognitive process of vocabulary learning and it is difficult to define what it means to know a word. In addition, if the reader understands 87% of the words in a text, it also means that he or she does not understand 13% of them, and these may be the most important words for understanding the full meaning. Nevertheless, a threshold vocabulary level is a useful concept and a good beginning for continued learning. 3. Lexical text coverage in German What happens when we apply this procedure to a language other than English, e.g. German? As is well known, German has a much more complex morphology than English and uses far more compounding of nouns, adjectives and verbs. For example, the German base word arbeit (“work”) could include not only three nouns (Arbeit, Arbeiter, Arbeiterin) and eight verb forms, but also literally hundreds of compound nouns (e.g. Arbeiterfamilie, Arbeitgeber, Schichtarbeit, etc.), verbs (e.g. ausarbeiten, bearbeiten, verarbeiten, etc.), and adjectives (arbeitslos, arbeitsfähig, etc.). Some of these might be recognised on the basis of an understanding of the root word, but many of them would not. It would seem that for German only the most basic morphological forms should be included in the word family. In spite of the lexical complexities of German, it seems to be possible to calculate lexical coverage in very much the same way as for English. An analysis was made from a subset of approximately 10% of the BYU/Leipzig Corpus of Contemporary German, using each of four registers in the corpus. The total BYU/Leipzig German Corpus consists of the following: Randall L. Jones 118 Spoken German: 1 million words. 700,000 conversation + 300,000 television Literature: 1 million words. Seven literature genres Newspapers: 1 million words. Twenty regional and national newspapers Academic prose: 600,000 words. Six academic areas, secondary & postsecondary Gebrauchstexte: 400,000 words. Instructions, advice, advertisements, etc. The texts are taken from the three major German-speaking countries and represent a variety of styles and levels. Most of the texts are from the years 20002002. A subset was used for this study because the full corpus was not yet complete. The 400,000 word German sub-corpus was processed using the RANGE software (available from Paul Nation, School of Linguistics and Applied Language Studies, Victoria University of Wellington in Wellington, New Zealand). First a raw frequency list was generated, then word family lists were constructed for the first 1,000 and 2,000 most frequent words. The results are shown in Table 2. Table 2: Text type and text coverage (tokens) by the most frequent 2000 words of German in four different text registers. Levels Conversation Literature Newspaper Academic 1st 1000 2nd 1000 82.6% 4.4% 72.0% 5.4% 64.0% 6.5% 65.4% 7.8% Total 87.0% 77.4% 70.5% 73.2% By comparing the German study with Nation’s English analysis one can make several interesting observations. First, the results are really not significantly different for conversation and academic text but quite different for literature and newspaper text. There could be a variety of explanations for this, including external factors relating to the choice of texts used for the respective studies. German national newspapers tend to be erudite and use more compound words. It is also interesting to note the respective differences among the four registers. Conversation represents the highest coverage in both languages, but whereas academic is the lowest for English, newspaper shows the lowest for German. Again, this difference may be the result of a number of factors. An attempt to account for these differences would be beyond our scope here but it is an intriguing cross-language phenomenon. A final curious statistic is the fact that the percentage difference for the second thousand words was almost exactly opposite of the first one thousand, i.e. lowest for conversation (4.4%) and highest for academic (7.8%), thus slightly decreasing the spread among the four registers. For the German study an additional 1,000 words were added. The results are shown in Table 3. Again we see that, by increasing the number of words by 50%, An Analysis of Lexical Text Coverage in Contemporary German 119 the coverage is increased by only 2.5%, 3.4%, 4.2% and 4.6% respectively, but that the percentage is lowest for conversation and highest for academic, thus lowering the gap even more. Is it possible that, at a certain point, the percentages would all be the same? Table 3: Text type and text coverage (tokens) by the most frequent 3000 words of German in four different text registers Levels 1st 1000 2nd 1000 3rd 1000 Conversation 82.6% 4.4% 2.5% Literature 72.0% 5.4% 3.4% Newspaper 64.0% 6.5% 4.2% Academic 65.4% 7.8% 4.6% Total 89.5% 80.8% 74.7% 77.8% One of the questions posed at the beginning of this paper was about the difference in vocabulary among the four registers represented. It is obvious that there is a quantitative difference, as is evidenced in the different percentages of coverage. But what degree of lexical overlap exists? In analysing the range of the 1000 most frequent words in a combined list of all four registers, it was interesting to observe that 89.1% of the words occurred in all four sub-corpora, 7.6% occurred in three of the four sub-corpora, 2.4% occurred in two, and only 0.8% occurred in just one. Table 4a: Words (partial listing) occurring in three of the four sub-corpora NA JEDOCH EURO MONTAG UHR GESICHT KRANKHEIT ZUDEM SOZIAL ZUGLEICH LEUTE URLAUB REGIERUNG Total 162 149 96 81 77 75 65 62 60 59 56 56 55 Academic 2 65 19 0 3 3 60 18 55 35 0 0 11 Spoken 141 0 1 1 0 0 0 0 1 0 35 48 1 Newspaper 0 62 76 75 53 6 2 41 4 13 5 2 43 Literature 19 22 0 5 21 66 3 3 0 11 16 6 0 This would suggest a high degree of commonality among the four registers, and has important implications for second language vocabulary learning. While there are language-learning programs that emphasise a specific register, in most cases vocabulary is learned without regard to how it might be encountered at a later time. 120 Randall L. Jones Table 4b: Words (partial listing) occurring in two of the four sub-corpora SOWIE MEDIZIN DOLLAR JÄHRIG MITTWOCH DIENSTAG QUARTAL PFERD ERSTMAL GARTEN 4. Total 108 105 68 51 51 47 45 43 42 41 Academic 58 102 0 18 0 0 0 0 0 0 Spoken 0 0 0 0 4 2 2 3 41 23 Newspaper 50 3 67 33 47 45 43 0 1 0 Literature 0 0 1 0 0 0 0 40 0 18 Conclusion Vocabulary learning in a second language can be difficult and time consuming, but language educators can contribute to the ease of learning by sequencing new lexical material based on frequency information. It would seem that in German as well as in English a threshold vocabulary of approximately 3,000 words would be suitable for the typical academic learning experience. There are, of course, thorny issues that still need to be addressed such as what it means to learn and know a word and how new vocabulary is best taught. But by being able to establish a frequency-based vocabulary pool, students can benefit from focusing on learning words that will most likely be of the greatest benefit to them. References Nation, I.S.P. (2001), Learning Vocabulary in Another Language. Cambridge: University Press. Analysing a semantic corpus study across English dialects: Searching for paradigmatic parallels 1 Sarah Lee and Debra Ziegeler University of Manchester Abstract In this investigation, we conduct a contrastive corpus analysis into the usages of the ‘get’ periphrastic construction focusing on semantic variation. Our primary interest is in standard Singapore English (SE), and British English (BE) and New Zealand English (NZE) were used for comparative purposes. The investigation found that, generally, SE use of the ‘get’ periphrastic construction was similar to that for BE and NZE. However, after conducting a search for paradigmatic parallels, we also found that in certain functional environments, typically filled by the ‘get’ causative in the other two dialects, SE may have gone further in evolving a competing form - the use of speech act verbs, especially ‘ask’ used in a causative sense. 1. Introduction This paper details the corpus-based approach we adopted to analyse the usages of a grammatical construction, the get periphrastic causative (henceforth getcausative) in an English dialect, and some interesting results of the investigation. We aimed to discover if standard Singapore English (SE) has developed linguistic alternatives to fill the causative function(s) of the get-causative, and/or has derived any indigenised usages of the construction, particularly cognitivelyinduced ones. As an English dialect, SE is interesting in that it exists in an ethnolinguistically diverse ecology wherein a number of genetically-unrelated languages compete with English for various functions. Feature selection as a result is at least potentially subject to influence by cognitive and situational factors that are much less likely to permeate largely monocultural monolingual dialects like BE and NZE. Keeping this important factor in mind, a comparative approach was chosen, and BE and NZE used as standards of measure. Important aspects of the approach included a prior analysis of the construction to determine the prototypical semantics associated with the construction as used in BE and NZE, and to focus the corpus study on finding quantitative and qualitative evidence that might indicate idiosyncratic behaviour related to the get-causative paradigm in SE. An example of the get-causative is given in (1): (1) No no they did Terry O’Neill’s session and it was such garbage they got him to reshoot (ICE-GB s1a-052) 122 Sarah Lee and Debra Ziegeler A particular construction such as the get-causative has, of course, a unique profile comprising a set of functions and semantic features, and a grammatical analysis at a very fine-grained level will reveal that neither the construction nor any of its key components can be replaced by other linguistic item(s) without some loss of information. In a case of paradigmatic replacement, the information that is lost must not include loss of salient function, otherwise, instead of a paradigmatic alternate, the replacement in effect signals an instance of general language change process. The idea of parallels to a well-defined paradigm is perhaps best represented by the variation shown by sociolinguistically and dialect-defined groups in their patterning preferences in a particular function. Cognitive factors can influence grammatical preferencing strategies. Corpus studies can be very useful in pointing out such systematic associations in dialectal variation by providing a massive amount of natural language for quantitative and qualitative analyses. However, data of significance may not be easy to find. Linguistic items in association patterns (‘the systematic ways in which linguistic features are used in association with other linguistic and non-linguistic features’ (Biber 1998:5)) that are also grammatical may not be arranged syntagmatically as grammatical parallels are often aligned paradigmatically. As well as quantitative means, this study uses clues from structural constraints of a construction to retrieve grammatical association patterns in the use of the get-causative. 2. Background to methodology Corpus studies have developed association-based methods to harvest instances of pattern recurrence in language use. More established methods such as collocation (Sinclair, 1991) exploit pairings and groupings of lexical items. Inroads made into the study of language use has shown that words very often do not co-occur at random in a text; rather, pairs or groups of words will consistently appear in the same or similar linear arrangements (though not necessarily directly adjacent) because there is a strong preference in language production to select ‘semi-preconstructed phrases that constitute single choices’ at least to the extent that they are non-random co-occurrences (see, for example, Sinclair’s 1991: 110 ‘idiom principle’). Harvested collocations can be measured and analysed in terms of the degree of association that ranges from a cohesive strength analogous to morphological compounding, e.g., of course (see Sinclair, 1991: 111) to randomness, where the co-occurrence is analysed to have been arbitrarily induced. Collocations can only treat items in essentially syntagmatic arrangements, i.e., collocates must be linearly-patterned words. More recently, the focus in corpus linguistics has also turned to association patterns of grammatical constructions, (e.g., Biber, 2000; Hunston & Francis, 1999), that is, association patterns that are instances of grammatical preferencing (Biber, 2000). There are (at least) two considerations that make retrieval and Analysing a semantic corpus study across English dialects 123 analysis of such data difficult. First, grammatical association patterns do not necessarily syntagmatically align. This means that the goal of a search is not for linear patterns, making the search for grammatical association patterns much more difficult than locating collocations. Second, some types of parallelisms, if cognitively or situationally-induced, are not transparently obvious. Transparent associations will be structurally similar, for example, the difference between overt use of that and the omission of it in clause structure preferencing, e.g. (Biber et al., 1998: 103): (2) a. b. I thought that she only wore jeans. She thinks he’s sweet. The two clause structure types are grammatical patterns and while serving analogous functions, there is an obvious structural difference that can be isolated and studied. But grammatical patterns are not always so easy to detect. A particular grammatical structure is very often the result of cognitive or communicative function, and possible structural representations of a function can vary, i.e., the paradigm is not equivalent to any particular linguistic form sui generis. In (3), for example, (3) Person A: How did you go to the airport? Person B: I ______ Jane (to) drive me. a paradigmatic focus is interested in finding alternatives either for the cause verb, let us say, get, or for the construction as a whole (perhaps a resultative construction like I got driven there rather than another causative one, for instance). So, paradigmatic parallels of the get-construction may not necessarily have the same structural form as the source construction. At best, a primarily structural approach may yield some alternatives but, more likely, to approach the investigation on the basis of structural similarity would harvest too immense a number of possible alternatives, 2 rendering the analysis too broad to be interesting. The central issue involves how one can predict identifiable structural alternatives for a cognitive or communicative function that are not always realised in structurally similar ways. Automated processes that can trawl a corpus for items that fit a semantic profile are not yet available commercially. In Section 4, our low-tech approach to finding parallels for the get-causative is detailed. 2.1 Corpora Consulted The corpora examined were the SE (ICE-SE), NZE (ICE-NZ) and BE (ICE-BE) sub-corpora from the International Corpus of English (ICE) Collection. 3 Each sub-corpus contains approximately one million words of the standard variety of English in each geographic location, with the ratio of the number of spoken to 124 Sarah Lee and Debra Ziegeler written words at 3:2. The number of texts and registers represented in each ICE sub-corpora is standardised, the consistency enabling greater accuracy for frequency counts in cross-dialectal studies. (For more information about the ICE Collection, see Greenbaum, 1996.) As ICE-SE is not tagged for grammatical information, one search issue that arose was related to occurrences of ellipted linguistic items that prevented classification of the token as an instance of causative periphrastic. Zero categories such as prodropping are found in informal registers in SE, e.g., private dialogues (ICE-SG s1a-001 to s1a-100), as a feature that has transferred over from the influence of Chinese (Bao, 2001). Where instances were found, we needed to look at the extended context to determine whether the construction was actually a periphrastic. If any ambiguity could not be resolved, we did not add the instance to our frequency count, as in this example: (4) The only problem is that you must get __ to allow you to tape the voices… (ICE-SG s1a-070) In (4), the ellipted item could have been an instance of pro-dropping (of the causee) or it may be a word like ‘permission’ (i.e., a transitive use of the verb get, plus a verbal complement beginning with the to infinitive serving as an adjunct to the main clause. 3. The periphrastic get-causative construction Here are some examples of the get-causative: (5) Get-causatives a. Like the doctor in Bristol, she looked at my eyes, she got me to touch fingers and noses, to hop on one leg and saw how I coordinated… [ICE- GB w2b 001] b. What we are trying to get you to realise is … [ICE-GB s1b 011] c. Yah or else … how you’re gonna get those middle age people to answer them? (ICE-SG s1a-070) d. With that kind of problem I got my chief engineer to go out and start the company making test equipment that will solve this type of problem. (ICE-SG s2a-043) e. Workflow simply means getting the computer to pass information from one person in a work chain to another. (ICESG w2c-004) f. The forecast of rain for the following week finally got him to fix the roof. (Talmy, 1976:106) Analysing a semantic corpus study across English dialects 125 The get-causative is only one of many constructions of the highly polysemous verb get. With respect to usage, the lexical item get historically has had a reputation of being appropriate for use only in informal or colloquial contexts. The prescriptive bias is mentioned in the Collins Cobuild English Usage (1992), American English Usage (1957), 4 and Fowler (1996). Whether or not this stigmatisation extends to all registers, is observed equally across dialects, 5 or applies only to some get constructions, has not been studied as far as we know. It is also not known how strongly the bias has affected today’s speakers. 6 This lack of knowledge meant that it was possible that the stigmatisation is still active today, so that usage preferencing differences across the dialects and across registers, and skewing, both of frequency distribution totals for registers and overall, may result. 3.1.1 Semantic profile of the get-causative In the literature, there are a few notes about typical usages of the construction but these have seldom been confirmed by studies that are based on high volume empirical data, and also do not provide a complete semantic profile of the construction. However, knowledge of individual essential semantic features can assist pre-analysis. Animacy is an important semantic variable in linguistic representations of causativity. For the get-causative, both Causer and Causee are prototypically sentient beings, viz. human.7 The sense of human agency can be extended metaphorically and metonymically, to animals, natural forces and the class of machines that can metaphorically be conceived as a system comprising parts working together like a computer, machine or car, as in (5e) (Givon, 1976: 348). We included extended animacy as instances of human agency in our frequency counts. The get-causative has also been associated with events viewed by the speaker to be difficult to achieve. This is possibly because of the history of the lexical meaning of get, i.e., the association with obtaining a goal via physical means. Although some analyses, e.g. Hollmann (m.s.) and Wierzbicka (2000: 118) have suggested that the difficulty or effort is associated specifically with the construction, it is not clear whether the association is fundamentally a lexicogrammatical one – i.e., a collocation with the lexical item get. In some dialects though, particularly Australian and New Zealand Englishes (from our personal observations), the construction seems to be used just as much in contexts which do not involve difficulty or effort, (see, e.g., 9a). Another semantic note about the get-causative is that it can be used for contexts allowing the achievement of the causative goal via verbal and non-verbal means, and perhaps usefully, without the actual means of conduit being specified. In (6) for instance, 126 Sarah Lee and Debra Ziegeler (6) With that kind of problem I got my chief engineer to go out and start the company making test equipment that will solve this type of problem (ICE-SG s2a-043) the speaker does not have to specify whether he verbally instructed his chief engineer or conveyed what he wanted indirectly, by simply exposing him to the problem. This is especially the case for non-human Causers, (see, e.g. 5e-f). 3.2 Relation of get-causative to other periphrastic causative constructions The get-causative can be viewed as a member of the class of directive causative constructions (Shibatani, 1976) in English, in which the doer of the event is overtly specified, typically realised periphrastically. Structurally, these constructions involve the use of an auxiliary cause verb such as cause, get, have, make, get, force, and in which the Causer engages an intermediary agent (doer), the Causee, to effect the event of the verb complement. The form of the periphrastic causative constructions is shown in (7): (7) [NPCAUSER-CAUSE COMPLEMENT]] VERB-NPCAUSEE-[(to)-VERB Examples of periphrastic causatives are listed in (8): (8) a. b. c. d. No no they did Terry O’Neill’s session and it was such garbage they got him to reshoot (ICE-GB s1a-052) hm you know some of them bring a client in from a workplace and then have the counsellor sit in and coach them. (ICE-GB s1a-060) The presence of steam at the top probably caused the beads to fuse and over-expand and subsequently led to shrinkage of the cushion after ejection. (ICE-SG W3a-038) Inconvenience and loneliness, but mostly loneliness, made me think of home. (ICE-SG W2f-014) At this level of granularity, the various cause verb causative constructions can be analysed as transparent alternatives in the periphrastic causative paradigm. While the structure remains constant, each construction has a unique semantic profile, a configuration of a number of salient semantic features and specific communicative functions, such as that detailed for the get-causative above. Variation in semantics can relate to animacy preferences (see Talmy 1979 [2000] for analyses of a number of periphrastics), verb complement type constraints, e.g. constraints on the make causative depending on the directness relation between the two causing and caused events (Kemmer & Verhagen, 1994; Kemmer, 2002), and appropriate communicative situations (register preferencing). Analysing a semantic corpus study across English dialects 127 While the known members of the periphrastic causative class are certainly possible candidates for paradigmatic alternatives, 8 it is also true that structurally different forms may also be competitors. In a new study, Ziegeler & Lee (m.s.) argue that the conventionalised scenario construction (9b) is evolving to become a paradigmatic alternative to some functions currently filled by the resultative construction environment (9a) in some dialects of English, especially SE. (9) a. I got my hair cut. b. I cut my hair. Such functional alternatives are problematic for the use of corpora in extracting frequency data, as the construction is formally indistinguishable from an alternative transitive construction involving no indirect causativity at all. Furthermore, no specific grammatical marking is used to express the implied indirect causativity, which would mean that any corpus retrieval must involve searching for haphazard lexical instances, and some methodology would need to be developed to delimit the pool of likely lexical candidates for retrieval. Such a methodology may involve closer field observation of language items in regular use than is possible through the constructed database of a corpus, no matter how representative it might be. There are clearly problems to be overcome and methods to be developed in searching for items which are considered to be openclass functional replacements of former closed-class grammatical categories, as they cannot even represent paradigmatic substitutes within a particular construction. In this study, we searched for members of the periphrastic class and not openclass replacements. A pre-analysis of the semantic profile of the construction was an important aspect of the methodology we used. 4. Approach used in the study The search for paradigmatic parallels is a particularly relevant strategy to studies that involve language change in dialects situated in a complex ecology. As noted, standard SE coexists with a number of languages genetically-unrelated to English (various Chinese dialects, Malay, Tamil, Baba Malay) as well as various Englishes including a creoloid form heavily influenced by Chinese. Most standard SE speakers are bilingual from infancy, and a number of Singaporeans are L2 speakers of English, increasing the possibility of ‘alien’ cognitive-induced features entering the dialect. 9 Previous studies have shown that grammatical features may be sourced from a background language, for example, elements of the aspectual system from Chinese (Bao, 2001) or that wider distribution of grammaticalised features in L1 dialects (hyper-grammaticalisation) may develop 128 Sarah Lee and Debra Ziegeler as a result of differences in the paths of ontogenetic grammaticalisation (Ziegeler 2000). The basic approach used here is a comparative one. Quantitative information can be an important indicator of how widespread the use of the get-causative construction is in SE, however, the figures are meaningless without a standard measure. BE and NZE were enlisted to serve as the contrasting dialects. BE was chosen partly because of historical reasons connected to the development of certain (e.g., more formal) registers in SE, i.e., styles for more formal registers followed an exonormative standard, and partly because it is a variety developed within a largely monocultural, monolingual environment. However, since BE is possibly much slower in dispensing with prescriptive bias against the use of get than in other dialects, 10 assuming the get-causative is affected, it seemed methodologically sounder to introduce a third dialect into the analysis. NZE, like BE, keeps to the monolingual, monocultural constraint but may be less held to the prescriptive bias. 11 NZE would therefore provide a suitable control quantitative measure for frequency information in this respect, lest SE be found to contain significantly more instances of the get-causative construction than in BE. Another important aspect of our methodological approach was to systematically account for each instance of distributional variation found for the construction and each association pattern found for SE, across registers as well as across dialects. We thought that to do this was crucial because the usage of a particular periphrastic causative construction is at least partly influenced dialect-internally due to differing causal exertions on usage in the respective dialects. Influences can be situationally-induced. It has been shown that linguistic units, whether a lexical item or construction, can often be shown preferential bias according to certain register-related variables such as formality and (im)personal style (Biber 1999; 1995, and references therein). The ICE corpora contains a number of texts standardised according to register so conceivably, distributional variation can arise from register bias. For our frequency counts, it was crucial that we recognise any skewing due to this factor so that semantically-induced variation can be identified. To find instances of qualitative variation, we manually analysed each instance found for animacy and overt marking of difficulty. In the case of the latter, as we found that words denoting difficulty or effort were frequently associated with the get-causative, we conducted a supplementary search for periphrastic causative constructions that were possibly paradigmatic alternatives for the same functional environment (see Section 5 below). Analysing a semantic corpus study across English dialects 129 5. Distribution patterns for the get-causative in ICE-GB, ICE-NZ and ICE-SG 5.1 General frequency of occurrence Table 1 below provides a summary of the frequency of occurrences of the getcausative in each corpus. 12 The information provided is the absolute number of instances found in each corpus (n), as well as the totals for the spoken and written modes. As ICE-NZ contained about 30% more words than in either of the other two sub-corpora, the table provides totals normed to 100,000 words as well as a ratio based on the ICE-GB results. The last may be a more accurate indicator of frequency across dialects generally. Table 1: Frequency of occurrence of the get-causative in ICE-GB, ICE-SG and ICE-NZ (in total instances found, per 100,000 words, and relative occurrence in other corpora compared with ICE-GB) Spoken Written Total (Average) ICE-GB n Per 100000 31 4.9 / 1 11 2.6 /1 42 3.8 /1 ICE-SG n Per 100000 42 6.6 /1.3 18 4.5 /1.7 60 5.6 /1.5 ICE-NZ n Per 100,000 68 9.3 /1.9 24 4.0/ 1.5 92 6.7 /1.8 Table 1 shows that the comparative frequency count of instances found in the three corpora support our personal observation that the get-causative may be used more frequently in NZE than in BE. The frequency count for ICE-SE is higher than for ICE-GB but not ICE-NZ. In terms of mode, the ICE-SG figure for written is slightly higher than in ICE-NZ. However, mode per se does not provide an accurate measure of language use, 13 thus the separated figures for mode in Table 1 provide a skewed view of the quantitative results. Additionally, the frequency count comparison merely indicates an impression that frequency distribution of the use of the construction in ICE-SG is somewhere in between BE and NZE, and is thus not helpful in isolating any deviation from the other two dialects. Table 2 breaks down the frequency distribution according to register (normed per 100,000 words). The striking aspect of comparison here is that all dialects are fairly similar in the way in which the get-causative is distributed across registers. Note that registers of the spoken mode do not necessarily contain more instances of the get-causative than the written ones. 130 Sarah Lee and Debra Ziegeler Table 2: Frequency distribution of get-causative across registers in ICE-SG/ICEGB/ICE-NZ per 100,000 words. ICE-SG ICE-GB ICE-NZ Legal cross-examinations (S) 25. Business Transactions (S) 0 9.7 Social Letters (W) 19.0 Persuasive Writing (W); Social Letters (W) Broadcast Talks (S); 12. Non-Academic Writing 1 (W) 6.9 Private Dialogues (S) 16.8 Private Dialogues (S); Reportage (W) 9.6 Private Dialogues (S) / Broadcast Discussions (S) 6.8 Business Transactions (S) 15.7 Broadcast Interviews (S) 9.0 Business Correspondence (W) 6.7 Instructional Writing: Skills & Hobbies (W) 10.5 Parliamentary Debates (S) 8.0 Public Dialogues (S) 5.0 Unscripted Speeches (S) 9.7 Unscripted Speeches (S) 6.3 Legal CrossExaminations (S) / Class Lessons (S) 4.7 Demonstrations (S) Broadcast Talks (S) 4.8 Non-broadcast Talks 4.6 Spontaneous Commentary (S) 8.4 Instructional Writing (both types) (W) 4.7 Broadcast Interviews (S) 4.5 Class Lessons (S); Creative Writing (W) 8.0 Broadcast Discussions (S) 4.5 Business Letters (W) Instructional Writing: 3.3 Administrative Writing 7.1 (W) Class Lessons 4.4 Reportage (W) 2.4 Business Transactions (S) 4.2 Creative Writing (W); Academic Writing (W) 2.3 Reportage (W) Creative Writing (W) 2.4 Legal Cross Examinations (S) 4.5 Non-academic Writing (W) 1.2 Non-academic Writing 4.0 Academic Writing (W) 1.1 Parliamentary Debates (S) 4.1 Broadcast Discussions (S) 3.7 Non printed Student Essays (W) 3.1 Academic Writing 1.6 Legal Presentations (S); Non-broadcast talks (S); Demonstrations (S); Spontaneous Commentary (S); Broadcast News (S); Student Essays (W); Exam Scripts (W); Business Letters (W); 0.0 (W)=Writing; (S)=Speech; Class Lessons (S); Parliamentary Debates (S); Broadcast News (S); Spontaneous Commentary (S); Demonstrations (S); Legal Presentations (S); Social Letters (W); Instructional Writing (W); Persuasive Writing (W) 0.0 8.7 Non academic Writing: 6.6 Tech (W) 6.4 Broadcast Interviews (S); Exam Scripts (W); 0.0 Business Letters (W); Persuasive Writing (W) Analysing a semantic corpus study across English dialects 131 The data appears to suggest that registers which inherently contain more events with directives and (inter)personal situations (perhaps personal style) are found likely to contain more of the constructions. For example, Business Transactions (which contain many situations where A wants B to do C, e.g., server/customer in a shop) ranked highly for ICE-GB and ICE-NZ, and Private Dialogues in all three corpora. It also may explain why Academic Writing, 14 which avoids personal, involved styles, ranked so low across all dialects. The stigmatisation of get may also be a factor contributing to the observation that more formal registers did not contain many instances. There may also be idiosyncratic preferences, for example, perhaps an explanation of why the register Legal Cross-Examinations in ICE-SG was ranked so high. But this could also be due to skewing due to low absolute numbers as word numbers in some registers were much higher than in others. For example, word numbers for Private Dialogues were around the 250,000 word mark but Legal Cross-Examinations were only represented by 20,000 words. Across the dialects, it is clear that the get-causative is more widespread across registers in ICE-NZ. The data may also suggest that style is more informal and personal generally across registers in NZE; however, further study is required beyond the scope of this one. The results show that accuracy of analysis would profit from adopting a register variable-focused approach (Conrad & Reppen, 1998), as there are strong regularities across registers in all dialects that appears to be consistent across dialects. 15 For our purpose, higher degree of accuracy and more information about register-related preferences is not required. We merely noted that the environments where paradigmatic parallels are likely to be found are in registers that contain a higher rate of directives, (inter)personal relationships and more informal styles of communication. 16 We also concluded that there were no significant differences between SE and the other two dialects in the influence of discourse factors on the use of the construction. 5.2 Register-related patterns On a related note, there is more evidence that supports the above conclusion about the parity of SE in relation to BE and NZE. We found that all three corpora were consistent in the way in which they used the get-causative in registers where style was impersonal and formal. While instances with human agents in Causer and Causee positions were found, in more impersonal and formal communicative styles, agents can be absent (unspecified), as in (10a), generic (10b) or nominal clausal or phrasal causers (10c). (10) a b. Workflow simply means getting the computer to pass information from one person in a workchain to another. (ICE-SG w2b-039 ‘Non-academic Writing’) So far, the EDB has played the role of growth advocate to the hilt, pushing hard to get the entire 132 Sarah Lee and Debra Ziegeler c. population to go for growth and helping the Government to focus all its policies on promoting growth. (ICE-SG w2e-005d ‘Press Editorials’) The matchmakers’ persuasive powers getting clients to lower their expectations are also being tested. (ICE-SG w2c-010) In general, frequency information did not reveal any significant variation in the use of the get-causative across the three corpora, and, significantly, did not demonstrate any usages uniquely SE. We now turn to the qualitative information the corpus analysis revealed. 5.3 Semantic features of the get-causative 5.4 Agency typology & distribution While agency is prototypically attributed to sentient entities, (usually humans), causative verbs can show a preference for the type of entity preferred. We found, for instance, that in all three corpora, cause is used in impersonal registers, mostly in scientific/technical documents, and causer and causee tend to be inanimate entities or clausal/phrasal nominal phrases. It is rare to find an example such as (11c), especially in impersonal registers. (11) a b. c. The presence of steam at the top probably caused the beads to fuse and over-expand and subsequently led to shrinkage of the cushion after ejection. (ICE-SG W2a038 ‘Academic Writing - Technology’) A precaution though; fertilizer granules should not be applied too near to its tree trunk as it may cause the stem to rot. (ICE-SG W2d-013 ‘Instructional Writing’) I caused John to go. (Shibatani 1976:3) By contrast, the configuration of participants in get-causative constructions is to have animate causers and causees, typically human beings, or metaphoric extensions of essentially sentient attributes (e.g., volition), supporting earlier observations about this (see Section 3.1.1.). Table 4 shows the prototypical animacy profiles of the get-causative in all three corpora (note then that an example like (5f) would be a very rare use of the construction). Table 4 shows that ICE-SG contains no significant variation from the other subcorpora with respect to animacy configurations. At least in terms of animacy association patterns, SE appears to show no variation from BE and SE. Analysing a semantic corpus study across English dialects 133 Table 4: Prototypical animacy profile of causers and causees in ICE-GB, ICE-SG and ICE-NZ (in %). ICE-GB Get ICE-SG Get ICE-NZE Get H+H H + NH NH + H NH + NH 85.7 7.1 4.8 0 91.7 3.3 1.7 1.7 84.8 3.3 12.0 0 (H=human; NH=nonhuman. H includes extended animacy.) 5.5 Association Patterns We noted in Section 3.1.1. that the get-causative contains the aspectual suggestion of difficulty, effort or the focus of success in achieving the causative event. We found that the get-causative does occur systematically with linguistic items denoting difficulty and effort but that this differed across dialects. Linguistic items marking the difficulty or effort included modals (e.g., must) and catenative verbs, such as try to/and. Try to/and is the most common collocate and indeed, in all three corpora, the lexical item get is the most frequent verb collocation for try to/and. The list of collocations found for the get-causative are given below: (12) Collocations denoting difficulty/effort found in ICE-GB, ICESG & ICE-NZE (only the lemmas of a verb is given) try to/and; manage to; it would be very hard to; difficult to; such an effort to; been unable to; cannot; could not; wouldn’t be able to; it is possible to; must; perhaps...can; succeed in; finally; eventually; In some instances, extended discourse context was consulted, to determine if difficulty or effort is suggested, as in (13). (13) a. b. like you’d get him to do things he wouldn’t usually do but then, he’d kind of give you some security… (ICENZ s1a-056) Mister Warburton says national insurance doesn’t provide a rebate or any financial incentive to get people to install detectors (ICE-NZ S2b-007) Neutral usages, e.g., (14), where no difficulty or effort was marked were also found. 134 Sarah Lee and Debra Ziegeler (14) a. b. I rung this morning. I rung your mother and she was out… and he said ‘can I get her to phone you’ and I said ‘oh no’. (ICE-NZ s1a-007) With that kind of problem I got my chief engineer to go out and start the company making test equipment that will solve this type of problem (ICE-SG s2a-043) In (14a), there is no indication that the speaker thinks that it will be difficult to get the mother to telephone the other person, similarly in (14b). The use of the getcausative is merely for instructional purposes. Table 5 shows the frequency of effort-coded in relation to neutral instances across the dialects. Table 5: Frequency of collocation of get-causative with linguistic items suggesting difficulty or effort in achieving the causative event in ICEGB, ICE-SG and ICE-NZ (in %). ICE-GB ICE-SG ICE-NZE Effort-coded Neutral 64.2 41.7 24.0 35.8 58.0 76.0 As Table 5 shows, ICE-GB had the highest number of instances involving difficulty or effort (64.2%), and ICE-NZ the lowest. The results for ICE-SG were again in between the two other dialects so, once again, ICE-SG shows no sign of deviation from expected usages of the get-causative. While other cause verbs can and do co-occur with items denoting effort and difficulty, none produced quantitative results that indicated a degree of association as cohesively as for get-causatives. We concluded that there was a pattern of grammatical association between the get-causative and events in which the causer is viewed as having to exert some effort in obtaining the caused event. 5.6 Finding paradigmatic parallels to the get-causative A supplementary search using the effort-coded association pattern was conducted to determine whether other possible cause verbs were also associated with difficulty and effort. We found no deviation from expected usages in ICE-GB and ICE-NZ. In ICE-SG, an instance (15) was found where the speech act verb ask was used in what appeared to be a function typically filled by the get causative. (15) T: I didn’t know that last time I bought so such a small policy C: Oh yes I am trying to ask you to buy a bigger policy you say no no no (ICE-SG S1b-075) Analysing a semantic corpus study across English dialects 135 C is an insurance salesman and T a customer. During the exchange, C is attempting to sell T a more expensive life insurance policy by giving the latter reasons for doing it. T keeps asking questions about the offer. C says ‘…I am trying to ask you to buy…’. There are two interesting points about this instance of ask. The utterance reports on an indirect speech act which was uttered with some apparent effort, and not the effort involved in effecting an indirect causative act. The same use of ask in (16) refers only to the difficulty of producing the question verbally: (16) My Italian is not good, I tried to ask you to pardon me. Another point about this example is that C may have been referring to his efforts in the past and not in this conversation. At least in the segment recorded for use in ICE, there is no record of T having said ‘no no no’. Some SE speakers do not make use of tense-marking if time is marked in some other way, e.g., (17). (17) That day you know you’re trying to tell me to do both by December…I mean I know it’s necessary but it’s (ICE-SG S1b029a) In (17), the event happened in the past marked by that day, but the present progressive tense form are trying is used. If indeed (15) is another instance of this, the evidence is even stronger that the use of ask is developing another function in SE. What is significant is that no instance involving the use of speech verbs as neo-causatives was found in ICE-GB and ICE-NZ. 17 However, since only one instance of causative ask was found, we need to do further work to determine whether this is a development for ask or whether other speech verbs are involved (some instances of causative call were also found). 6. Conclusions In our investigation of the get causative in SE, we exploited grammatical association patterns to determine whether paradigmatic parallels to the getcausative can be found in that dialect. As the results showed, the quantitative analysis did not indicate any variation of significance of SE use of the getcausative. Register-related usages of the construction in SE also showed no deviation from the findings for BE and NZE. We are cautious in our conclusion that there are no register differences in SE as the sampling size (each corpus is one million words) may have been too small to expose any subtle variations. As our primary interest is in semantic variation, we do not believe this point damages our account of the use of the construction in standard SE. The more interesting methodology point concerns the paradigmatic parallel we may have possibly found in the speech verb ask. If this is really an instance of paradigmatic 136 Sarah Lee and Debra Ziegeler replacement, then frequency of occurrence is not necessarily always a true indication of what features are used in a dialect. 18 It was noted by Biber (1995) that a feature that is distinct to a particular dialect need not occur frequently in a corpus to be significant. However, this raises the issue of how one can distinguish a significant occurrence from merely a non-patterned one. Clearly, more work needs to be done on this methodological issue, and in finding paradigmatic parallels in general. With respect to the methodology we did use, we found our objective to account systematically for the features found in usages of the get-causative to be a useful approach to cognitive (including semantic-based) investigations using corpora. Of course, much more could be done to fine-tune the methodology we used; however, we hope to have raised some issues relating to the difficulties of finding paradigmatic parallels in corpus study. Note 1 The authors would like to acknowledge the UK Economic and Social Research Council for sponsorship of the research project upon which this study is based (grant no. R000223787; PI: Debra Ziegeler). 2 A search for the form [NP1CAUSER-CAUSE VERB-NP2CAUSEE-[(to)VERB COMPLEMENT]]. 3 We used the Wellington Corpus (spoken and written) at the presentation given at the Corpus Linguistics 2003 Conference, in Lancaster, UK because ICE-NZE was unavailable to us at the time. 4 This book is based on Fowler’s Modern English Usage. 5 Hundt (cf. Denison 1998) mentions the possibility that the get passive may be more stigmatized in BE than in American English, hence the lower number recorded in her corpus study. 6 Subsequent surveying and comments by speakers of BE suggest that age may be a variable in determining frequency of use. Informants who are generally younger are less likely to observe the prescriptive rule, or even are unaware of its existence. 7 Talmy (2000:531) includes the get-causative as one example of ‘caused agency’ or inducive causatives. He Talmy (2000:534) also notes that in the use of the get-causative, the final event which the Causer induces is considered to be desirable to some involved entity but we did not use this point in our corpus study. Analysing a semantic corpus study across English dialects 137 8 A list is given in the Collins COBUILD grammar patterns. - 1: Verbs. (1996). 9 See Ho & Platt (1993) amongst others, for an introduction to Singapore English. 10 Hundt (2001) for example, cites Denison (1998) claiming that the get passive occurs more frequently in AE then in BE perhaps because of less pressure from prescriptive bias in the former dialect. 11 From personal observation. If true, this observation may hold for Australian English as well. Both claims need substantiation. 12 The total number of instances in each sub-corpora is not high so one can question whether the quantitative differences can support any statement about cross-dialectal variation. However, fewer instances of other periphrastics (with the exception of make-causative) were found in all corpora. It may mean that the sample size of one million should be increased for studies involving periphrastic causatives. 13 “Speech and writing are not homogeneous types but stereotypes. Speech is generally informal (face-to-face) and written, (information exposition) more formal.” (Biber 1988:36) 14 For Academic Writing and Non-academic Writing registers, the four subtypes, Humanities, Social Sciences, Natural Sciences and Technology, were collapsed. 15 For the multiple-dimensional approach, see Biber (1995, 1988); Biber et al (1998), which we could refer to if we were interested in register distribution. However, this investigation does not target situationallyinduced variation. 16 In subsequent research for instance, we examined Internet-based communications such as emails and newsgroups, in which participants generally use informal styles. 17 (We use the term neo-causative to refer to new members which may be used within the causative verb paradigm that we have described.) We have found some of these in Internet registers, in informal styles, typically by writers in younger age groups and American English. References Bao, Z. (1995), Already in Singapore English. World Englishes. 14, 2, July, 181188 138 Sarah Lee and Debra Ziegeler Bao, Z. (2001), The Origins of Empty Categories in Singapore English. Journal of Pidgin and Creole Languages. 16, 2, 275-319 Bao, Z. (2002), The aspectual system of Singapore English and systemic substratist explanation. (unpublished) Barlow M and S. Kemmer (1999), Usage-based models of language. Cambridge: Cambridge University Press. Biber D. (1999), Language use through corpus-based analyses, in: Barlow M and S. Kemmer (eds.) Usage-based models of language. Cambridge: Cambridge University Press, pp. 287-313. Biber D. (1995), Dimensions of Register Variation. Cambridge: Cambridge University Press Biber D. (1988), Variation Across Speech and Writing. Cambridge, Cambridge University Press. Biber D, S. Conrad and R. Reppen (1998), Corpus Linguistics. Investigating Language Structure & Use. Cambridge, UK: Cambridge University Press. Collins COBUILD grammar patterns. - 1: Verbs. (1996), London: HarperCollins [for] the University of Birmingham. Fowler, H.W. (1996), The New Fowler's modern English usage / first edited by H.W. Fowler. - 3rd ed. - Oxford: Oxford University Press. Givón T. (1979), On understanding grammar. New York: Academic Press. Greenbaum S. (ed). (1996), Comparing English worldwide. Oxford: Clarendon Press. Ho, M. and J. Platt (1993), Dynamics of a contact continuum. Singaporean English. Oxford: Clarendon Press Hollmann W. m.s. Towards a truly dynamic usage-based model: The case of periphrastic causative get. Hundt M. (2001), What corpora tells use about the grammaticalisation of voice in get-constructions. Studies in Language. 25:1 Hunston S. and S. Francis (1999), Pattern grammar. A corpus-driven approach to the lexical grammar of English. Amsterdam/Philadelphia: John Benjamins. Kemmer, S. and A. Verhagen (1994), The Grammar of Causatives and the Conceptual Structure of Event. Cognitive Linguistics. 5-2, pp.115-156 Nicholson, M. (1957), A dictionary of American-English usage, based on Fowler's Modern English Usa. - New York Shibatani M. (1976a), The Grammar of Causative Constructions: A Conspectus. Chapter in Shibatani M. 1976b. Shibatani M. (ed.) (1976b), Syntax and Semantics. The Grammar of Causative Constructions. New York: Academic Press. Sinclair J. (ed.) (1993), Collins COBUILD English language dictionary. Sinclair J. (1991), Corpus concordance collocation. Oxford: Oxford University Press. Analysing a semantic corpus study across English dialects 139 Talmy, L. (2000), Towards a cognitive semantics. Cambridge, Mass.; London: MIT Press. Talmy L. (1976), Semantic Causative Types. In Shibatani 1976. pp.43-116. Ziegeler, D. (2000), Hypothetical modality: grammaticalisation in an L2 dialect. Amsterdam: Benjamins This page intentionally left blank The curse and the blessing of mobile phones – a corpus-based study into American and Polish rhetorical conventions Agnieszka LeĔko-SzymaĔska University of àódĨ Abstract The study reported in this chapter applies a corpus-based methodology to investigate rhetorical strategies used by American and Polish apprentice writers in their argumentative essays. The data was collected from 79 Polish first-year students of English, and their 80 American counterparts, freshman non-English majors. The simplest method was chosen to analyse the discrepancies between the two groups of essays: the comparison of wordlists also known as the keyword analysis. The study revealed interesting textual differences between American and Polish essays which pertain to such rhetorical strategies as the choice of general versus experience-related arguments, the level of formality and the use of structuring devices. 1. Introduction Learner corpora have recently become an important source of data in second language acquisition studies. Samples of learners’ written and spoken L2 production are collected with the aim to describe as accurately as possible various characteristics of interlanguage. The main interests of researchers revolve around the differences between native and non-native linguistic systems. Thus, investigations focus on such linguistic features as lexical, grammatical and syntactic factors (Meunier, 1998). However, the differences between native and non-native production also occur at the macrolinguistic level and are related to the ways discourse is structured by both groups of language users. Such discrepancies reflect broadly understood cultural differences and are studied within the framework of contrastive rhetoric. The claims of contrastive rhetoric can be summarised as follows: Contrastive rhetoric maintains that language and writing are cultural phenomena. As a direct consequence, each language has rhetorical conventions unique to it. (Connor, 1996: 5) When writing in a foreign language learners show a tendency to transfer not only the linguistic features of their native tongue but also its rhetorical conventions. These conventions pertain to such factors as the structure or units of texts, explicitness, information structure, politeness and intertextuality (Myers, 2002). 142 Agnieszka LeĔko-SzymaĔska As a result, native speakers of a language may find learners’ written discourse ineffective or even incomprehensible. So far, the research in contrastive rhetoric has rarely applied the corpus-based and quantitative methods to support its claims (see Anderson, 2001 for a counterexample which, however, is the exception that proves the rule). Its findings are usually derived from the qualitative analysis of exemple texts. For example, Duszak’s (1994) study of the differences between English and Polish intellectual styles as reflected in the introductions to academic papers is based on a selection of 20 Polish and 20 English articles in the field of linguistics, which are exploited only as a source of examples to support the points the author makes, with no attempt to quantify the results. The qualitative approach undoubtedly has its merits in allowing a detailed scrutiny of the exact rhetorical strategies applied by writers. However, supplementing such methods with quantitative data can make the analysis more reliable and justifiable. The aim of this paper is to demonstrate that research in contrastive rhetoric can benefit from the application of the corpus-based and quantitative methodology. In the same fashion, adopting the text approach rather than the language approach to the analysis of corpus data (Scott, 2000) can lead to totally unexpected and very revealing insights. In the study reported here such an approach and such a methodology are applied to investigate the differences in the rhetorical strategies used by American and Polish apprentice writers in their argumentative essays. 2. Data The research reported in this chapter in fact represents a by-product of a large scale project, whose aim is to compile and explore a Polish learner corpus of written English. The corpus consists mainly of argumentative (but also narrative, descriptive and quasi-academic) essays written by Polish advanced learners of English at varying stages of proficiency in L2. As is the case with most learner corpus studies the explorations within the project focus on the areas of lexical, grammatical and syntactic characteristics of learners’ interlanguage. For one of the studies within the project, whose initial aim was the investigation of the breadth of learners’ lexical knowledge, data was collected from 79 Polish first-year students of English at the Institute of English Studies, University of àódĨ, and their 80 American counterparts, freshman non-English majors at South West University in Marshall, Minnesota. In order to ensure a close comparability of the samples, both groups of students were asked to write an essay in almost identical conditions (during one of their first composition classes at the university) and on the same topic – “The mobile phone – the curse or the blessing of the end of the 20th century”. The choice of this topic was fairly arbitrary, as the research questions were not related in any way to the study of attitudes or The curse and the blessing of mobile phones 143 lexis pertaining to this particular area. However, during the process of coding the data it was noticed that the American and Polish essays differed greatly in their treatment of the topic. Thus, the decision was made to pursue this observation in a more rigorous way. 3. Method Since the initial observation indicated that while writing on the same topic both groups of students wrote about different issues and problems, it seemed appropriate to turn to the field of content-analysis for the appropriate methodology to process the data. From among a range of tools widely used by researchers in this area, the simplest method was chosen to analyse the differences between the two groups of essays: the comparison of wordlists also known as the keyword analysis. According to Scott (2000) keywords are a good indicator of the ‘aboutness’ of a text, thus the procedure seemed an appropriate first step to pinpoint the discrepancies between the samples. Instead of using the Wordsmith’s tool for the keyword analysis, the decision was made to process the essays with Wmatrix, a corpus-analysis tool developed at Lancaster University. The advantage of this software over the Wordsmith Tools is that it detects recurring lexical phrases and treats them as single units in the analysis, so the final list of key items consists of both individual words and fixed phrases. Such a list is likely to give a better picture of the main topics surfacing in both groups of essays. Moreover, the generation of a keyword list in Wmatrix is based on a more reliable statistical procedure, log-likelihood chi-square, which has been proved to produce better results in establishing keywords (Rayson, 2003). For the purpose of the keyword analysis, first a wordlist was created for each group of essays. Next, Wmatrix compared the two wordlists against each other and produced a list of overused words and phrases in both subcorpora arranged according to the log-likelihood (LL) coefficient. Only the items whose LL coefficient was above 6.6 (p<0.01) were chosen for a detailed scrutiny. They made a total of 321 items which were further sorted into those overused by the American students (145 words and phrases) and those overused by the Polish students (176 items). The complete lists of the key items in the American subcorpus and the Polish subcorpus are in the Appendix. The cut-off point applied in this study needs some justification. Rayson (2003) claims that since the generation of a list of key items involves multiple statistical procedures the probability of error multiplies. He postulates the adaptation of a much higher cut-off point which should be located at the level of 15,13 (p<0.0001) for the results to be valid and reliable. However, since the list will not be further analysed quantitatively, this argument is not relevant here. The cut-off 144 Agnieszka LeĔko-SzymaĔska point chosen for the purpose of this study is fairly arbitrary and instead a decision could have been made to analyse the first 300 items on the list. Both lists were examined in search for those key items which could point to the recurring themes in the subcorpora (and thus the ‘aboutness’ of the texts) as well as to other characteristic features of the two groups of essays. The examination was supported by the scrutiny of the concordance lines of the key items. Whenever a specific theme could be identified, it was recorded next to the item in the list. In some cases the same key item could belong to two different themes. For example, the examination of the word plans (Figure 1), which was key in the American corpus, revealed that the item is related to two themes of COST or CONTACT. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ith a company that sell rate y all talk about their great es allow teenagers to set up h cell phones are the lso in case of a change in r cell phone the most. These nce charges. Many cell phone eight o'clock, when all you used for a simple change in ith friends in order to make urn it off after you've made inute appointments, and make make calls for work, dinner reless companies offer great nt of travel. There are also y of useful ways. Many phone phone calls. Most cell phone nes and lastly the different f they get the right kind of e the ability to get certain ave with cell phones are the your family. Many cell phone plans. The usual rates range from t plans. Oh yes, everything sounds gr plans with their friends and allows plans that are offered. Some of tho plans with where your family went w plans do not cost a tremendous amou plans include free long distance to plans for the evening have been mad plans. If you didn't have a cell ph plans. With all the things teens ha plans & enjoy the company of the pe plans while on the road. There a plans, or having important conversa plans' which are both convenient an plans that allow you to have free m plans available are so inexpensive plans have a cheaper rate per minut plans that can be used that will fi plan. The plans keep getting greate plans for your phone. Let's say tha plans that are offered. Some of tho plans now have free nationwide long Figure 1: Concordance lines of the word plans in the American subcorpus In some other cases the themes could overlap. For example, the keywords related to the motif of DRIVING in the American corpus at the same time pointed to the themes of EMERGENCY or HAZARD. In addition to the thematic grouping, other clusters of key items were also observed and recorded. They were related to the purely linguistic properties of the texts and included pronouns, linking expressions, fixed phrases and the target language variety. The following two tables present the first 15 key items in the two sorted lists. The curse and the blessing of mobile phones 145 Table 1: The first 15 key items in the American subcorpus word raw Year 1 % Native raw % 1 2 3 4 cell phones phone driving 46 81 104 5 0.17 0.30 0.38 0.02 758 475 516 106 2.30 1.44 1.57 0.32 5 6 1 0 0.00 0.00 44 36 0.13 0.11 7 8 9 10 11 12 13 minutes get_hold _of while have someone ways get you car 24 183 25 2 30 342 15 0.09 0.67 0.09 0.01 0.11 1.25 0.05 108 390 99 38 106 595 71 0.33 1.18 0.30 0.12 0.32 1.81 0.22 14 15 if road 104 1 0.38 0.00 232 30 0.70 0.09 LL THEMES and other groupings 639.53 American English 242.25 229.04 95.70 DRIVING (EMERGENCY, HAZARD) 45.36 COST 43.67 CONTACT 35, EMERGENCY 1, fixed phrase 43.67 43.66 34.85 33.37 32.35 30.98 pronoun (you) 30.17 DRIVING (EMERGENCY, HAZARD) 29.56 29.13 DRIVING (EMERGENCY, HAZARD) Table 2: The first 15 key items in the Polish subcorpus word 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 4. mobile_ph one of it we our nowadays mobile which is kind very fact moreover invention SMS Year 1 raw % Native raw % LL 321 1.17 98 0.30 168.92 800 608 263 170 45 55 107 587 43 124 42 23 42 20 2.91 2.21 0.96 0.62 0.16 0.20 0.39 2.14 0.16 0.45 0.15 0.08 0.15 0.07 502 363 101 46 2 6 35 468 5 55 5 0 8 0 1.53 1.10 0.31 0.14 0.01 0.02 0.11 1.42 0.02 0.17 0.02 0.00 0.02 0.00 133.55 114.90 107.06 99.99 56.80 54.73 52.48 43.70 41.75 41.28 40.40 36.24 31.92 31.52 THEMES and other groupings British English, fixed phrase pronoun (we) pronoun (we) GENERAL British English linking expression GENERAL TEXTING Results and discussion The analysis led to interesting findings on the content level of the texts as well as on the purely linguistic level. Some of these differences are not surprising and can easily be explained, but some reveal unexpected discrepancies between the groups. 146 Agnieszka LeĔko-SzymaĔska On the content level, the following thematic groupings were recorded for the American and Polish key items (Tables 3 and 4 respectively). Table 3: The readily-identifiable themes in the American subcorpus THEME Key items pointing to the themes DRIVING driving, car, road, drive, driver, on the road, pull over, vehicle EMERGENCY driving, get hold of, car, road, drive, stranded, 911, emergencies, accidents, on the road, help, ditch, weather, emergency, winter driving, road, car, drive, driver, distraction, on the road, security, pull over, vehicle, hazardous HAZARD COST minutes, plan, distance, service, plans, charges, free, month, roaming, rates, cards, cost, long CONTACT get hold of, drive, on the road, plans, teens, store, kids, reaching MANNERS distraction, service, class, movies, movie, movie theatre TECHNICAL NETWORK PROBLEMS service, area, reception, static, tower FAMILIARITY home, family, house, Minnesota, dad, family, mom, Americans IMAGE teens, cool GADGETS id, weather, mail, store, entertainment, games, directions, features GENERAL 20th Table 4: The readily-identifiable themes in the Polish subcorpus THEME Key items pointing to the themes GENERAL nowadays, invention, inventions, situations, human, influence, life, world, 20th century, civilisation, modern, 21st century, phenomenon, development, cultural, progress, social TEXTING SMS, sending, send HEALTH health, harmful, scientists, harm, diseases, researches CONTACT contact, contacts, holidays, in touch with MANNERS cinema, switch off, theatre IMAGE show off, newest, youngsters EMERGENCY mountains, ambulance Tables 3 and 4 show some variation between the groups in the choice of arguments. This variation can be attributable to the cultural differences between Poland and the U.S. as well as the differences in the life style of the two groups of students. Thus, the theme of low COST comes up as one of the advantages of mobile phones in America, especially for making long-distance calls, whereas such an argument does not surface in the Polish data, since mobile phone calls are generally expensive in Poland. For the same reason the theme of TEXTING appears only in the Polish data, as it is a popular method of using mobile phones The curse and the blessing of mobile phones 147 due to the relatively high costs of mobile-phone calls. TECHNICAL NETWORK PROBLEMS constitute an important issue for the American students living in a rural and scarcely-populated part of Minnesota, where the data was collected. The same issue does not surface in the Polish essays, as the problem is much less acute in Poland, especially in the urban areas where most of the students come from. Also for the reasons related to the density of population in the students’ immediate surroundings, the theme of EMERGENCY use of a mobile phone is much more strongly emphasised in the American data and is almost always connected with the emergency on the road. The prototypical emergency use of a mobile phone for the Polish students is associated with the accidents in the mountains. The differences in life styles are reflected by the fact that in the American data the most frequent theme is DRIVING (the experience that most Polish 19-year-olds do not have yet, at least not on regular basis), whereas the keyword mountains connected with the theme of EMERGENCY in the Polish subcorpus can be explained by the fact that trekking in the mountains is a popular holiday activity among Polish students. Although the findings mentioned above specify the differences between the two groups of essays, they hardly reveal any interesting facts about the subcorpora, since such differences could only be expected. There is nothing surprising in the fact that the students’ essays reflect the characteristic features of the setting in which they were written. However, further scrutiny of the keyword lists brings to light a deeper and more intriguing variation between the two subcorpora. While the American keyword list contains many items pointing to several readilyidentifiable and recurring themes which can be easily explained by the factors related to the culture, setting and life-style, the Polish list contains very few of these identifiable and explainable themes. With the exception of the theme of using TEXTING as a convenient way of contacting people, there are few items in the Polish list whose keyness can be attributed to cultural or environmental factors. On the other hand, one of most recurring themes in the Polish data is the concept of HEALTH represented by items such as health, harmful, and diseases. Although there is considerable research evidence alerting mobile-phone owners to the potential health risks related to the use of this device, it is hard to assume that the students themselves or the people in their immediate surroundings actually suffer from such health problems. Another frequent theme recurring in the Polish data is labelled GENERAL and is composed of items such as inventions, human, and development. It shows that the Polish students tend to discuss the problem in fairly general terms. On the other hand, the American list includes the names of family members (father, dad, and mom), and other items pointing to FAMILIARITY such as house, home, Minnesota and Americans. The variation in the choice of general versus experience-based arguments seems to be the most striking difference between the Polish and the American data. While the American students answered the essay question by making reference to 148 Agnieszka LeĔko-SzymaĔska their own life experience such as driving or technical network problems, the Polish learners talked about civilisation and health hazards. The discrepancies discussed in the previous paragraphs are further supported by the examination of another group of keywords found in the subcorpora: pronouns. Table 5 lists the key pronouns occurring in the two lists. Table 5: Pronouns in the American and Polish keyword lists. pronouns in the American keyword list you, my, I, they, she, her pronouns in the Polish keyword list we, our, us The American key pronouns indicate that the students talk about their own experience (I, my) or refer to real people (she, her), whereas the Polish students make a frequent use of the generic WE (we, our, us) to support the generalisations they make. All these findings point to the fact that the Polish students approach the topic on a more general level, and the American students relate to their own experience in tackling the problem. One explanation of such variation could be that the Polish learners, contrary to the American students, did not own mobile phones at the time of writing. Unfortunately, no data is available on this issue, since the essays were collected with an entirely different purpose; yet this explanation seems highly unlikely, because the mobile phone is a common-place device among Polish teenagers. Thus, it can be claimed that the observed differences reveal the discrepancies in the rhetorical strategies used by American and Polish apprentice writers. These discrepancies pertain to the choice of general versus experiencebased arguments. Such an explanation accords with with Kaszubski’s (1997) findings. He demonstrated that Polish apprentice writers tend to overuse abstract nouns of reference, which, he concluded, may imply that Poles are “particularly prone to make sweeping generalisations when writing in English” (1997: 155). Another interesting observation can be made based on the examination of the keyword lists. The keyness of the pronoun you in the American data indicates that the American students adopt an informal style in their essays which allows them to address the reader directly. On the other hand, the Polish keyword list abounds in linking expressions (Table 6), which could indicate that the Polish students use a more formal style and pay more attention to the structure of their essays. However, there is also an alternative explanation of the high number of linking expressions in the Polish data. Milton (1998) and de Cock (2000) observed that in structuring their essays learners of English tend to over-rely on a small set of linking expressions promoted by textbooks and teachers. Thus, the keyness of linking expressions can also be attributable to the learning strategies rather than the rhetorical strategies employed by the learners. At the moment, it is impossible to establish which explanation – an L2 learning strategy or transfer of an L1 The curse and the blessing of mobile phones 149 rhetorical strategy – is responsible for the high representation of linking expressions in the Polish data. The examination of comparable essays written by Polish students in their native tongue could be revealing in this respect, but unfortunately such data is unavailable at the moment. Table 6: Linking expression in the Polish keyword list. Linking expression 5. moreover, on the other hand, however, apart from, first of all, for instance, thus, furthermore, nevertheless, sum up, what is more, not only, firstly, on the one hand, as far as, for example Other findings The examination of the keyword lists on the linguistic level also reveals some interesting differences between the subcorpora. One of the discrepancies is related to the standard of English used by both groups of students. The Polish list of keywords contains items such as mobile and cinema, whereas the American list contains items such as cell, cellular and movie-theatre. This points to the fact that Polish learners of English tend to use the British variety of English, which can be explained by the fact that the vast majority of EFL materials available in Poland are produced in Britain and obviously promote the British English standard. Perhaps it can be interesting to point out that teaching materials have more influence on learners’ language than other sources of authentic language available outside the classroom, such as films or music, dominated by the American variety of English. A further interesting observation is related to the native/non-native language characteristics expected to surface in the keyword lists. It was assumed that the degree of idiomaticity (understood in the broader Sinclairian terms) would be higher in the native data, and that the American keyword lists would contain more fixed phrases than the Polish one. However, quite the opposite was found to be true (Table 7). Table 7: Fixed phrases in the American and Polish keyword lists. phrases in the American keyword list get hold of, on the road, look at, having to, pick up, pull over, some one, movie theatre, going off, too many, going to, so many phrases in the Polish keyword list mobile phone, on the other hand, mobile phones, apart from, first of all, for instance, point of view, by means of, sum up, what is more, in my opinion, switch off, 20th century, not only, 21st century, phone box, no longer, show off, so called, turn out, more and more, of course, such a, on the one hand, thanks to, as far as, in touch with, for example Such unexpected results in fact stay in tune with de Cock’s (2000) findings. Contrary to the popular belief that one of the characteristic features of 150 Agnieszka LeĔko-SzymaĔska interlanguage is its lack of idiomaticity, she found that learners overuse two-, three- and four-word expressions both in writing and speech. Moreover, her further qualitative analysis of learners’ fixed phrases revealed that ... advanced learners’ use of frequently recurring sequences of words displays a complex picture of overuse, underuse, misuse and use of idiosyncratic sequences, which may well play a significant part in the foreign-soundingness of their speech and writing (de Cock, 2000: 65) A similar complicated pattern of overuse can be observed in the two keyword lists. The Polish list consists mainly of linking expressions whereas most of the American key phrases are phrasal verbs. Thus, this study supports the claim of the complex nature of learners’ use of idiomatic expressions. 6. Conclusions The study has demonstrated that the corpus-based and quantitative methodology can be of value to the field of contrastive rhetoric producing interesting results and justifiable claims. It has also been shown that adopting the text rather than the language approach to the analysis of corpus data can bring totally unexpected, but very revealing insights. Specifically, it has been demonstrated that the two subcorpora exploited in the study contain more differences than could be anticipated before collecting the data. The study has also revealed important textual differences between American and Polish argumentative essays written in English by apprentice writers. These differences pertain to such rhetorical strategies as the choice of general or experience-related arguments, the level of formality and the use of structuring devices. These differences are not a result of ‘nativeness’ and ‘non-nativeness’ in language use but represent deeper rhetorical conventions existing in the two cultures. References Anderson, W. (2001), Discourse-based diversity: a corpus analysis of collocation in European Union and national French Administrative Language. Paper presented at the annual meeting of the British Association for Applied Linguistics. September. de Cock, S. (2000), Repetitive phrasal chunkiness and advanced EFL speech and writing, in: C. Mair and M. Hundt (eds.) Corpus Linguistics and Linguistic Theory. Amsterdam-Atlanta, GA: Rodopi, pp. 51-68. Connor, U. (1996), Contrastive Rhetoric: Cross-cultural Aspects of Second Language Writing. Cambridge: Cambridge University Press. The curse and the blessing of mobile phones 151 Duszak, A. (1994), Academic discourse and intellectual styles. Journal of Pragmatics 21:191-313. Kaszubski, P. (1997), Polish student writers – Can corpora help them?, in: B. Lewandowska-Tomaszczyk and P.J. Melia (eds.) PALC’97. Practical Applications of Language Corpora. àódĨ: àódĨ University Press, pp. 133158. Meunier, F. (1998), Computer tools for the analysis of learner corpora, in: S. Granger (ed.) Learner English on Computer. London and New York: Longman, pp. 19-38. Milton, J. (1998), Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment, in: S. Granger (ed.) Learner English on Computer. London and New York: Longman, pp. 186-198. Myers, G. (2002), Contrastive rhetoric and academic discourses: an institutional view. Paper presented at the 2nd International Contrastive Linguistics Conference. October. Rayson, P. (2003), Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Unpublished PhD thesis, Lancaster University. Scott, M. (2000), Focusing on the text and its key words, in: L. Burnard and T. McEnery (eds.) Rethinking Language Pedagogy from a Corpus Perspective. Frankfurt: Peter Lang, pp. 103-122. Software: Wordsmith Tools, Scott, M. Wmatrix, Rayson, P. 152 Agnieszka LeĔko-SzymaĔska Appendix A: Key items in the American corpus 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. cell phones phone driving minutes get_hold_of while have someone ways get you car if road plan my call drive i driver stranded on now used 911 distance emergencies they could out accidents are believe person good then distraction talk a on_the_road service around when use go hassle look_at home plans she there 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. family help area caller security today class house reached convenience seen Minnesota calls charges loved movies positives reception teens many into having_to pick_up been free along college cool ditch downside id location missed month overall pull_over purchase roaming static three vehicle weather talking emergency movie here things another 20th blessings causing contract 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. dad mail movie_theatre rates some_one store women else needs father away cellular entertainment mom though games kids her problem after americans cards cost directions going_off hazardous item negatives reaching responsibility too_many tower winter long allow going_to something regular down features so_man Appendix B: Key items in the Polish corpus 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. mobile_phone of it we our nowadays mobile which is kind very fact moreover invention SMS mobiles 17. on_the_other_hand 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. us mobile_phones matter however apart_from inventions some situations as health comfortable imagine useful human such small first_of_all for_instance whether short influence mentioned owners contact quite life owner sides addicted point_of_view thanks often necessary world difficult 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. real aware thus unfortunately simply by_means_of cinema concerned furthermore nevertheless sum_up undoubtedly what_is_more businessmen in_my_opinion harmful sending switch_off 20th_centuryusing device scientists its addiction civilisation contacts harm importance not_only lots especially telephone arguments claim advantages modern cells control theatre 21st_century firstly phenomenon phone_box regarded surely whose that send development no_longer living rather under 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. arises basic crucial effect groups hardly irreplaceable knowledge posses present private show_off so_called somehow turn_out youngsters more_and_more of_course not such_a happy nobody normal ordinary mention consider among constant cons pros opinion want appears consideration cultural diseases divided exist famous gadget holidays mountains newest on_the_one_han d opponents precious produce producers regard researches thanks_to 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. forget say against disadvantage possessing progress social various ambulance as_far_as mean in_touch_with makes latest situation important no for_example advantage availabl This page intentionally left blank Using a dedicated corpus to identify features of professional English usage: What do “we” do in science journal articles? Judy Noguchi, Thomas Orr and Yukio Tono Mukogawa Women’s University University of Aizu Meikai University Abstract This chapter addresses the problem of inadequate educational materials for the effective training of non-native speakers in the professional English of the scientific community. We claim that one cause of this inadequacy is lack of proper linguistic research based on suitable linguistic research tools. The development of a major international corpus of professional English in the sciences and other fields is described as a significant resource for solving this problem, illustrated with a specific research project examining the use of “we” in professional English from the Corpus of Professional English. Results reveal the value of well-designed dedicated corpora for addressing the specific English instructional needs of non-native speakers of English in the professions. 1. Introduction Although English is the language of preference for science in international contexts, the majority of the world’s scientists are not native speakers of English and, therefore, they can be said to be considerably disadvantaged in professional English communication. Language teachers may have introduced non-native speakers (NNS) to the sounds, symbols, and structure of English for general purposes such as asking for directions, reading a menu, and writing classroom essays. However, few NNS of English have ever enjoyed the luxury of specialised training in the spoken and written discourse of their profession. In general, this means that they have had to pick this English up on their own, at considerable effort, with disappointingly limited results (Coates et al. 2002). This situation is terribly unfortunate, for it not only severely marginalises NNS in their scientific communities, but it also prevents the world from benefiting from the creative potential of those who cannot disseminate their ideas persuasively because of poor English. Basically, there are two reasons for this unfortunate situation. One is the scarcity of language teachers who possess expertise in the English of science, and the second is the scarcity of instructional materials that adequately explain professional English text in scientific contexts. The English of research articles and technical specifications is very different from the English of storybooks and street signs, and mastery of this English does not come easily. Native-speaker 156 Judy Noguchi, Thomas Orr and Yukio Tono intuition or professional work experience alone are not enough to generate proper linguistic insight, and training materials based on “research” of this kind tend to be disappointingly vague and fraught with inaccuracies. Teachers and students of English in the sciences require more substantial materials based on far more rigorous research, enabled by far better research tools. The development of computer-based corpora, along with sophisticated software tools to analyse them properly, has now started to make this kind of research possible. 2. The value of specialised corpora The advent of computers, computer-based corpora, and the whole new field of corpus linguistics is now beginning to provide some satisfactory solutions to the problems mentioned above. By collecting discourse samples in the linguistic domains requiring study, corpus linguists can now begin to identify features of language that were beyond the scope of thorough observation in the past. This development is welcome news for teachers and students of scientific English, for scientific genres, with all of their peculiarities, can now be studied on grander scales to generate far more objective and reliable data. General corpora, such as the British National Corpus (BNC), provide the means for studying how English is used in general contexts; while the development of specialised corpora provide the means for studying how English is used in special contexts. For rigorous investigations of English in the theoretical and applied sciences, a very special corpus is required, dedicated to the language of these professional communities. Gathering a large enough collection of suitable discourse to develop a dedicated corpus of this kind, however, has not been an easy task. The value of corpora of scientific English on a small scale has been suggested in the past (Johns, 1991; Bondi, 2001), one good example being a corpus of physics research articles and their parallel academic conference presentations developed by Umesaki (2000). Using this corpus, Umesaki explored the variety of referents to the writer in academic papers, which she identifies as “one of the difficulties for non-native speakers of English in writing academic papers” requiring greater educational attention based on findings from corpus research. Although work of this nature is being conducted in various disciplinary fields on small scales, there remains a need for a major dedicated corpus of English for studying professional English in the theoretical and applied sciences. This would provide a referent corpus against which these smaller corpora could be compared and also contribute to a better understanding of how English is used by professionals internationally when they communicate with each other. Using a dedicated corpus to identify features of professional English usage 3. 157 The construction of the Corpus of Professional English, CPE The creation of a major international corpus of English in the sciences and other professions to aid research in professional discourse is a serious undertaking, and a new organisation has recently been established to take on this challenge. Called the Professional English Research Consortium (PERC), this non-profit academic organisation headquartered in Tokyo, was established in April 2002 to create such a corpus and generate research in professional English that might be of particular benefit to NNS in the professions as well as to the educators and material developers that support them. The corpus, named simply the Corpus of Professional English (CPE), aims to become a 100-million word balanced corpus of Professional English. It is designed for various research purposes, ranging from pure research on Professional English (e.g. variations of usage in lexis, syntax, semantics, and discourse across text types) to applied research such as lexicography, language testing, educational materials and program development. The CPE will be serviced by a sophisticated web-based query system so that those who are not familiar with corpora can extract linguistic features they need in a user-friendly manner. The design scheme of the CPE is shown in Table 1. The ultimate goal of the CPE is to achieve a 100-million word written corpus, composed from a reasonable balance of text types. The balance of each text type will be further examined in the future, but tentatively it was decided that the following ratio of text types seemed most suitable for representing professional English: academic journals (30%), legal/workplace documents (20%), trade journals (10%), reference books (10%), websites (10%), newsletters (5%), correspondence (5%), manuals (5%) and ephemera (5%). At present, the project team is focusing on the collection of academic journal articles, for this is the text that frequently proves most difficult when obtaining copyright permission, and yet is needed most by researchers in PE. Table 1: Tentative balance of text types for the CPE Academic journals Legal/workplace docs Trade journals Reference books Websites Newsletters Correspondence Manuals Ephemera 30% 20% 10% 10% 10% 5% 5% 5% 5% In order to select the journal texts in an objective way, the project team decided to base content decisions on data obtained from Journal Citation Reports (JCR), which presents quantifiable statistical data for an objective and systematic 158 Judy Noguchi, Thomas Orr and Yukio Tono approach to determining the relative importance of journals within their subject categories. At present, the JCR contains 5,700 journals in the Science Edition and has a unique measure called “Impact Factor”, which provides a way to evaluate or compare a journal’s relative importance to others in the same field. Employing this data, the top 20% of the journals with the highest impact factor in each field were selected for inclusion in the CPE. JCR classifications were also used to define the subject fields. Acquiring texts for a corpus is always difficult but not impossible. Following procedures established by the creators of the BNC, the CPE project team sent letters to major journal publishers in order to obtain copyright permission for more than 1,500 journals. As is often the case, the first letter did not yield a sufficient response; however, much more favourable responses were obtained after the launching of the PERC website which explained the CPE in greater detail than the letters. At present, copyright clearance has been obtained for almost 300 journals from over 50 publishers. To facilitate research on the CPE, a web-based corpus query system with a 3level interface has also been developed in collaboration with the major PERC member institution, Shogakukan, Inc. For elementary users, the tool provides simple word search and KWIC displays only. For intermediate users, independent word/POS/lemma queries can be made with detailed collocation statistics (raw frequency/T-score/MI-score/log-log). And for experienced researchers, complex search (word/POS/lemma and combinations of the three) is made available with more sophisticated output. Shogakukan plans to launch a portal site for megacorpora such as the BNC, the ANC, and COBUILD-Direct in the future so that ordinary language teachers can access these large corpora without worrying about the installation of a complicated interface. Currently, the CPE project team is in the process of acquiring the targeted texts as well as cleaning and formatting the existing texts that have already been obtained. A prototype version of the CPE (c.2 million words of copyright-cleared academic journal text) has been set up on this new query system and is beginning to provide an impressive wealth of data that contrasts significantly with that which can be obtained from general corpora such as the BNC. 4. The research project and its results To demonstrate the value of a corpus specifically dedicated to professional English, we have chosen for this chapter a simple analysis of current usage of the pronoun we as it appears specifically in professional scientific texts in comparison with similar data gathered from corpora devoted to English on a broader, more general scale. We will first explain the methods we employed in this research, then follow this with specific results and a discussion of their significance. Using a dedicated corpus to identify features of professional English usage 4.1 159 Methodology As stated above, the CPE is currently under construction. The research for this paper was conducted on a prototype corpus from the CPE of 260 journal article texts from 18 journals in fields ranging from multidisciplinary agriculture, biochemistry and molecular biology, cell biology, developmental biology, plant science and forestry to multidisciplinary materials science, mineralogy, oceanography, general and internal medicine, health care sciences and service, orthopaedics medicine, pharmacology and pharmacy, and psychiatry. The prototype corpus included 1,787,484 tokens of almost 60,000 types. The corpus was examined for verbs used after we using Wordsmith Ver. 2 (Oxford University Press). The concordance lines were rearranged with the word to the right of we in alphabetical order, and then the verbs were classified according to the seven major semantic domains described in the Longman grammar of spoken and written English (Biber et al., 1999): “activity verbs, communication verbs, mental verbs, causative verbs, verbs of simple occurrence, verbs of existence or relationship, and aspectual verbs.” A summary of these verb features and some examples are presented in Table 2. Table 2: Verb semantic domains based on the LGSWE (Biber et al., 1999: 360364) Semantic Domain Features Examples Activity Actions and events associated with choice; subject is semantic role of agent Communication Subcategory of activity verbs; associated with activities for communication Activities and states experienced by humans; including cognitive, emotional, and perception bring, buy, carry, come, give, go, leave, work ask, explain, say, suggest think, know, love, hate, see, taste, read A new state of affairs is brought about by a person or an inanimate entity Events happening without volitional activity State of relationship between entities allow, cause, enable, require, permit become, change, happen, develop be, seem, appear State of progress of an event or activity begin, continue, keep, stop Mental Facilitation or causation Occurrence Existence or relationship Aspectual According to the LGSWE (Biber et al., 1999), the most frequently used verb type is the activity verb. The LGSWE bases its findings on corpus studies which identified the four registers of conversation, fiction, news, and academic prose. Examination of the distribution of commonly-used verbs according to the four registers revealed that activity verbs ranked the highest in three of the four registers. It was only in academic prose that existence verbs display almost Judy Noguchi, Thomas Orr and Yukio Tono 160 equivalent frequency. Another feature of academic prose overall is the use of more causative verbs and occurrence verbs, compared to that of other registers. With respect to the usage of the personal pronoun, the LGSWE (Biber et al. 1999: 329-330) recognises its common use “to refer to a single author, a group of authors, to the author and the reader, or to people in general.” Biber et al. also state that “In some cases, academic authors seem to become confused themselves, switching indiscriminately among the different uses of we.” This highlights the problem pointed out by Umesaki (2002) faced by the NNS scientist who often finds author-referent conventions confusing. In this work, we focused on identifying the type of verb following we in the prototype CPE corpus. Knowing what types of verbs are commonly used in academic papers should offer help to the NNS scientist in deciding when and how to use we when writing up research. 4.2 Research data from the CPE A total of 3,401 instances of we followed by a verb were identified in the CPE prototype corpus. The main verb was identified and classified according to the semantic domain classifications in the LGSWE. Its tense and aspect were also noted. If the verb occurred at least twice, it was included in the count for the distribution of verb types presented in Tables 3a and 3b. Table 3a: Distribution of verb types used with we according to semantic domain and tense/aspect Verb type Total Men 1433 45.13 668 46.62 472 32.94 119 8.30 Act 1038 32.69 612 58.96 187 18.02 160 15.41 Com 440 13.86 71 16.14 290 65.91 45 10.23 Exi 168 4.66 63 31.76 70 44.59 6 4.05 Cau 40 1.89 23 65.00 3 11.67 11 18.33 Asp 32 1.01 8 25.00 29 90.63 6 18.75 Occ 24 0.76 14 58.33 3 12.50 5 20.83 1459 45.95 1054 33.20 352 11.09 3175 % past % pres % prep % Using a dedicated corpus to identify features of professional English usage 161 Table 3b: Distribution of verb types used with we according to semantic domain and tense/aspect Verb type mod % prec % pstp % Men 122 8.51 11 0.77 16 1.12 Act 49 4.72 18 1.73 10 0.96 Com 33 7.50 3 0.68 0 0.00 Exi 20 13.51 5 3.38 0 0.00 Cau 3 5.00 0 0.00 0 0.00 Asp 7 21.88 2 6.25 0 0.00 Occ 1 4.17 0 0.00 1 4.17 235 7.40 39 1.23 27 0.85 Verb semantic domains: men = mental, exi = existence, act = activity, com = communication, exi = existence, cau = causative, asp = aspective, occ = occurrence. Verb tense and aspect: past = past tense, pres = present tense, prep = present tense perfect aspect, mod = modal auxiliary, prec = present tense continuous aspect, pstp = past tense perfect aspect The present tense continuous aspect occurred in 5 instances, two with mental verbs and three with activity verbs, but these data are not included in the table. As can be seen from Tables 3a and 3b, the most frequently used verb type after the personal pronoun we was the mental verb accounting for 45.13% of the total. This was followed by activity verbs at 32.69% and communicative verbs at 13.86%. The most frequently-used verb tense was the past tense, accounting for 45.95% of all instances catalogued. However, this tense was the predominant one only for the mental, activity, causative, and occurrence verbs. The present tense form was more commonly used for the communicative, existence and aspectual verbs. A closer examination of the verbs in their semantic domain classifications reveals a more complex picture. As can be seen from Tables 5a and 5b, while the mental verbs find, observe and examine overwhelmingly occur in the past tense, conclude occurs more than 80% as the present tense form. In the case of activity verbs with the highest frequencies, use, analyse and test are predominantly used in the past form, but show is used in the past form in only 14.29% of the instances observed while it appears more frequently and almost equally as the present tense perfect aspect (41.27%) and the present tense (40.48%). 162 Judy Noguchi, Thomas Orr and Yukio Tono Table 4: Number of verbs in each semantic domain and some examples Verb semantic domain No. of verbs observed Mental 97 find (163), observe (108), examine (103), conclude (58), identify (54), know (47), compare (46) , see (45), determine (41), investigate (39) Activity 110 Communicative 34 use (183), show (126), analyse (51), test (48), demonstrate (46), perform (46), measure (33), calculate (24), obtain (23), do (21) thank (101), report (49), present (34), describe (30), propose (29), note (19), ask (18), suggest (18), acknowledge (14), discuss (11) Existence 12 have (71), be (28), exclude (14), include (11), stand (5), have to (4) Causative 5 be able to (28), be unable to (20), allow (6), require (3), subject (3) Aspect 8 Occurrence 6 begin (10), continue (5), start (4), undertake (4), initiate (3), achieve (2), enter (2), keep (2) develop (12), fail (4), modify (2), increase (2), change (2), become (2) Examples (No. of instances) Verbs which appeared after we two or more times were counted. Table 5a: Verb tense and aspect distribution for most frequently used mental and activity verbs Verb Mental find observe examine conclude Activity use show analyse test Total past % pres % prep % 163 108 103 58 129 83 78 6 79.14 76.85 75.73 10.34 11 16 3 47 6.75 14.81 2.91 81.03 19 7 17 11.66 6.48 16.50 0.00 183 126 51 48 131 18 43 39 71.58 14.29 84.31 81.25 24 51 2 1 13.11 40.48 3.92 2.08 17 52 5 6 9.29 41.27 9.80 12.50 Using a dedicated corpus to identify features of professional English usage 163 Table 5b: Verb tense and aspect distribution for the most frequently used mental and activity verbs Verb Mental find observe examine conclude Activity use show analyse test mod % 3 3 5 1.84 0.00 2.91 8.62 9 4 1 1 4.92 3.17 1.96 2.08 prec 1 % pstp % 0.00 0.00 0.00 0.00 1 1 0.61 0.93 0.00 0.00 0.55 0.00 0.00 0.00 1 1 0.55 0.79 0.00 0.00 Not including one instance of the past perfect for observe Verb tense and aspect: past = past tense, pres = present tense, prep = present tense perfect aspect, mod = modal auxiliary, prec = present tense continuous aspect, pstp = past tense perfect aspect Table 6: Top twenty clusters in the vicinity of we N Cluster Freq. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 we found that in this study we conclude that we examined the we have shown acknowledgments we thank we used the have shown that in the present we did not we do not found that the we show that we have previously the effect of the present study in this paper this study we we have found we used a 76 50 39 39 35 32 32 28 28 28 25 23 22 21 20 19 18 18 18 18 164 Judy Noguchi, Thomas Orr and Yukio Tono The clusters presented in Table 6 show that the top three clusters involve mental verbs, found, conclude, and examined. This suggests that cognitive activities of dynamic nature (find and examine) are expressed using the past tense, while the more stative conclude appears most frequently in the present tense. The sixth cluster is from the acknowledgements section and indicates an almost formulaic usage of we thank. The repeated references to the work being presented in the paper (in this study, in the present, the present study, in this paper) in the vicinity of we indicate that direct reference to the author(s) appears particularly when attention is being drawn to the study under discussion. Interestingly, of the 260 texts examined, we was used at least once in 223 texts. The highest number per 1,000 instances was 13.94, with the total overall average being 2.12 per 1,000. The average for the top ten texts was 9.74 instances per 1,000 words. Table 7: Texts with high frequency usage of we No. File Words Hits per 1,000 Field 1 2 3 4 5 6 7 8 9 10 079.txt 078.txt 072.txt 070.txt 047.txt 035.txt 195.txt 076.txt 058.txt 314.txt 1,865 5,549 2,956 6,292 4,821 7,748 4,668 2,693 7,696 7,921 26 70 34 68 50 73 38 21 50 50 Ave 13.94 12.61 11.5 10.81 10.37 9.42 8.14 7.8 6.5 6.31 9.74 Health services Health services Health services Health services Psychiatry Plant sciences Cell biology Health services Plant sciences Oceanography Table 7 suggests that some journals or fields may display a higher frequency of first-person pronoun usage than others. Also revealing were the negative data of texts in which we was not used even once. Such texts occurred across all fields, but all sixteen texts from one journal in multidisciplinary materials science had no instances of we at all. The only instance that was detected was the abbreviation of WE for working electrode. 5. Present applications and future research The above analyses on the usage of we in the 260 texts in the prototype CPE corpus of scientific academic journal articles from a range of fields reveal the following: i) The use of we is rather common, occurring at least once in 85.77% of the texts examined. Using a dedicated corpus to identify features of professional English usage 165 ii) Some journals and fields tend to display more we usage that others. On the other hand, all texts coming from one of the journals had no instances of we. Thus, there seems to be a need for even further specialization of corpora to illuminate differences among journals and/or fields of study. iii) The verb type most commonly used with we is the mental verb (45.13%) followed by the activity verb (32.69%). iv) The most frequently used tense is the past tense, but the distribution of past and present tense usage is reversed for some verb types and even within verb type categories. v) High-frequency clusters in the vicinity of we tend to be related to mental activities and references to the work under discussion. The findings overall point to the need for even further refining of corpus studies of specialised texts in order to reveal features which can be used when planning course materials for English for specific purposes classes. Such comparative studies of texts from different research fields and genres must await the completion of the CPE corpus; however, this research conducted from a small sample of dedicated texts already reveals some interesting things that differ from earlier findings based on general English corpora. Even in its present state, the CPE corpus prototype dedicated to text in scientific disciplines can prove very helpful as reference material for postgrad students or NNS scientists who are at the stage of writing up their research. If they are given background instruction in the genre-analysis approach to understanding the framework of moves and steps that compose the research journal article (Swales, 1990; Weissberg and Buker, 1990), a data-driven learning approach to concordancing (Johns, 1991a and b) can serve as a valuable tool for aiding NNS with their professional writing, when it comes to word choice or other writing issues (Hunston, 2002). As the CPE continues to develop, we envisage a system by which NNS scientists can access a website for online support when creating professional documents, which would include access to selections of dedicated corpora in the specific fields for which they are writing. References Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman grammar of spoken and written English. Harlow: Longman. Bondi, M. (2001), ‘Small corpora and language variation’, in M. Ghadessy, A. Henry, and R. L. Roeberry (eds.), Small corpus studies and ELT. Amsterdam: John Benjamins. Coates, R., B. Sturgeon, J. Bohannan, and E. Pasini (2002), ‘Language and publication in Cardiovascular research articles’, Cardiovascular research, 53: 279-285. 166 Judy Noguchi, Thomas Orr and Yukio Tono Hunston, S. (2002), Corpora in applied linguistics. Cambridge: Cambridge University Press. Johns, T. (1991a), ‘From printout to handout: Grammar and vocabulary teaching in the context of data-driven learning’, in T. Johns and P. King (eds.), Classroom concordancing. Birmingham, UK: Centre for English Language Studies, The University of Birmingham, 27-46. Johns, T. (1991b), ‘Should you be persuaded: Two examples of data-driven learning’, in T. Johns and P. King (eds.), Classroom concordancing. Birmingham, UK: Centre for English Language Studies, The University of Birmingham, 1-16. Johns, T. (2002), homepage http://web.bham.ac.uk/johnstf/timeap3.htm Swales, J. (1990) Genre analysis: English in academic and research settings. Cambridge, UK: Cambridge University Press. Umesaki, A. (2000), ‘Syntactic differences in the discourse of oral and written papers’, English corpus studies, 7: 39-59. Umesaki, A. (2002), ‘Reference to the presenter in academic papers’. Paper presented at AILA 2002, Singapore International Convention and Exhibition Centre, Dec. 21, 2002. Weissberg, R. and S. Buker (1990). Writing up research: Experimental research report writing for students of English. Englewood Cliffs, NJ: PrenticeHall. Methods and tools for development of the Russian Reference Corpus 1 Serge Sharoff University of Leeds Abstract The paper discusses the history of development of Russian corpora and presents methods and tools that are used in the ongoing development of the Russian Reference Corpus. Development of the corpus follows the key design principles of the BNC and extends them further by introducing an elaborate model of text typology and by adding lemmatisation and morphosyntactic annotations to POS tagging. The paper also discusses problems in development of the corpus that are related to the Russian language and culture. 1. The history of development of Russian corpora It is not too big a generalisation to say that development of Russian computer corpora followed the pattern established by English corpora. The Brown Corpus (Kucera & Francis, 1967) set up the standard for the design, size and coverage of general-purpose corpora in other languages, including Russian. In the 1970s a corpus of 1 million words was developed by Zasorina and her colleagues; it consisted of 500 samples of 2,000 words each and covered four genres: mass media, fiction, science (including humanities) and drama (as an attempt to cover the spoken language). The study resulted in a frequency dictionary (Zasorina, 1977), but not in a publicly available resource. The best-known comprehensive Russian corpus was developed in the 1980s in Uppsala, Sweden; it also resulted in a frequency dictionary (Lönngren, 1993). The Uppsala Corpus (UC) consists of 1 million words in 600 samples equally divided between fiction and non-fiction texts. UC is popular for various reasons, partly because it can be freely accessed via the Internet, but for modern standards it is too small and restricted in genre coverage. It also lacks morphosyntactic annotations and lemmatisation. The lack of lemmatisation hinders the search of multiple word forms, which often cannot be found using regular expressions, e.g. the verb vyjti (to leave) in Russian has about 40 forms, including many dissimilar forms like vyjdu, vyshla, vyshedshij. The lack of morphosyntactic annotations hinders even simple searches of grammatical relations, for example, searching for uses of the partitive case or for complements of a particular verb in the dative case. Another attempt to develop a comprehensive corpus was made in the Soviet Union in the mid 1980s. It is known as the Computer Fund of Russian Language (CFRL). Its aims were similar to those of the British National Corpus (BNC), which was to be developed a few years later. The main goal was to create a very large corpus of general language and subcorpora for various genres that would 168 Serge Sharoff help in the development of NLP applications. The set of corpora would also provide resources for studying and teaching the Russian language, including development of dictionaries, grammars, textbooks, etc (Andryuschenko, 1989). It was also expected that the corpus would include an historical component to cover the development of the Russian language from the earliest available sources (10th century AD). However, the project did not produce the expected outcome: no representative corpus has been collected. Resources available from the CFRL now include Russian literature of the 19th century and samples of newspapers from 1997. The progress in development of OCR software resulted in multiple ad hoc collections of Russian fiction and reference texts, for instance, Moshkow’s Library (ML), but such collections are not balanced and representative. The same applies to collections of newspapers available online. Currently, corpus studies of Russian are based mostly on the Internet. The Internet can be regarded as the largest Russian corpus, because the amount of Internet documents available for Russian search engines can be estimated at about 250 billion words (1.5 TB of unique texts indexed by Yandex), much larger than any conceivable corpus. However, there are three types of problems that hinder its use for corpus studies. First, it cannot be claimed that the material is representative and that there is a balance of text types. Texts presented on the Russian Internet are chaotic: their set depends on preferences and interests of a very specific group of Russian language speakers that are active on the Internet. The recall of search results also cannot be evaluated, because it depends on unknown parameters: which texts are available or not available on the Internet; which texts available on the Internet were not found by the search engine used for the query, etc. Second, search engines address the needs of information retrieval, rather than linguistic search. Even though search engines provide lemmatisation, so that one can search for all forms of a word, a query cannot be formulated in terms of grammatical features, including tenses, cases or word classes. As for lemmatisation performed by search engines, it is not designed to handle the queries of (corpus) linguists. For example, normal users, who are interested in information retrieval, pay no attention to the aspect of verbs used in their queries and want to get pages corresponding to the verb irrespective of its form. Search engines anticipate the need and index verbs of the perfective and imperfective aspect under one lemma. However, this technique drastically decreases the precision of linguistic searches and leads to some funny results, when pomni used in a query leads to pages with myatyj, because pomyat’ and myat’ form an aspectual pair. Third, search engines present search results in a way that also does not correspond to the needs of a linguist. The pages are ordered in terms of their information rank that has nothing to do with linguistic criteria. The output also Methods and tools for development of the Russian Reference Corpus 169 does not form a concordance, because pages in the output are separated by documents, rather than by contexts of their uses. Finally, search results are based on words occurring in titles of pages or keywords or even in other pages that refer to the link being displayed as relevant. 2. The content of the Russian Reference Corpus From the viewpoint of corpus linguistics, Russian is one of few major world languages that lack a comprehensive corpus of modern language use. However, the need for constructing such a corpus is growing in the corpus linguistics community both in Russia and in the rest of the world. The objective of the project presented in this chapter is to develop the Russian equivalent of the BNC, namely the Russian Reference Corpus (BOKR, BOljshoj Korpus Russkogo yazyka). It is designed as a corpus of 100 million words with the proportional coverage of major varieties of texts in modern Russian, with POS annotation and lemmatisation. The annotation scheme (which is based on the TEI) also marks noun phrases and prepositional phrases, because they are important for the resolution of ambiguity and can be reliably detected. The corpus consists of texts originally written or uttered in Russian by native speakers 2 in recent years. The exact diachronic sample depends on the text type and is discussed below. Table 1: Corpus composition Russian Standard BOKR quantity 10 million words (500 texts) 100 million words (10,000 texts) quality a representative sample of Russian fiction written between 1960 and 2002 a representative corpus of modern Russian, balanced according to a text typology annotation POS tags, morphological and partial syntactic properties with manual disambiguation POS tags, morphological and partial syntactic properties with automatic disambiguation access public Internet access with a query interface shared between the two corpora (Russian Standard is a subcorpus of BOKR) BOKR will include the Russian Standard, a subcorpus of 10 million words of modern fiction representative of the standard literary language. The relationship between the two corpora is described in Table 1. The two corpora differ mostly in their foci: on the large size, wide coverage and the balance of genres in BOKR and on selection of culturally-salient modern literary works and manual disambiguation of morphosyntactic annotations in the Russian Standard. The latter aspect is similar to the design intentions of the hand-corrected core BNC subcorpus (Leech, 1997). The Russian Standard is aimed to be the basic source of information for the development of corpus-based Russian grammars for 170 Serge Sharoff academic and teaching purposes, while BOKR will provide a complementary source of grammatical information and will be the basic source of lexical information. In one respect, the design of the Russian Standard is remarkably different from the design of the core BNC subcorpus. The core BNC is based on a proportional selection of texts from the whole set of the BNC files, while the Russian Standard is based on literary texts. This reflects the difference in the cultural status of the language of imaginative writing in British and Russian cultures: in Russian the literary language is treated as the authoritative source, which effectively defines the language used by native speakers. This fact is also the reason for the higher proportion of fiction in the Uppsala Corpus and the corpus used by Zasorina (1977): fiction texts covered about the half of their content, much higher than the proportion of fiction in the Brown Corpus (25%) and the BNC (17%), cf. also the balance of genres proposed for BOKR in the discussion below. 2.1 The typology of texts The balance of genres in BOKR is based on a text typology that is more sophisticated than that of the BNC. The basic principles for describing texts in BOKR follow the EAGLES guidelines (Sinclair, 1996), which distinguish between text-external (E) and text-internal (I) parameters in text classification: 1. 2. 3. 4. 5. ȿ1 (origin) - parameters concerning the origin of the text, i.e. the creation date, the author’s age and sex, the place of his/her origin, other circumstances of text creation that can affect linguistic parameters; E2 (state) - the appearance of the text, in particular, the distinction between written and spoken text modes (including written-to-be-spoken and electronic communication as the two border cases), and between published sources (books, magazines and newspapers), ephemera and correspondence within the written mode; ȿ3 (aims) - matters concerning the reason for making the text and the intended effect it is expected to have, including (1) the size of the audience (and subclasses for private and public speech) and (2) the communicative function of the text, i.e. discussion, information, recommendation, instruction or recreation. I1 (topic) - the main topic of the text, following a shallow classification of knowledge domains similar to classes used in the BNC, e.g. natural sciences, applied sciences, life or politics; I2 (style) - “the patterns of language that are thought to correlate with external parameters” (Sinclair, 1996), such as formal or informal, one-way or interactive, etc. The changes in the finer classification of parameters in comparison to Sinclair (1996) are based on the experience in development of other representative corpora, such as the Brown Corpus, BNC, and the TEI guidelines (Sperberg- Methods and tools for development of the Russian Reference Corpus 171 McQueen & Burnard, 2001), as well as considerations from Russian texts. This concerns, for example, the use of an additional mode (written-to-be-spoken), which is borrowed from the BNC (E2), the intended audience age (E3.1), a classification of fiction genres (E3.2) and styles (I2). It was considered helpful to extend the classification of text styles with separate subclasses for fiction and non-fiction texts. The patterns of language detected for fiction include the following styles (some better-known writers that often use the style are also indicated): 1. 2. 3. 4. 5. neutral, — the style characteristic for standard literary texts in Russian, regional, derevenskaja proza — an imitation of regional, mostly rural, language varieties, e.g. Astafiev, Rasputin, lowly, snizhennyj — an imitation of the spoken language used by a “lesser educated” population, often slang, e.g. Ju. Aleshkovskij, Limonov, official, socrealism — the official style of the Soviet literature, e.g. Dangulov, Markov, individual, — a marked way of language use with significant deviations from the neutral style, this style is typically the result of linguistic or stylistic experiments, e.g. S. Sokolov. Each style in the list instantiates a specific set of implications on lexicogrammatical properties (with the exception of the individual style, which is often author-specific, but this is exactly the reason to classify a text in this way). Nonfiction is classified according to the following styles: neutral, formal, informal, and academic writing. Since the project is aimed at a representative sample of modern Russian, all meaningful combinations of parameters should be represented in the corpus by at least a handful of texts, though the number of texts in each group depends on the estimated number of respective texts in the Russian discourse and the availability of their electronic copies. Text length is another important technical parameter. It is easier to develop a large corpus using longer texts. However, this means that the corpus contains fewer texts, so an idiosyncratic use of language in each text significantly influences lexicogrammatical properties that can be described using the corpus. This is the reason for the balance of texts of various sizes in the two corpora, i.e. both shorter and longer texts should be included in each category with a greater number of shorter texts to alleviate the influence of longer ones. The intended coverage of knowledge domains (I1) roughly follows the proportion used in the BNC. The comparison is shown in Table 2 (the data are from the BNC Index by David Lee). Since the typology of texts in the BNC is based on other principles, the comparison presents the content of texts in BOKR, as if they were described in terms used by the BNC. For instance, spoken language is treated as a domain in the BNC, so the figures in Table 2 also include it, even 172 Serge Sharoff though a spoken discourse can be devoted to any other topic in the list of domains, so it is described as the mode of speech in BOKR (E2). It would be desirable to increase the proportion of spoken language in BOKR at least to the coverage of the BNC, if not to 50% of the total corpus, but the small amount of available transcribed recordings make the ideal target impractical. The major departure from the BNC is the (already discussed) higher proportion of fiction texts, which are not considered in our scheme as a knowledge domain of its own (similar to the spoken domain), but as the most important component of the knowledge domain “Life” (cf. respective sections in newspapers, which in the Russian context often include short fiction stories). Note that the corpus is currently under development, so the figures in the third column in Table 2 are approximations for the expected coverage. Table 2: The proportion of knowledge domains Domains as in the BNC BNC BOKR Spoken (not a domain in BOKR) 10.7 % 5% Imaginative Texts (Life in BOKR) 16.7 % 30 % Natural Sciences 3.8 % 5% Applied Sciences 7.2 % 10 % Social Sciences 14.2 % 12 % World Affairs (Politics in BOKR) 18.9 % 15 % Commerce 7.6 % 5% Arts 6.8 % 5% Belief/Thought (Religion and philosophy in BOKR) 3.1 % 3% Leisure 11.2 % 10 % Currently tools and techniques for working with BOKR and the Russian Standard are tested using a corpus of 40 million words. Its subcorpus of about 1 million words of fiction texts (corresponding to the Russian Standard) has POS annotations that have been automatically assigned and manually inspected. It is also used for correcting the POS tagger used for processing the larger corpus. It is expected that the final release of the corpus will be available by the end of 2004. 2.2 The methodology for achieving the proportional coverage The costs of compiling a representative corpus now are lower than 10 years ago, when the BNC was collected. Many types of source texts are readily available in electronic form, in particular, fiction and news texts are widely accessible via the Internet and can be legally available for the corpus. Other types of discourse, like business or private correspondence, are harder to obtain and deposit in a corpus Methods and tools for development of the Russian Reference Corpus 173 because of legal obstacles. Yet other types of sources, like samples of spontaneous speech, are rare for technical reasons. The proposed solution is to increase the amount of ephemera (including leaflets, junk mail and typed material), correspondence (business and private) and spoken language samples whenever possible, because they reflect everyday language produced and reproduced regularly in the discourse. Anyway, various types of published texts will take the rest of the share. In this respect, the situation is similar to the early time of the BNC: the amount of texts from unpublished sources in the written part of the BNC is about 4.5%. It is unlikely that in BOKR we will have significantly more; even though the majority of source texts are available in electronic form now, their holders are unwilling to share them. For the reasons of protection of privacy, personal and business letters are subjected to an anonymisation procedure with respect to names of persons and companies. Person names are replaced with MX, FX or CX tags (for male, female or child participants respectively) and names of companies with CoX (X is the identification number of a participant in the text; the same practice is also used in the Bank of English). In some cases, text providers manually replace names with codes. In other cases, they provide original texts, but when texts are stored in the corpus, names are replaced automatically using the lists of known given names and surnames of persons and names of companies. Care has been taken so that names of prominent figures and characters from popular books and films have not been replaced, for instance, even though Karamazoff and Putin are valid Russian names, it is much more likely they are not participants in the exchange, so their names are left as they are in texts (given that the corpus lacks private letters from or to prominent figures). We understand that the anonymisation procedure is not completely satisfactory, cf. analysis in (Rock, 2001). First, it does not lead to complete anonymity: contextual clues are left in texts and allow the detection of participants. Second, a text in which names are replaced with codes looks less natural. Third, errors in setting identification numbers of participants are possible; they can lead to problems in discourse studies based on such texts. Finally, the anonymisation removes the possibility of studying the frequency of personal names, discourse patterns of their uses, as well as phonological patterns. However, we regard this as the best possible practice for storing private and business letters without violating the privacy of their authors and addressees. The text description framework is much more elaborate than a list of domains, so the balance of texts should be achieved on the basis of the text typology described above. The typology can be represented in terms of the systemic network of interrelated choices (Martin, 1987). For instance, when a text is described as fiction, it can be described in terms of the style of fiction, such as, stylistically neutral, low or regional, and in terms of the genre of fiction, such as general, historical, science fiction, etc., but not in terms of the interaction between the 174 Serge Sharoff author and the audience, because it is not produced spontaneously in the presence of the audience. The network of options is traversed using the Systemic Coder. A person that encodes metatextual information about a document has access to its record, including the author, title, year of creation, location of the file and its size in words. Encoding options are selected from a list of categories, for instance, the age of the intended audience is selected from adult, child, teen or x-age, and the age of the author at the time of text creation is selected from child, teen, young, mid, senior (mid-aged authors is the broadest category that covers ages 22 to 55). Even though the typology is elaborate, experience shows that most texts can be described in just a few seconds. Some combinations of features are logically impossible, for instance, a personal letter aimed at a very large audience or a private discussion on TV. Some other combinations are very unlikely, for instance, books written in the domain of natural sciences in formal style, aimed at a very large female audience for entertainment: the combination of formal style and entertainment or natural sciences and a sex-targeted audience is unlikely. However, if a combination of parameters is meaningful, an effort should be made to cover it in the corpus by, at least, several texts. As an example, consider the set of texts within the knowledge domain “Politics”, subdomain “home affairs” (the parameter I1 in our typology). Variation over other parameters involves selection of texts written in neutral, formal, academic and informal styles (I2), texts created by male or female authors or texts with corporate authorship, texts written within the period of 1990-2000 in different regions of Russia (E1), texts printed in newspapers, magazines, or books, as well as letters and reports, or spoken discussions, on site, on TV and radio (E2), texts aimed at different audiences (general vs. informed vs. professional, public vs. private, etc.), and aimed at various communicative functions, e.g. discussion, recommendation, instruction or entertainment (E3). Each text in the corpus should be described by this set of options, for instance, the Russian Constitution is a text written in formal style (I2) in 1993 in Moscow, the authorship is corporate (E1), it is written material printed as a book of 9,500 words (E2), aimed at a very large audience (even if it is rarely read by the majority of population), with the intended function of recommendation, as a legal document. The typology ensures that every text to be included in BOKR can be described in terms of the parameters listed above. Since texts aimed at more public audiences are easier to obtain, extra efforts are taken to cover texts aimed at more private audiences. The text collection activity could lead to a corpus significantly larger than 100 million words. The next step is to balance the collection. The balance takes into account the proportion of basic genres according to Table 2, as well as the proportion of texts within each parameter. For instance, the classes of the intended outcome of a text include discussion, recommendation, instruction, Methods and tools for development of the Russian Reference Corpus 175 recreation and information. The exact proportion of intended outcomes in the corpus can hardly be determined, e.g. the allocation of 25% for discussions or 10% for instructions looks fairly arbitrary. However, if texts classified as a recommendation take 90% of the total corpus, this is a clear sign of disproportionate coverage, which should be corrected. The balance will be monitored using statistical tools available in the Systemic Coder, such as Cell Analysis or Significance Tests. Another problem with sources concerns the choice of diachronic sampling, because the turbulent history of Russia in the 20th century radically affected the language. For instance, according to the frequency list (Zasorina, 1977), which was compiled on the basis of texts from 1930-1960, such words as sovetskij (Soviet) and tovarishch (comrade) belonged to the first hundred of Russian words on a par with function words, but this is no longer true of modern texts. The language of fiction has not been so radically affected. The decisions on the chronological limits of the study are different for different text types, for instance, fiction texts are taken from 1960, scientific texts from 1980, political texts and ephemera from 1990 (earlier ephemera texts are also hard to obtain), while news texts from 1997. 3. The principles of morphosyntactic annotation English corpora, including the BNC, are annotated with complex tags, like NNS for plural common nouns – a technique that is impossible in a highly inflective language, such as Russian. For instance, an adjective inflects for case, gender and number, giving 36 basic adjectival categories in total, while a verb in addition to its own 14 basic categories has up to 4 participles, each of which declines for adjectival categories. This leads to thousands of separate tags that cannot be searched effectively. For this reason, each word is annotated with a list of features, which provides a unique identification of their type according to the morphosyntactic codes from EAGLES (Calzolari and McNaught, 1996), for instance, bylo (was) is annotated as “verb,ifve,int,act: n,sg,past”, which stands for verb, imperfective, intransitive, active voice (the features describe the lexical item byt’), followed by a colon with features that describe the form: neutral gender, singular number, past tense. Separate features from a feature bundle associated with each word can be selected in a window of the query interface (Figure 1). 176 Serge Sharoff Figure 1: The query interface 3.1 How to resolve the ambiguity The POS class and the morphological properties of a word are reflected in the flexion and the probability of deciding the lemma and the POS class from a word form is higher than in, for instance, English. As a result, there are many morphological analysers for Russian, which make decisions on the basis of word forms, but virtually no Russian taggers that take into account the local context. However, if real Russian texts are to be tagged with the intention of making a corpus that includes lemmatisation and morphosyntactic annotation, the level of ambiguity is high: very frequent word forms like stali, shli, or ego can correspond to several lemmas, i.e. stali – stat’ (verb, to become) vs. stal’ (noun, steel), shli – idti (to go) vs. slat’ (to send), ego – ego (possessive) vs. on vs. ono (both are personal pronouns). The ambiguity of POS classes is relatively rare, but many noun forms have multiple readings, for instance, the word form pole is an instance of four different nouns pol (floor), pole (field) and pola (lap) and Polya (a person name, when it is capitalised). Also, in many cases the ambiguity concerns the set of morphological features of the same lemma, e.g. knigi is the singular form in the genitive case or the plural form in the nominative or accusative case, while znakomoj is the singular form in the genitive, dative or prepositional case of either a noun or an adjective. According to initial experiments, the frequency of ambiguous detection Methods and tools for development of the Russian Reference Corpus 177 of lemmas and POS classes in running text is about 25%, while the frequency of ambiguous morphological properties is about 55%. The values are too high for a corpus with morphosyntactic annotations and so the ambiguity should be reduced. Currently, there are no tools available that allow reliable parsing in a corpus of this size for Russian or any other language. Use of language-independent POS taggers based on statistical models cannot improve the quality of the output, because they are typically based on considerations relating to the word order, which is flexible in Russian, and because the genuine ambiguity between POS classes is relatively rare, and the most frequent type of the ambiguity concerns different readings of a word form (cf. the example with pole) and morphological features (cf. the example with knigi). If word forms or sets of morphological features (e.g. plural+dative+feminine) are treated as POS classes, then their number increases and the quality of language-independent POS tagging declines. However, partial parsing that detects nominal and prepositional phrases is reliable enough and can be used for deciding the reading of ambiguous forms. In BOKR and Russian Standard we use Dialing, a morphological analyser with simple mechanisms for syntactic and semantic analysis. Since two analyses of the same word form have distinct morphological properties (case, number and gender), the agreement of the noun and the adjective in noun phrases removes some types of ambiguity, e.g. the word combination znakomoj knigi from Otkroem stranitsy etoj horosho znakomoj knigi (Let’s open pages of this wellknown book) can be parsed only as the genitive singular form of both words, and the first word is an adjective. Another simple mechanism that requires only partial parsing is the agreement of the subject and the predicate; it can remove, for instance, the plural reading of a noun in the nominative case (knigi), when the predicate is singular, as in Knigi na polke ne bylo. Some types of ambiguity of lemmas and POS classes are left after partial parsing: 12% of forms are ambiguous with respect to lemmas, and 22% with respect to morphological properties. Since the BOKR corpus is annotated without human intervention, ambiguous analyses are subjected to further filtering according to statistical heuristics. For instance, two nouns spina (back) and spin (spin) have several identical word forms, which cannot be separated by means of parsing. However, spin is a term in theoretical physics, so the reading can be ignored in normal texts. Few other cases resist even complete parsing and statistical consideration, for instance, the ambiguity between the two readings of the word form banke (bank vs. banka) in Xranite svoi denjgi v banke (keep your money in a bank/in a jar) can be resolved only on the basis of semantic and pragmatic constraints. Such cases of ambiguity are retained in BOKR. The same applies to the ambiguity of morphological properties left after the syntactic filter. Currently, 3.6% of forms remain with ambiguous lemmas. The ambiguity in the Russian Standard is corrected manually. 178 3.2 Serge Sharoff How to store annotations The design of the annotation format of the two corpora follows the best practices in corpus development established in the 1990s, namely EAGLES (European Advisory Group on Language Engineering Standards) and TEI (Text Encoding Initiative). Even though XCES (XML Corpus Encoding Standard) is expected to become the international standard for language resources (Ide and Romary, 2002), the annotation scheme based on it is extremely verbose (the size of an annotated file is a hundred times larger than a plain text file and, for a corpus of 100 mln words, the size really matters). The XCES scheme is also not suited for querying word uses in the corpus, because information on similar properties is represented at different levels of the XML structure. Thus, we have postponed the use of XCES until the standard is established and there are publicly available software tools for working with the format. The format of BOKR is based on the TEI scheme and uses standard tags, like <phr>, <s>, <w> for representing phrases, sentences and words. Morphological annotation is stored in <ana> tags that describe word properties in lemma and feats attributes. Ambiguity is represented using multiple <ana> tags. The following is an example from the beginning of a sentence: Mne bylo ochen’ zhalko svoih chasov, … (I was very sorry about loosing my watch, …) <s id="kozlotur.1476"> <w n="1">Ɇɧɟ<ana lemma="ɹ" feats="pron,sg,1: dat"/></w> <w n="2">ɛɵɥɨ<ana lemma="ɛɵɬɶ" feats="verb,ifve,int,act: n,sg,past "/></w> <phr type="ADV+ADV"> <w n="3">ɨɱɟɧɶ<ana lemma="ɨɱɟɧɶ" feats="adv"/></w> <w n="4">ɠɚɥɤɨ<ana lemma="ɠɚɥɤɨ" feats="adv"/></w> </phr> <phr type="ADJ+NOUN"> <w n="5">ɫɜɨɢɯ<ana lemma="ɫɜɨɣ" feats="pron,poss: pl,gen"/></w> <w n="6">ɱɚɫɨɜ<ana lemma="ɱɚɫ" feats="noun,m,in: pl,gen"/> <ana lemma="ɱɚɫɵ" feats="noun,pl: gen"/></w> </phr> </s> The parser was able to resolve the ambiguity between two analyses of mne (the dative or prepositional case), svoikh (the genitive, accusative or prepositional case of either a personal pronoun or a possessive pronoun), and zhalko (an adverb or an adjective). However, the ambiguity between two readings of the word form <w n=“6”> is left in the output in the two <ana> tags. It can be read as chas (an hour) and chasy (a watch), in the latter case it is pluralia tantum, which is reflected in the position of the number value before the colon. Metainformation about a document is stored in the header and is also based on the TEI format. In some cases, TEI provides tags for encoding the contextual Methods and tools for development of the Russian Reference Corpus 179 settings used in the text typology, for instance, <creation> or <textClass>. In other cases, the information on the expected outcome or the size of the audience is expressed using the general framework of taxonomy specifications by means of <catRef> (category reference) tags. Notes 1 The research presented in the paper has been supported by the Alexander von Humboldt Foundation, Germany, when the author was affiliated with the University of Bielefeld and the Russian Research Institute for Artificial Intelligence. I am grateful to my Russian colleagues, in particular, to Vladimir Plungian and Katia Rakhilina, who took the leadership in the ongoing development of the Russian Reference Corpus. 2 Development of a translation corpus is considered to be a separate task. References Andryuschenko, V.M. (1989). Konzepziya i arhitectura Mashinnogo fonda russkogo jazyka (The concept and design of the Computer Fund of Russian Language), Moskva: Nauka, 1989 Calzolari, N., McNaught, J. (eds.) (1996). Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. EAGLES document EAG-CLWG-MORPHSYN/R http://www.ilc.cnr.it/EAGLES96/morphsyn/morphsyn.html Ide, N., Romary, L. (2002). Standards for language resources. In Proc. of Language Resources and Evaluation Conference (LREC02). May, 2002, Las Palmas, Spain. 59-65. Leech, G. (1997). A brief users’ guide to the grammatical tagging of the British National Corpus, UCREL, Lancaster University. http://www.hcu.ox.ac.uk/BNC/what/gramtag.html Lönngren, Lennart (ed.) (1993). Chastotnyj slovar’ sovremennogo russkogo jazyka. (A Frequency Dictionary of Modern Russian. With a Summary in English.) Acta Universitatis Upsaliensis, Studia Slavica Upsaliensia 32. 188 pp. Uppsala. 180 Serge Sharoff Martin, J.R. (1987). The meaning of features in systemic linguistics. In M.A.K. Halliday, R.P. Fawcett (eds.) New Developments in Systemic Linguistics. Vol. 1. London: Pinter Publishers. 14-40. Rock, F. (2001). Policy and practice in the anonymisation of linguistic data. International Journal of Corpus Linguistics, 6(1). Sinclair, J. (1996). Preliminary recommendations on text typology. EAGLES Document EAG-TCWG-TTYP/P. http://www.ilc.pi.cnr.it/EAGLES96/texttyp/texttyp.html Sperberg-McQueen, C. M., Burnard, L. (eds.) (2001). Guidelines for Electronic Text Encoding and Interchange. http://www.hcu.ox.ac.uk/TEI/P4X/index.html Verbitskaya, L.A., Kazanskij, N.N., Kassevich, V.B., (forthcoming). Nekotorye problemy sozdanija natsional’nogo korpusa russkogo jazyka. NTI, Series 2. (in Russian) Zasorina, L.N. (ed.) (1977). Chastotnyj slovar’ russkogo jazyka. Moscow: Russkij Jazyk. Internet links BNC Index: http://www.comp.lancs.ac.uk/ucrel/bncindex/ BOKR: Boljshoj Korpus Russkogo yazyka (the Russian Reference Corpus, a description of the project), http://bokrcorpora.narod.ru/ CFRL: the Computer Fund of Russian Language, http://irlras-cfrl.rema.ru/ Coder, a markup and classification tool: http://www.wagsoft.com/Coder/ Dialing, the morphological analyser: http://www.aot.ru/download.htm Moshkow’s Library: http://lib.ru/ RS: the Russian Standard (online access), http://corpora.yandex.ru/ UC: the Uppsala Corpus, available from the University of Tübingen, http://www.sfb441.uni-tuebingen.de/b1/en/korpora.html Yandex, the search engine: http://www.yandex.ru/ A profile-based calculation of region and register variation: the synchronic and diachronic status of the two main national varieties of Dutch Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts University of Leuven Abstract In this paper we present a profile-based method for analysing regional and register variation, and we explain how we apply it to the study of Belgian and Netherlandic Dutch. A ‘profile’ comprises frequency information about the set of alternative synonymous terms for naming a particular concept. Our comparison of subcorpora is based on differences in naming preferences, as revealed by such profiles. Apart from the actual algorithm for the comparison of subcorpora this paper also addresses some practical issues. In particular the preliminary step of profile selection is addressed: in this context the concept of ‘stable lexical markers’ is introduced. 1. Introduction 1.1 Broader context of the study 1 In recent years our research unit has investigated the internal stratification of Belgian and Netherlandic Dutch. Although officially the same language, these two national varieties do differ to some degree, especially, but not exclusively, in the context of more informal language use. One important difference in the evolution of both varieties lies in their standardisation: due to a history of foreign occupations Belgian Dutch has known an interrupted, and therefore slower standardisation process - which may even be considered to be not fully completed yet, in the sense that the standard variety is not that well established and by many speakers is not felt to be the natural variety of choice in all that many different situations. The overall model of the current internal stratification of Belgian and Netherlandic Dutch that we take as our point of departure can be summarised in one synchronic hypothesis and one diachronic hypothesis. The synchronic hypothesis is that if we look at different registers within the two contemporary national varieties, we will see that the linguistic difference between language use in less formal and in more formal registers is larger in Belgian Dutch than in Netherlandic Dutch. In other words, if we compare situations where informal language use can be expected to situations where more formal language use can be expected, and we repeat this exercise for Belgium and for The Netherlands 182 Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts (looking at the same situations in both cases), we will measure more important differences in Belgium. The diachronic hypothesis is that in the last decades, i.e. in the second half of the twentieth century, Belgian Dutch has been moving closer to Netherlandic Dutch and in doing so has been using the latter as its reference point for (further) standardisation. In Geeraerts, Grondelaers and Speelman (1999) we report on a study in which a first body of empirical evidence was found to confirm both the synchronic assumption of a different internal stratification, with Belgian Dutch having a more outspoken informal register, and the diachronic assumption of a convergence characterised by Belgian Dutch moving towards Netherlandic Dutch. The dataset used in this study consists of some 40000 naming events of items related to clothing and football concepts. With respect to the synchronic hypothesis the naming of clothing terms was investigated for two different registers. The more formal situation was the naming of garments in magazines. The more informal situation was the naming of garments in shop windows (which can be expected to target a more local audience and hence to be more informal). The data used to investigate the diachronic hypothesis were the football terms and clothing terms encountered in Belgian journals and magazines form three different points in time (1950 - 1970 - 1990). In subsequent research, a dedicated corpus was compiled to replicate this study for other registers and other linguistic variables. This corpus is the so-called ConDiv corpus, a 40 million token corpus created for measuring convergence and divergence patterns (cf. Grondelaers, Deygers, van Aken, van den Heede and Speelman, 2000). The corpus consists of a diachronic component, which contains newspaper material from 1950, 1970 and 1990, and a more elaborate synchronic component from 1990, which contains texts from 5 different registers: quality newspapers, nation-wide popular newspapers, regional popular newspapers, texts from Usenet newsgroups, and texts from IRC chat channels. Several replication studies based on this corpus corroborated the original findings from Geeraerts, Grondelaers and Speelman (1999). For instance, Grondelaers, van Aken, Speelman and Geeraerts (2001) reports on two replication studies, one in which the original clothing terms study was applied to the ConDiv corpus, and another in which variation in the choice of prepositions was used as linguistic evidence. 1.2 Formal onomasiological variation This paper focuses on some key methodological choices made in our investigation of the stratification Belgian and Netherlandic Dutch. The first and most important methodological choice is that among the different types of variation that exist we choose formal onomasiological variation as our basis for measuring linguistic distances. According to the definition we adhere to (cf. Geeraerts, Grondelaers and Bakema, 1994), formal onomasiological variation is that type of onomasiological variation, in other words is that type of variation in the terms one uses to name an item, that is not motivated by semantic differences, and therefore can be said to be purely ‘formal’. An example would be the A profile-based calculation of region and register variation 183 variation between the terms ‘underground’ and ‘subway’ to refer to the concept UNDERGROUND. An example of onomasiological variation that is not purely formal would be the variation between the terms ‘garment’ and ‘pants’ to refer to a pair of trousers. In the latter example the choice for a different term is related to a different conceptualisation of the thing that is referred to, which is not the case in the UNDERGROUND example. The motivation for our choice to measure linguistic distance on the basis of formal onomasiological variation is not that this type of variation is more important than other types of variation, but rather that it exceptionally useful for detecting the type of regional and stylistic differences we are interested in our study. The structure of the remainder of this paper is as follows. In section 2 we introduce profiles and profile-based calculations of linguistic differences as our primary technique for quantifying formal onomasiological variation. After explaining the actual calculations we focus on the most salient feature of this technique, which is the use of profiles. We argue that profiles are a straightforward technique for measuring regional and stylistic differences, while controlling other types of variation. We subsequently illustrate the different outcomes of actual profile-based calculations and non-profile-based calculations (the latter being exemplified by calculations based on key words). The calculations are applied to the synchronic Belgian Dutch part of the ConDiv corpus. In section 3 we introduce a complementary technique, the method of ‘stable lexical markers’, which is a technique for exploring and charting a more complete set of noteworthy, different types of variation in the comparison of corpora. Rather than neutralising certain types of variation in advance by our choice of technique (which is what we do in section 2), we now aim to obtain an overview of the different types of variation, of their relative importance, and of the actual linguistic variables (i.e. the actual terms) that are representative of these different types of variation. In section 4 we summarise the most important characteristics of profile-based calculations and of the technique of stable lexical markers. 2. Profile-based register analysis of Belgian Dutch 2.1 Profiles A profile is an exhaustive set of synonymous terms for naming the same concept, together with information about the frequency with which each term is used to name the concept 2 . Table 1 shows three profiles, more specifically the profiles for the concept UNCLE in three different subcorpora of the Belgian Dutch (B) synchronic part of the ConDiv corpus. The subcorpora contain chat material (ircVL), material from a regional popular newspaper (regL1) and material from a quality newspaper (quaSR) respectively. In this case there are two different terms 184 Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts for referring to the concept at hand, namely ‘oom’ and ‘nonkel’ (note that in other profiles there are often more than two alternatives). Table 1: The profiles for UNCLE in three different subcorpora of the ConDiv corpus Oom Nonkel ircVL (B) regL1 (B) quaSR (B) 0 (0 %) 55 (100 %) 25 (61 %) 16 (39 %) 15 (94 %) 1 (6 %) . This example illustrates how profiles are sensitive to stylistic variation. The term ‘nonkel’ is a somewhat more colloquial term to refer to UNCLE. The term ‘oom’ is the more formal, standard variant. As can be seen in the table, the chat subcorpus shows an outspoken preference for the colloquial variant ‘nonkel’ (all 55 occurrences), the regional paper shows a moderate preference for the standard variant ‘oom’ (25 out of 41 occurrences), and the quality paper shows an outspoken preference for the standard variant ‘oom’ (15 out of 16 occurrences). It is obvious that in a similar vain profiles are sensitive to regional variation, but we will not give examples of this because in this text we will only deal with Belgian material. Linguistic distance between two corpora, on the basis of their profiles for one particular concept is calculated as a city block distance on the basis of the relative frequencies 3 . For example, the distance between regL1 and quaSR, on the basis of their UNCLE-profiles (cf. Table 1), is calculated as 0.5 * (|0.61 – 0.94| + |0.39 – 0.06|). In other words, we obtain the linguistic distance by summing up the row-wise differences in percentages, and by subsequently dividing that sum by two. The division by two is a normalisation step that guarantees the result is a value between 0 and 1 (i.e. between 0% and 100%). After calculating the distances between two subcorpora on the basis of individual concepts, we want to summarise this information in a global distance between the two subcorpora. Linguistic distance between two corpora on the basis of their profiles for a whole range of concepts is calculated as the average of the distances based on a single concept 4 . After calculating these global distances for all couples of subcorpora, the final step is to feed these distances into a multidimensional scaling analysis, in which we try to plot all subcorpora on a low-dimensional space in such a way that distances in the plot reflect linguistic distances as well as possible. We illustrate the procedure by means of an example. The 20 subcorpora we compare in the example are listed in Table 2. They are all part of the Belgian Dutch synchronic part of the ConDiv corpus. This section of the ConDiv-corpus consists of material from one quality newspaper (qua), one national popular newspaper (nat), two regional popular newspapers (regL and A profile-based calculation of region and register variation 185 regA) and several Usenet newsgroups (use) and IRC chat channels (irc). All of these categories are further subdivided, sometimes with (irc, use, nat, qua) and sometimes without (regL, regA) a further topic-wise differentiation. Table 2: The 20 subcorpora of the synchronic Belgian Dutch part of the ConDiv corpus Name nr tokens register topic IrcRE IrcVA IrcLE IrcVL IrcBE UseTE UseSP UseSR regL1 regL2 regL3 regA1 regA2 regA3 NatRE NatSP NatSR QuaSP QuaTE QuaSR 205560 1182849 1784084 2736111 1686571 2486797 117195 2376788 1561362 1450968 1666916 1563799 1504606 1810548 1945461 427280 518670 994867 1431786 3607513 IRC-material IRC-material IRC-material IRC-material IRC-material Usenet Usenet Usenet regional popular newspaper regional popular newspaper regional popular newspaper regional popular newspaper regional popular newspaper regional popular newspaper national popular newspaper national popular newspaper national popular newspaper national quality newspaper national quality newspaper national quality newspaper regional chat channels varia chat channel “Leuven” chat channel “Flanders” chat channel “Belgium” technical topics sports supra-regional topics (no differentiation) (no differentiation) (no differentiation) (no differentiation) (no differentiation) (no differentiation) regional sports supra-regional interest sports technical topics supra-regional topics Table 3: Terms and concepts for 10 different sorts of profiles CONCEPT Terms A MIND TO IF FOR THE PRICE OF (+ amount) UNCLE IN A MOMENT ONCE AGAIN TO BE READY EACH OTHER MOMENT TO CONTRIBUTE TO goesting, zin als, indien aan (+ amount), tegen (+ amount), voor (+ amount) oom, nonkel seffens, dadelijk weeral, alweer gereed zijn, klaar zijn elkaar, mekaar, elkander, mekander moment, ogenblik bijdragen aan, bijdragen bij, bijdragen tot The terms and concepts of the 10 different profiles type we use in the example are listed in Table 3. The result of the MDS-analysis is shown in Figure 1. One can see an axis IRC-Usenet-newspapers running from left to right. Moreover within the newspaper corpora most subcorpora that belong to the same newspaper are rather close to each-other (two notable exceptions are ‘natSP’ and ‘quaTE’). In summary, the recognition of the registers is not impeccable, but seems acceptable in general. 186 Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts Figure 1: Example of MDS-plot based on profile-based linguistic distance 2.2 Profiles compared to key words An obvious alternative to working with profiles would be to calculate differences on the basis of isolated individual terms. In fact, this is common practice, and many methods that are currently used are based on individual terms (for an overview, see Kilgarriff, forthcoming). The main reason why we refrain from doing this is that we want to avoid thematic bias in our measurements. Let us illustrate what we mean by this with an example. Suppose that in corpus A ‘oom’ and ‘nonkel’ are used 20 times each. Also suppose that in corpus B ‘oom’ and ‘nonkel’ are used 100 times each. And let us, for the sake of simplicity, assume the corpora are equal in size. Now, if we would compare the corpora on the basis of isolated individual terms, there is a clear risk of misinterpreting the data. One approach would be to look only for items that are known to be colloquial, and to count their occurrences. In this case the colloquial item is ‘nonkel’, so we would A profile-based calculation of region and register variation 187 count the number of occurrences of ‘nonkel’. We would conclude that corpus B contains more colloquial material than corpus A. Another approach would be to list all items that are more typical of one corpus than of the other. If we would do that, we would conclude that both ‘oom’ and ‘nonkel’ are more typical of B than of A. However, if we now move to profiles, we can see that the actual difference between A and B lies in the frequency of the concept UNCLE, and that the actual naming preferences are identical in both corpora (in both cases 50% ‘oom’ and 50% ‘nonkel’). So for our purpose, which is the investigation of regional and stylistic differences, we want to measure no difference in this example. To put it another way, the high frequency of ‘oom’ and ‘nonkel’ in corpus B is ambiguous in terms of popularity of terms and popularity of concepts. The profile-based method disambiguates these two levels 5 . Figure 2: MDS-plot based on number of key words In Figure 2 we give an example of a comparison of corpora based on individual isolated terms. We used the log likelihood ratio test for binomial distributions (cf. 188 Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts Dunning, 1993) to calculate which terms, in the comparison of two corpora A and B are used significantly more often in A (positive key words 6 of A) or significantly less often in A (negative key words of A). We then used the total sum of key words (both positive and negative) as a dissimilarity measure, and compared all corpora pairwise 7 . Finally, we fed these dissimilarities to a MDSanalysis. The result is shown in Figure 2. Apart from the fact that the axis IRCUsenet-newspapers is a bit more fuzzy 8 than in Figure 1, the most interesting observation is that the three sports-related corpora (in the centre of the figure) are very close to each-other, in spite of their belonging to different registers. This seems to illustrate that this type of measurement is so sensitive to topical bias that it becomes less useful for our purpose, i.e. the measurement of regional and stylistic variation. 3. Stable lexical markers 3.1 Purpose of using stable lexical markers Although the plot in Figure 2 indicates that the key words method as applied in 2.2 is less suited to our needs, we believe that a further analysis of the results of this method is worthwhile. Being based on all terms in the corpora, this method contains a wealth of information. If we can obtain a clearer picture of the different sources of variation (such as style, topic, etc.) that interact in the constellation we see in Figure 2, and if we can obtain information about their relative importance, and if we can also discover which linguistic variables (i.e. which terms) are representative of these different sources of variation, then there are several possible ways to benefit from this knowledge in our analysis, both directly and indirectly. Directly, because it helps us to construct a representative set of profiles for measuring regional and stylistic variation 9 . Indirectly, because it serves as an additional autonomous analysis of variation in the corpora. Unfortunately a direct analysis of the key words is cumbersome, because in each of the 190 pair-wise comparisons of corpora thousands or even tens of thousands of key words show up 10 . In order to overcome this practical problem 11 , we want to simplify the picture and look at a subset of key words that according to some criteria can be called a particularly salient subset. The criterion we use is consistency in being typical of specific groups of corpora. More specifically, we introduce the concept of stable lexical markers or stability as a marker. The concept is straightforward 12 . If we compare two sets of corpora S1 and S2, then the stability with which a term is a marker for set A is equal to the number of couples {X, Y}, with X a member of S1 and Y a member of S2, for which the term is a positive key word for X. For instance, if S1 contains three corpora, say S1 is {regL1, regL2, regL3}, and S2 also contains three corpora, say S2 is {regA1, A profile-based calculation of region and register variation 189 regA2, regA3}, then there are 9 different couples {X, Y} with X a member of S1 and Y a member of S2 , so the maximum stability would be 9. 3.2 Applications of stable lexical markers We illustrate two ways the concept of stable markers can be applied. In a first application we consider all partitions of our set of 20 corpora into two subsets S1 and S2 with subset S1 having size 3 and subset S2 having size 17. There are 1140 such partitions. We sort this list by number of maximally stable markers of S1. The top of this list is shown in Table 4. Right after the last item displayed here the number of maximally stable markers drops to 18, and then the number rapidly declines to hardly any maximally stable markers. The reason we look at subsets of size 3 is that most groups of corpora with the same source type are of size 3, with the exception of IRC, of which there are 5 items. As we see in Table 4, all these groups appear at the top of the list. Additionally, three mixed-group subsets also appear. Those are the ones with the grey background. Table 4: all subsets of size 3 with 30 or more maximally stable markers nr maximally stable markers subset S1 1 202 {natRE, natSP, quaSP} 2 187 {ircRE, ircVL, ircBE} 3 186 {regA1, regA2, regA3} 4 185 {useTE, useSP, useSR} 5 74 {regL1, regL2, regL3} 6 66 {regL3, quaTE, quaSR} 7 55 {natRE, natSP, natSU} 8 54 {ircRE, ircVA, ircBE} 9 48 {ircVA, ircLE, ircVL} 10 36 {quaSP, quaTE, quaSR} 11 34 {ircRE, ircLE, ircBE} 12 30 {natSP, regA3, quaSP} The second application of stable makers then of course is to look at the actual list of markers. As a first example, we look at the set of maximally stable markers of {natRE, natSP, quaSP}, the triplet at position 1 in Table 4. In Table 5 we show the 20 13 first items from the list of 202 maximally stable markers. It is clear that the lexicon that sets this triplet apart is topic-related – more precisely sportsrelated. Applying the same procedure we found that the same is true for triples 6 and 12 in Table 4. The markers in triplet 6 are related to ‘hot news’ topics. The markers in triplet 12 are once again sports-related. 190 Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts Table 5: The first 20 items in the list of maximally stable markers of {natRE, natSP, quaSP} aanvoerder (captain), achillespees (Achilles tendon), affluiten (to whistle), aftrap (kick-off), beker (cup), beloften (junior team), beslissende (decisive), bezoekende (visiting), coach (coach), competitie (competition), competitiestart (competition start), dij (thigh), doelkansen (opportunities to score), doelman (keeper), doelpunt (goal), doelpunten (goals), eindstand (final score), finales (finales), fit (in shape), forfait (walk over), … As a second example of the second application, we look at one very specific subset S1 of size 8; we look at the subset S1 of all 8 computer mediated communication subcorpora, and compare it to the subset S2 of all 12 newspaper subcorpora. The results are shown in Table 6. Maximum stability is 96 in this case. There are 158 maximally stable markers, which we have manually classified in different categories in order to distinguish between different sources or types of variation. The categories, most of which correspond to dimensions of variation as described in Biber (1995), are shown in the first column. The second column shows some examples for each category. The examples are taken from the set of 279 terms that have a stability of 92 of more (i.e. that are at least 95% stable). The last column shows the absolute and relative frequencies of the different categories for the 158 fully stable terms (assigning each term to exactly one category). Table 6: Stable lexical markers of the computer mediated communication corpora category examples of 95% stable markers nr of 100% stable markers 1st + 2nd person jij (you), uw (your), ik (I), denk (think, 1st person singular), bent (are, 2nd person singular), … after, again, sorry, stuff, … chat (chat), email (email), comp (computer), … snap (understand, 1st person singular), tis (it is), niks (nothing), ne (a), zijt (are, 2nd person singular), … groetjes (greetings), greetz (greetings), bedankt (thanks), hmm (hmm), oeps (oups), ... redelijk (reasonably), … 1 (0.63%) English thematic specificity colloquial style conversational elements downtoners 94 (59.49%) 47 (29.75%) 4 (2.53%) 11 (6.96%) 1 (0.63%) At first sight, one would conclude that the type of variation that interests us most, that of stylistic variation (colloquial style), is relatively rare: 2.53% of the cases, as opposed to 59.49% for English, 29.75% for thematic specificity (i.e. topic) and A profile-based calculation of region and register variation 191 6.96% for conversational elements. However, it is obvious that assigning each term to one single category is an oversimplification. For instance, the examples ‘greetz’ in conversational elements is also related to English. Also, the colloquial style examples ‘snap’ and ‘zijt’ also are examples of 1st + 2nd person. In a similar way we find items that are not classified as colloquial style but that are related to colloquial style, for instance ‘comp’ and ‘greetz’. In summary, sources of variation are often hard to separate, and more in particular, although purely stylistic variation is rare in these stable markers, stylistic variation is often superimposed on other types of variation. 4. Conclusions In this chapter we have argued that key-word-based methods are less suited for the analysis of stylistic and regional variation, because they are too sensitive to topical bias. We further argued that profile-based calculations are more suited because they neutralise this topical bias. In order to illustrate this we have used both methods to calculate linguistic distance between 20 subcorpora that represent different registers of Belgian Dutch, but some of which are further differentiated by topic. The analysis confirmed that the key word method is too sensitive to topical bias. Next we introduced the concept of stable lexical markers, and showed two applications in which stable markers are used to discover noteworthy patterns in the multitude of information that resides in a classical keyword-based analysis, but is hard to be extracted from it. On the basis of these applications we illustrated that stable markers are a useful device to disentangle different sources of variation, to learn about the relative importance of these different sources, and to discover which terms are representative of these different sources. Notes 1 The research reported on in this paper was supported by VNC-grant 205.41.07 3 as well as by OT-project OT 01/05. For the corpus-linguistic procedures and some of the statistical analyses the tool Abundantia Verborum was used [http://wwwling.arts.kuleuven.ac.be/genling/abundant]. For the multidimensional scaling analyses the environment R was used [http://www.r-project.org/]. 2 This definition of profiles clearly targets lexical variation. It is possible to broaden the definition in such a way that other types of variation (morphological, syntactic, ..) also fit in. Profiles would then group together 192 Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts competing constructions. But since in this paper we only deal with lexical information, we will stick to the more restricted definition. 3 In practice there is one additional step. We first apply a log likelihood ratio test for multinomial distributions (cf. Dunning 1993) to test whether the two profiles are significantly different from each other. If they are, we use city block distance to calculate linguistic distance. If they are not, we use (a constant very close to) zero as linguistic distance. 4 In practice there is one more subtlety. Rather than simply averaging over the concept-specific distances, we use a weighted sum of the conceptspecific distances. The weight of a concept-specific distance is determined by the sum of al naming events in the union of the two corpora under comparison that pertain to this concept. The purpose is to give more weight to naming choices for concepts that on average are referred to more frequently. To give an extreme example from our analysis of clothing terms, we want to prevent that variation in the naming of a rarely referred to concept such as BOWLER HAT has as much influence on the global linguistic distance as variation in the naming of a frequently referred to concept such as TROUSERS. 5 In theory, one could claim that such thematic bias should be avoided in the design of the corpus. In practice, however, this is virtually impossible. 6 We borrow the term key words from Scott (1997). Actually, the term is older, but initially the procedure was quite different (cf. Williams, 1976, and, for a discussion, Stubbs, 1996). We use the term key word as it is used in Scott (1997). By a positive key word of corpus A (when compared to B) we mean a key word that, in terms of relative frequency, is used more often in A than in B. By a negative key word we mean a key word that, in terms of relative frequency, is used less often. 7 The pair-wise comparison of all corpora is not the most typical use of key words, which normally involve comparing corpora to a reference corpus. However, we believe that in the context of our design (analysing clusters through MDS) a direct comparison of the corpora is more straightforward. 8 It also should be mentioned that in this MDS-analysis the stress is much higher than in the profile-based analysis. In fact, in the key word based calculations, one actually needs three dimension to obtain an acceptable fit. Nevertheless, we chose to show the two-dimensional plot because the plot is easier to interpret, and because the main observations are not affected by this simplification. For a more detailed analysis of these key word based calculations we refer the reader to Speelman, Grondelaers and Geeraerts (forthcoming). A profile-based calculation of region and register variation 193 9 Currently our compilation of profiles is mainly based on synonym lists that are derived from dictionary information. Automatic or semi-automatic derivation of potentially interesting variables from the corpora serves as a complementary strategy. 10 We applied the method to word forms, not to lemmata. 11 It could be called more than a practical problem. It is often claimed that many methods that are currently used to detect significant differences in the use of a term (most notable chi square based methods), signal a counter-intuitively large number of significant difference. One of the sources of the problem, as mentioned in Kilgarriff (forthcoming), is the distribution of the lexicon: high frequent terms are so frequent, and therefore produce so many observations, that even subtle, often linguistically less relevant or distinctive patterns reach significance. 12 The concept is related to the key key concept (cf. Scott, 1997), but stable markers are more generic, because they involve the comparison of two sets of corpora, and are used for different purposes. 13 After filtering out a few proper names. References Biber, D. (1995), Dimensions in Register Variation, Cambridge University Press. Dunning, T. (1993), ‘Accurate Methods for the Statistics of Surprise and Coincidence’, Computational Linguistics, 19(1): 61-74. Geeraerts, D., S. Grondelaers and P. Bakema (1994). The Structure of Lexical Variation. Meaning, Naming and Context. Berlin: Mouton de Gruyter. Geeraerts, D., S. Grondelaers and D. Speelman (1999). Convergentie en divergentie in de Nederlandse woordenschat. Een onderzoek naar kledingen voetbaltermen. Amsterdam: Meertensinstituut. Grondelaers S., K. Deygers, H. van Aken, V. van den Heede and D. Speelman (2000), ‘Het ConDiv-corpus geschreven Nederlands’, Nederlandse Taalkunde, 5: 356-363. Grondelaers S., H. van Aken, D. Speelman and D. Geeraerts (2001), ‘Inhoudswoorden en preposities als standaardiseringsindicatoren. De diachrone en synchrone status van het Belgische Nederlands’, Nederlandse Taalkunde, 6: 179-202. Kilgarriff, A. (forthcoming). ‘Comparing corpora’. International Journal of Corpus Linguistics. Scott, M. (1997). ‘Pc analysis of key words – and key key words’. System, 25:233-245. Stubbs, M. (1996), Text and Corpus Analysis, Oxford: Blackwell. 194 Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts Speelman, D., S. Grondelaers and D. Geeraerts (forthcoming), ‘Profile-based Linguistic Uniformity as a Generic Method for Comparing Language Varieties’, Computers and the Humanities. Williams, R. (1976), Keywords, London: Fontana. A multilingual learner corpus in Brazil Stella E. O. Tagnin University of São Paulo Abstract A learner corpus can provide useful data to detect specific difficulties of language learners and consequently inform the production of pedagogic material to address these problem areas. The USP Multilingual Learner Corpus will initially be built in English, German and Spanish. A heading with detailed information about the student (course level, age, sex, mother tongue, etc.) will allow for different types of research. The corpus’s multilingual character will make it possible to look into difficulties common to all Brazilian learners, irrespective of language. Its varied content may also provide insights into the effectiveness of different methodologies. 1. Introduction One of the problems with textbooks used in Brazil for teaching a foreign language is that most are written by foreign authors unacquainted with Brazilian students’ difficulties. It is a known fact that a learner corpus can provide useful data to detect such specific difficulties and consequently inform the production of pedagogic material to address these problem areas (Leech, 1998). Until recently, the only learner corpus under construction in Brazil was the BrIcle (Berber Sardinha, 2001), the Brazilian Portuguese part of the ICLE project (Granger, 1993, 1994), which at the time of writing contains 40,000 words of argumentative texts collected at the Catholic University of São Paulo (PUC-SP). In early 2002 the University of São Paulo (USP) joined the project. The PUC-USP partnership in the Br-Icle project triggered an interest in extending the collection of texts to include all genres in which there is student production at our Department of Modern Languages, which is composed of five areas: English, German, Spanish, French and Italian. This chapter will give an overview of foreign language teaching in Brazil, with a special focus on the state-of-the-art at the University of São Paulo. It will discuss the PUC-USP partnership and how this project motivated teachers from other languages to build their own learner corpora, which will all be brought together in the Multilingual Learner Corpus (MLC). This corpus is part of a larger project – COMET, a Multilingual Corpus for Teaching and Translation, which is also being built at the University of São Paulo under my co-ordination (Tagnin, 196 Stella E. O. Tagnin 2002a, 2002b). Next, it will discuss the design and structure of the corpus. The last part will focus on the research possibilities envisaged by the MLC. 2. Foreign language teaching in Brazil One of the problems with textbooks used in Brazil for teaching a foreign language is that most are written by foreign authors unacquainted with Brazilian students’ difficulties. The other is that they take no account of Brazilian culture or students’ interests. 2.1 The curriculum at the University of São Paulo The Department of Modern Language is divided into five areas, one for each “modern” foreign language taught at the undergraduate level: English, French, German, Italian and Spanish. The curriculum is composed of eight semesters during which students follow courses addressing both the language and the literature components. With the exception of English, where it is taken for granted that students have a fairly good command of the language on entering the course, the other languages start teaching “from scratch” as these languages are not taught regularly in secondary school whereas English is part of the compulsory curriculum. Due to the high demand for foreign language courses in general, the Department also offers extracurricular courses, mainly aimed at the academic community. These go by the name of English on Campus, Español en el Campus etc. and extend from five to ten semesters, depending on the language. They are taught by postgraduate students under the supervision of a co-ordinator, who is a regular teacher in that language. In section 4.1 we will go into more detail about these courses and how they can contribute to the corpus. 3. Learner corpora in Brazil To our knowledge, the only learner corpus under construction in Brazil is the BrIcle. To date it is composed of 40,000 words compiled at the Catholic University of São Paulo under the co-ordination of Tony Berber Sardinha. In line with ICLE requirements it is restricted to argumentative texts and should reach 200,000 words on completion. The project was joined by USP in early 2002, and is due to be completed by the end of 2004. Although there is no notice of any other “formal” learner corpus, several materials involving the teaching of FL have been assembled by various researchers. A few are in electronic format (diskettes, CD-ROMs), but most are probably not. In any case, the material is not prepared for investigation with the aid of electronic search tools. A multilingual learner corpus in Brazil 197 The German area has been working on a Contrastive German-Portuguese Grammar project for which it has collected different types of student production. Most of this material is recorded on CD-ROMs or diskettes but is only available for internal use: x Verbs of transportation. Vol. 1. CAPLE - Corpus of German and Portuguese as Foreign Languages (Blühdorn et al., 1997). This material was collected in several schools engaged in foreign language teaching. In Brazil it came from third and fourth year undergraduate students at USP and intermediate and advanced students at the Goethe Institut in São Paulo; in Germany from undergraduate German students and learners of Portuguese at the University of Erlangen-Nürnberg. It consists of three types of production: a) sentences in which students were required to use verbs of transportation, b) translations of 17 sentences into the foreign language, and c) description of the stories presented in six different sequences of cartoons. x Compositions in German – Vol. 2 CAPLE ದ Corpus of German and Portuguese as Foreign Languages. (Blühdorn et al., 1999) This material was collected between 1996 and 1998 from 342 informants at three German-Brazilian secondary schools and is divided into three categories: a) Brazilian learners of German, b) Brazilian learners who have both German and Portuguese at home, and c) German native speakers living in Brazil. x Contrastive Analysis Corpus of Mistakes in Portuguese and German as Foreign Languages (Glenk & Stanich, 2000). This material was collected from second to third year undergraduate learners of Portuguese at the Universities of Erlangen-Nürnberg (1997) and University of Vienna (1999 and 2000), and from third and fourth year undergraduate learners of German at USP (1998 and 1999). It consists of descriptive texts based on the cartoons used in the research referred to above, narrative texts and essays. Other non-contrastive materials are: x x Corpus of letters exchanged between learners of German in Fès (Morroco) and São Paulo (University of São Paulo) (Blühdorn, 1997). Studentenzeitung (Blühdorn, 1999). A newspaper written by third year German learners at USP during the second semester 1999. 198 Stella E. O. Tagnin x Student writing, different typologies: criteria for text production (Nomura, in preparation). Material produced by second year German learners at USP. In English and Spanish there are scattered collections of texts as a result of individual research by post-graduate students, mainly intended for contrastive studies. However, they are not in a format that makes them searchable by corpus tools. 4. The Learner Corpus at USP When the PUC-USP partnership was established, several teachers at USP became interested in corpora studies. Because the Br-Icle requires only argumentative texts, the English teachers decided to build a corpus with the other types of texts produced by their undergraduate students, mainly narratives and essays. However, once the goal of 200,000 words of argumentative texts has been reached for the Br-Icle, this type of text will also be included in the USP Multilingual Learner Corpus. The German and Spanish areas have already joined the MLC project. French and Italian have shown some interest but no official contact has been made as yet. Nevertheless, the project is underway, and it will also be fed with texts from the on campus courses. 4.1 The on campus courses The participation of the on campus courses at USP opens up other possibilities, such as including other genres of texts and gathering the production of another type of students. As these courses are aimed mainly at the academic community, that is, students, teachers and other employees, one gets a fairly varied audience, both in terms of age and cultural background, which is certain to affect the content of their production. Quite a few undergraduate students attend these courses to “catch up” with the rest of their class, that is, as remedial work to improve their linguistic performance. The teaching is more informal than in the regular undergraduate courses and students feel they have less responsibility when it comes to passing or failing the course. The English on Campus (EOC) course has a total of ten semesters: Basic 1, 2 and 3, Pre-Intermediate 1 and 2, Intermediate 1 and 2, Advanced 1 and 2 and a semester of Conversation. The course books used are New Interchange: from Introduction up to volume 3, Part B at the Basic, Pre-Intermediate and Intermediate levels, and Passages Parts A & B at the advanced level. Topics for the written assignments are suggested according to the grammar points addressed. For instance, Write a conversation with a friend in which you describe your A multilingual learner corpus in Brazil 199 apartment or house and ask about his or her living place when the focus is on the Simple Present, short answers; questions with how many and answers with there is, there are. Due to the high number of students at the English on Campus courses – approximately 700 – and considering that two assignments are submitted per student, it would be possible to collect around 1,000 texts per semester, that is, as long as most learners agree to sign to give their permission to have their assignments included in the corpus. The picture is slightly different for the German on Campus (GOC) course. They offer a five-semester course: Basic 1, 2, 3 and 4 and a semester of Conversation. A German course book Moment mal! is used at the first three levels. Basic 4 and Conversation use material prepared by the teachers and based on other German books. The content of the Conversation course varies each semester and students may take it more than once as it is mainly aimed at giving undergraduate students an opportunity to exercise their oral production. As there are approximately 100 students enrolled, the number of possible texts for inclusion each semester would be around 200, again with two texts per student. The levels at Spanish on Campus (SOC) are Basic 1, 2, Intermediate 1, 2 and Advanced 1, 2. Three enhancement modules are also offered, each one semesterlong: a) Culture, b) Literature, and c) Conversation. A Grammar module is in preparation. As opposed to the previous courses, Spanish relies on material prepared only by their own teachers and is based on the profile and a needs analysis of their students. There are currently about 460 students enrolled in their courses, which would give a total of 920 texts per semester. The course has also been a source for postgraduate research on exclusion and self-exclusion factors; methodological and ideological analyses of textbooks; error analysis; evaluation of assessment procedures; and theories of language acquisition focusing especially on how or how much learning can contribute to acquisition. One such research took a contrastive approach comparing how the simple and compound past tenses are used by learners of Spanish and of English. This study was informed by data from both the English on Campus and Spanish on Campus courses. 5. The USP Multilingual Learner Corpus To date the USP Multilingual Learner Corpus (MLC) will be composed of texts produced by their undergraduate (UG) and on Campus learners in the areas of English, German and Spanish. As mentioned above, English undergraduate texts Stella E. O. Tagnin 200 of the argumentative type will be fed into Br-Icle until it has reached its goal of 200,000 words, (see diagram below). USP Learner Br- English German UG UG Spanish GOC EO UG SO Figure 1: The composition of the USP Learner Corpus Each student will be identified by a code and a profile with basic information as to his course, level, year of attendance, age, sex etc. Each text will be stored in its full form and preceded by a header with information as to text type, grammar point covered, topic of assignment, course book (or other materials) in use, etc. 6. Possible areas of research Learner corpora in various parts of the world have already produced a wealth of research (Granger, 1998b; Granger, 2002; Granger et al., 2002) but to our knowledge there is no multilingual learner corpus, that is, learners with a common mother tongue learning different foreign languages. This is in contrast with the ICLE project in which one has learners with different mother tongues learning a common language. With this design, and given that each subcorpus is also a stand-alone contrastive learner corpus, in that it allows comparison between productions originating in the two distinct courses offered at USP, it is envisaged that the corpus will not only allow for horizontal studies, comparing student production originating in the same class or in the same level, but also studies on the vertical axis, assessing student development over a period of time, either individually or collectively (cf. Kaszubski, 2000; Lenko-Szymanska, 2000; among others). Research on student writing strategies like paraphrasing, the (over/under) use or avoidance of certain syntactic structures, vocabulary items, collocations and formulas (cf. Altenberg & Tapper, 1998; Altenberg, 2002; Berber Sardinha, 2001; De Cock, 1998; Granger, 1998a, 1998b; and many others) will also be possible. More interesting perhaps is the possibility of cross-linguistic studies, like detecting problems common to learning a foreign language or problems common to Brazilian learners. Another contrastive area made possible by the design of the A multilingual learner corpus in Brazil 201 corpus lies in the field of methodology as it will enable researchers to evaluate the effectiveness of different methodologies or materials at both the undergraduate and/or the on campus courses. 7. Conclusion The Multilingual Learner Corpus under construction at the University of São Paulo is currently being fed with student production from two types of courses: the regular undergraduate courses and the extracurricular on campus courses offered by the areas of English, German and Spanish, at the Department of Modern Languages. The MLC is not only a promising project in terms of the array of possible research areas, but it has also integrated the different languages taught at the Department by bringing them together to work under a common project. References Altenberg B. (2002), Using bilingual corpus evidence in learner corpus research in: S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam and Philadelphia: Benjamins, 37-54. Altenberg B. and M. Tapper (1998), The use of adverbial connectors in advanced Swedish learners’ written English, in S. Granger (ed.) Learner English on Computer. London and New York: Addison Wesley Longman. 80-93. Berber Sardinha T. (2001), O Corpus de Aprendiz Br-Icle, http://lael.pucsp.br/~tony/2001bricle-interc.pdf. Blühdorn H, G. Evangelista and M. C. Reckziegel (eds.) (1999), Redações em alemão. Vol. 2. CAPLE - Corpus em alemão e português como línguas estrangeiras. São Paulo, USP. Blühdorn H. (ed.) (s/d), Sagen und Legenden, Tänze, Festtagsbräuche und Kochrezepte aus Brasilien. Eine interkulturelle Korrespondenz zwischen Deutsch-Studenten in São Paulo, Brasilien, und Fès, Marrocos. Blühdorn, H. (ed.) (1999), Studentenzeitung, São Paulo, USP. Blühdorn, H., L. F. Dias Moreira and R. F. Silva (eds.) (1997), Verbos de transporte. Vol. 1. CAPLE - Corpus em alemão e português como línguas estrangeiras. De Cock, S (1998), A Recurrent Word Combination Approach to the Study of Formulae in the Speech of Native and Non-Native Speakers of English. International Journal of Corpus Linguistics, vol. 3(1): 59-80. Glenk, E. and K. Stanich (eds.) (2000), Corpus de Análise Contrastiva de Erros em Português e Alemão como Línguas Estrangeiras. São Paulo, USP, setembro de 2000. 202 Stella E. O. Tagnin Granger, S. (1993) The International Corpus of Learner English, in: Aarts, J., P. de Haan and N. Oostdijk (eds.) English Language Corpora: Design, Analysis and Exploitation. Amsterdam: Rodopi: Amsterdam. 57-69. Granger, S. (1996), From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora, in: Aijmer, Karin, Bengt Altenberg & Mats Johansson (eds). Languages in Contrast – Papers from a Symposium on Text-based Cross-linguistic Studies. Lund 4-5 March 1994, Lund: Lund University Press. 37-51. Granger, S. (1998a), Prefabricated patterns in advanced EFL writing: collocations and formulae, in: Cowie, A. (ed.) Phraseology: theory, analysis and applications. Oxford: Oxford University Press. 145-160. Granger, S. (ed.) (1998b), Learner English on Computer. , London & New York: Addison Wesley Longman. Granger, S. (2002), A Bird’s-eye View of Computer Learner Corpus Research, in: S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam and Philadelphia: Benjamins. 3-33. Granger S., J. Hung and S. Petch-Tyson (eds.) (2002), Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam and Philadelphia: Benjamins. Kaszubski P. (2000), Lexical profiling of English (learner) corpora: can we measure advancement levels?, in: B. Lewandowska-Tomaszczyk and P. Melia (eds.) àódĨ studies in Language. Vol. 1: PALC’99: Practical Applications in Language Corpora [Papers from the International Conference at the University of àódĨ, Poland, 15-18 April, 1999]. Frankfurt am Main: Peter Lang,. 249-86. Leech, G. (1998), Learner corpora: what they are and what can be done with them, in S. Granger (ed.) (1998b), Learner English on Computer. , London & New York: Addison Wesley Longman. xiv-xx. Lenko-Szymanska, A. (2000), How to trace the growth in learner’s active vocabulary. A corpus-based study. Paper presented at the 4th International Conference on Teaching and Language Corpora. Graz, 19-23 July 2000. Nomura, M. (org.) (in preparation). Redação de textos de diferentes tipologias: critérios de produção de texto. Tagnin, S E O (2001), COMET – A Multilingual Corpus for Teaching and Translation, In Lewandowska-Tomaszczyk, B. (ed.) PALC 2001 – Practical Applications in Language Corpora, Frankfurt am Main: Peter Lang, 535-540 Tagnin, S E O (2002), Taking off in Brazil: COMET – A Multilingual Corpus for Teaching and Translation. Paper presented at ICAME 2002 – The Theory and Use of Corpora – The 23rd International Conference on English Language Research on Computerized Corpora of Modern and Medieval English, Gothenburg, Sweden, May 22- 26, 2002. Quantitative or qualitative content analysis? Experiences from a cross-cultural comparison of female students’ attitudes to shoe fashions in Germany, Poland and Russia Andrew Wilson and Olga Moudraia Lancaster University Abstract In order to examine differences in attitudes to shoe fashions between women in Germany, Poland and Russia, we asked three samples of advanced female students of English to write a short English composition in response to the stimulus: “Tell us a little bit about the footwear (shoes, boots, etc.) you own and when you wear it”. We analysed the results using a manual qualitative content analysis and two forms of quantitative computer content analysis: one using project-specific categories developed from the qualitative content analysis and previous theory, the other using general semantic field categories. Both techniques were successful in highlighting similar between-group differences, suggesting that qualitative content analysis and project-specific categories can largely be dispensed with. Some issues in using non-native student English compositions as data in cross-cultural studies are also considered. 1. Introduction Appearance is an integral part of communication and miscommunication. If we want proof of this claim, we need cast our minds back no further than the British Conservative Party’s annual conference in 2002, when its new chairperson, Theresa May, took to the platform in kitten-heeled leopardskin shoes. This deviation from a traditional “business” shoe received huge media attention, with close-up photographs in many national newspapers, but it ultimately overshadowed what she had to say: it has been remembered long after the content of her speech has been forgotten. Thus, something as everyday as her choice of shoe can be said to have led to an overall communication failure 1 . It is clear from this example that the messages sent out by clothing and footwear (non-verbal communication) deserve as much attention as those messages transmitted by words (natural language), and it is from this position that we began to work on apparel-based non-verbal communication, and, in particular, on footwear. 1.1 Cross-cultural differences in non-verbal communication We know that differences in non-verbal communication (hereafter “NVC”) exist between nationalities and cultures (Andersen, 1988). However, although a 204 Andrew Wilson and Olga Moudraia number of comparative studies in NVC have been carried out (e.g. Remland et al., 1991), these have tended to focus on the more bodily aspects, such as gesture, facial expression, touch and proxemics. By contrast, a very large proportion of the empirical work on apparel has been carried out in the USA, focussing primarily on American subjects. Comparatively little empirical work of this kind has been carried out in Europe (especially eastern Europe), and even less has had a contrastive orientation. An important consequence of this is that the claims of popular “dress for success” manuals, whatever their empirical and theoretical foundations in one culture, cannot simply be transferred by translation to other cultural contexts. In order to test how, and to what extent, perceptions of apparel differ cross-culturally, we have been collecting data from female students in three European countries: Germany, Poland and Russia. 1.2 Approaching NVC in the respondents’ own words Although some of the early work on NVC made use of open-ended qualitative data (e.g. Stone, 1962), much recent research in the field has been carried out in an experimental scale-based paradigm (Davis & Lennon, 1988). However, this latter approach, although readily quantified, has the problem of imposing the researchers’ pre-defined categories and interpretive schemata on the experimental subjects: for instance, they have to rate the image of a variously clothed figure as being more or less “friendly and approachable”. A notable exception to this experimental paradigm is the small-scale study by Golliher (1987), who employed a form of open-ended depth interviewing in order to re-construct the categorisation and interpretation of clothing items in his informant’s own terms. This qualitative ethnomethodological approach is valuable, because it can provide much richer data than the experimental approach: it taps into the actual attitudes and perceptions of the informants rather than having them agree or disagree with the researchers’ attitudes. In this study, we have applied (for logistical reasons) a written variant of open-ended interviewing. 1.3 Analysing open-ended responses For the analysis of open-ended response data, there are two main paradigms available to the researcher: qualitative analysis and quantitative analysis. One of our goals in this pilot study was to examine the relative merits of a qualitative and quantitative content analysis of our data. Qualitative content analysis, as we understand it, is a variant of the grounded theory approach to text analysis – in other words, it is a bottom-up approach to identifying the main ideas within people’s discourse. In the course of careful reading, linguistic units within the text (words and phrases) are annotated and aggregated into broader categories as themes begin to emerge from the data. As its name suggests, qualitative content analysis is not, in itself, a quantitative approach; however, as well as functioning as a stand-alone methodology, it can also feed into dictionary construction for quantitative content analysis. Quantitative or qualitative content analysis? 205 In contrast to qualitative content analysis, quantitative content analysis attempts to provide a numerical measure of mention which can then be used to make statistical comparisons between respondent groups. Quantitative content analysis is typically fully automated (or at least computer aided) and can take two main forms: a form that relies on the classification of text words according to a predefined set of categories (dictionary-based content analysis) and a form that makes use of the multivariate analysis of word contiguities to derive themes from the set of texts being examined (correlational content analysis) (Hogenraad, McKenzie & Péladeau, 2003). In dictionary-based content analysis – the method which we have used in this paper – a dictionary of words and content categories is constructed, and words in the running text are then automatically matched with the dictionary categories. The content analysis program then provides a frequency of use for each category in the study across each group of texts indicated by the researcher. The categories in dictionary-based content analysis may be quite general, semantic-field-type categories, which can be used for a wide range of studies, or they may be specially designed for a particular study, typically on the basis of previous theory and a prior qualitative analysis of at least some of the data. In this study, we have undertaken two quantitative content analyses – one with general semantic field categories and one with project-specific categories – and we have compared both these sets of results with each other and with the outcome of the qualitative analysis. 1.4 Language issues in cross-cultural survey research One of the major difficulties in conducting cross-cultural research lies in language. If material is collected in several languages (e.g. German, Polish and Russian), it can then prove rather difficult to quantify and compare the results. This is certainly the case if one is working with word-level data within a methodology such as correlational content analysis, since differences in vocabulary range and grammatical structure can themselves lead to differences in the relative frequencies of comparable words. If dictionary-based content analysis is used, then vocabulary range and grammar may prove rather less of a problem, but another problem then arises: the availability of exactly comparable content analysis dictionaries and related processing tools for all the languages in question. In an attempt to avoid both these sets of problems, we have been collecting our data in a single language – English – from advanced learners of that language in each of the countries covered by the project. In principle, this should enable us to make direct comparisons between national groups. However, at the outset, it was not clear what other issues non-native Englishes might pose for the content analyst, and so this is an issue which we kept in focus during this pilot study and to which we return later in the paper. 206 2. Andrew Wilson and Olga Moudraia Data Our data consisted of compositions written by advanced female students of English at universities in Germany, Poland, and Russia. These were all collected during the summer of 2002 and were provided in computer-readable form. All respondents were natives of the country in question: any students not conforming to this criterion were dropped from the subsequent analysis. In this pilot study, there were 11 compositions from Germany, 34 from Poland, and 13 from Russia. 2 The students were asked to write on each of the following open-ended stimulus questions: x What sort of things would you wear to a job interview and why? x What sort of things would you wear to a friend’s birthday party and why? x Tell us a little bit about the footwear (shoes, boots, etc.) you own and when you wear it. In the present study, we focus only on the last of the three questions – the one about footwear. 3. Qualitative analysis Close reading of the material helps us understand what the writers are telling us in their own words. A special coding frame was constructed to list and classify the words and phrases in our shoe data. We aimed to find the answers to three Whquestions: What? Why? When? Qualitative content analysis of the data revealed the following main themes: x What – shoe types and their physical attributes; x Why – reasons for wearing different types of shoes such as comfort, practicality including matching outfits, suitability, design, colour, attractiveness and price; x When – occasions when particular types of shoes are worn: related to weather, outdoor and indoor activities, lifestyle, formal and informal social events, etc. The examination of Becker and Lißmann’s (1973) two levels of content, primary content (themes and main ideas of the text) and latent content (context information), lead us to the following research conclusions on the similarities and differences in the subjects’ classifications of shoes. Firstly, the Poles described the widest range of shoe types with main emphasis on comfort and elegance, while occasions when particular types of shoes are worn were mainly related to outdoor activities, and formal and informal social events. Secondly, Russian respondents, quite unsurprisingly, had a bigger variety of boots, and generally Quantitative or qualitative content analysis? 207 referred to weather-related occasions for wearing particular types of footwear with a focus on cold weather conditions; they placed major emphasis on comfort and attractiveness of shoes. Thirdly, German subjects wrote less about high heels, placing more emphasis on practicality; occasions when particular types of shoes are worn were varied, including outdoor activities, formal and informal social events, as well as weather-related occasions with a focus on warm weather conditions. 4. Quantitative analysis For the quantitative content analysis, we used the USAS suite of programs, developed over the past 13 years at Lancaster University (Wilson & Rayson, 1993, Rayson & Wilson, 1996). USAS is a software package for dictionarybased content analysis: we hope to apply correlational content analysis to these data in a future study. Using a comprehensive dictionary and multi-word-unit list (which we updated to cover the vocabulary of the present data), USAS assigns a basic semantic field code (or “SEMTAG”) to each lexical item or phrase within a text – for example, “Colour”, “Body and Bodyparts”, “Power”, “Similar/Different”, etc. USAS performs this task with approximately 92% accuracy. We shall return to these SEMTAGs later in the paper. A further feature of USAS is the module called MAPPING. This enables researchers to create their own, project-specific category system (or “CONTAGs”) by conflating, subdividing, or ignoring various SEMTAG categories. For our data analysis, we used MAPPING to create categories based both on previous work on the perception of shoe fashions and on the results of the qualitative analysis. To examine differences between the three groups of respondents, we used the TMATRIX module of USAS. This module enables the researcher to see the frequencies of words and categories for each group, examine concordances, and carry out log-likelihood or chi-squared tests on word and category frequencies. 4.1 Project-specific category construction Although shoes have been said “never to lie” about the wearer’s personality (Pond, 1986), in the literature on NVC and first impressions they have received comparatively little attention when set alongside other apparel items such as dresses or suits. Shoes have occasionally been included in broader studies (e.g. Lennon & Miller’s (1984/85) study included a pair of brown boots), but, at the time of carrying out the research, the only detailed study known to us was that of Kaiser et al. (1987). 3 It was thus from Kaiser et al.’s study that we set out to create CONTAGs for our project. These CONTAGs would enable us to see Andrew Wilson and Olga Moudraia 208 whether the European nationalities used the same kinds of criteria as Kaiser et al.’s American sample when writing about footwear and whether or how they differed amongst themselves. Prior to carrying out a semantic differential (questionnaire) study with a larger sample of respondents, Kaiser et al. undertook a focus group with five men and five women living in California. The respondents were shown a range of shoe styles, which they then discussed. From these discussions, the following primary dimensions were extracted: x x x x x x x x x x old-young liberal-conservative work-leisure comfortable-uncomfortable unsexy-sexy formal-casual high-status-low-status inexpensive-expensive dislike-like fashionable-unfashionable These dimensions were relatively unproblematic to operationalise as content categories, with the exception of liberal-conservative. As an approximation to this category, therefore, we made the distinction classical-modern. In addition to the categories based on Kaiser et al’s dimensions, we also created categories to represent the main types of footwear mentioned by our respondents. Most of these are self explanatory; the category “elegant shoes” contains references to stilettos, court shoes, pumps, etc. Finally, we created categories to measure references to the main additional issues that emerged from the qualitative content analysis. Table 1 details the set of content categories that we worked with. Table 1: Content categories used in the analysis Shoe styles: BOOT CLOG DOCS ELSH FLTS SLPR SNDL TRNR OTHE Boots Clogs Doc Martens (and similar shoes) Elegant shoes Flat shoes Slippers Sandals Trainers Other shoe styles Quantitative or qualitative content analysis? 209 Kaiser et al dimensions: AGES Young-old CLAS Classical-modern COMF Comfortable-uncomfortable FASH Fashionable-unfashionable FORM Formal-casual LIKE Like-dislike PRIC Expensive-inexpensive SEXY Sexy-unsexy STAT High-status-low-status WORK Work-leisure Additional categories from qualitative analysis: ATTR Attractive-unattractive PHYS Physical attributes PRAC Suitable-unsuitable SEAS Seasons TEMP Temperature WEAT Weather 4.2 Results All of Kaiser et al.’s dimensions were used by our respondents, though to varying degrees: for example, high-status-low-status was relatively little used, whilst work-leisure was a very widely used category. Table 2 shows the results for the Kaiser et al. categories aggregated for the whole sample. Table 2: Frequencies of Kaiser et al dimensions (whole sample) WORK LIKE COMF FORM FASH CLAS PRIC SEXY AGES STAT Freq. per million words 21,483 9,645 7,453 2,255 1,754 752 626 626 564 376 Table 3 shows the between-groups differences for all the content categories. For three groups (2 d.f.), log-likelihood (LL) values of 9.21 or greater are significant at p < 0.01 and values of 5.99 or greater are significant at p < 0.05. In terms of between-groups differences, only one of the Kaiser et al. categories (fashionable-unfashionable) distinguished between groups at p < 0.01, with the Russian sample writing the most about this dimension. A further category – classical-modern – distinguished at the lower probability level (p < 0.05), again with the Russian sample writing the most about this dimension. 210 Andrew Wilson and Olga Moudraia Table 3: Between-groups differences on content categories Category freq. per million words PHYS BOOT SEAS TEMP WEAT FASH SLPR ATTR FLTS SNDL CLAS DOCS OTHE PRIC COMF STAT ELSH CLOG WORK LIKE FORM SEXY PRAC AGES TRNR LL German Polish Russian 46,415 7,682 13,444 7,682 3,201 1,280 1,601 4,161 0 6,722 0 320 2,241 0 4,802 320 5,442 320 19,846 7,682 2,561 320 2,241 960 640 29,434 2,725 7,849 2,834 2,180 981 436 8,721 1,744 2,725 654 872 2,180 763 8,285 545 7,304 0 23,111 10,575 2,507 545 3,270 545 327 63,778 13,628 16,626 7,086 7,086 4,088 2,726 4,906 1,090 3,543 1,635 0 545 818 7,632 0 9,267 0 18,806 8,994 1,363 1,090 3,816 273 545 75.0 47.9 19.9 16.7 15.7 12.4 11.5 10.4 9.5 8.8 7.7 5.9 5.4 4.4 4.2 3.4 3.4 3.3 2.8 2.3 1.9 1.7 1.4 1.4 0.6 Taking into account the further content categories developed on the basis of the qualitative content analysis, the following picture of the three national groups emerged. We discuss here only those categories which showed significant between-groups differences. The Russian sample wrote most about the physical attributes of their shoes, such as size, shape, material and colour. By contrast, the Poles wrote the least about these dimensions overall, although this was still the most frequent category within the Polish sample. The Russian sample also wrote the most about seasons and the weather as determiners of when certain styles of shoes are worn. The German sample also wrote substantially about seasons; they wrote rather less about weather conditions, but had the highest frequency of references to temperature (these being a mixture of weather references and references to shoes being warm or cold to wear). Again, the Polish sample seemed somewhat less concerned about these three dimensions. Interestingly, seasons, weather and temperature were not included as dimensions of usage in Kaiser et al.’s study, but, at least for some of our sample, they seem important determiners of shoe style choice. Quantitative or qualitative content analysis? 211 The only non-shoe-style category on which the Polish sample had the highest frequency of use was attractiveness. It seems that the Polish respondents, as a whole, were more concerned with this dimension than the other two national groups. Attractiveness was not included explicitly in Kaiser et al.’s dimensions, but it could be considered to be implied by categories such as like-dislike and fashionable-unfashionable. We treated attractiveness as a separate category, since we felt that many of its component words (such as smart and elegant) did not fit any of the Kaiser et al. categories particularly well. However, some might argue that these words might be classified elsewhere. In terms of shoe styles, boots were referred to most often by the Russian sample, then by the German sample. The Russians also made the most references to slippers. By contrast, the German sample referred the most often to sandals. This may be a seasonal effect, since the data were collected during the summer, but, if so, it does not explain why this style was not also prominent in the writings of the other two samples. The Poles had the highest frequency of references to flat shoes and “Doc Marten”-style boots. 5. Discussion 5.1 Qualitative or quantitative content analysis? Both qualitative and quantitative content analysis of the shoe data yielded similar results revealing the same main themes. Qualitative content analysis can provide an accurate overall picture of a text corpus, although, being interpretive and subjective, it may overlook some specific details. As Mayring (2001) remarks, qualitative content analysis remains an act of interpretation since “relating categories and parts of the material is no automatic technique but a creative act of interpreting meanings in the text” by the content analyst, who puts into the process of analysis all his/her competencies, pre-knowledge and empathic abilities. Qualitative content analysis defines itself within this framework as an empirical, methodical and controlled approach to the analysis of texts within their context of communication and without precise quantification. In contrast to qualitative analysis, quantitative content analysis (as its name suggests) provides an exact quantification by counting the category frequencies in the texts. In our data, although most themes were identified with equal success by both the qualitative and quantitative analyses, the relative emphasis placed on them by the different nationalities became much clearer in the quantitative analysis: for example, the qualitative analysis identified comfort as a major emphasis of the Polish sample in particular, and, in the quantitative analysis, this category showed a statistically significant between-groups difference. 212 5.2 Andrew Wilson and Olga Moudraia CONTAGs or SEMTAGs? Although we used project-specific content categories (CONTAGs) for our main analysis, the data had initially been tagged, as mentioned earlier, with more general semantic field categories (SEMTAGs). We therefore asked ourselves the question: Would we have obtained the same results if we had not used CONTAGs but had conducted the analysis only using the pre-existing SEMTAGs? If we would, this provides a useful argument for future studies’ taking the SEMTAGged data “as is” and thus reducing the manual effort needed in developing new CONTAGs for different research projects. Table 4 shows the SEMTAG frequencies per million words for categories that were significant at p < 0.01 or p < 0.05. As there are more SEMTAGs than CONTAGs, it will be seen that there are a greater number of significant categories in this table. However, the same themes still predominate: physical attribute categories such as materials, colour, temperature and physical properties, as well as weather and time periods (used for seasons), are more frequent in the Russian and German samples, whilst the evaluative category “judgement of appearance: positive” is more frequent in the Polish sample. In looking at SEMTAG profiles, it is worth making a distinction between analysis categories and retrieval categories. As SEMTAG attempts to tag all the words and phrases in a text, rather than a selection, not all the tags relate to content items (nouns, verbs, adjectives, and lexical adverbs): many relate to word classes such as degree adverbs, numbers, modal verbs, pronouns, and so on, which are not key concepts within the content analysis and for most purposes can be disregarded. Other categories – such as similar and different – are not very meaningful in themselves as analysis categories but can be useful as retrieval categories. For instance, when examined in context, the category “similar” can be seen to be referring mainly to another key reason for wearing particular shoes: because they go with or match particular clothes. An advantage of using SEMTAGs is the greater precision that they can provide: for instance, here we are able to see that the Russian sample writes most about cold temperatures whilst the German sample writes most about warm temperatures. On the other hand, when using SEMTAGs, it is sometimes necessary to revert to the lexical frequency list in order to examine the use of detailed, project-specific concepts: for example, in our study, we were interested in different kinds of footwear, but, in the SEMTAGs, all kinds of footwear receive a single tag (clothing and accessories), which is applied to any apparel item. Perhaps a compromise can be found in developing detailed subcategorisations of key, project-specific categories but leaving the remainder of the SEMTAGs unaltered. Quantitative or qualitative content analysis? 213 Table 4: Between-groups differences on SEMTAGs Category description Substances and materials: solid Colour and light Numbers Pronouns Buying and selling Temperature: cold Clothing and accessories Time periods Weather Degree: maximisers Judgement of appearance: positive Continuous Possible Difficult Long Short Arts and crafts Grammatical words Unfriendly Young, new Degree: approximators Friendly Interest Dry Exclusivizers/particularizers The same Using Similar Showing Past Negative Seem, appear Heavy Location and direction Transport by land Phoney Large Different Physical properties: general Putting and placing Temperature: hot Point in time Important Intimate/sexual relationship Unusual Substances and materials: liquid Touch Games Relationship: general Body and body parts General, unspecific Category freq. per million words German Polish Russian 11,204 27,209 16,325 147,567 960 1,601 101,152 19,206 3,201 1,921 7,682 320 10,563 0 1,280 960 0 287,452 640 640 960 1,601 0 0 8,323 0 1,601 1,601 2,881 960 10,883 2,881 0 3,841 3,201 320 0 9,603 1,280 640 5,122 3,201 2,241 0 0 640 640 640 320 6,082 960 5,669 14,281 7,522 167,230 6,323 327 93,208 12,101 2,180 4,470 16,025 2,180 12,973 1,635 2,289 654 327 255,096 1,417 1,417 1,308 1,308 872 0 4,579 218 1,090 3,816 763 1,308 14,935 4,252 763 3,161 981 0 1,090 8,503 1,744 2,943 2,071 4,797 3,488 654 654 0 0 0 1,744 7,631 109 18,534 26,983 17,171 128,373 1,635 3,543 120,741 21,532 7,086 818 13,900 273 6,269 545 5,179 2,998 1,635 267,920 0 0 3,816 0 0 818 3,271 1,363 0 1,363 2,180 0 9,267 1,363 0 6,814 2,453 818 273 4,633 4,088 2,726 3,271 1,908 1,090 0 0 0 0 0 545 11,447 0 LL 42.0 32.0 28.9 27.6 27.4 19.4 19.0 17.6 15.7 15.2 13.2 12.2 12.0 10.2 10.0 9.8 9.5 9.3 9.2 9.2 9.1 9.0 8.9 8.8 8.6 8.5 8.3 8.3 8.2 8.1 8.0 7.8 7.8 7.7 7.6 7.6 7.3 7.3 7.1 6.9 6.9 6.8 6.7 6.7 6.7 6.5 6.5 6.5 6.5 6.4 6.4 214 Andrew Wilson and Olga Moudraia Giving Degree: compromisers Usual Liking: positive (+++) 5.3 1,280 1,921 3,841 0 763 2,943 6,759 763 0 818 3,816 1,363 6.4 6.4 6.3 6.2 Using non-native English compositions as data Returning to our secondary research question – the issues posed by using English as a survey response medium with non-native speakers – we are able to make three main observations. Firstly, it is possible that, when working with students as respondents, the nature of the exercise may have some effect on its outcome. In the case of the present survey, the Russian exercise was also used as a compulsory graded exercise in an EFL class; the German and Polish exercises, by contrast, were intended as compulsory but did not form part of the assessment for the respective courses, so students would not be penalised for not doing them (or for doing them “badly”). This may account for the fact that the Russian responses were in many ways more detailed, in several cases involving numbered lists of footwear items owned by the writer. An interesting alternative – or perhaps supplementary – explanation for some between-group differences is that they may arise in part from different “rhetorical strategies” in approaching the compositions. 4 Aside from this consideration, two more specifically lexical issues arise: the use of (often very culture-specific) L1 vocabulary in the English compositions and the non-standard use of English vocabulary by particular groups of respondents. Examples of the first issue in our data were the term glans in the Polish compositions, about which we had long discussions with our Polish colleague (it appears to refer to a kind of “punk rock” style involving lots of black leather and metal studs), and the term kapron in the Russian data, a kind of material. An example of the second issue was the use by the Russian respondents of the term top-boots. They were the only group to use this term, and they used it rather frequently (10 times across 12 respondents). This is a relatively dated expression in native-speaker English and, according to the Oxford English Dictionary, is properly used for the style of riding boot that has a cuff of differently coloured leather at the top of the boot shaft: these are the style of boots worn by jockeys and show-jumpers, also commonly seen in a female fashion version at the turn of the 1980s/90s. However, the Russian respondents appear to be using this term to refer to any boot with a shaft reaching to the knee of the wearer. The Oxford English Dictionary also recognises this sense, though castigates it as “incorrect”. These lexical issues are perhaps not a large problem in a dictionary-based content analysis, providing the respective words and phrases are properly categorised, but Quantitative or qualitative content analysis? 215 they can potentially be problematic for those approaches to content analysis that work primarily on vocabulary items, such as correlational content analysis or keyword analysis (Scott, 1997, 2001). In such cases, a partial synonymisation process, although frequently dispensed with as uneconomical, might be advisable. 6. Conclusion Our pilot study has revealed interesting differences between the three nationalities in the way they write about shoe fashions. Although they all made use of the “American” dimensions delineated by Kaiser et al. (1987), they did so to different degrees, and they also made use of other conceptual dimensions not mentioned by Kaiser et al. In general, the Russians seemed to place the most emphasis on the physical and practical characteristics of their shoes. They also seemed to be the strongest “trend-followers”. By contrast, the Poles seemed to emphasise their judgements of the attractiveness of shoes. The Germans tended to fall midway between the Russians and the Poles on most dimensions. As this was a pilot study, these results should for the moment be considered suggestive rather than conclusive. However, if further work (which we are planning to carry out) does support these findings, then they may have important implications both for crosscultural apparel-based NVC and for footwear marketing. On a methodological level, we have shown that a general content analysis dictionary based on semantic fields can deliver the same substantive results as a specially constructed project-specific dictionary. We have also shown that such a dictionary provides broadly the same results as a manual qualitative content analysis, which replicates the findings of Thomas and Wilson (1996) on doctorpatient interaction. We therefore suggest that qualitative analysis and special dictionary construction are costly stages of analysis which can be dispensed with without much effect on the outcome of a study. However, it may be that such general dictionaries require some minor modification (mostly the insertion of more detailed sub-divisions) to cover the fine detail of specific topic areas under examination. Notes 1 Unless, of course, one of the aims had been to project a trendier image for the party, regardless of the speech content. 2 We are grateful to our colleagues in this project – Amei Koll-Stobbe in Greifswald, Germany; Agnieszka Lénko-SzymaĔska in Lodz, Poland, and Tatyana Astafurova in Volgograd, Russia – for collecting these data. 216 Andrew Wilson and Olga Moudraia 3 We have since become aware of further research on shoes carried out by Marianne Herzog in Germany – e.g. Herzog (1995). 4 See Lénko-SzymaĔska (this volume) for a detailed discussion of this idea in the context of Polish and American English compositions written in response to the same question (about mobile phones). References Andersen, P.A. (1988), Explaining intercultural differences in nonverbal communication, in: L.A. Samovar and R.E. Porter (Eds.) Intercultural Communication: A Reader. 5th ed. Belmont, CA: Wadsworth, pp. 272281. Becker, J. and H.-J. Lißmann (1973), Inhaltsanalyse – Kritik einer sozialwissenschaftlichen Methode. Arbeitspapiere zur politischen Soziologie 5. München: Olzog. Davis, L.L. and S.J. Lennon (1988), Social cognition and the study of clothing and human behavior. Social Behavior and Personality 16(2): 175-186. Golliher, J.M. (1987), The meaning of bodily artefacts: variation in domain structure, communicative functions, and social contexts. Semiotica 65(1/2): 107-127. Herzog, M. (1995), Auftreten... Mensch und Schuh, in: D. Grünewald (ed.) "Was sind wir Menschen doch!..." Menschen im Bild. Analysen. Festschrift für Hermann Hinkel. Weimar: Verlag und Datenbank für Geisteswissenschaften, pp.105-114. Hogenraad, R., D.P McKenzie, and N. Péladeau (fc.) Force and influence in content analysis: The production of new social knowledge. Quality & Quantity 37(3): 221-238. Kaiser S.B., H.G. Schutz and J.L. Chandler (1987), Cultural codes and sex-role ideology – a study of shoes. American Journal of Semiotics 5(1): 13-33. Lénko-SzymaĔska, A. (this volume), The curse and the blessing of mobile phones – a corpus-based study into Polish and American rhetoric strategies. Lennon, S.J. and F.G. Miller (1984/85), Attire, physical appearance, and first impressions: more is less. Clothing and Textiles Research Journal 3(1), 18. Mayring, P. (2001), Combination and integration of qualitative and quantitative analysis. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research [On-line Journal], 2(1). URL: http://www.unituebingen.de/qualitative-psychologie/t-ws01/Mayring_en.htm. Pond, M. (1986), Shoes Never Lie. London: Grafton. Rayson, P. and A. Wilson (1996), The ACAMRIT semantic tagging system: progress report, in: L. J. Evett and T. G. Rose (eds.) Proceedings of the AISB Workshop on Language Engineering for Document Analysis and Quantitative or qualitative content analysis? 217 Recognition. Brighton: Faculty of Engineering and Computing, Nottingham Trent University, pp. 13-20. Remland, M.S., T.S. Jones and H. Brinkman (1991), Proxemic and haptic behavior in three European countries. Journal of Nonverbal Behavior 15: 215-232. Scott, M. (1997), PC analysis of key words - and key key words. System 25(1): 113. Scott, M. (2001), Comparing corpora and identifying key words, collocations, and frequency distributions through the WordSmith Tools suite of computer programs, in: M. Ghadessy, A. Henry and R.L. Roseberry (eds.) Small Corpus Studies and ELT: Theory and Practice. Amsterdam: Benjamins, pp. 47-67. Stone, G.P. (1962), Appearance and the self, in: A. Rose (ed.) Human Behavior and Social Processes. Boston: Houghton Mifflin, pp. 86-118. Thomas, J.A. and A. Wilson (1996), Methodologies for studying a corpus of doctor-patient interaction, in: J.A. Thomas and M.H. Short (eds.) Using Corpora for Language Research: Studies in the Honour of Geoffrey Leech. London: Longman, pp. 92-109. Wilson, A. and P. Rayson (1993), Automatic content analysis of spoken discourse: a report on work in progress, in: C. Souter and E. Atwell (eds.) Corpus Based Computational Linguistics. Amsterdam: Rodopi, pp. 215226. This page intentionally left blank Survey and Prospect of China’s Corpus-Based Research1 Yang Xiao-jun Hunan University of Science & Technology Beijing Foreign Studies University Abstract This chapter conducts a survey of China’s corpus-based research, focusing on the following aspects: i) the history and development of China’s corpus-based research; ii) corpus compilation; iii) leading research figures and institutions; iv) major academic publications on these research. Then, the author predicts the prospect of this research, focusing on the developing trend of China’s corpus-based research, corpus annotation problems, corpus processing tools and how to apply the corpora to language teaching and translation studies. 1. Introduction Corpus linguistics is playing a more and more important role in linguistic research and lexicography, etc. Tognini-Bonelli (2001: 1) points out that what we are witnessing is the emergence of a new research enterprise and a new philosophical approach to linguistic enquiry that has a theoretical status and, because of this, is in a position to contribute specifically to other applications such as lexicography, language teaching, translation, stylistics, grammar, gender studies, forensic linguistics and computational linguistics, etc. Hunston (2002: 1) notes that it is no exaggeration to say that corpora, and the study of corpora, have revolutionised the study of language, and of the applications of language over the last few decades, and that the improved accessibility of computers has changed corpus study from a subject for specialists to something that is open to all. In China corpus linguistics has led to a qualitative change in our understanding of language. In this paper the author first conducts a survey of China’s corpus-based research, focusing on the following aspects: i) the history and development of China’s corpus-based research; ii) corpus compilation; iii) leading research figures and institutions; iv) major academic publications on this research. Then, the author predicts the prospect of this research, focusing on the developing trend of China’s corpus-based research, corpus annotation problems, corpus processing tools and how to apply corpora to language teaching and translation studies. 220 2. Yang Xiao-jun Survey of China’s Corpus-based Research According to Huang Changning and Li Juanzi (2002: 4), China’s earliest corpusbased research on Chinese dialects has a history of nearly 3,000 years. Even the computer corpus-based research on the English and Chinese Languages has a history of over 20 years. Wang Jianxi (2001) stated that, in recent years, corpus linguistics in China has made considerable progress in the compilation, annotation and analysis of Chinese corpora, and in the compilation and studies of corpora of English as a Foreign Language. 2.1 The history and development of the research The history of China’s earliest corpus-based research may date back to the Zhou Dynasty (1100-221 B.C.) and the Qin Dynasty (220-206 B.C.). It is reported that Yang Xiong, who showed great interest in studying dialects, had spent 27 years interviewing some scholars and soldiers who were summoned from different districts to the capital. He had collected enough data and resources manually to compile the first book on Chinese dialects (Huang Changning & Li juanzi, 2002: 4). Since the compilation of China’s first English corpus—JDEST in 1982, many achievements have been made in China’s contemporary corpus-based research. First, fifteen Chinese corpora, eight English corpora and five bilingual corpora (including parallel corpora) have been compiled or are being compiled (to be described in detail in the next section). Secondly, four books on corpus linguistics and four dictionaries on Modern Chinese have been published. Thirdly, eighteen universities and research institutions have been carrying out research programmes on corpus linguistics. Fourthly, there are more and more regular contributors to major national journals of foreign language research and linguistics year by year, covering topics from the application of corpus linguistics to the techniques for compiling and exploiting corpora. Searching papers on corpus linguistics from www.cnki.net, we can find out that there are altogether 107 academic papers on corpus-based research in Chinese academic journals (CAJ), and up to now, the number is about 140 adding the papers searched from www.baidu.com. Besides, there are a number of MSc/MA dissertations and several doctoral theses in corpus-based work. 2.2 China’s Corpora compiled and being compiled It is during these 20 years that about 15 Chinese corpora, 8 English corpora and 5 bilingual corpora (including parallel corpora) have been compiled or being compiled. Some of them are used for general purposes; some are used for specific purposes. 2.2.1 Chinese Corpora (1) The Modern Chinese Language Corpus (MCLC). The biggest representative written Chinese corpus to date, compiled at the Research Institute of Survey and Prospect of China’s Corpus-Based Research 221 Language Application in Beijing in 1995. It aims to contain 70 million contemporary Chinese characters, systematically sampled from among 1.4 billion characters of original texts covering the period 1919 to the 1990s. With regard to contents, 59.6 percent of these texts belong to the humanities, 17.24 percent to natural science, 13.7 percent to newspaper material and 9.37 percent to the miscellaneous class. This MCLC serves the potential needs in five areas: (1) information processing in the Chinese language; (2) the unification and standardisation of the Chinese language; (3) academic research; (4) language education; (5) application of the Chinese language, (see Wang Jianxin, 2001). (2) The Si Ku Quan Shu. (This refers to the Complete Library in Four Branches of Literature completed in 1782, which has the world’s longest series of books. The work, comprising four traditional division of Chinese learning (classics, history, philosophy, and belles-letters), contains 3,503 titles bound into more than 36,000 books with a total of 853,456 pages.) This is the largest electronic Chinese text database so far. Composed of the most comprehensive and outstanding classic Chinese works in the past 3,000 years before the end of the 18th century. This database of 800 million Chinese characters in 4.7 million pages was published by the Digital Heritage Publishing Ltd. in Hong Kong in two versions. One is the complete facsimile version of 183 CD-ROMs with characterretrieving tools with a total of 4.7 million pages of the original books and was put into the market in 1998; the other one has 168 CD-ROMs with title-retrieving tools with a total of 800 million characters and was put into market in 1998. The database has greatly facilitated the study of the ancient Chinese language, literature, culture and history. (3) Corpus of the Chinese Language as Interlanguage. This corpus was compiled by the Research Institute for Computational Language of Beijing Language and Culture University with more than 3.5 million characters in 1995. This corpus consists of 5,774 written texts in Chinese as interlanguage of 1,635 overseas students learning Chinese. The students are from 96 countries and regions studying in nine universities. This corpus is used to enhance the teaching of Chinese to foreign students. (4) The Academia Sinica Balanced Corpus (version 3.0). This corpus was compiled at the Academia Sinica in Taiwan. It is one of the largest annotated Chinese corpora and has been put on the web with 5 million Chinese characters tokenized, tagged and parsed. This grammar tree bank and the statistics will be very useful for the processing system of the Chinese language. (5) The Corpus of Chinese Phonology. This corpus was set up by the Institute of Applied Linguistics of the Chinese Academy of Social Sciences (CASS) with 45,000 Chinese words and 1,200 syllables with tones, numeral strings, lightly pronounced words, words pronounced with a rising “/r/”ending sound, bi-syllables, tri-syllables, separate sentences, short passages and dialogues. This corpus has been annotated manually. An 222 Yang Xiao-jun automatic phonetic-analysis system of Mandarin Chinese, based on the English phonetic analytical system ToBI, is in progress. (6) Contemporary Beijing Spoken Chinese Corpus. This corpus was compiled by Lu Bisong, Ren Yuan and other six scholars in 1992. They have interviewed 500 people from six districts of Beijing talking on about 28 topics and sampled 378 recording materials. This corpus gives a vivid description of Beijing Spoken Chinese in the 1980s with a total of 1.7 million characters. (7) Situated Discourse for Spoken Chinese corpus. This corpus is being compiled at the Research Department for Contemporary Linguistics of the Research Institute for Languages and Linguistics of the Chinese Academy of Social Sciences under the guidance of the doctor supervisor Gu Yueguo. As part of the workplace-related discourse, more than 269 hours of recordings are planned for this corpus, which are being transcribed and annotated to be made available on the Internet in a multimedia form. (8) Corpus of Chinese Textbooks for Primary and Middle Schools. This corpus was compiled by Beijing Normal University with 1 million characters and composed of middle-school teaching material on Chinese literature and language. (9) The People’s Daily Annotated Corpus (PFR). This corpus has been segmented and annotated jointly by the Fujitsu Research Institute of Japan and the Institute of Computational linguistics (ICL) of Peking University with about 200 million Chinese characters. (10) Huayu2. This is a balanced corpus of 2 million Chinese characters, which is entitled Huayu and compiled by both the State Key Laboratory of Intelligent Technology & System (Key Lab), Tsinghua University, and the Language Information Processing Institute, Beijing Language and Culture University on Mainland China. This corpus has been tokenised and tagged with Cseg & tag 1.0, a segmentation and tagging system developed by them, and then manually proofread, from which a grammar tree bank of 10,000 Chinese sentences is being built to be used as a testbed for Chinese parsers. (11) Linguistic Variation in Chinese Communities (LIVAC Synchronous Corpus). This corpus entitled LIVAC Synchronous Corpus was compiled by the City University of Hong Kong and is a representative computerised text corpus from Chinese newspapers and electronic media in Mainland China, Hong Kong, Macau, Taiwan and Singapore. LIVAC aims to cover a ten-year period starting from July 1995. This corpus will be very useful for comparative studies and can provide quantitative data for language engineering. (12) Statistical Corpus of Chinese Word Frequency. This corpus was designed for the specific purpose of the statistics of Chinese word frequency and compiled jointly by 10 universities and research institutes. It has several subcorpora, such as: (a) 20-million character corpus of modern Chinese Survey and Prospect of China’s Corpus-Based Research 223 was compiled by the Beijing University of Aeronautics and Astronautics in 1983, (b) 5.27-million character corpus of modern Chinese corpus was compiled by Wuhan University in 1979, (c) 5-million-character Chinese corpus was compiled by Peking University in 1992, (d) 100-millioncharacter corpus of ancient and modern Chinese was compiled by Shanghai Normal University. A modern Chinese corpus of 66,186,297 Chinese characters was compiled by Shandong University, (e) 2.5 million-character Chinese corpus of news was compiled by Shanxi University in 1988. (13) The Corpus of the Contemporary Chinese Language. This corpus was compiled by the Department of Chinese and Bilingual Linguistics at the Hong Kong University of Science and Technology (HKUST) and comprises 6 million characters of contemporary Chinese used in mainland China, Hong Kong and Taiwan. It has been segmented with an automatic algorithm, and considerable research has been conducted using this corpus. (14) The Hong Kong Cantonese Child Language Corpus (CANCORD). This corpus grew out of the project “The development of grammatical competence in Cantonese-speaking children” funded by the Hong Kong Research Grants Council from 1991-93, which is a joint effort of three local universities: The Chinese University of Hong Kong, the Hong Kong Polytechnic University, and the University of Hong Kong. This database contains 171 files coded according to the internationally accepted CHAT format and tagged with 33 part-of-speech labels. The data should be of use to any one interested in early language development, be they linguists, psychologists, philosophers or educationalists. Queries about the corpus should be directed to Thomas Lee (e-mail: htlee@netvigator.com). (15) The Electronic Database of the Chinese Documents (Scripta Sinica). This corpus is probably the most empirically sound one in China today and fully available on the Internet, which is entitled Scripta Sinica and compiled collectively by ten research institutes of the Academia Sinica in Taiwang. The 2.0 version of this database contains 139,940,071 Chinese characters, and each year 10 million characters of new Chinese texts are added to it. 2.2.2 English Corpora (1) JDEST (Jiaotong University Corpus for EST). This corpus was compiled at Shanghai Jiaotong University under the guidance of Prof. Huang Renjie and Prof. Yang Huizhong in 1982. It comprises approximately 1 million words, and was updated to 4,082,368 words in 2000. It has 2,000 texts with a length of 500 words each text covering 10 majors in both British English and American English. It is an annotated corpus and word frequency lists have been produced. It is designed to meet the needs of 224 Yang Xiao-jun students of English used in science and technology. Here is the website: www.sjtu.edu.cn (2) GPEC (Guangzhou Petroleum English Corpus). This corpus was compiled by Prof. Zhu Qi-bo at Guangzhou Training College of the Chinese Petroleum University in 1986. It has 700 passages with 500-600 words in each passage sampled over period of 1975-1986. Its size is 411,612 words in both British English and American English. A pack of concordance programmes for the corpus has been worked out. This corpus is designed to enable the study of the lexicon of Petroleum English and provide information for comparative language analysis. (3) Corpus for EST. This corpus was compiled jointly by Guangzhou University of Foreign Studies and the Hong Kong University of Science and Technology in 1998. It comprises 5 million words. (4) Communicative English Corpus for Chinese Students. This corpus was compiled at the School of Foreign Languages and International Studies, Guangzhou University of Foreign Studies. (5) English Literature Corpus. This corpus was compiled at the School of Foreign Languages and International Studies, Beijing Foreign Studies University. It comprises 5 million words. (6) Chinese Learner English Corpus (CLEC). This corpus was compiled in 1999 under the guidance of Prof. Gui Shichun of Guangzhou University of Foreign Studies and Prof. Yang Huizhong of Shanghai Jiaotong University. It comprises 1 million words. The written texts in this corpus are from the compositions and essays of either English majors or nonEnglish majors or middle school students, (40%, 30% and 30% respectively). A subcorpus, College Learner Corpus, was compiled at School of Foreign Studies, Henan University under the guidance of Prof. Li Wenzhong. (7) The HKUST Corpus of Learner English. This corpus was compiled by J. Milton at the Computer Department of the Hong Kong University of Science and Technology. It comprises 25 million words. This corpus was POS tagged and partly error-tagged. All language materials are from Chinese learners. Kennedy (1998: 42) has made some comments on this corpus as follows: Interlanguage studies of the written English of mainly Cantonese learners of English will be facilitated by the completion of the five-million-word Hong Kong University of Science and Technology (HKUST) Corpus (Milton & Tong, 1991). The corpus will be probably the largest machine-readable corpus yet produced of the written English of Chinese learners and also one of the largest corpora of any single group of learners. It is intended that it will be available with grammatical and discourse feature tags. The use of this corpus to describe the rule base of written English for learners of Chinese background is intended to inform the development of English teaching materials. (8) Corpus for Middle School English Education (MSEE). This corpus was compiled by Prof. He An-ping at South China Normal University in Survey and Prospect of China’s Corpus-Based Research 225 1999. It comprises 2.3 million words in three sections: 1.3 million words of the written and spoken English produced by the secondary school students in Guangdong Province; 0.5 million words of the complete new set of English textbooks used in China’s middle schools, and 0.5 million transcribed words of 130 hours of classroom teaching, matched with video and tape recordings. 2.2.3 Bilingual Corpora (including Parallel Corpora) (1) CONULEXID (the Commercial Press and Nanjing University Lexical Database). This is a bilingual (English and Chinese) Database and was compiled jointly by Nanjing University and Commercial Publishing House in 2000. It is for the purposes of bilingual dictionary making and publishing. (2) Chinese/Japanese Parallel Corpus (CJPC). This corpus was compiled at the National Research Centre for Foreign Language Education, Beijing Foreign Studies University, under the guidance of Prof. Xu Yiping. It is the first Chinese/Japanese parallel corpus compiled in China, upon which many significant results have been produced. (see Xu and Cao, 2002) (3) Chinese/English Parallel Corpus (CEPC). This corpus is being compiled at the National Research Centre for Foreign Language Education, Beijing Foreign Studies University, under the guidance of Dr.Wang Kefei. The objective of the present project is to create a Chinese/English Parallel Corpus of 30 million words, representative of modern Chinese and English in the twentieth century so as to establish a research platform for comparative studies of Chinese and English that can meet observational and descriptive accuracy, translation studies and teaching, statistical analysis such as frequency of occurrence, machine translation and compilation of bilingual dictionaries. This corpus comprises four subcorpora: bilingual corpus of aligned sentences and phrases, multidisciplinary corpus, specialised corpus, and translated texts corpus. All the texts are from written sources. Software such as bilingual text alignment software and bilingual concordance software will be developed for effective and efficient use of this corpus. This corpus has these distinct features: it can be separated for specific purposes or combined as a whole for a general purpose; the original texts have different translation, etc. (see Wang kefei, 2002b). (4) English/Chinese Parallel Corpus (ECPC). Under the Statistical Inter-Lingual Conversion (SILC) project, a 60Mb English/Chinese parallel corpus has been compiled by the Computer Department of the HKUST, containing transcripts of speech recordings and their high-quality translations of the Hong Kong parliament. From the corpus, 29 Mb of English texts (about 5 million words) and 15.5 Mb of Chinese equivalents were used for the experiment. They were all automatically aligned at sentence and paragraph level. 226 Yang Xiao-jun (5) Cantonese-English Bilingual Child Language Corpus. This is a corpus of Cantonese-English Bilingual Children’s early language development which is being compiled by Virinia Yip (CUHK) and Stephen Mattews (HKU). Three simultaneous bilingual subjects will be studied longitudinally for one and a half or two years and video-taped bi-weekly. The resulting transcriptions will form a bilingual corpus, containing English and Cantonese in romanised form for each child. This will be the first Cantonese-English Bilingual Child corpus and will be useful for addressing issues such as the question of differentiation, the degree of balance between the two languages and the possibility of delay relative to monolingual development. 2.3 Leading figures and institutions in the research A: The following is the introduction of the leading figures, whose names are spelt in Chinese Pinyin with the surname followed by the middle and last name, and their contributions to China’s corpus-based research. (1) Yu Shiwen is a doctor supervisor and the director of the Institute of Computational Linguistics, Department of Computer Science and Technology, Peking University. He has been carrying on research on computational linguistics and natural corpus processing since 1986. He compiled the Grammatical Knowledge Base of Contemporary Chinese (GKBCC). This database was published in Tsinghua University Press in 1997. Based on the GKBCC, he also compiled the database the Grammatical Knowledge Base of Contemporary Chinese---a Complete Specification, which was published at Tsinghua University Press in 1998. (2) Huang Changning works in The Research Institute for Computational Language of Tsinghua University and he has written 5 papers on corpusbased research, mainly on Chinese corpora and 1 book—Corpus Linguistics (Chinese version) published by Commercial Press in 2002. (3) Zhang Pu works in The Research Institute for Computational Language of Beijing Language and Culture University and he has written 10 papers on corpus-based research, mainly on Chinese corpora. He is one of the leading researchers for having compiled Modern Chinese Corpus. (4) Gui Shichun is a linguist working in Guangzhou University of Foreign Studies and has compiled the Chinese Learner English Corpus (CLEC) and undertaken some corpus-based research. (5) Yang Huizhong is a doctor supervisor of Foreign Studies Department of Shanghai Jiao Tong University. He has compiled two major corpora with other professors: JDEST and CLEC, and wrote in 1985 a paper—The use of computers in English teaching and research in China and in 2002 with his four doctoral candidates majoring in corpus linguistics wrote a book—An Introduction to Corpus Linguistics. He is one of the pioneers of China’s corpus-based research. Survey and Prospect of China’s Corpus-Based Research 227 (6) Gu Yueguo is a doctor supervisor of the Research Institute for Languages and Linguistics, the Chinese Academy of Social Science. He is in charge of the research programme for compiling a corpus—Situated Discourse for Spoken Chinese corpus, which he is about to complete. In 1998 he edited a special issue for corpus linguistics research as the first issue of Contemporary Linguistics and contributed a paper—Corpora and Language Research to this issue and gave some advice on this kind of research. He has had four doctoral candidates majoring in corpus linguistics. (7) Wang Kefei is a doctor supervisor of the National Research Centre for Foreign Language Education, Beijing Foreign Studies University. He is in charge of compiling Chinese/English Parallel Corpus. He has published four papers on parallel corpus-based approach to both language and translation studies. (8) Chen Guohua is a doctor supervisor of the National Research Centre for Foreign Language Education, Beijing Foreign Studies University and one of the leading members for compiling the Chinese/English Parallel Corpus. He has undertaken some research on corpus-based approaches to lexicography and has one doctoral candidate majoring in corpus linguistics and lexicography. (9) Pan Yongliang is a doctor supervisor in PLA Foreign Studies University and wrote two papers on corpus-based research and has several doctoral candidates majoring in corpus linguistics. (10) Wang Jianxin works at Foreign Languages Department of the Beijing University of Post and Telecommunication and takes part in compiling the Chinese/English Parallel Corpus. He has written six papers on corpus-based research. Among these papers, one was published in the International Journal of Corpus Linguistics (2001, vol. 6, No. 2); another was published in ICAME Journal (2000, 4). He studied corpus linguistics in Bergen University as a visiting scholar for one year. (11) He Anpin works at the Foreign Studies Department of South China Normal University and attended two conferences: ICAME 2000 (Sydney) and Corpus Linguistics 2001 (Lancaster). She has published more than five papers on corpus-based research and compiled a corpus - MSEE. (12) Li Wenzhong. Dr. Li has majored in corpus linguistics and is supervised by Prof. Yang Huizhong. He has compiled a sub-corpus of CLEC— College Learner Corpus wrote some papers on corpus-based research and three chapters of Yang Huizhong’s An Introduction to Corpus Linguistics. B: The following are the leading institutions in this research and their addresses, websites and the e-mail addresses of the project researchers. 228 Yang Xiao-jun (1) The State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, P.R. China. E-mail: sms@s1000c.cs.tsinghua.edu.cn; website: www.tsinghua.edu.cn (2) The Institute of Computational Linguistics, Dept. of Computer Science and Technology, Peking University, Beijing, 100871, P.R. China. E-mail: yusw@pku.edu.cn; website: www.pku.edu.cn (3) The Academia Sinica of Taiwang Website: www.academia.sinica.edu.tw. Scripta Sinica. E-mail: linshi@sinica.edu.tw; website: academia.sinica.edu.tw/info/index.html#db. (4) The National Research Center for Foreign Languages Education, Beijing Foreign Studies University, Beijing, 100089, P.R. China. E-mail: kfwang@hotmail.com; website: www.sinotefl.com (5) Language Information Processing Institute, Beijing Language and Culture University, Beijing , 100083, P.R. China. E-mail: zhangpu@blcp.edu.cn; website: www.blcp.edu.cn (6) The Chinese Academy of Social Science. No:5, Jian Guo Men Wai Da Jie, Beijing, 100732, P.R. China. www.cass.net.cn/jxky/yxsz/kx/index.htm. (7) Beijing University of Aeronautics and Astronautics. www.buaa.edu.cn (8) Open System & Chinese Information Processing Center, Institute of Software, the Chinese Academy of Sciences, Beijing, 100083. P.R. China. E-mail: idu@sonata.iscas.ac.cn (9) Shanghai Jiaotong University, Shanghai, 200030,P.R. China. www.sjtu.edu.cn (10) Guangdong University of Foreign Studies, Guangdong, 510420, P.R. China. www.gdufs.edu.cn (11) South China Normal University, Guangdong, 510631, P.R. China. www.scnu.edu.cn (12) Wuhan University. www.whu.edu.cn (13) Northeast University, Shenyang, 110006, P.R. China. www.neu.edu.cn (14) Shanxi University, Taiyuan, 030006, P.R. China. www.sxu.edu.cn (15) The City University Of Hong Kong. www.cityu.edu.hk (16) Digital Heritage Publishing Ltd. (Hong Kong). www.skqs.com (17) The Chinese University of Hong Kong. www.cuhk.edu.hk (18) University of Petroleum-Guangzhou. www.ccem.uiuc.edu/chen/up/guangzhou.html. 2.4 Classifications of corpus-based research publications and major academic journals 2.4.1 Dictionaries and books (1) The Comprehensive Dictionary of the Chinese Language (in electronic form). The original 12 volumes of this dictionary was made available on one CD-ROM in 1998 by the Chinese Dictionary Press and the Commercial Press (Hong Kong Ltd.). Survey and Prospect of China’s Corpus-Based Research 229 (2) Grammatical Knowledge Base of Contemporary Chinese (GKBCC). This database was compiled by Peking University and already published in Tsinghua University Press in 1997. Based on the GKBCC, the database the Grammatical Knowledge Base of Contemporary Chinese---a Complete Specification has been compiled by Yu Shiwen, et al. at the ICL of Peking University and was published at Tsinghua University Press in 1998. (3) List of Common Characters in Modern Chinese. It was published in the Chinese Language Press in 1997. (4) Frequency Dictionary of the Modern Chinese Language. This database was compiled at Language Information Processing Institute, Beijing Language and Culture University and has frequency statistics for 1, 310,000 characters of Modern Chinese characters and was published in Beijing Language and Culture University Press in 1997. (5) An Introduction to Corpus Linguistics. This book was written by Yang Huizhong et al. and published by Shanghai Foreign Language Education Press in 2002. (6) An Introduction to Chinese Learner English Corpus (CLEC). This book was written by Yang Huizhong, et al and published by Shanghai Foreign Language Education Press in 2003. (7) Corpus Linguistics. This book was written by Huang Changning and Li Juanzi and published by the Commercial Press in 2002. (8) An Introduction to the Corpus for Middle School English Education (MSEE). This book was written by He Anping and published by Guangdong Electronic Press in 1999. 2.4.2 Classifications of corpus-based research papers As mentioned above, up to now there are about 140 corpus-based research papers (it is by no means exhaustive). They may be classified as follows: (1) In terms of the language in the corpora - Papers on English corpora and bilingual corpora take about 70% of the total number, the rest 30% are on Chinese corpora. (2) In terms of the research orientation and scope - Papers on the introduction to English and Chinese corpora and book reviews on corpus linguistics take about 20% and they mainly cover the period 1985-1995, such as Wang Jianxin (1996) and Pan Yongliang (2000). Papers on theoretical research on corpus linguistics take about 30% and they mainly cover the period 1996-2000. Papers on the applications of corpora take 40% and they mainly cover the period 1998-2003. They may be further classified into the following categories: (a) Papers on the application of corpora to translation studies take about 5%, such as Wang Kefei (2002 a, b), Liao Qiyi (2000), Zhang Meifang (2002), etc., (b) Papers on the application of corpora to language teaching take about 20%., (c) Papers on the application of corpora to contrastive linguistic analysis take about 5%, (d) Papers on the application of corpora to lexicography take about 5%, (e) Papers on the application of corpora to lexical studies take 5%. Papers on the techniques for corpora annotation and concordance take about the rest 10%. 230 Yang Xiao-jun 2.4.3 Classification of these academic journals These academic journals that publish papers on corpus-based research in China may be classified into three types: journals of foreign languages and linguistics, journals of Chinese language and linguistics, and journals of information and computer technology. Here the major ones are listed: (a) Contemporary Linguistics (Beijing); (b) Foreign Language Teaching and Research (Beijing); (c) Journal of Foreign Languages (Shanghai); (d) Modern Foreign Languages (Guangzhou); (e) Foreign Languages and Their Teaching (Dalian); (f) Journal of PLA Foreign Studies University (Luoyang); (g) Foreign Language Teaching (Xi’an); (h) Journal of Languages and Writings (Beijing); (i) Journal of Chinese Information Processing (Beijing); (j) Applied Linguistics (Beijing); and (k) Journal of Computer Science (Beijing). 3. Prospect of China’s corpus-based research In this section, we focus on the developing trend of China’s corpus-based research, corpus annotation problems, corpus processing tools and how to apply the corpora to language teaching and translation studies. 3.1 The developing trend of China’s corpus-based research The present situation in China’s corpus-building development is that the development of spoken Chinese corpora is far behind that of the written Chinese corpora. What is even worse is that there is no spoken English corpora already built or even being built in China, so we should speed up the building of spoken corpora to maintain the balance with that of written corpora. Meanwhile it has been suggested that more small corpora be built for specific research purposes and more bilingual and parallel corpora be built for language research, translation studies, contrastive analysis and dictionary-making. 3.2 Corpus annotation problems and corpus processing tools Technology for tagging and parsing should be standardised and further developed. Annotation on lexical, phonetic, syntactic and phonological levels goes far beyond that of semantic and pragmatic levels, so more attention should be paid to this aspect. The supply of more and more automatic taggers and parsers cannot meet the demand for them and most of these corpus processing tools are only suitable for specific corpora. Technology for alignment is not currently satisfactory. Up to now, we have not developed our own tools to process bilingual corpora. The challenge to corpus analysis tools is to systematise the design of corpora and concordancers so that any concordancer can work on any corpus. Survey and Prospect of China’s Corpus-Based Research 3.3 231 How to apply the corpora to language teaching and translation studies There is great potential for these corpora to enrich China’s corpus-based research. Here we just focus on how to apply the corpora to language teaching and translation studies. 3.3.1 Corpora and language teaching Corpus linguistics has a double role in language teaching; entailing a methodological innovation and a theoretical one, because together they will account for a new way of teaching (Tognini-Bonell, 2001: 14). The development of corpora has the potential for two major effects on the professional life of the language teacher. First, corpora lead to new descriptions of a language, so that the content of what the language teacher is teaching is perceived to change in radical ways (Sinclair, 1991: 100; Stubbs, 1996: 231-232). Secondly, corpora themselves can be exploited to produce language teaching materials, and can form the basis for new approaches to syllabus design and to methodology (Huston 2002: 137). In fact, we can apply corpora to teaching nearly all the language major courses, especially comparative and contrastive studies, vocabulary lessons and writing lessons, etc. 3.3.2 Corpora and translation studies Translation is an increasingly important application of corpora. Research into corpora and translation tends to focus on two areas: practical and theoretical. In practical terms, the question is: What software can be developed that will enable a translator to exploit corpora as an aid in the day-to-day business of translation? In theoretical terms, the question is: What does a corpus consisting of translated texts indicate about the process of translation itself? Because corpora can be used to raise awareness about language in general, they are extremely useful in training translators and in pointing up potential problems for translation. Not only can corpora provide evidence for how a given word or phrase are possible, they also provide an insight into the process and the nature of translation itself (Huston 2002: 123-128). So our bilingual parallel corpora can be a resource and a practical tool to analyse translation, translators’ style and help translation training, teaching and the writing of translation textbooks, etc. 4. Conclusion In this chapter, a general survey of China’s corpus-based research has been conducted from four viewpoints. Prediction of the prospect of the research has also been made from three viewpoints. It is obvious that corpus linguistics is playing a more and more important role in China’s corpus-based research. In the future, a national association of corpus linguistics studies will be created and a national academic journal of corpus linguistics will come into being. In the end, 232 Yang Xiao-jun our motto should never be forgotten: “We are corpus-based, but not corpusbound.” Notes 1 This paper has been polished by my supervisor, Dr. Wang Kefei, Professor of the National Research Centre for Foreign Language Education, Beijing Foreign Studies University. I am very grateful to him and to Prof. Wang Jianxin who offered me some information and data on corpora compilation. 2 Here “Hua” refers to Tsinghua University; “Yu” refers to Beijing Language and Culture University. References Biber, D., S. Conrad and R. Reppen (1998), Corpus Linguistics. Cambridge: Cambridge University Press. He, Anping. (1999), Introduction to the corpus for middle school English Education (MSEE). Guangzhou: Guangdong Electronic Press. Huang, Changning and Li Juanzi. (2002), Corpus Linguistics. Beijing: Commercial Press. Hunston, S. (2002), Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Kennedy, G. (1998), An Introduction to Corpus Linguistics, London: Longman. Liao, Qiyi. (2000), ‘Corpora and translation studies’, Foreign Language Teaching and Research, 5: 380-384. Pan, Yongliang. (2000), ‘Introduction to Corpus Linguistics (Biber, et al 1998)’, Foreign Language Teaching and Research, 5: 389-392. Sinclair, J.M. (1991), Corpus Concordance Collocation,. Oxford: OUP. Stubbs, M. (1996), Text and Corpus Analysis, Oxford: Blackwell. Thomas, J. and M. Short. (1996), Using Corpora for Language Research, London: Longman. Tognini-Bonelli, E. (2001), Corpus Linguistics at Work. Amsterdam/ Philadelphia: John Benjamins Publishing Company. Wang, Jianxin. (1996), ‘Introduction to three contemporary English corpora’, Foreign Language Teaching and Research 3:37-40. Wang, Jianxin. (1998), ‘Important stages in the development of corpus linguistics’, Foreign Language Teaching and Research 4:52-57. Wang, Jianxin. (1999), ‘Some development of researches on corpus linguistics in China’, Foreign Languages and Their Teaching 3: 18-20. Wang, Jianxin. (2001), ‘Recent progress in corpus linguistics in China’, International Journal of Corpus Linguistics, Volume 6: Survey and Prospect of China’s Corpus-Based Research 233 Wang, Kefei. (2002a), ‘Parallel corpus-based approach to translation studies’, Foreign Languages and Their Teaching, 9: 35-39. Wang, Kefei. (2002b), ‘Corpus and network for translator and interpreter training’, Foreign Language Teaching and Research, 3:231-232. Wang, Lidi. and W. Jianxin. (2001), ‘Implementation plan for a 30-million-word Chinese/English parallel corpus’, Presentation at the 3rd International Symposium on EFL in China. May, Beijing. Xu, Yiping. and Cao. Dafeng, (eds.). (2002), The development and application of a Chinese/Japanese parallel corpus, Beijing: Foreign Language Teaching and Research Press. Yang, Huizhong. (2002), An Introduction to Corpus Linguistics, Shanghai: Shanghai Foreign Language Education Press. Yang, Xiaojun. and Li. Saihong. (2003), ‘Advantages of Corpora in Lexicography----Review on OALD (6th edition)’, Foreign Languages and Their Teaching, 4: 47-51. Zhang, Meifang. (2002), ‘Exploiting corpora to analyze the stylistic features of the translators’, Journal of PLA Foreign Studies University 3: 10-14.