Corpus Linguistics Around the World

Transcription

&RUSXVOLQJXLVWLFV
DURXQGWKHZRUOG
/$1*8$*($1'&20387(56
678',(6,135$&7,&$//,1*8,67,&6
1R
HGLWHGE\
&KULVWLDQ0DLU
&KDUOHV)0H\HU
1HOOHNH2RVWGLMN
&RUSXVOLQJXLVWLFV
DURXQGWKHZRUOG
(GLWHGE\
$QGUHZ:LOVRQ
'DZQ$UFKHU
3DXO5D\VRQ
$PVWHUGDP1HZ<RUN1<
&RYHUGHVLJQ3LHU3RVW
&RYHULPDJH1$6$WKH9LVLEOH(DUWKKWWSYLVLEOHHDUWKQDVDJRY
2QOLQHDFFHVVLVLQFOXGHGLQSULQWVXEVFULSWLRQV
VHHZZZURGRSLQO
7KH SDSHU RQ ZKLFK WKLV ERRN LV SULQWHG PHHWV WKH UHTXLUHPHQWV RI
,62,QIRUPDWLRQDQGGRFXPHQWDWLRQ3DSHUIRUGRFXPHQWV
5HTXLUHPHQWVIRUSHUPDQHQFH
,6%1836-4ERXQG
(GLWLRQV5RGRSL%9$PVWHUGDP1HZ<RUN1<
3ULQWHGLQ7KH1HWKHUODQGV
Contents
Preface
Andrew Wilson, Dawn Archer and Paul Rayson
Methodology and steps towards the construction of EPEC, a corpus
of written Basque tagged at morphological and syntactic levels for
automatic processing
Aduriz I., Aranzabe M.J., Arriola J.M., Atutxa A., Díaz de Ilarraza
A., Ezeiza N., Gojenola K., Oronoz M., Soroa A., Urizar R.
1
The mood of the (financial) markets: in a corpus of words and of
pictures
Khurshid Ahmad, David Cheng, Tugba Taskaya, Saif Ahmad, Lee
Gillam, Pensiri Manomaisupat, Hayssam Traboulsi and Andrew
Hippisley
17
Towards a methodology for corpus-based studies of linguistic
change: Contrastive observations and their possible diachronic
interpretations in the Korpus 2000 and Korpus 90 General Corpora
of Danish
Jørg Asmussen
33
Synchronic and diachronic variation: the how and why of
sociolinguistic corpora
Kate Beeching
49
Statistical analysis of the source origin of Maltese
Roderick Bovingdon and Angelo Dalli
63
Discovering regularities in non-native speech
Julie Carson-Berndsen, Ulrike Gut and Robert Kelly
77
Tracking lexical changes in the reference corpus of Slovene texts
Vojko Gorjanc
91
Relating linguistic units to socio-contextual information in a
spontaneous speech corpus of Spanish
José María Guirao, Antonio Moreno Sandoval, Ana González
Ledesma, Guillermo de la Madrid, Manuel Alcántara
101
An analysis of lexical text coverage in contemporary German
Randall L. Jones
115
Analysing a semantic corpus study across English dialects:
Searching for paradigmatic parallels
Sarah Lee and Debra Ziegeler
121
The curse and the blessing of mobile phones – a corpus-based study
into American and Polish rhetorical conventions
Agnieszka LeĔko-SzymaĔska
141
Using a dedicated corpus to identify features of professional
English usage: What do “we” do in science journal articles?
Judy Noguchi, Thomas Orr and Yukio Tono
155
Methods and tools for development of the Russian Reference
Corpus
Serge Sharoff
167
A profile-based calculation of region and register variation: the
synchronic and diachronic status of the two main national varieties
of Dutch
Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts
181
A multilingual learner corpus in Brazil
Stella E. O. Tagnin
195
Quantitative or qualitative content analysis? Experiences from a
cross-cultural comparison of female students’ attitudes to shoe
fashions in Germany, Poland and Russia
Andrew Wilson and Olga Moudraia
203
Survey and Prospect of China’s Corpus-Based Research
Yang Xiao-jun
219
Corpus linguistics around the world
Andrew Wilson, Dawn Archer and Paul Rayson
Lancaster University
Preface
The scope of corpus-based research is becoming ever wider.
Not so many years ago, the vast majority of corpus-linguistic research was
concerned with the grammar and vocabulary of standard language varieties – the
latter meaning, in many cases, British or American English. Whilst research on
other topics, languages, and varieties was by no means completely absent from
the scene, this was the general picture of the field which came across to the
interested observer.
Today, things have changed dramatically. As this volume shows, the range of
languages, research questions, and, indeed, methodologies which are addressed
by corpus linguists has diversified. It is probably true to say that none of the
papers published in this volume focuses primarily on standard English as a
general variety. Here we find work not only on English dialects (Lee & Ziegeler)
but also on learner language (Carson-Berndsen, Gut & Kelly; Lenko-Szymanska;
Tagnin) and on a wide range of world languages - Basque (Aduriz et al.), Chinese
(Xiao-jun), Danish (Asmussen), Dutch (Speelman et al.), German (Jones),
Maltese (Bovingdon & Dalli), Russian (Sharoff), Slovene (Gorjanc), and Spanish
(Guirao et al.). In terms of the research questions addressed, the more
‘traditional’ areas of corpus linguistics are still well represented, with papers on
vocabulary (Jones), spoken language (Carson-Berndsen, Gut & Kelly; Guirao et
al.), synchronic and diachronic variation (Asmussen, Beeching; Bovingdon &
Dalli; Gorjanc; Lee & Ziegeler; Speelman et al.), Languages for Special Purposes
(Noguchi, Orr & Tono), tagging, and corpus development (Aduriz et al.; Sharoff).
However, exciting new departures are also present, with corpus-based work now
extending into areas such as cross-cultural rhetoric and social psychology (LenkoSzymanska; Wilson & Moudraia) and even economic forecasting (Ahmad et al.).
The papers published in this volume are but a small selection from the many
which were presented at the Corpus Linguistics 2003 conference, held at
Lancaster University in March 2003. This was the second Corpus Linguistics
conference which we hosted at Lancaster (the first was in 2001), and, like its
predecessor, it truly amazed us with the range of corpus-informed work being
carried out world-wide. Computer corpus linguistics continues to thrive and to
extend into so many areas of inquiry, many of which would probably have been
unimaginable for its pioneers in the 1960s and 1970s. We are sure that it will
continue to hold surprises for us. In the meantime, perhaps this collection of
recent corpus research around the globe will whet the reader’s appetite to follow
up new developments in this fast-moving field.
Andrew Wilson
Dawn Archer
Paul Rayson
Lancaster, August 2004
Methodology and steps towards the construction of EPEC, a
corpus of written Basque tagged at morphological and syntactic
levels for automatic processing
Aduriz I.*, Aranzabe M.J., Arriola J.M., Atutxa A., Díaz de Ilarraza A., Ezeiza N.,
Gojenola K., Oronoz M., Soroa A., Urizar R.
University of the Basque Country
* University of Barcelona.
Abstract
In this article, we will describe the different steps in the construction of EPEC (Reference
Corpus for the Processing of Basque). EPEC is a corpus of standard written Basque that
has been manually tagged at different levels (morphology, surface syntax, phrases) and is
currently being hand-tagged at deep syntax level following the Dependency Structurebased Scheme. It is aimed to be a "reference" corpus for the development and
improvement of several NLP tools for Basque. This corpus has already been used for the
construction of some tools such as a morphological analyser, a lemmatiser, or a shallow
syntactic analyser.
1.
Introduction
When specifying the strategic priorities for the development of language
technology in minority languages, Sarasola (2000) stated:
Language foundations and research are essential to create any tool or
application; but in the same way tools and applications will be very
helpful in research and improving language foundations. Therefore,
these three levels [applications, tools, and language foundations] have
to be incrementally developed in a parallel and coordinated way in
order to get the best benefit possible.
Moreover, Sarasola (2000) proposes five phases as a general strategy to follow in
the processing of a language: (1) laying foundations, (2) basic tools, (3) tools of
medium complexity, (4) advanced tools and multilinguality, and (5) general
applications. In all the phases proposed, corpora, first raw and then tagged, stand
out as an essential language resource.
In this article, we will describe the different steps in the construction of EPEC
(Reference Corpus for the Processing of Basque). EPEC is a corpus of standard
written Basque that has been manually tagged at different levels (morphology,
surface syntax, phrases) and is currently being hand tagged at deep syntax level.
2
Aduriz et al.
It is aimed to be a “reference” corpus for the development and improvement of
several NLP tools for Basque.
In section 2, we explain how the raw corpus was compiled and we briefly
describe the design of the tagset. In section 3, we account for the morphological
disambiguation process carried out manually over the outcome of MORFEUS
(the morphological analyser for Basque). The shallow syntactic tagging and
phrase tagging are explained in sections 4 and 5 respectively. Finally, in section 6
we explain the tag system chosen for the dependency-based syntactic analysis and
how the treebank is being tagged manually.
In figure 1, we can see a diagram showing the different phases in the construction
of EPEC, contrasting the manual tasks (right column) with the computer-based
ones (left column) as well as the dependencies between them.
2.
The tagged corpus
2.1
Compilation of the corpus
EPEC is a 50,000-word sample collection of standard written Basque. It is a
strategic resource for the processing of Basque and it has already been used for
the development and improvement of some tools. Half of this collection was
obtained from the Statistical Corpus of 20th Century Basque
(http://www.euskaracorpusa.net). The other half was extracted from Euskaldunon
Egunkaria (http://www.egunero.info), the only daily newspaper written entirely
in standard Basque.
The Statistical Corpus of 20th Century Basque is a reference corpus of Basque
including 4,658,036 word-forms. It was created by UZEI (http://www.uzei.com),
a non-profit organisation devoted to making the Basque language suitable for any
specialised field. The corpus was constructed on the basis of an exhaustive
inventory of 20th century Basque publications, from which a random sampling
was extracted. This corpus has become an invaluable linguistic reference for
written Basque of this period. It was classified taking into account the following
criteria: the publications were divided into 4 periods (1900-1939, 1940-1968,
1969-1990, 1991-1999), 6 different dialects (Biscayan, Guipuzcoan, Souletin,
Labourdin-Navarrese, Standard Basque, and non-classified), and 14 genres
(literary prose, poetry, theatre, administration, newspapers...). Each book or
article also contained information about the author(s) and its title. A subcorpus of
about 25,000 word-forms was extracted from this corpus in order to build EPEC.
Texts written in standard Basque, corresponding to the last period (1991-1999)
and belonging to both literary and non-literary prose, were chosen for this
purpose. The second part of EPEC consists of several articles extracted from the
Euskaldunon Egunkaria written in the second half of 1999 and in 2000. The
The construction of EPEC
3
articles were chosen so that they covered an assorted range of topics (economics,
culture, entertainment, international, local, opinion, politics, sports...).
PARSER
CHUNKER
SHALLOW MORPHOSYNTACTIC
SYNTAX TAGGING
TAGGING
COMPILATION OF
TEXTS
MANUAL
DISAMBIGUATION
MORPHEUS
CG
DISAMBIGUATION
( LINGUISTIC
KNOWLEDGE)
STOCHASTIC
DISAMBIGUATION
COMPARISON
CG
DISAMBIGUATION
( LINGUISTIC
KNOWLEDGE)
MANUAL
DISAMBIGUATION
COMPARISON
Future work
CG
CG
CHUNKER
( LINGUISTIC
KNOWLEDGE)
MANUAL
REVIEW
PARSER
( LINGUISTIC
KNOWLEDGE)
MANUAL
TAGGING
COMPARISON
EPEC
Forthcoming
Figure 1. Sketch of the different steps in the completion of EPEC
.
Figure 1: Sketch of the different steps in the completion of EPEC
4
2.2
Aduriz et al.
Design of the tagset
Choosing an appropriate tagset is a crucial task since the usefulness of further
applications depends on it. The main problem we found while defining the tagset
for Basque was the absence of an exhaustive one for automatic use. Moreover,
printed dictionaries of Basque also lacked systematisation of categories.
For the morphosyntactic treatment of Basque texts, the tag system we developed
is a four level system, ranging from the simplest part-of-speech tagging scheme to
full morphosyntactic information. At the first level, 20 general categories are
included for lexical items (noun, adjective, verb, pronoun, conjunction...). At the
second one, each category tag is further refined by subcategory tags. For instance,
the category ‘pronoun’ has 6 subcategories: common, emphatic, interrogative,
indefinite, reflexive and reciprocal.
The third level includes some basic morphosyntactic information such as
declension case, number, etc. This morphological information is carried by the
dependent morphemes attached to the stem.
The full output of the morphosyntactic analysis constitutes the fourth level of
tagging. The only difference between this and the previous level is that, here, all
the morphological information is considered along with the tags for syntactic
functions. Morphology and syntax are closely related in Basque, so most
syntactic functions are provided by the database, along with inflexional
morphemes. For instance, the ergative case in Basque marks the subject in a
clause (with transitive verbs) the absolutive case may either indicate the subject
or the predicative (with intransitive verbs), or the direct object. The specification
at this level is very detailed and constitutes the input for the morphosyntactic
disambiguation process as well as for syntactic and other types of language
processing.
In addition to these four levels, further tags are added to mark verb chains, noun
phrases, and postpositional phrases (see sections 4.3.1 and 4.3.2)
Nowadays, we are involved in the syntactic tagging of the corpus, following the
Dependency Structure-based Scheme (see section 5). About 31 syntactic tags are
being used for this purpose.
3.
Morphosyntactic tagging of the corpus
A morphological analyser of words is an indispensable basic tool when defining a
general framework for the automatic processing of agglutinative languages like
Basque (Aduriz et al., 1998). However, prior to the completion of the
5
morphological analyser MORFEUS, the design of the tagset had to be
accomplished (see section 2.2) and a lexical database developed.
3.1
EDBL, a lexical database for Basque
EDBL (Aldezabal et al., 2001) is a general-purpose lexical database used in
Basque text-processing tools. This large repository of lexical knowledge is the
basis for many different NLP tasks, and provides lexical information for several
language tools including, obviously, the morphological analyser. At present, it
consists of nearly 80,000 entries divided into (i) dictionary entries (the same
found in any conventional dictionary), (ii) inflected verb forms, and (iii)
dependent morphemes, all of them with their respective morphological
information.
3.2
MORFEUS, automatic morphological analyser
MORFEUS is a robust morphological analyser for Basque. It is a basic tool for
current and future work on NLP. The analyser is based on the two-level
formalism proposed by Koskenniemi (1983), which has had widespread
acceptance due mostly to its general applicability, declarativeness of rules and
clear separation between linguistic knowledge and program.
The architecture of the analyser was defined using three main modules:
1 The standard analyser that uses a general lexicon and a user’s lexicon.
This module is able to analyse and generate standard language wordforms. In our applications for Basque, we defined more than 130 patterns
of morphotactics and two-rule systems in cascade, the first one for longdistance dependencies among morphemes and the second one for
morphophonological changes. These elements are compiled together in the
standard transducer.
2 The analysis and normalization of linguistic variants (dialectal uses and
competence errors). Due to non-standard or dialectal uses of the language
and competence errors, the standard morphology is not enough to offer
good results when analysing real text corpora. This problem becomes
critical in languages like Basque in which standardisation is in process and
dialectal forms are still in widespread use. For this process, the standard
transducer is extended with new lexical entries and phonological rules
producing the enhanced transducer.
3 The guesser or analyser of words without lemmas in the lexicons. In this
case, the standard transducer is simplified removing the lexical entries and
allowing the analysis of any string. Therefore, the standard transducer is
substituted by a general transducer to describe any combination of
characters.
6
Aduriz et al.
The morphological analyser gives as a result all the possible analyses of each
token in the text.
3.3
Manual disambiguation of the corpus
The manual disambiguation of the corpus was performed on the output of
MORFEUS. Thus, the whole corpus was morphosyntactically analysed giving to
each word-form every possible analysis, without taking into account the context
in which it appeared. Once each word-form in the corpus was morphosyntactically analysed, we carried out the manual disambiguation process. Two
linguists independently assigned the correct syntactic tag to each word in the
corpus, applying the “double blind” method described in Voutilainen & Järvinen
(1995). In case no right tag had been automatically assigned, they typed it
themselves. Both linguists’ answers were compared and, when differences
occurred, they agreed a single tag.
This manually disambiguated corpus was used both to improve a Constraint
Grammar disambiguator and to develop a stochastic tagger. After the corpus was
manually disambiguated, we started to construct a grammar of constraint rules
that would automatically select the correct syntactic tags in any real corpus. For
this purpose, we chose the Constraint Grammar (CG) formalism (Karlsson et al.,
1995; Tapanainen & Voutilainen, 1994), which was designed with the aim of
being a language-independent and robust tool to disambiguate and analyse
unrestricted texts. The CG grammar statements are close to real text sentences
and directly address some crucial parsing problems, especially ambiguity. The
role of the CG system is to apply a set of linguistic constraints that discard as
many alternatives as possible, leaving at the end the most fully disambiguated
sentences possible.
Each rule produced for this grammar was checked on the manually disambiguated
corpus so as to test its goodness and improve it iteratively whenever necessary.
Moreover, in the cases in which the analyser did not assign any correct analysis to
a word-form in the corpus, the linguists contributed greatly to the improvement of
the lexical database and the analyser itself.
Besides this, we also developed a stochastic tagger. Statistical methods need little
effort and obtain very good results (Church, 1998; Cutting et al., 1992), at least
when applied to English. In our case, we selected the TATOO tagger based on
Hidden Markov Models (Armstrong et al., 1995). TATOO was designed to be
applied to the output of a morphological analyser and the tagset can be easily
switched without changing the input text.
However, because Basque is an agglutinative and free-order language, the
stochastic tagger turned out to be much less accurate than for English when
trained directly on the output of the morphological analyser. So, we performed a
supervised training on the output of the CG grammar. Since the CG
7
disambiguator leaves a relatively low ambiguity rate, the results of TATOO were
much better. Currently, we apply a combination of the CG disambiguator with the
stochastic tagger and get good results (Ezeiza 2003). The CG disambiguator is
first applied and then the remaining ambiguities are solved using the results of
TATOO.
4.
Shallow syntax tagging
After disambiguating the morphological tags in the corpus, the next step was to
assign the corresponding syntactic tag to each word-form. Syntactic function tags
follow the philosophy of the Constraint Grammar (CG) formalism in the sense
that they are based on a functionally labelled dependency syntax 1 . By adopting
the CG formalism, we express the syntactic functions of words and the
interdependencies that exist among them rather than deep structural relations. So
the syntactic tags at this level refer to shallow syntactic functions, i.e. they may
provide information about the surface structure of verb chains, noun phrases, or
postpositional phrases. Therefore, this results in a shallow parsing of the corpus.
As we mentioned before, most syntactic functions are added to the word-forms
together with inflectional morphemes. Morphological suffixes and syntactic
functions are closely related in Basque and both are included in the database.
Thus, the output of the morphological analyser displays most of these shallow
syntactic tags.
However, some other syntactic tags that are not inherited from the database are
added to the analysis through CG mapping rules. These functions are mostly
attached to parts of speech, and they are generally assigned to word-forms
provided that they comply with some given contextual conditions. Mainly, the
syntactic function tags are divided into three groups: main functions (subject,
object, indirect object…), modifiers (indicating the direction relative to their
head), and verb functions (used to detect verb chains). This distinction of the
syntactic functions is essential for the tagging of the different kinds of phrases
(see section 5).
The ambiguity rate related to the shallow syntactic tagging is over 22% 2 , that is,
for each 100 word-forms 22 are assigned more than one syntactic tag.
4.1
Manual disambiguation and applications
Once each word-form in the corpus was given at least one syntactic tag, we
repeated the manual disambiguation process. This method was similar to the one
used for the morphological disambiguation in the previous step. Two linguists
independently assigned the correct syntactic tag to each word in the corpus or, in
case no right tag had been automatically assigned, they typed it themselves. Then,
both linguists agreed a single tag when differences occurred. After the corpus was
manually disambiguated, we started to make up a grammar of constraint rules that
8
Aduriz et al.
would automatically select the correct syntactic tags in any real corpus. Each rule
produced was checked on the manually disambiguated corpus so as to test its
goodness and improve it if necessary.
5.
Tagging phrases
At this stage we have the corpus manually tagged with surface syntactic tags
following the CG syntax. No phrase units are marked yet, although based on this
representation, the identification of various kinds of phrase units, such as verb
chains, noun phrases, and postpositional phrases is reasonably straightforward.
5.1
Tags for verb chains
In order to detect verb chains, we use the verb function tags (@+FAUXVERB,
@-FAUXVERB, @+FMAINVERB, @-FMAINVERB 3 …) and some particles
(the negative particle, modal particles…). Based on these elements we are able to
detect not only continuous verb chains but also dispersed ones.
So as to mark up continuous verb chains, the following tags are attached using
again CG mapping rules:
• %VCH: this tag is attached to a verb chain composed of a single
element.
• %VCHI: this is attached to the initial element of a complex verb chain.
• %VCHF: this is attached to the final element of a complex verb chain.
The tags used to mark-up the dispersed verb chains are:
• %NCVCHI: this tag is attached to the initial element of a noncontinuous verb chain.
• %NCVCHC: this tag is attached to the second element of a noncontinuous verb chain.
• %NCVCHF: this tag is attached to the final element of a non-continuous
verb chain.
5.2
Tags for noun phrases and postpositional phrases
Our assumption is that any word having a modifier function tag is linked to some
word with a main syntactic function tag. Moreover, a word with a main syntactic
function tag can, by itself, constitute a phrase unit. With this in mind, we
established three tags to mark up this kind of phrase units (noun phrases or
postpositional phrases):
• %PHR: this tag is attached to words with main syntactic function tags
that constitute a phrase unit by themselves.
9
• %PHRI: this tag is attached to the initial element of a phrase unit.
• %PHRF: this tag is attached to the final element of a phrase unit.
In order to attach one of these tags to each word-form, we have simultaneously
developed two subgrammars containing CG mapping rules. The first subgrammar
is aimed at delimiting verb chains whereas the second one marks noun and
postpositional phrases.
5.3
Manual tagging and applications
At present, a linguist is checking the tags that the first set of mapping rules
marked up in the corpus. Whenever necessary, she adds, removes, or changes the
tags automatically assigned. Once this work is finished, the first set of mapping
rules that were developed will be tested on the corpus and the results will be used
to improve the rules iteratively as well as to develop new ones.
6.
Treebank
The next logical stage in the completion of the corpus is deep syntax tagging, in
order to build a treebank (Aduriz et al., 2002.) Although manually tagging a
treebank is an expensive and time-consuming task, it is also an essential step for
the development of syntactic tools and applications for Basque. A group of
linguists in our research group is currently involved in this arduous task 4 .
dependant
structurally_case- marked
complements
nc
arg_mod
clausal
modifier
detmod
ncsubj ncobj nczobj finite_clause non-finite_clause
ccomp_subj ccomp_obj ccomp_zobj
xcomp_subj
auxiliary conjunction
nc
clausal
pred
ncmod cmod xmod
xcomp_obj
ncpred xpred
xcomp_zobj
Figure 2: Hierarchy of grammatical relations.
After considering a number of diverse choices (including Skut et al., 1997;
Oflazer, 1999) we decided to follow a dependency-based procedure, for it was, in
our opinion, the one that could best deal with the free word order displayed by
Basque syntax. The dependency-based analysis describes the relations existing
between components (i.e. word-forms). This way, for each sentence in the corpus
10
Aduriz et al.
we explicitly determine the syntactic dependencies between the heads and the
dependents.
In order to define the syntactic tagging system, we adopted the framework
presented in Carroll et al. (1998, 1999). By following this line of work, we
developed a coding-system based on hierarchies of grammatical relations, both
for lexical and empty elements, such as pro (see figure 2).
As can be seen in figure 2, the hierarchy distinguishes between several general
levels, which are further specified in subsequent levels. Thus, for instance, in the
general level we find structurally case-marked complements, thematic roles
(arg_mod), modifiers, auxiliaries and conjunctions. In turn, structurally casemarked complements, for example, are divided into noun phrases and clauses.
Each continuous gradation achieves further specification by taking into account
their grammatical function (e.g. ncsubj, ncobj, and nczobj).
Next, we present an example showing some of the grammatical relations
specified in the hierarchy:
ncsubj (Case, Head, Head of NP, the Case-marked element within NP, subj )
ncobj (Case, Head, Head of NP, the Case-marked element within NP, obj)
nczobj 5 (Case, Head, Head of NP, the Case-marked element within NP, ind.obj)
These are examples of structurally case-marked complements when complements
are nc (non-clausal, Noun Phrases, henceforth NP), as, for instance, in the
sentence Aitak haurrari sagarra eman dio ‘Father has given an apple to the child’
(literally ‘Father to-child apple given has’):
ncsubj (erg, eman, aitak, aitak, subj)
nczobj (dat, eman, haurrari, haurrari, ind.obj)
ncobj (abs, eman, sagarra, sagarra, obj)
This description is extremely important, since it determines the number and type
of tags needed for each relation (number of slots, the characteristics of each one,
etc.). This formalisation will be very useful for future treatments, for example, to
transfer all this information in to XML format (see section 7).
Tagging the corpus manually has enabled us to find solutions to problems that
emerge in the analysing process, such as discontinuous constituents, coordination,
or comparative clauses. Moreover, it is not unusual that similar phenomena are
treated as distinct by the different linguists tagging the corpus. In these cases, the
group of linguists tries to agree a single analysis that will be regarded as correct
thereafter.
Consequently, as the tagging process goes on and we find new solutions to
arising problems, accuracy, robustness, and speed will improve. Besides, we are
11
currently developing a computational tool designed to make the manual tagging
easier and faster.
All of this work is being carried out within a project that aims at constructing
treebanks for Catalan, Spanish, and Basque (Civit & Martí, 2002).
6.1
Applications
When the manual tagging of the corpus is finished, we plan to develop a tool
based on linguistic knowledge that will be able to parse real corpora
automatically. As in the previous steps of manual tagging, each rule produced for
the parser will be tested on the manually tagged corpus in order to assess its
effectivity and improve it accordingly.
In the future, we also plan to apply machine learning methods to the corpus, in
order to carry out automatic tagging.
7.
Representation of the Corpus using XML
Over the last three years much effort has been made in our research group (Artola
et al., 2002) to integrate the NLP tools for Basque described in this chapter. Due
to the complexity of the information to be exchanged among the tools, Feature
Structures (FSs) are used to represent it. Feature structures are coded following
the TEI’s DTD for FSs, and Feature Structure Definition descriptions (FSD) have
been thoroughly defined. The documents used as input and output of the different
tools, contain TEI-P4-conformant feature structures (FS) coded in XML. The use
of XML for encoding the I/O streams flowing between programs forces us to
describe the mark-up formally, and provides software to check that these mark-up
hold invariantly in an annotated corpus.
We could deeply analyse the framework for linguistic knowledge representation
and integration developed in our group, but as it is not the goal of this paper, we
will only show the output of the tools of the analysis chain (figure 3).
Different representations of the sentence Noizean behin itsaso aldetik Donostiako
Ondarreta hondartzara enbata iristen da (Once in a while, a storm arrives from
high seas to the Donostia’s beach of Ondarreta) coded in XML are shown in
figure 3.
12
Aduriz et al.
<text id=‘T1> .. Noizean behin itsaso aldetik Donostiako Ondarreta hondartzara
enbata iristen da.</text>
.xml
...


<linkGrp type=‘MWLU‘ tagOrder='y'>
<link ID=‘mwlnk1 targets='Xw1 Xw2'/>
...
<w id='w1' sameAs='Xw1' type=‘BEG_CAP'>Noizean</w>
<w id='w2' sameAs='Xw2'>behin</w>
<w id='w3' sameAs='Xw3'>itsaso</w>
<w id='w4' sameAs='Xw4'>aldetik</w>
<w id='w5' sameAs='Xw5' type=‘BEG_CAP'>Donostiako</w>
<w id='w6' sameAs='Xw6' type=‘BEG_CAP'>Ondarreta</w>
<w id='w7' sameAs='Xw7'>hondartzara</w>
<w id='w8' sameAs='Xw8'>enbata</w>
<w id='w9' sameAs='Xw9'>iristen</w>
<w id='w10' sameAs='Xw10'>da</w>
<w id='w11' sameAs='Xw11' type='PUNCT_FSTOP'>.</w>

...
w.xml

<linkGrp type=‘mwlnk-lem‘tagOrder='y'>
<link targets=‘mwlnk1 ADV'/>
<linkGrp type='w-lem‘tagOrder='y'>
<link targets='Xw3 COM-NOUN-1'/>
Morphosyntactic Analysis
<link targets='Xw4 COM-NOUN-2'/>
<text id='LemDoc0001'>
....
<fs id="COM-NOUN-2" type="Lemmatisation">
<f name="Form"><str>aldetik</str></f>
<f name="Lemma"><str>alde</str></f>
<f name="morphological-Features">
<fs type="Top-Features-List">
<f name="POS"><sym value="NOUN"/></f>
<f name="SUBCAT"><sym value="COM"/></f>
<f name="DET"><sym value="DET"/></f>
<f name="NUM"><sym value="S"/></f>
<f name="CASE"><sym value="ABL"/></f>
</fs>
</f>
</fs>


<fs id="PROP-N-LOC-1" type="Lemmatisation ">
<f name="Form"><str>Donostiako</str></f>
<f name="Lemma"><str>Donostia</str></f>
<f name="Top-Features-List">
<fs type="upper-level-features">
<f name="POS"><sym value="NAME"/></f>
<f name="SUBCAT"><sym value="PROP-LOC"/></f>
<f name="DET"><sym value="DET"/></f>
<f name="NUM"><sym value="S"/></f>
<f name="CASE"><sym value="GEL"/></f>
</fs>
</f>
</fs>


.lem.xml
...
<link targets='Xw5 PROP-N-LOC-1'/>
<link targets='Xw6 PROP-N-LOC-2'/>
...
.lemlnk.xml
EUSLEM
Chunker
<text id=span0001>...

<linkgrp type=“span” tagOrder='y'>
<text id=cad0001>
...
<fs id=COM-NOUN-1 type="phrase">
<f name="chain">
<str>itsaso aldetik</str></f>
<f name="head">
<str>itsaso</str></f>
<f name="POS"><str>NOM</str></f>
<f name="SUBCAT"><str>COM</str></f>
<f name="SFL" ORG=‘list’>
<sym value = "NCMOD" >
</f>
</fs> ...
Dependencies
.sint.xml
<text id=dep0001>
...
<linkgrp type="dep" targorder=“y“
targFunc=“head dependant
case-dep-unit" domains=???>

<link id="dep1" targets="Xw9 span1 span1">
<link id="dep2" targets=“Xw9 Xw7 Xw7">
...
</linkgrp>
dep.xml
<link id=“span1” targets=“Xw3 Xw4”>
<link id=“span2” targets=“Xw5 Xw6 Xw7”>...
</linkgrp>

.spanlnk.xml
</text>
<text id=span-sint0001>
...

<linkgrp type=“span-sint” tagOrder='y'>
<link targets=“span1 NOUN-COM”>
...
</linkgrp>

</text>
.sintlnk.xml
...
<linkgrp type="dep-deplib" targorder="yes">
<link targets="dep1 D-NCMOD-ABL"/>
<link targets="dep2 D-NCMOD-ALA"/>
...
</linkgrp>...
Deplnk.xml
<fs id="D-NCMOD-ALA“ type="dependency">
...
<f name=“nombre">
<sym value="NCMOD-ABL"/></f>
<f name=“CASE">
<sym value="ALA"/></f> ..
</fs>
Figure3: Output of the different tools coded in XML.
Figure 3: Output of the different tools coded in XML
Deplib.xml
8.
13
Future work
The lexical information gathered in the lexical database (EDBL), which is the
basis for several NLP tools in our research group, is constantly being renewed.
New entries from diverse sources are periodically added to the database.
Moreover, new tools such as multiword units, named entities, or postposition
recognisers have been developed. These changes must be reflected in the corpus,
so we must review it regularly. Therefore, in the near future, we intend to update
EPEC with new information. This will be done semiautomatically, so that only
the new information needs to be reviewed.
9.
Acknowledgements
This research is supported by the University of the Basque Country
(9/UPV00141.226-14601/2002), the Ministry of Industry of the Basque
Government (project XUXENG, OD02UN52), the Interministerial Commission
for Science and Technology of the Spanish Government (FIT-150500-2002-244),
and the European Community (MEANING project, IST-2001-34460).
Notes
1
The concept of dependency-syntax has a long tradition in grammatical
analyses since the Greco-Roman era. More recently, within the application
of formalisms to syntactic theory, among others we find Tèsniere (1959),
Hays (1964) and Mel’cuk (1988), the ones who have recovered
dependency-syntax in theoretical terms.
2
This ambiguity was estimated taking into account the syntactic functions
of a subset of 200 common words.
3
Finite auxiliary verb, non-finite auxiliary, finite main verb, non-finite main
verb…
4
In this research line, our group is taking part in the project entitled “The
IXA group, tools for an automatic treatment of Basque: creating a database
composed of syntactic-semantic trees” (See ‘acknowledgments’)
5
nczobj would be equivalent to the English nciobj (non-clausal indirect
object).
References
Aduriz I., Agirre E., Aldezabal I., Alegria I., Ansa O., Arregi X., Arriola J.M.,
Artola X., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Maritxalar A.,
Maritxalar M., Oronoz M., Sarasola K., Soroa A., Urizar R., Urkia M.
14
Aduriz et al.
(1998), A Framework for the Automatic Processing of Basque.
Proceedings of the First International Conference on Language Resources
and Evaluation, Granada.
Aduriz I., Aldezabal I., Aranzabe M., Arrieta B., Arriola J., Atutxa A., Díaz de
Ilarraza A., Gojenola K., Oronoz M., Sarasola K. (2002), Construcción de
un corpus etiquetado sintácticamente para el euskera. Actas del XVIII
Congreso de la SEPLN, Valladolid, Spain.
Aldezabal I., Ansa O., Arrieta B., Artola X., Ezeiza A., Hernández G., Lersundi
M. (2001), EDBL: a General Lexical Basis for the Automatic Processing
of Basque. IRCS Workshop on Linguistic Databases, Philadelphia (USA).
Alegria I., Aranzabe M., Ezeiza A., Ezeiza N., Urizar R. (2002) Robustness and
customisation in an analyser/lemmatiser for Basque. Proceedings of
Workshop on "Customizing knowledge in NLP applications ". Third
International Conference on Language Resources and Evaluation, Las
Palmas de Gran Canaria (Spain).
Armstrong S., Russell G., Petitpierre D., Robert G. (1995) An Open Architecture
for Multilingual Text Processing. Proceedings of the 7th Conference of the
European Chapter of the Association for Computational Linguistics,
Dublin, Ireland, pp. 101-106.
Artola X., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Hernández G., Soroa A.
(2002) A Class Library for the Integration of NLP Tools: Definition and
implementation of an Abstract Data Type Collection for the manipulation
of SGML documents in a context of stand-off linguistic annotation.
Proceedings of the Third International Conference on Language
Resources and Evaluation, Las Palmas de G 1 ran Canaria, Spain.
Carroll J., Briscoe T., Sanfilippo A. (1998) Parser evaluation: a survey and a new
proposal. Proceedings of the International Conference on Language
Resources and Evaluation, Granada, Spain, pp. 447-454.
Carroll J., Minnen G., Briscoe T. (1999) Corpus Annotation for Parser
Evaluation. Proceedings of Workshop on Linguistically Interpreted
Corpora, EACL´99, Bergen.
Church K. W. (1998) A Stochastic Parts Program and Noun Phrase Parser for
Unrestricted Text. Proceedings of the Second Conference on Applied
Natural Language Processing, Québec, Canada, pp. 136-143.
Civit M., Martí M. (2002) Design Principles for a Spanish Treebank. Proceedings
of the Treebanks and Linguistic Theories (TLT2002), Sozopol, Bulgaria.
Cutting D., Kupiec J., Pederson J., Sibun P. (1992) A Practical Part-of-speech
Tagger. Proceedings of the Third Conference on Applied Natural
Language Processing, Philadelphia, USA, pp. 133-140.
Ezeiza, N. (2003) Corpusak ustiatzeko tresna linguistikoak. Euskararen
etiketatzaile sintaktiko sendo eta malgua. PhD thesis, University of the
Basque Country.
Hays D. C. (1964) Dependency theory: a formalism and some observations.
Language 40, pp. 511-525.
15
Karlsson F., Voutilainen A., Heikkila J., Anttila A. (1995) Constraint Grammar:
Language-independent System for Parsing Unrestricted Text. Mouton de
Gruyter, Berlin.
Koskenniemi K. (1983) Two-level Morphology: A general Computational Model
for Word-Form Recognition and Production. University of Helsinki,
Department of General Linguistics. Publications 11.
Mel’cuk I. (1988) A Dependency Syntax: Theory and Practice. State University
of New York Press.
Oflazer K., Zeynep D., Tür H., Tür G. (1999) Design for a Turkish treebank.
Proceedings of Workshop on Linguistically Interpreted Corpora, at
EACL, Bergen.
Sarasola K. (2000) Strategic priorities for the development of language
technology in minority languages. Proceedings of Workshop on
"Developing language resources for minority languages: re-usability and
strategic priorities". Second International Conference on Language
Resources and Evaluation, Athens, Greece.
Skut W., Krenn B., Brants T., Uszkoreit H. (1997) An Annotation Scheme for
Free Word Order Languages. Fifth Conference on Applied Natural
Language Processing, Washington, DC, USA, pp. 88-95.
Tapanainen P., Voutilainen A. (1994) Tagging Accurately-Don´t guess if you
know. Proceedings of the 4th Conference on Applied Natural Language
Processing, Washington.
Tesnière L. (1959) Eléments de Syntaxe Structurale, (2nd ed.) Paris, Klincksieck.
Voutilainen A., Järvinen T. (1995) Specifying a shallow grammatical
representation for grammatical purposes. Proceedings of the 7th
Conference of European Association of Computational Linguistics,
Dublin.
This page intentionally left blank
The mood of the (financial) markets: In a corpus of words and
of pictures
Khurshid Ahmad, David Cheng, Tugba Taskaya, Saif Ahmad, Lee Gillam, Pensiri
Manomaisupat, Hayssam Traboulsi and Andrew Hippisley
Department of Computing, University of Surrey
Abstract
Corpora of texts are used typically to study the structure and function of language. The
distribution of various linguistic units, comprising texts in a corpus are used to make and
test hypotheses relevant to different linguistic levels of description. News reports and
editorials have been used extensively to populate corpora for studying language, for
making dictionaries and for writing grammar books. News reports of financial markets are
generally accompanied by time-indexed series of values of shares, currencies and so on,
reflecting the change in value over a period of time. A corpus linguistic method for
extracting sentiment indicators, e.g. shares going up or a currency falling down, is
presented together with a technique for correlating the quantitative time-series of values
with a time series of sentiment indicators. The correlation may be used in the analysis of
the movement of shares, currencies and other financial instruments.
1.
Introduction
Financial markets are places where financial instruments are bought and sold.
These instruments include shares, currencies, bonds: there are shares traded for
individual organisations and traders take options – slang bet – on the aggregate
value of key shares e.g. Financial Times Stock Exchange (FTSE). Some of these
instruments are traded in millions, others in thousands and yet others in hundreds:
the prices of instruments change frequently during single trading or over a longer
trading horizon. A set of buying / selling prices of instruments, ordered in time,
is usually referred to as a (quantitative) time series. In the financial pages of
newspapers, and now on specialised web sites, these time series are either
displayed independently or as graphical illustrations within (long) texts.
The buying and selling of instruments in itself causes changes in their value: too
many buyers for a share and its value goes up, too many sellers and the value
goes down. The so-called efficient market hypothesis (EMH) suggests that the
(trading in a) financial market is the sole arbiter of the price of an instrument.
Despite the preponderance of the EMH, newspapers and financial web sites
regularly report the reactions of individuals, acting on their own or on behalf of
organisations or governments: some web sites display results of polls of financial
experts. The experts report whether they are either ‘bearish’ – or shy to buy or
18
Ahmad et al.
sell, or ‘bullish’ – too eager / aggressive to buy or sell, and indeed some of them
are neutral. These polls are conducted at regular intervals and the sentiments of
experts are displayed as a time series. And then there are some who ‘correlate’
the time series comprising the bearish / bullish / neutral voting figures with the
time series of financial instruments. The correlation is then used as a cipher for
buying or disposing of an instrument.
The sentiment of the market traders – or market sentiment for short – is shaped
by, and in turn shapes, the value of a financial instrument usually in the short
term and perhaps in the long term as well.
Much as the sentiment of a trader influences others, others also influence him.
The view of the others is typically communicated through press statements. One
can argue that (financial) news stories may affect the trader’s sentiment, or more
precisely, his or her attitude towards an instrument. Ergo, positive news stories
may persuade people to invest in the market thereby driving the prices of
instruments up: conversely negative or gloomy stories force prices down. Note
that the physical prepositions (up/down) used to describe the position/location of
a physical object, are also used to describe the change in value of an abstract
financial instrument during a fixed time period. News stories and people’s
conversation about the financial markets extend these spatial metaphors further
by talking in terms of state change – one sees a change in the value of an
instrument in terms of rising or falling.
The use of literary allusions, including bear/bull, vibrant/anaemic, and more
colourful slang, including the phrase dead cat bounce, to describe that the upward
movement of stock is much like a lifeless object merely moving because of the
laws of gravity, shows a creative use of language in the specialist field of
financial trading (Ahmad 2002). The sentiment of a trader toward the market
may change by reading a news story in that bullish stories may cheer him or her
up, and bearish stories may depress him or her and in turn depress the market
(Knowles 1996).
We wish to explore whether it is possible to extract the sentiment from a news
story through linguistic analysis. It is possible to use the literature on
buying/selling in semantic theory (Jackendoff 1991) as a framework for analysing
the meaning of the news stories. The literature on natural language processing
(Simmons et al 1984) and on knowledge representation suggests that frame
semantics has been used to build systems that can, in principle, analyse, extract
and disseminate the meaning (intent?) of a specialist news report. Frame
semantics has a number of limitations, and a prominent one is the need for a
lexicon that is rich and extensive in terms of meaningful data.
What about a purely lexical approach? Recall, Quirk et al.’s (1985) observation
that relates frequency of a lexical item to the acceptability of that item by an
The mood of the (financial) markets
19
educated-native speaker of the language. Stretching this dictum to financial
markets, one can argue that higher frequency of phrases that express positive
sentiment suggests that all is well in the market. Similarly, a predominance of
negative phrases might suggest that all is not well in the market and that it may
have fallen or is about to fall.
We have analysed three year’s output of Reuters financial news comprising over
10 millions tokens published during 2000-2002.
1.1
rose
fell
1
FTSE 100
0.9
Normalised Value
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Month
Figure 1: Monthly variation of “rose”, “fell” and FTSE 100
Figure 1 shows the variation in the frequency of two verbs rose and fell over a
one-year period (Ahmad et al. 2002). Also plotted is the value of FTSE-100 at
the close of daily trading during 2002. There appears to be an encouraging
numerical correlation amongst the sentiment verbs and FTSE-100.
One further exploration of the above hypothesis that sentiment may correlate with
frequency of phrases that may express positive and negative sentiment used to
describe changes in the value of an instrument, requires an understanding of the
following:
How do we analyse texts such that frequent phrases hypothetically related to a
sentiment are indeed used in a sentence or in many sentences to express the
Ahmad et al.
20
sentiment? The ambiguity of language does play a significant part in confusing
the sense of these phrases, for example, rose, as in something rising, with that of
a flower or indeed the name of a person. We show how simple cues related to the
grammatical categories of phrases in the neighbourhood of the potential sentiment
laden word ensure that the word rise quite x suggests that the value of a financial
instrument is rising.
How do we use the time series of sentiment word frequency in conjunction with
the time series of the values of an instrument either to predict stability or chaos in
the market, or to discover the ‘turning point’ in the value of an instrument? The
turning point is the point in time when the value of the instrument stops
decreasing and starts to increase or vice versa.
How do we organise texts in a diachronic corpus such that these texts may be
added or omitted according to the pragmatic attributes of the texts? This is
important if the market movements and sentiment analysis are required for a
specific instrument (e.g. US$, Euro, BT stocks, FTSE derivative) traded in a
certain country or group of countries.
2.
Sentiment words and their frequency
2.1
A note on up-and-down phrases
Amongst the many different phrases chosen to describe the changes in the actual
or potential values of financial instruments the phrases up, down, rise and fell
have an intuitive prominence. Other related phrases and synonyms are also used:
growth, slump, jump and drop are good examples. The British National Corpus
(BNC) shows the preponderance in the general language of these up-and-down
phrases:
Table 1: BNC frequencies obtained from the on-line version of the corpus; for
rose and fell we restricted the query to verbs only (NBNC = 100,106,008)
Token
fBNC
Token
fBNC
up
growth
rose
207709
12794
5566
Down
Fell
Slump
92285
9563
632
The BNC shows that there is more ‘positive’ sentiment in the corpus than, say,
the ‘negative’. However a much more detailed analysis is required before any
serious conclusion can be drawn from these literally raw figures. A similar
analysis of financial news texts shows the dominance of the phrases up and down
but with a somewhat different distribution of the other four phrases. Consider a
21
sample from Reuters UK-financial News for the month of November 2002
(comprising 400,000 or so tokens):
Table 2: Frequencies of ‘positive’ and ‘negative’ sentiment phrases based on
Reuters UK-financial News November 2002. This is an untagged
sample and homographic conflicts (e.g. rose as noun or verb) have not
been resolved (NReuters=402089)
Token
fReuters
Token
fReuters
up
growth
rose
1435
650
424
down
fell
slump
716
391
73
The rank order of the three ‘positive’ sentiment phrases and that of the ‘negative’
sentiments is preserved when the register changes from the BNC – a corpus
which largely comprises general language texts (c. 62% texts are drawn from
fiction, leisure, world affairs, the arts, belief and thought) – to the specialist
financial news which wholly comprises, what the BNC compilers would call,
commerce and finance texts (the BNC has just under 8% of such texts in its
composition). However, the relative distribution of the different phrases within
the ‘positive’ and ‘negative’ sentiment categories is substantially different:
Table 3: Relative distribution of “positive” and “negative” sentiment phrases
across BNC and Reuters UK-financial News November 2002.
Token
fBNC
Up
growth
Rose
Total
207709
12794
5566
Positive
=226069
N BNC
fBNC /
91.8
5.66
2.46
100%
Positive
N BNC
fReuters
1435
650
424
Positive
N Re
uters = 2509
fReuters /
Positive
N BNC
57.2
25.9
16.9
100%
The predominance of up and down is reduced by about one-third and the more
domain-specific growth and rose increase dramatically when we change the
register from general language to special language: from 6% in the BNC to about
26% in Reuters for growth and from under 3% to about 17% for rose. Similar
results can be obtained for negative sentiment words.
Financial journalists typically express rise and fall in percentage terms: this
perhaps gives their story a quantitative and objective look and feel. A
concordance of Reuters UK-financial News (November 2002) containing the
phrases rose shows a varied usage of the pattern:
Ahmad et al.
22

º
ª
»
«
by
» X percent
rose «
«only/nearly»
»
«
to
¼
¬
Some examples of these patterns include:
enterprise shares
Volatile mortgage payments ,
home loan repayments ,
152 pence, logica
Property companies -last week house prices
rose
rose
rose
rose
rose
rose
2.83
by 0.1
to 2.3
over eight
nearly seven
only 1.4
percent
percent
percent
percent
percent
percent
to 584 pence in
on the month to
in the year to
and cmg climbed 6
to 1,235
last month - -
A further analysis of the stories published by Reuters, under the UK-financial
rubric throughout the calendar year 2002, shows that not all the above patterns are
used as frequently and that one pattern, rose X percent, tends to dominate – or, in
other words was used preferentially by the journalists in the year 2002.
Table 4: Variations of sentiment rose across year 2002, where fx denotes the raw
frequency of the phrases
Patterns
frose
X%
by X %
to X %
over X %
Nearly X
%
by over X
%
only X %
frose [phrase] %
Proportion
of
rose
[phase] %
Jan
417
58.0%
2.2%
Feb
369
52.0%
7.0%
4.1%
0.5%
0.3%
284
238
Mar
245
49.4%
0.8%
1.6%
0.8%
Apr
263
58.2%
1.1%
1.5%
0.4%
143
174
May
376
57.4%
1.9%
1.6%
0.8%
Jun
245
53.1%
1.6%
Jul
427
50.1%
1.9%
0.4%
1.4%
0.3%
0.4%
0.5%
0.3%
149
0.2%
254
0.6%
217
238
Aug
351
51.3%
2.8%
0.3%
Sep
357
44.8%
3.6%
0.3%
0.6%
0.3%
190
Oct
342
59.4%
2.9%
2.3%
0.6%
Nov
424
50.7%
3.8%
2.8%
1.4%
0.6%
0.7%
0.0%
0.0%
236
0.2%
0.7%
256
Dec
231
63.6%
3.5%
0.4%
175
68.1% 64.5% 58.4% 66.2% 63.3% 60.8% 59.5% 61.8% 53.2% 69.0% 60.4% 75.8%
For example, in January 2002, of all the patterns rose [phrase] X percent, over
58% were rose X percent, followed by 2.2% comprising rose by X percent and
just under 1% comprising rose over X percent. During the year 2002, the
frequency of pattern rose by X % was above 50% for all months except March
and September where it was 49.4 % and 44.8 % respectively (the average value
was 54%). Table 4 shows the monthly variation in the various patterns
comprising the verb rose.
The rise/fall and up/down metaphor may be expressed through other related
words and through synonyms. Textbooks on ‘report writing’ suggest that without
23
losing accuracy, one may use related words/synonyms instead of repeating the
same words. Indeed, slump instead of fall, and jump/climb instead of rise appear
to be good candidates for not only improving writing style but also may make the
news report more sensational. We looked into Roget’s Thesaurus (1980) and the
WorldNet online to find related words/synonyms for the rise/fall we have
discovered in Tables 2 – 4 thus far. It appears that the frequency of the related
words/synonyms is much lower (Table 5).
Table 5: Related words/synonym sets of rise and fall
3.
Lemma
fall, inc. fell, falling
fReuters
875
rise, inc. rose, rising, risen
1004
Related words/Synonyms
drop
slump
strike
jump
climb
lift
fReuters
155
69
53
133
75
33
Sentiments and lexico-grammatical patterns?
The domination of one word form over others with related meanings shows
perhaps the precision, which came from a lack of imagination some would argue,
in the language of financial journalists. This is good news for people interested in
information extraction – a branch of computing dedicated to the extraction of
meaning from natural language text. The initial results related to the
preponderance of one word form over others (Table 5) together with a
preponderant lexico-grammatical pattern in which the word form is found (Table
4), suggests to us that when the word forms rise and fall, or rather rose and fell,
are used, especially followed by a number and ‘percentage’, then they indicate an
upward or downward movement in the value of some part of the financial market
– this being true of shares, aggregate share index of one business section (cf.
Property companies, telecommunications) and house prices. This still could be
an ambiguity though: the report telling us that ‘mortgage defaulters fell by X
percent’ is bad news except if you are in the debt collection business.
Zellig Harris and his pupil Maurice Gross (1991) have been keen to suggest that
in certain types of text one may find local grammars in operation: certain phrase
structures that occur more frequently in one type of text, or one set of text
fragments than in the language as a whole. Harris illustrated this point by citing
examples of recursive noun-phrases used in biochemical literature to either refer
to complex biochemical compounds or complex biochemical processes. Gross
(1993) focussed on how we specify time and date and showed cardinal numbers
used to denote time and calendrical expressions (day / month / year, century)
embedded in their own local grammar. Barnbrook and Sinclair have used this
notion to argue that dictionary definitions are also written in a local grammar
(1993). Local grammars and the Hallidayan term ‘lexico-grammatical’ patterns
24
Ahmad et al.
have a certain resonance, and this we have used to explore the grammatical
environment of market-sentiment indicating words.
The sentences comprising rose and fell embedded in the patterns shown in Table
4 were analysed using a reliable part-of-speech tagger – CLAWS. The local
grammar of the two-sentiment indicators rose/fell comprises these patterns (also
see):
VVD

°
°
®
°
°
¯
Ø
PRP
½
°
°
¾
°
°
¿
AV0
AVP
DT0
CRD
NN0
VVD
AV0
^
DT0
`
CRD
VVD
PRP
^
AT0
`
CRD
NN0
NN0
Here VVD = verb; CRD = cardinal; NN0 = numeral; DT0 = determiner; AV0 =
adverb; AVP = adverb particle; PRP = preposition.
Table 6: Grammatical properties of the lexico-grammatical patterns
Pattern
rose X %
rose by X %
rose to X %
rose over X %
rose nearly X %
rose by over X %
rose only X %
4.
Grammar
VVD CRD NN0
VVD PRP CRD NN0
VVD PRP CRD NN0
VVD AV0 CRD NN0
VVD AV0 CRD NN0
VVD AVP AV0 CRD NN0
VVD AV0 CRD NN0
“Virtual Corpora”
The designers of individual corpora have discussed the organisation of the text
files within a corpus.
The early corpora divided texts into
informative/imaginative types (cf. Lancaster-Oslo Bergen and Brown Corpora,
c.1960s), the texts were then divided into genre, topic and other pragmatic
attributes. The subsequent ones took two different approaches: first, genre-based
classification was used by some, where texts were classified into books,
magazines, personal correspondence and so on (cf. Collins-COBUILD, c.197080s); second, topic-based classification was used by the designers of LancasterLongman corpus (c.1980s) wherein texts were divided into subject topics
(science, world-affairs, news and so on). The other pragmatic attributes were
included as well. The LOB and Brown corpora were developed for the study of
25
(English) language in general and the Collins-CoBuild and Longman-Lancaster
for lexicographical purposes. The more ambitious British National Corpus
extended the list of pragmatic attributes, and perhaps made the attributes more
explicit. The texts can be selected for analysis through a complex query language
or by selecting the texts from a file store. For us, there is a hierarchical structure
that drives the design of a given text corpus. The selection of the top-level
features drives the selection of the other features – these are “doable” tasks if you
know the organisation of the corpus you are using. The user of the computerbased corpus must know how the corpus is structured and what is more desirable
is the use of ‘logical’ (attribute-oriented) features of the text.
The organisation of a corpus of news reports suggests that sometimes there would
be a need to analyse the corpus diachronically, focusing on a given
individual/organisation, financial instrument, and at other times there is a need to
conduct a synchronic analysis, for example, the analysis of news about all
organisations within a given industry sector or all members of a political party.
The analysis may be required based on a query comprising (a number of)
keyword(s) that may be used in the indexation of a set of news reports. The
permutations of the pragmatic and lexical attributes of a given text are numerous.
It is possible that individual users of a text corpus may like to organise the texts
according to their own needs. In order to have a user-configurable corpus, the
notion of a virtual corpus was introduced (Holmes et al. 1994) for analysing
technical and scientific texts.
The notion of virtual corpus is similar to that of a virtual machine: there is in
reality only one corpus, but the users can arrange the text attributes in a hierarchy
of their choice based on a physically extant set of texts, for the duration of their
use. This configurable hierarchy will have to be made available through the
agency of a program, within a suite of corpus management programs, for
producing this virtual corpus. The notion of virtual corpus introduces a shift from
the usual pre-defined and explicit corpus hierarchical approach, in that it allows
the definition of virtual hierarchies. Texts can then be retrieved by navigating
through an organisation – the virtual hierarchy – specified by the user.
We have designed a corpus for use in the automatic extraction of financial
information from newspaper texts. The principal source of this corpus are the
Reuters News Agency texts – in NewsML format , which is an extension to XML
for enriching news stories, conceived by Reuters, developed and ratified by the
International Press and Telecommunications Council (IPTC). Atkins, Clear and
Ostler (1992) discussed criteria for corpus design. Based on this, and with
attributes available in NewsML, news can be organized using the six major
(pragmatic) attributes shown in Table 7 below.
Ahmad et al.
26
Table 7: Pragmatic attributes used for organising NewsML texts
Publisher
Name; Place of Publication; Source of Publication; Date of Publication; Date of
Origination
Availability
Copyright Status; Copyright Duration; Copyright Owner; Usage Restriction
Text
Title/Headline; Dateline; Text Type(book, newspaper, journal, etc), Text Mode
(written, spoken?), Text Entry (electronic, transcribed?)
Language
Language Name; Regional Variant
Author
Byline; Reporter Nationality; First Language; Editor
Category
Industry Code; Topic Code
From the above set of pragmatic attributes, a news database management system,
Virtual Corpus Manager (VCM) was developed at Surrey. VCM can be used to
organise texts; to share and retrieve texts; to navigate through (content of) the
corpus and to impose integrity and security checks on texts. We have used six
different types of constraints. Each constraint allows the user to choose one set of
inter-related attributes at a time. For example, the users can choose as many
major attributes and for each attribute can choose the sub-attributes. One example
in the use of VCM is selecting UK-specific financial texts from Reuters daily
news stream for a given year.
5.
Visualising the mood of the market
Time series analysis is one of the established and persuasive branches of
statistics. This analysis is used extensively for analysing ‘a sequence of data
indexed by time, often comprising uniformly spaced observations’ in science,
engineering, economics, commerce, biology and in almost every subject.
Financial news reports are usually illustrated with a time series of the instrument
that is being reported: the share price of a company at the opening or closing of a
day’s trading plotted over a financial (calendar) year shows the perception of the
traders of the financial health (another metaphor) of the company. There are time
series, which include both opening/closing and day’s highs/lows – the Japanese
candlestick patterns as they are called in the trade.
There are time series of the geometric mean of major organisations whose shares
are traded in the market – FTSE-100 share index is the geometric mean of the
share prices of 100 leading organisations at the close of daily trading, and there is
FTSE index, which is the geometric mean of all-shares traded in the London
Stock Exchange at the close of trading. We similarly have DAX-30 for Germany
and CAX-100 for France, and there is Dow-Jones Industrial Average (DJIA) in
the USA. Financial traders, having sought opinion from statisticians, tend not to
deal with the ‘raw’ data value, but use other statistical measures related to the
value of the instrument(s): typically used indices are those of volatility – a
measure based on the standard deviation of closing price from its average value in
27
the past few days/hours/minutes; the moving average; and return value, which is
the (logarithmic) difference between the value of the instrument at time t-1 and at
time t.
The use of other statistical measures of the quantitative changes in the value of
instrument(s) are important for us as we try to attempt to incorporate the use of
sentiment indicators – or rather changes in sentiment – in an overall financial
analysis framework. One such attempt has involved helping the traders to
correlate the quantitative signal, either in its raw form or the derived forms
(return and volatility measures), with the movement of sentiment indicating
phrases. The traders typically use two sophisticated computer systems almost
simultaneously during a trading session: one screen dedicated to the value of
financial instruments, sometimes resolved at 50 values per minute, and the other
screen dedicated to news streams supplied by Reuters, Bloomberg and others. A
typical trader looks from one to the other and then makes his or her decision.
This, rather simplistic view of financial trading has led to the development of
SATISFI – which can simultaneously display, or help to visualise the news, the
value of an instrument, and the changes in the frequency of sentiment indicators.
Table 8 and Table 9 below show the two most frequent sentiment words used in
generating the positive and negative sentiment time series respectively.
Table 8: Dominant Sentiment words rose and up
Rose
Up
Total
Jan
87.80
117.48
205.28
Feb
79.81
135.81
215.62
Rose
Up
Total
Jul
63.27
99.89
163.16
Aug
86.09
134.09
220.18
Relative Frequency (10-5)
Mar
Apr
May
57.94
64.51
85.39
109.40 88.97
96.87
167.34 153.48 182.26
Sep
Oct
Nov
58.28
62.26
63.67
96.92
94.71
99.98
155.2
156.97 163.65
Jun
58.41
69.78
128.19
Dec
67.25
80.70
147.95
Table 9: Dominant Sentiment words fell and down
Fell
Down
Total
Jan
62.45
84.09
146.54
Feb
88.53
100.94
189.47
Fell
Down
Total
Jul
63.02
91.92
154.94
Aug
75.38
87.28
162.66
Mar
Apr
May
69.28
27.06
68.89
68.88
74.89
61.71
138.16
101.95
130.6
Sep
Oct
Nov
51.53
48.81
51.73
101.22
81.78
67.65
152.75
130.59
119.38
Jun
69.78
92.51
162.29
Dec
68.40
85.31
153.71
28
Ahmad et al.
Figure 2: SATISFI prototype shown with one-year FTSE index based on
monthly data with upward and downward movement indicator series
SATISFI has four major components that have been fully integrated as shown in
figure 2.
i) Time Series Display: SATISFI can display three time series at a
time. These time series comprise of FTSE-100 close index values,
upward movement indicators and downward movement indicators.
As discussed above, upward and downward movement indicators
are the quantification of the market sentiment expressed in financial
news. Over 70 terms each have been identified for conveying ‘good’
and ‘bad’ news. For example upward movement indicators would
contain terms like ‘up, rise, growth’ etc. while downward movement
indicators would contain terms like ‘down, fall’ etc. The movement
indicator time series are synthesized by counting these movement
indicator terms within the financial news published for a particular
day. Each time series is normalised for proper display purposes.
SATISFI is capable of displaying the above time series in three
forms:
(1) Raw form denotes the original time series.
(2) Return form refers to the logarithmic difference between two
consecutive values.
(3) Volatility (historical volatility) is the relative rate at which the
time series moves up or down.
ii) Time Series Correlation: Correlation is a measure of the degree of
linear relationship between two time series. SATISFI provides the
29
user the facility of cross-correlating two series in any form (raw,
return, volatility). Any series can be shifted forward or backward
and the cross correlation recalculated to determine whether the
market is followed by the news or vice versa.
iii) Document Display: This is comprised of two parts:
(1) Document Titles: Clicking a dot (date) on any of the time
series displays the corresponding date’s news titles.
(2) Document Content: The content of any document title can be
viewed by clicking that news title.
iv) Document Analysis: Whenever a document title is selected from
the news list, the extracted sentiment keywords along with the
frequencies are displayed in “Document Keywords” area. Positive
sentiment keyword analysis details appear under the title “Upward
Movement Indicators” and negative sentiment keyword analysis
details appear under the title “Downward Movement Indicators”.
6.
Finding ‘meaningful’ patterns
6.1
A case study: Movements in 2002
The year 2002 has seen its ups and downs like many other years and the
movements in financial markets worldwide have been in the downward direction.
Or, at least, the geometric mean of the value of the shares of the major
corporations in North America, Japan, and the European Union, with some
exceptions, has reduced substantially. The UK FTSE-100 shows mainly
downward movements interspersed with small periods of upward movements,
which unfortunately, could not compensate for the previous reduction in the value
of the index. In time-series analysis literature one finds techniques that tend to
separate the so-called trends from cyclical movements in the series: the cyclical
movements may be due to factors like holidays, trading patterns that may be
seasonal and so on (Tino et al. 2001). The trends, it is claimed, show a change
that is caused because the basic structure of the market has changed. Techniques
like fractal analysis and the related chaotic systems methods help in disentangling
the trends from the cycles. Another related and robust technique is the wavelet
analysis (Rioul et al. 1991): a wave comprises oscillations of a number of
different frequencies and trends and wavelet analysis suggests ways in which
these could be disentangled. We have used wavelet analysis on the FTSE-100
data for 2002 together with the time series of upward and downward indicating
phrases, in Reuters Financial News for the same year. Figure 3a shows the raw
figures for the daily trading data for the FTSE-100. Figure 3b shows the longterm trend in the time series (downwards) while figure 3c shows the short-term
trend and hence some cyclical behaviour. Note the turning points in the cyclical
data (marked by arrows in figure 3c). There is, as noted earlier, a considerable
30
Ahmad et al.
interest in identifying these turning points. The system SATISFI is being
extended to generate a textual description of the turning points.
Figure 33a shows a time series of the frequency of a number of upward
movement indicators, including rise, growth and other less frequent phrases
indicating upward movement. There is a corresponding long-term decay in the
time-series of upward movement indicators (figure 33b) as found in the raw
FTSE-100 data (figure 3b). The cyclical movements are much more pronounced
(figure 33c) but a comparison with eye suggests that there is a similar pattern in
figure 33c as in figure 3c. The downward movement indicators show that for the
first six-months or so of 2002 the frequency of downwards indicators increases
rapidly but shows a decay in the latter half of 2002 (figure 33b).
The sentiment indicating time-series has to be refined and much more work is
required before we may use it to predict the actual mood of the market. However,
our approach is perhaps amongst the first of the explorations, which investigate
how the quantitative movements in a financial market are influenced by the news
stories, some influencing the market and others showing the influence of the
market.
Afterword
We have attempted to build a system from various linguistic, visual and
mathematical components that allows us to explore the behaviour of the traders,
and to attempt to model this behaviour. The purpose of the system is to assist the
trader by reducing the amount of textual and numeric data that the trader needs to
assimilate to form a view of the market. In terms of corpus linguistics, the
analysis of qualitative data available in collections of news texts organised by
descriptive metadata (through XML and Reuters codes), combined with text
processing techniques to determine key patterns, proper-noun analysis to
determine key entities and the use of terminology collections for reducing the
understanding overhead necessary for the text, and for automatic classification of
the text, shows early benefits. This work is directly relevant to the potential for
Information Extraction techniques to be adopted across domains, an issue that is
being investigated also in a related EPSRC-sponsored project Scene of Crime
Information System (SOCIS). The integration of techniques in corpus linguistics
with other forms of analysis including mathematical analysis and image analysis
provides a supporting environment for future projects that do not focus on a
single medium of communication. Experts do not generally rely on the text
medium alone; this work provides evidence of the kinds of information fusion
that financial experts carry out many, many times each day.
31
Figure 3:
Acknowledgements
The work described in this paper has been supported by the EU IST Programme’s
Generic Information Decision Assistant (GIDA IST 2000-31123). Based on
papers presented at two workshops: LREC Event Modelling Workshop (Spain
2002) and Financial News Analysis Workshop, 11th International Terminology
and Knowledge Engineering Congress (France 2002).
References
Ahmad, K. (2002), Events and the causes of events: the use of metaphor in
financial texts, in Proceedings of the workshop at the International
Conference on Terminology and Knowledge Engineering. Nancy, France.
Atkins, S., J. Clear and N. Ostler (1992), Corpus Design Criteria, Literary and
Linguistic Computing 7(1): 1-16.
Barnbrook, G. and J. Sinclair (1995), Parsing CoBuild Entries, in J. Sinclair; M.
Hoelter and Peters, C (eds.), The Languages of Definition: The
Formalization of Dictionary Definitions for Natural Language Processing.
32
Ahmad et al.
Luxembourg: Office for Official Publications of the European
Communities. pp. 13-58.
Gross, M. (1993), Local Grammars and their Representation by Finite Automata,
in M. Hoey, (ed.), Data, Description, Discourse: Papers on the English
Language in Honour of John McH Sinclair. HarperCollins Publishers. pp.
26-38.
Harris, Z. (1991), A Theory of Language and Information: A Mathematical
Approach. Clarendon Press, Oxford.
Holmes-Higgin, P., S. Abidi and K. Ahmad (1994), ‘Virtual’ Text Corpora and
their Management. In Proc. of Sixth EURALEX International Congress on
Lexicography, Amsterdam.
Jackendoff, R. (1991), Semantic Structures. Cambridge (USA) & London: The
MIT Press.
Knowles, F. (1996), ‘Lexicographical Aspects of Health Metaphors in Financial
Texts’, in: M. Gellerstam et al (eds.), Euralex96 Proceedings (Part II).
Göteborg, Sweden: Göteborg University. pp. 789-796.
Maybury, M. (1995), Generating Summaries from Event Data. Information
Processing and Management 31(5): 735-751.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
Grammar of the English Language. Longman.
Rioul, O. and M. Vitterli (1991), Wavelets and Signal Processing, IEEE Signal
Processing Magazine, pp. 14-38.
Robert A. (eds) (1980), Roget’s Thesaurus. Great Britain, Longman.
Simmons, R. F. (1984), Computations from the English: A Procedural Logic
Approach for Representing and Understanding English Texts. Englewood
Cliffs, NJ: Prentice Hall.
Tino, P., C. Schittenkopf and G. Dorffner (2001), Volatility Trading via
Temporal Pattern Recognition in Quantized Financial Time Series, Pattern
Analysis and Applications.
Reuters NewsML Showcase: http://about.reuters.com/newsml/
Tagger: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html
System Quirk: http://www.computing.surrey.ac.uk/ai/SystemQ
WordNet: http://www.cogsci.princeton.edu/~wn/
Towards a methodology for corpus-based studies of linguistic
change: Contrastive observations and their possible diachronic
interpretations in the Korpus 2000 and Korpus 90 General
Corpora of Danish
Jørg Asmussen
1
Society for Danish Language and Literature
Abstract
The Korpus 90 and Korpus 2000 Corpora of Danish were both designed and compiled at
the Society for Danish Language and Literature (DSL). The joint web-based query
interface of the two corpora enables immediate comparative studies.
This paper focuses on examples of contrastive observations and their possible diachronic
interpretations. It discusses whether observable differences of word frequency, inflection,
collocation, connotation, and word order reflect real changes in the Danish language, or
whether they reflect differently compiled corpora.
Finally, the paper proposes prerequisites for a methodology of comparative corpus
investigation and the determination of diachronic corpus similarity. In this context, the
concept of invariant textual features is introduced.
1.
Background: The Korpus 2000 and Korpus 90 General Corpora of
Danish
The Korpus 2000 and Korpus 90 General Corpora of Danish were both designed
and compiled at the Society for Danish Language and Literature (Det Danske
Sprog- og Litteraturselskab, DSL). DSL, which is a type of academy, was
founded in 1911 with the aim of publishing scholarly editions of Danish works of
linguistic or literary importance, including dictionaries. During the 1990s, the design and compilation of corpora and other electronic linguistic resources of the
Danish language has become an important activity for DSL. Today, DSL is
regarded as the main public institution 2 in Denmark within the fields of
development and compilation of dictionaries and corpora 3 of the Danish
language. DSL’s long-term goal is to combine dictionaries and corpora to yield a
unique language information system – a bank of Danish language.
Korpus 2000 (K2000) holds 28 million text words and is composed of text
material from the years 1998-2002, covering a wide variety of genres. The
objective of the Korpus 2000 project was to compile a general corpus in order to
document written Danish language around the turn of the millennium. Korpus
34
Jørg Asmussen
2000 has been made publicly accessible on the Internet, one of the main purposes
of the project being to increase laymen’s awareness of the advantages of a corpus
as an extension to dictionaries.
Korpus 90 (K90) is a 28-million-word subset of the 40-million-word Corpus of
The Danish Dictionary (Den Danske Ordbog). Spoken material, as well as texts
with certain copyright restrictions, has been excluded from this subset. The
Corpus of The Danish Dictionary, holding text material from the decade 19831992, was compiled by DSL in the early 1990s. 4 It served as a major source in
the compilation of The Danish Dictionary, an entirely new, written-from-scratch
dictionary of contemporary Danish to be published in six volumes by DSL during
2003-2004. As part of the Korpus 2000 project, K90 was made publicly
accessible on DSL’s website 5 , where it serves as an approximately 10-year-older
counterpart to K2000.
Both K2000 and K90 are morphosyntactically tagged with the Constraint
Grammar-based DanPars-tagger 6 and are accessible through a joint query
interface 7 that enables immediate comparisons of certain linguistic aspects. The
following section will provide some examples of contrastive observations made
through this interface, as well as possible diachronic interpretations – and misinterpretations – of them.
2.
Problem: Contrastive
interpretation
observations
and
their
diachronic
Query results such as frequencies and collocates are presented in the K2000
interface in contrastive tables showing results for K2000 and K90. Hence the user
can compare frequency figures and collocates immediately, and hopefully get an
idea of what kinds of changes in vocabulary, inflectional forms, or collocates
(semantics) have taken place in the years between the compilation of K90 and
K2000. However, the contrastive presentation of query results has its
shortcomings as well, since it may seduce some users into making faulty
generalisations about language change. Below, we will present some examples of
these pitfalls and sketch some possible ways to avoid them, some of which could
be implemented in the user interface as a diachronic interpretation facility.
2.1
Vocabulary
A comparison of the frequencies of all words in K2000 and K90 reveals, not
surprisingly, that some words are significantly more frequent in one corpus than
the other. Assuming that each of the two corpora reflects Danish language usage
during a given period of time (1983-1992 and 1998-2002), we might interpret
these frequency differences as changes in the incidence of words in the Danish
language as a whole.
Towards a methodology for corpus-based studies of linguistic change
35
Figure 1 (below) shows how the frequencies for the noun regn (rain) are
presented in the user interface: the first column lists the possible inflectional (and
orthographic) forms of the lemma, the second column indicates the frequency of
each form in K2000, and the third column the frequency in K90. The form listed
in the bottom row is the lemma itself with all its inflectional forms; i.e., it is the
total of all of the forms listed above the bottom row. 8 Frequency is not shown in
absolute decimal figures, but as a logarithmic score 9 represented by a number (07) of bullets. Experimentally, it has been established that this logarithmic score
apparently follows general intuitions on word frequency quite well, such that
words with 1-2 bullets are less frequent, e.g. entomologi (entomology), words
with 6-7 bullets are very frequent, e.g. i (in) and og (and), and middle-range
words with 3-5 bullets are of average frequency, e.g. regn. As the table shows,
there do not appear to be any striking differences in the use of these forms
between K2000 and K90, with one exception - the indefinite genitive form regns
(rain’s), which does not occur at all in K2000, but scores one bullet in K90. The
raised thumb indicates that this form is “interestingly” more frequent in K90 than
in K2000 - which in this case means that it occurs at least twice as often as the
corresponding form in the other corpus. 10 This does not necessarily mean that the
difference is statistically significant as well. Especially with low frequency
words, the significance may be skewed. However, the phenomenon may have
some linguistic relevance, and our intention, therefore, was to highlight this in the
frequency table - a practice which may have confused users, as it turns out.
Clicking on one of the magnifying glasses yields the KWIC concordance for the
corresponding form, together with the number of occurrences. For regns, we get
three occurrences in the 28-million-word K90.
Figure 1: Frequency table for the noun regn (‘rain’)
The total frequency of all inflectional variants of the lemma mobiltelefon (mobile
phone) is about 25 times higher in K2000 (1,486 occurrences) than in K90 (59
occurrences). Assuming that the vocabulary of a language reflects general
changes in society, and keeping in mind technological changes from the eighties
to the late nineties, it seems evident that this frequency observation could be
interpreted as a general change in the vocabulary of the Danish language. Similar
36
Jørg Asmussen
examples are biltelefon (car phone) and benchmarking; biltelefon, which is five
times more frequent in K90 (51 occurrences) than in K2000 (9 occurrences),
denotes a technical device that by and large has been replaced by mobile phones.
Benchmarking does not occur at all in K90, whereas it occurs 34 times in K2000,
which might indicate that this word is new to Danish vocabulary. Jarvad (1999),
whose dictionary of new words in Danish is based on good old-fashioned
“manual” language excerpts, dates the first occurrence of this word in Danish to
1996 - an observation which might support the assumption that this word is new.
All these examples appear to indicate that the two corpora reflect changes in
language use as a result of general changes in society.
A less frequent word such as kambrium (Cambrium) occurs four times in K90,
but not at all in K2000, and is therefore marked with a raised thumb in the K90
column. A somewhat naive, but not entirely impossible, interpretation of this
could be that the incidence of this word in the Danish language is decreasing.
However, a closer look at the word reveals that it only occurs in a single text on
geology, a scientific domain which appears to have been covered unevenly in the
two corpora. So, in certain cases, raw frequency comparisons turn out to be
unreliable: they indicate different corpora compositions rather than diachronic
changes in the vocabulary of the language in question.
In the case of kambrium, the introduction of some measure of word dispersion in
the corpus would probably mitigate against misinterpretation; thus dispersion
could be used to correct the use of raw frequency to measure incidence. However,
the disadvantage of this method is that it generally diminishes the weight of
linguistically interesting, infrequent words, such as new words. Yet the implementation of this method in the K90/2000 user interface would prevent users
who are unfamiliar with quantitative linguistic methods from misinterpreting
words such as kambrium.
2.2
Inflection
A comparison of the inflectional forms of particular words will at first glance
show some striking differences between K2000 and K90. Looking at our previous
example, regn, we already have noted some differences between K2000 and K90
for the indefinite genitive form regns (rain’s), which occurs three times in K90,
but not at all in K2000. 11 For the definite genitive form regnens (the rain’s), the
table does not indicate any quantitative differences, but if one looks at the
corresponding concordances, one will observe that this genitive form occurs
twelve times in K90 and nine times in K2000. The same picture emerges again:
fewer genitives in K2000. Elbro 2002 reports on observations from K90 and
K2000 that genitives of certain frequently used nouns appear to be less prevalent
in K2000 than in K90. This, he suggests, might indicate a general tendency in
Danish towards replacing genitive constructions by prepositional phrases,
probably as a result of the influence of English. Thus, for example, bilens ejer
(the car’s owner), which is considered the canonical norm in Danish, is often re-
37
placed by ejeren af bilen (the owner of the car). This assumption is further
supported by Elbro’s (2002) observation of increased frequencies of certain
prepositions in K2000.
Apparently, the absolute number of genitive forms of some nouns in K2000 is
significantly lower than in K90. For instance, the noun bil (car) has a total of 393
genitive forms in K2000, as opposed to 586 in K90. Similar results are observed
for other nouns denoting common objects, such as cykel (bicycle) and hus
(house). A closer look at bil reveals that the lemma, including all its inflectional
variants, occurs 10,360 times in K90, as opposed to 8,354 times in K2000, an
observation which hardly - even among laymen - would yield an interpretation
such as “the word bil is replaced by other words”, or even “the denoted object is
about to disappear from our lives” - interpretations that appeared obvious for a
word such as biltelefon, as discussed above. Furthermore, an examination of the
relative numbers of genitive forms of the lemma bil in K90 and K2000 shows
5.7% and 4.7% respectively - a difference too weak to support conclusions about
radical linguistic change. Other examples that may disprove the hypothesis of
decreasing genitive forms include land (land, country) and the proper noun
Danmark (Denmark). The lemma land occurs 28,222 times in K2000, the genitive proportion being 22.3%, against 21,478 times in K90, with a genitive
proportion of 16.7%, thus showing the opposite tendency of what was reported
above. Danmark occurs 30,730 times in K2000, against 22,243 times in K90,
with 15.1% genitive forms in K2000 and 15.7% in K90. Once again, no alarming
differences in the use of genitives can be observed. However, what might be
alarming is the fact that some words, such as land and Danmark, which we
intuitively would expect to have a constant prevalence in the language over ten
years, show these astonishing frequency differences, despite their identical
logarithmic score. This could be another indication of two differently-composed
corpora, K2000 probably containing a larger amount of newspaper text than K90.
The examples given above suggest that one presumably cannot account for
general inflectional change merely by examining randomly-selected frequent
words, since the result of the comparison will be too arbitrary to enable us to
make any general conclusions. If one wishes to investigate this kind of linguistic
change, one should examine the phenomenon in question throughout the corpus.
Thus, in the case of the genitive forms, one should at least compare the total ratio
of genitives forms of all nouns in each corpus, and probably also the relative
number of nouns compared with other parts of speech in each corpus.12
2.3
Collocation
The K2000 query system can display collocates as either frequently or typically
co-occurring collocates. The former are mainly made up of co-occurring function
words, whereas the latter are determined by means of mutual information,
I(worda;wordb) = P(worda,wordb)/P(worda)P(wordb). 13 In order to prevent
infrequent words from popping up among the typical collocates, some intuitively
38
Jørg Asmussen
defined noise-reducing, and process time-reducing, conditions have been set up:
the corpus frequencies of both examined words worda and wordb must be greater
than 30, the collocate candidate’s frequency must exceed a threshold t =
(log10(fk))2/10, where fk is the frequency of the collocator, i.e. the word for which
we want to determine the collocates. Moreover, the collocation itself must occur
at least twice, and the resulting (non-logarithmic) mutual information score must
be greater than 20. The collocation window is set to two words to the left and two
to the right of the collocator. The resulting collocates are presented in a table
consisting of four lists, a left-hand and a right-hand list for each corpus. The lists
are sorted in descending order according to their mutual information score, which
is not explicitly stated in the list, but instead converted into a degree (1-5) of
mutual attraction d = round(log10(I)-1), where I is the non-logarithmic mutual
information score. The resulting degree of mutual attraction is expressed by a
corresponding number of bullets in front of each collocate, where one bullet may
be interpreted as a weak - but significant - mutual attraction between the given
words, and the maximum of five bullets as a very strong attraction. Figure 2
shows, as an example, a collocation table for the lemma terrorist.
Figure 2: Typical collocates for the noun terrorist
In K2000, we find to the left of the noun terrorist adjectival forms that denote
either a general quality or state (hensynsløse (ruthless), eftersøgte (wanted),
fængslede (imprisoned), islamiske (Islamic), internationale (international)) or a
(national) origin or location. On the right-hand side, we note a couple of proper
nouns. In K90 we also find eftersøgte (wanted) and palæstinensiske (Palestinian)
to the left, and furthermore vesttyske (West German); finally, we find dræbt
(killed) to the right.
The result may be interpreted as follows: one of the characteristics of a terrorist
that remains constant over time is that he is wanted or Palestinian, whereas West
German is no longer a significant attribute, but, instead, many other nationalities,
a certain religious orientation, or even international. In K2000, terrorists
furthermore have names or belong to certain organisations, whereas in K90 they
39
either killed or were killed. The greater number of collocates in K2000 may
indicate that terrorist has become a more frequently used word, and, in fact, the
lemma occurs nearly twice as often in K2000 (477) as in K90 (253). Or perhaps
this is just another indication that the corpora are composed differently, K2000
probably containing more newspaper material than K90. Yet the results and their
interpretation seem reasonable to some extent in that they resemble general
tendencies in Danish society related to this topic. Our historical knowledge helps
us to understand these collocates and helps us to understand the constants and
changes in the collocational behaviour of this word.
A word which ought not to have changed its collocational behaviour over ten
years is jul (Christmas) because its use is most likely embedded in a strong
traditional context, and, accordingly, the majority of the listed collocates are to be
found in both corpora, such as glædelig (happy), fejre (celebrate), or, on the
right-hand side, nytår (New Year). However, what is conspicuous is that the
number of collocates is somewhat greater in K90, indicating that jul occurs more
often in this corpus (2,196) than in K2000 (1,275), which might be taken as
another indication of compositional differences between the corpora. Thus, for
example, hvid (white) does not occur as a collocate to jul in K2000, even if the
collocation hvid jul occurs twice in K2000 and, of course, still is a valid
collocation in Danish. Unfortunately it has been blocked by the noise-reduction
measures. This example shows that incidental differences in frequency of a
(normally) quite frequently used word, such as jul, may have a serious impact on
the determination of collocates. Thus, comparing collocates derived from words
with remarkably different frequencies in the two corpora does not necessarily
give a correct impression of changes in their general collocational behaviour.
Even if a word is used less frequently, it may still keep its well-established
collocates, but they may not appear in collocational statistics any longer.
What might appear, in some cases, are candidates that intuitively do not count as
collocates. Looking up collocates for the word juletræ (Christmas tree), one will,
as expected, find pynte (decorate) and danse [rundt=om] (dance [around]), but
in K2000 we also find talende (speaking) - and this as the most significant
collocate for this corpus, scoring four bullets on the degree of mutual attraction
scale. On closer examination, it appears that all examples of the talende juletræ
come from one single text, a strange story of a speaking Christmas tree. Thus,
also when determining collocations, some kind of dispersion-based correction of
the raw frequencies of the words involved would help diminish these odd
findings.
2.4
Semantics
Closely related to collocation is the phenomenon of semantic prosody, 14 the
ability of words to establish certain flavours of meaning contextually, e.g.
positive, negative, ironic, etc. Consider the word sideeffekt (side effect), an
English loanword, which is not registered by Jarvad (1999), but occurs 11 times
40
Jørg Asmussen
in K90 and 22 times in K2000. One might argue that, as a word, sideeffekt is
superfluous in Danish because another word, bivirkning, already exists with
exactly the same meaning. Approximately half of the examples in K90 are used in
a clearly negative context, indicating that sideeffekt is something unwanted and
harmful, and thus the semantics of this word appears to be very close to that of
bivirkning. The picture has changed in K2000, where the majority of examples
clearly show sideeffekt as something still unwanted, but quite good, and some
examples are explicitly preceded by the adjective positiv (positive), as shown in
Figure 3.
Figure 3: Indications of changed semantic prosody for sideeffekt in K2000.
The question is whether these examples empirically justify the conclusion that
sideeffekt in Danish has indeed changed its semantic prosody and found a
semantic niche. How many examples of a type of semantic change do we need in
order to exclude corpus compositional bias and to be able to make conclusions
about a language in general?
2.5
Syntax
We will give one example of syntax that shows some topological differences for
the negation ikke (not) which one can observe between K90 and K2000. In main
clauses, the negation is placed after the finite verb: Peter drikker ikke te (‘Peter
drinks not tea’), whereas the negation is placed before the finite verb in relative
clauses: Anne serverer kaffe fordi Peter ikke drikker te (‘Anne serves coffee because Peter not drinks tea’). Main clause word order in subordinate clauses is
generally considered substandard Danish. However, evidence of this usage can be
found in both corpora. Intuitively, we would expect this non-canonical word
order to be more frequent in current Danish, and we therefore would expect to
find more examples of this in K2000 than in K90. However, a comparison shows
more examples of the incorrectly placed ikke in K90, some of which are shown in
Figure 4. Does this prove our intuition wrong, or does it indicate that K2000
contains more professional - and thus more ‘correct’ - language? Or is it the case
that the corpus evidence is not convincing enough because of methodological
shortcomings?
41
Figure 4: Examples of non-canonical word order in K90.
3.
Towards a methodology of corpus-based comparative studies
The examples above demonstrate that our contrastive corpus observations in
some cases may yield dubious interpretations and generalisations on linguistic
change. In order to improve the quality of our comparative corpus investigations,
we need to establish (i) a framework for the way we examine our data, i.e. a
declaration of the analytical methods applied, and (ii) a method of describing and
classifying our data, i.e. the corpus. To ensure that others can repeat our
investigations on other data with analogous results, we would probably even want
to standardise our procedures or create a general methodology for corpus-based
studies of linguistic change. To be able to handle different scenarios of
investigation, the analytical methods should at least cover the fields of
vocabulary, morphology (inflection), collocation, semantic prosody, and syntax.
Based on the examples given in section 2, we can list the following preliminary
prerequisites for these analytical methods.
3.1
Prerequisites for a methodology of comparative corpus investigation
The underlying word concept is important in comparisons based on vocabulary.
In order to make these comparisons, we propose a word definition based on
lexicalised units which include fixed multiword units, i.e. a word is defined as a
lemma. Hence, what is comparable across corpora, when investigating
vocabulary, are lemmas, not types. Furthermore, for vocabulary comparisons,
measures other than raw lemma frequencies should be applied. A score based on
different quantitative characteristics of a lemma in the corpus appears to be better
suited to describing the prevalence of that lemma. These characteristics should at
least include raw frequency and dispersion. To date, the field of vocabulary
comparison appears to have attracted the most attention in corpus linguistic research, and a considerable number of statistical methods have been proposed.15
With regard to morphology, the comparison of the frequency of
inflectional variants of a lemma must be handled differently from vocabulary
comparison. As has been shown above, comparisons based on type can yield
misleading results. First of all, one has to determine whether one wants to
investigate a certain inflectional category within a lemma - for example changes
42
Jørg Asmussen
in the quantity of plural forms of a noun - or if one wants to examine probable
quantitative changes of a grammatical category as a whole throughout the corpus
- for example the relative quantity of genitive forms of all nouns in the corpus. In
the first case, the ratio of an inflectional variant of a lemma contrasted with all
other inflectional variants should be compared with the ratio of the same
inflectional variant of the same lemma in another corpus. To ensure that potential
changes are lemma-specific, and not just part of a general diachronic shift, the
observations should be contrasted with an investigation of the examined
inflectional category as a whole throughout the corpus. In the second case, the
inflectional category in question should be investigated throughout the corpus. In
both cases, statistical methods of determining the significance of proportional
changes must be developed and applied.
The comparison of changes in the frequency of collocations in two different
corpora appears to be more complex than plain lemma-based vocabulary
comparison. Statistical methods of the mutual information type are probably not
suitable in a comparison context; their strength is mainly found in detecting
statistically salient collocations. Once a collocation has been detected, one can
compare its number of occurrences across corpora. However, collocations should
not be compared as if they were words by merely applying one of the methods for
vocabulary comparison, since the frequencies of the collocations themselves also
depend on the words involved. Hence, if one observes significant differences in
the frequency of one of the words involved in a collocation when comparing two
corpora, these differences will also be reflected in all collocations that involve
this word. In these cases, it is not necessarily the frequency of the collocation, or
its prevalence in the language, that has changed. Comparisons of collocations
might therefore best be treated as composite vocabulary comparisons of the
words involved in a collocation, in conjunction with the frequency scores of the
collocations themselves.
Within the field of semantic prosody, some kind of semantic tagging, or at least
grouping of contextual words, is indispensable. Changes will often affect new
loanwords, and thus relatively infrequent words that are semantically vague at
first, but find a semantic niche after some years. The number of words that show
these characteristics is normally quite limited, and hence the comparative
statistical methods must be adjusted in these special circumstances.
The number of observable instances will decrease as the complexity of a certain
linguistic phenomenon increases, i.e. when it is composed of several words or
linguistic constituents. This applies to most observable linguistic phenomena
above the word level, e.g. collocations, semantic prosody, and - perhaps most
saliently - syntax, and it will have implications on the statistics used in these
cases.
43
For other fields of linguistic comparisons between corpora, similar prerequisites
have to be established. Common to all of these prerequisites should be the
principles of (i) contrasting a phenomenon with its potential alternatives within
one corpus, and comparing this contrast ratio with that of another corpus, and (ii)
determining appropriate comparison statistics for linguistic phenomena of
varying complexity.
3.2
Prerequisites for a methodology of measuring corpus similarities over
time
Apart from the development of analytical methods to be applied, one needs a
method to describe and classify the data, i.e. the corpus. Although considerable
efforts have been made to describe what is in a corpus and why, and although
corpora usually are designed in accordance with some idea of representativeness,
the problem remains that corpus descriptions are generally founded on relatively
vaguely defined, intuitively applied textual categories. Classifying texts by
intuition is necessarily imprecise because intuitions may differ from compiler to
compiler, and a single compiler may even apply different criteria of classification
according to extra-textual circumstances. However, this problem does not seem to
be critical, as long as the majority of corpus-based linguistic research refers to the
same well-established corpus, e.g. the BNC, or, in the case of Danish, K2000 or
K90.
Statistical methods for determining corpus differences (Garside and Rayson
2000) and corpus similarities have been proposed, mainly based on vocabulary
(Kilgarriff 2001). But what happens if one wants to compile a corpus designed
exactly like an existing one, e.g. a new BNC, holding a similar mixture of, e.g.,
texts that are ten years more recent? In this case, the definition of a concept of
similarity might become quite intricate because differences are explicitly desired
as long as they are the result of diachronic language change, but they must
definitely not be the result of differently composed corpora. How can one
compile an updated version of a well-established corpus that is absolutely
comparable to the existing one in its composition, so that the linguistic
differences detected by comparing the corpora truly and only can be explained
diachronically - and not as a result of compositional bias?
A solution to this problem of corpus composition is a measure, a quantitatively
established figure, or - more likely - several figures, characterising the overall
textual composition of a corpus. Corpora with the same characteristics, then,
would probably hold the same mixture of text types and would hence be
comparable. So far, this approach is not different from the (implicitly) synchronic
similarity measuring approaches proposed by Kilgarriff 2001. The difference is
that the required measure should be based on positively invariant features of
language, i.e. features that do not change (significantly) over time.
44
Jørg Asmussen
Within a synchronic context, the linguistic features to compare almost may be
chosen by chance. They may be, and often are, based on vocabulary, but other
features - such as part-of-speech, syntactic constituents, sentence length, word
length, maybe even the distribution of character n-grams, etc. - may to some
extent be useful as well. If statistics show that the two corpora have matching
features, they may be expected to be of a similar type or composition. Within a
diachronic context, however, many of these features may not be suitable because
an ongoing general linguistic change might affect one or more of them; hence
keeping them constant across corpora and over time might obscure the detection
of possible changes.
Linguistic features based on grammar or semantics have the disadvantage of
requiring morphosyntactically or semantically tagged corpora. The task of
tagging a corpus before one can determine its compositional characteristics
introduces a certain amount of interpretation, as one has to apply a grammatical
tradition or semantic framework that does not necessarily capture the possible,
subtle linguistic changes that occur as a language develops over time.
Furthermore, the task of tagging increases the degree of methodological
complexity, since one does not only need a tool for determining the
compositional characteristics of the corpus, but also tagging tools. To ensure that
the measurement of the compositional characteristics can be easily achieved in
different settings, all tools that are involved have to be well documented and
maintained. Slight modifications in tagging algorithms or tag sets may have farreaching implications on the reliability of this approach. For these reasons, one
should try to keep the number and complexity of the tools involved as low as
possible, as well as the setting itself. We therefore propose that corpus similarity
should be determined on untagged corpora only, and that any introduction of
linguistic theory into this process should be avoided. The mere physical,
orthographic surface of the corpus should suffice, since its recurrent symbols
yield structures of varying complexity that resemble the overall textual
composition of the corpus.
It is beyond the scope of this chapter to try to determine the required invariant
linguistic features. However, a couple of candidates will be briefly discussed.
Belica (1998) gives an account of some statistical diachronic investigations of
vocabulary and collocations in the German “Wendekorpus”. 16 He seems to be
quite aware of the corpus compositional problem dealt with in this paper and
points at the distribution of different collocations of selected function words 17 as
a candidate for an invariant feature, but does not give any examples or further
details. However, it does not seem evident that this type of collocation should
remain stable over time. A more appropriate candidate for an invariant linguistic
feature seems to be the frequency and dispersion of words belonging to a defined
semantic core vocabulary. Ruus (1995) attempts to determine a core vocabulary
for Danish which she believes constitutes the basic lexical norm of the language.
The core vocabulary, which Ruus determines, includes semantically vacuous, but
45
highly frequent function words, whereas the semantic core vocabulary, which we
shall suggest as a candidate, excludes them. The remainder should be relatively
frequent, semantically well-established words of the jul (Christmas) type
mentioned above. This semantic core vocabulary can be expected to remain
grammatically and semantically stable over time and thus may prove quite useful
in our context. Further experimental work will show whether it is possible to
derive such a constant semantic core vocabulary from a large amount of text
material, 18 and whether it really can serve as an appropriate invariant linguistic
feature.
4.
Conclusion
In this paper, we have discussed examples of comparative investigations on two
general language corpora of Danish, K90 and K2000, the former reflecting
Danish language usage around 1990, the latter one around 2000. We have argued
that some of the interpretations of the observed differences in vocabulary,
collocation, semantics, and grammar were not necessarily the result of general
changes in language usage, but rather a likely consequence of differentlycomposed corpora. 19 This observation led us to the conclusion that a framework
for comparative diachronic corpus investigation is needed, and we sketched some
prerequisites for a methodology of diachronic corpus comparison. With regard to
comparative corpus investigation, we found that standardisable approaches are
required to account for the complexity of the observed linguistic phenomena, as
well as the quantitative relationship between a linguistic realisation and its
potential variants. In further work, such approaches should be concretised and
evaluated. With regard to the data material, i.e. the corpora to be compared, we
argued that in order to determine the similarity of corpora comprising material
from different time periods, the approach must allow a certain degree of
dissimilarity for those linguistic characteristics which are likely to change over
time. Therefore, we proposed to base this approach on invariant textual features,
the semantic core vocabulary of a language being an initial candidate for
investigation in further work. Generally, we find that the determination of
invariant features appears to be a conceptual, or hermeneutic, challenge rather
than a mere statistical one.
Notes
1
Thanks to my colleagues Britt Keson and Allan Ørsnes for their valuable
comments on the manuscript.
46
Jørg Asmussen
2
Legally DSL is a semi-public institution under the jurisdiction of the
Danish Ministry of Culture, and its activities are financed in part by the
Danish Government and in part by private and public foundations.
3
Among other corpora compiled or distributed by DSL are the Danish
PAROLE Corpus (compilation and distribution, cf. Keson 1998) and the
Bergenholtz DK87-90 Corpus (distribution, cf. Bergenholtz 1988). An
overview of Danish corpora is given in Asmussen (2001).
4
A comprehensive account of the background and the design of the Corpus
of the Danish Dictionary is given in Asmussen & Norling-Christensen
(1998).
5
Accessible at http://www.dsl.dk/korpus2000.
6
Developed by E. Bick under the VISL project, University of Southern
Denmark, http://visl.sdu.dk.
7
The query tool being CQP (Christ 1994) in conjunction with a special web
interface developed in the Korpus 2000 project. For details on Korpus
2000 and its web-based user interface cf. Andersen et al. (2002).
8
The smiling face indicates that the lemma is spelt in accordance with
official Danish orthography.
9
The score s is computed as follows: s = round(0.5 + log10(f)), for f>0, and
s = 0, for f=0, where f is the computed average number of occurrences in
one hundred million words of text.
10
This rule has been arbitrarily defined.
11
Even if this is by no means statistically significant, a user who is not aware
of this might make a comparison based on these raw frequencies.
12
As noted by Elbro himself.
13
Cf. Church & Hanks (1989); a modification is that their log2 function is
not applied here.
14
For a brief description, some examples, and further references cf. Rundell
(2002).
15
For a comprehensive overview and discussion cf. Kilgarriff (2001).
16
Institut für deutsche Sprache, Mannheim, http://www.ids-mannheim.de.
17
"die Verteilung von verschiedenen
Funktionswörter" (p. 35).
Kollokationen
ausgewählter
47
18
DSL stores approximately 400 million words of text material in its text
bank. The material covers a time span of approximately 20 years and
constitutes the raw material for other DSL corpora. A part of this material
is - to date - seven years of newspaper data from a Danish daily newspaper
(Berlingske Tidende), approximately 25 million words per year. A first
step could be to determine the common vocabulary to be found in each
year (or each month or day), and to use this as a basis for further
investigations.
19
In fact, there is a compositional difference between these two corpora as
K2000 holds approximately 2/3 of newspaper text as opposed to
approximately 1/3 in K90.
References
Andersen, M.S., Asmussen, H., Asmussen, J. (2002), ‘The Project of Korpus
2000 Going Public’, in: A. Braasch and C. Povlsen (eds.) Proceedings of
the Tenth EURALEX International Congress, EURALEX (2002),
Copenhagen.
Asmussen, J. (2001), Korpus 2000. Et overblik over projektets baggrund,
fremgangsmåder og perspektiver. NyS 30. Nydanske studier & almen
kommunikationsteori, Copenhagen.
Asmussen, J. and O. Norling-Christensen (1998), ‘The Corpus of The Danish
Dictionary’, in: Lexikos 8, Afrilex Series 8:1998, Stellenbosch, pp. 223242.
Belica, C. (1998), Statistische Analyse von Zeitstrukturen in Korpora. Teubert W.
(ed.): Neologie und Korpus. Tübingen.
Bergenholtz H. (1988), DK87: Et korpus med dansk almensprog. Hermes,
Journal of Linguistics 1, Århus, pp. 229-237.
Christ, O. (1994), A modular and flexible architecture for an integrated corpus
query system. COMPLEX’94, Budapest.
Church, K. and P. Hanks (1989), Word association norms, mutual information
and lexicography. ACL Proceedings, 27th Annual Meeting, Vancouver.
Elbro, C. (2002), Ift, ifm, mht, mhp og andre uspecifikke præpositioner. Mål og
Mæle 3:2002, Copenhagen, pp. 17-23
Garside, R. and P. Rayson (2000), Comparing corpora using frequency profiling.
Proceedings of the workshop on Comparing Corpora, held in conjunction
with the 38th annual meeting of the Association for Computational
Linguistics (ACL 2000). 1-8 October 2000, Hong Kong, pp. 1 - 6.
48
Jørg Asmussen
Jarvad, P. (1999), Nye Ord. Ordbog over nye ord i dansk 1955-1998.
Copenhagen.
Keson, B. (1998). Documentation of The Danish Morphosyntactically Tagged
PAROLE Corpus. Society for Danish Language and Literature, DSL,
Copenhagen, http://korpus.dsl.dk/e-resurser/parole-doc.rtf
Kilgarriff, A. (2001), Comparing Corpora. International Journal of Corpus
Linguistics, 2001, vol. 6, no. 1, 2001, pp. 97-133.
Rundell, M. (2002), ‘Good Old-fashioned Lexicography: Human Judgment and
the Limits of Automation’, in M-H. Corréard (ed.) Lexicography and
Natural Language Processing. A Festschrift in Honour of B.T.S. Atkins.
EURALEX 2002.
Ruus, H. (1995), Danske Kerneord. Centrale dele af den danske leksikalske
norm. Copenhagen.
Synchronic and diachronic variation: the how and why of
sociolinguistic corpora.
Kate Beeching
University of the West of England
Abstract
This paper aims to illustrate the potential of (spoken) sociolinguistic corpora for research
studies in both synchronic and diachronic variation, with reference to French, and to
suggest ways in which useful research corpora may be established for future generations
of scholars. Spoken corpora and corpus tools are an excellent heuristic in charting
distributional frequencies or probabilistic factors. Andersen (2000) suggests that the
upsurge of innit and like in the COLT Corpus of adolescent English may be more than
age-grading. The present paper will present broad-brush preliminary evidence with
respect to the evolution of selected pragmatic particles in French.
1.
Introduction
The title of my paper on sociolinguistic corpora has put the how? before the why?
This raises a very interesting methodological issue relating to corpus collection
and corpus-based research. Generally speaking, linguistic research design which
includes data collection focuses very precisely on the nature of the data required
for the research project in question. You start with the why? and move on to the
how? A hypothesis is formulated and the data collection is specifically designed
to test that hypothesis.
The nature of corpus design is quite the opposite: very large amounts of data are
collected from a representative sample of the population, both written and
spoken, and - as was the case for the Survey of English Usage - the corpus
designer decides which genres are to be included as being those which are most
typical of the language at any one time. Inevitably, the composition of the corpus
influences the findings which may be made and bias may creep in. For a long
time, for example, the spoken language was considered to have been underrepresented in the British National Corpus and, even now, if we compare how
much time the average British English speaker spends speaking and listening as
opposed to reading or writing, corpora are generally weighted in favour of the
written language. The time-consuming nature of transcription plays a major role
in the relative paucity of spoken material. Should researchers wish to chart
distributional frequencies in the spoken as opposed to the written language, they
are at liberty to select sub-corpora. Inevitably, too, the corpus may not include the
precise register of the language which the researcher is interested in. In Beeching
50
Kate Beeching
(1997), I provide an apology for the small home-grown corpus which focuses on,
in this case, the type of spoken French most needed by horticulturalists (a group
of learners I happened to be teaching at the time and for whose needs no
specifically-designed corpus was available). At the University of the West of
England, Bristol, we are currently engaged in a Joint Project collecting samples of
learner English/French/German and Spanish at a number of levels. Already the
methodological questions are being posed: what level of detail will be required in
transcription? Does the method of data collection fit the individual research aims
of the staff involved? How best can we collect data which it will be subsequently
of interest to interrogate for studies which may be as diverse as morphological
development, lexical richness and so on. The how? and the why? questions
bedevil the collection of corpus material in a very large number of fields. The key
factor about corpus investigations is size: millions of words are considered to be
required to gauge the use of lexical items, even those which are not particularly
rare. For studies focusing on variation, large is necessary but insufficient. In the
case of spoken data requiring transcription, large means long hours of work by
trained personnel - in other words, funding. Corpus Linguistics is a field where
individual researchers, public and private funding, universities, libraries,
publishers and government need to pull together, as has been most spectacularly
demonstrated by the success of the BNC. A collaboration of this sort has,
unhappily, not occurred for French. For studies focusing on variation, large is
necessary, but insufficient, because samples must be accompanied by particular
demographic data to allow investigations to be meaningful. And opinions differ
on exactly how data should be collected and collated for reasonable conclusions
to be drawn.
In this paper, my remarks relate to the type of corpus which lends itself to the
study of semantico-pragmatic change. On the one hand, the why? logically
precedes the how? On the other hand, most empirical research papers place the
methodology section before the results section. It is for this reason that I have
positioned the how? before the why? in this paper, though the latter part of the
paper serves as justification for the methods described in the former. The
questions to be raised are: How reasonable is it for sociolinguists to ask for
specific information about speakers to be included in the data set along with
transcriptions when corpora are compiled? And what sort of information and
what degree of detail is useful?
2.
The how?
Examining data collection methods which have been used in the past is one way
to explore best practice for the future. Two outstanding examples of existing
sociolinguistic corpora for French are the Etude Sociolinguistique d’Orléans
(collected between 1966-1970) and the VALIBEL Corpus. The latter is the
largest electronic corpus of spoken French, with 400 hours of recordings or about
Synchronic and diachronic variation
51
4 million words, and hopes to be representative of the “communauté linguistique
Wallonie-Bruxelles” (Francard et al. 2002). Each interview is accompanied by
demographic information covering the following six parameters: sex, age,
geographical location, place of birth, educational background and the
socioprofessional situation of the informant. More information is accessible at
http://valibel.fltr.ucl.ac.be. The VALIBEL centre focuses on variation studies
(lexical, syntactic, phonetic and prosodic). Despite the excellence of this corpus,
it is of little use for my exploration of hexagonal French, as its focus is entirely on
Belgian French. As yet nothing similar is available for hexagonal French though
the Orléans Corpus is an excellent model. My own very small Bristol Corpus is
available on-line and the “reference corpus of spoken French” currently the
responsibility of Mireille Bilger (Université de Perpignan: bilger@univ-perp.fr)
holds great promise.
The Orléans Corpus, the collection and transcription of which was supported by
grants from the DES, the French Embassy in London, BELC in Paris, the CRDP
in Orléans, the CMPP in Orléans, INSEE in Orléans and the French Centre for
European Sociology is an invaluable resource. As the authors explain in their
catalogue, the origins of the ESLO (Etude Sociolinguistique d’Orléans) go back
to 1966, the era of the ‘audio-visual revolution” in language teaching. There was
an acute need for non-literary and everyday samples of authentic spoken French.
The aim, then, of the corpus was to collect a body of recordings of spoken French
from an urban environment, the choice of subjects being governed by sociological
criteria in order to ensure that the corpus was representative of French speakers of
the time. The corpus was to be transcribed and made available to researchers in
linguistics, sociology and pedagogy and was to be used as the basis for teaching
materials. The resulting electronic corpus comprises approximately 109 hours of
spoken French, 902,755 words of which have been orthographically transcribed
with a further 13 hours of phonetic transcription. These transcriptions are
available at: http://bach.arts.kuleuven.ac.be/elicop. Each interview is
accompanied (in the catalogue) with very full background information, including
not only the six parameters favoured by VALIBEL but the initials of the
interviewer, the date of the interview and date and place of birth of the
interviewee, the INSEE coding of the interviewee, their marital status, political
leanings, the length and quality of the recording and some indication of the topics
covered. Most interviews covered a pre-determined set of questions, but the
speakers appear relaxed and informal in their mode of response. This imaginative
yet rigorously managed project is a treasure-house of historical evidence on the
French language, the first and, as yet, virtually the only French corpus in
existence which is freely available on-line.
The Bristol Corpus also had its origins in language pedagogy. Supported by small
grants and advances from OUP and CUP, I spent 10 years from 1980 to 1990 as a
freelance French textbook writer, collecting, transcribing and preparing
pedagogical materials for French (e.g. Beeching, 1985, 1986, 1989, Beeching &
52
Kate Beeching
Page, 1988, Beeching & Le Guilloux, 1990, 1993). All were based on and had an
accompanying cassette containing authentic taped spoken interviews with French
people from all walks of life and from a number of urban and provincial parts of
France. A selected sub-section of the interviews recorded during these years was
transcribed as part of my PhD thesis, an abridged version of which was published
in 2002. These interviews now make up the Bristol Corpus which comprises 17.5
hours of orthographically transcribed speech, or 155,000 words, involving 95
speakers, aged 7 to 90 years with a balanced range of educational backgrounds.
The Corpus may be accessed at http://www.uwe.ac.uk/facults/hlss/languages/
research/staff/CORPUS.pdf. The file contains some demographic information
about the speakers: their sex, age (in 3 age bands: 0-20; 21- 40; 41+) and level of
education (in 3 bands: no bac, bac but no university degree, bac + university
degree(s)).
Bilger (2002) presented the “Reference Corpus of spoken French” the
constitution of which was proposed in 1998 by the GARS teams and the CNRS
ESA 6060 team directed by C.Blanche-Benveniste. Currently the DELIC team
led by J. Veronis is completing the corpus. Bilger regretted that the corpus did not
fulfil all of the criteria which Sinclair suggests characterise a ‘reference’ corpus
as not all speech situations are included and a number of usages of the language
are missing. She suggests that this is but a first step in a vaster project. An
attempt has been made to capture a representative sample of common, current
usage. The corpus is 400,000 words long, the interviews were recorded in 40
towns and three speech situations have been favoured: “private” speech,
professional speech and public speech. Bilger did not mention in detail what
background notes accompany the transcriptions.
To sum up, a reference corpus which is useful for sociolinguistic research must
attempt to provide samples which are representative of all members of the society
in question talking at a similar level (or levels) of formality in order for
comparisons to be made and variation discerned. A minimum of information
should accompany each interview: the sex, age and educational background of the
speakers.
3.
The why?
Reference corpora which are allied with demographic information can be used,
not only to look at synchronic variation and the way that social hierarchies and
power may be enshrined in language use (see Beeching, forthcoming a) but also
to investigate diachronic variation (see Beeching, forthcoming b). This is the
most exciting new development. Though change may conceivably occur because
of the written language, it is generally recognised that most changes occur in
face-to-face interaction. In the past, language change has been studied with some
difficulty: written documents of various sorts have been scrutinised to find
53
evidence of what might have been happening in the spoken language at a
particular stage (an excellent example of such an investigation at the phonological
level is Sampson, 2002). The fact that we now have what amount to historical
documents in the form of tape-recordings and, if we are lucky, transcriptions of
those tape-recordings, means that, from now on, we will be able to monitor
change as it occurs. Corpus evidence allows us to quantify shifts, gauge reversals,
and check distributional frequencies. Electronic search facilities and statistical
packages can alleviate much of the drudgery entailed. We can refine not only our
descriptions of linguistic change but also our hypotheses about the causes and
effects of such changes. Linguists have traditionally favoured language-internal
causes of linguistic change. The advent of sociolinguistics and sociolinguistic
evidence allows us to put forward a very strong argument for the importance of
language-external, social factors. As Croft (2000: 4) remarks ‘the real entities of
language are utterances and speakers’ grammars. Language change occurs via
replication of these entities not through inherent change of an abstract system’.
There is considerable evidence that synchronic and diachronic variation are
inextricably linked and that language items which begin as one variant of a
variable or one sense of a polyseme gradually gain the upper hand. A new formfunction configuration can eventually obscure the original one as, becoming more
frequent, it is routinised. As Heine, Claudi and Hünnemeyer (1991:261) remark
with reference to grammaticalisation, language change and synchronic variation
are inextricably interlinked:
Grammaticalization has to be conceived of as a panchronic process
that presents both a diachronic perspective, since it involves change,
and a synchronic perspective, since it implies variation that can be
described as a system without reference to time.
Jakobson (1952/1963:37) made a similar point when he said
Pendant un certain temps, le point de départ et le point
d’aboutissement de la mutation se trouvent coexister sous la forme de
deux couches stylistiques différentes...Un changement est donc, à ses
débuts, un fait synchronique 1 .
Grammaticalisation is a term which is generally applied to nouns and verbs which
come to serve grammatical functions and then continue to develop new
grammatical functions. Classic examples include the development of LatinRomance articles, auxiliaries and indefinite pronouns. Whilst much focus has
traditionally been placed on what Givón has famously referred to as ‘Today’s
morphology is yesterday’s syntax’, it could be argued that the emphasis on
morphology reflects the neogrammarians’ preoccupation with this area rather
than giving a just view of a process or phenomenon which extends to semantic
and pragmatic areas, not solely purely ‘grammatical’ ones. In order for a lexical
54
Kate Beeching
or content word to gain in currency and become a function or grammatical word,
it must of necessity lose the particular lexical meaning which ties it to specific
contexts and become desemanticised. This ‘bleaching’ has been regarded as a
somewhat negative fate for words though some researchers point up that we
should not consider the desemanticisation as ‘loss’ but as an enriching or
multiplying of meaning. In addition, desemantisation is a sine qua non for a term
to become grammaticalised or gain in currency, to change from a ‘content’ to a
‘function’ word, as Haspelmath (1999:1062) explains:
Semantic generalization or bleaching is usually a prerequisite for use
in a basic discourse function, that is, for the increase in frequency that
triggers the other changes.
The semantic generalisation which is so often observed in the development of a
lexical category to a functional one appears, then, to be not a consequence of
routinisation but a pre-requisite for it. This “bleaching” and “gaining in currency”
is not a phenomenon which restricts itself to morphology but is omnipresent, not
least in cases of pragmatico-semantic change. In Beeching (forthcoming b) I
argue precisely this and I posit that, along with the formal changes which appear
to accompany grammaticalisation (desemantisation, fusion, coalescence), what
we are witnessing in all cases of language change, is a process of
pragmaticalisation which I define as follows:
Pragmaticalisation is the manner in which words, used in context,
shift in meaning or attract a new social semiotic, become habituated in
that usage and are propagated because of the new fashion or prestige
which is attached to them.
Pragmaticalisation occurs during human interaction and human interaction is
always heavily overlaid with connotations which may be most readily explored
using politeness theory. Kasper (1990) distinguishes between two trends in the
conceptualisation of politeness: politeness as strategic conflict-avoidance and
politeness as social indexing. Both are involved in a consideration of pragmatic
particles such as like and innit in British English or enfin, hein or quoi in
hexagonal French. All are used to encourage speaker involvement, and they are
part of the means whereby speakers downtone assertions, avoid presenting
themselves as the expert and thus they are a means of avoiding conflict and
managing face needs. But they are often words, too, which are stigmatised in
“polite” circles, they are familiar, colloquial, only used in the spoken language, in
informal contexts or by an easily identifiable subset of the population. Quotative
like for example is much more frequent on the lips of younger speakers. Hein,
and, to an even greater extent, quoi, are characteristically used to a much greater
extent by working class speakers. Frequent use of such terms, then, contains a
social semiotic, resulting in social indexation.
55
The underlying processes of language innovation and language diffusion appear
to me to reside in the conventionalisation of generalised invited inferences, some
of which may be cognitively motivated metaphorical or metonymic
interpretations, but many of which are motivated by considerations of politeness
in its broadest sense including both notions of sociability and face and of social
indexation (see Eeelen, 2001, and Beeching, 2002: 25-27). On the one hand,
Croft (2000: 178) intimates the usefulness of politeness theory in helping to
explicate the directionality of the transmission of variants, an aspect not
accounted for in the Milroys’ weak-tie model. On the other hand, drawing on
Levinson (2000), Traugott and Dasher (2002) posit an Invited Inferencing Theory
of Semantic Change (IITSC). This theory suggests that historically there is a path
from coded meanings (Ms) to utterance-token meaning (IINs) to utterance-type,
pragmatically polysemous meanings (GIINs) to new semantically polysemous
(coded) meanings. Their arguments for the role of conceptual metonymy are
persuasive. My theory of pragmaticalisation builds on the IITSC but places it
firmly in the context of politeness theory and of Milroyan sociolinguistic theory
and methodology. Both Croft (2000) and Givón (2002) draw out the parallels
between adaptive behaviour in biology and linguistic phenomena, with variation
seen as the indispensable tool of learning, change and adaptation and this in the
context of the sociability, co-operation and communicative way of life rooted in a
hunting-and-gathering society. The primacy of sociability, the bid for conflictavoidance in human interaction and the equally prevalent urge for social
indexation promote not only a high rate of usage of politeness markers but also
their differential usage in different identity groups. Moreover, because such
particles are both frequent and highly polysemous, they are subject to change.
Hence the fascination they hold for scholars of synchronic and diachronic
variation.
The techniques and tools of Corpus Linguistics are formidable allies when
exploring the relative distributional frequencies of old and new forms coexisting
in synchrony. Though linguists have generally focused on change and, in
particular, on the beginning and end of a change, the far commoner and more
ubiquitous situation is that of stable sociolinguistic and stylistic variation, out of
which change may, or may not, emerge. As Hopper and Traugott (1993: 95) point
out:
Changes do not have to occur. They do not have to go to completion,
they do not have to move all the way along a cline ..[....] the outcome
of grammaticalization is quite often a ragged and incomplete
subsystem that is not evidently moving in some identifiable direction.
Polysemy and synchronic variation are far commoner than diachronic change
and, coupled with social stratification, they constitute the raw material without
which change is not possible. Distributional frequencies may fluctuate, depending
upon fashion and prestige, a form-function association may dwindle and even die
56
Kate Beeching
and, in those relatively rare cases, change may be said to have occurred. There
appears to be a probabilistic aspect to the spread of a new form, the now famous
S-curve, whereby after a slow start a new form surges forward at an accelerated
pace and then falls backs as it stabilises. In Beeching (forthcoming a), I have
attempted to capture the numerous complexities involved in the sociolinguistic
promotion and demotion of a new form in the disarmingly simple formula l = p n, where l is the likelihood of spread, p are the positive aspects of the (changed)
form-function configuration and n are the negative ones in the mind of the
speaker involved. This formula allows for the fact that not all forms are positive
for all members of a speech community and that attitudes to a form may change
over time. In highly stratified societies, the n value of a stigmatised form entirely
eclipses its p value. Only if strata become less rigid and identification with the
life-style associated with the form rises can the n value drop. A measurement of
the distributional frequencies of a form in speech samples from similar speech
communities at two points in time may indicate the ‘l’ and help us to chart
fluctuations over a large speech population. In this respect, the usefulness of large
corpora of transcribed spoken language coupled with the demographic
information classically associated with sociolinguistic studies - the age, sex and
socioeconomic background of speakers - can hardly be exaggerated.
In Beeching (forthcoming b), I survey the evolution of four pragmatic particles,
as evidenced in a comparison of their usage in the (1966-1970) Orléans Corpus
and the (1980-1990) Bristol Corpus. Many caveats must be issued concerning the
conclusiveness of the data presented. The corpora are small by many standards.
The Orléans Corpus is restricted to only one town in France, whereas the Bristol
Corpus covers a number of different towns and regions. Although a similar
amount of speech is examined in each corpus (around 155,000 words), ninetyfive speakers are surveyed in the Bristol Corpus while only twelve have been
studied in detail in the Orléans Corpus, two men and two women from each of the
three education groups which I have designated here as WC, LMC and MC. As a
means of illustrating the ‘why?’ of sociolinguistic corpora, I wish now to focus
on only two of the particles studied in Beeching (forthcoming b): hein and quoi.
Hein and quoi both occur in an utterance-, or at the very least, tone-groupterminal position and are mildly stigmatised (they do not occur in formal written
discourse and are highly unlikely to occur in formal spoken French). Quoi
remains perhaps slightly more stigmatised than hein. Both, however, could be
said to serve social/interactional purposes in maintaining hearer-involvement and
in hedging or downtoning remarks which a speaker or hearer might consider
over-assertive. Hein is generally translated into English by a tag question or by
you know?, while quoi may be rendered ‘as it were’, ‘so to speak’, ‘know what I
mean’, ‘like’ ‘sort of’, ‘kind of thing’, ‘you know’ or even ‘of course’.
Oui, peut-être mais ça dépend aussi, hein?
Yes, perhaps, but it depends, too, doesn’t it?
57
ne pas avoir que des contraintes dans dans sa vie, quoi, hein?
not just to have obligations in in one’s life, as it were, you know?
Hein and quoi appear, thus, to function in both of the manners in which politeness
is conceptualised. They serve both to flag social indexation (they are nonstandard, stigmatised forms) and to mediate sociability. While quoi seems to have
one core function or meaning (it flags tentativeness concerning the adequacy of
one’s expression), Beeching (2002) charted two main functions or meanings of
hein: a Hyperbolic function, where hein underscores an emphatic remark, and a
Discoursal function, where hein is used as a backchannel device to maintain
hearer involvement. Results in Beeching (2002) based on an analysis of the
Bristol Corpus suggested a shift in distributional frequencies from the Hyperbolic
to the Discoursal usage and, simultaneously a change in social semiotic: the
Discoursal usage of hein was favoured by female speakers.
140
120
100
80
Mean HRANDQR
60
CLA SS
40
WC
20
MC
0
LMC
Bristol
Or léans
CORPUS
Figure 1: Mean rates of hein and quoi usage in the Bristol and the Orléans
Corpus
As Beeching (forthcoming b) shows, rates of hein usage are similar overall in
both corpora. The class distribution of usage differs dramatically, however. In the
later, Bristol, Corpus, the middle class speakers have adopted hein and working
class rates are proportionately smaller. Rates of quoi-usage have, by contrast,
doubled in the intervening years, with extremely high rates amongst working
class speakers but much higher rates, too, amongst middle class speakers.
58
Kate Beeching
Figure 1 charts the distributional frequencies of the sum of the means of hein and
quoi rates per 10,000 words in the Bristol and Orléans Corpora subdivided for
educational background. (It should be noted that the corpora had to be screened in
the case of quoi to focus only on occurrences of it in tone-group-terminal
function: quoi has of course a number of other more canonical pronominal uses in
interrogative and relative constructions - de quoi parles-tu? il n’y a pas de quoi
rire etc.)
The most striking aspect of this figure is the increase in the usage of hein and
quoi amongst middle class speakers. There appears to have been a slight dropping
of in lower-middle-class usage of these stigmatised forms in the Bristol Corpus
and a very slight increase in working class usage but the increase in usage overall
is amongst middle-class speakers.
Table 1 demonstrates the role played by the speaker’s sex in the increased rates of
hein/quoi usage.
Table 1: Rates of hein/quoi usage per 10,000 words in the Orléans and Bristol
Corpora, subdivided by sex and class (N= the sum of the raw number
of occurrences of hein and quoi).
Male
Orléans Corpus (1966)
Bristol Corpus
(1990)
Female
WC
LMC
MC
WC
LMC
MC
65.61
N=200
57.91
N=271
62.54
N=191
27.69
N=53
15.03
N=37
24.33
N=22
22.94
N=69
46.89
N=150
13.92
N=51
33.96
N=108
15.84
N=29
35.98
N=54
In the Orléans Corpus, hein/quoi usage is a predominantly male WC and LMC
phenomenon. The female speakers observe the same reticence concerning
hein/quoi usage as the MC males. In the Bristol Corpus, however, rates are a
great deal more evenly distributed. Though the highest rates belong to WC
speakers, this is true of both the men and the women. Indeed, the female LMC
and MC rates exceed those of their male equivalents. It seems that hein and quoi
are becoming less stigmatised and that it may be women who are leading this
change in social semiotic.
In Beeching (forthcoming a), I discuss the relationship between politeness and
power and the maintenance of hierarchies through asymmetrical language-usage.
It is too early to make any conclusive remarks and replications of the study,
drawing on more data from the invaluable Orléans Corpus and also from the more
recent Reference Corpus, is advisable. However, it seems possible that, if middle
class speakers are beginning to adopt stigmatised ‘working class’ speech forms,
there has been a democratisation, a shift in the hierarchical nature of French
society. Our study of distributional frequencies of pragmatic particles may inform
59
us not only about linguistic usage but also about the structure of society. It is the
inter-relationship between the two which may bring about linguistic change.
4.
Conclusion
In this brief chapter, it has been impossible to do full justice to either of my main
foci: the usefulness of sociolinguistic corpora and the way that diachronic change
may be charted through differences in distributional frequencies (synchronic
variation) noted at different points in time. As Macaulay (2002, 298) points out,
there are many variables that affect samples of speech. He claims, however:
Yet we need not despair. One way forward is in replication. As more
studies are carried out, the influence of accidental factors may be
easier to detect.
Macaulay makes a number of very useful practical suggestions concerning the
way that replication may be made more reliable. For one thing, researchers need
to show rates of occurrence in a standardised way which will allow comparisons
to be made e.g. how many occurrences per 10,000 words. He recommends the
amassing of comparative data. When collecting corpus material, I urge fellow
researchers to append as much information as possible concerning the date and
place of collection, the sex, age, place of birth and educational and
socioeconomic background of the speaker, the nature of the relationship between
the interviewer and interviewee (or speakers, if a ‘fly-on-the-wall’ recording).
Macaulay stresses, too, the usefulness of the computer, a point of view which I
whole-heartedly support. One is forced, however, to recognise that linguists do
not look set to be put out of their job by a computer: the polysemy and
multifunctionality of certain terms, the subtlety of invited inferences and the
continuum along which terms travel towards coded meanings pose a formidable
challenge to the computer programmer.
Finally, as any researcher who has collected and transcribed spoken material
knows only too well, transcription is fantastically time-consuming. It has become
obvious to me, however, that when we transcribe, what we are engaged in is not
only a record and research base for the study of contemporary forms. What we
are engaged in is the creation of a historical archive of spoken language which
was impossible for our less technologically-advantaged forebears. To aid our
understanding of the nature of language, of the way that it is structured and how
that structure reflects social phenomena, of its variation, both synchronic and
diachronic, every attempt should be made to take samples which are
representative and to accompany our hard-won transcriptions with detailed
demographic information. In this way, we allow others to stand on our shoulders
and the fruits of our efforts are redoubled.
60
Kate Beeching
Notes
1
For a while the departure and arrival point of the change coexist in the
form of two different stylistic layers.... a change is thus, in its beginnings,
a synchronic phenomenon.
References
Andersen G. (2000), Pragmatic Markers and Sociolinguistic Variation. A
relevance-theoretic approach to the language of adolescents.
Amsterdam/Philadelphia: John Benjamins.
Beeching, K. (1985), Vrai de vrai! Oxford: Oxford University Press.
Beeching, K. (1986), A vrai dire. Oxford: Oxford University Press
Beeching, K. (1989), Actifrance. Oxford: Oxford University Press
Beeching, K. (1997), French for specific purposes: the case for spoken corpora.
Applied Linguistics. (1997). 18, 3, 374-394.
Beeching K. (2002), Gender, Politeness and pragmatic particles in French.
Amsterdam/Philadelphia: John Benjamins.
Beeching K. (Forthcoming a), Pragmatic particles - polite but powerless? Tonegroup terminal hein and quoi in contemporary spoken French. To appear
in one of the first two editions of Multilingua 2004.
Beeching, K. (Forthcoming b), Pragmaticalisation, politeness and linguistic
change: synchronic evidence from French.
Beeching K., & I. le Guilloux. (1990), Ça se dit et ça s'écrit. Oxford: Oxford
University Press.
Beeching K. & I. le Guilloux. (1993), La passerelle. Cambridge: Cambridge
University Press.
Beeching K., & B. Page. (1988), Contrastes. Cambridge: Cambridge University
Press.
Bilger M. (2002.), Présentation du “corpus de référence de français parlé”. Paper
given at the ATALA Conference, “Constitution et exploitation de corpus
du français parlé” 25 May 2002.
Croft, W. (2000), Explaining Language Change. An Evolutionary Approach.
Harlow: Longman.
Eelen, G. (2001), A critique of politeness theories. Manchester: St. Jerome
Publishing.
Francard M., G. Geron, V. Giroul, P. Hambye, A.C.Simon, and R.Wilmet (2002),
Le centre de recherche Valibel: des corpus oraux au service d’un
observatoire du français en Belgique. Paper given at the ATALA
Conference, “Constitution et exploitation de corpus du français parlé” 25
May 2002.
61
Givón, T. (2002), Bio-linguistics. The Santa Barbara lectures. Amsterdam: John
Benjamins.
Haspelmath, M. (1999), Why is grammaticalization irreversible? Linguistics, 376: 1043-1068.
Heine, B., U. Claudi, and F. Hünnemeyer. (1991), Grammaticalization: a
conceptual framework. Chicago: University of Chicago Press.
Hopper, P. J. & E.C.Traugott. (1993), Grammaticalization Cambridge:
Cambridge University Press.
Jakobson, R. ([1952] 1963), Essais de linguistique générale. Paris: Éditions de
minuit.
Kasper, G. (1990), Linguistic politeness: current research issues. Journal of
Pragmatics 14: 193-218.
Levinson, S. (2000), Presumptive Meanings: the theory of Generalized
Conversational Implicature. Cambridge, MA: MIT Press, Bradford.
Macaulay, R. (2002), Discourse Variation. In Chambers, J K, Schilling-Estes, N
The handbook of language variation and change. Oxford: Blackwell, pp.
283-305.
Sampson, R. (2002), A transient vowel in early modern French: i nasal, in: R.
Sampson and W. Ayres-Bennett (eds.) Interpreting the history of French.
A Festschrift for Peter Rickard on the occasion of his eightieth birthday.
Amsterdam/New York: Rodopi.
Traugott, E. and R. Dasher. (2002), Regularity in Semantic Change. Cambridge:
Roderick Bovingdon
Angelo Dalli
University of Sheffield
Abstract
This paper presents the results of the first ever large-scale statistical analysis of Maltese
using the newly formed Maltilex Corpus. Traditional etymological and categorical
analyses were supplemented with data mining techniques to provide accurate results with
reduced effort.
Statistics about the relationship between etymology and word classes were analysed from
different viewpoints. Maltese grammar and morphology remain to this day largely Arabic,
but with distinct Romance and English morphological accretions. Italian lexical influence
upon present day Maltese has exceeded the Arabic content in a quantitative sense,
enriching Maltese from a purely root based morphology with additional productive
Romance features.
1.
Introduction
The most recent theory relating to the origin of the Maltese language points to a
direct Sicilian-Arabic connection. This claim, by Dionisius Agius (Agius, 1990;
Agius, 1993; Agius,1996) and Joseph Brincat (Brincat, 1994), although supported
by significant research findings by other contemporary scholars from different
disciplines 1 , remains inconclusive (Bonanno, 1988; Borg and Alexander, 1978;
Borg and Alexander, 1994).
In the light of the prevailing knowledge of his times, Joseph Aquilina was on the
right track when he proposed the Arabic of the Muslim Aghlabids of the
Maghreb, as the most likely original source for the Maltese language (Aquilina,
1959). Even at this early stage Aquilina does not rule out a possible SicilianArabic link (Aquilina, 1988).
In contrast to Aquilina’s thesis, the current Siculo-Arabic school of thought
presents a revolutionary twist. It repositions the original source of Maltese from a
direct northern African origin to an offshoot of Sicilian Arabic. This theory
reinforces the notion that the form of Arabic which was adopted in Malta, as the
forerunner of today’s Maltese, was already tainted with non-Arabic Language
influences from its very inception (Agius, 1996; Brincat, 1995). Whilst
considerable language convergence exists between Maltese and Siculo-Arabic
64
(Agius, 1996), there still remain significant unexplained aspects of inquiry into
early Maltese 2 .
The answers to most unexplained lines of inquiry, as sparse as the remaining
evidence may be, may very well be laying dormant within the more subtle and
distant aspects of early Maltese. The grammar and morphology of modern
Maltese remain to this day Arabic, with distinct Romance and English
morphological, and increasingly lexical, accretions (Aquilina, 1979; Mifsud,
1995).
A thorough combing of European, North African and Turkish depositories as well
as of private Maltese collections, has the potential of uncovering major
revelations, as given in the relatively recent contributions in Wettinger and Fsadni
(1968), Cassola (1992 and 1996), and Brincat (1995). Scholarship of Maltese has
taken great strides forward by way of research, both within the spheres of
linguistics as well as from a purely historical perspective (Cassola, 1992; Mifsud,
1995). Wettinger’s exposure of Vatican Manuscript 411 deserves closer scrutiny
as a possible significant link between medieval Maltese and Arabic, as well as
being a valuable pointer to the existence of other similar contemporary
documents elsewhere, including possible links to Hebrew (Wettinger, 1979).
The quantity and significance of recent documentary discoveries relating to
medieval Maltese, are highly suggestive of even earlier records of written
Maltese. Such exciting finds could provide scholars with the elusive missing
linguistic links, bridging the gap between the pre-Arabic and Arabic beginnings.
This need not necessarily conflict with Agius’ and Brincat’s postulations, as there
is already ample evidence of the strong Siculo-Arabic connection. Agius openly
claims, in his major work on Siculo Arabic, that Maltese is a direct offshoot of
Siculo-Arabic with no connections to the north African littoral 3 . Brincat (1994)
emphasises the lack of a substrate for the pre-Arabic period. The distinct
possibility of remnants of a pre-Arabic substrate concealed within the later
Siculo-Arabic strata is suspected.
2.
Non-Arabic linguistic influences on contemporary Maltese
Traditionally the Romance element in Maltese was thought to commence with the
coming of the Normans from 1049 onwards. Nowadays the year 1127 is
perceived as a more realistic date in historical terms (Cassar, 2000). The preNorman Arabic content appears to have been heavily Sicilian based (Brincat,
1995). Brincat’s contention that the lack of a substrate is the strongest pointer to a
Siculo-Arabic genus 4 holds much merit along with Agius’ extensive approach to
such origins.
2.1
65
Knights of St. John of Jerusalem
The more overt lexical and the later morphological Romance additions to
Maltese, had to wait for the arrival of the religious Order of the Knights of St.
John of Jerusalem, circa 1530, before any significant inroads began to become
apparent. The Arabic-oriented populace of Malta, left to its own devices,
continued to interact harmoniously for a long period of time, within a Muslim 5 ,
Christian 6 and Jewish milieu. Under the Knights, direct rule was introduced 7 ,
with the consequent imposition of the foreigners’ will and culture. During this
long period in Malta’s history between 1530 and 1798, the Maltese came under
more direct and imposing influence from the Romance element, due to the
comparatively large numerical presence of Knights 8 , with an increasing quantity
of Romance words and phrases from different regions of the Italian mainland
(especially from the northern regions of Italy) being absorbed into Maltese. The
increased social interaction with the local populace through this direct rule was an
overriding force impinging on every aspect of the indigenous Maltese way of life,
not least the language admixture.
The Order’s rule over Malta lasted for over two centuries. During this vibrant
period in the archipelago’s history, this considerably increased interaction
between the rulers, the Knights, and the general populace, consisted for the most
part in a master-subordinate relationship. It is important to note this point, as such
social interaction between the overlords and the general populace meant that
several aspects of the rulers’ culture, not least their predominantly Romance
languages, imprinted their influences on the mainly peasant indigenous stock.
After the Order was ousted from the islands by the French under Napoleon in
1798 – who in turn saw their demise after only two years’ occupation – the
British rule took possession for the next century and a half.
2.2
Modern Italian
Ironically, the greatest linguistic inroads from the Italian mainland occurred
during the early days of the British rule 9 . Owing to the political turmoil aroused
on the Italian mainland, during the unification of Italy many political refugees
sought and were granted political asylum in the British protectorate of Malta.
These émigrés, who included a number of prominent Italian intellectuals and
politicians, banded together and formed a strong political lobby of their own.
With Malta’s long tradition of strong political, cultural, administrative, religious
and linguistic influence from the Italian mainland as well as from Sicily, these
émigrés found many sympathisers amongst the local population (Friggieri, 1979).
A considerable number of newspapers in the Italian language flourished on the
island, stimulating the adoption and spread of modern Italian Language influence
66
on both the Maltese idiom, as well as Maltese thought 10 . The infamous Language
Question of the late 1920s and early 1930s, when the British colonial power
squeezed out Italian from officialdom, saw the insertion of Maltese as the
language of common parlance, while English took the coveted functionality of
Italian. This political manoeuvring marked a phase when Italian linguistic
influence suffered a temporary though prolonged lull (Fuccio, 1933; Hull, 1993).
Modern Italian has since made vast inroads into contemporary Maltese, mostly
through the influence of Italian television, which enjoys a wide following in
Malta. So strongly has this new influx been felt that the large quantity of Italian
lexemes entering Maltese has affected the morphology of the language – a
phenomenon indicated by the statistics presented in this study. Contemporary
Italian borrowings can be distinguished with ease from those previously adopted
during the much earlier pre-British period 11 . These additional and more
significant Italian influences have markedly shaped Maltese with distinctly
European traits. Such trend emanates from the intellectual class which
consistently borrows new terminology, mostly from Italian, with increasing
inclusions from other European languages; these being the dominant language
sources whence contemporary intellectual, scientific and technological
innovations originate.
This study clearly shows that Italian lexical influence on present-day Maltese, if
only by way of numerical representation, has surpassed the Arabic content. Italian
morphological influences have taken hold to such an extent that the updating of
Maltese grammars to include this aspect is being considered. Such development
has evolved Maltese from a purely root-based morphology, as a typological
feature of its Semitic past (Schweiger, 1994), with the additional productive
Romance feature of catenation (Mifsud, 1995).
2.3
English
The most recent linguistic influence on Maltese is English. English has steadily
and increasingly affected Maltese, adding another language facet to the overall
structure of Maltese (Mifsud, 1995).
Interestingly enough, despite these recent accretions from English and the vastly
different morphological structure of the two languages, the assimilation of
English lexemes into a Maltese mould, occurs with the least possible disturbance
to Maltese morphological structure. By comparison, contemporary Arabic, under
the influence of similar pressures of modern life, as well as the ever-changing
world political scene, when borrowing from English and American-English, is
adopting similar assimilative patterns as Maltese (Holes, 1995). In the case of
Maltese, this is especially so with verbs, where the basic structure of the English
word is left intact while other morphemes in the form of mainly prefixes and
suffixes take over (i.e. the stem does not change its internal make-up).
67
Other linguistic devices, also of a Semitic nature, such as gemination of the initial
radical in the case of verbs that are frequently preceded with a euphonic i, are also
at work (Mifsud, 1995). Furthermore, as a result of this linguistic enrichment and
infusion, Romance affixes assimilate naturally with both Semitic and English
lexemes as an intrinsic part of standard Maltese grammatical structure.
3.
Statistical and analytical results
The data used in this statistical analysis has been sampled from the Maltilex
Project’s corpus as of November 2001. The Maltilex Project is a joint effort
between the Department of Computer Science and AI and the Institute of
Linguistics at the University of Malta. Its aim is to produce the first national
collection of computerised language resources for Maltese (Rosner et al., 1999;
Rosner et al., 2000).
The Maltilex corpus is made up of a representative mixture of newspaper articles
of different kinds, including local and foreign news coverage, sports articles,
political discussions and others, together with transcripts of radio shows, official
government publications and some novels. When this statistical analysis was
performed the corpus had 1.8 million words consisting of almost 70,000 different
word forms.
3.1
Selection methodology and analysis tools
A representative sample of 1,000 words was needed for the purpose of this study.
The sample was selected using a strictly random process to ensure the validity of
our statistical results. A chaotic function was used to assign a random number to
all 70,000 unique word forms in the corpus, permitting a randomly ordered list of
words to be created. This list was then examined and the following modifications
were made:
1. Spelling Mistakes – Spelling mistakes were immediately corrected. Words that
were spelled ambiguously were excluded from the list. This kind of
modification and filtering is statistically sound since no attempt is made to
influence the contents of the list. During the cleaning process the data was
also converted into Unicode format to be universally accessible by all
analysis tools.
2. Hapax Words – The word occurrence frequency in the corpus was examined
for every word and all hapax words were automatically removed. This
process removed most superfluous and arcane words that were
accidentally inserted in the corpus and that appeared in the sample. This
process can be statistically justified since hapax words can be seen as
68
outliers that can be safely removed without affecting the validity of the
resulting analysis.
The etymology and class was noted down for every word in the sample. When a
word had more than one class, the word entry was duplicated and a single
category was entered for every word. In order to maintain accurate statistics, a
weight was added to every entry representing the number of classes associated
with a particular word. Thus every entry of a word having n classes was assigned
a weight of 1/n. A total of 1034 entries were thus obtained, representing all
possible etymology/word class pairs for the sample word forms.
The data matrix that was obtained was analysed using a custom-written data
mining tool to extract statistics about the relationship between etymology and
word classes in Maltese. Overall statistics about the source language origins of
Maltese together with the most commonly occurring word classes was also
extracted. The use of a data mining tool enabled us to analyse the data from two
different perspectives – word class distribution for every etymological class and
vice-versa.
3.2
Etymology
In our data sample, most of the words derived from Italian, Arabic and English.
Source Origin of the Maltese Language
It
54%
Other
5%
Eng
4%
Eng>Dutch
0%
Ar
41%
Eng>It
1%
Ar
Eng
Eng>Dutch
Eng>It
It
Figure 1: Source Origin of the Maltese Language
There were also some isolated cases of Italian and Dutch etymons 12 finding their
way into Maltese through English. In the analysis these languages are denoted by
69
the codes It, Ar and Eng respectively. The isolated cases are denoted as Eng > It
and Eng > Dutch respectively.
Figure 1 shows the source origin of the Maltese Language, with Italian words
forming the majority of the words at 54%, Arabic at 41% and English at 5%.
Table 1 shows a summary of the etymological analysis data 13 , showing the exact
number of word forms pertaining to every etymological class.
Table 1: Maltese Language Etymology Summary
Etymology
Ar
Eng
Eng>Dutch
Eng>It
It
Count
411
36
1
9
543
Following this summary analysis we then split up the data set according to the
three main source languages and further analysed the source languages’
contribution in terms of word classes. The following abbreviations were used to
denote different word classes: adj – adjective; adv – adverb; conj – conjunction;
demon – demonstrative; interj – interjection; n – noun; pers – personal; poss –
possessive; pp. – participle; prep – preposition; pron – pronoun; v – verb. The
word classes were further annotated with m, f, pl and dual to denote masculine,
feminine, plural and dual forms of word class respectively.
Arabic Grouped Word Classes
n
21%
adv
2%
pp
4%
Other
9%
v
66%
Figure 2: Arabic Word Classes
adj
2%
pron
2%
prep
2%
conj
1%
v
n
pp
adv
adj
pron
prep
conj
70
Figure 2 shows the summary results for Arabic word classes, showing this
language source as a significant contribution in terms of verbs, and to a lesser
extent, nouns.
Italian Grouped Word Classes
v
29%
adj pp
11% 9% adv
3%
interj
0%
conj
0%
n
48%
Other
0%
n
v
adj
pp
adv
interj
conj
Figure 3: Italian Word Classes
Figure 3 shows the summary results for Italian word classes, showing Italian’s
significant contribution in terms of nouns, and to a lesser extent, verbs. This
further mirrors the significant contribution of Arabic.
Figure 4 shows the summary results for English word classes. The relatively low
percentage of English words (5% of the total) is consistent with established
Maltese literary convention. This result seems to underestimate the percentage of
English words in common Maltese parlance.
From a purely lexical viewpoint Standard Maltese comprises 41% from Arabic
origins, 54% from Italian and 9% from English as illustrated in Figure 1.
Linguistically this points to a predominantly Italian influence with English slowly
edging its way into the Maltese mould. The Arabic content, on the other hand, at
surface value appears to be waning to a marked degree, bearing in mind that
Maltese is still looked on as belonging to the Semitic fold.
A deeper analysis justifies the continued classification of Maltese with the
Semitic family of languages. Table 2 presents percentage data formed from the
comparison of the word classes in each of the three source language groups,
illustrated in Figures 2, 3 and 4.
71
.
English Grouped Word Classes
pp
7%
n
63%
v
22%
adj
n
pp
v
adj
8%
Figure 4: English Word Classes
Table 2: Word Class Composition
Class
Verbs
Adverbs
Nouns
Adjectives
Participles
Pronouns
Prepositions
Conjunctions
Arabic
66%
2%
21%
2%
4%
2%
2%
1%
Italian
29%
3%
48%
11%
9%
0%
0%
0%
English
22%
0%
63%
8%
7%
0%
0%
0%
The following insights can be obtained from Table 2:
1. Smaller word classes in the category of pronouns, prepositions and
conjunctions are not favoured either from the Italian or the English
lexicon. This trend attests to their stronger adherence to their older Arabic
origins. Considering the sparse use of adverbs in the older (Arabic)
Maltese, this word class appears to have no preference for English with a
minor inclination towards Italian borrowings.
2. The major word classes impinging on Maltese from both Italian as well as
English are the two main classifications consisting of the Verbs class
together with the Nouns/Adjectives/Participles class.
72
These latter results strongly suggest that Maltese morphology, when borrowing
from Italian, exhibits a distinctly stronger preference towards the nominal
lexicology than the verbal portion, while English displays quite the opposite.
Italian borrowings from the Nouns/Adjectives/Participles class show 68%
nominal borrowings as opposed to English’s 78%. On the other hand, the Italian
verbal borrowings from the Verbs class are not that much higher at 29% than
their English counterpart of 22%. Considering the far lengthier historical-political
connection with Italian mainland compared to the relatively recent and much
shorter British connection, this result shows that verbal borrowing from English
is increasing.
3.3
Word Classes
The contribution of the source languages to different word classes in the Maltese
language necessarily entailed an etymological analysis. The data mining tool used
for the analysis enabled us to split the data set according to word classes. This
allowed us to analyse the composition of different word class groups according to
the source languages.
Figure 5 shows a summary of the word category classes in Maltese with verbs
(43%) and nouns (37%) making up 80% of all words.
Word Categories of the Maltese
Language
v
43%
Other
5%
adv
3%
pp
7%
conj
1%
interj
0%
n
37%
adj
7%
prep
pron 1%
1%
adj
adv
conj
interj
n
pp
prep
pron
v
Figure 5: Word Categories of the Maltese Language
Table 3 presents the normalized counts of all word classes in Maltese. The counts
are not all integers since the actual counts are based on the 1,034 word sample
that was created by duplicating word entries having multiple word classes, as
previously explained. The count was then normalised to 1,000 words.
73
Table 4 presents the percentage data for the word classes in standard Maltese, as
illustrated in Figure 5.
Table 4 shows that an obvious preponderance of verbs over nouns in Standard
Maltese, exceeding them by 6%. As such variation is not strong enough to
indicate a distinct characteristic, it suggests that Maltese is a less concrete and a
more conceptual language than formerly assumed. It therefore appears that
contemporary Maltese, under the influence of its two main language sources,
especially of recent times, is developing more abstract means of expression than
it was previously able to impart.
Table 3: Maltese Word Category Classes – Normalized Counts
Word Class
adj
adv
conj
interj
n
pp
prep
pron
v
Counts
73.5
26.83
5
1
368.5
68
6.83
8
442.34
A comparison of Table 2 and Table 4 points to a consistent trend towards the
Maltese linguistic structure relying more heavily upon verbs and nouns than other
word classes. The shifting from a purely root-based morphology to that of a more
diversified form with the addition of concatenation is perhaps the most significant
evolutionary device Maltese has adopted in recent times.
Table 4: Maltese Word Category Classes – Percentages
Word Class
Verbs
Adverbs
Nouns
Adjectives
Participles
Pronouns
Prepositions
Conjunctions
Percentile
43%
3%
37%
7%
7%
1%
1%
1%
74
Note
1
Including, amongst others, works by Arnold Cassola, Girolamo Caracausi,
Adalgisa De Simone, Alberto Varvaro, Stanley Fiorini and Godfrey
Wettinger.
2
A prime example is Alexander Borg’s investigation of the imaala
phenomenon in Maltese, whose erratic behaviour is still left without a
definitive explanation.
3
“..., my hypothesis is that contemporary Maltese, containing a mixture of
Arabic and Romance, is directly linked with the Siculo-Arabic and not
with North African dialects as has been so far believed.” (Agius, 1996,
p.432)
4
There still lingers the remote possibility of linking Maltese to earlier
origins than the current Siculo-Arabic claim. In this paper, the term
Semitic includes, along with Arabic, the Berber element, owing to the long
standing association of the Berber language with Arabic, both in Sicily as
well as during much earlier times, along the North African littoral, with
the considerable interchange between the two lexicons. Also such term is
intentionally applied as an all-inclusive reference to any remote possible
language influences from the wider Semitic language group. In similar
manner, the terms Romance and Italian, for the scope of this study, include
medieval and modern Sicilian and Italian, with all their dialectal and
regional variations, as well as French and Andalusian Spanish with its own
Arabo-Berber influences included.
5
The population of the Maltese Islands in 1240 consisted of the following
family distribution: 836 Saracens (Muslims), 250 Christians and 33 Jewish
(Wettinger, 1968, pg. 33).
6
According to the official report by the Apostolic Delegate Dusina in 1574,
the Christian population of Malta, including the clergy, was lax in the
extreme. Thus, at this relatively late date, one might be inclined to
contemplate a religious belief and custom predominated by the
numerically stronger Muslim presence.
7
Prior to the Knights, Malta’s affairs were handled from the occupying
power's overseas quarters.
8
Mainly Italian, Portuguese, Spanish and French. The German Knights do
not appear to have exerted any influence either upon the Maltese language
or the culture of the populace (Aquilina, 1976).
9
Prior to British rule, Maltese had not acquired the status of a literary
language. The main Romance linguistic influence made its entry mainly
75
through the spoken idiom rather than literary texts or formal learning
methods. Hence the Romance lexical material entering Maltese was of the
most basic type, enabling it to become molded within Maltese Semitic
morphology with relative ease. In contrast, during the British rule, the
learning of Italian was formally imposed upon the populace through the
educational system, as well as the general culture of the local ecclesiastical
authorities, the administration, the law courts and the press.
10
Dormant notions of nationhood and nationalism were stimulated, resulting
in the first formations of formal and popular political agitation with the
resultant linguistic bent towards an Italianate mode of linguistic
expression.
11
The early Romance element in Maltese became intrinsically integrated
into the basic language structure, while the more modern and erudite
forms of Italian, with increasingly less input from Sicilian and Southern
Italian, tended to resist full assimilation.
12
Like fissuri (fissures) and jott (yacht).
13
In this case, word duplications due to a word having multiple word classes
were not included, hence the sample size of 1,000 words.
References:
Agius, D. (1990), ‘Il-Miklem Malti: A contribution to Arabic lexical
dialectology’, British Society of Middle Eastern Studies Bulletin, 17:2.
Agius, D. (1993), ‘Reconstructing the Medieval Arabic of Sicily’, Languages of
the Mediterranean, Msida: University of Malta.
Agius, D. (1996), ‘Siculo Arabic’, Kegan Paul International, 12.
Aquilina, J. (1959), The Structure of Maltese, Valletta: University of Malta.
Aquilina, J. (1979), Maltese-Arabic Comparative Grammar. Libya: Socialist
People’s Libyan Arab Jamahiriya Press.
Aquilina, J. (1988), ‘Criteri di etimologia siculo-maltese’, Malta e Sicilia:
Contiguita e Continuita Linguistica e Culturale. Catania: Gruppo
Linguistico Catanese.
Aquilina, G. (1988), ‘Il lessico agricolo e meteorologico nel maltese e le sue fonti
arabe e siciliane’, Journal of Maltese Studies. Malta: University of Malta.
17-18.
Bonanno, A. (1988), ‘Contiguita e continuita culturale e linguistica fra Sicilia e
Malta in eta prearaba’, Malta e Sicilia: Contiguita e Continuita
Linguistica e Culturale. Catania: Gruppo Linguistico Catanese.
76
Borg, A. (1978), A historical and comparative phonology and morphology of
Maltese. M.A. Thesis, Msida: University of Malta.
Borg, A. (1994), ‘On some Mediterranean influences on the lexicon of Maltese’,
Blaustein Institute for Desert Research and Ben Gurion. Israel: University
of Beer Sheeva.
Brincat, G. (1994), ‘Gli albori della lingua maltese: il problema del sostrato alla
luce delle notizie storiche di al-Himyari sul periodo arabo a Malta (8701054)’, Languages of the Mediterranean. Msida: University of Malta.
Brincat, J. (1995), ‘Malta 870-1054: Al Himyari's Account and its Linguistics
Inplications’. Msida: University of Malta.
Cassola, A. (1992), The Bibliotecha Vallicelliana Regole per la Lingua Maltese.
Egypt: Said International.
Cassola, A. (1996), Il Mezzo Vocabolario Maltese-Italiano del '700. Egypt: Said
International.
Cassar, C. (2000), Society, Culture and Identity in Early Modern Malta. Msida:
Mireva Publications.
Friggieri, O. (1979), Storja tal-Letteratura Maltija. Malta: Klabb Kotba Maltin.
Fuccio, G. (1933). ‘Il Conflitto Anglo-Maltese’, Quaderni dell'Istituto Nazionale
Fascista di Cultura, Treves-Treccani-Tumminelli, 3:8.
Holes, C. (1995), Modern Arabic: Structures, Functions and Varieties. London:
Longman Linguistics Library.
Hull, G. (1993), The Malta Language Question. Egypt: Said International.
Mifsud, M. (1995), Loan Verbs - A Descriptive and Comparative Study. Leiden:
Brill.
Schweiger, F. (1994), ‘To what extent is Maltese a Semitic Language?’,
Languages of the Mediterranean. Malta: University of Malta.
Rosner, M., R. Fabri, J. Caruana, M. Lougraïeb, M. Montebello, D. Galea and G.
Mangion. (1999), Maltilex Project. Malta: University of Malta.
Rosner, M., R. Fabri and J. Caruana. (2000), ‘Maltilex: A Computational Lexicon
for Maltese’. Msida: University of Malta.
Wettinger, G. (1973), ‘Arabo-Berber Influences in Maltese: Onomastic evidence’,
Proceedings of the First Congress on Mediterranean Studies of AraboBerber Influence. Msida: University of Malta.
Wettinger, G. (1979), ‘Late Medieval Judaeo-Arabic Poetry in Vatican MS411:
Links with Maltese and Sicilian Arabic’, Journal of Maltese Studies.
Msida: University of Malta. 13.
Wettinger, G. and M. Fsadni. (1968), Peter Caxaro's Cantilena: A Poem in
Medieval Maltese. Valletta: University of Malta.
Julie Carson-Berndsen1, Ulrike Gut2 and Robert Kelly1
University College Dublin 1
University of Bielefeld 2
Abstract
This paper presents ongoing collaborative research which focuses on the application of
computational linguistic techniques to the analysis of a corpus of native and non-native
speech. The aim of this research is to use computational tools for modelling phonological
acquisition and representation to identify regularities and sub-regularities between
different speaker groups. The corpus is being collected and annotated at different levels as
part of ongoing research into the acquisition of prosody by non-native speakers at the
University of Bielefeld. The computational tools have been designed and implemented at
University College Dublin as part of a development environment for modelling, testing and
evaluating phonotactic descriptions of lesser-studied languages.
1.
Introduction
It is a well-known and easily observable fact that non-native speakers sometimes
produce syllables in their speech which violate the phonological rules of the
foreign or target language. This can have various causes: the simplest is that a
speech sound is produced which does not exist in the language being learned.
Alternatively, a sequence of sounds may be produced which is not permissible in
the target language. This constitutes a violation of the phonotactic rules of the
language. Many explanations have been put forward to explain the occurrence of
these types of errors, of which the claim that speakers transfer phonological rules
of their native language to their productions in the target language is the most
popular one. The majority of studies on the acquisition of phonotactic rules are
based on small numbers of participants, which reflects the time-consuming nature
of a manual analysis of this kind of data.
There are two specific aspects which motivate the work presented here. The first,
a more computational linguistic motivation, is primarily concerned with the
acquisition and evaluation of phonological structure that can be usefully
employed in speech technology. The second motivation, a more theoretical
linguistic motivation, is concerned with the application of the comparative
methodology in the context of the phonological analysis of non-native speech.
Each of these motivations is now addressed in turn in sections 2 and 3. Section 4
discusses a specific representation of phonotactic constraints and section 5
presents a set of finite state tools which are used to learn regularities from a
phonological corpus. In section 6, a particular corpus containing data from non-
78
native speakers of German with two different native languages, Italian and Polish,
is used for a study of phonotactic errors found in non-native speech. Section 7
concludes with a discussion of future work.
2.
Ubiquitous Acquisition of Phonotactic Constraints
Ubiquitous language technology concerns the development of language
technologies for different purposes on different platforms so that they can be
made available to everybody at all times rather than to a select group for specific
purposes. Clearly this is a long-term goal which involves a rethinking of current
approaches to speech and language technologies combined with an enhancement
based on information of varying levels of granularity paving the way for the
development of robust multilingual applications which are easily scalable. One
immediate prerequisite to this is the provision of a methodology which can be
applied to numerous languages in order to accumulate descriptions which can be
reused with varying speech technologies. One particular technology to which this
approach is being applied is based on the Time Map Model (Carson-Berndsen,
1998). This model defines the constraints on permissible combinations of sounds
in a language in terms of a phonotactic automaton. The advantage of this
approach is that each sound in the language is described with respect to the
structural context (both preceding and following) in which it can occur (see
section 4). The phonotactic automaton thus models all possible syllables of a
language and in this way caters for native speaker intuitions about the wellformedness of phonological representations.
This paper presents an innovative methodology for automatic analysis of the
phonotactics of spoken language based on a suite of finite state tools which have
been developed primarily for use in multilingual ubiquitous speech technology.
We are primarily concerned with the phonotactic level of description, which is
employed in a computational linguistic approach to speech recognition and
synthesis. We present an XML-based representation of various types of
information defined with respect to the phonotactic context. This representation
can be learned automatically from a phonemically labelled data set and can be
processed to provide analyses of the phonological regularities which have been
found. We demonstrate how this methodology can be applied to the task of
phonotactic analysis of non-native speech.
3.
Phonotactics in Non-Native Speech
The term “phonotactics” refers to language-specific rules and constraints for how
sounds can be combined in a syllable. A German syllable (ı) consists of three
parts: an onset, a nucleus, and a coda (e.g. Wiese 1996). The nucleus constitutes
the centre of all syllables and consists of one or two vowels (V). The onset
comprises all prevocalic consonants (C) and can be filled with between zero and
79
four consonants. All postvocalic consonants form the coda. In German, a
sequence of up to five consonants can occur in this position. Figure 1 illustrates
this with the word springt’ jumps’. It has the three consonants [6S^] in the onset
position, the vowel [,] as the nucleus and the two consonants [1W] in the coda
position.
Figure 1: The syllable structure of the German word springt.
Two kinds of syllable types can be distinguished in German (Carson-Berndsen,
1998): non-reduced and reduced syllables. Non-reduced have between zero and
three consonants in the onset position, a short or a long full vowel (or diphthong)
and up to four consonants in the coda position. Reduced syllables contain an
optional single initial consonant, a weak vowel /l/ and an optional single final
consonant. Reduced syllables are never accented.
The occurrence and the ordering of consonants and consonant clusters in onset
and coda is constrained by the phonotactics. An example of an occurrence
constraint in German is for example that the phoneme /1/ is not permitted in the
onset position and that voiced stops and fricatives such as /d/ and /v/ are not
permitted in the final consonant position of the coda. Ordering constraints apply
to the sequence of consonants within either onset or coda. The consonant cluster
/lp/ for example is not permissible in the onset, but can occur in the coda position,
as in the word Kalb [kalp] (see Kohler, 1995).
It has often been observed that learners of a foreign language produce “illegal”
consonant clusters in both the onset and the coda position. These differences
between native and non-native speakers of a language are systematic and are
therefore assumed to form part of the non-native speakers’ interlanguage, i.e.
their current representation of the grammar of the target language. Several
reasons for systematic phonotactic errors have been proposed: Carlisle (2002)
claims that some onset consonant clusters in words are more marked (i.e. occur
less frequently in all languages of the world) than others and that foreign
language learners always acquire the less marked onsets before the marked ones.
Similarly, Eckman (1991) proposed that the reduction of English final consonant
80
clusters by native speakers of Cantonese, Japanese and Korean follows universal
principles, that is, they are reduced from more marked to less marked forms.
These approaches assume that the non-native speakers’ interlanguage is governed
by universal principles. Other researchers find influence of native language
phonotactic rules in the productions of non-native syllables. Broselow (1984)
examined the consonant cluster simplifications in English produced by Arabic
native speakers and concluded that they directly reflect the phonotactic rules of
their native languages. The underlying assumption is that the interlanguage of
non-native speakers contains phonological rules of their native language.
Due to the time-consuming character of phonetic analyses, many of these studies
are based on a small number of speakers only. However, in order to test
hypotheses about the nature of interlanguage structure and the cause of systematic
errors, large-scale studies are necessary (cf. Carlisle, 2002). The object of this
paper is to present tools for automatic analysis of non-native speakers’ production
of syllable types which allows processing of large data resources and suggests a
model for interlanguage representations.
4.
Representing Phonotactic Constraints
The underlying assumption in this paper is that phonotactic constraints for any
language can be represented in terms of finite state automata which define
acceptable sequences of sounds within a syllable or word domain. A subsection
of a phonotactic automaton describing CC-clusters in English syllable onsets is
depicted in Figure 2. The complete phonotactic automaton for English syllables
allows for a distinction to be made between well-formed and illegal structures,
i.e. although the word blick does not exist in English, the sequence of sounds is
permissible and would be recognised as well-formed by a native speaker of the
language, whereas the form bnick will always be rejected by native speakers of
English as a possible word. A phonotactic automaton is language-specific and can
be developed for each individual language.
A multilingual time map (Carson-Berndsen, 2002) extends the functionality of a
phonotactic automaton by combining language-specific information at various
levels of granularity represented as a multilevel finite state transducer. The
advantage of this representation, as with the phonotactic finite state automaton, is
that it is declarative, bi-directional, efficient to process, and can be easily learned.
The multilevel finite state transducer can be viewed as an extension of the
phonotactic automaton to include (at least) the following levels: graphemes,
phonemes, allophones, canonical form, features, constraints on overlap relations,
average duration, frequency and probability. Each arc specifies information on all
of these levels (although some of this information may not be available in all
cases; however, the transducer can be readily updated at any time). This
representation serves as the basis for the learning, analysis and comparison tools
81
which are presented in the next section. For the specific case study presented
below, only the phoneme and canonical form levels are relevant to the discussion.
Figure 2: Subsection of phonotactic automaton depicting CC- onset in English.
A multilingual time map is represented in XML with a visualisation in terms of a
directed graph. A subsection of a German multilingual time map depicting two
consonant clusters in onset position is visualised in Figure 3. The full information
is shown for only one arc.
Figure 3: CC- onset in German.
82
5.
Finite State Tools
This section presents a suite of finite state tools which have been developed for
use in computational phonology applications motivated by the developments in
linear and nonlinear finite state phonology (e.g. Kaplan & Kay, 1994; Bird &
Ellison, 1994). The suite of tools is centred around two main programs GTI and
PAL, which are described firstly in terms of their general functionality. At the
end of this section the specific ways in which these programs underpin the finite
state tools for phonotactic analysis are summarised.
5.1
The Generic Transducer Interpreter
The Generic Transducer Interpreter, GTI, is a program designed to read the
structure of a finite state transducer (FST) and using this structure test a set of
sequences of tokens for acceptance by that FST. FST structures are stored in
XML marked up form. A FST consists of a set of states and state transitions. A
single start state is defined and a subset of the states is designated as final states.
The state transitions of a FST may have any number of tapes defining different
transductions between tapes.
A sequence of tokens is accepted by a FST if a path can be traced from the start
state of the FST to a final state of the FST for that sequence. Starting with the
initial state as the current state a path is traced by examining each successive
token in the sequence to determine if a state transition can be applied from the
current state. A state transition can be applied at a given state for a given token if
there is a state transition in the state transition set having a source state matching
the current state and transition symbols matching the current token. Test
sequences are currently stored in text files in which sequences must be specified
in a particular format. For GTI to test a set of sequences it must have specific
tapes of state transitions in the FST defined as input tapes. Any number of input
tapes can be specified (up to the number of tapes in the state transitions). Input
tapes are those tapes of state transitions used to match the tokens of sequences
against in order to determine if a state transition can be applied. Thus, each token
of a test sequence is an item of input for the FST. If L tapes of state transitions
have been specified as input tapes then each token in a test sequence must have L
tapes also. If GTI detects that there are an incompatible number of tapes in any
token of any sequence in the test file then an error is flagged.
It is also possible to specify which tapes of state transitions are to be treated as
output tapes. Output tapes are those tapes for which the result of a transduction is
required. When a sequence is accepted on the input tapes of a multi-tape FST
there will be associated outputs, namely the concatenation of the sequences of
symbols present on non-input tapes. It may be necessary to examine the output
produced by any one of these output tapes. Again, any number of output tapes can
be specified (up to the number of tapes in the state transitions).
83
The names of both the FST file and the test sequence file together with input and
output tapes are passed to GTI via the command line. If the FST file is in the
correct format without error then GTI will create the FST as specified. As GTI
tests each sequence, it reports whether the sequence was accepted or not when
applied to the FST on the specified input tapes. Also, if output tapes are specified
then the outputs are displayed for each output tape. Note that if the FST is nondeterministic that more than one acceptance path may be traced. In this case, the
outputs for all successful acceptance paths are reported. It is also possible to
output the actual trace(s) of state transitions for each accepted sequence.
5.2
The Phonotactic Automaton Learner
The Phonotactic Automaton Learner, PAL, is a program that takes a set of
training sequences of symbols and learns the structure of a stochastic finite state
transducer (SFST) that accepts the training sequences. The training sequences to
be used by PAL are stored in a text file in which sequences must be specified in a
particular format. The training sequences may have multiple tapes and in this case
all tokens of all sequences must have the same number of tapes specified or an
error will be flagged. Once learned, the SFST representation is as an XML
marked-up form. The PAL program uses the ALGERIA machine induction
algorithm (Carrasco & Oncina, 1999) to learn the structure of the required SFST.
The algorithm works in two stages. First, a prefix-tree SFST is built from the
specified training sequences. A prefix-tree SFST has a single start state with a
single path from the start state to a final state for each of the training sequences.
For each token in each training sequence there is a state transition between states
along the unique path for that sequence. Also, if two training sequences share a
prefix then there is a single path in the prefix-tree SFST for that prefix. The path
diverges into distinct state transitions for the distinct postfixes after the final state
transition in the common prefix. Each state transition in the prefix-tree SFST has
a frequency of traversal dependent on the number of training sequences that share
that state transition. The frequencies of state transitions are then used to identify
the canonical SFST based on state merging.
The state merging process is the second stage of the ALGERIA algorithm. Two
states are merged into a single canonical state if the language generated from that
point on is statistically identical (i.e. for each continuation path emanating from
the first state there is a matching continuation path emanating from the second,
both of which have a statistically identical frequency). If two states are merged
then all state transitions in the prefix-tree SFST that refer to either of the two
merged states, now refer to the newly created merged state. By the process of
comparing each state with each other state in the prefix-tree SFST the canonical
SFST is identified.
The ALGERIA algorithm has an associated confidence level. This confidence
determines how rigorous the learning of SFSTs is. The lower the confidence the
less likely it will be that the learned SFST is absolutely correct, that is, the less
84
likely it will be that the learned SFST will accept only the training sequences.
There is a default confidence which has been found by experimentation to be
sufficient for effective learning, however, an alternative confidence can be
specified. PAL accepts the name of a training sequence file and a destination for
writing the learned SFST to as well as an optional confidence through a command
line interface.
In summary, GTI and PAL are used specifically in the context of the tools for
phonotactic analysis of non-native speech as follows: The Learning Tool uses
PAL to learn a deterministic finite state automaton from a phonemically labelled
data set. This automaton defines the phoneme level of the multilingual time map
and can be output as XML. A canonicalisation step allows each phoneme on each
arc in the multilingual time map to be supplemented by its canonical form in
terms of V or C. The final output is thus a two-level transducer. The Analysis
Tool uses GTI to transduce from one level to another (e.g. phoneme to canonical
form or vice versa). The Comparison Tool uses GTI to partition the data into
accepted and rejected forms. An analysis summary shows how many forms were
accepted or rejected.
6.
The study of phonotactic errors in non-native speech
This section is concerned with the application of the computational linguistic
tools presented in section 5 to the corpus-based analysis of the phonotactics of
non-native speech. The corpus used in this study is being collected and annotated
at different levels as part of ongoing research into the acquisition of prosody by
non-native speakers at the University of Bielefeld (see Milde & Gut, 2002).
6.1
Participants
The data are taken from the LeaP corpus 3 , which consists of prosodically
annotated non-native speech of more than 70 speakers. For the study, three data
sets were used consisting of speech produced by German, Italian and Polish
native speakers. The German natives are all speakers of Hochdeutsch (Standard
High German). The Italian native speakers were between 21 and 31 years old
when they were recorded and had been living in Germany for between two
months and five years. They all studied German at school or at University level
for up to six years prior to their arrival in Germany. The Polish speakers were
between 22 and 29 years old at the time of recording and had been living in
Germany for a period of between eight months and six years. They all have a
University degree in German from their home country, where they studied
German for between four and six years. All speakers were intermediate to
advanced speakers of German and had no active knowledge of German
phonotactic rules.
6.2
85
The Data
Recordings consisted of three parts. First, a short interview (approximately ten
minutes) was conducted with the non-native German speakers, in which various
questions about their language learning history such as age at first contact with
German and length of formal instruction were asked. Second, the participants
were asked to read out a short story. Third, they re-told the story in their own
words and without reference to the written text. All recordings were carried out in
Bielefeld in either a sound-treated or a quiet room. Only the readings were
analysed for this study. The first set, German_Read, consists of three readings
of a story by German native speakers. The second set, Italian_Read, consists
of five readings (total of 624 syllables) of the same story by Italian speakers of
German. The third set, Polish_Read, consists of five readings (total of 618
syllables) of the same story by Polish speakers of German.
The acoustic analysis of the data was carried out using ESPS/waves+ and Praat
and was done by one trained phonetician and four students with extensive
training and experience in phonetic analysis. The data was annotated at a number
of linguistic levels such as the level of the intonational phrase, the word, the
syllable, and the skeletal (CV) structure, and comprises annotations of the
prosodic structures of intonation and pitch range. The syllable level was then
selected from the prosodic annotation as the relevant input for the finite state
tools. All syllables were transcribed phonetically in SAMPA. Transcription was
fairly broad but included processes such as nasalisation, unreleased stops and
aspiration. In case of ambisyllabic consonants, half of the ambisyllabic
consonants was considered to belong to the preceding syllable and half to the
subsequent one.
6.3
Corpus-based investigation of non-native speech errors
The finite state tools described in section 5 were applied to the data sets described
above in order to identify regularities in the phonotactic errors produced by
Italian and Polish speakers of German. In each case a specific phonotactic
automaton for each data set was constructed using the learning tool, PAL. In
order to be able to filter the phonotactic errors, it was necessary to identify which
forms produced by the Italian and Polish speakers adhere to the phonotactics of
German. For this a complete manually constructed phonotactic automaton for
German (from Carson-Berndsen, 1998) was used; this automaton is referred to as
Phono_German_Comp.fsm below. The procedure involved the application of
the learning tool, the analysis tool and the comparison tool as follows.
The Learning Tool was applied to the German_Read data set to generate a
finite state representation of the phonotactic forms used in the read speech of the
German
native
speakers.
The
output
of
this
was
termed
Phono_German_Read.fsm. The phonemic forms contained in the
86
Italian_Read
data
set
were
then
compared
against
the
Phono_German_Read.fsm using the Comparison Tool and the rejected
forms were then compared against the Phono_German_Comp.fsm. This
ensured that all forms which corresponded to the phonotactics of native speakers
of German (the target language) were filtered and that any remaining forms in the
data set corresponded to phonotactically ill-formed structures which had been
produced by Italian native speakers of German.
The Learning Tool was once again applied to these forms to produce an
automaton
representation
of
the
ill-formed
structures,
termed
Italian_Rejected.fsm. The Analysis Tool was applied to this data in order
to map from the phonemic representations to the canonical CV form. This
resulted in the data set, Italian_Read_Canonical, which characterised the
canonical forms of all the phonotactically ill-formed structures in the
Italian_Read set. Finally, the Comparison Tool was reapplied to associate all
phonemic realisations with a canonical form so that particular phonotactic
anomalies and regularities with respect to the target language could be identified.
The flowchart for the application of the tools in this task is depicted in Figure 4
with respect to the Italian data.
Figure 4: The application of the finite state tools to the Italian data.
An analogous procedure was followed for the analysis of the Polish_Read
corpus. The results are presented in the next section. One point to be noted here is
87
that whilst the learning tool does, of course, generalise to some extent over these
forms, there is no notion of complete coverage as defined in the data set, although
currently we are investigating projection techniques to cater for gaps caused by
the lack of data which would allow for a distinction to be made between
idiosyncratic gaps and systematic gaps. This can be based on the notion of natural
classes of sounds which function similarly in a particular phonotactic context.
6.4
Results
Table 1 lists the percentage and absolute number of syllables in Italian_Read
and Polish_Read which were rejected after comparing them first to
Phono_German_Read and then to Phono_German_Comp. Only syllable
types were considered, that is, if a rejected syllable occurred more than once it
was only counted as one rejection. Thus, after two rounds of application, a total
of 54% of the syllables produced by the Italian non-native speakers and 48.7% of
the syllables produced by the Polish non-native speakers were rejected.
Table 1: % of rejected syllables.
% of syllable tokens rejected by Phono_German_Read
% of syllable tokens rejected by Phono_German_Complete
Italian
Polish
67%
(418 of 624)
59.5%
(368 of 618)
80.6%
81.7%
The results of the application of the Learning Tool to the remaining syllables
were then classified manually into
a) phonemic inventory errors (phonemes that do not occur in German were
produced)
b) onset consonant cluster errors
c) coda consonant cluster errors
Table 2 summarizes the results for the Italian and Polish speakers. Distinct
differences between the phonotactic violations produced by the Italian and the
Polish speakers can be seen. On the whole, the Polish speakers produce fewer
errors; different types of illegal initial consonant clusters occur in the two speaker
groups, and the postvocalic r (which is produced as an a-schwa in German)
constitutes the major problematic area for the Italian speakers.
88
Table 2: Error analysis of Italian and Polish speakers of German.
Italian
Polish
7.
Phoneme inventory
Errors
6 consonants,
5 monophthongs,
16 diphthongs
Onset consonant cluster errors
Coda consonant cluster errors
10 types,
predominantly [nd], [mb]
2 types,
14 final voiced consonants, 27
postvocalic r
4 consonants,
3 monophthongs,
24 diphthongs
6 types,
2 initial consonants [x, N]
1 type,
11 final voiced consonants,
1 postvocalic r
Conclusion
In this paper we demonstrated how a suite of finite state tools can be applied to
study the phonotactics of different varieties of spoken language. The results imply
that research in second language acquisition can potentially benefit enormously
from this methodology as it allows a rapid analysis of large corpora. While the
corpus described in this study was relatively small, to perform a manual analysis
of this data would have been a laborious task. Future work involves employing
the tools to analyse phonotactic differences among other non-native speech
groupings using the LeaP corpus. Currently this corpus consists of 253 annotated
recordings of between 2 and 30 minutes’ duration by 88 different speakers with
21 different native languages. Since the corpus is annotated at a number of
linguistic levels as described above, inter-level analyses are possible. For the
application of the finite state tools, each level of annotation is viewed as a tier,
analogous to the representations of autosegmental phonology (Goldsmith, 1990).
Analysis can take place either with respect to individual tiers or with respect to an
associated set of tiers. In the latter case, one tier is chosen as the primary tier and
the others are associated with it in terms of overlap and precedence relations
between the units as suggested in Carson-Berndsen (1998: 60). Using the
computational linguistic tools, finite state automaton and finite state transducer
representations of the tiers are extracted automatically from the annotated corpus.
Regularities in the data are then identified either with respect to a single tier or
with respect to an associated set of tiers. In addition to identifying phonotactic
errors with respect to one non-native speakers of one particular language, we are
also currently applying this methodology to the investigation of the phonotactics
of typologically different languages, in particular spoken Yoruba and Igbo.
Note
1
This work was part funded by an Enterprise Ireland Resreach Innovation
Fund grant (no. IF/2001/021) and an Enterprise Ireland International
Collabortaion grant (no. IC/2002/053).
89
2
The LeaP project is funded by the MSWF (Ministry for Education of
North-Rhine Westphalia, Germany).
3
http://www.spectrum.uni-bielefeld.de/LeaP/
References
Bird, S. & T. M. Ellison (1994), One - level phonology: autosegmental
representations and rules as finite state automata, Computational
Linguistics 20: 55-90.
Broselow, E. (1984), An investigation of transfer in second language phonology.
International Review of Applied Linguistics 22 (4), 253-269.
Carlisle, (2002), The Acquisition of Two and Three Member Onsets: Time III of
a Longitudinal Study. Proceedings of New Sounds 2000, pp. 42-47.
Carrasco, R. C. & Oncina, J. (1999), Learning deterministic regular grammars
from stochastic samples in polynomial time, ITA, Vol.33, No.1, 1-19.
Carson-Berndsen, J. (2002), Multilingual time maps: portable phonotactic models
for speech technology. In Proceedings of the LREC 2002 workshop on
Portability Issues in Human Language Technology.
Carson-Berndsen, J. (1998), Time map phonology. Dordrecht: Kluwer.
Eckman, F. (1991), The structural conformity hypothesis and the acquisition of
consonant clusters in the interlanguage of ESL learners. Studies in Second
Language Acquisition 13, 23-42.
Goldsmith, J. (1990), Autosegmental and Metrical Phonology, Cambridge, Mass:
Basil Blackwell.
Kaplan, R. & M. Kay (1994), Regular models of phonological rule systems,
Computational Linguistics 20:331-378.
Kohler, K. (1995), Phonetik des Deutschen. Berlin: Erich Schmidt.
Milde, J.-T. & Gut, U. (2002), A prosodic corpus of non-native speech. In: B. Bel
& I. Marlien (eds.) Proceedings of the Speech Prosody 2002 conference,
11-13 April 2002. Aix-en-Provence: Laboratoire Parole et Langage, pp.
503-506.
Wiese, R. (1996), German phonology. Clarendon: Oxford.
Tracking lexical changes in the reference corpus of Slovene
texts
Vojko Gorjanc
University of Ljubljana
Abstract
The text focuses on lexical borrowings from English, introduced into Slovene in the last
decade of the 20th century. Using the FIDA corpus, a reference corpus of Slovene, new
lexical items can be tracked over the last decade: by means of corpus analysis, we can
determine when a word entered in the Slovene language and how it established itself in the
language. Corpus analysis reveals great creativity of Slovene language speakers; in
addition to loan words, original Slovene coinages occur almost invariably. Corpus data
shows a great deal of variability originating from the desire of the speakers of Slovene to
coin new expressions, but after a few years, the data begin to reveal the prevailing variant.
1.
Introduction
With the help of a corpus, we can track lexical changes quickly and reliably, and
also observe the response of a selected language to new lexical items introduced
into it from other languages, e.g. English, or some other language with which the
selected language has direct contact, in the case of Slovene, these are Italian,
German, Hungarian or Croatian.
This paper focuses on lexical items introduced into Slovene in the last decade of
the 20th century. The starting point for the comparison with the state in the
corpus of the Slovene language is John Ayto’s list of English lexical items from
the 1980s and 1990s (Ayto 1999); the subsequent corpus analyses focus on
lexical items from the fields of the Internet and computer science. With corpus
investigations, we will try to determine how the Slovene language reacts to
lexical items introduced into Slovene from English. Using the corpus, we can
track the new lexical items through the last decade, and observe their
characteristics in the corpus.
2.
The Corpus
The Corpus of the Slovene Language, called FIDA, is a reference corpus of
Slovene compiled by a consortium of four project partners: University of
Ljubljana (Faculty of Arts), Jozef Stefan Institute Ljubljana, DZS Publishing
House and Amebis software company (http://www.fida.net). The corpus is
composed of contemporary Slovene texts, the majority of which were published
Vojko Gorjanc
92
in the 1990s. The corpus contains just over 100 million words, encompassing a
broad variety of language variants and registers. It is composed of written texts
and texts originally produced as written-to-be-spoken. The transcripts of Slovene
parliamentary proceedings are the only spoken component of the corpus.
The basic corpus characteristics according to taxonomic corpus parameters are (in
%):
Medium
spoken
electronic
written
Text type
1.97
0.03
98.00
literary
technical
other
Linguistic proofreading
5.94
18.46
75.60
yes
no
unknown
63.92
3.13
32.95
The FIDA corpus is lemmatised and morpho-syntactically tagged, but all the
tagging was done automatically without the possibility of disambiguation in cases
where double or even multiple lemmas were possible. Since Slovene is a
morphologically complex language, double or triple lemmas are frequent, which
makes statistical data from the corpus unreliable to some extent. In the last few
years, some significant steps were taken to solve the problem, both by testing the
existing language-independent tools and by developing new ones (Džeroski and
Erjavec, 2000; Mladeniü, 2002) but the situation is still far from ideal. Although
less acute, the lemmatisation of non-lemmatised words is another problem
waiting for the improvement of text processing tools for Slovene. The
lemmatisation of the FIDA corpus was based on the lexicon developed by the
software company involved in the project. Experiences show that in certain cases
non-lemmatised words skew the results of statistical analysis, so all these have to
be taken into account when interpreting the corpus data (Gorjanc and Krek,
2001).
3.
Methodology
Using a wordlist from the Slovene corpus, we will obtain information on the
lexical items from Ayto's list relevant for the Slovene language. By means of
corpus analysis, we will determine when a word occurs in the Slovene language
and how it establishes itself in the language, and by means of statistical analysis,
we will determine the possible collocations of the selected word and their changes
from the first occurrence of the word until the end of the decade. Since pairs of
synonyms or strings often occur with new expressions, we will try to determine
how they occur in the corpus and how they disappear. With the help of markers of
semantic relations already identified for the Slovene language by corpus analysis,
we will identify pairs of synonyms and strings within the corpus. We will focus
on the distribution of synonyms within the corpus regarding the time of their
93
occurrence, and consider when and why one of the synonyms becomes dominant
in the language while the other variants disappear.
3.1
Extracting collocations
For extracting collocations, the MI3 value introduced by McEnery et al. with
information on the probability of a word pair occurring together or separately will
be used (McEnery, Langé, Oakes and Véronis 1997). The MI3 value has turned
out to be sound information for content words in Slovene. On the other hand,
with this value it is hard to determine function words as part of collocations; in
particular, in the case of collocators of verbs and nouns with prepositional
phrases. For instance, to detect propositional words, raw statistics provide more
valuable information. After detection, a noun + propositional word pair, for
example, the MI3 value for the whole pair is calculated to extract the string of
collocators.
The comparison of results between T, MI and MI3 values for the Slovene corpus
shows that MI3 is the most effective of the three. MI introduced in Church and
Hanks (1989) is far less effective, since the frequency of corpus elements is
underestimated and a single co-occurrence of two elements in the corpus gives
high scores which can diminish the importance of more frequent lexical units
(Manning and Schütze, 1999). This fact is even more relevant in the case of the
FIDA corpus, since specific forms of non-lemmatised words are attributed high
MI scores. To some extent, MI3 neutralises the effects of low frequency of a
corpus element which is why it gives better results for collocations.
The noun raþunalnik (Engl. computer) with collocators according to MI and MI3
values in the FIDA corpus (frame 3)
MI3
raþunalnik
oseben (Engl. personal)
(Engl. computer) prenosen (Engl. portable)
na (Engl. on)
vaš (Engl. your)
biti (Engl. to be)
z (Engl. with)
in (Engl. and)
v (Engl. in)
poznavanje (Engl. knowing)
za (Engl. for)
delo (Engl. work)
moj (Engl. my)
uporabljati (Engl. to use)
zmogljiv (Engl. high capacity)
zagnati (Engl. to start)
MI
=uporablja
=appleov
=deskpro
=skreširani
=pomeþite
=nnnn
=gxi
=upsajo
=85prenosni
=megatronski
=macintosch
=optiplex
=sprojektiram
=blagajniþarska
=dlanþni
94
Vojko Gorjanc
internet (Engl. Internet)
ki (Engl. which)
uporaba (Engl. useage)
=windows*
delati (Engl. to work)
* = non-lemmatised
=brskalnemu
=vseuporabljajo
=pc486
izklopitev
=ignororajo
By using MI3 values, we obtain relevant results for collocations, while the highest
MI values are generally non-lemmatised words, mostly spelling errors. From the
point of view of non-lemmatised words the results are interesting, since they
reveal some of the problems the speakers of Slovene have with forming new
words, e.g. the adjective derived from the name Apple; in Slovene there is an
attempt to use the suffix -ov, appleov.
3.2
Identifying markers of semantic relations
Semantically related words – synonyms, hyper- and hyponyms, abbreviations often collocate or appear in similar contexts and it is usually possible to identify
the domain of a word on the basis of its textual environment. When we want to
explain the relations between concepts within the reality portrayed, we often use
explicit linguistic structures or phrases, such as X is defined as Y, X is an instance
of Y, There are several types of X, for example A, B, C etc. So, if we, for example,
identify the pattern “A, also known as B” to indicate complete or near synonymy,
we can extract the noun phrases linked by the pattern as pairs of synonyms.
Similar methods are proposed by several authors in the field of terminology,
either for the purpose of an automatic construction of knowledge databases
(Bowden et al., 1996), conceptual sampling for terminography (Meyer et al.,
1999) or simply a search for synonyms in a corpus (Pearson, 1998).
For Slovene the most frequent markers were identified by analysing the FIDA
corpus. For synonymy, these markers are ali (Engl. or), ali tudi (Engl. also),
imenujemo tudi (Engl. we also refer to it as), imenovan* tudi, (sinonim _) (Engl.
also referred to as), je sinonim za (Engl. is a synonym for), znan* tudi kot (Engl.
also known as), znan* tudi pod imenom (Engl. also familiar as), z drugim
imenom (Engl. also called), ... (Vintar and Gorjanc 2000). For the purpose of this
paper, these findings were used to extract pairs or strings of synonyms of selected
words.
Internet ali medmrežje, kot so ga izvirno poslovenili /.../, že dolgo ni veþ
neznanka povpreþnemu Slovencu.
Pair of synonyms: internet medmrežje
4.
95
Corpus analysis
In this analysis, we will focus on some key lexical elements, from the field of the
Internet, which have undergone the process of being accepted into the Slovene
language. Keeping the loan words unchanged, the most passive response of
recipient-language users is confirmed as temporary and it leaves open the
possibility for coining new terms; it turns out that if the new terms are formed in
a way which is acceptable to users, no problems arise in introducing the Slovene
variant. Let us illustrate this, using the example of the term World Wide Web, and
its Slovene variant, svetovni splet (the graph below is in %).
100
80
60
40
20
0
1994
1995
1996
world wide web
1997
1998
1999
svetovni splet
Figure 1: Occurrences of World Wide Web, and its Slovene variant, svetovni splet
in the FIDA corpus between 1994 and 1999
In the two years after its first appearance, only the loan word occurs in the corpus,
but when the Slovene variant appears, it immediately becomes a successful rival
and the use of the loan word gradually decreases.
In texts, the dominance of the Slovene synonym over the loan word is even more
obvious in the case of another key word from the field of the Internet, i.e. home
page. After eliminating corpus noise related to proper names of pages, it turns out
that the Slovene term has dominated completely (91.8% of corpus occurrences).
In addition to the calque domaþa stran (Engl. home page), there is also a new
term predstavitvena stran (Engl. web presentation page) coined in Slovene, but it
seems that the calque from English is more acceptable.
The fate of the following term from the field of IT is quite different. The English
term screen saver entered the Slovene language in the 1990s.
Vojko Gorjanc
96
100
80
60
40
20
0
1
2
3
4
1 screen saver; 2 varþevalnik zaslona; 3 ohranjevalnik zaslona;
4 ohranjevalec zaslona
Figure 2: Occurrences of screen saver, and its Slovene variants, varþevalnik
zaslona, ohranjevalec zaslona and ohranjevalnik zaslona in the FIDA
corpus
After the loan word, the calque varþevalnik zaslona occurs next, but a later
Slovene term formed by using attribute ohrajeva- (Engl. keep) turns out to be
more acceptable. At first, there are two variants, but later the adjective with the
suffix -ik dominates. While it is true that in Slovene a new word varþevalnik
(Engl. saver) is derived from the verb varþevati (Engl. save), it seems that the
semantic link is not strong enough for the speakers. In the corpus, the verb
varþevati (Engl. save) tends to collocate with words such as banka (Engl. bank),
denar (Engl. money); zaþeti (Engl. start), splaþati (Engl. worth). It thus covers a
semantic field which the speakers do not associate with new terminology from the
field of IT.
The term Internet itself has become fully integrated into the Slovene language;
this is partly due to its everyday use. As a noun, it occurs as a premodifier in noun
phrases: e.g., internet storitev (Engl. Internet service), internet naslov (Engl.
URL), internet povezava (Engl. Internet connection), internet ponudnik (Engl.
Internet service provider), internet stran (Engl. Web page), internet raþun (Engl.
Internet account), internet protokol (Engl. Internet protocol). In the Slovene
language, the new type of noun phrase with a noun functioning as a premodifier
is becoming increasingly common. This type of noun phrase is formed under the
97
influence of the English language, in Slovene the premodifier in a noun phrase
had to be an adjective. The noun as the premodifier appears frequently, even
though there is also the possibility of forming the noun phrase with an adjective
as the premodifier. The noun Internet happens to be extremely prolific in terms of
word formation, since it forms adjectives using the suffixes -ni, -ski and -ov:
internetni, internetski, internetovski, internetov in Adjective+Noun combinations;
the adverb internetsko using the suffix -o in Adverb+Verb combinations, and
newly derived nouns using the suffixes -ar and -ec: internetar; internetovec, as
well as new compound nouns, e.g. internetdžanki (Engl. Internet junkie). In
adjectives, the relatively high variability indicates that the newly formed words
have not yet been fully accepted. The collocations of an individual adjective show
that the collocators of the adjectives internetni, internetski and internetovski
overlap /service, page, search engine, business, shop, bookseller, service
provider.../, so that it is impossible to determine the specific dependent links
between words. Therefore, it seems that the use is very optional and different
variants of the adjective are possible with the same headword. In the case of the
adjective internetov, which is the least common of the adjectives listed above, the
link to the headword is completely dispersed; this indicates that the suffix variant
-ov is not integrated and consequently inappropriate for the classifying character
of the adjective, generally expressed by the adjectival suffixes -ni and -ski in
Slovene. The frequent use of the classifying adjective with the suffix -ni shows a
prevalence of this variant, and its only real rival is the classifying adjective with
the suffix -ski.
60
50
40
30
20
10
0
1
2
3
4
1 internetski; 2 internetni; 3 internetov; 4 internetovski
Figure 3: Occurrences of adjectives Internet using the Slovene suffixes -ni, -ski
and -ov in the FIDA corpus
On the other hand, Slovene term medmrežje, introduced when the term internet
was already fully accepted, has very low corpus frequency (corpus occurrences
98
Vojko Gorjanc
medmrežje : internet = 2.2% : 97.8%) and in terms of word formation it is not
productive. Despite the fact that the speakers of Slovene did not accept the
Slovene term medmrežje, there are still constant but unsuccessful attempts to
force medmrežje instead of internet by normativist linguists.
In assigning terms to concepts, the Internet has stimulated the formation of two
strings of newly formed terms, which seem to be growing in prolificacy with the
development of the Internet, i.e., terms of the type e-mail and terms of the type
kiber- and cyber-. Among the latter, a great variability in spelling can be observed
in Slovene: the new terms can be spelled as two words kiber prostor (Engl.
cyberspace), cyber kavarna (Engl. cyber-café), or as one word, with or without a
hyphen, e.g., kiber-kiþ (Engl. cyber-kitsch), cyber-kultura (Engl. cyber-culture)
and kibersvet (Engl. cyberworld), cyberfolk (Engl. cyberfolk). The spelling of
these terms is an important point of dispute among Slovene grammarians at the
moment, with the question of the influence of the English pattern being
particularly prominent. New terms of the type e-Noun, e-Adjective have already
been presented and show a very open series (Jakopin, 2001); here, let us consider
the trends in Slovene with kiber- and cyber-.
At first glance, the results of the corpus are extremely dispersed; the process is
very productive, and both the loan element as well as the Slovene element are
prolific. With the loan element, the tendency to write the new terms as two
separate words can be observed; the one-word spelling is reserved for a limited
number of complete loan words, e.g. cyberspace, cybersex, cybercash. However,
as there are so many new terms coined and cyber- generally occurs with a
Slovene second element, the new term is generally spelled as two separate words,
e.g., cyber otroški vrtec (Engl. cyber-kindergarten), cyber jasli (Engl. cybercréche), cyber kavarna (Engl. cyber-café), cyber umetnost (Engl. cyber-art).
Consequently, the two-word spelling is used with loan words as well, e.g. cyber
café, cyber space. The pattern is quite open and this can be seen in the possibility
of the hyphenated spelling for both complete loan words as well as for the
combinations of a loan element and a Slovene element, e.g. cyber-space, cyberpunk; cyber-gostilniþar (Engl. cyber-innkeeper), cyber-klobasica (Engl. cybersausage). The hyphenated spelling is particularly common with adjectives, e.g.
cyber-totalitarni (Engl. cyber-totalitarian), cyber-kavbojski (Engl. cybercowboy).
On the other hand, there is a tendency for one-word spelling with the Slovene
version kiber, the spelling change thus only enables the formation process in
Slovene. The one-word spelling is also a consequence of the fact that the
adjectives kibernetiþni in kibernetski (Engl. cybernetic) assume the attributive
function. The greater assimilation of kiber into the Slovene language is confirmed
by the terms coined for the male and female representatives of the cyberworld:
kibernetnica, kibernetniþar and kibernetniþarka, and it also occurs in the root of
the verb kiberseksati (Engl. to have cybersex).
99
In this case, the almost unbelievable range of terms reveals a dynamic process of
assigning terms to concepts in Slovene, while the corpus indicates that the new
terms with kiber- prevail (over 60%). The use also shows that the lexical elements
kiber and cyber are extremely popular. In Slovene, it seems that the speakers
accept the lexical term as semantically emptied, as something very fashionable at
the moment. Thus, as we have seen, the collocations may be rather unusual.
5.
Conclusion
In its observation of the process of accepting new lexical items into Slovene, the
corpus analysis reveals the great creativity of Slovene language speakers; in
addition to loan words, original Slovene expressions occur almost invariably.
Corpus data shows a great deal of variability, linked above all to the desire for
original expressions. However, after a few years, the data begin to reveal the
prevailing variant. The question of which variant is eventually fully accepted,
though, remains open. Full acceptance of a variant is conditioned by a series of
linguistic and even more non-linguistic factors. It is very important that the new
terms emerge spontaneously in the language community and are not introduced
into the language by linguistic intervention. If the speakers of Slovene even begin
to suspect that the term is an attempt at linguistic intervention, they will probably
reject it. In order for a variant to be fully accepted in Slovene today, it should
have an appropriate semantic basis, so that the speakers of Slovene can identify it
as “sexy” and “cool”.
References
Ayto, J. (1999), 20th Century Words. Oxford, Oxford University Press.
Bowden P., P. Halstead and T Rose (1996), Extracting Conceptual Knowledge
from Text Using Explicit Relation Markers. In Proceedings of EKAW-96,
Nottingham, pp. 147-162.
Church K. and P. Hanks (1989), Word association norms, mutual information and
lexicography, in: Proceedings of the 27th Annual conference of the
Association of Computational Linguistics, pp. 76-82.
Džeroski S. and T. Erjavec (2000), Learning to lemmatise Slovene words, in: J.
Cussens and S. Džeroski (eds), Learning Language in Logic. Berlin,
Springer, pp. 69-88.
Gorjanc V. and S. Krek (2001), A corpus-based dictionary database as the source
for compiling Slovene-X dictionaries, in Proceedings of the COMPLEX
2001 6th Conference on Computational Lexicography and Corpus
Research, Birmingham, pp. 41-47.
Jakopin P (2001) Words and nonwords as basic units of a newspaper text corpus,
in: Proceedings of the COMPLEX 2001 6th Conference on Computational
Lexicography and Corpus Research, Birmingham, pp. 49-65.
100
Vojko Gorjanc
Manning C., and H. Schütze (1999) Foundations of Statistical Natural Language
Processing. Cambridge MA: The MIT Press.
McEnery T., J. Langé, Oakes, M. and J. Véronis (1997), The exploration of
multilingual annotated corpora for term extraction, in R. Garside, G.
Leech, A. McEnery (eds.), Corpus Annotation. Linguistic Information
from Computer Text Corpora. London, Longman.
Meyer, I., Mackintosh K., Barriere, C. and T. Morgan (1999), Conceptual
sampling for terminological corpus analysis, in Sandrini (ed.),
Proceedings of TKE ’99. Vienna, TermNet, pp. 256-267.
Mladeniü D. (2002), Automatic word lemmatisation, in: T. Erjavec and J. Gros
(eds.), Jezikovne tehnologije, Language Technologies. Ljubljana, Institut
Jozef Stefan, pp. 153-159.
Pearson J. (1998), Terms in Context. Amsterdam, John Benjamins.
Vintar Š, and V. Gorjanc (2000) Identifying markers of semantic relations in
Slovene. http://www2.arnes.si/vinta/telri.rtf
Relating linguistic units to socio-contextual information in a
spontaneous speech corpus of Spanish
José María Guirao
Universidad de Granada
Antonio Moreno Sandoval, Ana González Ledesma, Guillermo de la Madrid,
Manuel Alcántara
Universidad Autónoma de Madrid
Abstract
This chapter shows the application of statistical tests to a corpus of spontaneous spoken
Spanish. Our goal is to find representative differences between different parts of the
corpus. To this end, we tagged n-grams in the corpus with features related to the speaker
(age, gender, etc.), or the context (dialogue, monologue, media, etc.), and applied the loglikelihood test (Dunning, 1993) in order to find the most distinctive lexical or grammatical
items for each specific socio-contextual feature.
This chapter is divided in three sections. In the first, the characteristics of the spoken
corpus are shown. The second section is devoted to the explanation of the computational
tool. In the third section, a first rough estimate of the results obtained is given, as well as
possible applications of the model.
1.
The Spanish corpus of the C-ORAL-ROM project.
C-ORAL-ROM is a multi-lingual corpus of spontaneous speech for the main four
Romance languages, French, Italian, Portuguese and Spanish (Cresti et al. 2002).
The project is funded by the EU under the V Framework Programme (IST-200026228) and the consortium consists of 9 partners, co-ordinated by the University
of Florence. The remarkable feature of C-ORAL-ROM is its spontaneity: texts
have been recorded in their actual context and without any script. Each subcorpus is made up of 300,000 words with the same text distribution to assure
comparability and sufficient register representation. The resource will be
delivered in several formats: an orthographic transcription, an xml-tagged
version, and the aligned audio source. Partial linguistic annotation will be
provided, as well as some programs to handle the resources and quantitative
studies. This paper shows preliminary results with respect to the Spanish corpus.
102
1.1
Guirao et al
Differences between a speech database and a corpus of spontaneous
speech
When discussing spoken resources, a preliminary distinction has to be made.
Most linguistic resources currently available are speech databases: collections of
high-quality recordings and detailed phonetic transcriptions of speech set up in
controlled environments (typically telephone services). These speech databases
are mostly used for training and testing speech systems and they are developed by
and for the language engineering industry. They aim to serve as a basis for
recognizing and producing speech in restricted, predictable domains. In most
cases, those databases contain many samples of the same word (that is, many
tokens of the same type). Usually, the utterances are prepared and pronounced by
professional speakers. The acoustic quality of the recording is essential. Speech
databases usually provide detailed phonetic descriptions, including disfluencies,
noises and other sounds. In general, those databases reflect the standard register,
and distant variants (dialects, jargons) are poorly represented. Instances of those
are SpeechDat (LRE-63314, Infrastructure for Spoken Language Resources),
SpeechDat II (LRE2-4001, Speech Databases for the Creation of Voice Driven
Teleservices), which have set up a standard for this type of resource.
On the other hand, corpora of spontaneous speech are typically collections of a
wide variety of spoken registers and non-scripted speech. Those corpora are
collected mainly for linguistic analyses and applications (language teaching,
grammars and dictionaries). In such corpora the acoustic quality is not essential.
What is important is that the texts reflect as much variation as possible and the
speaker behaves in a spontaneous manner. In some cases, those corpora are only
concerned with a given register, for instance, a dialect or children’s speech. An
important difference with respect to speech databases is the transcription:
spontaneous spoken corpora usually are less precise in the acoustic and phonetic
parts. On the contrary, they include detailed information about the context and the
speakers. These corpora are used mainly for sociolinguistic, text-typologic, or
psycholinguistic analyses. Examples are CHILDES and London-Lund.
C-ORAL-ROM is a corpus of spontaneous speech, but it also shows some
distinctive features:
Multilingual: the main goal is to compare the four languages, on the same
grounds, and provide comparative studies at different linguistic levels.
Acoustic quality: in order to be re-usable by the speech industry, sufficient
samples of digital recordings, media and phone conversations are
included.
Alignment of the transcription and the original sound: this is useful both to verify
the accuracy of the transcription and for teaching and other applied
investigation purposes.
Relating linguistic units to socio-contextual information
103
The main limitation of C-ORAL-ROM is its size. 300,000 words per language is
not a sufficient number for stating classifications and statistically significant
analyses. We believe that this corpus will show the relevancy and usefulness of
an approach that pays as much attention to the acoustic quality of the register as
to the linguistic annotation.
1.2
Multi-lingual comparability
Cross-linguistic comparison can provide two complementary perspectives:
comparing a given feature or features across languages, and comparing a given
register or text type across languages (Biber, 1995). On the C-ORAL-ROM
project different traditions and experiences have interacted. On the practical side,
the teams came to an agreement around two basic points: a text distribution (or
sampling design) and a unified format for the transcription.
1.2.1 Text distribution
In order to compare the linguistic features of the four languages, the same
common sampling criteria and the same proportion of each type in the four subcorpora are needed. There is a long tradition in sociolinguistics and in corpus
linguistics (Labov, 1966; Biber, 1988; Biber et al., 1999; Miller & Weinert, 1999)
in determining the relevant non-linguistic parameters. Basically, authors agree
with a series of socio-situational parameters, such as register and genre variation,
sociological features of the speakers (sex, age, education, occupation, origin), and
dialogic structure (monologue, dialogue, conversation). The disagreement is in
how to combine these parameters. C-ORAL-ROM has chosen the design of the
Spoken Dutch Corpus (http://lands.let.kun.nl/cgn/ehome.htm). The sampling
design is different in both sub-corpora:
Informal register is organised according to social context (familiar-private
vs. public) and dialogic structure (monologue vs. dialogue-conversation).
Formal register is organised according to channel (media, telephone,
natural context). In addition, media texts and formal in natural context are
grouped by genre (see table below).
Sociological features have not been taken into account for text selection, but they
are explicitly marked in the metadata section of the transcription. Male/Female
distinction has been the only feature to be balanced.
With respect to the text length, some decisions have been made. Only three types
of size are allowed: short (1,500 words), medium (3,000 words) and large (4,500
words). Texts shorter than 1,500 words have allowed in genre types like
meteorological reports, but always compounding segments of 1,500 words.
104
Guirao et al
Tables 1 and 2 show the distribution design of the informal and formal subcorpora for each language.
Table 1: Informal sub-corpus
Private/Familiar Context
113,000 words
Monologue
Dialogue
33,000 words
80,000 words
Public Context
37,000 words
Monologue
6,000 words
Dialogue
31,000 words
Table 2: Formal sub-corpus
Formal in Natural Context
65,000 words
Political Speech
Political Debate
Preaching
Teaching
Professional Explanation
Conferences
Business
Law
Formal in Media Context
60,000 words
News
Meteo
Interviews
Reportage
Scientific Press
Sport
Talk Show Political
Thematic Explanation
Talk Show Culture
Talk Show Science
Telephone
25,000 words
Private Dialogues
Phone to Call Services
1.2.2 Common format
To ensure a valid comparison, it is also necessary to use a consistent annotation
framework. The consortium developed the C-ORAL-ROM format, which is
based in the known CHAT format. A conversion to XML is provided. The xmltagged version guarantees easy interpretation though the corresponding DTD. The
combined use of XML and DTD ensures that every text in each corpus complies
with the same requirements. In this way, textual uniformity are obtained
throughout and between the four corpora.
The format is divided into the header (with the meta-data) and the transcription.
Most features in the header are compulsory, therefore a rich information is
provided for every text. The transcription is divided into turns, where applicable.
Each turn is marked by a three-letter code identifying the speaker. An
orthographic transcription is provided, along with some tags marking
disfluencies, noises, overlapping, and prosodic units. Morpho-syntactic tagging
will be supplied in a separate tier. Figure 1 shows a fragment of a text. A large
selection of fragments from the four languages can be consulted on the official
webpage of the project, along with the sound source.
105
@Title: Raquel
@File: efamdl04
@Participants: PAT, Patricia, (woman, B, 2, hairdresser, participant,
Madrid)
ROS, Rosa, (woman, B, 3, English teacher, participant, Madrid)
@Date: 10/03/2001
@Place: Madrid
@Situation: chat between friends at home, not hidden, researcher
observer
@Topic: friends, movies and future Use’s works
@Source: C-ORAL-ROM
@Class: informal, familiar/private, dialogue
@Length: 7’ 58’’
@Words: 1509
@Acoustic_quality: A
@Transcriber: Guillermo
@Revisor: Manuel; Guillermo, Jesús and Manuel (prosody)
@Comments:
*PAT: si ya han [/] han decidido ir con ellos / y la conocen ...
*ROS: ya / pero si yo no [/] si a mí me da igual / si yo no digo nada de
Use y Nuria / yo digo que la peña es un / poco egoísta //
*PAT: <no> //
*ROS: [<] <y ya está> //
Figure 1: A fragment of C-ORAL-ROM text
1.3
Other relevant aspects
C-ORAL-ROM is compliant with the state of the art in spoken corpora. These
aspects are briefly summarised in the following paragraphs.
1.3.1 The legal issue
During the 1990s legislation on Copyright and Privacy changed in many
European countries. In spoken language corpora, the law is applied when
recording individuals or using sound documents from the mass media. In the first
case, speakers retain their right to preserve privacy, and have to give their express
authorisation in order to their speech will be transcribed and published. In order
to preserve spontaneity, which is essential for our purposes, the procedure is to
ask each participant to sign an authorisation after the recording. If a speaker
refuses his/her consent, then the recording is discarded. The right to privacy
applies to every recording in a private context, but not to ones in a public
situation (a lecture, a political speech, a sermon).
106
Guirao et al
On the other hand, many texts in the corpora are copyrighted, not only the media
recordings but also those in which the speaker creates knowledge in the form of
ideas or structure of contents. Typically, this is true of lectures and professional
talks. We obtained the written authorisation from the authors or the copyright
holders for all the texts included in the Spanish corpus.
1.3.2 The acoustic quality
The Spanish corpus of C-ORAL-ROM has been collected from scratch, although
other teams in the project have reused part of their previous texts. In our case, we
preferred to make new recordings because, on one hand, we did not have the
written consent for our previous texts and, on the other hand, the acoustic quality
of the analogical tapes was poor.
Most texts have been recorded with a DAT Tascam (model DA-P1) and two
unidirectional microphones. The source has been converted into a WAV file,
mono, 16 bit, 22.050 Hz, through a SPDIF port in a Sound Blaster Live Platinum
5.1, using the software Creative Recorder. In public places, when possible, the
DAT recorder has been connected to the sound system. The media recordings
either have been provided directly by the broadcasting station or recorded by a
computer connected to the receiver.
Acoustic quality is essential for the application in speech technologies and
language engineering.
1.3.3 The linguistic annotation
Corpora increase in value depending on the annotation layers provided. Tagging a
spontaneous speech corpus is a task slightly different to the same for written
corpora (Uchimoto et al., 2002). The difference is not in the tagged information
but in the lesser efficiency of the taggers when applied to spoken corpora. For
instance, some POS taggers are usually trained on written texts, which show a
quite stable and determined word order. On the contrary, corpora of spontaneous
speech are highly flexible in word order. In addition, they show repetition, restartings, overlapping, and other features of spoken syntax which have to be
trained specifically.
The lexicon is also different. One can find many words that are not included in
printed dictionaries, because they are innovations, or belong to an informal
register, or simply because they are mispronunciations.
A complete lemmatization and POS tagging is provided. Moreno and Guirao
(2003) report the development of a POS tagger and unknown words recogniser
for morphosyntactic annotation of the Spanish corpus. The results provided by
the lemmatizer are used in this paper (see section 3).
107
1.3.4 The validation
To verify the reliability of data has become a fashionable topic in the recent
years. Users of linguistic resources want to know how the resources have been
collected and their accuracy. C-ORAL-ROM passes two types of evaluation.
An internal validation is carried out by the team itself. Each text passes through
five steps: transcription, first revision, prosodic tagging, second revision, and
sound-text alignment. At least, three linguists transcribe/revise each text. A
program verifies format errors, blanks, typos, badly formed tags, etc. Therefore,
content and form have been validated exhaustively, guaranteeing that the
transcription is accurate to the sound source. We want to stress that the alignment
of sound and text is the best guarantee for validation of a spoken text: any
discrepancy between the actual speech and its transcription will be easily
detected.
An external validation will be done by experts at the end of the project.
2.
The computational tool
We have developed computational tools for transforming the C-ORAL-ROM
format into a more suitable tagging scheme in order to relate meta-data with
lexical items, and compute the appropriate statistics. We will divide this section
into three sections.
2.1
Using xml-tagged corpus for relating meta-data and linguistic features
The original C-ORAL-ROM annotation has been designed for registering a wide
range of features, including acoustic ones (prosodic marks, noises, etc.) which
will be used by the speech technology community. An example of an xml-tagged
file is shown:
<Turn>
<Name>PAT</Name>
<Says>
<Utterance Type= "interrogation"> y cómo está </Utterance>
<Notes Type= "act"> cough </Notes>
</Says>
</Turn>
<Turn>
<Name>ROS</Name>
<Says> <Utterance Type= "enunciation"> bueno </Utterance>
<Utterance Type= "enunciation"> no está mal </Utterance>
</Says>
</Turn>
108
Guirao et al
Our goal in this experiment is to seek out lexical units peculiar to each subcorpus. The first step was to remove the non-lexical information from the original
xml tagging. In particular, we wanted to capture two types of information:
i) The words that every speaker says, and
ii) The split of every turn into utterances in order to prevent ill-formed
word clusters. This task is similar to tokenisation in written corpora. This
division into utterances is also needed for delimiting the context that the
POS tagger uses for disambiguation.
A Perl script generates a new tagged corpus with only two tags: one for TURN,
with attributes for speaker and file, and another for UTTERANCE:
<turn speaker="PAT" file="efamcv01"> <utt> y como está </utt>
</turn>
<turn speaker="ROS" file="efamcv01"> <utt> bueno </utt> no está
mal </utt> </turn>
By this means, every word in the corpus can be related with the speaker and the
text. The file keeps in the header all the socio-contextual information. The corpus
is partitioned in as many sub-corpora as different features appeared in the header.
For instance, a male sub-corpus, an informal sub-corpus, a telephone sub-corpus,
a meteo sub-corpus, etc. are all generated. After partition into sub-corpora, all
occurrences (the tokens) for every lexical unit (the types) are counted in each subcorpus. The next table shows the distribution by sex of speaker.
Table 3: Distribution by sex
Sex
Man
Woman
X
Tokens for the category
182832
134693
9519
Total number of tokens
327044
327044
327044
Percentage
55.9 %
41.2 %
2.9 %
The “X” value is assigned when the sex of the speaker is unknown (typically in a
media recording).
The procedure can be applied to any type of information derived from the corpus.
For instance, we tagged it with POS and lemma, using a POS tagger for spoken
Spanish developed in our laboratory (Moreno & Guirao, 2003). We show the
previous example after lemmatization and POS tagging. Lemmas are shown in
uppercase.
109
Lemmatisation
<turn speaker="PAT" file="efamcv01"> <utt> Y CÓMO ESTAR
</utt> </turn>
<turn speaker="ROS" file="efamcv01"> <utt> BUENO </utt> NO
ESTAR MAL </utt> </turn>
POS tagging
<turn speaker="PAT" file="efamcv01"> <utt> C
</turn>
P AUX </utt>
<turn speaker="ROS" file="efamcv01"> <utt> MD </utt> ADV
AUX ADV</utt> </turn>
C= conjunction; P = pronoun; AUX = auxiliary; MD: discursive
marker; ADV= adverb
In summary, in this experiment we have considered three levels of linguistic data:
words, lemmas and POS.
2.2
Extracting word clusters
If we calculate statistics directly on every unit, the result will not be correct, since
multi-words units will not be included in this count. Discourse markers as
frequent as “por ejemplo” (for instance), “es decir” (in other words) or “o sea”
(that is) will not appear if we work on single word units. To solve this, we
developed an algorithm based on n-grams in order to extract multi-word
candidates. We took out all n-grams with three or more occurrences, for n = 4, 3,
and 2. Next, a filter is applied for discarding all n-grams that start or end with a
determiner or auxiliary. Finally, multi-words are selected by hand. Every multiword is regarded as a lexical unit, equivalent to the simple/single words.
2.3
Applying the statistics of surprise
In order to identify the distinctive words, lemmas or POS for a given sub-corpus,
we have employed the log-likelihood ratio test proposed by Dunning (1993). This
method does not assume normal statistical distributions of units in a corpus.
Instead, the log-likelihood ratio O assumes a binomial distribution more
appropriate for rare but distinctive words. “Texts are composed largely of such
rare events” (Dunning, 1993). In addition, this test does not need balanced subcorpora for comparison.
110
Guirao et al
This method has been successfully applied for finding collocations (Dunning,
1993) and terms (Daille, 1994). In order to test the method for finding distinctive
units in specific domains, we can work on two hypotheses:
i) Two registers (or sub-corpora) show no difference in distinctive units
(Null hypothesis)
ii) For a given sub-corpus, we can find out distinctive units (Alternative
hypothesis).
We applied the test to two well-defined sub-corpora, meteorological reports and
law, in order to discard one of the hypotheses. Results are shown in Table 4. The
critical value for one degree of freedom is 7.88.
Table 4: Dunning Test applied to well-defined sub-corpora
Meteo
-2 log O
165
160
145
128
97
91
80
69
69
64
Freq
50
38
27
6466
16
19
19
12
12
92
Lemmas
norte
fuerza
viento
en
componente
temperatura
noroeste
oeste
nube
zona
Law
-2 log O
200
200
116
113
101
84
83
78
77
65
Freq
58
289
85
45
22
31
26
50
69
17
Lemmas
policía
persona
derecho
contrato
judicial
delito
delincuente
ley
determinar
cometer
Results confirm the alternative hypothesis and the suitability of the Dunning test
for the task. Most of the “top 10” lemmas in both domains have a low occurrence,
but all are typical terms in its domain.
3.
Preliminary results
Our goal is to show a range of possibilities for the application of this method. We
will show here a very incomplete set of data. Currently, there is a disproportion of
social and register features with respect to annotated linguistic features. The
linguistic annotation is being carried out this year (2003). Comprehensive results
will be delivered with the final version of the four corpora, including a crosslinguistic comparison. In this paper, the only linguistic features that will be taken
into account are:
words and multi-words
lemmas
POS tags.
111
First, we show the 10 most frequent word types in our corpus of spontaneous
speech for three professions: a consumers’ association manager, a football coach
and a computer system administrator. Notice that words and multi-words are
regarded as equivalent units.
Table 5: Most characteristic words in three professions
Consumers’ association
manager
-2 log O
110
establecimiento
107
establecimientos
106
encuesta
103
cesta
65
política de precios
45
marcas
44
precios
44
índice
42
consumidor
41
insisto
Football coach
-2 log O
31
30
28
25
18
17
16
16
16
16
Computer system
administrator
-2 log O
directiva
91
tío
club
48
grabando
Real Madrid
41
web
rueda de prensa
33
yo qué sé
director general
33
no sé
no
30
joder
estabilidad
30
linux
confianza
28
cabrón
hombre yo creo que 28
detectan
Y que
26
barato
The procedure can be extended to any profession registered in the corpus, as a
means of detecting sociolectal information. Now we will provide the results of
the Dunning test on formal and informal registers, approximately 150,000 words
each:
Table 6: More characteristic words in formal and informal registers
Formal
-2 log O
134
91
62
61
60
55
49
49
47
46
de
es decir
su
en
gobierno
en este momento
desarrollo
general
nuestra
países
Informal
-2 log O
422
279
238
231
231
222
182
179
173
171
si
ah
sabes
claro
me
tía
yo
no sé
no
ya
Another interesting comparison is to find out which POS are more typical in male
and female registers. This table shows the results.
In our corpus, men prefer to use nouns and women prefer clearly pronouns.
Finally, after lemmatization, we can show the 10 most frequent verbs in general,
male and female sub-corpora.
112
Guirao et al
Table 7: POS in male and female registers
General
Total
occurrences
47052
42531
38210
32284
31404
30737
25044
17418
12611
10112
Male
V
N
PREP
ART
ADV
C
P
AUX
ADJ
Q
-2 log O
515
422
399
382
80
34
34
Female
N
ADJ
ART
PREP
DEM
Q
REL
-2 log O
1327
524
243
149
126
28
16
16
4
P
ADV
C
MD
INTJ
V
POSS
AUX
NPR
Table 8: Most frequent verb lemmas in male and female registers
General
Total
occurrences
3398
2973
2579
2067
1046
1026
995
802
779
577
4.
Male
ir
tener
decir
hacer
poder
saber
ver
dar
querer
creer
-2 log O
28
28
26
23
23
23
22
22
20
20
Female
Escuchar
Recordar
Aparecer
Llegar
contemplar
caminar
intentar
Amar
juntar
superar
-2 log O
159
154
112
99
86
47
46
42
36
36
ir
decir
saber
venir
dar
mirar
comprar
gustar
quedar
contar
Conclusions and future work
Here, we have shown the relevance of this procedure as an empirical method for
the validation of sociolinguistic hypotheses in spoken language, as well as for
determining register typology.
The method correlates linguistic with socio-contextual data applying Dunning’s
Statistics of Surprise. In order to achieve this, a rich linguistic-tagged corpus and
the use of xml have been essential. The preliminary results are promising and
have not been shown previously for Spanish. However, extracting conclusions
and interpretations for these figures is premature, since the corpus is clearly not
sufficient. For this reason, we will apply the method to CORLEC corpus, also
developed by the LLI-UAM (see Moreno 2002 for an overview). The
combination of C-ORAL-ROM and CORLEC corpora will contain over
1,500,000 words of spontaneous spoken Spanish.
113
We also wish to find out more about the correlation between linguistic and sociocontextual features, when complete morphosyntactic annotation will be finished.
For instance, verb tenses, number, gender, persons in pronouns, etc, will be
tagged. Biber (1988, 1995) provides a rich catalogue of linguistic features that
can be traced.
Finally, a cross-linguistic comparison between the four Romance languages in CORAL-ROM will be made, based on the same text distribution.
References
Biber, D. (1988), Variation across speech and writing. Cambridge: CUP.
Biber, D. (1995), Dimensions of register variation. Cambridge: CUP.
Biber D., S. Johansson, G Leech, S. Conrad and E. Finegan (eds.) (1999), The
Longman grammar of spoken and written English. London: Longman.
Cresti, E. et al. (2002) `The C-ORAL-ROM project. New methods for spoken
language archives in a multilingual romance corpus’, in: Proceedings of
LREC 2002. Las Palmas de Gran Canaria.
Daille, B. (1994), Combined approach for terminology extraction: lexical
statistics and linguistic filtering. Ph.D. Thesis, Paris 7.
Dunning T. (1993), ‘Accurate methods for the statistics of surprise and
coincidence’. Computational Linguistics 19(1): 61-74.
Labov W. (1966), The social stratification of English in New York City.
Washington: Center for Applied Linguistics.
Miller J. and R. Weinert (1999) Spontaneous spoken language. Oxford:
Clarendon.
Moreno A. (2002), ‘La evolución de los corpus de habla espontánea: la
experiencia del LLI-UAM’, in: Proceedings of II Jornadas de Tecnologías
del Habla, Granada, Spain.
Moreno A. and J. M. Guirao (2003), ‘Tagging a spontaneous speech corpus of
Spanish’, in: Proceedings of Recent Advances in NLP (RANLP-2003)
Borovets, Bulgaria.
Uchimoto K., C. Nobata, A. Yamada, S. Sekine and H. Isahara, (2002),
‘Morphological Analysis of the Spontaneous Speech corpus’. In:
Proceedings of Conference of Computational Linguistics (COLING 2002)
Taipei, Taiwan
An analysis of lexical text coverage in contemporary German
Randall L. Jones
Brigham Young University
Abstract
One of the many practical applications of corpus studies is the generation of
word frequency information. It makes sense that for the teaching of vocabulary in
a second language, lexical frequency should play a significant role in the
selection of vocabulary to be included in pedagogical materials. Using the
concept of “lexical text coverage”, a study based on the BYU/Leipzig Corpus of
Contemporary German has shown that a basic vocabulary of 3,000 high
frequency words can account for between 75% and 90% of the words in the text,
depending on the register.
1.
Vocabulary and second language learning
How many words must a learner know in order to read and understand a German
novel, a newspaper, an academic text, or to understand a German television
broadcast or a conversation? Which words are most important, i.e. which ones
should be learned first? Is the vocabulary used in the above-mentioned registers
about the same, or are there substantial differences? And what is a word and what
does it really mean to know a word in a second language?
These questions lie at the very root of second language vocabulary learning. To
begin with the question of which words should be learned first, it is important to
look at vocabulary frequency, as the frequency of the various words in a typical
text differs significantly. Some occur numerous times and some occur only once.
By focusing first on learning the most frequently occurring words, it would seem
that the process of learning to read or comprehend speech in a second language
would become more systematic and therefore more efficient. But again the
question arises, what are the most frequently occurring words in German and how
many does one have to learn?
2.
The concept of lexical text coverage
Work by scholars such as Nation (2001) and others has helped us immeasurably
in understanding the nature of vocabulary frequency and its relation to vocabulary
learning in a second language. For a learner to read and have some
comprehension of a German text, a minimum or threshold vocabulary is
116
Randall L. Jones
necessary. Nation suggests that an understanding of a 2,000 to 3,000 word family
level is a minimum for reading an unedited English text (2001: 146). He has
introduced the notion of “word token coverage” and applied it to various types of
English texts (2001: 17, 147). The notion of “word token coverage” means the
degree to which a defined level of high-frequency vocabulary from a text covers
or accounts for all the words in the text. For example, what percentage of a given
text is covered by the 1,000 most frequently occurring words? What about the
next 1,000, and how many words does it take to achieve a coverage of, say, 90%?
The procedure for determining this is quite straightforward.
i) Generate a word frequency list for a text.
ii) Construct word families from the n most frequent words.
iii) Match that word family frequency list against the original text to
determine what percentage is covered by the words in the list.
Nation uses what he calls word families instead of simple words, i.e. all words
that are related to and recognisable from a base word. For example, the base word
agree also includes inflected verb forms as well as associated nouns and
adjectives: agreed, agreeing, agreement, agreements, agrees, agreeable,
disagree, disagreements, disagreeable, disagreed, disagreeing, disagreement,
disagrees, a total of 14 word tokens in three word classes. It is assumed that by
understanding the verb agree, one will also understand other morphological
permutations. The decision is, of course, somewhat arbitrary, i.e. it is assumed
that at an early stage of vocabulary learning one understands regular verb
conjugation and has a knowledge of the meaning of simple affixes such as dis-,
-ment, -able.
Table 1 shows Nation’s analysis of text coverage by the 1,000 and 2,000 most
frequent words in four English texts registers. The highest level of coverage for
1,000 words was for conversation (84.3%), while the lowest was for academic
texts (73.5%). An additional 1,000 words increases the coverage by only a few
percentage points. Each additional 1,000 words results in an increasingly smaller
addition to the coverage. It is important to stress that we are dealing with
coverage of word tokens, not word types. In a running text of 100,000 words of
the conversation sub-corpus for example, about half of the words will occur more
than once, and some of them numerous times. If the 2,000 most frequently
occurring words in the text were compared with each running word in the text,
about 90% of the running words would be accounted for or “covered”.
These statistics are highly interesting and even a bit surprising. The 1,000 most
frequently-occurring words in the English conversation text covered 84.3% of the
words in the text. This means that by knowing this subset of vocabulary, a reader
will recognise and have some knowledge of any given word in the text 84.3% of
the time. It is also surprising that, by doubling the number of words to 2,000, only
An Analysis of Lexical Text Coverage in Contemporary German
117
6% additional coverage is achieved. How many words are necessary in order to
achieve 100% coverage? It is also interesting to speculate on the difference in
coverage among the four registers. One would expect that conversation would
represent the highest coverage, but why would newspaper English be ten full
percentage points lower?
Table 1: Text type and text coverage (tokens) by the most frequent 2,000 words
of English in four different text registers (Nation, 2001: 17).
Levels
Conversation
Fiction
Newspapers
Academic texts
1st 1000
2nd 1000
84.3%
6.0%
82.3%
5.1%
75.6%
4.7%
73.5%
4.6%
Total
90.3%
87.4%
80.3%
78.1%
These statistics may convey the message that by learning a relatively small
number of words – a small number at least when compared to the total number of
words in the English language – one can read and comprehend an average
English text. Caution is in order here. We still do not understand the cognitive
process of vocabulary learning and it is difficult to define what it means to know
a word. In addition, if the reader understands 87% of the words in a text, it also
means that he or she does not understand 13% of them, and these may be the
most important words for understanding the full meaning. Nevertheless, a
threshold vocabulary level is a useful concept and a good beginning for continued
learning.
3.
Lexical text coverage in German
What happens when we apply this procedure to a language other than English,
e.g. German? As is well known, German has a much more complex morphology
than English and uses far more compounding of nouns, adjectives and verbs. For
example, the German base word arbeit (“work”) could include not only three
nouns (Arbeit, Arbeiter, Arbeiterin) and eight verb forms, but also literally
hundreds of compound nouns (e.g. Arbeiterfamilie, Arbeitgeber, Schichtarbeit,
etc.), verbs (e.g. ausarbeiten, bearbeiten, verarbeiten, etc.), and adjectives
(arbeitslos, arbeitsfähig, etc.). Some of these might be recognised on the basis of
an understanding of the root word, but many of them would not. It would seem
that for German only the most basic morphological forms should be included in
the word family.
In spite of the lexical complexities of German, it seems to be possible to calculate
lexical coverage in very much the same way as for English. An analysis was
made from a subset of approximately 10% of the BYU/Leipzig Corpus of
Contemporary German, using each of four registers in the corpus. The total
BYU/Leipzig German Corpus consists of the following:
Randall L. Jones
118
Spoken German: 1 million words. 700,000 conversation + 300,000
television
Literature: 1 million words. Seven literature genres
Newspapers: 1 million words. Twenty regional and national newspapers
Academic prose: 600,000 words. Six academic areas, secondary & postsecondary
Gebrauchstexte: 400,000 words. Instructions, advice, advertisements, etc.
The texts are taken from the three major German-speaking countries and
represent a variety of styles and levels. Most of the texts are from the years 20002002. A subset was used for this study because the full corpus was not yet
complete.
The 400,000 word German sub-corpus was processed using the RANGE software
(available from Paul Nation, School of Linguistics and Applied Language
Studies, Victoria University of Wellington in Wellington, New Zealand). First a
raw frequency list was generated, then word family lists were constructed for the
first 1,000 and 2,000 most frequent words. The results are shown in Table 2.
Table 2: Text type and text coverage (tokens) by the most frequent 2000 words
of German in four different text registers.
Levels
Conversation
Literature
Newspaper
Academic
1st 1000
2nd 1000
82.6%
4.4%
72.0%
5.4%
64.0%
6.5%
65.4%
7.8%
Total
87.0%
77.4%
70.5%
73.2%
By comparing the German study with Nation’s English analysis one can make
several interesting observations. First, the results are really not significantly
different for conversation and academic text but quite different for literature and
newspaper text. There could be a variety of explanations for this, including
external factors relating to the choice of texts used for the respective studies.
German national newspapers tend to be erudite and use more compound words. It
is also interesting to note the respective differences among the four registers.
Conversation represents the highest coverage in both languages, but whereas
academic is the lowest for English, newspaper shows the lowest for German.
Again, this difference may be the result of a number of factors. An attempt to
account for these differences would be beyond our scope here but it is an
intriguing cross-language phenomenon. A final curious statistic is the fact that the
percentage difference for the second thousand words was almost exactly opposite
of the first one thousand, i.e. lowest for conversation (4.4%) and highest for
academic (7.8%), thus slightly decreasing the spread among the four registers.
For the German study an additional 1,000 words were added. The results are
shown in Table 3. Again we see that, by increasing the number of words by 50%,
An Analysis of Lexical Text Coverage in Contemporary German
119
the coverage is increased by only 2.5%, 3.4%, 4.2% and 4.6% respectively, but
that the percentage is lowest for conversation and highest for academic, thus
lowering the gap even more. Is it possible that, at a certain point, the percentages
would all be the same?
Table 3: Text type and text coverage (tokens) by the most frequent 3000 words
of German in four different text registers
Levels
1st 1000
2nd 1000
3rd 1000
Conversation
82.6%
4.4%
2.5%
Literature
72.0%
5.4%
3.4%
Newspaper
64.0%
6.5%
4.2%
Academic
65.4%
7.8%
4.6%
Total
89.5%
80.8%
74.7%
77.8%
One of the questions posed at the beginning of this paper was about the difference
in vocabulary among the four registers represented. It is obvious that there is a
quantitative difference, as is evidenced in the different percentages of coverage.
But what degree of lexical overlap exists? In analysing the range of the 1000 most
frequent words in a combined list of all four registers, it was interesting to
observe that 89.1% of the words occurred in all four sub-corpora, 7.6% occurred
in three of the four sub-corpora, 2.4% occurred in two, and only 0.8% occurred in
just one.
Table 4a: Words (partial listing) occurring in three of the four sub-corpora
NA
JEDOCH
EURO
MONTAG
UHR
GESICHT
KRANKHEIT
ZUDEM
SOZIAL
ZUGLEICH
LEUTE
URLAUB
REGIERUNG
Total
162
149
96
81
77
75
65
62
60
59
56
56
55
Academic
2
65
19
0
3
3
60
18
55
35
0
0
11
Spoken
141
0
1
1
0
0
0
0
1
0
35
48
1
Newspaper
0
62
76
75
53
6
2
41
4
13
5
2
43
Literature
19
22
0
5
21
66
3
3
0
11
16
6
0
This would suggest a high degree of commonality among the four registers, and
has important implications for second language vocabulary learning. While there
are language-learning programs that emphasise a specific register, in most cases
vocabulary is learned without regard to how it might be encountered at a later
time.
120
Randall L. Jones
Table 4b: Words (partial listing) occurring in two of the four sub-corpora
SOWIE
MEDIZIN
DOLLAR
JÄHRIG
MITTWOCH
DIENSTAG
QUARTAL
PFERD
ERSTMAL
GARTEN
4.
Total
108
105
68
51
51
47
45
43
42
41
Academic
58
102
0
18
0
0
0
0
0
0
Spoken
0
0
0
0
4
2
2
3
41
23
Newspaper
50
3
67
33
47
45
43
0
1
0
Literature
0
0
1
0
0
0
0
40
0
18
Conclusion
Vocabulary learning in a second language can be difficult and time consuming,
but language educators can contribute to the ease of learning by sequencing new
lexical material based on frequency information. It would seem that in German as
well as in English a threshold vocabulary of approximately 3,000 words would be
suitable for the typical academic learning experience. There are, of course, thorny
issues that still need to be addressed such as what it means to learn and know a
word and how new vocabulary is best taught. But by being able to establish a
frequency-based vocabulary pool, students can benefit from focusing on learning
words that will most likely be of the greatest benefit to them.
References
Nation, I.S.P. (2001), Learning Vocabulary in Another Language. Cambridge:
University Press.
Analysing a semantic corpus study across English dialects:
Searching for paradigmatic parallels 1
University of Manchester
Abstract
In this investigation, we conduct a contrastive corpus analysis into the usages of the ‘get’
periphrastic construction focusing on semantic variation. Our primary interest is in
standard Singapore English (SE), and British English (BE) and New Zealand English
(NZE) were used for comparative purposes. The investigation found that, generally, SE use
of the ‘get’ periphrastic construction was similar to that for BE and NZE. However, after
conducting a search for paradigmatic parallels, we also found that in certain functional
environments, typically filled by the ‘get’ causative in the other two dialects, SE may have
gone further in evolving a competing form - the use of speech act verbs, especially ‘ask’
used in a causative sense.
1.
Introduction
This paper details the corpus-based approach we adopted to analyse the usages of
a grammatical construction, the get periphrastic causative (henceforth getcausative) in an English dialect, and some interesting results of the investigation.
We aimed to discover if standard Singapore English (SE) has developed linguistic
alternatives to fill the causative function(s) of the get-causative, and/or has
derived any indigenised usages of the construction, particularly cognitivelyinduced ones. As an English dialect, SE is interesting in that it exists in an
ethnolinguistically diverse ecology wherein a number of genetically-unrelated
languages compete with English for various functions. Feature selection as a
result is at least potentially subject to influence by cognitive and situational
factors that are much less likely to permeate largely monocultural monolingual
dialects like BE and NZE. Keeping this important factor in mind, a comparative
approach was chosen, and BE and NZE used as standards of measure. Important
aspects of the approach included a prior analysis of the construction to determine
the prototypical semantics associated with the construction as used in BE and
NZE, and to focus the corpus study on finding quantitative and qualitative
evidence that might indicate idiosyncratic behaviour related to the get-causative
paradigm in SE.
An example of the get-causative is given in (1):
(1) No no they did Terry O’Neill’s session and it was such garbage they
got him to reshoot (ICE-GB s1a-052)
122
A particular construction such as the get-causative has, of course, a unique profile
comprising a set of functions and semantic features, and a grammatical analysis at
a very fine-grained level will reveal that neither the construction nor any of its
key components can be replaced by other linguistic item(s) without some loss of
information. In a case of paradigmatic replacement, the information that is lost
must not include loss of salient function, otherwise, instead of a paradigmatic
alternate, the replacement in effect signals an instance of general language change
process. The idea of parallels to a well-defined paradigm is perhaps best
represented by the variation shown by sociolinguistically and dialect-defined
groups in their patterning preferences in a particular function.
Cognitive factors can influence grammatical preferencing strategies. Corpus
studies can be very useful in pointing out such systematic associations in dialectal
variation by providing a massive amount of natural language for quantitative and
qualitative analyses. However, data of significance may not be easy to find.
Linguistic items in association patterns (‘the systematic ways in which linguistic
features are used in association with other linguistic and non-linguistic features’
(Biber 1998:5)) that are also grammatical may not be arranged syntagmatically as
grammatical parallels are often aligned paradigmatically. As well as quantitative
means, this study uses clues from structural constraints of a construction to
retrieve grammatical association patterns in the use of the get-causative.
2.
Background to methodology
Corpus studies have developed association-based methods to harvest instances of
pattern recurrence in language use. More established methods such as collocation
(Sinclair, 1991) exploit pairings and groupings of lexical items. Inroads made into
the study of language use has shown that words very often do not co-occur at
random in a text; rather, pairs or groups of words will consistently appear in the
same or similar linear arrangements (though not necessarily directly adjacent)
because there is a strong preference in language production to select ‘semi-preconstructed phrases that constitute single choices’ at least to the extent that they
are non-random co-occurrences (see, for example, Sinclair’s 1991: 110 ‘idiom
principle’). Harvested collocations can be measured and analysed in terms of the
degree of association that ranges from a cohesive strength analogous to
morphological compounding, e.g., of course (see Sinclair, 1991: 111) to
randomness, where the co-occurrence is analysed to have been arbitrarily
induced. Collocations can only treat items in essentially syntagmatic
arrangements, i.e., collocates must be linearly-patterned words.
More recently, the focus in corpus linguistics has also turned to association
patterns of grammatical constructions, (e.g., Biber, 2000; Hunston & Francis,
1999), that is, association patterns that are instances of grammatical preferencing
(Biber, 2000). There are (at least) two considerations that make retrieval and
Analysing a semantic corpus study across English dialects
123
analysis of such data difficult. First, grammatical association patterns do not
necessarily syntagmatically align. This means that the goal of a search is not for
linear patterns, making the search for grammatical association patterns much
more difficult than locating collocations. Second, some types of parallelisms, if
cognitively or situationally-induced, are not transparently obvious. Transparent
associations will be structurally similar, for example, the difference between overt
use of that and the omission of it in clause structure preferencing, e.g. (Biber et
al., 1998: 103):
(2)
a.
b.
I thought that she only wore jeans.
She thinks he’s sweet.
The two clause structure types are grammatical patterns and while serving
analogous functions, there is an obvious structural difference that can be isolated
and studied. But grammatical patterns are not always so easy to detect. A
particular grammatical structure is very often the result of cognitive or
communicative function, and possible structural representations of a function can
vary, i.e., the paradigm is not equivalent to any particular linguistic form sui
generis. In (3), for example,
(3) Person A: How did you go to the airport?
Person B: I ______ Jane (to) drive me.
a paradigmatic focus is interested in finding alternatives either for the cause verb,
let us say, get, or for the construction as a whole (perhaps a resultative
construction like I got driven there rather than another causative one, for
instance). So, paradigmatic parallels of the get-construction may not necessarily
have the same structural form as the source construction. At best, a primarily
structural approach may yield some alternatives but, more likely, to approach the
investigation on the basis of structural similarity would harvest too immense a
number of possible alternatives, 2 rendering the analysis too broad to be
interesting.
The central issue involves how one can predict identifiable structural alternatives
for a cognitive or communicative function that are not always realised in
structurally similar ways. Automated processes that can trawl a corpus for items
that fit a semantic profile are not yet available commercially. In Section 4, our
low-tech approach to finding parallels for the get-causative is detailed.
2.1
Corpora Consulted
The corpora examined were the SE (ICE-SE), NZE (ICE-NZ) and BE (ICE-BE)
sub-corpora from the International Corpus of English (ICE) Collection. 3 Each
sub-corpus contains approximately one million words of the standard variety of
English in each geographic location, with the ratio of the number of spoken to
124
written words at 3:2. The number of texts and registers represented in each ICE
sub-corpora is standardised, the consistency enabling greater accuracy for
frequency counts in cross-dialectal studies. (For more information about the ICE
Collection, see Greenbaum, 1996.)
As ICE-SE is not tagged for grammatical information, one search issue that arose
was related to occurrences of ellipted linguistic items that prevented classification
of the token as an instance of causative periphrastic. Zero categories such as prodropping are found in informal registers in SE, e.g., private dialogues (ICE-SG
s1a-001 to s1a-100), as a feature that has transferred over from the influence of
Chinese (Bao, 2001). Where instances were found, we needed to look at the
extended context to determine whether the construction was actually a
periphrastic. If any ambiguity could not be resolved, we did not add the instance
to our frequency count, as in this example:
(4) The only problem is that you must get __ to allow you to tape the
voices… (ICE-SG s1a-070)
In (4), the ellipted item could have been an instance of pro-dropping (of the
causee) or it may be a word like ‘permission’ (i.e., a transitive use of the verb get,
plus a verbal complement beginning with the to infinitive serving as an adjunct to
the main clause.
3.
The periphrastic get-causative construction
Here are some examples of the get-causative:
(5) Get-causatives
a. Like the doctor in Bristol, she looked at my eyes, she got me to
touch fingers and noses, to hop on one leg and saw how I
coordinated… [ICE- GB w2b 001]
b. What we are trying to get you to realise is … [ICE-GB s1b 011]
c. Yah or else … how you’re gonna get those middle age people
to answer them? (ICE-SG s1a-070)
d. With that kind of problem I got my chief engineer to go out and
start the company making test equipment that will solve this
type of problem. (ICE-SG s2a-043)
e. Workflow simply means getting the computer to pass
information from one person in a work chain to another. (ICESG w2c-004)
f. The forecast of rain for the following week finally got him to
fix the roof. (Talmy, 1976:106)
125
The get-causative is only one of many constructions of the highly polysemous
verb get. With respect to usage, the lexical item get historically has had a
reputation of being appropriate for use only in informal or colloquial contexts.
The prescriptive bias is mentioned in the Collins Cobuild English Usage (1992),
American English Usage (1957), 4 and Fowler (1996). Whether or not this
stigmatisation extends to all registers, is observed equally across dialects, 5 or
applies only to some get constructions, has not been studied as far as we know. It
is also not known how strongly the bias has affected today’s speakers. 6 This lack
of knowledge meant that it was possible that the stigmatisation is still active
today, so that usage preferencing differences across the dialects and across
registers, and skewing, both of frequency distribution totals for registers and
overall, may result.
3.1.1 Semantic profile of the get-causative
In the literature, there are a few notes about typical usages of the construction but
these have seldom been confirmed by studies that are based on high volume
empirical data, and also do not provide a complete semantic profile of the
construction. However, knowledge of individual essential semantic features can
assist pre-analysis. Animacy is an important semantic variable in linguistic
representations of causativity. For the get-causative, both Causer and Causee are
prototypically sentient beings, viz. human.7 The sense of human agency can be
extended metaphorically and metonymically, to animals, natural forces and the
class of machines that can metaphorically be conceived as a system comprising
parts working together like a computer, machine or car, as in (5e) (Givon, 1976:
348). We included extended animacy as instances of human agency in our
frequency counts.
The get-causative has also been associated with events viewed by the speaker to
be difficult to achieve. This is possibly because of the history of the lexical
meaning of get, i.e., the association with obtaining a goal via physical means.
Although some analyses, e.g. Hollmann (m.s.) and Wierzbicka (2000: 118) have
suggested that the difficulty or effort is associated specifically with the
construction, it is not clear whether the association is fundamentally a lexicogrammatical one – i.e., a collocation with the lexical item get. In some dialects
though, particularly Australian and New Zealand Englishes (from our personal
observations), the construction seems to be used just as much in contexts which
do not involve difficulty or effort, (see, e.g., 9a).
Another semantic note about the get-causative is that it can be used for contexts
allowing the achievement of the causative goal via verbal and non-verbal means,
and perhaps usefully, without the actual means of conduit being specified. In (6)
for instance,
126
(6) With that kind of problem I got my chief engineer to go out and
start the company making test equipment that will solve this type of
problem (ICE-SG s2a-043)
the speaker does not have to specify whether he verbally instructed his chief
engineer or conveyed what he wanted indirectly, by simply exposing him to the
problem. This is especially the case for non-human Causers, (see, e.g. 5e-f).
3.2
Relation of get-causative to other periphrastic causative constructions
The get-causative can be viewed as a member of the class of directive causative
constructions (Shibatani, 1976) in English, in which the doer of the event is
overtly specified, typically realised periphrastically. Structurally, these
constructions involve the use of an auxiliary cause verb such as cause, get, have,
make, get, force, and in which the Causer engages an intermediary agent (doer),
the Causee, to effect the event of the verb complement. The form of the
periphrastic causative constructions is shown in (7):
(7) [NPCAUSER-CAUSE
COMPLEMENT]]
VERB-NPCAUSEE-[(to)-VERB
Examples of periphrastic causatives are listed in (8):
(8)
a.
b.
c.
d.
No no they did Terry O’Neill’s session and it was such
garbage they got him to reshoot (ICE-GB s1a-052)
hm you know some of them bring a client in from a
workplace and then have the counsellor sit in and coach
them. (ICE-GB s1a-060)
The presence of steam at the top probably caused the beads
to fuse and over-expand and subsequently led to shrinkage
of the cushion after ejection. (ICE-SG W3a-038)
Inconvenience and loneliness, but mostly loneliness, made
me think of home. (ICE-SG W2f-014)
At this level of granularity, the various cause verb causative constructions can be
analysed as transparent alternatives in the periphrastic causative paradigm. While
the structure remains constant, each construction has a unique semantic profile, a
configuration of a number of salient semantic features and specific
communicative functions, such as that detailed for the get-causative above.
Variation in semantics can relate to animacy preferences (see Talmy 1979 [2000]
for analyses of a number of periphrastics), verb complement type constraints, e.g.
constraints on the make causative depending on the directness relation between
the two causing and caused events (Kemmer & Verhagen, 1994; Kemmer, 2002),
and appropriate communicative situations (register preferencing).
127
While the known members of the periphrastic causative class are certainly
possible candidates for paradigmatic alternatives, 8 it is also true that structurally
different forms may also be competitors. In a new study, Ziegeler & Lee (m.s.)
argue that the conventionalised scenario construction (9b) is evolving to become
a paradigmatic alternative to some functions currently filled by the resultative
construction environment (9a) in some dialects of English, especially SE.
(9)
a. I got my hair cut.
b. I cut my hair.
Such functional alternatives are problematic for the use of corpora in extracting
frequency data, as the construction is formally indistinguishable from an
alternative transitive construction involving no indirect causativity at all.
Furthermore, no specific grammatical marking is used to express the implied
indirect causativity, which would mean that any corpus retrieval must involve
searching for haphazard lexical instances, and some methodology would need to
be developed to delimit the pool of likely lexical candidates for retrieval. Such a
methodology may involve closer field observation of language items in regular
use than is possible through the constructed database of a corpus, no matter how
representative it might be. There are clearly problems to be overcome and
methods to be developed in searching for items which are considered to be openclass functional replacements of former closed-class grammatical categories, as
they cannot even represent paradigmatic substitutes within a particular
construction.
In this study, we searched for members of the periphrastic class and not openclass replacements. A pre-analysis of the semantic profile of the construction was
an important aspect of the methodology we used.
4.
Approach used in the study
The search for paradigmatic parallels is a particularly relevant strategy to studies
that involve language change in dialects situated in a complex ecology. As noted,
standard SE coexists with a number of languages genetically-unrelated to English
(various Chinese dialects, Malay, Tamil, Baba Malay) as well as various
Englishes including a creoloid form heavily influenced by Chinese. Most
standard SE speakers are bilingual from infancy, and a number of Singaporeans
are L2 speakers of English, increasing the possibility of ‘alien’ cognitive-induced
features entering the dialect. 9 Previous studies have shown that grammatical
features may be sourced from a background language, for example, elements of
the aspectual system from Chinese (Bao, 2001) or that wider distribution of
grammaticalised features in L1 dialects (hyper-grammaticalisation) may develop
128
as a result of differences in the paths of ontogenetic grammaticalisation (Ziegeler
2000).
The basic approach used here is a comparative one. Quantitative information can
be an important indicator of how widespread the use of the get-causative
construction is in SE, however, the figures are meaningless without a standard
measure. BE and NZE were enlisted to serve as the contrasting dialects. BE was
chosen partly because of historical reasons connected to the development of
certain (e.g., more formal) registers in SE, i.e., styles for more formal registers
followed an exonormative standard, and partly because it is a variety developed
within a largely monocultural, monolingual environment. However, since BE is
possibly much slower in dispensing with prescriptive bias against the use of get
than in other dialects, 10 assuming the get-causative is affected, it seemed
methodologically sounder to introduce a third dialect into the analysis. NZE, like
BE, keeps to the monolingual, monocultural constraint but may be less held to the
prescriptive bias. 11 NZE would therefore provide a suitable control quantitative
measure for frequency information in this respect, lest SE be found to contain
significantly more instances of the get-causative construction than in BE.
Another important aspect of our methodological approach was to systematically
account for each instance of distributional variation found for the construction
and each association pattern found for SE, across registers as well as across
dialects. We thought that to do this was crucial because the usage of a particular
periphrastic causative construction is at least partly influenced dialect-internally
due to differing causal exertions on usage in the respective dialects. Influences
can be situationally-induced. It has been shown that linguistic units, whether a
lexical item or construction, can often be shown preferential bias according to
certain register-related variables such as formality and (im)personal style (Biber
1999; 1995, and references therein). The ICE corpora contains a number of texts
standardised according to register so conceivably, distributional variation can
arise from register bias. For our frequency counts, it was crucial that we recognise
any skewing due to this factor so that semantically-induced variation can be
identified.
To find instances of qualitative variation, we manually analysed each instance
found for animacy and overt marking of difficulty. In the case of the latter, as we
found that words denoting difficulty or effort were frequently associated with the
get-causative, we conducted a supplementary search for periphrastic causative
constructions that were possibly paradigmatic alternatives for the same functional
environment (see Section 5 below).
129
5.
Distribution patterns for the get-causative in ICE-GB, ICE-NZ and
ICE-SG
5.1
General frequency of occurrence
Table 1 below provides a summary of the frequency of occurrences of the getcausative in each corpus. 12 The information provided is the absolute number of
instances found in each corpus (n), as well as the totals for the spoken and written
modes. As ICE-NZ contained about 30% more words than in either of the other
two sub-corpora, the table provides totals normed to 100,000 words as well as a
ratio based on the ICE-GB results. The last may be a more accurate indicator of
frequency across dialects generally.
Table 1: Frequency of occurrence of the get-causative in ICE-GB, ICE-SG and
ICE-NZ (in total instances found, per 100,000 words, and relative
occurrence in other corpora compared with ICE-GB)
Spoken
Written
Total (Average)
ICE-GB
n
Per 100000
31
4.9 / 1
11
2.6 /1
42
3.8 /1
ICE-SG
n
Per 100000
42
6.6 /1.3
18
4.5 /1.7
60
5.6 /1.5
ICE-NZ
n
Per 100,000
68
9.3 /1.9
24
4.0/ 1.5
92
6.7 /1.8
Table 1 shows that the comparative frequency count of instances found in the
three corpora support our personal observation that the get-causative may be used
more frequently in NZE than in BE. The frequency count for ICE-SE is higher
than for ICE-GB but not ICE-NZ. In terms of mode, the ICE-SG figure for
written is slightly higher than in ICE-NZ. However, mode per se does not provide
an accurate measure of language use, 13 thus the separated figures for mode in
Table 1 provide a skewed view of the quantitative results. Additionally, the
frequency count comparison merely indicates an impression that frequency
distribution of the use of the construction in ICE-SG is somewhere in between BE
and NZE, and is thus not helpful in isolating any deviation from the other two
dialects.
Table 2 breaks down the frequency distribution according to register (normed per
100,000 words).
The striking aspect of comparison here is that all dialects are fairly similar in the
way in which the get-causative is distributed across registers. Note that registers
of the spoken mode do not necessarily contain more instances of the get-causative
than the written ones.
130
Table 2: Frequency distribution of get-causative across registers in ICE-SG/ICEGB/ICE-NZ per 100,000 words.
ICE-SG
ICE-GB
ICE-NZ
Legal cross-examinations
(S)
25.
Business Transactions (S)
0
9.7 Social Letters (W)
19.0
Persuasive Writing (W);
Social Letters (W)
Broadcast Talks (S);
12.
Non-Academic Writing
1
(W)
6.9 Private Dialogues (S)
16.8
Private Dialogues (S);
Reportage (W)
9.6
Private Dialogues (S) /
Broadcast Discussions
(S)
6.8
Business Transactions
(S)
15.7
Broadcast Interviews (S)
9.0
Business Correspondence
(W)
6.7
Instructional Writing:
Skills & Hobbies (W)
10.5
Parliamentary Debates
(S)
8.0
Public Dialogues (S)
5.0 Unscripted Speeches (S) 9.7
Unscripted Speeches (S)
6.3
Legal CrossExaminations (S) / Class
Lessons (S)
4.7 Demonstrations (S)
Broadcast Talks (S)
4.8 Non-broadcast Talks
4.6
Spontaneous
Commentary (S)
8.4
Instructional Writing
(both types) (W)
4.7
Broadcast Interviews (S)
4.5
Class Lessons (S);
Creative Writing (W)
8.0
(S)
4.5
Business Letters (W)
Instructional Writing:
3.3 Administrative Writing 7.1
(W)
Class Lessons
4.4
Reportage (W)
2.4
Business Transactions (S)
4.2
Creative Writing (W);
Academic Writing (W)
2.3 Reportage (W)
Creative Writing (W)
2.4
Legal Cross
Examinations (S)
4.5
Non-academic Writing
(W)
1.2
Non-academic Writing
4.0
Academic Writing (W)
1.1
(S)
4.1
(S)
3.7
Non printed Student
Essays (W)
3.1
Academic Writing
1.6
Legal Presentations (S);
Non-broadcast talks (S);
Demonstrations (S);
Spontaneous
Commentary (S);
Broadcast News (S);
Student Essays (W);
Exam Scripts (W);
Business Letters (W);
0.0
(W)=Writing; (S)=Speech;
Class Lessons (S);
(S); Broadcast News (S);
Spontaneous
Commentary (S);
Demonstrations (S);
Legal Presentations (S);
Social Letters (W);
Instructional Writing
(W); Persuasive Writing
(W)
0.0
8.7
Non academic Writing:
6.6
Tech (W)
6.4
Broadcast Interviews (S);
Exam Scripts (W);
0.0
Business Letters (W);
Persuasive Writing (W)
131
The data appears to suggest that registers which inherently contain more events
with directives and (inter)personal situations (perhaps personal style) are found
likely to contain more of the constructions. For example, Business Transactions
(which contain many situations where A wants B to do C, e.g., server/customer in
a shop) ranked highly for ICE-GB and ICE-NZ, and Private Dialogues in all three
corpora. It also may explain why Academic Writing, 14 which avoids personal,
involved styles, ranked so low across all dialects. The stigmatisation of get may
also be a factor contributing to the observation that more formal registers did not
contain many instances. There may also be idiosyncratic preferences, for
example, perhaps an explanation of why the register Legal Cross-Examinations in
ICE-SG was ranked so high. But this could also be due to skewing due to low
absolute numbers as word numbers in some registers were much higher than in
others. For example, word numbers for Private Dialogues were around the
250,000 word mark but Legal Cross-Examinations were only represented by
20,000 words. Across the dialects, it is clear that the get-causative is more
widespread across registers in ICE-NZ. The data may also suggest that style is
more informal and personal generally across registers in NZE; however, further
study is required beyond the scope of this one.
The results show that accuracy of analysis would profit from adopting a register
variable-focused approach (Conrad & Reppen, 1998), as there are strong
regularities across registers in all dialects that appears to be consistent across
dialects. 15 For our purpose, higher degree of accuracy and more information
about register-related preferences is not required. We merely noted that the
environments where paradigmatic parallels are likely to be found are in registers
that contain a higher rate of directives, (inter)personal relationships and more
informal styles of communication. 16 We also concluded that there were no
significant differences between SE and the other two dialects in the influence of
discourse factors on the use of the construction.
5.2
Register-related patterns
On a related note, there is more evidence that supports the above conclusion
about the parity of SE in relation to BE and NZE. We found that all three corpora
were consistent in the way in which they used the get-causative in registers where
style was impersonal and formal. While instances with human agents in Causer
and Causee positions were found, in more impersonal and formal communicative
styles, agents can be absent (unspecified), as in (10a), generic (10b) or nominal
clausal or phrasal causers (10c).
(10)
a
b.
Workflow simply means getting the computer to pass
information from one person in a workchain to
another. (ICE-SG w2b-039 ‘Non-academic Writing’)
So far, the EDB has played the role of growth
advocate to the hilt, pushing hard to get the entire
132
c.
population to go for growth and helping the
Government to focus all its policies on promoting
growth. (ICE-SG w2e-005d ‘Press Editorials’)
The matchmakers’ persuasive powers getting clients to
lower their expectations are also being tested. (ICE-SG
w2c-010)
In general, frequency information did not reveal any significant variation in the
use of the get-causative across the three corpora, and, significantly, did not
demonstrate any usages uniquely SE. We now turn to the qualitative information
the corpus analysis revealed.
5.3
Semantic features of the get-causative
5.4
Agency typology & distribution
While agency is prototypically attributed to sentient entities, (usually humans),
causative verbs can show a preference for the type of entity preferred. We found,
for instance, that in all three corpora, cause is used in impersonal registers, mostly
in scientific/technical documents, and causer and causee tend to be inanimate
entities or clausal/phrasal nominal phrases. It is rare to find an example such as
(11c), especially in impersonal registers.
(11)
a
b.
c.
The presence of steam at the top probably caused the
beads to fuse and over-expand and subsequently led to
shrinkage of the cushion after ejection. (ICE-SG W2a038 ‘Academic Writing - Technology’)
A precaution though; fertilizer granules should not be
applied too near to its tree trunk as it may cause the
stem to rot. (ICE-SG W2d-013 ‘Instructional Writing’)
I caused John to go. (Shibatani 1976:3)
By contrast, the configuration of participants in get-causative constructions is to
have animate causers and causees, typically human beings, or metaphoric
extensions of essentially sentient attributes (e.g., volition), supporting earlier
observations about this (see Section 3.1.1.). Table 4 shows the prototypical
animacy profiles of the get-causative in all three corpora (note then that an
example like (5f) would be a very rare use of the construction).
Table 4 shows that ICE-SG contains no significant variation from the other subcorpora with respect to animacy configurations. At least in terms of animacy
association patterns, SE appears to show no variation from BE and SE.
133
Table 4: Prototypical animacy profile of causers and causees in ICE-GB, ICE-SG
and ICE-NZ (in %).
ICE-GB
Get
ICE-SG
Get
ICE-NZE
Get
H+H
H + NH
NH + H
NH + NH
85.7
7.1
4.8
0
91.7
3.3
1.7
1.7
84.8
3.3
12.0
0
(H=human; NH=nonhuman. H includes extended animacy.)
5.5
Association Patterns
We noted in Section 3.1.1. that the get-causative contains the aspectual
suggestion of difficulty, effort or the focus of success in achieving the causative
event. We found that the get-causative does occur systematically with linguistic
items denoting difficulty and effort but that this differed across dialects.
Linguistic items marking the difficulty or effort included modals (e.g., must) and
catenative verbs, such as try to/and. Try to/and is the most common collocate and
indeed, in all three corpora, the lexical item get is the most frequent verb
collocation for try to/and. The list of collocations found for the get-causative are
given below:
(12)
Collocations denoting difficulty/effort found in ICE-GB, ICESG & ICE-NZE (only the lemmas of a verb is given)
try to/and; manage to; it would be very hard to; difficult to;
such an effort to; been unable to; cannot; could not; wouldn’t
be able to; it is possible to; must; perhaps...can; succeed in;
finally; eventually;
In some instances, extended discourse context was consulted, to determine if
difficulty or effort is suggested, as in (13).
(13)
a.
b.
like you’d get him to do things he wouldn’t usually do
but then, he’d kind of give you some security… (ICENZ s1a-056)
Mister Warburton says national insurance doesn’t
provide a rebate or any financial incentive to get people
to install detectors (ICE-NZ S2b-007)
Neutral usages, e.g., (14), where no difficulty or effort was marked were also
found.
134
(14)
a.
b.
I rung this morning. I rung your mother and she was
out… and he said ‘can I get her to phone you’ and I
said ‘oh no’. (ICE-NZ s1a-007)
With that kind of problem I got my chief engineer to
go out and start the company making test equipment
that will solve this type of problem (ICE-SG s2a-043)
In (14a), there is no indication that the speaker thinks that it will be difficult to get
the mother to telephone the other person, similarly in (14b). The use of the getcausative is merely for instructional purposes. Table 5 shows the frequency of
effort-coded in relation to neutral instances across the dialects.
Table 5: Frequency of collocation of get-causative with linguistic items
suggesting difficulty or effort in achieving the causative event in ICEGB, ICE-SG and ICE-NZ (in %).
ICE-GB
ICE-SG
ICE-NZE
Effort-coded
Neutral
64.2
41.7
24.0
35.8
58.0
76.0
As Table 5 shows, ICE-GB had the highest number of instances involving
difficulty or effort (64.2%), and ICE-NZ the lowest. The results for ICE-SG were
again in between the two other dialects so, once again, ICE-SG shows no sign of
deviation from expected usages of the get-causative.
While other cause verbs can and do co-occur with items denoting effort and
difficulty, none produced quantitative results that indicated a degree of
association as cohesively as for get-causatives. We concluded that there was a
pattern of grammatical association between the get-causative and events in which
the causer is viewed as having to exert some effort in obtaining the caused event.
5.6
Finding paradigmatic parallels to the get-causative
A supplementary search using the effort-coded association pattern was conducted
to determine whether other possible cause verbs were also associated with
difficulty and effort. We found no deviation from expected usages in ICE-GB and
ICE-NZ. In ICE-SG, an instance (15) was found where the speech act verb ask
was used in what appeared to be a function typically filled by the get causative.
(15)
T: I didn’t know that last time I bought so such a small policy
C: Oh yes I am trying to ask you to buy a bigger policy you say no
no no (ICE-SG S1b-075)
135
C is an insurance salesman and T a customer. During the exchange, C is
attempting to sell T a more expensive life insurance policy by giving the latter
reasons for doing it. T keeps asking questions about the offer. C says ‘…I am
trying to ask you to buy…’. There are two interesting points about this instance of
ask. The utterance reports on an indirect speech act which was uttered with some
apparent effort, and not the effort involved in effecting an indirect causative act.
The same use of ask in (16) refers only to the difficulty of producing the question
verbally:
(16) My Italian is not good, I tried to ask you to pardon me.
Another point about this example is that C may have been referring to his efforts
in the past and not in this conversation. At least in the segment recorded for use in
ICE, there is no record of T having said ‘no no no’. Some SE speakers do not
make use of tense-marking if time is marked in some other way, e.g., (17).
(17) That day you know you’re trying to tell me to do both by
December…I mean I know it’s necessary but it’s (ICE-SG S1b029a)
In (17), the event happened in the past marked by that day, but the present
progressive tense form are trying is used. If indeed (15) is another instance of
this, the evidence is even stronger that the use of ask is developing another
function in SE. What is significant is that no instance involving the use of speech
verbs as neo-causatives was found in ICE-GB and ICE-NZ. 17 However, since
only one instance of causative ask was found, we need to do further work to
determine whether this is a development for ask or whether other speech verbs
are involved (some instances of causative call were also found).
6.
Conclusions
In our investigation of the get causative in SE, we exploited grammatical
association patterns to determine whether paradigmatic parallels to the getcausative can be found in that dialect. As the results showed, the quantitative
analysis did not indicate any variation of significance of SE use of the getcausative. Register-related usages of the construction in SE also showed no
deviation from the findings for BE and NZE. We are cautious in our conclusion
that there are no register differences in SE as the sampling size (each corpus is
one million words) may have been too small to expose any subtle variations. As
our primary interest is in semantic variation, we do not believe this point damages
our account of the use of the construction in standard SE. The more interesting
methodology point concerns the paradigmatic parallel we may have possibly
found in the speech verb ask. If this is really an instance of paradigmatic
136
replacement, then frequency of occurrence is not necessarily always a true
indication of what features are used in a dialect. 18 It was noted by Biber (1995)
that a feature that is distinct to a particular dialect need not occur frequently in a
corpus to be significant. However, this raises the issue of how one can distinguish
a significant occurrence from merely a non-patterned one. Clearly, more work
needs to be done on this methodological issue, and in finding paradigmatic
parallels in general.
With respect to the methodology we did use, we found our objective to account
systematically for the features found in usages of the get-causative to be a useful
approach to cognitive (including semantic-based) investigations using corpora. Of
course, much more could be done to fine-tune the methodology we used;
however, we hope to have raised some issues relating to the difficulties of finding
paradigmatic parallels in corpus study.
Note
1
The authors would like to acknowledge the UK Economic and Social
Research Council for sponsorship of the research project upon which this
study is based (grant no. R000223787; PI: Debra Ziegeler).
2
A search for the form [NP1CAUSER-CAUSE VERB-NP2CAUSEE-[(to)VERB COMPLEMENT]].
3
We used the Wellington Corpus (spoken and written) at the presentation
given at the Corpus Linguistics 2003 Conference, in Lancaster, UK
because ICE-NZE was unavailable to us at the time.
4
This book is based on Fowler’s Modern English Usage.
5
Hundt (cf. Denison 1998) mentions the possibility that the get passive may
be more stigmatized in BE than in American English, hence the lower
number recorded in her corpus study.
6
Subsequent surveying and comments by speakers of BE suggest that age
may be a variable in determining frequency of use. Informants who are
generally younger are less likely to observe the prescriptive rule, or even
are unaware of its existence.
7
Talmy (2000:531) includes the get-causative as one example of ‘caused
agency’ or inducive causatives. He Talmy (2000:534) also notes that in the
use of the get-causative, the final event which the Causer induces is
considered to be desirable to some involved entity but we did not use this
point in our corpus study.
137
8
A list is given in the Collins COBUILD grammar patterns. - 1: Verbs.
(1996).
9
See Ho & Platt (1993) amongst others, for an introduction to Singapore
English.
10
Hundt (2001) for example, cites Denison (1998) claiming that the get
passive occurs more frequently in AE then in BE perhaps because of less
pressure from prescriptive bias in the former dialect.
11
From personal observation. If true, this observation may hold for
Australian English as well. Both claims need substantiation.
12
The total number of instances in each sub-corpora is not high so one can
question whether the quantitative differences can support any statement
about cross-dialectal variation. However, fewer instances of other
periphrastics (with the exception of make-causative) were found in all
corpora. It may mean that the sample size of one million should be
increased for studies involving periphrastic causatives.
13
“Speech and writing are not homogeneous types but stereotypes. Speech is
generally informal (face-to-face) and written, (information exposition)
more formal.” (Biber 1988:36)
14
For Academic Writing and Non-academic Writing registers, the four subtypes, Humanities, Social Sciences, Natural Sciences and Technology,
were collapsed.
15
For the multiple-dimensional approach, see Biber (1995, 1988); Biber et al
(1998), which we could refer to if we were interested in register
distribution. However, this investigation does not target situationallyinduced variation.
16
In subsequent research for instance, we examined Internet-based
communications such as emails and newsgroups, in which participants
generally use informal styles.
17
(We use the term neo-causative to refer to new members which may be
used within the causative verb paradigm that we have described.) We have
found some of these in Internet registers, in informal styles, typically by
writers in younger age groups and American English.
References
Bao, Z. (1995), Already in Singapore English. World Englishes. 14, 2, July, 181188
138
Bao, Z. (2001), The Origins of Empty Categories in Singapore English. Journal
of Pidgin and Creole Languages. 16, 2, 275-319
Bao, Z. (2002), The aspectual system of Singapore English and systemic
substratist explanation. (unpublished)
Barlow M and S. Kemmer (1999), Usage-based models of language. Cambridge:
Biber D. (1999), Language use through corpus-based analyses, in: Barlow M and
S. Kemmer (eds.) Usage-based models of language. Cambridge:
Cambridge University Press, pp. 287-313.
Biber D. (1995), Dimensions of Register Variation. Cambridge: Cambridge
University Press
Biber D. (1988), Variation Across Speech and Writing. Cambridge, Cambridge
University Press.
Biber D, S. Conrad and R. Reppen (1998), Corpus Linguistics. Investigating
Language Structure & Use. Cambridge, UK: Cambridge University Press.
Collins COBUILD grammar patterns. - 1: Verbs. (1996), London: HarperCollins
[for] the University of Birmingham.
Fowler, H.W. (1996), The New Fowler's modern English usage / first edited by
H.W. Fowler. - 3rd ed. - Oxford: Oxford University Press.
Givón T. (1979), On understanding grammar. New York: Academic Press.
Greenbaum S. (ed). (1996), Comparing English worldwide. Oxford: Clarendon
Press.
Ho, M. and J. Platt (1993), Dynamics of a contact continuum. Singaporean
English. Oxford: Clarendon Press
Hollmann W. m.s. Towards a truly dynamic usage-based model: The case of
periphrastic causative get.
Hundt M. (2001), What corpora tells use about the grammaticalisation of voice in
get-constructions. Studies in Language. 25:1
Hunston S. and S. Francis (1999), Pattern grammar. A corpus-driven approach to
the lexical grammar of English. Amsterdam/Philadelphia: John
Benjamins.
Kemmer, S. and A. Verhagen (1994), The Grammar of Causatives and the
Conceptual Structure of Event. Cognitive Linguistics. 5-2, pp.115-156
Nicholson, M. (1957), A dictionary of American-English usage, based on
Fowler's Modern English Usa. - New York
Shibatani M. (1976a), The Grammar of Causative Constructions: A Conspectus.
Chapter in Shibatani M. 1976b.
Shibatani M. (ed.) (1976b), Syntax and Semantics. The Grammar of Causative
Constructions. New York: Academic Press.
Sinclair J. (ed.) (1993), Collins COBUILD English language dictionary.
Sinclair J. (1991), Corpus concordance collocation. Oxford: Oxford University
Press.
139
Talmy, L. (2000), Towards a cognitive semantics. Cambridge, Mass.; London:
MIT Press.
Talmy L. (1976), Semantic Causative Types. In Shibatani 1976. pp.43-116.
Ziegeler, D. (2000), Hypothetical modality: grammaticalisation in an L2 dialect.
Amsterdam: Benjamins
The curse and the blessing of mobile phones – a corpus-based
study into American and Polish rhetorical conventions
University of àódĨ
Abstract
The study reported in this chapter applies a corpus-based methodology to investigate
rhetorical strategies used by American and Polish apprentice writers in their
argumentative essays. The data was collected from 79 Polish first-year students of
English, and their 80 American counterparts, freshman non-English majors. The simplest
method was chosen to analyse the discrepancies between the two groups of essays: the
comparison of wordlists also known as the keyword analysis. The study revealed
interesting textual differences between American and Polish essays which pertain to such
rhetorical strategies as the choice of general versus experience-related arguments, the
level of formality and the use of structuring devices.
1.
Introduction
Learner corpora have recently become an important source of data in second
language acquisition studies. Samples of learners’ written and spoken L2
production are collected with the aim to describe as accurately as possible various
characteristics of interlanguage. The main interests of researchers revolve around
the differences between native and non-native linguistic systems. Thus,
investigations focus on such linguistic features as lexical, grammatical and
syntactic factors (Meunier, 1998). However, the differences between native and
non-native production also occur at the macrolinguistic level and are related to
the ways discourse is structured by both groups of language users. Such
discrepancies reflect broadly understood cultural differences and are studied
within the framework of contrastive rhetoric.
The claims of contrastive rhetoric can be summarised as follows:
Contrastive rhetoric maintains that language and writing are cultural
phenomena. As a direct consequence, each language has rhetorical
conventions unique to it. (Connor, 1996: 5)
When writing in a foreign language learners show a tendency to transfer not only
the linguistic features of their native tongue but also its rhetorical conventions.
These conventions pertain to such factors as the structure or units of texts,
explicitness, information structure, politeness and intertextuality (Myers, 2002).
142
As a result, native speakers of a language may find learners’ written discourse
ineffective or even incomprehensible.
So far, the research in contrastive rhetoric has rarely applied the corpus-based and
quantitative methods to support its claims (see Anderson, 2001 for a
counterexample which, however, is the exception that proves the rule). Its
findings are usually derived from the qualitative analysis of exemple texts. For
example, Duszak’s (1994) study of the differences between English and Polish
intellectual styles as reflected in the introductions to academic papers is based on
a selection of 20 Polish and 20 English articles in the field of linguistics, which
are exploited only as a source of examples to support the points the author makes,
with no attempt to quantify the results. The qualitative approach undoubtedly has
its merits in allowing a detailed scrutiny of the exact rhetorical strategies applied
by writers. However, supplementing such methods with quantitative data can
make the analysis more reliable and justifiable.
The aim of this paper is to demonstrate that research in contrastive rhetoric can
benefit from the application of the corpus-based and quantitative methodology. In
the same fashion, adopting the text approach rather than the language approach to
the analysis of corpus data (Scott, 2000) can lead to totally unexpected and very
revealing insights. In the study reported here such an approach and such a
methodology are applied to investigate the differences in the rhetorical strategies
used by American and Polish apprentice writers in their argumentative essays.
2.
Data
The research reported in this chapter in fact represents a by-product of a large
scale project, whose aim is to compile and explore a Polish learner corpus of
written English. The corpus consists mainly of argumentative (but also narrative,
descriptive and quasi-academic) essays written by Polish advanced learners of
English at varying stages of proficiency in L2. As is the case with most learner
corpus studies the explorations within the project focus on the areas of lexical,
grammatical and syntactic characteristics of learners’ interlanguage.
For one of the studies within the project, whose initial aim was the investigation
of the breadth of learners’ lexical knowledge, data was collected from 79 Polish
first-year students of English at the Institute of English Studies, University of
àódĨ, and their 80 American counterparts, freshman non-English majors at South
West University in Marshall, Minnesota. In order to ensure a close comparability
of the samples, both groups of students were asked to write an essay in almost
identical conditions (during one of their first composition classes at the
university) and on the same topic – “The mobile phone – the curse or the blessing
of the end of the 20th century”. The choice of this topic was fairly arbitrary, as
the research questions were not related in any way to the study of attitudes or
The curse and the blessing of mobile phones
143
lexis pertaining to this particular area. However, during the process of coding the
data it was noticed that the American and Polish essays differed greatly in their
treatment of the topic. Thus, the decision was made to pursue this observation in
a more rigorous way.
3.
Method
Since the initial observation indicated that while writing on the same topic both
groups of students wrote about different issues and problems, it seemed
appropriate to turn to the field of content-analysis for the appropriate
methodology to process the data. From among a range of tools widely used by
researchers in this area, the simplest method was chosen to analyse the
differences between the two groups of essays: the comparison of wordlists also
known as the keyword analysis. According to Scott (2000) keywords are a good
indicator of the ‘aboutness’ of a text, thus the procedure seemed an appropriate
first step to pinpoint the discrepancies between the samples.
Instead of using the Wordsmith’s tool for the keyword analysis, the decision was
made to process the essays with Wmatrix, a corpus-analysis tool developed at
Lancaster University. The advantage of this software over the Wordsmith Tools is
that it detects recurring lexical phrases and treats them as single units in the
analysis, so the final list of key items consists of both individual words and fixed
phrases. Such a list is likely to give a better picture of the main topics surfacing in
both groups of essays. Moreover, the generation of a keyword list in Wmatrix is
based on a more reliable statistical procedure, log-likelihood chi-square, which
has been proved to produce better results in establishing keywords (Rayson,
2003).
For the purpose of the keyword analysis, first a wordlist was created for each
group of essays. Next, Wmatrix compared the two wordlists against each other
and produced a list of overused words and phrases in both subcorpora arranged
according to the log-likelihood (LL) coefficient. Only the items whose LL
coefficient was above 6.6 (p<0.01) were chosen for a detailed scrutiny. They
made a total of 321 items which were further sorted into those overused by the
American students (145 words and phrases) and those overused by the Polish
students (176 items). The complete lists of the key items in the American
subcorpus and the Polish subcorpus are in the Appendix.
The cut-off point applied in this study needs some justification. Rayson (2003)
claims that since the generation of a list of key items involves multiple statistical
procedures the probability of error multiplies. He postulates the adaptation of a
much higher cut-off point which should be located at the level of 15,13
(p<0.0001) for the results to be valid and reliable. However, since the list will not
be further analysed quantitatively, this argument is not relevant here. The cut-off
144
point chosen for the purpose of this study is fairly arbitrary and instead a decision
could have been made to analyse the first 300 items on the list.
Both lists were examined in search for those key items which could point to the
recurring themes in the subcorpora (and thus the ‘aboutness’ of the texts) as well
as to other characteristic features of the two groups of essays. The examination
was supported by the scrutiny of the concordance lines of the key items.
Whenever a specific theme could be identified, it was recorded next to the item in
the list. In some cases the same key item could belong to two different themes.
For example, the examination of the word plans (Figure 1), which was key in the
American corpus, revealed that the item is related to two themes of COST or
CONTACT.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
ith a company that sell rate
y all talk about their great
es allow teenagers to set up
h cell phones are the
lso in case of
a change in
r cell phone the most. These
nce charges. Many cell phone
eight o'clock, when all you
used for a simple change in
ith friends in order to make
urn it off after you've made
inute appointments, and make
make calls for work, dinner
reless companies offer great
nt of travel. There are also
y of useful ways. Many phone
phone calls. Most cell phone
nes and lastly the different
f they get the right kind of
e the ability to get certain
ave with cell phones are the
your family. Many cell phone
plans. The usual rates range from t
plans. Oh yes, everything sounds gr
plans with their friends and allows
plans that are offered. Some of tho
plans with where your family went w
plans do not cost a tremendous amou
plans include free long distance to
plans for the evening have been mad
plans. If you didn't have a cell ph
plans. With all the things teens ha
plans & enjoy the company of the pe
plans while on the road.
There a
plans, or having important conversa
plans' which are both convenient an
plans that allow you to have free m
plans available are so inexpensive
plans have a cheaper rate per minut
plans that can be used that will fi
plan. The plans keep getting greate
plans for your phone. Let's say tha
plans that are offered. Some of tho
plans now have free nationwide long
Figure 1: Concordance lines of the word plans in the American subcorpus
In some other cases the themes could overlap. For example, the keywords related
to the motif of DRIVING in the American corpus at the same time pointed to the
themes of EMERGENCY or HAZARD.
In addition to the thematic grouping, other clusters of key items were also
observed and recorded. They were related to the purely linguistic properties of the
texts and included pronouns, linking expressions, fixed phrases and the target
language variety. The following two tables present the first 15 key items in the
two sorted lists.
145
Table 1: The first 15 key items in the American subcorpus
word
raw
Year 1
%
Native
raw
%
1
2
3
4
cell
phones
phone
driving
46
81
104
5
0.17
0.30
0.38
0.02
758
475
516
106
2.30
1.44
1.57
0.32
5
6
1
0
0.00
0.00
44
36
0.13
0.11
7
8
9
10
11
12
13
minutes
get_hold
_of
while
have
someone
ways
get
you
car
24
183
25
2
30
342
15
0.09
0.67
0.09
0.01
0.11
1.25
0.05
108
390
99
38
106
595
71
0.33
1.18
0.30
0.12
0.32
1.81
0.22
14
15
if
road
104
1
0.38
0.00
232
30
0.70
0.09
LL
THEMES and other groupings
639.53 American English
242.25
229.04
95.70 DRIVING (EMERGENCY,
HAZARD)
45.36 COST
43.67 CONTACT 35, EMERGENCY 1,
fixed phrase
43.67
43.66
34.85
33.37
32.35
30.98 pronoun (you)
HAZARD)
29.56
HAZARD)
Table 2: The first 15 key items in the Polish subcorpus
word
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
4.
mobile_ph
one
of
it
we
our
nowadays
mobile
which
is
kind
very
fact
moreover
invention
SMS
Year 1
raw
%
Native
raw
%
LL
321
1.17
98
0.30
168.92
800
608
263
170
45
55
107
587
43
124
42
23
42
20
2.91
2.21
0.96
0.62
0.16
0.20
0.39
2.14
0.16
0.45
0.15
0.08
0.15
0.07
502
363
101
46
2
6
35
468
5
55
5
0
8
0
1.53
1.10
0.31
0.14
0.01
0.02
0.11
1.42
0.02
0.17
0.02
0.00
0.02
0.00
133.55
114.90
107.06
99.99
56.80
54.73
52.48
43.70
41.75
41.28
40.40
36.24
31.92
31.52
THEMES and other groupings
British English, fixed phrase
pronoun (we)
pronoun (we)
GENERAL
British English
linking expression
GENERAL
TEXTING
Results and discussion
The analysis led to interesting findings on the content level of the texts as well as
on the purely linguistic level. Some of these differences are not surprising and can
easily be explained, but some reveal unexpected discrepancies between the
groups.
146
On the content level, the following thematic groupings were recorded for the
American and Polish key items (Tables 3 and 4 respectively).
Table 3: The readily-identifiable themes in the American subcorpus
THEME
Key items pointing to the themes
DRIVING
driving, car, road, drive, driver, on the road, pull over, vehicle
EMERGENCY
driving, get hold of, car, road, drive, stranded, 911, emergencies,
accidents, on the road, help, ditch, weather, emergency, winter
driving, road, car, drive, driver, distraction, on the road, security,
pull over, vehicle, hazardous
HAZARD
COST
minutes, plan, distance, service, plans, charges, free, month,
roaming, rates, cards, cost, long
CONTACT
get hold of, drive, on the road, plans, teens, store, kids, reaching
MANNERS
distraction, service, class, movies, movie, movie theatre
TECHNICAL NETWORK
PROBLEMS
service, area, reception, static, tower
FAMILIARITY
home, family, house, Minnesota, dad, family, mom, Americans
IMAGE
teens, cool
GADGETS
id, weather, mail, store, entertainment, games, directions, features
GENERAL
20th
Table 4: The readily-identifiable themes in the Polish subcorpus
THEME
Key items pointing to the themes
GENERAL
nowadays, invention, inventions, situations, human, influence, life, world, 20th
century, civilisation, modern, 21st century, phenomenon, development,
cultural, progress, social
TEXTING
SMS, sending, send
HEALTH
health, harmful, scientists, harm, diseases, researches
CONTACT
contact, contacts, holidays, in touch with
MANNERS
cinema, switch off, theatre
IMAGE
show off, newest, youngsters
EMERGENCY
mountains, ambulance
Tables 3 and 4 show some variation between the groups in the choice of
arguments. This variation can be attributable to the cultural differences between
Poland and the U.S. as well as the differences in the life style of the two groups of
students. Thus, the theme of low COST comes up as one of the advantages of
mobile phones in America, especially for making long-distance calls, whereas
such an argument does not surface in the Polish data, since mobile phone calls are
generally expensive in Poland. For the same reason the theme of TEXTING
appears only in the Polish data, as it is a popular method of using mobile phones
147
due to the relatively high costs of mobile-phone calls. TECHNICAL NETWORK
PROBLEMS constitute an important issue for the American students living in a
rural and scarcely-populated part of Minnesota, where the data was collected. The
same issue does not surface in the Polish essays, as the problem is much less
acute in Poland, especially in the urban areas where most of the students come
from. Also for the reasons related to the density of population in the students’
immediate surroundings, the theme of EMERGENCY use of a mobile phone is
much more strongly emphasised in the American data and is almost always
connected with the emergency on the road. The prototypical emergency use of a
mobile phone for the Polish students is associated with the accidents in the
mountains. The differences in life styles are reflected by the fact that in the
American data the most frequent theme is DRIVING (the experience that most
Polish 19-year-olds do not have yet, at least not on regular basis), whereas the
keyword mountains connected with the theme of EMERGENCY in the Polish
subcorpus can be explained by the fact that trekking in the mountains is a popular
holiday activity among Polish students.
Although the findings mentioned above specify the differences between the two
groups of essays, they hardly reveal any interesting facts about the subcorpora,
since such differences could only be expected. There is nothing surprising in the
fact that the students’ essays reflect the characteristic features of the setting in
which they were written. However, further scrutiny of the keyword lists brings to
light a deeper and more intriguing variation between the two subcorpora. While
the American keyword list contains many items pointing to several readilyidentifiable and recurring themes which can be easily explained by the factors
related to the culture, setting and life-style, the Polish list contains very few of
these identifiable and explainable themes. With the exception of the theme of
using TEXTING as a convenient way of contacting people, there are few items in
the Polish list whose keyness can be attributed to cultural or environmental
factors. On the other hand, one of most recurring themes in the Polish data is the
concept of HEALTH represented by items such as health, harmful, and diseases.
Although there is considerable research evidence alerting mobile-phone owners
to the potential health risks related to the use of this device, it is hard to assume
that the students themselves or the people in their immediate surroundings
actually suffer from such health problems. Another frequent theme recurring in
the Polish data is labelled GENERAL and is composed of items such as
inventions, human, and development. It shows that the Polish students tend to
discuss the problem in fairly general terms. On the other hand, the American list
includes the names of family members (father, dad, and mom), and other items
pointing to FAMILIARITY such as house, home, Minnesota and Americans.
The variation in the choice of general versus experience-based arguments seems
to be the most striking difference between the Polish and the American data.
While the American students answered the essay question by making reference to
148
their own life experience such as driving or technical network problems, the
Polish learners talked about civilisation and health hazards.
The discrepancies discussed in the previous paragraphs are further supported by
the examination of another group of keywords found in the subcorpora: pronouns.
Table 5 lists the key pronouns occurring in the two lists.
Table 5: Pronouns in the American and Polish keyword lists.
pronouns in the American keyword list
you, my, I, they, she, her
pronouns in the Polish keyword list
we, our, us
The American key pronouns indicate that the students talk about their own
experience (I, my) or refer to real people (she, her), whereas the Polish students
make a frequent use of the generic WE (we, our, us) to support the generalisations
they make.
All these findings point to the fact that the Polish students approach the topic on a
more general level, and the American students relate to their own experience in
tackling the problem. One explanation of such variation could be that the Polish
learners, contrary to the American students, did not own mobile phones at the
time of writing. Unfortunately, no data is available on this issue, since the essays
were collected with an entirely different purpose; yet this explanation seems
highly unlikely, because the mobile phone is a common-place device among
Polish teenagers. Thus, it can be claimed that the observed differences reveal the
discrepancies in the rhetorical strategies used by American and Polish apprentice
writers. These discrepancies pertain to the choice of general versus experiencebased arguments. Such an explanation accords with with Kaszubski’s (1997)
findings. He demonstrated that Polish apprentice writers tend to overuse abstract
nouns of reference, which, he concluded, may imply that Poles are “particularly
prone to make sweeping generalisations when writing in English” (1997: 155).
Another interesting observation can be made based on the examination of the
keyword lists. The keyness of the pronoun you in the American data indicates that
the American students adopt an informal style in their essays which allows them
to address the reader directly. On the other hand, the Polish keyword list abounds
in linking expressions (Table 6), which could indicate that the Polish students use
a more formal style and pay more attention to the structure of their essays.
However, there is also an alternative explanation of the high number of linking
expressions in the Polish data. Milton (1998) and de Cock (2000) observed that in
structuring their essays learners of English tend to over-rely on a small set of
linking expressions promoted by textbooks and teachers. Thus, the keyness of
linking expressions can also be attributable to the learning strategies rather than
the rhetorical strategies employed by the learners. At the moment, it is impossible
to establish which explanation – an L2 learning strategy or transfer of an L1
149
rhetorical strategy – is responsible for the high representation of linking
expressions in the Polish data. The examination of comparable essays written by
Polish students in their native tongue could be revealing in this respect, but
unfortunately such data is unavailable at the moment.
Table 6: Linking expression in the Polish keyword list.
Linking
expression
5.
moreover, on the other hand, however, apart from, first of all, for instance, thus,
furthermore, nevertheless, sum up, what is more, not only, firstly, on the one
hand, as far as, for example
Other findings
The examination of the keyword lists on the linguistic level also reveals some
interesting differences between the subcorpora. One of the discrepancies is
related to the standard of English used by both groups of students. The Polish list
of keywords contains items such as mobile and cinema, whereas the American list
contains items such as cell, cellular and movie-theatre. This points to the fact that
Polish learners of English tend to use the British variety of English, which can be
explained by the fact that the vast majority of EFL materials available in Poland
are produced in Britain and obviously promote the British English standard.
Perhaps it can be interesting to point out that teaching materials have more
influence on learners’ language than other sources of authentic language available
outside the classroom, such as films or music, dominated by the American variety
of English.
A further interesting observation is related to the native/non-native language
characteristics expected to surface in the keyword lists. It was assumed that the
degree of idiomaticity (understood in the broader Sinclairian terms) would be
higher in the native data, and that the American keyword lists would contain more
fixed phrases than the Polish one. However, quite the opposite was found to be
true (Table 7).
Table 7: Fixed phrases in the American and Polish keyword lists.
phrases in the
American
keyword list
get hold of, on the road, look at, having to, pick up, pull over, some one, movie
theatre, going off, too many, going to, so many
phrases in the
Polish keyword
list
mobile phone, on the other hand, mobile phones, apart from, first of all, for
instance, point of view, by means of, sum up, what is more, in my opinion,
switch off, 20th century, not only, 21st century, phone box, no longer, show off,
so called, turn out, more and more, of course, such a, on the one hand, thanks
to, as far as, in touch with, for example
Such unexpected results in fact stay in tune with de Cock’s (2000) findings.
Contrary to the popular belief that one of the characteristic features of
150
interlanguage is its lack of idiomaticity, she found that learners overuse two-,
three- and four-word expressions both in writing and speech. Moreover, her
further qualitative analysis of learners’ fixed phrases revealed that
... advanced learners’ use of frequently recurring sequences of words
displays a complex picture of overuse, underuse, misuse and use of
idiosyncratic sequences, which may well play a significant part in the
foreign-soundingness of their speech and writing (de Cock, 2000: 65)
A similar complicated pattern of overuse can be observed in the two keyword
lists. The Polish list consists mainly of linking expressions whereas most of the
American key phrases are phrasal verbs. Thus, this study supports the claim of
the complex nature of learners’ use of idiomatic expressions.
6.
Conclusions
The study has demonstrated that the corpus-based and quantitative methodology
can be of value to the field of contrastive rhetoric producing interesting results
and justifiable claims. It has also been shown that adopting the text rather than the
language approach to the analysis of corpus data can bring totally unexpected, but
very revealing insights. Specifically, it has been demonstrated that the two
subcorpora exploited in the study contain more differences than could be
anticipated before collecting the data.
The study has also revealed important textual differences between American and
Polish argumentative essays written in English by apprentice writers. These
differences pertain to such rhetorical strategies as the choice of general or
experience-related arguments, the level of formality and the use of structuring
devices. These differences are not a result of ‘nativeness’ and ‘non-nativeness’ in
language use but represent deeper rhetorical conventions existing in the two
cultures.
References
Anderson, W. (2001), Discourse-based diversity: a corpus analysis of collocation
in European Union and national French Administrative Language. Paper
presented at the annual meeting of the British Association for Applied
Linguistics. September.
de Cock, S. (2000), Repetitive phrasal chunkiness and advanced EFL speech and
writing, in: C. Mair and M. Hundt (eds.) Corpus Linguistics and Linguistic
Theory. Amsterdam-Atlanta, GA: Rodopi, pp. 51-68.
Connor, U. (1996), Contrastive Rhetoric: Cross-cultural Aspects of Second
Language Writing. Cambridge: Cambridge University Press.
151
Duszak, A. (1994), Academic discourse and intellectual styles. Journal of
Pragmatics 21:191-313.
Kaszubski, P. (1997), Polish student writers – Can corpora help them?, in: B.
Lewandowska-Tomaszczyk and P.J. Melia (eds.) PALC’97. Practical
Applications of Language Corpora. àódĨ: àódĨ University Press, pp. 133158.
Meunier, F. (1998), Computer tools for the analysis of learner corpora, in: S.
Granger (ed.) Learner English on Computer. London and New York:
Longman, pp. 19-38.
Milton, J. (1998), Exploiting L1 and interlanguage corpora in the design of an
electronic language learning and production environment, in: S. Granger
(ed.) Learner English on Computer. London and New York: Longman, pp.
186-198.
Myers, G. (2002), Contrastive rhetoric and academic discourses: an institutional
view. Paper presented at the 2nd International Contrastive Linguistics
Conference. October.
Rayson, P. (2003), Matrix: A statistical method and software tool for linguistic
analysis through corpus comparison. Unpublished PhD thesis, Lancaster
University.
Scott, M. (2000), Focusing on the text and its key words, in: L. Burnard and T.
McEnery (eds.) Rethinking Language Pedagogy from a Corpus
Perspective. Frankfurt: Peter Lang, pp. 103-122.
Software:
Wordsmith Tools, Scott, M.
Wmatrix, Rayson, P.
152
Appendix A: Key items in the American corpus
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
cell
phones
phone
driving
minutes
get_hold_of
while
have
someone
ways
get
you
car
if
road
plan
my
call
drive
i
driver
stranded
on
now
used
911
distance
emergencies
they
could
out
accidents
are
believe
person
good
then
distraction
talk
a
on_the_road
service
around
when
use
go
hassle
look_at
home
plans
she
there
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
family
help
area
caller
security
today
class
house
reached
convenience
seen
Minnesota
calls
charges
loved
movies
positives
reception
teens
many
into
having_to
pick_up
been
free
along
college
cool
ditch
downside
id
location
missed
month
overall
pull_over
purchase
roaming
static
three
vehicle
weather
talking
emergency
movie
here
things
another
20th
blessings
causing
contract
105.
106.
107.
108.
109.
110.
111.
112.
113.
114.
115.
116.
117.
118.
119.
120.
121.
122.
123.
124.
125.
126.
127.
128.
129.
130.
131.
132.
133.
134.
135.
136.
137.
138.
139.
140.
141.
142.
143.
144.
145.
dad
mail
movie_theatre
rates
some_one
store
women
else
needs
father
away
cellular
entertainment
mom
though
games
kids
her
problem
after
americans
cards
cost
directions
going_off
hazardous
item
negatives
reaching
responsibility
too_many
tower
winter
long
allow
going_to
something
regular
down
features
so_man
Appendix B: Key items in the Polish corpus
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
mobile_phone
of
it
we
our
nowadays
mobile
which
is
kind
very
fact
moreover
invention
SMS
mobiles
17.
on_the_other_hand
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
us
mobile_phones
matter
however
apart_from
inventions
some
situations
as
health
comfortable
imagine
useful
human
such
small
first_of_all
for_instance
whether
short
influence
mentioned
owners
contact
quite
life
owner
sides
addicted
point_of_view
thanks
often
necessary
world
difficult
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
real
aware
thus
unfortunately
simply
by_means_of
cinema
concerned
furthermore
nevertheless
sum_up
undoubtedly
what_is_more
businessmen
in_my_opinion
harmful
sending
switch_off
20th_centuryusing
device
scientists
its
addiction
civilisation
contacts
harm
importance
not_only
lots
especially
telephone
arguments
claim
advantages
modern
cells
control
theatre
21st_century
firstly
phenomenon
phone_box
regarded
surely
whose
that
send
development
no_longer
living
rather
under
105.
106.
107.
108.
109.
110.
111.
112.
113.
114.
115.
116.
117.
118.
119.
120.
121.
122.
123.
124.
125.
126.
127.
128.
129.
130.
131.
132.
133.
134.
135.
136.
137.
138.
139.
140.
141.
142.
143.
144.
145.
146.
147.
148.
149.
150.
151.
152.
153.
154.
155.
arises
basic
crucial
effect
groups
hardly
irreplaceable
knowledge
posses
present
private
show_off
so_called
somehow
turn_out
youngsters
more_and_more
of_course
not
such_a
happy
nobody
normal
ordinary
mention
consider
among
constant
cons
pros
opinion
want
appears
consideration
cultural
diseases
divided
exist
famous
gadget
holidays
mountains
newest
on_the_one_han
d
opponents
precious
produce
producers
regard
researches
thanks_to
156.
157.
158.
159.
160.
161.
162.
163.
164.
165.
166.
167.
168.
169.
170.
171.
172.
173.
174.
175.
forget
say
against
disadvantage
possessing
progress
social
various
ambulance
as_far_as
mean
in_touch_with
makes
latest
situation
important
no
for_example
advantage
availabl
Using a dedicated corpus to identify features of professional
English usage: What do “we” do in science journal articles?
Mukogawa Women’s University
University of Aizu
Meikai University
Abstract
This chapter addresses the problem of inadequate educational materials for the effective
training of non-native speakers in the professional English of the scientific community.
We claim that one cause of this inadequacy is lack of proper linguistic research based on
suitable linguistic research tools. The development of a major international corpus of
professional English in the sciences and other fields is described as a significant resource
for solving this problem, illustrated with a specific research project examining the use of
“we” in professional English from the Corpus of Professional English. Results reveal the
value of well-designed dedicated corpora for addressing the specific English instructional
needs of non-native speakers of English in the professions.
1.
Introduction
Although English is the language of preference for science in international
contexts, the majority of the world’s scientists are not native speakers of English
and, therefore, they can be said to be considerably disadvantaged in professional
English communication. Language teachers may have introduced non-native
speakers (NNS) to the sounds, symbols, and structure of English for general
purposes such as asking for directions, reading a menu, and writing classroom
essays. However, few NNS of English have ever enjoyed the luxury of
specialised training in the spoken and written discourse of their profession. In
general, this means that they have had to pick this English up on their own, at
considerable effort, with disappointingly limited results (Coates et al. 2002). This
situation is terribly unfortunate, for it not only severely marginalises NNS in their
scientific communities, but it also prevents the world from benefiting from the
creative potential of those who cannot disseminate their ideas persuasively
because of poor English.
Basically, there are two reasons for this unfortunate situation. One is the scarcity
of language teachers who possess expertise in the English of science, and the
second is the scarcity of instructional materials that adequately explain
professional English text in scientific contexts. The English of research articles
and technical specifications is very different from the English of storybooks and
street signs, and mastery of this English does not come easily. Native-speaker
156
intuition or professional work experience alone are not enough to generate proper
linguistic insight, and training materials based on “research” of this kind tend to
be disappointingly vague and fraught with inaccuracies. Teachers and students of
English in the sciences require more substantial materials based on far more
rigorous research, enabled by far better research tools. The development of
computer-based corpora, along with sophisticated software tools to analyse them
properly, has now started to make this kind of research possible.
2.
The value of specialised corpora
The advent of computers, computer-based corpora, and the whole new field of
corpus linguistics is now beginning to provide some satisfactory solutions to the
problems mentioned above. By collecting discourse samples in the linguistic
domains requiring study, corpus linguists can now begin to identify features of
language that were beyond the scope of thorough observation in the past. This
development is welcome news for teachers and students of scientific English, for
scientific genres, with all of their peculiarities, can now be studied on grander
scales to generate far more objective and reliable data. General corpora, such as
the British National Corpus (BNC), provide the means for studying how English
is used in general contexts; while the development of specialised corpora provide
the means for studying how English is used in special contexts.
For rigorous investigations of English in the theoretical and applied sciences, a
very special corpus is required, dedicated to the language of these professional
communities. Gathering a large enough collection of suitable discourse to
develop a dedicated corpus of this kind, however, has not been an easy task. The
value of corpora of scientific English on a small scale has been suggested in the
past (Johns, 1991; Bondi, 2001), one good example being a corpus of physics
research articles and their parallel academic conference presentations developed
by Umesaki (2000). Using this corpus, Umesaki explored the variety of referents
to the writer in academic papers, which she identifies as “one of the difficulties
for non-native speakers of English in writing academic papers” requiring greater
educational attention based on findings from corpus research.
Although work of this nature is being conducted in various disciplinary fields on
small scales, there remains a need for a major dedicated corpus of English for
studying professional English in the theoretical and applied sciences. This would
provide a referent corpus against which these smaller corpora could be compared
and also contribute to a better understanding of how English is used by
professionals internationally when they communicate with each other.
Using a dedicated corpus to identify features of professional English usage
3.
157
The construction of the Corpus of Professional English, CPE
The creation of a major international corpus of English in the sciences and other
professions to aid research in professional discourse is a serious undertaking, and
a new organisation has recently been established to take on this challenge.
Called the Professional English Research Consortium (PERC), this non-profit
academic organisation headquartered in Tokyo, was established in April 2002 to
create such a corpus and generate research in professional English that might be
of particular benefit to NNS in the professions as well as to the educators and
material developers that support them.
The corpus, named simply the Corpus of Professional English (CPE), aims to
become a 100-million word balanced corpus of Professional English. It is
designed for various research purposes, ranging from pure research on
Professional English (e.g. variations of usage in lexis, syntax, semantics, and
discourse across text types) to applied research such as lexicography, language
testing, educational materials and program development. The CPE will be
serviced by a sophisticated web-based query system so that those who are not
familiar with corpora can extract linguistic features they need in a user-friendly
manner. The design scheme of the CPE is shown in Table 1.
The ultimate goal of the CPE is to achieve a 100-million word written corpus,
composed from a reasonable balance of text types. The balance of each text type
will be further examined in the future, but tentatively it was decided that the
following ratio of text types seemed most suitable for representing professional
English: academic journals (30%), legal/workplace documents (20%), trade
journals (10%), reference books (10%), websites (10%), newsletters (5%),
correspondence (5%), manuals (5%) and ephemera (5%). At present, the project
team is focusing on the collection of academic journal articles, for this is the text
that frequently proves most difficult when obtaining copyright permission, and
yet is needed most by researchers in PE.
Table 1: Tentative balance of text types for the CPE
Academic journals
Legal/workplace docs
Trade journals
Reference books
Websites
Newsletters
Correspondence
Manuals
Ephemera
30%
20%
10%
10%
10%
5%
5%
5%
5%
In order to select the journal texts in an objective way, the project team decided to
base content decisions on data obtained from Journal Citation Reports (JCR),
which presents quantifiable statistical data for an objective and systematic
158
approach to determining the relative importance of journals within their subject
categories. At present, the JCR contains 5,700 journals in the Science Edition
and has a unique measure called “Impact Factor”, which provides a way to
evaluate or compare a journal’s relative importance to others in the same field.
Employing this data, the top 20% of the journals with the highest impact factor in
each field were selected for inclusion in the CPE. JCR classifications were also
used to define the subject fields.
Acquiring texts for a corpus is always difficult but not impossible. Following
procedures established by the creators of the BNC, the CPE project team sent
letters to major journal publishers in order to obtain copyright permission for
more than 1,500 journals. As is often the case, the first letter did not yield a
sufficient response; however, much more favourable responses were obtained
after the launching of the PERC website which explained the CPE in greater
detail than the letters. At present, copyright clearance has been obtained for
almost 300 journals from over 50 publishers.
To facilitate research on the CPE, a web-based corpus query system with a 3level interface has also been developed in collaboration with the major PERC
member institution, Shogakukan, Inc. For elementary users, the tool provides
simple word search and KWIC displays only. For intermediate users, independent
word/POS/lemma queries can be made with detailed collocation statistics (raw
frequency/T-score/MI-score/log-log). And for experienced researchers, complex
search (word/POS/lemma and combinations of the three) is made available with
more sophisticated output. Shogakukan plans to launch a portal site for megacorpora such as the BNC, the ANC, and COBUILD-Direct in the future so that
ordinary language teachers can access these large corpora without worrying about
the installation of a complicated interface.
Currently, the CPE project team is in the process of acquiring the targeted texts as
well as cleaning and formatting the existing texts that have already been obtained.
A prototype version of the CPE (c.2 million words of copyright-cleared academic
journal text) has been set up on this new query system and is beginning to provide
an impressive wealth of data that contrasts significantly with that which can be
obtained from general corpora such as the BNC.
4.
The research project and its results
To demonstrate the value of a corpus specifically dedicated to professional
English, we have chosen for this chapter a simple analysis of current usage of the
pronoun we as it appears specifically in professional scientific texts in
comparison with similar data gathered from corpora devoted to English on a
broader, more general scale. We will first explain the methods we employed in
this research, then follow this with specific results and a discussion of their
significance.
4.1
159
Methodology
As stated above, the CPE is currently under construction. The research for this
paper was conducted on a prototype corpus from the CPE of 260 journal article
texts from 18 journals in fields ranging from multidisciplinary agriculture,
biochemistry and molecular biology, cell biology, developmental biology, plant
science and forestry to multidisciplinary materials science, mineralogy,
oceanography, general and internal medicine, health care sciences and service,
orthopaedics medicine, pharmacology and pharmacy, and psychiatry. The
prototype corpus included 1,787,484 tokens of almost 60,000 types.
The corpus was examined for verbs used after we using Wordsmith Ver. 2
(Oxford University Press). The concordance lines were rearranged with the word
to the right of we in alphabetical order, and then the verbs were classified
according to the seven major semantic domains described in the Longman
grammar of spoken and written English (Biber et al., 1999): “activity verbs,
communication verbs, mental verbs, causative verbs, verbs of simple occurrence,
verbs of existence or relationship, and aspectual verbs.” A summary of these
verb features and some examples are presented in Table 2.
Table 2: Verb semantic domains based on the LGSWE (Biber et al., 1999: 360364)
Semantic Domain
Features
Examples
Activity
Actions and events associated with choice;
subject is semantic role of agent
Communication
Subcategory of activity verbs; associated with
activities for communication
Activities and states experienced by humans;
including cognitive, emotional, and perception
bring, buy, carry,
come, give, go, leave,
work
ask, explain, say,
suggest
think, know, love,
hate, see, taste, read
A new state of affairs is brought about by a
person or an inanimate entity
Events happening without volitional
activity
State of relationship between entities
allow, cause, enable,
require, permit
become, change,
happen, develop
be, seem, appear
State of progress of an event or activity
begin, continue, keep,
stop
Mental
Facilitation or
causation
Occurrence
Existence or
relationship
Aspectual
According to the LGSWE (Biber et al., 1999), the most frequently used verb type
is the activity verb. The LGSWE bases its findings on corpus studies which
identified the four registers of conversation, fiction, news, and academic prose.
Examination of the distribution of commonly-used verbs according to the four
registers revealed that activity verbs ranked the highest in three of the four
registers. It was only in academic prose that existence verbs display almost
160
equivalent frequency. Another feature of academic prose overall is the use of
more causative verbs and occurrence verbs, compared to that of other registers.
With respect to the usage of the personal pronoun, the LGSWE (Biber et al. 1999:
329-330) recognises its common use “to refer to a single author, a group of
authors, to the author and the reader, or to people in general.” Biber et al. also
state that “In some cases, academic authors seem to become confused themselves,
switching indiscriminately among the different uses of we.” This highlights the
problem pointed out by Umesaki (2002) faced by the NNS scientist who often
finds author-referent conventions confusing.
In this work, we focused on identifying the type of verb following we in the
prototype CPE corpus. Knowing what types of verbs are commonly used in
academic papers should offer help to the NNS scientist in deciding when and how
to use we when writing up research.
4.2
Research data from the CPE
A total of 3,401 instances of we followed by a verb were identified in the CPE
prototype corpus. The main verb was identified and classified according to the
semantic domain classifications in the LGSWE. Its tense and aspect were also
noted. If the verb occurred at least twice, it was included in the count for the
distribution of verb types presented in Tables 3a and 3b.
Table 3a: Distribution of verb types used with we according to semantic domain
and tense/aspect
Verb type
Total
Men
1433
45.13
668
46.62
472
32.94
119
8.30
Act
1038
32.69
612
58.96
187
18.02
160
15.41
Com
440
13.86
71
16.14
290
65.91
45
10.23
Exi
168
4.66
63
31.76
70
44.59
6
4.05
Cau
40
1.89
23
65.00
3
11.67
11
18.33
Asp
32
1.01
8
25.00
29
90.63
6
18.75
Occ
24
0.76
14
58.33
3
12.50
5
20.83
1459
45.95
1054
33.20
352
11.09
3175
%
past
%
pres
%
prep
%
161
Table 3b: Distribution of verb types used with we according to semantic domain
and tense/aspect
Verb type
mod
%
prec
%
pstp
%
Men
122
8.51
11
0.77
16
1.12
Act
49
4.72
18
1.73
10
0.96
Com
33
7.50
3
0.68
0
0.00
Exi
20
13.51
5
3.38
0
0.00
Cau
3
5.00
0
0.00
0
0.00
Asp
7
21.88
2
6.25
0
0.00
Occ
1
4.17
0
0.00
1
4.17
235
7.40
39
1.23
27
0.85
Verb semantic domains: men = mental, exi = existence, act = activity,
com = communication, exi = existence, cau = causative, asp = aspective,
occ = occurrence.
Verb tense and aspect: past = past tense, pres = present tense, prep =
present tense perfect aspect, mod = modal auxiliary, prec = present tense
continuous aspect, pstp = past tense perfect aspect
The present tense continuous aspect occurred in 5 instances, two with
mental verbs and three with activity verbs, but these data are not included
in the table.
As can be seen from Tables 3a and 3b, the most frequently used verb type after
the personal pronoun we was the mental verb accounting for 45.13% of the total.
This was followed by activity verbs at 32.69% and communicative verbs at
13.86%. The most frequently-used verb tense was the past tense, accounting for
45.95% of all instances catalogued. However, this tense was the predominant one
only for the mental, activity, causative, and occurrence verbs. The present tense
form was more commonly used for the communicative, existence and aspectual
verbs.
A closer examination of the verbs in their semantic domain classifications reveals
a more complex picture. As can be seen from Tables 5a and 5b, while the mental
verbs find, observe and examine overwhelmingly occur in the past tense,
conclude occurs more than 80% as the present tense form. In the case of activity
verbs with the highest frequencies, use, analyse and test are predominantly used
in the past form, but show is used in the past form in only 14.29% of the instances
observed while it appears more frequently and almost equally as the present tense
perfect aspect (41.27%) and the present tense (40.48%).
162
Table 4: Number of verbs in each semantic domain and some examples
Verb semantic
domain
No. of verbs
observed
Mental
97
find (163), observe (108), examine (103), conclude (58),
identify (54), know (47), compare (46) , see (45),
determine (41), investigate (39)
Activity
110
Communicative
34
use (183), show (126), analyse (51), test (48),
demonstrate (46), perform (46), measure (33), calculate
(24), obtain (23), do (21)
thank (101), report (49), present (34), describe (30),
propose (29), note (19), ask (18), suggest (18),
acknowledge (14), discuss (11)
Existence
12
have (71), be (28), exclude (14), include (11), stand (5),
have to (4)
Causative
5
be able to (28), be unable to (20), allow (6), require (3),
subject (3)
Aspect
8
Occurrence
6
begin (10), continue (5), start (4), undertake (4), initiate
(3), achieve (2), enter (2), keep (2)
develop (12), fail (4), modify (2), increase (2), change
(2), become (2)
Examples (No. of instances)
Verbs which appeared after we two or more times were counted.
Table 5a: Verb tense and aspect distribution for most frequently used mental and
activity verbs
Verb
Mental
find
observe
examine
conclude
Activity
use
show
analyse
test
Total
past
%
pres
%
prep
%
163
108
103
58
129
83
78
6
79.14
76.85
75.73
10.34
11
16
3
47
6.75
14.81
2.91
81.03
19
7
17
11.66
6.48
16.50
0.00
183
126
51
48
131
18
43
39
71.58
14.29
84.31
81.25
24
51
2
1
13.11
40.48
3.92
2.08
17
52
5
6
9.29
41.27
9.80
12.50
163
Table 5b: Verb tense and aspect distribution for the most frequently used mental
and activity verbs
Verb
Mental
find
observe
examine
conclude
Activity
use
show
analyse
test
mod
%
3
3
5
1.84
0.00
2.91
8.62
9
4
1
1
4.92
3.17
1.96
2.08
prec
1
%
pstp
%
0.00
0.00
0.00
0.00
1
1
0.61
0.93
0.00
0.00
0.55
0.00
0.00
0.00
1
1
0.55
0.79
0.00
0.00
Not including one instance of the past perfect for observe
Verb tense and aspect: past = past tense, pres = present tense, prep =
present tense perfect aspect, mod = modal auxiliary, prec = present tense
continuous aspect, pstp = past tense perfect aspect
Table 6: Top twenty clusters in the vicinity of we
N
Cluster
Freq.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
we found that
in this study
we conclude that
we examined the
we have shown
acknowledgments we thank
we used the
have shown that
in the present
we did not
we do not
found that the
we show that
we have previously
the effect of
the present study
in this paper
this study we
we have found
we used a
76
50
39
39
35
32
32
28
28
28
25
23
22
21
20
19
18
18
18
18
164
The clusters presented in Table 6 show that the top three clusters involve mental
verbs, found, conclude, and examined. This suggests that cognitive activities of
dynamic nature (find and examine) are expressed using the past tense, while the
more stative conclude appears most frequently in the present tense. The sixth
cluster is from the acknowledgements section and indicates an almost formulaic
usage of we thank. The repeated references to the work being presented in the
paper (in this study, in the present, the present study, in this paper) in the vicinity
of we indicate that direct reference to the author(s) appears particularly when
attention is being drawn to the study under discussion.
Interestingly, of the 260 texts examined, we was used at least once in 223 texts.
The highest number per 1,000 instances was 13.94, with the total overall average
being 2.12 per 1,000. The average for the top ten texts was 9.74 instances per
1,000 words.
Table 7: Texts with high frequency usage of we
No.
File
Words
Hits
per 1,000
Field
1
2
3
4
5
6
7
8
9
10
079.txt
078.txt
072.txt
070.txt
047.txt
035.txt
195.txt
076.txt
058.txt
314.txt
1,865
5,549
2,956
6,292
4,821
7,748
4,668
2,693
7,696
7,921
26
70
34
68
50
73
38
21
50
50
Ave
13.94
12.61
11.5
10.81
10.37
9.42
8.14
7.8
6.5
6.31
9.74
Health services
Health services
Health services
Health services
Psychiatry
Plant sciences
Cell biology
Health services
Plant sciences
Oceanography
Table 7 suggests that some journals or fields may display a higher frequency of
first-person pronoun usage than others. Also revealing were the negative data of
texts in which we was not used even once. Such texts occurred across all fields,
but all sixteen texts from one journal in multidisciplinary materials science had
no instances of we at all. The only instance that was detected was the
abbreviation of WE for working electrode.
5.
Present applications and future research
The above analyses on the usage of we in the 260 texts in the prototype CPE
corpus of scientific academic journal articles from a range of fields reveal the
following:
i) The use of we is rather common, occurring at least once in 85.77% of the
texts examined.
165
ii) Some journals and fields tend to display more we usage that others. On
the other hand, all texts coming from one of the journals had no
instances of we. Thus, there seems to be a need for even further
specialization of corpora to illuminate differences among journals and/or
fields of study.
iii) The verb type most commonly used with we is the mental verb (45.13%)
followed by the activity verb (32.69%).
iv) The most frequently used tense is the past tense, but the distribution of
past and present tense usage is reversed for some verb types and even
within verb type categories.
v) High-frequency clusters in the vicinity of we tend to be related to mental
activities and references to the work under discussion.
The findings overall point to the need for even further refining of corpus studies
of specialised texts in order to reveal features which can be used when planning
course materials for English for specific purposes classes. Such comparative
studies of texts from different research fields and genres must await the
completion of the CPE corpus; however, this research conducted from a small
sample of dedicated texts already reveals some interesting things that differ from
earlier findings based on general English corpora.
Even in its present state, the CPE corpus prototype dedicated to text in scientific
disciplines can prove very helpful as reference material for postgrad students or
NNS scientists who are at the stage of writing up their research. If they are given
background instruction in the genre-analysis approach to understanding the
framework of moves and steps that compose the research journal article (Swales,
1990; Weissberg and Buker, 1990), a data-driven learning approach to
concordancing (Johns, 1991a and b) can serve as a valuable tool for aiding NNS
with their professional writing, when it comes to word choice or other writing
issues (Hunston, 2002). As the CPE continues to develop, we envisage a system
by which NNS scientists can access a website for online support when creating
professional documents, which would include access to selections of dedicated
corpora in the specific fields for which they are writing.
References
Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman
grammar of spoken and written English. Harlow: Longman.
Bondi, M. (2001), ‘Small corpora and language variation’, in M. Ghadessy, A.
Henry, and R. L. Roeberry (eds.), Small corpus studies and ELT.
Amsterdam: John Benjamins.
Coates, R., B. Sturgeon, J. Bohannan, and E. Pasini (2002), ‘Language and
publication in Cardiovascular research articles’, Cardiovascular
research, 53: 279-285.
166
Hunston, S. (2002), Corpora in applied linguistics. Cambridge: Cambridge
University Press.
Johns, T. (1991a), ‘From printout to handout: Grammar and vocabulary teaching
in the context of data-driven learning’, in T. Johns and P. King (eds.),
Classroom concordancing. Birmingham, UK: Centre for English
Language Studies, The University of Birmingham, 27-46.
Johns, T. (1991b), ‘Should you be persuaded: Two examples of data-driven
learning’, in T. Johns and P. King (eds.), Classroom concordancing.
Birmingham, UK: Centre for English Language Studies, The University of
Birmingham, 1-16.
Johns, T. (2002), homepage http://web.bham.ac.uk/johnstf/timeap3.htm
Swales, J. (1990) Genre analysis: English in academic and research settings.
Cambridge, UK: Cambridge University Press.
Umesaki, A. (2000), ‘Syntactic differences in the discourse of oral and written
papers’, English corpus studies, 7: 39-59.
Umesaki, A. (2002), ‘Reference to the presenter in academic papers’. Paper
presented at AILA 2002, Singapore International Convention and
Exhibition Centre, Dec. 21, 2002.
Weissberg, R. and S. Buker (1990). Writing up research: Experimental research
report writing for students of English. Englewood Cliffs, NJ: PrenticeHall.
Methods and tools for development of the Russian Reference
Corpus 1
Serge Sharoff
University of Leeds
Abstract
The paper discusses the history of development of Russian corpora and presents methods
and tools that are used in the ongoing development of the Russian Reference Corpus.
Development of the corpus follows the key design principles of the BNC and extends them
further by introducing an elaborate model of text typology and by adding lemmatisation
and morphosyntactic annotations to POS tagging. The paper also discusses problems in
development of the corpus that are related to the Russian language and culture.
1.
The history of development of Russian corpora
It is not too big a generalisation to say that development of Russian computer
corpora followed the pattern established by English corpora. The Brown Corpus
(Kucera & Francis, 1967) set up the standard for the design, size and coverage of
general-purpose corpora in other languages, including Russian. In the 1970s a
corpus of 1 million words was developed by Zasorina and her colleagues; it
consisted of 500 samples of 2,000 words each and covered four genres: mass
media, fiction, science (including humanities) and drama (as an attempt to cover
the spoken language). The study resulted in a frequency dictionary (Zasorina,
1977), but not in a publicly available resource. The best-known comprehensive
Russian corpus was developed in the 1980s in Uppsala, Sweden; it also resulted
in a frequency dictionary (Lönngren, 1993). The Uppsala Corpus (UC) consists of
1 million words in 600 samples equally divided between fiction and non-fiction
texts. UC is popular for various reasons, partly because it can be freely accessed
via the Internet, but for modern standards it is too small and restricted in genre
coverage. It also lacks morphosyntactic annotations and lemmatisation. The lack
of lemmatisation hinders the search of multiple word forms, which often cannot
be found using regular expressions, e.g. the verb vyjti (to leave) in Russian has
about 40 forms, including many dissimilar forms like vyjdu, vyshla, vyshedshij.
The lack of morphosyntactic annotations hinders even simple searches of
grammatical relations, for example, searching for uses of the partitive case or for
complements of a particular verb in the dative case.
Another attempt to develop a comprehensive corpus was made in the Soviet
Union in the mid 1980s. It is known as the Computer Fund of Russian Language
(CFRL). Its aims were similar to those of the British National Corpus (BNC),
which was to be developed a few years later. The main goal was to create a very
large corpus of general language and subcorpora for various genres that would
168
Serge Sharoff
help in the development of NLP applications. The set of corpora would also
provide resources for studying and teaching the Russian language, including
development of dictionaries, grammars, textbooks, etc (Andryuschenko, 1989). It
was also expected that the corpus would include an historical component to cover
the development of the Russian language from the earliest available sources (10th
century AD). However, the project did not produce the expected outcome: no
representative corpus has been collected. Resources available from the CFRL
now include Russian literature of the 19th century and samples of newspapers
from 1997. The progress in development of OCR software resulted in multiple
ad hoc collections of Russian fiction and reference texts, for instance, Moshkow’s
Library (ML), but such collections are not balanced and representative. The same
applies to collections of newspapers available online.
Currently, corpus studies of Russian are based mostly on the Internet. The
Internet can be regarded as the largest Russian corpus, because the amount of
Internet documents available for Russian search engines can be estimated at about
250 billion words (1.5 TB of unique texts indexed by Yandex), much larger than
any conceivable corpus. However, there are three types of problems that hinder
its use for corpus studies.
First, it cannot be claimed that the material is representative and that there is a
balance of text types. Texts presented on the Russian Internet are chaotic: their set
depends on preferences and interests of a very specific group of Russian language
speakers that are active on the Internet. The recall of search results also cannot be
evaluated, because it depends on unknown parameters: which texts are available
or not available on the Internet; which texts available on the Internet were not
found by the search engine used for the query, etc.
Second, search engines address the needs of information retrieval, rather than
linguistic search. Even though search engines provide lemmatisation, so that one
can search for all forms of a word, a query cannot be formulated in terms of
grammatical features, including tenses, cases or word classes.
As for
lemmatisation performed by search engines, it is not designed to handle the
queries of (corpus) linguists. For example, normal users, who are interested in
information retrieval, pay no attention to the aspect of verbs used in their queries
and want to get pages corresponding to the verb irrespective of its form. Search
engines anticipate the need and index verbs of the perfective and imperfective
aspect under one lemma. However, this technique drastically decreases the
precision of linguistic searches and leads to some funny results, when pomni used
in a query leads to pages with myatyj, because pomyat’ and myat’ form an
aspectual pair.
Third, search engines present search results in a way that also does not
correspond to the needs of a linguist. The pages are ordered in terms of their
information rank that has nothing to do with linguistic criteria. The output also
Methods and tools for development of the Russian Reference Corpus
169
does not form a concordance, because pages in the output are separated by
documents, rather than by contexts of their uses. Finally, search results are based
on words occurring in titles of pages or keywords or even in other pages that refer
to the link being displayed as relevant.
2.
The content of the Russian Reference Corpus
From the viewpoint of corpus linguistics, Russian is one of few major world
languages that lack a comprehensive corpus of modern language use. However,
the need for constructing such a corpus is growing in the corpus linguistics
community both in Russia and in the rest of the world. The objective of the
project presented in this chapter is to develop the Russian equivalent of the BNC,
namely the Russian Reference Corpus (BOKR, BOljshoj Korpus Russkogo
yazyka). It is designed as a corpus of 100 million words with the proportional
coverage of major varieties of texts in modern Russian, with POS annotation and
lemmatisation. The annotation scheme (which is based on the TEI) also marks
noun phrases and prepositional phrases, because they are important for the
resolution of ambiguity and can be reliably detected. The corpus consists of texts
originally written or uttered in Russian by native speakers 2 in recent years. The
exact diachronic sample depends on the text type and is discussed below.
Table 1: Corpus composition
Russian Standard
BOKR
quantity
10 million words (500 texts)
100 million words (10,000 texts)
quality
a representative sample of Russian
fiction written between 1960 and
2002
a representative corpus of modern
Russian, balanced according to a
text typology
annotation
POS tags, morphological and partial
syntactic properties with manual
disambiguation
POS tags, morphological and partial
syntactic properties with automatic
disambiguation
access
public Internet access with a query interface shared between the two corpora
(Russian Standard is a subcorpus of BOKR)
BOKR will include the Russian Standard, a subcorpus of 10 million words of
modern fiction representative of the standard literary language. The relationship
between the two corpora is described in Table 1. The two corpora differ mostly
in their foci: on the large size, wide coverage and the balance of genres in BOKR
and on selection of culturally-salient modern literary works and manual
disambiguation of morphosyntactic annotations in the Russian Standard. The
latter aspect is similar to the design intentions of the hand-corrected core BNC
subcorpus (Leech, 1997). The Russian Standard is aimed to be the basic source
of information for the development of corpus-based Russian grammars for
170
Serge Sharoff
academic and teaching purposes, while BOKR will provide a complementary
source of grammatical information and will be the basic source of lexical
information.
In one respect, the design of the Russian Standard is remarkably different from
the design of the core BNC subcorpus. The core BNC is based on a proportional
selection of texts from the whole set of the BNC files, while the Russian Standard
is based on literary texts. This reflects the difference in the cultural status of the
language of imaginative writing in British and Russian cultures: in Russian the
literary language is treated as the authoritative source, which effectively defines
the language used by native speakers. This fact is also the reason for the higher
proportion of fiction in the Uppsala Corpus and the corpus used by Zasorina
(1977): fiction texts covered about the half of their content, much higher than the
proportion of fiction in the Brown Corpus (25%) and the BNC (17%), cf. also the
balance of genres proposed for BOKR in the discussion below.
2.1
The typology of texts
The balance of genres in BOKR is based on a text typology that is more
sophisticated than that of the BNC. The basic principles for describing texts in
BOKR follow the EAGLES guidelines (Sinclair, 1996), which distinguish
between text-external (E) and text-internal (I) parameters in text classification:
1.
2.
3.
4.
5.
ȿ1 (origin) - parameters concerning the origin of the text, i.e. the creation
date, the author’s age and sex, the place of his/her origin, other
circumstances of text creation that can affect linguistic parameters;
E2 (state) - the appearance of the text, in particular, the distinction
between written and spoken text modes (including written-to-be-spoken
and electronic communication as the two border cases), and between
published sources (books, magazines and newspapers), ephemera and
correspondence within the written mode;
ȿ3 (aims) - matters concerning the reason for making the text and the
intended effect it is expected to have, including (1) the size of the audience
(and subclasses for private and public speech) and (2) the communicative
function of the text, i.e. discussion, information, recommendation,
instruction or recreation.
I1 (topic) - the main topic of the text, following a shallow classification of
knowledge domains similar to classes used in the BNC, e.g. natural
sciences, applied sciences, life or politics;
I2 (style) - “the patterns of language that are thought to correlate with
external parameters” (Sinclair, 1996), such as formal or informal, one-way
or interactive, etc.
The changes in the finer classification of parameters in comparison to Sinclair
(1996) are based on the experience in development of other representative
corpora, such as the Brown Corpus, BNC, and the TEI guidelines (Sperberg-
171
McQueen & Burnard, 2001), as well as considerations from Russian texts. This
concerns, for example, the use of an additional mode (written-to-be-spoken),
which is borrowed from the BNC (E2), the intended audience age (E3.1), a
classification of fiction genres (E3.2) and styles (I2). It was considered helpful to
extend the classification of text styles with separate subclasses for fiction and
non-fiction texts. The patterns of language detected for fiction include the
following styles (some better-known writers that often use the style are also
indicated):
1.
2.
3.
4.
5.
neutral, — the style characteristic for standard literary texts in Russian,
regional, derevenskaja proza — an imitation of regional, mostly rural,
language varieties, e.g. Astafiev, Rasputin,
lowly, snizhennyj — an imitation of the spoken language used by a “lesser
educated” population, often slang, e.g. Ju. Aleshkovskij, Limonov,
official, socrealism — the official style of the Soviet literature, e.g.
Dangulov, Markov,
individual, — a marked way of language use with significant deviations
from the neutral style, this style is typically the result of linguistic or
stylistic experiments, e.g. S. Sokolov.
Each style in the list instantiates a specific set of implications on
lexicogrammatical properties (with the exception of the individual style, which is
often author-specific, but this is exactly the reason to classify a text in this way).
Nonfiction is classified according to the following styles: neutral, formal,
informal, and academic writing.
Since the project is aimed at a representative sample of modern Russian, all
meaningful combinations of parameters should be represented in the corpus by at
least a handful of texts, though the number of texts in each group depends on the
estimated number of respective texts in the Russian discourse and the availability
of their electronic copies. Text length is another important technical parameter. It
is easier to develop a large corpus using longer texts. However, this means that
the corpus contains fewer texts, so an idiosyncratic use of language in each text
significantly influences lexicogrammatical properties that can be described using
the corpus. This is the reason for the balance of texts of various sizes in the two
corpora, i.e. both shorter and longer texts should be included in each category
with a greater number of shorter texts to alleviate the influence of longer ones.
The intended coverage of knowledge domains (I1) roughly follows the proportion
used in the BNC. The comparison is shown in Table 2 (the data are from the
BNC Index by David Lee). Since the typology of texts in the BNC is based on
other principles, the comparison presents the content of texts in BOKR, as if they
were described in terms used by the BNC. For instance, spoken language is
treated as a domain in the BNC, so the figures in Table 2 also include it, even
172
Serge Sharoff
though a spoken discourse can be devoted to any other topic in the list of
domains, so it is described as the mode of speech in BOKR (E2).
It would be desirable to increase the proportion of spoken language in BOKR at
least to the coverage of the BNC, if not to 50% of the total corpus, but the small
amount of available transcribed recordings make the ideal target impractical. The
major departure from the BNC is the (already discussed) higher proportion of
fiction texts, which are not considered in our scheme as a knowledge domain of
its own (similar to the spoken domain), but as the most important component of
the knowledge domain “Life” (cf. respective sections in newspapers, which in the
Russian context often include short fiction stories). Note that the corpus is
currently under development, so the figures in the third column in Table 2 are
approximations for the expected coverage.
Table 2: The proportion of knowledge domains
Domains as in the BNC
BNC
BOKR
Spoken (not a domain in BOKR)
10.7 %
5%
Imaginative Texts (Life in BOKR)
16.7 %
30 %
Natural Sciences
3.8 %
5%
Applied Sciences
7.2 %
10 %
Social Sciences
14.2 %
12 %
World Affairs (Politics in BOKR)
18.9 %
15 %
Commerce
7.6 %
5%
Arts
6.8 %
5%
Belief/Thought (Religion and philosophy in BOKR)
3.1 %
3%
Leisure
11.2 %
10 %
Currently tools and techniques for working with BOKR and the Russian Standard
are tested using a corpus of 40 million words. Its subcorpus of about 1 million
words of fiction texts (corresponding to the Russian Standard) has POS
annotations that have been automatically assigned and manually inspected. It is
also used for correcting the POS tagger used for processing the larger corpus. It
is expected that the final release of the corpus will be available by the end of
2004.
2.2
The methodology for achieving the proportional coverage
The costs of compiling a representative corpus now are lower than 10 years ago,
when the BNC was collected. Many types of source texts are readily available in
electronic form, in particular, fiction and news texts are widely accessible via the
Internet and can be legally available for the corpus. Other types of discourse, like
business or private correspondence, are harder to obtain and deposit in a corpus
173
because of legal obstacles. Yet other types of sources, like samples of
spontaneous speech, are rare for technical reasons. The proposed solution is to
increase the amount of ephemera (including leaflets, junk mail and typed
material), correspondence (business and private) and spoken language samples
whenever possible, because they reflect everyday language produced and
reproduced regularly in the discourse. Anyway, various types of published texts
will take the rest of the share. In this respect, the situation is similar to the early
time of the BNC: the amount of texts from unpublished sources in the written part
of the BNC is about 4.5%. It is unlikely that in BOKR we will have significantly
more; even though the majority of source texts are available in electronic form
now, their holders are unwilling to share them.
For the reasons of protection of privacy, personal and business letters are
subjected to an anonymisation procedure with respect to names of persons and
companies. Person names are replaced with MX, FX or CX tags (for male,
female or child participants respectively) and names of companies with CoX (X is
the identification number of a participant in the text; the same practice is also
used in the Bank of English). In some cases, text providers manually replace
names with codes. In other cases, they provide original texts, but when texts are
stored in the corpus, names are replaced automatically using the lists of known
given names and surnames of persons and names of companies. Care has been
taken so that names of prominent figures and characters from popular books and
films have not been replaced, for instance, even though Karamazoff and Putin are
valid Russian names, it is much more likely they are not participants in the
exchange, so their names are left as they are in texts (given that the corpus lacks
private letters from or to prominent figures).
We understand that the anonymisation procedure is not completely satisfactory,
cf. analysis in (Rock, 2001). First, it does not lead to complete anonymity:
contextual clues are left in texts and allow the detection of participants. Second, a
text in which names are replaced with codes looks less natural. Third, errors in
setting identification numbers of participants are possible; they can lead to
problems in discourse studies based on such texts. Finally, the anonymisation
removes the possibility of studying the frequency of personal names, discourse
patterns of their uses, as well as phonological patterns. However, we regard this
as the best possible practice for storing private and business letters without
violating the privacy of their authors and addressees.
The text description framework is much more elaborate than a list of domains, so
the balance of texts should be achieved on the basis of the text typology described
above. The typology can be represented in terms of the systemic network of
interrelated choices (Martin, 1987). For instance, when a text is described as
fiction, it can be described in terms of the style of fiction, such as, stylistically
neutral, low or regional, and in terms of the genre of fiction, such as general,
historical, science fiction, etc., but not in terms of the interaction between the
174
Serge Sharoff
author and the audience, because it is not produced spontaneously in the presence
of the audience.
The network of options is traversed using the Systemic Coder. A person that
encodes metatextual information about a document has access to its record,
including the author, title, year of creation, location of the file and its size in
words. Encoding options are selected from a list of categories, for instance, the
age of the intended audience is selected from adult, child, teen or x-age, and the
age of the author at the time of text creation is selected from child, teen, young,
mid, senior (mid-aged authors is the broadest category that covers ages 22 to 55).
Even though the typology is elaborate, experience shows that most texts can be
described in just a few seconds.
Some combinations of features are logically impossible, for instance, a personal
letter aimed at a very large audience or a private discussion on TV. Some other
combinations are very unlikely, for instance, books written in the domain of
natural sciences in formal style, aimed at a very large female audience for
entertainment: the combination of formal style and entertainment or natural
sciences and a sex-targeted audience is unlikely. However, if a combination of
parameters is meaningful, an effort should be made to cover it in the corpus by, at
least, several texts. As an example, consider the set of texts within the knowledge
domain “Politics”, subdomain “home affairs” (the parameter I1 in our typology).
Variation over other parameters involves selection of texts written in neutral,
formal, academic and informal styles (I2), texts created by male or female authors
or texts with corporate authorship, texts written within the period of 1990-2000 in
different regions of Russia (E1), texts printed in newspapers, magazines, or
books, as well as letters and reports, or spoken discussions, on site, on TV and
radio (E2), texts aimed at different audiences (general vs. informed vs.
professional, public vs. private, etc.), and aimed at various communicative
functions, e.g. discussion, recommendation, instruction or entertainment (E3).
Each text in the corpus should be described by this set of options, for instance, the
Russian Constitution is a text written in formal style (I2) in 1993 in Moscow, the
authorship is corporate (E1), it is written material printed as a book of 9,500
words (E2), aimed at a very large audience (even if it is rarely read by the
majority of population), with the intended function of recommendation, as a legal
document.
The typology ensures that every text to be included in BOKR can be described in
terms of the parameters listed above. Since texts aimed at more public audiences
are easier to obtain, extra efforts are taken to cover texts aimed at more private
audiences. The text collection activity could lead to a corpus significantly larger
than 100 million words. The next step is to balance the collection. The balance
takes into account the proportion of basic genres according to Table 2, as well as
the proportion of texts within each parameter. For instance, the classes of the
intended outcome of a text include discussion, recommendation, instruction,
175
recreation and information. The exact proportion of intended outcomes in the
corpus can hardly be determined, e.g. the allocation of 25% for discussions or
10% for instructions looks fairly arbitrary. However, if texts classified as a
recommendation take 90% of the total corpus, this is a clear sign of
disproportionate coverage, which should be corrected. The balance will be
monitored using statistical tools available in the Systemic Coder, such as Cell
Analysis or Significance Tests.
Another problem with sources concerns the choice of diachronic sampling,
because the turbulent history of Russia in the 20th century radically affected the
language. For instance, according to the frequency list (Zasorina, 1977), which
was compiled on the basis of texts from 1930-1960, such words as sovetskij
(Soviet) and tovarishch (comrade) belonged to the first hundred of Russian words
on a par with function words, but this is no longer true of modern texts. The
language of fiction has not been so radically affected. The decisions on the
chronological limits of the study are different for different text types, for instance,
fiction texts are taken from 1960, scientific texts from 1980, political texts and
ephemera from 1990 (earlier ephemera texts are also hard to obtain), while news
texts from 1997.
3.
The principles of morphosyntactic annotation
English corpora, including the BNC, are annotated with complex tags, like NNS
for plural common nouns – a technique that is impossible in a highly inflective
language, such as Russian. For instance, an adjective inflects for case, gender and
number, giving 36 basic adjectival categories in total, while a verb in addition to
its own 14 basic categories has up to 4 participles, each of which declines for
adjectival categories. This leads to thousands of separate tags that cannot be
searched effectively. For this reason, each word is annotated with a list of
features, which provides a unique identification of their type according to the
morphosyntactic codes from EAGLES (Calzolari and McNaught, 1996), for
instance, bylo (was) is annotated as “verb,ifve,int,act: n,sg,past”, which stands for
verb, imperfective, intransitive, active voice (the features describe the lexical item
byt’), followed by a colon with features that describe the form: neutral gender,
singular number, past tense. Separate features from a feature bundle associated
with each word can be selected in a window of the query interface (Figure 1).
176
Serge Sharoff
Figure 1: The query interface
3.1
How to resolve the ambiguity
The POS class and the morphological properties of a word are reflected in the
flexion and the probability of deciding the lemma and the POS class from a word
form is higher than in, for instance, English. As a result, there are many
morphological analysers for Russian, which make decisions on the basis of word
forms, but virtually no Russian taggers that take into account the local context.
However, if real Russian texts are to be tagged with the intention of making a
corpus that includes lemmatisation and morphosyntactic annotation, the level of
ambiguity is high: very frequent word forms like stali, shli, or ego can correspond
to several lemmas, i.e. stali – stat’ (verb, to become) vs. stal’ (noun, steel), shli –
idti (to go) vs. slat’ (to send), ego – ego (possessive) vs. on vs. ono (both are
personal pronouns).
The ambiguity of POS classes is relatively rare, but many noun forms have
multiple readings, for instance, the word form pole is an instance of four different
nouns pol (floor), pole (field) and pola (lap) and Polya (a person name, when it is
capitalised). Also, in many cases the ambiguity concerns the set of morphological
features of the same lemma, e.g. knigi is the singular form in the genitive case or
the plural form in the nominative or accusative case, while znakomoj is the
singular form in the genitive, dative or prepositional case of either a noun or an
adjective. According to initial experiments, the frequency of ambiguous detection
177
of lemmas and POS classes in running text is about 25%, while the frequency of
ambiguous morphological properties is about 55%. The values are too high for a
corpus with morphosyntactic annotations and so the ambiguity should be reduced.
Currently, there are no tools available that allow reliable parsing in a corpus of
this size for Russian or any other language. Use of language-independent POS
taggers based on statistical models cannot improve the quality of the output,
because they are typically based on considerations relating to the word order,
which is flexible in Russian, and because the genuine ambiguity between POS
classes is relatively rare, and the most frequent type of the ambiguity concerns
different readings of a word form (cf. the example with pole) and morphological
features (cf. the example with knigi). If word forms or sets of morphological
features (e.g. plural+dative+feminine) are treated as POS classes, then their
number increases and the quality of language-independent POS tagging declines.
However, partial parsing that detects nominal and prepositional phrases is
reliable enough and can be used for deciding the reading of ambiguous forms. In
BOKR and Russian Standard we use Dialing, a morphological analyser with
simple mechanisms for syntactic and semantic analysis. Since two analyses of the
same word form have distinct morphological properties (case, number and
gender), the agreement of the noun and the adjective in noun phrases removes
some types of ambiguity, e.g. the word combination znakomoj knigi from
Otkroem stranitsy etoj horosho znakomoj knigi (Let’s open pages of this wellknown book) can be parsed only as the genitive singular form of both words, and
the first word is an adjective. Another simple mechanism that requires only
partial parsing is the agreement of the subject and the predicate; it can remove,
for instance, the plural reading of a noun in the nominative case (knigi), when the
predicate is singular, as in Knigi na polke ne bylo.
Some types of ambiguity of lemmas and POS classes are left after partial parsing:
12% of forms are ambiguous with respect to lemmas, and 22% with respect to
morphological properties. Since the BOKR corpus is annotated without human
intervention, ambiguous analyses are subjected to further filtering according to
statistical heuristics. For instance, two nouns spina (back) and spin (spin) have
several identical word forms, which cannot be separated by means of parsing.
However, spin is a term in theoretical physics, so the reading can be ignored in
normal texts. Few other cases resist even complete parsing and statistical
consideration, for instance, the ambiguity between the two readings of the word
form banke (bank vs. banka) in Xranite svoi denjgi v banke (keep your money in
a bank/in a jar) can be resolved only on the basis of semantic and pragmatic
constraints. Such cases of ambiguity are retained in BOKR. The same applies to
the ambiguity of morphological properties left after the syntactic filter. Currently,
3.6% of forms remain with ambiguous lemmas. The ambiguity in the Russian
Standard is corrected manually.
178
3.2
Serge Sharoff
How to store annotations
The design of the annotation format of the two corpora follows the best practices
in corpus development established in the 1990s, namely EAGLES (European
Advisory Group on Language Engineering Standards) and TEI (Text Encoding
Initiative). Even though XCES (XML Corpus Encoding Standard) is expected to
become the international standard for language resources (Ide and Romary,
2002), the annotation scheme based on it is extremely verbose (the size of an
annotated file is a hundred times larger than a plain text file and, for a corpus of
100 mln words, the size really matters). The XCES scheme is also not suited for
querying word uses in the corpus, because information on similar properties is
represented at different levels of the XML structure. Thus, we have postponed the
use of XCES until the standard is established and there are publicly available
software tools for working with the format.
The format of BOKR is based on the TEI scheme and uses standard tags, like
<phr>, <s>, <w> for representing phrases, sentences and words. Morphological
annotation is stored in <ana> tags that describe word properties in lemma and
feats attributes. Ambiguity is represented using multiple <ana> tags. The
following is an example from the beginning of a sentence:
Mne bylo ochen’ zhalko svoih chasov, …
(I was very sorry about loosing my watch, …)
<s id="kozlotur.1476">
<w n="1">Ɇɧɟ<ana lemma="ɹ" feats="pron,sg,1: dat"/></w>
<w n="2">ɛɵɥɨ<ana lemma="ɛɵɬɶ" feats="verb,ifve,int,act: n,sg,past "/></w>
<phr type="ADV+ADV"> <w n="3">ɨɱɟɧɶ<ana lemma="ɨɱɟɧɶ" feats="adv"/></w>
<w n="4">ɠɚɥɤɨ<ana lemma="ɠɚɥɤɨ" feats="adv"/></w>
</phr>
<phr type="ADJ+NOUN">
<w n="5">ɫɜɨɢɯ<ana lemma="ɫɜɨɣ" feats="pron,poss: pl,gen"/></w>
<w n="6">ɱɚɫɨɜ<ana lemma="ɱɚɫ" feats="noun,m,in: pl,gen"/>
<ana lemma="ɱɚɫɵ" feats="noun,pl: gen"/></w>
</phr>
</s>
The parser was able to resolve the ambiguity between two analyses of mne (the
dative or prepositional case), svoikh (the genitive, accusative or prepositional case
of either a personal pronoun or a possessive pronoun), and zhalko (an adverb or
an adjective). However, the ambiguity between two readings of the word form
<w n=“6”> is left in the output in the two <ana> tags. It can be read as chas (an
hour) and chasy (a watch), in the latter case it is pluralia tantum, which is
reflected in the position of the number value before the colon.
Metainformation about a document is stored in the header and is also based on the
TEI format. In some cases, TEI provides tags for encoding the contextual
179
settings used in the text typology, for instance, <creation> or <textClass>. In other
cases, the information on the expected outcome or the size of the audience is
expressed using the general framework of taxonomy specifications by means of
<catRef> (category reference) tags.
Notes
1
The research presented in the paper has been supported by the Alexander
von Humboldt Foundation, Germany, when the author was affiliated with
the University of Bielefeld and the Russian Research Institute for
Artificial Intelligence. I am grateful to my Russian colleagues, in
particular, to Vladimir Plungian and Katia Rakhilina, who took the
leadership in the ongoing development of the Russian Reference Corpus.
2
Development of a translation corpus is considered to be a separate task.
References
Andryuschenko, V.M. (1989). Konzepziya i arhitectura Mashinnogo fonda
russkogo jazyka (The concept and design of the Computer Fund of
Russian Language), Moskva: Nauka, 1989
Calzolari, N., McNaught, J. (eds.) (1996). Synopsis and Comparison of
Morphosyntactic Phenomena Encoded in Lexicons and Corpora.
EAGLES document EAG-CLWG-MORPHSYN/R
http://www.ilc.cnr.it/EAGLES96/morphsyn/morphsyn.html
Ide, N., Romary, L. (2002). Standards for language resources. In Proc. of
Language Resources and Evaluation Conference (LREC02). May, 2002,
Las Palmas, Spain. 59-65.
Leech, G. (1997). A brief users’ guide to the grammatical tagging of the British
National Corpus, UCREL, Lancaster University.
http://www.hcu.ox.ac.uk/BNC/what/gramtag.html
Lönngren, Lennart (ed.) (1993). Chastotnyj slovar’ sovremennogo russkogo
jazyka. (A Frequency Dictionary of Modern Russian. With a Summary in
English.) Acta Universitatis Upsaliensis, Studia Slavica Upsaliensia 32.
188 pp. Uppsala.
180
Serge Sharoff
Martin, J.R. (1987). The meaning of features in systemic linguistics. In M.A.K.
Halliday, R.P. Fawcett (eds.) New Developments in Systemic Linguistics.
Vol. 1. London: Pinter Publishers. 14-40.
Rock, F. (2001). Policy and practice in the anonymisation of linguistic data.
International Journal of Corpus Linguistics, 6(1).
Sinclair, J. (1996). Preliminary recommendations on text typology. EAGLES
Document EAG-TCWG-TTYP/P.
http://www.ilc.pi.cnr.it/EAGLES96/texttyp/texttyp.html
Sperberg-McQueen, C. M., Burnard, L. (eds.) (2001). Guidelines for Electronic
Text Encoding and Interchange.
http://www.hcu.ox.ac.uk/TEI/P4X/index.html
Verbitskaya, L.A., Kazanskij, N.N., Kassevich, V.B., (forthcoming). Nekotorye
problemy sozdanija natsional’nogo korpusa russkogo jazyka. NTI, Series
2. (in Russian)
Zasorina, L.N. (ed.) (1977). Chastotnyj slovar’ russkogo jazyka. Moscow:
Russkij Jazyk.
Internet links
BNC Index: http://www.comp.lancs.ac.uk/ucrel/bncindex/
BOKR: Boljshoj Korpus Russkogo yazyka (the Russian Reference Corpus, a
description of the project), http://bokrcorpora.narod.ru/
CFRL: the Computer Fund of Russian Language, http://irlras-cfrl.rema.ru/
Coder, a markup and classification tool: http://www.wagsoft.com/Coder/
Dialing, the morphological analyser: http://www.aot.ru/download.htm
Moshkow’s Library: http://lib.ru/
RS: the Russian Standard (online access), http://corpora.yandex.ru/
UC: the Uppsala Corpus, available from the University of Tübingen,
http://www.sfb441.uni-tuebingen.de/b1/en/korpora.html
Yandex, the search engine: http://www.yandex.ru/
A profile-based calculation of region and register variation: the
synchronic and diachronic status of the two main national
varieties of Dutch
University of Leuven
Abstract
In this paper we present a profile-based method for analysing regional and register
variation, and we explain how we apply it to the study of Belgian and Netherlandic Dutch.
A ‘profile’ comprises frequency information about the set of alternative synonymous terms
for naming a particular concept. Our comparison of subcorpora is based on differences in
naming preferences, as revealed by such profiles. Apart from the actual algorithm for the
comparison of subcorpora this paper also addresses some practical issues. In particular
the preliminary step of profile selection is addressed: in this context the concept of ‘stable
lexical markers’ is introduced.
1.
Introduction
1.1
Broader context of the study 1
In recent years our research unit has investigated the internal stratification of
Belgian and Netherlandic Dutch. Although officially the same language, these
two national varieties do differ to some degree, especially, but not exclusively, in
the context of more informal language use. One important difference in the
evolution of both varieties lies in their standardisation: due to a history of foreign
occupations Belgian Dutch has known an interrupted, and therefore slower
standardisation process - which may even be considered to be not fully completed
yet, in the sense that the standard variety is not that well established and by many
speakers is not felt to be the natural variety of choice in all that many different
situations.
The overall model of the current internal stratification of Belgian and
Netherlandic Dutch that we take as our point of departure can be summarised in
one synchronic hypothesis and one diachronic hypothesis. The synchronic
hypothesis is that if we look at different registers within the two contemporary
national varieties, we will see that the linguistic difference between language use
in less formal and in more formal registers is larger in Belgian Dutch than in
Netherlandic Dutch. In other words, if we compare situations where informal
language use can be expected to situations where more formal language use can
be expected, and we repeat this exercise for Belgium and for The Netherlands
182
(looking at the same situations in both cases), we will measure more important
differences in Belgium. The diachronic hypothesis is that in the last decades, i.e.
in the second half of the twentieth century, Belgian Dutch has been moving closer
to Netherlandic Dutch and in doing so has been using the latter as its reference
point for (further) standardisation.
In Geeraerts, Grondelaers and Speelman (1999) we report on a study in which a
first body of empirical evidence was found to confirm both the synchronic
assumption of a different internal stratification, with Belgian Dutch having a
more outspoken informal register, and the diachronic assumption of a
convergence characterised by Belgian Dutch moving towards Netherlandic
Dutch. The dataset used in this study consists of some 40000 naming events of
items related to clothing and football concepts. With respect to the synchronic
hypothesis the naming of clothing terms was investigated for two different
registers. The more formal situation was the naming of garments in magazines.
The more informal situation was the naming of garments in shop windows (which
can be expected to target a more local audience and hence to be more informal).
The data used to investigate the diachronic hypothesis were the football terms and
clothing terms encountered in Belgian journals and magazines form three
different points in time (1950 - 1970 - 1990). In subsequent research, a dedicated
corpus was compiled to replicate this study for other registers and other linguistic
variables. This corpus is the so-called ConDiv corpus, a 40 million token corpus
created for measuring convergence and divergence patterns (cf. Grondelaers,
Deygers, van Aken, van den Heede and Speelman, 2000). The corpus consists of
a diachronic component, which contains newspaper material from 1950, 1970 and
1990, and a more elaborate synchronic component from 1990, which contains
texts from 5 different registers: quality newspapers, nation-wide popular
newspapers, regional popular newspapers, texts from Usenet newsgroups, and
texts from IRC chat channels. Several replication studies based on this corpus
corroborated the original findings from Geeraerts, Grondelaers and Speelman
(1999). For instance, Grondelaers, van Aken, Speelman and Geeraerts (2001)
reports on two replication studies, one in which the original clothing terms study
was applied to the ConDiv corpus, and another in which variation in the choice of
prepositions was used as linguistic evidence.
1.2
Formal onomasiological variation
This paper focuses on some key methodological choices made in our
investigation of the stratification Belgian and Netherlandic Dutch. The first and
most important methodological choice is that among the different types of
variation that exist we choose formal onomasiological variation as our basis for
measuring linguistic distances. According to the definition we adhere to (cf.
Geeraerts, Grondelaers and Bakema, 1994), formal onomasiological variation is
that type of onomasiological variation, in other words is that type of variation in
the terms one uses to name an item, that is not motivated by semantic differences,
and therefore can be said to be purely ‘formal’. An example would be the
A profile-based calculation of region and register variation
183
variation between the terms ‘underground’ and ‘subway’ to refer to the concept
UNDERGROUND. An example of onomasiological variation that is not purely
formal would be the variation between the terms ‘garment’ and ‘pants’ to refer to
a pair of trousers. In the latter example the choice for a different term is related to
a different conceptualisation of the thing that is referred to, which is not the case
in the UNDERGROUND example. The motivation for our choice to measure
linguistic distance on the basis of formal onomasiological variation is not that this
type of variation is more important than other types of variation, but rather that it
exceptionally useful for detecting the type of regional and stylistic differences we
are interested in our study.
The structure of the remainder of this paper is as follows. In section 2 we
introduce profiles and profile-based calculations of linguistic differences as our
primary technique for quantifying formal onomasiological variation. After
explaining the actual calculations we focus on the most salient feature of this
technique, which is the use of profiles. We argue that profiles are a
straightforward technique for measuring regional and stylistic differences, while
controlling other types of variation. We subsequently illustrate the different
outcomes of actual profile-based calculations and non-profile-based calculations
(the latter being exemplified by calculations based on key words). The
calculations are applied to the synchronic Belgian Dutch part of the ConDiv
corpus. In section 3 we introduce a complementary technique, the method of
‘stable lexical markers’, which is a technique for exploring and charting a more
complete set of noteworthy, different types of variation in the comparison of
corpora. Rather than neutralising certain types of variation in advance by our
choice of technique (which is what we do in section 2), we now aim to obtain an
overview of the different types of variation, of their relative importance, and of
the actual linguistic variables (i.e. the actual terms) that are representative of these
different types of variation. In section 4 we summarise the most important
characteristics of profile-based calculations and of the technique of stable lexical
markers.
2.
Profile-based register analysis of Belgian Dutch
2.1
Profiles
A profile is an exhaustive set of synonymous terms for naming the same concept,
together with information about the frequency with which each term is used to
name the concept 2 . Table 1 shows three profiles, more specifically the profiles
for the concept UNCLE in three different subcorpora of the Belgian Dutch (B)
synchronic part of the ConDiv corpus. The subcorpora contain chat material
(ircVL), material from a regional popular newspaper (regL1) and material from a
quality newspaper (quaSR) respectively. In this case there are two different terms
184
for referring to the concept at hand, namely ‘oom’ and ‘nonkel’ (note that in other
profiles there are often more than two alternatives).
Table 1: The profiles for UNCLE in three different subcorpora of the ConDiv
corpus
Oom
Nonkel
ircVL (B)
regL1 (B)
quaSR (B)
0 (0 %)
55 (100 %)
25 (61 %)
16 (39 %)
15 (94 %)
1 (6 %)
.
This example illustrates how profiles are sensitive to stylistic variation. The term
‘nonkel’ is a somewhat more colloquial term to refer to UNCLE. The term ‘oom’
is the more formal, standard variant. As can be seen in the table, the chat
subcorpus shows an outspoken preference for the colloquial variant ‘nonkel’ (all
55 occurrences), the regional paper shows a moderate preference for the standard
variant ‘oom’ (25 out of 41 occurrences), and the quality paper shows an
outspoken preference for the standard variant ‘oom’ (15 out of 16 occurrences). It
is obvious that in a similar vain profiles are sensitive to regional variation, but we
will not give examples of this because in this text we will only deal with Belgian
material.
Linguistic distance between two corpora, on the basis of their profiles for one
particular concept is calculated as a city block distance on the basis of the relative
frequencies 3 . For example, the distance between regL1 and quaSR, on the basis
of their UNCLE-profiles (cf. Table 1), is calculated as 0.5 * (|0.61 –
0.94| + |0.39 – 0.06|). In other words, we obtain the linguistic
distance by summing up the row-wise differences in percentages, and by
subsequently dividing that sum by two. The division by two is a normalisation
step that guarantees the result is a value between 0 and 1 (i.e. between 0% and
100%).
After calculating the distances between two subcorpora on the basis of individual
concepts, we want to summarise this information in a global distance between the
two subcorpora. Linguistic distance between two corpora on the basis of their
profiles for a whole range of concepts is calculated as the average of the distances
based on a single concept 4 . After calculating these global distances for all couples
of subcorpora, the final step is to feed these distances into a multidimensional
scaling analysis, in which we try to plot all subcorpora on a low-dimensional
space in such a way that distances in the plot reflect linguistic distances as well as
possible. We illustrate the procedure by means of an example.
The 20 subcorpora we compare in the example are listed in Table 2. They are all
part of the Belgian Dutch synchronic part of the ConDiv corpus. This section of
the ConDiv-corpus consists of material from one quality newspaper (qua), one
national popular newspaper (nat), two regional popular newspapers (regL and
185
regA) and several Usenet newsgroups (use) and IRC chat channels (irc). All of
these categories are further subdivided, sometimes with (irc, use, nat, qua) and
sometimes without (regL, regA) a further topic-wise differentiation.
Table 2: The 20 subcorpora of the synchronic Belgian Dutch part of the ConDiv
corpus
Name
nr tokens
register
topic
IrcRE
IrcVA
IrcLE
IrcVL
IrcBE
UseTE
UseSP
UseSR
regL1
regL2
regL3
regA1
regA2
regA3
NatRE
NatSP
NatSR
QuaSP
QuaTE
QuaSR
205560
1182849
1784084
2736111
1686571
2486797
117195
2376788
1561362
1450968
1666916
1563799
1504606
1810548
1945461
427280
518670
994867
1431786
3607513
IRC-material
IRC-material
IRC-material
IRC-material
IRC-material
Usenet
Usenet
Usenet
regional popular newspaper
national popular newspaper
national quality newspaper
regional chat channels
varia
chat channel “Leuven”
chat channel “Flanders”
chat channel “Belgium”
technical topics
sports
supra-regional topics
(no differentiation)
regional
sports
supra-regional interest
sports
technical topics
supra-regional topics
Table 3: Terms and concepts for 10 different sorts of profiles
CONCEPT
Terms
A MIND TO
IF
FOR THE PRICE OF (+ amount)
UNCLE
IN A MOMENT
ONCE AGAIN
TO BE READY
EACH OTHER
MOMENT
TO CONTRIBUTE TO
goesting, zin
als, indien
aan (+ amount), tegen (+ amount), voor (+ amount)
oom, nonkel
seffens, dadelijk
weeral, alweer
gereed zijn, klaar zijn
elkaar, mekaar, elkander, mekander
moment, ogenblik
bijdragen aan, bijdragen bij, bijdragen tot
The terms and concepts of the 10 different profiles type we use in the example are
listed in Table 3. The result of the MDS-analysis is shown in Figure 1. One can
see an axis IRC-Usenet-newspapers running from left to right. Moreover within
the newspaper corpora most subcorpora that belong to the same newspaper are
rather close to each-other (two notable exceptions are ‘natSP’ and ‘quaTE’). In
summary, the recognition of the registers is not impeccable, but seems acceptable
in general.
186
Figure 1: Example of MDS-plot based on profile-based linguistic distance
2.2
Profiles compared to key words
An obvious alternative to working with profiles would be to calculate differences
on the basis of isolated individual terms. In fact, this is common practice, and
many methods that are currently used are based on individual terms (for an
overview, see Kilgarriff, forthcoming). The main reason why we refrain from
doing this is that we want to avoid thematic bias in our measurements. Let us
illustrate what we mean by this with an example. Suppose that in corpus A ‘oom’
and ‘nonkel’ are used 20 times each. Also suppose that in corpus B ‘oom’ and
‘nonkel’ are used 100 times each. And let us, for the sake of simplicity, assume
the corpora are equal in size. Now, if we would compare the corpora on the basis
of isolated individual terms, there is a clear risk of misinterpreting the data. One
approach would be to look only for items that are known to be colloquial, and to
count their occurrences. In this case the colloquial item is ‘nonkel’, so we would
187
count the number of occurrences of ‘nonkel’. We would conclude that corpus B
contains more colloquial material than corpus A. Another approach would be to
list all items that are more typical of one corpus than of the other. If we would do
that, we would conclude that both ‘oom’ and ‘nonkel’ are more typical of B than
of A. However, if we now move to profiles, we can see that the actual difference
between A and B lies in the frequency of the concept UNCLE, and that the actual
naming preferences are identical in both corpora (in both cases 50% ‘oom’ and
50% ‘nonkel’). So for our purpose, which is the investigation of regional and
stylistic differences, we want to measure no difference in this example. To put it
another way, the high frequency of ‘oom’ and ‘nonkel’ in corpus B is ambiguous
in terms of popularity of terms and popularity of concepts. The profile-based
method disambiguates these two levels 5 .
Figure 2: MDS-plot based on number of key words
In Figure 2 we give an example of a comparison of corpora based on individual
isolated terms. We used the log likelihood ratio test for binomial distributions (cf.
188
Dunning, 1993) to calculate which terms, in the comparison of two corpora A and
B are used significantly more often in A (positive key words 6 of A) or
significantly less often in A (negative key words of A). We then used the total
sum of key words (both positive and negative) as a dissimilarity measure, and
compared all corpora pairwise 7 . Finally, we fed these dissimilarities to a MDSanalysis. The result is shown in Figure 2. Apart from the fact that the axis IRCUsenet-newspapers is a bit more fuzzy 8 than in Figure 1, the most interesting
observation is that the three sports-related corpora (in the centre of the figure) are
very close to each-other, in spite of their belonging to different registers. This
seems to illustrate that this type of measurement is so sensitive to topical bias that
it becomes less useful for our purpose, i.e. the measurement of regional and
stylistic variation.
3.
Stable lexical markers
3.1
Purpose of using stable lexical markers
Although the plot in Figure 2 indicates that the key words method as applied in
2.2 is less suited to our needs, we believe that a further analysis of the results of
this method is worthwhile. Being based on all terms in the corpora, this method
contains a wealth of information. If we can obtain a clearer picture of the
different sources of variation (such as style, topic, etc.) that interact in the
constellation we see in Figure 2, and if we can obtain information about their
relative importance, and if we can also discover which linguistic variables (i.e.
which terms) are representative of these different sources of variation, then there
are several possible ways to benefit from this knowledge in our analysis, both
directly and indirectly. Directly, because it helps us to construct a representative
set of profiles for measuring regional and stylistic variation 9 . Indirectly, because
it serves as an additional autonomous analysis of variation in the corpora.
Unfortunately a direct analysis of the key words is cumbersome, because in each
of the 190 pair-wise comparisons of corpora thousands or even tens of thousands
of key words show up 10 . In order to overcome this practical problem 11 , we want
to simplify the picture and look at a subset of key words that according to some
criteria can be called a particularly salient subset. The criterion we use is
consistency in being typical of specific groups of corpora. More specifically, we
introduce the concept of stable lexical markers or stability as a marker.
The concept is straightforward 12 . If we compare two sets of corpora S1 and S2,
then the stability with which a term is a marker for set A is equal to the number of
couples {X, Y}, with X a member of S1 and Y a member of S2, for which the
term is a positive key word for X. For instance, if S1 contains three corpora, say
S1 is {regL1, regL2, regL3}, and S2 also contains three corpora, say S2 is {regA1,
189
regA2, regA3}, then there are 9 different couples {X, Y} with X a member of S1
and Y a member of S2 , so the maximum stability would be 9.
3.2
Applications of stable lexical markers
We illustrate two ways the concept of stable markers can be applied. In a first
application we consider all partitions of our set of 20 corpora into two subsets S1
and S2 with subset S1 having size 3 and subset S2 having size 17. There are 1140
such partitions. We sort this list by number of maximally stable markers of S1.
The top of this list is shown in Table 4. Right after the last item displayed here
the number of maximally stable markers drops to 18, and then the number rapidly
declines to hardly any maximally stable markers.
The reason we look at subsets of size 3 is that most groups of corpora with the
same source type are of size 3, with the exception of IRC, of which there are 5
items. As we see in Table 4, all these groups appear at the top of the list.
Additionally, three mixed-group subsets also appear. Those are the ones with the
grey background.
Table 4: all subsets of size 3 with 30 or more maximally stable markers
nr maximally stable markers
subset S1
1
202
{natRE, natSP, quaSP}
2
187
{ircRE, ircVL, ircBE}
3
186
{regA1, regA2, regA3}
4
185
{useTE, useSP, useSR}
5
74
{regL1, regL2, regL3}
6
66
{regL3, quaTE, quaSR}
7
55
{natRE, natSP, natSU}
8
54
{ircRE, ircVA, ircBE}
9
48
{ircVA, ircLE, ircVL}
10
36
{quaSP, quaTE, quaSR}
11
34
{ircRE, ircLE, ircBE}
12
30
{natSP, regA3, quaSP}
The second application of stable makers then of course is to look at the actual list
of markers. As a first example, we look at the set of maximally stable markers of
{natRE, natSP, quaSP}, the triplet at position 1 in Table 4. In Table 5 we show
the 20 13 first items from the list of 202 maximally stable markers. It is clear that
the lexicon that sets this triplet apart is topic-related – more precisely sportsrelated. Applying the same procedure we found that the same is true for triples 6
and 12 in Table 4. The markers in triplet 6 are related to ‘hot news’ topics. The
markers in triplet 12 are once again sports-related.
190
Table 5: The first 20 items in the list of maximally stable markers of {natRE,
natSP, quaSP}
aanvoerder (captain), achillespees (Achilles tendon), affluiten (to whistle), aftrap
(kick-off), beker (cup), beloften (junior team), beslissende (decisive),
bezoekende (visiting), coach (coach), competitie (competition), competitiestart
(competition start), dij (thigh), doelkansen (opportunities to score), doelman
(keeper), doelpunt (goal), doelpunten (goals), eindstand (final score), finales
(finales), fit (in shape), forfait (walk over), …
As a second example of the second application, we look at one very specific
subset S1 of size 8; we look at the subset S1 of all 8 computer mediated
communication subcorpora, and compare it to the subset S2 of all 12 newspaper
subcorpora. The results are shown in Table 6. Maximum stability is 96 in this
case. There are 158 maximally stable markers, which we have manually classified
in different categories in order to distinguish between different sources or types of
variation. The categories, most of which correspond to dimensions of variation as
described in Biber (1995), are shown in the first column. The second column
shows some examples for each category. The examples are taken from the set of
279 terms that have a stability of 92 of more (i.e. that are at least 95% stable). The
last column shows the absolute and relative frequencies of the different categories
for the 158 fully stable terms (assigning each term to exactly one category).
Table 6: Stable lexical markers of the computer mediated communication
corpora
category
examples of 95% stable markers
nr of 100% stable markers
1st + 2nd person
jij (you), uw (your), ik (I), denk (think, 1st
person singular), bent (are, 2nd person
singular), …
after, again, sorry, stuff, …
chat (chat), email (email), comp
(computer), …
snap (understand, 1st person singular), tis
(it is), niks (nothing), ne (a), zijt (are, 2nd
person singular), …
groetjes (greetings), greetz (greetings),
bedankt (thanks), hmm (hmm), oeps
(oups), ...
redelijk (reasonably), …
1 (0.63%)
English
thematic specificity
colloquial style
conversational elements
downtoners
94 (59.49%)
47 (29.75%)
4 (2.53%)
11 (6.96%)
1 (0.63%)
At first sight, one would conclude that the type of variation that interests us most,
that of stylistic variation (colloquial style), is relatively rare: 2.53% of the cases,
as opposed to 59.49% for English, 29.75% for thematic specificity (i.e. topic) and
191
6.96% for conversational elements. However, it is obvious that assigning each
term to one single category is an oversimplification. For instance, the examples
‘greetz’ in conversational elements is also related to English. Also, the colloquial
style examples ‘snap’ and ‘zijt’ also are examples of 1st + 2nd person. In a similar
way we find items that are not classified as colloquial style but that are related to
colloquial style, for instance ‘comp’ and ‘greetz’. In summary, sources of
variation are often hard to separate, and more in particular, although purely
stylistic variation is rare in these stable markers, stylistic variation is often
superimposed on other types of variation.
4.
Conclusions
In this chapter we have argued that key-word-based methods are less suited for
the analysis of stylistic and regional variation, because they are too sensitive to
topical bias. We further argued that profile-based calculations are more suited
because they neutralise this topical bias. In order to illustrate this we have used
both methods to calculate linguistic distance between 20 subcorpora that
represent different registers of Belgian Dutch, but some of which are further
differentiated by topic. The analysis confirmed that the key word method is too
sensitive to topical bias. Next we introduced the concept of stable lexical markers,
and showed two applications in which stable markers are used to discover
noteworthy patterns in the multitude of information that resides in a classical keyword-based analysis, but is hard to be extracted from it. On the basis of these
applications we illustrated that stable markers are a useful device to disentangle
different sources of variation, to learn about the relative importance of these
different sources, and to discover which terms are representative of these different
sources.
Notes
1
The research reported on in this paper was supported by VNC-grant
205.41.07 3 as well as by OT-project OT 01/05. For the corpus-linguistic
procedures and some of the statistical analyses the tool Abundantia
Verborum
was
used
[http://wwwling.arts.kuleuven.ac.be/genling/abundant].
For
the
multidimensional scaling analyses the environment R was used
[http://www.r-project.org/].
2
This definition of profiles clearly targets lexical variation. It is possible to
broaden the definition in such a way that other types of variation
(morphological, syntactic, ..) also fit in. Profiles would then group together
192
competing constructions. But since in this paper we only deal with lexical
information, we will stick to the more restricted definition.
3
In practice there is one additional step. We first apply a log likelihood
ratio test for multinomial distributions (cf. Dunning 1993) to test whether
the two profiles are significantly different from each other. If they are, we
use city block distance to calculate linguistic distance. If they are not, we
use (a constant very close to) zero as linguistic distance.
4
In practice there is one more subtlety. Rather than simply averaging over
the concept-specific distances, we use a weighted sum of the conceptspecific distances. The weight of a concept-specific distance is determined
by the sum of al naming events in the union of the two corpora under
comparison that pertain to this concept. The purpose is to give more
weight to naming choices for concepts that on average are referred to more
frequently. To give an extreme example from our analysis of clothing
terms, we want to prevent that variation in the naming of a rarely referred
to concept such as BOWLER HAT has as much influence on the global
linguistic distance as variation in the naming of a frequently referred to
concept such as TROUSERS.
5
In theory, one could claim that such thematic bias should be avoided in the
design of the corpus. In practice, however, this is virtually impossible.
6
We borrow the term key words from Scott (1997). Actually, the term is
older, but initially the procedure was quite different (cf. Williams, 1976,
and, for a discussion, Stubbs, 1996). We use the term key word as it is
used in Scott (1997). By a positive key word of corpus A (when compared
to B) we mean a key word that, in terms of relative frequency, is used
more often in A than in B. By a negative key word we mean a key word
that, in terms of relative frequency, is used less often.
7
The pair-wise comparison of all corpora is not the most typical use of key
words, which normally involve comparing corpora to a reference corpus.
However, we believe that in the context of our design (analysing clusters
through MDS) a direct comparison of the corpora is more straightforward.
8
It also should be mentioned that in this MDS-analysis the stress is much
higher than in the profile-based analysis. In fact, in the key word based
calculations, one actually needs three dimension to obtain an acceptable
fit. Nevertheless, we chose to show the two-dimensional plot because the
plot is easier to interpret, and because the main observations are not
affected by this simplification. For a more detailed analysis of these key
word based calculations we refer the reader to Speelman, Grondelaers and
Geeraerts (forthcoming).
193
9
Currently our compilation of profiles is mainly based on synonym lists
that are derived from dictionary information. Automatic or semi-automatic
derivation of potentially interesting variables from the corpora serves as a
complementary strategy.
10
We applied the method to word forms, not to lemmata.
11
It could be called more than a practical problem. It is often claimed that
many methods that are currently used to detect significant differences in
the use of a term (most notable chi square based methods), signal a
counter-intuitively large number of significant difference. One of the
sources of the problem, as mentioned in Kilgarriff (forthcoming), is the
distribution of the lexicon: high frequent terms are so frequent, and
therefore produce so many observations, that even subtle, often
linguistically less relevant or distinctive patterns reach significance.
12
The concept is related to the key key concept (cf. Scott, 1997), but stable
markers are more generic, because they involve the comparison of two sets
of corpora, and are used for different purposes.
13
After filtering out a few proper names.
References
Biber, D. (1995), Dimensions in Register Variation, Cambridge University Press.
Dunning, T. (1993), ‘Accurate Methods for the Statistics of Surprise and
Coincidence’, Computational Linguistics, 19(1): 61-74.
Geeraerts, D., S. Grondelaers and P. Bakema (1994). The Structure of Lexical
Variation. Meaning, Naming and Context. Berlin: Mouton de Gruyter.
Geeraerts, D., S. Grondelaers and D. Speelman (1999). Convergentie en
divergentie in de Nederlandse woordenschat. Een onderzoek naar kledingen voetbaltermen. Amsterdam: Meertensinstituut.
Grondelaers S., K. Deygers, H. van Aken, V. van den Heede and D. Speelman
(2000), ‘Het ConDiv-corpus geschreven Nederlands’, Nederlandse
Taalkunde, 5: 356-363.
Grondelaers S., H. van Aken, D. Speelman and D. Geeraerts (2001),
‘Inhoudswoorden en preposities als standaardiseringsindicatoren. De
diachrone en synchrone status van het Belgische Nederlands’,
Nederlandse Taalkunde, 6: 179-202.
Kilgarriff, A. (forthcoming). ‘Comparing corpora’. International Journal of
Corpus Linguistics.
Scott, M. (1997). ‘Pc analysis of key words – and key key words’. System,
25:233-245.
Stubbs, M. (1996), Text and Corpus Analysis, Oxford: Blackwell.
194
Speelman, D., S. Grondelaers and D. Geeraerts (forthcoming), ‘Profile-based
Linguistic Uniformity as a Generic Method for Comparing Language
Varieties’, Computers and the Humanities.
Williams, R. (1976), Keywords, London: Fontana.
Stella E. O. Tagnin
University of São Paulo
Abstract
A learner corpus can provide useful data to detect specific difficulties of language learners
and consequently inform the production of pedagogic material to address these problem
areas. The USP Multilingual Learner Corpus will initially be built in English, German and
Spanish. A heading with detailed information about the student (course level, age, sex,
mother tongue, etc.) will allow for different types of research. The corpus’s multilingual
character will make it possible to look into difficulties common to all Brazilian learners,
irrespective of language. Its varied content may also provide insights into the effectiveness
of different methodologies.
1.
Introduction
One of the problems with textbooks used in Brazil for teaching a foreign
language is that most are written by foreign authors unacquainted with Brazilian
students’ difficulties. It is a known fact that a learner corpus can provide useful
data to detect such specific difficulties and consequently inform the production of
pedagogic material to address these problem areas (Leech, 1998).
Until recently, the only learner corpus under construction in Brazil was the BrIcle (Berber Sardinha, 2001), the Brazilian Portuguese part of the ICLE project
(Granger, 1993, 1994), which at the time of writing contains 40,000 words of
argumentative texts collected at the Catholic University of São Paulo (PUC-SP).
In early 2002 the University of São Paulo (USP) joined the project.
The PUC-USP partnership in the Br-Icle project triggered an interest in extending
the collection of texts to include all genres in which there is student production at
our Department of Modern Languages, which is composed of five areas: English,
German, Spanish, French and Italian.
This chapter will give an overview of foreign language teaching in Brazil, with a
special focus on the state-of-the-art at the University of São Paulo. It will discuss
the PUC-USP partnership and how this project motivated teachers from other
languages to build their own learner corpora, which will all be brought together in
the Multilingual Learner Corpus (MLC). This corpus is part of a larger project –
COMET, a Multilingual Corpus for Teaching and Translation, which is also
being built at the University of São Paulo under my co-ordination (Tagnin,
196
Stella E. O. Tagnin
2002a, 2002b). Next, it will discuss the design and structure of the corpus. The
last part will focus on the research possibilities envisaged by the MLC.
2.
Foreign language teaching in Brazil
One of the problems with textbooks used in Brazil for teaching a foreign
language is that most are written by foreign authors unacquainted with Brazilian
students’ difficulties. The other is that they take no account of Brazilian culture or
students’ interests.
2.1
The curriculum at the University of São Paulo
The Department of Modern Language is divided into five areas, one for each
“modern” foreign language taught at the undergraduate level: English, French,
German, Italian and Spanish. The curriculum is composed of eight semesters
during which students follow courses addressing both the language and the
literature components. With the exception of English, where it is taken for
granted that students have a fairly good command of the language on entering the
course, the other languages start teaching “from scratch” as these languages are
not taught regularly in secondary school whereas English is part of the
compulsory curriculum.
Due to the high demand for foreign language courses in general, the Department
also offers extracurricular courses, mainly aimed at the academic community.
These go by the name of English on Campus, Español en el Campus etc. and
extend from five to ten semesters, depending on the language. They are taught by
postgraduate students under the supervision of a co-ordinator, who is a regular
teacher in that language. In section 4.1 we will go into more detail about these
courses and how they can contribute to the corpus.
3.
Learner corpora in Brazil
To our knowledge, the only learner corpus under construction in Brazil is the BrIcle. To date it is composed of 40,000 words compiled at the Catholic University
of São Paulo under the co-ordination of Tony Berber Sardinha. In line with ICLE
requirements it is restricted to argumentative texts and should reach 200,000
words on completion. The project was joined by USP in early 2002, and is due to
be completed by the end of 2004.
Although there is no notice of any other “formal” learner corpus, several
materials involving the teaching of FL have been assembled by various
researchers. A few are in electronic format (diskettes, CD-ROMs), but most are
probably not. In any case, the material is not prepared for investigation with the
aid of electronic search tools.
197
The German area has been working on a Contrastive German-Portuguese
Grammar project for which it has collected different types of student production.
Most of this material is recorded on CD-ROMs or diskettes but is only available
for internal use:
x
Verbs of transportation. Vol. 1. CAPLE - Corpus of German and
Portuguese as Foreign Languages (Blühdorn et al., 1997).
This material was collected in several schools engaged in foreign language
teaching. In Brazil it came from third and fourth year undergraduate students at
USP and intermediate and advanced students at the Goethe Institut in São Paulo;
in Germany from undergraduate German students and learners of Portuguese at
the University of Erlangen-Nürnberg.
It consists of three types of production: a) sentences in which students were
required to use verbs of transportation, b) translations of 17 sentences into the
foreign language, and c) description of the stories presented in six different
sequences of cartoons.
x
Compositions in German – Vol. 2 CAPLE ದ Corpus of German and
Portuguese as Foreign Languages. (Blühdorn et al., 1999)
This material was collected between 1996 and 1998 from 342 informants at three
German-Brazilian secondary schools and is divided into three categories: a)
Brazilian learners of German, b) Brazilian learners who have both German and
Portuguese at home, and c) German native speakers living in Brazil.
x
Contrastive Analysis Corpus of Mistakes in Portuguese and German as
Foreign Languages (Glenk & Stanich, 2000).
This material was collected from second to third year undergraduate learners of
Portuguese at the Universities of Erlangen-Nürnberg (1997) and University of
Vienna (1999 and 2000), and from third and fourth year undergraduate learners of
German at USP (1998 and 1999). It consists of descriptive texts based on the
cartoons used in the research referred to above, narrative texts and essays.
Other non-contrastive materials are:
x
x
Corpus of letters exchanged between learners of German in Fès
(Morroco) and São Paulo (University of São Paulo) (Blühdorn, 1997).
Studentenzeitung (Blühdorn, 1999). A newspaper written by third year
German learners at USP during the second semester 1999.
198
Stella E. O. Tagnin
x
Student writing, different typologies: criteria for text production
(Nomura, in preparation). Material produced by second year German learners
at USP.
In English and Spanish there are scattered collections of texts as a result of
individual research by post-graduate students, mainly intended for contrastive
studies. However, they are not in a format that makes them searchable by corpus
tools.
4.
The Learner Corpus at USP
When the PUC-USP partnership was established, several teachers at USP became
interested in corpora studies. Because the Br-Icle requires only argumentative
texts, the English teachers decided to build a corpus with the other types of texts
produced by their undergraduate students, mainly narratives and essays.
However, once the goal of 200,000 words of argumentative texts has been
reached for the Br-Icle, this type of text will also be included in the USP
Multilingual Learner Corpus.
The German and Spanish areas have already joined the MLC project. French and
Italian have shown some interest but no official contact has been made as yet.
Nevertheless, the project is underway, and it will also be fed with texts from the
on campus courses.
4.1
The on campus courses
The participation of the on campus courses at USP opens up other possibilities,
such as including other genres of texts and gathering the production of another
type of students.
As these courses are aimed mainly at the academic community, that is, students,
teachers and other employees, one gets a fairly varied audience, both in terms of
age and cultural background, which is certain to affect the content of their
production. Quite a few undergraduate students attend these courses to “catch up”
with the rest of their class, that is, as remedial work to improve their linguistic
performance. The teaching is more informal than in the regular undergraduate
courses and students feel they have less responsibility when it comes to passing
or failing the course.
The English on Campus (EOC) course has a total of ten semesters: Basic 1, 2 and
3, Pre-Intermediate 1 and 2, Intermediate 1 and 2, Advanced 1 and 2 and a
semester of Conversation. The course books used are New Interchange: from
Introduction up to volume 3, Part B at the Basic, Pre-Intermediate and
Intermediate levels, and Passages Parts A & B at the advanced level. Topics for
the written assignments are suggested according to the grammar points addressed.
For instance, Write a conversation with a friend in which you describe your
199
apartment or house and ask about his or her living place when the focus is on the
Simple Present, short answers; questions with how many and answers with there
is, there are.
Due to the high number of students at the English on Campus courses –
approximately 700 – and considering that two assignments are submitted per
student, it would be possible to collect around 1,000 texts per semester, that is, as
long as most learners agree to sign to give their permission to have their
assignments included in the corpus.
The picture is slightly different for the German on Campus (GOC) course. They
offer a five-semester course: Basic 1, 2, 3 and 4 and a semester of Conversation.
A German course book Moment mal! is used at the first three levels. Basic 4 and
Conversation use material prepared by the teachers and based on other German
books. The content of the Conversation course varies each semester and students
may take it more than once as it is mainly aimed at giving undergraduate students
an opportunity to exercise their oral production.
As there are approximately 100 students enrolled, the number of possible texts for
inclusion each semester would be around 200, again with two texts per student.
The levels at Spanish on Campus (SOC) are Basic 1, 2, Intermediate 1, 2 and
Advanced 1, 2. Three enhancement modules are also offered, each one semesterlong: a) Culture, b) Literature, and c) Conversation. A Grammar module is in
preparation.
As opposed to the previous courses, Spanish relies on material prepared only by
their own teachers and is based on the profile and a needs analysis of their
students. There are currently about 460 students enrolled in their courses, which
would give a total of 920 texts per semester.
The course has also been a source for postgraduate research on exclusion and
self-exclusion factors; methodological and ideological analyses of textbooks;
error analysis; evaluation of assessment procedures; and theories of language
acquisition focusing especially on how or how much learning can contribute to
acquisition. One such research took a contrastive approach comparing how the
simple and compound past tenses are used by learners of Spanish and of English.
This study was informed by data from both the English on Campus and Spanish
on Campus courses.
5.
The USP Multilingual Learner Corpus
To date the USP Multilingual Learner Corpus (MLC) will be composed of texts
produced by their undergraduate (UG) and on Campus learners in the areas of
English, German and Spanish. As mentioned above, English undergraduate texts
Stella E. O. Tagnin
200
of the argumentative type will be fed into Br-Icle until it has reached its goal of
200,000 words, (see diagram below).
USP Learner
Br-
English
German
UG
UG
Spanish
GOC
EO
UG
SO
Figure 1: The composition of the USP Learner Corpus
Each student will be identified by a code and a profile with basic information as
to his course, level, year of attendance, age, sex etc. Each text will be stored in its
full form and preceded by a header with information as to text type, grammar
point covered, topic of assignment, course book (or other materials) in use, etc.
6.
Possible areas of research
Learner corpora in various parts of the world have already produced a wealth of
research (Granger, 1998b; Granger, 2002; Granger et al., 2002) but to our
knowledge there is no multilingual learner corpus, that is, learners with a
common mother tongue learning different foreign languages. This is in contrast
with the ICLE project in which one has learners with different mother tongues
learning a common language.
With this design, and given that each subcorpus is also a stand-alone contrastive
learner corpus, in that it allows comparison between productions originating in
the two distinct courses offered at USP, it is envisaged that the corpus will not
only allow for horizontal studies, comparing student production originating in the
same class or in the same level, but also studies on the vertical axis, assessing
student development over a period of time, either individually or collectively (cf.
Kaszubski, 2000; Lenko-Szymanska, 2000; among others). Research on student
writing strategies like paraphrasing, the (over/under) use or avoidance of certain
syntactic structures, vocabulary items, collocations and formulas (cf. Altenberg &
Tapper, 1998; Altenberg, 2002; Berber Sardinha, 2001; De Cock, 1998; Granger,
1998a, 1998b; and many others) will also be possible.
More interesting perhaps is the possibility of cross-linguistic studies, like
detecting problems common to learning a foreign language or problems common
to Brazilian learners. Another contrastive area made possible by the design of the
201
corpus lies in the field of methodology as it will enable researchers to evaluate the
effectiveness of different methodologies or materials at both the undergraduate
and/or the on campus courses.
7.
Conclusion
The Multilingual Learner Corpus under construction at the University of São
Paulo is currently being fed with student production from two types of courses:
the regular undergraduate courses and the extracurricular on campus courses
offered by the areas of English, German and Spanish, at the Department of
Modern Languages. The MLC is not only a promising project in terms of the
array of possible research areas, but it has also integrated the different languages
taught at the Department by bringing them together to work under a common
project.
References
Altenberg B. (2002), Using bilingual corpus evidence in learner corpus research
in: S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner
Corpora, Second Language Acquisition and Foreign Language Teaching.
Amsterdam and Philadelphia: Benjamins, 37-54.
Altenberg B. and M. Tapper (1998), The use of adverbial connectors in advanced
Swedish learners’ written English, in S. Granger (ed.) Learner English on
Computer. London and New York: Addison Wesley Longman. 80-93.
Berber Sardinha T. (2001), O Corpus de Aprendiz Br-Icle,
http://lael.pucsp.br/~tony/2001bricle-interc.pdf.
Blühdorn H, G. Evangelista and M. C. Reckziegel (eds.) (1999), Redações em
alemão. Vol. 2. CAPLE - Corpus em alemão e português como línguas
estrangeiras. São Paulo, USP.
Blühdorn H. (ed.) (s/d), Sagen und Legenden, Tänze, Festtagsbräuche und
Kochrezepte aus Brasilien. Eine interkulturelle Korrespondenz zwischen
Deutsch-Studenten in São Paulo, Brasilien, und Fès, Marrocos.
Blühdorn, H. (ed.) (1999), Studentenzeitung, São Paulo, USP.
Blühdorn, H., L. F. Dias Moreira and R. F. Silva (eds.) (1997), Verbos de
transporte. Vol. 1. CAPLE - Corpus em alemão e português como línguas
estrangeiras.
De Cock, S (1998), A Recurrent Word Combination Approach to the Study of
Formulae in the Speech of Native and Non-Native Speakers of English.
International Journal of Corpus Linguistics, vol. 3(1): 59-80.
Glenk, E. and K. Stanich (eds.) (2000), Corpus de Análise Contrastiva de Erros
em Português e Alemão como Línguas Estrangeiras. São Paulo, USP,
setembro de 2000.
202
Stella E. O. Tagnin
Granger, S. (1993) The International Corpus of Learner English, in: Aarts, J., P.
de Haan and N. Oostdijk (eds.) English Language Corpora: Design,
Analysis and Exploitation. Amsterdam: Rodopi: Amsterdam. 57-69.
Granger, S. (1996), From CA to CIA and back: An integrated approach to
computerized bilingual and learner corpora, in: Aijmer, Karin, Bengt
Altenberg & Mats Johansson (eds). Languages in Contrast – Papers from
a Symposium on Text-based Cross-linguistic Studies. Lund 4-5 March
1994, Lund: Lund University Press. 37-51.
Granger, S. (1998a), Prefabricated patterns in advanced EFL writing: collocations
and formulae, in: Cowie, A. (ed.) Phraseology: theory, analysis and
applications. Oxford: Oxford University Press. 145-160.
Granger, S. (ed.) (1998b), Learner English on Computer. , London & New York:
Addison Wesley Longman.
Granger, S. (2002), A Bird’s-eye View of Computer Learner Corpus Research,
in: S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner
Amsterdam and Philadelphia: Benjamins. 3-33.
Granger S., J. Hung and S. Petch-Tyson (eds.) (2002), Computer Learner
Amsterdam and Philadelphia: Benjamins.
Kaszubski P. (2000), Lexical profiling of English (learner) corpora: can we
measure advancement levels?, in: B. Lewandowska-Tomaszczyk and P.
Melia (eds.) àódĨ studies in Language. Vol. 1: PALC’99: Practical
Applications in Language Corpora [Papers from the International
Conference at the University of àódĨ, Poland, 15-18 April, 1999].
Frankfurt am Main: Peter Lang,. 249-86.
Leech, G. (1998), Learner corpora: what they are and what can be done with
them, in S. Granger (ed.) (1998b), Learner English on Computer. , London
& New York: Addison Wesley Longman. xiv-xx.
Lenko-Szymanska, A. (2000), How to trace the growth in learner’s active
vocabulary. A corpus-based study. Paper presented at the 4th International
Conference on Teaching and Language Corpora. Graz, 19-23 July 2000.
Nomura, M. (org.) (in preparation). Redação de textos de diferentes tipologias:
critérios de produção de texto.
Tagnin, S E O (2001), COMET – A Multilingual Corpus for Teaching and
Translation, In Lewandowska-Tomaszczyk, B. (ed.) PALC 2001 –
Practical Applications in Language Corpora, Frankfurt am Main: Peter
Lang, 535-540
Tagnin, S E O (2002), Taking off in Brazil: COMET – A Multilingual Corpus
for Teaching and Translation. Paper presented at ICAME 2002 – The
Theory and Use of Corpora – The 23rd International Conference on
English Language Research on Computerized Corpora of Modern and
Medieval English, Gothenburg, Sweden, May 22- 26, 2002.
Quantitative or qualitative content analysis?
Experiences from a cross-cultural comparison of female
students’ attitudes to shoe fashions in Germany, Poland and
Russia
Lancaster University
Abstract
In order to examine differences in attitudes to shoe fashions between women in Germany,
Poland and Russia, we asked three samples of advanced female students of English to
write a short English composition in response to the stimulus: “Tell us a little bit about the
footwear (shoes, boots, etc.) you own and when you wear it”. We analysed the results
using a manual qualitative content analysis and two forms of quantitative computer
content analysis: one using project-specific categories developed from the qualitative
content analysis and previous theory, the other using general semantic field categories.
Both techniques were successful in highlighting similar between-group differences,
suggesting that qualitative content analysis and project-specific categories can largely be
dispensed with. Some issues in using non-native student English compositions as data in
cross-cultural studies are also considered.
1.
Introduction
Appearance is an integral part of communication and miscommunication. If we
want proof of this claim, we need cast our minds back no further than the British
Conservative Party’s annual conference in 2002, when its new chairperson,
Theresa May, took to the platform in kitten-heeled leopardskin shoes. This
deviation from a traditional “business” shoe received huge media attention, with
close-up photographs in many national newspapers, but it ultimately
overshadowed what she had to say: it has been remembered long after the content
of her speech has been forgotten. Thus, something as everyday as her choice of
shoe can be said to have led to an overall communication failure 1 . It is clear from
this example that the messages sent out by clothing and footwear (non-verbal
communication) deserve as much attention as those messages transmitted by
words (natural language), and it is from this position that we began to work on
apparel-based non-verbal communication, and, in particular, on footwear.
1.1
Cross-cultural differences in non-verbal communication
We know that differences in non-verbal communication (hereafter “NVC”) exist
between nationalities and cultures (Andersen, 1988). However, although a
204
number of comparative studies in NVC have been carried out (e.g. Remland et
al., 1991), these have tended to focus on the more bodily aspects, such as gesture,
facial expression, touch and proxemics. By contrast, a very large proportion of
the empirical work on apparel has been carried out in the USA, focussing
primarily on American subjects. Comparatively little empirical work of this kind
has been carried out in Europe (especially eastern Europe), and even less has had
a contrastive orientation. An important consequence of this is that the claims of
popular “dress for success” manuals, whatever their empirical and theoretical
foundations in one culture, cannot simply be transferred by translation to other
cultural contexts. In order to test how, and to what extent, perceptions of apparel
differ cross-culturally, we have been collecting data from female students in three
European countries: Germany, Poland and Russia.
1.2
Approaching NVC in the respondents’ own words
Although some of the early work on NVC made use of open-ended qualitative
data (e.g. Stone, 1962), much recent research in the field has been carried out in
an experimental scale-based paradigm (Davis & Lennon, 1988). However, this
latter approach, although readily quantified, has the problem of imposing the
researchers’ pre-defined categories and interpretive schemata on the experimental
subjects: for instance, they have to rate the image of a variously clothed figure as
being more or less “friendly and approachable”. A notable exception to this
experimental paradigm is the small-scale study by Golliher (1987), who
employed a form of open-ended depth interviewing in order to re-construct the
categorisation and interpretation of clothing items in his informant’s own terms.
This qualitative ethnomethodological approach is valuable, because it can provide
much richer data than the experimental approach: it taps into the actual attitudes
and perceptions of the informants rather than having them agree or disagree with
the researchers’ attitudes. In this study, we have applied (for logistical reasons) a
written variant of open-ended interviewing.
1.3
Analysing open-ended responses
For the analysis of open-ended response data, there are two main paradigms
available to the researcher: qualitative analysis and quantitative analysis. One of
our goals in this pilot study was to examine the relative merits of a qualitative and
quantitative content analysis of our data.
Qualitative content analysis, as we understand it, is a variant of the grounded
theory approach to text analysis – in other words, it is a bottom-up approach to
identifying the main ideas within people’s discourse. In the course of careful
reading, linguistic units within the text (words and phrases) are annotated and
aggregated into broader categories as themes begin to emerge from the data. As
its name suggests, qualitative content analysis is not, in itself, a quantitative
approach; however, as well as functioning as a stand-alone methodology, it can
also feed into dictionary construction for quantitative content analysis.
205
In contrast to qualitative content analysis, quantitative content analysis attempts
to provide a numerical measure of mention which can then be used to make
statistical comparisons between respondent groups. Quantitative content analysis
is typically fully automated (or at least computer aided) and can take two main
forms: a form that relies on the classification of text words according to a predefined set of categories (dictionary-based content analysis) and a form that
makes use of the multivariate analysis of word contiguities to derive themes from
the set of texts being examined (correlational content analysis) (Hogenraad,
McKenzie & Péladeau, 2003).
In dictionary-based content analysis – the method which we have used in this
paper – a dictionary of words and content categories is constructed, and words in
the running text are then automatically matched with the dictionary categories.
The content analysis program then provides a frequency of use for each category
in the study across each group of texts indicated by the researcher.
The categories in dictionary-based content analysis may be quite general,
semantic-field-type categories, which can be used for a wide range of studies, or
they may be specially designed for a particular study, typically on the basis of
previous theory and a prior qualitative analysis of at least some of the data. In
this study, we have undertaken two quantitative content analyses – one with
general semantic field categories and one with project-specific categories – and
we have compared both these sets of results with each other and with the outcome
of the qualitative analysis.
1.4
Language issues in cross-cultural survey research
One of the major difficulties in conducting cross-cultural research lies in
language. If material is collected in several languages (e.g. German, Polish and
Russian), it can then prove rather difficult to quantify and compare the results.
This is certainly the case if one is working with word-level data within a
methodology such as correlational content analysis, since differences in
vocabulary range and grammatical structure can themselves lead to differences in
the relative frequencies of comparable words. If dictionary-based content
analysis is used, then vocabulary range and grammar may prove rather less of a
problem, but another problem then arises: the availability of exactly comparable
content analysis dictionaries and related processing tools for all the languages in
question. In an attempt to avoid both these sets of problems, we have been
collecting our data in a single language – English – from advanced learners of
that language in each of the countries covered by the project. In principle, this
should enable us to make direct comparisons between national groups. However,
at the outset, it was not clear what other issues non-native Englishes might pose
for the content analyst, and so this is an issue which we kept in focus during this
pilot study and to which we return later in the paper.
206
2.
Data
Our data consisted of compositions written by advanced female students of
English at universities in Germany, Poland, and Russia. These were all collected
during the summer of 2002 and were provided in computer-readable form. All
respondents were natives of the country in question: any students not conforming
to this criterion were dropped from the subsequent analysis. In this pilot study,
there were 11 compositions from Germany, 34 from Poland, and 13 from
Russia. 2 The students were asked to write on each of the following open-ended
stimulus questions:
x What sort of things would you wear to a job interview and why?
x What sort of things would you wear to a friend’s birthday party and why?
x Tell us a little bit about the footwear (shoes, boots, etc.) you own and
when you wear it.
In the present study, we focus only on the last of the three questions – the one
about footwear.
3.
Qualitative analysis
Close reading of the material helps us understand what the writers are telling us in
their own words. A special coding frame was constructed to list and classify the
words and phrases in our shoe data. We aimed to find the answers to three Whquestions: What? Why? When?
Qualitative content analysis of the data revealed the following main themes:
x What – shoe types and their physical attributes;
x Why – reasons for wearing different types of shoes such as comfort,
practicality including matching outfits, suitability, design, colour,
attractiveness and price;
x When – occasions when particular types of shoes are worn: related to
weather, outdoor and indoor activities, lifestyle, formal and informal
social events, etc.
The examination of Becker and Lißmann’s (1973) two levels of content, primary
content (themes and main ideas of the text) and latent content (context
information), lead us to the following research conclusions on the similarities and
differences in the subjects’ classifications of shoes. Firstly, the Poles described
the widest range of shoe types with main emphasis on comfort and elegance,
while occasions when particular types of shoes are worn were mainly related to
outdoor activities, and formal and informal social events. Secondly, Russian
respondents, quite unsurprisingly, had a bigger variety of boots, and generally
207
referred to weather-related occasions for wearing particular types of footwear
with a focus on cold weather conditions; they placed major emphasis on comfort
and attractiveness of shoes. Thirdly, German subjects wrote less about high heels,
placing more emphasis on practicality; occasions when particular types of shoes
are worn were varied, including outdoor activities, formal and informal social
events, as well as weather-related occasions with a focus on warm weather
conditions.
4.
Quantitative analysis
For the quantitative content analysis, we used the USAS suite of programs,
developed over the past 13 years at Lancaster University (Wilson & Rayson,
1993, Rayson & Wilson, 1996). USAS is a software package for dictionarybased content analysis: we hope to apply correlational content analysis to these
data in a future study.
Using a comprehensive dictionary and multi-word-unit list (which we updated to
cover the vocabulary of the present data), USAS assigns a basic semantic field
code (or “SEMTAG”) to each lexical item or phrase within a text – for example,
“Colour”, “Body and Bodyparts”, “Power”, “Similar/Different”, etc. USAS
performs this task with approximately 92% accuracy. We shall return to these
SEMTAGs later in the paper.
A further feature of USAS is the module called MAPPING. This enables
researchers to create their own, project-specific category system (or
“CONTAGs”) by conflating, subdividing, or ignoring various SEMTAG
categories. For our data analysis, we used MAPPING to create categories based
both on previous work on the perception of shoe fashions and on the results of the
qualitative analysis.
To examine differences between the three groups of respondents, we used the
TMATRIX module of USAS. This module enables the researcher to see the
frequencies of words and categories for each group, examine concordances, and
carry out log-likelihood or chi-squared tests on word and category frequencies.
4.1
Project-specific category construction
Although shoes have been said “never to lie” about the wearer’s personality
(Pond, 1986), in the literature on NVC and first impressions they have received
comparatively little attention when set alongside other apparel items such as
dresses or suits. Shoes have occasionally been included in broader studies (e.g.
Lennon & Miller’s (1984/85) study included a pair of brown boots), but, at the
time of carrying out the research, the only detailed study known to us was that of
Kaiser et al. (1987). 3 It was thus from Kaiser et al.’s study that we set out to
create CONTAGs for our project. These CONTAGs would enable us to see
208
whether the European nationalities used the same kinds of criteria as Kaiser et
al.’s American sample when writing about footwear and whether or how they
differed amongst themselves.
Prior to carrying out a semantic differential (questionnaire) study with a larger
sample of respondents, Kaiser et al. undertook a focus group with five men and
five women living in California. The respondents were shown a range of shoe
styles, which they then discussed. From these discussions, the following primary
dimensions were extracted:
x
x
x
x
x
x
x
x
x
x
old-young
liberal-conservative
work-leisure
comfortable-uncomfortable
unsexy-sexy
formal-casual
high-status-low-status
inexpensive-expensive
dislike-like
fashionable-unfashionable
These dimensions were relatively unproblematic to operationalise as content
categories, with the exception of liberal-conservative. As an approximation to
this category, therefore, we made the distinction classical-modern.
In addition to the categories based on Kaiser et al’s dimensions, we also created
categories to represent the main types of footwear mentioned by our respondents.
Most of these are self explanatory; the category “elegant shoes” contains
references to stilettos, court shoes, pumps, etc.
Finally, we created categories to measure references to the main additional issues
that emerged from the qualitative content analysis.
Table 1 details the set of content categories that we worked with.
Table 1: Content categories used in the analysis
Shoe styles:
BOOT
CLOG
DOCS
ELSH
FLTS
SLPR
SNDL
TRNR
OTHE
Boots
Clogs
Doc Martens (and similar shoes)
Elegant shoes
Flat shoes
Slippers
Sandals
Trainers
Other shoe styles
209
Kaiser et al dimensions:
AGES
Young-old
CLAS
Classical-modern
COMF
Comfortable-uncomfortable
FASH
Fashionable-unfashionable
FORM
Formal-casual
LIKE
Like-dislike
PRIC
Expensive-inexpensive
SEXY
Sexy-unsexy
STAT
High-status-low-status
WORK
Work-leisure
Additional categories from qualitative analysis:
ATTR
Attractive-unattractive
PHYS
Physical attributes
PRAC
Suitable-unsuitable
SEAS
Seasons
TEMP
Temperature
WEAT
Weather
4.2
Results
All of Kaiser et al.’s dimensions were used by our respondents, though to varying
degrees: for example, high-status-low-status was relatively little used, whilst
work-leisure was a very widely used category. Table 2 shows the results for the
Kaiser et al. categories aggregated for the whole sample.
Table 2: Frequencies of Kaiser et al dimensions (whole sample)
WORK
LIKE
COMF
FORM
FASH
CLAS
PRIC
SEXY
AGES
STAT
Freq. per million words
21,483
9,645
7,453
2,255
1,754
752
626
626
564
376
Table 3 shows the between-groups differences for all the content categories. For
three groups (2 d.f.), log-likelihood (LL) values of 9.21 or greater are significant
at p < 0.01 and values of 5.99 or greater are significant at p < 0.05.
In terms of between-groups differences, only one of the Kaiser et al. categories
(fashionable-unfashionable) distinguished between groups at p < 0.01, with the
Russian sample writing the most about this dimension. A further category –
classical-modern – distinguished at the lower probability level (p < 0.05), again
with the Russian sample writing the most about this dimension.
210
Table 3: Between-groups differences on content categories
Category freq. per million words
PHYS
BOOT
SEAS
TEMP
WEAT
FASH
SLPR
ATTR
FLTS
SNDL
CLAS
DOCS
OTHE
PRIC
COMF
STAT
ELSH
CLOG
WORK
LIKE
FORM
SEXY
PRAC
AGES
TRNR
LL
German
Polish
Russian
46,415
7,682
13,444
7,682
3,201
1,280
1,601
4,161
0
6,722
0
320
2,241
0
4,802
320
5,442
320
19,846
7,682
2,561
320
2,241
960
640
29,434
2,725
7,849
2,834
2,180
981
436
8,721
1,744
2,725
654
872
2,180
763
8,285
545
7,304
0
23,111
10,575
2,507
545
3,270
545
327
63,778
13,628
16,626
7,086
7,086
4,088
2,726
4,906
1,090
3,543
1,635
0
545
818
7,632
0
9,267
0
18,806
8,994
1,363
1,090
3,816
273
545
75.0
47.9
19.9
16.7
15.7
12.4
11.5
10.4
9.5
8.8
7.7
5.9
5.4
4.4
4.2
3.4
3.4
3.3
2.8
2.3
1.9
1.7
1.4
1.4
0.6
Taking into account the further content categories developed on the basis of the
qualitative content analysis, the following picture of the three national groups
emerged. We discuss here only those categories which showed significant
between-groups differences.
The Russian sample wrote most about the physical attributes of their shoes, such
as size, shape, material and colour. By contrast, the Poles wrote the least about
these dimensions overall, although this was still the most frequent category within
the Polish sample.
The Russian sample also wrote the most about seasons and the weather as
determiners of when certain styles of shoes are worn. The German sample also
wrote substantially about seasons; they wrote rather less about weather
conditions, but had the highest frequency of references to temperature (these
being a mixture of weather references and references to shoes being warm or cold
to wear). Again, the Polish sample seemed somewhat less concerned about these
three dimensions. Interestingly, seasons, weather and temperature were not
included as dimensions of usage in Kaiser et al.’s study, but, at least for some of
our sample, they seem important determiners of shoe style choice.
211
The only non-shoe-style category on which the Polish sample had the highest
frequency of use was attractiveness. It seems that the Polish respondents, as a
whole, were more concerned with this dimension than the other two national
groups. Attractiveness was not included explicitly in Kaiser et al.’s dimensions,
but it could be considered to be implied by categories such as like-dislike and
fashionable-unfashionable. We treated attractiveness as a separate category,
since we felt that many of its component words (such as smart and elegant) did
not fit any of the Kaiser et al. categories particularly well. However, some might
argue that these words might be classified elsewhere.
In terms of shoe styles, boots were referred to most often by the Russian sample,
then by the German sample. The Russians also made the most references to
slippers. By contrast, the German sample referred the most often to sandals. This
may be a seasonal effect, since the data were collected during the summer, but, if
so, it does not explain why this style was not also prominent in the writings of the
other two samples. The Poles had the highest frequency of references to flat
shoes and “Doc Marten”-style boots.
5.
Discussion
5.1
Qualitative or quantitative content analysis?
Both qualitative and quantitative content analysis of the shoe data yielded similar
results revealing the same main themes. Qualitative content analysis can provide
an accurate overall picture of a text corpus, although, being interpretive and
subjective, it may overlook some specific details. As Mayring (2001) remarks,
qualitative content analysis remains an act of interpretation since “relating
categories and parts of the material is no automatic technique but a creative act of
interpreting meanings in the text” by the content analyst, who puts into the
process of analysis all his/her competencies, pre-knowledge and empathic
abilities. Qualitative content analysis defines itself within this framework as an
empirical, methodical and controlled approach to the analysis of texts within their
context of communication and without precise quantification.
In contrast to qualitative analysis, quantitative content analysis (as its name
suggests) provides an exact quantification by counting the category frequencies in
the texts. In our data, although most themes were identified with equal success
by both the qualitative and quantitative analyses, the relative emphasis placed on
them by the different nationalities became much clearer in the quantitative
analysis: for example, the qualitative analysis identified comfort as a major
emphasis of the Polish sample in particular, and, in the quantitative analysis, this
category showed a statistically significant between-groups difference.
212
5.2
CONTAGs or SEMTAGs?
Although we used project-specific content categories (CONTAGs) for our main
analysis, the data had initially been tagged, as mentioned earlier, with more
general semantic field categories (SEMTAGs). We therefore asked ourselves the
question: Would we have obtained the same results if we had not used
CONTAGs but had conducted the analysis only using the pre-existing
SEMTAGs? If we would, this provides a useful argument for future studies’
taking the SEMTAGged data “as is” and thus reducing the manual effort needed
in developing new CONTAGs for different research projects.
Table 4 shows the SEMTAG frequencies per million words for categories that
were significant at p < 0.01 or p < 0.05. As there are more SEMTAGs than
CONTAGs, it will be seen that there are a greater number of significant
categories in this table. However, the same themes still predominate: physical
attribute categories such as materials, colour, temperature and physical properties,
as well as weather and time periods (used for seasons), are more frequent in the
Russian and German samples, whilst the evaluative category “judgement of
appearance: positive” is more frequent in the Polish sample.
In looking at SEMTAG profiles, it is worth making a distinction between analysis
categories and retrieval categories. As SEMTAG attempts to tag all the words
and phrases in a text, rather than a selection, not all the tags relate to content
items (nouns, verbs, adjectives, and lexical adverbs): many relate to word classes
such as degree adverbs, numbers, modal verbs, pronouns, and so on, which are
not key concepts within the content analysis and for most purposes can be
disregarded. Other categories – such as similar and different – are not very
meaningful in themselves as analysis categories but can be useful as retrieval
categories. For instance, when examined in context, the category “similar” can
be seen to be referring mainly to another key reason for wearing particular shoes:
because they go with or match particular clothes.
An advantage of using SEMTAGs is the greater precision that they can provide:
for instance, here we are able to see that the Russian sample writes most about
cold temperatures whilst the German sample writes most about warm
temperatures. On the other hand, when using SEMTAGs, it is sometimes
necessary to revert to the lexical frequency list in order to examine the use of
detailed, project-specific concepts: for example, in our study, we were interested
in different kinds of footwear, but, in the SEMTAGs, all kinds of footwear
receive a single tag (clothing and accessories), which is applied to any apparel
item. Perhaps a compromise can be found in developing detailed
subcategorisations of key, project-specific categories but leaving the remainder of
the SEMTAGs unaltered.
213
Table 4: Between-groups differences on SEMTAGs
Category description
Substances and materials: solid
Colour and light
Numbers
Pronouns
Buying and selling
Temperature: cold
Clothing and accessories
Time periods
Weather
Degree: maximisers
Judgement of appearance: positive
Continuous
Possible
Difficult
Long
Short
Arts and crafts
Grammatical words
Unfriendly
Young, new
Degree: approximators
Friendly
Interest
Dry
Exclusivizers/particularizers
The same
Using
Similar
Showing
Past
Negative
Seem, appear
Heavy
Location and direction
Transport by land
Phoney
Large
Different
Physical properties: general
Putting and placing
Temperature: hot
Point in time
Important
Intimate/sexual relationship
Unusual
Substances and materials: liquid
Touch
Games
Relationship: general
Body and body parts
General, unspecific
Category freq. per million words
German
Polish
Russian
11,204
27,209
16,325
147,567
960
1,601
101,152
19,206
3,201
1,921
7,682
320
10,563
0
1,280
960
0
287,452
640
640
960
1,601
0
0
8,323
0
1,601
1,601
2,881
960
10,883
2,881
0
3,841
3,201
320
0
9,603
1,280
640
5,122
3,201
2,241
0
0
640
640
640
320
6,082
960
5,669
14,281
7,522
167,230
6,323
327
93,208
12,101
2,180
4,470
16,025
2,180
12,973
1,635
2,289
654
327
255,096
1,417
1,417
1,308
1,308
872
0
4,579
218
1,090
3,816
763
1,308
14,935
4,252
763
3,161
981
0
1,090
8,503
1,744
2,943
2,071
4,797
3,488
654
654
0
0
0
1,744
7,631
109
18,534
26,983
17,171
128,373
1,635
3,543
120,741
21,532
7,086
818
13,900
273
6,269
545
5,179
2,998
1,635
267,920
0
0
3,816
0
0
818
3,271
1,363
0
1,363
2,180
0
9,267
1,363
0
6,814
2,453
818
273
4,633
4,088
2,726
3,271
1,908
1,090
0
0
0
0
0
545
11,447
0
LL
42.0
32.0
28.9
27.6
27.4
19.4
19.0
17.6
15.7
15.2
13.2
12.2
12.0
10.2
10.0
9.8
9.5
9.3
9.2
9.2
9.1
9.0
8.9
8.8
8.6
8.5
8.3
8.3
8.2
8.1
8.0
7.8
7.8
7.7
7.6
7.6
7.3
7.3
7.1
6.9
6.9
6.8
6.7
6.7
6.7
6.5
6.5
6.5
6.5
6.4
6.4
214
Giving
Degree: compromisers
Usual
Liking: positive (+++)
5.3
1,280
1,921
3,841
0
763
2,943
6,759
763
0
818
3,816
1,363
6.4
6.4
6.3
6.2
Using non-native English compositions as data
Returning to our secondary research question – the issues posed by using English
as a survey response medium with non-native speakers – we are able to make
three main observations.
Firstly, it is possible that, when working with students as respondents, the nature
of the exercise may have some effect on its outcome. In the case of the present
survey, the Russian exercise was also used as a compulsory graded exercise in an
EFL class; the German and Polish exercises, by contrast, were intended as
compulsory but did not form part of the assessment for the respective courses, so
students would not be penalised for not doing them (or for doing them “badly”).
This may account for the fact that the Russian responses were in many ways more
detailed, in several cases involving numbered lists of footwear items owned by
the writer. An interesting alternative – or perhaps supplementary – explanation
for some between-group differences is that they may arise in part from different
“rhetorical strategies” in approaching the compositions. 4
Aside from this consideration, two more specifically lexical issues arise: the use
of (often very culture-specific) L1 vocabulary in the English compositions and
the non-standard use of English vocabulary by particular groups of respondents.
Examples of the first issue in our data were the term glans in the Polish
compositions, about which we had long discussions with our Polish colleague (it
appears to refer to a kind of “punk rock” style involving lots of black leather and
metal studs), and the term kapron in the Russian data, a kind of material.
An example of the second issue was the use by the Russian respondents of the
term top-boots. They were the only group to use this term, and they used it rather
frequently (10 times across 12 respondents). This is a relatively dated expression
in native-speaker English and, according to the Oxford English Dictionary, is
properly used for the style of riding boot that has a cuff of differently coloured
leather at the top of the boot shaft: these are the style of boots worn by jockeys
and show-jumpers, also commonly seen in a female fashion version at the turn of
the 1980s/90s. However, the Russian respondents appear to be using this term to
refer to any boot with a shaft reaching to the knee of the wearer. The Oxford
English Dictionary also recognises this sense, though castigates it as “incorrect”.
These lexical issues are perhaps not a large problem in a dictionary-based content
analysis, providing the respective words and phrases are properly categorised, but
215
they can potentially be problematic for those approaches to content analysis that
work primarily on vocabulary items, such as correlational content analysis or
keyword analysis (Scott, 1997, 2001). In such cases, a partial synonymisation
process, although frequently dispensed with as uneconomical, might be advisable.
6.
Conclusion
Our pilot study has revealed interesting differences between the three nationalities
in the way they write about shoe fashions. Although they all made use of the
“American” dimensions delineated by Kaiser et al. (1987), they did so to different
degrees, and they also made use of other conceptual dimensions not mentioned by
Kaiser et al. In general, the Russians seemed to place the most emphasis on the
physical and practical characteristics of their shoes. They also seemed to be the
strongest “trend-followers”. By contrast, the Poles seemed to emphasise their
judgements of the attractiveness of shoes. The Germans tended to fall midway
between the Russians and the Poles on most dimensions. As this was a pilot
study, these results should for the moment be considered suggestive rather than
conclusive. However, if further work (which we are planning to carry out) does
support these findings, then they may have important implications both for crosscultural apparel-based NVC and for footwear marketing.
On a methodological level, we have shown that a general content analysis
dictionary based on semantic fields can deliver the same substantive results as a
specially constructed project-specific dictionary. We have also shown that such a
dictionary provides broadly the same results as a manual qualitative content
analysis, which replicates the findings of Thomas and Wilson (1996) on doctorpatient interaction. We therefore suggest that qualitative analysis and special
dictionary construction are costly stages of analysis which can be dispensed with
without much effect on the outcome of a study. However, it may be that such
general dictionaries require some minor modification (mostly the insertion of
more detailed sub-divisions) to cover the fine detail of specific topic areas under
examination.
Notes
1
Unless, of course, one of the aims had been to project a trendier image for
the party, regardless of the speech content.
2
We are grateful to our colleagues in this project – Amei Koll-Stobbe in
Greifswald, Germany; Agnieszka Lénko-SzymaĔska in Lodz, Poland, and
Tatyana Astafurova in Volgograd, Russia – for collecting these data.
216
3
We have since become aware of further research on shoes carried out by
Marianne Herzog in Germany – e.g. Herzog (1995).
4
See Lénko-SzymaĔska (this volume) for a detailed discussion of this idea
in the context of Polish and American English compositions written in
response to the same question (about mobile phones).
References
Andersen, P.A. (1988), Explaining intercultural differences in nonverbal
communication, in: L.A. Samovar and R.E. Porter (Eds.) Intercultural
Communication: A Reader. 5th ed. Belmont, CA: Wadsworth, pp. 272281.
Becker, J. and H.-J. Lißmann (1973), Inhaltsanalyse – Kritik einer
sozialwissenschaftlichen Methode. Arbeitspapiere zur politischen
Soziologie 5. München: Olzog.
Davis, L.L. and S.J. Lennon (1988), Social cognition and the study of clothing
and human behavior. Social Behavior and Personality 16(2): 175-186.
Golliher, J.M. (1987), The meaning of bodily artefacts: variation in domain
structure, communicative functions, and social contexts. Semiotica
65(1/2): 107-127.
Herzog, M. (1995), Auftreten... Mensch und Schuh, in: D. Grünewald (ed.) "Was
sind wir Menschen doch!..." Menschen im Bild. Analysen. Festschrift für
Hermann
Hinkel.
Weimar:
Verlag
und
Datenbank
für
Geisteswissenschaften, pp.105-114.
Hogenraad, R., D.P McKenzie, and N. Péladeau (fc.) Force and influence in
content analysis: The production of new social knowledge. Quality &
Quantity 37(3): 221-238.
Kaiser S.B., H.G. Schutz and J.L. Chandler (1987), Cultural codes and sex-role
ideology – a study of shoes. American Journal of Semiotics 5(1): 13-33.
Lénko-SzymaĔska, A. (this volume), The curse and the blessing of mobile phones
– a corpus-based study into Polish and American rhetoric strategies.
Lennon, S.J. and F.G. Miller (1984/85), Attire, physical appearance, and first
impressions: more is less. Clothing and Textiles Research Journal 3(1), 18.
Mayring, P. (2001), Combination and integration of qualitative and quantitative
analysis. Forum Qualitative Sozialforschung / Forum: Qualitative Social
Research
[On-line
Journal],
2(1).
URL:
http://www.unituebingen.de/qualitative-psychologie/t-ws01/Mayring_en.htm.
Pond, M. (1986), Shoes Never Lie. London: Grafton.
Rayson, P. and A. Wilson (1996), The ACAMRIT semantic tagging system:
progress report, in: L. J. Evett and T. G. Rose (eds.) Proceedings of the
AISB Workshop on Language Engineering for Document Analysis and
217
Recognition. Brighton: Faculty of Engineering and Computing,
Nottingham Trent University, pp. 13-20.
Remland, M.S., T.S. Jones and H. Brinkman (1991), Proxemic and haptic
behavior in three European countries. Journal of Nonverbal Behavior 15:
215-232.
Scott, M. (1997), PC analysis of key words - and key key words. System 25(1): 113.
Scott, M. (2001), Comparing corpora and identifying key words, collocations,
and frequency distributions through the WordSmith Tools suite of
computer programs, in: M. Ghadessy, A. Henry and R.L. Roseberry (eds.)
Small Corpus Studies and ELT: Theory and Practice. Amsterdam:
Benjamins, pp. 47-67.
Stone, G.P. (1962), Appearance and the self, in: A. Rose (ed.) Human Behavior
and Social Processes. Boston: Houghton Mifflin, pp. 86-118.
Thomas, J.A. and A. Wilson (1996), Methodologies for studying a corpus of
doctor-patient interaction, in: J.A. Thomas and M.H. Short (eds.) Using
Corpora for Language Research: Studies in the Honour of Geoffrey
Leech. London: Longman, pp. 92-109.
Wilson, A. and P. Rayson (1993), Automatic content analysis of spoken
discourse: a report on work in progress, in: C. Souter and E. Atwell (eds.)
Corpus Based Computational Linguistics. Amsterdam: Rodopi, pp. 215226.
Survey and Prospect of China’s Corpus-Based Research1
Yang Xiao-jun
Hunan University of Science & Technology Beijing Foreign Studies University
Abstract
This chapter conducts a survey of China’s corpus-based research, focusing on the
following aspects: i) the history and development of China’s corpus-based research; ii)
corpus compilation; iii) leading research figures and institutions; iv) major academic
publications on these research. Then, the author predicts the prospect of this research,
focusing on the developing trend of China’s corpus-based research, corpus annotation
problems, corpus processing tools and how to apply the corpora to language teaching and
translation studies.
1.
Introduction
Corpus linguistics is playing a more and more important role in linguistic
research and lexicography, etc. Tognini-Bonelli (2001: 1) points out that what we
are witnessing is the emergence of a new research enterprise and a new
philosophical approach to linguistic enquiry that has a theoretical status and,
because of this, is in a position to contribute specifically to other applications
such as lexicography, language teaching, translation, stylistics, grammar, gender
studies, forensic linguistics and computational linguistics, etc. Hunston (2002: 1)
notes that it is no exaggeration to say that corpora, and the study of corpora, have
revolutionised the study of language, and of the applications of language over the
last few decades, and that the improved accessibility of computers has changed
corpus study from a subject for specialists to something that is open to all. In
China corpus linguistics has led to a qualitative change in our understanding of
language.
In this paper the author first conducts a survey of China’s corpus-based research,
focusing on the following aspects: i) the history and development of China’s
corpus-based research; ii) corpus compilation; iii) leading research figures and
institutions; iv) major academic publications on this research. Then, the author
predicts the prospect of this research, focusing on the developing trend of China’s
corpus-based research, corpus annotation problems, corpus processing tools and
how to apply corpora to language teaching and translation studies.
220
2.
Yang Xiao-jun
Survey of China’s Corpus-based Research
According to Huang Changning and Li Juanzi (2002: 4), China’s earliest corpusbased research on Chinese dialects has a history of nearly 3,000 years. Even the
computer corpus-based research on the English and Chinese Languages has a
history of over 20 years. Wang Jianxi (2001) stated that, in recent years, corpus
linguistics in China has made considerable progress in the compilation,
annotation and analysis of Chinese corpora, and in the compilation and studies of
corpora of English as a Foreign Language.
2.1
The history and development of the research
The history of China’s earliest corpus-based research may date back to the Zhou
Dynasty (1100-221 B.C.) and the Qin Dynasty (220-206 B.C.). It is reported that
Yang Xiong, who showed great interest in studying dialects, had spent 27 years
interviewing some scholars and soldiers who were summoned from different
districts to the capital. He had collected enough data and resources manually to
compile the first book on Chinese dialects (Huang Changning & Li juanzi, 2002:
4). Since the compilation of China’s first English corpus—JDEST in 1982, many
achievements have been made in China’s contemporary corpus-based research.
First, fifteen Chinese corpora, eight English corpora and five bilingual corpora
(including parallel corpora) have been compiled or are being compiled (to be
described in detail in the next section). Secondly, four books on corpus linguistics
and four dictionaries on Modern Chinese have been published. Thirdly, eighteen
universities and research institutions have been carrying out research programmes
on corpus linguistics. Fourthly, there are more and more regular contributors to
major national journals of foreign language research and linguistics year by year,
covering topics from the application of corpus linguistics to the techniques for
compiling and exploiting corpora. Searching papers on corpus linguistics from
www.cnki.net, we can find out that there are altogether 107 academic papers on
corpus-based research in Chinese academic journals (CAJ), and up to now, the
number is about 140 adding the papers searched from www.baidu.com. Besides,
there are a number of MSc/MA dissertations and several doctoral theses in
corpus-based work.
2.2
China’s Corpora compiled and being compiled
It is during these 20 years that about 15 Chinese corpora, 8 English corpora and 5
bilingual corpora (including parallel corpora) have been compiled or being
compiled. Some of them are used for general purposes; some are used for specific
purposes.
2.2.1 Chinese Corpora
(1) The Modern Chinese Language Corpus (MCLC). The biggest representative
written Chinese corpus to date, compiled at the Research Institute of
221
Language Application in Beijing in 1995. It aims to contain 70 million
contemporary Chinese characters, systematically sampled from among
1.4 billion characters of original texts covering the period 1919 to the
1990s. With regard to contents, 59.6 percent of these texts belong to the
humanities, 17.24 percent to natural science, 13.7 percent to newspaper
material and 9.37 percent to the miscellaneous class. This MCLC serves
the potential needs in five areas: (1) information processing in the
Chinese language; (2) the unification and standardisation of the Chinese
language; (3) academic research; (4) language education; (5) application
of the Chinese language, (see Wang Jianxin, 2001).
(2) The Si Ku Quan Shu. (This refers to the Complete Library in Four Branches
of Literature completed in 1782, which has the world’s longest series of
books. The work, comprising four traditional division of Chinese
learning (classics, history, philosophy, and belles-letters), contains 3,503
titles bound into more than 36,000 books with a total of 853,456 pages.)
This is the largest electronic Chinese text database so far. Composed of
the most comprehensive and outstanding classic Chinese works in the
past 3,000 years before the end of the 18th century. This database of 800
million Chinese characters in 4.7 million pages was published by the
Digital Heritage Publishing Ltd. in Hong Kong in two versions. One is
the complete facsimile version of 183 CD-ROMs with characterretrieving tools with a total of 4.7 million pages of the original books and
was put into the market in 1998; the other one has 168 CD-ROMs with
title-retrieving tools with a total of 800 million characters and was put
into market in 1998. The database has greatly facilitated the study of the
ancient Chinese language, literature, culture and history.
(3) Corpus of the Chinese Language as Interlanguage. This corpus was compiled
by the Research Institute for Computational Language of Beijing
Language and Culture University with more than 3.5 million characters
in 1995. This corpus consists of 5,774 written texts in Chinese as
interlanguage of 1,635 overseas students learning Chinese. The students
are from 96 countries and regions studying in nine universities. This
corpus is used to enhance the teaching of Chinese to foreign students.
(4) The Academia Sinica Balanced Corpus (version 3.0). This corpus was
compiled at the Academia Sinica in Taiwan. It is one of the largest
annotated Chinese corpora and has been put on the web with 5 million
Chinese characters tokenized, tagged and parsed. This grammar tree
bank and the statistics will be very useful for the processing system of
the Chinese language.
(5) The Corpus of Chinese Phonology. This corpus was set up by the Institute of
Applied Linguistics of the Chinese Academy of Social Sciences (CASS)
with 45,000 Chinese words and 1,200 syllables with tones, numeral
strings, lightly pronounced words, words pronounced with a rising
“/r/”ending sound, bi-syllables, tri-syllables, separate sentences, short
passages and dialogues. This corpus has been annotated manually. An
222
Yang Xiao-jun
automatic phonetic-analysis system of Mandarin Chinese, based on the
English phonetic analytical system ToBI, is in progress.
(6) Contemporary Beijing Spoken Chinese Corpus. This corpus was compiled by
Lu Bisong, Ren Yuan and other six scholars in 1992. They have
interviewed 500 people from six districts of Beijing talking on about 28
topics and sampled 378 recording materials. This corpus gives a vivid
description of Beijing Spoken Chinese in the 1980s with a total of 1.7
million characters.
(7) Situated Discourse for Spoken Chinese corpus. This corpus is being compiled
at the Research Department for Contemporary Linguistics of the
Research Institute for Languages and Linguistics of the Chinese
Academy of Social Sciences under the guidance of the doctor supervisor
Gu Yueguo. As part of the workplace-related discourse, more than 269
hours of recordings are planned for this corpus, which are being
transcribed and annotated to be made available on the Internet in a
multimedia form.
(8) Corpus of Chinese Textbooks for Primary and Middle Schools. This corpus
was compiled by Beijing Normal University with 1 million characters
and composed of middle-school teaching material on Chinese literature
and language.
(9) The People’s Daily Annotated Corpus (PFR). This corpus has been
segmented and annotated jointly by the Fujitsu Research Institute of
Japan and the Institute of Computational linguistics (ICL) of Peking
University with about 200 million Chinese characters.
(10) Huayu2. This is a balanced corpus of 2 million Chinese characters, which is
entitled Huayu and compiled by both the State Key Laboratory of
Intelligent Technology & System (Key Lab), Tsinghua University, and
the Language Information Processing Institute, Beijing Language and
Culture University on Mainland China. This corpus has been tokenised
and tagged with Cseg & tag 1.0, a segmentation and tagging system
developed by them, and then manually proofread, from which a grammar
tree bank of 10,000 Chinese sentences is being built to be used as a testbed for Chinese parsers.
(11) Linguistic Variation in Chinese Communities (LIVAC Synchronous Corpus).
This corpus entitled LIVAC Synchronous Corpus was compiled by the
City University of Hong Kong and is a representative computerised
text corpus from Chinese newspapers and electronic media in
Mainland China, Hong Kong, Macau, Taiwan and Singapore. LIVAC
aims to cover a ten-year period starting from July 1995. This corpus
will be very useful for comparative studies and can provide
quantitative data for language engineering.
(12) Statistical Corpus of Chinese Word Frequency. This corpus was designed for
the specific purpose of the statistics of Chinese word frequency and
compiled jointly by 10 universities and research institutes. It has several
subcorpora, such as: (a) 20-million character corpus of modern Chinese
223
was compiled by the Beijing University of Aeronautics and Astronautics
in 1983, (b) 5.27-million character corpus of modern Chinese corpus was
compiled by Wuhan University in 1979, (c) 5-million-character Chinese
corpus was compiled by Peking University in 1992, (d) 100-millioncharacter corpus of ancient and modern Chinese was compiled by
Shanghai Normal University. A modern Chinese corpus of 66,186,297
Chinese characters was compiled by Shandong University, (e) 2.5
million-character Chinese corpus of news was compiled by Shanxi
University in 1988.
(13) The Corpus of the Contemporary Chinese Language. This corpus was
compiled by the Department of Chinese and Bilingual Linguistics at the
Hong Kong University of Science and Technology (HKUST) and
comprises 6 million characters of contemporary Chinese used in
mainland China, Hong Kong and Taiwan. It has been segmented with an
automatic algorithm, and considerable research has been conducted
using this corpus.
(14) The Hong Kong Cantonese Child Language Corpus (CANCORD). This
corpus grew out of the project “The development of grammatical
competence in Cantonese-speaking children” funded by the Hong Kong
Research Grants Council from 1991-93, which is a joint effort of three
local universities: The Chinese University of Hong Kong, the Hong
Kong Polytechnic University, and the University of Hong Kong. This
database contains 171 files coded according to the internationally
accepted CHAT format and tagged with 33 part-of-speech labels. The
data should be of use to any one interested in early language
development, be they linguists, psychologists, philosophers or
educationalists. Queries about the corpus should be directed to Thomas
Lee (e-mail: htlee@netvigator.com).
(15) The Electronic Database of the Chinese Documents (Scripta Sinica). This
corpus is probably the most empirically sound one in China today and
fully available on the Internet, which is entitled Scripta Sinica and
compiled collectively by ten research institutes of the Academia Sinica
in Taiwang. The 2.0 version of this database contains 139,940,071
Chinese characters, and each year 10 million characters of new Chinese
texts are added to it.
2.2.2 English Corpora
(1) JDEST (Jiaotong University Corpus for EST). This corpus was compiled at
Shanghai Jiaotong University under the guidance of Prof. Huang Renjie
and Prof. Yang Huizhong in 1982. It comprises approximately 1 million
words, and was updated to 4,082,368 words in 2000. It has 2,000 texts
with a length of 500 words each text covering 10 majors in both British
English and American English. It is an annotated corpus and word
frequency lists have been produced. It is designed to meet the needs of
224
Yang Xiao-jun
students of English used in science and technology. Here is the website:
www.sjtu.edu.cn
(2) GPEC (Guangzhou Petroleum English Corpus). This corpus was compiled
by Prof. Zhu Qi-bo at Guangzhou Training College of the Chinese
Petroleum University in 1986. It has 700 passages with 500-600 words
in each passage sampled over period of 1975-1986. Its size is 411,612
words in both British English and American English. A pack of
concordance programmes for the corpus has been worked out. This
corpus is designed to enable the study of the lexicon of Petroleum
English and provide information for comparative language analysis.
(3) Corpus for EST. This corpus was compiled jointly by Guangzhou University
of Foreign Studies and the Hong Kong University of Science and
Technology in 1998. It comprises 5 million words.
(4) Communicative English Corpus for Chinese Students. This corpus was
compiled at the School of Foreign Languages and International Studies,
Guangzhou University of Foreign Studies.
(5) English Literature Corpus. This corpus was compiled at the School of Foreign
Languages and International Studies, Beijing Foreign Studies University.
It comprises 5 million words.
(6) Chinese Learner English Corpus (CLEC). This corpus was compiled in 1999
under the guidance of Prof. Gui Shichun of Guangzhou University of
Foreign Studies and Prof. Yang Huizhong of Shanghai Jiaotong
University. It comprises 1 million words. The written texts in this corpus
are from the compositions and essays of either English majors or nonEnglish majors or middle school students, (40%, 30% and 30%
respectively). A subcorpus, College Learner Corpus, was compiled at
School of Foreign Studies, Henan University under the guidance of Prof.
Li Wenzhong.
(7) The HKUST Corpus of Learner English. This corpus was compiled by J.
Milton at the Computer Department of the Hong Kong University of
Science and Technology. It comprises 25 million words. This corpus was
POS tagged and partly error-tagged. All language materials are from
Chinese learners. Kennedy (1998: 42) has made some comments on this
corpus as follows: Interlanguage studies of the written English of mainly
Cantonese learners of English will be facilitated by the completion of the
five-million-word Hong Kong University of Science and Technology
(HKUST) Corpus (Milton & Tong, 1991). The corpus will be probably
the largest machine-readable corpus yet produced of the written English
of Chinese learners and also one of the largest corpora of any single
group of learners. It is intended that it will be available with grammatical
and discourse feature tags. The use of this corpus to describe the rule
base of written English for learners of Chinese background is intended to
inform the development of English teaching materials.
(8) Corpus for Middle School English Education (MSEE). This corpus was
compiled by Prof. He An-ping at South China Normal University in
225
1999. It comprises 2.3 million words in three sections: 1.3 million words
of the written and spoken English produced by the secondary school
students in Guangdong Province; 0.5 million words of the complete new
set of English textbooks used in China’s middle schools, and 0.5 million
transcribed words of 130 hours of classroom teaching, matched with
video and tape recordings.
2.2.3 Bilingual Corpora (including Parallel Corpora)
(1) CONULEXID (the Commercial Press and Nanjing University Lexical
Database). This is a bilingual (English and Chinese) Database and was
compiled jointly by Nanjing University and Commercial Publishing
House in 2000. It is for the purposes of bilingual dictionary making and
publishing.
(2) Chinese/Japanese Parallel Corpus (CJPC). This corpus was compiled at the
National Research Centre for Foreign Language Education, Beijing
Foreign Studies University, under the guidance of Prof. Xu Yiping. It is
the first Chinese/Japanese parallel corpus compiled in China, upon
which many significant results have been produced. (see Xu and Cao,
2002)
(3) Chinese/English Parallel Corpus (CEPC). This corpus is being compiled at
the National Research Centre for Foreign Language Education, Beijing
Foreign Studies University, under the guidance of Dr.Wang Kefei. The
objective of the present project is to create a Chinese/English Parallel
Corpus of 30 million words, representative of modern Chinese and
English in the twentieth century so as to establish a research platform for
comparative studies of Chinese and English that can meet observational
and descriptive accuracy, translation studies and teaching, statistical
analysis such as frequency of occurrence, machine translation and
compilation of bilingual dictionaries. This corpus comprises four
subcorpora: bilingual corpus of aligned sentences and phrases, multidisciplinary corpus, specialised corpus, and translated texts corpus. All
the texts are from written sources. Software such as bilingual text
alignment software and bilingual concordance software will be
developed for effective and efficient use of this corpus. This corpus has
these distinct features: it can be separated for specific purposes or
combined as a whole for a general purpose; the original texts have
different translation, etc. (see Wang kefei, 2002b).
(4) English/Chinese Parallel Corpus (ECPC). Under the Statistical Inter-Lingual
Conversion (SILC) project, a 60Mb English/Chinese parallel corpus has
been compiled by the Computer Department of the HKUST, containing
transcripts of speech recordings and their high-quality translations of the
Hong Kong parliament. From the corpus, 29 Mb of English texts (about
5 million words) and 15.5 Mb of Chinese equivalents were used for the
experiment. They were all automatically aligned at sentence and
paragraph level.
226
Yang Xiao-jun
(5) Cantonese-English Bilingual Child Language Corpus. This is a corpus of
Cantonese-English Bilingual Children’s early language development
which is being compiled by Virinia Yip (CUHK) and Stephen Mattews
(HKU). Three simultaneous bilingual subjects will be studied
longitudinally for one and a half or two years and video-taped bi-weekly.
The resulting transcriptions will form a bilingual corpus, containing
English and Cantonese in romanised form for each child. This will be the
first Cantonese-English Bilingual Child corpus and will be useful for
addressing issues such as the question of differentiation, the degree of
balance between the two languages and the possibility of delay relative
to monolingual development.
2.3
Leading figures and institutions in the research
A: The following is the introduction of the leading figures, whose names are
spelt in Chinese Pinyin with the surname followed by the middle and last name,
and their contributions to China’s corpus-based research.
(1) Yu Shiwen is a doctor supervisor and the director of the Institute of
Computational Linguistics, Department of Computer Science and
Technology, Peking University. He has been carrying on research on
computational linguistics and natural corpus processing since 1986. He
compiled the Grammatical Knowledge Base of Contemporary Chinese
(GKBCC). This database was published in Tsinghua University Press in
1997. Based on the GKBCC, he also compiled the database the
Grammatical Knowledge Base of Contemporary Chinese---a Complete
Specification, which was published at Tsinghua University Press in
1998.
(2) Huang Changning works in The Research Institute for Computational
Language of Tsinghua University and he has written 5 papers on corpusbased research, mainly on Chinese corpora and 1 book—Corpus
Linguistics (Chinese version) published by Commercial Press in 2002.
(3) Zhang Pu works in The Research Institute for Computational Language of
Beijing Language and Culture University and he has written 10 papers
on corpus-based research, mainly on Chinese corpora. He is one of the
leading researchers for having compiled Modern Chinese Corpus.
(4) Gui Shichun is a linguist working in Guangzhou University of Foreign
Studies and has compiled the Chinese Learner English Corpus (CLEC)
and undertaken some corpus-based research.
(5) Yang Huizhong is a doctor supervisor of Foreign Studies Department of
Shanghai Jiao Tong University. He has compiled two major corpora with
other professors: JDEST and CLEC, and wrote in 1985 a paper—The use
of computers in English teaching and research in China and in 2002
with his four doctoral candidates majoring in corpus linguistics wrote a
book—An Introduction to Corpus Linguistics. He is one of the pioneers
of China’s corpus-based research.
227
(6) Gu Yueguo is a doctor supervisor of the Research Institute for Languages and
Linguistics, the Chinese Academy of Social Science. He is in charge of
the research programme for compiling a corpus—Situated Discourse for
Spoken Chinese corpus, which he is about to complete. In 1998 he edited
a special issue for corpus linguistics research as the first issue of
Contemporary Linguistics and contributed a paper—Corpora and
Language Research to this issue and gave some advice on this kind of
research. He has had four doctoral candidates majoring in corpus
linguistics.
(7) Wang Kefei is a doctor supervisor of the National Research Centre for
Foreign Language Education, Beijing Foreign Studies University. He is
in charge of compiling Chinese/English Parallel Corpus. He has
published four papers on parallel corpus-based approach to both
language and translation studies.
(8) Chen Guohua is a doctor supervisor of the National Research Centre for
Foreign Language Education, Beijing Foreign Studies University and
one of the leading members for compiling the Chinese/English Parallel
Corpus. He has undertaken some research on corpus-based approaches to
lexicography and has one doctoral candidate majoring in corpus
linguistics and lexicography.
(9) Pan Yongliang is a doctor supervisor in PLA Foreign Studies University and
wrote two papers on corpus-based research and has several doctoral
candidates majoring in corpus linguistics.
(10) Wang Jianxin works at Foreign Languages Department of the Beijing
University of Post and Telecommunication and takes part in compiling
the Chinese/English Parallel Corpus. He has written six papers on
corpus-based research. Among these papers, one was published in the
International Journal of Corpus Linguistics (2001, vol. 6, No. 2); another
was published in ICAME Journal (2000, 4). He studied corpus
linguistics in Bergen University as a visiting scholar for one year.
(11) He Anpin works at the Foreign Studies Department of South China Normal
University and attended two conferences: ICAME 2000 (Sydney) and
Corpus Linguistics 2001 (Lancaster). She has published more than five
papers on corpus-based research and compiled a corpus - MSEE.
(12) Li Wenzhong. Dr. Li has majored in corpus linguistics and is supervised by
Prof. Yang Huizhong. He has compiled a sub-corpus of CLEC—
College Learner Corpus wrote some papers on corpus-based research
and three chapters of Yang Huizhong’s An Introduction to Corpus
Linguistics.
B: The following are the leading institutions in this research and their addresses,
websites and the e-mail addresses of the project researchers.
228
Yang Xiao-jun
(1) The State Key Laboratory of Intelligent Technology and Systems, Tsinghua
University,
Beijing,
100084,
P.R.
China.
E-mail:
sms@s1000c.cs.tsinghua.edu.cn; website: www.tsinghua.edu.cn
(2) The Institute of Computational Linguistics, Dept. of Computer Science and
Technology, Peking University, Beijing, 100871, P.R. China. E-mail:
yusw@pku.edu.cn; website: www.pku.edu.cn
(3) The Academia Sinica of Taiwang Website: www.academia.sinica.edu.tw.
Scripta
Sinica.
E-mail:
linshi@sinica.edu.tw;
website:
academia.sinica.edu.tw/info/index.html#db.
(4) The National Research Center for Foreign Languages Education, Beijing
Foreign Studies University, Beijing, 100089, P.R. China. E-mail:
kfwang@hotmail.com; website: www.sinotefl.com
(5) Language Information Processing Institute, Beijing Language and Culture
University, Beijing , 100083, P.R. China. E-mail: zhangpu@blcp.edu.cn;
website: www.blcp.edu.cn
(6) The Chinese Academy of Social Science. No:5, Jian Guo Men Wai Da Jie,
Beijing, 100732, P.R. China. www.cass.net.cn/jxky/yxsz/kx/index.htm.
(7) Beijing University of Aeronautics and Astronautics. www.buaa.edu.cn
(8) Open System & Chinese Information Processing Center, Institute of Software,
the Chinese Academy of Sciences, Beijing, 100083. P.R. China. E-mail:
idu@sonata.iscas.ac.cn
(9) Shanghai Jiaotong University, Shanghai, 200030,P.R. China. www.sjtu.edu.cn
(10) Guangdong University of Foreign Studies, Guangdong, 510420, P.R. China.
www.gdufs.edu.cn
(11) South China Normal University, Guangdong, 510631, P.R. China.
www.scnu.edu.cn
(12) Wuhan University. www.whu.edu.cn
(13) Northeast University, Shenyang, 110006, P.R. China. www.neu.edu.cn
(14) Shanxi University, Taiyuan, 030006, P.R. China. www.sxu.edu.cn
(15) The City University Of Hong Kong. www.cityu.edu.hk
(16) Digital Heritage Publishing Ltd. (Hong Kong). www.skqs.com
(17) The Chinese University of Hong Kong. www.cuhk.edu.hk
(18) University of Petroleum-Guangzhou.
www.ccem.uiuc.edu/chen/up/guangzhou.html.
2.4
Classifications of corpus-based research publications and major
academic journals
2.4.1 Dictionaries and books
(1) The Comprehensive Dictionary of the Chinese Language (in electronic form).
The original 12 volumes of this dictionary was made available on one
CD-ROM in 1998 by the Chinese Dictionary Press and the Commercial
Press (Hong Kong Ltd.).
229
(2) Grammatical Knowledge Base of Contemporary Chinese (GKBCC). This
database was compiled by Peking University and already published in
Tsinghua University Press in 1997. Based on the GKBCC, the database
the Grammatical Knowledge Base of Contemporary Chinese---a
Complete Specification has been compiled by Yu Shiwen, et al. at the
ICL of Peking University and was published at Tsinghua University
Press in 1998.
(3) List of Common Characters in Modern Chinese. It was published in the
Chinese Language Press in 1997.
(4) Frequency Dictionary of the Modern Chinese Language. This database was
compiled at Language Information Processing Institute, Beijing
Language and Culture University and has frequency statistics for 1,
310,000 characters of Modern Chinese characters and was published in
Beijing Language and Culture University Press in 1997.
(5) An Introduction to Corpus Linguistics. This book was written by Yang
Huizhong et al. and published by Shanghai Foreign Language Education
Press in 2002.
(6) An Introduction to Chinese Learner English Corpus (CLEC). This book was
written by Yang Huizhong, et al and published by Shanghai Foreign
Language Education Press in 2003.
(7) Corpus Linguistics. This book was written by Huang Changning and Li Juanzi
and published by the Commercial Press in 2002.
(8) An Introduction to the Corpus for Middle School English Education (MSEE).
This book was written by He Anping and published by Guangdong
Electronic Press in 1999.
2.4.2 Classifications of corpus-based research papers
As mentioned above, up to now there are about 140 corpus-based research papers
(it is by no means exhaustive). They may be classified as follows: (1) In terms of
the language in the corpora - Papers on English corpora and bilingual corpora
take about 70% of the total number, the rest 30% are on Chinese corpora. (2) In
terms of the research orientation and scope - Papers on the introduction to English
and Chinese corpora and book reviews on corpus linguistics take about 20% and
they mainly cover the period 1985-1995, such as Wang Jianxin (1996) and Pan
Yongliang (2000). Papers on theoretical research on corpus linguistics take about
30% and they mainly cover the period 1996-2000. Papers on the applications of
corpora take 40% and they mainly cover the period 1998-2003. They may be
further classified into the following categories: (a) Papers on the application of
corpora to translation studies take about 5%, such as Wang Kefei (2002 a, b),
Liao Qiyi (2000), Zhang Meifang (2002), etc., (b) Papers on the application of
corpora to language teaching take about 20%., (c) Papers on the application of
corpora to contrastive linguistic analysis take about 5%, (d) Papers on the
application of corpora to lexicography take about 5%, (e) Papers on the
application of corpora to lexical studies take 5%. Papers on the techniques for
corpora annotation and concordance take about the rest 10%.
230
Yang Xiao-jun
2.4.3 Classification of these academic journals
These academic journals that publish papers on corpus-based research in China
may be classified into three types: journals of foreign languages and linguistics,
journals of Chinese language and linguistics, and journals of information and
computer technology. Here the major ones are listed: (a) Contemporary
Linguistics (Beijing); (b) Foreign Language Teaching and Research (Beijing); (c)
Journal of Foreign Languages (Shanghai); (d) Modern Foreign Languages
(Guangzhou); (e) Foreign Languages and Their Teaching (Dalian); (f) Journal of
PLA Foreign Studies University (Luoyang); (g) Foreign Language Teaching
(Xi’an); (h) Journal of Languages and Writings (Beijing); (i) Journal of Chinese
Information Processing (Beijing); (j) Applied Linguistics (Beijing); and (k)
Journal of Computer Science (Beijing).
3.
Prospect of China’s corpus-based research
In this section, we focus on the developing trend of China’s corpus-based
research, corpus annotation problems, corpus processing tools and how to apply
the corpora to language teaching and translation studies.
3.1
The developing trend of China’s corpus-based research
The present situation in China’s corpus-building development is that the
development of spoken Chinese corpora is far behind that of the written Chinese
corpora. What is even worse is that there is no spoken English corpora already
built or even being built in China, so we should speed up the building of spoken
corpora to maintain the balance with that of written corpora. Meanwhile it has
been suggested that more small corpora be built for specific research purposes
and more bilingual and parallel corpora be built for language research, translation
studies, contrastive analysis and dictionary-making.
3.2
Corpus annotation problems and corpus processing tools
Technology for tagging and parsing should be standardised and further
developed. Annotation on lexical, phonetic, syntactic and phonological levels
goes far beyond that of semantic and pragmatic levels, so more attention should
be paid to this aspect. The supply of more and more automatic taggers and parsers
cannot meet the demand for them and most of these corpus processing tools are
only suitable for specific corpora. Technology for alignment is not currently
satisfactory. Up to now, we have not developed our own tools to process bilingual
corpora. The challenge to corpus analysis tools is to systematise the design of
corpora and concordancers so that any concordancer can work on any corpus.
3.3
231
How to apply the corpora to language teaching and translation studies
There is great potential for these corpora to enrich China’s corpus-based research.
Here we just focus on how to apply the corpora to language teaching and
translation studies.
3.3.1 Corpora and language teaching
Corpus linguistics has a double role in language teaching; entailing a
methodological innovation and a theoretical one, because together they will
account for a new way of teaching (Tognini-Bonell, 2001: 14). The development
of corpora has the potential for two major effects on the professional life of the
language teacher. First, corpora lead to new descriptions of a language, so that the
content of what the language teacher is teaching is perceived to change in radical
ways (Sinclair, 1991: 100; Stubbs, 1996: 231-232). Secondly, corpora themselves
can be exploited to produce language teaching materials, and can form the basis
for new approaches to syllabus design and to methodology (Huston 2002: 137).
In fact, we can apply corpora to teaching nearly all the language major courses,
especially comparative and contrastive studies, vocabulary lessons and writing
lessons, etc.
3.3.2 Corpora and translation studies
Translation is an increasingly important application of corpora. Research into
corpora and translation tends to focus on two areas: practical and theoretical. In
practical terms, the question is: What software can be developed that will enable a
translator to exploit corpora as an aid in the day-to-day business of translation? In
theoretical terms, the question is: What does a corpus consisting of translated
texts indicate about the process of translation itself? Because corpora can be used
to raise awareness about language in general, they are extremely useful in training
translators and in pointing up potential problems for translation. Not only can
corpora provide evidence for how a given word or phrase are possible, they also
provide an insight into the process and the nature of translation itself (Huston
2002: 123-128). So our bilingual parallel corpora can be a resource and a
practical tool to analyse translation, translators’ style and help translation training,
teaching and the writing of translation textbooks, etc.
4.
Conclusion
In this chapter, a general survey of China’s corpus-based research has been
conducted from four viewpoints. Prediction of the prospect of the research has
also been made from three viewpoints. It is obvious that corpus linguistics is
playing a more and more important role in China’s corpus-based research. In the
future, a national association of corpus linguistics studies will be created and a
national academic journal of corpus linguistics will come into being. In the end,
232
Yang Xiao-jun
our motto should never be forgotten: “We are corpus-based, but not corpusbound.”
Notes
1
This paper has been polished by my supervisor, Dr. Wang Kefei, Professor
of the National Research Centre for Foreign Language Education, Beijing
Foreign Studies University. I am very grateful to him and to Prof. Wang
Jianxin who offered me some information and data on corpora
compilation.
2
Here “Hua” refers to Tsinghua University; “Yu” refers to Beijing
Language and Culture University.
References
Biber, D., S. Conrad and R. Reppen (1998), Corpus Linguistics. Cambridge:
He, Anping. (1999), Introduction to the corpus for middle school English
Education (MSEE). Guangzhou: Guangdong Electronic Press.
Huang, Changning and Li Juanzi. (2002), Corpus Linguistics. Beijing:
Commercial Press.
Hunston, S. (2002), Corpora in Applied Linguistics. Cambridge: Cambridge
University Press.
Kennedy, G. (1998), An Introduction to Corpus Linguistics, London: Longman.
Liao, Qiyi. (2000), ‘Corpora and translation studies’, Foreign Language Teaching
and Research, 5: 380-384.
Pan, Yongliang. (2000), ‘Introduction to Corpus Linguistics (Biber, et al 1998)’,
Foreign Language Teaching and Research, 5: 389-392.
Sinclair, J.M. (1991), Corpus Concordance Collocation,. Oxford: OUP.
Stubbs, M. (1996), Text and Corpus Analysis, Oxford: Blackwell.
Thomas, J. and M. Short. (1996), Using Corpora for Language Research,
London: Longman.
Tognini-Bonelli, E. (2001), Corpus Linguistics at Work. Amsterdam/
Philadelphia: John Benjamins Publishing Company.
Wang, Jianxin. (1996), ‘Introduction to three contemporary English corpora’,
Foreign Language Teaching and Research 3:37-40.
Wang, Jianxin. (1998), ‘Important stages in the development of corpus
linguistics’, Foreign Language Teaching and Research 4:52-57.
Wang, Jianxin. (1999), ‘Some development of researches on corpus linguistics in
China’, Foreign Languages and Their Teaching 3: 18-20.
Wang, Jianxin. (2001), ‘Recent progress in corpus linguistics in China’,
International Journal of Corpus Linguistics, Volume 6:
233
Wang, Kefei. (2002a), ‘Parallel corpus-based approach to translation studies’,
Foreign Languages and Their Teaching, 9: 35-39.
Wang, Kefei. (2002b), ‘Corpus and network for translator and interpreter
training’, Foreign Language Teaching and Research, 3:231-232.
Wang, Lidi. and W. Jianxin. (2001), ‘Implementation plan for a 30-million-word
Chinese/English parallel corpus’, Presentation at the 3rd International
Symposium on EFL in China. May, Beijing.
Xu, Yiping. and Cao. Dafeng, (eds.). (2002), The development and application of
a Chinese/Japanese parallel corpus, Beijing: Foreign Language Teaching
and Research Press.
Yang, Huizhong. (2002), An Introduction to Corpus Linguistics, Shanghai:
Shanghai Foreign Language Education Press.
Yang, Xiaojun. and Li. Saihong. (2003), ‘Advantages of Corpora in
Lexicography----Review on OALD (6th edition)’, Foreign Languages and
Their Teaching, 4: 47-51.
Zhang, Meifang. (2002), ‘Exploiting corpora to analyze the stylistic features of
the translators’, Journal of PLA Foreign Studies University 3: 10-14.