1 Corpus Linguistics - Institut für deutsche Sprache und Linguistik

Transcription

plan
Corpus Linguistics
zmotivation:
corpora and foreign-language teaching
zlearner corpora – introduction
zdesign of learner corpora
DGfS & GLOW Summer School
Micro- and Macrovariation
Stuttgart, 2006
Anke Lüdeling
anke.luedeling@rz.hu-berlin.de
{general issues
{Falko
zcontrastive interlanguage analysis …
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
plan
plan
z… contrastive interlanguage analysis
zannotation
zarchitecture
{introduction: flat annotation vs. multi-level
annotation
{exercise: annotation of learner data using the
developed tagsets
{error annotation
zintroduction: what are errors, how can error
tagsets be developed?
ztarget hypotheses
zdiscussion
Stuttgart, Aug 2006
2
z(if we have the time)
an exemplary study:
complex verbs in Falko
3
Stuttgart, Aug 2006
4
recapitulation: corpora
zmany definition of ‚corpus‘, here:
z„Corpus: A collection of pieces of
language that are selected and ordered
according to explicit linguistic criteria in
order to be used as a sample of the
language.“
zthe kind of data one needs depends on the
research question; for some research
questions corpus data is helpful
zcorpora
{controlled data
{frequency data
{context data
http://www.ilc.cnr.it/EAGLES96/corpintr/corpintr.html
zreproducability
Stuttgart, Aug 2006
5
Stuttgart, Aug 2006
6
1
z qualitative analysis – categorization
(header information, token, part-of-speech,
lemma, sense, syntactic structure etc.)
¾ dependent on linguistic analysis
z quantitative analysis
Motivation:
Corpora in Language
Teaching
{within a corpus: comparison of categories
(words, lemmas, constructions etc.)
{between corpora: comparison of categories
¾ dependent on qualitative analysis
¾ descriptive statistics vs. inferential statistics
Stuttgart, Aug 2006
7
Stuttgart, Aug 2006
corpora in language teaching
L1 corpora
zcorpora in language teaching
z to improve teaching materials
8
{ 'natural' examples in text books and learner dictionaries
{ frequency information to improve word lists/collocation lists
{ grammar
{ teaching strategies
(for given grammar problems, Inquiry Based Education)
{L1 corpora
{translation corpora
{learner corpora
¾ many studies, see e.g. Sinclair et al. 1991, Wichmann et
al. 1997, Granger et al. 2002, Nesselhauf 2005, Römer
2006
Stuttgart, Aug 2006
9
Stuttgart, Aug 2006
L1 corpora in language teaching
zfollowing Römer (2006a):
direct approach vs. indirect approch
z direct approach
data-driven learning (DDL), discovery learning,
corpora in the classroom
“confront the learner as directly as possible with
the data, and to make the learner a linguistic
researcher” (Johns 2002, 108), see also
Bernardini (2002)
)teachers and students
10
• ‚free‘ concordances
• controlled exercises
Stuttgart, Aug 2006
11
Stuttgart, Aug 2006
12
2
example the vs. zero article
z ‚language‘, example from http://www.eisu.bham.ac.uk/johnstf/def_art.htm
zindirect approach: corpus studies to
improve teaching materials
grammars (COBUILD), specific topics
(Granger 1999, Nesselhauf 2005, Römer
2005 and many others)
)researchers, textbook writers
z In Gwynedd, a bedrock of the Welsh language, there are 25 film-making companies.
z We must accept that the salvation of the French language involves learning one or
more of the languages in neighbouring countries.
z The research also showed increases in the frequency of bad language and sex on
television.
z Inspectors said behaviour was generally good, but features "such as free use of
colloquial language and non-attendance at lessons are tolerated much more than in
conventional schools".
z 1. proud of their command of _____ English language and engage in quite of lot of
patting them
z 2. but it does not mean that _____ everyday language is bad: it is simply the way of
things tha
z 3. cluded that cerebral dominance for _____ language is established before the age
of five. Dur
z 4. abulary is one thing and _____ technical language is another, Vocabulary is words,
lists of
z 5. avic-speakers. Orthodoxy and _____ Greek language remain the two markers of
13
modern Greek ide
Stuttgart, Aug 2006
Stuttgart, Aug 2006
14
example: looking (Römer 2006b, 236)
z one type of study: comparison between ‚real‘ L1
and ‚school‘ L1
z example:
progressive in English (Römer 2005, 2006b),
comparison of spoken parts of BNC and BoE
and two textbook corpora containing ‚spoken‘
language
z progressives are difficult for German learners of
English
{because German does not contain a progressive
{because textbooks do not represent progressives as
they are in ‚real‘ English?
Stuttgart, Aug 2006
15
example: looking (Römer 2006b)
Stuttgart, Aug 2006
16
example: looking (Römer 2006b, 238)
zthe progressive can stand with single
actions and with repeated actions
(examples from Römer 2006b, 237)
zWell we’re really looking for a vegetarian
one aren’t we now (BoE_brspok)
zYes. I remember that from when we were
looking at houses # down there
(BoE_brspok)
Stuttgart, Aug 2006
17
Stuttgart, Aug 2006
18
3
L1 corpora / translation corpora
zconsequences of these findings:
as yet undecided – studies necessary
z „However, despite the progress that has
unquestionably been made in the past two or
three decades, I would still be hesitant to say
that corpora have after all fully ‘arrived’ on the
pedagogical landscape.“ (Römer 2006a, 121)
z but: TALC, workshops at other conferences (CL
2005, DGfS 2006, …)
z best resources/studies/corpora/networks for
English
z other languages follow very slowly
{perhaps none: it might be the case that
textbooks use these distributions for a reason
(unlikely in advanced studies)
{use ‚real‘ examples instead of made-up
examples
zcorpus comparison as a necessary base
for further studies
Stuttgart, Aug 2006
19
Stuttgart, Aug 2006
20
learner corpora
z "The area of linguistic enquiry known as learner
corpus research [...] has created an important
link between the two previously disparate fields
of corpus linguistics and foreign/second
language research. Using the main principles,
tools and methods from corpus linguistics, it
aims to provide improved descriptions of learner
language which can be used for a wide range of
purposes in foreign/second language acquisition
research and also to improve foreign language
teaching." (Granger 2002, 4)
Learner Corpora:
Introduction
Design
Falko
Stuttgart, Aug 2006
21
Stuttgart, Aug 2006
learner data
aside: experimental data
zintrospection – not available
zerror collections – mainly useless
z for many research questions one needs
experimental data in addition to/instead of
corpus data
z learner corpus data is production data
(comprehension data only in indirect ways)
z production experiments
(c-tests, fill-in tests, …)
z conceptualization experiments
(eye-tracking, reaction time etc., Carroll,
Stutterheim, Nuese 2004)
zlearner corpora
zexperimental data
Stuttgart, Aug 2006
23
Stuttgart, Aug 2006
22
24
4
goals of learner corpus research
learner corpora
zdescriptively:
find and classify errors, find patterns in
learner language
ztheoretically:
find out about learner's hypotheses
(interlanguage)
zimprove teaching material
zcontrolled collections of learner language
Stuttgart, Aug 2006
z “Computer learner corpora are electronic
collections of authentic FL/SL textual data
assembled according to explicit design criteria
for a particular SLA/FLT purpose. They are
encoded in a standardised and homogeneous
way and documented as to their origin and
provenance.” (Granger 2002: 7)
25
Stuttgart, Aug 2006
learner corpora
learner corpora: design
z“There is nothing new in the idea of
collecting learner data. Both FLT and SLA
researchers have been collecting learner
output for descriptive and/or theorybuilding purposes since the disciplines
emerged. In view of this, it is justified to
ask what added value, if any, can be
gained from using learner corpus data.”
(Granger 2004: 123f.)
z start from a research question; depending on that
choose the design criteria
Stuttgart, Aug 2006
{ number of subjects
{ type of exercise
z essay
z summary
z spoken
z…
{ level (variety) of the learner
z beginner
z basic
z post-basic
z advanced
z…
27
Stuttgart, Aug 2006
corpus design (following Granger 2002)
learner corpus examples: ICLE
monolingual
(one L1)
multilingual
(several L1)
z ICLE (International Corpus of Learner English)
general
special
synchronic
(cross-section)
diachronic
(longitudinal)
written
(text corpus)
spoken
(speech corpus)
Stuttgart, Aug 2006
26
28
{initiative from Louvain-la-Neuve (Sylviane Granger)
{goal: comparable corpora with different L1s
{collected in several countries
(in April 2006: 19 partners; German: Uni Augsburg)
{written: essays, advanced learners
{error tag set
(but published version not error tagged)
{comparable corpus of native speaker essays
¾ large impact, well-known, many publications
29
Stuttgart, Aug 2006
30
5
learner corpus examples: MAELC
learner corpus examples: ISLE
zMultimedia Adult English Learner Corpus
z ISLE (Interactive Spoken Language Education)
(Atwell, Howarth & Souter 2003)
{Portland (Oregon), Initiative ESOL
{longitudinal: language classes over four
semesters are recorded (video, audio) and
partly transcribed
{several tasks (interactive, spoken)
{software for transcription, search etc.
{not really used for linguistic research –
perhaps due to unclear design criteria:
corpus?
Stuttgart, Aug 2006
{international speech technology project
(Uni Leeds, Uni Hamburg, Klett, ...)
{primary goal: non-native speech models for speech
recognition
{secondary goal: description of typical pronunciation
errors
{spoken English by German and Italian learners,
middle level, read text (1300 words/speaker),
additionally sentences with problematic features
(I said bed not bad)
{phonological annotation
31
Stuttgart, Aug 2006
learner corpus examples: LeaP
learner corpora for DaF/DaZ
zLearning Prosody (Milde & Gut 2002)
zprobably: private collections of learner
data everywhere
zbut almost no publically avaliable welldesigned error-tagged, corpora for learner
German
{aim: study of phonological features in (very)
advanced learners of German
{spoken language, different L1s, ca. 400
recordings, 4 different speaking styles
{phonetically and phonologically annotated
{not very well known (yet), but available,
interesting linguistic studies
Stuttgart, Aug 2006
{ Belz (2004) – written, not error-tagged (?), not available
{ LeaP – available, spoken
{ Weinberger (2002) – written, error-tagged, not available
(perhaps soon)
{ ESF Korpora – spoken, spontaneous, available
{ ???
33
Stuttgart, Aug 2006
Falko
Falko – core corpus
z Falko for fehlerannotiertes Lernerkorpus, built at
the Humboldt-University of Berlin and the Free
University of Berlin with help from others
z summaries of scientific texts
about linguistics and literature
z acquired at ‚Zwischenprüfungslevel‘
(after about two years of study) at the Free University
z advanced learners
(DSH-Prüfung plus studies in Germany)
z highly controlled, metadata
z at the moment 37.000 tokens, growing
z automatically tagged for lemma & pos
z manually tagged for goal hypothesis, agreement, wordorder, definiteness (not yet totally available)
z control corpus, German native speakers
(13.000 tokens)
{relatively new, totally underfunded ;-)
{written
{advanced learners
{error-tagged
{freely available
{still small but growing
Stuttgart, Aug 2006
32
35
Stuttgart, Aug 2006
34
36
6
Falko – extension corpora
Falko – essays
zsummaries (same procedure as core
corpus) from Danish students
zlongitudinal data from Georgetown
University, several tasks
zessays by students at Humboldt University
zat the moment we collect essays:
here you could help
zfixed topics
zobligatory c-test to assess level of
proficiency
zquestionnaire for meta-data
zdetailed instructions available
Stuttgart, Aug 2006
37
Falko – data preprocessing
Stuttgart, Aug 2006
38
digitization
zhandwritten data
zadvantages:
an
on
file
d
ize
ym
num
ber
{Bill Gates doesn‘t help in correcting the data
{no typos
zproblems
problem
{digitalization is time-consuming
{the digitizer has to make decisions
Stuttgart, Aug 2006
39
digitization
Stuttgart, Aug 2006
40
Stuttgart, Aug 2006
42
digitization
Stuttgart, Aug 2006
41
7
digitization
digitized text (text 46)
Stuttgart, Aug 2006
43
Zusammenfassung. Das europäische Kunstmärchen In diesem
Text geht es um den Begriff des Kunstmärchens. Volker Klotz
wirft zuerst das Problem der Bezeichnung "Kunstmärchen" vor.
Diese Bezeichnung könne zu Fehleinschätzungen verleiten,
denn sie scheine gegenständige Bedeutungen zu enthalten.
Zunächst beschäftigt sich Klotz mit dem Thema, wie man unter
Kunstmärchen verstehen kann. Er behauptet, dass unter
Kunstmärchen man offenbar eine Gruppe poetischer Gebilde
verstehe, die einen Abstand von anderen literarischen Gruppen
halte. Klotz unterscheidet die anderen Gattungen wie Novelle,
Ode und Komödie mit Kunstmärchen. Mit der Vorzeichnung
"Kunst" definieren jene literarischen Gattungen ihre Arte. Sie
werden auch ergänzt und schränkt. Das heisse, die
Vorzeichnung entscheidet die Art der Gattungen. Auf der
anderen Seite hat Kunstmärchen ein zweispältiges Verhältnis in
seinem Wort. […]
Stuttgart, Aug 2006
44
summary: learner corpora design
zcorpus design starts with a research
question
zit is important that the design criteria and
the acquisition process are open to the
user
Analysis of Learner Corpora:
Contrastive Interlanguage
Analysis
zethical/legal issues:
always difficult – get written consent
before publication (observer‘s paradox)
Stuttgart, Aug 2006
45
Stuttgart, Aug 2006
learner corpora –
what can you do with them
quantitative analysis
ztwo main types of studies
z descriptive statistics:
one describes the data at hand
{error analysis (EA)
– qualitative and quantitative
{contrastive interlanguage analysis (CIA) –
qualitative and quantitative
46
{ ‚simple‘ counts
{ collocation analyses
{ multivariate analyses
{…
z inferential statistics - modelling:
one uses a descrition of the data at hand to say
something about a larger population
z know your data – comparability, ways of counting
z different distributions
{ extrapolation
{ machine learning
{…
z Baayen (2001), Biber, Conrad & Reppen (2000), Manning & Schütze (1999), Moisl
(erscheint), Oakes (1998), …
Stuttgart, Aug 2006
47
Stuttgart, Aug 2006
48
8
aside: the ‚r-word‘
aside: the r-word
zfrequent question:
when is a learner corpus representative?
zthe fact that a corpus is not representative
does not mean that one cannot
extrapolate (mathematically you can)
zbut: be very careful in the interpretation of
the data!
{wrong question!
{as one does not know the basic population one
cannot construct a representative corpus
(this is true for all corpora that do not represent
a closed variety)
zwhen is a corpus big enough?
{this depends on the research question (Falko is
still too small for many questions)
Stuttgart, Aug 2006
49
Stuttgart, Aug 2006
CIA
CIA
zgoal: find statistical differences (called
overuse or underuse) between two
varieties
z comparison of
{lexis
{syntax
{register
{errors
{pronunciation
{…
{either NS vs. NNS
{or NNS with two different L1s
{or NNS with the same L1 but at different points
in time
{…
z how can you compare?
{count surface properties
{count categories use statistical techniques
zproblem: find comparable varieties
Stuttgart, Aug 2006
50
51
Stuttgart, Aug 2006
52
CIA example: lexis
counting
z do learners use more/less words than L1
speakers?
z or different ones?
z or the same ones in a different distribution?
z do learners use longer/shorter words than L1
speakers?
z do learners use more/less learned words that L1
speakers?
z are there differences in the distribution of partof-speech types?
z…
Beim Prinzip der Konventionalität gibt es für jede
Bedeutung eine Form , von der die Menschen
eine Vorstellung haben , dass sie auch
gebraucht wird . Im Gegensatz dazu beinhaltet
das Kontrastprinzip , dass davon ausgegangen
wird , wenn es Unterschiede bezüglich der Form
von Wörtern gibt auch unterschiedliche
Wortbedeutungen existieren . Im frühen
Kindesalter begreifen viele nicht , dass zwei von
der Form verschiedene Wörter trotzdem das
Gleiche bedeuten können
Stuttgart, Aug 2006
53
Stuttgart, Aug 2006
54
9
counting
counting
3 tokens
the
gebraucht wird
. ImofGegensatz
dazu beinhaltet
word form, type
das Kontrastprinzip
dass davon ausgegangen
wird , wenn‚der‘
es Unterschiede bezüglich der Form
gebraucht wird . Im 2Gegensatz
tokens of dazu beinhaltet
das Kontrastprinziplemma
, dasstype
davon ausgegangen
wird , wenn es Unterschiede
bezüglich der Form
‚Wort‘
Stuttgart, Aug 2006
Stuttgart, Aug 2006
prerequisite:
tokenization
prerequisite:
lemmatization
55
56
counting
counting
gebraucht wird . Im Gegensatz dazu beinhaltet
das Kontrastprinzip , dass davon ausgegangen
16 tokens
of
wird , wenn es Unterschiede
bezüglich
der Form
part-of-speech
type ‚NN‘
Wortbedeutungen existieren
. Im frühen
zin each counting step one relies on
classification and decisions that have been
taken
zthis can be problematic because automatic
annotation (lemmatizers, pos-taggers) are
typically trained on newspaper data and
not on learner data which has orthographic
errors and word order errors
Stuttgart, Aug 2006
Stuttgart, Aug 2006
prerequisite:
pos-tagging
57
counting
58
text 033 (L2)
Der Realismus ist eine überragende geistige und künstlerische Tendenz des 19. Jahrhunderts. Er erstreckt sich als
international weit ausgreifende Epochenströmung bis gegen Ende des Jahrhunderts. In der Literatur tritt er unter
verschiedenen Namen auf. Die Literatur- und Kunstgeschichte einigte sich erst im nachhinein über die Grenzen des
Realismus. Anfangs haben sich die Naturalisten hauptsächlich als Realisten verstanden. Hieraus erkennt man, dass die
genauere Bezeichnung des Realismus nach dem damaligen Sprachgebrauch noch nicht vorhanden war. In der heutigen
Bezeichnung ist sie eine literaturund kunstgeschichtliche Richtung, welche in Poetischer und Bürgerlicher Realismus
eingeteilt wurde, um sie von der Naturalismus Variante unterscheiden zu können. Diese beiden Bezeichnungen werden
vielfach synonym gebraucht. Bei einer stärkeren Unterscheidung würde man den Bürgerlichen Realismus mit der
Gründungsphase der realistischen Bewegung um die Jahrhundertmitte identifizieren, welche auch als Programmatischer
Realismus bezeichnet wird. In den fünfziger Jahren, hat es in Deutschland eine lebhafte literaturtheoretische Debatte um die
Zielsetzung und das Wesen ein realistischer Literatur gegeben. Eines ihrer Zentren hat es, in der von Gustav Freytag und
Julian Schmidt herausgegebenen Zeitschrift "Die Grenzboten". Der populäre Roman "Soll und Haben" von Gustav Freytag gilt
als ein Exempel für die gehegten nationalpädagogischen Erwartungen an einen dezidiert bürgerlichen Realismus. Das
bedeutendste poetolgische Manifest des deutschen Realismus ist Fontanes Aufsatz "Unsere lyrische und epische Poesie seit
1848" (1853). Hier wird das Ideal einer umfassenden Wirklichkeits-Repräsentanz formuliert. Die deutschen Realisten wollten
auf keinen Fall als Kopisten der Wirklichkeit verstanden werden. Durch die Erfindung der Daguerrotypie (Photographie) rückte
die Mimesis der Realität sehr nahe. Somit konnte die herrschende Ästhetik mit Hilfe der Photographie, sich von der
herkömmlichen Kunst distanzieren. Rudolf Gottschall kritisiert an Beispielen aus dem Romanschaffen von Daudet und Zola
die direkte Beziehung der Literatur auf die zeitgenössische Wirklichkeit. Er sieht es als unzulässige Überschreitung der
Grenze zwischen Kunst und Realität. Gleichzeitig kritisiert er auch Fontane, Zolas und Alexander Kiellands "Reportertum". Er
erklärt, dass er an dem exakten Bericht einen ungeheuren Literaturfortschritt erkennt, welches uns auf einen Schlag vom öden
Geschwätz zurückliegender Jahrzente befreit. Weiter stellt er fest, dass sich "Meisterstücke und Berichterstattung" erst dann
zur Höhe des Kunstwerks erheben <>. Damit greift er auf eine Norm der klassischen Ästhetik zurück, dem "Idealrealismus".
Das idealistische Ressentiment ist eines der Gründe für das Zurückbleiben des deutschsprachigen Realismus gegenüber der
Schonungslosigkeit der Gesellschaftskritik eines Dickens oder Balzac und der unbestechlichen Psychologie Flauberts. Der
deutsche Kritiker Emil Homberger kritisiert Flaubert. Für ihn, zergliedert Flaubert in seinen Erzählungen das Seelenleben
seiner Figuren leidenschaftslos. So wie Fontane vermisst auch er die belebende Seele. Er versucht sogar, den französischen
Autor einen Irrtum in der Auffassung von "Objektivität" nachweisen zu können. Denn wer objektiv sein wolle, dürfe sich nicht
einseitig an der Dokumentation einzelner Faktoren berauschen.
zan additional problem in the Falko corpus:
copied text – students tend to copy
words/sequences of words from the
original
zplagiarism detection tool WCopyfind 2.6
(finds only complete copies)
Stuttgart, Aug 2006
59
Stuttgart, Aug 2006
60
10
text 033 – section
text d015 (L1)
z Der Realismus ist eine überragende geistige
und künstlerische Tendenz des 19.
Jahrhunderts. Er erstreckt sich als international
weit ausgreifende Epochenströmung bis gegen
Ende des Jahrhunderts. In der Literatur tritt er
unter verschiedenen Namen auf. Die Literaturund Kunstgeschichte einigte sich erst im
nachhinein über die Grenzen des Realismus.
Anfangs haben sich die Naturalisten
hauptsächlich als Realisten verstanden. Hieraus
erkennt man, dass die genauere Bezeichnung
des Realismus nach dem damaligen
Sprachgebrauch noch nicht vorhanden war.
Stile & Richtungen: 1. Realismus
Der vorliegende Artikel beschäftigt sich mit der Begriffsbestimmung des literarischen 'Realismus'.
Dieser wird vorerst in der ersten Hälfte d. 20. Jahrhunderts angesetzt und als Strömung
charakterisiert, die sich von der klassisch-romantischen Kunstauffassung abzusetzen versucht. Die
eigentliche Begriffsbestimmung stellt sich allerdings zur Zeit des Realismus anders dar als heutige
Ansätze, insofern dass die Abgrenzung zum Naturalismus noch nicht so stark gezogen wird. Es wir im
Weiteren zwischen Poetischem und Bürgerlichen Realismus unterschieden, woran wiederum die
Definition des Programmatischen Realismus geknüpft ist. Wichtig für die Begriffsabgrenzung sind die
literaturtheoretischen Debatten um den Realismus, die in den 1850er Jahren von Autoren wie Gustav
Freytag, Julian Schmidt aber auch Theodor Fontane - als Hauptvertreter des Realismus -geführt
werden. Diese sind eng an die Ästhetikdebatte und deren Anknüpfung an die idealist. Tradition
geknüpft.Die Unterscheidung zwischen Bloß-Wirklichem & Wahren ist für diese Debatte von
außerordentlicher Wichtigkeit und prägt die heutige Begriffsbestimmung und die Abgrenzung des
Realismus gegen den Naturalismus. Zwar stellt die Dokumentation der Wirklichkeit und der exakte
Bericht ein wichtiges Charakteristikum des Realismus dar, diese dürfen aber nach Meinung Fontanes
ästhetische Prinzipien nicht außen vor lassen, da es sich sonst nicht um Kunst handele. Kunst müsse
von fotografischer Dokumentation der Wirklichkeit abgesetzt werden. In diesem Zusammenhang stellt
sich auch die Frage nach der Auffassung von Objektivität. Homberger ist der Meinung, dass diese
immer auch von der Wirkung auf den Leser abhängig ist, und dass sich die Wirklichkeit insgesamt
vielschichtiger gestaltet als in faktischer Dokumentation dargestellt werden könne. Diese insgesamt
idealistische Voreingenommenheit mag letztendlich für das Zurückbleiben des Realismus hinter dem
plakativeren Naturalismus verantwortlich sein. Vor allem aber weist die Realismusdebatte so bereits
auf Gedanken des marxistischen Kritikers Georg Lukásc voraus und seine Unterscheidung zwischen
'Erzählen' und 'Beschreiben'.
Stuttgart, Aug 2006
61
Stuttgart, Aug 2006
counting
CIA example: lexis
zdecision on how to deal with copies: can
words in the copied sections be counted in
the same way that ‚new‘ words are
counted?
zonce you know what you want to count
(here word-form types)
and have the counts for the texts
(here: Falko L1 and Falko L2, v 1.2)
can you simply compare the two counts?
62
zthe following stems from joint work with
Marco Baroni and Stefan Evert
Stuttgart, Aug 2006
63
Stuttgart, Aug 2006
naive comparison 1
different corpus sizes
zdirect comparison of the different wordform types
zFalko L1: 12758 Tokens
zFalko L2 (version 1.2): 30749 Tokens
zyou expect more different words in larger
texts
zadditional effect: Falko L2 covers more
different topics
zyou have to normalize
{Falko L1: 2485 types
{Falko L2: 4208 types
zconclusion: learners use more different
words than native speakers
Stuttgart, Aug 2006
65
Stuttgart, Aug 2006
64
66
11
naive comparison 2
z you divide the number of
types by the number of
tokens: type-token ratio
z then you can normalize to
a given number (here
1000) (in effect you
calculate the number of
types per 1000 tokens)
naive comparison 2
Falko
L1
2485 /
194,8 /
12758 = 1000
0,1948
Falko
L2
4208 /
136,8 /
30749 = 1000
0,1368
zFalko L1: 194,8 types / 1000 tokens
zFalko L2: 136,8 types / 1000 tokens
zconclusion: native speakers use more
different words than learners
z
Stuttgart, Aug 2006
67
Stuttgart, Aug 2006
68
type-token ratio
type-token ratio
zwhy is the comparison of the type-token
ratios problematic?
zthe direct comparison didn‘t work because
we saw that the number of types grows
when the corpus grows (it is not a constant
but changes with corpus size)
zif one compares the ttr of corpora of
different sizes one assumes that the ttr is
a constant for a given corpus. is that true?
Falko L1
2485 / 12758 =
0,1948
194,8 / 1000
Falko L2
(version 1.2)
4208 / 30749 =
0,1368
136,8 / 1000
Falko L2
(version 1.1)
3397 / 22891 =
0,1484
148.4 / 1000
Stuttgart, Aug 2006
69
Stuttgart, Aug 2006
70
71
Stuttgart, Aug 2006
72
type-token ratio
zthe ttr also depends on corpus size
Stuttgart, Aug 2006
12
type-token ratio
type-token ratio
zthe ttr also depends on corpus size:
the larger the corpus, the lower the ttr
zthe ttr also depends on corpus size: the
larger the corpus, the lower the ttr
zWas versteht man unter der Definition
Text ? Diese Frage ist nur schwer zu
beantworten . Bei einem Brief oder einem
Gedicht ist uns allen klar , was der Text ist
.
zWas versteht man unter der Definition
Text ? Diese Frage ist nur schwer zu
beantworten . Bei einem Brief oder einem
Gedicht ist uns allen klar , was der Text ist
.
Stuttgart, Aug 2006
73
Stuttgart, Aug 2006
74
75
Stuttgart, Aug 2006
76
type-token ratio
zif ttr is dependent on corpus size one
cannot simply compare the ttr of Falko L1
and Falko L2
zif one compares the ttr curves of L1 and
L2 one notices that the curves are very
similar (ttr for L2 after the first 12785
tokens ist 191,3 / 1000 (2441 word form
types)).
Stuttgart, Aug 2006
repetitions
zwhat is repeated?
{function words
{content words (?)
zwhat happens to ttr if one compares only
content words (here: verbs)?
Stuttgart, Aug 2006
77
Stuttgart, Aug 2006
78
13
‚stable‘ values
zif one compares the proportion of verbs in
the learner data one sees that (after a
certain amount of data) one gets a stable
or constant value
Stuttgart, Aug 2006
79
Stuttgart, Aug 2006
80
differences between
stable and changing values
z stable: finite number of possible values (like postags)
{after a certain amount of text one has seen all
values
{then it is possible to calculate the proportion of
each value
z the addition of more text can make the
calculation more precise
z distributions with these properties lead to values
that can be compared between corpora of
different sizes
Stuttgart, Aug 2006
81
differences between
stable and changing values
82
LNRE distributions
z changing: infinitely many (or at least many more than
one can encounter) values (like lemmas or word forms)
zhow can values from different LNRE
distributions be compared?
{ no matter how much text one samples, on can always find
unseen word forms if one adds more text
{ some types are very frequent, many types are rare
(→ LNRE distribution, Baayen 2001)
{ but: the more text one has seen the less frequently one finds
new word-form types – it is not possible to calculate fixed
proportions
{either one cuts the size of the larger corpus to
the smaller corpus (loss of information)
{or one uses a statistical model (Baayen 2001,
Evert 2004) to extrapolate the values of the
smaller corpus (fitting the models can be
difficult)
z the addition of more text changes the results
z distributions with these properties lead to values which
cannot easily be compared between corpora of different
sizes
Stuttgart, Aug 2006
Stuttgart, Aug 2006
83
Stuttgart, Aug 2006
84
14
summary CIA
zit is crucial to know a lot about the corpora
before the comparison
zqualitatively:
content of the corpora, pre-processing
etc.,
→ decision of what to count (be explicit!)
zquantitatively: distributions, statistical
methods
Stuttgart, Aug 2006
85
Error Analysis
Stuttgart, Aug 2006
error analysis
what is an error?
zwhat is an error?
zdifferent ways of classifying errors
zgoal hypotheses
zinter-annotator agreement
zstructural errors (breaking of a rule), can
(in theory) be found algorithmically
znon-structural errors, can not be found
algorithmically
Stuttgart, Aug 2006
{deviations from some kind of norm ('breaches
of code', Corder 1973)
{quantitative differences (overuse, underuse)
87
Stuttgart, Aug 2006
what is an error?
aside: status of ‚error‘ in SLA/FLT
z "A linguistic form, ... which, in the same context
would in all likelihood not be produced by the
learner's native speaker counterparts." (Lennon
1991, 182)
z errors (competence) vs. lapses/mistakes
(performance)
zthe status of ‚error‘ is controversial
zerror as something that is not to be
looked at
zerrors as a valuable source of information
about the learning process
(Cherubim 1980, Corder 1981, Lennon
1991, …)
Stuttgart, Aug 2006
86
89
Stuttgart, Aug 2006
88
90
15
error annotation
z
1.
2.
3.
4.
error tags
Ellis 1994
collection of samples of learner language
identification of errors
description of errors
explanation of errors
zclassification
{formal kind of error (insertion, deletion, ...)
{exponent of error (word, phrase, ...)
{hypothesis about reason (interference with L1,
principle X not understood, ...)
{linguistic level (morphology, syntax, ...)
difficult to separate
definition and
explanation
Stuttgart, Aug 2006
91
discussion
Stuttgart, Aug 2006
92
target hypothesis
zstarting point:
theoretical ] descriptive
zlevel of granularity
ztarget hypothesis
“reconstruction of those utterances in the
target language” (Ellis 1994: 54)
die Erklärung für <MoArInGn>diese Phänomen ist
einfach
the explanation for these phenomenon is simply
Mo – morphology, Ar – article, In – Inflection, Gn –
gender
(Weinberger 2002, 25)
problem: is there a gender or perhaps a number
problem (on the noun)?
Stuttgart, Aug 2006
93
Stuttgart, Aug 2006
target hypothesis
target hypothesis
<MoNoDv>Studentensleben
student life, correct: Studentenleben
Mo- morphology, No – noun, Dv – derivation
z Der Realismus ist eine im 19. Jahrhunder, als
Gegenbewegung zu klassisch-romantischen
Kunstauffassung literarische Richtung, die sich bis zum
Ende des Jahrhunderts international weit erstreckte.
(Falko L2)
- could also be compounding (linking
element)
Stuttgart, Aug 2006
95
z Der Realismus ist eine literarische Richtung, die im 19.
Jahrhundert als Gegenbewegung zur klassischromantischen Kunstauffassung gegründet wurde und
sich …
z Der Realismus ist eine im 19. Jahrhundert, als
Gegenbewegung zur klassisch-romantischen
Kunstauffassung gegründete literarische Richtung, die
sich …
z…
Stuttgart, Aug 2006
94
96
16
target hypothesis
target hypothesis
Kunstauffassung literarische Richtung, die sich
bis zum Ende des Jahrhunderts international
weit erstreckte. (Falko L2)
z Der Realismus ist eine literarische Richtung, die
im 19. Jahrhundert als Gegenbewegung zur
klassisch-romantischen Kunstauffassung
gegründet wurde und sich …
¾ ORTH: Jahrhundert
¾ WORTST: Relativsatz
¾ AUSLASSUNG: gegründet wurde
Kunstauffassung literarische Richtung, die sich
bis zum Ende des Jahrhunderts international
weit erstreckte. (Falko L2)
z Der Realismus ist eine im 19. Jahrhundert, als
Gegenbewegung zur klassisch-romantischen
Kunstauffassung gegründete literarische
Richtung, die sich …
z ORTH: Jahrhundert
z ORTH: Komma
z AUSLASSUNG: gegründete
Stuttgart, Aug 2006
97
Stuttgart, Aug 2006
discussion
summary: development of error tagsets
zevery kind of error marking presupposes a
target hypothesis
zinter-annotator agreement
(there are statistical measures like the κ
measures)
zthe development of error tags is difficult
because
{there is often more than one possible target
hypothesis
{the level of granularity of the tagset depends on
the research question
z„generic tagsets“ (Degneaux et al. 1996, Izumi et al.
2005) vs. specific tagsets (Lippert 2005)
zLüdeling (2006)
Stuttgart, Aug 2006
98
99
Stuttgart, Aug 2006
100
annotation and corpus architecture
zgeneral
Architecture
flat architecture vs.
multi-level architecture
{flat architecture
{multi-level architecture
zfor learner corpora
{flat architecture
{multi-level architecture (Falko)
Stuttgart, Aug 2006
101
Stuttgart, Aug 2006
102
17
annotation: data model
annotation – data model
zmost corpora have a flat data structure,
structural and positional annotation is
connected to tokens or placed between
tokens
zBNC as one example, XML-like coding
z<s n=11> <w NN1>Difficulty <w VBZ>is
<w VBG>being <w VVN>expressed <w
PRP>with <w AT0>the <w NN1>method
<w TO0>to <w VBI>be <w VVN>used <w
TO0>to <w VVI>launch <w AT0>the <w
NN1>scheme<c PUN>. </s>
(example from BNC,
http://www.natcorp.ox.ac.uk/World/HTML/cdifbase.html)
Stuttgart, Aug 2006
103
Stuttgart, Aug 2006
104
PRP>with <w
AT0>the <w NN1>method
structural
<w TO0>to markup
<w VBI>be <w VVN>used <w
TO0>to <w VVI>launch <w AT0>the <w
NN1>scheme <c PUN>. </s>
PRP>with <w AT0>the <w NN1>method
part-of<w TO0>to <w
VBI>be <w VVN>used <w
speech
TO0>to <w VVI>launch
<w AT0>the <w
NN1>scheme<c PUN>. </s>
z (example from BNC,
z (example from BNC,
Stuttgart, Aug 2006
105
106
a different model for annotation:
multi-level standoff annotation
problems wrt flat annotation
zdeveloped for multi-modal corpora where
speech signal, transcription, gestures or
other 'modi' of communication are aligned
z'timeline' or one level of text as a reference
zeach modus and each annotation level
can be stored in a separate file with
pointers to the timeline – markup
separated from text/timeline
zsmallest annotation unit: token
zconflicting hierarchies
zconflation of different kinds of information
into one tag
Stuttgart, Aug 2006
Stuttgart, Aug 2006
107
Stuttgart, Aug 2006
108
18
word parola . mot
palavra woord ord .
t1
t2
t3
t4
word parola .
Stuttgart, Aug 2006
109
line 1
word parola .
t6
t7 t8
mot palavra woord ord .
Stuttgart, Aug 2006
110
line 2
Stuttgart, Aug 2006
t5
sentence 1
sentence 1
line 1
line 2
word parola .
111
Stuttgart, Aug 2006
112
flat annotation in learner corpora
tabular model
z most learner corpora use flat models
z a typical error-annotated learner corpus is a
corpus of German advanced learners compiled
and annotated in Lancaster (Weinberger 2002)
z error classification in four levels – conflated into
one tag, exponent: token
z die Erklärung für <MoArInGn> diese Phänomen ist
Mo – morphology, Ar – article, In – Inflection, Gn –
gender
z <LxPhCh> Es gibt eine veränderte Gesellschaft und
Lx- lexical, Ph – phrase, Ch – incorrect choice
(Weinberger 2002)
{level 1: formal, lexical, morphology, ...
{level 2: pos
{level 3: (subcategories for level 1 categories); for
formal: spelling, umlaut, ß, caps/small, punctuation
{level 4: target modification and further specification:
addition, choice, case, gender, number, ...
Stuttgart, Aug 2006
113
¾
¾
¾
¾
not possible to code extension of error
not possible to code competing analysis
not possible to code several errors on the same token
in this case: target hypothesis implicit
Stuttgart, Aug 2006
114
19
tree models
problems
z I belong to two baseball
<n_num crr=„teams“>team</n_num>
(Izumi, Uchimoto & Isahara 2005)
zdefinition and explanation of error are
conflated into one tag
zerror tag is attached to one word – often it
is difficult to decide where an error lies
zdifficult to infer the target hypothesis
zdifficult to add tags or change tags
¾ extension of error can be coded
¾ not possible to code competing analyses
¾ problems with conflicting hierarchies
¾ in this case: target hypothesis explicit
Stuttgart, Aug 2006
115
116
multi-level standoff error annotation:
Falko
multi-level standoff error annotation
z coding and storage done by EXMARaLDA
(Universität Hamburg)
z a search tool is currently under development
(many searches already functioning,
Siemen, Lüdeling & Müller 2006)
ztext remains separate from all annotation
levels
zit is possible to add as many annotation
levels as desired
za tag can span several tokens
zconflicting hierarchies can be coded
Stuttgart, Aug 2006
Stuttgart, Aug 2006
117
multi-level standoff error annotation:
Falko
Stuttgart, Aug 2006
118
Falko - EXMARaLDA
zexplicit target hypothesis (interpretation)
zerror tags coded by linguistic level
(orthography, word formation, word order,
agreement etc.)
zfor each level three sublevels
{identification
{description
{explanation
Stuttgart, Aug 2006
119
Stuttgart, Aug 2006
120
20
Falko
aside: automatic error analysis
zautomatic error analysis for CALL/ICALL
applications (e.g. Reuer 2003)
zerror corpora for automatic grammar
checking (e.g. FLAG)
Stuttgart, Aug 2006
121
Stuttgart, Aug 2006
summary
summary 2
zerror tagging problematic because of
zmulti-level standoff annotation for learner
corpora
{several possible target hypotheses
{the need for several tagsets for the same
phenomena
{inter-annotator agreement
{error exponent can be more than one token
{makes the clear separation of identification,
description and explanation (a desideratum of
error research) possible
{makes the addition of different target
hypotheses possible
zsome of these problems can be dealt with
in the corpus architecture
Stuttgart, Aug 2006
122
123
Stuttgart, Aug 2006
124
summary 3 (the mantra)
z first: the research question
z if using a corpus, be aware of the fact that there
are interpretations/categorizations on all levels:
“Data is ontologically different from the world.
The world is as it is; data is an interpretation of it
for the purpose of scientific study. […] A text
corpus is not the linguist’s data – measurements
of such things as average sentence length are.”
(Moisl, to appear)
Stuttgart, Aug 2006
125
21

1 Corpus Linguistics - Institut für deutsche Sprache und Linguistik

Transcription

Similar documents

Verfahren zur Anonymität und - Sicherheit in verteilten Systemen

kunst lehren / teaching art