Experiments with Tokenization and Part-of-speech

Transcription

Experiments with Tokenization and Part-of-speech
Experiments with Tokenization
and Part-of-speech Tagging
for German CMC Discourse
Thomas Bartz, Michael Beißwenger, Angelika Storrer
Processing and Annotating CMC discourse
Desiderata from the perspective of linguists who want to do
corpus based analyses of language use in CMC and of linguistic variation across genres (CMC vs. other written genres):
a) annotation of phenomena specific to CMC (e.g., structure of
CMC documents; „netspeak“ elements such as emoticons,
interaction words, addressing terms)
b) part-of-speech and syntactic annotation of postings
c) automatization of (a) and (b) in order to be able to analyze
large sets of data and to build large annotated CMC corpora
Desideratum from the perspective of Automatic Language
Processing / Language Technology:
Increase the accuracy of automatic linguistic processing and
annotation of web corpora through a better handling of phenomena
which typically occur in the CMC shares of these corpora.
Our goals & motivation
Evaluation of the performance of standard NLP tools (which usually are
trained on newspaper texts) for analyzing linguistic phenomena which
are not standard conformant and which are either specific to or
frequently occur in CMC discourse.
 goal: identification of problem types as a basis for a discussion of
possible solutions
 basis: small, manually compiled data set with typical CMC
phenomena
Project background:
DeRiK – German
Reference Corpus of
Computer-Mediated
Communication
http://www.dwds.de
http://www.chatkorpus.tu-dortmund.de
KobRA –
Corpus-based
Linguistic
Research and
Analysis Using
Data Mining
http://www.kobra.tu-dortmund.de
Scientific network:
Empirical Research
on Internet-based
Communication
http://www.empirikom.net
Experiments with tokenization and POS tagging
Manually compiled data set with postings that display selected features
typical for CMC discourse.
Source: Dortmund Chat Corpus and German Wikipedia (talk pages)
http://www.chatkorpus.tu-dortmund.de/
Feature
http://de.wikipedia.org/
Wikipedia
talk pages
written colloquial language I:
colloquial spellings of words
W1
20
written colloquial language II:
contractions of verb form + pronoun
W2
W3
„netspeak“ elements I:
emoticons
W4
„netspeak“ elements II:
interaction words
W5
Postings in total:
DWDS
20
20
C1
20
occasional or CMC-specific
acronyms
Chat
20
C2
20
20
C3
20
20
C4
20
20
C5
100
100
200
canonical
spellings of the
words given in
W1 and C1
source: text
corpora of the
DWDS project
http://www.dwds.de/
Written colloquial language – e.g.: word spellings
DWDS (canonical lang.)
Wikipedia talk pages
Chat
Ja, Meg Ryan habe tatsächlich ...
... ja, der Brownie koste zwei Dollar.
... Prinzipiell ja, auch wenn ...
... ja, in ihm offenbare sich der ...
Jo, gute Vorbereitung ist ...
Joh, da hast Du sicher nicht ...
Jap, geht klar!
Jupp, aber Hinweise zu ...
jo, mach das mal...
japp tom, stimmt. ...
jepp zora, das bin ich ; )
jau das auto fährt ...
Goethe-Jahr? Aber nein: eine...
... dann sage ich: nein.
... Nein, nein: Der normale ...
Nee dann müsste ich ja ...
Ach nee, jetze isses ...
Nööö (Zitat Benutzer:Orientalist)...
nope,die 10000 gesamt sind ...
@quaki, nee,bin ...
nöö is er nich
... Okay, okay, sie ist ein ...
okidoki, sag Bescheid, wenn du ... oki...mach`s gut
..., droht jetzt der Bankrott. ...
Ach nee, jetze isses plötzlich ...
Gut machst du das!, ruft ...
... gefroren ist, das ist schon ...
nö,dat ebste findeste ...
dat ist donald duck
... Die Rose verblüht ihm nicht.
ich mag net wissen wie ...
... mit dieser Spende nichts zu tun...
... Bergtouren nichts anderes als ...
...kann man hier nischt mehr ...
und sagt nix, der sack
Darum kann ich es ...
..., wenn ich täglich einige...
Isch ja gut, es hier noch ...
... Mein richtiger Vater war ...
... mit Vadder is hier Kim Il Sung...
..., aber auch mit großem Aufwand...
... isch hab bestanden
mach isch glatt :)
ich auch aba bei mir ...
Experiments with tokenization and POS tagging
Manually compiled data set with postings that display selected features
typical for CMC discourse.
Source: Dortmund Chat Corpus and German Wikipedia (talk pages)
http://www.chatkorpus.tu-dortmund.de/
Feature
http://de.wikipedia.org/
Wikipedia
talk pages
written colloquial language I:
colloquial spellings of words
W1
20
written colloquial language II:
contractions of verb form + pronoun
W2
W3
„netspeak“ elements I:
emoticons
W4
„netspeak“ elements II:
interaction words
W5
Postings in total:
DWDS
20
20
C1
20
occasional or CMC-specific
acronyms
Chat
20
C2
20
20
C3
20
20
C4
20
20
C5
100
100
200
Written colloquial language – e.g.: contracted forms
STANDARD:
NON-STANDARD:
verb form + pronoun
(2nd pers. sg.)
verb form + reduced form
of pronoun, contracted
meinst du
= ‚think you‘, (do) you think
meinste : meinst + de (< du)
hast du
= did you / have you
haste : hast + de (< du)
bist du = are you
kommst du
= ‚come you‘, (do) you come
biste : bist + de (< du)
kommste : kommst + de (< du)
na klar,wat meinste wohl wieso die hälfte
da nichtanwesend war?*G*
right, so why do ya think half of them
weren’t there?*G*
was haste denn kaputt gemacht?
what didya break?
wie alt biste und wo kommste her
how old are ya and where’d ya come from
Experiments with tokenization and POS tagging
Manually compiled data set with postings that display selected features
typical for CMC discourse.
Source: Dortmund Chat Corpus and German Wikipedia (talk pages)
http://www.chatkorpus.tu-dortmund.de/
Feature
http://de.wikipedia.org/
Wikipedia
talk pages
written colloquial language I:
colloquial spellings of words
W1
20
written colloquial language II:
contractions of verb form + pronoun
W2
W3
„netspeak“ elements I:
emoticons
W4
„netspeak“ elements II:
interaction words
W5
Postings in total:
DWDS
20
20
C1
20
occasional or CMC-specific
acronyms
Chat
20
C2
20
20
C3
20
20
C4
20
20
C5
100
100
200
Occasional or CMC-specific acronyms
Examples:
IMHO
in my humble opinion
bspw.
beispielsweise (for example)
b.t.w.
by the way
Btw.
by the way
vllt
vielleicht (maybe)
evt.
eventuell (eventually)
mE
meines Erachtens (in my opinion)
zB
zum Beispiel (e.g.)
Thx
thanks
jmd
jemand(en) (anybody / somebody)
LG
Liebe Grüße (lots of love)
POV
point of view (frequently used in Wikipedia
discussions, even in the German Wikipedia)
In the test data set, occurences
of acronyms are given in context
(size = one posting):
Tut mir leid, -die Bilder sind ja
soweit gut, sie sollten jedoch
IMO zur Abwechslung auch mal
vereinzelt links stehen
http://de.wikipedia.org/wiki/Diskussion:
Schw%C3%A4bische_Alb/Archiv/1
Positionen zu Umweltpolitik.
Jmd fleißig genug, die zu finden
und einzuarbeiten?
http://de.wikipedia.org/wiki/Diskussion:
Peter_Altmaier
Experiments with tokenization and POS tagging
Manually compiled data set with postings that display selected features
typical for CMC discourse.
Source: Dortmund Chat Corpus and German Wikipedia (talk pages)
http://www.chatkorpus.tu-dortmund.de/
Feature
http://de.wikipedia.org/
Wikipedia
talk pages
written colloquial language I:
colloquial spellings of words
W1
20
written colloquial language II:
contractions of verb form + pronoun
W2
W3
„netspeak“ elements I:
emoticons
W4
„netspeak“ elements II:
interaction words
W5
Postings in total:
DWDS
20
20
C1
20
occasional or CMC-specific
acronyms
Chat
20
C2
20
20
C3
20
20
C4
20
20
C5
100
100
200
„Netspeak“ elements – e.g.: emoticons, interaction words
:)
:O)
(:
:P
:-)
:-P
:-))
8)
:-)))
=o)
:o)
^^
;-)
-.-
;-))))
o_O
:(
O-O
:-(
*freu*
*lach*
*lächel*
*grins*
*fiesgrins*
*wink*
Gähn
*Seufz*
*werb*
*wunder*
*stotter*
*rotwerd*
*einrück*
lol
LOL
*lol*
*rofl*
*Grummel*
*kopfschüttel*
*duck*
*g*
*ggg*
*lernenmuss*
*feuerzeug an reb
weiterreich*
cf. EN:
*giggles*
*smiles*
*smirks*
Experiments with tokenization and POS tagging
Manually compiled data set with postings that display selected features
typical for CMC discourse.
Source: Dortmund Chat Corpus and German Wikipedia (talk pages)
http://www.chatkorpus.tu-dortmund.de/
Feature
http://de.wikipedia.org/
Wikipedia
talk pages
written colloquial language I:
colloquial spellings of words
W1
20
written colloquial language II:
contractions of verb form + pronoun
W2
W3
„netspeak“ elements I:
emoticons
W4
„netspeak“ elements II:
interaction words
W5
Postings in total:
DWDS
20
20
C1
20
occasional or CMC-specific
acronyms
Chat
20
C2
20
20
C3
20
20
C4
20
20
C5
100
100
200
canonical
spellings of the
words given in
W1 and C1
source: text
corpora of the
DWDS project
http://www.dwds.de/
Experiments with tokenization and POS tagging
Automatic analysis of the test data set with selected NLP tools for
German – using WebLicht.
WebLicht („Web-based Linguistic Chaining Tool“) is an execution environment
for automatic annotation of text corpora. Linguistic tools such as tokenizers, part
of speech taggers, and parsers are encapsulated as web services, which can
be combined by the user into custom processing chains. The resulting
annotations can then be visualized in an appropriate way, such as in a table or
tree format.
Tool chain 1: Combined tokenizer and
sentencizer + TreeTagger (IMS) using
the STTS POS tagset for German
Tool chain 2: Combined tokenizer and
sentencizer + Tagger from the
OpenNLP project (SfS) using the
STTS POS tagset for German
The processing process (schematic)
annotated tokens
POS tagger
tokens
tokenizer, sentence splitter
data
POS
categories
tagset (STTS)
Problem type I: tokenization
Problem type I: tokenization problems: The data consists of linguistic
units for which the POS tagger has adequate categories of analysis – but
the output of the tokenizing process creates tokens that the POS tagger
can’t identify as instances of those categories.
 Reasons (1): Non-canonical use of whitespace and punctuation marks.
annotated tokens
POS tagger
tokens
tokenizer, sentence splitter
data
POS
categories
tagset (STTS)
Problem type I: tokenization
stoeps
stoeps war gestern sooooo vergesslich
TomcatMJ wieso stoeps?biste losgerannt einkaufen udn
ahst vergessen dich anzuziehen vorher?*G*
stoeps
man tom...woher weißt du das?
TomcatMJ *hehe*
TomcatMJ shortnews.de machts möglich wenn die
supermarktwebcams reinverlinkt werden:-)
Dortmund Chat Corpus, document No. 2221007
Problem type I: tokenization
SfS Tokenizer
IMS Tokenizer
stoeps
stoeps war gestern sooooo vergesslich
TomcatMJ wieso stoeps?biste losgerannt einkaufen udn
ahst vergessen dich anzuziehen vorher?*G*
stoeps
man tom...woher weißt du das?
TomcatMJ *hehe*
TomcatMJ shortnews.de machts möglich wenn die
supermarktwebcams reinverlinkt werden:-)
IMS Tokenizer
SfS Tokenizer
Dortmund Chat Corpus, document No. 2221007
SfS Tokenizer
IMS Tokenizer
Problem type I: senctence splitting
Problem type I: tokenization
Problem type I: tokenization problems: The data consists of linguistic
units for which the POS tagger has adequate categories of analysis – but
the output of the tokenizing process creates tokens that the POS tagger
can’t identify as instances of those categories.
 Reasons (2): CMC-specific new types of tokens that the tokenizer
doesn’t know and that consist of special characters which the tokenizer –
partially – seems to recogize as other types of tokens.
annotated tokens
POS tagger
tokens
tokenizer, sentence splitter
data
POS
categories
tagset (STTS)
Problem type I: tokenization
Example: Tokenization results from IMS Tokenizer:
segmentation of 27 out of 40 emoticon tokens incorrect:
Problem type I: tokenization
SfS-Tokenisierer
IMS-Tokenisierer
stoeps
stoeps war gestern sooooo vergesslich
TomcatMJ wieso stoeps?biste losgerannt einkaufen udn
ahst vergessen dich anzuziehen vorher?*G*
stoeps
man tom...woher weißt du das?
TomcatMJ *hehe*
TomcatMJ shortnews.de machts möglich wenn die
supermarktwebcams reinverlinkt werden:-)
Dortmunder Chat-Korpus, Dok. Nr. 2221007
Manual Normalization of Tokenization Results
In order to be able to test the performance of the POS taggers, we
manually normalized the output of the tokenization process:
Problem type II: classification
Problem type II: classification problems: The tokens created by the
tokenizer are occurences of categories which exist in the POS tagset, but
the POS tagger can’t classify them correctly.
 Reason: non-standard spelling of word forms that can’t be mapped to a
standard form; occasional (non-lexicalized) abbreviations that can’t be
mapped to an expanded form.
annotated tokens
POS tagger
tokens
tokenizer, sentence splitter
data
POS
categories
tagset (STTS)
Problem type II: classification – colloquial word spellings
Correctly classified examples ...
TreeTagger
OpenNLP Tagger
... from the DWDS corpus:
18 (20)
15 (20)
... from Wikipedia talk pages:
1 (20)
1 (20)
... from chats:
2 (20)
3 (20)
VVIMP: Guck Dir genau den
kompletten Vereinsnamen an.
VVFIN: ... wenn du gar
nich suchst sondern
einfach guckst was da
ist...
VVFIN:
sowatt
PTKNEG:
kütt
nöö is
vüür
er nich
PIS: und sagt nix,
der sack
Problem type II: classification – acronyms
20 postings with instances of acronyms from the Wikipedia talk pages:
results for TreeTagger (IMS):
results for OpenNLP Tagger (SfS):
POS cat.
#
correct?
POS cat.
#
correct?
NE
8
0
NE
8
0
NN
5
0
ADV
5
1
ADJA
3
0
VVFIN
2
0
VVFIN
2
0
XY
2
0
ADJD
1
0
ADJD
1
1
TRUNC
1
0
APPR
1
0
Total:
20
0
NN
1
0
Total:
20
2
20 postings with instances of acronyms from the Dortmund Chat Corpus:
results for TreeTagger (IMS):
results for OpenNLP Tagger (SfS):
POS cat.
#
correct?
POS cat.
#
correct?
NN
10
0
NE
9
0
ADJA
4
0
ADJA
7
0
NE
4
0
XY
2
0
VVFIN
2
0
ADJD
1
0
Total:
20
0
VVFIN
1
0
Total:
20
0
Problem type II: classification – acronyms
20 postings with instances of acronyms from the Wikipedia talk pages:
results for TreeTagger (IMS):
results for OpenNLP Tagger (SfS):
POS cat.
#
correct?
POS cat.
#
correct?
NE
8
0
NE
8
0
NN
5
0
ADV
5
1
ADJA
3
0
VVFIN
2
0
VVFIN
2
0
XY
2
0
ADJD
1
0
ADJD
1
1
TRUNC
1
0
APPR
1
0
Total:
20
0
NN
1
0
Total:
20
2
20 postings with instances of acronyms from the Dortmund Chat Corpus:
results for TreeTagger (IMS):
results for OpenNLP Tagger (SfS):
POS cat.
#
correct?
POS cat.
#
correct?
NN
10
0
NE
9
0
ADJA
4
0
ADJA
7
0
NE
4
0
XY
2
0
VVFIN
2
0
ADJD
1
0
Total:
20
0
VVFIN
1
0
Total:
20
0
Attraktivität liegt immer im
Auge des Betrachters; ich
bspw. finde die jetzige
Lösung sehr viel weniger
augenkrebserregend.
http://de.wikipedia.org/wiki/Diskus
sion:FC_Schalke_04/Archiv/1
Sollen wir evt. nicht gleich
anfangen wir 2000-2010?
http://de.wikipedia.org/wiki/Diskus
sion:FC_Bayern_M%C3%BCnche
n/Archiv/2012
Problem type III – categories (tagset)
Problem type III: category problem: The output of the tokenizer is
correct, but the POS tagger can’t classify them correctly – because the
tagset provides no adequate categories.
 Reasons:
(1) The tokens under observation are no (prototypical) word tokens;
(2) the tokens are occurences of phenomena which are not yet
systematically covered by existing POS tagsets.
annotated tokens
POS tagger
tokens
tokenizer, sentence splitter
data
POS
categories
tagset (STTS)
Problem type III: categories – contractive forms
Test data set: Contractive forms of the type
VVFIN + PPER
findeste, magste, meinste, denkste, machs, machts,
kenns, gehts, sags, schreibs
VVFIN + PPER + PPER
machstes
< machst + (t)e + (e)s „make you it“ (you do/make it)
VAFIN + PPER
haste, habs, hats, biste, isses, wärs, wirds
VMFIN + PPER
könnteste, kanns
Wikipedia talk pages:
Contractive forms classified as ...
chat:
TreeTagger
OpenNLP
TreeTagger
OpenNLP
VVFIN / VAFIN:
8 (20)
7 (20)
7 (20)
10 (20)
NN:
8 (20)
1 (20)
6 (20)
0 (20)
ADJA / ADJD:
4 (20)
5 (20)
7 (20)
4 (20)
ADV:
0 (20)
0 (20)
0 (20)
3 (20)
other:
0 (20)
7 (20)
3 (20)
3 (20)
Problem type III: categories – emoticons
:)
(:
:-)
:-))
:-)))
:o)
;-)
;-))))
:(
:-(
:O)
:P
:-P
8)
=o)
^^
-.o_O
O-O
Tagging result for IMS TreeTagger after manual normalization of
tokenization results (chat and Wikipedia examples taken together):
tagged as:
#
NN
20
ADJD
12
ADJA
6
NE
1
VVFIN
1
Total:
40
Note: No results for the STTS category ITJ (interjection – even though
emoticons share functional and positional properties with interjections)
Problem type III: categories – interaction words
*freu*
*lach*
*lächel*
*grins*
*fiesgrins*
*wink*
Gähn
*Seufz*
*werb*
*wunder*
*stotter*
*rotwerd*
*einrück*
lol
LOL
*lol*
*rofl*
*Grummel*
*kopfschüttel*
*duck*
*g*
*ggg*
*lernenmuss*
*feuerzeug an
reb weiterreich*
Tokenizing + tagging, 1st passage (IMS TreeTagger):
tokenizers treat interaction words and asterisks in most cases as one
token.
Tokenizing + tagging, 2nd passage (IMS TreeTagger):
manual normalization of the tokenization result before tagging: (1)
elimination of all asterisk characters; (2) even multi-word expressions
with whitespace (e.g. *feuerzeug an reb weiterreich*) are accounted
for as one token.
Tagging results,
1st passage:
tagged as:
#
Tagging results,
2nd passage:
tagged as:
#
NN
20
NN
21
ADJD
13
VVIMP
9
ADJA
4
NE
3
NE
2
ADJA
5
VVIMP
1
ADJD
1
Total:
40
PTKVZ
1
Total:
40
Observation 2nd passage:
In 14 cases, the paradigma of
the verb used as interaction
word comprises an imperative
form which is homonymous
with the non-inflected form.
In 8 of these cases, the interaction word is classified as an
imperative verb form (VVIMP).
Conclusion and Outlook:
Towards a Linguistic Annotation of CMC
CMC discourse exhibits diverse features of nonstandardized writing (e.g., speedwriting phenomena,
colloquial spellings) and contains elements which the
NLP tools tested in our little experiment can’t handle yet
and which are not yet covered by tagsets designed for
the annotation of parts of speech (e.g., interaction signs
such as emoticons and interaction words).
Agenda:
1) Adaptation/training of existing tools for the processing of CMC –
e.g., on basis of a manually coded GOLD standard for different types
of phenomena which frequently occur in CMC discourse
 Tokenizers
 Sentencizers
 POS taggers
Ideal constellation: cooperations between linguists and researchers
from the field of NLP
2) Integration of CMC-specific elements into existing POS classifications
(cf. suggestion for interaction signs in TEI schema) and extension of POS
tagsets (e.g., within the workgroup for the revision of STTS, formed 2012).
Experiments with Tokenization
and Part-of-speech Tagging
for German CMC Discourse
Thomas Bartz, Michael Beißwenger, Angelika Storrer