Experiments with Tokenization and Part-of-speech
Transcription
Experiments with Tokenization and Part-of-speech
Experiments with Tokenization and Part-of-speech Tagging for German CMC Discourse Thomas Bartz, Michael Beißwenger, Angelika Storrer Processing and Annotating CMC discourse Desiderata from the perspective of linguists who want to do corpus based analyses of language use in CMC and of linguistic variation across genres (CMC vs. other written genres): a) annotation of phenomena specific to CMC (e.g., structure of CMC documents; „netspeak“ elements such as emoticons, interaction words, addressing terms) b) part-of-speech and syntactic annotation of postings c) automatization of (a) and (b) in order to be able to analyze large sets of data and to build large annotated CMC corpora Desideratum from the perspective of Automatic Language Processing / Language Technology: Increase the accuracy of automatic linguistic processing and annotation of web corpora through a better handling of phenomena which typically occur in the CMC shares of these corpora. Our goals & motivation Evaluation of the performance of standard NLP tools (which usually are trained on newspaper texts) for analyzing linguistic phenomena which are not standard conformant and which are either specific to or frequently occur in CMC discourse. goal: identification of problem types as a basis for a discussion of possible solutions basis: small, manually compiled data set with typical CMC phenomena Project background: DeRiK – German Reference Corpus of Computer-Mediated Communication http://www.dwds.de http://www.chatkorpus.tu-dortmund.de KobRA – Corpus-based Linguistic Research and Analysis Using Data Mining http://www.kobra.tu-dortmund.de Scientific network: Empirical Research on Internet-based Communication http://www.empirikom.net Experiments with tokenization and POS tagging Manually compiled data set with postings that display selected features typical for CMC discourse. Source: Dortmund Chat Corpus and German Wikipedia (talk pages) http://www.chatkorpus.tu-dortmund.de/ Feature http://de.wikipedia.org/ Wikipedia talk pages written colloquial language I: colloquial spellings of words W1 20 written colloquial language II: contractions of verb form + pronoun W2 W3 „netspeak“ elements I: emoticons W4 „netspeak“ elements II: interaction words W5 Postings in total: DWDS 20 20 C1 20 occasional or CMC-specific acronyms Chat 20 C2 20 20 C3 20 20 C4 20 20 C5 100 100 200 canonical spellings of the words given in W1 and C1 source: text corpora of the DWDS project http://www.dwds.de/ Written colloquial language – e.g.: word spellings DWDS (canonical lang.) Wikipedia talk pages Chat Ja, Meg Ryan habe tatsächlich ... ... ja, der Brownie koste zwei Dollar. ... Prinzipiell ja, auch wenn ... ... ja, in ihm offenbare sich der ... Jo, gute Vorbereitung ist ... Joh, da hast Du sicher nicht ... Jap, geht klar! Jupp, aber Hinweise zu ... jo, mach das mal... japp tom, stimmt. ... jepp zora, das bin ich ; ) jau das auto fährt ... Goethe-Jahr? Aber nein: eine... ... dann sage ich: nein. ... Nein, nein: Der normale ... Nee dann müsste ich ja ... Ach nee, jetze isses ... Nööö (Zitat Benutzer:Orientalist)... nope,die 10000 gesamt sind ... @quaki, nee,bin ... nöö is er nich ... Okay, okay, sie ist ein ... okidoki, sag Bescheid, wenn du ... oki...mach`s gut ..., droht jetzt der Bankrott. ... Ach nee, jetze isses plötzlich ... Gut machst du das!, ruft ... ... gefroren ist, das ist schon ... nö,dat ebste findeste ... dat ist donald duck ... Die Rose verblüht ihm nicht. ich mag net wissen wie ... ... mit dieser Spende nichts zu tun... ... Bergtouren nichts anderes als ... ...kann man hier nischt mehr ... und sagt nix, der sack Darum kann ich es ... ..., wenn ich täglich einige... Isch ja gut, es hier noch ... ... Mein richtiger Vater war ... ... mit Vadder is hier Kim Il Sung... ..., aber auch mit großem Aufwand... ... isch hab bestanden mach isch glatt :) ich auch aba bei mir ... Experiments with tokenization and POS tagging Manually compiled data set with postings that display selected features typical for CMC discourse. Source: Dortmund Chat Corpus and German Wikipedia (talk pages) http://www.chatkorpus.tu-dortmund.de/ Feature http://de.wikipedia.org/ Wikipedia talk pages written colloquial language I: colloquial spellings of words W1 20 written colloquial language II: contractions of verb form + pronoun W2 W3 „netspeak“ elements I: emoticons W4 „netspeak“ elements II: interaction words W5 Postings in total: DWDS 20 20 C1 20 occasional or CMC-specific acronyms Chat 20 C2 20 20 C3 20 20 C4 20 20 C5 100 100 200 Written colloquial language – e.g.: contracted forms STANDARD: NON-STANDARD: verb form + pronoun (2nd pers. sg.) verb form + reduced form of pronoun, contracted meinst du = ‚think you‘, (do) you think meinste : meinst + de (< du) hast du = did you / have you haste : hast + de (< du) bist du = are you kommst du = ‚come you‘, (do) you come biste : bist + de (< du) kommste : kommst + de (< du) na klar,wat meinste wohl wieso die hälfte da nichtanwesend war?*G* right, so why do ya think half of them weren’t there?*G* was haste denn kaputt gemacht? what didya break? wie alt biste und wo kommste her how old are ya and where’d ya come from Experiments with tokenization and POS tagging Manually compiled data set with postings that display selected features typical for CMC discourse. Source: Dortmund Chat Corpus and German Wikipedia (talk pages) http://www.chatkorpus.tu-dortmund.de/ Feature http://de.wikipedia.org/ Wikipedia talk pages written colloquial language I: colloquial spellings of words W1 20 written colloquial language II: contractions of verb form + pronoun W2 W3 „netspeak“ elements I: emoticons W4 „netspeak“ elements II: interaction words W5 Postings in total: DWDS 20 20 C1 20 occasional or CMC-specific acronyms Chat 20 C2 20 20 C3 20 20 C4 20 20 C5 100 100 200 Occasional or CMC-specific acronyms Examples: IMHO in my humble opinion bspw. beispielsweise (for example) b.t.w. by the way Btw. by the way vllt vielleicht (maybe) evt. eventuell (eventually) mE meines Erachtens (in my opinion) zB zum Beispiel (e.g.) Thx thanks jmd jemand(en) (anybody / somebody) LG Liebe Grüße (lots of love) POV point of view (frequently used in Wikipedia discussions, even in the German Wikipedia) In the test data set, occurences of acronyms are given in context (size = one posting): Tut mir leid, -die Bilder sind ja soweit gut, sie sollten jedoch IMO zur Abwechslung auch mal vereinzelt links stehen http://de.wikipedia.org/wiki/Diskussion: Schw%C3%A4bische_Alb/Archiv/1 Positionen zu Umweltpolitik. Jmd fleißig genug, die zu finden und einzuarbeiten? http://de.wikipedia.org/wiki/Diskussion: Peter_Altmaier Experiments with tokenization and POS tagging Manually compiled data set with postings that display selected features typical for CMC discourse. Source: Dortmund Chat Corpus and German Wikipedia (talk pages) http://www.chatkorpus.tu-dortmund.de/ Feature http://de.wikipedia.org/ Wikipedia talk pages written colloquial language I: colloquial spellings of words W1 20 written colloquial language II: contractions of verb form + pronoun W2 W3 „netspeak“ elements I: emoticons W4 „netspeak“ elements II: interaction words W5 Postings in total: DWDS 20 20 C1 20 occasional or CMC-specific acronyms Chat 20 C2 20 20 C3 20 20 C4 20 20 C5 100 100 200 „Netspeak“ elements – e.g.: emoticons, interaction words :) :O) (: :P :-) :-P :-)) 8) :-))) =o) :o) ^^ ;-) -.- ;-)))) o_O :( O-O :-( *freu* *lach* *lächel* *grins* *fiesgrins* *wink* Gähn *Seufz* *werb* *wunder* *stotter* *rotwerd* *einrück* lol LOL *lol* *rofl* *Grummel* *kopfschüttel* *duck* *g* *ggg* *lernenmuss* *feuerzeug an reb weiterreich* cf. EN: *giggles* *smiles* *smirks* Experiments with tokenization and POS tagging Manually compiled data set with postings that display selected features typical for CMC discourse. Source: Dortmund Chat Corpus and German Wikipedia (talk pages) http://www.chatkorpus.tu-dortmund.de/ Feature http://de.wikipedia.org/ Wikipedia talk pages written colloquial language I: colloquial spellings of words W1 20 written colloquial language II: contractions of verb form + pronoun W2 W3 „netspeak“ elements I: emoticons W4 „netspeak“ elements II: interaction words W5 Postings in total: DWDS 20 20 C1 20 occasional or CMC-specific acronyms Chat 20 C2 20 20 C3 20 20 C4 20 20 C5 100 100 200 canonical spellings of the words given in W1 and C1 source: text corpora of the DWDS project http://www.dwds.de/ Experiments with tokenization and POS tagging Automatic analysis of the test data set with selected NLP tools for German – using WebLicht. WebLicht („Web-based Linguistic Chaining Tool“) is an execution environment for automatic annotation of text corpora. Linguistic tools such as tokenizers, part of speech taggers, and parsers are encapsulated as web services, which can be combined by the user into custom processing chains. The resulting annotations can then be visualized in an appropriate way, such as in a table or tree format. Tool chain 1: Combined tokenizer and sentencizer + TreeTagger (IMS) using the STTS POS tagset for German Tool chain 2: Combined tokenizer and sentencizer + Tagger from the OpenNLP project (SfS) using the STTS POS tagset for German The processing process (schematic) annotated tokens POS tagger tokens tokenizer, sentence splitter data POS categories tagset (STTS) Problem type I: tokenization Problem type I: tokenization problems: The data consists of linguistic units for which the POS tagger has adequate categories of analysis – but the output of the tokenizing process creates tokens that the POS tagger can’t identify as instances of those categories. Reasons (1): Non-canonical use of whitespace and punctuation marks. annotated tokens POS tagger tokens tokenizer, sentence splitter data POS categories tagset (STTS) Problem type I: tokenization stoeps stoeps war gestern sooooo vergesslich TomcatMJ wieso stoeps?biste losgerannt einkaufen udn ahst vergessen dich anzuziehen vorher?*G* stoeps man tom...woher weißt du das? TomcatMJ *hehe* TomcatMJ shortnews.de machts möglich wenn die supermarktwebcams reinverlinkt werden:-) Dortmund Chat Corpus, document No. 2221007 Problem type I: tokenization SfS Tokenizer IMS Tokenizer stoeps stoeps war gestern sooooo vergesslich TomcatMJ wieso stoeps?biste losgerannt einkaufen udn ahst vergessen dich anzuziehen vorher?*G* stoeps man tom...woher weißt du das? TomcatMJ *hehe* TomcatMJ shortnews.de machts möglich wenn die supermarktwebcams reinverlinkt werden:-) IMS Tokenizer SfS Tokenizer Dortmund Chat Corpus, document No. 2221007 SfS Tokenizer IMS Tokenizer Problem type I: senctence splitting Problem type I: tokenization Problem type I: tokenization problems: The data consists of linguistic units for which the POS tagger has adequate categories of analysis – but the output of the tokenizing process creates tokens that the POS tagger can’t identify as instances of those categories. Reasons (2): CMC-specific new types of tokens that the tokenizer doesn’t know and that consist of special characters which the tokenizer – partially – seems to recogize as other types of tokens. annotated tokens POS tagger tokens tokenizer, sentence splitter data POS categories tagset (STTS) Problem type I: tokenization Example: Tokenization results from IMS Tokenizer: segmentation of 27 out of 40 emoticon tokens incorrect: Problem type I: tokenization SfS-Tokenisierer IMS-Tokenisierer stoeps stoeps war gestern sooooo vergesslich TomcatMJ wieso stoeps?biste losgerannt einkaufen udn ahst vergessen dich anzuziehen vorher?*G* stoeps man tom...woher weißt du das? TomcatMJ *hehe* TomcatMJ shortnews.de machts möglich wenn die supermarktwebcams reinverlinkt werden:-) Dortmunder Chat-Korpus, Dok. Nr. 2221007 Manual Normalization of Tokenization Results In order to be able to test the performance of the POS taggers, we manually normalized the output of the tokenization process: Problem type II: classification Problem type II: classification problems: The tokens created by the tokenizer are occurences of categories which exist in the POS tagset, but the POS tagger can’t classify them correctly. Reason: non-standard spelling of word forms that can’t be mapped to a standard form; occasional (non-lexicalized) abbreviations that can’t be mapped to an expanded form. annotated tokens POS tagger tokens tokenizer, sentence splitter data POS categories tagset (STTS) Problem type II: classification – colloquial word spellings Correctly classified examples ... TreeTagger OpenNLP Tagger ... from the DWDS corpus: 18 (20) 15 (20) ... from Wikipedia talk pages: 1 (20) 1 (20) ... from chats: 2 (20) 3 (20) VVIMP: Guck Dir genau den kompletten Vereinsnamen an. VVFIN: ... wenn du gar nich suchst sondern einfach guckst was da ist... VVFIN: sowatt PTKNEG: kütt nöö is vüür er nich PIS: und sagt nix, der sack Problem type II: classification – acronyms 20 postings with instances of acronyms from the Wikipedia talk pages: results for TreeTagger (IMS): results for OpenNLP Tagger (SfS): POS cat. # correct? POS cat. # correct? NE 8 0 NE 8 0 NN 5 0 ADV 5 1 ADJA 3 0 VVFIN 2 0 VVFIN 2 0 XY 2 0 ADJD 1 0 ADJD 1 1 TRUNC 1 0 APPR 1 0 Total: 20 0 NN 1 0 Total: 20 2 20 postings with instances of acronyms from the Dortmund Chat Corpus: results for TreeTagger (IMS): results for OpenNLP Tagger (SfS): POS cat. # correct? POS cat. # correct? NN 10 0 NE 9 0 ADJA 4 0 ADJA 7 0 NE 4 0 XY 2 0 VVFIN 2 0 ADJD 1 0 Total: 20 0 VVFIN 1 0 Total: 20 0 Problem type II: classification – acronyms 20 postings with instances of acronyms from the Wikipedia talk pages: results for TreeTagger (IMS): results for OpenNLP Tagger (SfS): POS cat. # correct? POS cat. # correct? NE 8 0 NE 8 0 NN 5 0 ADV 5 1 ADJA 3 0 VVFIN 2 0 VVFIN 2 0 XY 2 0 ADJD 1 0 ADJD 1 1 TRUNC 1 0 APPR 1 0 Total: 20 0 NN 1 0 Total: 20 2 20 postings with instances of acronyms from the Dortmund Chat Corpus: results for TreeTagger (IMS): results for OpenNLP Tagger (SfS): POS cat. # correct? POS cat. # correct? NN 10 0 NE 9 0 ADJA 4 0 ADJA 7 0 NE 4 0 XY 2 0 VVFIN 2 0 ADJD 1 0 Total: 20 0 VVFIN 1 0 Total: 20 0 Attraktivität liegt immer im Auge des Betrachters; ich bspw. finde die jetzige Lösung sehr viel weniger augenkrebserregend. http://de.wikipedia.org/wiki/Diskus sion:FC_Schalke_04/Archiv/1 Sollen wir evt. nicht gleich anfangen wir 2000-2010? http://de.wikipedia.org/wiki/Diskus sion:FC_Bayern_M%C3%BCnche n/Archiv/2012 Problem type III – categories (tagset) Problem type III: category problem: The output of the tokenizer is correct, but the POS tagger can’t classify them correctly – because the tagset provides no adequate categories. Reasons: (1) The tokens under observation are no (prototypical) word tokens; (2) the tokens are occurences of phenomena which are not yet systematically covered by existing POS tagsets. annotated tokens POS tagger tokens tokenizer, sentence splitter data POS categories tagset (STTS) Problem type III: categories – contractive forms Test data set: Contractive forms of the type VVFIN + PPER findeste, magste, meinste, denkste, machs, machts, kenns, gehts, sags, schreibs VVFIN + PPER + PPER machstes < machst + (t)e + (e)s „make you it“ (you do/make it) VAFIN + PPER haste, habs, hats, biste, isses, wärs, wirds VMFIN + PPER könnteste, kanns Wikipedia talk pages: Contractive forms classified as ... chat: TreeTagger OpenNLP TreeTagger OpenNLP VVFIN / VAFIN: 8 (20) 7 (20) 7 (20) 10 (20) NN: 8 (20) 1 (20) 6 (20) 0 (20) ADJA / ADJD: 4 (20) 5 (20) 7 (20) 4 (20) ADV: 0 (20) 0 (20) 0 (20) 3 (20) other: 0 (20) 7 (20) 3 (20) 3 (20) Problem type III: categories – emoticons :) (: :-) :-)) :-))) :o) ;-) ;-)))) :( :-( :O) :P :-P 8) =o) ^^ -.o_O O-O Tagging result for IMS TreeTagger after manual normalization of tokenization results (chat and Wikipedia examples taken together): tagged as: # NN 20 ADJD 12 ADJA 6 NE 1 VVFIN 1 Total: 40 Note: No results for the STTS category ITJ (interjection – even though emoticons share functional and positional properties with interjections) Problem type III: categories – interaction words *freu* *lach* *lächel* *grins* *fiesgrins* *wink* Gähn *Seufz* *werb* *wunder* *stotter* *rotwerd* *einrück* lol LOL *lol* *rofl* *Grummel* *kopfschüttel* *duck* *g* *ggg* *lernenmuss* *feuerzeug an reb weiterreich* Tokenizing + tagging, 1st passage (IMS TreeTagger): tokenizers treat interaction words and asterisks in most cases as one token. Tokenizing + tagging, 2nd passage (IMS TreeTagger): manual normalization of the tokenization result before tagging: (1) elimination of all asterisk characters; (2) even multi-word expressions with whitespace (e.g. *feuerzeug an reb weiterreich*) are accounted for as one token. Tagging results, 1st passage: tagged as: # Tagging results, 2nd passage: tagged as: # NN 20 NN 21 ADJD 13 VVIMP 9 ADJA 4 NE 3 NE 2 ADJA 5 VVIMP 1 ADJD 1 Total: 40 PTKVZ 1 Total: 40 Observation 2nd passage: In 14 cases, the paradigma of the verb used as interaction word comprises an imperative form which is homonymous with the non-inflected form. In 8 of these cases, the interaction word is classified as an imperative verb form (VVIMP). Conclusion and Outlook: Towards a Linguistic Annotation of CMC CMC discourse exhibits diverse features of nonstandardized writing (e.g., speedwriting phenomena, colloquial spellings) and contains elements which the NLP tools tested in our little experiment can’t handle yet and which are not yet covered by tagsets designed for the annotation of parts of speech (e.g., interaction signs such as emoticons and interaction words). Agenda: 1) Adaptation/training of existing tools for the processing of CMC – e.g., on basis of a manually coded GOLD standard for different types of phenomena which frequently occur in CMC discourse Tokenizers Sentencizers POS taggers Ideal constellation: cooperations between linguists and researchers from the field of NLP 2) Integration of CMC-specific elements into existing POS classifications (cf. suggestion for interaction signs in TEI schema) and extension of POS tagsets (e.g., within the workgroup for the revision of STTS, formed 2012). Experiments with Tokenization and Part-of-speech Tagging for German CMC Discourse Thomas Bartz, Michael Beißwenger, Angelika Storrer