slides

Transcription

slides

Annotating Syntactic Information on
5.5 Billion Word Corpus of
Japanese Blogs
Michal Ptaszynski 1, Rafal Rzepka 2, Kenji Araki 2,
Yoshio Momouchi 3
1) JSPS Research Fellow / Hokkai-Gakuen University, High-Tech Research Center
2) Hokkaido University, Graduate School of Information Science and Technology
3) Hokkai-Gakuen University, Department of Electronics and Information Engineering
Annotating Syntactic Information on
5.5 Billion Word Corpus of
Japanese Blogs
Michal Ptaszynski 1, Rafal Rzepka 2, Kenji Araki 2,
Yoshio Momouchi 3
1) JSPS Research Fellow / Hokkai-Gakuen University, High-Tech Research Center
2) Hokkaido University, Graduate School of Information Science and Technology
3) Hokkai-Gakuen University, Department of Electronics and Information Engineering
Presentation Outline
•
•
•
•
•
Introduction
YACIS Corpus Description
Syntax/Morphology Annotation
Corpus Statistics
Conclusions and Future Work
Introduction
• Corpora are important in many NLP tasks
Introduction
• Corpora are important in many NLP tasks
– Text normalization
– Lexicon generation
– Sentiment analysis
– Dialog agent development
–…
Introduction
• There are some (somewhat) large corpora for
Japanese
– KOTONOHA: BCCWJ (Balanced Corpus of
Contemporary Written Japanese) (4,800,000w)
– Aozora Bunko (more than 10,000 books)
– Mainichi Shinbun (200,000 articles)
– Asahi Shinbun (130,000 articles)
Introduction
• There are some (somewhat) large corpora for
Japanese
– KOTONOHA: BCCWJ (Balanced Corpus of
Contemporary Written Japanese) (4,800,000w)
– Aozora Bunko (more than 10,000 books)
– Mainichi Shinbun (200,000 articles)
– Asahi Shinbun (130,000 articles)
Introduction
• A good source of casual language:
– INTERNET
• BLOGS
Introduction
• Internet based corpora for Japanese
– KWIC on WEB (http://languagecraft.jp/kwic/)
• 2,000,000pages
– JpWaC (http://trac.sketchengine.co.uk/wiki/Corpora/JpWaC)
• 49,000 pages
–…
Introduction
• Internet based corpora for Japanese
– Problems:
•
•
•
•
•
Robots.txt
Duplicates
Language detection
Encoding
No specific domain (multi-domain)
Introduction
• Blog based corpora for Japanese
– jBlogs
• 28,000 pages, 62 mil words
– KNB
• 249 pages, 67,000 words
jBlogs: M. Baroni, and M. Ueyama, ”Building General- and Special-Purpose Corpora by Web
Crawling”, In: Proceedings of the 13th NIJL International Symposium on Language Corpora: Their
Compilation and Application, 2006, www.tokuteicorpus.jp/result/pdf/2006 004.pdf
KNB: Chikara Hashimoto, Sadao Kurohashi, Daisuke Kawahara, Keiji Shinzato and Masaaki
Nagata, “Construction of a Blog Corpus with Syntactic, Anaphoric, and Sentiment Annotations”
[in Japanese], Journal of Natural Language Processing, Vol 18, No. 2, pp. 175-201, 2011.
Introduction
• Blog based corpora for Japanese
– jBlogs
• 28,000 pages, 62 mil words
– KNB
• 249 pages, 67,000 words
Introduction
• Written L. Corpora vs. Internet corpora vs. Blog corpora
corpus
scale
corpus
scale
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
Aozora
Bunko
>10,000
books
49,000
pages
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
corpus
scale
jBlogs
28,000
pages / 62
mil words
KNB
249 pages /
67,000
words
Introduction
• Written L. Corpora vs. Internet corpora vs. Blog corpora
corpus
scale
corpus
scale
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
Aozora
Bunko
>10,000
books
49,000
pages
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
corpus
scale
jBlogs
28,000
pages / 62
mil words
KNB
249 pages /
67,000
words
• Need a BIG corpus of blogs for Japanese
• Looked through a number of blog services
• Looked through a number of blog services
• Ameba (www.ameba.jp/) has a clear structure
．．．
<div class="contents">
<div class="subContents">

ずいぶん前になりますが岡山に行ってきました<img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら
きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" />
 
なんと人生初の一人旅(ﾟ∀ﾟ*)
 
そしてこれはその時に買ったお土産です<img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /> 
<a id="i10537038426" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038426.html"><img
height="330" alt="○●ようのうまいもん日記●○"
src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/42/f4/j/t02200330_0800120010537038426.jpg" width="220" border="0" /></a>
 
『白桃のロイヤルガレット』
 
桃系はやっぱり買っとかなければ!!! ってことで買いました(￣∀￣)
 
 
 
<a id="i10537038429" class="detailOn" href="http://ameblo.jp/0ppe1zryshan/image-10619583023-10537038429.html"><img alt="○●ようのうま
いもん日記●○" src="http://stat.ameba.jp/user_images/20100511/19/0ppe1zryshan/5b/a7/j/t02200147_0800053410537038429.jpg" border="0"
/></a>
．．．
Extract these

<img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら
きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" />
 


 

<img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /> 
『白桃のロイヤルガレット』
 


 
 
 
/></a>
From between
these
．．．
Extract these

<img alt="0" src="http://emoji.ameba.jp/img/user/ts/tsumegaeru/606746.gif" /><img alt="きら
きら" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1956233.gif" />
 


 

<img alt="パンダ" src="http://emoji.ameba.jp/img/user/ch/chocodeblog/1114039.gif" /> 
『白桃のロイヤルガレット』
 


 
 
 
/></a>
From between
these
Get rid of:
- Pictures
- Irrelevant HTML tags
- Emoji (but leave kaomoji)
- Non-Japanese pages
- … (Other post-processing)
• YACIS:
– Yet Another Corpus of Internet Sentences
• Corpus compilation:
– 2009 Dec 3-24
• Ameba blogs
• Only one query to Google:
“site:ameblo.jp” (take 1000 links)
～＞ crawl from page to page
corpus
scale
corpus
scale
corpus
scale
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
jBlogs
10 bil char
28,000
pages / 62
mil words
Aozora
Bunko
>10,000
books
49,000
pages
249 pages /
67,000
words
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
KNB
corpus
scale
corpus
scale
corpus
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
jBlogs
10 bil char
Aozora
Bunko
>10,000
books
49,000
pages
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
KNB
YACIS
scale
28,000
pages / 62
mil words
249 pages /
67,000
words
12 mil page
28 bil. char
5,6 bil. w.
corpus
scale
corpus
scale
corpus
KWIC on
KOTONOHA 4,800,000w
WEB
2,000,000
pages
jBlogs
10 bil char
Aozora
Bunko
>10,000
books
49,000
pages
Mainichi
Shinbun
Asahi
Shinbun
~200,000
articles
~130,000
articles
JpWaC
KNB
YACIS
scale
28,000
pages / 62
mil words
249 pages /
67,000
words
12 mil page
28 bil. char
5,6 bil. w.
• What we have:
• What we have:
Original URL
Extraction time
Sentence ID
Tags:
<doc> one blog page
<post> one post in blog
<s> one sentence
<comments> all comments
<cmt> one comment
Dependency
structure
Tokenization
• What we want:
POS
Lemmatization
Named
Entities
Emotive
expressions
Emotion
classes
Positive/
Negative
Emotion
objects
Emotive
sentences
Emoticons
Dependency
structure
Tokenization
• What we want:
POS
Lemmatization
Named
Entities
Emotive
expressions
Emotion
classes
Positive/
Negative
Emotion
objects
Emotive
sentences
Emoticons
•
•
•
•
•
Tokenization (T)
Lemmatization (L)
POS tagging (POS)
Dependency structure (DS)
Named entity recognition (NER)
•
•
•
•
•
ChaSen
Tokenization (T)
Lemmatization (L) MeCab
POS tagging (POS)
Dependency structure (DS)Cabocha
Named entity recognition (NER)
Juman
KNP
Speed
POS
Juman
Slower
～
～
MeCab Cabocha
～
～
Faster
DS
KNP
ChaSen
* Subjective evaluation on a small test set, no benchmarks
** Test with “time” command in Linux
• Cool features of MeCab
– POS prediction
– Use of two dictionaries (ipadic, jumandic)
• Cool features of Cabocha
– Works with MeCab
– IREX (NER standard)
• Annotation time:
– MeCab: 2 days
– Cabocha: 7 days
• File size:
– Raw (only text, no HTML)
– Tokenization
– POS/ipadic, Lemma, etc.
– POS/jumandic, Lemma
– DS with NER
27 GB
32 GB
286 GB
286 GB
86 GB
• Annotation example
• Annotation example
Corpus Statistics
• Evaluation
– Manual:
• Impossible (5.6 bil. words, 350 mil. sentences)
– 1 sentence in 1 sec. = 4050 days (11 years)
• MeCab and Cabocha are standard tools (reliable)
Corpus Statistics
• Evaluation
– Automatic: comparison of general features
• Now mostly for POS
1. Ipadic vs. jumandic
2. YACIS vs. other Japanese corpora
3. YACIS vs. other language corpora
Corpus Statistics
Ipadic vs. jumandic
YACIS-ipadic
YACIS-jumandic
Corpus Statistics
Ipadic vs. jumandic
YACIS-ipadic
YACIS-jumandic
Corpus Statistics
Ipadic vs. jumandic
Differences in dictionaries.
For example: いやー (interjeticon)
Ipadic: いやー【感動詞】
YACIS-ipadic
YACIS-jumandic
Jumandic: いや【感動詞】＋ー【記号・特殊】
Corpus Statistics
YACIS vs. jBlogs and JENAAD
jBlogs: blog corpus from 2006
61 mil. words, 30 thousand blog docs
JENAAD: news corpus from 2003
4.7 mil. words, Yomiuri (1989-2001)
jBlogs: M. Baroni, and M. Ueyama, ”Building General- and Special-Purpose Corpora by Web
Crawling”, In: Proceedings of the 13th NIJL International Symposium on Language Corpora:
Their Compilation and Application, 2006, www.tokuteicorpus.jp/result/pdf/2006 004.pdf
JENAAD: Masao Utiyama and Hitoshi Isahara. (2003) “Reliable Measures for Aligning JapaneseEnglish News Articles and Sentences”. ACL-2003, pp. 72–79.
Corpus Statistics
YACIS-ipadic YACIS-jumandic
Corpus Statistics
Spearman's ρ
YACIS-ipadic
YACIS-jumandic
jBlogs
JENAAD
YACIS-ipadic
YACIS-jumandic
YACISYACIS-
jBlogs
JENAAD
ipadic
jumandic
1
0.88
0.96
1
1
0.79
0.85
1
1
Corpus Statistics
Spearman's ρ
YACIS-ipadic
YACIS-jumandic
jBlogs
JENAAD
YACIS-ipadic
YACIS-jumandic
YACISYACIS-
jBlogs
JENAAD
ipadic
jumandic
1
0.88
0.96
1
1
0.79
0.85
------------- ChaSen / ipadic ----------
1
1
Corpus Statistics
Spearman's ρ
YACIS-ipadic
YACIS-jumandic
YACISYACIS-
YACIS-ipadic
YACIS (large)
jBlogs
jBlogs
(medium)
JENAAD (small)
JENAAD
JENAAD
ipadic
jumandic
1
0.88
0.96
1
1
0.79
0.85
Statistically, part-of-speech distribution is
similar for ipadic across all three corpora:
YACIS-jumandic
jBlogs
5,600,000,000 words
61,000,000 words
4,700,000 words
------------- ChaSen / ipadic ----------
1
1
Corpus Statistics
It doesn’t mean
YACIS-ipadic
YACIS-jumandic MeCab/ipadic is
YACISYACISSpearman's ρ
jBlogs better.
JENAAD
ipadic
jumandic ------------- ChaSen / ipadic ---------It means: It is
consistent
regardless
YACIS-ipadic
1
0.88
0.96
1
Statistically, part-of-speech distribution is
of corpus size.
similar for ipadic across all three corpora:
And even if it has
YACIS-jumandic
1
0.79
0.85
errors, the errors
YACIS (large) 5,600,000,000 words
jBlogs
1
appear
consistently. *
jBlogs
(medium)
61,000,000 words
JENAAD (small)
JENAAD
4,700,000 words
1
* Within limited evaluation range:
Comparison of POS distribution
Corpus Statistics
Japanese vs. British English and Italian
• British English: ukWaC
– 2bil. Words, .uk domain, POS, lemma
• Italian: itWaC
– 2bil. Words, .it domain, POS, lemma
Corpus Statistics
•
Size comparable to
British English: ukWaC
YACIS
– 2bil. Words, .uk domain, POS, lemma>1 bil. words
• Italian: itWaC
– 2bil. Words, .it domain, POS, lemma
• Both from WaCky (Web as Corpus kool
ynitiative)
http://wacky.sslmit.unibo.it/doku.php
Corpus Statistics
YACIS-ipadic
YACIS-jumandic
*
*
* ) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, “The WaCky Wide Web: A Collection of
Very Large Linguistically Processed Web-Crawled Corpora”, Kluwer Academic Publishers, Netherlands, 2008.
Corpus Statistics
YACIS-ipadic
YACIS-jumandic
*
*
Corpus Statistics
YACIS-ipadic
Noun
Verb
Adjecti
ve
*
YACIS-jumandic
*
Noun
Adjecti
ve
Verb
Noun
Adjecti
ve
Verb
Corpus Statistics
What it means?
Japanese
Corporavs.
of British English
comparable size, but
different languages YACIS-ipadic
(Japanese vs. 2
European languages) Noun
Verb
have different POS
Adjecti
distribution.
ve
and Italian
*
YACIS-jumandic
*
Noun
Adjecti
ve
Verb
Noun
Adjecti
ve
Verb
If we won’t argue about POS definitions, this
could be a small hint for a proof that POS
distribution is not universal across languages.
Conclusions
• Gathered YACIS
– 5.6 bil. word corpus of Japanese blogs
• Annotated YACIS with Syntactic/Morphological
information
– POS, Tokenization, Lemma, Dependency Structure,
Named Entities
• Evaluated YACIS by comparing to other
corpora
Conclusions
• Corpora of the same language, but different
size have the same POS distribution
=POS tagging is consistent
• Corpora of comparable size, but different
languages have different POS distribution
=POS distribution IS NOT Universal across
languages
Future Work
•
•
•
•
Online interface!
More detailed evaluation (e.g. of dependency)
Lexicon generation
N-gram version for download without
limitations
• Applications
Thank you for your attention!
Michal Ptaszynski
ptaszynski@ieee.org
Discussion
• Copyrights
– YACIS will not be put on sale
– Only for scientific purposes
– Usage of corpus will need a two-side agreement
• Gathering of the corpus is similar to search
engines
– If YACIS was illegal, Google, Yahoo,… would be
even more illegal.

slides

Transcription

Similar documents

Note: The only uncolored characters have round