FS-NER: A Lightweight Filter-Stream Approach to Named Entity

Transcription

FS-NER: A Lightweight Filter-Stream Approach to Named Entity
FS-NER: A Lightweight Filter-Stream
Approach to Named Entity
Recognition on Twitter Data
Diego Marinho de OliveiraƩ, Alberto H. F. LaenderƩ,
Adriano VelosoƩ, Altigran S. da SilvaƱ
ƩDepartamento
de Ciência da Computação
Universidade Federal de Minas Gerais
{dmoliveira, laender, adrianov}@dcc.ufmg.br
ƱInstituto
de Computação
Universidade Federal do Amazonas
alti@icomp.ufam.edu.br
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Introduction
• Microblogging activity is reshaping the way people
communicate
• Dominant NER approaches are either based on
linguistic grammar-based techniques or on
statistical models
• They are ill suited when it comes to Twitter data
[Gimpel et al., 2011; Ritter et al., 2011]
2
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Introduction
• We propose a novel NER approach, called FS-NER
(Filter Stream Named Entity Recognition)
• FS-NER is characterized by the use of filters that
process Twitter messages
• We show that FS-NER performs 3% better than a
CRF-based baseline, besides being orders of
magnitude faster and much more practical
3
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Named Entity Recognition
• Aims to recognize and extract determined
terms known as entity (e.g., person,
organization, location)
Gates was born in Seattle, Washington and
was the founder of Microsoft
4
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Named Entity Recognition
• Aims to recognize and extract determined
terms known as entity (e.g., person,
organization, location)
[PER Gates] was born in [LOC Seattle],
[LOC Washington] and was the founder of
[ORG Microsoft]
5
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Twitter and it challenges
It is an online social networking service that allow users to
write and read text-based messages up to 140 characters
•
•
•
•
•
•
Large volume of data
Lack of formalism
Language diversity
Real-time nature
Lack of contextualization
Data stream orientation
6
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Related Work
• Ritter et al., 2011
Considered techniques usually applied to traditional NER
and adapted them to Twitter
• Liu et al., 2011
Use the k-Nearest Neighbors algorithm and a CRF-based
approach for composing a semi-supervised system
• Li et al., 2012
Have proposed a two-step, unsupervised NER approach
targeted to Twitter data, called TwiNER
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
7
Proposed Approach
Filter Stream Named Entity
Recognition (FS-NER)
8
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter-Stream Named Entity Recogntion
• Let S = <m1,m2,…> be a stream of messages (i.e., tweets)
each mi in S is expressed by a pair (X,Y)
X =[x1, x2, …, xn] and Y = [y1, y2, …, yn], yi assumes one value of
{Beginning, Inside, Last, Outside, UnitToken}
While X is known in advance for all messages in S, the values for the
labels in Y are unknown and must be predicted
For example
“I love NEW YORK” → ([x1=I, x2=love, x3=NEW, x4=YORK],
[y1=Outside, y2=Outside, y3=Beginning, y4=Last]).
9
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter-Stream Named Entity Recogntion
• Filter
Likelihood
10
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter-Stream Named Entity Recogntion
• Recognition
– The set {l, ϴl} is used to choose the most likely
label for the term xi
– In isolation, filters may not capture specific
patterns that can be used for recognition
11
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter-Stream Named Entity Recogntion
• Recognition
12
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter Engineering
• Term
The term filter estimates the probability of a
certain term being an entity
Distribution P(Yi|X=“new”^FT)
Distribution P(Yi|X=“Nashville”^FT)
0,9
0,7
0,8
0,6
0,7
0,5
0,6
0,5
0,4
0,4
0,3
0,3
0,2
0,2
0,1
0,1
0
0
B
I
L
O
U
B
I
L
O
U
13
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter Engineering
• Context
The context filter is able to capture unknown
entities
Apple → It is an entity or not?
14
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter Engineering
• Context
The context filter is able to capture unknown
entities
Apple fail should be a lesson for Microsoft
[ORG]
15
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter Engineering
• Affix
The affix filter uses the fragments of an
observation xi to infer if it is an entity
Examples of affixes
-ville: Nashville, Knoxville, Jacksonville
-an (-ian): Italian, urban, African
-ese: Chinese, Congolese, Vietnamese
16
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter Engineering
• Dictionary
The dictionary filter uses a list of names of
correlated entities to infer whether the
observed term is an entity
?
TRAINING
SET
TEST
SET
DICTIONARY
17
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Filter Engineering
• Noun
The noun filter only considers terms that have just
the first letter capitalized to infer if the observed
term is an entity
Capitalized
John
Apple
Nintendo
Not capitalized
bill
new york
microsoft
18
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
FS-NER in Action: Example
19
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
FS-NER in Action: Example
• Recognition model
20
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
FS-NER in Action: Example
• Training
Term filter
Context filter
Considering only Xi where ϴI > 0,00
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Noun filter
21
FS-NER in Action: Example
• Test
Precision 0.88
Recall 1.00
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
F1 0.93
22
Experimental Evaluation
• Baseline
CRF-based approach from Sarawagi1
• Metrics
NRC
NRC
2 PR
P
; R
; F1 
NAA
NES
PR
23
1. Avaible at http://crf.sourceforge.net/
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Experimental Evaluation
• Collections
 OW Tweets are related to soccer teams playing in
the Brazilian National League
Entities of interest: Player, Venue and Teams
2.000 Tweets; language Portuguese
24
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Experimental Evaluation
• Collections
 ETZ Tweets were randomly crawled and are all in
English
Entities of interest: Company, Place and Person
2.400 Tweets; language English
25
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Experimental Evaluation
• Collections
 WT Tweets semi-manually labeled and supplied by
the WePS3 task 2
Entity of interest: Organization
~44.000 Tweets; language English, Spanish, Japanese, ...
26
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Analysis Individual Filters
27
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Analysis Individual Filters
 The best precise filters
Term, context and dictionary
 The best recall filters
Affix and noun
28
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Analysis of Specific Filter Combinations
Recognition centered on the term filter (TRM)
29
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Analysis of Specific Filter Combinations
Recognition centered on the term filter with
generalization (GTRM)
30
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Analysis of Specific Filter Combinations
Recognition centered on the noun filter (NON)
31
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Analysis of Specific Filter Combinations
Recognition centered on the context filter (CTX)
32
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Analysis of Specific Filter Combinations
33
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Comparison with CRF-based approaches
34
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Performance Analysis
• Runtime Comparison
35
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Conclusion
• We have introduced FS-NER (Filter-Stream
Named Entity Recognition)
• The proposed approach is based on a efficient
structure composed of lightweight filters
• We show that FS-NER performs 3% better than a
CRF-based baseline, besides being orders of
magnitude faster and much more practical
36
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Future Work
• Alleviate the dependence on manually
annotated data
• Automate the process of filter combination
• Identify other application environments in
which FS-NER is suitable
37
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
ThAnks!
This work was partially funded by InWeb - The Brazilian National Institute of Science and
Technology for the Web (grant MCT/CNPq 573871/2008-6), and by the authors’ individual grants
from CNPq, FAPEMIG and FAPEAM.
38
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
References
• K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M.
Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. Part-of-Speech Tagging
for Twitter: Annotation, Features, and Experiments. In Proc. of ACL (Short
Papers), pages 42-47, 2011
• A. Ritter, S. Clark, Mausam, and O. Etzioni. Named Entity Recognition in
Tweets: An Experimental Study. In Proc. of EMNLP, pages 1524-1534, 2011
• C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. TwiNER: named
entity recognition in targeted twitter stream. In Proc. of SIGIR, pages 721730, 2012.
• X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing Named Entities in
Tweets. In Proc. of ACL, pages 359-367, 2011.
39
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013