FS-NER: A Lightweight Filter-Stream Approach to Named Entity

Transcription

FS-NER: A Lightweight Filter-Stream
Approach to Named Entity
Recognition on Twitter Data
Diego Marinho de OliveiraƩ, Alberto H. F. LaenderƩ,
Adriano VelosoƩ, Altigran S. da SilvaƱ
ƩDepartamento
de Ciência da Computação
Universidade Federal de Minas Gerais
{dmoliveira, laender, adrianov}@dcc.ufmg.br
ƱInstituto
de Computação
Universidade Federal do Amazonas
alti@icomp.ufam.edu.br
Diego Marinho de Oliveira (dmoliveira@dcc.ufmg.br) – MSM@WWW - 2013
Introduction
• Microblogging activity is reshaping the way people
communicate
• Dominant NER approaches are either based on
linguistic grammar-based techniques or on
statistical models
• They are ill suited when it comes to Twitter data
[Gimpel et al., 2011; Ritter et al., 2011]
2
Introduction
• We propose a novel NER approach, called FS-NER
(Filter Stream Named Entity Recognition)
• FS-NER is characterized by the use of filters that
process Twitter messages
• We show that FS-NER performs 3% better than a
CRF-based baseline, besides being orders of
magnitude faster and much more practical
3
Named Entity Recognition
• Aims to recognize and extract determined
terms known as entity (e.g., person,
organization, location)
Gates was born in Seattle, Washington and
was the founder of Microsoft
4
Named Entity Recognition
• Aims to recognize and extract determined
terms known as entity (e.g., person,
organization, location)
[PER Gates] was born in [LOC Seattle],
[LOC Washington] and was the founder of
[ORG Microsoft]
5
Twitter and it challenges
It is an online social networking service that allow users to
write and read text-based messages up to 140 characters
•
•
•
•
•
•
Large volume of data
Lack of formalism
Language diversity
Real-time nature
Lack of contextualization
Data stream orientation
6
Related Work
• Ritter et al., 2011
Considered techniques usually applied to traditional NER
and adapted them to Twitter
• Liu et al., 2011
Use the k-Nearest Neighbors algorithm and a CRF-based
approach for composing a semi-supervised system
• Li et al., 2012
Have proposed a two-step, unsupervised NER approach
targeted to Twitter data, called TwiNER
7
Proposed Approach
Filter Stream Named Entity
Recognition (FS-NER)
8
Filter-Stream Named Entity Recogntion
• Let S = <m1,m2,…> be a stream of messages (i.e., tweets)
each mi in S is expressed by a pair (X,Y)
X =[x1, x2, …, xn] and Y = [y1, y2, …, yn], yi assumes one value of
{Beginning, Inside, Last, Outside, UnitToken}
While X is known in advance for all messages in S, the values for the
labels in Y are unknown and must be predicted
For example
“I love NEW YORK” → ([x1=I, x2=love, x3=NEW, x4=YORK],
[y1=Outside, y2=Outside, y3=Beginning, y4=Last]).
9
• Filter
Likelihood
10
• Recognition
– The set {l, ϴl} is used to choose the most likely
label for the term xi
– In isolation, filters may not capture specific
patterns that can be used for recognition
11
• Recognition
12
Filter Engineering
• Term
The term filter estimates the probability of a
certain term being an entity
Distribution P(Yi|X=“new”^FT)
Distribution P(Yi|X=“Nashville”^FT)
0,9
0,7
0,8
0,6
0,7
0,5
0,6
0,5
0,4
0,4
0,3
0,3
0,2
0,2
0,1
0,1
0
0
B
I
L
O
U
B
I
L
O
U
13
Filter Engineering
• Context
The context filter is able to capture unknown
entities
Apple → It is an entity or not?
14
Filter Engineering
• Context
The context filter is able to capture unknown
entities
Apple fail should be a lesson for Microsoft
[ORG]
15
Filter Engineering
• Affix
The affix filter uses the fragments of an
observation xi to infer if it is an entity
Examples of affixes
-ville: Nashville, Knoxville, Jacksonville
-an (-ian): Italian, urban, African
-ese: Chinese, Congolese, Vietnamese
16
Filter Engineering
• Dictionary
The dictionary filter uses a list of names of
correlated entities to infer whether the
observed term is an entity
?
TRAINING
SET
TEST
SET
DICTIONARY
17
Filter Engineering
• Noun
The noun filter only considers terms that have just
the first letter capitalized to infer if the observed
term is an entity
Capitalized
John
Apple
Nintendo
Not capitalized
bill
new york
microsoft
18
FS-NER in Action: Example
19
• Recognition model
20
• Training
Term filter
Context filter
Considering only Xi where ϴI > 0,00
Noun filter
21
• Test
Precision 0.88
Recall 1.00
F1 0.93
22
Experimental Evaluation
• Baseline
CRF-based approach from Sarawagi1
• Metrics
NRC
NRC
2 PR
P
; R
; F1 
NAA
NES
PR
23
1. Avaible at http://crf.sourceforge.net/
• Collections
 OW Tweets are related to soccer teams playing in
the Brazilian National League
Entities of interest: Player, Venue and Teams
2.000 Tweets; language Portuguese
24
• Collections
 ETZ Tweets were randomly crawled and are all in
English
Entities of interest: Company, Place and Person
2.400 Tweets; language English
25
• Collections
 WT Tweets semi-manually labeled and supplied by
the WePS3 task 2
Entity of interest: Organization
~44.000 Tweets; language English, Spanish, Japanese, ...
26
Performance Analysis
• Analysis Individual Filters
27
• Analysis Individual Filters
 The best precise filters
Term, context and dictionary
 The best recall filters
Affix and noun
28
• Analysis of Specific Filter Combinations
Recognition centered on the term filter (TRM)
29
Recognition centered on the term filter with
generalization (GTRM)
30
Recognition centered on the noun filter (NON)
31
Recognition centered on the context filter (CTX)
32
33
• Comparison with CRF-based approaches
34
• Runtime Comparison
35
Conclusion
• We have introduced FS-NER (Filter-Stream
Named Entity Recognition)
• The proposed approach is based on a efficient
structure composed of lightweight filters
• We show that FS-NER performs 3% better than a
CRF-based baseline, besides being orders of
magnitude faster and much more practical
36
Future Work
• Alleviate the dependence on manually
annotated data
• Automate the process of filter combination
• Identify other application environments in
which FS-NER is suitable
37
ThAnks!
This work was partially funded by InWeb - The Brazilian National Institute of Science and
Technology for the Web (grant MCT/CNPq 573871/2008-6), and by the authors’ individual grants
from CNPq, FAPEMIG and FAPEAM.
38
References
• K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M.
Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. Part-of-Speech Tagging
for Twitter: Annotation, Features, and Experiments. In Proc. of ACL (Short
Papers), pages 42-47, 2011
• A. Ritter, S. Clark, Mausam, and O. Etzioni. Named Entity Recognition in
Tweets: An Experimental Study. In Proc. of EMNLP, pages 1524-1534, 2011
• C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. TwiNER: named
entity recognition in targeted twitter stream. In Proc. of SIGIR, pages 721730, 2012.
• X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing Named Entities in
Tweets. In Proc. of ACL, pages 359-367, 2011.
39

FS-NER: A Lightweight Filter-Stream Approach to Named Entity

Transcription

Similar documents

Integrative Kreativ Werkstatt fÃ¼r Kinder

MSM 2.0 Zusammenfassung

Filterqualität HELIOPAN – Vario

ce 579 tiefenfiltration

ronda - RCM.it

formosa 50 D ENG

DIE REVOLUTION DER LUFTFILTER Das Jahr 2012 ist schon das

2015 Europäische Modelle

Datenblatt KF - KNOLL Maschinenbau

Filter Nozzles for water processing and water cleaning Toberas

Raiffeisenbank Mittelbünden Ihre persönliche Bank

Anaheim Medical Center