D9.35 Framework for combining user supplied knowledge from

Transcription

D9.35 Framework for combining user supplied knowledge from
European Seventh Framework Programme
FP7-218086-Collaborative Project
D9.35 Framework for combining user supplied
knowledge from diverse sources
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
The INDECT Consortium
AGH — University of Science and Technology, AGH, Poland
Gdansk University of Technology, GUT, Poland
InnoTec DATA GmbH & Co. KG, INNOTEC, Germany
IP Grenoble (Ensimag), INP, France
MSWiA — General Headquarters of Police (Polish Police), GHP, Poland
Moviquity, MOVIQUITY, Spain
Products and Systems of Information Technology, PSI, Germany
Police Service of Northern Ireland, PSNI, United Kingdom
Poznan University of Technology, PUT, Poland
Universidad Carlos III de Madrid, UC3M, Spain
Technical University of Sofia, TU-SOFIA, Bulgaria
University of Wuppertal, BUW, Germany
University of York, UoY, Great Britain
Technical University of Ostrava, VSB, Czech Republic
Technical University of Kosice, TUKE, Slovakia
X-Art Pro Division G.m.b.H., X-art, Austria
Fachhochschule Technikum Wien, FHTW, Austria
c
Copyright
2013, the Members of the INDECT Consortium
D9.35
Public
2/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
Document Information
Contract Number
218086
Deliverable Name
Framework for combining user supplied knowledge from diverse sources
Deliverable number
D9.35
Editor(s)
Suresh Manandhar, University of York, suresh@cs.york.ac.uk
Author(s)
Suraj Jung Pandey, University of York, suraj@cs.york.ac.uk
Reviewer(s)
Ethical Review: Andreas Pongratz (X-art)
Security Review: Petr Machnik (VSB)
End-users review: Dmitrijs Apanasovics, Michael Ross (PSNI)
Scientific Review: Nick Pears (UoY)
Dissemination level
Public
Contractual date of
December 2012
delivery
Delivery date
July 2013
Status
Final version
Keywords
Entity resolution, Kernel methods
This project is funded under 7th Framework Program
D9.35
Public
1/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
Contents
Document Information
1
1 Executive Summary
6
2 Introduction
7
2.1
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2
List of participants & roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Ethical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3 General Steps for Entity Linking
11
3.1
Generating similar entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.2
Comparing entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.2.1
Building a model representing entity . . . . . . . . . . . . . . . . . . . . . . . .
13
Entity Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.3
4 Datasets and Results
D9.35
18
Public
2/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
List of Figures
1
Framework for combining user supplied knowledge. . . . . . . . . . . . . . . . . . . . .
2
Both superstars are often referred as ’MJ’ Source: http://en.wikipedia.org/wiki/Michael_Jackson
Source:http://en.wikipedia.org/wiki/Michael_Jordan . . . . . . . . . . . . . . . . . . .
8
3
Summary of steps required to ensure ethical use of WP4 software . . . . . . . . . . . .
10
4
Example pipeline for the task of entity linking . . . . . . . . . . . . . . . . . . . . . . .
12
5
Model creation by training with positive and negative examples . . . . . . . . . . . . .
14
6
Entities associated with unknown entEntryity ’MJ’, Michael Jackson and Michael Jordan. In this case, MJ is probably Michael Jordan. . . . . . . . . . . . . . . . . . . . .
15
7
Document 1 for entity "SFAX" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
8
Document 2 for entity "SFAX" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
9
Assigning named entity labels using Stanford’s NER. The labels are either PERSON
or ORGANISATION or LOCATION.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
16
10
Pipeline for feature vector creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
11
Example of dimensionality reduction. Semantically similar words are collapsed to single
feature label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D9.35
7
Public
17
3/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
List of Tables
D9.35
1
Necessary security measures for use of software within WP4 and their benefits. . . . .
9
2
Document Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Public
4/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
(This page is left blank intentionally)
D9.35
Public
5/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
1. Executive Summary
Security is becoming a weak point of energy and communications infrastructures, commercial stores,
conference centers, airports and sites with high person traffic in general. Practically any crowded place
is vulnerable, and the risks should be controlled and minimised as much as possible. Access control and
rapid response to potential dangers are properties that every security system for such environments
should have. The INDECT project is aiming to develop new tools and techniques that will help the
potential end users in improving their methods for crime detection and prevention thereby offering
more security to the citizens of the European Union.
In the context of the INDECT project, Work Package 4 (WP4) is responsible for the Extraction of
Information for Crime Prevention by Unstructured Data. On of the task in WP4 is combining user
supplied knowledge.
Combining user supplied knowledge is essentially maintaining a database of records or knowledge base
which contains description of different entities (person, location, organisation). First, it is necessary
to group documents within the knowledge base, where each group refers to a same unique entity. For
example, separating all documents referring to Michael Jordon the basketball player from documents
of other Michael Jordon(s). Next, the user would add or delete records in the database. The new
information could be in form of single document or itself a database of records. During addition of
new information, the consistency of the knowledge base should be maintained i.e. if the entity already
exists in the knowledge base then, the new information about the entity should be merged with already
existing information of the entity and if the new entity does not exist in the knowledge base then a
new ID for the entity should be created in the knowledge base. Figure 1 shows the framework for
combining user supplied knowledge with the ’entity disambiguation’ working as the central control of
such system.
D9.35
Public
6/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
Knowledge Base Single document Add/Delete
Records
Add/Delete
Records
En#ty Disambigua#
on System En##es Descrip#on En#ty 1 Descrip#on Add/Delete/Update
En#ty 2 Records
Descrip#on En#ty 3 Descrip#on En#ty 4 Descrip#on Database Records Figure 1: Framework for combining user supplied knowledge.
2. Introduction
In this document, we describe an entity resolution/linking/disambiguation method. Entity resolution
can be defined as a method that identifies whether two entities are same or not. Entities can be
mentions of person, places or organisations.
Entity resolution has a variety of applications. For instance, entity resolution is essential from a security point of view, in which the target would be to identify the different entity mentions (first names,
surnames, nicknames) of persons or organisations that are subject of an investigation. Additionally,
it can be used to remove duplicates from a large database by merging identical entities.
The disambiguation of an entity is a difficult task for two main reasons. Firstly, an entity can be
mentioned in a variety of ways. For example, many organisations and people are often referred by
their initials, i.e. BA for British Airways or MJ for Michael Jackson. Secondly and most importantly,
entity mentions are ambiguous, i.e. they might refer to different entities. For example, BA might also
refer to Bosnian Airlines, while MJ might also refer to Michael Jordon. In the same vein, Washington
might refer to the capital of USA, a newspaper (Washington daily), or George Washington.
In this document we describe a system that can identify if a given entity is present in the database or
not. The most important feature to disambiguate two entities are the relations that the given entity
shows with other entities. For example, as can been seen in Figure 2 both ’MJ’ can be disambiguated
D9.35
Public
7/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
(a) Michael Jackson’s profile
(b) Michael Jordon’s profile
Figure 2: Both superstars are often referred as ’MJ’
Source: http://en.wikipedia.org/wiki/Michael_Jackson
Source:http://en.wikipedia.org/wiki/Michael_Jordan
from information about their occupation, birth date, place etc.
2.1. Objectives
The objective of this report is to provide introduction and motivation for entity resolution. The
report also briefly explains the steps necessary for entity resolution in a knowledge base (database).
The report also provides key insight on the method developed by UoY for entity resolution.
2.2. List of participants & roles
This report has been produced by the University of York (UoY).
D9.35
Public
8/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
2.3. Ethical Issues
Since entity resolution creates relation between different entities, its misuse may affect citizens privacy.
Hence, care must be taken to prevent improper use of such software. For the purpose of developing our
scientific methods we have explicitly avoided any prejudicial stereotypes and have only used publicly
available data from various sources. Any use of the methodology in real situations will require careful
monitoring to ensure compliance with ethical and legal standards.
Organisations that use the methods and software described or produced within WP4 and the INDECT project in general should be aware of the security risks associated with the use of such software/methods. Software/methods produced within WP4 can be employed to automatically analyse
text documents or web pages to extract names of people, organisations, locations, dates etc., extract
their relationships and identify behavioural patterns. Such software can process massive amounts of
data continuously.
At least the following set of security measures need to be in place for any software described or
produced within WP4: (1) the use of the software in a secure environment, (2) secure management
of data sources by encrypting the used knowledge bases, (3) secure access policy, (4) secure storage of
access logs and (5) anonymisation of Ids (email, URL, username etc) from data.
The security risk associated with using software within WP4 can be minimised by following important
security protocols. A carefully designed security protocol prevents critical software, like behavioural
profiling software, from being misused and compromised. Table 1 provides a summary of some of the
benefits associated with the corresponding security measures.
Security Measures
Use of software in secure environment
Encrypting data
Secure access policy
Anonymisation of Ids
Benefits
Safeguards critical application and data.
Protects sensitive information in case of data misplacement. Also, access of data over network becomes secure.
Includes authentication protocols which provides necessary restriction for untrusted use. Also maintains logs of
software access.
Protects privacy of the owner of the document.
Table 1: Necessary security measures for use of software within WP4 and their benefits.
Figure 3 shows stepwise use of WP4 software in a secure environment. A brief explanation of each of
the steps is given below:
Step 1 Before supplying the software with input data any user identification present within the data
D9.35
Public
9/20
c
INDECT
Consortium —
D9.35 Framework for combining user supplied knowledge from diverse sources
www.indect-project.eu
Input&data&(Chat&to&be&
analysed,&seed&
suspicious&websites&
and&textual&data&for&
rela3on&mining)&
Internet&
Step%3%
DeCanonymisa3on&
and&search&
Step%1%
Id&(email,&URL,&
username&etc.)&
anonymisa3on&&
Step%13%
Step%4%
Step%2%
WP4&so8ware&
Step%8%
Poten3al&
data&of&
interest&
Encrypted&analysed&
content&(suspicious&
websites,&
suspicious&chat,&
rela3on&graphs)&
Step%7%
Encrypted&
human&verified&
data&
Step%12%
User&Id&deC
anonymisa3on&
Human&
verifica3on&
yes&
Step%9%
yes&
Legal&
authorisa3on?&
Step%10%
Encrypted&
browsing&and&
verifica3on&log&
Secure&
authorisa3on?&
Step%6%
Step%11%
Human&expert&
Step%5%
Encrypted&
access&log&
Human&expert&
Encrypted&
access&log&
Figure 3: Summary of steps required to ensure ethical use of WP4 software
is anonymised. This step is necessary to protect the identity of the user or website in the
subsequent steps.
Step 2 The WP4 software processes input data.
Step 3 When additional data needs to be fetched from World Wide Web relating to an anonymised user
or website, de-anonymisation is necessary. However, the de-anonymisation and search module
is separated from the WP4 software. The de-anonymisation and search is performed in a secure
location (e.g. police servers). The newly extracted content is again fed back to Step 1 for
anonymisation. The WP4 software will only have access to anonymised content providing an
additional layer of security.
Step 4 The output of the software can be suspicious websites or suspicious chat or relation graphs
depending on the type of software used. The output is stored in an encrypted storage. The
D9.35
Public
10/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
encryption of the output is necessary as it prevents any unauthorised access to processed data.
Step 5 A human expert uses a secure access mechanism to view the processed data.
Step 6 The access logs are stored in a secure encrypted storage to prevent unauthorised viewing of
access logs.
Step 7 Although the WP4 software exhibits reasonable accuracy it is necessary to have human verification to remove false positives i.e. data that has been incorrectly classified by the software as
being potential data of interest.
In this step, a human expert will analyse all processed data and identify potential data of interest.
Step 8 The human verified data is stored in a secure storage.
Step 9 All activities of the human expert are logged and these are stored in a secure storage.
Step 10 At the point of need, a legal authorisation is given to a human expert to de-anonymise some of
the data requiring serious investigation.
Step 11 The human expert gains legally authorised access and this is logged in a secure storage.
Step 12 The selected data is de-anonymised.
Step 13 The de-anonymised data is revealed to the human expert for further investigation.
The above steps for the usage of the software is needed to minimise the risk associated with unauthorised use of the software and to ensure that the use of the software is strictly within the law. Step
5,7,10,12 are flagged as critical steps as it involves human access of sensitive data. These steps should
only be executed once appropriate legal authorisation has been obtained as indicated above.
3. General Steps for Entity Linking
Given a target entity mention, its corresponding article and the database containing real-world data:
the task of entity linking can be divided into three sub-processes:
• Generating similar entities,
• Comparing entities, and
D9.35
Public
11/20
c
INDECT
Consortium —
D9.35 Framework for combining user supplied knowledge from diverse sources
www.indect-project.eu
Candidate'
Genera3on'
Database''
Michael'
Jordon,'
Sports'
MJ,'
poli3cs'
Michael'
Jackson,'
Musician'
input'
MJ,'
Basketball'
Targeted'
Search''
Candidate'Ranking'and''
Entry'Selec3on'
Michael'
Jordon,'
Sports'
Matching'en3ty'
Figure 4: Example pipeline for the task of entity linking
• Entity Selection.
An example of pipeline shown in Figure 4. Given a target entity mention and its corresponding article,
the proposed method performs a targeted search in the database. The keywords for search is acquired
from the given article. From the relatively small list of candidate generated, ranking is performed
by comparing with input article. Finally, according to the threshold parameter, the highest ranked
candidate is either selected or rejected as a match for the input entity. If candidate with the highest
rank is rejected then we conclude that input entity is not present in the database.
3.1. Generating similar entities
As we already know from Section 2 if two entity are the same then their corresponding article will
have similar information content. Thus, to enquire whether a given entity is present in the database
or not we need to match the given article with articles in the database. For an input entity there may
be only few entity in the database which show similarity to the input entity. For example, Michael
Jordan and Michael Jackson may share similarity (initials, common celebrity friends) but they will not
D9.35
Public
12/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
share similarity with Dennis Ritchie (Father of programming language C.). Thus, it is not necessary
that we match the given article with all articles in the database. The match can be done with only
selected articles. These selected articles can be generated by a targeted search.
A search is performed in the database by using keywords extracted from the given article. Only those
document which will match the keywords are selected for further processing. For keywords we can
use the given entity, other entities in the given article and the relation between given entity and other
entities. We will discuss on extraction of relation in coming Sections. There are many tools available
freely, which are optimised to query large database and return result in few seconds. In our context
we use Lucene
1
for searching the database.
3.2. Comparing entities
After we generate candidate articles from the database we need to assign a confidence value to each
article which indicates whether the article represents the same entity as the given input entity.
To generate such classification we associate each entity in the database with an model. Such model
when presented with new articles representing some entity will be able to classify whether the new
articles represents the same entity as the one represented by the model.
3.2.1. Building a model representing entity
As already discussed, each unique entity in the knowledge base is represented by a different model. A
model represents an entity through various features, collected from articles linked with the entities.
A model is created for each entity by training a classifier based on the documents represented by the
entities and also on documents representing other entities as shown in Figure 5. Thus, the model
stores cases when a given entity is same as one linked with the model and also cases when it is not.
Each model is then used for classification of new entities. The positive result of the classification
asserts that the new entity is the same as the one represented by the model. Model creation can be
divided into two steps, namely Feature Generation and Learning.
Feature Generation
An entity can be disambiguated by inspecting how it relates with other entities. If two entities are
the same then they exhibit similar relations to similar entities as shown in Figure 6, unknown entity
1 http://lucene.apache.org/core/
D9.35
Public
13/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
Ar#cles(for(
En#ty(A(
Provided(as(posi%ve(
examples(
Training(
Ar#cles(for(
en##es(other(
than(A(
Model(
for(
En#ty(A(
Provided(as(nega%ve(
examples(
Figure 5: Model creation by training with positive and negative examples
’MJ’ has more entities in common with ’Michael Jordan’.
In out setting, each entity is represented by its corresponding documents. For example, Figure 7 and
Figure 8 are two different documents of a same entity "SFAX". Features are generated by extracting
words linking given entity and other entity in the document. Disambiguation is performed by matching
entities of same features. For example, in Figure 7 – ’Sfax goals Hamza Younes’ and in Figure 8 ’Sfax
goals Hamza Younes’ increases the probability that both "SFAX" in two different document represents
the same entity as the feature goals contains same entity ’Hamza Younes’. It should be noted that the
relation does not need to be represented by adjoining words, rather as in previous example, relations
can be across non-continuous words within a sentence. Relations across non-continuous words can be
extracted through sub-sequence extraction.
Following above mentioned intuition, the feature generation in our setting begins with pre-processing
using Stanford’s Named Entity Recognition (NER) tool [1], Stanford’s coreference tagger[2, 3, 4] and
Stanford’s Part of Speech (PoS) tagger[5]. Next, by extracting sub-sequences we extract the relation
between target entity and other entities in the document. Then, we create feature vector using the
extracted sub-sequences. Figure 9 shows the application of NER and Figure 10 shows the pipeline
of feature extraction phase. The feature extraction process generates feature vectors for each entity.
The values for each feature is the linking entity.
Learning a model
Once the feature vectors have been created we need to learn a model for each target entity. Learning is
done by matching the values of common relations (features) and then create a generalised model. The
D9.35
Public
14/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
MJ#
Associated*En--es:#
NBA#
Basketball#
Chicago#Bulls#
Nike#
Michael#Jackson#
Associated*En--es:#
Thriller#
Songs#
Indiana#
Sony#
Michael#Jordan#
Associated*En--es:#
Basketball#
Bulls#
Nike#
New#York#
Figure 6: Entities associated with unknown entEntryity ’MJ’, Michael Jackson and Michael Jordan.
In this case, MJ is probably Michael Jordan.
Sfaxien outclassed Astres Douala of Cameroon 3-0 in the
Mediterranean town of Sfax via goals from Congolese Blaise 'Lelo'
Mbele, Ivorian Blaise Kouassi and Hamza Younes after leading 1-0
at half-time.
Figure 7: Document 1 for entity "SFAX"
most straight forward form of matching is a lexical matching, i.e the two relations are same if the have
an exact string. Although, in some ideal cases lexical matching will yield accurate results, but mostly
it will fail since perfect lexical matching rarely occurs. Additionally, the problem will be compounded
as semantically similar but lexically different words are also not matched. For example, in Figure 7
and Figure 8, even though words like "outclassed" and "overcame" show similarity in meaning they
are represented as two different features. Thus, to overcome the problem of lexical matching we use
similarity matching, where two words are considered the same if they show high similarity scores.
Sfaxien overcame Astres 3-0 in the Mediterraean city of
Sfax
Saturday through goals from Blaise 'Lelo' Mbele, Blaise
Kouassi and Hamza Younes to remain top of the
standings with 10 points, one more than Mazembe.
Figure 8: Document 2 for entity "SFAX"
D9.35
Public
15/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
Sfaxien outclassed Astres Douala of Cameroon [ORGANISATION]
3-0 in the
Mediterranean[LOCATION] town of Sfax [LOCATION] via goals from
Congolese Blaise [ORGANISATION] 'Lelo'
Mbele [PERSON], Ivorian Blaise Kouassi [PERSON] and Hamza
Younes [PERSON] after leading 1-0
at half-time.
Figure 9: Assigning named entity labels using Stanford’s NER. The labels are either PERSON or
ORGANISATION or LOCATION.
Figure 10: Pipeline for feature vector creation.
The similarity score can be obtained from WordNet[6]. Then the group of words which show high
similarity between each other can be represented by a single feature representation.
Figure 11 shows an example where many semantically similar words are collapsed to a single feature
label. The process will reduce the number of relation we use and is termed as dimensionality reduction.
We use a popular algorithm called Multidimensional Scaling (MDS)[7] to group words with high
similarity between each other.
Matching values of features The next step in learning a model is to comparing values of a same
feature. Suppose , we have relations for Entity 1 as: {lives->UK and stays->Fulford}. If ’lives’ and
’stays’ has been reduced to a feature d7, then:
D9.35
Public
16/20
c
INDECT
Consortium —
D9.35 Framework for combining user supplied knowledge from diverse sources
www.indect-project.eu
outclasss,)outdistance,)
outdo,)outhussle,)
outmatch,)outpace,)
outperform,)outplay,)
outrank,)outrival,)outrun,)
outshine,)outstrip)
accumula8on,)aggrega8on,)
assemblage,)assembly,)
associa8on,)assortment,)
band,)batch,)ba;ery,)bevy,)
body,)bunch,)bundle,)
cartel,)category,))
affec8on,)angle,)ar8cle,)
aspect,)a;ribute,)character,)
component,)cons8tuent,)
detail,)differen8al,)
element,)facet,)factor,)gag,)
gimmick,,)
Feature)1)
Feature)2)
Feature)3)
Feature)4)
Feature)5)
Feature)6)
aboard,)conjoin,)consociate,)
correlate,)couple,)equate,)
fasten,)get)into,)hitch)on,)
hook)on,)hook)up,)interface,)
join,)join)up)with,)marry,)meld)
with,)network)with,)plug)into,)
relate,)
appella8on,)appella8ve,)
class,)classifica8on,)
cognomen,)compella8on,)
denomina8on,)
descrip8on,)epithet,)
iden8fica8on,)key)word,))
bookish,)college,)
collegiate,)erudite,)
intellectual,)learned,)
pedan8c,)scholarly,)
scholas8c,)studious,)
university)
Figure 11: Example of dimensionality reduction. Semantically similar words are collapsed to single
feature label.
<d7:(UK, Fulford)> is the feature vector.
For Entity 2 if the relations are: {lives->UK and stays->Broadway}
<d7:(UK, Broadway)> is the feature vector.
The model learns the similarity between Entity 1 and Entity two by comparing the values set of
feature ’d7’ i.e. (UK, Fulford) and (UK, Broadway). In this case too, perfect lexical matching will be
a rare case and additionally semantically similar values will be ignored if they do not match lexically,
Thus, to overcome this problem we define a function that returns the similarity score of a matching
pair.
D9.35
Public
17/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
3.3. Entity Selection
This is the final step where we decide if the given entity is present in the database or not. The article
representing input entity is classified by models of each candidate articles. We can say that the input
article is present in the database if it is classified as ’true’ by only one candidate model and ’false’ by
rest of the model. If all of the models give false classification then the given entity is not present in
the database.
4. Datasets and Results
The training of the model by using the features and the process described above is performed on TAC
KBP dataset[8]. The dataset consists of 3904 entity mentions of which 560 are distinct entities. On
the test dataset the system showed an average accuracy of 69.25% for correctly identifying the entities.
The average accuracy of identifying entities not present in the dataset is 61%, this is mainly caused
due the the lack of training examples for these entities.
D9.35
Public
18/20
c
INDECT
Consortium —
D9.35 Framework for combining user supplied knowledge from diverse sources
www.indect-project.eu
References
[1] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into
information extraction systems by gibbs sampling,” in Proceedings of the 43rd Annual
Meeting on Association for Computational Linguistics, ser. ACL ’05.
Stroudsburg, PA,
USA: Association for Computational Linguistics, 2005, pp. 363–370. [Online]. Available:
http://dx.doi.org/10.3115/1219840.1219885
[2] H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky, “Deterministic
coreference resolution based on entity-centric, precision-ranked rules.” in Computational Linguistics 39(4, 2013.
[3] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky, “Stanford’s multipass sieve coreference resolution system at the conll-2011 shared task.” in In Proceedings of the
CoNLL-2011 Shared Task ., 2011.
[4] K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. Manning,
“A multi-pass sieve for coreference resolution.” in EMNLP-2010, Boston, USA., 2010.
[5] K. Toutanova, D. Klein, C. Manning, and Y. Singer, “Feature-rich part-of-speech tagging with a
cyclic dependency network.” in HLT-NAACL 2003, pp. 252-259., 2003.
[6] G. A. Miller, “A lexical databse for english,” in Communications of the ACM Vol. 38, No. 11:
39-41, 1995.
[7] F. Wickelmaier, “An introduction to mds.” in Reports from the Sound Quality Research Unit
(SQRU), No. 7., 2003.
[8] P. McNamee and H. T. Dang, “Overview of the TAC 2009 knowledge base population track,”
in In Proceedings of the 2009 Text Analysis Conference.
National Institute of Standards and
Technology, Nov. 2009.
D9.35
Public
19/20
D9.35 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
Document Updates
Table 2: Document Updates
Version
20130508
20130530
20130612
20130619
20130702
D9.35
Date
09/05/2013
30/05/2013
12/06/2013
19/06/2013
02/07/2013
Updates and Revision History
Introduction, Evaluation and conclusion
Changes as suggested by Ethical and Security Reviewers
Addition of figures and changes throughout
Changes throughout as Suggested by Suresh Manandhar
Updated list of reviewers
Public
Suraj
Suraj
Suraj
Suraj
Suraj
Author
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
20/20