Poster
Transcription
Poster
Algorithm 1: Concept annotation algorithm. Algorithm 1: Concept annotation algorithm. Input: A wiki W = (A, E) Input: A wiki W = (A, E) 0 Output: Annotated wiki W = (A, E0 ) Boosting Cross-‐lingual Knowledge Linking via Concept Annotation catego categories as as Output: Annotated wiki W = (A, E ) 2.2 Concept Annotation Definition 2 Semantic Relatedness. Given a concept c, let for each article aa22 A do for each article A do N be the existing annotations in the same article with c, the c Important concepts in a wiki article are usually annotated by catego Extract a1: set of concepts Caa inina; a; Algorithm Concept annotation algorithm. Her Extract a set of concepts C Here Semantic Relatedness between a candidate destination article 1 editors with hyperlinks to their corresponding articles in the 1 2 2 asby usb Get the existing annotations N in a; wiki Get the existing annotations N in a; a a and the context annotations of c is computed as: wiki a Input: A wiki W = (A, E) same wiki database. These inner links guide readers to arti0 for each c 2 C do for each c 2 C do tained tained from i a i a cles that provide related information about the subject of curOutput: Annotated wiki W = (A, E ) 1 X Findthe theset set of of candidate candidate articles ci ;ci ; Find articlesTcTi cof of Feature 1 was often used to estimate 2 SR(a, c) = r(a, b) (2) Fea i rent article. The linkage structure ⇤ ⇤ |Nc | Get b arg max maxb2T ⇥⇥SR(b, NaN );a ); i )i ) Get b ==arg LP(b|c (b|c SR(b, Given t b2Nc similarity between articles in traditional knowledge linking b2Tccii LP Giv for each article a 2 A do ⇤ ⇤ N = N [ {b }; egories of [ ] a= Naa[ {b }; approaches, e.g., Wang et al., 2012 . However, there might N egorie a Extract a set of concepts Ca in a; Her end putes simi be missing inner links in the wiki articles, especially when the end log(max(|Ia |, |Ib |)) log(|Ia \ Ib |) putes Get the existing annotations N in a; wiki b a categori for each b 2 N do r(a, b) = 1 (3) articles. L Algorithm 1:bConcept articles are newly created or have fewer editors. In this cirj a annotation algorithm. for each 2 N do article for each j ci 2 C a a do log(|W |) log(min(|I |, |I |)) tained a b as if < a, b > 2 / E then to category cumstance, link-based features cannot accurately assess the j Input: ifAFind wiki W =2 (A, E) < a, b > / E then the set of candidate articles Tci of ci ; to cate j Cross-‐lingual knowledge linking is to automa)cally discover cross-‐lingual Fea E = E [ {< a, b >} 0 structure-based similarity between articles. Therefore, our categories Concept A nnotator j where Ia and Ib are the sets of inlinks of article a and article ⇤ Output: Annotated wiki W = (A, E(b|c ) i ) ⇥ SR(b, Na ); E = E [ {< a, b >} Get b = arg max LP catego b2T j Giv c end basic idea is to first uses a concept annotator to enrich the i links (CLs) between wikis, which can largely enrich b,the cross-‐lingual respectively; and W is the set of all articles in the input ⇤ end N = N [ {b }; egorie inner links in wikis before finding new CLs. a a end wiki. categories of two articles, the category similarity is computed knowledge and facilitate sharing knowledge across different languages. The Algorithm 1:toConcept annotation algorithm. for each article a 2 A do end end Recently, several approaches have been proposed annoputes end categories of two articles, the category similarity is computed Algorithm Concept annotation algorithm. The Semantic Relatedness metric has been in several Extract Input Wused ikis as a set of concepts C in a; Concept Here CL P redictor a [Mihalcea seed CLs and the inner link structures are two important factors for finding tating documents with 1: concepts in Wikipedia and for each b 2 N do article end return N j a d Input: A wiki W = (A, E)approaches for linking as other documents to Wikipedia. The Get the existing annotations Na in a;is computed wiki by Csomai, 2007; Milne and Witten, 2008; Kulkarni et al., 2009; categories of two articles, the category similarity 0 Input: A wiki W = (A, E) if < a, b > 2 / E then Extrac)on to cate j return N Algorithm 1: Concept annotation algorithm. new CLs. AShen challenging p roblem i s h ow t o fi nd l arge s cale C Ls b etween w ikis d inner link structures in the wiki are used to compute the Se0 Output: Annotated wiki W = (A, E ) for each cEi 2 C do tained fr et al., 2012]. These approaches aim to identify impora[ 2 · | (C(a)) \ C(b)| = E {< a, b >} catego 1!2 as j Output: Annotated wiki Wwhen = Eto )the Relatedness. One thing worth mentioning is that (e.g. English and Baidu Baike) insufficient seed 2 · when | 1!2 (C(a)) \ C(b)| Find the set of candidate articles Tci of ci ; fend (a, b)(= (8)Featu tant Wikipedia concepts in the given documents link them Input: A and wiki W(A, =there (A, E)are mantic 51: Feature Outlink similarity 1 be not ac⇤ fWiki b)(= (8) there are0 insufficient inner links, this metric | (C(a))| + |C(b)| 5 (a, might 1!2 Get b = arg max LP (b|c ) ⇥ SR(b, N ); corresponding articles in Wikipedia. The concept annotator b2T i a a Given c Training Outlinks of an article correspond to a set of other articles CLs and inner links? i | (C(a))| + |C(b)| for each article a 2 A do where c 2 Output: Annotated wiki W = (A, E ) end 1!2 curate enough to reflect the relatedness between articles. To Feature 1: Outlink similarity i 2 · | (C(a)) \ C(b)| ⇤ 1!2 in our approach provides the similar as existing apfor each article a 2 Afunction doExtract N = N [ {b }; egories that it links to. The outlink similarity from computes thewiki similaria a f (a, b)(= (8) a set of concepts C in a; this end, our approach first utilizes the known CLs to merge Here (·) maps categories one to another 5 end a Outlinks ofAnnota)on an article correspond to a set of other articles 1!2 instead discovering where ENproaches. However, Extract aZH setof of conceptslinks Cafrom in a;an ex- the link structures Regressio Here (·) maps categories from one wiki to another end | (C(a))| + |C(b)| ties between articles by comparing elements in their outlinks. putes sim 1!2 1!2 of two wikis, which results in a Integrated return N Get the existing annotations N in a; that it links to. outlink similarity computes the similarid The wiki by using the CLs between categories, which can be obfor each article a 2 A do ternal document to a wiki, concept annotator focuses on ena cross-‐lingual l ink categories of two articles, the category similarity is computed Given two articles a 2 W and b 2 W , let O(a) and O(b) for each b 2 N do The CL pr articles. Get the existing annotations N in a; 1 can be ob2 j a wiki by using the CLs between categories, which a Algorithm 1: Concept annotation algorithm. Concept Network. The Semantic Relatedness is computed Regre China riching the inner links in a wiki. Therefore, there will be ties between articles by comparing elements in their outlinks. Extract a set of concepts C in a; for each c 2 C do Here (·) maps categories from one wiki to another tained from Wikipedia. 中国 a i a be the outlinks of a and b respectively. The outlink similarity 1!2 if < a, bj >2 / E then oftodifferen asbased on the links in the catego Integrated Concept Network, which for each c 2 C do tained from Wikipedia. i a Wiki 2 some existing before annotating. In our Input:Tsinghua A wiki W =links (A,in E)the articles +categories, Given two articles a 2 W and b 2 W , let O(a) and O(b) The C 1 2 Find the set of candidate articles T of c ; Get the existing annotations N in a; is computed as wiki by using the CLs between which can be obold ! to d E = E [ {< a, b >} c i a categori Feature 6 Category similarity 0 + j i is formally defined as follows. Incremental 0 Predic)ng Find the set of candidate articles T of c ; categories of two articles, the category similarity is computed ci algorithm. i approach, the inputs of concept annotator are two wikis and Feature 6 Category similarity University Feature 1: Outlink similarity Inner lOutput: ink ⇤ Algorithm 1: Concept annotation be the outlinks of a and b respectively. The outlink similarity this of diff Annotated wiki清华大学 W =⇤ (A,Inner E )link purpo end Get b = arg max LP (b|c ) ⇥ SR(b, N ); for each c 2 C do 2 · | (C(a)) \ C(b)| tained from Wikipedia. b2T i a i a 1!2 Given two articles a and b, let C(a) and C(b) be the cata set of CLs between them. The concept Get b = arg maxannotation (b|ci ) ⇥ SR(b, Ncia3);fIntegrated Outlinks of an article correspond to a set of other articles Linking Definition Concept Network. Given two wikis b2Tci LPprocess 2 · | (O(a)) \ O(b)| Given two articles a and b, let C(a) and C(b) be the catas 1!2 b)(= (8) as where is computed old !0 5 (a,of + ⇤ end + Find the set of candidate articles T c ; f (a, b) = (4) c i 1 Feature 6 Category similarity consists of two basic steps: (1) concept extraction, and (2) ⇤ W = i W2 = (A |2 ,1!2 (C(a))| |C(b)| N = N [ {b Input: A wiki (A, E) + W}; E2 ), where A+ Aegories sets of a and b respectively. Category similarity comBeijing it links to. The outlink similarity computes the similaria a 1 = (A1 , E1 ), 1 band 2 are that N = N [ {b }; | (O(a))| + |O(b)| 北京 egories of a and respectively. Category similarity coma a ⇤ 1!2 this pu for each article a 2 A do 0 end annotation. Getend b = arg maxb2TciofLP (b|c ) ⇥ SR(b, N ); articles, iE1 and E2 are sets and Wputes , respecRegre a of links in W1Given 2two s(a, b) articles a and b, let C(a) and C(b) be the catties between articles by comparing elements in their outlinks. 2 · | (O(a)) \ O(b)| similarities between categories by using CLs between 1!2 Output: Annotated wiki W = (A, E ) end Extract a set of concepts C in a; putes similarities between categories by using CLs(C(a)) HereLet1!2 (·) maps categories from one wiki to another k where (·) is a· = function tobetween maps articles in W1 to their(4) 2 | \ C(b)| return N 1!2 Concept Extraction. Inathis step,Npossible concepts ⇤ that d f (a, b) + 1!2 tively. CL = {(a , b )|a 2 B , b 2 B } be the set of 1 i i i 1 i 2 i=1 of a Given two articles a 2 W and b 2 W , let O(a) and O(b) = N [ {b }; The C egories and b respectively. Category similarity com1 2 a a for each b 2 N do Forbidden articles. Let E(c ) and E(c ) be the set of articles belonging f (a, b)(= (8) j a | (O(a))| + |O(b)| corresponding articles in W if there are CLs between them. Getmight the existing annotations N in a; 1 2 for each b 2 N do wiki by using the CLs between categories, which can be ob5be the set of articles 1!2 articles. Let E(c ) and E(c ) 2belonging refer to other articlesjin the wiki are identified in a same a cross-lingual links between W and W , where B ✓ A and 1 2 1 2 1 1 City be Feature thebetween outlinks ofcategories a1!2 and b respectively. The outlink similarity + (C(a))| of s(a diff 故宫 | + |C(b)| end putes similarities by using CLs between 2: Outlink similarity if < a, b > 2 / E then to category c and c respectively, the similarity between two for each c 2 C do an input article. The concepts are extracted by matching all tained Wikipedia. j B2 ✓ from A2 . The Integrated Concept Network andwhere Wis2 computed is 1!2 iffor < a,each bj 郭金龙 >article 2 / E then i a 1(·)similarity Jinlong Guo to category c1 andofcW the between twoarticles in W Cross-‐lingual L inks is a 2function to maps 2 1respectively, a 2 A do For eac 1 to their + as old ! 0 + Given two articles a 2 W and b 2 W , let O (a) and for each b 2 N do the n-grams thecandidate inputEarticle with the elements in a conarticles. Let E(c ) and E(c ) be the set of articles belonging 1 2 Find the setinof articles T of c ; j a ICN (W , W ) = (V, L), where V = A [ A ; L is a set of Feature 1: Outlink similarity E = E [ {< a, b >} c i 1 2 Feature similarity categories is computed as 1 2 ia, bj >} j1 6 2Category = E [ {< corresponding articles in W if there are CLs between them. mizes the categories is computed as 2 + Extract a set of concepts C in a; Here (·) maps categories from one wiki to another this p ⇤ a 1!2 O (b) be the outlinks in the abstracts and infoboxes of a and trolled vocabulary. The vocabulary contains the titles and the a links which are established as follows: Get b = arg max LP (b|c ) ⇥ SR(b, N ); + correspond if < a, b > 2 / E then Outlinks of an article to a set of other articles to category c and c respectively, the similarity between two b2T i a j Given two articles a and b, let C(a) and C(b) be the catc end 1 2 where c is predicte 2 · | (O(a)) \ O(b)| Feature 2: Outlink similarity i + 1!2 0 0 0 i b respectively. The outlink similarity is computed as Get annotations N in a; anchor texts of all⇤ end the articles inthe theexisting input wiki. For an inwiki by using the CLs between categories, which can be obFor f (a, b) = (4) a + (v, v ) 2ofEa (v, vb )respectively. 2 E2 ! (v, v Category ) 2categories L that itncomlinks to. computes theOsimilari+ n X m 1The outlink similarity 1_ E = E [ {< a, b >} m Na = Na [end {b }; CL predic is computed as egories and similarity j X Given two articles a 2 W and b 2 W , let (a) and X X end 1 2 0 0 0 0 | (O(a))| + |O(b)| 1 put article a, the results offor concept extraction contain a set 1!2 1 Regress mizes a b each c 2 C do ties between articles by comparing elements in their outlinks. a b tained from Wikipedia. + (v, v ) 2 E ^ (v, b) 2 CL ^ (v , b ) 2 CL ! (b, b ) 2 L + + propriately i a 1 endof concepts C = {ci }k and the sets end putes similarities between categories by using CLs between s(a f (a, b) = (c , c ) (9) O (b) be the outlinks in the abstracts and infoboxes of a and 2 · | (O (a)) \ O (b)| f (a, b) = (c , c ) (9) 1!2 6 of candidate articles 6 0 0 0 0 i j i j end i=1 end where (·) is a function to maps articles in W to their is pre + Given two articles a 2 W and b 2 W , let O(a) and O(b) f (a, b) = (5) 1!2 1 The CL challengin n m + nm 2 1 2 nm Find the set of candidate articles T of c ; (v, v ) 2 E ^ (v, a) 2 CL ^ (a , v ) 2 CL ! (a, a ) 2 L 2 for{T each b 2 N do c i + + articles. Let E(c ) and E(c ) be the set of articles belonging Feature 6 Category similarity X X i 1 b respectively. The outlink similarity is|O computed as them. 2 might link to. end | (O (a))| + (b)| c1 , Tc2j, ..., Tcak } that each concept i=1 j=1 i=1 j=1 1 1!2 corresponding articles in W if there are CLs between a bThe outlink similarity return N N be the outlinks of a and b respectively. ⇤ CL pr 2 of differ Here, w d d ifAnnotation. < a, breturn > 2 / E then to category c and c respectively, the similarity between two f (a, b) = (c , c ) (9) Get b = arg max LP (b|c ) ⇥ SR(b, N ); In the annotation step, links from the conj In the iIntegrated Concept Network, the link structures of 1 2 + let C(a) 6articles b2Tci aLanguage-‐independent i+ and j Given two aInlink and b, C(b) be the catFeatures Feature 3: similarity end Feature 2: Outlink similarity is computed as + weights of old ! to propri nm 0 2 · | (O (a)) \ O (b)| cepts E to = theEdestination are identified. Here we use [ Cnocept {< a, barticles >} For ⇤ categories is computed as two wikis are merged by using CLs. The Semantic Related1!2 2| (E(c )) \ E(c )| Ar)cles j +W 2| (E(c )) \ E(c )| + i=1 j=1 1!2 1 2 1!2 1 2 Inlinks of an article correspond to a set of other articles N = N [ {b }; N-‐grams two matching Given two articles a 2 W and b 2 , let O (a)comand egories of a and b respectively. Category similarity return N of CLs {(a Extract C oncept-‐Ar)cle f (a, b) = (5) a a this purp 1 2 is computed challe categories of two articles, the category similarity d 2 (c , c ) = (10) (c , c ) = (10) metrics, the Link Probability and the Semantic RelatedAlgorithm 1: Concept annotation algorithm. 1 2 + (a))| + ness computed based on Integrated Concept Network is more end 1let I(a) 2 mizes + | (O + |O (b)| linking to it, and I(b) denote inlinks of two articles, 2 · | (O(a)) \ O(b)| 1!2 1!2 c a ,a ,…,a | (E(c ))| + |E(c )| O (b) be the outlinks in the abstracts and infoboxes of a and vocabulary B = {(b ) | (E(c ))| + |E(c )| i 1!2 1 2 as end 1 11 12 1p n m 1!2 1 2 Her putes similarities between categories by using CLs between ness, to decide the correct links for each concept. The Link reliable than on a single wiki. In order to balance the Link f12| (a, b) = (4) Feature 1: Outlink similarity is pre XX Feature 1:Input: Outlink similarity (E(c )) \ E(c )| end + the inlink similarity is computed as 1!2 1 2 A wiki W = (A, E) 1 ensure: b respectively. The outlink similarity is computed as Feature 3: Inlink similarity aapproach b | (O(a))| + |O(b)| 1!2 Probability measures the possibilities that an article is the (c , c ) = (10) weigh 0 Probability and the Semantic Relatedness, our uses c a ,a ,…,a f (a, b) = (c , c ) (9) 1 2 a b for each b 2 N do CL prb articles. Let E(c ) and E(c ) be the set of articles belonging 6 Outlinks of an article correspond to a set of other articles a b 2 21 22 2q j a i j 1 2 Outlinks of an article correspond to a set of other articles end destination of a given concept, which is approximated where c 2 C(a), c 2 C(b), n = |C(a)|, and m = |C(b)|. s(a, Output: Annotated wiki W = (A, E ) | (E(c ))| + |E(c )| where c 2 C(a), c 2 C(b), n = |C(a)|, and m = |C(b)|. Inlinks of (·) an is article correspond to a of other articles nm ii=1 Outlink s jimilarity 2 · | (C(a)) \ C(b)| 1!2 1|1!2 2set based an algorithm that greedily selects the articles that maximize i j of CL where a function to maps articles in W to their 2 · (I(a)) \ I(b)| Feature 1: Outlink similarity 1!2 1 + + 1!2 propri j=1 that it links to. The outlink similarity computes the similariif < a, b > 2 / E then f (a, b)(= (8) c3 annotations a31,a to1category c and c respectively, the two 2 ·I(b) | 1!2 (Osimilarity (a)) \ O ofbetween (b)| returnon Ndthe already known 52let j The that links to.Semantic The outlink computes theAlgorithm similari1 f (a, b) = (6) 32 3h in,…,a aitwiki. 3 linking to it, I(a) and denote inlinks two articles, the similarity product of the two metrics. outlines the alcorresponding articles in W1!2 if (I(a))| there are CLs between them.(5) | CLs + |C(b)| B = { b) = 2 (C(a))| a b f2 (a, 1!2 challe | + |I(b)| 8a 2 Outlinks of an article correspond to a set of other articles i Regression-based Model for Predicting new + + ties between articles by comparing elements in their outlinks. where c 2 C(a), c 2 C(b), n = |C(a)|, and m = |C(b)|. for article a 2 Aindo Ebetween =the [articles {<each a, bby Regression-based Model for Predicting new CLs Relatedness measures between des+computed categories is computed as | (O (a))| + |O (b)| comparing elements in2| their outlinks. gorithm the concept3annotator. i j j >} the inlink similarity is as …relatednessties …Ecandidate 1!2 (E(c )) \ E(c )| Feature 2: Outlink similarity 1!2 1 2 ensure Experiments 3.2 Cross-lingual Links Prediction + Her that it links to. The outlink similarity computes the similariExtract a set of concepts C in a; Feature 4: Inlink similarity For e (c , c ) = (10) Here (·) maps categories from one wiki to another tinationGiven article two and the surrounding context of the concept. a 2, let articles aend 2 W1 and barticles 2 W2 ,alet O(a) and O(b) + 1 1!2 The CL predictor in our approach computes the weighted sum which also Given two 2 W and b 2 W O(a) and O(b) Given two articles a 2 W and b 2 W , let O (a) and The CL predictor in our approach computes the weighted sum Feature similarity 1 Cross-lingual 2 Link 1 2 + 3: Inlink + | (E(c ))| + |E(c )| c a ,a ,…,a weigh 1!2 1 2 Regression-based Model for Predicting new CLs Let I (a) and I (b) denote two articles’ inlinks that are 2.3 Prediction m m1 m2 mk Get the existing annotations N in a; mizes th Formally, these two metrics are defined as follows. wiki by using the CLs between categories, which can be obties between articles by comparing elements in their outlinks. Feature 1: Outlink similarity a + 2 · | (I(a)) \ I(b)| The data of English Wikipedia and Chinese Wikipedia has n m be the outlinks of a and b respectively. The outlink similarity 1!2 of different similarities between articles, and applies a threshO (b) be the outlinks in the abstracts and infoboxes of a and In order to aevaluate the CL prediction method Inlinks of an b) article correspond to set ofinlink other + articles X X be the outlinks offora each and cbi respectively. The outlink similarity of different similarities between articles, and applies a threshend of CL f (a, = (6) from other articles’ abstracts and infoboxes, the sim1 3 is predic 2 C do a b tained from Wikipedia. a + Outlinks of an article correspond to a set of other articles been used to evaluate our proposed approach. We downAs shown in Figure 2, the CL predictor takes two wikis and a a b two articles W1 andwhere b2W , let O(a) and O(b) The CL predictor in our approach computes the weighted sum Definition 1 Link Probability. a conceptac 2 identified 22 c C(a), c 2 C(b), n = |C(a)|, and m = |C(b)|. selected 3,000 articles having CLs to English is computed asGiven Given | (I(a))| + |I(b)| 8a b respectively. The outlink similarity is computed as old ! to decide whether an article pair should have a CL. For linking to it, let I(a) and I(b) denote inlinks of two articles, 1!2 i j 0 f (a, b) = (c , c ) (9) + B = { is computed as ilarity is computed as 6 old ! to decide whether an article pair should have a CL. For Find the set of candidate articles T of c ; i j CL pred 0 c i load the English Wikipedia and Chinese Wikipedia dumps set of seed CLs between them as inputs, then it predicts a set Feature 6 Category similarity end i that it links to.article, The outlink computes the similariin an the Linksimilarity Probability from thisof concept c btorespectively. an- ⇤ Chinese Wikipedia. The existing CLs betw be the outlinks a and The outlink this similarity of different similarities between and applies a threshnm + isarticles, the inlink similarity computed as purpose, a scoring function is defined: Inlink s imilarity ensure Feature 4: Inlink similarity Therefo Get b = arg max LP (b|c ) ⇥ SR(b, N ); + + propriat of new CLs. This subsection first introduces the definitions of i=1 j=1 from Wikipedia’s download site. Both of the two dumps this purpose, a scoring function is defined: b2T i a Regression-based Model for Predicting new CLs Given two articles a and b, let C(a) and C(b) be the catc otherarticles article by a Definition iscomparing defined by the approximated conditional ties between elements in1!2 their(O(a)) outlinks. +were + as(b)| 22· ·| |the (O (a)) \ O which cles used the ground truth for the ev i return N 1!2 2 · | \ O(b)| 2 Semantic Relatedness. Given a concept c, let is computed as d categories of two articles, category similarity is computed + + (I (a)) \ I (b)| old ! to decide whether an article pair should have a CL. For 1!2 two articles’ ⇤ 0 thethe Algorithm 1: Concept annotation algorithm. + Annota)on p redic)on f (a, b) = (5) 2 · | (O(a)) \ O(b)| Let I (a) and I (b) denote inlinks that are different similarity features, and then describes learning are archived in August 2012, English Wikipedia contains challeng 2 1!2 where the N = N [ {b }; probability of a given c: f (a, b) = (4) f (a, b) = (7) egories of a and b respectively. Category similarity coma a seed CLs were extracted from the rest of 23 + + 4 1 Given two articles a 2 W and b 2 W , let O(a) and O(b) 2 · | (I(a)) \ I(b)| The CL predictor in our approach computes the weighted sum N be the existing annotations in the same article with c, the 1 2 1!2 + + | (O (a))| + |O (b)| c Definition 2 Semantic Relatedness. Given a concept c, let as + f (a, b) = (4) 1!2 cle are usually annotated by | (I (a))| + |I (b)| this purpose, a scoring function is defined: 1 end | (O(a))| + |O(b)| 1!2 method for predicting new CLs. 4 million articles, and the Chinese Wikipedia contains 499 f (a, b) = (6) from other articles’ abstracts and infoboxes, the inlink simput y is a Here, 1!2 2| (E(c )) \ E(c )| 3 i CLs between English Wikipedia and Chinese putes similarities between categories by using CLs between Input: A wiki W = (A, E) 1!2 1 2 | (O(a))| + |O(b)| be the outlinks of a and b respectively. The outlink similarity Semantic Relatedness between a candidate destination article different between articles, and applies a(cthresh~ Ncinbethe the existing annotations in the same the of(O(a)) 1!2 | 1!2 (I(a))| + |I(b)| 8 2 · |c, 1!2 \ similarities O(b)| thousand count(a, c) article with orresponding Feature 3: Inlink similarity ly annotated byarticles b) = ! + ! ~ · f 0s(a, articles, , c ) = (10) and there are already 239,309 cross-lingual 0 a,b Feature 5 Category similarity model on ilarity is computed as 1 2 ~ weights thethe experiments, we aimed to investigate how for each bold 2W N1ato do articles. Let E(c ) and E(c ) be set of articles belonging LP (a|c) = (1) jFeature s(a, b) = ! + ! ~ · f f (a, b) = (4) 1 2 a and the context annotations of c is computed as: Definition where (·) is a function to maps articles in to their Output: Annotated wiki W = (A, E ) is computed as (11) Semantic Relatedness between a candidate destination article 1 0 a,b Link P robability ! decide whether an article pair should have a CL. For 1!2 guide readers to arti| (E(c ))| + |E(c )| 0to maps articles 2 · | (C(a)) \ C(b)| + gr links articles in the Inlinks of an article correspond to a set of other articles 1!2 count(c) links between English Wikipedia and Chinese Wikipedia. Us1!2 1 2 Categories are tags attached to articles, which represent the and tures. The where (·) is a function in W to their Feature 4: Inlink similarity (11) of CLs { | (O(a))| + |O(b)| tion method performed before after the c 1!2 1 if < a, b > 2 / E then Feature 1: Outlink similarity The 1!2 to category c and c respectively, the similarity between two = ! + ! ⇥ f (a, b) + ... + ! ⇥ f (a, b) j f (a, b)(= (8) 1 2 which a and the context annotations of c is computed as: 0 1 1 6 6 Six features are defined to assess the similarities between ar5 of Concept corresponding articles in W2 if 1there them. this purpose, a scoring function is we defined: + +ofC(b) Xare CLs between ion abouttothe subject of curreaders arti+ +and ~ ing these data, first evaluate the effectiveness linking to it, let I(a) I(b) denote inlinks two articles, = ! + ! ⇥ f (a, b) + ... + ! ⇥ f (a, b) topics of the articles’ subjects. Let C(a) and denote F1-measur 2 · | (I (a)) \ I (b)| | (C(a))| + |C(b)| s(a, b) = ! + ! ~ · f 0 1 1 6 6 Let I (a) and I (b) denote two articles’ inlinks that are tion, and how the CL prediction method perfo B = {(b 1!2 1!2 0 a,b corresponding articles in W if there are CLs between them. E = E [ {< a, b >} categories is computed as a b j 2 + 2 · | (O(a)) \ O(b)| where count(a, c) denotes the number of times that c links to ticles by using different kinds of information, all of these SR(a, c)(·) =article r(a, b)article (2) Outlinks of an correspond to a set of other articles where 1!2 for each a 2 A do e subject was often used to estimate f (a, b) = (7) where is a function to maps articles in W to their X of cur(11) + Feature 2: Outlink similarity Annotation and CL prediction methods respectively, and then where c 2 C(a), c 2 C(b), n = |C(a)|, and m = |C(b)|. 4 1!2 1 the inlink similarity is computed as 1 ⇤ abstracts + features are based on theFor ferent number of(b)| seeding cross-lingual links. + (a))| + f1 (a, b) = denotes (4)end iW from jarticles’ |N other and infoboxes, the inlink simensure: c |2: | (I + |I a and count(c) the number of times that c appears in link structures and therefore are each article a in , the article b in W that maxiFeature Outlink similarity 1!2 SR(a, c) = r(a, b) (2) =as + ! ⇥ f (a, b) + ⇥ f (a, 1! 2... + !6from + b2Nin Category s imilarity Feature 5 Category similarity athere set O ofcomputes concepts Cthe a; put y aditional knowledge Algorithm linkingGiven1: sed to estimate ⇤b) c Extract Here (·) maps categories one wiki to another 0 1 1 6 |that (O(a))| + |O(b)| the proposed approach a whole. a inevaluate i it links to. The outlink similarity similari1!2 corresponding articles W if are CLs between them. Concept Annotation Algorithm 1!2 two articles a 2 W and b 2 W , let (a) and n m did three groups of experiments, each group 2 For each article a in W , the article b in W that maxi|N | 1 2 ilarity ⇤ is computed as 1 1X⇤X 2 c + score function s(a, the whole wiki. ~ language-independent. end ]. However, b2N mizes the b ) and satisfies s(a, b ) > 0 s(a, b) = ! + ! ~ · f wledge linkingSeman)c 012 there might c Given two articles a 2 W and b 2 W , let O (a) and + Get the existing annotations Na2outlinks. in Feature 5Model Category similarity model 2the · | 1!2 (I(a)) \ I(b)| 0 Categories a,ba;are tags attach + wiki bywhich using the CLs between categories, can be oba b which to mizes articles, represent 1of ⇤ ⇤ Regression-based for Predicting new CLs number of seeding cross-lingual links. In The eac Feature 2: Outlink similarity ties between articles by comparing elements in their O (b) be the outlinks in the abstracts and infoboxes a and f (a, b) = (c , c ) (9) where (·) is a function to maps articles in W to their (11) ⇤ the score function s(a, b ) and satisfies s(a, b ) > 0 6 1!2 1 f (a, b) = (6) Input: A wiki W = (A, E) i j 3Wtags + end ver, thereespecially might articles, when the For each article a in , the article b W that maxi+in + compared is predicted as the corresponding article of a. The idea of for each c 2 C do Categories are attached to articles, which represent the tures. nm tained from Wikipedia. 1 2 + 3.1 Concept Annotation O (b) be the outlinks in the abstracts and infoboxes of a and i a log(max(|I |, |I |)) log(|I \ I |) topics of the articles’ subjects. Let C(a) and C(b) denote + periments, we also the performance | (I(a))| + |I(b)| 8ai Relatedness = ! + ! ⇥ f (a, b) + ... + ! ⇥ f (a, b) 0 a b a b 2 · | (I (a)) \ I (b)| 1!2 0 1 1 6 6 1!2 Given two articles a 2 W and b 2 W , let O (a) and corresponding articles in W if there are CLs between them. i=1 j=1 b respectively. The outlink similarity is computed as 2 1 2 Given two articles a 2 W and b 2 W , let O(a) and O(b) ⇤ ⇤C(b) The CL predictor in our approach computes the weighted sum is predicted as the corresponding article of a. The idea of when the In this r(a, b) = 1wiki (3) return N where ecially fewer editors. cir- +Annotated 1 2 + + f (a, b) = (7) d Output: W = (A, E ) Find the set of candidate articles T of c ; log(max(|I |, |I |)) log(|I \ I |) 4 topics of the articles’ subjects. Let C(a) and denote F1-me mizes the score function s(a, b ) and satisfies s(a, b ) > 0 c i with the SVM classification model, which was CL prediction is simple and straightforward, but how to apFeature 6 Category similarity a b log(|W b a b i + + categories of two entities, the category similarity is computed + + respectively. The outlink similarity is computed as |) log(min(|I |, |I |)) a b Feature 2: Outlink similarity |and (I (a))| + |I (b)|buta how Feature method, 4: is Inlink similarity For thea)evaluation of the Annotation we ranr(a, (3)b⇤ = arg O outlinks (b) be the outlinks in theGet abstracts and infoboxes of and ors. In accurately this cir- assess 1!2 ⇤Concept annot theb) = 1 be the put1 sh yali of a and b respectively. The outlink similarity CL prediction simple straightforward, to apwhich of different similarities between articles, and applies threshmax LP (b|c ⇥ SR(b, N ); ous knowledge linking approaches. Table For each article a in W , the article b in W that maxi2| (E(c )) \ E(c )| b2T i a Regression M odel f or P redicting N ew C Ls log(|W |) log(min(|I |, |I |)) + + 1 2 is predicted as the corresponding article of a. The idea of + + + Given two articles a and b, let C(a) and C(b) be the catc 1!2 1 2 propriately set the weights of different similarity features is a a b(a)) as + O (b)| i domly selected 1000 articles from Chinese Wikipedia. There Let I (a) and I (b) denote two articles’ inlinks that(10) are Given two articles a 2 W and b 2 W , let O (a) and ately assess the 2 · | (O \ en articles. Therefore, our 1 2 (c , c ) = 1!2 b respectively. The outlink similarity is computed as ⇤ ⇤ Feature 5 Category similarity mode S R(a,c) i s c omputed b ased o n t he l inks i n t he I ntegrated C oncept N etwork ⇤ where I and I are the sets of inlinks of article a and article 1 2 + imental results. a b + + propriately set the weights of different similarity features is a for each article a 2 A do mizes the score function s(a, b ) and satisfies s(a, b ) > 0 is computed as N = N [ {b }; + old ! to decide whether an article pair should have a CL. For egories of aarticles’ and b respectively. Category similarity comf (a, b) = (5) a2 · | 1!2 a | (E(c ))| + |E(c )| CL prediction is simple and straightforward, but how to ap2 · | (C(a)) \ C(b)| 0 challenging problem. (O (a)) \ O (b)| 2 are 12,723 manually created inner links in these selected arti1!2 1 2 Therefore, ourto 1!2 from other abstracts and infoboxes, the inlink simO+ (b) beenrich the outlinks the abstracts and infoboxes of a and cept annotator the where Ia andin I are the sets of inlinks of article a and article + + Feature 1: Outlink similarity For each ar)cle a in W , the ar)cle b* in W that maximizes the following b, brespectively; and| W1!2 is the set of all articles in predicted the input Categories are tags attached to articles, which represent the tures. (O (a))| + |O (b)| 1 2 f (a, b)(= (8) Extract a set of concepts C in a; Get the existing In each group of the experiment, our re f (a, b) = (5) is as the corresponding article of a. The idea of 5 end challenging problem. a 2 putes similarities between categories by using CLs between + cles. 70% of the existing inner links in the articles were ranor to enrich the this purpose, a scoring function is defined: + + ilarity is computed as propriately set the weights of different similarity features is a + + Here, we propose a regression-based model to learn the b, respectively; and W is the set of all articles in the input g new CLs. a|C(b)| b b respectively. The outlink | (C(a))| + of| (O an1!2 article correspond to a(b)| set of other articles wiki. similarity is computed asOutlinks topics of the articles’ subjects. Let C(a) and C(b) denote F1-me 2 · | (a)) \ O (b)| (O (a))| + |O 1!2 where c 2 C(a), c 2 C(b), n = |C(a)|, and m = |C(b)|. 1!2 learning model (RM) always performed bet score func)on s(a,b*) and sa)sfies s(a,b*)>0 is predicted as the i j CL prediction is simple and straightforward, but how to apannotations N in a; for each c 2 C do for each b 2 N do domly chosen as the ground truth for the evaluation, which articles. Let E(c ) and E(c ) be the set of articles belonging 2 · | (O(a)) \ O(b)| a i a j a Feature 3: Inlink similarity Here, we propose a regression-based model to learn the 1 2 wiki. 1!2 have been proposed to annof2 (a, b) that = it links (5) There challenging problem. weights of features based on a set of known CLs. Given a set to. The outlink similarity computes the similari+ + + + model in terms of F1-measure; it shows that f (a, b) = (4) The Semantic Relatedness metric has been used in several + + Find the set of candidate articles T of c ; propriately set the weights of different similarity features is a were removed from the articles before the articles are fed | (O (a))| + |O (b)| oposed to anno1 2 · | (I (a)) \ I (b)| if < a, b > 2 / E then Feature 3: Inlink similarity c i to category c and c respectively, the similarity between two Here (·) maps categories from one wiki to another wiki corresponding a r)cle o f a . [ 1!2 1!2 j by comparing i set of other n nset of known in Wikipedia Annota)on Mihalcea andSemantic 1 2 1!2 2 · | Relatedness (O (a)) \ O (b)| Inlinks of an article correspond to a articles Regression-based Model for Predicting new CLs weights of features based on a CLs. Given a set 1!2 ties between articles elements in their outlinks. where th of CLs {(a , b )} as training data, let A = {(a )} and Here, we propose a regression-based model to learn the The metric has been used in several | (O(a))| + |O(b)| is more suitable for the knowledge linking pro f (a, b) = (7) ⇤ i i i 4 1!2 other approaches for linking documents to Wikipedia. The i=1 i=1 [2008; to the the other annotation algorithm. After the annotation, we colf2 (a, b) = Get b = argmax (5) Mihalcea and et al., + (a))| + |I + (b)| challenging problem. LP (b|c ) ⇥ SR(b, c ); E = E [ {< a, b >} categories is computed as n n Kulkarni 2009; b2T i i by using CLs between categories, which can be obtained j | (I Inlinks of an article correspond to a set of articles c + + ~ n 1!2 traditional linking to it, let I(a) and I(b) denote inlinks of two articles, Feature 3: Inlink similarity i (b)| Given two articles a 2 W and b 2 W , let O(a) and O(b) put yi is The CL predictor in our approach computes the weighted sum other approaches for linking documents to Wikipedia. The of CLs {(a , b )} as training data, let A = {(a )} classification model SVM.and | (O (a))| + |O 1 2 s(a, b) = ! + ! ~ · f B = {(b )} , our model tries to optimal weights to ifind i the i i=1 1!2 weights features based on a set of known CLs. Given a set inner link structures in the wiki are used to compute the Se0 a,b Criteria i=1 lected the new discovered inner links, and computed the Prerni et al., 2009; i ⇤ i=1 end Here, we propose a regression-based model to learn the ches aim to identifyinner imporwhere (·) is a function to maps articles in W to their from Wikipedia. N = N [ {b }; (11) n5 Categorybetween linking to it, let I(a) and I(b) denote inlinks of two articles, a a link structures in the wiki are used to compute the Se1!2 1 be the outlinks of a and b respectively. The outlink similarity n n Feature similarity model o of different similarities articles, and applies a threshthe inlink similarity is computed as Inlinks of an article correspond to a set of other articles Before concept annotation being perfor B = {(b )} , our model tries to find the optimal weights to mantic Relatedness. One thing worth mentioning is that when cision, Recall and F1-measure of the new discovered inner identify imporn m ensure: of CLs {(a , b )} as training data, let A = {(a )} and i Feature 3: Inlink similarity i=1 i= i!0i=1 if6i=1 ments and link them to the + X X weights of features based on a set of known CLs. Given a set + ! ⇥ f (a, b) + ... + ! ⇥ (a, b) end 1 mantic Relatedness. One thing worth mentioning is that when 1 1 6 end Feature 6 Category similarity is computed as Categories are tags attached to articles, which represent the tures. Th old ! to decide whether an article pair should have a CL. For corresponding articles in W if there are CLs between them. the inlink similarity is computed as a approach b n 0n measure of our does not always incre there are insufficient inner links, this metric might be not aclinks based on the ground truth links. 2 nk them to the linking to it, let I(a) and I(b) denote inlinks of two articles, n ensure: Inlinks of an article correspond to a set of other articles f6 (a, b) = (ci , cweights ) (9) dia. The concept annotator B = {(b )} , our model tries to find the optimal to i of CLs {(a , b )} as training data, let A = {(a )} and j i=1 there are insufficient inner links, this metric might be not aci i i I ntegrated C oncept N etwork end + ZH i=1 i=1 topics of the articles’ subjects. Let C(a) and C(b) denote F1-meas EN for each b 2 N do this purpose, a scoring function is defined: 2 · | (I(a)) \ I(b)| seed CLs are used. However, the F1-measure Given two articles a and b, let C(a) and C(b) be the catcurate enough to reflect the relatedness between articles. To ncept annotator nm j a 1!2 One advantage of our concept annotation method is to Feature 2: Outlink similarity the inlink similarity is computed as n ilar function as to existing linking it,curate letapI(a) andtoI(b) denote inlinks of two return articles, A regression model is used to learn the weights of features based on a set of i=1 j=1 ⇤ ensure: enough reflect the relatedness between articles. To 0 0 2 · | (O(a)) \ O(b)| B = {(b )} , our model tries to find the optimal weights to f (a, b) = (6) + 1!2 i N 3 For each article a in W , the article W i=1 ously as bthe in number ofthat seed maxiCLs increases aft d CLs as existinglinks ap- from an ex- ifthis end, our approach first utilizes the known to merge < a, b > 2 / E then 1 2 + leverage the cross-lingual information to enhance the conegories of a and b respectively. Category similarity com2 · | (I(a)) \ I(b)| j 1!2 f (a, b) = (4) 中国 scovering | (I(a))| + |I(b)| 8a 2 A, 8b 2 (B {b }), s(a , b ) s(a , b ) > 0 1 ensure: the inlink similarity is computed as Given two articles a 2 W and b 2 W , let O (a) and China 1!2 i i i i i this end, our approach first utilizes the known CLs to merge 1 2 ⇤ ⇤ known C Ls. T he o p)mal w eights s hould s a)sfies: 中国 0 0 f (a, b) = (6) notation; there is a 24.5% increment of the F1| (O(a))| + |O(b)| nks from an exthe link structures of two wikis, which results in a Integrated 3 cept annotation. Therefore, we did experiments with differ1!2 E = E [ {< a, b >} China mizes the score function s(a, b ) and satisfies s(a, b ) > 0 2| (E(c )) \ E(c )| putes similarities between categories by using LCs between 清华大学 j ept annotator focusestheonlink en-structures + 1!2 1 ,b ) 2s(a , b ) > 0 2 · | (I(a)) \ I(b)| ~ 1!2 of two wikis, which results in a Integrated | (I(a))| + |I(b)| 8a 2 A, 8b 2 (B {b }), s(a + Tsinghua s(a, b) = ! + ! ~ · f O (b) be the outlinks in the abstracts and infoboxes of a and 1!2 i i i i i increases the number of seed CLs from 0.05 m Tsinghua 0 1 , c2 )Figure a,b focuses on en(c = (10) Concept Network. The Semantic Relatedness is computed Feature 4: Inlink similarity ent number of CLs used in the concept annotation. 0 0 f (a, b) = (6) end2 · | The 3etworks where to mapswhich articles in W their i. Therefore, there will be Network. (11) Integrate n (I(a)) \ I(b)| articles. Let E(c ) and E(c ) be the set of articles belonging University 1!2 (·) is a function 1 to is predicted as the corresponding article of a. The idea of also means Concept Semantic Relatedness is computed 1!2 1 2 University | (E(c ))| + |E(c )| 清华大学 + Table 1: Results of re C 1!2 1 2 + + million. This is might because the equivalent , there will be | (I(a))| + |I(b)| 8a 2 A, 8b 2 (B {b }), s(a , b ) s(a , b ) > 0 + based on the links in the Integrated Concept Network, which 3 shows the experimental results of concept annotation. For Feature 1: Outlink similarity 0 0 1!2 i i i i i f (a, b) = (6) = ! + ! ⇥ f (a, b) + ... + ! ⇥ f (a, b) b respectively. The outlink similarity is computed as Let Ilinks (a) and I (b)Feature denote two articles’ inlinksinthat are 0 between 1 1 6 6 4: Inlink similarity s before annotating.based In3 our corresponding articles W if there are CLs between them. on the in the Integrated Concept Network, which 2 end to category c and c respectively, the similarity two Before An 1 i }), of 2 experiments, | 1!2 (I(a))| +as|I(b)| 8a 2 A, 8b 2 (B {b s(a , b ) s(a , b ) > 0 CL prediction is simple and straightforward, but how to apwhich also means CLs cannot effectively be propagated to oth otating. In our i i i i is formally defined follows. three groups we used 0, 0.1million and 0.2 a b + + Outlinks ofthe andenote articlesimilarity correspond to ainlinks set of other articles + + infoboxes, +2: annotator areBeijing two wikis and other where c~i 2 C(a), cj 2 C(b), when n= |C(a)|, and m = links |C(b)|. from articles’ abstracts and inlink simFeature Outlink is formally defined as follows. Feature 4: Inlink similarity Let I (a) and I (b) two articles’ that are 北京 ⇤ #Seed CLs Model Precision Reca categories is computed as ~ there are few inner in wikis. There end two wikisFeature and 北京 + million CLs, respectively. As shown in Figure 3, our apFor each article a in W , the article b in W that maxiwhich also means + + ! ~ · ( f f ) > 0 propriately set the weights of different similarity features is a 1 2 + similarithat it links to. The outlink similarity computes the ai ,bi ai ,b‘ Inlink similarity concept annotation4: process + +Network. + Definition 3 Integrated Concept Given two wikis Given two articles a 2 W and b 2 W , let O (a) and Beijing 2 · | (O (a)) \ O (b)| ilarity is computed as 1 2 1!2 ⇤ ⇤ which also means 0.05 Mil. CLsiss(a, SVM 92.1% 35.0% Let Ifrom (a) and IGiven (b) denote two articles’ inlinks that are the inner links important for knowledge li other articles’ abstracts and infoboxes, the inlink simotation process + Definition 3 Integrated Concept Network. two wikis proach can obtain 82.1% precision and 70.5% recall when no return N + mizes the score function s(a, b ) and satisfies b ) > 0 d + Regression-based Model for Predicting new CLs ~ ~ n X m problem. f (a, b) = (5) ties between articles byinsets comparing elements in theirof outlinks. conceptand extraction, (2)I (b) Let(2)I (a)and and denote two articles’ inlinks that are challenging W = (A , E ), W = (A , E ), where A and A are ! ~ · ( f f ) > 0 2 ni ,b O (b) be the outlinks the abstracts and infoboxes a and + 1 1 1 2 2 2 1 2 X a ,b a RM 93.3%of CL 36.0% effectively improve the performance pre i i action, ‘ + + Therefore, a training data set is generated to feed the linear regression W = (A , E ), W = (A , E ), where A and A are sets 1 Therefore, we generate a new dataset D = {(x , y )} , CLs are used. The precision and recall increase as the numfrom other articles’ abstracts and infoboxes, the inlink similarity is computed as 1 1 1 2 2 2 1 2 i i is predicted as the corresponding article of a. The idea of + | (O (a))| + |O (b)| a b i=1 Forbidden City + ~ ~ + 1!2 +The Given two articles a 2 W and b 2 W , let O(a) and O(b) from other articles’ abstracts and infoboxes, the inlink simof articles, E and E are sets of links in W and W , respecThe CL predictor in our approach computes the weighted sum 1 2 b respectively. outlink similarity is computed as 1 2 1 2 f (a, b) = (c , c ) (9) ! ~ · ( f f ) > 0 0.10 Mil. CLs SVM 79.7% 35.0% · | (I (a)) \ I (b)| 6 a ,b a ,b Here, we propose a regression-based model to learn the ~ ~ i j of articles,故宫 E1 and E2 ilarity are sets is of2links in W and W , respecber of used CLs increases. When 0.2 million CLs are used, i i i ‘ 1!2 n 1 as 2 ~ ~ ! ~ · ( f f ) > 0 CL prediction is simple and straightforward, but how to apk computed algorithm t o l earn t he w eights: ai ,bioutlink ai ,bsimilarity possible concepts that as tively.f郭金龙 nm ‘ Jinlong G2uo where the input vector x = ( f f ), the target outTherefore, we generate a new dataset D = {(x , y )} , Forbidden City Let CL = {(a , b )|a 2 B , b 2 B } be the set of be the outlinks of a and b respectively. The (a, b) = (7) i a ,b a ,b k i i ilarity computed of different similarities between articles, and applies a threshi i i 1 i i i i j6 = i 4 i=1 RM 84.6% 37.4% i=1 estep, concepts thatis Feature 3: Inlink similarity tively. Let CL = {(a , b )|a 2 B , b 2 B } be the set of i=1 j=1 + + the precision increased by 3.5% and the recall was increased + + i i i 1 i 2 i=1 n weights of features based on a set of known CLs. Given a set + + propriately set the weights of different similarity features is a Jinlong Guo 3.3 Incrementally Knowledge Linking | (I (a))| + |I (b)| 2 · | (I (a)) \ I (b)| 郭金龙 earesame wiki are identified in Feature Definition n 1!2 故宫 1!2 cross-lingual links between W and W , where B ✓ A and 2 · | (O (a)) \ O (b)| Therefore, we generate a new dataset D = {(x , y )} , put y is always set to 1. Then we train a linear regression is computed as ~ ~ 1 2 1 1 1!2 old !the toi=1 decide whether an article pair ayn CL. For i have iSVM identified in 0.15 Mil.should CLs 80.9% 35.9% i new Therefore, we generate a dataset D = {(x , y , i=1 0)} cross-lingual links Inlinks between+Wof W+article B ✓ A and by 3.2%. According the results, a large part of missing inner where input vector x = ( f f ), the target outi i where + + 1 and 2 , where 1 1 = 1 Cross-li n f (a, b) = (7) i a ,b a ,b an correspond to a set of other articles i f (a, b) = (5) i i i j6=i 4 Network challenging problem. 2 W(I e extracted allFeature 2 · | (a)) \ I (b)| of CLs {(a , b )} as training data, let A = {(a )} and B ✓ A . The Integrated Concept of and W is + + + + 1!2 2 · | (I (a)) \ I (b)| Six features are defined to assess the similarities between ar2 2 1 2 i i i by matching by all matching ~ ~ RM 93.5% 38.2% 1!2 this purpose, a scoring function is defined: 5 Category similarity model on the dataset D to get the weights of different feai=1 i=1 | (O (a))| + |O (b)| B ✓ A . The Integrated Concept Network of W and W is | (I (a))| + |I (b)| ~ ~ links can be discovered with an acceptable precision by our 2 2 1 2 1!2 where the input vector x = ( f f ), the target out1!2 2| (E(c )) \ E(c )| f (a, b) = (7) where the input vector x = ( f f ), the target output y is always set to 1. Then we train a linear regression i a ,b a ,b 1!2 1 2 At last, we evaluated our approach as a who important resour f (a, b) = (7) i i i j6=i i a ,b a ,b 4 i with the elements in a conn Here, we propose a regression-based model to learn the i i i j6 = i 4 linking to it, let I(a) and I(b) denote inlinks of two articles, ICN (W , W ) = (V, L), where V = A [ A ; L is a set of 2 · | (O(a)) \ O(b)| + + 1L), 2where 2 of(a))| + (a))| +of ments in a con- ticles 1!2 0.20 Mil. CLs SVM 84.7% 37.3% (c , c ) = (10) by kinds all these B = {(b )} , our model tries to find the optimal weights to ICN (W W2| ) 1!2 = different (V, V|I= A1information, [ Aarticles, ; L |is 1!2 a1 set of concept annotation method; if there are available CLs, they (I + |I (b)| 1 2 1 , using 2 Feature (I + (b)| Categories are tags attached to which represent the tures. The threshold ! is set to the value that maximizes the i i=1 3:f1Inlink proposed approach aims toadiscover CLs with y(4) is0model always set to2dataset 1. Then we train aknown linear regression (a, = putb)ysimilarity train ai linear regression ry titles contains and the i is always set to 1. Then we put Feature 5 Category similarity on the D to get the weights of different feaweights of features based on a set of CLs. Given set different languag | (E(c ))| + |E(c )| links which are established as follows: 1!2 1 he andthe thetitlesfeatures RM 94.5% 37.9% linkstopics which are established as follows: the inlink similarity is computed as can be used to improve the quality of the discovered inner | (O(a))| + |O(b)| are based on the link structures and therefore are 0 0 0 1!2 ensure: n n of the subjects. Let C(a) and C(b)on denote on the training CLs. seeding links, we did not use all the availabl Inlinks of attached anmodel article correspond toF1-measure a set of other articles 0 articles’ 0 Category Feature 5 Category similarity the dataset D to get the weights of different feaniki. the For inputan wiki. For an0 inFeature 5 similarity model on the dataset D to get the weights of different feaof CLs {(a , b )} as training data, let A = {(a )} and ~ i i i inCategories are tags to articles, which represent the tures. The threshold ! is set to the value that maximizes the (v, v ) 2 E _ (v, v ) 2 E ! (v, v ) 2 L articles in Wiki i=1 i=1 the proposed 1 2 s(a, b) = ! + ! ~ · f 0 links. (v, v ) 2 E _ (v, v ) 2 E ! (v, v ) 2 L 0 a,b a b The datasets of English Wikipedia (4 million ar)cles) and Chinese Wikipedia (499 thousand ar)cles) that are archived in August 2012 has been used to evaluate 1 2 language-independent. links; instead, we randomly select 50 thousand n m = |C(b)|. 0 0 0 0 function linking to it, let I(a) and I(b) denote inlinks of two articles, where c 2 C(a), c 2 C(b), n = |C(a)| and where (·) is a to maps articles in W to their ept extraction contain a set (11) Categories are tags attached to articles, which represent the tures. The threshold ! is set to the value that maximizes the 0 0 0 0 1!2 1 B = {(b )} , our model tries to find the optimal weights to i j 0 n contain a set Categories are tags attached to articles, which represent the tures. The threshold ! is set to the value that maximizes the i i=1 0 (v, v ) 2 E ^ (v, b) 2 CL ^ (v , b ) 2 CL ! (b, b ) 2 L topics of the articles’ subjects. Let C(a) and C(b) denote F1-measure on the training CLs. links to certain l 1 (v, v ) 2 E ^ (v, b) 2 CL ^ (v , b ) 2 CL ! (b, b ) 2 L 2 · | (I(a)) \ I(b)| seed links. We also randomly removed 70% 1 = ! + ! ⇥ f (a, b) + ... + ! ⇥ f (a, b) Feature 1: Outlink similarity approach. T here a re 2 39,309 c ross-‐lingual l inks b etween t wo w ikis. 1!2 0 1 1 6 6 the inlink similarity is computed as he sets topics of candidate 0 0 0 0 corresponding articles in W2on if the there aredenote CLs between them. of thearticles articles’ subjects. Let C(a) denote F1-measure training CLs. ndidate articles ensure: 0 0 0 and C(b) 0 topics of the articles’ subjects. Let C(a) and C(b) F1-measure on the 160,000" 0 training CLs. 0 (for f (a, b) = (6) 100%) (v, v ) 2 E ^ (v, a) 2 CL ^ (a , v ) 2 CL ! (a, a ) 2 L the Chinese Wikipedia both articles havin lem, several kn 3 2 (v, v ) 2 E ^ (v, a) 2 CL ^ (a , v ) 2 CL ! (a, a ) 2 L + Regression-based Learning Model for Predicting CLs 2 ept might link to. Outlinks of an article is a set of other articles that it links to. Feature 2: Outlink+ similarity nk to. ⇤s(a | (I(a))| |I(b)| 8a 2 A, 8b 2 (B {b }), s(a , b ) , b ) > 0 90%) 1!2 i i i i i 140,000" links and having no cross-lingual links). We fed For each article a in W , the article b in W that maxi R esults o f c ross-‐lingual l ink p rediction 2 · | (I(a)) \ I(b)| Table 1: Results of cross-lingual link prediction (%). 1 2 been proposed + 1!2 to nfrom step,the links from The theInconIn the Integrated Concept Network, the link structures of conoutlink similarity computes the similarities between artithe Integrated Concept Network, the linkGiven structures twofof3articles a 2 W and b 2 W , let O (a) and The CL predictor in our approach computes the weighted sum 80%) 1 2 0 0 (a, b) = (6) ⇤ two wikis to our approach, ⇤ and let0 it run itera + mizes the score function s(a, b ) and satisfies s(a, b ) > 120,000" Before Annotation After Annotation + are identified. we wikis use two wikis are merged by using CLs. The Semantic Related70%) | (I(a))| + |I(b)| 8a 2 A, 8b 2 (B {b }), s(a , b ) s(a , b ) > 0 . Here we useHerecles Feature 4: Inlink similarity tween wikis. W two are merged by using CLs. The Semantic Related1!2 i and apply a threshi i i i by comparing elements in their outlinks. Given two arti-in the O (b) be the outlinks abstracts andsimilarities infoboxes of a and of different between articles, mentally find of newa.CLs. Figure 4ofshows the n which also means is predicted as the corresponding article The idea #Seed CLs RelatedModel Precision Reccall F1-measure Precision Recall F1-measure 60%) y and the Semantic 100,000" + + + mantic Relatedness computed based on Integrated Concept Network is more ness a computed based on Integrated Concept Network is more + wikis are sparse b respectively. The outlink similarity is computed as an article pair should have a CL. For discovered CLs in each run. cles 2 W and b 2 W , let O(a) and O(b) be the outlinks Let I (a) and I (b) denote two articles’ inlinks that are Feature 4: Inlink similarity old ! to decide whether 1 2 Blue bars corres 0 50%) 0.05 Mil. CLs SVM 92.1 35.0 50.7 78.5 37.2 50.5 which also means for The Link CL prediction is simple and straightforward, but how to apcept.each Theconcept. Link reliable than on a single wiki. In order to balance the Link reliable than on a single wiki. In order to balance the 80,000" + Link + + Let I (a) and I (b) denote two articles’ inlinks that are of a and b respectively. The outlink similarity is computed as results without concept annotation, and red ba 40%) lingual links are this purpose, a scoring function is defined: from other articles’ abstracts and infoboxes, the inlink simRM 93.3 36.0 52.0 92.4 38.6 54.5 thatis an is the nilities article thearticle + + Probability and the Semantic Relatedness, propriately! set· (the weightsf~ of different similarity features is a Probability and the Semantic Relatedness, our approachour usesapproach2uses ~ + 60,000" · | (O (a)) \ O (b)| 30%) 1!2 ~ f ) > 0 sults when the concept annotation was perform from other articles’ abstracts and infoboxes, the inlink sima ,b a ,b all the possible cr i i i ‘ Mil. CLsan based SVM 79.7 35.0 48.6 86.9 50.4 63.8 which is0.10 approximated ~ ~ oximated based f (a, b) = (5) an algorithm that greedily selects the articles that maximize algorithm that greedily the (O(a)) articles that maximize ilarity is 2selects computed as ilarity challenging problem. 2 ! ~ · ( f f ) > 0 20%) · | 1!2 \ O(b)| ai ,bi ai ,b‘ + (a))| + |O + (b)| 40,000" ~ is computed as | (O According to the results, new CLs can be 1!2 s(a, ~10%)· fa,b RM of 84.6 37.4 65.3b) = !0 + ! nsThe in aSemantic wiki. The Semantic f1the (a, b) metrics. =of the (4) the product two Algorithm outlines the96.6 the product two metrics.151.9 Algorithm 1 aloutlines49.3 the althis problem, w n Here, we propose a regression-based model to learn the n (11) Therefore, we generate a new dataset D = {(x , y )} , | (O(a))| + |O(b)| 20,000" i i each run no matter the concept annotation wa Therefore, we generate a new dataset D = {(x , y )} , i=1 1!2 es candidate desbetween candidate desi i 0.15the Mil. CLsgorithm SVM 80.9 35.9 49.7 88.1 57.3 69.5 0%) in the conceptinannotator. gorithm the concept annotator.2Feature + + i=1 3: Inlink similarity + + concept annotati = !0(b)| + !1 ⇥ fPrecision) ... + !6of⇥ f6 (a, b)based~ on~ anot. features set ~ of known CLs. Given a set Recall) F1/measure) · | (I (a)) \ I (b)| 1 (a, b) + weights 2 · | (I (a)) \ I 1!2 1!2 However, there were more CLs discove ~ of concept. 0" ingthe context of the concept. RM 93.5 38.2 54.2 93.7 56.2 70.2 where the input vector x = ( f f ), the target outn n input vector xi=1 fannotation ),was the outInlinks of an b) article to a set+ (7) ofNo)CLs)) other where articles f4 (a, =W1 correspond (7) the of 82.1%) 70.5%) 75.9%) i(f ,b 1"j6j6= 2"{(atarget 3" 4" f (a, b) = i =as ai ,baicept aia,b i ,bi data, ilet =A ii = CLs {(a , b )} training )} and where (·) is a function that maps a sets of articles in 4 edge linking. O + i i i 1!2 done. In the fourth run, m i=1 + + 2.3 Cross-lingual Link Prediction lows. | (I (a))| + |I (b)| 2.3 Cross-lingual Link Prediction efined0.20 as follows. Mil. CLs SVM 84.7 37.3 51.8 88.8 68.1 77.1 1!2 ⇤ 0.1)Mil.)CLs) 83.6%) 71.4%) 77.0%) No"Annota,on" 34,876" 57,128" 76,932" 67,440" | (I (a))| + |I (b)| put y is always set to 1. Then we train a linear regression nW i {(b linking to it, let I(a) and I(b) denote inlinks of two articles, 1!2 For each article a in W , the article b in that maxiput y is always set to 1. Then we train a linear regression B = )} , our model tries to find the optimal weights to 1 2 to their corresponding articles in W if there are CLs between i discovered than in the third run when c i wiki article with 2 i=1 Annota,on" 0.2)Mil.)CLs) 45,002" 78,334" 132,210" 152,223"using 85.6%) 73.7%) 79.2%) RM 94.5 37.9 54.1 95.9 67.2 79.0 As shown in Figure 2, the CL predictor takes two wikis and a ⇤ ⇤ Feature 5 Category similarity model on the dataset D to get the weights of different feaAs shown in Figure 2, the CL predictor takes two wikis and a ept c identified thesimilarity inlink similarity is computed as the score functionmodel Given a concept c identified mizes s(a, b ) and satisfies s(a, b ) > 0 is ensure: tion; but less CLs were cles, found feain the fourth r them. Feature 5 Category on the dataset D to get the weights of different so that new set of seed CLs between them as inputs, then it predicts a set seed CLs between them as inputs, then itare predicts a set oncept c rtoegression anCategories tags attached to articles, which represent the tures. The threshold !0 is set to the value that maximizes the Our odel is cof ompared SVM classifica)on model. y from this concept cmto an-(RM) set +with R esults o f c oncept a nnotation R esults o f i ncremental l inking third run without concept annotation. In sum predicted as the corresponding article of a. The idea of CL Feature 2: Outlink similarity of new CLs. This subsection first introduces the definitions of Categories are tags attached to articles, which represent the tures. The threshold ! is set to the value that maximizes the cross-lingua of new CLs. This subsection firsttopics introduces thearticles’ definitions FigureCLs. 4: Results of Incrementally new Knowledge Linking ted conditional Results of Concept Annotation 0 training of the subjects. Let C(a) andFigure C(b)3:denote F1-measure on the 2of· | (I(a)) \ I(b)| Zhichun Wang , Juanzi Li , Jie Tang Beijing Normal University, Tsinghua University Mo)va)on Proposed Framework ? Annota)ng Concepts Within Each Wiki Predic)ng New Cross-‐lingual Links Between Wikis wiki ar)cle Boosting Number'of'discovered'CLs Experiments