Download presentation
Presentation is loading. Please wait.
Published byFrancine Hoover Modified over 9 years ago
1
Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao Zhou July 17, 2012 5/24/20151
2
Outline Motivation Problem Definition Previous Methods LINDEN Framework Experiments Conclusion 2/34 5/24/2015
3
Outline Motivation Problem Definition Previous Methods LINDEN Framework Experiments Conclusion 3/34 5/24/2015
4
Motivation Many large scale knowledge bases have emerged DBpedia, YAGO, Freebase, and etc. 4/34 5/24/2015 www.freebase.com
5
Motivation Many large scale knowledge bases have emerged DBpedia, YAGO, Freebase, and etc. As world evolves New facts come into existence Digitally expressed on the Web Maintaining and growing the existing knowledge bases Integrating the extracted facts with knowledge base Challenge Name variations “National Basketball Association” “NBA” “New York City” “Big Apple” Entity ambiguity “Michael Jordan” … NBA player Berkeley professor 5/34 5/24/2015
6
Outline Motivation Problem Definition Previous Methods LINDEN Framework Experiments Conclusion 6/34 5/24/2015
7
Problem Definition Entity linking task Input: A textual named entity mention m, already recognized in the unstructured text Output: The corresponding real world entity e in the knowledge base If the matching entity e for entity mention m does not exist in the knowledge base, we should return NIL for m 7/34 5/24/2015
8
Entity linking task Source: From Information to Knowledge:Harvesting Entities and Relationships from Web Sources. PODS’10. German Chancellor Angela Merkel and her husband Joachim Sauer went to Ulm, Germany. NIL Figure 1: An example of YAGO 8/34 5/24/2015
9
Outline Motivation Problem Definition Previous Methods LINDEN Framework Experiments Conclusion 9/34 5/24/2015
10
Previous Methods Essential step of entity linking Define a similarity measure between the text around the entity mention and the document associated with the entity Bag of words model Represent the context as a term vector Measure the co-occurrence statistics of terms Cannot capture the semantic knowledge Example: Text: Michael Jordan wins NBA champion. The bag of words model cannot work well! 10/34 5/24/2015
11
Outline Motivation Problem Definition Previous Methods LINDEN Framework Experiments Conclusion 11/34 5/24/2015
12
LINDEN Framework Candidate Entity Generation For each named entity mention m Retrieve the set of candidate entities E m Named Entity Disambiguation For each candidate entity e ∈ E m Define a scoring measure Give a rank to E m Unlinkable Mention Prediction For each e top which has the highest score in E m Validate whether the entity e top is the target entity for mention m 12/34 5/24/2015
13
Candidate Entity Generation Intuitively, the candidates in E m should have the name of the surface form of m. We build a dictionary that contains vast amount of information about the surface forms of entities Like name variations, abbreviations, confusable names, spelling variations, nicknames, etc. Leverage the four structures of Wikipedia Entity pages Redirect pages Disambiguation pages Hyperlinks in Wikipedia articles 13/34 5/24/2015
14
Candidate Entity Generation (Cont’) For each mention m Search it in the field of surface forms If a hit is found, we add all target entities of that surface form m to the set of candidate entities E m Table 1: An example of the dictionary 14/34 5/24/2015
15
Named Entity Disambiguation Goal: Give a rank to candidate entities according to their scores Define four features Feature 1: Link probability Based on the count information in the dictionary Semantic network based features Feature 2: Semantic associativity Based on the Wikipedia hyperlink structure Feature 3: Semantic similarity Derived from the taxonomy of YAGO Feature 4: Global coherence Global document-level topical coherence among entities 15/34 5/24/2015
16
Link Probability Feature 1: link probability LP(e|m) for candidate entity e where count m (e) is the number of links which point to entity e and have the surface form m Table 1: An example of the dictionary 0.81 0.05 LP 16/34 5/24/2015
17
Semantic Network Construction Recognize all the Wikipedia concepts Γ d in the document d The open source toolkit Wikipedia-Miner 1 Example: The Chicago Bulls’ player Michael Jordan won his first NBA championship in 1991. Set of entity mentions: {Michael Jordan, NBA} Candidate entities: Michael Jordan {Michael J. Jordan, Michael I. Jordan} NBA {National Basketball Association, Nepal Basketball Association} Γ d : {NBA All-Star Game, David Joel Stern, Charlotte Bobcats, Chicago Bulls} Hyperlink structure of Wikipedia articles Taxonomy of concepts in YAGO 1 http://wikipedia-miner.sourceforge.net/index.htm Figure 2: An example of the constructed semantic network 17/34 5/24/2015
18
Semantic Associativity Feature 2: semantic associativity SA(e) for each candidate entity e Figure 2: An example of the constructed semantic network 18/34 5/24/2015
19
Semantic Associativity (Cont’) Given two Wikipedia concepts e 1 and e 2 Wikipedia Link-based Measure (WLM) [1] Semantic associativity between them where E1 and E2 are the sets of Wikipedia concepts that hyperlink to e 1 and e 2 respectively, and W is the set of all concepts in Wikipedia [1] D. Milne and I. H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of WIKIAI, 2008. 19/34 5/24/2015
20
Semantic Similarity Feature 3: semantic similarity SS(e) for each candidate entity e where Θ k is the set of k context concepts in Γ d which have the highest semantic similarity with entity e Figure 2: An example of the constructed semantic network k=2 20/34 5/24/2015
21
Semantic Similarity (Cont’) Given two Wikipedia concepts e 1 and e 2 Assume the sets of their super classes are Φ e1 and Φ e2 For each class C 1 in the set Φ e1 Assign a target class ε(C 1 ) in another set Φ e2 as Where sim(C1, C2) is the semantic similarity between two classes C1 and C2 To compute sim(C1, C2) Adopt the information-theoretic approach introduced in [2] Where C0 is the lowest common ancestor node for class nodes C1 and C2 in the hierarchy, P(C) is the probability that a randomly selected object belongs to the subtree with the root of C in the taxonomy. [2] D. Lin. An information-theoretic definition of similarity. In Proceedings of ICML, pages 296–304, 1998. 21/34 5/24/2015
22
Semantic Similarity (Cont’) Calculate the semantic similarity from one set of classes Φ e1 to another set of classes Φ e2 Define the semantic similarity between Wikipedia concepts e 1 and e 2 22/34 5/24/2015
23
Global Coherence Feature 4: global coherence GC(e) for each candidate entity e Measured as the average semantic associativity of candidate entity e to the mapping entities of the other mentions where e m’ is the mapping entity of mention m’ Substitute the most likely assigned entity for the mapping entity in Formula 9 The most likely assigned entity e’ m’ for mention m is defined as the candidate entity which has the maximum link probability in E m 23/34 5/24/2015
24
Global Coherence (Cont’) Figure 2: An example of the constructed semantic network 24/34 5/24/2015
25
Candidates Ranking To generate a feature vector F m (e) for each e ∈ E m To calculate Score m (e) for each candidate e where is the weight vector which gives different weights for each feature element in F m (e) Rank the candidates and pick the top candidate as the predicted mapping entity for mention m To learn, we use a max-margin technique based on the training data set Assume Score m (e ∗ ) is larger than any other Score m (e) with a margin We minimize over ξ m ≥ 0 and the objective 25/34 5/24/2015
26
Unlinkable Mention Prediction Predict mention m as an unlinkable mention If the size of E m generated in the Candidate Entities Generation module is equal to zero If Score m (e top ) is smaller than the learned threshold τ 26/34 5/24/2015
27
Outline Motivation Problem Definition Previous Methods LINDEN Framework Experiments Conclusion 27/34 5/24/2015
28
Experiment Setup Data sets CZ data set: newswire data used by Cucerzan [3] TAC-KBP2009 data set: used in the track of Knowledge Base Population (KBP) at the Text Analysis Conference (TAC) 2009 Parameters learning: 10-fold cross validation [3] S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL, pages 708–716, 2007. 28/34 5/24/2015
29
Results over the CZ data set 29/34 5/24/2015
30
Results over the CZ data set 30/34 5/24/2015
31
Results on the TAC-KBP2009 data set 31/34 5/24/2015
32
Results on the TAC-KBP2009 data set 32/34 5/24/2015
33
Outline Motivation Problem Definition Previous Methods LINDEN Framework Experiments Conclusion 33/34 5/24/2015
34
Conclusion LINDEN A novel framework to link named entities in text with YAGO Leveraging the rich semantic knowledge derived from the Wikipedia and the taxonomy of YAGO Significantly outperforms the state-of-the-art methods in terms of accuracy 34/34 5/24/2015
35
Thanks! Q&A 35/34 5/24/2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.