Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Machine Learning Group Department of Computer Sciences University of Texas.

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Chapter 5: Introduction to Information Retrieval

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University.

Chapter 5: Information Retrieval and Web Search

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

Design Challenges and Misconceptions in Named Entity Recognition Lev Ratinov and Dan Roth The Named entity recognition problem: identify people, locations,

Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.

Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

A Language Independent Method for Question Classification COLING 2004.

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

Chapter 6: Information Retrieval and Web Search

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

Understanding User’s Query Intent with Wikipedia G 여 승 후.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Algorithmic Detection of Semantic Similarity WWW 2005.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,

LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach Hien Nguyen * (Ton Duc Thang University, Vietnam) Tru Cao (Ho Chi.

Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

Concept-based Short Text Classification and Ranking

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.

2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Exploiting Wikipedia as External Knowledge for Document Clustering

Wikitology Wikipedia as an Ontology

Presentation 王睿.

Information Retrieval

Chapter 5: Information Retrieval and Web Search

Entity Linking Survey

SVMs for Document Ranking

Presentation transcript:

Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin Marius Pasca Google Inc Amphitheatre Parkway Mountain View, CA

Some names denote multiple entities: –“John Williams and the Boston Pops conducted a summer Star Wars concert at Tanglewood.” John Williams  John Williams (composer) –“John Williams lost a Taipei death match against his brother, Axl Rotten.” John Williams  John Williams (wrestler) –“John Williams won a Victoria Cross for his actions at the battle of Rorke’s Drift. John Williams  John Williams (VC) Introduction: Disambiguation

Introduction: Normalization Some entities have multiple names: –John Williams (composer)  John Williams –John Williams (composer)  John Towner Williams –John Williams (wrestler)  John Williams –John Williams (wrestler)  Ian Rotten –Venus (planet)  Venus –Venus (planet)  Morning Star –Venus (planet)  Evening Star

Introduction: Motivation Web searches –Queries about Named Entities (NEs) constitute a significant portion of popular web queries. –Ideally, search results are clustered such that: In each cluster, the queried name denotes the same entity. Each cluster is enriched by querying the web with alternative names of the corresponding entity. Web-based Information Extraction (IE) –Aggregating extractions from multiple web pages can lead to improved accuracy in IE tasks (e.g. extracting relationships between NEs). –Named entity disambiguation is essential for performing a meaningful aggregation.

Introduction: Approach Build a dictionary D of named entities –Use information from a large coverage encyclopedia – Wikipedia. –Each name d  D is mapped to d.E, the set of entities that d can refer to in Wikipedia. Design a method that takes as input a proper name in its document context, and can be trained to: 1)Detect when a proper name refers to an entity from D. [Detection] 2)Find the named entity refered in that context. [Disambiguation]

Introduction: Example John WilliamsJohn Towner WilliamsIan Rotten John Williams (composer)John Williams (VC)John Williams (wrestler) “… this past weekend. John Williams and the Boston Pops conducted a summer Star Wars concert at Tanglewood …” ? John Williams (other) Dictionary Document

Outline Introduction Wikipedia Structures –Named Entity Dictionary –Disambiguation Dataset Disambiguation & Detection Experimental Evaluation Future Work Conclusions

Wikipedia – A Wiki Encyclopedia Wikipedia – a free online encyclopedia written collaboratively by volunteers, using wiki software. 200 language editions, with varying levels of coverage. Very dynamic and quickly growing resource: –May 2005: 577,860 articles –Sep. 2005: 751,666 articles

Wikipedia Articles & Titles Each article describes a specific entity or concept. An article is uniquely identified by its title. –Usually, the title is the most common name used to denote the entity described in the article. –If the title name is ambiguous, it may be qualified with an expression between parentheses. –Example: John Williams (composer) Notation: –E  the set of all named entities from Wikipedia. –e  E  an arbitrary named entity. e.title  the title name e.T  the text of the article

Wikipedia Structures In general, there is a many-to-many relationship between names and entities, captured in Wikipedia through: –Redirect articles. –Disambiguation articles. Hyperlinks: An article may contain links to other articles in Wikipedia. Categories: each article belongs to at least one Wikipedia category.

Redirect Articles A redirect article exists for each alternative name used to refer to an entity in Wikipedia. Example: The article titled John Towner Williams consists in a pointer to the article John Williams (composer). Notation: –e.R  the set of all names that redirect to e. Example: –e.title  United States. –e.R  {USA, US, Estados Unidos, Untied States, Yankee Land, …}.

Disambiguation Articles A disambiguation article lists all Wikipedia entities (articles) that may be denoted by an ambiguous name. Example: The article titled John Williams (disambiguation) list 22 entities (articles). Notation: –e.D  the set of names whose disambiguation pages contain a link to e. Example: –e.title  Venus (planet). –e.D  {Venus, Morning Star, Evening Star}.

Named Entity Dictionary Named Entities  entities with a proper name title. All Wikipedia titles begin with a capital letter  3 heuristics for detecting proper name titles: 1)If e.title is a multiword title, then e is a named entity only if all content words are capitalized (e.g. The Witches of Eastwick) 2)If e.title is a one word title that contains at least two capital letters, then e is a named entity (e.g. NATO) 3)If at least 75% of the title occurrences inside the article are capitalized, then e is a named entity. Notation: –d  D is a proper name entry in the dictionary D (  500K entries). –d.E is the set of entities that may be denoted by d in Wikipedia, –e  d.E  d  e.name  d  e.R  d  e.D (e.name  e.title without the expression between parantheses)

Hyperlinks Mentions of entities in Wikipedia articles are often linked to their corresponding article, by using links or piped links. The [[Vatican City|Vatican]] is now an enclave surrounded by [[Rome]]. The Vatican is now an enclave surrounded by Rome. Wiki source Display string piped linklink

Disambiguation Dataset Hyperlinks in Wikipedia provide disambiguated named entity queries q. The [[Vatican City|Vatican]] is now an enclave surrounded by [[Rome]]. Notation: – q.E  the set of entities that are associated in the dictionary D with the display name from the link. – q.e  q.E  the true entity associated with the query, given by the title included in the link. – q.T  the text contained in a window of size 55 words [Gooi & Allan, 2004] centered on the link. display nametitle display name  title q1q1 q2q2

Disambiguation Dataset Every entity e k  q.E contributes a disambiguation example, labeled 1 if and only if e k  q.e  Query Text (q.T)Entity Title (e k.title) 1Boston Pops conducted concert Star Wars …e 1 : John Williams (composer) 0Boston Pops conducted concert Star Wars …e 2 : John Williams (wrestler) 0Boston Pops conducted concert Star Wars …e 3 : John Williams (VC) “… this past weekend. [[John Williams]] and the Boston Pops conducted a summer Star Wars concert at Tanglewood …” q 1,783,868 queries

Categories Each article in Wikipedia is required to be associated with at least one category. Categories form a directed acyclic graph, which allows multiple categorization schemes to co-exist. 59,759 categories in Wikipedia taxonomy. Notation: –e.C  the set of categories to which e belongs (ancestors included). Example: –e.title  Venus (planet). –e.C  {Venus, Planets of the Solar Systems, Planets, Solar System}.

Outline Introduction Wikipedia Structures Named Entity Dictionary Disambiguation Dataset Disambiguation & Detection Experimental Evaluation Future Work Conclusions

NE Disambiguation: Two Approaches 1)Classification:  Train a classifier for each proper name in the dictionary D.  Not feasible: 500K proper names  need 500K classifiers! 2)Ranking:  Design a scoring function score(q,e k ) that computes the compatibility between the context of the proper name occurring in a query q, and any of the entities e k  q.E that may be referred by that proper name.  For a given named entity query q, select the highest ranking entity:

Context-Article Similarity NE disambiguation  ranking problem. Use cosine similarity between query context and article, based on the tf x idf formulation:

Word-Category Correlations Problem: In many cases, given a query q, the true entity q.e fails to rank first because cue words from the query context do not occur in q.e’s article. –The article may be too short, or incomplete. –Relevant concepts from the query context are captured in the article through synonymous words or phrases. Approach: Use correlations between words in the query context w  q.T and categories to which the named entity belongs c  e.C.

“John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.” conducted Word-Category Correlations John Williams (composer) ? John Williams (wrestler) Film score composers Composers Musicians Professional wrestlers Wrestlers People known in connection with sports and hobbies People by occupation

Ranking Formulation Redefine q.E  the set of named entities from D that may be denoted by the display name in the query, plus an out-of-Wikipedia entity e out. Use a linear ranking function: One feature for the context-article similarity: Each word-category pair  w,c   V  C is translated into a feature: One special feature for out-of-Wikipedia entities:  [  cos |  w,c |  out ]

Ranking Formulation: Example “… this past weekend. John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.” conducted e 1  John Williams (composer) ? e 2  John Williams (wrestler) Film score composers Composers Musicians Professional wrestlers Wrestlers People known in connection with sports and hobbies People by occupation q  q.T  {past, weekend, Boston, Pops, conducted, summer, Star, Wars, concert, Tanglewood, …} e 1.C  {Film score composers, Composers, Musicians, People by occupation, …} e out.C   w,c (q,e 1 )  1, if (w,c)  q.T  e 1.C 0, otherwise.  w,c (q,e out )  0

NE Disambiguation: Overview 1 Redirect Pages NE Dictionary Hyperlinks Disambiguation Dataset Disambig Pages Data Structures

NE Disambiguation: Overview 2 Disambiguation Dataset Ranking Examples features  (q,e k ) SVM training Ranking Model weights wTraining Ranking Instances features  (q,e k ) NE Dictionary Answer: Ranking Model weights w NE query q Testing

Outline Introduction Wikipedia Structures Named Entity Dictionary Disambiguation & Detection Experimental Evaluation Future Work Conclusions

Experimental Evaluation The normalized ranking kernel is trained and evaluated against cosine similarity in 4 scenarios: 1)Disambiguation between entities with different categories in the set of 110 top-level categories under People by Occupation. 2)Disambiguation between entities with different categories in the set of 540 most popular (size > 200) categories under People by Occupation. 3)Disambiguation between entities with different categories in the set of 2847 most popular (size > 20) categories under People by Occupation. 4)Detection & Disambiguation between entities with different categories in the set of 540 most popular (size > 200) categories under People by Occupation. Use SVM light with the max-margin ranking approach from [Joachims 2002].

Experimental Evaluation: S 2 The set of Wikipedia categories is restricted to: C 2  the 540 categories under People by Occupation that have at least 200 articles Train & Test only on ambiguous queries  q,e k  such that: –e k.C  C 2   (i.e. matching entities have categories in C 2 ) –e k.C  C 2  q.e.C  C 2 (i.e. the true entity does not have exactly the same categories as other matching entities) Statistics & Results: #Cat Training datasetTest datasetTest Accuracy #Queries#Pairs#Constr.#Queries#PairsKernelCosine 54017,97055,45237,48270,468235, %55.8%

Experimental Evaluation: S 4 The set of Wikipedia categories is restricted to: C 4  the 540 categories under People by Occupation that have at least 200 articles. Train & Test: –Consider out-of-Wikipedia all entities that are not under People by Occupation. –Randomly select queries such that 10% have true answer out-of-Wikipedia. Statistics & Results: #Cat Training datasetTest datasetTest Accuracy #Queries#Pairs#Constr.#Queries#PairsKernelCosine 54038,726102,55363,82780,386191, %82.3%

Future Work Use weight vector w explicitly – reduce its dimensionality by considering only features occurring frequently in training data. Augment article text with context from hyperlinks that point to it. Use correlations between categories and traditional WSD features such as (syntactic) bigrams and trigrams centered on the ambiguous proper name.

Conclusion A novel approach to Named Entity Disambiguation based on knowledge encoded in Wikipedia. Learned correlations between Wikipedia categories and context words substantially improve disambiguation accuracy. Potential applications: Clustering results to web searches for popular named entities. NE disambiguation is essential for aggregating corpus- level results from Information Extraction.

Questions?

Ranking Kernel The corresponding kernel is: The normalized version:

Experimental Evaluation: S 1 The set of Wikipedia categories is restricted to: C 1  the 110 top-level categories under People by Occupation. Train & Test only on ambiguous queries  q,e k  such that: –e k.C  C 1   (i.e. matching entities have categories in C 1 ) –e k.C  C 1  q.e.C  C 1 (i.e. the true entity does not have exactly the same categories as other matching entities) Statistics & Results: #Cat Training datasetTest datasetTest Accuracy #Queries#Pairs#Constr.#Queries#PairsKernelCosine 11012,28839,88027,59248,661147, %61.5%

Experimental Evaluation: S 3 The set of Wikipedia categories is restricted to: C 3  the 2847 top-level categories under People by Occupation that have at least 20 articles Train & Test only on ambiguous queries  q,e k  such that: –e k.C  C 3   (i.e. matching entities have categories in C 3 ) –e k.C  C 3  q.e.C  C 3 (i.e. the true entity does not have exactly the same categories as other matching entities) Statistics & Results: #Cat Training datasetTest datasetTest Accuracy #Queries#Pairs#Constr.#Queries#PairsKernelCosine ,18564,56043,37575,190261, %55.4%