Relational Inference for Wikification

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
October 2014 Paul Kantor’s Fusion Fest Workshop Making Sense of Unstructured Data Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign.
Textual Relations Task Definition Annotate input text with disambiguated Wikipedia titles: Motivation Current state-of-the-art Wikifiers, using purely.
Relational Entity Linking with Cross Document Coreference Xiao Cheng, Bingling Chen, Rajhans Samdani, Kai-Wei Chang, Zhiye Fei and Dan Roth University.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.
Global and Local Wikification (GLOW) in TAC KBP Entity Linking Shared Task 2011 Lev Ratinov, Dan Roth This research is supported by the Defense Advanced.
Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.
Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.
Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.
Scalable Text Mining with Sparse Generative Models
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Coreference Resolution with Knowledge Haoruo Peng March 20,
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Page 1 INARC Report Dan Roth, UIUC March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov & Dan Roth Department of Computer.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Exploiting Background Knowledge for Relation Extraction Yee Seng Chan and Dan Roth University of Illinois at Urbana-Champaign 1.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Automatically Labeled Data Generation for Large Scale Event Extraction
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Exploiting Wikipedia as External Knowledge for Document Clustering
Part 2 Applications of ILP Formulations in Natural Language Processing
Improving a Pipeline Architecture for Shallow Discourse Parsing
GLOW- Global and Local Algorithms for Disambiguation to Wikipedia
X Ambiguity & Variability The Challenge The Wikifier Solution
Lecture 24: NER & Entity Linking
Compact Query Term Selection Using Topically Related Text
Relational Inference for Wikification
A Machine Learning Approach to Coreference Resolution of Noun Phrases
A Machine Learning Approach to Coreference Resolution of Noun Phrases
CS246: Information Retrieval
Entity Linking Survey
Topic: Semantic Text Mining
Presentation transcript:

Relational Inference for Wikification Xiao Cheng and Dan Roth University of Illinois at Urbana-Champaign

Wikification Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. Let’s start with what is wikification Read wiki titles slower

Applications Knowledge Acquisition via Grounding Coreference Resolution Learning-based multi-sieve co-reference resolution with knowledge (Ratinov et al. 2012) Information Extraction Unsupervised relation discovery with sense disambiguation (Yao et al. 2012) Automatic Event Extraction with Structured Preference Modeling (Lu and Roth, 2012 ) Text Classification Gabrilovich and Markovitch, 2007; Chang et al., 2008 Entity Linking Generally allows us to inject knowledge into text models, with access to rich structured concept data Entity Linking is very similar and we will actually evaluate on an EL task, except for the restriction on linking entities only and to cluster NIL entities

Challenges Ambiguity Concepts outside of Wikipedia (NIL) Variability Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. Ambiguity Concepts outside of Wikipedia (NIL) Blumenthal ? Variability Scale Millions of labels Times The New York Times The Times Connecticut CT The Nutmeg State This is done with mostly with stats before What if there is no correct answer for Blumenthal in Wikipedia? How do we link those mentions to NILs even if we know Blumenthal could refer to something we know

Challenges Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. State-of-the-art systems (Ratinov et al. 2011) can achieve the above with local and global statistical features Reaches bottleneck around 70%~ 85% F1 on non-wiki datasets What is missing? Quite impressive, bottleneck: adding more sophisticated contextual features does not help much We do not want to level off here

Motivating Example What are we missing with Bag of Words (BOW) models? Mubarak, the wife of deposed Egyptian President Hosni Mubarak, … , the of deposed , … Mubarak wife Egyptian President Hosni Mubarak What are we missing with Bag of Words (BOW) models? Who is Mubarak? Constraining interaction between concepts (Mubarak, wife, Hosni Mubarak) Main contribution here We will show that by identifying

Relational Inference for Wikification Mubarak, the wife of deposed Egyptian President Hosni Mubarak, … (Mubarak, wife, Hosni Mubarak) Our contribution Identify key textual relations for Wikification A global inference framework to incorporate relational knowledge Significant improvement over state-of-the-art systems Main contribution here We will show that by identifying

Talk Outline Why Wikification? Introduction Approach Evaluation Motivation Approach Wikification Pipeline Formulation Relational Analysis Evaluation Result Entity Linking We will first go over the motivating example for our work How we approach Wikification by giving its mathematical formulation, as well as how to instantiate our model with relational analysis over the text Finally we will present our result and show its robustness in an Entity Linking task

Wikification Mention Segmentation Candidate Generation Candidate Ranking Determine NILs Mention Segmentation Candidate Generation Candidate Ranking NIL Linking To incorporate the improvements, we first have to take a look at the system layout We will give a quick overview of the terminology of wikification pipeline

Wikification Pipeline 1 - Mention Segmentation ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… sub-NP (Noun Phrase) chunks NER Regular expressions Generate robust nested mentions

Wikification Pipeline 1 - Mention Segmentation ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party…

Wikification Pipeline 2 - Candidate Generation ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 1 Slobodan_Milošević 2 Milošević_(surname) 3 Boki_Milošević 4 Alexander_Milošević … k ek4 1 Socialist_Party_(France) 2 Socialist_Party_(Portugal) 3 Socialist_Party_of_America 4 Socialist_Party_(Argentina) …

Wikification Pipeline 3 - Candidate Ranking ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 sk3 1 Slobodan_Milošević 0.7 2 Milošević_(surname) 0.1 3 Boki_Milošević 4 Alexander_Milošević 0.05 … k ek4 sk4 1 Socialist_Party_(France) 0.23 2 Socialist_Party_(Portugal) 0.16 3 Socialist_Party_of_America 0.07 4 Socialist_Party_(Argentina) 0.06 … Assign score to candidate entities to express the model’s preference based on contextual coherence Local and global statistical features

Wikification Pipeline 4 – Determine NILs ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 sk3 1 Slobodan_Milošević 0.7 k ek4 sk4 1 Socialist_Party_(France) 0.23 We classify the whether the top candidate belongs to Wikipedia Usually a binary classifier optimizes for evaluation metric Is the top candidate really what the text referred to?

Talk Outline Why Wikification? Introduction Approach Evaluation Motivation Approach Wikification Pipeline Formulation Relational Analysis Evaluation Result Entity Linking Next we will introduce

Formulation (0) Intuition Mubarak, the wife of deposed Egyptian President Hosni Mubarak, … (Mubarak, wife, Hosni Mubarak) Intuition Promote pairs of concepts coherent with textual relations

Formulation (1) Formulate as an Integer Linear Program (ILP): If no relation exists, collapse to the non-structured decision weight to output 𝑒 𝑖 𝑘 Whether to output 𝑘th candidate of the 𝑖th mention weight of a relation 𝑟 𝑖𝑗 (𝑘,𝑙) Whether a relation exists between 𝑒 𝑖 𝑘 and 𝑒 𝑗 𝑙 Formalize intuition, weights induced from data, small constraints Objective function collapse to the original output when there is no relation, so our model is negatives-proof True-negatives and false-negatives will not negatively impact performance of baseline

Formulation (2) ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 sk3 1 Slobodan_Milošević 0.7 2 Milošević_(surname) 0.1 3 Boki_Milošević 4 Alexander_Milošević 0.05 … k ek4 sk4 1 Socialist_Party_(France) 0.23 2 Socialist_Party_(Portugal) 0.16 3 Socialist_Party_of_America 0.07 4 Socialist_Party_(Argentina) 0.06 … r(1,2)34 r(4,3)34 Here we give a concrete example of how the variables in the objective function is instantiated eki: whether a concept is chosen ski : score of a concept r(k,l)ij: whether a relation is present w(k,l)ij : score of a relation

Relation Identification Overall Approach Wikification Candidate Generation Candidate Ranking Determine NILs Relation Analysis Relation Identification Relation Retrieval Relational Inference Next we will describe our approach in detail by walk through an example Our contribution – relation analysis helps all stages of wikification

Relation Identification ACE style in-document coreference Extract named entity-only coreference relations with high precision Syntactico-Semantic relations (Chan & Roth ‘10) Easy to extract with high precision Aim for high recall, as false-positives will be verified and discarded Sparse, but covers ~80% relation instances in ACE2004 Type Example Premodifier Iranian Ministry of Defense Possessive NYC’s stock exchange Formulaic Chicago, Illinois Preposition President of the US Here we identify the textual relations

Relation Identification ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… Argument 1 Relation Type Argument 2 Yugoslav President apposition Slobodan Milošević coreference Milošević possessive Socialist Party There are quite some relations in this short sentence Overall it is sparse but important in text understanding, as error propagates in structured inference We do not actually care about relation types too much, as the relation will be constrained by our knowledge

Relation Identification Overall Approach Wikification Candidate Generation Candidate Ranking Determine NILs Relation Analysis Relation Identification Relation Retrieval Relational Inference Next we will describe our approach in detail by walk through an example Our contribution – relation analysis helps all stages of wikification

Relation Retrieval ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… Current approach Collect known mappings from Wikipedia page titles, hyperlinks… Limit to top-K candidates based on frequency of links (Ratinov et al. 2011) What concepts can “Socialist Party” refer to? Top-k gives a very strong baseline, single most important feature in wikification Include both explicit relations and Wikipedia hyperlinks as weak implicit relations in the form of (concept1,relationType,concept2)

Uninformative Mentions Using all possible candidates will comprise our system’s tractability

Relation Retrieval ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… What concepts can “Socialist Party” refer to? More robust candidate generation Identified relations are verified against a knowledge base (DBPedia) Retrieve relation arguments matching “(Milošević ,?,Socialist Party)” as our new candidates Instead of the mention to concept mapping space, we go to the concept to concept relational space

Relation Retrieval ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 sk3 1 Slobodan_Milošević 0.7 … k ek4 sk4 1 Socialist_Party_(France) 0.23 … Query Pruning Only 2 queries per pair necessary due to strong baseline. Relation space too large, needs pruning Query result noisy Constrain one of the arguments to be known candidates Chances of both top answers are wrong is actually very small q1=(Socialist Party of France,?, *Milošević*) q2=(Slobodan Milošević,?,*Socialist Party*)

Relation Retrieval Argument 1 Relation Type Argument 2 Milošević possessive Socialist Party

Relation Retrieval ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 sk3 1 Slobodan_Milošević 0.7 2 Milošević_(surname) 0.1 3 Boki_Milošević 4 Alexander_Milošević 0.05 … k ek4 sk4 1 Socialist_Party_(France) 0.23 2 Socialist_Party_(Portugal) 0.16 3 Socialist_Party_of_America 0.07 4 Socialist_Party_(Argentina) 0.06 … 21 Socialist_Party_of_Serbia 0.0 Top-k gives a very strong baseline, single most important feature in Wikification 𝑟 34 (1,21) =1

Relation Identification Overall Approach Wikification Candidate Generation Candidate Ranking Determine NILs Relation Analysis Relation Identification Relation Retrieval Relational Inference Next we will describe our approach in detail by walk through an example Our contribution – relation analysis helps all stages of wikification

𝑤 34 (1,21) =? Relational Inference ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 sk3 1 Slobodan_Milošević 0.7 2 Milošević_(surname) 0.1 3 Boki_Milošević 4 Alexander_Milošević 0.05 … k ek4 sk4 1 Socialist_Party_(France) 0.23 2 Socialist_Party_(Portugal) 0.16 3 Socialist_Party_of_America 0.07 4 Socialist_Party_(Argentina) 0.06 … 21 Socialist_Party_of_Serbia 0.0 Top-k gives a very strong baseline, single most important feature in wikification 𝑟 34 (1,21) =1 𝑤 34 (1,21) =?

Retrieved relation tuple Relation scoring Query scoring as a tie-breaker between multiple relations Explicit relations are stronger than a hyperlink relations Normalize score for each pair of mention to [0,1] Relation query Retrieved relation tuple 𝑤= 1 𝑍 𝑓 𝑞,𝜎

𝑤 34 (1,21) =1 Relational Inference ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 sk3 1 Slobodan_Milošević 0.7 2 Milošević_(surname) 0.1 3 Boki_Milošević 4 Alexander_Milošević 0.05 … k ek4 sk4 1 Socialist_Party_(France) 0.23 2 Socialist_Party_(Portugal) 0.16 3 Socialist_Party_of_America 0.07 4 Socialist_Party_(Argentina) 0.06 … 21 Socialist_Party_of_Serbia 0.0 Top-k gives a very strong baseline, single most important feature in wikification 𝑟 34 (1,21) =1 𝑤 34 (1,21) =1

Relational Inference - coreference ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… k ek3 sk3 1 Slobodan_Milošević 0.7 2 Milošević_(surname) 0.1 3 Boki_Milošević 4 Alexander_Milošević 0.05 … 𝑟 23 (1,1) =1 k ek2 sk2 1 Slobodan_Milošević 1.0 The semantics of coreference relation demands we assign the same output for all the mentions in the coreferent cluster

Relation Identification Overall Approach Wikification Candidate Generation Candidate Ranking Determine NILs Relation Analysis Relation Identification Relation Retrieval Relational Inference Next we will describe how we handle the NIL candidates

Determine unknown concepts (NILs) Dorothy Byrne, a state coordinator for the Florida Green Party,… nominal mention k ek1 sk1 1 Dorothy_Byrne_(British_Journalist) 0.6 2 Dorothy_Byrne_(mezzo-soprano) 0.4 k ek2 sk2 1 Green_Party_of_Florida 1.0 Top-k gives a very strong baseline, single most important feature in wikification NIL entities tends to be introduced with nominal mentions How to capture the fact: “Dorothy Byrne” does not refer to any concept in Wikipedia Identify coreferent nominal mention relations Generate better features for NIL classifier

Determine unknown concepts (NILs) Dorothy Byrne, a state coordinator for the Florida Green Party,… nominal mention k ek1 sk1 NIL 1.0 1 Dorothy_Byrne_(British_Journalist) 0.6 2 Dorothy_Byrne_(mezzo-soprano) 0.4 k ek2 sk2 1 Green_Party_of_Florida 1.0 To incorporate NIL classification into our framework for inference, we make up a NIL candidate Create NIL candidate for propagation

Talk Outline Why Wikification? Introduction Approach Evaluation Motivation Approach Wikification Pipeline Formulation Relational Analysis Evaluation Result Entity Linking Next we will introduce

Wikification Performance Result And we show that by incorporating these relational inference into the disambiguation framework, our system achieves significant improvements over the previous methods

Evaluation – TAC KBP Entity Linking Task Definition Links a named entity in a document to either a TAC Knowledge Base (TAC KB) node or NIL Cluster NIL entities Relevant Tasks Wikification Cross-document coreference Explain NIL clustering

Evaluation – TAC KBP Entity Linking Run Relational Inference (RI) Wikifier “as-is”: No retraining using TAC data Without retraining on TAC data, we improve our baseline system to a degree that is statistically indistinguishable from the top systems *Median of top 14 systems

Thank you! Conclusion Importance of linguistic and world knowledge Identification of relational information benefits Wikification Introduced an inference framework to leverage better language understandings Future work Accumulate what we know about NIL concepts Joint entity typing, coreference and disambiguation Incorporate more relations Show milosevic Thank you! *Demo will be updated in a week at: http://cogcomp.cs.illinois.edu/demo/wikify Download at: http://cogcomp.cs.illinois.edu/page/download_view/Wikifier

Back up slides BACK UP slides

Massive Textual Information How can we know more from large volumes of possibly unfamiliar raw texts?

Massive Textual Information We naturally “look up” concepts and accumulate knowledge.

Challenges Tuesday marks the 142nd anniversary of an event that forever altered the course of Chicago's development as a then-young American city ? It’s a version of Chicago – the standard classic Macintosh menu font… By the time the New Orleans Saints kicked off at Soldier Field on Sunday afternoon, their woeful history in the Windy City was fully understood.

? Challenges Ambiguity Variety Concepts outside of Wikipedia (NIL) Tuesday marks the 142nd anniversary of an event that forever altered the course of Chicago's development as a then-young American city Ambiguity Variety Concepts outside of Wikipedia (NIL) Chicago Chicago font ? Chicago The Windy City It’s a version of Chicago – the standard classic Macintosh menu font… Similar to WSD, but we have to deal with NILs By the time the New Orleans Saints kicked off at Soldier Field on Sunday afternoon, their woeful history in the Windy City was fully understood.

Organizing knowledge Wikipedia as a source of “common sense” knowledge Naturally bridges rich structured knowledge and text data Comprehensive for most purposes ~4.3 million English articles as of today Cross-lingual Estimated size of a printed version of Wikipedia as of August 2010. *Picture courtesy of Wikipedia

Motivating Example Relatively sparse, but act as hard constraints Intuitively, we need to “de-coreference” this pair of mention Opens a new dimension in text understanding helps all stages of Wikification Mubarak Hosni Mubarak Egyptian President wife Mubarak, the wife of deposed Egyptian President Hosni Mubarak, …

Wikification Approach ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… Mention Segmentation Use Shallow Parsing, NER and regular expression to generate likely mentions of concepts. Match nested mentions using dictionary. Discard unknown mentions. A traditional steps

Ranking Features What are we losing? Local features from Bag of Words (BOW) representation, such as various TFIDF windows. Global features from Bag of Concepts (BOC) representation. What are we losing? Mubarak Hosni Mubarak Egyptian President wife What are we losing?

Relation Extraction In document coreference Extract entity-only coreference relations with high precision Syntactico-Semantic relations (Chan & Roth ‘10) Easy to extract with high precision Deep parsing too slow Premodifier Iranian Ministry of Defense Possessive NYC’s stock exchange Formulaic Chicago, Illinois Preposition President of the US What are the useful relations

Relational Inference Coreference ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party Possessive Calls for the need for looking up knowledge

Relational Inference ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party How to look up knowledge and use these relations?

Wikification Approach Recall concepts Rank concepts Determine unknown concepts (NILs) Move on first

Determine NILs Dorothy Bryne, a state coordinator for the Florida Green Party Leverage nominal mention relations to construct limited context

Relational Inference Example ...like Chicago O'Hare and [La Guardia] in [New York],... Wikipedia Page Ski LaGuardia_Airport 0.36 La_Guardia 0.22 Fiorello_La_Guardia 0.18 La_Guardia,_Spain 0.11 Wikipedia Page Ski New_York 0.53 New_York_City 0.44 New_York_(magazine) 0.01 New_York_Knicks

Relational Inference Scenario 1 ...like Chicago O'Hare and [La Guardia] in [New York],... Wikipedia Page Score LaGuardia_Airport 0.36 La_Guardia 0.22 Fiorello_La_Guardia 0.18 La_Guardia,_Spain 0.11 Wikipedia Page Score New_York 0.53 New_York_City 0.44 New_York_(magazine) 0.01 New_York_Knicks Γ = 0.18+0.53 = 0.71

Relational Inference Scenario 2 ...like Chicago O'Hare and [La Guardia] in [New York],... Wikipedia Page Score LaGuardia_Airport 0.36 La_Guardia 0.22 Fiorello_La_Guardia 0.18 La_Guardia,_Spain 0.11 Wikipedia Page Score New_York 0.53 New_York_City 0.44 New_York_(magazine) 0.01 New_York_Knicks Γ = 0.36+0.44 = 0.8

Relational Inference Conclusion ...like Chicago O'Hare and [La Guardia] in [New York],... Pr(Γ= ) > Pr(Γ= )

Constraint scaling 𝑞: query 𝜎: a retrieved relation triple Relation scaling factor Query lexical similarity Normalization factor Lexical similarity as a tie-breaker between multiple relations Explicit relations are stronger than a hyperlink relations Normalize score for each pair of mention to [0,1]

Wikification Pipeline 1 - Mention Segmentation ...ousted long time Yugoslav President Slobodan Milošević in October. Mr. Milošević's Socialist Party… Sub noun phrase chunks and NER segments (Ratinov et al. 2011) Regular expression capturing longer composite mentions President of the United States of America Generate robust nested mentions