C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj (1000663882)

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Chapter 5: Introduction to Information Retrieval
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.
Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao.
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Retrieval
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Information Retrieval in Practice
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
ENTITY EXTRACTION: RULE-BASED METHODS “I’m booked on the train leaving from Paris at 6 hours 31” Rule: Location (Token string = “from”) ({DictionaryLookup=location}):loc.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Chapter 6: Information Retrieval and Web Search
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR
Web News Sentence Searching Using Linguistic Graph Similarity
Erasmus University Rotterdam
Preeti Bhargava, Nemanja Spasojevic, Guoning Hu
GLOW- Global and Local Algorithms for Disambiguation to Wikipedia
X Ambiguity & Variability The Challenge The Wikifier Solution
Lecture 24: NER & Entity Linking
Presentation 王睿.
Relational Inference for Wikification
Navi 下一步工作的设想 郑 亮 6.6.
Introduction to Information Retrieval
Hierarchical, Perceptron-like Learning for OBIE
Topic: Semantic Text Mining
Presentation transcript:

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

A BSTRACT The aim of the paper Annotation of open domain unstructured web text with uniquely identified entities in a social media like Wikipedia. Use of annotations for search and mining tasks

W HAT IS ENTITY DISAMBIGUATION ? An entity is something that is real and has a distinct existence. Wikipedia articles can be considered as entities. Entity disambiguation is the art of resolving correspondence between mentions of entities in natural language and real world entities. In this paper the disambiguation is carried out between annotations in web pages along and Wikipedia articles.

E NTITY D ISAMBIGUATION E XAMPLE

P REVIOUS WORK IN DISAMBIGUATION SemTag: First webscale disambiguation system. Annotated about 250 million web pages with IDs from the Stanford TAP. SemTag preferred high precision over recall, with an average of two annotations per page Wikify! Wikify performed both keyword extraction and disambiguation. Wikify could not achieve collective disambiguation across spots Milne and Witten (M&W): It’s a form of collective disambiguation which results better than Wikify. M&W achieves a F 1 measure of 0.53, unlike Wikify which has a F 1 measure of 0.83 Cucerzan’s algorithm: Each entity is represented as a high dimensional feature vector. Cucerzan annotates sparingly about 4.5% of all possible tokens are annotated.

T ERMINOLOGIES Spots Occurrence of text on a page that can be possibly linked to a Wikipedia article Attachment Possible entities in Wikipedia to which a spot can be linked Annotation Process of making an attachment to spots on a page Gama list List of all possible annotations

T ERMINOLOGIES I LLUSTRATED Spots Attachment Gama list

C OLLECTIVE ENTITY DISAMBIGUATION Sometimes disambiguation can not be carried out by using single spots in a page. Multiple spots in a page are required to disambiguate an entity All spots in an article are considered to be related

C OLLECTIVE ENTITY DISAMBIGUATION EXAMPLE

C ALCULATING RELATEDNESS BETWEEN WIKIPEDIA ENTITIES Relatedness between two entities is defined as r(γ, γ’)= g(γ) · g(γ’). Cucerzan’s proposal defined relatedness between entity based on cosine measure Milne et al. proposal: c = number of Wikipedia pages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise.

C ONTRIBUTIONS OF THIS PAPER The paper proposes posing entity disambiguation as an optimization problem. The paper provides a single optimization objective. Using integer linear programs Using heuristics for approximate solutions Paper also describes about rich node features with systematic learning Paper also describes about back off strategy for controlled annotations

M ODELING COMPATIBILITY BETWEEN WIKIPEDIA ARTICLES Entities modeled using a feature vector defined as f s (γ). The feature vector expresses local textual compatibility between (context of) spot s and candidate label γ. Components of the feature vector Spot side Context of the spot Wikipedia side Snippet Full text Anchor text Anchor text with context Similarity Measures Dot product Cosine Similarity Jaccard Similarity

M ETHODS FOR EVALUATING THE MODEL Authors use two ways for evaluating the model, Node score and Clique Score Node Score Defined by the function W is a training set obtained from linear adaptation of rank SVM Clique score Uses the related measure of Milne and Witten. Total objective

B ACK - OFF METHOD Not all spots in a web page may be tagged. Uses a special tag “NA” for articles that can’t be tagged Spots in the webpage marked “NA” will not contribute to the clique potential. A factor called “RNA” defines the aggressiveness of the tagging algorithm.

IMPLEMENTATION Integer linear program (ILP) based formulation Casting as 0/1 integer linear program Relaxing it to an LP Simpler heuristics Hill climbing for optimization

E VALUATING THE ALGORITHM Evaluation measures used Precision Number of spots tagged correctly out of total number of spots tagged Recall Number of spots tagged correctly out of total number of spots in ground truth F 1 F 1 is described using the following formula

D ATASETS USED FOR EVALUATION The authors use WebPages crawled and stored in the IITB database. Publicly available data from Cucerzan’s experiments (CZ)

E XPERIMENTAL RESULTS

N AMED ENTITY DISAMBIGUATION IN WIKIPEDIA Named ambiguity problem has resulted in a demand for efficient high quality disambiguation methods Not a trivial task, the application should be capable of deciding whether the group of name occurrences belong to the same entity Traditional methods of named entity disambiguation uses the Bag Of Words (BOW) method

W IKIPEDIA AS A SEMANTIC NETWORK Wikipedia is an open database covering most of the useful topics in the world. The title of Wikipedia article describes the content within the article. The title may sometimes be noisy. These are filtered using rules from Hu, et al.

S EMANTIC RELATIONS BETWEEN WIKIPEDIA CONCEPTS Wikipedia contains rich relation structures within the page The relatedness is represented by links between the Wikipedia pages.

W ORKING OF NAMED ENTITY DISAMBIGUATION USING WIKIPEDIA Uses vectors as to represent a Wikipedia entity. Similarity between each vector is measured for named entity disambiguation.

M EASURING SIMILARITY BETWEEN TWO WIKIPEDIA ENTITIES The similarity measure takes into account the full semantic relations indicated by hyperlinks in Wikipedia. The algorithm works in three steps. Described as follows

S TEP 1 In order to measure the similarity between two vector representations, the correspondence between the concepts of one vector to another have to be defined Semantic relations between articles is used to match the articles.

S TEP 2 Compute the semantic relatedness from one concept vector representation to another Using the alignments shown in previous step SR(MJ1→MJ2) is computed as (0.42×0.47× ×0.51× ×0.51×0.65)/(0.42× × ×0.51)=0.62, and SR(MJ2→MJ1) is computed as (0.47×0.42× ×0.54× × 0.51 × × 0.54 × 0.66 )/(0.47× × × × 0.54)=0.60.

S TEP 3 Compute the similarity between two concept vector representations. Similarity SIM(MJ1, MJ2) is computed as ( )/2 = 0.61, SIM(MJ2, MJ3) is computed as 0.10 and SIM(MJ1, MJ3) is computed as 0.0.

R ESULTS

Q UESTIONS ????