Learning from Negative Examples in Set-Expansion Authors: Prateek Jindal and Dan Roth Dept. of Computer Science, UIUC Presenting in: ICDM 2011.

Slides:



Advertisements
Similar presentations
Data Mining Tools Overview Business Intelligence for Managers.
Advertisements

Three things everyone should know to improve object retrieval
 Negnevitsky, Pearson Education, Lecture 5 Fuzzy expert systems: Fuzzy inference n Mamdani fuzzy inference n Sugeno fuzzy inference n Case study.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Incorporating Game Theory in Feature Selection for Text Categorization Nouman Azam and JingTao Yao Department of Computer Science University of Regina.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Active Learning and Collaborative Filtering
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Ensemble Learning: An Introduction
Statistical Background
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Chapter 5 Data mining : A Closer Look.
Quantitative Trading Strategy based on Time Series Technical Analysis Group Member: Zhao Xia Jun Lorraine Wang Lu Xiao Zhang Le Yu.
Discriminant Analysis Testing latent variables as predictors of groups.
AN INTRODUCTION TO PORTFOLIO MANAGEMENT
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Mining and Summarizing Customer Reviews
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Presented by Tienwei Tsai July, 2005
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Individual values of X Frequency How many individuals   Distribution of a population.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
CS 376b Introduction to Computer Vision 02 / 22 / 2008 Instructor: Michael Eckmann.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
1 Helping Editors Choose Better Seed Sets for Entity Set Expansion Vishnu Vyas, Patrick Pantel, Eric Crestan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/05/10.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Delving Into The Use of Inference 8.1 Estimating with Confidence 8.2 Use and Abuse of Tests.
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
Section 3.3: The Story of Statistical Inference Section 4.1: Testing Where a Proportion Is.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Vector Space Models.
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
12.1 Orthogonal Functions a function is considered to be a generalization of a vector. We will see the concepts of inner product, norm, orthogonal (perpendicular),
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Inferential Statistics Assoc. Prof. Dr. Şehnaz Şahinkarakaş.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Chapter 12 Chi-Square Tests and Nonparametric Tests
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture Slides Elementary Statistics Twelfth Edition
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Relevance and Reinforcement in Interactive Browsing
Presentation transcript:

Learning from Negative Examples in Set-Expansion Authors: Prateek Jindal and Dan Roth Dept. of Computer Science, UIUC Presenting in: ICDM 2011

Presentation Plan Introduction Centroid-Based Approach to Set-Expansion Incorporating Negative Examples in Centroid-Based Approach Inference-Based Approach to Set-Expansion Experimental Results

Set Expansion Set-expansion has been viewed as a problem of generating an extensive list of instances of a concept of interest, given a few examples of the concept as input. For example, if the seed-set is {Steffi Graf, Martina Hingis, Serena Williams}, the system should output an extensive list of female tennis players. We focus on set-expansion from free text, as opposed to web- based approaches that build on existing lists.

Importance of Negative Examples Most of the work on set-expansion has focused on taking only positive examples. For example, to produce a list of female tennis players, a few names of female tennis players are given as input to the system. However, just specifying a few female tennis players doesn’t define the concept precisely enough. The set-expansion systems tend to output some male tennis players along with female tennis players. Specifying a few names of male tennis players as negative examples defines the concept more precisely.

We used 7 positive examples to generate this list using state-of- the-art techniques which accept only positive examples. The errors have been underlined and italicized. Output is corrupted by male tennis players. Positive Examples are NOT Sufficient

Negative Examples Help Adding only 1 negative example to the seed-set improves the list-quality significantly. The second column contains no errors.

Presentation Plan Introduction Centroid-Based Approach to Set-Expansion Incorporating Negative Examples in Centroid-Based Approach Inference-Based Approach to Set-Expansion Experimental Results

Finding the Neighbours We compute the similarity between any two entities using the cosine coefficient.

Centroid Based Approach to Set-Expansion In the centroid-based approach, first of all, centroid () is computed by averaging the frequency vectors of entities in the seed-set () and then computing the discounted PMI of the resulting frequency vector. Next, NBRLIST of the centroid is computed and the system outputs the first M members of NBRLIST.

Presentation Plan Introduction Centroid-Based Approach to Set-Expansion Incorporating Negative Examples in Centroid-Based Approach Inference-Based Approach to Set-Expansion Experimental Results

Incorporating Negative Examples All the features are not equally important. To incorporate this knowledge into set-expansion, we associate a weight term with each entry in the vocabulary. Higher weight would mean that a particular word is more relevant to the underlying concept. By incorporating these weights into the cosine similarity formula, the new similarity formula becomes:

Incorporating Negative Examples We wish to learn a weight vector w such that the  similarity between the positive examples and the centroid becomes more than a pre-specified threshold.  similarity between negative examples and the centroid should become less than a pre-specified threshold. We accomplish this objective using the following linear program:

Presentation Plan Introduction Centroid-Based Approach to Set-Expansion Incorporating Negative Examples in Centroid-Based Approach Inference-Based Approach to Set-Expansion Experimental Results

Inference Based Approach to Set-Expansion We do not compute the centroid of the positive examples. The new approach is based on the intuition that the positive and negative examples can complement each others’ decision to better represent the underlying concept. Each example can be thought of as an expert which provides positive or negative evidence regarding the membership of any entity in the underlying concept. We develop a mechanism to combine the suggestions of such experts.

Inference Based Approach to Set-Expansion First we compute the NBRLIST of positive and negative examples respectively. Entities which have high similarity to the positive examples are more likely to belong to the underlying concept, while entities which have high similarity to the negative examples are likely to not belong to the underlying concept. We associate a reward (or penalty) with each entity in these lists based on the rank of the entity. Our reward (or penalty) function is based on the effective length, ℒ, of a list.

Inference Based Approach to Set-Expansion

Presentation Plan Introduction Centroid-Based Approach to Set-Expansion Incorporating Negative Examples in Centroid-Based Approach Inference-Based Approach to Set-Expansion Experimental Results

Effect of List Factor on List Quality Effective length, ℒ, of a list is computed by multiplying the required list length (or cut-off) by a list factor, ℱ. If is the specified cut-off, then ℒ = × ℱ.

Dataset Used For Experiments We used the AFE section of English Gigaword Corpus for our experiments. This is a comprehensive archive of newswire text data in English.

Experimental Results Notation Used: 1. SEC - Set Expansion system using Centroid. 2. SECW - Set Expansion system using Centroid where Weights are associated with the vocabulary terms. This system can learn from negative examples. 3. SEI - Set Expansion system using Inference. SEC and SECW serve as baseline systems.

Experimental Results We compare the performance of SEI with the two baselines on 5 different concepts as mentioned below: 1. Female Tennis Players (FTP) 2. Indian Politicians (IP) 3. Athletes (ATH) 4. Film Actors (FA) 5. Australian Cricketers (AC)

Experimental Results

Good negative examples are closely related to the true instances of the desired concept. How to Choose Good Negative Examples

Negative examples help set-expansion Unlike Centroid-based approach, Inference-based approach easily allows to incorporate negative examples. Good negative examples are closely related to true instances of the desired concept. Conclusion Thank You! Questions!