1 Helping Editors Choose Better Seed Sets for Entity Set Expansion Vishnu Vyas, Patrick Pantel, Eric Crestan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/05/10.

Slides:



Advertisements
Similar presentations
Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Advertisements

Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
DQR : A Probabilistic Approach to Diversified Query recommendation Date: 2013/05/20 Author: Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Source:
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Problem Semi supervised sarcasm identification using SASI
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Date: 2012/10/18 Author: Makoto P. Kato, Tetsuya Sakai, Katsumi Tanaka Source: World Wide Web conference (WWW "12) Advisor: Jia-ling, Koh Speaker: Jiun.
By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Presented by Tienwei Tsai July, 2005
Università degli Studi di Modena and Reggio Emilia Dipartimento di Ingegneria dell’Informazione Prototypes selection with.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Learning from Negative Examples in Set-Expansion Authors: Prateek Jindal and Dan Roth Dept. of Computer Science, UIUC Presenting in: ICDM 2011.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Exploiting Ontologies for Automatic Image Annotation Munirathnam Srikanth, Joshua Varner, Mitchell Bowden, Dan Moldovan Language Computer Corporation SIGIR.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Topical Clustering of Search Results Date : 2012/11/8 Resource : WSDM’12 Advisor : Dr. Jia-Ling Koh Speaker : Wei Chang 1.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date: From Word Representations:... ACL2010, From Frequency... JAIR 2010 Representing Word... Psychological.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Scalable Learning of Collective Behavior Based on Sparse Social Dimensions Lei Tang, Huan Liu CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/02/01.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Summarizing Contrastive Viewpoints in Opinionated Text Michael J. Paul, ChengXiang Zhai, Roxana Girju EMNLP ’ 10 Speaker: Hsin-Lan, Wang Date: 2010/12/07.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Reputation-aware QoS Value Prediction of Web Services Weiwei Qiu, Zhejiang University Zibin Zheng, The Chinese University of HongKong Xinyu Wang, Zhejiang.
1 Clustering Web Queries John S. Whissell, Charles L.A. Clarke, Azin Ashkan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/08/31.
From Frequency to Meaning: Vector Space Models of Semantics
Queensland University of Technology
Decision Making Based on Cohort Scores for
Enriching Taxonomies With Functional Domain Knowledge
Presentation transcript:

1 Helping Editors Choose Better Seed Sets for Entity Set Expansion Vishnu Vyas, Patrick Pantel, Eric Crestan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/05/10

2 Outline Introduction Impact of Seed Sets Systems Prototype Removal Clustering Minimum Overlap Criterion (MOC) Experiment Conclusions

3 Introduction Collections of named entities are used in many commercial and research applications. Semi-supervised methods (set expansion): Pattern based techniques Distributional techniques

4 Introduction Problem: The quality of the expansion can vary greatly based on the nature of the concept and the seed set. Human editors generate widely varying sets and poor expansion quality.

5 Introduction In this paper: Employ a seed set expansion system to study the impact of different seed sets. Propose several algorithms for improving the seed sets by human editors. Identify three factors of seed set composition that affect the expansion quality.

6 Impact of Seed Sets Seed Set Composition

7 Impact of Seed Sets Do humans generate good seed sets?

8 Impact of Seed Sets Factors in Seed Set Composition Prototypicality superordinate concept {dog, cat} -> pets (not animal) Ambiguity polysemy {mercury} -> elements and planets Coverage seed set shares in common with semantic space {iron, boron, nitrogen} vs. {helium, argon, xenon}

9 Systems Prototype Removal Clustering Minimum Overlap Criterion (MOC)

10 Prototype Removal A prototype is a common and unambiguous instance from a concept. sort: based on prototypicality score remove: the most prototypical seeds

11 Clustering Ambiguous seed instances belong to more than one concept. They tend to be less similar to any particular concept than their non- ambiguous counterparts.

12 Clustering distributional feature vector weight: point-wise mutual information average-link clustering Chose the tightest cluster as candidate seed set.

13 Clustering PMI(w) = (pmi w1, pmi w2, …, pmi wm ) c wf : the frequency of feature f occurring for term w n: the number of unique terms N: the total number of features for all terms

14 Minimum Overlap Criterion seeds can best represent a concept C: maximum information minimum redundancy Represent the concept with the set of features which are shared between a minimum of two seeds in the seed set.

15 Minimum Overlap Criterion joint information

16 Datasets and Baseline Select nine lists from Wikipedia ’ s List of pages which were considered complete and treated as the gold standard. Three of the lists were designated as the development set. The remaining sets were used to test the expansion performance of the seed sets generated by the three methods.

17 Experimental Setup Created trial seed sets from the original seed sets provided to us by the editors trial seed sets for each list total of 9216 trails training the parameters prototype removal: remove 3 seeds MOC: remove 4 seeds clustering: k=2

18 Experimental Results Overall Analysis

19 Experimental Results Intrinsic Analysis of Prototype Removals, Clustering and MOC

20 Experimental Results

21 Experimental Results MOC ’ s high performance: compare to prototype: Minimize semantic overlap between the seed sets. Seeds which are prototypical tend to overlap semantically with almost all seeds in a seed set. compare to ambiguous words: Ambiguous words do not share a lot of highly informative distributional features with the concept.

22 Conclusions Showed that the composition of seed sets can significantly affect the performance of set expansion. Showed that an average editor does not produce seed sets that result in high quality expansions.

23 Conclusions Identified three important factors in seed set composition – prototypicality, ambiguity and coverage. Proposed three algorithms, each one tackling a different factor affecting seed set composition.