Download presentation
Presentation is loading. Please wait.
Published byRoy Pearson Modified over 8 years ago
1
1 Helping Editors Choose Better Seed Sets for Entity Set Expansion Vishnu Vyas, Patrick Pantel, Eric Crestan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/05/10
2
2 Outline Introduction Impact of Seed Sets Systems Prototype Removal Clustering Minimum Overlap Criterion (MOC) Experiment Conclusions
3
3 Introduction Collections of named entities are used in many commercial and research applications. Semi-supervised methods (set expansion): Pattern based techniques Distributional techniques
4
4 Introduction Problem: The quality of the expansion can vary greatly based on the nature of the concept and the seed set. Human editors generate widely varying sets and poor expansion quality.
5
5 Introduction In this paper: Employ a seed set expansion system to study the impact of different seed sets. Propose several algorithms for improving the seed sets by human editors. Identify three factors of seed set composition that affect the expansion quality.
6
6 Impact of Seed Sets Seed Set Composition
7
7 Impact of Seed Sets Do humans generate good seed sets?
8
8 Impact of Seed Sets Factors in Seed Set Composition Prototypicality superordinate concept {dog, cat} -> pets (not animal) Ambiguity polysemy {mercury} -> elements and planets Coverage seed set shares in common with semantic space {iron, boron, nitrogen} vs. {helium, argon, xenon}
9
9 Systems Prototype Removal Clustering Minimum Overlap Criterion (MOC)
10
10 Prototype Removal A prototype is a common and unambiguous instance from a concept. sort: based on prototypicality score remove: the most prototypical seeds
11
11 Clustering Ambiguous seed instances belong to more than one concept. They tend to be less similar to any particular concept than their non- ambiguous counterparts.
12
12 Clustering distributional feature vector weight: point-wise mutual information average-link clustering Chose the tightest cluster as candidate seed set.
13
13 Clustering PMI(w) = (pmi w1, pmi w2, …, pmi wm ) c wf : the frequency of feature f occurring for term w n: the number of unique terms N: the total number of features for all terms
14
14 Minimum Overlap Criterion seeds can best represent a concept C: maximum information minimum redundancy Represent the concept with the set of features which are shared between a minimum of two seeds in the seed set.
15
15 Minimum Overlap Criterion joint information
16
16 Datasets and Baseline Select nine lists from Wikipedia ’ s List of pages which were considered complete and treated as the gold standard. Three of the lists were designated as the development set. The remaining sets were used to test the expansion performance of the seed sets generated by the three methods.
17
17 Experimental Setup Created trial seed sets from the original seed sets provided to us by the editors. 1024 trial seed sets for each list total of 9216 trails training the parameters prototype removal: remove 3 seeds MOC: remove 4 seeds clustering: k=2
18
18 Experimental Results Overall Analysis
19
19 Experimental Results Intrinsic Analysis of Prototype Removals, Clustering and MOC
20
20 Experimental Results
21
21 Experimental Results MOC ’ s high performance: compare to prototype: Minimize semantic overlap between the seed sets. Seeds which are prototypical tend to overlap semantically with almost all seeds in a seed set. compare to ambiguous words: Ambiguous words do not share a lot of highly informative distributional features with the concept.
22
22 Conclusions Showed that the composition of seed sets can significantly affect the performance of set expansion. Showed that an average editor does not produce seed sets that result in high quality expansions.
23
23 Conclusions Identified three important factors in seed set composition – prototypicality, ambiguity and coverage. Proposed three algorithms, each one tackling a different factor affecting seed set composition.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.