Download presentation
Presentation is loading. Please wait.
1
1 Discriminating Word Senses Using McQuitty’s Similarity Analysis Amruta Purandare University of Minnesota, Duluth Advisor : Dr Ted Pedersen Research supported by National Science Foundation (NSF) Faculty Early Career Development Award (#0092784)
2
2 Discriminating “line” They will begin line formation before ceremony Connect modem to any jack on your line Quit printing after the last line of each file Your line will not get tied while you are connected to net Stand balanced and comfortable during line up Lines that do not fit a page are truncated New line service provides reliable connections Pages are separated by line feed characters They stand far right when in line formation
3
3 They will begin line formation before ceremony Stand balanced and comfortable during line up They stand far right when in line formation Your line will not get tied while you are connected to net Connect modem to any jack on your line New line service provides reliable connections Quit printing after the last line of each page Lines that do not fit a page are truncated Pages are separated by line feed characters
4
4 Introduction What is Word Sense Discrimination ? What is Word Sense Discrimination ? Unsupervised learning Unsupervised learning Training Test Features Feature Vectors Clusters
5
5 Representing context Features (from training) Features (from training) Bi grams Bi grams Unigrams Unigrams Second Order Co-occurrences/SOCs (Schütze98) Second Order Co-occurrences/SOCs (Schütze98) Mixture Mixture Feature vectors (Binary) Feature vectors (Binary) Measuring similarity Measuring similarity Cosine Cosine Match Match
6
6 Feature examples for line for line Unigram Bi gram SOCs
7
7 McQuitty’s method Pedersen & Bruce, 1997 Pedersen & Bruce, 1997 Agglomerative Agglomerative UPGMA / Average Link UPGMA / Average Link Stopping rules Stopping rules –Number of clusters –Score cutoff
8
8EvaluationS1S2S3S4C110032 C21171 C32116 C421512
9
9EvaluationS1S3S4S2C11032015 C2171110 C3216110 C42121520 1512111755
10
10 Majority Sense Classifier
11
11 Experimental Data LineSenseval-2 #Senses6Variable Selected top 5 #instances4146(1200:600) 120/word, 73 words (100-150:50-100)
12
12 Scope of the experiments 584 experiments (73 * 4 * 2) 584 experiments (73 * 4 * 2) –73 Words: 72 Senseval-2, LINE –4 Features: Bi grams, Unigrams, SOCs, Mix –2 Similarity Measures: Match, Cosine Window = 5 Window = 5 –for Bi grams and SOCs Frequency cutoff = 2 Frequency cutoff = 2
13
13 Senseval-2 Results POS wise 6753 78 COSMAT SOCBI UNI COSMATCOSMAT 1100 1011655 139SOCBI UNISOCBI UNI No of words of a POS for which experiment obtained accuracy more than Majority
14
14 Senseval-2 Results Feature wise 67116 11 COSMAT NV ADJCOSMAT COSMAT 78139 10 5355 00NV ADJ NV ADJ
15
15 Senseval-2 Results Measure wise 65711513 101 SOCBIUNI NV ADJ SOCBIUNI 738659 100NV ADJ
16
160.250.230.190.18 0.210.20 COSMAT SOCBI UNI Line Results On uniform distribution of 6 senses
17
17 Sample Confusion Table (fine.soc.cos) Sample Confusion Table (fine.soc.cos)5.0011.67 63.33 16.67 3.33 11.678.335023.336.67 2001010420 252524 10090 10100 S0S1S2S3S4 7530144 37 38 10 2 60 S0 = elegant S1 = small grained S2 = superior S3 = satisfactory S4 = thin
18
18 Conclusions Small set of SOCs was powerful Small set of SOCs was powerful –Half the number of unigrams/bigrams Scaling done by Cosine helps ! Scaling done by Cosine helps ! Need more training data! Need more training data! Need to improve feature… Need to improve feature… Selection (Tests of associations) Selection (Tests of associations) extraction (Stemming) extraction (Stemming) matching (Fuzzy matching) matching (Fuzzy matching) …strategies for bi grams …strategies for bi grams Explore new features Explore new features POS POS Collocations Collocations
19
19 Recent work PDL implementation PDL implementation Cluto - Clustering Toolkit Cluto - Clustering Toolkithttp://www-users.cs.umn.edu/~karypis/cluto 6 clustering methods, 12 merging criteria 6 clustering methods, 12 merging criteria Plans Plans –Comparing clustering in similarity space Vs vector space ( Schütze, 1998 ) –Stopping rules
20
20 They will begin line formation before ceremony Stand balanced and comfortable during line up They stand far right when in line formation Your line will not get tied while you are connected to net Connect modem to any jack on your line New line service provides reliable connections Quit printing after the last line of each file Lines that do not fit a page are truncated Pages are separated by line feed characters Sense labeling
21
21 Software Packages SenseClusters (Our Discrimination Toolkit) SenseClusters (Our Discrimination Toolkit) http://www.d.umn.edu/~tpederse/senseclusters.html http://www.d.umn.edu/~tpederse/senseclusters.html PDL (Used to implement clustering algorithms) PDL (Used to implement clustering algorithms)http://pdl.perl.org/ NSP (Used for extracting features) NSP (Used for extracting features)http://www.d.umn.edu/~tpederse/nsp.html SenseTools (Used for preprocessing, feature matching) SenseTools (Used for preprocessing, feature matching)http://www.d.umn.edu/~tpederse/sensetools.html Cluto (Clustering Toolkit) Cluto (Clustering Toolkit)http://www-users.cs.umn.edu/~karypis/cluto
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.