Download presentation
Presentation is loading. Please wait.
1
Measures of Distributional Similarity Presenter: Cosmin Adrian Bejan Lillian Lee Department of Computer Science Cornell University
2
2 Overview Goal: improve probability estimation for unseen cooccurences. Contributions of this paper: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate a new function for evaluating proxy distributions
3
3 Introduction How to estimate the conditional cooccurence probability P(v|n) of an unseen word pair (n,v) drawn from some finite set NxV ? Normal approaches: Katz back-off method; Jelinek-Mercer interpolation method. An alternative approach: distance-weighted averaging where S(n) is a set of candidate similar words and sim(n,m) is a function of similarity between n and m.
4
4 Distributional Similarity Functions Notations: N – set of nouns V – set of transitive verbs (n,v) coocurence pair where n is the head of the direct object of v. n,m – two nouns whose distributi- onal similarity is to be determined q(v) ~ P(v|n) r(v) ~ P(v|m) Euclidean distance L 1 norm cosine Jaccard’s coefficient (1) (2) (3) (4)
5
5 Distributional Similarity Functions Jensen-Shannon divergence Kullback Leibler divergence confusion probability Kendal’s (5) (6) (7)
6
6 The Evaluation Method Evaluation of similarity functions – on a binary decision task Data = verb-object cooccurence pairs involving 1000 most frequent nouns Training/Testing set = 80% / 20% Testing set: discard the pairs occurring in the training data split the remaining pairs into five partitions replace each (n,v 1 ) with a (n,v 1,v 2 ) triple such that P(v 1 ) P(v 2 ) The task = reconstruct which of (n,v 1 ) and (n,v 2 ) was the original cooccurence. The error-rate measured for test-set performance: where T is the number of test triple tokens in the set
7
7 The Evaluation Method Incorporate similarity function into a decision rule as follows: (n,v 1,v 2 ) = test instance S f,k (n) = the k most similar words to n according to f evidence E f,k (n,v 1 ) for v 1 = the number of neighbors m S f,k (n) such that P(v 1 |m)>P(v 2 |m) the decision rule = choose the verb alternative with the greatest evidence For two functions f and g – if E f,k (n,v 1 )>E g,k (n,v 1 ) then the k most similar words according to f are on the whole better predictors that the k most similar words according to g; hence f induces an inherently better similarity ranking for distance-weighted averaging.
8
8 Similarity Metric Performance
9
9
10
10 The Skew Divergence Remark: it is desirable to have a similarity function that focuses on the verbs that cooccur with both of the nouns being compared. α - skew divergence the skew divergence is asymmetric; s α depends only on the verbs in V qr.
11
11 Performance of the Skew Divergence
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.