Measures of Distributional Similarity Presenter: Cosmin Adrian Bejan Lillian Lee Department of Computer Science Cornell University.

Measures of Distributional Similarity Presenter: Cosmin Adrian Bejan Lillian Lee Department of Computer Science Cornell University

2 Overview  Goal: improve probability estimation for unseen cooccurences.  Contributions of this paper:  an empirical comparison of a broad range of measures;  a classification of similarity functions based on the information that they incorporate  a new function for evaluating proxy distributions

3 Introduction  How to estimate the conditional cooccurence probability P(v|n) of an unseen word pair (n,v) drawn from some finite set NxV ?  Normal approaches:  Katz back-off method;  Jelinek-Mercer interpolation method.  An alternative approach:  distance-weighted averaging where S(n) is a set of candidate similar words and sim(n,m) is a function of similarity between n and m.

4 Distributional Similarity Functions Notations: N – set of nouns V – set of transitive verbs (n,v) coocurence pair where n is the head of the direct object of v. n,m – two nouns whose distributional similarity is to be determined q(v) ~ P(v|n) r(v) ~ P(v|m) Euclidean distance L 1 norm cosine Jaccard’s coefficient (1) (2) (3) (4)

5 Distributional Similarity Functions Jensen-Shannon divergence Kullback Leibler divergence confusion probability Kendal’s  (5) (6) (7)

6 The Evaluation Method  Evaluation of similarity functions – on a binary decision task  Data = verb-object cooccurence pairs involving 1000 most frequent nouns  Training/Testing set = 80% / 20%  Testing set:  discard the pairs occurring in the training data  split the remaining pairs into five partitions  replace each (n,v 1 ) with a (n,v 1,v 2 ) triple such that P(v 1 )  P(v 2 )  The task = reconstruct which of (n,v 1 ) and (n,v 2 ) was the original cooccurence.  The error-rate measured for test-set performance: where T is the number of test triple tokens in the set

7 The Evaluation Method  Incorporate similarity function into a decision rule as follows:  (n,v 1,v 2 ) = test instance  S f,k (n) = the k most similar words to n according to f  evidence E f,k (n,v 1 ) for v 1 = the number of neighbors m  S f,k (n) such that P(v 1 |m)>P(v 2 |m)  the decision rule = choose the verb alternative with the greatest evidence  For two functions f and g – if E f,k (n,v 1 )>E g,k (n,v 1 ) then the k most similar words according to f are on the whole better predictors that the k most similar words according to g; hence f induces an inherently better similarity ranking for distance-weighted averaging.

8 Similarity Metric Performance

10 The Skew Divergence  Remark: it is desirable to have a similarity function that focuses on the verbs that cooccur with both of the nouns being compared. α - skew divergence  the skew divergence is asymmetric;  s α depends only on the verbs in V qr.

11 Performance of the Skew Divergence

Measures of Distributional Similarity Presenter: Cosmin Adrian Bejan Lillian Lee Department of Computer Science Cornell University.

Similar presentations

Presentation on theme: "Measures of Distributional Similarity Presenter: Cosmin Adrian Bejan Lillian Lee Department of Computer Science Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Measures of Distributional Similarity Presenter: Cosmin Adrian Bejan Lillian Lee Department of Computer Science Cornell University.

Similar presentations

Presentation on theme: "Measures of Distributional Similarity Presenter: Cosmin Adrian Bejan Lillian Lee Department of Computer Science Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback