Frequency Estimates for Statistical Word Similarity Measures Presenter: Cosmin Adrian Bejan Egidio Terra and C.L.A. Clarke School of Computer Science University of Waterloo
2 Introduction A comparative study of two methods for estimating word cooccurence frequencies required by word similarity measures to solve human-oriented language tests. Example of such tests: determine the best synonym in a set of alternatives A={A 1, A 2, A 3, A 4 } for a specific target word TW in a context C={w 1 ’, w 2 ’, … w n ’} \ TW. determine the best synonym when no context is available
3 Measuring Word Similarity the notion for cooccurence of two words can be depicted by a contingency table: each dimension represents a random discrete variable W i with range A = {w i, w i }; each cell represent the joint frequency where N max is the maximum number of cooccurences.
4 Similarity between two words Pointwise Mutual Information Χ 2 - test Likelihood ratio Average Mutual Information
5 Context supported similarity Cosine of Pointwise Mutual Information L1 norm Contextual Average Mutual Information Contextual Jensen- Shanon Digergence Pointwise Mutual Infor- mation of Multiple words
6 Window-oriented approach f w_i – frequency of w i f w_1,w_2 – cooccurence frequency of w 1 and w 2 N – size of the corpus in words P(w i ) = f w_i /N f w_1,w_2 is estimated by the number of windows where the two words cooccur. N wt – number of windows of size t P(w 1, w 2 ) = f w_1,w_2 / N wt
7 Document-oriented approach df w_i – frequency of a word w i. It corresponds to the number of documents in which the words appears. D – the number of documents P(w i ) = df w_i / D df w_1,w_2 – cooccurence frequency of two words – is the number of documents where the words cooccur. P(w 1, w 2 ) = df w_1,w_2 / D
8 Results for TOEFL test set
9 Results for TS1 and context