Download presentation
Presentation is loading. Please wait.
1
Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian Lee- Dept. of Computer Science, Cornell University Presenter- Juan Ramos, Dept. of Computer Science, Rutgers Universtiy, juramos@cs.rutgers.edu
2
Overview Purpose: evaluate a method for clustering words according to their distribution in particular syntactic contexts. Methodology: find lowest distortion sets of clusters of words to determine models of word coocurrence.
3
Applications Scientific POV: lexical acquisition of words Practical POV: classification concerns data sparseness in garmmar models. Address clusters in large corpus of documents
4
Definitions Context: function of given word in its sentence. –Eg: a noun as a direct object Sense class: hidden model describing word association tendencies –Mix of cluster and cluster probability given a word Cluster: probabilistic concept of a sense class
5
Problem Setting Restrict problem to verbs (V) and nouns (N) in main verb-direct object relationship f (v, n) = frequencies of occurrence of verb, noun pairs –Text must be pre-formatted to fit specifications For given noun n, conditional distribution p(n, v) = f(v,n)/(sum (v, f(v,n))
6
Problem Setting cont. Goal: create set C of clusters and probabilityies p(c|n). Each c in C associated to cluster centroid p(c) –p(c) = average of p(n) over all v in V.
7
Distributional Similarity Given two distributions p, q, KL distance is D(p || q) = sum (x, p(x) log (p(x)/q(x))) –D(p || q) = 0 implies p = q –Small D(p || q) implies two distributions are likely instances of a centroid p(c). D(p || q) measures loss of data by using p(c).
8
Theoretical Foundation Given unstructured V, N, training data of X independent pairs of verbs and nouns. Problem: learn joint distribution of pairs given X Not quite unsupervised, not quite supervised –No internal structure in pairs –Learn underlying distribution
9
Distributional Clustering Approximately decompose p(n,v) to p’(n,v) = sum (c in C, p(c|n)*p(c, v)). –p(c|n) = membership probability of n in c –p(c,v) = p(v|c) = probability of v given centroid for c Assuming p(n), p’(v) coincide, p’(n,v) = sum(c in C, p(c)*p(n|c)*p(v|c))
10
Maximum Likelihood Cluster Centroids Used to maximize goodness of fit between data and p’(n,v) For sequence of pairs S, S’s model log prob. is: l(S) = sum(N, log (sum (c in C, p’(n,v)))). –Maximize according to p(n|c) and p(v|c). –Variation of l(S):
11
Maximum Entropy Cluster Membership Assume independence between variations of p(n|c) and p(v|c). –Can find Bayes inverses of p(n|c) given p(v|c) and p(v|n) –p(v|c) that maximize l(S) also minimize average distortion between cluster model and data
12
Entropy Cluster Membership cont. Average cluster distortion: Entropy:
13
Entropy Cluster Membership cont. Class and membership distributions: –Z(c) and Z(n) are normalization sums Previous equations reduce log-likelihood to: At maximum, variation vanishes
14
KL Distortion Attempt to minimize KL distortion through variation of KL distances: –Results in weighted average of noun distributions.
15
Free Energy Function Combined minimum distortion and max entropy equivalent to minimum of free energy: F = - H/beta F determines and H through partial derivatives: Min of F determines balance between disordering max entropy and ordering distortion min.
16
Hierarchical Clustering Number of clusters is determined through sequence of increases of beta. –Higher beta implies more local influence of noun on definition of centroids. Start with low beta and a single c in C –Search for lowest beta that splits c into two or more leaf c’s. –Repeat until |C| reaches desired size.
17
Experimental Results Classify 64 nouns appearing as direct objects of verb ‘fire’ in Associated Press documents, 1988, where |V| = 2147. Four words most similar to cluster centroid and KL distances for first splits. –Split 1: cluster of ‘fire’ as discharging weapons vs. cluster of ‘fire’ as releasing employees –Split 2: weapons as projectiles vs. weapons as guns.
18
Clustering on Verb ‘fire’
19
Evaluation
20
Evaluation cont.
21
Conclusions Clustering is efficient, informative, and returns good predictions Future work –Make clustering method more rigorous –Introduce human judgment, i.e. a more supervised approach –Extend model to other word relationships
22
References
23
References cont.
24
More References
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.