Download presentation
Presentation is loading. Please wait.
1
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft” : documents belong to clusters, with fractional scores Termination When assignment of documents to clusters ceases to change much OR When cluster centroids move negligibly over successive iterations
2
How to Find Good Clustering? Minimize the sum of distance within clusters C1C1 C2C2 C3C3 C4C4 C6C6
3
How to Efficiently Clustering Data?
4
K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers
5
K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers
6
K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers
7
K-means 1.Ask user how many clusters they’d like. (e.g. k=5)
8
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations
9
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
10
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns
11
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns
12
Problem with K-means (Sensitive to the Initial Cluster Centroids)
15
We are using distance to measure the similarity so far in k-means Other similarity measures are possible, e.g., kernel functions
16
Problem with K-means Binary cluster membership
17
Improve Soft Membership l 2 indicates the importance of each feature
18
Self-Organization Map (SOM) Like soft k-means Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialized even if eventually many are to remain devoid of documents
19
Self-Organization Map (SOM) Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function between clusters and
20
SOM : Update Rule Like Neural network Data item d activates neuron (closest cluster) as well as the neighborhood neurons Eg Gaussian neighborhood function Update rule for node under the influence of d is: where is the ndb width and is the learning rate parameter
21
SOM : Example I SOM computed from over a million documents taken from 80 Usenet newsgroups. Light areas have a high density of documents.
22
SOM: Example II Another example of SOM at work: the sites listed in the Open Directory have been organized within a map of Antarctica at http://antarcti.ca/.
23
Multidimensional Scaling(MDS) Goal: Represent documents as points in a low- dimensional space such that the Euclidean distance between any pair of points is as close as possible to the distance between them specified by the input. Given a priori (user-defined) measure of distance or dissimilarity between documents i and j, Let be the Euclidean distance between doc. i and j picked by our MDS algorithm
24
Minimize the Stress The stress of the embedding is given by: Iterative stress relation is the most used strategy to minimize the stress
25
Important Issues Stress not easy to optimize Iterative hill climbing 1. Points (documents) assigned random coordinates by external heuristic 2. Points moved by small distance in direction of locally decreasing stress For n documents Each takes time to be moved Totally time per relaxation
26
A Probabilistic Framework for Information Retrieval Three fundamental questions What statistics should be chosen to describe the characteristics of documents ? How to estimate this statistics ? How to compute the likelihood of generating queries given the statistics ?
27
Multivariate Binary Model A document event is just a bit-vector in the vocabulary W The bit corresponding to a term t is flipped on with probability Assume that: Term occurrences are independent event Term counts are unimportant The probability of generating d is given by
28
Multinomial Model Takes term counts into account, but does NOT fix the term-independence assumption The length of document is determined by a r.v. from a suitable distribution { all parameters needed to capture the length of distribution and all }
29
Mixture Models Suppose there are m topics (clusters) of the corpus with probability distribution: For the given topic, the documents are generated by binary/multinomial distribution with parameter set For a document belonging to topic, we would expect that
30
Unigram Language Model Observation: d={tf 1, tf 2, …, tf n } Unigram language model ={p(w 1 ), p(w 2 ), …, p(w n )} Maximum likelihood estimation
31
Unigram Language Model Probabilities for single word p(w) ={p(w) for any word w in vocabulary V} Estimate an unigram language model Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|
32
Statistical Inference C1: h, h, h, t, h bias b1 = 5/6 C2: t, t, h, t, h, h bias b2 = 1/2 C3: t, h, t, t, t, h bias b3 = 1/3 Why counting provide a good estimate of coin bias?
33
Maximum Likelihood Estimation (MLE) Observation o={o 1, o 2, …, o n } Maximum likelihood estimation E.g.: o={h, h, h, t, h,h} Pr(o|b) = b 5 (1-b)
34
Unigram Language Model Observation: d={tf 1, tf 2, …, tf n } Unigram language model ={p(w 1 ), p(w 2 ), …, p(w n )} Maximum likelihood estimation
35
Maximum A Posterior Estimation Consider a special case: we only toss each coin twice C1: h, t b1=1/2 C2: h, h b2=1 C3: t, t b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.