Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft” : documents belong to clusters, with fractional scores Termination When assignment of documents to clusters ceases to change much OR When cluster centroids move negligibly over successive iterations
How to Find Good Clustering? Minimize the sum of distance within clusters C1C1 C2C2 C3C3 C4C4 C6C6
How to Efficiently Clustering Data?
K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers
K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers
K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers
K-means 1.Ask user how many clusters they’d like. (e.g. k=5)
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns
Problem with K-means (Sensitive to the Initial Cluster Centroids)
We are using distance to measure the similarity so far in k-means Other similarity measures are possible, e.g., kernel functions
Problem with K-means Binary cluster membership
Improve Soft Membership l 2 indicates the importance of each feature
Self-Organization Map (SOM) Like soft k-means Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialized even if eventually many are to remain devoid of documents
Self-Organization Map (SOM) Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function between clusters and
SOM : Update Rule Like Neural network Data item d activates neuron (closest cluster) as well as the neighborhood neurons Eg Gaussian neighborhood function Update rule for node under the influence of d is: where is the ndb width and is the learning rate parameter
SOM : Example I SOM computed from over a million documents taken from 80 Usenet newsgroups. Light areas have a high density of documents.
SOM: Example II Another example of SOM at work: the sites listed in the Open Directory have been organized within a map of Antarctica at
Multidimensional Scaling(MDS) Goal: Represent documents as points in a low- dimensional space such that the Euclidean distance between any pair of points is as close as possible to the distance between them specified by the input. Given a priori (user-defined) measure of distance or dissimilarity between documents i and j, Let be the Euclidean distance between doc. i and j picked by our MDS algorithm
Minimize the Stress The stress of the embedding is given by: Iterative stress relation is the most used strategy to minimize the stress
Important Issues Stress not easy to optimize Iterative hill climbing 1. Points (documents) assigned random coordinates by external heuristic 2. Points moved by small distance in direction of locally decreasing stress For n documents Each takes time to be moved Totally time per relaxation
A Probabilistic Framework for Information Retrieval Three fundamental questions What statistics should be chosen to describe the characteristics of documents ? How to estimate this statistics ? How to compute the likelihood of generating queries given the statistics ?
Multivariate Binary Model A document event is just a bit-vector in the vocabulary W The bit corresponding to a term t is flipped on with probability Assume that: Term occurrences are independent event Term counts are unimportant The probability of generating d is given by
Multinomial Model Takes term counts into account, but does NOT fix the term-independence assumption The length of document is determined by a r.v. from a suitable distribution { all parameters needed to capture the length of distribution and all }
Mixture Models Suppose there are m topics (clusters) of the corpus with probability distribution: For the given topic, the documents are generated by binary/multinomial distribution with parameter set For a document belonging to topic, we would expect that
Unigram Language Model Observation: d={tf 1, tf 2, …, tf n } Unigram language model ={p(w 1 ), p(w 2 ), …, p(w n )} Maximum likelihood estimation
Unigram Language Model Probabilities for single word p(w) ={p(w) for any word w in vocabulary V} Estimate an unigram language model Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|
Statistical Inference C1: h, h, h, t, h bias b1 = 5/6 C2: t, t, h, t, h, h bias b2 = 1/2 C3: t, h, t, t, t, h bias b3 = 1/3 Why counting provide a good estimate of coin bias?
Maximum Likelihood Estimation (MLE) Observation o={o 1, o 2, …, o n } Maximum likelihood estimation E.g.: o={h, h, h, t, h,h} Pr(o|b) = b 5 (1-b)
Unigram Language Model Observation: d={tf 1, tf 2, …, tf n } Unigram language model ={p(w 1 ), p(w 2 ), …, p(w n )} Maximum likelihood estimation
Maximum A Posterior Estimation Consider a special case: we only toss each coin twice C1: h, t b1=1/2 C2: h, h b2=1 C3: t, t b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !