KAIST CS LAB Oh Jong-Hoon

KAIST CS LAB Oh Jong-Hoon
Clustering KAIST CS LAB Oh Jong-Hoon

Content What is Clustering Clustering Method Hierarchical Clustering
Single-link, complete-link, group-average ... Non-Hierarchical Clustering K-means EM-algorithm

What is clustering? red yellow …. blue ……

What is clustering Definition Statistical NLP Sort of clustering
Partition a set of objects into groups or clusters Statistical NLP EDA(exploratory data analysis) Generalization Monday, Tuesday, …,Sunday There is no entry for Friday Sort of clustering Hierarchical vs flat (non-hierarchical) hard (1:1) vs soft (1:n – degree of membership)

Hierarchical Clustering
Determine all inter-object dissimilarities Form cluster from two closest objects or clusters Redefine dissimilarities between new cluster and other objects Return to Step2 until all object are in the one cluster

A B C D E F F F {E,A} A A A Select A,F => C1 A B C D E C1 Calculate
Inter-object dissimilarity A B C D E F For each object, Find the closest object F F {E,A} A A A Select A,F => C1 A B C D E C1 B C D E F Form cluster and Recalculate dissimilarity

Three methods of hierarchical clustering
Single-link Similarity of two most similar members Complete link Similarity of two least similar members Group average Average similarity between members

Single link Similarity of two most similar members

Complete link Similarity of two least similar members

Group average Average similarity between members

Comparison Single-link Complete-link Group average Relative efficient
Long straggly clusters Ellipsoidal cluster Loosely bound cluster Complete-link Tightly bound cluster Group average Intermediate between single and complete

Language Model Improving language model By way of generalization
For rare events (do not have a enough training data) More accurate prediction for rare event Machine Translation S : Source language T : Target language 나는 지난 일요일에 학교에 갔다. I went to school on last Sunday. 나는 지난 토요일에 학교에 갔다. I went to school on last Saturday In Computing P(T)

Top-down Splitting the cluster into objects
By measure of cohesion Determine all inter-object cohesion in the cluster (initially only one cluster exists) Split cluster into two clusters using cohesion Recalculate cohesion for all clusters Return to Step2 until all cluster have one object

Non-Hierarchical Clustering
Randomly selected seed. Iterative reallocation Stopping criteria Maximum Likelihood Heuristic in nature # of clusters, cluster size and so on Finding optimal solution is difficult

K-means (algorithm) Initial center of cluster are randomly selected
Assign objects to cluster using distances between center and object Re-compute the center of each cluster Return step2 until stopping criteria is satisfied

K-means Hard clustering Issues in K-means
Cj Ci Hard clustering Issues in K-means How to break ties when D(x,Cj)=D(x,Ci) Assign objects randomly to one of the candidate cluster ( this cause algorithm not to converge) Perturb objects slightly so that their new position do not give rise to tie x

EM algorithm(1) Determine the most likely estimates for the parameters of the distribution Z : unobserved data set Zi = (zi1,zi2 … zik) Where zij =1 if object i is a member of cluster j otherwise 0 X : observed data set (incomplete data) X = { vector x1 … vector xn } with m dimension

EM algorithm(2) We want to calculate probability P(cj| vector xi)
K-1 2 k K-2 3 1 ………… EM algorithm(2) We want to calculate probability P(cj| vector xi) Assume that clusteri has a normal distribution Maximum likelihood of the form

EM algorithm(3) Finding parameters  maximizing log likelihood given in the equation.

Procedure of EM Expectation Step (E) Maximization Step (M)
Compute hij that is expectation of zij Maximization Step (M) Re-compute the parameters  (mean,variance, prior for each normal distribution)

Advantage and disadvantage of EM
Simple, easy to implement Its memory requirement are reasonable disadvantage Slow linear convergence Not guarantee global maxima

Properties of hierarchical and flat clustering
Preferable for detailed data analysis Provides more information than flat No single best algorithm (dependent on application) Less efficient than flat ( N X N similarity matrix required) Preferable if efficiency is consideration or data sets are very large K-means is the conceptually simplest method K-means assumes a simple Euclidean representation space and so can’t be used for many data sets In such case, EM algorithm is chosen

Summary Hierarchical clustering Non-hierarchical clustering Bottom up
Single-link Complete-link group average Top-down Non-hierarchical clustering K-means, EM algorithm

KAIST CS LAB Oh Jong-Hoon

Similar presentations

Presentation on theme: "KAIST CS LAB Oh Jong-Hoon"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KAIST CS LAB Oh Jong-Hoon

Similar presentations

Presentation on theme: "KAIST CS LAB Oh Jong-Hoon"— Presentation transcript:

Similar presentations

About project

Feedback