k-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory
Erik Zeitler2 k-Means Input M (set of points) k (number of clusters) Output µ 1, …, µ k (cluster centroids) k-Means clusters the M point into K clusters by minimizing the squared error function clusters S i ; i=1, …, k. µ i is the centroid of all x j S i.
Erik Zeitler3 k-Means algorithm select (m 1 … m K ) randomly from M % initial centroids do (µ 1 … µ K ) = (m 1 … m K ) all clusters C i = {} for each point p in M % compute cluster membership of p [µ i, i] = min(dist(µ, p)) % assign p to the corresponding cluster: C i = C i {p} end for each cluster C i % recompute the centroids m i = avg(x in C i ) while exists m i µ i % convergence criterion
Erik Zeitler4 K-Means on three clusters
Erik Zeitler5 I’m feeling Unlucky Bad initial points
Erik Zeitler6 Eat this! Non-spherical clusters
Erik Zeitler7 k-Means in practice How to choose initial centroids select randomly among the data points generate completely randomly How to choose k study the data run k-Means for different k measure squared error for each k Run k-Means many times! Get many choices of initial points
Erik Zeitler8 k-Means pros and cons +- Easy Fast Works only for ”well- shaped” clusters Scalable?Sensitive to outliers Sensitive to noise Must know k a priori
Erik Zeitler9 Questions Euclidean distance results in spherical clusters What cluster shape does the Manhattan distance give? Think of other distance measures too. What cluster shapes will those yield? Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point give an approximation of the complexity of the algorithm expressed in K, I, N, and X. Can the K-means algorithm be parallellized? How?
Erik Zeitler10 DBSCAN Density Based Spatial Clustering of Applications with Noise Basic idea: If an object p is density connected to q, then p and q belong to the same cluster If an object is not density connected to any other object it is considered noise
Erik Zeitler11 -neigborhood The neighborhood within a radius of an object core object An object is a core object iff there are more than MinPts objects in its - neighbourhood directly density reachable (ddr) An object p is ddr from q iff q is a core object and p is inside the neighbourhood of q Definitions p q
Erik Zeitler12 density reachable (dr) An object q is dr from p iff there exists a chain of objects p 1 … p n s.t. - p 1 is ddr from p, - p 2 is ddr from p 1, - p 3 is ddr from … and q is ddr from p n density connected (dc) p is dc to q iff - exist an object o such that p is dr from o - and q is dr from o Reachability and Connectivity q p p1p1 p2p2 o q p
Erik Zeitler13 Recall… Basic idea: If an object p is density connected to q, then p and q belong to the same cluster If an object is not density connected to any other object it is considered noise
Erik Zeitler14 DBSCAN, take 1 i = 0 do take a point p from M find the set of points P which are density connected to p if P = {} M = M \ {p} else C i =P i=i+1 M = M \ P end while M {} HOW? p
Erik Zeitler15 DBSCAN, take 2 i = 0 find the set of core points CP in M do take a point p from CP find the set of points P which are density reachable to p if P = {} M = M \ {p} else C i =P i=i+1 M = M \ P end while M {} HOW? Let’s call this function dr(p) p
Erik Zeitler16 Implement dr create function dr(p) -> P C = {p} P = {p} do remove a point p' from C find all points X that are ddr(p') C = C (X \ (P X)) % add newly discovered % points to C P = P X while C ≠ {} result P p’
Erik Zeitler17 DBSCAN pros and cons +- Clusters of arbitrary shape Robust to noise Requires connected regions of sufficiently high density Does not need an a priori k Data sets with varying densities are problematic Scalable?
Erik Zeitler18 Questions Why is the dc criterion useful to define a cluster, instead of dr or ddr? For which points are density reachable symmetric? i.e. for which p, q: dr(p, q) and dr(q, p)? Express using only core objects and ddr, which objects will belong to a cluster