Download presentation
Presentation is loading. Please wait.
1
k-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory
2
2015-06-03Erik Zeitler2 k-Means Input M (set of points) k (number of clusters) Output µ 1, …, µ k (cluster centroids) k-Means clusters the M point into K clusters by minimizing the squared error function clusters S i ; i=1, …, k. µ i is the centroid of all x j S i.
3
2015-06-03Erik Zeitler3 k-Means algorithm select (m 1 … m K ) randomly from M % initial centroids do (µ 1 … µ K ) = (m 1 … m K ) all clusters C i = {} for each point p in M % compute cluster membership of p [µ i, i] = min(dist(µ, p)) % assign p to the corresponding cluster: C i = C i {p} end for each cluster C i % recompute the centroids m i = avg(x in C i ) while exists m i µ i % convergence criterion
4
2015-06-03Erik Zeitler4 K-Means on three clusters
5
2015-06-03Erik Zeitler5 I’m feeling Unlucky Bad initial points
6
2015-06-03Erik Zeitler6 Eat this! Non-spherical clusters
7
2015-06-03Erik Zeitler7 k-Means in practice How to choose initial centroids select randomly among the data points generate completely randomly How to choose k study the data run k-Means for different k measure squared error for each k Run k-Means many times! Get many choices of initial points
8
2015-06-03Erik Zeitler8 k-Means pros and cons +- Easy Fast Works only for ”well- shaped” clusters Scalable?Sensitive to outliers Sensitive to noise Must know k a priori
9
2015-06-03Erik Zeitler9 Questions Euclidean distance results in spherical clusters What cluster shape does the Manhattan distance give? Think of other distance measures too. What cluster shapes will those yield? Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point give an approximation of the complexity of the algorithm expressed in K, I, N, and X. Can the K-means algorithm be parallellized? How?
10
2015-06-03Erik Zeitler10 DBSCAN Density Based Spatial Clustering of Applications with Noise Basic idea: If an object p is density connected to q, then p and q belong to the same cluster If an object is not density connected to any other object it is considered noise
11
2015-06-03Erik Zeitler11 -neigborhood The neighborhood within a radius of an object core object An object is a core object iff there are more than MinPts objects in its - neighbourhood directly density reachable (ddr) An object p is ddr from q iff q is a core object and p is inside the neighbourhood of q Definitions p q
12
2015-06-03Erik Zeitler12 density reachable (dr) An object q is dr from p iff there exists a chain of objects p 1 … p n s.t. - p 1 is ddr from p, - p 2 is ddr from p 1, - p 3 is ddr from … and q is ddr from p n density connected (dc) p is dc to q iff - exist an object o such that p is dr from o - and q is dr from o Reachability and Connectivity q p p1p1 p2p2 o q p
13
2015-06-03Erik Zeitler13 Recall… Basic idea: If an object p is density connected to q, then p and q belong to the same cluster If an object is not density connected to any other object it is considered noise
14
2015-06-03Erik Zeitler14 DBSCAN, take 1 i = 0 do take a point p from M find the set of points P which are density connected to p if P = {} M = M \ {p} else C i =P i=i+1 M = M \ P end while M {} HOW? p
15
2015-06-03Erik Zeitler15 DBSCAN, take 2 i = 0 find the set of core points CP in M do take a point p from CP find the set of points P which are density reachable to p if P = {} M = M \ {p} else C i =P i=i+1 M = M \ P end while M {} HOW? Let’s call this function dr(p) p
16
2015-06-03Erik Zeitler16 Implement dr create function dr(p) -> P C = {p} P = {p} do remove a point p' from C find all points X that are ddr(p') C = C (X \ (P X)) % add newly discovered % points to C P = P X while C ≠ {} result P p’
17
2015-06-03Erik Zeitler17 DBSCAN pros and cons +- Clusters of arbitrary shape Robust to noise Requires connected regions of sufficiently high density Does not need an a priori k Data sets with varying densities are problematic Scalable?
18
2015-06-03Erik Zeitler18 Questions Why is the dc criterion useful to define a cluster, instead of dr or ddr? For which points are density reachable symmetric? i.e. for which p, q: dr(p, q) and dr(q, p)? Express using only core objects and ddr, which objects will belong to a cluster
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.