Presentation is loading. Please wait.

Presentation is loading. Please wait.

K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.

Similar presentations

Presentation on theme: "K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory."— Presentation transcript:

1 k-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory

2 2015-06-03Erik Zeitler2 k-Means  Input M (set of points) k (number of clusters)  Output µ 1, …, µ k (cluster centroids)  k-Means clusters the M point into K clusters by minimizing the squared error function clusters S i ; i=1, …, k. µ i is the centroid of all x j  S i.

3 2015-06-03Erik Zeitler3 k-Means algorithm select (m 1 … m K ) randomly from M % initial centroids do (µ 1 … µ K ) = (m 1 … m K ) all clusters C i = {} for each point p in M % compute cluster membership of p [µ i, i] = min(dist(µ, p)) % assign p to the corresponding cluster: C i = C i  {p} end for each cluster C i % recompute the centroids m i = avg(x in C i ) while exists m i  µ i % convergence criterion

4 2015-06-03Erik Zeitler4 K-Means on three clusters

5 2015-06-03Erik Zeitler5 I’m feeling Unlucky Bad initial points

6 2015-06-03Erik Zeitler6 Eat this! Non-spherical clusters

7 2015-06-03Erik Zeitler7 k-Means in practice  How to choose initial centroids select randomly among the data points generate completely randomly  How to choose k study the data run k-Means for different k  measure squared error for each k  Run k-Means many times! Get many choices of initial points

8 2015-06-03Erik Zeitler8 k-Means pros and cons +- Easy Fast Works only for ”well- shaped” clusters Scalable?Sensitive to outliers Sensitive to noise Must know k a priori

9 2015-06-03Erik Zeitler9 Questions  Euclidean distance results in spherical clusters What cluster shape does the Manhattan distance give? Think of other distance measures too. What cluster shapes will those yield?  Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point give an approximation of the complexity of the algorithm expressed in K, I, N, and X.  Can the K-means algorithm be parallellized? How?

10 2015-06-03Erik Zeitler10 DBSCAN  Density Based Spatial Clustering of Applications with Noise  Basic idea: If an object p is density connected to q,  then p and q belong to the same cluster If an object is not density connected to any other object  it is considered noise

11 2015-06-03Erik Zeitler11   -neigborhood The neighborhood within a radius  of an object  core object An object is a core object iff there are more than MinPts objects in its  - neighbourhood  directly density reachable (ddr) An object p is ddr from q iff q is a core object and p is inside the  ­ neighbourhood of q Definitions p q 

12 2015-06-03Erik Zeitler12  density reachable (dr) An object q is dr from p iff there exists a chain of objects p 1 … p n s.t. - p 1 is ddr from p, - p 2 is ddr from p 1, - p 3 is ddr from … and q is ddr from p n  density connected (dc) p is dc to q iff - exist an object o such that p is dr from o - and q is dr from o Reachability and Connectivity q p p1p1 p2p2 o q p

13 2015-06-03Erik Zeitler13 Recall…  Basic idea: If an object p is density connected to q,  then p and q belong to the same cluster If an object is not density connected to any other object  it is considered noise

14 2015-06-03Erik Zeitler14 DBSCAN, take 1 i = 0 do take a point p from M find the set of points P which are density connected to p if P = {} M = M \ {p} else C i =P i=i+1 M = M \ P end while M  {} HOW? p

15 2015-06-03Erik Zeitler15 DBSCAN, take 2 i = 0 find the set of core points CP in M do take a point p from CP find the set of points P which are density reachable to p if P = {} M = M \ {p} else C i =P i=i+1 M = M \ P end while M  {} HOW? Let’s call this function dr(p) p

16 2015-06-03Erik Zeitler16 Implement dr create function dr(p) -> P C = {p} P = {p} do remove a point p' from C find all points X that are ddr(p') C = C  (X \ (P  X)) % add newly discovered % points to C P = P  X while C ≠ {} result P p’

17 2015-06-03Erik Zeitler17 DBSCAN pros and cons +- Clusters of arbitrary shape Robust to noise Requires connected regions of sufficiently high density Does not need an a priori k Data sets with varying densities are problematic Scalable?

18 2015-06-03Erik Zeitler18 Questions  Why is the dc criterion useful to define a cluster, instead of dr or ddr?  For which points are density reachable symmetric? i.e. for which p, q: dr(p, q) and dr(q, p)?  Express using only core objects and ddr, which objects will belong to a cluster

Download ppt "K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory."

Similar presentations

Ads by Google