How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University.

How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND

University of Eastern Finland Joensuu Joki = a river Joen = of a river Suu = mouth Joensuu = mouth of a river

Research topics Voice biometric Clustering methods Location-based application Clustering algorithms Clustering validity Graph clustering Gaussian mixture models Speaker recognition Voice activity detection Applications Mobile data collection Route reduction and compression Photo collections and social networks Location-aware services & search engine Lossless compression and data reduction Image denoising Ultrasonic, medical and HDR imaging Image processing & compression

Research achievements Voice biometric Clustering methods Location-based application State-of-the-art algorithms! 4 PhD degrees 5 Top publications Results used by companies in Finland State-of-the-art algorithms in niche areas 6 PhD degrees 8 Top publications Image processing & compression NIST SRE submission ranked #2 in four categories in NIST SRE 2006. Top-1 most downloaded publication in Speech Communication Oct-Dec 2009 Results used in Forensics

Application example 1 Color reconstruction Image with compression artifacts Image with original colors

Application example 2 speaker modeling for voice biometrics Training data Feature extraction and clustering Matti Mikko Tomi Speaker models Tomi Matti Feature extraction Best match: Matti ! Mikko ?

Speaker modeling Speech dataResult of clustering

Application example 3 Image segmentation Normalized color plots according to red and green components. Image with 4 color clusters red green

Application example 4 Quantization Quantized signal Original signal Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

Color quantization of images Color imageRGB samples Clustering

Application example 5 Clustering of spatial data

Clustered locations of users

Clustering of photos Timeline clustering

Clustering GPS trajectories Mobile users, taxi routes, fleet management

Conclusions from clusters Cluster 1: Office Cluster 2: Home

Part I: Clustering problem

Definitions and data Set of N data points: X={x 1, x 2, …, x N } Set of M cluster prototypes (centroids): C={c 1, c 2, …, c M }, P={p 1, p 2, …, p M }, Partition of the data:

K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means (X, C) → (C, P) REPEAT C prev ← C; FOR all i ∈ [1, N] DO p i ← FindNearest(x i, C); FOR all j ∈ [1, k] DO c j ← Average of x i  p i = j; UNTIL C = C prev Optimal partition Optimal centoids

Distance and cost function Euclidean distance of data vectors: Mean square error:

Clustering result as partition Illustrated by Voronoi diagram Illustrated by Convex hulls Cluster prototypes Partition of data

Cluster prototypes Partition of data Centroids as prototypes Partition by nearest prototype mapping Duality of partition and centroids

Cluster missingClusters missing Too many clusters Incorrect cluster allocation Incorrect number of clusters Challenges in clustering

How to solve? Solve the clustering:   Given input data (X) of N data vectors, and number of clusters (M), find the clusters.   Result given as a set of prototypes, or partition. Solve the number of clusters:   Define appropriate cluster validity function f.   Repeat the clustering algorithm for several M.   Select the best result according to f. Solve the problem efficiently. Algorithmic problem Mathematical problem Computer science problem

Part II: Clustering algorithms

Algorithm 1: Split P. Fränti, T. Kaukoranta and O. Nevalainen, "On the splitting method for vector quantization codebook generation", Optical Engineering, 36 (11), 3043-3051, November 1997.

Motivation   Efficiency of divide-and-conquer approach   Hierarchy of clusters as a result   Useful when solving the number of clusters Challenges   Design problem 1: What cluster to split?   Design problem 2: How to split?   Sub-optimal local optimization at best Divisive approach

Split-based (divisive) clustering

  Heuristic choices:   Cluster with highest variance (MSE)   Cluster with most skew distribution (3 rd moment)   Locally optimal:   Tentatively split all clusters   Select the one that decreases MSE most!   Complexity of the choice:   Heuristics take the time to compute the measure   Optimal choice takes only twice (2  ) more time!!!   The measures can be stored, and only two new clusters appear at each step to be calculated. Select cluster to be split Use this !

Selection example 11.2 11.6 Biggest MSE… … but dividing this decreases MSE more 7.5 8.2 4.3 6.5

Selection example 4.1 6.3 Only two new values need to be calculated 7.5 8.2 4.3 6.5 11.6

How to split   Centroid methods:   Heuristic 1: Replace C by C-  and C+    Heuristic 2: Two furthest vectors.   Heuristic 3: Two random vectors.   Partition according to principal axis:   Calculate principal axis   Select dividing point along the axis   Divide by a hyperplane   Calculate centroids of the two sub-clusters

Splitting along principal axis pseudo code Step 1:Calculate the principal axis. Step 2:Select a dividing point. Step 3:Divide points by a hyper plane. Step 4:Calculate centroids of the new clusters.

Example of dividing Dividing hyper plane Principal axis

Optimal dividing point pseudo code of Step 2 Step 2.1: Calculate projections on the principal axis. Step 2.2: Sort vectors according to the projection. Step 2.3: FOR each vector x i DO: - Divide using x i as dividing point. - Calculate distortion of subsets D 1 and D 2. Step 2.4: Choose point minimizing D 1 +D 2.

Finding dividing point   Calculating error for next dividing point:  Update centroids: Can be done in O(1) time!!!

Sub-optimality of the split

Example of splitting process Dividing hyper plane Principal axis 2 clusters3 clusters

4 clusters5 clusters Example of splitting process

10 clusters11 clusters

Example of splitting process 12 clusters13 clusters

Example of splitting process MSE = 1.94 14 clusters15 clusters

K-means refinement Result after re-partition: MSE = 1.39 Result after K-means: MSE = 1.33 Result directly after split: MSE = 1.94

Time complexity Number of processed vectors, assuming that clusters are always split into two equal halves: Assuming unequal split to n max and n min sizes:

Number of vectors processed: At each step, sorting the vectors is bottleneck: Time complexity

P. Fränti, T. Kaukoranta, D-F. Shen and K-S. Chang, "Fast and memory efficient implementation of the exact PNN", IEEE Trans. on Image Processing, 9 (5), 773-777, May 2000. Algorithm 2: Pairwise Nearest Neighbor

Agglomerative clustering Single link  Minimize distance of nearest vectors Complete link  Minimize distance of two furthest vectors Ward’s method  Minimize mean square error  In Vector Quantization, known as Pairwise Nearest Neighbor (PNN) method

PNN algorithm [Ward 1963: Journal of American Statistical Association] Merge cost: Local optimization strategy: Nearest neighbor search is needed: (1) finding the cluster pair to be merged (2) updating of NN pointers

Pseudo code

Overall example of the process M=5000 M=4999 M=4998. M=50. M=16 M=15 M=5000M=50 M=16 M=15

Detailed example of the process

Example - 25 Clusters MSE ≈ 1.01*10 9

Example of distance calculations

Storing distance matrix   Maintain the distance matrix and update rows for the changed cluster only!   Number of distance calculations reduces from O(N 2 ) to O(N) for each step.   Search of the minimum pair still requires O(N 2 ) time  still O(N 3 ) in total.   It also requires O(N 2 ) memory.

Heap structure for fast search [Kurita 1991: Pattern Recognition]  Search reduces O(N)  O(logN).  In total: O(N 2 logN)

Maintain nearest neighbor (NN) pointers [Fränti et al., 2000: IEEE Trans. Image Processing] Time complexity reduces to O(N 3 )  Ω (  N 2 )

Processing time comparison With NN pointers

Combining PNN and K-means N M M0M0 PNN K-means Standard PNN Random 1 M M0M0 N Number of clusters

  P. Fränti, O. Virmajoki and V. Hautamäki, "Fast agglomerative clustering using a k-nearest neighbor graph". IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), 1875-1881, November 2006.   P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.   T. Kaukoranta, P. Fränti and O. Nevalainen, "Vector quantization by lazy pairwise nearest neighbor method", Optical Engineering, 38 (11), 1862-1868, November 1999.   O. Virmajoki, P. Fränti and T. Kaukoranta, "Practical methods for speeding-up the pairwise nearest neighbor method ", Optical Engineering, 40 (11), 2495-2504, November 2001. Further improvements

Algorithm 3: Random Swap P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000.

Random swap algorithm (RS)

Demonstration of the algorithm

Centroid swap

Local repartition

Fine-tuning by K-means 1st iteration

Fine-tuning by K-means 2nd iteration

Fine-tuning by K-means 3rd iteration

Fine-tuning by K-means 16th iteration

Fine-tuning by K-means Final result after 25 iterations

Implementation of the swap 1. Random swap: 2. Re-partition vectors from old cluster: 3. Create new cluster:

Independency of initialization Results for T = 5000 iterations Worst Best Initial Bridge

Probability of good swap Select a proper centroid for removal:   There are M clusters in total: p removal =1/M. Select a proper new location:   There are N choices: p add =1/N   Only M are significantly different: p add =1/M In total:   M 2 significantly different swaps.   Probability of each different swap is p swap =1/M 2   Open question: how many of these are good?

  Probability of not finding good swap: Expected number of iterations  Estimated number of iterations:

Estimated number of iterations depending on T S1S1S1S1 S2S2S2S2 S3S3S3S3 S4S4S4S4 Observed = Number of iterations needed in practice. Estimated = Estimated number of iterations needed for the given q value.

Probability of success (p) depending on T

Probability of failure (q) depending on T

Observed probabilities depending on dimensionality

Bounds for the number of iterations Upper limit: Lower limit similarly; resulting in:

Multiple swaps (w) Probability for performing less than w swaps: Expected number of iterations:

Efficiency of the random swap Total time to find correct clustering:   Time per iteration  Number of iterations Time complexity of single step:   Swap: O(1)   Remove cluster: 2M  N/M = O(N)   Add cluster: 2N = O(N)   Centroids: 2  (2N/M) + 2  + 2 = O(N/M)   (Fast) K-means iteration: 4  N = O(  N) * * See Fast K-means for analysis.

Observed K-means iterations

K-means iterations

Time complexity and the observed number of steps

Total time complexity Number of iterations needed (T): t = O(αN) Total time: Time complexity of single step (t):

Time complexity: conclusions 1. 1. Logarithmic dependency on q 2. 2. Linear dependency on N 3. 3. Quadratic dependency on M (With large number of clusters, it can be too slow and faster variant might be needed.) 4. 4. Inverse dependency on  (worst case  = 2) (Higher the dimensionality, faster the method)

References Random swap algorithm: P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000. P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139 ‑ 1148, August 1998. Pseudo code: http://cs.uef.fi/sipu/ Efficiency of Random swap algorithm: P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of random swap based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08), Tampa, FL, Dec 2008.

Part III: Efficient solution

Stopping criterion? Ends up to a local minimum Divisive Agglomerative

Strategies for efficient search using random swap   Brute force: solve clustering for all possible number of clusters.   Stepwise: as in brute force but start using previous solution and iterate less.   Criterion-guided search: Integrate validity directly into the cost function.

Brute force search strategy Number of clusters Search for each separately 100 %

Stepwise search strategy Number of clusters Start from the previous result 30-40 %

Criterion guided search Number of clusters Integrate with the cost function! 3-6 %

Conclusions Define the problem Cost function f. Measures the goodness of clusters, or alternatively (dis)similarity between two objects. Solve the problem Select the best algorithm for minimizing f Homework Number of clusters: Q. Zhao and P. Fränti, "WB-index: a sum-of- squares based index for cluster validity", Data & Knowledge Engineering, 92: 77-89, 2014. Validation: P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), 3034-3045, Sept. 2014.

Thank you Time for questions!

How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University.

Similar presentations

Presentation on theme: "How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University.

Similar presentations

Presentation on theme: "How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University."— Presentation transcript:

Similar presentations

About project

Feedback