Download presentation
Presentation is loading. Please wait.
Published byNathan Reginald Welch Modified over 8 years ago
1
How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND
2
University of Eastern Finland Joensuu Joki = a river Joen = of a river Suu = mouth Joensuu = mouth of a river
3
Research topics Voice biometric Clustering methods Location-based application Clustering algorithms Clustering validity Graph clustering Gaussian mixture models Speaker recognition Voice activity detection Applications Mobile data collection Route reduction and compression Photo collections and social networks Location-aware services & search engine Lossless compression and data reduction Image denoising Ultrasonic, medical and HDR imaging Image processing & compression
4
Research achievements Voice biometric Clustering methods Location-based application State-of-the-art algorithms! 4 PhD degrees 5 Top publications Results used by companies in Finland State-of-the-art algorithms in niche areas 6 PhD degrees 8 Top publications Image processing & compression NIST SRE submission ranked #2 in four categories in NIST SRE 2006. Top-1 most downloaded publication in Speech Communication Oct-Dec 2009 Results used in Forensics
5
Application example 1 Color reconstruction Image with compression artifacts Image with original colors
6
Application example 2 speaker modeling for voice biometrics Training data Feature extraction and clustering Matti Mikko Tomi Speaker models Tomi Matti Feature extraction Best match: Matti ! Mikko ?
7
Speaker modeling Speech dataResult of clustering
8
Application example 3 Image segmentation Normalized color plots according to red and green components. Image with 4 color clusters red green
9
Application example 4 Quantization Quantized signal Original signal Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values
10
Color quantization of images Color imageRGB samples Clustering
11
Application example 5 Clustering of spatial data
12
Clustered locations of users
13
Clustering of photos Timeline clustering
14
Clustering GPS trajectories Mobile users, taxi routes, fleet management
15
Conclusions from clusters Cluster 1: Office Cluster 2: Home
16
Part I: Clustering problem
17
Definitions and data Set of N data points: X={x 1, x 2, …, x N } Set of M cluster prototypes (centroids): C={c 1, c 2, …, c M }, P={p 1, p 2, …, p M }, Partition of the data:
18
K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means (X, C) → (C, P) REPEAT C prev ← C; FOR all i ∈ [1, N] DO p i ← FindNearest(x i, C); FOR all j ∈ [1, k] DO c j ← Average of x i p i = j; UNTIL C = C prev Optimal partition Optimal centoids
19
Distance and cost function Euclidean distance of data vectors: Mean square error:
20
Clustering result as partition Illustrated by Voronoi diagram Illustrated by Convex hulls Cluster prototypes Partition of data
21
Cluster prototypes Partition of data Centroids as prototypes Partition by nearest prototype mapping Duality of partition and centroids
22
Cluster missingClusters missing Too many clusters Incorrect cluster allocation Incorrect number of clusters Challenges in clustering
23
How to solve? Solve the clustering: Given input data (X) of N data vectors, and number of clusters (M), find the clusters. Result given as a set of prototypes, or partition. Solve the number of clusters: Define appropriate cluster validity function f. Repeat the clustering algorithm for several M. Select the best result according to f. Solve the problem efficiently. Algorithmic problem Mathematical problem Computer science problem
24
Part II: Clustering algorithms
25
Algorithm 1: Split P. Fränti, T. Kaukoranta and O. Nevalainen, "On the splitting method for vector quantization codebook generation", Optical Engineering, 36 (11), 3043-3051, November 1997.
26
Motivation Efficiency of divide-and-conquer approach Hierarchy of clusters as a result Useful when solving the number of clusters Challenges Design problem 1: What cluster to split? Design problem 2: How to split? Sub-optimal local optimization at best Divisive approach
27
Split-based (divisive) clustering
28
Heuristic choices: Cluster with highest variance (MSE) Cluster with most skew distribution (3 rd moment) Locally optimal: Tentatively split all clusters Select the one that decreases MSE most! Complexity of the choice: Heuristics take the time to compute the measure Optimal choice takes only twice (2 ) more time!!! The measures can be stored, and only two new clusters appear at each step to be calculated. Select cluster to be split Use this !
29
Selection example 11.2 11.6 Biggest MSE… … but dividing this decreases MSE more 7.5 8.2 4.3 6.5
30
Selection example 4.1 6.3 Only two new values need to be calculated 7.5 8.2 4.3 6.5 11.6
31
How to split Centroid methods: Heuristic 1: Replace C by C- and C+ Heuristic 2: Two furthest vectors. Heuristic 3: Two random vectors. Partition according to principal axis: Calculate principal axis Select dividing point along the axis Divide by a hyperplane Calculate centroids of the two sub-clusters
32
Splitting along principal axis pseudo code Step 1:Calculate the principal axis. Step 2:Select a dividing point. Step 3:Divide points by a hyper plane. Step 4:Calculate centroids of the new clusters.
33
Example of dividing Dividing hyper plane Principal axis
34
Optimal dividing point pseudo code of Step 2 Step 2.1: Calculate projections on the principal axis. Step 2.2: Sort vectors according to the projection. Step 2.3: FOR each vector x i DO: - Divide using x i as dividing point. - Calculate distortion of subsets D 1 and D 2. Step 2.4: Choose point minimizing D 1 +D 2.
35
Finding dividing point Calculating error for next dividing point: Update centroids: Can be done in O(1) time!!!
36
Sub-optimality of the split
37
Example of splitting process Dividing hyper plane Principal axis 2 clusters3 clusters
38
4 clusters5 clusters Example of splitting process
39
6 clusters7 clusters Example of splitting process
40
8 clusters9 clusters Example of splitting process
41
10 clusters11 clusters
42
Example of splitting process 12 clusters13 clusters
43
Example of splitting process MSE = 1.94 14 clusters15 clusters
44
K-means refinement Result after re-partition: MSE = 1.39 Result after K-means: MSE = 1.33 Result directly after split: MSE = 1.94
45
Time complexity Number of processed vectors, assuming that clusters are always split into two equal halves: Assuming unequal split to n max and n min sizes:
46
Number of vectors processed: At each step, sorting the vectors is bottleneck: Time complexity
47
P. Fränti, T. Kaukoranta, D-F. Shen and K-S. Chang, "Fast and memory efficient implementation of the exact PNN", IEEE Trans. on Image Processing, 9 (5), 773-777, May 2000. Algorithm 2: Pairwise Nearest Neighbor
48
Agglomerative clustering Single link Minimize distance of nearest vectors Complete link Minimize distance of two furthest vectors Ward’s method Minimize mean square error In Vector Quantization, known as Pairwise Nearest Neighbor (PNN) method
49
PNN algorithm [Ward 1963: Journal of American Statistical Association] Merge cost: Local optimization strategy: Nearest neighbor search is needed: (1) finding the cluster pair to be merged (2) updating of NN pointers
50
Pseudo code
51
Overall example of the process M=5000 M=4999 M=4998. M=50. M=16 M=15 M=5000M=50 M=16 M=15
52
Detailed example of the process
53
Example - 25 Clusters MSE ≈ 1.01*10 9
54
Example - 24 Clusters MSE ≈ 1.03*10 9
55
Example - 23 Clusters MSE ≈ 1.06*10 9
56
Example - 22 Clusters MSE ≈ 1.09*10 9
57
Example - 21 Clusters MSE ≈ 1.12*10 9
58
Example - 20 Clusters MSE ≈ 1.16*10 9
59
Example - 19 Clusters MSE ≈ 1.19*10 9
60
Example - 18 Clusters MSE ≈ 1.23*10 9
61
Example - 17 Clusters MSE ≈ 1.26*10 9
62
Example - 16 Clusters MSE ≈ 1.30*10 9
63
Example - 15 Clusters MSE ≈ 1.34*10 9
64
Example of distance calculations
65
Storing distance matrix Maintain the distance matrix and update rows for the changed cluster only! Number of distance calculations reduces from O(N 2 ) to O(N) for each step. Search of the minimum pair still requires O(N 2 ) time still O(N 3 ) in total. It also requires O(N 2 ) memory.
66
Heap structure for fast search [Kurita 1991: Pattern Recognition] Search reduces O(N) O(logN). In total: O(N 2 logN)
67
Maintain nearest neighbor (NN) pointers [Fränti et al., 2000: IEEE Trans. Image Processing] Time complexity reduces to O(N 3 ) Ω ( N 2 )
68
Processing time comparison With NN pointers
69
Combining PNN and K-means N M M0M0 PNN K-means Standard PNN Random 1 M M0M0 N Number of clusters
70
P. Fränti, O. Virmajoki and V. Hautamäki, "Fast agglomerative clustering using a k-nearest neighbor graph". IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), 1875-1881, November 2006. P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006. T. Kaukoranta, P. Fränti and O. Nevalainen, "Vector quantization by lazy pairwise nearest neighbor method", Optical Engineering, 38 (11), 1862-1868, November 1999. O. Virmajoki, P. Fränti and T. Kaukoranta, "Practical methods for speeding-up the pairwise nearest neighbor method ", Optical Engineering, 40 (11), 2495-2504, November 2001. Further improvements
71
Algorithm 3: Random Swap P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000.
72
Random swap algorithm (RS)
73
Demonstration of the algorithm
74
Centroid swap
75
Local repartition
76
Fine-tuning by K-means 1st iteration
77
Fine-tuning by K-means 2nd iteration
78
Fine-tuning by K-means 3rd iteration
79
Fine-tuning by K-means 16th iteration
80
Fine-tuning by K-means 17th iteration
81
Fine-tuning by K-means 18th iteration
82
Fine-tuning by K-means 19th iteration
83
Fine-tuning by K-means Final result after 25 iterations
84
Implementation of the swap 1. Random swap: 2. Re-partition vectors from old cluster: 3. Create new cluster:
85
Independency of initialization Results for T = 5000 iterations Worst Best Initial Bridge
86
Probability of good swap Select a proper centroid for removal: There are M clusters in total: p removal =1/M. Select a proper new location: There are N choices: p add =1/N Only M are significantly different: p add =1/M In total: M 2 significantly different swaps. Probability of each different swap is p swap =1/M 2 Open question: how many of these are good?
87
Probability of not finding good swap: Expected number of iterations Estimated number of iterations:
88
Estimated number of iterations depending on T S1S1S1S1 S2S2S2S2 S3S3S3S3 S4S4S4S4 Observed = Number of iterations needed in practice. Estimated = Estimated number of iterations needed for the given q value.
89
Probability of success (p) depending on T
90
Probability of failure (q) depending on T
91
Observed probabilities depending on dimensionality
92
Bounds for the number of iterations Upper limit: Lower limit similarly; resulting in:
93
Multiple swaps (w) Probability for performing less than w swaps: Expected number of iterations:
94
Efficiency of the random swap Total time to find correct clustering: Time per iteration Number of iterations Time complexity of single step: Swap: O(1) Remove cluster: 2M N/M = O(N) Add cluster: 2N = O(N) Centroids: 2 (2N/M) + 2 + 2 = O(N/M) (Fast) K-means iteration: 4 N = O( N) * * See Fast K-means for analysis.
95
Observed K-means iterations
96
K-means iterations
97
Time complexity and the observed number of steps
98
Total time complexity Number of iterations needed (T): t = O(αN) Total time: Time complexity of single step (t):
99
Time complexity: conclusions 1. 1. Logarithmic dependency on q 2. 2. Linear dependency on N 3. 3. Quadratic dependency on M (With large number of clusters, it can be too slow and faster variant might be needed.) 4. 4. Inverse dependency on (worst case = 2) (Higher the dimensionality, faster the method)
100
References Random swap algorithm: P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000. P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139 ‑ 1148, August 1998. Pseudo code: http://cs.uef.fi/sipu/ Efficiency of Random swap algorithm: P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of random swap based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08), Tampa, FL, Dec 2008.
101
Part III: Efficient solution
102
Stopping criterion? Ends up to a local minimum Divisive Agglomerative
103
Strategies for efficient search using random swap Brute force: solve clustering for all possible number of clusters. Stepwise: as in brute force but start using previous solution and iterate less. Criterion-guided search: Integrate validity directly into the cost function.
104
Brute force search strategy Number of clusters Search for each separately 100 %
105
Stepwise search strategy Number of clusters Start from the previous result 30-40 %
106
Criterion guided search Number of clusters Integrate with the cost function! 3-6 %
107
Conclusions Define the problem Cost function f. Measures the goodness of clusters, or alternatively (dis)similarity between two objects. Solve the problem Select the best algorithm for minimizing f Homework Number of clusters: Q. Zhao and P. Fränti, "WB-index: a sum-of- squares based index for cluster validity", Data & Knowledge Engineering, 92: 77-89, 2014. Validation: P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), 3034-3045, Sept. 2014.
108
Thank you Time for questions!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.