Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University.

Similar presentations


Presentation on theme: "How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University."— Presentation transcript:

1 How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND

2 University of Eastern Finland Joensuu Joki = a river Joen = of a river Suu = mouth Joensuu = mouth of a river

3 Research topics Voice biometric Clustering methods Location-based application Clustering algorithms Clustering validity Graph clustering Gaussian mixture models Speaker recognition Voice activity detection Applications Mobile data collection Route reduction and compression Photo collections and social networks Location-aware services & search engine Lossless compression and data reduction Image denoising Ultrasonic, medical and HDR imaging Image processing & compression

4 Research achievements Voice biometric Clustering methods Location-based application State-of-the-art algorithms! 4 PhD degrees 5 Top publications Results used by companies in Finland State-of-the-art algorithms in niche areas 6 PhD degrees 8 Top publications Image processing & compression NIST SRE submission ranked #2 in four categories in NIST SRE 2006. Top-1 most downloaded publication in Speech Communication Oct-Dec 2009 Results used in Forensics

5 Application example 1 Color reconstruction Image with compression artifacts Image with original colors

6 Application example 2 speaker modeling for voice biometrics Training data Feature extraction and clustering Matti Mikko Tomi Speaker models Tomi Matti Feature extraction Best match: Matti ! Mikko ?

7 Speaker modeling Speech dataResult of clustering

8 Application example 3 Image segmentation Normalized color plots according to red and green components. Image with 4 color clusters red green

9 Application example 4 Quantization Quantized signal Original signal Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

10 Color quantization of images Color imageRGB samples Clustering

11 Application example 5 Clustering of spatial data

12 Clustered locations of users

13 Clustering of photos Timeline clustering

14 Clustering GPS trajectories Mobile users, taxi routes, fleet management

15 Conclusions from clusters Cluster 1: Office Cluster 2: Home

16 Part I: Clustering problem

17 Definitions and data Set of N data points: X={x 1, x 2, …, x N } Set of M cluster prototypes (centroids): C={c 1, c 2, …, c M }, P={p 1, p 2, …, p M }, Partition of the data:

18 K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means (X, C) → (C, P) REPEAT C prev ← C; FOR all i ∈ [1, N] DO p i ← FindNearest(x i, C); FOR all j ∈ [1, k] DO c j ← Average of x i  p i = j; UNTIL C = C prev Optimal partition Optimal centoids

19 Distance and cost function Euclidean distance of data vectors: Mean square error:

20 Clustering result as partition Illustrated by Voronoi diagram Illustrated by Convex hulls Cluster prototypes Partition of data

21 Cluster prototypes Partition of data Centroids as prototypes Partition by nearest prototype mapping Duality of partition and centroids

22 Cluster missingClusters missing Too many clusters Incorrect cluster allocation Incorrect number of clusters Challenges in clustering

23 How to solve? Solve the clustering:   Given input data (X) of N data vectors, and number of clusters (M), find the clusters.   Result given as a set of prototypes, or partition. Solve the number of clusters:   Define appropriate cluster validity function f.   Repeat the clustering algorithm for several M.   Select the best result according to f. Solve the problem efficiently. Algorithmic problem Mathematical problem Computer science problem

24 Part II: Clustering algorithms

25 Algorithm 1: Split P. Fränti, T. Kaukoranta and O. Nevalainen, "On the splitting method for vector quantization codebook generation", Optical Engineering, 36 (11), 3043-3051, November 1997.

26 Motivation   Efficiency of divide-and-conquer approach   Hierarchy of clusters as a result   Useful when solving the number of clusters Challenges   Design problem 1: What cluster to split?   Design problem 2: How to split?   Sub-optimal local optimization at best Divisive approach

27 Split-based (divisive) clustering

28   Heuristic choices:   Cluster with highest variance (MSE)   Cluster with most skew distribution (3 rd moment)   Locally optimal:   Tentatively split all clusters   Select the one that decreases MSE most!   Complexity of the choice:   Heuristics take the time to compute the measure   Optimal choice takes only twice (2  ) more time!!!   The measures can be stored, and only two new clusters appear at each step to be calculated. Select cluster to be split Use this !

29 Selection example 11.2 11.6 Biggest MSE… … but dividing this decreases MSE more 7.5 8.2 4.3 6.5

30 Selection example 4.1 6.3 Only two new values need to be calculated 7.5 8.2 4.3 6.5 11.6

31 How to split   Centroid methods:   Heuristic 1: Replace C by C-  and C+    Heuristic 2: Two furthest vectors.   Heuristic 3: Two random vectors.   Partition according to principal axis:   Calculate principal axis   Select dividing point along the axis   Divide by a hyperplane   Calculate centroids of the two sub-clusters

32 Splitting along principal axis pseudo code Step 1:Calculate the principal axis. Step 2:Select a dividing point. Step 3:Divide points by a hyper plane. Step 4:Calculate centroids of the new clusters.

33 Example of dividing Dividing hyper plane Principal axis

34 Optimal dividing point pseudo code of Step 2 Step 2.1: Calculate projections on the principal axis. Step 2.2: Sort vectors according to the projection. Step 2.3: FOR each vector x i DO: - Divide using x i as dividing point. - Calculate distortion of subsets D 1 and D 2. Step 2.4: Choose point minimizing D 1 +D 2.

35 Finding dividing point   Calculating error for next dividing point:  Update centroids: Can be done in O(1) time!!!

36 Sub-optimality of the split

37 Example of splitting process Dividing hyper plane Principal axis 2 clusters3 clusters

38 4 clusters5 clusters Example of splitting process

39 6 clusters7 clusters Example of splitting process

40 8 clusters9 clusters Example of splitting process

41 10 clusters11 clusters

42 Example of splitting process 12 clusters13 clusters

43 Example of splitting process MSE = 1.94 14 clusters15 clusters

44 K-means refinement Result after re-partition: MSE = 1.39 Result after K-means: MSE = 1.33 Result directly after split: MSE = 1.94

45 Time complexity Number of processed vectors, assuming that clusters are always split into two equal halves: Assuming unequal split to n max and n min sizes:

46 Number of vectors processed: At each step, sorting the vectors is bottleneck: Time complexity

47 P. Fränti, T. Kaukoranta, D-F. Shen and K-S. Chang, "Fast and memory efficient implementation of the exact PNN", IEEE Trans. on Image Processing, 9 (5), 773-777, May 2000. Algorithm 2: Pairwise Nearest Neighbor

48 Agglomerative clustering Single link  Minimize distance of nearest vectors Complete link  Minimize distance of two furthest vectors Ward’s method  Minimize mean square error  In Vector Quantization, known as Pairwise Nearest Neighbor (PNN) method

49 PNN algorithm [Ward 1963: Journal of American Statistical Association] Merge cost: Local optimization strategy: Nearest neighbor search is needed: (1) finding the cluster pair to be merged (2) updating of NN pointers

50 Pseudo code

51 Overall example of the process M=5000 M=4999 M=4998. M=50. M=16 M=15 M=5000M=50 M=16 M=15

52 Detailed example of the process

53 Example - 25 Clusters MSE ≈ 1.01*10 9

54 Example - 24 Clusters MSE ≈ 1.03*10 9

55 Example - 23 Clusters MSE ≈ 1.06*10 9

56 Example - 22 Clusters MSE ≈ 1.09*10 9

57 Example - 21 Clusters MSE ≈ 1.12*10 9

58 Example - 20 Clusters MSE ≈ 1.16*10 9

59 Example - 19 Clusters MSE ≈ 1.19*10 9

60 Example - 18 Clusters MSE ≈ 1.23*10 9

61 Example - 17 Clusters MSE ≈ 1.26*10 9

62 Example - 16 Clusters MSE ≈ 1.30*10 9

63 Example - 15 Clusters MSE ≈ 1.34*10 9

64 Example of distance calculations

65 Storing distance matrix   Maintain the distance matrix and update rows for the changed cluster only!   Number of distance calculations reduces from O(N 2 ) to O(N) for each step.   Search of the minimum pair still requires O(N 2 ) time  still O(N 3 ) in total.   It also requires O(N 2 ) memory.

66 Heap structure for fast search [Kurita 1991: Pattern Recognition]  Search reduces O(N)  O(logN).  In total: O(N 2 logN)

67 Maintain nearest neighbor (NN) pointers [Fränti et al., 2000: IEEE Trans. Image Processing] Time complexity reduces to O(N 3 )  Ω (  N 2 )

68 Processing time comparison With NN pointers

69 Combining PNN and K-means N M M0M0 PNN K-means Standard PNN Random 1 M M0M0 N Number of clusters

70   P. Fränti, O. Virmajoki and V. Hautamäki, "Fast agglomerative clustering using a k-nearest neighbor graph". IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), 1875-1881, November 2006.   P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.   T. Kaukoranta, P. Fränti and O. Nevalainen, "Vector quantization by lazy pairwise nearest neighbor method", Optical Engineering, 38 (11), 1862-1868, November 1999.   O. Virmajoki, P. Fränti and T. Kaukoranta, "Practical methods for speeding-up the pairwise nearest neighbor method ", Optical Engineering, 40 (11), 2495-2504, November 2001. Further improvements

71 Algorithm 3: Random Swap P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000.

72 Random swap algorithm (RS)

73 Demonstration of the algorithm

74 Centroid swap

75 Local repartition

76 Fine-tuning by K-means 1st iteration

77 Fine-tuning by K-means 2nd iteration

78 Fine-tuning by K-means 3rd iteration

79 Fine-tuning by K-means 16th iteration

80 Fine-tuning by K-means 17th iteration

81 Fine-tuning by K-means 18th iteration

82 Fine-tuning by K-means 19th iteration

83 Fine-tuning by K-means Final result after 25 iterations

84 Implementation of the swap 1. Random swap: 2. Re-partition vectors from old cluster: 3. Create new cluster:

85 Independency of initialization Results for T = 5000 iterations Worst Best Initial Bridge

86 Probability of good swap Select a proper centroid for removal:   There are M clusters in total: p removal =1/M. Select a proper new location:   There are N choices: p add =1/N   Only M are significantly different: p add =1/M In total:   M 2 significantly different swaps.   Probability of each different swap is p swap =1/M 2   Open question: how many of these are good?

87   Probability of not finding good swap: Expected number of iterations  Estimated number of iterations:

88 Estimated number of iterations depending on T S1S1S1S1 S2S2S2S2 S3S3S3S3 S4S4S4S4 Observed = Number of iterations needed in practice. Estimated = Estimated number of iterations needed for the given q value.

89 Probability of success (p) depending on T

90 Probability of failure (q) depending on T

91 Observed probabilities depending on dimensionality

92 Bounds for the number of iterations Upper limit: Lower limit similarly; resulting in:

93 Multiple swaps (w) Probability for performing less than w swaps: Expected number of iterations:

94 Efficiency of the random swap Total time to find correct clustering:   Time per iteration  Number of iterations Time complexity of single step:   Swap: O(1)   Remove cluster: 2M  N/M = O(N)   Add cluster: 2N = O(N)   Centroids: 2  (2N/M) + 2  + 2 = O(N/M)   (Fast) K-means iteration: 4  N = O(  N) * * See Fast K-means for analysis.

95 Observed K-means iterations

96 K-means iterations

97 Time complexity and the observed number of steps

98 Total time complexity Number of iterations needed (T): t = O(αN) Total time: Time complexity of single step (t):

99 Time complexity: conclusions 1. 1. Logarithmic dependency on q 2. 2. Linear dependency on N 3. 3. Quadratic dependency on M (With large number of clusters, it can be too slow and faster variant might be needed.) 4. 4. Inverse dependency on  (worst case  = 2) (Higher the dimensionality, faster the method)

100 References Random swap algorithm: P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000. P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139 ‑ 1148, August 1998. Pseudo code: http://cs.uef.fi/sipu/ Efficiency of Random swap algorithm: P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of random swap based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08), Tampa, FL, Dec 2008.

101 Part III: Efficient solution

102 Stopping criterion? Ends up to a local minimum Divisive Agglomerative

103 Strategies for efficient search using random swap   Brute force: solve clustering for all possible number of clusters.   Stepwise: as in brute force but start using previous solution and iterate less.   Criterion-guided search: Integrate validity directly into the cost function.

104 Brute force search strategy Number of clusters Search for each separately 100 %

105 Stepwise search strategy Number of clusters Start from the previous result 30-40 %

106 Criterion guided search Number of clusters Integrate with the cost function! 3-6 %

107 Conclusions Define the problem Cost function f. Measures the goodness of clusters, or alternatively (dis)similarity between two objects. Solve the problem Select the best algorithm for minimizing f Homework Number of clusters: Q. Zhao and P. Fränti, "WB-index: a sum-of- squares based index for cluster validity", Data & Knowledge Engineering, 92: 77-89, 2014. Validation: P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), 3034-3045, Sept. 2014.

108 Thank you Time for questions!


Download ppt "How to cluster data Algorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University."

Similar presentations


Ads by Google