Download presentation
Presentation is loading. Please wait.
Published byJanice Webster Modified over 9 years ago
1
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute of Information Technology (IIIT) Hyderabad
2
IIIT Hyderabad Introduction Classification of data desired for meaningful representation. Data in subset ideally shares common traits. Unsupervised learning for finding hidden structure. Application in data mining, computer vision with –Image Classification –Document Retrieval Simple K-Means algorithm December 20, 2011HiPC - 20112
3
IIIT Hyderabad Need for High Performance Clustering Time Complexity of O(n dk+1 log n) where n- input vectors, d- dimension, k-centers A fast, efficient clustering implementation is needed to deal with large data, high dimensionality and large centers. In computer vision, 128-dim SIFT and 512-dim (GIST) are common. Features can run into several millions Bag of Words for Vocabulary generation using SIFT [Lowe IJCV04] vectors December 20, 2011HiPC - 20113
4
IIIT Hyderabad Challenges and Contributions Data: Storage of data format for quick and repeated access. Computational: O(n dk+1 log n) Contributions: A complete GPU based implementation with –Exploitation of intra-vector parallelism –Efficient Mean evaluation –Data Organization –Multi GPU framework December 20, 20114HiPC - 2011
5
IIIT Hyderabad Related Work General Improvements –KD-trees [Moor et al, SIGKKD-1999 –Triangle Inequality [Elkan, ICML-2003] Pre CUDA GPU Efforts Improvements –Fragment Shader [Hart et al, SIGGRAPH- 2004] December 20, 2011HiPC - 20115
6
IIIT Hyderabad Related Work (cont) Recent GPU efforts –Mean on CPU [Che et al, JPDC-2008] –Mean on CPU + GPU [Hong et al, WCCSIE- 2009] –GPU Miner [Ren et al, HP Labs-2009] –HPK-Means [Wu et al, UCHPC-2009] –Divide & Rule [Li et al, ICCIT-2010] Parallelism not exploited within data object. Lacking efficiency in Mean evaluation on GPU. Proposed techniques are parameter dependant. December 20, 2011HiPC - 20116
7
IIIT Hyderabad K-Means Objective Function ∑ i ∑ j ‖ x i (j) -c j ‖ 2 1 ≤ i ≤ n, 1 ≤ j ≤ k Euclidean distance : L2 norm Steps: –Membership Evaluation –New Mean Evaluation –Convergence December 20, 2011HiPC - 20117
8
IIIT Hyderabad Algorithm K random centers are initially chosen from input. Partitions data into k clusters Observation belongs to the cluster with the nearest mean. Re-evaluate the new centers & continue the process till convergence is attained. December 20, 2011HiPC - 20118
9
IIIT Hyderabad K-Means on GPU Membership Evaluation Involves Distance and Minima evaluation. Single thread per component of vector Parallel computation done on ‘d’ components of input and center vectors stored in row major format. Log summation for distance evaluation. For each input vector we traverse across all centers stored in L2 cache. December 20, 20119HiPC - 2011
10
IIIT Hyderabad K-Means on GPU (Cont) December 20, 2011HiPC - 201110 Membership Evaluation Data objects stored in row major format Provides coalesced access Distance evaluation using shared memory. Root finding avoided
11
IIIT Hyderabad K-Means on GPU (Cont) Mean Evaluation Issues –Data rearrangement on CPU as per membership is time consuming. –Concurrent writes –Random reads and writes –Non uniform distribution of labels for data objects. December 20, 2011HiPC - 201111
12
IIIT Hyderabad Mean Evaluation on GPU Store labels and index in 64 bit records Group data objects with same membership using Splitsort operation. We split using labels as key Gather primitive used to rearrange input in order of labels. Sorted global index of input vectors is generated. December 20, 201112HiPC - 2011 Splitsort : Suryakant & Narayanan IIITH, TR 2009
13
IIIT Hyderabad Splitsort & Transpose Operation December 20, 2011HiPC - 201113
14
IIIT Hyderabad Mean Evaluation on GPU (cont) Row major storage of vectors enabled coalesced access. CUDPP segmented scan followed by compact operation for histogram count. Transpose operation before rearranging input vectors. Using segmented scan again we evaluated mean of rearranged vectors as per labels. December 20, 2011HiPC - 201114
15
IIIT Hyderabad Implementation Details Tesla –2 vectors per block, 2 centers at a time –Centers accessed via shared memory Fermi –2 vectors per block, 4 centers at a time –Centers accessed via global memory using L2 cache –More shared memory for distance evaluation Occupancy of 83% using 5136 KB of shared memory in case of Fermi. December 20, 2011HiPC - 201115
16
IIIT Hyderabad ISSUES Too many distance evaluations Convergence highly dependent on cluster centers chosen. Prior seeding using K-Means++ can reduce the number of iterations. Parameters like dimension, cluster centers affect the performance apart from the input size of the vectors. December 20, 2011HiPC - 201116
17
IIIT Hyderabad Limitations of GPU device Highly computational & memory consuming algorithms. Limited Global and Shared memory on a GPU device. Division of computational load if more than one device is available. Utilization of every resource available. Scalability of the algorithm December 20, 2011HiPC - 201117
18
IIIT Hyderabad Multi GPU Approach Partition input data into chunks proportional to number of cores. Broadcast ‘k’ centers to all the nodes. Perform Membership & partial mean on each of the GPUs sent to their respective nodes. Nodes direct partial sums to Master node. New means evaluated by Master node for next iteration. December 20, 2011HiPC - 201118
19
IIIT Hyderabad Results Generated Gaussian SIFT vectors Variation in parameters n, d, k Performance on CPU(32 bit, 2.7 Ghz), Tesla T10, GTX 480 tested up to n max :4 Million, k max : 8000, d max : 256 MultiGPU (4xT10 + GTX 480) n max : 32 Million, k max : 8000, d max : 256 Comparison with previous GPU implementations. December 20, 201119HiPC - 2011
20
IIIT Hyderabad Overall Results N, KCPU GPU Tesla T10 GTX 4804xT10 10K, 801.30.1190.180.097 50K, 80071.32.731.730.891 125K, 2K463.614.187.712.47 250K, 4K132038.527.77.45 1M, 8K28936268.6170.668.5 Times of K-Means on CPU, GPUs in seconds for d=128. December 20, 201120HiPC - 2011
21
IIIT Hyderabad Overall Performance Mean evaluation reduced to 6% of the total time for large input of high dimensional data. Multi GPU provided linear speedup Speedup of up to 170 on GTX 480 6 Million vectors of 128 dimension clustered in just 136 sec per iteration. Achieved up to twice increase in speedup against the best GPU implementation December 20, 2011HiPC - 201121
22
IIIT Hyderabad Performance vs ‘n’ December 20, 2011HiPC - 201122 Linear performance for variation in n, with d=128 and k=4,000.
23
IIIT Hyderabad Performance vs ‘d’ December 20, 2011HiPC - 201123 Performance for variation in d, with n=1M and k=8,000.
24
IIIT Hyderabad Performance vs ‘k’ December 20, 2011HiPC - 201124 Linear performance for variation in k, with n=50k and d=128.
25
IIIT Hyderabad Comparison NKDLi et alWu et al Our K-Means 2 Million 40081.234.531.27 4 Million 10080.6894.950.734 4 Million 40082.269.032.4 51,20032641320-0.191 51,2003212828936-0.282 Running time of K-Means in seconds on GTX 280. December 20, 201125HiPC - 2011
26
IIIT Hyderabad Performance on GPUs December 20, 2011HiPC - 201126 Performance of 8600, Tesla, GTX 480 for d=128 and k=1,000.
27
IIIT Hyderabad Conclusions Achieved a speed of over 170 on single NVIDIA Fermi GPU. Complete GPU based implementation. High Performance for large ‘d’ due to processing of vector in parallel. Scalable in problem size n, d, k and number of cores. Use of operations like Splitsort, Transpose for coalesced memory access. Overcame memory limitations using Multi GPU frame work. Code will be available at http://cvit.iiit.ac.in soonhttp://cvit.iiit.ac.in December 20, 201127HiPC - 2011
28
IIIT Hyderabad Thank You Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.