Download presentation
Presentation is loading. Please wait.
Published byKristopher Nash Modified over 9 years ago
1
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th 2010
2
Thesis Objectives Develop high performance parallel implementations of data clustering algorithms leveraging the computational power of GPUs and the CUDA framework – Make clustering flow cytometry data sets practical on a single lab machine Use OpenMP and MPI for distributing work to multiple GPUs in a grid computing or commodity cluster environment
3
Outline Overview of the application domain GPU Architecture, CUDA Parallel Implementation Results Multi-GPU architecture More Results
4
Data Clustering A form of unsupervised learning that groups similar objects into relatively homogeneous sets called clusters How do we define similarity between objects? – Depends on the application domain, implementation Not to be confused with data classification, which assigns objects to predefined classes
5
Data Clustering Algorithms Clustering Taxonomy from “Data Clustering: A Review”, by Jain et al. [1]
6
Example: Iris Flower Data
7
Flow Cytometry Technology used by biologists and immunologists to study the physical and chemical characteristics of cells Example: Measure T lymphocyte counts to monitor HIV infection [2]
8
Flow Cytometry Cells in a fluid pass through a laser Measure physical characteristics with scatter data Add fluorescently labeled antibodies to measure other aspects of the cells
9
Flow Cytometer
10
Flow Cytometry Data Sets Multiple measurements (dimensions) for each event – Upwards of 6 scatter dimensions and 18 colors per experiment On the order of 10 5 – 10 6 events ~24 million values that must be clustered Lots of potential clusters Example: 10 6 events, 100 clusters, 24 dimensions – C-means: O(NMD) = 2.4 x 10 9 – Expectation Maximization: O(NMD 2 ) = 5.7 x 10 10
11
Parallel Processing Clustering can take many hours on a single CPU Data growth is accelerating Performance gains of single-threaded applications slowing down Fortunately many data clustering algorithms lend themselves naturally to parallel processing
12
Multi-core Current trends: – Adding more cores – More SIMD SSE3/AVX – Application specific extensions VT-x, AES-NI – Point-to-Point interconnects, higher memory bandwidths
13
GPU Architecture Trends CPU GPU Figure based on Intel Larabee Presentation at SuperComputing 2009 Fixed Function Fully Programmable Partially Programmable Multi-threadedMulti-coreMany-core Intel Larabee NVIDIA CUDA
14
GPU vs. CPU Peak Performance
15
Tesla GPU Architecture
16
Tesla Cores
17
GPGPU General Purpose computing on Graphics Processing Units Past – Programmable shader languages: Cg, GLSL, HLSL – Use textures to store data Present: – Multiple frameworks using traditional general purpose systems and high-level languages
18
CUDA: Software Stack Image from [5]
19
CUDA: Program Flow Application Start Search for CUDA DevicesLoad data on hostAllocate device memoryCopy data to deviceLaunch device kernels to process dataCopy results from device to host memory CPU Main Memory Device Memory GPU Cores PCI-Express Device Host
20
CUDA: Streaming Multiprocessors Image from [3]
21
CUDA: Thread Model Kernel – A device function invoked by the host computer – Launches a grid with multiple blocks, and multiple threads per block Blocks – Independent tasks comprised of multiple threads – no synchronization between blocks SIMT: Single-Instruction Multiple- Thread – Multiple threads executing time instruction on different data (SIMD), can diverge if necessary Image from [3]
22
CUDA: Memory Model Image from [3]
23
When is CUDA worthwhile? High computational density – Worthwhile to transfer data to separate device Both coarse-grained and fine-grained SIMD parallelism – Lots of independent tasks (blocks) that don’t require frequent synchronization – Within each block, lots of individual SIMD threads Contiguous memory access patterns Frequently/Repeatedly used data small enough to fit in shared memory
24
C-means Minimizes square error between data points and cluster centers using Euclidean distance Alternates between computing membership values and updating cluster centers Time complexity: O(N*M*D*I) N = vectors, M = clusters, D = dimensions per vector, I = number of iterations
25
C-means Example
26
C-means Program Flow Read Input DataCopy Data To GPU Copy Centers to GPU Distances >> Memberships >> Center Numerators >> Center Denominators >> Choose Initial Centers Compute Centers Output Results Error < ε? No Yes Host Device Legend
27
C-means Distance Kernel Inputs: [D x N] data matrix, [M x D] centers matrix Outputs: [M x N] distance matrix Kernel Grid: [N/512 x M] blocks, 512 threads per block Distance Kernel Grid: [M x N] Distance Matrix B1 = (1,1)B2 = (2,1)(N/512,1) (1,2) 512 0 1024NN-512 0 1 M (1,M)(N/512,M) All Threads [ 1 x D] Center in Shared Memory [ D x N] Data Matrix in Global Memory t1t1 t2t2 …t 512 0512 … N 0 D
28
C-means Membership Kernel Kernel Grid: [N/512] blocks, 512 threads per block Transforms [M x N] distance matrix to [M x N] membership matrix (in-place) – Each thread makes two passes through distance matrix. First to compute sum Second to compute each membership Kernel Computes [M x N] Membership Matrix Block (1)Block (2)(N/512) 512 0 1024NN-512 [ M x N] Distance Matrix in Global Memory t1t1 t2t2 …t 512
29
C-means Centers Kernel Kernel Grid: [M/4 x D] blocks, 256 threads per block [ D x N] Data Matrix Centers Kernel Grid: Computes [M x D] Matrix B1 = (1,1)B2 = (2,1)(M/4,1) (1,2) 4 0 8MM-4 0 1 D (1,D)(M/4,D) 256 threads cycle through the N events Each value gets re-used 4 times [4 x 256] Partial Sums: Shared Memory t1t1 t2t2 …t 256 [ 4 x 256 ] Partial sums reduced to [ 4 x 1 ] with butterfly sum [ M x N] Data Matrix
30
Expectation Maximization with a Gaussian mixture model Data described by a mixture of M Gaussian distributions Each Gaussian has 3 parameters
31
E-step Compute likelihoods based on current model parameters – O(NMD 2 ) Convert likelihoods into membership values O(NM)
32
M-step Update model parameters
33
EM Program Flow Read Input Data, Transpose Copy Data to GPU Initialize Models Copy Models to GPU Likelihoods >> Constants >> Output Results Δ Likelihood < ε No Yes Memberships >> Covariance >> Means >> N >> Desired # Clusters? No Yes Combine 2 closest Gaussians Host Device Legend
34
EM: Likelihood Kernel Kernel Grid: [M x 16], 512 threads per block Kernel Grid Computes [M x N] Likelihood Matrix B1 = (1,1)(2,1)(1,16) B2 = (1,2) N/16 0 2N/16N15N/16 0 1 M (M,1)(M,16) Shared Memory [1 x M] Mean Vector [M x M] Covariance Matrix Reads all dimensions of first N/16 events from [D x N] Data Matrix in Global Memory t1t1 t2t2 …t 512 Writes 1/16 th of a row to [M x N] Matrix in Global Memory t1t1 t2t2 … t 512 ……..… t1t1 t2t2 …t 512 t1t1 t2t2 … 0512N/16 0512N/16 ……..… …N
35
EM: Covariance Kernel Kernel Grid: [M/6 x D(D+1)/2] blocks, 256 threads per block Loop unrolled by 6 clusters per block (limited by resource constraints) Clusters 1 to 6 … Clusters M-5 to M [ D x N] Data Matrix [6 x 256] Partial Sums: Shared Memory t1t1 t2t2 …t 256 [ 6 x 256 ] Partial sums reduced to [ 6 x 1 ] with butterfly sum [ M x N] Membership Matrix
36
Performance Tuning Global Memory Coalescing – 8000/9000 (1.0/1.1) series devices All 16 threads in half-warp access consecutive elements Starting address aligned to 16*sizeof(element) – GT200 (1.2,1.3) series devices 32B/64B/128B aligned transactions Minimize # of transactions by accessing consecutive elements Images from [4]
37
Performance Tuning Partition Camping – Global memory has 6 to 8 partitions (64-bit wide DRAM channels) – Organized into 256-byte wide blocks – Want an even distribution of accesses to the different channels
38
Performance Tuning Occupancy – Ratio of active warps to maximum allowed by a multiprocessor – Multiple active warps hides register dependencies, global memory access latencies – # of warps restricted by device resources Registers required per thread Shared memory per block Total number of threads per block
39
Performance Tuning Block Count – More small/simple blocks often better than larger blocks with loops
40
Performance Tuning CUBLAS – CUDA Basic Linear Algebra Subprograms – SGEMM more scalable solution for some of the kernels – Makes good use of shared memory – Poor blocking on rectangular matrices
41
Testing Environments Oak: 2 GPU Server @ RIT – Tesla C870: G80 architecture, 128 cores, 76.8 GB/sec – GTX260: GT200 architecture, 192 cores, 112 GB/sec – CUDA 2.3, Ubuntu Server 8.04 LTS, GCC 4.2, OpenMP Lincoln, a TeraGrid HPC resource – Cluster located in the National Center for Supercomputing Applications @ University of Illinois – 192 nodes each with 8 CPU cores and 16GB of memory connected with Infiniband networking – 92 Tesla S1070 accelerator units, for a total of 368 GPUs (2 per node) Each has 240 cores, 102 GB/sec – CUDA 2.3, RHEL4, GCC 4.2, OpenMP, MVAPICH2 for MPI
42
Results – Speedup C-meansExpectation Maximization
43
Results – Overhead
44
Comparisons to Prior Work C-means – Order of magnitude improvement over our previous publication [6] – 1.04x to 4.64x improvement on data sizes provided in Anderson et al. [7]
45
Comparisons to Prior Work Expectation Maximization – 3.72x to 10.1x improvement over Kumar et al. [8] – [8] only supports diagonal covariance, thesis implementation capable of full covariance – Andrew Harp EM implementation [9] Did not provide raw execution times, only speedup. Reports 170x speedup on a GTX285 Using his CPU source code was able to get raw CPU times on comparable hardware and compute speedup for the GTX260 with implementation in this thesis Speedup with thesis implementation over his CPU reference was 446x, effectively at least a 2.6x speedup
46
Multi-GPU Strategy 3 Tier Parallel hierarchy
47
Changes from Single GPU Very little impact on GPU kernel implementations Normalization for some results (C-means cluster centers, EM M-step kernels) done on host after collecting results from each GPU -Very low overhead Majority of overhead is initial distribution of input data, final collection of event-level membership results to nodes in the cluster – Asynchronous MPI sends from host instead of each node reading input file from data store – Need to transpose membership matrix and then gather data from each node
48
Multi-GPU Strategy
49
Multi-GPU Analysis Fixed Problem Size Analysis – i.e. Amdahl’s Law, Strong Scaling, True Speedup – Kept input size and parameters the same – Increased the number of nodes on Lincoln cluster Time-constrained Analysis – i.e. Gustafson’s Law, Weak Scaling, Scaled Speedup – Problem size changed such that execution time would remain the same with ideal speedup Execution time proportional to N/p Problem size scaled to N x p
50
Fixed Problem Size Analysis
51
Overhead in Parallel Implementation
52
Time-Constrained Analysis
53
Synthetic Data Results C-means Result EM Gaussian Result
54
Flow Cytometry Results Combination of Mouse blood and Human blood cells 21 dimensions – 6 Scatter, 7 human, 7 mouse, 1 in common Attempt to distinguish mouse from human cells with clustering Data courtesy of Ernest Wang, Tim Mossman, James Cavenaugh, Iftekhar Naim, and others from the Flowgating group at the University of Rochester Center for Vaccine Biology and Immunology
56
99.3% of the 200,000 cells properly grouped into mouse or human clusters with a 50/50 mixture Reducing mixture to 10,000 human and 100,000 mouse, stilled properly grouped 98.99% of all cells With only 1% human cells, they start getting harder to distinguish. 62 (6.2%) of human cells were grouped into predominately mouse clusters. 192 mouse cells were grouped into the predominately human clusters
57
Conclusions Both C-means and Expectation Maximization with Gaussians have abundant data parallelism that maps well to massively parallel many-core GPU architectures – Nearly 2 orders of magnitude improvement over single-threaded algorithms on comparable CPUs GPUs allow clustering of large data sets, like those found in flow cytometry, to be practical (a few minutes instead of a few hours) on a single lab machine GPU co-processors and the CUDA framework can be combined with traditional parallel programming techniques for efficient high performance computing – Over 6000x speedup compared to a single CPU with only 64 server nodes and 128 GPUs Parallelization strategies used in this thesis applicable to other clustering algorithms
58
Future Work Apply CUDA, OpenMP, MPI to other clustering algorithms or other parts of workflow – Skewed T mixture model clustering – Accelerated data preparation, visualization, statistical inference between data sets Improvements to current implementations – CUDA SGEMM with better performance on highly rectangular matrices could replace some kernels – Try MAGMA, CULAtools – 3 rd party BLAS libraries for GPUs Investigate performance on new architectures and frameworks – NVIDIA Fermi and Intel Larabee architectures – OpenCL, DirectCompute, PGI frameworks/compilers Improvements to multi-node implementation – Remove master-slave paradigm for data distribution and final result collection (currently root node needs enough memory to hold it all – not scalable to very large data sets) – Dynamic load balancing for heterogeneous environments – Use CPU cores along with GPU for processing portions of the data, instead of idling during kernels
59
Questions?
60
References 1.A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999. 2.H. Shapiro, J. Wiley, and W. InterScience, Practical flow cytometry. Wiley-Liss New York, 2003. 3.NVIDIA, “NVIDIA CUDA Programming Guide 2.3”. [Online] available: http://developer.nvidia.com/object/cuda_2_3_downloads.html 4.NVIDIA, “NVIDIA CUDA C Programming Best Practices Guide”. [Online] available: http://developer.nvidia.com/object/cuda_2_3_downloads.html 5.NVIDIA, “NVIDIA CUDA Architecture Introduction & Overview”. [Online] available: http://developer.nvidia.com/object/cuda_2_3_downloads.html 6.J. Espenshade, A. Pangborn, G. von Laszewski, D. Roberts, and J. Cavenaugh, “Accelerating partitional algorithms for flow cytometry on gpus,” in Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on, Aug. 2009, pp. 226–233. 7.D. Anderson, R. Luke, and J. Keller, “Speedup of fuzzy clustering through stream processing on graphics processing units,” Fuzzy Systems, IEEE Transactions on, vol. 16, no. 4, pp. 1101-1106, Aug. 2008. 8.N. Kumar, S. Satoor, and I. Buck, “Fast parallel expectation maximization for gaussian mixture models on gpus using cuda,” in 11th IEEE International Conference on High Performance Computing and Communications, 2009. HPCC’09, 2009, pp. 103–109. 9.A. Harp, “EM of GMMs with GPU acceleration,” May 2009. [Online]. Available: http://andrewharp.com/gmmcuda
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.