Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
OpenFOAM on a GPU-based Heterogeneous Cluster
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Background Computer System Architectures Computer System Software.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS427 Multicore Architecture and Parallel Computing
Pipeline parallelism and Multi–GPU Programming
Graphics Processing Unit
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th 2010

Thesis Objectives Develop high performance parallel implementations of data clustering algorithms leveraging the computational power of GPUs and the CUDA framework – Make clustering flow cytometry data sets practical on a single lab machine Use OpenMP and MPI for distributing work to multiple GPUs in a grid computing or commodity cluster environment

Outline Overview of the application domain GPU Architecture, CUDA Parallel Implementation Results Multi-GPU architecture More Results

Data Clustering A form of unsupervised learning that groups similar objects into relatively homogeneous sets called clusters How do we define similarity between objects? – Depends on the application domain, implementation Not to be confused with data classification, which assigns objects to predefined classes

Data Clustering Algorithms Clustering Taxonomy from “Data Clustering: A Review”, by Jain et al. [1]

Example: Iris Flower Data

Flow Cytometry Technology used by biologists and immunologists to study the physical and chemical characteristics of cells Example: Measure T lymphocyte counts to monitor HIV infection [2]

Flow Cytometry Cells in a fluid pass through a laser Measure physical characteristics with scatter data Add fluorescently labeled antibodies to measure other aspects of the cells

Flow Cytometer

Flow Cytometry Data Sets Multiple measurements (dimensions) for each event – Upwards of 6 scatter dimensions and 18 colors per experiment On the order of 10 5 – 10 6 events ~24 million values that must be clustered Lots of potential clusters Example: 10 6 events, 100 clusters, 24 dimensions – C-means: O(NMD) = 2.4 x 10 9 – Expectation Maximization: O(NMD 2 ) = 5.7 x 10 10

Parallel Processing Clustering can take many hours on a single CPU Data growth is accelerating Performance gains of single-threaded applications slowing down Fortunately many data clustering algorithms lend themselves naturally to parallel processing

Multi-core Current trends: – Adding more cores – More SIMD SSE3/AVX – Application specific extensions VT-x, AES-NI – Point-to-Point interconnects, higher memory bandwidths

GPU Architecture Trends CPU GPU Figure based on Intel Larabee Presentation at SuperComputing 2009 Fixed Function Fully Programmable Partially Programmable Multi-threadedMulti-coreMany-core Intel Larabee NVIDIA CUDA

GPU vs. CPU Peak Performance

Tesla GPU Architecture

Tesla Cores

GPGPU General Purpose computing on Graphics Processing Units Past – Programmable shader languages: Cg, GLSL, HLSL – Use textures to store data Present: – Multiple frameworks using traditional general purpose systems and high-level languages

CUDA: Software Stack Image from [5]

CUDA: Program Flow Application Start Search for CUDA DevicesLoad data on hostAllocate device memoryCopy data to deviceLaunch device kernels to process dataCopy results from device to host memory CPU Main Memory Device Memory GPU Cores PCI-Express Device Host

CUDA: Streaming Multiprocessors Image from [3]

CUDA: Thread Model Kernel – A device function invoked by the host computer – Launches a grid with multiple blocks, and multiple threads per block Blocks – Independent tasks comprised of multiple threads – no synchronization between blocks SIMT: Single-Instruction Multiple- Thread – Multiple threads executing time instruction on different data (SIMD), can diverge if necessary Image from [3]

CUDA: Memory Model Image from [3]

When is CUDA worthwhile? High computational density – Worthwhile to transfer data to separate device Both coarse-grained and fine-grained SIMD parallelism – Lots of independent tasks (blocks) that don’t require frequent synchronization – Within each block, lots of individual SIMD threads Contiguous memory access patterns Frequently/Repeatedly used data small enough to fit in shared memory

C-means Minimizes square error between data points and cluster centers using Euclidean distance Alternates between computing membership values and updating cluster centers Time complexity: O(N*M*D*I) N = vectors, M = clusters, D = dimensions per vector, I = number of iterations

C-means Example

C-means Program Flow Read Input DataCopy Data To GPU Copy Centers to GPU Distances >> Memberships >> Center Numerators >> Center Denominators >> Choose Initial Centers Compute Centers Output Results Error < ε? No Yes Host Device Legend

C-means Distance Kernel Inputs: [D x N] data matrix, [M x D] centers matrix Outputs: [M x N] distance matrix Kernel Grid: [N/512 x M] blocks, 512 threads per block Distance Kernel Grid: [M x N] Distance Matrix B1 = (1,1)B2 = (2,1)(N/512,1) (1,2) NN M (1,M)(N/512,M) All Threads [ 1 x D] Center in Shared Memory [ D x N] Data Matrix in Global Memory t1t1 t2t2 …t … N 0 D

C-means Membership Kernel Kernel Grid: [N/512] blocks, 512 threads per block Transforms [M x N] distance matrix to [M x N] membership matrix (in-place) – Each thread makes two passes through distance matrix. First to compute sum Second to compute each membership Kernel Computes [M x N] Membership Matrix Block (1)Block (2)(N/512) NN-512 [ M x N] Distance Matrix in Global Memory t1t1 t2t2 …t 512

C-means Centers Kernel Kernel Grid: [M/4 x D] blocks, 256 threads per block [ D x N] Data Matrix Centers Kernel Grid: Computes [M x D] Matrix B1 = (1,1)B2 = (2,1)(M/4,1) (1,2) 4 0 8MM D (1,D)(M/4,D) 256 threads cycle through the N events Each value gets re-used 4 times [4 x 256] Partial Sums: Shared Memory t1t1 t2t2 …t 256 [ 4 x 256 ] Partial sums reduced to [ 4 x 1 ] with butterfly sum [ M x N] Data Matrix

Expectation Maximization with a Gaussian mixture model Data described by a mixture of M Gaussian distributions Each Gaussian has 3 parameters

E-step Compute likelihoods based on current model parameters – O(NMD 2 ) Convert likelihoods into membership values O(NM)

M-step Update model parameters

EM Program Flow Read Input Data, Transpose Copy Data to GPU Initialize Models Copy Models to GPU Likelihoods >> Constants >> Output Results Δ Likelihood < ε No Yes Memberships >> Covariance >> Means >> N >> Desired # Clusters? No Yes Combine 2 closest Gaussians Host Device Legend

EM: Likelihood Kernel Kernel Grid: [M x 16], 512 threads per block Kernel Grid Computes [M x N] Likelihood Matrix B1 = (1,1)(2,1)(1,16) B2 = (1,2) N/16 0 2N/16N15N/ M (M,1)(M,16) Shared Memory [1 x M] Mean Vector [M x M] Covariance Matrix Reads all dimensions of first N/16 events from [D x N] Data Matrix in Global Memory t1t1 t2t2 …t 512 Writes 1/16 th of a row to [M x N] Matrix in Global Memory t1t1 t2t2 … t 512 ……..… t1t1 t2t2 …t 512 t1t1 t2t2 … 0512N/ N/16 ……..… …N

EM: Covariance Kernel Kernel Grid: [M/6 x D(D+1)/2] blocks, 256 threads per block Loop unrolled by 6 clusters per block (limited by resource constraints) Clusters 1 to 6 … Clusters M-5 to M [ D x N] Data Matrix [6 x 256] Partial Sums: Shared Memory t1t1 t2t2 …t 256 [ 6 x 256 ] Partial sums reduced to [ 6 x 1 ] with butterfly sum [ M x N] Membership Matrix

Performance Tuning Global Memory Coalescing – 8000/9000 (1.0/1.1) series devices All 16 threads in half-warp access consecutive elements Starting address aligned to 16*sizeof(element) – GT200 (1.2,1.3) series devices 32B/64B/128B aligned transactions Minimize # of transactions by accessing consecutive elements Images from [4]

Performance Tuning Partition Camping – Global memory has 6 to 8 partitions (64-bit wide DRAM channels) – Organized into 256-byte wide blocks – Want an even distribution of accesses to the different channels

Performance Tuning Occupancy – Ratio of active warps to maximum allowed by a multiprocessor – Multiple active warps hides register dependencies, global memory access latencies – # of warps restricted by device resources Registers required per thread Shared memory per block Total number of threads per block

Performance Tuning Block Count – More small/simple blocks often better than larger blocks with loops

Performance Tuning CUBLAS – CUDA Basic Linear Algebra Subprograms – SGEMM more scalable solution for some of the kernels – Makes good use of shared memory – Poor blocking on rectangular matrices

Testing Environments Oak: 2 GPU RIT – Tesla C870: G80 architecture, 128 cores, 76.8 GB/sec – GTX260: GT200 architecture, 192 cores, 112 GB/sec – CUDA 2.3, Ubuntu Server 8.04 LTS, GCC 4.2, OpenMP Lincoln, a TeraGrid HPC resource – Cluster located in the National Center for Supercomputing University of Illinois – 192 nodes each with 8 CPU cores and 16GB of memory connected with Infiniband networking – 92 Tesla S1070 accelerator units, for a total of 368 GPUs (2 per node) Each has 240 cores, 102 GB/sec – CUDA 2.3, RHEL4, GCC 4.2, OpenMP, MVAPICH2 for MPI

Results – Speedup C-meansExpectation Maximization

Results – Overhead

Comparisons to Prior Work C-means – Order of magnitude improvement over our previous publication [6] – 1.04x to 4.64x improvement on data sizes provided in Anderson et al. [7]

Comparisons to Prior Work Expectation Maximization – 3.72x to 10.1x improvement over Kumar et al. [8] – [8] only supports diagonal covariance, thesis implementation capable of full covariance – Andrew Harp EM implementation [9] Did not provide raw execution times, only speedup. Reports 170x speedup on a GTX285 Using his CPU source code was able to get raw CPU times on comparable hardware and compute speedup for the GTX260 with implementation in this thesis Speedup with thesis implementation over his CPU reference was 446x, effectively at least a 2.6x speedup

Multi-GPU Strategy 3 Tier Parallel hierarchy

Changes from Single GPU Very little impact on GPU kernel implementations Normalization for some results (C-means cluster centers, EM M-step kernels) done on host after collecting results from each GPU -Very low overhead Majority of overhead is initial distribution of input data, final collection of event-level membership results to nodes in the cluster – Asynchronous MPI sends from host instead of each node reading input file from data store – Need to transpose membership matrix and then gather data from each node

Multi-GPU Strategy

Multi-GPU Analysis Fixed Problem Size Analysis – i.e. Amdahl’s Law, Strong Scaling, True Speedup – Kept input size and parameters the same – Increased the number of nodes on Lincoln cluster Time-constrained Analysis – i.e. Gustafson’s Law, Weak Scaling, Scaled Speedup – Problem size changed such that execution time would remain the same with ideal speedup Execution time proportional to N/p Problem size scaled to N x p

Fixed Problem Size Analysis

Overhead in Parallel Implementation

Time-Constrained Analysis

Synthetic Data Results C-means Result EM Gaussian Result

Flow Cytometry Results Combination of Mouse blood and Human blood cells 21 dimensions – 6 Scatter, 7 human, 7 mouse, 1 in common Attempt to distinguish mouse from human cells with clustering Data courtesy of Ernest Wang, Tim Mossman, James Cavenaugh, Iftekhar Naim, and others from the Flowgating group at the University of Rochester Center for Vaccine Biology and Immunology

99.3% of the 200,000 cells properly grouped into mouse or human clusters with a 50/50 mixture Reducing mixture to 10,000 human and 100,000 mouse, stilled properly grouped 98.99% of all cells With only 1% human cells, they start getting harder to distinguish. 62 (6.2%) of human cells were grouped into predominately mouse clusters. 192 mouse cells were grouped into the predominately human clusters

Conclusions Both C-means and Expectation Maximization with Gaussians have abundant data parallelism that maps well to massively parallel many-core GPU architectures – Nearly 2 orders of magnitude improvement over single-threaded algorithms on comparable CPUs GPUs allow clustering of large data sets, like those found in flow cytometry, to be practical (a few minutes instead of a few hours) on a single lab machine GPU co-processors and the CUDA framework can be combined with traditional parallel programming techniques for efficient high performance computing – Over 6000x speedup compared to a single CPU with only 64 server nodes and 128 GPUs Parallelization strategies used in this thesis applicable to other clustering algorithms

Future Work Apply CUDA, OpenMP, MPI to other clustering algorithms or other parts of workflow – Skewed T mixture model clustering – Accelerated data preparation, visualization, statistical inference between data sets Improvements to current implementations – CUDA SGEMM with better performance on highly rectangular matrices could replace some kernels – Try MAGMA, CULAtools – 3 rd party BLAS libraries for GPUs Investigate performance on new architectures and frameworks – NVIDIA Fermi and Intel Larabee architectures – OpenCL, DirectCompute, PGI frameworks/compilers Improvements to multi-node implementation – Remove master-slave paradigm for data distribution and final result collection (currently root node needs enough memory to hold it all – not scalable to very large data sets) – Dynamic load balancing for heterogeneous environments – Use CPU cores along with GPU for processing portions of the data, instead of idling during kernels

Questions?

References 1.A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, H. Shapiro, J. Wiley, and W. InterScience, Practical flow cytometry. Wiley-Liss New York, NVIDIA, “NVIDIA CUDA Programming Guide 2.3”. [Online] available: 4.NVIDIA, “NVIDIA CUDA C Programming Best Practices Guide”. [Online] available: 5.NVIDIA, “NVIDIA CUDA Architecture Introduction & Overview”. [Online] available: 6.J. Espenshade, A. Pangborn, G. von Laszewski, D. Roberts, and J. Cavenaugh, “Accelerating partitional algorithms for flow cytometry on gpus,” in Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on, Aug. 2009, pp. 226– D. Anderson, R. Luke, and J. Keller, “Speedup of fuzzy clustering through stream processing on graphics processing units,” Fuzzy Systems, IEEE Transactions on, vol. 16, no. 4, pp , Aug N. Kumar, S. Satoor, and I. Buck, “Fast parallel expectation maximization for gaussian mixture models on gpus using cuda,” in 11th IEEE International Conference on High Performance Computing and Communications, HPCC’09, 2009, pp. 103– A. Harp, “EM of GMMs with GPU acceleration,” May [Online]. Available: