Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr. Roy Melton 1 1- RIT 2 – Indiana University 3 – University of Rochester Many scientific and business applications produce massive data sets. This data must often be classified into meaningful subsets in order for data analysts to draw meaningful conclusions from it. Data clustering is the broad field of statistical analysis that groups (classifies) similar objects into relatively homogenous sets called clusters. Data clustering has a history in a wide variety of fields, such as, data mining, machine learning, geology, astronomy, and bioinformatics, to name a few. There has been a myriad of different techniques developed over the past 50 to 60+ years to tackle the problem of clustering data. These algorithms typically have computational complexities that increase combinatorially as the dimensions of the input and the number of classes (clusters) increases. In many fields the amount of data requiring clustering is growing rapidly, yet the performance gains from single CPU cores has declined over the past few years. The computational demands of multivariate clustering grow rapidly and therefore large data sets are very time consuming on a single CPU. Fortunately these techniques lend themselves naturally to large scale parallel processing. To address the computational demands graphics processing units, specifically NVIDIA’s CUDA framework and Tesla architecture, were investigated as a low-cost, high performance solution to a number of clustering algorithms. C-means and Expectation Maximization with Gaussian mixture models were implemented using the CUDA framework. The algorithm implementations use a hybrid of CUDA, OpenMP, and MPI to scale to many GPUs on multiple nodes in a distributed memory cluster environment. This framework is envisioned as part of a larger cloud- based workflow service where scientists can apply multiple algorithms and parameter sweeps to their data sets and quickly receive a thorough set of results that can be further analyzed by experts. Using a cluster of 128 GPUs, a speedup of 6,300 with 72% parallel efficiency was observed compared to a 2.5 GHz Intel Xeon E5420 CPU. The paradigm in general purpose computing over the past few years has been shifting from individual processors getting faster, to getting wider (more threads and cores), and will likely continue in upcoming years. On the other side of the spectrum, graphics processing units (GPUs), which were parallel fixed-function co-processors are becoming increasingly programmable and therefore suitable for more general purpose applications. CPU GPU Fixed Function Fully Programmable Partially Programmable Multi-threadedMulti-coreMany-core Intel Larabee NVIDIA CUDA GPUs provide significantly higher floating point performance (nearly an order of magnitude) and more memory bandwidth than CPUs. Very high performance/cost and performance/watt ratios (becoming increasingly important for data centers due to heat and energy costs) with a GPU. Modern NVIDIA GPUs are many-core systems with multiprocessors capable of general purpose computing applications. The Tesla architecture has up to 30 multiprocessors for a total of 240 processing cores, while NVIDIA’s new Fermi architecture (launched in March, 2010) boasts up to 512 processing cores. Independent tasks within a kernel map to individual multiprocessors. Threads within a multiprocessor map to the processing cores and can communicate via shared memory. These multiprocessors each handle hundreds of concurrent threads executing single-instruction multiple- data (SIMD) operations. Threads can diverge to perform different instructions when necessary, at the cost of throughput. NVIDIA calls this model single-instruction multiple-thread (SIMT), which is more flexible for writing general purpose applications than strict SIMD. CUDA is a framework that enables developers to target GPUs with general purpose algorithms. This research uses C and the CUDA runtime to accelerate two clustering algorithms. CUDA programs consist of code that runs on the host CPU and device kernels that run on the GPU. The host code is written in ANSI C and compiled with gcc. The kernels are written in C for CUDA and then compiled to a hardware independent pseudo-assembly language called PTX by the nvcc compiler. The CUDA driver has a JIT compiler that converts the PTX code to the machine code for the target hardware at runtime and handles communication between the host and device. Multi-Tier Parallelization Hierarchy MPI – Internode communication. One process per server node OpenMP – Intranode communication. One CPU thread per GPU CUDA – Kernel grids map to many-core GPU architecture. Thousands of lightweight threads compute fine-grained parallel tasks The data sets for clustering contain N independent vectors which are evenly distributed to each server node. If the nodes have multiple GPUs, the vectors are further divided amongst multiple host thread+GPU pairs (one OpenMP thread and CUDA context per GPU). After distributing the work, the host thread invokes a CUDA kernel. Within the GPU large independent tasks, such as computing the covariance between different pairs of dimensions are mapped to different GPU multiprocessors. The individual threads then execute fine-grained parallel SIMD operations on the data elements. Results are copied from the GPU back to the host CPU. Reduction occurs across the different threads and server nodes to form a final result. The process repeats for all kernels until the algorithm converges to a final answer. TeraGrid Resources – Lincoln The TeraGrid is a high performance computing infrastructure bringing together resources from 11 different sites across the United States. Lincoln is the name of one machine on the TeraGrid located in the National Center for Supercomputing Applications at the University of Illinois Urbana-Champaign. Lincoln contains 92 NVIDIA Tesla S1070 accelerator units for a total of 368 GPUs with 88,320 cores. This research uses Lincoln as a testing environment for developing parallel data clustering algorithms using multiple GPUs, supported in part by the National Science Foundation under TeraGrid grant number TG-MIP NVIDIA CUDA 3.0 Programming Guide. [Online] available: 2.NVIDIA CUDA Architecture Introduction & Overview. [Online] available: 3.National Center for Supercomputing Applications, “Ncsa intel 64 tesla linux cluster lincoln technical summary,” [Online]. Available: 4.J. Espenshade, A. Pangborn, G. von Laszewski, D. Roberts, and J. Cavenaugh, “Accelerating partitional algorithms for flow cytometry on gpus,” in Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on, Aug. 2009, pp. 226– Invitrogen, “Fluorescence tutorials: Intro to flow cytometry,” [Online]. Available: The single GPUs provide 1.8 to 2.0 orders of magnitude speedup over an optimized C version running on a single core of a comparable modern Intel CPU (2.5 GHz Xeon E5420 Quad core). This is a large improvement at a cost significantly less than a single desktop machine (the GTX260 currently retails for about $200). The speedup is expected to be even greater compared to a CPU with NVIDIA’s new Fermi GPUs which have more than twice the cores, significantly higher memory bandwidth, and more memory caching capabilities than the Tesla-based cards. Testing was also performed on the Lincoln supercomputer. The algorithm achieved near ideal speedup with 32 GPUs, and 72% parallel efficiency with 128 GPUs for a total speedup over a CPU of 6,300. This allows the clustering of a typical flow cytometry file in about 5 seconds rather than 10 hours. Queuing time on a shared resource such as Lincoln may make it impractical to use for individual FCS files, but it is certainly viable for achieving high throughput with a large number of flow cytometry files in a clinical study. Flow cytometry is a mainstay technology used by biologists and immunologists for analyzing cells suspended in a fluid. It is used in a variety of clinical and biological research areas. The conventional analysis of flow cytometry data is sequential bivariate gating (where an operator manually draws bins around groups of cells in 2D plots), but there has been a surge of research in recent years trying to use automated multivariate clustering algorithms to analyze flow data. Flow cytometry generates large data sets on the order of 10 6 vectors, 24 dimensions, and many possible clusters. Data sets of this size and complexity have huge computational requirements. This makes clustering slow at best, and often simply impractical on a single CPU. Flow cytometry data was the inspiration for the GPU- based acceleration in this research. Pictured below is an overview of a typical flow cytometer [5]. A laser hits cells flowing through a tube one at a time (ideally) and the light is measured by various detectors forming a multidimensional vector for each cell. On the right is a density plot of two dimensions (side light scatter and forward light scatter) of a flow cytometry data file. The above figure is based on an Intel Larabee presentation slide at Supercomputing 09 in Portland, Oregon