Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Similar presentations


Presentation on theme: "Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th."— Presentation transcript:

1 Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th 2010

2 Intro Overview of the application domain Trends in computing architecture GPU Architecture, CUDA Parallel Implementation

3 Data Clustering A form of unsupervised learning that groups similar objects into relatively homogeneous sets called clusters How do we define similarity between objects? – Depends on the application domain, implementation Not to be confused with data classification, which assigns objects to predefined classes

4 Data Clustering Algorithms Clustering Taxonomy from “Data Clustering: A Review”, by Jain et al. [1]

5 Example: Iris Flower Data

6 Flow Cytometry Technology used by biologists and immunologists to study the physical and chemical characteristics of cells Example: Measure T lymphocyte counts to monitor HIV infection [2]

7 Flow Cytometry Cells in a fluid pass through a laser Measure physical characteristics with scatter data Add fluorescently labeled antibodies to measure other aspects of the cells

8 Flow Cytometer

9 Flow Cytometry Data Sets Multiple measurements (dimensions) for each event – Upwards of 6 scatter dimensions and 18 colors per experiment On the order of 10 5 – 10 6 events ~24 million values that must be clustered Lots of potential clusters Clustering can take many hours on a CPU

10 Parallel Computing Fortunately many data clustering algorithms lend themselves naturally to parallel processing Typically with clusters of commodity CPUs Common APIs: – MPI: Message Passing Interface – OpenMP: Open Multi-processing

11 Multi-core Current trends: – Adding more cores – Application specific extensions SSE3/AVX, VT-x, AES-NI – Point-to-Point interconnects, higher memory bandwidths

12 GPU Architecture Trends CPU GPU Figure based on Intel Larabee Presentation at SuperComputing 2009 Fixed Function Fully Programmable Partially Programmable Multi-threadedMulti-coreMany-core Intel Larabee NVIDIA CUDA

13 Tesla GPU Architecture

14 Tesla Cores

15 GPGPU General Purpose computing on Graphics Processing Units Past – Programmable shader languages: Cg, GLSL, HLSL – Use textures to store data Present: – Multiple frameworks using traditional general purpose systems and high-level languages

16 CUDA: Software Stack

17 CUDA: Streaming Multiprocessors

18 CUDA: Thread Model Kernel – A device function invoked by the host computer – Launches a grid with multiple blocks, and multiple threads per block Blocks – Independent tasks comprised of multiple threads – no synchronization between blocks SIMT: Single-Instruction Multiple- Thread – Multiple threads executing time instruction on different data (SIMD), can diverge if neccesary

19 CUDA: Memory Model

20 CUDA: Program Flow Application Start Search for CUDA DevicesLoad data on hostAllocate device memoryCopy data to deviceLaunch device kernels to process dataCopy results from device to host memory CPU Main Memory Device Memory GPU Cores PCI-Express

21 When is CUDA worthwhile? High computational density – Worthwhile to transfer data to separate device Both coarse-grained and fine-grained SIMD parallelism – Lots of independent tasks (blocks) that don’t require frequent synchronization map to different multiprocessors on the GPU – Within each block, lots of individual SIMD threads Contiguous memory access patterns Frequently/Repeatedly used data small enough to fit in shared memory

22 C-means Minimizes square error between data points and cluster centers using Euclidean distance Alternates between computing membership values and updating cluster centers

23 C-means Parallel Implementation

24

25 EM with a Gaussian mixture model Data described by a mixture of M Gaussian distributions Each Gaussian has 3 parameters

26 E-step Compute likelihoods based on current model parameters Convert likelihoods into membership values

27 M-step Update model parameters

28 EM Parallel Implementation

29

30 Performance Tuning Global Memory Coalescing – 1.0/1.1 vs 1.2/1.3 devices

31 Performance Tuning Partition Camping

32 Performance Tuning CUBLAS

33 Multi-GPU Strategy 3 Tier Parallel hierarchy

34 Multi-GPU Strategy

35 Multi-GPU Implementation Very little impact on GPU kernel implementations, just their inputs / grid dimensions Discuss host-code changes

36 Data Distribution Asynchronous MPI sends from host instead of each node reading input file from data store

37 Results - Kernels Speedup figures

38 Results - Kernels Speedup figures

39 Results – Overhead Time-breakdown for I/O, GPU memcpy, etc

40 Multi-GPU Results Amdahl’s Law vs. Gustafson’s Law – i.e. Strong vs. Weak Scaling – i.e. Fixed Problem Size vs. Fixed-Time – i.e. True Speedup vs. Scaled Speedup

41 Fixed Problem Size Analysis

42 Time-Constrained Analysis

43 Conclusions

44 Future Work

45 Questions?

46 References 1.A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999. 2.H. Shapiro, J. Wiley, and W. InterScience, Practical flow cytometry. Wiley-Liss New York, 2003.


Download ppt "Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th."

Similar presentations


Ads by Google