Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th 2010
Data Clustering
Data Clustering Cont.
Example
Flow Cytometry
Flow Cytometry Cont.
Flow Cytometry Data Sets Size of the data, motivation for GPUs / parallel processing
Parallel Computing
Trend toward multi-core, many-core architectures
GPU Architecture Trends
Tesla GPU Architecture
GPGPU
CUDA Software Stack
CUDA Programming Model
CUDA Kernel Grids / Blocks /Threads
CUDA Memory
CUDA Program Flow
C-means
C-means Parallel Implementation
EM with a Gaussian mixture model
EM Parallel Implementation
Performance Tuning Global Memory Coalescing – 1.0/1.1 vs 1.2/1.3 devices
Performance Tuning Partition Camping
Performance Tuning CUBLAS
Multi-GPU Strategy 3 Tier Parallel hierarchy – MPI, OpenMP, CUDA
Multi-GPU Strategy MapReduce-style data distribution and reduction
Multi-GPU Implementation Very little impact on GPU kernel implementations, just their inputs / grid dimensions Discuss host-code changes
Data Distribution Asynchronous MPI sends from host instead of each node reading input file from data store
Results - Kernels Speedup figures
Results - Kernels Speedup figures
Results – Overhead Time-breakdown for I/O, GPU memcpy, etc
Multi-GPU Results Amdahl’s Law vs. Gustafson’s Law – i.e. Strong vs. Weak Scaling – i.e. Fixed Problem Size vs. Fixed-Time – i.e. True Speedup vs. Scaled Speedup
Fixed Problem Size Analysis
Time-Constrained Analysis
Conclusions
Future Work
Questions?
References