Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Slides:

Advertisements

Similar presentations

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Panda: MapReduce Framework on GPU’s and CPU’s

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Parallel Computing Presented by Justin Reschke

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

CS427 Multicore Architecture and Parallel Computing

Graphics Processing Unit

Pipeline parallelism and Multi–GPU Programming

General Purpose Graphics Processing Units (GPGPUs)

Graphics Processing Unit

6- General Purpose GPU Programming

Presentation transcript:

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th 2010

Intro Overview of the application domain Trends in computing architecture GPU Architecture, CUDA Parallel Implementation

Data Clustering A form of unsupervised learning that groups similar objects into relatively homogeneous sets called clusters How do we define similarity between objects? – Depends on the application domain, implementation Not to be confused with data classification, which assigns objects to predefined classes

Data Clustering Algorithms Clustering Taxonomy from “Data Clustering: A Review”, by Jain et al. [1]

Example: Iris Flower Data

Flow Cytometry Technology used by biologists and immunologists to study the physical and chemical characteristics of cells Example: Measure T lymphocyte counts to monitor HIV infection [2]

Flow Cytometry Cells in a fluid pass through a laser Measure physical characteristics with scatter data Add fluorescently labeled antibodies to measure other aspects of the cells

Flow Cytometer

Flow Cytometry Data Sets Multiple measurements (dimensions) for each event – Upwards of 6 scatter dimensions and 18 colors per experiment On the order of 10 5 – 10 6 events ~24 million values that must be clustered Lots of potential clusters Clustering can take many hours on a CPU

Parallel Computing Fortunately many data clustering algorithms lend themselves naturally to parallel processing Typically with clusters of commodity CPUs Common APIs: – MPI: Message Passing Interface – OpenMP: Open Multi-processing

Multi-core Current trends: – Adding more cores – Application specific extensions SSE3/AVX, VT-x, AES-NI – Point-to-Point interconnects, higher memory bandwidths

GPU Architecture Trends CPU GPU Figure based on Intel Larabee Presentation at SuperComputing 2009 Fixed Function Fully Programmable Partially Programmable Multi-threadedMulti-coreMany-core Intel Larabee NVIDIA CUDA

Tesla GPU Architecture

Tesla Cores

GPGPU General Purpose computing on Graphics Processing Units Past – Programmable shader languages: Cg, GLSL, HLSL – Use textures to store data Present: – Multiple frameworks using traditional general purpose systems and high-level languages

CUDA: Software Stack

CUDA: Streaming Multiprocessors

CUDA: Thread Model Kernel – A device function invoked by the host computer – Launches a grid with multiple blocks, and multiple threads per block Blocks – Independent tasks comprised of multiple threads – no synchronization between blocks SIMT: Single-Instruction Multiple- Thread – Multiple threads executing time instruction on different data (SIMD), can diverge if neccesary

CUDA: Memory Model

CUDA: Program Flow Application Start Search for CUDA DevicesLoad data on hostAllocate device memoryCopy data to deviceLaunch device kernels to process dataCopy results from device to host memory CPU Main Memory Device Memory GPU Cores PCI-Express

When is CUDA worthwhile? High computational density – Worthwhile to transfer data to separate device Both coarse-grained and fine-grained SIMD parallelism – Lots of independent tasks (blocks) that don’t require frequent synchronization map to different multiprocessors on the GPU – Within each block, lots of individual SIMD threads Contiguous memory access patterns Frequently/Repeatedly used data small enough to fit in shared memory

C-means Minimizes square error between data points and cluster centers using Euclidean distance Alternates between computing membership values and updating cluster centers

C-means Parallel Implementation

EM with a Gaussian mixture model Data described by a mixture of M Gaussian distributions Each Gaussian has 3 parameters

E-step Compute likelihoods based on current model parameters Convert likelihoods into membership values

M-step Update model parameters

EM Parallel Implementation

Performance Tuning Global Memory Coalescing – 1.0/1.1 vs 1.2/1.3 devices

Performance Tuning Partition Camping

Performance Tuning CUBLAS

Multi-GPU Strategy 3 Tier Parallel hierarchy

Multi-GPU Strategy

Multi-GPU Implementation Very little impact on GPU kernel implementations, just their inputs / grid dimensions Discuss host-code changes

Data Distribution Asynchronous MPI sends from host instead of each node reading input file from data store

Results - Kernels Speedup figures

Results - Kernels Speedup figures

Results – Overhead Time-breakdown for I/O, GPU memcpy, etc

Multi-GPU Results Amdahl’s Law vs. Gustafson’s Law – i.e. Strong vs. Weak Scaling – i.e. Fixed Problem Size vs. Fixed-Time – i.e. True Speedup vs. Scaled Speedup

Fixed Problem Size Analysis

Time-Constrained Analysis

Conclusions

Future Work

Questions?

References 1.A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, H. Shapiro, J. Wiley, and W. InterScience, Practical flow cytometry. Wiley-Liss New York, 2003.