Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Slides:



Advertisements
Similar presentations
Shredder GPU-Accelerated Incremental Storage and Computation
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Multi-GPU System Design with Memory Networks
BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education Program May 29 – June Hybrid MPI/CUDA Scaling accelerator.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Panda: MapReduce Framework on GPU’s and CPU’s
Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Jawwad A Shamsi Nouman Durrani Nadeem Kafi Systems Research Laboratories, FAST National University of Computer and Emerging Sciences, Karachi Novelties.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Implementing a Speech Recognition System on a GPU using CUDA
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
QCAdesigner – CUDA HPPS project
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
CS 732: Advance Machine Learning
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Productive Performance Tools for Heterogeneous Parallel Computing
CS427 Multicore Architecture and Parallel Computing
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Speedup over Ji et al.'s work
Linchuan Chen, Xin Huo and Gagan Agrawal
Presentation transcript:

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th 2010

Data Clustering

Data Clustering Cont.

Example

Flow Cytometry

Flow Cytometry Cont.

Flow Cytometry Data Sets Size of the data, motivation for GPUs / parallel processing

Parallel Computing

Trend toward multi-core, many-core architectures

GPU Architecture Trends

Tesla GPU Architecture

GPGPU

CUDA Software Stack

CUDA Programming Model

CUDA Kernel Grids / Blocks /Threads

CUDA Memory

CUDA Program Flow

C-means

C-means Parallel Implementation

EM with a Gaussian mixture model

EM Parallel Implementation

Performance Tuning Global Memory Coalescing – 1.0/1.1 vs 1.2/1.3 devices

Performance Tuning Partition Camping

Performance Tuning CUBLAS

Multi-GPU Strategy 3 Tier Parallel hierarchy – MPI, OpenMP, CUDA

Multi-GPU Strategy MapReduce-style data distribution and reduction

Multi-GPU Implementation Very little impact on GPU kernel implementations, just their inputs / grid dimensions Discuss host-code changes

Data Distribution Asynchronous MPI sends from host instead of each node reading input file from data store

Results - Kernels Speedup figures

Results - Kernels Speedup figures

Results – Overhead Time-breakdown for I/O, GPU memcpy, etc

Multi-GPU Results Amdahl’s Law vs. Gustafson’s Law – i.e. Strong vs. Weak Scaling – i.e. Fixed Problem Size vs. Fixed-Time – i.e. True Speedup vs. Scaled Speedup

Fixed Problem Size Analysis

Time-Constrained Analysis

Conclusions

Future Work

Questions?

References