Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
SALSA HPC Group School of Informatics and Computing Indiana University.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.
SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
OpenFOAM on a GPU-based Heterogeneous Cluster
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
SALSA HPC Group School of Informatics and Computing Indiana University.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
My Coordinates Office EM G.27 contact time:
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Appendix C Graphics and Computing GPUs
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
CS427 Multicore Architecture and Parallel Computing
Parallel Programming By J. H. Wang May 2, 2017.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
GPU Programming using OpenCL
Linchuan Chen, Xin Huo and Gagan Agrawal
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Applying Twister to Scientific Applications
Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.
Adaptive Interpolation of Multidimensional Scaling
Wei Jiang Advisor: Dr. Gagan Agrawal
Data-Intensive Computing: From Clouds to GPU Clusters
TensorFlow: A System for Large-Scale Machine Learning
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox {tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu 2nd International Workshop on GPUs and Scientific Applications Galveston Island, TX

Iterative Statistical Applications Consists of iterative computation and communication steps Growing set of applications Clustering, data mining, machine learning & dimension reduction applications Driven by data deluge & emerging computation fields Compute Communication Reduce/ barrier New Iteration Describe using the figure – reduce or barrier

Iterative Statistical Applications Compute Communication Reduce/ barrier New Iteration Data intensive Larger loop-invariant data Smaller loop-variant delta between iterations Result of an iteration Broadcast to all the workers of the next iteration High memory access to floating point operations ratio Large input data sizes which are loop-invariant and can be reused across iterations. Loop-variant results.. Orders of magnitude smaller… Software controlled memory hierarchy of GPU’s and the higher memory bandwidth allows us to optimize these applications. We restrict ourselves to problem sizes that that fit GPU memory…

Motivation Important set of applications Increasing power and availability of GPGPU computing Cloud Computing Iterative MapReduce technologies GPGPU computing in clouds These types of applications are widely used today and the use cases are growing fast with all the data analytics applications. There exists frameworks, such as iterative map reduce, which takes advantage of those characteristics. Whether we can do same improvements for GPGPU programs.. And whether we can use them in a distributed fashion with the above mentioned map reduce frameworks. from http://aws.amazon.com/ec2/

Motivation A sample bioinformatics pipeline O(NxN) O(NxN) O(NxN) Clustering O(NxN) Cluster Indices Pairwise Alignment & Distance Calculation 3D Plot Gene Sequences Visualization O(NxN) Coordinates Distance Matrix Multi-Dimensional Scaling http://salsahpc.indiana.edu/

Overview Three iterative statistical kernels implemented using OpenCl Kmeans Clustering Multi Dimesional Scaling PageRank Optimized by, Reusing loop-invariant data Utilizing different memory levels Rearranging data storage layouts Dividing work between CPU and GPU Where to put this slide In this paper we present out experience on implementing 3 iterative statistical applications using OpenCL..

OpenCL Cross platform, vendor neutral, open standard GPGPU, multi-core CPU, FPGA… Supports parallel programming in heterogeneous environments Compute kernels Based on C99 Basic unit of executable code Work items Single element of the execution domain Grouped in the work groups Communication & synchronization within work groups 5 minutes….

OpenCL Memory Hierarchy Compute Unit 1 Compute Unit 2 Private Private Private Private Work Item 1 Work Item 2 Work Item 1 Work Item 2 Local Memory Local Memory CPU Global GPU Memory Constant Memory

Environment NVIDIA Tesla C1060 240 scalar processors 4GB global memory 102 GB/sec peak memory bandwidth 16KB shared memory per 8 cores CUDA compute capability 1.3 Peak Performance 933 GFLOPS Single with SF 622 GFLOPS Single MAD 77.7 GFLOPS Double 2 instruction issue ports.   Port 0 (622 GFLOPS) Port 1 (311GFLOPS) can issue instructions to two Special Function Units (SFU) each of which can process packed 4-wide vectors.  The SFUs perform transcendental operations like sin, cos, etc. or single precision multiplies (like the Intel SSE instruction: MULPS) http://www.beyond3d.com/content/articles/77/1

KMeans Clustering Partition a given data set into disjoint clusters Each iteration Cluster assignment step Centroid update step Flops per work item (3DM+M) D :number of dimensions M :number of centroids http://www.aishack.in/2010/07/k-means-clustering/

Re-using loop-invariant data

KMeansClustering Optimizations Naïve (with data re-using)

KMeansClustering Optimizations Data points copied to local memory

KMeansClustering Optimizations Cluster centroid points copied to local memory

KMeansClustering Optimizations Local memory data points in column major order

KMeansClustering Performance Varying number of clusters (centroids)

KMeansClustering Performance Varying number of dimensions

KMeansClustering Performance Increasing number of iterations

KMeans Clustering Overhead

Multi Dimesional Scaling Map a data set in high dimensional space to a data set in lower dimensional space Use a NxN dissimilarity matrix as the input Output usually in 3D (Nx3) or 2D (Nx2) space Flops per work item (8DN+7N+3D+1) D : target dimension N : number of data points SMACOF MDS algorithm 13 Scaling by majorizing a complicated function http://salsahpc.indiana.edu/

MDS Optimizations Re-using loop-invariant data

MDS Optimizations Naïve (with loop-invariant data reuse)

MDS Optimizations Naïve (with loop-invariant data reuse)

MDS Optimizations Naïve (with loop-invariant data reuse)

MDS Optimizations Naïve (with loop-invariant data reuse)

MDS Performance Increasing number of iterations

MDS Overhead

Page Rank Analyses the linkage information to measure the relative importance Sparse matrix and vector multiplication Web graph Very sparse Power law distribution 20

Sparse Matrix Representations ELLPACK Compressed Sparse Row (CSR) http://www.nvidia.com/docs/IO/66889/nvr-2008-004.pdf

PageRank implementations

Lessons Reusing of loop-invariant data Leveraging local memory Optimizing data layout Sharing work between CPU & GPU

OpenCL experience Flexible programming environment Support for work group level synchronization primitives Lack of debugging support Lack of dynamic memory allocation Compilation target than a user programming environment?

Future Work Extending kernels to distributed environments Comparing with CUDA implementations Exploring more aggressive CPU/GPU sharing Studying more application kernels Data reuse in the pipeline Studying more application kernels > to identify any patterns or high level constructs

Acknowledgements This work was started as a class project for CSCI-B649:Parallel Architectures (spring 2010) at IU School of Informatics and Computing. Thilina was supported by National Institutes of Health grant 5 RC2 HG005806-02. We thank Sueng-Hee Bae, BingJing Zang, Li Hui and the Salsa group (http://salsahpc.indiana.edu/) for the algorithmic insights.

Questions

Thank You!

KMeansClustering Optimizations Data in global memory coalesced