University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

Slides:



Advertisements
Similar presentations
Prasanna Pandit R. Govindarajan
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
University of Michigan Electrical Engineering and Computer Science Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Synergy.cs.vt.edu Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
University of Michigan Electrical Engineering and Computer Science 1 Practical Lock/Unlock Pairing for Concurrent Programs Hyoun Kyu Cho 1, Yin Wang 2,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
SAGE: Self-Tuning Approximation for Graphics Engines
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Implementing a Speech Recognition System on a GPU using CUDA
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
QCAdesigner – CUDA HPPS project
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,
EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 14, 2012.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir Hormati 2 Janghaeng Lee 1 and Scott Mahlke University of Michigan - Ann Arbor 2 Microsoft Research, Microsoft

University of Michigan Electrical Engineering and Computer Science Amdahl’s Law GPGPU may have <100x speedup but... 2 NO GPU utilizationGPU Executable NO GPU utilization 50% Even 1000x here does NOT bring more than 2x in overall Execution Time

University of Michigan Electrical Engineering and Computer Science General Purpose Computing on GPU Limitation of –Massive Data-Parallelism –Linear array access –NO Indirect array access –NO Pointers Leaves GPUs underutilized –GPUs are not so much generalized 3 GPU Executable How can GPUs be more GENERAL?

University of Michigan Electrical Engineering and Computer Science Motivation – More Generalization Reduce Sections –Non-Linear array access –Indirect array access –Array access through pointers Difficult for programmers to verify –Loop-Carried Dependencies 4 NO GPU utilization for(y=0; y<ny; y++) for(x=0; x<nx; x++){ xr = x % squaresize[XUP]; yr = y % squaresize[YUP]; i = xr + yr; lattice[i].x = x; lattice[i].y = y; } for(i=1; i<m; i++) for(j=iaL[i]; j<iaL[i+1]-1; j++) x[i] = x[i] - aL[j] * x[jaL[j]]; for(int i=0; i<n; i++){ *c = *a + *b; a++; b++; c++; }

University of Michigan Electrical Engineering and Computer Science Motivation – More Generalization Reduce Sections –Non-Linear array access –Indirect array access –Array access through pointers Difficult for programmers to verify –Loop-Carried Dependencies 5 NO GPU utilization

University of Michigan Electrical Engineering and Computer Science Paragon Execution 6 SequentialLoop 1 Loop 2 CPU SequentialDO-ALL Sequential Loop 3Sequential CPU L2 L3 Sequential Conflict Check L2 Sequential L1 GPU Possibly-Parallel

University of Michigan Electrical Engineering and Computer Science Paragon Execution with Conflict 7 SequentialLoop 1 Loop 2 CPU GPU SequentialDO-ALLPossibly-ParallelDO-ALLSequential Loop 3Sequential CPU Sequential L1L2 L3 Sequential Conflict L2

University of Michigan Electrical Engineering and Computer Science Paragon Process Flow Input: Sequential Code Loop Classification Instrumentation Offline Compilation CUDA + pThread Profiling Execution without Profiling Conflict Management Unit Conflict Management Unit Runtime Kernel Management 8

University of Michigan Electrical Engineering and Computer Science Offline Compilation Loop classification –Sequential Loops Dependence determined at compile-time Assign to CPU statically –DO-ALL Loops Assign to GPU statically –Possible DO-ALL Loops Dependence can be determined at RUNTIME 9

University of Michigan Electrical Engineering and Computer Science Runtime Profiling Spawns thread on CPU –Sequential execution thread –Monitoring thread Keeps track of memory foot print Marks loop –Sequential If many conflicts –Parallelizable If no/few conflicts 10 Assigned to CPU and GPU

University of Michigan Electrical Engineering and Computer Science Conflict Detection - Logging Lazy conflict detection Allocate memory when executing kernel –“write-set” for store –“read-set” for load 11 for (i = 0; i < N; i++){ idx = I[i]; C[idx] = A[idx] + B[idx]; } AtomicInc(C_wr_log[idx]); int C_wr_log[sizeof_C]; bool C_rd_log[sizeof_C]; for (i = tid; i < N; i += ThreadCnt){ idx = I[i]; C[idx] = A[idx] + B[idx]; }

University of Michigan Electrical Engineering and Computer Science Conflict Detection - Checking Done in parallel following kernel Conflict if –Address written more than once –Address read and written at least once 12 F... C_wr_logC_rd_log Thread 1 Thread 2 Thread 3 Thread 4 F F F F Thread N F T T2 OK Conflict [0] [1] [2] [3] [N] [0] [1] [2] [3] [N]

University of Michigan Electrical Engineering and Computer Science Experimental Setup 13 CPU –Intel Core i7 GPU –NVIDIA GTX 560 with 2GB DDR5 Benchmark –Loops with pointers FDTD, Siedel, Jacobi2d, GEMM, TMV –Indirect/Non-Linear access Saxpy, House, Ipvec, Ger, Gemver, SOR, FWD

University of Michigan Electrical Engineering and Computer Science Results for Pointer Access 14 36x

University of Michigan Electrical Engineering and Computer Science Results for Indirect Access x

University of Michigan Electrical Engineering and Computer Science Conclusion Paragon improves performance –More GPU Utilization –Speculatively run possibly-parallel loops on GPU No performance penalty on mis-speculation –Letting CPU run sequentially at the same time –Conflict checking is done in GPU 16

University of Michigan Electrical Engineering and Computer Science Q & A 17

University of Michigan Electrical Engineering and Computer Science Overhead Breakdown 18

University of Michigan Electrical Engineering and Computer Science Overhead Breakdown 19