University of Central Florida Understanding Software Approaches for GPGPU Reliability Martin Dimitrov* Mike Mantor† Huiyang Zhou* *University of Central.

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,

Sunpyo Hong, Hyesoon Kim

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

A Discussion of CPU vs. GPU 1. CUDA Real “Hardware” Intel Core 2 Extreme QX9650 NVIDIA GeForce GTX 280 NVIDIA GeForce GTX 480 Transistors820 million1.4.

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Appendix C Graphics and Computing GPUs

NFV Compute Acceleration APIs and Evaluation

Employing compression solutions under openacc

Lecture 2: Intro to the simd lifestyle and GPU internals

Linchuan Chen, Xin Huo and Gagan Agrawal

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Presented by: Isaac Martin

NVIDIA Fermi Architecture

Presentation transcript:

University of Central Florida Understanding Software Approaches for GPGPU Reliability Martin Dimitrov* Mike Mantor† Huiyang Zhou* *University of Central Florida, Orlando †AMD, Orlando

Motivation Soft-error rates are predicted to grow exponentially in future process generations. Hard errors are gaining importance. Current GPUs do not provide hardware support for detecting soft or hard errors. Near future GPUs are not likely to address these reliability challenges, because GPU design are still largely driven by video games market. University of Central Florida2

Our Contributions Propose and evaluate three different software-only approaches for providing redundant computations on GPUs –R-Naïve –R-Scatter –R-Thread Evaluate the tradeoffs of some additional hardware support (parity protection in memory) to our software approaches University of Central Florida3

Presentation Outline Motivation Our Contributions Background on GPU architectures Proposed Software Redundancy Approaches –R-Naïve –R-Scatter –R-Thread Experimental Methodology Experimental Results Conclusions University of Central Florida4

GPU Architectures NVIDIA G80 16 Compute units –8 streaming processors –8K-entry reg. file –16kB shared memory –8kB constant memory Huge number of threads can be assigned to a compute unit, up to 512 (8k registers/ 512 threads = 16 registers per thread) Threads are scheduled in “warps” of 32 University of Central Florida5 SP 16 kB Shared Memory Compute Unit 32 kB Register File 8 kB Constant Memory

GPU Architectures ATI R670 4 Compute units (CU) –80 streaming processors/CU –256kB register file/CU Thread are organized in wavefronts (similar to warps) Instructions are grouped into VLIW words of 5 University of Central Florida6 Compute Unit SP Branch Unit Registers Stream Processing Units

Proposed Software Redundancy Approaches Our goal is to provide 100% redundancy on the GPU. Duplicate memcopy CPU-GPU-CPU (spatial redundancy for GPU memory). Duplicate kernel executions (temporal redundancy for computational logic and communication links) University of Central Florida7

Proposed Software Redundancy Approaches R-Naive University of Central Florida8 StreamRead(in) Kernel(in, out) StreamWrite(out) StreamRead(in) StreamRead(in_R) Kernel(in,out) Kernel(in_R,out_R) StreamWrite(out) StreamWrite(out_R) StreamRead(in) Kernel(in,out) StreamRead(in_R) Kernel(in_R,out_R) StreamWrite(out) StreamWrite(out_R) StreamRead(in) Kernel(in,out) StreamWrite(out) StreamRead(in_R) Kernel(in_R,out_R) StreamWrite(out_R) (a) Original Code (b) Redundant code without overlap (c) Redundant code with overlap (d) Redundant code back-to-back

Proposed Software Redundancy Approaches R-Naive Hard-error: it is desirable for the original and redundant input/output streams to use different communication links and compute cores. Solutions –For some applications this can be achieved at the application level by rearranging the input data. –For other applications it is desirable to have a software controllable interface to assign the hardware resources. University of Central Florida9

Proposed Software Redundancy Approaches R-Naive Advantages/Disadvantages of using R-Naive: +Easy to implement +Predictable performance –100% performance overhead University of Central Florida10

Proposed Software Redundancy Approaches R-Scatter Take advantage of unused instruction level-parallelism. University of Central Florida11 Original Redundant Operation Original vs. R-Scatter VLIW instruction schedules Each VLIW instruction is mapped to a stream processing unit

Proposed Software Redundancy Approaches R-Scatter University of Central Florida12 kernel void mat_mult(float width, float M[][], float N[][], out float P<>){ float2 vPos = indexof(P).xy; // obtain position into the stream float4 index = float4(vPos.x, 0.0f, 0.0f, vPos.y); float4 step = float4(0.0f, 1.0f, 1.0f, 0.0f); float sum = 0.0f; for(float i=0; i<width; i= i+1){ sum += M[index.zw]*N[index.xy]; //accessing input stream index += step; } P = sum; } kernel void mat_mult(float width, float M[][], float M_R[][], float N[][], float N_R[][], out float P<>, out float P_R<> ){ float2 vPos = indexof(P).xy; // obtain position into the stream float4 index = float4(vPos.x, 0.0f, 0.0f, vPos.y); float4 step = float4(0.0f, 1.0f, 1.0f, 0.0f); float sum = 0.0f; float sum_R = 0.0f; for(float i=0; i<width; i= i+1){ sum += M[index.zw]*N[index.xy]; //accessing input stream sum_R += M_R[index.zw]*N_R[index.xy]; index += step; } P = sum; P_R = sum_R; } (a) Original Code(b) R-Scatter Code The redundantoperations areinherentlyindependent. (7 VLIW words)(11 VLIW words) An error to “ i ” will affect both the original andredundant computation.

Proposed Software Redundancy Approaches R-Scatter Advantages/Disadvantages of using R-Scatter: +Better utilized VLIW schedules +Reused instructions (such as the for-loop) +Overlapped memory accesses –Extra registers or shared memory used per kernel may affect thread-level parallelism University of Central Florida13

Proposed Software Redundancy Approaches R-Thread Take advantage of unused thread level-parallelism (unused compute units) Allocate double the number of thread-blocks. The extra thread blocks perform redundant computations. University of Central Florida14 float Pvalue = 0; for (int k = 0; k < Block_Size; ++k){ float m = M[addr_md]; float n = N[addr_nd]; Pvalue += m * n; } P[ty*Width + tx] = Pvalue; if(by >= NumBlocks){ M = M_R; N = N_R; P = P_R; by = by - NumBlocks; } float Pvalue = 0; for (int k = 0; k < Block_Size; ++k){ float m = M[addr_md]; float n = N[addr_nd]; Pvalue += m * n; } P[ty*Width + tx] = Pvalue; (a) Original Code(b) R-Thread Code

Proposed Software Redundancy Approaches R-Thread Advantages/Disadvantages of using R-Thread: +Easy to implement +May result in performance improvement if there is not enough thread-level parallelism –100% performance overhead if enough thread-level parallelism is present. University of Central Florida15

Hardware Support for Error Detection in Off-Chip and On-Chip Memories Protecting Off-Chip global memory –Benefits all proposed approaches –Eliminates the need for a redundant CPU-GPU transfer Protecting On-Chip caches, shared memory –Benefits R-Scatter on the G80 –Required for R670 to obtain benefit of protecting off-chip memory due to implicit caching Results are compared on the CPU, thus we still need the redundant memory transfer GPU-CPU University of Central Florida16

Experimental Methodology Machine Setup Brook+ Experiments –Brook+ 1.0 Alpha, Windows XP –2.4 GHz Intel Core2 Quad CPU, 3.25 Gbytes of RAM –ATI R670 card with 512 MB memory CUDA Experiments –CUDA SDK 1.1, Fedora 8 Linux –2.3 GHz Quad core Intel Xeon, 2GByte of RAM –NVIDIA GTX 8800 card with 768 MB memory Both machines have PCIe x16 to provide 3.2 GB/s bandwidth between GPU and CPU. University of Central Florida17

Experimental Methodology Evaluated Applications Benchmark Name DescriptionApplication Domain Matrix Multiplication Multiplying two 2k by 2k matricesMathematics ConvolutionApplying a 5x5 filter on a 2k by 2k imageGraphics Black ScholesCompute the pricing of 8 million stock optionsFinance Mandelbrot Obtain a Mandelbrot set from a quadratic recurrence equation Mathematics Bitonic SortA parallel sorting algorithm. Sort 2^20 elementsComputer Science 1D FFTFast Furrier Transform on a 4K arrayMathematics University of Central Florida18

Experimental Results R-Naive University of Central Florida19 Consistently 2x the original execution time. Some applications have no benefit from hardware protection of Off- Chip memory Memory transfer from GPU to CPU is slower than CPU to GPU. Hardware support for Off-Chip memory protection results in 5%-7% performance gains.

Experimental Results R-Scatter on R670 Applications with compacted schedules generally see benefit from R-Scatter (FFT, Bitonic Sort) Some applications are still dominated by memory transfer time (Convolution, Black Scholes) On average R-Scatter is 195% of the original execution time ( 185% with hardware memory protection) University of Central Florida20

Experimental Results R-Thread on G80 Performance overhead uniformly close to 100% due to enough thread-level parallelism our benchmarks. When the input size is reduced (exposing some thread- level parallelism) there are clear benefits. University of Central Florida21

Conclusions We proposed three software redundancy approaches with different trade-offs. Compiler analysis should be able to utilize some of the unused resources and provide reliability automatically We conclude that for our current software approaches, hardware support provides very limited benefit. University of Central Florida22

Questions University of Central Florida23

Experimental Results R-Scatter University of Central Florida24 VLIW Instruction Schedules for Bitonic Sort on R670. Each word may contain up to 5 instructions – x,y,z,w,t. 16 x: MUL_e ____, T1.w, T3.z y: FLOOR ____, T0.z z: SETGE ____, T0.y, |KC0[5].x| 17 x: CNDE T1.x, PV16.z, T0.y, T0.w y: FLOOR T1.y, PV16.x z: ADD T0.z, PV16.y, 0.0f 18 x: MOV T0.x, |PV17.y| y: ADD ____, |KC0[5].x|, PV17.x w: MOV/2 ____, |PV17.y| 19 z: TRUNC ____, PV18.w w: CNDGT ____, -T1.x, PV18 16 x: SETGE ____, PS15, |KC0[5].x| y: ADD ____, T1.w, KC0[2].x z: MULADD T2.z, -T0.y, T2.x, T1.x w: ADD ____, -|KC0[5].x|, PS15 t: ADD ____, T1.w, KC0[8].x 17 x: ADD ____, -|KC0[11].x|, PV16.z y: SETGE ____, PV16.z, |KC0[11].x| z: CNDE T3.z, PV16.x, T1.y, PV16.w w: FLOOR ____, PV16.y t: FLOOR ____, PS16 18 x: ADD R2.x, PV17.w, 0.0f y: ADD ____, |KC0[5].x|, PV17.z z: ADD R1.z, PS17, 0.0f w: CNDE T0.w, PV17.y, T2.z, PV17.x t: MUL_e ____, T1.w, T2.y 19 x: FLOOR R0.x, PS18 y: MUL_e ____, T1.w, T3.x z: ADD ____, |KC0[11].x|, PV18.w w: CNDGT ____, -T3.z, PV18.y, T3.z (a) Original Code(b) R-Scatter Code