A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Date:102/1/9 Publisher:IEEE HPCC 2012 Author:Che-Lun Hung, Hsiao-hsi.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Skewed Compressed Cache

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

Performance in GPU Architectures: Potentials and Distances

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

Employing compression solutions under openacc

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

ISPASS th April Santa Rosa, California

Addressing Software-Managed Cache Development Effort in GPGPUs

Basic CUDA Programming

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

RegLess: Just-in-Time Operand Staging for GPUs

Rachata Ausavarungnirun

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Presented by: Isaac Martin

MASK: Redesigning the GPU Memory Hierarchy

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Presentation transcript:

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria

Evaluation of Datatype Size on GPGPUs Key Observation: Smallest datatypes always improve memory efficiency But they may degrade performance Key Question: Why smaller data type may degrade performance? Key Finding: Miss Status Holding Registers (MSHRs) are the contenting resource under smaller data types 2

Outline Motivation Background GPU Architecture Miss status handling structure Case Study: 1D Stencil Memory pattern Methodology Evaluation 3

Impact of Data type Size Performance vs Accuracy: We relax accuracy constrain and evaluate performance Datatype sizes: 4-byte int, 2-byte short, 1-byte char Basic elements of applications: Matrix multiplication: Element in matrix 1D stencil: Element in array Stereo matching: Label in disparity map Case Study 4

Intercon. Network L2$ GPU Architecture SM L1$ L2$ MCtrl SM L1$ SM Warp Pool Warp Scheduler ALUSFULSU L1$ Data Tags MSHRs Registerfile MSHRs W0 - MW0 W1 - MW1 L0 - $ID 0x0A - ADDR Merger fields Cache line Block address 5

Outstanding memory accesses Limited by L1$ and MSHRs capability Without merging capability: Best-case merging-enabled: Worst-case merging-enabled: 6

Case Study: 1D Stencil Algorithm: CUDA code: int i = threadIdx.x + blockIdx.x*blockDim.x; if( i 0 ){ a_dst[i] = (a_src[i-1] + a_src[i] + a_src[i+1]) / 3; } 7

Methodology Real hardware: NVIDIA GeForce GTX480 Simulated hardware: GPGPU-sim v3.2.2 GTX MSHR per L1$, 8-merger per MSHR entry 8

Performance Stencil 1D under real and simulated GTX480 9

Memory efficiency Smaller data types consistently improve memory efficiency metrics: 10

Stall breakdown Smaller datatypes stall for merger fields Larger datatypes stall for coalescing 11

Conclusion Real evaluation and simulation to observe the impact of datatypes on: Performance of GPUs Effective cache capacity, memory latency/bandwidth/demand Coalescing, cache, and MSHR stalls Smaller datatypes improve memory efficiency Depending on the memory access pattern, smaller datatypes may increase MSHR merger stalls Future Work: Micro-benchmarking to understand GPU MSHR structure 12

Thank you! Question? 13

Example: Worst-case scenario Warp Pool Warp Scheduler L1$ Data Tags MSHRs X X X X W0 - MW0 W1 - MW1 L0 - $ID 0x0A - ADDR ALUSFU LSU Registerfile L0 L1 L0 L1 W20x0A W0ALUW1ALUW2ALUW3LSU W4LSUW5LSUW6LSUW7LSU 14

Methodology (2) 15

Methodology (3) 16

Sensitivity Varying MSHRs, merger fields, sets, and ways 17