A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.

Slides:



Advertisements
Similar presentations
DirectCompute Performance on DX11 Hardware
Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Computer Science Education
Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.
SE-292 High Performance Computing
ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Is There a Real Difference between DSPs and GPUs?
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.
Topics Left Superscalar machines IA64 / EPIC architecture
Threads Cannot be Implemented As a Library Andrew Hobbs.
Compiler Construction Sohail Aslam Lecture Code Generation  The code generation problem is the task of mapping intermediate code to machine code.
ARM versions ARM architecture has been extended over several versions.
Machine & Assembly Language. Machine Language  Computer languages cannot be read directly by the computer – they are not in binary.  All commands need.
Instruction Level Parallelism
ATI Stream Computing ATI Radeon™ HD 2900 Series Instruction Set Architecture Micah Villmow May 30, 2008.
3-Software Design Basics in Embedded Systems
ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008.
Chapter 3 General-Purpose Processors: Software
MIPS Assembly Tutorial
Mali Instruction Set Architecture
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
SCIENCES USC INFORMATION INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes.
Order-Independent Texture Synthesis Li-Yi Wei Marc Levoy Gcafe 1/30/2003.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
A Discussion of CPU vs. GPU 1. CUDA Real “Hardware” Intel Core 2 Extreme QX9650 NVIDIA GeForce GTX 280 NVIDIA GeForce GTX 480 Transistors820 million1.4.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Concurrency and Performance Based on slides by Henri Casanova.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
RegLess: Just-in-Time Operand Staging for GPUs
Presented by: Isaac Martin
Chapter 1 Introduction.
6- General Purpose GPU Programming
Presentation transcript:

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li

Motivation To understand behavior of major kernel characteristics – ALU:Fetch Ratio – Read Latency – Write Latency – Register Usage – Domain Size – Cache Effect Use micro-benchmarks as guidelines for general optimizations Little to no useful micro-benchmarks exist for AMD GPUs Look at multiple generations of AMD GPU (RV670, RV770, RV870)

Hardware Background Current AMD GPU: – Scalable SIMD (Compute) Engines: Thread processors per SIMD engine – RV770 and RV870 => 16 TPs/SIMD engine – 5-wide VLIW processors (compute cores) – Threads run in Wavefronts Multiple threads per Wavefront depending on architecture – RV770 and RV870 => 64 Threads/Wavefront Threads organized into quads per thread processor Two Wavefront slots/SIMD engine (odd and even)

AMD GPU Arch. Overview Thread OrganizationHardware Overview

Software Overview 00 TEX: ADDR(128) CNT(8) VALID_PIX 0 SAMPLE R1, R0.xyxx, t0, s0 UNNORM(XYZW) 1 SAMPLE R2, R0.xyxx, t1, s0 UNNORM(XYZW) 2 SAMPLE R3, R0.xyxx, t2, s0 UNNORM(XYZW) 01 ALU: ADDR(32) CNT(88) 8 x: ADD ____, R1.w, R2.w y: ADD ____, R1.z, R2.z z: ADD ____, R1.y, R2.y w: ADD ____, R1.x, R2.x 9 x: ADD ____, R3.w, PV1.x y: ADD ____, R3.z, PV1.y z: ADD ____, R3.y, PV1.z w: ADD ____, R3.x, PV1.w 14 x: ADD T1.x, T0.w, PV2.x y: ADD T1.y, T0.z, PV2.y z: ADD T1.z, T0.y, PV2.z w: ADD T1.w, T0.x, PV2.w 02 EXP_DONE: PIX0, R0 END_OF_PROGRAM Fetch Clause ALU Clause

Code Generation Use CAL/IL (Compute Abstraction Layer/Intermediate Language) – CAL: API interface to GPU – IL: Intermediate Language Virtual registers – Low level programmable GPGPU solution for AMD GPUs – Greater control of CAL compiler produced ISA – Greater control of register usage Each benchmark uses the same pattern of operations (register usage differs slightly)

Code Generation - Generic Reg0 = Input0 + Input1 While (INPUTS) Reg[] = Reg[-1] + Input[] While (ALU_OPS) Reg[] = Reg[-1] + Reg[-2] Output =Reg[]; R1 = Input1 + Input2; R2 = R1 + Input3; R3 = R2 + Input4; R4 = R3 + R2; R5 = R4 + R5; ………….. R15 = R14 + R13; Output1 = R15 + R14;

Clause Generation – Register Usage Sample(32) ALU_OPs Clause (use first 32 sampled) Sample(8) ALU_OPs Clause (use 8 sampled here) Sample(8) ALU_OPs Clause (use 8 sampled here) Sample(8) ALU_OPs Clause (use 8 sampled here) Sample(8) ALU_OPs Clause (use 8 sampled here) Output Sample(64) ALU_OPs Clause (use first 32 sampled) ALU_OPs Clause (use next 8) Output Register Usage LayoutClause Layout

ALU:Fetch Ratio “Ideal” ALU:Fetch Ratio is 1.00 – 1.00 means perfect balance of ALU and Fetch Units Ideal GPU utilization includes full use of BOTH the ALU units and the Memory (Fetch) units – Reported ALU:Fetch ratio of 1.0 is not always optimal utilization Depends on memory access types and patterns, cache hit ratio, register usage, latency hiding... among other things

ALU:Fetch 16 Inputs 64x1 Block Size – Samplers Lower Cache Hit Ratio

ALU:Fetch 16 Inputs 4x16 Block Size - Samplers

ALU:Fetch 16 Inputs Global Read and Stream Write

ALU:Fetch 16 Inputs Global Read and Global Write

Input Latency – Texture Fetch 64x1 ALU Ops < 4*Inputs Reduction in Cache Hit Linear increase can be effected by cache hit ratio

Input Latency – Global Read ALU Ops < 4*Inputs Generally linear increase with number of reads

Write Latency – Streaming Store ALU Ops < 4*Inputs Generally linear increase with number of writes

Write Latency – Global Write ALU Ops < 4*Inputs Generally linear increase with number of writes

Domain Size – Pixel Shader ALU:Fetch = 10.0, Inputs =8

Domain Size – Compute Shader ALU:Fetch = 10.0, Inputs =8

Register Usage – 64x1 Block Size Overall Performance Improvement

Register Usage – 4x16 Block Size Cache Thrashing

Cache Use – ALU:Fetch 64x1 Slight impact in performance

Cache Use – ALU:Fetch 4x16 Cache Hit Ratio not effected much by number of ALU operations

Cache Use – Register Usage 64x1 Too many wavefronts

Cache Use – Register Usage 4x16 Cache Thrashing

Conclusion/Future Work Conclusion – Attempt to understand behavior based on program characteristics, not specific algorithm Gives guidelines for more general optimizations – Look at major kernel characteristics Some features maybe driver/compiler limited and not necessarily hardware limited – Can vary somewhat among versions from driver to driver or compiler to compiler Future Work – More details such as Local Data Store, Block Size and Wavefronts effects – Analyze more configurations – Build predictable micro-benchmarks for higher level language (ex. OpenCL) – Continue to update behavior with current drivers