HPEC 2007 Norm Rubin Fellow AMD Graphics Products Group norman.rubin at amd.com.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

DSPs Vs General Purpose Microprocessors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
IMGD 4000: Computer Graphics in Games Emmanuel Agu.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolutions of GPU Architectures Andrew Coile CMPE220 3/2007.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
AMD’s ATI GPU Radeon R700 (HD 4xxx) series Elizabeth Soechting David Chang Jessica Vasconcellos 1 CS 433 Advanced Computer Architecture May 7, 2008.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Issues And Challenges In Compiling for Graphics Processors Norm Rubin Fellow AMD Graphics Products Group.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Computer Graphics Graphics Hardware
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Pseudorandom Number Generation on the GPU Myles Sussman, William Crutchfield, Matthew Papakipos.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
1 Latest Generations of Multi Core Processors
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
General Purpose computing on Graphics Processing Units
Computer Graphics Graphics Hardware
Unit 2 Technology Systems
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
BLIS optimized for EPYCTM Processors
The Small batch (and Other) solutions in Mantle API
SOC Runtime Gregory Stoner.
NVIDIA Fermi Architecture
RegMutex: Inter-Warp GPU Register Time-Sharing
Computer Graphics Graphics Hardware
Advanced Micro Devices, Inc.
Graphics Processing Unit
CSE 502: Computer Architecture
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Presentation transcript:

HPEC 2007 Norm Rubin Fellow AMD Graphics Products Group norman.rubin at amd.com

Overview Why GPU What is the difference between GPU/CPU 2900 introduction AMD programming model

GPGPU GPGPU means not rendering (not visual output) GPUs are getting closer to CPU functionality, branches float operations etc GPUs have huge floating point performance (2 * 500 gflops) GPUs are cheaper per mflop (less then ¼ the cost) Once you need to recode for mult-core, any crazy idea starts to look promising, improved clock speed is over GPU has new programming models, new languages, new issues with optimization

GPU Programming Systems OpenGl or DirectX –CG/HLSL/OpenGl shading language GPGPU languages –Accelerator (MicroSoft research) –Brook (Stanford) –CTM (AMD) –CUDA (NVIDIA) –RapidMind et al

Published results (kernels) Large matrix/vector operations (BLAS) Protein Folding (Molecular Dynamics) Finance modeling FFT (SETI, signal processing) Raytracing Physics Simulation [cloth, fluid, collision,…] Sequence Matching (Hidden Markov Models) Speech/Image Recognition (Hidden Markov Models, Neural nets) Databases Sort/Search Medical Imaging (image segmentation, processing) And many, many, many more…

Speedup switching to GPU (kernel code) Simple ports of data parallel programs get 5-10 times faster. Algorithm is data parallel to start, problem fits on machine Smart ports (make use of tiling/compression, recode understanding the GPU, gets times faster) – change of algorithm.

Why are GPU’s getting faster Why are GPUs getting faster so fast? – Arithmetic intensity- The specialized nature of GPUs makes it easier to use additional transistors for computation – Economics - Multi-billion dollar video game market drives innovation - Games and images do not need the same kind of accurate results as GPGPU, so lots of fast but odd arithmetic, e.g. no denorms

GPU vs CPU – what is the difference?

HD nano Barcelona 65 Nano

CPU vs GPU CPUGPU 5% of area is ALU40 % of area is ALU Mem – low latency (1/10) of GPU Mem – high bandwidth (10 times CPU) Big Cache (10 times GPU)Small Cache Full IEEE + doublesPartial IEEE 4 cores64+ cores Just load storesFancy memory – tiling – arithmetic in memory 10 Times the flops of CPU

Chip design point CPU Lots of instructions, little data  Out of order exec  Branch prediction Reuse and locality Task parallel Needs OS Complex sync Backward compatible Little functional change Agreement on instructions GPU Little instructions, lots of data  SIMD  Hardware threading Little reuse Data parallel No OS Simple sync Not backward compatible Fast/frequent functional change

CPU vs GPU performance Peak performance = 1 float op per cycle Program: series of loads (1-6) r1 = load (index1) r2 = load (index2) r3 = r1 + r2 series of muladds (1-100) r4 = r3 * r3 + r3 r5 = r4 * r4 + r4 Run over a very large (out-of-cache) data set. (stream comp) Can you get peak performance/multi-core/cluster?

CPU operation Wait for memory, gaps prevent peak performance Gap size varies dynamically Hard to tolerate latency One iteration at a time Single CPU unit Cannot reach 100% Hard to prefetch Multi-core does not help Cluster does not help Limited number of outstanding fetches

GPU THREADS (lower clock – different scale) Overlapped fetch and alu Many outstanding fetches Lots of threads Fetch unit + ALU unit Fast thread switch In-order finish ALU units reach 100% utilization Hardware sync for final Output

GPU performance (reaches peak) From gpubench website

Implications Compute is cheap but you need lots of parallelism to keep all those alu’s busy. (graphics shading is highly parallel) Bandwidth can be expensive (500 cycle memory latency) Compute goes up by 70% a year but bandwidth goes up by 25% a year, latency goes down by 5% a year (arithmetic intensity – lots of alu ops per read) GPU wins when arithmetic intensity is high GPU wins when streaming (little reuse – lots of data) saxpy fft – sequential performance

What does a real machine look like?

AMD Radeon HD™ 2900 Highlights 18 Technology leadership Clock speeds – 742 MHz Transistor – 700 million Technology Process - TSMC 80nm HS Power ~215 W, Pin Count Die Size 420mm  (20mm x 21mm) 2 nd generation unified architecture Scalar ALU design with 320 stream processing units 475 GigaFLOPS of (MulAdd) compute 47.5 GigaPixels/Sec & 742 Mtri/sec 106 GB/sec Bandwidth Optimized for Dynamic Game Computing and Accelerated Stream Processing DirectX ® 10 Massive shader and geometry processing performance Shader Model 4.0 with Integer support Enabling the next generation of visual effects Cutting-edge image quality features Advanced anti-aliasing and texture filtering capabilities Fast High Dynamic Range rendering Programmable Tessellation Unit ATI Avivo™ HD technology Delivering The Ultimate Visual Experience ™ For HD video HD display and audio connectivity HD DVD and Blu-Ray capable Native CrossFire™ technology Superior multi-GPU support Scales up rendering performance and image quality with 2 or more GPUs

AMD Radeon HD2900 Graphics Unit 2nd Generation Unified Shader Architecture  Development from proven and successful XBOX 360 graphics New dispatch processor handling thousands of simultaneous threads Instruction Cache and Constant Cache for unlimited program size  Up to 320 discrete, independent stream processing units  Vliw ALU implementation Dedicated branch execution units Three dedicated fetch units Texture Cache Vertex Cache Load/Store Cache  Full support for DirectX 10.0, Shader Model 4.0

20 Shader Processing Units (SPU) Arranged as 5-way vliw stream processors Co-issue up to 5 scalar FP MAD (Multiply-Add) Up to 5 integer operations supported (cmp, logical, add) One of the 5 stream processing units additionally handles * transcendental instructions ( SIN, COS, LOG, EXP, RCP, RSQ) * integer multiply and shift operations 32-bit floating point precision (round to nearest even) Branch execution units handle flow control and conditional operations Condition code generation for full branching Predication supported directly in ALU General Purpose Registers 1 MByte of GPR space for fast register access General Purpose Registers Branch Execution Unit

Active Thread count Assume a thread needs 5 registers Each simd has 256 sets of 64 vector registers 256/5 = 51 51*64 threads per simd 4 simd engines so 51*64*4 active threads = Each register holds 4 – 32 bit values

CTM Introduction The Close-To-the-Metal interface CTM lets developers see a data parallel machine, not a vertex/pixel processor CTM removes all unnecessary driver overhead, but exposes all the special features of the processor CTM is a system developer interface, not an end user view AMD is building a application interface on top of CTM

AMD Accelerated Computing Software

Random number generation Pseudorandom number generation on the GPU, Sussman et. al. Graphics Hardware 2006 Combined Explicit Inverse Congruent Generator Some issues –Explicit so no state needs to be saved –Slow to compute –Not proven that parallel random generators are uncorrelated. 10 times CPU speed (but slower then a CPU using a more standard algorithm)

Hardware issues that determined the algorithm –16 outputs per thread (removed) –No integer operations (removed) –Inexact division (partially removed – no denorms)

Can implement Mersenne twister algorithm –Requires integer ops + saved state parallel version can be proved uncorrelated –16k parallel generators Approx 100 times faster (CPU/GPU) on current hardware Current limitation – pci bus – getting data into and out of the GPU - One year later

Observations Random number generation moved from Odd-ball technique to what might be overkill for simulation. In one generation the GPU got both faster and better! Unlike CPU, GPU systems are rapidly changing. You may need to recode algorithms each rev. What remained constant: you need:  Massively parallel + lots of arithmetic intensity.

Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2007 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, Avivo, Catalyst, Radeon, The Ultimate Visual Experience and combinations thereof are trademarks of Advanced Micro Devices, Inc. DirectX, Microsoft, Windows and Vista are registered trademarks, of Microsoft Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.