ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Slides:

Advertisements

Similar presentations

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Advertisements

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Multiplication X 1 1 x 1 = 1 2 x 1 = 2 3 x 1 = 3 4 x 1 = 4 5 x 1 = 5 6 x 1 = 6 7 x 1 = 7 8 x 1 = 8 9 x 1 = 9 10 x 1 = x 1 = x 1 = 12 X 2 1.

Chapter 5 Input/Output 5.1 Principles of I/O hardware

DirectCompute Performance on DX11 Hardware

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

SE-292 High Performance Computing

Copyright © 2007 Heathkit Company, Inc. All Rights Reserved PC Fundamentals Presentation 35 – Buses.

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

SE 292 (3:0) High Performance Computing L2: Basic Computer Organization R. Govindarajan

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Project 5: Virtual Memory

1 Overview Assignment 4: hints Memory management Assignment 3: solution.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.

The von Neumann Model – Chapter 4 COMP 2620 Dr. James Money COMP

Executional Architecture

25 seconds left…...

Figure 10–1 A 64-cell memory array organized in three different ways.

SE-292 High Performance Computing

Pointers and Arrays Chapter 12

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

& dding ubtracting ractions.

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.

Topics Left Superscalar machines IA64 / EPIC architecture

Compiler Construction Sohail Aslam Lecture Code Generation  The code generation problem is the task of mapping intermediate code to machine code.

ATI Stream Computing ATI Radeon™ HD 2900 Series Instruction Set Architecture Micah Villmow May 30, 2008.

Chapter 3 โพรเซสเซอร์และการทำงาน The Processing Unit

ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008.

Class Addressing modes

THUMB Instructions: Branching and Data Processing

ITEC 352 Lecture 13 ISA(4).

Princess Sumaya University

Multiplication Facts Practice

DSPs Vs General Purpose Microprocessors

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

HPEC 2007 Norm Rubin Fellow AMD Graphics Products Group norman.rubin at amd.com.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

My Coordinates Office EM G.27 contact time:

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

The Small batch (and Other) solutions in Mantle API

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

RegMutex: Inter-Warp GPU Register Time-Sharing

Compute Shaders Optimize your engine using compute

Advanced Micro Devices, Inc.

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Outline ATI Radeon™ HD 3800 Series GPU – What changed. ATI Radeon™ HD 3400/3600 Series and X2 GPU variants ATI Radeon™ HD 4800 – A new architecture? Compute Shader – A new paradigm

| ATI Stream Computing Update | Confidential 33 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ATI Radeon™ HD 3800 Series GPU What Changed Double Precision Memory Controller Modifications Tex Modifications Linear Memory Global Buffer support Limited Render backends

| ATI Stream Computing Update | Confidential 44 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ALU Hardware – Double Precision Combine thin pipes together to produce double Combines two F32 components for F64  MSB is in y/w component  LSB is in x/z component Two pipe instructions:  DADD  Double-Float Conversion Ops  DLDEXP  DFRAC  Double Comparison Ops Four pipe instructions:  DFREXP  DMUL  DMAD IL: dmad r10.xy__, r0.xy, r5.xy, r10.xy ISA: 21 x: MULADD_64 T0.x, R5.y, R1.y, T0.y y: MULADD_64 T0.y, R5.y, R1.y, T0.y z: MULADD_64 ____, R5.y, R1.y, T0.y w: MULADD_64 ____, R5.x, R1.x, T0.x t: MULADD R4.y, R5.z, R3.z, T0.z IL: dadd r10.xy__, r0.xy, r5.xy dadd r10.__zw, r0.zw, r5.zw ISA: 20 x: ADD_64 T3.x, R3.y, R1.y y: ADD_64 T3.y, R3.x, R1.x z: ADD_64 T3.z, R3.w, R1.w VEC_120 w: ADD_64 T3.w, R3.z, R1.z t: MULADD T0.w, R4.y, R1.x, T0.w

| ATI Stream Computing Update | Confidential 55 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Memory Hardware – Memory Controller Die-shrink from 80nm to 55nm 512-bit ring bus, 256r/256w 72 GB/s bandwidth peak 32-bit memory channels

| ATI Stream Computing Update | Confidential 66 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Memory Hardware – Texture Unit Four 32KB four-way associative L1 caches L1 cache size is 4x8KB per SIMD engine Data is split across all four 8K L1 cache’s L1 cacheline is 128 bytes or 2 quads of data 256KB unified cache over all SIMDs

| ATI Stream Computing Update | Confidential 77 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Memory Hardware – Linear Layout Tiled Layout P Pitch Width Height Possible wasted space between width and pitch Euclidean coordinates for addressing Macro-micro tiling format is non-linear Outputs through color buffer backend Linear Layout Pitch Height Addressable space is pitch * height No wasted space in allocated texture Linear macro tiling format Outputs through SMX

| ATI Stream Computing Update | Confidential 88 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Memory Hardware – RB Changes Memory Controller DPP Array Memory Controller DPP Array ATI Radeon™ HD 2900 Series GPU ATI Radeon™ HD 3800 Series GPU

| ATI Stream Computing Update | Confidential 99 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ATI Radeon™ HD 4800 Series GPUs - Improvements 2.5x more floating point compute power than ATI Radeon™ HD 3800 Series GPUs Includes all the features added to ATI Radeon™ HD 3800 Series GPUs Higher bandwidths w/ GDDR5 memory 115GB/s memory bandwidth 1.2 Teraflops peak ALU performance New compute shader paradigm Inter- and Intra- thread sharing

| ATI Stream Computing Update | Confidential 10 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ATI Radeon™ HD 4800 Series GPUs – Architecture Features ALU Improvements 10 SIMD engines 16 TP’s per SIMD 5 streaming cores per TP 800 total streaming cores Shared global registers TEX Improvements 4 TEX units per SIMD 40 total TEX units Local data share Global data share MEM Improvements 8KB L1 cache per SIMD 480 GB/s L1 BW 4 32KB L2 caches 384 GB/s L2->L1 BW

| ATI Stream Computing Update | Confidential 11 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ATI Radeon™ HD 4800 Series GPUs - Hardware Layout Optimized for distributed memory layout and GDDR5 Various Sections:  ALU – Red  TEX – Brown  MEM – Orange  RAM – Green  PCIE – Blue  Display - Yellow

| ATI Stream Computing Update | Confidential 12 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ATI Radeon™ HD 4800 Series GPUs - ALU Units Same ALUs as ATI Radeon™ HD 3800 Series GPUs, just more Integer shifts on all streaming cores Improved double and integer performance 16KB on-chip local data share with write private-read anywhere memory model Global R/W registers per SIMD 32KB on-chip global data share

| ATI Stream Computing Update | Confidential 13 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ATI Radeon™ HD 4800 Series GPUs – Memory Hardware – TEX Units

| ATI Stream Computing Update | Confidential 14 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ATI Radeon™ HD 4800 Series GPU – Memory Hardware – Memory Controller

| ATI Stream Computing Update | Confidential 15 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview ATI Radeon™ HD 4800 Series GPU – Memory Hardware – Render Backends 4 Render backends 256 bit memory lines Write combining cache Global buffer via DB instead of SMX Scratch buffer bandwidth doubled Scatter bandwidth inline with color writes

| ATI Stream Computing Update | Confidential 16 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Compute Shader – A New Paradigm A general approach to the compute paradigm Disconnect the output domain from the problem domain Gives more control to the shader writer Read anywhere, write anywhere The new terminology – threads and groups Data sharing – shared registers and local data share Linear memory format

| ATI Stream Computing Update | Confidential 17 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Compute Shader – A New Paradigm (cont’d) Removes graphics-centric terminology and ideas An array of parallel processing elements Removes graphics pipeline from the picture (no ES, PS, GS, VS etc.) Inputs and outputs are disconnected from the output domain Domain is now specified by the number of threads to run in a 2D fashion.

| ATI Stream Computing Update | Confidential 18 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Compute Shader - Terminology Thread – A single invocation of the kernel Group – A set number of threads that can share data and run together on a single SIMD. Multiple groups can run on a single SIMD if registers allow Shared Registers – Registers that are global to a SIMD Local Data Share – 16KB on-chip memory per SIMD shared between threads in a group Wavefront – group of 64 threads run concurrently on a SIMD Fence – Synchronization mechanism for threads within a group  _threads – Generic barrier that synchronizes all threads to a point  _memory – Synchronize threads on global memory accesses  _sr – Synchronize on Shared Register access  _lds – Synchronize on local data share

| ATI Stream Computing Update | Confidential 19 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Data Sharing and Synchronization SR – Globally shared registers –Sharing between all wavefronts in a SIMD –Column sharing on the SIMD –Persistent registers –Atomicity guaranteed in same instruction LDS – Local Data Share –Write local, read global system –Share between all threads in a group –Synchronization required  New Indexing Values – No more vPos/vWinCoord –vTid – ID of thread within a group –vaTid – ID of thread within a domain –vTgroupid – ID of group within a domain

| ATI Stream Computing Update | Confidential 20 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Shared Registers Wavefront 1Wavefront 2Wavefront 3Wavefront 4Wavefront 0Wavefront 5Wavefront 7Wavefront 6Shared Registers SIMD 0 Wavefront 1Wavefront 2Wavefront 3Wavefront 4Wavefront 0Wavefront 5Wavefront 7Wavefront 6Shared Registers SIMD N Data is shared between columns of a wavefront per SIMD - Accesses in the same ALU clause are atomic, indexing is not allowed - Shared registers are carved out of the register pool - Same as accessing normal registers

| ATI Stream Computing Update | Confidential 21 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview IL SR Usage il_cs_2_0 dcl_cb cb0[1] dcl_shared_temp sr1 add sr0, sr0, r mov g[vaTid0.x], sr0 ret end Atomic Read-Modify-Write Uses:  Reductions –Max –Min –Sum –Average  Order Agnostic Data Updates –Histogram –Global Counters –Semaphores

| ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Local Data Share 16KB of memory per SIMD, 4 banks of 1k Dwords, max 128 per thread. Write address is based on thread ID, and offsets are static Reads are done by thread ID + offset. Dispatches one write command every cycle Dispatches read over four cycles with waterfall cycle latency that needs to be hidden by ALU

| ATI Stream Computing Update | Confidential 23 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview LDS Group 1 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront 0 Group 2 Wavefront 0 Wavefront 1 Wavefront 3 Wavefront 2 SIMD 0 Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3 SIMD 0 LDS Memory Write self only Group 1 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront 0 Group 2 Wavefront 0 Wavefront 1 Wavefront 3 Wavefront 2 SIMD N Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3 SIMD N LDS Memory Read Any

| ATI Stream Computing Update | Confidential 24 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview IL LDS Usage il_cs_2_0 dcl_cb cb0[1] dcl_num_thread_per_group 1024 dcl_lds_size_per_thread 4 dcl_lds_sharing_mode _wavefrontRel dcl_literal l0, 0x0, 0x04, 0x8, 0x1 mov r0, cb0[0].xxxx lds_write_vec mem, vTid0.x iadd r0, r0, vTid0.x0xx lds_read_vec_sharingMode(abs) r2, r0.x0 mov g[vaTid0.x], r2 ret end

| ATI Stream Computing Update | Confidential 25 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.