GPGPUs - Data Parallel Accelerators-1 Dezső Sima February 2011 © Dezső Sima 2011 Ver. 1.0 (Updated 10/02/2011)

1. Introduction (1) Aim Introduction and overview in a very limited time

2. Basics of the SIMT execution Contents 1.Introduction 3. Overview of GPGPUs 4. Overview of data parallel accelerators 5. Microarchitecture of GPGPUs (examples) 5.1 AMD/ATI RV870 (Cypress) 5.2 Nvidia Fermi 6. Integrated CPU/GPUs 7. References 6.1 AMD Fusion APU line 6.2 Intel Sandy Bridge

1. The emergence of GPGPUs

Vertex EdgeSurface Vertices have three spatial coordinates supplementary information necessary to render the object, such as color texture reflectance properties etc. Rrepresentation of objects by triangels 1. Introduction (2)

Main types of shaders in GPUs Shaders Geometry shadersVertex shaders Pixel shaders (Fragment shaders) Transform each vertex’s 3D-position in the virtual space to the 2D coordinate, at which it appears on the screen Calculate the color of the pixels Can add or remove vertices from a mesh 1. Introduction (3)

DirectX version Pixel SM Vertex SMSupporting OS 8.0 (11/2000) 1.0, 1.1 Windows 2000 8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1Windows XP/ Windows Server 2003 9.0 (12/2002) 2.0 9.0a (3/2003) 2_A, 2_B 2.x 9.0c (8/2004) 3.0 Windows XP SP2 10.0 (11/2006) 4.0 Windows Vista 10.1 (2/2008) 4.1 Windows Vista SP1/ Windows Server 2008 11 (in development) 5.0 Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX and MS’s OSs [18], [21] 1. Introduction (4) DirectX: Microsoft API set for MM/3D

Convergence of important features of the vertex and pixel shader models Subsequent shader models introduce typically, a number of new/enhanced features. Shader model 2 [19] Different precision requirements Vertex shader: FP32 (coordinates) Pixel shader: FX24 (3 colors x 8) Different instructions Different resources (e.g. registers) Differences between the vertex and pixel shader models in subsequent shader models concerning precision requirements, instruction sets and programming resources. Shader model 3 [19] Unified precision requirements for both shaders (FP32) with the option to specify partial precision (FP16 or FP24) by adding a modifier to the shader code Different instructions Different resources (e.g. registers) 1. Introduction (3)

Shader model 4 (introduced with DirectX10) [20] Unified precision requirements for both shaders (FP32) with the possibility to use new data formats. Unified instruction set Unified resources (e.g. temporary and constant registers) Shader architectures of GPUs prior to SM4 GPUs prior to SM4 (DirectX 10): have separate vertex and pixel units with different features. Drawback of having separate units for vertex and pixel shading Inefficiency of the hardware implementation (Vertex shaders and pixel shaders often have complementary load patterns [21]). 1. Introduction (3)

Unified shader model (introduced in the SM 4.0 of DirectX 10.0) The same (programmable) processor can be used to implement all shaders; the vertex shader the pixel shader and the geometry shader (new feature of the SMl 4) Unified, programable shader architecture 1. Introduction (5)

Figure: Principle of the unified shader architecture [22] 1. Introduction (6)

Based on its FP32 computing capability and the large number of FP-units available the unified shader is a prospective candidate for speeding up HPC! GPUs with unified shader architectures also termed as GPGPUs (General Purpose GPUs) 1. Introduction (7) or cGPUs (computational GPUs)

Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11] 1. Introduction (8)

Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11] 1. Introduction (9)

Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11] 1. Introduction (10)

2. Basics of the SIMT execution

Main alternatives of data parallel execution Data parallel execution SIMD execution SIMT execution One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors Two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (matrices) E.g. 2. and 3. generation superscalars GPGPUs, data parallel accelerators Figure: Main alternatives of data parallel execution data dependent flow control as well as barrier synchronization is massively multithreaded, and provides Needs an FX/FP SIMD extension of the ISA Needs an FX/FP SIMT extension of the ISA and the API 2. Basics of the SIMT execution (1)

Scalar execution SIMD execution SIMT execution Domain of execution: single data elements Domain of execution: elements of vectors Domain of execution: elements of matrices (at the programming level) Figure: Domains of execution in case of scalar, SIMD and SIMT execution 2. Basics of the SIMT execution (2) Remark SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia) Scalar, SIMD and SIMT execution

Key components of the implementation of SIMT execution Data parallel execution Massive multithreading Data dependent flow control Barrier synchronization 2. Basics of the SIMT execution (3)

(i.e. all ALUs of a SIMT core perform typically the same operation). Data parallel execution Fetch/Decode ALU SIMT core Figure: Basic layout of a SIMT core ALU Performed by SIMT cores SIMT cores execute the same instruction stream on a number of ALUs SIMT cores are the basic building blocks of GPGPU or data parallel accelerators. 2. Basics of the SIMT execution (4) During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.

streaming multiprocessor (Nvidia), superscalar shader processor (AMD), wide SIMD processor, CPU core (Intel). Remark 1 Different manufacturers designate SIMT cores differently, such as 2. Basics of the SIMT execution (5)

Fetch/Decode ALU RF Each ALU is allocated a working register set (RF) Figure: Main functional blocks of a SIMT core ALU RF 2. Basics of the SIMT execution (6)

SIMT ALUs perform typically, RRR operations, that is ALUs take their operands from and write the calculated results to the register set (RF) allocated to them. ALU RF Figure: Principle of operation of the SIMD ALUs 2. Basics of the SIMT execution (7)

Remark 2 ALU RF ALU Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs Actually, the register sets (RF) allocated to each ALU are given parts of a large enough register file. 2. Basics of the SIMT execution (8)

Basic operation of recent SIMT ALUs ALU RF are pipelined, capable of starting a new operation every new clock cycle, (more precisely, every shader clock cycle), execute basically SP FP-MADD (simple precision i.e. 32-bit. Multiply-Add) instructions of the form axb+c, need a few number of clock cycles, e.g. 2 or 4 shader cycles, to present the results of the SP FMADD operations to the RF, That is, without further enhancements their peak performance is 2 SP FP operations/cycle 2. Basics of the SIMT execution (9)

Additional operations provided by SIMT ALUs FX operations and FX/FP conversions, DP FP operations, trigonometric functions (usually supported by special functional units). 2. Basics of the SIMT execution (10)

Aim of massive multithreading to speed up computations by increasing the utilization of available computing resources in case of stalls (e.g. due to cache misses). 2. Basics of the SIMT execution (11) Massive multithreading Suspend stalled threads from execution and allocate ready to run threads for execution. When a large enough number of threads are available long stalls can be hidden. Principle

Multithreading is implemented by creating and managing parallel executable threads for each data element of the execution domain. Figure: Parallel executable threads for each element of the execution domain Same instructions for all data elements 2. Basics of the SIMT execution (12)

Effective implementation of multithreading if thread switches, called context switches, do not cause cycle penalties. providing separate contexts (register space) for each thread, and implementing a zero-cycle context switch mechanism. Achieved by 2. Basics of the SIMT execution (13)

ALU CTX Actual context Register file (RF) Context switch Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU Fetch/Decode SIMT core 2. Basics of the SIMT execution (14)

Data dependent flow control Implemented by SIMT branch processing In SIMT processing both paths of a branch are executed subsequently such that for each path the prescribed operations are executed only on those data elements which fulfill the data condition given for that path (e.g. x i > 0). Example 2. Basics of the SIMT execution (15)

Figure: Execution of branches [24] The given condition will be checked separately for each thread 2. Basics of the SIMT execution (16)

Figure: Execution of branches [24] First all ALUs meeting the condition execute the prescibed three operations, then all ALUs missing the condition execute the next two operatons 2. Basics of the SIMT execution (17)

Figure: Resuming instruction stream processing after executing a branch [24] 2. Basics of the SIMT execution (18)

Barrier synchronization Implemented e.g. in AMD’s Intermediate Language (IL) by the fence threads instruction [10]. In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow (CF) instruction format [7]. Remark Lets wait all threads for completing all prior instructions before executing the next instruction. 2. Basics of the SIMT execution (19)

Each kernel invocation lets execute all thread blocks (Block(i,j)) kernel0 >>() kernel1 >>() HostDevice Figure: Hierarchy of threads [25] Principle of SIMT execution 2. Basics of the SIMT execution (20)

3. Overview of GPGPUs

Basic implementation alternatives of the SIMT execution GPGPUs Data parallel accelerators Dedicated units supporting data parallel execution with appropriate programming environment Programmable GPUs with appropriate programming environments E.g.Nvidia’s 8800 and GTX lines AMD’s HD 38xx, HD48xx lines Nvidia’s Tesla lines AMD’s FireStream lines Have display outputs No display outputs Have larger memories than GPGPUs Figure: Basic implementation alternatives of the SIMT execution 3. Overview of GPGPUs (1)

GPGPUs Nvidia’s line AMD/ATI’s line Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines 90 nm G80 65 nm G92G200 Shrink Enhanced arch. 80 nm R600 55 nm RV670RV770 Shrink Enhanced arch. 3. Overview of GPGPUs (2) 40 nm GF100 (Fermi) Shrink 40 nm RV870 Shrink Enhanced arch. Enhanced arch.

48 ALUs 6/08 65 nm/1400 mtrs 11/06 90 nm/681 mtrs Cores Cards CUDA Cores G80 2005200620072008 96 ALUs 320-bit 8800 GTS 10/07 65 nm/754 mtrs G92 128 ALUs 384-bit 8800 GTX 112 ALUs 256-bit 8800 GT GT200 192 ALUs 448-bit GTX260 240 ALUs 512-bit GTX280 6/07 Version 1.0 11/07 Version 1.1 6/08 Version 2.0 5/08 55 nm/956 mtrs 5/07 80 nm/681 mtrs R600 11/07 55 nm/666 mtrs R670 RV770 11/05 R500 320 ALUs 512-bit HD 2900XT 320 ALUs 256-bit HD 3850 320 ALUs 256-bit HD 3870 800 ALUs 256-bit HD 4850 800 ALUs 256-bit HD 4870 Cards (Xbox) 11/07 Brook+ Brooks+ RapidMind NVidia AMD/ATI 6/08 support 3870 Figure: Overview of GPGPUs (1) 3. Overview of GPGPUs (3) OpenCL+ 12/08 OpenCL 1.0 11/08 Version 2.0 9/08 1208 Brook+ 1.3 Brook+ 1.2

Cores Cards CUDA Cores 20092010 448 ALUs 320-bit 480 ALUs 384-bit 5/09 3/106/10 Version 3.1 Cards 3/09 Brook+ 1.4 Brooks+ RapidMind 2011 NVidia AMD/ATI Figure: Overview of GPGPUs (2) 3. Overview of GPGPUs (4) OpenCL+ 6/10 OpenCL 1.1 3/10 40 nm/3000 mtrs GF100 (Fermi) GTX 470 GTX 480 07/10 40 nm/1950 mtrs GF104 (Fermi) 336 ALUs 192/256-bit GTX 460 512 ALUs 384-bit 480 ALUs 384-bit 11/10 40 nm/3000 mtrs GF110 (Fermi) GTX 580 GTX 560 Ti Version 2.1 6/09 Version 2.2 Version 3.0 1/11 Version 3.2 1/11 10/10 40 nm/1700 mtrs 8/09 Intel bought RapidMind Barts Pro/XT 1440/1600 ALUs 256-bit HD 5850/70 960/1120 ALUs 256-bit HD 6850/70 9/09 40 nm/2100 mtrs RV870

3. Overview of GPGPUs (5) Feature support in CUDA versions 1.x/2.x Source: CUDA Wiki Consolidation of the programming environment?

3. Overview of GPGPUs (6) CUDA specs 1.x/2.x Source: CUDA Wiki

8800 GTS8800 GTX8800 GTGTX 260GTX 280 CoreG80 G92GT200 Introduction11/06 10/076/08 IC technology90 nm 65 nm Nr. of transistors681 mtrs 754 mtrs1400 mtrs Die are480 mm 2 324 mm 2 576 mm 2 Core frequency500 MHz575 MHz600 MHz576 MHz602 MHz Computation No.of ALUs96128112192240 Shader frequency1.2 GHz1.35 GHz1.512 GHz1.242 GHz1.296 GHz No. FP32 inst./cycle3* (but only in a few issue cases)33 Peak FP32 performance346 GLOPS512 GLOPS508 GLOPS715 GLOPS933 GLOPS Peak FP64 performance––––77/76 GLOPS Memory Mem. transfer rate (eff)1600 Mb/s1800 Mb/s 1998 Mb/s2214 Mb/s Mem. interface320-bit384-bit256-bit448-bit512-bit Mem. bandwidth64 GB/s86.4 GB/s57.6 GB/s111.9 GB/s141.7 GB/s Mem. size320 MB768 MB512 MB896 MB1.0 GB Mem. typeGDDR3 Mem. channel6*64-bit 4*64-bit8*64-bit Mem. contr.Crossbar System Multi. CPU techn.SLI InterfacePCIe x16 PCIe 2.0x16 MS Direct X10 10.1 subset Table: Main features of Nvidia’s GPGPUs 3. Overview of GPGPUs (7)

HD 2900XTHD 3850HD 3870HD 4850HD 4870 CoreR600R670 RV770 Introduction5/0711/07 5/08 IC technology80 nm55 nm Nr. of transistors700 mtrs666 mtrs 956 mtrs Die are408 mm 2 192 mm 2 260 mm 2 Core frequency740 MHz670 MHz775 MHz625 MHz750 MHz Computation No. of ALUs320 800 Shader frequency740 MHz670 MHz775 MHz625 MHz750 MHz No. FP32 inst./cycle22222 Peak FP32 performance471.6 GLOPS429 GLOPS496 GLOPS1000 GLOPS1200 GLOPS Peak FP64 performance–––200 GLOPS240 GLOPS Memory Mem. transfer rate (eff)1600 Mb/s1660 Mb/s2250 Mb/s2000 Mb/s3600 Mb/s (GDDR5) Mem. interface512-bit256-bit 265-bit Mem. bandwidth105.6 GB/s53.1 GB/s720 GB/s64 GB/s118 GB/s Mem. size512 MB256 MB512 MB Mem. typeGDDR3 GDDR4GDDR3GDDR3/GDDR5 Mem. channel8*64-bit8*32-bit 4*64-bit Mem. contr.Ring bus Crossbar System Multi. CPU techn.CrossFireCrossFire X InterfacePCIe x16PCIe 2.0x16 MS Direct X1010.1 Table: Main features of AMD/ATIs GPGPUs 3. Overview of GPGPUs (8)

Price relations (as of 10/2008) Nvidia GTX260 ~ 300 $ GTX280 ~ 600 $ AMD/ATI HD4850 ~ 200 $ HD4870 na 3. Overview of GPGPUs (9)

GPGPUs - Data Parallel Accelerators-1 Dezső Sima February 2011 © Dezső Sima 2011 Ver. 1.0 (Updated 10/02/2011)

Similar presentations

Presentation on theme: "GPGPUs - Data Parallel Accelerators-1 Dezső Sima February 2011 © Dezső Sima 2011 Ver. 1.0 (Updated 10/02/2011)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPGPUs - Data Parallel Accelerators-1 Dezső Sima February 2011 © Dezső Sima 2011 Ver. 1.0 (Updated 10/02/2011)

Similar presentations

Presentation on theme: "GPGPUs - Data Parallel Accelerators-1 Dezső Sima February 2011 © Dezső Sima 2011 Ver. 1.0 (Updated 10/02/2011)"— Presentation transcript:

Similar presentations

About project

Feedback