Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Similar presentations


Presentation on theme: "Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."— Presentation transcript:

1 Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

2 To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPU The feeds and speeds of the traditional CPU world The feeds and speeds when employing a GPU To form a solid knowledge base for performance programming in modern GPU’s Knowing yesterday, today, and tomorrow The PC world is becoming flatter Outsourcing of computation is becoming easier… Architectural Considerations – Slide 2

3 Topic 1 (next): The GPU as Part of the PC Architecture Topic 2: Threading Hardware in the G80 Topic 3: Memory Hardware in the G80 Architectural Considerations – Slide 3

4 Global variables declaration Function prototypes __global__ void kernelOne(…) Main () allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne >>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__ automatic variables transparently assigned to registers or local memory __syncthreads()… Architectural Considerations – Slide 4 repeat as needed

5 The bandwidth between key components ultimately dictates system performance Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering, caching can temporarily defy the rules in some cases Ultimately, the performance goes falls back to what the “speeds and feeds” dictate Architectural Considerations – Slide 5

6 Northbridge connects three components that must be communicate at high speed CPU, DRAM, video Video also needs to have first- class access to DRAM Previous NVIDIA cards are connected to AGP, up to 2 GB/sec transfers Southbridge serves as a concentrator for slower I/O devices Architectural Considerations – Slide 6 CPU Core Logic Chipset

7 Connected to the southBridge Originally 33 MHz, 32-bit wide, 132 MB/sec peak transfer rate; more recently 66 MHz, 64-bit, 512 MB/sec peak Upstream bandwidth remain slow for device (256 MB/sec peak) Shared bus with arbitration Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge Architectural Considerations – Slide 7

8 PCI device registers are mapped into the CPU’s physical address space Accessed through loads/ stores (kernel mode) Addresses assigned to the PCI devices at boot time All devices listen for their addresses Architectural Considerations – Slide 8

9 Switched, point-to-point connection Each card has a dedicated “link” to the central switch, no bus arbitration. Packet switches messages form virtual channel Prioritized packets for quality of service, e.g., real- time video streaming Architectural Considerations – Slide 9

10 Each link consists of one more lanes Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit 2.5 Gb/sec in one direction) Upstream and downstream now simultaneous and symmetric Each link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc. Architectural Considerations – Slide 10

11 Each link consists of one more lanes Each byte data is 8b/10b encoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/sec per lane each way. Thus, the net data rates are 250 MB/sec (x1) 500 MB/sec (x2), 1GB/sec (x4), 2 GB/sec (x8), 4 GB/sec (x16), each way Architectural Considerations – Slide 11

12 PCIe forms the interconnect backbone Northbridge/Southbridge are both PCIe switches Some Southbridge designs have built-in PCI- PCIe bridge to allow old PCI cards Some PCIe cards are PCI cards with a PCI-PCIe bridge Architectural Considerations – Slide 12

13 FSB connection between processor and Northbridge (82925X) Memory control hub Northbridge handles “primary” PCIe to video/GPU and DRAM. PCIe x16 bandwidth at 8 GB/sec (4 GB each direction) Southbridge (ICH6RW) handles other peripherals Architectural Considerations – Slide 13

14 Bensley platform Blackford Memory Control Hub (MCH) is now a PCIe switch that integrates (NB/SB). FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/sec per DIMM PCIe links form backbone Architectural Considerations – Slide 14 Source: http://www.2cpu.com/review.php?id=109

15 Bensley platform PCIe device upstream bandwidth now equal to down stream Workstation version has x16 GPU link via the Greencreek MCH Architectural Considerations – Slide 15 Source: http://www.2cpu.com/review.php?id=109

16 Two CPU sockets Dual Independent Bus to CPUs, each is basically a FSB CPU feeds at 8.5–10.5 GB/sec per socket Compared to current Front-Side Bus CPU feeds 6.4GB/sec PCIe bridges to legacy I/O devices Architectural Considerations – Slide 16 Source: http://www.2cpu.com/review.php?id=109

17 AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture HyperTransport ™ similarities to PCIe: Packet based, switching network Dedicated links for both directions Shown in 4 socket configuraton, 8 GB/sec per link Architectural Considerations – Slide 17

18 Northbridge/ HyperTransport ™ is on die Glueless logic to DDR, DDR2 memory PCI-X/PCIe bridges (usually implemented in Southbridge) Architectural Considerations – Slide 18

19 “Torrenza” technology Allows licensing of coherent HyperTransport™ to 3 rd party manufacturers to make socket- compatible accelerators/co- processors Architectural Considerations – Slide 19

20 “Torrenza” technology Allows 3 rd party PPUs (Physics Processing Unit), GPUs, and co- processors to access main system memory directly and coherently Architectural Considerations – Slide 20

21 “Torrenza” technology Could make accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory. Architectural Considerations – Slide 21

22 Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe Architectural Considerations – Slide 22 Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture

23 HyperTransport ™ 1.0 Specification 800 MHz max, 12.8 GB/s aggregate bandwidth (6.4 GB/s each way) Architectural Considerations – Slide 23 Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture

24 HyperTransport ™ 2.0 Specification Added PCIe mapping 1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate bandwidth (11.2 GB/s each way) Architectural Considerations – Slide 24 Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture

25 HyperTransport ™ 3.0 Specification 1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate bandwidth (20.8 GB/s each way) Added AC coupling to extend HyperTransport ™ to long distance to system-to-system interconnect Architectural Considerations – Slide 25 Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture

26 Architectural Considerations – Slide 26 256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32 16x PCI-Express SLI Connector DVI x 2 sVideo TV Out Single slot cooling

27 Single-Program Multiple-Data (SPMD) CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks Architectural Considerations – Slide 27

28 Architectural Considerations – Slide 28 CPU Serial Code Grid 0... GPU Parallel Kernel KernelA >>(args); Grid 1 CPU Serial Code GPU Parallel Kernel KernelB >>(args);

29 A kernel is executed as a grid of thread blocks All threads share global memory space Architectural Considerations – Slide 29

30 A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution using barrier Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Architectural Considerations – Slide 30

31 Programmer declares (Thread) Block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Architectural Considerations – Slide 31 CUDA Thread Block Thread Id #: 0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

32 All threads in a block execute the same thread program Threads share data and synchronize while doing their share of the work Threads have thread id numbers within block Thread program uses thread id to select work and address shared data Architectural Considerations – Slide 32 CUDA Thread Block Thread Id #: 0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

33 Architectural Considerations – Slide 33 TPC TEX SM SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Texture Processor Cluster Streaming Multiprocessor SM Shared Memory Streaming Processor Array …

34 SPA: Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) TPC: Texture Processor Cluster (2 SM + TEX) SM: Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block SP: Streaming Processor Scalar ALU for a single CUDA thread Architectural Considerations – Slide 34

35 Streaming Multiprocessor (SM) 8 Streaming Processors (SP) 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads Architectural Considerations – Slide 35 SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Streaming Multiprocessor Shared Memory

36 20+ GFLOPS 16 KB shared memory texture and global memory access Architectural Considerations – Slide 36 SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Streaming Multiprocessor Shared Memory

37 The future of GPUs is programmable processing So – build the architecture around the processor Architectural Considerations – Slide 37 L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB

38 Processors execute computing threads Alternative operating mode specifically for computing Architectural Considerations – Slide 38 Generates thread grids based on kernel calls

39 Grid is launched on the streaming processor array (SPA) Thread blocks are serially distributed to all the streaming multiprocessors (SMs) Potentially >1 thread block per SM Each SM launches warps of threads 2 levels of parallelism Architectural Considerations – Slide 39 Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

40 SM schedules and executes warps that are ready to run As warps and thread blocks complete, resources are freed SPA can distribute more thread blocks Architectural Considerations – Slide 40 Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

41 Threads are assigned to SMs in block granularity Up to 8 blocks to each SM as resource allows SM in G80 can take up to 768 threads Could be 256 (threads/block) × 3 blocks Or 128 (threads/block) × 6 blocks, etc. Architectural Considerations – Slide 41 t0 t1 t2 … tm Blocks Texture L1 SP Shared Memory MT IU SP Shared Memory MT IU TF L2 Memory t0 t1 t2 … tm Blocks SM 1SM 0

42 Threads run concurrently SM assigns/maintains thread id numbers SM manages/schedules thread execution Architectural Considerations – Slide 42 t0 t1 t2 … tm Blocks Texture L1 SP Shared Memory MT IU SP Shared Memory MT IU TF L2 Memory t0 t1 t2 … tm Blocks SM 1SM 0

43 Each thread blocks is divided into 32-thread warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM Architectural Considerations – Slide 43 … t0 t1 t2 … t31 … … … Block 1 WarpsBlock 2 Warps SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Streaming Multiprocessor Shared Memory

44 If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? Each block is divided into 256/32 = 8 warps There are 8 × 3 = 24 warps At any point in time, only one of the 24 warps will be selected for instruction fetch and execution. Architectural Considerations – Slide 44 … t0 t1 t2 … t31 … … … Block 1 WarpsBlock 2 Warps SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Streaming Multiprocessor Shared Memory

45 SM hardware implements zero- overhead warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected Architectural Considerations – Slide 45 warp 8 instruction 11 SM multithreaded Warp scheduler warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12...... time warp 3 instruction 96

46 Four clock cycles needed to dispatch the same instruction for all threads in a warp in G80 If one global memory access is needed for every 4 instructions, a minimum of 13 warps are needed to fully tolerate 200- cycle memory latency Architectural Considerations – Slide 46 warp 8 instruction 11 SM multithreaded Warp scheduler warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12...... time warp 3 instruction 96

47 Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one “ready-to-go” warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on round- robin/age of warp SM broadcasts the same instruction to 32 threads of a warp Architectural Considerations – Slide 47 I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MADSFU

48 All register operands of all instructions in the instruction buffer are scoreboarded instruction becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Architectural Considerations – Slide 48

49 Decoupled memory/processor pipelines any thread can continue to issue instructions until scoreboarding prevents issue allows memory/processor ops to proceed in shadow of other waiting memory/processor ops Architectural Considerations – Slide 49

50 For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles? For 4×4, we have 16 threads per block. Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM! There are 8 warps but each warp is only half full. Architectural Considerations – Slide 50

51 For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles? For 8×8, we have 64 threads per block. Since each SM can take up to 768 threads, it could take up to 12 blocks. However, each SM can only take up to 8 blocks, only 512 threads will go into each SM! There are 16 warps available for scheduling in each SM Each warp spans four slices in the y dimension Architectural Considerations – Slide 51

52 For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles? For 16×16, we have 256 threads per block. Since each SM can take up to 768 threads, it can take up to 3 blocks and achieve full capacity unless other resource considerations overrule. There are 24 warps available for scheduling in each SM Each warp spans two slices in the y dimension For 32×32, we have 1024 threads per Block. Not even one can fit into an SM! Architectural Considerations – Slide 52

53 Review: CUDA Device Memory Space Each thread can: R/W per-thread registers and local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant and texture memories The host can R/W global, constant and texture memories Architectural Considerations – Slide 53 (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Mem Thr (0, 0) Regs Local Mem Thr (1, 0) Regs Block (1, 0) Shared Memory Local Mem Thr (0, 0) Regs Local Mem Thr (1, 0) Regs HOSTHOST

54 Uses: Inter-thread communication within a block Cache data to reduce global memory accesses Use it to avoid non-coalesced access Organization: 16 banks, 32-bit wide banks (Tesla) 32 banks, 32-bit wide banks (Fermi) Successive 32-bit words belong to different banks Architectural Considerations – Slide 54

55 Performance: 32 bits per bank per 2 clocks per multiprocessor Shared memory accesses are per 16-threads (half-warp) Serialization: if n threads (out of 16) access the same bank, n accesses are executed serially Broadcast: n threads access the same word in one fetch Architectural Considerations – Slide 55

56 Local Memory: per-thread Private per thread Auto variables, register spill Shared Memory: per-block Shared by threads of the same block Inter-thread communication Architectural Considerations – Slide 56 Thread Local Memory Block Shared Memory

57 Global Memory:per-application Shared by all threads Inter-grid communication Architectural Considerations – Slide 57 Grid 0... Global Memory... Grid 1 Sequential Grids in Time

58 Threads in a block share data and results In memory and shared memory Synchronize at barrier instruction Architectural Considerations – Slide 58 t0 t1 t2 … tm Blocks Texture L1 SP Shared Memory MT IU SP Shared Memory MT IU TF L2 Memory t0 t1 t2 … tm Blocks SM 1SM 0

59 Per-block shared memory allocation Keeps data close to processor Minimize trips to global memory Shared memory is dynamically allocated to blocks, one of the limiting resources Architectural Considerations – Slide 59 t0 t1 t2 … tm Blocks Texture L1 SP Shared Memory MT IU SP Shared Memory MT IU TF L2 Memory t0 t1 t2 … tm Blocks SM 1SM 0

60 Register File (RF) 32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF Architectural Considerations – Slide 60 I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MADSFU

61 There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself Architectural Considerations – Slide 61 4 blocks 3 blocks

62 If each block has 16 × 16 threads and each thread uses 10 registers, how many thread can run on each SM? Each block requires 10 × 256 = 2560 registers 8192 = 3 × 2560 + change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each block now requires 11 × 256 = 2816 registers 8192 < 2816 × 3 Only two blocks can run on an SM, ¹⁄₃ reduction of thread-level parallelism!!! Architectural Considerations – Slide 62

63 Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism (ILP) and thread level parallelism (TLP) Architectural Considerations – Slide 63

64 Assume that a kernel has 256-thread blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles, then 3 blocks can run on each SM If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, only two can run on each SM Architectural Considerations – Slide 64

65 However, one only needs 200/(8×4) = 7 warps to tolerate the memory latency Two blocks have 16 warps. The performance can be actually higher! Architectural Considerations – Slide 65

66 Increase in per-thread performance, but fewer threads Lower overall performance Architectural Considerations – Slide 66

67 Architectural Considerations – Slide 67 Loop { Load current tile to shared memory syncthreads() Compute current tile syncthreads() } Load next tile from global memory Loop { Deposit current tile to shared memory syncthreads() Load next tile from global memory Compute current tile syncthreads() } Without prefetchingWith prefetching One could double buffer the computation, getting better instruction mix within each thread This is classic software pipelining in ILP compilers

68 Deposit blue tile from register into shared memory and Syncthreads Load orange tile into register Compute blue tile Deposit orange tile into shared memory …. Architectural Considerations – Slide 68 Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH-1 2 012 by ty 2 1 0 TILE_WIDTH-1 2 1 0 TILE_WIDTH TILE_WIDTHE WIDTH

69 There are very few multiplications or additions between branches and address calculations. Loop unrolling can help. Architectural Considerations – Slide 69 for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx]; Pvalue += Ms[ty][k] * Ns[k][tx] + … Ms[ty][k+15] * Ns[k+15][tx];

70 Architectural Considerations – Slide 70 Ctemp = 0; for (…) { __shared__ float As[16][16]; __shared__ float Bs[16][16]; // load input tile elements As[ty][tx] = A[indexA]; Bs[ty][tx] = B[indexB]; indexA +=16; indexB += 16 * widthB; __syncthreads(); // compute results for tile for (i = 0; i < 16; i++) { Ctemp += As[ty][i] * Bs[i][tx]; } __syncthreads(); } C[indexC] = Ctemp; Ctemp = 0; for (…) { __shared__ float As[16][16]; __shared__ float Bs[16][16]; // load input tile elements As[ty][tx] = A[indexA]; Bs[ty][tx] = B[indexB]; indexA +=16; indexB += 16 * widthB; __syncthreads(); // compute results for tile Ctemp += As[ty][0] * Bs[0][tx]; … Ctemp += As[ty][15] * Bs[15][tx]; __syncthreads(); } C[indexC] = Ctemp; Tiled VersionUnrolled Version Removal of branch instructions and address calculations Does this use more registers?

71 Long-latency operations Avoid stalls by executing other threads Stalls and bubbles in the pipeline Barrier synchronization Branch divergence Shared resource saturation Global memory bandwidth Local memory capacity Architectural Considerations – Slide 71

72 Based on original material from Jon Stokes, PCI Express: An Overview http://arstechnica.com/articles/paedia/hardware/pcie.ars The University of Illinois at Urbana-Champaign David Kirk, Wen-mei W. Hwu Revision history: last updated 10/11/2011. Previous revisions: 9/13/2011. Architectural Considerations – Slide 72


Download ppt "Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."

Similar presentations


Ads by Google