Download presentation
Presentation is loading. Please wait.
Published byWarren Weaver Modified over 9 years ago
1
GPUs
2
An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and workstation – massive volume and potential impact Performance Advantage of GPUs Courtesy: John Owens
3
Heart of Blue Waters: Two Chips AMD Interlagos 157 GF peak performance Features: 2.3-2.6 GHz 8 core modules, 16 threads On-chip Caches L1 (I:8x64KB; D:16x16KB) L2 (8x2MB) Memory Subsystem Four memory channels 51.2 GB/s bandwidth NVIDIA Kepler 1,300 GF peak performance Features: 15 Streaming multiprocessors (SMX) SMX: 192 CUDA SPs, 64 dp units, 32 special function units L1 caches/shared mem (64KB, 48KB) L2 cache (1,536KB) Memory subsystem Six memory channels 250 GB/s bandwidth
4
DRAM Cache ALU Control ALU DRAM CPU GPU CPUs and GPUs have fundamentally different design philosophies.
5
CPUs: Latency Oriented Design Large caches –Convert long latency memory accesses to short latency cache accesses Sophisticated control –Branch prediction for reduced branch latency –Data forwarding for reduced data latency Powerful ALU –Reduced operation latency 5 Cache ALU Control ALU DRAM CPU
6
GPUs: Throughput Oriented Design Small caches –To boost memory throughput Simple control –No branch prediction –No data forwarding Energy efficient ALUs –Many, long latency but heavily pipelined for high throughput Require massive number of threads to tolerate latencies 6 DRAM GPU
7
Winning Applications Use Both CPU and GPU CPUs for sequential parts where latency matters –CPUs can be 10+X faster than GPUs for sequential code GPUs for parallel parts where throughput wins –GPUs can be 10+X faster than CPUs for parallel code 7
8
GPU Architecture FETCH DECODE ISSUE RF FU RF FU RF FU... 0 1... N Good for data parallel code. Bad when threads diverge.
9
Control Divergence A CB D A CB D Thread 0Thread 1 A A C B D D A A C B D D Thread 0 Thread 1 Threads must execute in lockstep, but different threads execute different instruction sequences
10
Memory Divergence MEM CTRL DRAM L2 INTERCONNECT... MEM CTRL DRAM L2... L1 SM L1 SM L1 SM Load 0Load 1 Dependent instructions must stall until longest memory request finishes
11
Massive Parallelism Requires Regularity
12
Undesirable Patterns Serialization due to conflicting use of critical resources Over subscription of Global Memory bandwidth Load imbalance among parallel threads
13
Conflicting Data Accesses Cause Serialization and Delays Massively parallel execution cannot afford serialization Contentions in accessing critical data causes serialization
14
A Simple Example A naïve inner product algorithm of two vectors of one million elements each –All multiplications can be done in one time unit (parallel) –Additions to a single accumulator in one million time units (serial) * * * * * +*+* +++ …… Time
15
How much can conflicts hurt? Amdahl’s Law –If fraction X of a computation is serialized, the speedup cannot be more than 1/X In the previous example, X = 50% –Half the calculations are serialized –No more than 2X speedup, no matter how many computing cores are used
16
Load Balance The total amount of time to complete a parallel job is limited by the thread that takes the longest to finish goodbad
17
How bad can it be? Assume that a job takes 100 units of time for one person to finish –If we break up the job into 10 parts of 10 units each and have 10 people to do it in parallel, we can get a 10X speedup –If we break up the job into 50, 10, 5, 5, 5, 5, 5, 5, 5, 5 units, the same 10 people will take 50 units to finish, with 9 of them idling for most of the time. We will get no more than 2X speedup.
18
How does imbalance come about? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Non-uniform data distributions –Highly concentrated spatial data areas –Astronomy, medical imaging, computer vision, rendering, … If each thread processes the input data of a given spatial volume unit, some will do a lot more work than others
19
Global Memory Bandwidth Ideal Reality
20
Global Memory Bandwidth Many-core processors have limited off-chip memory access bandwidth compared to peak compute throughput Fermi –1.5 TFLOPS SPFP peak throughput –0.75 TFLOPS DPFP peak throughput –144 GB/s peak off-chip memory access bandwidth 36 G SPFP operands per second 18 G DPFP operands per second –To achieve peak throughput, how many operations must a program perform for each operand? 1,500/36 = ~42 SPFP arithmetic operations for each operand value fetched from off-chip memory How about DPFP?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.