Download presentation
Presentation is loading. Please wait.
Published byEthel Parker Modified over 9 years ago
1
Why GPUs? Robert Strzodka
2
2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics
3
3 INOUT Data Processing in General Processor IN OUT memory memorywall lack of parallelism
4
4 Old and New Wisdom in Computer Architecture Old: Power is free, Transistors are expensive New: “Power wall”, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: “Memory wall”, Multiplies fast, Memory slow (200 clocks to DRAM memory, 4 clocks for FP multiply) Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New: “ILP wall”, diminishing returns on more ILP HW (Explicit thread and data parallelism must be exploited) New: Power Wall + Memory Wall + ILP Wall = Brick Wall slide courtesy of Christos Kozyrakis
5
5 Uniprocessor Performance (SPECint) From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Sea change in chip design: multiple “cores” or processors per chip 3X slide courtesy of Christos Kozyrakis
6
6 Processor Instruction-Stream-Based Processing instructions cache memory data
7
7 Instruction- and Data-Streams Addition of 2D arrays: C= A + B for(y=0; y<HEIGHT; y++) for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x]; } instuction stream processing data inputStreams(A,B); outputStream(C); kernelProgram(OP_ADD); processStreams(); data streams undergoing a kernel operation
8
8 Processor Data-Stream-Based Processing memory pipeline data configuration pipeline
9
9 Architectures: Data – Processor Locality Field Programmable Gate Array (FPGA) –Compute by configuring Boolean functions and local memory Processor Array / Multi-core Processor –Assemble many (simple) processors and memories on one chip Processor-in-Memory (PIM) –Insert processing elements directly into RAM chips Stream Processor –Create data locality through a hierarchy of memories
10
10Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics
11
11 The GPU is a Fast, Parallel Array Processor Input Arrays: 1D, 3D, 2D (typical) Vertex Processor (VP) Kernel changes index regions of output arrays Rasterizer Creates data streams from index regions Stream of array elements, order unknown Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays Output Arrays: 1D, 3D (slice), 2D (typical)
12
12 Index Regions in Output Arrays Output region Quads and Triangles –Fastest option Output region Line segments –Slower, try to pair lines to 2xh, wx2 quads Output region Point Clouds –Slowest, try to gather points into larger forms
13
13 High Level Graphics Language for the Kernels Float data types: –half 16-bit (s10e5), float 32-bit (s23e8) Vectors, structs and arrays: – float4, float vec[6], float3x4, float arr[5][3], struct {} Arithmetic and logic operators: –+, -, *, /; &&, ||, ! Trignonometric, exponential functions: –sin, asin, exp, log, pow, … User defined functions –max3(float a, float b, float c) { return max(a,max(b,c)); } Conditional statements, loops: – if, for, while, dynamic branching in PS3 Streaming and random data access
14
14 Input and Output Arrays CPU Input and output arrays may overlap GPU Input and output arrays must not overlap Input Output Input Output
15
15 Native Memory Layout – Data Locality CPU 1D input 1D output Higher dimensions with offsets GPU 1D, 2D, 3D input 2D output Other dimensions with offsets Input Output Color coded locality red (near), blue (far)
16
16 Data-Flow: Gather and Scatter CPU Arbitrary gather Arbitrary scatter GPU Arbitrary gather Restricted scatter InputOutputInputOutput InputOutput InputOutput
17
17Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics
18
18 1) Computational Performance GFLOPS chart courtesy of John Owens ATI R520 Note: Sustained performance is usually much lower and depends heavily on the memory system !
19
19 2) Memory Performance CPU –Large cache –Few processing elements –Optimized for spatial and temporal data reuse GeForce 7800 GTX Pentium 4 chart courtesy of Ian Buck Memory access types: Cache, Sequential, Random GPU –Small cache –Many processing elements –Optimized for sequential (streaming) data access
20
20 3) Configuration Overhead Configu-rationlimitedCompu-tationlimited chart courtesy of Ian Buck
21
21Conclusions Parallelism is now indispensable to further increase performance Both memory and processing element dominated designs have pros and cons Mapping algorithms to the appropriate architecture allows enormous speedups Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.