GPU Architecture Overview

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.
Multi-Core Architectures
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
EKT303/4 Superscalar vs Super-pipelined.
GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2013.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
My Coordinates Office EM G.27 contact time:
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture 5a: CPU architecture 101 boris.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
COMP 740: Computer Architecture and Implementation
Computer Structure Multi-Threading
Instructional Parallelism
Hyperthreading Technology
NVIDIA Fermi Architecture
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Mattan Erez The University of Texas at Austin
Overview Prof. Eric Rotenberg
Mattan Erez The University of Texas at Austin
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
CSE 502: Computer Architecture
Presentation transcript:

GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2014

Acknowledgements CPU slides – Varun Sampath, NVIDIA GPU slides Kayvon Fatahalian, CMU Mike Houston, NVIDIA

CPU and GPU Trends FLOPS – FLoating-point OPerations per Second GFLOPS - One billion (109) FLOPS TFLOPS – 1,000 GFLOPS FLOPS are not a perfect measurement because of memory access, instruction mix, branches, etc. Peak FLOPS are idealistic, but not often reached in practice. FMA is one instruction.

CPU and GPU Trends Moore’s Law – the number of transistors on a chip will double about every two year (not 18 months, that was reference to performance) Chart from: http://docs.nvidia.com/cuda/cuda-c-programming-guide/

CPU and GPU Trends Moore’s Law – the number of transistors on a chip will double about every two year (not 18 months, that was reference to performance) Chart from: http://docs.nvidia.com/cuda/cuda-c-programming-guide/

CPU and GPU Trends Compute Memory Bandwidth Install Base Intel Core i7: 4 cores – 100 GFLOP NVIDIA GTX280: 240 cores – 1 TFLOP Memory Bandwidth Ivy Bridge System Memory: 60 GB/s NVIDIA 780 Ti: > 330 GB/s PCI-E Gen3: 8 GB/s Install Base Over 400 million CUDA-capable GPUs PCI-E Gen3 is 8 GB/s in each direction 2012 CPU: 6-8 core CPU with several MB on-chip cache

CPU and GPU Trends Single-core performance slowing since 2003 Power and heat limits GPUs delivery higher FLOP per Watt Dollar mm 2012 – GPU peak FLOP about 10x CPU See http://www.gotw.ca/publications/concurrency-ddj.htm

CPU Review What are the major components in a CPU die? Cache, branch predicator, out-of-order logic, ALUs, …

Profiled with psrun on ENIAC CPU Review Desktop Applications Lightly threaded Lots of branches Lots of memory accesses Software likes “serial,” hardware likes “parallel.” Branch prediction accuracy 96.7% - vim 98.9% - ls Profiled with psrun on ENIAC Slide from Varun Sampath

Fetch  Decode  Execute  Memory  Writeback A Simple CPU Core PC I$ Register File s1 s2 d D$ + 4 control Clocked. All of these steps occur in one clock period. Clock period bound by datapath delay. Bad utilization. Fetch  Decode  Execute  Memory  Writeback Image: Penn CIS501

Pipelining Insn Mem Register File Data Tinsn-mem Tregfile TALU PC Insn Mem Register File s1 s2 d Data + 4 Tinsn-mem Tregfile TALU Tdata-mem Tsinglecycle About 12-25 gates per-stage, plus 3-5 gates for the latch. Capitalize on instruction-level parallelism (ILP) + Significantly reduced clock period – Slight latency & area increase (pipeline latches) ? Dependent instructions ? Branches Alleged Pipeline Lengths: Core 2: 14 stages Pentium 4 (Prescott): > 20 stages Sandy Bridge: in between Pipelining allows much higher clock speeds. Pipeline length source: Agner Fog’s Microarchitecture Guide Image: Penn CIS501

Branches Register File Data Mem jeq loop ??? s1 s2 d a F/D D/X X/M M/W IR A B O D F/D D/X X/M M/W nop We don’t know what instructions will execute after the branch until the branch is executed. We could either stall until it is, or “speculate” and flush on a misprediction. Image: Penn CIS501

Branch Prediction + Modern predictors > 90% accuracy Raise performance and energy efficiency – Area increase – Potential fetch stage latency increase 96.7% accurate on vim example Flushing a pipeline means a waste of all executed instructions. Area increase potentially even greater for tournament predictors. Predication avoids explicitly branches completely since the instruction is always fetched and conditionally executed.

Memory Hierarchy Memory: the larger it gets, the slower it gets Rough numbers: Latency Bandwidth Size SRAM (L1, L2, L3) 1-2ns 200GBps 1-20MB DRAM (memory) 70ns 20GBps 1-20GB Flash (disk) 70-90µs 200MBps 100-1000GB HDD (disk) 10ms 1-150MBps 500-3000GB SRAM & DRAM latency, and DRAM bandwidth for Sandy Bridge from Lostcircuits Flash and HDD latencies from AnandTech Flash and HDD bandwidth from AnandTech Bench SRAM bandwidth guesstimated.

Caching Keep data we need close Exploit: Temporal locality Chunk just used likely to be used again soon Spatial locality Next chunk to use is likely close to previous

Cache Hierarchy Hardware-managed Software-managed L1 Instruction/Data caches L2 unified cache L3 unified cache Software-managed Main memory Disk L2 Larger Faster L3 LRU-esque eviction algorithms used for caches. Could be inclusive, exclusive, or neither. OS manages virtual memory. Main Memory Disk (not to scale)

Source: www.lostcircuits.com Look at caches – 24.7% of die is L3 cache – 15MB 32KB I$ and 32KB D$ 256KB L2 per core Memory controller – 15% of die; 4 channels, each supporting DDR3-1600  12.8GB/s Intel Core i7 3960X – 15MB L3 (25% of die). 4-channel Memory Controller, 51.2GB/s total Source: www.lostcircuits.com

Improving IPC IPC (instructions/cycle) bottlenecked at 1 instruction / clock Superscalar – increase pipeline width Duplicate pipeline registers, execution resources, ports Image: Penn CIS371

Scheduling Consider instructions: xor r1,r2 -> r3 add r3,r4 -> r4 sub r5,r2 -> r3 addi r3,1 -> r1 xor and add are dependent (Read-After-Write, RAW) sub and addi are dependent (RAW) xor and sub are not (Write-After-Write, WAW)

Register Renaming How about this instead: xor p1,p2 -> p6 add p6,p4 -> p7 sub p5,p2 -> p8 addi p8,1 -> p9 xor and sub can now execute in parallel

Out-of-Order Execution Reordering instructions to maximize throughput Fetch  Decode  Rename  Dispatch  Issue  Register-Read  Execute  Memory  Writeback  Commit Reorder Buffer (ROB) Keeps track of status for in-flight instructions Physical Register File (PRF) Issue Queue/Scheduler Chooses next instruction(s) to execute New pipeline. 3 Principal data structures. High power-usage since the fetch/decode logic is always executing – and is not powered down like other FUs. “dispatch logic scales somewhere between quadratically and exponentially with the issue-width” - http://www.lighterra.com/papers/modernmicroprocessors/

Parallelism in the CPU Covered Instruction-Level (ILP) extraction Superscalar Out-of-order Data-Level Parallelism (DLP) Vectors Thread-Level Parallelism (TLP) Simultaneous Multithreading (SMT) Multicore

Vectors Motivation for (int i = 0; i < N; i++) A[i] = B[i] + C[i]; How do we make this go fast?

CPU Data-level Parallelism Single Instruction Multiple Data (SIMD) Let’s make the execution unit (ALU) really wide Let’s make the registers really wide too for (int i = 0; i < N; i+= 4) { // in parallel A[i] = B[i] + C[i]; A[i+1] = B[i+1] + C[i+1]; A[i+2] = B[i+2] + C[i+2]; A[i+3] = B[i+3] + C[i+3]; }

Vector Operations in x86 SSE2 AVX 4-wide packed float and packed integer instructions Intel Pentium 4 onwards AMD Athlon 64 onwards AVX 8-wide packed float and packed integer instructions Intel Sandy Bridge AMD Bulldozer Sandy Bridge can issue 2 AVX ops per cycle

Thread-Level Parallelism Thread Composition Instruction streams Private PC, registers, stack Shared globals, heap Created and destroyed by programmer Scheduled by programmer or by OS

Simultaneous Multithreading Instructions can be issued from multiple threads Requires partitioning of ROB, other buffers + Minimal hardware duplication + More scheduling freedom for OoO – Cache and execution resource contention can reduce single-threaded performance Register state, return stack buffer, and large page ITLB only replicated stuff: http://www.anandtech.com/show/2594/8

Multicore Replicate full pipeline Sandy Bridge-E: 6 cores + Full cores, no resource sharing other than last-level cache + Easier way to take advantage of Moore’s Law – Utilization

CPU Conclusions CPU optimized for sequential programming Pipelines, branch prediction, superscalar, OoO Reduce execution time with high clock speeds and high utilization Slow memory is a constant problem Parallelism Sandy Bridge-E great for 6-12 active threads How about 16K? A lot of effort put into improving IPC, and avoiding memory latencies. Attacking throughput is an entirely different game. GeForce 680GTX – 16K threads

How did this happen?

Graphics Workloads Triangles/vertices and pixels/fragments Right image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html

Early 90s – Pre GPU Wolf 3D – 2.5D. No ceiling and floor height changes (or textures). No rolling walls. No lighting. Enemies always face viewer. Ray casting. Pre-computed texture coordinates. Doom – Based on BSP tree (Bruce Naylor). Texture mapped all surfaces. Lighting. Varying floor altitudes. Rolling walls. Slide from http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf

Why GPUs? Graphics workloads are embarrassingly parallel Data-parallel Pipeline-parallel CPU and GPU execute in parallel Hardware: texture filtering, rasterization, etc.

Data Parallel Beyond Graphics Cloth simulation Particle system Matrix multiply Data Parallel – The same program running on multiple data (SPMD). Cloth is modeled with a mass-spring-damper system obeying Hook’s Law. The force on each mass is computed independently at each time step. Each particle in a particle system has forces like gravity and wind. Integrating from f = ma to position can be done independently at each time step. Can an application have more than one type of parallelism? Image from: https://plus.google.com/u/0/photos/100838748547881402137/albums/5407605084626995217/5581900335460078306

Highlight fixed and programmable components