GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Lecture 6: Multicore Systems

Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.

CPU Architecture Overview Varun Sampath CIS 565 Spring 2012.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

Instruction Level Parallelism (ILP) Colin Stevens.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Multi-core architectures. Single-core computer Single-core CPU chip.

By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.

Multi-Core Architectures

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

GPU Architecture Overview

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

EKT303/4 Superscalar vs Super-pipelined.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

My Coordinates Office EM G.27 contact time:

Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.

PipeliningPipelining Computer Architecture (Fall 2006)

1 Lecture 5a: CPU architecture 101 boris.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Computer Structure Multi-Threading

/ Computer Architecture and Design

Instructional Parallelism

Hyperthreading Technology

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Mattan Erez The University of Texas at Austin

Overview Prof. Eric Rotenberg

Mattan Erez The University of Texas at Austin

Lecture 10: ILP Innovations

CSE 502: Computer Architecture

Presentation transcript:

GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2013

Reminders Course website  Google Group  Signup: GitHub  Create an account:  Change it to an edu account:

Projects Project 0 due Monday 09/09 Project 1  Released Monday 09/09  Due Wednesday 09/18  In-class demos Wednesday 09/18

Today Walkthrough Project 0 CPU and GPU architecture

Course Overview Follow-Up Progress – students surpass teaches I want more than 2% impact

Acknowledgements CPU slides – Varun Sampath, NVIDIA GPU slides  Kayvon Fatahalian, CMU  Mike Houston, AMD

CPU and GPU Trends FLOPS – FLoating-point OPerations per Second GFLOPS - One billion (10 9 ) FLOPS TFLOPS – 1,000 GFLOPS

CPU and GPU Trends Chart from:

CPU and GPU Trends Compute  Intel Core i7 – 4 cores – 100 GFLOP  NVIDIA GTX280 – 240 cores – 1 TFLOP Memory Bandwidth  System Memory – 60 GB/s  NVIDIA GTX680 – 200 GB/s  PCI-E Gen3 – 8 GB/s Install Base  Over 400 million CUDA-capable GPUs

CPU and GPU Trends Single-core performance slowing since 2003  Power and heat limits GPUs delivery higher FLOP per  Watt  Dollar  Mm 2012 – GPU peak FLOP about 10x CPU

CPU Review What are the major components in a CPU die?

CPU Review Desktop Applications –Lightly threaded –Lots of branches –Lots of memory accesses Profiled with psrun on ENIAC Slide from Varun SampathVarun Sampath

A Simple CPU Core Image: Penn CIS501Penn CIS501 Fetch  Decode  Execute  Memory  Writeback PC I$ Register File s1s2d D$ +4+4 control

Pipelining Image: Penn CIS501Penn CIS501 PC Insn Mem Register File s1s2d Data Mem +4+4 T insn-mem T regfile T ALU T data-mem T regfile T singlecycle

Branches Image: Penn CIS501Penn CIS501 jeq loop ??? Register File SXSX s1s2d Data Mem a d IR A B IR O B IR O D IR F/DD/XX/MM/W nop

Branch Prediction + Modern predictors > 90% accuracy o Raise performance and energy efficiency (why?) – Area increase – Potential fetch stage latency increase

Memory Hierarchy Memory: the larger it gets, the slower it gets Rough numbers: LatencyBandwidthSize SRAM (L1, L2, L3) 1-2ns200GBps1-20MB DRAM (memory) 70ns20GBps1-20GB Flash (disk)70-90µs200MBps GB HDD (disk)10ms1-150MBps GB

Caching Keep data weneed close Exploit:  Temporal locality Chunk just used likely to be used again soon  Spatial locality Next chunk to use is likely close to previous

Cache Hierarchy Hardware-managed  L1 Instruction/Data caches  L2 unified cache  L3 unified cache Software-managed  Main memory  Disk I$ D$ L2 L3 Main Memory Disk (not to scale) LargerLarger FasterFaster

Intel Core i7 3960X – 15MB L3 (25% of die). 4-channel Memory Controller, 51.2GB/s total Source:

Improving IPC IPC (instructions/cycle) bottlenecked at 1 instruction / clock Superscalar – increase pipeline width Image: Penn CIS371

Scheduling Consider instructions: xor r1,r2 -> r3 add r3,r4 -> r4 sub r5,r2 -> r3 addi r3,1 -> r1 xor and add are dependent (Read-After- Write, RAW) sub and addi are dependent (RAW) xor and sub are not (Write-After-Write, WAW)

Register Renaming How about this instead: xor p1,p2 -> p6 add p6,p4 -> p7 sub p5,p2 -> p8 addi p8,1 -> p9 xor and sub can now execute in parallel

Out-of-Order Execution Reordering instructions to maximize throughput Fetch  Decode  Rename  Dispatch  Issue  Register-Read  Execute  Memory  Writeback  Commit Reorder Buffer (ROB)  Keeps track of status for in-flight instructions Physical Register File (PRF) Issue Queue/Scheduler  Chooses next instruction(s) to execute

Parallelism in the CPU Covered Instruction-Level (ILP) extraction  Superscalar  Out-of-order Data-Level Parallelism (DLP)  Vectors Thread-Level Parallelism (TLP)  Simultaneous Multithreading (SMT)  Multicore

Vectors Motivation for (int i = 0; i < N; i++) A[i] = B[i] + C[i];

CPU Data-level Parallelism Single Instruction Multiple Data (SIMD)  Let’s make the execution unit (ALU) really wide  Let’s make the registers really wide too for (int i = 0; i < N; i+= 4) { // in parallel A[i] = B[i] + C[i]; A[i+1] = B[i+1] + C[i+1]; A[i+2] = B[i+2] + C[i+2]; A[i+3] = B[i+3] + C[i+3]; }

Vector Operations in x86 SSE2  4-wide packed float and packed integer instructions  Intel Pentium 4 onwards  AMD Athlon 64 onwards AVX  8-wide packed float and packed integer instructions  Intel Sandy Bridge  AMD Bulldozer

Thread-Level Parallelism Thread Composition  Instruction streams  Private PC, registers, stack  Shared globals, heap Created and destroyed by programmer Scheduled by programmer or by OS

Simultaneous Multithreading Instructions can be issued from multiple threads Requires partitioning of ROB, other buffers + Minimal hardware duplication + More scheduling freedom for OoO – Cache and execution resource contention can reduce single-threaded performance

Multicore Replicate full pipeline Sandy Bridge-E: 6 cores + Full cores, no resource sharing other than last-level cache + Easier way to take advantage of Moore’s Law – Utilization

CPU Conclusions CPU optimized for sequential programming  Pipelines, branch prediction, superscalar, OoO  Reduce execution time with high clock speeds and high utilization Slow memory is a constant problem Parallelism  Sandy Bridge-E great for 6-12 active threads  How about 12,000?

How did this happen?

Graphics Workloads Triangles/vertices and pixels/fragments Right image from

Early 90s – Pre GPU Slide from

Why GPUs? Graphics workloads are embarrassingly parallel  Data-parallel  Pipeline-parallel CPU and GPU execute in parallel Hardware: texture filtering, rasterization, etc.

Data Parallel Beyond Graphics  Cloth simulation  Particle system  Matrix multiply Image from: