Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University.

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
A many-core GPU architecture.. Price, performance, and evolution.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Comparison of Modern CPUs and GPUs And the convergence of both Jonathan Palacios Josh Triska.
Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
Prof. Kavita Bala and Prof. Hakim Weatherspoon CS 3410, Spring 2014 Computer Science Cornell University.
What does the Future Hold? Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University.
What does the Future Hold? Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University.
Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu
Computer Organization CS224 Fall 2012 Lesson 52. Message Passing  Each processor has private physical address space  Hardware sends/receives messages.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Chapter 7 Multicores, Multiprocessors, and Clusters CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: 6.7.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Morgan Kaufmann Publishers
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
What does the Future Hold? Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Physically-Based Visual Simulation on Graphics Hardware Mark J. Harris Greg Coombe Thorsten Scheuermann.
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Appendix C Graphics and Computing GPUs
Prof. Zhang Gang School of Computer Sci. & Tech.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
NVIDIA Fermi Architecture
Ray Tracing on Programmable Graphics Hardware
Graphics Processing Unit
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

Project3 Cache Race Games night Monday, May 4 th, 5pm Come, eat, drink, have fun and be merry! Location: B17 Upson Hall Prelim2: Thursday, April 30 th in evening Time and Location: 7:30pm sharp in Statler Auditorium Old prelims are online in CMS Prelim Review Session: TODAY, Tuesday, April 28, 7-9pm in B14 Hollister Hall Project4: Design Doc due May 5 th, bring design doc to mtg May 4-6 Demos: May 12 and 13 Will not be able to use slip days

Prelim2 Topics Lecture: Lectures 10 to 24 Data and Control Hazards (Chapters ) RISC/CISC (Chapters , 2.21) Calling conventions and linkers (Chapters 2.8, 2.12, Appendix A.1-6) Caching and Virtual Memory (Chapter 5) Multicore/parallelism (Chapter 6) Synchronization (Chapter 2.11) Traps, Exceptions, OS (Chapter 4.9, Appendix A.7, pp ) HW2, Labs 3/4, C-Labs 2/3, PA2/3 Topics from Prelim1 (not the focus, but some possible questions)

GPU: Graphics processing unit Very basic till about 1999 Specialized device to accelerate display Then started changing into a full processor 2000-…: Frontier times

CPU: Central Processing Unit GPU: Graphics Processing Unit

GPU parallelism is similar to multicore parallelism Key: How to gang schedule thousands of threads on thousands of cores? Hardware multithreading with thousands of register sets GPU Hardware multithreads (like multicore Hyperthreads) MultiIssue + extra PCs and registers – dependency logic Illusion of thousands of cores Fine grain hardware multithreading - Easier to keep pipelines full

Processing is highly data-parallel GPUs are highly multithreaded Use thread switching to hide memory latency –Less reliance on multi-level caches Graphics memory is wide and high-bandwidth Trend toward general purpose GPUs Heterogeneous CPU/GPU systems CPU for sequential code, GPU for parallel code Programming languages/APIs DirectX, OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) Compute Unified Device Architecture (CUDA)

Peak Performance (polygon s’s/sec) Year HP CRX SGI Iris SGI GT HP VRX Stellar GS1000 SGI VGX HP TVRX SGI SkyWriter SGI E&S F300 One-pixel polygons (~10M 30Hz) SGI RE2 RE1 Megatek UNC Pxpl4 UNC Pxpl5 UNC/HP PixelFlow Flat shading Gouraud shading Antialiasing Slope ~2.4x/year (Moore's Law ~ 1.7x/year) SGI IR E&S Harmony SGI R-Monster Division VPX E&S Freedom Accel/VSIS Voodoo Glint Division Pxpl6 PC Graphics Textures SGI Cobalt Nvidia TNT 3DLabs Graph courtesy of Professor John Poulton (from Eric Haines) GeForce ATI Radeon 256 nVidia G70

Started in 1999 Flexible, programmable –Vertex, Geometry, Fragment Shaders And much faster, of course 1999 GeForce256: 0.35 Gigapixel peak fill rate 2001 GeForce3: 0.8 Gigapixel peak fill rate 2003 GeForceFX Ultra: 2.0 Gigapixel peak fill rate ATI Radeon 9800 Pro : 3.0 Gigapixel peak fill rate 2006 NV60:... Gigapixel peak fill rate 2009 GeForce GTX 285: 10 Gigapixel peak fill rate 2011 –GeForce GTC 590: 56 Gigapixel peak fill rate –Radeon HD 6990: 2x –GeForce GTC 690: 62 Gigapixel/s peak fill rate

Fixed function pipeline

Programmable vertex and pixel processors

nVIDIA Kepler

Parallelism: thousands of cores Pipelining Hardware multithreading Not multiscale caching Streaming caches Throughput, not latency

Single Instruction Single Data (SISD) Multiple Instruction Single Data (MISD) Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD)

nVIDIA Kepler

Shuang Zhao, Cornell University, 2014

Fastest, per-thread Faster, per-block Slower, global Read-only, cached

Shuang Zhao, Cornell University, 2014 Host: the CPU and its memoryDevice: the GPU and its memory

Compute Unified Device Architecture do_something_on_host(); kernel >>(args); cudaDeviceSynchronize(); do_something_else_on_host(); Shuang Zhao, Cornell University, 2014 … Highly parallel

Threads in a block are partitioned into warps All threads in a warp execute in a Single Instruction Multiple Data, or SIMD, fashion All paths of conditional branches will be taken Warp size varies, many graphics cards have 32 NO guaranteed execution ordering between warps Shuang Zhao, Cornell University, 2014

Threads in one warp execute very different branches Significantly harms the performance! Simple solution: Reordering the threads so that all threads in each block are more likely to take the same branch Not always possible Shuang Zhao, Cornell University, 2014

Streaming multiprocessor 8 × Streaming processors Chapter 6 — Parallel Processors from Client to Cloud — 26

Streaming Processors Single-precision FP and integer units Each SP is fine-grained multithreaded Warp: group of 32 threads Executed in parallel, SIMD style –8 SPs × 4 clock cycles Hardware contexts for 24 warps –Registers, PCs, … Chapter 6 — Parallel Processors from Client to Cloud — 27