Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Heterogeneous Multi-Core Processors Jeremy Sugerman GCafe May 3, 2007.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 13 Embedded Systems
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.
Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Computer System Architectures Computer System Software
Verification and Performance Estimation Environment for 3D Graphics Geometry Acceleration System Young-Su Kwon.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Computer Graphics Graphics Hardware
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Hardware Security Mechanisms Krste Asanovic U.C. Berkeley August 20, 2009.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
Emergent Game Technologies Gamebryo Element Engine Thread for Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
J++ Machine Jeremy Sugerman Kayvon Fatahalian. Background  Multicore CPUs  Generalized GPUs (Brook, CTM, CUDA)  Tightly coupled traditional CPU (more.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Pangaea: A Tightly-Coupled Heterogeneous IA32 Chip Multiprocessor
Graphics Processing Unit
Mattan Erez The University of Texas at Austin
Computer Graphics Graphics Hardware
Introduction to Heterogeneous Parallel Computing
GPU Scheduling on the NVIDIA TX2:
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian

Trends  Multi-core CPUs  Generalized GPUs –Brook, CTM, CUDA  Tighter CPU-GPU coupling –PS3 –Xbox 360 –AMD “Fusion” (faster bus, but GPU still treated as batch coprocessor)

CPU-GPU coupling  Important apps (game engines) exhibit workloads suitable for both CPU and GPU-style cores GPU Friendly Geometry processing Shading Physics (fluids/particles) CPU Friendly IO AI/planning Collisions Adaptive algorithms

CPU-GPU coupling  Current: coarse granularity interaction –Control: CPU launches batch of work, waits for results before sending more commands (multi-pass) –Necessitates algorithmic changes  GPU is slave coprocessor –Limited mechanisms to create new work –CPU must deliver LARGE batches –CPU sends GPU commands via “driver” model

Fundamentally different cores  “CPU” cores –Small number (tens) of HW threads –Software (OS) thread scheduling –Memory system prioritizes minimizing latency  “GPU” cores –Many HW threads (>1000), hardware scheduled –Minimize per-thread state (state kept on-chip) shared PC, wide SIMD execution, small register file No thread stack –Memory system prioritizes throughput (not clear: sync, SW-managed memory, isolation, resource constraints)

GPU as a giant scheduler IA VS GS RS PS 1-to-1 1-to-N (bounded) 1-to-(0 or X) (X static) 1-to-N (unbounded) data buffer cmd buffer OM = on-chip queues Off-chip buffers (data) output stream

GPU as a giant scheduler IA RS (read-modify-write) OM VS/GS/PS command queue vertex queue primitive queue fragment queue Thread scoreboard Hardware scheduler On-chip queues Processing cores Off-chip buffers (data)

GPU as a giant scheduler  Rasterizer (+ input cmd processor) is a domain specific HW work scheduler –Millions of work items/frame –On chip queues of work –Thousands of HW threads active at once –CPU threads (via API commands), GS programs, fixed function logic generate work –Pipeline describes dependencies  What is the work here? –Vertices –Geometric primitives –Fragments –In the future: Rays? Well defined resource requirements for each category.

The project  Investigate making “GPU” cores first-class execution engines in multi-core system  Add:  Fine granularity interaction between cores  Processing work on any core can create new work (for any other core)  Hypothesis: scheduling work (actions) is key problem –Keeping state on-chip  Drive architecture simulation with interactive graphics pipeline augmented with raytracing

Our architecture  Multi-core processor = some “CPU” + some “GPU” style cores  Unified system address space  “Good” interconnect between cores  Actions (work) on any core can create new work  Potentially… –Software-managed configurable L2 –Synchronization/signaling primitives across actions

Need new scheduler  GPU HW scheduler leverages highly domain-specific information –Knows dependencies –Knows resources used by threads  Need to move to more general-purpose HW/SW scheduler, yet still do okay  Questions –What scheduling algorithms? –What information is needed to make decisions?

Programming model = queues  Model system as a collection of work queues –Create work = enqueue –SW driven dispatch of “CPU” core work –HW driven dispatch of “GPU” core work –Application code does not dequeue

Benefits of queues  Describe classes of work –Associate queues with environments GPU (no gather) GPU + gather GPU + create work (bounded) CPU CPU + SW managed L2  Opportunity to coalesce/reorder work –Fine-created creation, bulk execution  Describe dependencies

Decisions  Granularity of work –Enqueue elements or batches?  “Coherence” of work (batching state changes) –Associate kernels/resources with queues (part of env)?  Constraints on enqueue –Fail gracefully in case of explosion  Scheduling policy –Minimize state (size of queues) –How to understand dependencies

First steps  Coarse architecture simulation –Hello world = run CPU + GPU threads, GPU threads create other threads Identify GPU ISA additions  Establish what information scheduler needs –What are the “environments”  Eventually drive simulation with hybrid renderer

Evaluation  Compare against architectural alternatives 1.Multi-pass rendering (very coarse-grain) with domain-specific scheduler –Paper: “GPU” microarchitecture comparison with our design –Scheduling resources –On chip state / performance tradeoff –On chip bandwidth 2.Many-core homogenous CPU

Summary  Hypothesis: Elevating “GPU” cores to first-class execution engines is better way to build hybrid system –Apps with dynamic/irregular components –Performance –Ease of programming  Allow all cores to generate new work by adding to system queues  Scheduling work in these queues is key issue (goal: keep queues on chip)

Three fronts  GPU micro-architecture –GPU work creating GPU work –Generalization of DirectX 10 GS  CPU-GPU integration –GPU cores as first-class execution environments (dump the driver model) –Unified view of work throughout machine –Any core creates work for other cores  GPU resource management –Ability to correctly manage/virtualize GPU resources –Window manager