Heterogeneous Multi-Core Processors Jeremy Sugerman GCafe May 3, 2007.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Structure of Computer Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.

Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.

12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.

Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.

Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Dragged, Kicking and Screaming: Multicore Architecture and Video Games.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.

If You Like Money, General-Purpose Is for You Chris Hughes Parallel Computing Lab Intel.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Processor Architecture

Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Martin Kruliš by Martin Kruliš (v1.1)1.

EKT303/4 Superscalar vs Super-pipelined.

Parallel processing

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

J++ Machine Jeremy Sugerman Kayvon Fatahalian. Background  Multicore CPUs  Generalized GPUs (Brook, CTM, CUDA)  Tightly coupled traditional CPU (more.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.

GPU Architecture and Its Application

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

What is GPU? how does it work?

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Graphics Processing Unit

Chapter 1 Introduction.

Introduction to Heterogeneous Parallel Computing

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

Heterogeneous Multi-Core Processors Jeremy Sugerman GCafe May 3, 2007

Context  Exploring the CPU and GPU future relationship –Joint work, thinking with Kayvon –Much kibbitzing from Pat, Mike, Tim, Daniel  Vision and opinion, not experiments and results –More of a talk than a paper –The value is more conceptual than algorithmic –Wider gcafe audience appeal than our near term elbows-deep plans to dive into GPU guts

Outline  Introduction  CPU “Special Feature” Background  Compute-Maximizing Processors  Synthesis, with Extensions  Questions for the Audience…

Introduction  Multi-core is status quo for forthcoming CPUs  Variety of emerging (for “general purpose”) architectures try to offer discontinuous performance boost over traditional CPUs –GPU, Cell SPEs, Niagara, Larrabee, …  CPU vendors have a history of co-opting special purpose units for targeted performance wins: –FPU, SSE/Altivec, VT/SVM  CPUs should co-opt entire “compute” cores!

Introduction  Industry is already exploring hybrid models –Cell: 1 PowerPC and 8 SPEs –AMD Fusion: Slideware CPU + GPU –Intel Larrabee: Weirder, NDA encumbered  The programming model for communicating deserves to be architecturally defined.  Tighter integration than the current “host + accelerator” model eases porting and efficiency.  Work queues / buffers allow intregrated coordination with decoupled execution.

Outline  Introduction  CPU “Special Feature” Background  Compute-Maximizing Processors  Synthesis, with Extensions  Questions for the Audience…

CPU “Special Features”  CPUs are built for general purpose flexibility…  … but have always stolen fixed function units in the name of performance. –Old CPUs had schedulers, malloc burned in! –CISC instructions really were faster –Hardware managed TLBs and caches –Arguably, all virtual memory support

CPU “Special Features”  More relevantly, dedicated hardware has been adopted for domain-specific workloads.  … when the domain was sufficiently large / lucrative / influential  … and the increase in performance over software implementation / emulation was BIG  … and the cost in “design budget” (transistors, power, area, etc.) was acceptable.  Examples: FPUs, SIMD and Non-Temporal accesses, CPU virtualization

Outline  Introduction  CPU “Special Feature” Background  Compute-Maximizing Processors  Synthesis, with Extensions  Questions for the Audience…

Compute-Maximizing Processors  “Important” common apps are FLOP hungry –Video processing, Rendering –Physics / Game “Physics” –Even OS compositing managers!  HPC apps are FLOP hungry too –Computational Bio, Finance, Simulations, …  All can soak vastly more compute than current CPUs can deliver.  All can utilize thread or data parallelism.  Increased interest in custom / non-”general” processors

Compute-Maximizing Processors  Or “throughput oriented”  Packed with ALUs / FPUs  Application specified parallelism replaces the focus on single-thread ILP  Available in many flavours: –SIMD –Highly threaded cores –High numbers of tiny cores –Stream processors  Real life examples generally mix and match

Compute-Maximizing Processors  Offer an order of magnitude potential performance boost… if the workload sustains high processor utilization  Mapping / porting algorithms is a labour intensive and complex effort.  This is intrinsic. Within any design budget, a BIG performance win comes at a cost…  If it didn’t, the CPU designers would steal it.

Compute-Maximizing Programming  Generally offered as off-board “accelerators” –Data “tossed over the wall” and back –Only portions of computations achieve a speedup if offloaded –Accelerators mono-task one kernel at a time  Applications are sliced into successive statically defined phases separated by resorting, repacking, or converting entire datasets.  Limited to a single dataset-wide feed forward pipeline. Effectively back to batch processing

Outline  Introduction  CPU “Special Feature” Background  Compute-Maximizing Processors  Synthesis, with Extensions  Questions for the Audience…

Synthesis  Add at least one compute-max core to CPUs –Workloads that use it get BIG performance –Programmers are struggling to get any performance from having more normal cores –Being “on-chip” architected and ubiquitous is huge for application use of compute-max  Compute core exposed as programmable independent multithreaded execution engine –A lot like adding (only!) fragment shaders –Largely agnostic on hardware “flavour”

Extensions  Unified address space –Coherency is nice, but still valuable without it  Multiple kernels “bound” (loaded) at a time –All part of the same application, for now  “Work” delivered to compute cores through work queues –Dequeuing batches / schedules for coherence, not necessarily FIFO –Compute and CPU cores can insert on remote queues

Extensions CLAIM: Queues break the “batch processing” straitjacket and still expose enough coherent parallelism to sustain compute-max utilization.  First part is easy:  Obvious per-data element state machine  Dynamic insertion of new “work”  Instead of being idle as the live thread count in a “pass” drops, a core can pull in “work” from other “passes” (queues).

Extensions CLAIM: Queues break the “batch processing” straitjacket and still expose enough coherent parallelism to sustain compute-max utilization.  Second part is more controversial:  “Lots” of data quantized into a “few” states should have plentiful, easy coherence.  If the workload as a whole has coherence  Pigeon hole argument, basically  Also mitigates SIMD performance constraints  Coherence can be built / specified dynamically

Outline  Introduction  CPU “Special Feature” Background  Compute-Maximizing Processors  Synthesis, with Extensions  Questions for the Audience…

Audience Participation  Do you believe my argument conceptually? –For the heterogeneous / hybrid CPU in general? –For queues and multiple kernels?  What persuades you 3 x86 + compute is preferable to quad x86? –What app / class of apps and how much of a win? 10x? 5x?  How skeptical are you that queues can match the performance of multi-pass / batching?  What would you find a compelling flexibility / expressiveness justification for adding queues? –Performance wins regaining coherence in existing branching/looping shaders? –New algorithms if shaders and CPU threads can dynamically insert additional “work”?