Trends in Multicore Architecture

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
A many-core GPU architecture.. Price, performance, and evolution.
Data Parallel Algorithms Presented By: M.Mohsin Butt
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Performance Tuning John Black CS 425 UNR, Fall 2000.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
Processor Level Parallelism 1
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Parallel Programming Models
The Present and Future of Parallelism on GPUs
Conclusions on CS3014 David Gregg Department of Computer Science
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
GPU Architecture and Its Application
Lynn Choi School of Electrical Engineering
How do we evaluate computer architectures?
Multi-core processors
Constructing a system with multiple computers or processors
Architecture & Organization 1
Real-Time Ray Tracing Stefan Popov.
Lecture 5: GPU Compute Architecture
Hyperthreading Technology
Multi-Processing in High Performance Computer Architecture:
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
Architecture & Organization 1
Lecture 5: GPU Compute Architecture for the last time
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
The Challenge of Cross - Language Interoperability
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Graphics Processing Unit
Vrije Universiteit Amsterdam
Multicore and GPU Programming
The University of Adelaide, School of Computer Science
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Trends in Multicore Architecture Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab

Outline Objective of this segment: explain why this tutorial is relevant to researchers in systems design Why multicore? Why GPUs? Why CUDA?

Why Multicore? How did we get here? Combination of both “ILP wall” and “power wall” ILP wall: wider superscalar & more aggressive OO execution  diminishing returns Boost single-thread performance  boost frequency Power wall: boosting frequency to keep up with Moore’s Law (2X per generation) is expensive Natural frequency growth with technology scaling is only ~20-30% per generation Don’t need expensive microarchitectures just for this Faster frequency growth requires Aggressive circuits (expensive) Very deep pipeline – 30+ stages? (expensive) Power-saving techniques weren’t able to compensate No longer worth the Si, cooling costs, or battery life If there were still good opportunities to raise ILP, we would pursue that because it would save so much programming headache Classical OO has run out of steam; manufacturers unwilling (for unknown reasons) to pursue more aggressive OO

Single-core Watts/Spec (through 2005) Normalized to process gen (Normalized to same technology node) (courtesy Mark Horowitz)

The Multi-core Revolution Can’t make a single core much faster But need to maintain profit margins More and more cache  diminishing returns “New” Moore’s Law Same core is 2X smaller per generation, can double # cores Focus on throughput Can use smaller, lower-power cores (even in-order issue) Make cores multi-threaded Trade single-thread performance for Better throughput Lower power density This is what GPUs do! An important issue is that the performance in a single, commodity box is the sweet spot

Where Do GPUs Fit In? Big-picture goal for processor designers: improve user benefit with Moore’s Law Solve bigger problems Improve user experience  induce them to buy new systems Need scalable, programmable multicore Scalable: doubling processing elements (PEs) ~doubles performance Multicore achieves this, if your program has scalable parallelism Programmable: easy to realize performance potential GPUs can do this! GPUs provide a pool of cores with general-purpose instruction sets Graphics has lots of parallelism that scales to large # cores CUDA leverages this background Need to maximize performance/mm2 Need high volume to achieve commodity pricing GPUs of course leverage the 3D market

How do GPUs differ from CPUs? Key: perf/mm2 Emphasize throughput, not per-thread latency Maximize number of PEs and utilization Amortize hardware in time—multithreading Hide latency with computation, not caching Spend area on PEs instead Hide latencies with fast thread switch and many threads/PE GPUs much more aggressive than today’s CPUs: 24 active threads/PE on G80 Exploit SIMD efficiency Amortize hardware in space—share fetch/control among multiple PEs 8 in the case of Tesla architecture Note that SIMD vector NVIDIA’s architecture is “scalar SIMD” (SIMT), AMD does both High bandwidth to global memory Minimize amount of multithreading needed G80 memory interface is 384-bit, R600 is 512-bit Net result: 470 GFLOP/s and ~80 GB/s sustained in G80 CPUs seem to be following similar trends SIMT will be discussed in a later segment

How do GPUs differ from CPUs? (2) Hardware thread creation and management New thread for each vertex/pixel CPU: kernel or user-level software involvement Virtualized cores Program is agnostic about physical number of cores True for both 3D and general-purpose CPU: number of threads generally f(# cores) Hardware barriers These characteristics simplify problem decomposition, scalability, and portability Nothing prevents non-graphics hardware from adopting these features

How do GPUs differ from CPUs? (3) Specialized graphics hardware (Here I only talk about what is exposed through CUDA) Texture path High-bandwidth gather, interpolation Constant memory Even higher-bandwidth access to small read-only data regions Transcendentals (reciprocal sqrt, trig, log2, etc.) Different implementation of atomic memory operations GPU: handled in memory interface CPU: generally handled with CPU involvement Local scratchpad in each core (a.k.a. per-block shared memory) Memory system exploits spatial, not temporal locality Progeny of these features is graphics—applicability is broader

How do GPUs differ from CPUs? (4) Fundamental trends are actually very general Exploit parallelism in time and space Other processor families are following similar paths (multithreading, SIMD, etc.) Niagara Larrabee Network processors Clearspeed Cell BE Many others…. Alternative: heterogeneous (Nomenclature note: asymmetric = single-ISA, heterogeneous = multi-ISA) Fusion

Why is CUDA Important? Mass market platform Easy to buy and set up a system Provides a solution for manycore parallelism Not limited to small core counts Easy to learn abstractions for massive parallelism Abstractions are not tied to a specific platform Doesn’t depend on graphics pipeline; can be implemented on other platforms Preliminary results suggest that CUDA programs run efficiently on multicore CPUs [Stratton’08] Supports a wide range of application characteristics More general than streaming Not limited to data parallelism

Why is CUDA Important? (2) CUDA + GPUs facilitate multicore research at scale 16, 8-way SIMD cores = 128 PEs Simple programming model allows exploration of new algorithms, hardware bottlenecks, and parallel programming features The whole community can learn from this CUDA + GPUs provide a real platform…today Results are not theoretical Increases interest from potential users, e.g. computational scientists Boosts opportunities for interdisciplinary collaboration Underlying ISA can be targeted with new languages Great opportunity for research in parallel languages CUDA is teachable Undergrads can start writing real programs within a couple of weeks CUDA cracks the nut

Thank you Questions?