Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
TO COMPUTERS WITH BASIC CONCEPTS Lecturer: Mohamed-Nur Hussein Abdullahi Hame WEEK 1 M. Sc in CSE (Daffodil International University)
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Parallel computer architecture classification
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Last Lecture The Future of Parallel Programming and Getting to Exascale 1.
Today’s topics Single processors and the Memory Hierarchy
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
Chapter 5 Computing Components. The (META) BIG IDEA Cool, idea but maybe too big DATA – Must be stored somewhere in a storage device PROCESSING – Data.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Contemporary Languages in Parallel Computing Raymond Hummel.
Panda: MapReduce Framework on GPU’s and CPU’s
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Computer System Architectures Computer System Software
CS231: Computer Architecture I Laxmikant Kale Fall 2004.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Software; Nature, Capabilities and Limitations: describe the need for interfacing with peripherals storage devices, input and output devices and display.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Memory Systems How to make the most out of cheap storage.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
RISC and CISC. What is CISC? CISC is an acronym for Complex Instruction Set Computer and are chips that are easy to program and which make efficient use.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.
EKT303/4 Superscalar vs Super-pipelined.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Central Processing Unit (CPU) The Computer’s Brain.
Background Computer System Architectures Computer System Software.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Computer Organization and Architecture Lecture 1 : Introduction
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Visit for more Learning Resources
Chapter 17 Parallel Processing
What is Computer Architecture?
What is Computer Architecture?
What is Computer Architecture?
Graphics Processing Unit
Multicore and GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
CSE378 Introduction to Machine Organization
Presentation transcript:

Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory

Pavan Balaji, Argonne National Laboratory Accelerators != GPUs  Anything that is built to make a specific type of computation (i.e., not general purpose computation) is an accelerator –A vector instruction unit is an accelerator –The double/quad floating point operations on BG/P and Q are accelerators –An H.264 media decoder sitting on a processor die is an accelerator –There is no such thing as a “general purpose” accelerator  GPUs are one form of accelerators, but not the only ones P2S2 Workshop Panel (09/10/2012)

Pavan Balaji, Argonne National Laboratory Divergence in Accelerator Computing?  Divergence == Increasing difference  No -- There are a lot of different models of accelerator computing today –NVIDIA/AMD GPUs, FPGAs, AMD’s fused chip architectures, Intel MIC architecture, Intel Xeon, Blue Gene (Yes, they are accelerators too) –Broadly classified into: Decoupled processing/Decoupled memory (think GPUs) Coupled processing/Decoupled memory (think AMD Fusion) Coupled processing/Coupled memory (think Intel Xeon/MIC, BG/Q)  But the trend is not towards increasing difference, but rather towards convergence –Vendors and researchers are trying out different options to see what works and what does not P2S2 Workshop Panel (09/10/2012) You have to kiss many frogs before you can find your prince!

Pavan Balaji, Argonne National Laboratory Who will be the last man standing? P2S2 Workshop Panel (09/10/2012) GPUs Decoupled processing Decoupled memory Fused Processors (e.g., AMD Fusion) Coupled processing Decoupled memory General Purpose Processors with Accelerator Extensions (e.g., Xeon, MIC, BG/P, BG/Q) Coupled processing Coupled memory

Pavan Balaji, Argonne National Laboratory Quantum mechanical interactions are near-sighted (Walter Kohn) P2S2 Workshop Panel (09/10/2012) Traditional quantum chemistry studies lie within the nearsighted range where interactions are dense: Future quantum chemistry studies expose both short- and long-range interactions: Range of interactions between particles Note that the figures are phenomenological. Quantum chemistry methods treat correlation using a variety of approaches and have different short/long- range cutoffs. distance Interaction strength Courtesy Jeff Hammond, Argonne National Laboratory

Pavan Balaji, Argonne National Laboratory Wind Turbine and Flight Blade Designs  Blades are getting larger with every new design –With larger blades, the additional lift or torque generated is from the outer regions of the blade –Air flow from far out regions of the blade has lesser computational intensity making the computation more “sparse” P2S2 Workshop Panel (09/10/2012)

Pavan Balaji, Argonne National Laboratory Decoupled Processing/Decoupled Memory (GPUs)  Pros: –A separate can be custom built for acceleration –Faster memory; better designed memory and memory controllers for acceleration  Cons: –Decoupled from the main processing unit P2S2 Workshop Panel (09/10/2012) Control Unit ALU Cache DRAM Regular CPU coresGPU cores Verdict

Pavan Balaji, Argonne National Laboratory Coupled Processing/Decoupled Memory  Pros: –Improved coupling of the processing units and memory allows for much faster synchronization –Separate memory allows for better optimized memory and memory controllers  Cons: –The need for data staging does not disappear P2S2 Workshop Panel (09/10/2012) CPU GPU CPU Memory GPU Memory CPU GPU CPU Memory GPU Memory Verdict

Pavan Balaji, Argonne National Laboratory General Purpose Processors with Accelerator Extensions  Pros: –Very fine-grained synchronization (no memory synchronization required; processing synchronization for power constraints)  Cons: –Unified memory means that specialization not possible (either in memory or in memory controllers) –Single die memory constraints P2S2 Workshop Panel (09/10/2012) Intel: MIC IBM: BG/Q Power Constrained Memory Consistency Tilera: GX Godson T Intel: SCC Dally: Echelon Extreme Specialization and Power Management Chien: 10x10

Pavan Balaji, Argonne National Laboratory Towards On-chip Instruction-level Heterogeneity  Vector units were a form of instruction-level heterogeneity –Some instructions use vector hardware, some don’t –Vector instruction units processed the same data that other units processed  Synchronization requirements –No memory staging requirements –Theoretically, accelerator units can fit into the same instruction pipeline as general purpose processing  But, there are some practicality constraints –Amount of acceleration is so high that not all hardware can be turned on at the same time (dark silicon with power gating will lead the way) So synchronization is not absent, but much more fine-grained (10s of cycles) –Compilers (with help from users – OpenMP, OpenACC) will have to do some work to coalesce hardware power-gating P2S2 Workshop Panel (09/10/2012) Verdict

Pavan Balaji, Argonne National Laboratory Summary  Accelerators are of different kinds – GPUs are just one example of it  Decoupled memory accelerators do not have much of a chance to survive because of data staging requirements –Fundamentally ill-suited for sparse/fine-grained computations –Caveat: LINPACK is not a fine-grained computation, so the Top500 might still boast a GPU-like machine  Fine-grained instruction-level heterogeneity is required –Many architectures are already going in that direction –BG/Q and Intel MIC’s planned roadmap are in that direction P2S2 Workshop Panel (09/10/2012)