Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

Slides:

Advertisements

Similar presentations

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

55:035 Computer Architecture and Organization Lecture 7 155:035 Computer Architecture and Organization.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

RISC vs CISC Yuan Wei Bin Huang Amit K. Naidu. Introduction - RISC and CISC Boundaries have blurred. Modern CPUs Utilize features of both. The Manufacturing.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Introduction to MIMD architectures

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Chapter 17 Parallel Processing.

12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Outline Why this subject? What is High Performance Computing?

Martin Kruliš by Martin Kruliš (v1.1)1.

Lecture 3: Computer Architectures

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Introduction to parallel programming

Distributed Shared Memory

CS5102 High Performance Computer Systems Thread-Level Parallelism

Parallel Processing - introduction

Cache Memory Presentation I

Clusters of Computational Accelerators

Coe818 Advanced Computer Architecture

Introduction to Heterogeneous Parallel Computing

EE 4xx: Computer Architecture and Performance Programming

Chapter 4 Multiprocessors

Presentation transcript:

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri

Outline Context & Introduction Rigel Design Goals Rigel Architecture Design Elements Estimates for the Rigel Design Conclusion

Context & Introduction Accelerator(e.g. GPUs): a hardware entity designed to provide advantages for a specific class of applications including: higher performance, lower power, or lower unit cost compared to a general-purpose CPUs. Accelerator: maximize throughput(operations/sec) CPU: minimize latency (sec/operation)

Context & Introduction Challenges: Inflexible programming models Lack of conventional memory model Hard to scale irregular parallel apps Challenges lead to: Operations / area ($) Operations / Watt (power) Operations / Programmer Effort

Rigel Design Goals What: Future programming models  Apps and models may not exist yet  Flexible design: easier to retarget How: Focus on scalability, programmer effort  Raised hardware/software interface  Focusing design effort: five elements

Rigel Architecture  Area-optimized  Dual-issue  In-order  RISC-like ISA(instruction set architecture)  Single-precision  Floating-point  Registers

Rigel Architecture

45nm technology, 320mm 2 Rigel chip: (1024cores) frequency of 1.2 GHz: a peak throughput of 2.4 TFLOPS

Design Elements 1.Execution Model: ISA, SIMD vs. MIMD, VLIW vs. OoOE, MT 2.Memory Model: Caches vs. scratchpad, ordering, coherence 3.Work Distribution: Scheduling, spectrum of SW/HW choices 4.Synchronization: Scalability, influence on prog. model\ 5.Locality Management

Element 1: Execution Model  Tradeoff 1: MIMD vs. SIMD -Irregular data parallelism -Task parallelism  Tradeoff 2: Latency vs. Throughput -Simple in-order cores  Tradeoff 3: Full RISC ISA vs. Specialized Cores

Element 2: Memory Model  Tradeoff 1: Single vs. Multiple address space  Tradeoff 2: Hardware caches vs. scratchpads -Hardware exploits locality -Software manages global sharing  Tradeoff 3: Hierarchical vs. Distributed -Cluster cache/global cache hierarchy -ISA provides local/global memory operations -Non-uniformity: Programmer effort

Element 3: Work Distribution  Tradeoff (Spectrum):HW vs. SW Implementation -software task management: Hierarchical queues -Flexible policies + little specialized hardware  Rigel Task Model

Rigel Task Model

Rigel Task Model Evaluation

Element 4: Synchronization  Coherence mechanisms: 1. Control synchronization 2. Data sharing  Broadcast update -use cases: flags and barriers -reduce contention from polling

Area estimates for the Rigel Design

Conclusions Although Rigel is not yet a physical chip, the whole idea is novel and feasible. Future Work: Element five: Locality Management The Rigel design strikes a balance between performance and programmability

References age/Rigel.html Rigel: A Scalable Architecture for Core Accelerators, Daniel R. Johnson et al, SAAHPC'09. The PowerPoint Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009 by John H. Kelm et al, UIUC