Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.

Slides:

Advertisements

Similar presentations

Please do not distribute

Advertisements

Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

1 Executive Summary. 2 Overall Architecture of ARC ♦ Architecture of ARC  Multiple cores and accelerators  Global Accelerator Manager (GAM)  Shared.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Programming Model for Spatial Low-Power Architectures Phitchaya Mangpo Phothilimthana and Nishant Totla with Prof. Ras Bodik mentored by Dinakar Dhurjati.

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

SOC Design at BWRC: A Case Study EE249 Discussion November 30, 1999 Mike Sheets.

Some Thoughts on Technology and Strategies for Petaflops.

OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,

Spring 08, Jan 15 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Toward Cache-Friendly Hardware Accelerators

Please do not distribute

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

Please do not distribute

2006 Chapter-1 L2: "Embedded Systems - Architecture, Programming and Design", Raj Kamal, Publs.: McGraw-Hill, Inc. 1 Introduction to Embedded Systems –

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

ECE 720T5 Fall 2012 Cyber-Physical Systems Rodolfo Pellizzoni.

Determining the Optimal Process Technology for Performance- Constrained Circuits Michael Boyer & Sudeep Ghosh ECE 563: Introduction to VLSI December 5.

The MachSuite Benchmark

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

SYSTEM-ON-CHIP (SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY.

VLSI & ECAD LAB Introduction.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

CPE 626 Advanced VLSI Design Aleksandar Milenkovic Assistant.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.

MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Baseband Implementation of an OFDM System for 60GHz Radios: From Concept to Silicon Jing Zhang University of Toronto.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

ECE 551: Digital System Design & Synthesis Motivation and Introduction Lectures Set 1 (3 Lectures)

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

Philipp Gysel ECE Department University of California, Davis

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.

PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou

Please do not distribute

Please do not distribute

Please do not distribute

Please do not distribute

FPGA Acceleration of Convolutional Neural Networks

Course Agenda DSP Design Flow.

ELEC 7770 Advanced VLSI Design Spring 2012 Introduction

ELEC 7770 Advanced VLSI Design Spring 2010 Introduction

Introduction to Heterogeneous Parallel Computing

Hossein Omidian, Guy Lemieux

Sculptor: Flexible Approximation with

Presentation transcript:

Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon Reagen, Yakun Sophia Shao

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone Accelerator Generation: High-Level Synthesis 10:30 am – 11:00 am HLS-Based Accelerator-Rich Architecture Simulation: PARADE 11:00 am – 11:30 am Break 11:30 am – 12:00 pm Pre-RTL SoC Simulation: gem5-Aladdin 12:00 pm – 12:30 pm FPGA Prototyping: ARACompiler 12:30 pm – 2:00 pm Lunch 2:00 pm – 3:00 pm Panel on Accelerator Research 3:00 pm – 3:30 pm Accelerator Benchmarks and Workload Characterization 3:30 pm – 4:00 pm Break 4:00 pm – 5:00 pm Hands-on Exercise

CMOS Technology Scaling 3

Technological Fallow Period 4

…and it’s about time. 5 Golden Age Of Design Technological Fallow Period [Colwell 2012] 7nm, ~50B tx

Technology Trends Technology Design Danowitz et al., CACM 04/2012, Figure 1

Potential for Specialized Architectures 7 [Brodersen and Meng, 2002] 16Encryption 17Hearing Aid 18FIR for disk read 19MPEG Encoder Baseband

Beyond Homogeneous Parallelism SIMD/ SSE AESDEC In Core Out of Core GPU H.264 Composable Accelerators Energy Efficiency Programmability Fixed Function

Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators 9 [Die photo from Chipworks] [Accelerators annotated by Sophia Harvard]

Cores, GPUs, and Accelerators: Apple A8 SoC 10 Out-of-Core Accelerators Maltiel Consulting estimates Our estimates [ [Y. Shao, IEEE Micro 2015]

Challenges in Accelerators Flexibility –Fixed-function accelerators are only designed for the target applications. Design Cost –Hand-written RTL implementation is inherently tedious and time-consuming. Programmability –Today’s accelerators are explicitly managed by programmers. 11

Composable Customization Monolithic Hardware Accelerator 12

Composable Customization Composed Accelerator with sub-blocks 13

Composable Customization Composed Accelerator w/ Architectural Support Shared Interconnect and Memory Fabric 14

Composable Customization Composed Accelerator w/ Architectural Support Shared Interconnect and Memory Fabric Example: “Accelerator Store” Lyons et al. TACO’12 15

Composable Customization Composed Accelerator w/ Architectural Support Shared Interconnect and Memory Fabric 16

Composable Customization Composed Accelerator w/ Architectural Support Composable Accelerators Provide Application Flexibility Shared Interconnect and Memory Fabric 17

Composable Accelerators with Programmable Fabrics [ISLPED’2013] Dynamic Resource Allocation of ABBs ♦ Enhancement [ISLPED 2013]: with 20% of the chip area dedicated to programmable fabric, we can achieve more:  Flexibility: An average 8.2x (up to 146x) speedup in other domains, such as commercial, vision and navigation  Longevity: 22x speedup on a new application within the medical imaging domain

Composable Accelerators from Accelerator Building Blocks (ABBs) M M $2 C C C C M M C C C C C C C C C C C C C C C C C C C C A A A A A A A A A A A A A A A A A A A A GAM A A A A A A A A C C C C C C C C C C C C C C C C $2 C C C C M M C C C C M M C C A A M M Router CoreL2 BanksAccelerator + DMA + SPM Memory controller - sqrt ****** /x Static Decomposition into ABBs ABB1, Type = Poly Input: Mem, Output: ABB2 Function: (x0-x1),(x2-x3),… ABB2, Type = Poly Input: ABB1, Output: ABB3 Function: x0*x1+x2*x3+… ABB3, Type = Sqrt Input: ABB2, Output: ABB4 Function: sqrt(x0) ABB4, Type = FInv Input: ABB3, Output: Mem Function: 1/x0 Memory Decomposed Denoise LCA ABB: Poly1 ABB: Poly2 ABB: Sqrt ABB: Finv

Composable Accelerators [ISLPED’2012] Dynamic Resource Allocation of ABBs Cong, Ghodrat, Gill, Grigorian and Reinman. “CHARM: A Composable Heterogeneous Accelerator-Rich Microprocessor.” ISLPED 2012

Results ♦ Enhancement [ISLPED’2013]: with 20% of the chip area dedicated to programmable fabric, we can achieve more:  Flexibility: An average 12x (up to 146x) speedup in other domains, such as commercial, vision and navigation  Longevity: 22x speedup on a new application within the medical imaging domain Results relative to an Intel Core i GHz) Accelerators are synthesized in 32nm technology GPU (NVIDIA Tesla M2075) FPGA (Xilinx V6) Monolithic Accelerators Composable Accelerators DeblurPerformance97X25X58X107X Energy 19X 130X 369X 261X DenoisePerformance38X12X26X37X Energy 7.5X 89X 327X 308X SegmentationPerformance52X78X79X155X Energy 2.4X 371X 201X 149X RegistrationPerformance32X24X53X109X Energy 27.8X 31X 854X1102X AveragePerformance50X27X50X90X Energy 10X 107X 379X338X

Challenges in Accelerators Flexibility –Fixed-function accelerators are only designed for the target applications. Programmability –Today’s accelerators are explicitly managed by programmers. 22

OMAP 4 SoC Today’s SoC ARM Cores GPU DSP System Bus Secondary Bus Secondary Bus Tertiary Bus DMA SD USB Audio Video Face Imaging USB

Challenges in Accelerators Flexibility –Fixed-function accelerators are only designed for the target applications. Programmability –Today’s accelerators are explicitly managed by programmers. Design Cost –Accelerator (and RTL) implementation is inherently tedious and time-consuming. 24

Some highlights (and pain points) of our research in accelerator architectures 25 Hempstead, ISCA’05 Event-Driven Architectures For Wireless Sensor Nodes AS OCN Accel Store Accel Store Accel Store Accel Store Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accel Core Accelerator Memory Systems Design: “Accelerator Store” Lyons, CAL’10 Robobee “Brain” System-on-Chip Zhang, CICC’13, VLSI’15

Aladdin gem5-Aladdin ASIC Flow or FPGA Prototype Prototyping Modeling High-Level Synthesis PARADE Accelerator Research Infrastructure 26 Standalone System Integration RTL

27 Panel: Rapid Exploration of Accelerator-Rich Architectures Organizer: David Brooks (Harvard) and Jason Cong (UCLA) Moderator: Jason Cong Panelists: Ameen Akel (Micron) Chris Batten (Cornell) Derek Chiou (UT-Austin/Microsoft) Boris Ginzburg (Intel) Michael Kishinevsky (Intel)

What accelerators have you designed or plan to design? What is the process to select the workloads or kernels for acceleration? How do you estimate the acceleration potential? What’s your methodology for accelerator design? E.g. –How do you select the communication scheme between the CPU and the accelerators? –Do you do design space exploration? –How do you trade-off efficiency and flexibility in accelerator designs? How do you validate your accelerator design, in terms of both performance and correctness? Questions to the Panel (and attendees)

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone Accelerator Generation: High-Level Synthesis 10:30 am – 11:00 am HLS-Based Accelerator-Rich Architecture Simulation: PARADE 11:00 am – 11:30 am Break 11:30 am – 12:00 pm Pre-RTL SoC Simulation: gem5-Aladdin 12:00 pm – 12:30 pm FPGA Prototyping: ARACompiler 12:30 pm – 2:00 pm Lunch 2:00 pm – 3:00 pm Panel on Accelerator Research 3:00 pm – 3:30 pm Accelerator Benchmarks and Workload Characterization 3:30 pm – 4:00 pm Break 4:00 pm – 5:00 pm Hands-on Exercise