Please do not distribute

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

Please do not distribute

Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

Some Thoughts on Technology and Strategies for Petaflops.

Toward Cache-Friendly Hardware Accelerators

Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.

Please do not distribute

Projects Using gem5 ParaDIME (2012 – 2015) RoMoL (2013 – 2018)

Please do not distribute

Content Project Goals. Term A Goals. Quick Overview of Term A Goals. Term B Goals. Gantt Chart. Requests.

The MachSuite Benchmark

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

1. DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

Caches for Accelerators

(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.

PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Please do not distribute

System-on-Chip Design

Please do not distribute

Two-Dimensional Phase Unwrapping On FPGAs And GPUs

Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou

Please do not distribute

Initial Experiences with Deploying FPGA Accelerators in Datacenters

Hands On SoC FPGA Design

CA Final Project – Multithreaded Processor with IPC Interface

Please do not distribute

Ph.D. in Computer Science

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Please do not distribute

SmartCell: A Coarse-Grained Reconfigurable Architecture for High Performance and Low Power Embedded Computing Xinming Huang Depart. Of Electrical and Computer.

NOCs: Past, Present and Future

Performance Tuning Team Chia-heng Tu June 30, 2009

Collaborative Computing for Heterogeneous Integrated Systems

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula

Multi-core SOC for Future Media Processing

Derek Chiou The University of Texas at Austin

Using FPGAs with Processors in YOUR Designs

Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform

Programmable Logic- How do they do that?

Emu: Rapid FPGA Prototyping of Network Services in C#

Course Agenda DSP Design Flow.

Dynamically Reconfigurable Architectures: An Overview

Latte: Locality Aware Transformation for High Level Synthesis

Characteristics of Reconfigurable Hardware

Today’s agenda Hardware architecture and runtime system

A High Performance SoC: PkunityTM

HIGH LEVEL SYNTHESIS.

Good Morning/Afternoon/Evening

2018 NSF Expeditions in Computing PI Meeting

2018 NSF Expeditions in Computing PI Meeting

Programmable Logic- How do they do that?

Introduction to Heterogeneous Parallel Computing

Co-designed Virtual Machines for Reliable Computer Systems

Exploring Application Specific Programmable Logic Devices

Course Outline for Computer Architecture

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

Border Control: Sandboxing Accelerators

Presentation transcript:

Please do not distribute 5/10/2018 Rapid Exploration of Accelerator-Rich Architectures: Automation from Concept to Prototyping David Brooks, Jason Cong, Zhenman Fang, Yakun Sophia Shao, and Sam Xi Harvard University & UCLA GYW

Please do not distribute 5/10/2018 Tutorial Outline Time Topic Speaker 8:30 am – 9:00 am Accelerator Research Infrastructure Overview Sophia Shao 9:00 am – 9:30 am Aladdin: Accelerator Pre-RTL Modeling 9:30 am – 10:00 am Rapid Hardware Specialization with HLS: Glass Half Full Prof. Zhiru Zhang 10:00 am – 10:30 am PARADE: HLS-Based Accelerator-Rich Architecture Simulation Zhenman Fang 10:30 am – 11:00 am Break 11:00 am – 11:30 am gem5-Aladdin: Accelerator System Co-Design Sam Xi 11:30 am – 12:00 pm ARAPrototyper: FPGA Prototyping 12:00pm – 13:30 pm Lunch 13:30 pm – 14:00 pm Virtual Machine Setup Sophia Shao & Sam Xi 14:00 pm – 14:30 pm Hands-on: Accelerator Design Space Exploration using Aladdin 14:30 pm – 15:00 pm Hands-on: SoC Design Space Exploration using gem5-Aladdin GYW

Moore’s Law

CMOS Scaling is Slowing Down Please do not distribute 5/10/2018 CMOS Scaling is Slowing Down 180 nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm 14 nm 10 nm http://www.anandtech.com/show/9447/intel-10nm-and-kaby-lake GYW

CMOS Technology Scaling Please do not distribute 5/10/2018 CMOS Technology Scaling Technological Fallow Period GYW

Potential for Specialized Architectures 16 Encryption 17 Hearing Aid 18 FIR for disk read 19 MPEG Encoder 20 802.11 Baseband [Zhang and Brodersen]

Cores, GPUs, and Accelerators: Apple A8 SoC Please do not distribute 5/10/2018 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators GYW

Cores, GPUs, and Accelerators: Apple A8 SoC Please do not distribute 5/10/2018 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators GYW

Cores, GPUs, and Accelerators: Apple A8 SoC Please do not distribute 5/10/2018 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Maltiel Consulting estimates Our estimates GYW

Challenges in Accelerators Flexibility Fixed-function accelerators are only designed for the target applications. Programmability Today’s accelerators are explicitly managed by programmers.

Please do not distribute 5/10/2018 Today’s SoC OMAP 4 SoC GYW

Please do not distribute 5/10/2018 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus Secondary Bus Tertiary OMAP 4 SoC GYW

Challenges in Accelerators Flexibility Fixed-function accelerators are only designed for the target applications. Programmability Today’s accelerators are explicitly managed by programmers. Design Cost Accelerator (and RTL) implementation is inherently tedious and time-consuming.

Please do not distribute 5/10/2018 Today’s SoC GPU/DSP CPU Buses Mem Inter- face Acc GYW

Future Accelerator-Centric Architectures Please do not distribute 5/10/2018 Future Accelerator-Centric Architectures GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores How to decompose applications into accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability GYW

auto-generated accelerators based on HLS (AutoPilot) PARADE: Platform for Accelerator-Rich Architectural Design & Exploration [ICCAD 15] extended gem5 (McPAT) for X86 CPU, with OS auto-generated accelerators based on HLS (AutoPilot) added SPM, DMA, GAM & TLB model extended Garnet (DSENT) for NoC extended Ruby (CACTI) for coherent cache hierarchy gem5 memory model [ISPASS 14]

ARAPrototyper: Prototyping an ARA on FPGA Using Xilinx Zynq SoC (FPGA fabrics + ARM) Major components of an ARA General processor cores A sea of heterogeneous accelerators Memory system + interconnects (NoC)

Contributions WIICA: Accelerator Workload Characterization [ISPASS’13] GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores MachSuite: Accelerator Benchmark Suite [IISWC’14] Aladdin: Accelerator Pre-RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] gem5-Aladdin: Accelerator-System Co-Design [MICRO’16]

Please do not distribute 5/10/2018 Tutorial Outline Time Topic Speaker 8:30 am – 9:00 am Accelerator Research Infrastructure Overview Sophia Shao 9:00 am – 9:30 am Aladdin: Accelerator Pre-RTL Modeling 9:30 am – 10:00 am Rapid Hardware Specialization with HLS: Glass Half Full Prof. Zhiru Zhang 10:00 am – 10:30 am PARADE: HLS-Based Accelerator-Rich Architecture Simulation Zhenman Fang 10:30 am – 11:00 am Break 11:00 am – 11:30 am gem5-Aladdin: Accelerator System Co-Design Sam Xi 11:30 am – 12:00 pm ARAPrototyper: FPGA Prototyping 12:00pm – 13:30 pm Lunch 13:30 pm – 14:00 pm Virtual Machine Setup Sophia Shao & Sam Xi 14:00 pm – 14:30 pm Hands-on: Accelerator Design Space Exploration using Aladdin 14:30 pm – 15:00 pm Hands-on: SoC Design Space Exploration using gem5-Aladdin GYW