Please do not distribute

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

Please do not distribute
Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.
Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.
Some Thoughts on Technology and Strategies for Petaflops.
Toward Cache-Friendly Hardware Accelerators
Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.
Please do not distribute
Projects Using gem5 ParaDIME (2012 – 2015) RoMoL (2013 – 2018)
Please do not distribute
Content Project Goals. Term A Goals. Quick Overview of Term A Goals. Term B Goals. Gantt Chart. Requests.
The MachSuite Benchmark
Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
1. DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Caches for Accelerators
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.
PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Please do not distribute
System-on-Chip Design
Please do not distribute
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou
Please do not distribute
Initial Experiences with Deploying FPGA Accelerators in Datacenters
Hands On SoC FPGA Design
CA Final Project – Multithreaded Processor with IPC Interface
Please do not distribute
Ph.D. in Computer Science
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Please do not distribute
SmartCell: A Coarse-Grained Reconfigurable Architecture for High Performance and Low Power Embedded Computing Xinming Huang Depart. Of Electrical and Computer.
NOCs: Past, Present and Future
Performance Tuning Team Chia-heng Tu June 30, 2009
Collaborative Computing for Heterogeneous Integrated Systems
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
Multi-core SOC for Future Media Processing
Derek Chiou The University of Texas at Austin
Using FPGAs with Processors in YOUR Designs
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Programmable Logic- How do they do that?
Emu: Rapid FPGA Prototyping of Network Services in C#
Course Agenda DSP Design Flow.
Dynamically Reconfigurable Architectures: An Overview
Latte: Locality Aware Transformation for High Level Synthesis
Characteristics of Reconfigurable Hardware
Today’s agenda Hardware architecture and runtime system
A High Performance SoC: PkunityTM
HIGH LEVEL SYNTHESIS.
Good Morning/Afternoon/Evening
2018 NSF Expeditions in Computing PI Meeting
2018 NSF Expeditions in Computing PI Meeting
Programmable Logic- How do they do that?
Introduction to Heterogeneous Parallel Computing
Co-designed Virtual Machines for Reliable Computer Systems
Exploring Application Specific Programmable Logic Devices
Course Outline for Computer Architecture
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Border Control: Sandboxing Accelerators
Presentation transcript:

Please do not distribute 5/10/2018 Rapid Exploration of Accelerator-Rich Architectures: Automation from Concept to Prototyping David Brooks, Jason Cong, Zhenman Fang, Yakun Sophia Shao, and Sam Xi Harvard University & UCLA GYW

Please do not distribute 5/10/2018 Tutorial Outline Time Topic Speaker 8:30 am – 9:00 am Accelerator Research Infrastructure Overview Sophia Shao 9:00 am – 9:30 am Aladdin: Accelerator Pre-RTL Modeling 9:30 am – 10:00 am Rapid Hardware Specialization with HLS: Glass Half Full Prof. Zhiru Zhang 10:00 am – 10:30 am PARADE: HLS-Based Accelerator-Rich Architecture Simulation Zhenman Fang 10:30 am – 11:00 am Break 11:00 am – 11:30 am gem5-Aladdin: Accelerator System Co-Design Sam Xi 11:30 am – 12:00 pm ARAPrototyper: FPGA Prototyping 12:00pm – 13:30 pm Lunch 13:30 pm – 14:00 pm Virtual Machine Setup Sophia Shao & Sam Xi 14:00 pm – 14:30 pm Hands-on: Accelerator Design Space Exploration using Aladdin 14:30 pm – 15:00 pm Hands-on: SoC Design Space Exploration using gem5-Aladdin GYW

Moore’s Law

CMOS Scaling is Slowing Down Please do not distribute 5/10/2018 CMOS Scaling is Slowing Down 180 nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm 14 nm 10 nm http://www.anandtech.com/show/9447/intel-10nm-and-kaby-lake GYW

CMOS Technology Scaling Please do not distribute 5/10/2018 CMOS Technology Scaling Technological Fallow Period GYW

Potential for Specialized Architectures 16 Encryption 17 Hearing Aid 18 FIR for disk read 19 MPEG Encoder 20 802.11 Baseband [Zhang and Brodersen]

Cores, GPUs, and Accelerators: Apple A8 SoC Please do not distribute 5/10/2018 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators GYW

Cores, GPUs, and Accelerators: Apple A8 SoC Please do not distribute 5/10/2018 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators GYW

Cores, GPUs, and Accelerators: Apple A8 SoC Please do not distribute 5/10/2018 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Maltiel Consulting estimates Our estimates GYW

Challenges in Accelerators Flexibility Fixed-function accelerators are only designed for the target applications. Programmability Today’s accelerators are explicitly managed by programmers.

Please do not distribute 5/10/2018 Today’s SoC OMAP 4 SoC GYW

Please do not distribute 5/10/2018 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus Secondary Bus Tertiary OMAP 4 SoC GYW

Challenges in Accelerators Flexibility Fixed-function accelerators are only designed for the target applications. Programmability Today’s accelerators are explicitly managed by programmers. Design Cost Accelerator (and RTL) implementation is inherently tedious and time-consuming.

Please do not distribute 5/10/2018 Today’s SoC GPU/DSP CPU Buses Mem Inter- face Acc GYW

Future Accelerator-Centric Architectures Please do not distribute 5/10/2018 Future Accelerator-Centric Architectures GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores How to decompose applications into accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability GYW

auto-generated accelerators based on HLS (AutoPilot) PARADE: Platform for Accelerator-Rich Architectural Design & Exploration [ICCAD 15] extended gem5 (McPAT) for X86 CPU, with OS auto-generated accelerators based on HLS (AutoPilot) added SPM, DMA, GAM & TLB model extended Garnet (DSENT) for NoC extended Ruby (CACTI) for coherent cache hierarchy gem5 memory model [ISPASS 14]

ARAPrototyper: Prototyping an ARA on FPGA Using Xilinx Zynq SoC (FPGA fabrics + ARM) Major components of an ARA General processor cores A sea of heterogeneous accelerators Memory system + interconnects (NoC)

Contributions WIICA: Accelerator Workload Characterization [ISPASS’13] GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores MachSuite: Accelerator Benchmark Suite [IISWC’14] Aladdin: Accelerator Pre-RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] gem5-Aladdin: Accelerator-System Co-Design [MICRO’16]

Please do not distribute 5/10/2018 Tutorial Outline Time Topic Speaker 8:30 am – 9:00 am Accelerator Research Infrastructure Overview Sophia Shao 9:00 am – 9:30 am Aladdin: Accelerator Pre-RTL Modeling 9:30 am – 10:00 am Rapid Hardware Specialization with HLS: Glass Half Full Prof. Zhiru Zhang 10:00 am – 10:30 am PARADE: HLS-Based Accelerator-Rich Architecture Simulation Zhenman Fang 10:30 am – 11:00 am Break 11:00 am – 11:30 am gem5-Aladdin: Accelerator System Co-Design Sam Xi 11:30 am – 12:00 pm ARAPrototyper: FPGA Prototyping 12:00pm – 13:30 pm Lunch 13:30 pm – 14:00 pm Virtual Machine Setup Sophia Shao & Sam Xi 14:00 pm – 14:30 pm Hands-on: Accelerator Design Space Exploration using Aladdin 14:30 pm – 15:00 pm Hands-on: SoC Design Space Exploration using gem5-Aladdin GYW