Please do not distribute

Slides:



Advertisements
Similar presentations
Please do not distribute
Advertisements

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.
Toward Cache-Friendly Hardware Accelerators
Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.
Please do not distribute
Embedded Systems Design at Mentor. Platform Express Drag and Drop Design in Minutes IP Described In XML Databook s Simple System Diagrams represent complex.
CERN CMS Project Host / SD Card Configuration Data Access Dave Ojika Alex Madorsky Dr. Darin Acosta Dr. Ivan Furic.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Virtual Platforms for Memory Controller Design Space Exploration Matthias Jung, Christian Weis, Norbert Wehn University of Kaiserslautern, Germany.
CISC Machine Learning for Solving Systems Problems Arch Explorer Lecture 5 John Cavazos Dept of Computer & Information Sciences University of Delaware.
MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
The MachSuite Benchmark
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.
Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Automated Design of Custom Architecture Tulika Mitra
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
1 Some Limits of Power Delivery in the Multicore Era Runjie Zhang, Brett H. Meyer, Wei Huang, Kevin Skadron and Mircea R. Stan University of Virginia,
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Spring 2007Lecture 16 Heterogeneous Systems (Thanks to Wen-Mei Hwu for many of the figures)
March 9, 2015 San Jose Compute Engineering Workshop.
1 Latest Generations of Multi Core Processors
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.
Electronic system level design Teacher : 蔡宗漢 Electronic system level Design Lab environment overview Speaker: 范辰碩 2012/10/231.
Caches for Accelerators
ECE 551: Digital System Design & Synthesis Motivation and Introduction Lectures Set 1 (3 Lectures)
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
Advanced Rendering Technology The AR250 A New Architecture for Ray Traced Rendering.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
April 15, 2013 Atul Kwatra Principal Engineer Intel Corporation Hardware/Software Co-design using SystemC/TLM – Challenges & Opportunities ISCUG ’13.
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.
Programmable Accelerators
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
NFV Compute Acceleration APIs and Evaluation
Please do not distribute
Please do not distribute
Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou
FPGAs for next gen DAQ and Computing systems at CERN
Please do not distribute
Please do not distribute
Please do not distribute
ECE354 Embedded Systems Introduction C Andras Moritz.
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Please do not distribute
Application-Specific Customization of Soft Processor Microarchitecture
Stash: Have Your Scratchpad and Cache it Too
Performance Tuning Team Chia-heng Tu June 30, 2009
Collaborative Computing for Heterogeneous Integrated Systems
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Exascale Programming Models in an Era of Big Computation and Big Data
UNISIM (UNIted SIMulation Environment) walkthrough
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Border Control: Sandboxing Accelerators
Presentation transcript:

Please do not distribute 4/21/2017 Integration for Heterogeneous SoC Modeling Y. Sophia Shao, Sam Xi, Gu-Yeon Wei, David Brooks Harvard University GYW

Please do not distribute 4/21/2017 More accelerators. Out-of-Core Accelerators Maltiel Consulting estimates [Shao, et al., IEEE Micro] [Die photo from Chipworks] [Accelerators annotated by Sophia Shao @ Harvard] GYW

Accelerator-CPU Integration: Today’s Conventional SoCs Easy to integrate lots of IP, simple accelerator design Hard to program and share data Core L2 $ … L3 $ DMA On-Chip System Bus Acc #1 Scratchpad Acc #n

Accelerator Integration Trend Users design application-specific hardware accelerators. System vendors provide Host Service Layer with virtual memory and cache coherence support Intel QuickAssist QPI-Based FPGA Accelerator Platform (QAP) IBM POWER8’s Coherent Accelerator Processor Interface (CAPI) Main CPU/SoC FPGA or user-defined ASIC Core … Core Accelerator L2 $ L2 $ Acc Agent Host Service Layer L3 $

Please do not distribute 4/21/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems GYW

Please do not distribute 4/21/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems GYW

Please do not distribute 4/21/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost GYW

Please do not distribute 4/21/2017 Aladdin Overview Optimization Phase Realization Phase Optimistic IR Initial DDDG Idealistic C Code Dynamic Data Dependence Graph (DDDG) Program Constrained DDDG Resource Power/Area Models Performance Activity Acc Design Parameters Power/Area GYW

Aladdin Take-Away Compared to HLS and hand-written RTL for SHOC benchmarks and custom accelerator designs Large design space exploration (DSE) in minutes instead of hours/days with unmodified C/C++ algorithm description Limitations Dynamic approach  Aladdin depends on realistic workload inputs Algorithm dependent Aladdin enables DSE/algorithm exploration Cycle Counts Power Area within 2% within 5% within 7%

Please do not distribute 4/21/2017 Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. GPGPU-Sim GPU gem5 ... … Big Cores Small Cores DRAMSim2 Memory Interface Shared Resources Ruby/GARNET Sea of Fine-Grained Accelerators GYW

gem5-Aladdin Integration CPU Acc Datapath Cache Scratchpad TLB DMA Engine Cache LLC DRAM

gem5-Aladdin Integration Scratchpad TLB Cache Acc Datapath Scratchpad TLB Cache Acc Datapath CPU … Cache … DMA Engine Acc Shared Cache LLC DRAM

Acc Cache Memory CPU Cache Memory

Heterogeneous SoC Modeling Please do not distribute 4/21/2017 Heterogeneous SoC Modeling Increasing number of accelerators are integrated into both mobile SoCs and servers. gem5-Aladdin integration enables rapid design space exploration of future accelerator-centric platforms. Download Aladdin at http://vlsiarch.eecs.harvard.edu/aladdin GYW