Please do not distribute

Slides:



Advertisements
Similar presentations
Multiprocessor Architecture for Image processing Mayank Kumar – 2006EE10331 Pushpendre Rastogi – 2006EE50412 Under the guidance of Dr.Anshul Kumar.
Advertisements

ITRS Design ITWG Design and System Drivers Worldwide Design ITWG Key messages: 1.- Software is now part of semiconductor technology roadmap 2.-
ITRS Roadmap Design + System Drivers Makuhari, December 2007 Worldwide Design ITWG Good morning. Here we present the work that the ITRS Design TWG has.
Reconfigurable Computing After a Decade: A New Perspective and Challenges For Hardware-Software Co-Design and Development Tirumale K Ramesh, Ph.D. Boeing.
A hardware-software co-design approach with separated verification/synthesis between computation and communication Masahiro Fujita VLSI Design and Education.
System-level Architectur Modeling for Power Aware Computing Dexin Li.
Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.
Processing Efficiency Jonah Probell Multimedia Systems Engineer Tensilica Truly Understanding Low-Power Multimedia Chip Design.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.
Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.
Toward Cache-Friendly Hardware Accelerators
Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.
Please do not distribute
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,
Please do not distribute
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Content Project Goals. Term A Goals. Quick Overview of Term A Goals. Term B Goals. Gantt Chart. Requests.
ECE 720T5 Fall 2012 Cyber-Physical Systems Rodolfo Pellizzoni.
The MachSuite Benchmark
Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Automated Design of Custom Architecture Tulika Mitra
1. DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs.
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
VLSI Algorithmic Design Automation Lab. THE TI OMAP PLATFORM APPROACH TO SOC.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
Caches for Accelerators
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
System on a Programmable Chip (System on a Reprogrammable Chip)
A 45nm 37.3GOPS/W Heterogeneous Multi-Core SoC ● Renesas Technology, Kodaira, Japan ● Hitachi, Kodaira, Japan ● Waseda University, Shinjuku, Japan ● Tokyo.
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Please do not distribute
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Please do not distribute
Please do not distribute
Please do not distribute
Please do not distribute
Ph.D. in Computer Science
Evaluating Register File Size
Please do not distribute
Application-Specific Customization of Soft Processor Microarchitecture
SmartCell: A Coarse-Grained Reconfigurable Architecture for High Performance and Low Power Embedded Computing Xinming Huang Depart. Of Electrical and Computer.
Texas Instruments TDA2x and Vision SDK
NOCs: Past, Present and Future
Chapter 1: Introduction
Collaborative Computing for Heterogeneous Integrated Systems
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Dynamically Reconfigurable Architectures: An Overview
CoCentirc System Studio (CCSS) by
Introduction to Heterogeneous Parallel Computing
Hardware Architectures for Deep Learning
Mapping DSP algorithms to a general purpose out-of-order processor
Application-Specific Customization of Soft Processor Microarchitecture
Presentation transcript:

Please do not distribute 4/10/2017 A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks Harvard University GYW

Beyond Homogeneous Parallelism General-Purpose Cores (CPU) Programmable Accelerators (DSP, GPU) Application-Specific Accelerator (ASIP, ASIC) Energy Efficiency Flexibility Programmability Design Cost

Please do not distribute 4/10/2017 Today’s SoC OMAP 4 SoC GYW

Please do not distribute 4/10/2017 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus Secondary Bus Tertiary OMAP 4 SoC GYW

Please do not distribute 4/10/2017 Today’s SoC Apple A7 Harvard VLSI-ARCH Group SoC Tapeout GYW

Please do not distribute 4/10/2017 Today’s SoC GPU/DSP CPU Buses Mem Inter- face Acc GYW

Future Accelerator-Centric Architectures Please do not distribute 4/10/2017 Future Accelerator-Centric Architectures GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores How to decompose an application to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability GYW

Please do not distribute 4/10/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost GYW

Future Accelerator-Centric Architecture Please do not distribute 4/10/2017 Future Accelerator-Centric Architecture GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores GYW

Future Accelerator-Centric Architecture Please do not distribute 4/10/2017 Future Accelerator-Centric Architecture GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Aladdin can rapidly evaluate large design space of accelerator-centric architectures. GYW

Please do not distribute 4/10/2017 Aladdin Overview Optimization Phase Realization Phase Optimistic IR Initial DDDG Idealistic C Code Dynamic Data Dependence Graph (DDDG) Program Constrained DDDG Resource Power/Area Models Performance Activity Acc Design Parameters Power/Area GYW

Please do not distribute 4/10/2017 Aladdin Overview Optimization Phase Optimistic IR Initial DDDG Idealistic DDDG C Code Performance Activity Program Constrained DDDG Resource Constrained DDDG Acc Design Parameters Power/Area Models Power/Area Realization Phase GYW

Please do not distribute 4/10/2017 From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; GYW

From C to Design Space IR Dynamic Trace Please do not distribute 4/10/2017 From C to Design Space IR Dynamic Trace 0. r0=0 //i = 0 r4=load (r0 + r1) //load a[i] r5=load (r0 + r2) //load b[i] r6=r4 + r5 store(r0 + r3, r6) //store c[i] r0=r0 + 1 //++i r4=load(r0 + r1) //load a[i] r5=load(r0 + r2) //load b[i] r0 = r0 + 1 //++i … C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; GYW

From C to Design Space Initial DDDG Please do not distribute 4/10/2017 From C to Design Space Initial DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 1. ld a 2. ld b 3. + 4. st c 5. i++ 6. ld a 7. ld b 8. + 9. st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 11. ld a 12. ld b 13. + 14. st c GYW

From C to Design Space Idealistic DDDG Please do not distribute 4/10/2017 From C to Design Space Idealistic DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 0. i=0 5. i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 5. i++ 1. ld a 2. ld b C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 6. ld a 7. ld b 3. + 11. ld a 12. ld b 8. + 4. st c 13. + 9. st c 14. st c GYW

From C to Design Space Optimization Phase: C->IR->DDDG Please do not distribute 4/10/2017 From C to Design Space Optimization Phase: C->IR->DDDG Include application-specific customization strategies. Node-Level: Bit-width Analysis Strength Reduction Tree-height Reduction Loop-Level: Remove dependences between loop index variables Memory Optimization: Memory-to-Register Conversion Store-Load Forwarding Store Buffer Extensible e.g. Model CAM accelerator by matching nodes in DDDG GYW

From C to Design Space One Design Please do not distribute 4/10/2017 From C to Design Space One Design MEM + Resource Activity Idealistic DDDG Cycle 0. i=0 5.i++ 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 0. i=0 5.i++ 10. i++ 15. i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b 16. ld a 17. ld b 3. + 8. + 13. + 18. + 4. st c 9. st c 14. st c 19. st c Acc Design Parameters: Memory BW <= 2 1 Adder GYW

From C to Design Space Another Design Please do not distribute 4/10/2017 From C to Design Space Another Design MEM + Resource Activity Idealistic DDDG Cycle 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c 6. ld a 0. i=0 5.i++ 10. i++ 15. i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b 16. ld a 17. ld b 3. + 8. + 13. + 18. + 4. st c 9. st c 14. st c 19. st c Acc Design Parameters: Memory BW <= 4 2 Adders GYW

From C to Design Space Realization Phase: DDDG->Estimates Please do not distribute 4/10/2017 From C to Design Space Realization Phase: DDDG->Estimates Constrain the DDDG with program and user-defined resource constraints Program Constraints Control Dependence Memory Ambiguation Resource Constraints Loop-level Parallelism Loop Pipelining Memory Ports # of FUs (e.g., adders, multipliers) GYW

From C to Design Space Power-Performance per Design Please do not distribute 4/10/2017 From C to Design Space Power-Performance per Design Acc Design Parameters: Memory BW <= 4 2 Adders Power Acc Design Parameters: Memory BW <= 2 1 Adder Cycle GYW

From C to Design Space Design Space of an Algorithm Please do not distribute 4/10/2017 From C to Design Space Design Space of an Algorithm Power Cycle GYW

Please do not distribute 4/10/2017 Aladdin Validation Aladdin C Code Power/Area Performance ModelSim Design Compiler Verilog Activity GYW

Please do not distribute 4/10/2017 Aladdin Validation Aladdin C Code Power/Area Performance RTL Designer Design Compiler Verilog Activity HLS C Tuning Vivado HLS ModelSim GYW

Please do not distribute 4/10/2017 Aladdin Validation GYW

Please do not distribute 4/10/2017 Aladdin Validation GYW

Aladdin enables rapid design space exploration for accelerators. Please do not distribute 4/10/2017 Aladdin enables rapid design space exploration for accelerators. 7 mins Aladdin C Code Power/Area Performance 52 hours RTL Designer Design Compiler Verilog Activity HLS C Tuning Vivado HLS ModelSim GYW

Please do not distribute 4/10/2017 Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. GPGPU-Sim GPU MARSx86 ... XIOSim… Big Cores Small Cores DRAMSim2 Memory Interface Shared Resources Cacti/Orion2 Sea of Fine-Grained Accelerators GYW

Modeling Accelerators in a SoC-like Environment Please do not distribute 4/10/2017 Modeling Accelerators in a SoC-like Environment Acc Core Cache Memory Core Acc Core Cache Memory GYW

Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Please do not distribute 4/10/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Architectures with 1000s of accelerators will be radically different; New design tools are needed. Aladdin enables rapid design space exploration of future accelerator-centric platforms. You can find Aladdin at http://vlsiarch.eecs.harvard.edu/aladdin GYW