Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.

Slides:

Advertisements

Similar presentations

Please do not distribute

Advertisements

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

1 EE249 Discussion A Method for Architecture Exploration for Heterogeneous Signal Processing Systems Sam Williams EE249 Discussion Section October 15,

Toward Cache-Friendly Hardware Accelerators

Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.

Please do not distribute

Projects Using gem5 ParaDIME (2012 – 2015) RoMoL (2013 – 2018)

Please do not distribute

Content Project Goals. Term A Goals. Quick Overview of Term A Goals. Term B Goals. Gantt Chart. Requests.

The MachSuite Benchmark

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/30/2013.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Caches for Accelerators

PTII Model  VHDL Codegen Verification Project Overview 1.Generate VHDL descriptions for Ptolemy models. 2.Maintain bit and cycle accuracy in implementation.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

Implementing RISC Multi Core Processor Using HLS Language - BLUESPEC Liam Wigdor Instructor Mony Orbach Shirel Josef Semesterial Winter 2013.

PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Please do not distribute

Please do not distribute

TI Information – Selective Disclosure

Lynn Choi School of Electrical Engineering

Please do not distribute

Please do not distribute

Please do not distribute

Microarchitecture.

Ph.D. in Computer Science

Please do not distribute

Application-Specific Customization of Soft Processor Microarchitecture

Stash: Have Your Scratchpad and Cache it Too

Performance Tuning Team Chia-heng Tu June 30, 2009

FPGAs in AWS and First Use Cases, Kees Vissers

Introduction to High-level Synthesis

Collaborative Computing for Heterogeneous Integrated Systems

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula

Multi-core SOC for Future Media Processing

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform

Matlab as a Development Environment for FPGA Design

CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.

A High Performance SoC: PkunityTM

HIGH LEVEL SYNTHESIS.

Good Morning/Afternoon/Evening

Introduction to Heterogeneous Parallel Computing

Hossein Omidian, Guy Lemieux

Hardware Architectures for Deep Learning

Application-Specific Customization of Soft Processor Microarchitecture

Sculptor: Flexible Approximation with

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE

Harvard University Moore’s Law 2

CMOS Scaling is Slowing Down Harvard University nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm 14 nm 10 nm

CMOS Technology Scaling Technological Fallow Period Harvard University 4

Potential for Specialized Architectures [Zhang and Brodersen] 16Encryption 17Hearing Aid 18FIR for disk read 19MPEG Encoder Baseband Harvard University 5

Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Harvard University 6

Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Harvard University 7

Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Maltiel Consulting estimates Our estimates Harvard University 8

Challenges in Accelerators  Flexibility –Fixed-function accelerators are only designed for the target applications.  Programmability –Today’s accelerators are explicitly managed by programmers. 9

OMAP 4 SoC Today’s SoC Harvard University 10

OMAP 4 SoC Today’s SoC ARM Core s GPU DSP System Bus Secondary Bus Secondary Bus Tertiary Bus DMA SD USB Audio Video Face Imaging USB Harvard University 11

Challenges in Accelerators  Flexibility –Fixed-function accelerators are only designed for the target applications.  Programmability –Today’s accelerators are explicitly managed by programmers.  Design Cost –Accelerator (and RTL) implementation is inherently tedious and time-consuming. 12 Harvard University

Today’s SoC GPU/ DSP CPU Buses Mem Inter- face Acc CPU Acc Harvard University 13

Future Accelerator-Centric Architectures Flexibility Design Cost Programmability How to decompose applications into accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Harvard University 14

Harvard University GPU/D SP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Core s 15 Accelerator-System Co-Design [Under Review] Contributions Research Infrastructures for Hardware Accelerators [Synthesis Lecture’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] Aladdin: Accelerator Pre- RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] MachSuite: Accelerator Benchmark Suite [IISWC’14] WIICA: Accelerator Workload Characterization [ISPASS’13] Instruction-Level Energy Model for Xeon Phi [ISLPED’13_2]

Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems Design Cost Flexibility Programmability Aladdin: A pre-RTL, Power- Performance Accelerator Simulator “Design Assistant” Understand Algorithmic-HW Design Space before RTL Harvard University 16

GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Future Accelerator-Centric Architecture Harvard University 17

GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Future Accelerator-Centric Architecture Aladdin can rapidly evaluate large design space of accelerator-centric architectures. Harvard University 18

Aladdin Overview C Code Power/Area Performance Activity Acc Design Parameters Optimization Phase Realization Phase Optimistic IR Initial DDDG Idealistic DDDG Program Constrained DDDG Resource Constrained DDDG Power/Area Models Dynamic Data Dependence Graph (DDDG) Harvard University 19

Aladdin Overview C Code Optimistic IR Initial DDDG Idealistic DDDG Program Constrained DDDG Resource Constrained DDDG Power/Area Models Optimization Phase Realization Phase Power/Area Performance Activity Acc Design Parameters Harvard University 20

From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; Harvard University 21

From C to Design Space IR Dynamic Trace C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 0. r0=0 //i = 0 1.r4=load (r0 + r1) //load a[i] 2.r5=load (r0 + r2) //load b[i] 3.r6=r4 + r5 4.store(r0 + r3, r6) //store c[i] 5.r0=r0 + 1 //++i 6.r4=load(r0 + r1) //load a[i] 7.r5=load(r0 + r2) //load b[i] 8.r6=r4 + r5 9.store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … Harvard University 22

From C to Design Space Initial DDDG 0. i=0 1. ld a2. ld b st c 5. i++ 6. ld a7. ld b st c 10. i ld a12. ld b st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … Harvard University 23

0. i=0 5. i i ld a12. ld b st c 6. ld a7. ld b st c 1. ld a 2. ld b st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 5. i i ld a12. ld b st c 6. ld a7. ld b st c 1. ld a 2. ld b st c From C to Design Space Idealistic DDDG Harvard University 24

 Include application-specific customization strategies.  Node-Level: –Bit-width Analysis –Strength Reduction –Tree-height Reduction  Loop-Level: –Remove dependences between loop index variables  Memory Optimization: –Memory-to-Register Conversion –Store-Load Forwarding –Store Buffer From C to Design Space Optimization Phase: C->IR->DDDG Harvard University 25

From C to Design Space One Design MEM Resource Activity Idealistic DDDG Acc Design Parameters: Memory BW <= 2 1 Adder 0. i=0 5.i i ld a 12. ld b st c 6. ld a 7. ld b st c 1. ld a 2. ld b st c 15. i ld a 17. ld b st c Cycle 0. i=0 5.i++ 6. ld a 7. ld b st c 1. ld a 2. ld b st c Harvard University 26

From C to Design Space Another Design MEM Resource Activity Cycle 0. i=0 5.i i ld a 12. ld b st c 7. ld b st c 1. ld a 2. ld b st c 15. i ld a 17. ld b st c 6. ld a Acc Design Parameters: Memory BW <= 4 2 Adders Idealistic DDDG 0. i=0 5.i i ld a 12. ld b st c 6. ld a 7. ld b st c 1. ld a 2. ld b st c 15. i ld a 17. ld b st c Harvard University 27

 Constrain the DDDG with program and user- defined resource constraints  Program Constraints –Control Dependence –Memory Ambiguation  Resource Constraints –Loop-level Parallelism –Loop Pipelining –Memory Ports From C to Design Space Realization Phase: DDDG->Estimates Harvard University 28

Cycle Power Acc Design Parameters: Memory BW <= 4 2 Adders Acc Design Parameters: Memory BW <= 2 1 Adder From C to Design Space Power-Performance per Design Harvard University 29

From C to Design Space Design Space of an Algorithm Cycle Power Harvard University 30

Aladdin Validation C Code Power/Area Performance Aladdin ModelSim Design Compiler Verilog Activity Harvard University 31

Aladdin Validation C Code Power/Area Performance Aladdin RTL Designer HLS C Tuning Vivado HLS ModelSim Design Compiler Verilog Activity Harvard University 32

Aladdin Validation Harvard University 33

Aladdin Validation Harvard University 34

Algorithm-to-Solution Time Harvard University 35 Hand-Coded RTL C-to-RTL Programming Effort HighMedium RTL Generation Designer Dependent 37 mins RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins Time to Solution (36 Designs) 52 hours

Algorithm-to-Solution Time Hand-Coded RTL C-to-RTLAladdin Programming Effort HighMedium N/A RTL Generation Designer Dependent 37 mins RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins1 min Time to Solution (36 Designs) 52 hours7 min Harvard University 36

Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. GPU Shared Resources Memory Interface Sea of Fine-Grained Accelerators Big Cores Small Cores GPGPU- Sim gem5... gem5 … Cacti/Orion2 DRAMSim 2 Harvard University 37

Accelerator Integration Harvard University 38 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA

Compute is only a part of the story Harvard University 39

Compute is only a part of the story Harvard University Accelerator-System Co-Design 40

Accelerator Integration Harvard University 41 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA ACC MEM Lane 0Lane 1Lane 2Lane 3 TLB Cache Cache Interface

gem5-Aladdin: An SoC Simulator Harvard University 42 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA ACC MEM Lane 0Lane 1Lane 2Lane 3 TLB Cache Cache Interface

gem5-Aladdin Validation Harvard University 43 Applicatio n gem5-Aladdin Vivado HLS Verilog Flush Latency DMA Latency Acc Exe Latency DMA IP Block FPGA ARM Core Xilinx Zynq SoC Kernel

gem5-Aladdin Validation Harvard University 44

To DMA or To Cache?  Accelerator local memory Harvard University 45

DMA or Cache Harvard University 46

DMA or Cache Harvard University 47

DMA or Cache Harvard University 48

Conclusions  Architectures with 1000s of accelerators will be radically different; New design tools are needed.  We built Aladdin, an architectural level power, performance, and area simulator for accelerators.  We integrated Aladdin with gem5 to model the interactions between accelerators and the rest of the SoC.  These accelerator infrastructures open up opportunities for innovation on heterogeneous architecture designs. Harvard University 49

Harvard University GPU/D SP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Core s 50 Accelerator-System Co-Design [Under Review] Contributions Research Infrastructures for Hardware Accelerators [Synthesis Lecture’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] Aladdin: Accelerator Pre- RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] MachSuite: Accelerator Benchmark Suite [IISWC’14] WIICA: Accelerator Workload Characterization [ISPASS’13] Instruction-Level Energy Model for Xeon Phi [ISLPED’13_2]

Publications 1.Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “An Holistic Approach to Accelerator- System Co-Design,” Under Review. 2.Y.S Shao and D. Brooks, “Research Infrastructures for Hardware Accelerators,” Synthesis Lectures on Computer Architecture, Nov Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “The Aladdin Approach to Accelerator Design and Modeling,” IEEE Micro TopPicks, May-June Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “Toward Cache-Friendly Hardware Accelerators,” SCAW’15. 5.B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for Accelerator Design and Customized Architectures,” IISWC’14. 6.Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures,” ISCA’14. 7.B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13. 8.Y.S. Shao and D. Brooks, “Energy Characterization and Instruction-Level Energy Model of Intel’s Xeon Phi Processor,” ISLPED’13. 9.Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures,” ISPASS’13. Harvard University 51

Acknowledgement Harvard University 52

Thanks! Harvard University 53