Presentation is loading. Please wait.

Presentation is loading. Please wait.

Please do not distribute

Similar presentations


Presentation on theme: "Please do not distribute"— Presentation transcript:

1 Please do not distribute
4/10/2017 A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks Harvard University GYW

2 Beyond Homogeneous Parallelism
General-Purpose Cores (CPU) Programmable Accelerators (DSP, GPU) Application-Specific Accelerator (ASIP, ASIC) Energy Efficiency Flexibility Programmability Design Cost

3 Please do not distribute
4/10/2017 Today’s SoC OMAP 4 SoC GYW

4 Please do not distribute
4/10/2017 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus Secondary Bus Tertiary OMAP 4 SoC GYW

5 Please do not distribute
4/10/2017 Today’s SoC Apple A7 Harvard VLSI-ARCH Group SoC Tapeout GYW

6 Please do not distribute
4/10/2017 Today’s SoC GPU/DSP CPU Buses Mem Inter- face Acc GYW

7 Future Accelerator-Centric Architectures
Please do not distribute 4/10/2017 Future Accelerator-Centric Architectures GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores How to decompose an application to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability GYW

8 Please do not distribute
4/10/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost GYW

9 Future Accelerator-Centric Architecture
Please do not distribute 4/10/2017 Future Accelerator-Centric Architecture GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores GYW

10 Future Accelerator-Centric Architecture
Please do not distribute 4/10/2017 Future Accelerator-Centric Architecture GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Aladdin can rapidly evaluate large design space of accelerator-centric architectures. GYW

11 Please do not distribute
4/10/2017 Aladdin Overview Optimization Phase Realization Phase Optimistic IR Initial DDDG Idealistic C Code Dynamic Data Dependence Graph (DDDG) Program Constrained DDDG Resource Power/Area Models Performance Activity Acc Design Parameters Power/Area GYW

12 Please do not distribute
4/10/2017 Aladdin Overview Optimization Phase Optimistic IR Initial DDDG Idealistic DDDG C Code Performance Activity Program Constrained DDDG Resource Constrained DDDG Acc Design Parameters Power/Area Models Power/Area Realization Phase GYW

13 Please do not distribute
4/10/2017 From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; GYW

14 From C to Design Space IR Dynamic Trace
Please do not distribute 4/10/2017 From C to Design Space IR Dynamic Trace 0. r0=0 //i = 0 r4=load (r0 + r1) //load a[i] r5=load (r0 + r2) //load b[i] r6=r4 + r5 store(r0 + r3, r6) //store c[i] r0=r //++i r4=load(r0 + r1) //load a[i] r5=load(r0 + r2) //load b[i] r0 = r //++i C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; GYW

15 From C to Design Space Initial DDDG
Please do not distribute 4/10/2017 From C to Design Space Initial DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r //++i 0. i=0 1. ld a 2. ld b 3. + 4. st c 5. i++ 6. ld a 7. ld b 8. + 9. st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 11. ld a 12. ld b 13. + 14. st c GYW

16 From C to Design Space Idealistic DDDG
Please do not distribute 4/10/2017 From C to Design Space Idealistic DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r //++i 0. i=0 0. i=0 5. i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 5. i++ 1. ld a 2. ld b C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 6. ld a 7. ld b 3. + 11. ld a 12. ld b 8. + 4. st c 13. + 9. st c 14. st c GYW

17 From C to Design Space Optimization Phase: C->IR->DDDG
Please do not distribute 4/10/2017 From C to Design Space Optimization Phase: C->IR->DDDG Include application-specific customization strategies. Node-Level: Bit-width Analysis Strength Reduction Tree-height Reduction Loop-Level: Remove dependences between loop index variables Memory Optimization: Memory-to-Register Conversion Store-Load Forwarding Store Buffer Extensible e.g. Model CAM accelerator by matching nodes in DDDG GYW

18 From C to Design Space One Design
Please do not distribute 4/10/2017 From C to Design Space One Design MEM + Resource Activity Idealistic DDDG Cycle 0. i=0 5.i++ 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 0. i=0 5.i++ 10. i++ 15. i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b 16. ld a 17. ld b 3. + 8. + 13. + 18. + 4. st c 9. st c 14. st c 19. st c Acc Design Parameters: Memory BW <= 2 1 Adder GYW

19 From C to Design Space Another Design
Please do not distribute 4/10/2017 From C to Design Space Another Design MEM + Resource Activity Idealistic DDDG Cycle 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c 6. ld a 0. i=0 5.i++ 10. i++ 15. i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b 16. ld a 17. ld b 3. + 8. + 13. + 18. + 4. st c 9. st c 14. st c 19. st c Acc Design Parameters: Memory BW <= 4 2 Adders GYW

20 From C to Design Space Realization Phase: DDDG->Estimates
Please do not distribute 4/10/2017 From C to Design Space Realization Phase: DDDG->Estimates Constrain the DDDG with program and user-defined resource constraints Program Constraints Control Dependence Memory Ambiguation Resource Constraints Loop-level Parallelism Loop Pipelining Memory Ports # of FUs (e.g., adders, multipliers) GYW

21 From C to Design Space Power-Performance per Design
Please do not distribute 4/10/2017 From C to Design Space Power-Performance per Design Acc Design Parameters: Memory BW <= 4 2 Adders Power Acc Design Parameters: Memory BW <= 2 1 Adder Cycle GYW

22 From C to Design Space Design Space of an Algorithm
Please do not distribute 4/10/2017 From C to Design Space Design Space of an Algorithm Power Cycle GYW

23 Please do not distribute
4/10/2017 Aladdin Validation Aladdin C Code Power/Area Performance ModelSim Design Compiler Verilog Activity GYW

24 Please do not distribute
4/10/2017 Aladdin Validation Aladdin C Code Power/Area Performance RTL Designer Design Compiler Verilog Activity HLS C Tuning Vivado HLS ModelSim GYW

25 Please do not distribute
4/10/2017 Aladdin Validation GYW

26 Please do not distribute
4/10/2017 Aladdin Validation GYW

27 Aladdin enables rapid design space exploration for accelerators.
Please do not distribute 4/10/2017 Aladdin enables rapid design space exploration for accelerators. 7 mins Aladdin C Code Power/Area Performance 52 hours RTL Designer Design Compiler Verilog Activity HLS C Tuning Vivado HLS ModelSim GYW

28 Please do not distribute
4/10/2017 Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. GPGPU-Sim GPU MARSx86 ... XIOSim… Big Cores Small Cores DRAMSim2 Memory Interface Shared Resources Cacti/Orion2 Sea of Fine-Grained Accelerators GYW

29 Modeling Accelerators in a SoC-like Environment
Please do not distribute 4/10/2017 Modeling Accelerators in a SoC-like Environment Acc Core Cache Memory Core Acc Core Cache Memory GYW

30 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator
Please do not distribute 4/10/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Architectures with 1000s of accelerators will be radically different; New design tools are needed. Aladdin enables rapid design space exploration of future accelerator-centric platforms. You can find Aladdin at GYW


Download ppt "Please do not distribute"

Similar presentations


Ads by Google