Download presentation
1
Please do not distribute
4/10/2017 A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks Harvard University GYW
2
Beyond Homogeneous Parallelism
General-Purpose Cores (CPU) Programmable Accelerators (DSP, GPU) Application-Specific Accelerator (ASIP, ASIC) Energy Efficiency Flexibility Programmability Design Cost
3
Please do not distribute
4/10/2017 Today’s SoC OMAP 4 SoC GYW
4
Please do not distribute
4/10/2017 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus Secondary Bus Tertiary OMAP 4 SoC GYW
5
Please do not distribute
4/10/2017 Today’s SoC Apple A7 Harvard VLSI-ARCH Group SoC Tapeout GYW
6
Please do not distribute
4/10/2017 Today’s SoC GPU/DSP CPU Buses Mem Inter- face Acc GYW
7
Future Accelerator-Centric Architectures
Please do not distribute 4/10/2017 Future Accelerator-Centric Architectures GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores How to decompose an application to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability GYW
8
Please do not distribute
4/10/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost GYW
9
Future Accelerator-Centric Architecture
Please do not distribute 4/10/2017 Future Accelerator-Centric Architecture GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores GYW
10
Future Accelerator-Centric Architecture
Please do not distribute 4/10/2017 Future Accelerator-Centric Architecture GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Aladdin can rapidly evaluate large design space of accelerator-centric architectures. GYW
11
Please do not distribute
4/10/2017 Aladdin Overview Optimization Phase Realization Phase Optimistic IR Initial DDDG Idealistic C Code Dynamic Data Dependence Graph (DDDG) Program Constrained DDDG Resource Power/Area Models Performance Activity Acc Design Parameters Power/Area GYW
12
Please do not distribute
4/10/2017 Aladdin Overview Optimization Phase Optimistic IR Initial DDDG Idealistic DDDG C Code Performance Activity Program Constrained DDDG Resource Constrained DDDG Acc Design Parameters Power/Area Models Power/Area Realization Phase GYW
13
Please do not distribute
4/10/2017 From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; GYW
14
From C to Design Space IR Dynamic Trace
Please do not distribute 4/10/2017 From C to Design Space IR Dynamic Trace 0. r0=0 //i = 0 r4=load (r0 + r1) //load a[i] r5=load (r0 + r2) //load b[i] r6=r4 + r5 store(r0 + r3, r6) //store c[i] r0=r //++i r4=load(r0 + r1) //load a[i] r5=load(r0 + r2) //load b[i] r0 = r //++i … C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; GYW
15
From C to Design Space Initial DDDG
Please do not distribute 4/10/2017 From C to Design Space Initial DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r //++i … 0. i=0 1. ld a 2. ld b 3. + 4. st c 5. i++ 6. ld a 7. ld b 8. + 9. st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 11. ld a 12. ld b 13. + 14. st c GYW
16
From C to Design Space Idealistic DDDG
Please do not distribute 4/10/2017 From C to Design Space Idealistic DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r //++i … 0. i=0 0. i=0 5. i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 5. i++ 1. ld a 2. ld b C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 6. ld a 7. ld b 3. + 11. ld a 12. ld b 8. + 4. st c 13. + 9. st c 14. st c GYW
17
From C to Design Space Optimization Phase: C->IR->DDDG
Please do not distribute 4/10/2017 From C to Design Space Optimization Phase: C->IR->DDDG Include application-specific customization strategies. Node-Level: Bit-width Analysis Strength Reduction Tree-height Reduction Loop-Level: Remove dependences between loop index variables Memory Optimization: Memory-to-Register Conversion Store-Load Forwarding Store Buffer Extensible e.g. Model CAM accelerator by matching nodes in DDDG GYW
18
From C to Design Space One Design
Please do not distribute 4/10/2017 From C to Design Space One Design MEM + Resource Activity Idealistic DDDG Cycle 0. i=0 5.i++ 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 0. i=0 5.i++ 10. i++ 15. i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b 16. ld a 17. ld b 3. + 8. + 13. + 18. + 4. st c 9. st c 14. st c 19. st c Acc Design Parameters: Memory BW <= 2 1 Adder GYW
19
From C to Design Space Another Design
Please do not distribute 4/10/2017 From C to Design Space Another Design MEM + Resource Activity Idealistic DDDG Cycle 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c 6. ld a 0. i=0 5.i++ 10. i++ 15. i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b 16. ld a 17. ld b 3. + 8. + 13. + 18. + 4. st c 9. st c 14. st c 19. st c Acc Design Parameters: Memory BW <= 4 2 Adders GYW
20
From C to Design Space Realization Phase: DDDG->Estimates
Please do not distribute 4/10/2017 From C to Design Space Realization Phase: DDDG->Estimates Constrain the DDDG with program and user-defined resource constraints Program Constraints Control Dependence Memory Ambiguation Resource Constraints Loop-level Parallelism Loop Pipelining Memory Ports # of FUs (e.g., adders, multipliers) GYW
21
From C to Design Space Power-Performance per Design
Please do not distribute 4/10/2017 From C to Design Space Power-Performance per Design Acc Design Parameters: Memory BW <= 4 2 Adders Power Acc Design Parameters: Memory BW <= 2 1 Adder Cycle GYW
22
From C to Design Space Design Space of an Algorithm
Please do not distribute 4/10/2017 From C to Design Space Design Space of an Algorithm Power Cycle GYW
23
Please do not distribute
4/10/2017 Aladdin Validation Aladdin C Code Power/Area Performance ModelSim Design Compiler Verilog Activity GYW
24
Please do not distribute
4/10/2017 Aladdin Validation Aladdin C Code Power/Area Performance RTL Designer Design Compiler Verilog Activity HLS C Tuning Vivado HLS ModelSim GYW
25
Please do not distribute
4/10/2017 Aladdin Validation GYW
26
Please do not distribute
4/10/2017 Aladdin Validation GYW
27
Aladdin enables rapid design space exploration for accelerators.
Please do not distribute 4/10/2017 Aladdin enables rapid design space exploration for accelerators. 7 mins Aladdin C Code Power/Area Performance 52 hours RTL Designer Design Compiler Verilog Activity HLS C Tuning Vivado HLS ModelSim GYW
28
Please do not distribute
4/10/2017 Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. GPGPU-Sim GPU MARSx86 ... XIOSim… Big Cores Small Cores DRAMSim2 Memory Interface Shared Resources Cacti/Orion2 Sea of Fine-Grained Accelerators GYW
29
Modeling Accelerators in a SoC-like Environment
Please do not distribute 4/10/2017 Modeling Accelerators in a SoC-like Environment Acc Core Cache Memory Core Acc Core Cache Memory GYW
30
Aladdin: A pre-RTL, Power-Performance Accelerator Simulator
Please do not distribute 4/10/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Architectures with 1000s of accelerators will be radically different; New design tools are needed. Aladdin enables rapid design space exploration of future accelerator-centric platforms. You can find Aladdin at GYW
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.