Please do not distribute 4/10/2017 A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks Harvard University GYW
Beyond Homogeneous Parallelism General-Purpose Cores (CPU) Programmable Accelerators (DSP, GPU) Application-Specific Accelerator (ASIP, ASIC) Energy Efficiency Flexibility Programmability Design Cost
Please do not distribute 4/10/2017 Today’s SoC OMAP 4 SoC GYW
Please do not distribute 4/10/2017 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus Secondary Bus Tertiary OMAP 4 SoC GYW
Please do not distribute 4/10/2017 Today’s SoC Apple A7 Harvard VLSI-ARCH Group SoC Tapeout GYW
Please do not distribute 4/10/2017 Today’s SoC GPU/DSP CPU Buses Mem Inter- face Acc GYW
Future Accelerator-Centric Architectures Please do not distribute 4/10/2017 Future Accelerator-Centric Architectures GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores How to decompose an application to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability GYW
Please do not distribute 4/10/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost GYW
Future Accelerator-Centric Architecture Please do not distribute 4/10/2017 Future Accelerator-Centric Architecture GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores GYW
Future Accelerator-Centric Architecture Please do not distribute 4/10/2017 Future Accelerator-Centric Architecture GPU/DSP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Aladdin can rapidly evaluate large design space of accelerator-centric architectures. GYW
Please do not distribute 4/10/2017 Aladdin Overview Optimization Phase Realization Phase Optimistic IR Initial DDDG Idealistic C Code Dynamic Data Dependence Graph (DDDG) Program Constrained DDDG Resource Power/Area Models Performance Activity Acc Design Parameters Power/Area GYW
Please do not distribute 4/10/2017 Aladdin Overview Optimization Phase Optimistic IR Initial DDDG Idealistic DDDG C Code Performance Activity Program Constrained DDDG Resource Constrained DDDG Acc Design Parameters Power/Area Models Power/Area Realization Phase GYW
Please do not distribute 4/10/2017 From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; GYW
From C to Design Space IR Dynamic Trace Please do not distribute 4/10/2017 From C to Design Space IR Dynamic Trace 0. r0=0 //i = 0 r4=load (r0 + r1) //load a[i] r5=load (r0 + r2) //load b[i] r6=r4 + r5 store(r0 + r3, r6) //store c[i] r0=r0 + 1 //++i r4=load(r0 + r1) //load a[i] r5=load(r0 + r2) //load b[i] r0 = r0 + 1 //++i … C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; GYW
From C to Design Space Initial DDDG Please do not distribute 4/10/2017 From C to Design Space Initial DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 1. ld a 2. ld b 3. + 4. st c 5. i++ 6. ld a 7. ld b 8. + 9. st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 11. ld a 12. ld b 13. + 14. st c GYW
From C to Design Space Idealistic DDDG Please do not distribute 4/10/2017 From C to Design Space Idealistic DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 0. i=0 5. i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 5. i++ 1. ld a 2. ld b C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 6. ld a 7. ld b 3. + 11. ld a 12. ld b 8. + 4. st c 13. + 9. st c 14. st c GYW
From C to Design Space Optimization Phase: C->IR->DDDG Please do not distribute 4/10/2017 From C to Design Space Optimization Phase: C->IR->DDDG Include application-specific customization strategies. Node-Level: Bit-width Analysis Strength Reduction Tree-height Reduction Loop-Level: Remove dependences between loop index variables Memory Optimization: Memory-to-Register Conversion Store-Load Forwarding Store Buffer Extensible e.g. Model CAM accelerator by matching nodes in DDDG GYW
From C to Design Space One Design Please do not distribute 4/10/2017 From C to Design Space One Design MEM + Resource Activity Idealistic DDDG Cycle 0. i=0 5.i++ 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 0. i=0 5.i++ 10. i++ 15. i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b 16. ld a 17. ld b 3. + 8. + 13. + 18. + 4. st c 9. st c 14. st c 19. st c Acc Design Parameters: Memory BW <= 2 1 Adder GYW
From C to Design Space Another Design Please do not distribute 4/10/2017 From C to Design Space Another Design MEM + Resource Activity Idealistic DDDG Cycle 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c 6. ld a 0. i=0 5.i++ 10. i++ 15. i++ 1. ld a 2. ld b 6. ld a 7. ld b 11. ld a 12. ld b 16. ld a 17. ld b 3. + 8. + 13. + 18. + 4. st c 9. st c 14. st c 19. st c Acc Design Parameters: Memory BW <= 4 2 Adders GYW
From C to Design Space Realization Phase: DDDG->Estimates Please do not distribute 4/10/2017 From C to Design Space Realization Phase: DDDG->Estimates Constrain the DDDG with program and user-defined resource constraints Program Constraints Control Dependence Memory Ambiguation Resource Constraints Loop-level Parallelism Loop Pipelining Memory Ports # of FUs (e.g., adders, multipliers) GYW
From C to Design Space Power-Performance per Design Please do not distribute 4/10/2017 From C to Design Space Power-Performance per Design Acc Design Parameters: Memory BW <= 4 2 Adders Power Acc Design Parameters: Memory BW <= 2 1 Adder Cycle GYW
From C to Design Space Design Space of an Algorithm Please do not distribute 4/10/2017 From C to Design Space Design Space of an Algorithm Power Cycle GYW
Please do not distribute 4/10/2017 Aladdin Validation Aladdin C Code Power/Area Performance ModelSim Design Compiler Verilog Activity GYW
Please do not distribute 4/10/2017 Aladdin Validation Aladdin C Code Power/Area Performance RTL Designer Design Compiler Verilog Activity HLS C Tuning Vivado HLS ModelSim GYW
Please do not distribute 4/10/2017 Aladdin Validation GYW
Please do not distribute 4/10/2017 Aladdin Validation GYW
Aladdin enables rapid design space exploration for accelerators. Please do not distribute 4/10/2017 Aladdin enables rapid design space exploration for accelerators. 7 mins Aladdin C Code Power/Area Performance 52 hours RTL Designer Design Compiler Verilog Activity HLS C Tuning Vivado HLS ModelSim GYW
Please do not distribute 4/10/2017 Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. GPGPU-Sim GPU MARSx86 ... XIOSim… Big Cores Small Cores DRAMSim2 Memory Interface Shared Resources Cacti/Orion2 Sea of Fine-Grained Accelerators GYW
Modeling Accelerators in a SoC-like Environment Please do not distribute 4/10/2017 Modeling Accelerators in a SoC-like Environment Acc Core Cache Memory Core Acc Core Cache Memory GYW
Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Please do not distribute 4/10/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Architectures with 1000s of accelerators will be radically different; New design tools are needed. Aladdin enables rapid design space exploration of future accelerator-centric platforms. You can find Aladdin at http://vlsiarch.eecs.harvard.edu/aladdin GYW