Download presentation
Presentation is loading. Please wait.
Published byJanis Skinner Modified over 8 years ago
1
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE
2
Harvard University Moore’s Law 2
3
CMOS Scaling is Slowing Down http://www.anandtech.com/show/9447/intel-10nm-and-kaby-lake Harvard University 3 180 nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm 14 nm 10 nm
4
CMOS Technology Scaling Technological Fallow Period Harvard University 4
5
Potential for Specialized Architectures [Zhang and Brodersen] 16Encryption 17Hearing Aid 18FIR for disk read 19MPEG Encoder 20802.11 Baseband Harvard University 5
6
Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Harvard University 6
7
Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Harvard University 7
8
Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Maltiel Consulting estimates Our estimates Harvard University 8
9
Challenges in Accelerators Flexibility –Fixed-function accelerators are only designed for the target applications. Programmability –Today’s accelerators are explicitly managed by programmers. 9
10
OMAP 4 SoC Today’s SoC Harvard University 10
11
OMAP 4 SoC Today’s SoC ARM Core s GPU DSP System Bus Secondary Bus Secondary Bus Tertiary Bus DMA SD USB Audio Video Face Imaging USB Harvard University 11
12
Challenges in Accelerators Flexibility –Fixed-function accelerators are only designed for the target applications. Programmability –Today’s accelerators are explicitly managed by programmers. Design Cost –Accelerator (and RTL) implementation is inherently tedious and time-consuming. 12 Harvard University
13
Today’s SoC GPU/ DSP CPU Buses Mem Inter- face Acc CPU Acc Harvard University 13
14
Future Accelerator-Centric Architectures Flexibility Design Cost Programmability How to decompose applications into accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Harvard University 14
15
Harvard University GPU/D SP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Core s 15 Accelerator-System Co-Design [Under Review] Contributions Research Infrastructures for Hardware Accelerators [Synthesis Lecture’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] Aladdin: Accelerator Pre- RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] MachSuite: Accelerator Benchmark Suite [IISWC’14] WIICA: Accelerator Workload Characterization [ISPASS’13] Instruction-Level Energy Model for Xeon Phi [ISLPED’13_2]
16
Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems Design Cost Flexibility Programmability Aladdin: A pre-RTL, Power- Performance Accelerator Simulator “Design Assistant” Understand Algorithmic-HW Design Space before RTL Harvard University 16
17
GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Future Accelerator-Centric Architecture Harvard University 17
18
GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Future Accelerator-Centric Architecture Aladdin can rapidly evaluate large design space of accelerator-centric architectures. Harvard University 18
19
Aladdin Overview C Code Power/Area Performance Activity Acc Design Parameters Optimization Phase Realization Phase Optimistic IR Initial DDDG Idealistic DDDG Program Constrained DDDG Resource Constrained DDDG Power/Area Models Dynamic Data Dependence Graph (DDDG) Harvard University 19
20
Aladdin Overview C Code Optimistic IR Initial DDDG Idealistic DDDG Program Constrained DDDG Resource Constrained DDDG Power/Area Models Optimization Phase Realization Phase Power/Area Performance Activity Acc Design Parameters Harvard University 20
21
From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; Harvard University 21
22
From C to Design Space IR Dynamic Trace C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 0. r0=0 //i = 0 1.r4=load (r0 + r1) //load a[i] 2.r5=load (r0 + r2) //load b[i] 3.r6=r4 + r5 4.store(r0 + r3, r6) //store c[i] 5.r0=r0 + 1 //++i 6.r4=load(r0 + r1) //load a[i] 7.r5=load(r0 + r2) //load b[i] 8.r6=r4 + r5 9.store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … Harvard University 22
23
From C to Design Space Initial DDDG 0. i=0 1. ld a2. ld b 3. + 4. st c 5. i++ 6. ld a7. ld b 8. + 9. st c 10. i++ 11. ld a12. ld b 13. + 14. st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … Harvard University 23
24
0. i=0 5. i++ 10. i++ 11. ld a12. ld b 13. + 14. st c 6. ld a7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 5. i++ 10. i++ 11. ld a12. ld b 13. + 14. st c 6. ld a7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c From C to Design Space Idealistic DDDG Harvard University 24
25
Include application-specific customization strategies. Node-Level: –Bit-width Analysis –Strength Reduction –Tree-height Reduction Loop-Level: –Remove dependences between loop index variables Memory Optimization: –Memory-to-Register Conversion –Store-Load Forwarding –Store Buffer From C to Design Space Optimization Phase: C->IR->DDDG Harvard University 25
26
From C to Design Space One Design MEM + + + Resource Activity Idealistic DDDG Acc Design Parameters: Memory BW <= 2 1 Adder 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c Cycle 0. i=0 5.i++ 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c Harvard University 26
27
From C to Design Space Another Design MEM + + + + + + + Resource Activity Cycle 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c 6. ld a Acc Design Parameters: Memory BW <= 4 2 Adders Idealistic DDDG 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c Harvard University 27
28
Constrain the DDDG with program and user- defined resource constraints Program Constraints –Control Dependence –Memory Ambiguation Resource Constraints –Loop-level Parallelism –Loop Pipelining –Memory Ports From C to Design Space Realization Phase: DDDG->Estimates Harvard University 28
29
Cycle Power Acc Design Parameters: Memory BW <= 4 2 Adders Acc Design Parameters: Memory BW <= 2 1 Adder From C to Design Space Power-Performance per Design Harvard University 29
30
From C to Design Space Design Space of an Algorithm Cycle Power Harvard University 30
31
Aladdin Validation C Code Power/Area Performance Aladdin ModelSim Design Compiler Verilog Activity Harvard University 31
32
Aladdin Validation C Code Power/Area Performance Aladdin RTL Designer HLS C Tuning Vivado HLS ModelSim Design Compiler Verilog Activity Harvard University 32
33
Aladdin Validation Harvard University 33
34
Aladdin Validation Harvard University 34
35
Algorithm-to-Solution Time Harvard University 35 Hand-Coded RTL C-to-RTL Programming Effort HighMedium RTL Generation Designer Dependent 37 mins RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins Time to Solution (36 Designs) 52 hours
36
Algorithm-to-Solution Time Hand-Coded RTL C-to-RTLAladdin Programming Effort HighMedium N/A RTL Generation Designer Dependent 37 mins RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins1 min Time to Solution (36 Designs) 52 hours7 min Harvard University 36
37
Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. GPU Shared Resources Memory Interface Sea of Fine-Grained Accelerators Big Cores Small Cores GPGPU- Sim gem5... gem5 … Cacti/Orion2 DRAMSim 2 Harvard University 37
38
Accelerator Integration Harvard University 38 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA
39
Compute is only a part of the story Harvard University 39
40
Compute is only a part of the story Harvard University Accelerator-System Co-Design 40
41
Accelerator Integration Harvard University 41 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA ACC MEM Lane 0Lane 1Lane 2Lane 3 TLB Cache Cache Interface
42
gem5-Aladdin: An SoC Simulator Harvard University 42 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA ACC MEM Lane 0Lane 1Lane 2Lane 3 TLB Cache Cache Interface
43
gem5-Aladdin Validation Harvard University 43 Applicatio n gem5-Aladdin Vivado HLS Verilog Flush Latency DMA Latency Acc Exe Latency DMA IP Block FPGA ARM Core Xilinx Zynq SoC Kernel
44
gem5-Aladdin Validation Harvard University 44
45
To DMA or To Cache? Accelerator local memory Harvard University 45
46
DMA or Cache Harvard University 46
47
DMA or Cache Harvard University 47
48
DMA or Cache Harvard University 48
49
Conclusions Architectures with 1000s of accelerators will be radically different; New design tools are needed. We built Aladdin, an architectural level power, performance, and area simulator for accelerators. We integrated Aladdin with gem5 to model the interactions between accelerators and the rest of the SoC. These accelerator infrastructures open up opportunities for innovation on heterogeneous architecture designs. Harvard University 49
50
Harvard University GPU/D SP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Core s 50 Accelerator-System Co-Design [Under Review] Contributions Research Infrastructures for Hardware Accelerators [Synthesis Lecture’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] Aladdin: Accelerator Pre- RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] MachSuite: Accelerator Benchmark Suite [IISWC’14] WIICA: Accelerator Workload Characterization [ISPASS’13] Instruction-Level Energy Model for Xeon Phi [ISLPED’13_2]
51
Publications 1.Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “An Holistic Approach to Accelerator- System Co-Design,” Under Review. 2.Y.S Shao and D. Brooks, “Research Infrastructures for Hardware Accelerators,” Synthesis Lectures on Computer Architecture, Nov 2015. 3.Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “The Aladdin Approach to Accelerator Design and Modeling,” IEEE Micro TopPicks, May-June 2015. 4.Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “Toward Cache-Friendly Hardware Accelerators,” SCAW’15. 5.B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for Accelerator Design and Customized Architectures,” IISWC’14. 6.Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures,” ISCA’14. 7.B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13. 8.Y.S. Shao and D. Brooks, “Energy Characterization and Instruction-Level Energy Model of Intel’s Xeon Phi Processor,” ISLPED’13. 9.Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures,” ISPASS’13. Harvard University 51
52
Acknowledgement Harvard University 52
53
Thanks! Harvard University 53
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.