Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.

Similar presentations


Presentation on theme: "Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE."— Presentation transcript:

1 Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE

2 Harvard University Moore’s Law 2

3 CMOS Scaling is Slowing Down http://www.anandtech.com/show/9447/intel-10nm-and-kaby-lake Harvard University 3 180 nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm 14 nm 10 nm

4 CMOS Technology Scaling Technological Fallow Period Harvard University 4

5 Potential for Specialized Architectures [Zhang and Brodersen] 16Encryption 17Hearing Aid 18FIR for disk read 19MPEG Encoder 20802.11 Baseband Harvard University 5

6 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Harvard University 6

7 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Harvard University 7

8 Cores, GPUs, and Accelerators: Apple A8 SoC Out-of-Core Accelerators Maltiel Consulting estimates Our estimates Harvard University 8

9 Challenges in Accelerators  Flexibility –Fixed-function accelerators are only designed for the target applications.  Programmability –Today’s accelerators are explicitly managed by programmers. 9

10 OMAP 4 SoC Today’s SoC Harvard University 10

11 OMAP 4 SoC Today’s SoC ARM Core s GPU DSP System Bus Secondary Bus Secondary Bus Tertiary Bus DMA SD USB Audio Video Face Imaging USB Harvard University 11

12 Challenges in Accelerators  Flexibility –Fixed-function accelerators are only designed for the target applications.  Programmability –Today’s accelerators are explicitly managed by programmers.  Design Cost –Accelerator (and RTL) implementation is inherently tedious and time-consuming. 12 Harvard University

13 Today’s SoC GPU/ DSP CPU Buses Mem Inter- face Acc CPU Acc Harvard University 13

14 Future Accelerator-Centric Architectures Flexibility Design Cost Programmability How to decompose applications into accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Harvard University 14

15 Harvard University GPU/D SP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Core s 15 Accelerator-System Co-Design [Under Review] Contributions Research Infrastructures for Hardware Accelerators [Synthesis Lecture’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] Aladdin: Accelerator Pre- RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] MachSuite: Accelerator Benchmark Suite [IISWC’14] WIICA: Accelerator Workload Characterization [ISPASS’13] Instruction-Level Energy Model for Xeon Phi [ISLPED’13_2]

16 Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems Design Cost Flexibility Programmability Aladdin: A pre-RTL, Power- Performance Accelerator Simulator “Design Assistant” Understand Algorithmic-HW Design Space before RTL Harvard University 16

17 GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Future Accelerator-Centric Architecture Harvard University 17

18 GPU/DS P Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Cores Future Accelerator-Centric Architecture Aladdin can rapidly evaluate large design space of accelerator-centric architectures. Harvard University 18

19 Aladdin Overview C Code Power/Area Performance Activity Acc Design Parameters Optimization Phase Realization Phase Optimistic IR Initial DDDG Idealistic DDDG Program Constrained DDDG Resource Constrained DDDG Power/Area Models Dynamic Data Dependence Graph (DDDG) Harvard University 19

20 Aladdin Overview C Code Optimistic IR Initial DDDG Idealistic DDDG Program Constrained DDDG Resource Constrained DDDG Power/Area Models Optimization Phase Realization Phase Power/Area Performance Activity Acc Design Parameters Harvard University 20

21 From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; Harvard University 21

22 From C to Design Space IR Dynamic Trace C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 0. r0=0 //i = 0 1.r4=load (r0 + r1) //load a[i] 2.r5=load (r0 + r2) //load b[i] 3.r6=r4 + r5 4.store(r0 + r3, r6) //store c[i] 5.r0=r0 + 1 //++i 6.r4=load(r0 + r1) //load a[i] 7.r5=load(r0 + r2) //load b[i] 8.r6=r4 + r5 9.store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … Harvard University 22

23 From C to Design Space Initial DDDG 0. i=0 1. ld a2. ld b 3. + 4. st c 5. i++ 6. ld a7. ld b 8. + 9. st c 10. i++ 11. ld a12. ld b 13. + 14. st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … Harvard University 23

24 0. i=0 5. i++ 10. i++ 11. ld a12. ld b 13. + 14. st c 6. ld a7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 5. i++ 10. i++ 11. ld a12. ld b 13. + 14. st c 6. ld a7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c From C to Design Space Idealistic DDDG Harvard University 24

25  Include application-specific customization strategies.  Node-Level: –Bit-width Analysis –Strength Reduction –Tree-height Reduction  Loop-Level: –Remove dependences between loop index variables  Memory Optimization: –Memory-to-Register Conversion –Store-Load Forwarding –Store Buffer From C to Design Space Optimization Phase: C->IR->DDDG Harvard University 25

26 From C to Design Space One Design MEM + + + Resource Activity Idealistic DDDG Acc Design Parameters: Memory BW <= 2 1 Adder 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c Cycle 0. i=0 5.i++ 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c Harvard University 26

27 From C to Design Space Another Design MEM + + + + + + + Resource Activity Cycle 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c 6. ld a Acc Design Parameters: Memory BW <= 4 2 Adders Idealistic DDDG 0. i=0 5.i++ 10. i++ 11. ld a 12. ld b 13. + 14. st c 6. ld a 7. ld b 8. + 9. st c 1. ld a 2. ld b 3. + 4. st c 15. i++ 16. ld a 17. ld b 18. + 19. st c Harvard University 27

28  Constrain the DDDG with program and user- defined resource constraints  Program Constraints –Control Dependence –Memory Ambiguation  Resource Constraints –Loop-level Parallelism –Loop Pipelining –Memory Ports From C to Design Space Realization Phase: DDDG->Estimates Harvard University 28

29 Cycle Power Acc Design Parameters: Memory BW <= 4 2 Adders Acc Design Parameters: Memory BW <= 2 1 Adder From C to Design Space Power-Performance per Design Harvard University 29

30 From C to Design Space Design Space of an Algorithm Cycle Power Harvard University 30

31 Aladdin Validation C Code Power/Area Performance Aladdin ModelSim Design Compiler Verilog Activity Harvard University 31

32 Aladdin Validation C Code Power/Area Performance Aladdin RTL Designer HLS C Tuning Vivado HLS ModelSim Design Compiler Verilog Activity Harvard University 32

33 Aladdin Validation Harvard University 33

34 Aladdin Validation Harvard University 34

35 Algorithm-to-Solution Time Harvard University 35 Hand-Coded RTL C-to-RTL Programming Effort HighMedium RTL Generation Designer Dependent 37 mins RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins Time to Solution (36 Designs) 52 hours

36 Algorithm-to-Solution Time Hand-Coded RTL C-to-RTLAladdin Programming Effort HighMedium N/A RTL Generation Designer Dependent 37 mins RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins1 min Time to Solution (36 Designs) 52 hours7 min Harvard University 36

37 Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC. GPU Shared Resources Memory Interface Sea of Fine-Grained Accelerators Big Cores Small Cores GPGPU- Sim gem5... gem5 … Cacti/Orion2 DRAMSim 2 Harvard University 37

38 Accelerator Integration Harvard University 38 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA

39 Compute is only a part of the story Harvard University 39

40 Compute is only a part of the story Harvard University Accelerator-System Co-Design 40

41 Accelerator Integration Harvard University 41 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA ACC MEM Lane 0Lane 1Lane 2Lane 3 TLB Cache Cache Interface

42 gem5-Aladdin: An SoC Simulator Harvard University 42 ACC MEM Lane 0Lane 1Lane 2Lane 3 ARR 0ARR 1 BUF 0BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA ACC MEM Lane 0Lane 1Lane 2Lane 3 TLB Cache Cache Interface

43 gem5-Aladdin Validation Harvard University 43 Applicatio n gem5-Aladdin Vivado HLS Verilog Flush Latency DMA Latency Acc Exe Latency DMA IP Block FPGA ARM Core Xilinx Zynq SoC Kernel

44 gem5-Aladdin Validation Harvard University 44

45 To DMA or To Cache?  Accelerator local memory Harvard University 45

46 DMA or Cache Harvard University 46

47 DMA or Cache Harvard University 47

48 DMA or Cache Harvard University 48

49 Conclusions  Architectures with 1000s of accelerators will be radically different; New design tools are needed.  We built Aladdin, an architectural level power, performance, and area simulator for accelerators.  We integrated Aladdin with gem5 to model the interactions between accelerators and the rest of the SoC.  These accelerator infrastructures open up opportunities for innovation on heterogeneous architecture designs. Harvard University 49

50 Harvard University GPU/D SP Big Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators Small Core s 50 Accelerator-System Co-Design [Under Review] Contributions Research Infrastructures for Hardware Accelerators [Synthesis Lecture’15] Accelerator Design w/ High-Level Synthesis [ISLPED’13_1] Aladdin: Accelerator Pre- RTL, Power-Performance Simulator [ISCA’14, TopPicks’15] MachSuite: Accelerator Benchmark Suite [IISWC’14] WIICA: Accelerator Workload Characterization [ISPASS’13] Instruction-Level Energy Model for Xeon Phi [ISLPED’13_2]

51 Publications 1.Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “An Holistic Approach to Accelerator- System Co-Design,” Under Review. 2.Y.S Shao and D. Brooks, “Research Infrastructures for Hardware Accelerators,” Synthesis Lectures on Computer Architecture, Nov 2015. 3.Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “The Aladdin Approach to Accelerator Design and Modeling,” IEEE Micro TopPicks, May-June 2015. 4.Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “Toward Cache-Friendly Hardware Accelerators,” SCAW’15. 5.B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for Accelerator Design and Customized Architectures,” IISWC’14. 6.Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures,” ISCA’14. 7.B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13. 8.Y.S. Shao and D. Brooks, “Energy Characterization and Instruction-Level Energy Model of Intel’s Xeon Phi Processor,” ISLPED’13. 9.Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures,” ISPASS’13. Harvard University 51

52 Acknowledgement Harvard University 52

53 Thanks! Harvard University 53


Download ppt "Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE."

Similar presentations


Ads by Google