Download presentation
Presentation is loading. Please wait.
1
Ph.D. in Computer Science
School of Computing, Informatics, and Decision Systems Engineering Compiler and Architecture Design for Coarse-Grained Programmable Accelerators Mahdi Hamzeh June 26, 2015
2
Trends in Silicon Computing
Heterogeneity Multi-cores Multi-cores Multi-threading Multi-threading Multi-threading μ-architecture μ-architecture μ-architecture μ-architecture Technology Technology Technology Technology Technology 6/26/15 Compiler and Architecture Design for CGRAs
3
Why Heterogonous Computing?
Efficient Resource Allocation Based on Run-Time Info Each exhibit interesting feature for a class of computation Applications execute in phases Phase: a different class of computation A significant silicon area will be dark 1 Power GPU FPGA HP Core LP Core DSP HW ACC Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs
4
HW Accelerators are Expensive!
High design, test, verification cost HW ACC and FPGA Engineering cost Time to market HW ACC System Design Cost FPGA GPU DSP Building specialized HW ACC is expensive and time consuming HP Core LP Core Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs
5
HW Accelerators: Low Utilization, Limited Programmability
Specialized for one application HW ACC Specialized for a class of computation DSP, GPU Run-time configuration overhead FPGA LP Core HP Core Flexibility FPGA GPU DSP HW ACC is only do well in one app, cannot use it in other app even if close computation class phase HW ACC Performance LP Core: Low power in-order general-purpose core HP Core: High-performance out-of-order general-purpose core HW ACC: Hardware accelerator 6/26/15 Compiler and Architecture Design for CGRAs
6
Software Programmable Accelerators: Opportunities and Challenges
Programmability Compiler support: drives down costs HW ACC DSP GPU FPGA Performance Flexibility HP Core LP Core System Design Cost HP Core LP Core DSP HW ACC GPU FPGA Performance SW ACC SW acc to close cost gap SW ACC 6/26/15 Compiler and Architecture Design for CGRAs
7
Coarse-Grained Reconfigurable Architectures
6/26/15 Compiler and Architecture Design for CGRAs
8
CGRA Designs in Literature
ADRES 60 GOPS/w 6/26/15 Compiler and Architecture Design for CGRAs
9
CGRA Designs in Literature
TilePro64 192 6/26/15 Compiler and Architecture Design for CGRAs
10
Problems Addressed in this Dissertation
CGRA Compiler Problems Problem Definition Complexity Analysis Contribution CGRA Design What I did in this dissertation CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs
11
CGRA accelerates loops using modulo scheduling
Execution Trace Target Application Specified in C Serial region Prolog Repetitive region Loop Serial region Epilog 6/26/15 Compiler and Architecture Design for CGRAs
12
II is the performance metric
Modulo Scheduling Time 4 b 1 2 3 4 1 2 3 4 a 2 a a a a b b b b b b 1 2 3 4 1 2 3 4 c d 1 2 3 4 1 b II is the performance metric c c c c d d d d 1 2 3 4 1 2 3 4 f f f f e e e e 2 g g g g 1 2 3 4 1 2 3 4 3 6/26/15 Compiler and Architecture Design for CGRAs
13
CGRA Modulo Scheduling: Problem Definition
Define what a right mapping is. Map ops to subset of resources. Every data dependency is mapped to a path under certain conditions, II is minimized 6/26/15 Compiler and Architecture Design for CGRAs
14
CGRA Modulo Scheduling: Problem Definition
Define what a right mapping is. Map ops to subset of resources. Every data dependency is mapped to a path under certain conditions, II is minimized 6/26/15 Compiler and Architecture Design for CGRAs
15
Compiler and Architecture Design for CGRAs
Problem Definition Important characteristics Routing, re-computing, or both EPIMorphism between computation graph and resource graph Identified the list of necessary conditions scheduled computation graph should hold Mapping is NP-Complete 3-partition problem 6/26/15 Compiler and Architecture Design for CGRAs
16
Problems Addressed in This Dissertation
Problem Definition Complexity Analysis CGRA Compiler Problems Mapping Algorithm Contribution CGRA Design What I did in this dissertation CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs
17
CGRA Modulo Scheduling Policies
Brute Force Edge Centric Integrated Methods Node Centric Modulo Scheduling Policies Nature Inspired Existing literature addressing this problem using following policies Partitioning Decomposition methods Nature Inspired 6/26/15 Compiler and Architecture Design for CGRAs
18
Assumption and Limitations
Memory miss, stop the execution A ld/st queue to resolve memory dependencies Support only single assignment instructions No system call No Function call Single exit condition 6/26/15 Compiler and Architecture Design for CGRAs
19
Compiler and Architecture Design for CGRAs
EPIMap Decomposition Scheduling Placement Constructive Evolve computation graph based on resource graph Adjust resource graph (MII) Efficient placement How we address it. Why we do it better? 6/26/15 Compiler and Architecture Design for CGRAs
20
Compiler and Architecture Design for CGRAs
EPIMap notable features and policies 6/26/15 Compiler and Architecture Design for CGRAs
21
Compiler and Architecture Design for CGRAs
Re-Scheduling 6/26/15 Compiler and Architecture Design for CGRAs
22
Resource Allocation Problem
6/26/15 Compiler and Architecture Design for CGRAs
23
Resource Allocation: Supporting Multi-cycle Operation
6/26/15 Compiler and Architecture Design for CGRAs
24
Resource Allocation: Supporting Pipelined Resources
f 6/26/15 Compiler and Architecture Design for CGRAs
25
Compiler and Architecture Design for CGRAs
Register Allocation 6/26/15 Compiler and Architecture Design for CGRAs
26
Compiler and Architecture Design for CGRAs
Register Allocation 6/26/15 Compiler and Architecture Design for CGRAs
27
Rotating and Non-Rotating Register Files
6/26/15 Compiler and Architecture Design for CGRAs
28
Problems Addressed in This Dissertation
Problem Definition Complexity Analysis CGRA Compiler Problems Mapping Algorithm Contribution CGRA Design What I did in this dissertation Control Flow Acceleration CGRA System Integration 6/26/15 Compiler and Architecture Design for CGRAs
29
Control Flow Acceleration
6/26/15 Compiler and Architecture Design for CGRAs
30
Compiler and Architecture Design for CGRAs
Partial Predication 3 a b c f e h et ef a b a b h a b h et ef c h a b c e f 6/26/15 Compiler and Architecture Design for CGRAs
31
Compiler and Architecture Design for CGRAs
Full Predication b h a 4 a a b c f e h b h b h a c e b e e b a c e f 6/26/15 Compiler and Architecture Design for CGRAs
32
Compiler and Architecture Design for CGRAs
Dual-Issue a b c f e h et ef a b c f h e 6/26/15 Compiler and Architecture Design for CGRAs
33
Mapping with Dual-Issue
2 b a b c f h e a b h a b c e f 6/26/15 Compiler and Architecture Design for CGRAs
34
Compiler and Architecture Design for CGRAs
Hardware Support 6/26/15 Compiler and Architecture Design for CGRAs
35
Compiler and Architecture Design for CGRAs
CGRA Compiler Flow 6/26/15 Compiler and Architecture Design for CGRAs
36
State-of-the-art before EPIMap/REGIMap
DRESC: A simulated annealing based mapping algorithm Integrated Mapping policy Supports multi-cycle operations Supports pipelined PEs Extended with register allocation Has been shown to generate mapping better than other mapping algorithms 6/26/15 Compiler and Architecture Design for CGRAs
37
Compiler and Architecture Design for CGRAs
EPIMap DRESC: Simulated annealing based MII= Min (ResMII, RecMII) 4 X 4 CGRA Mesh interconnect 1 cycle latency 6/26/15 Compiler and Architecture Design for CGRAs
38
Mapping and Register Allocation-Single Cycle
6/26/15 Compiler and Architecture Design for CGRAs
39
Mapping and Register Allocation-Single Cycle
6/26/15 Compiler and Architecture Design for CGRAs
40
Mapping and Register Allocation-Single Cycle
6/26/15 Compiler and Architecture Design for CGRAs
41
Mapping and Register Allocation-Pipelined PEs
6/26/15 Compiler and Architecture Design for CGRAs
42
Mapping and Register Allocation-Pipelined PEs
6/26/15 Compiler and Architecture Design for CGRAs
43
Summary of EPIMap/REGIMap vs. DRESC
Performance Ratio Compilation Time Ratio Single cycle (NO-RA) 1.31X 138X Single cycle – 2 Regs 1.73X 240X Single cycle - 4 Regs 1.6X 209X Single cycle - 8 Regs 1.5X 163X Pipelined (NO-RA) 1.45X 192X Pipelined- 2 Regs 1.83X 317X Pipelined- 4 Regs 1.81X 289X Pipelined- 8 Regs 1.68X 227X 6/26/15 Compiler and Architecture Design for CGRAs
44
Mapping Loops With Conditional Instructions
6/26/15 Compiler and Architecture Design for CGRAs
45
CGRA Research Framework
6/26/15 Compiler and Architecture Design for CGRAs
46
Compiler and Architecture Design for CGRAs
6/26/15 Compiler and Architecture Design for CGRAs
47
Compiler and Architecture Design for CGRAs
Summary Problem definition Supports routing Re-computation Complexity analysis Reduction from 3-partition problem Counter intuitive discovery, re-computation can improve performance Computation graph and necessary conditions EPIMap Approximate II progressively Effective iterative scheduling algorithm 6/26/15 Compiler and Architecture Design for CGRAs
48
Compiler and Architecture Design for CGRAs
Summary Placement problem formulation Support of multi-cycle operations Support of pipelined resources Constructive method REGIMap Integrated placement and register allocation Support of conditionals Full predication Partial predication Dual-issue Integration with llvm compiler framework 6/26/15 Compiler and Architecture Design for CGRAs
49
Compiler and Architecture Design for CGRAs
Summary CGRA design ISA Rotating and non-rotating register files Dual-issue support RTL implementation and synthesis CGRA simulation framework CGRA model in gem5 6/26/15 Compiler and Architecture Design for CGRAs
50
Compiler and Architecture Design for CGRAs
Future Directions Support of system call Mapping with memory optimization Software prefetching in mapping Just-in-time compilation of kernels Offload decision at run-time Speculative execution support for CGRAs 6/26/15 Compiler and Architecture Design for CGRAs
51
Compiler and Architecture Design for CGRAs
Backup 6/26/15 Compiler and Architecture Design for CGRAs
52
Backup-Scheduling Success
6/26/15 Compiler and Architecture Design for CGRAs
53
Clique-Resource Allocation Attempts
6/26/15 Compiler and Architecture Design for CGRAs
54
Compiler and Architecture Design for CGRAs
Step by Step Example 6/26/15 Compiler and Architecture Design for CGRAs
55
Compiler and Architecture Design for CGRAs
Step by Step Example 6/26/15 Compiler and Architecture Design for CGRAs
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.