Download presentation
Presentation is loading. Please wait.
Published byMeagan Higgins Modified over 6 years ago
1
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Yongjun Park, Hyunchul Park, Scott Mahlke CCCP Research Group, University of Michigan 1
2
Coarse-Grained Reconfigurable Architecture (CGRA)
Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration 2
3
CGRA : Attractive Alternative to ASICs
Suitable for running multimedia applications for future embedded systems High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW 3
4
Performance Bottleneck: Acyclic Code
Block 1 Block 2 Block 3 Block 5 Block 0 … Normal schedule Block 1 Block 2 Block 3 Block 5 Block 0 … Software Pipeline … Software Pipeline Acyclic region dominant Original Loop region dominant Block 0 Block 1 Acyclic region is substantial! It’s time to optimize acyclic code. Block 2 Block 3 Block 4 Block 5 … Application Execution Time 4
5
Key Idea: Chaining Instructions
1. Clock period Longest operation with register file access. 2. CGRA is not VLIW. Register file access is not frequent! 3. Opportunity of instruction chaining. 4. Considerable register access time ≈ Arithmetic operation delay (3.5ns clock IBM 90nm) Non-critical path : Fast! Critical Path: Slow! Group Opcode Delay(ns) Multi cycle op MUL, LD, ST 1.65 Arith ADD, SUB 1.74 Shift LSL, LSR, ASR 1.36 Comp EQ, NE, LT 0.93 Logic AND, OR, XOR 0.73 RF Access 1.61 5
6
Dynamic Operation Fusion
Execute multiple dependent operations in one cycle Key benefits 1. Minimal hardware overhead 2. Multiple subgraphs can be executed simultaneously. 3. Dynamic merging of FUs MUL LD Add512r10 4x4 CGRA Operation fusion : 1 Cycle ADD LSR 4x4 CGRA Current : 3 Cycle A B Assumption Instruction time = RF read time = RF write time ADD 512 ADD 10 LSR Out 6
7
Hardware Support Simple bypass network
Small overhead: 3.8%(SRAM), 2.3%(MUX) baseline modified overhead(%) control bit 845 877 3.8 area (mm^2) 1.447 1.48 2.3 7
8
Compiler Support Tick-based scheduling Tick-based scheduling
Tick: small time unit based on hardware delay information Clock cycle = # of ticks Clock boundary constraint checking Resource conflict Time conflict Tick-based scheduling Tick: small time unit based on hardware delay information Clock cycle = # of ticks Clock boundary constraint checking Resource conflict Time conflict Tick-based scheduling Tick: small time unit based on hardware delay information Clock cycle = # of ticks Clock boundary constraint checking Resource conflict Time conflict 8
9
Dynamic Operation Fusion Example(1)
1. Conventional Scheduling 1. Conventional Scheduling – 5 cycle DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph Schedule Table Schedule Table Schedule Table Schedule Table Schedule Table Schedule Table const const const const const const RF[0] RF[0] RF[0] RF[0] RF[0] RF[0] const const const const const const RF[1] RF[1] RF[1] RF[1] RF[1] RF[1] Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 OP 2 2 3 4 Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 2 3 4 Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 OP 2 2 OP 3 3 OP4 4 Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 OP 2 2 OP 3 3 4 Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 OP 2 2 OP 3 3 OP4 4 OP 5 Time FU0 FU1 FU2 FU3 FU4 FU5 1 2 3 4 const const const const const const SUB(0) SUB(0) SUB(0) SUB(0) SUB(0) SUB(0) ADD(1) ADD(1) ADD(1) ADD(1) ADD(1) ADD(1) ADD(2) ADD(2) ADD(2) ADD(2) ADD(2) ADD(2) const const const const const const LSR(3) LSR(3) LSR(3) LSR(3) LSR(3) LSR(3) CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping Register file Register file Register file Register file Register file Register file LSL(4) LSL(4) LSL(4) LSL(4) LSL(4) LSL(4) OP 0 OP 0 OP 0 OP 0 OP 0 OP 0 OP 1 OP 1 OP 1 OP 1 OP 1 OP 1 OP 5 OP 5 OP 5 OP 5 OP 5 OP 5 ADD(5) ADD(5) ADD(5) ADD(5) ADD(5) ADD(5) RF[2] RF[2] RF[2] RF[2] RF[2] RF[2] OP 2 OP 2 OP 2 OP 2 OP 2 OP 2 OP 3 OP 3 OP 3 OP 3 OP 3 OP 3 OP 4 OP 4 OP 4 OP 4 OP 4 OP 4 9
10
Dynamic Operation Fusion Example(2)
2. Dynamic Operation Fusion – 3 Cycle. Schedule Table Schedule Table Schedule Table Schedule Table DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph Time FU0 FU1 FU2 FU3 FU4 FU5 RF OP 0 OP 1 OP 2 1 2 Time FU0 FU1 FU2 FU3 FU4 FU5 RF OP 0 OP 1 OP 2 1 OP 3 OP4 2 Time FU0 FU1 FU2 FU3 FU4 FU5 1 2 Time FU0 FU1 FU2 FU3 FU4 FU5 RF OP 0 OP 1 OP 2 1 OP 3 OP4 2 OP 5 RF[0] RF[0] RF[0] RF[0] const const const const RF[1] RF[1] RF[1] RF[1] const const const const const const const const SUB(0) SUB(0) SUB(0) SUB(0) ADD(1) ADD(1) ADD(1) ADD(1) ADD(2) ADD(2) ADD(2) ADD(2) const const const const LSR(3) LSR(3) LSR(3) LSR(3) CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping LSL(4) LSL(4) LSL(4) LSL(4) Register file Register file Register file Register file ADD(5) ADD(5) ADD(5) ADD(5) OP 0 OP 0 OP 0 OP 0 OP 1 OP 1 OP 1 OP 1 OP 5 OP 5 OP 5 OP 5 RF[2] RF[2] RF[2] RF[2] OP 2 OP 2 OP 2 OP 2 OP 3 OP 3 OP 3 OP 3 OP 4 OP 4 OP 4 OP 4 10
11
Experimental Setup Benchmarks Two designs
multimedia applications for embedded systems Audio decoding (AAC) Video decoding (H.264) 3D graphics (3D) Two designs baseline : 4x4 heterogeneous CGRA express : 4x4 heterogeneous CGRA with bypass network 11
12
Performance Enhancement
Express achieves 7-17% reduction in execution time Most of reduction comes from acyclic code region. Express also improves the performance of resource-constrained loop. Bypass network gives more freedom to compiler. 12
13
Detailed Result for 3D Graphics
Target application 3D graphics Power consumption 3% higher than the baseline Performance enhancement 17% faster than the baseline Energy consumption 15% more efficient baseline express ratio power (mW) 298.26 306.78 102.86% # of cycles (million) 156.81 130.22 83.04% energy (mJ) 233.85 199.74 85.42% 13
14
Conclusion Acyclic region becomes the performance bottleneck.
The run-time for loops decreases by large factors. Dynamic operation fusion enables to execute back-to- back operations in a cycle Bypass network Tick-based scheduler Up to17% faster and 15% more energy efficient with 3% hardware overhead 14
15
Questions? 15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.