Download presentation
Presentation is loading. Please wait.
1
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004
2
Outline u Motivation u Application-specific instruction set compilation u Register file data bandwidth problem u Architecture extension – shadow registers u Shadow register binding u Conclusions
3
Reconfigurable Processor Platform u Reconfigurable processor (RP) core + programmable fabric RP core supports: Basic instruction set + customized instructions u Programmable fabric implements the customized instructions u Either runtime reconfigurable or pre-synthesized u Example: Nios / Nios II from Altera Stratix version supported by Nios 3.0 system 5 extended instruction formats Up to 2048 instructions for each format Reconfigurable Processor Core CPU Bus
4
Motivational Example t 1 = a * b; t 2 = b * 2; ; t 3 = c * 5; t 4 = t 1 + t 2 ; t 5 = t 2 + t 3 ; t 6 = t 5 + t 4 ; Execution time: 9 clock cycles *: 2 clock cycles+: 1 clock cycle Extended Instruction Set: I extop1 expop2 extop1 extop2 *** ++ + 25 abc t 1 = extop1(a, b, 2); t 2 = extop2(b, c, 2, 5); t 3 = t 1 + t 2 ; Execution time: 5 clock cycles Speedup: 1.8
5
Problem Statement Given: Application program in CDFG G(V, E) A processor with basic instruction set I Pattern constraints: I. Number of inputs less than N in; II. 1 output; III. Total area no more than A Objective: Generate a pattern library P Map G to the extended instruction set I P, so that the total execution time is minimized.
6
Proposed ASIP Compilation Flow u Extended Instruction Candidates Generation Satisfying I/O constraints u Extended Instruction Selection Select a subset to maximize the potential speedup while satisfying the resource constraint u Code Generation Graph covering Minimize the total execution time Instruction Implementation / Pattern Generation / ASIP constraints ASIP Synthesis Pattern Selection Application Mapping Pattern Library C Implementation Mapped CDFG Compilation CDFG Simulation
7
Step 1. Pattern Enumeration 3-feasible cones: n 1 : {a, b} n 2 : {b, 2} n 3 : {c, 5} n 4 : {n 1, n 2 },{n 1, b, 2},{n 2, a, b},{a, b, 2} *** ++ + 25 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Each pattern is a N in - feasible cone Cut enumeration is used to enumerate all the N in - feasible cones [cong et al, FPGA’99] Basic idea: In topological order, merge the cuts of fan-ins and discards those cuts not N in -feasible
8
Step 2. Pattern Selection u Basic idea: simultaneously consider speed up, occurrence frequency and area. u Speedup Tsw(p) = total execution time with basic instructions Thw(p) = length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p) u Occurrence Some pattern instances may be isomorphic Graph isomorphism test [ Nauty Package ] Small subgraphs, isomorphism test is very fast Gain(p) = Speedup(p) Occurrence(p) Selection under area constraint can be formulated as a 0-1 knapsack problem Pattern *+ T sw = 3 T hw = 2 Speedup = 1.5 *** ++ + 25 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6
9
Step 3. Application Mapping u Assume execution on an in-order, single-issue processor u Cover each node in G(V, E) with the extended instruction set to minimize the execution time. Trivial pattern – software execution time Nontrivial pattern – hardware execution time Total execution time = Sum of execution time of instance patterns after application mapping u Theorem : The application mapping problem is equivalent to the library-based minimum-area technology mapping problem.
10
Speedup and Resource Overhead on NIOS # Extended Instruction Speedup Resource Overhead EstimationNiosLEMemory DSP Block fft_br93.282.654086.06%65,5369.79%16 iir73.183.732553.79%4,7360.71%40 fir22.402.14510.76%1,0240.15%8 pr21.571.75711.05%00.00%14 dir23.283.02540.80%00.00%16 mcm44.753.221862.76%00.00%56 Average3.082.75-2.54%-1.77%-
11
Simulation Environment u Simplescalar v3.0 u Benchmarks From Mediabench suite u Machine Configuration Single issue in-order processor (ARM like) DL1: 8KB, 4-way, 1 cycle IL1: 8KB, direct mapped, 1 cycle Unified L2: 256KB, 4-way, 8 cycle Functional units: 2 IntAdd, 1 IntMult, 1 FPAdd, 1 FPMult Reconfigurable units critical path latency of the collapsed instructions critical path latency of the collapsed instructions
12
Pattern Distribution Most of the patterns have less than 7 nodes inside
13
Ideal Speedup under Different Input Size Constraints
14
Outline u Motivation u Application-specific instruction set compilation u Register file data bandwidth problem u Architecture extension – shadow registers u Shadow register binding u Conclusions
15
Register File Bandwidth Problem u Most of the speedup comes from clusters with more than two inputs u 2-port register file in embedded processors u Need extra cycles to transfer data for extended instructions with more than 2 inputs u Speedup drop due to communication overhead
16
Speedup Drop with Different Input Constraints Move operation takes one cycle 46% speedup drop on average
17
Outline u Motivation u Application-specific instruction set compilation u Register file data bandwidth problem u Architecture extension – shadow registers u Shadow register binding u Conclusions
18
Architecture Extensions u Existing Solutions Dedicated Data Link Avoid potential resource contention through bus Avoid potential resource contention through bus Need extra cycles for communication Need extra cycles for communication Employed in Microblaze from Xilinx Employed in Microblaze from Xilinx Multiport Register File Low utilization when executing basic instructions Low utilization when executing basic instructions Area and power grows cubically Area and power grows cubically Register File Replication Predetermined one-to-one correspondence Predetermined one-to-one correspondence Resource waste in terms of area and power Resource waste in terms of area and power Limit compiler optimization Limit compiler optimization
19
Our Approach – Shadow Registers u Core registers are augmented by an extra set of shadow registers Conditionally written Used only by the custom logic
20
Shadow Registers u Controlling the shadow register u Advantages and limitations Cost-efficient for small number of shadow registers Only need a few control signals to be added Opportunity for compiler optimization Require extra control bits OperationForward the resultSkip Instruction Subword 00011011 Shadow- reg ID 012-
21
Outline u Motivation u Application-specific instruction set compilation u Register file data bandwidth problem u Architecture extension – shadow registers u Shadow register binding u Conclusions
22
i 1 = …; i 2 = ext 1 (…, i 1, …); i 3 = …; i 4 = ext 2 (…, i 1, …); i 5 = ext 3 (…, i 3, …); i 6 = ext 4 (…, i 3, …); Internal Representation 2-level CDFG representation 1 st level: control flow graph 2 nd level: data flow graph Computation node latency & scheduled time slot Data edge lifetime Variable lifetime e3e3 e4e4 e2e2 e1e1 1 2 3 4 5 6 Life time e 1 = [2, 2] Life time e 2 = [2, 4] Life time i 1 = [2, 4]
23
Observation Observation u 2-port register file u 3-input extended instruction u Without shadow register 4 additional moves u Binding for 1 register i 1 = …; i 2 = ext 1 (…, i 1, …); i 3 = …; i 4 = ext 2 (…, i 1, …); i 5 = ext 3 (…, i 3, …); i 6 = ext 4 (…, i 3, …); e3e3 e4e4 e2e2 e1e1 1 2 3 4 5 6 Binding 1: either i 1 or i 3 in shadow register save 2 moves Binding 2: save 3 moves
24
Register Binding u Which operands should be bound? Each input could be a candidate Binding different candidates leads to different savings Unaffordable to try all the combinations
25
One Shadow Register Binding Problem u Problem formulation: Given A scheduled DFG and one shadow register Objective Bind variables to shadow register Minimize the number of moves
26
Algorithm for Binding One Shadow Register u Weighted compatibility graph Vertex data edge in the DFG Vertex data edge in the DFG Weight # saves if the value is kept in the register Weight # saves if the value is kept in the register Edge lifetimes don’t overlap Edge lifetimes don’t overlap u Theorem: Binding problem is equivalent to find a maximum weighted chain in the compatibility graph Can be optimally solved in time O(|V’| + |E’|) u Extension to K-shadow registers
27
Experimental Results (1) Speedup under different number of shadow registers for 3-input extended instructions
28
Experimental Results (2) Speedup under different number of shadow registers for 4-input extended instructions
29
Conclusions u Proposed and developed complete compilation flow u Observed and quantitatively analyzed data bandwidth problem u Proposed novel architecture extension and efficient register binding algorithm
30
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.