Presentation is loading. Please wait.

Presentation is loading. Please wait.

Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.

Similar presentations


Presentation on theme: "Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping."— Presentation transcript:

1 Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping Fan, Guoling Han, Zhiru Zhang Supported by NSF

2 Outline Motivation Related Works Problem Statement Proposed Solutions Experimental Results Conclusions

3 Motivation

4 Motivation (cont’d) Flexibility is required to satisfy different requirements and to avoid potential design errors Application Specific Instruction-set Processors (ASIPs) provide a solution to the tradeoff between efficiency and flexibility  A general purpose processor + specific hardware resource  Base instruction set + customized instructions  Specific hardware resource implements the customized instructions  Either runtime reconfigurable or pre-synthesized  Gain more popularity recently IFX Carmel 20xx, ARM, Tensilica Xtensa, STM Lx, ARC Cores

5 Application Specific Instruction-set Processor Program with basic instructions set I t 1 = a * b; t 2 = b * 0xf0; ; t 3 = c * 0x12; t 4 = t 1 + t 2 ; t 5 = t 2 + t 3 ; t 6 = t 5 + t 4 ; Custom Logic *** ++ + 0xf00x12 abc Execution time: 9 clock cycles *: 2 clock cycles+: 1 clock cycles

6 Application Specific Instruction-set Processor (cont ’ d) *** ++ + 0xf00x12 abc Program with extended instructions t 1 = extop1(a, b, 0xf0); t 2 = extop2(b, c, 0xf0, 0x12); t 3 = t 1 + t 2 ; Execution time: 5 clock cycles Speedup: 1.8 extops: 2 clock cycles+: 1 clock cycles Extended Instruction Set: I  extop1  expop2 extop1 extop2

7 Related Works [Kastner et al, TODAES ’ 02] Template generation + covering Limitation:  Minimum number of templates may not lead to maximum speedup  Ignore architecture constraints [Atasu et al, DAC ’ 03] Branch and bound Limitation:  High complexity  Instruction reuse is not considered [Peymandoust et al, ICASAP ’ 03] Instruction selection + instruction mapping Limitation:  Minimize the extended instruction number

8 Preliminaries Control data flow graph (CDFG)  Basic blocks(BBK) each bbk is a DAG, denoted by G(V, E)  Control edges Cone  A subgraph consisting of node v and its predecessors such that any path connecting a node in the cone and v lies entirely in the cone  K-feasible cone Pattern A single output DAG  Trivial pattern  Nontrivial pattern  Associated with execution time, number of I/O, area Trivial Pattern Execution time I/O: 2-in 1-out *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Nontrivial Pattern SW Execution time HW Execution time I/O: 2-in 1-out Area: 2 {a, b, 0xf0}

9 Problem Statement Given:  G(V, E)  The basic instruction set I  Pattern constraints: I. Number of inputs |PI(p i )|  N in,  i; II. Number of outputs |PO(p i )| = 1,  i; III. Total area Objective:  Generate a pattern library P  Map G to the extended instruction set I  P, so that the total execution time is minimized.

10 Problem Decomposition Sub-problem 1. Pattern Enumeration: Generate all of the patterns S satisfying the constraints (i) and (ii) from G(V, E). Sub-problem 2. Instruction Set Selection: Select a subset P of S to maximize the potential speedup while satisfying the area constraint. Sub-problem 3. Application Mapping: Map G(V, E) to I  P so that the total execution time of G is minimized.

11 Proposed ASIP Compilation Flow Instruction Implementation / ASIP synthesis Pattern Generation / Pattern Selection Application Mapping Pattern library C ASIP constraints Implementation Mapped CDFG SUIF / CDFG generator CDFG

12 1.Pattern Enumeration All possible application specific instruction patterns should be enumerated Each pattern is a k-feasible cone Cut enumeration is used to enumerate all the k-feasible cones [cong et al, FPGA’99] In topological order, merge the cuts of fan- ins and discards those cuts not k-feasible

13 1.Pattern Enumeration (cont ’ d) 3-feasible cones: n 1 : {a, b} n 2 : {b, 0xf0} n 3 : {c, 0x12} *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n 4 : {n 1, n 2 }, n 5 : {n 2, n 3 }, {n 2, c, 0x12}, {n 3, b, 0xf0} {b, 0xf0, c, 0x12} n 6 : {n 4, n 5 }, {n 4, n 2, n 3 }, {n 5, n 1, n 2 } {n 1, b, 0xf0},{n 2, a, b},{a, b, 0xf0}

14 2.Pattern Selection (1) Resource cost and the execution time can be obtained using high-level estimation tool The extended instructions should satisfy the area constraint  Use all the enumerated patterns Optimal code can be generated Mapping becomes unaffordable  Heuristically select a set of patterns

15 2.Pattern Selection (2) Basic idea: simultaneously consider speed up, occurrence frequency and area. Speedup Tsw(p) = |V(p)| Thw(p) = Length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p) Occurrence  Some pattern instances may be isomorphic  Graph isomorphism test [ Nauty Package ]  Small subgraphs, isomorphism test is very fast Gain(p) = Speedup(p)  Occurrence(p) *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 Pattern *+ T sw = 3 T hw = 2 Speedup = 1.5

16 2.Pattern Selection (3) Selection under Area Constraint  Can be formulated as a 0-1 knapsack problem 0-1 knapsack problem: Given n items (patterns) and weight W (area constraint A), and the ith item (pattern) is associated with value (gain) v i and weight (area) w i, select a subset of items to maximize the total value, while the total weight does not exceed W.  Optimally solvable by Dynamic programming algorithm.

17 3.Application Mapping (1) Application mapping covers each node in G(V, E) with the extended instruction set to minimize the execution time. The execution time of a mapped DAG is defined as the sum of the execution time of the patterns covering the DAG.

18 3.Application Mapping (2) Theorem: The application mapping problem is equivalent to the minimum-area technology mapping problem.  Execution time ↔ area  Total area = sum of area of each component  Total execution time = sum of execution time of each pattern  Minimum-area mapping is NP-hard → application mapping is NP-hard  A lot of minimum-area technology mapping algorithms

19 Minimum-area technology mapping [Keutzer, DAC’87 ] Tree decomposition + dynamic programming [Rudell] [Liao, ICCAD’95] Min-cost binate covering Given:  a boolean function f with variable set X  a cost function which maps X to a nonnegative integer Objective:  find an assignment for each variable so that the value of f is 1 and the sum of cost is minimized

20 Binate Covering (1) *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5

21 Binate Covering (2) *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5 Covering clause: p 0 The fan-ins of the sink node need be covered by some pattern

22 Binate Covering (3) *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 The nodes that generate inputs to p i must be covered by some other pattern PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5 Covering clause: p 2 +p 6 +p 7 +p 10

23 Binate Covering (4) *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5 p 2 →p 4 & p 2 →p 5 ¬p 2 + p 4 & ¬ p 2 + p 5

24 Binate Covering (4) *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 PatternFunctionCostCovers p0p0 +1n6n6 p1p1 +1n5n5 p2p2 +1n4n4 p3p3 *2n3n3 p4p4 *2n2n2 p5p5 *2n1n1 p6p6 *+2n 1, n 4 p7p7 *+2n 2, n 4 p8p8 *+2n 2, n 5 p9p9 *+2n 3, n 5 p 10 (*)+(*)2n 1, n 2, n 4 p 11 (*)+(*)2n 2, n 3, n 5 ¬p 6 + p 4 ¬p 7 + p 5

25 Binate Covering (5) *** ++ + 0xf00x12 abc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 f = p 0 (p 2 +p 6 +p 7 +p 10 )(¬p 2 + p 4 )(¬ p 2 + p 5 )(¬p 6 + p 4 )(¬p 7 + p 5 ) (p 1 +p 8 +p 9 +p 11 ) (¬p 1 + p 3 )(¬ p 1 + p 4 ) (¬p 8 + p 3 )(¬p 9 + p 4 ) min-cost cover: p 0, p 10, p 11 with cost 1+2+2 = 5

26 Experimental Results (1) A commercial reconfigurable system – Nios from Altera is used to implement the ASIPs.  5 extended instruction formats  up to 2048 instructions for each format Some DSP applications are taken as benchmark Altera’s Quartus II 3.0 is used to aid the synthesis and the physical design of the extended instructions.

27 Experimental Results (2) Pattern size vs. number of pattern instances (2-input patterns)

28 Experimental Results (3) Speedup under different input size constraints Speedup = T extended / T basic Ideal speedup pipeline hazard memory impact

29 Experimental Results (4) Speedup and resource overhead on Nios implementations

30 Conclusions Propose a set of algorithms for ASIP compilation  Actual performance metric is used as the optimization objective  Reduce the instruction mapping problem into an area-minimization logic covering problem  Operation duplication is considered implicitly Experiments show encouraging speedup

31 Thank You


Download ppt "Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping."

Similar presentations


Ads by Google