Presentation is loading. Please wait.

Presentation is loading. Please wait.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Similar presentations


Presentation on theme: "Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable."— Presentation transcript:

1 Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable Systems Lab Computer Science Department University of California, Los Angeles philip@cs.ucla.edumajid@cs.ucla.edukaplan@cs.ucla.edu DAC ’04. June 9, 2004. San Diego Convention Center, San Diego, CA

2 Outline Custom Instruction Generation and Selection Resource Sharing Algorithm Description with Examples Datapath Synthesis Techniques Experimental Methodology and Results Summary

3 Custom Instruction Generation Compiler Profiles Application Code Extracts Favorable IR Patterns Synthesizes Patterns as Hardware Datapaths Custom Instruction Selection Area Constraints Limit on-Chip Functionality NP-Hard 0-1 Knapsack Problem Formulated as an Integer Linear Program (ILP) Custom Instruction Generation and Selection

4 For each custom instruction i Gain(i) : Estimated Performance Gain of i Area(i) : Estimated Area of i Selected(i) : 1 if i is Selected; 0 Otherwise Goal Maximize Gain of Selected Instructions Constraint Area of Selected Instructions FPGA Area < ILP Formulation for Instruction Selection Problem

5 What About Resource Sharing? Area = 17 Area = 25 Two DFGs 1.5 My Datapath Area = 28 ILP Area Estimate = 42 Area Costs 8 5 1 3

6 Analysis 0-1 Knapsack Problem Formulation Over- Estimated Area by 150% ILP Solvers Do Not Consider Resource Sharing How to Remedy This Develop a Resource Sharing Algorithm Avoid Additive Area Estimates Based on per- Instruction Costs

7 Resource Sharing for DFGs Given: A Set of DFGs G* = {G 1, …, G n } Goal: Construct a Consolidation Graph G C of Minimal Cost Constraints: G C Must be Acyclic G C Must be a Supergraph of each G i in G* That’s Life: The Problem is NP-Hard

8 Resource Sharing Overview G3G3 G4G4 G1G1 G2G2 Decompose Patterns into Input-Output Paths Path Based Resource Sharing (PBRS)

9 Resource Sharing Overview G3G3 G4G4 G1G1 G2G2 Decompose Patterns into Input-Output Paths Path Based Resource Sharing (PBRS)

10 Resource Sharing Overview Use Substring Matching to Share Resources Merge DFGs Along Matched Nodes G3G3 G4G4 G1G1 G2G2

11 Resource Sharing Overview Synthesize G C Requires Less Area than Synthesizing G 1 …G 4 Separately GcGc G3G3 G4G4 G1G1 G2G2

12 Area Costs 8 5 1 3 Path-Based Resource Sharing P1:() P2:()

13 P1:() P2:() MACStr O(L) L – Length of String ( ) Area of MACStr = 26 Maximum Area Common Substring Area Costs 8 5 1 3

14 P1:() P2:() MACSeq O(L 2 /logL) L – Length of String ( ) Area of MACSeq = 43 Area Costs 8 5 1 3 Maximum Area Common Subsequence

15 Resource Sharing Algorithm Global Phase Determine: Which DFGs to Merge An Initial Path to Merge Local Phase Aggressively Apply PBRS to Share Resources Between the DFGs Selected by the Global Phase Repeat Until all DFGs are Merged, or no Further Resource Sharing is Possible

16 Resource Sharing Algorithm Area Costs 8 5 1 3 G1G1 G2G2 G3G3 G4G4

17 Global Phase Area Costs 8 5 1 3 G3G3 G4G4 G1G1 G2G2

18 Global Phase Area Costs 8 5 1 3 G3G3 G4G4 G1G1 G2G2 MACSeq/MACStr

19 Entering Local Phase Area Costs 8 5 1 3 G1G1 G2G2 MACSeq/MACStr

20 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 1 2 2 2 2 2 G 12 MACSeq/MACStr

21 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 1 2 2 2 2 2 G 12 0 0 0 0 MACSeq/MACStr

22 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 1 2 2 2 2 2 G 12

23 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 1 2 2 2 2 2 G 12 MACSeq/MACStr

24 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 1 2 2 2 2 2 G 12 MACSeq/MACStr

25 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 2 2 2 2 G 12 MACSeq/MACStr

26 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 2 2 2 2 G 12 MACSeq/MACStr

27 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 2 2 2 2 G 12 MACSeq/MACStr

28 Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 2 2 2 2 G 12

29 Returning To Global Phase Area Costs 8 5 1 3 G 12 G3G3 G4G4

30 Global Phase Area Costs 8 5 1 3 G3G3 G4G4 G 12

31 Global Phase Area Costs 8 5 1 3 G3G3 G4G4 G 12 MACSeq/MACStr

32 Entering Local Phase Area Costs 8 5 1 3 G 12 G4G4 MACSeq/MACStr

33 Local Phase Area Costs 8 5 1 3 G4G4 0 0 0 0 G 12 G 124 4 4 4 12 MACSeq/MACStr

34 Local Phase Area Costs 8 5 1 3 G4G4 0 0 0 0 G 12 G 124 4 4 4 12 MACSeq/MACStr

35 Local Phase Area Costs 8 5 1 3 G4G4 0 0 0 0 G 12 G 124 4 4 4 12

36 Local Phase Area Costs 8 5 1 3 G4G4 0 0 0 0 G 12 G 124 4 4 4 12 MACSeq/MACStr

37 Local Phase Area Costs 8 5 1 3 G4G4 0 0 0 0 G 12 G 124 4 4 4 12 MACSeq/MACStr

38 A Local Decision Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 4 12 MACSeq/MACStr

39 A Local Decision Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 4 12

40 A Local Decision Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 4 12 MACSeq/MACStr

41 A Local Decision Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 4 12 MACSeq/MACStr

42 Cycles are Illegal Area Costs 8 5 1 3 0 0 0 0 ILLEGAL! 4 12 G 124 4 4 12 G 124 MACSeq/MACStr

43 Cycles are Illegal Area Costs 8 5 1 3 0 0 0 0 G 124 4 4 12 LEGAL! 4 12 G 124 MACSeq/MACStr

44 Local Phase Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 12

45 Returning To Global Phase Area Costs 8 5 1 3 G3G3 G 124

46 Global Phase Area Costs 8 5 1 3 G3G3 G 124

47 Global Phase Area Costs 8 5 1 3 G3G3 G 124 MACSeq/MACStr

48 Global Phase Area Costs 8 5 1 3 G3G3 G 124 3 3 3 124 G 1234 MACSeq/MACStr

49 Global Phase Area Costs 8 5 1 3 G3G3 G 124 3 3 3 124 0 0 0 0 G 1234 MACSeq/MACStr

50 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 3 3 3 124 G 1234

51 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 3 3 124 MACSeq/MACStr

52 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 3 3 124 MACSeq/MACStr

53 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 124 MACSeq/MACStr

54 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 124 MACSeq/MACStr

55 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 124

56 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 124 MACSeq/MACStr

57 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 124 MACSeq/MACStr

58 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 124 MACSeq/MACStr 124

59 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 124 MACSeq/MACStr 124

60 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 124 MACSeq/MACStr

61 Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 124

62 We’re Done Area Costs 8 5 1 3 G1G1 G2G2 G3G3 G4G4 G 1234

63 We’re Done Area Costs 8 5 1 3 G1G1 G2G2 G3G3 G4G4 Area = 17 Area = 25 Area = 14 Area = 20 G 1234 Area = 30 Total Area of DFGs = 76 G 1234

64 VLIW Synthesis Experimental Procedure Custom Instr. Generation Set of Patterns Machine-SUIF Compiler Consolidation Graph Construction Algorithm Consolidation Graph Estimate Area Pipeline Synthesis

65 Pipelined Datapath Synthesis Compiler Loop Bodies 80-90% of Program Execution Time Parallelism Exists Across Multiple Iterations Pipelined Datapath Yields Maximal Throughput. Data Flow Graph Insert Registers & Muxes

66 Pipelined Datapath Synthesis GcGc G1G1 G2G2 G3G3 G4G4

67 VLIW Datapath Synthesis Compiler Non-Loop Computations Instruction-Level Parallelism Similar to Latency-Constrained Scheduling in High-Level Synthesis Data Flow Graph

68 Benchmark Suite MediaBench Benchmark Suite Exp.BenchmarkFile/Function Num. Instrs. Largest Instr. (Operations) Avg. Ops per Instr. 1 2 3 4 5 6 7 8 9 10 11 Mesa PGP Rasta Epic JPEG MPEG2 Rasta blend.c idea.c mul_mdmd_md.c collapse_pyr jpeg_fdct_ifast jpeg_idct_4x4 jpeg_idct_2x2 idct_col FR4TR Lqsolve.c idct_row 6 14 5 7 21 5 8 7 9 4 10 18 8 6 4 9 17 12 5 30 37 25 5.5 3.2 3.0 4.4 7.0 5.9 3.1 7.2 20.0 7.5

69 Experimental Results XilinxE-1000 Area

70 Experimental Results XilinxE-1000 Area

71 Summary Area Estimates Based on Resource Sharing 0-1 Knapsack Problem Formulation Does Allow for Resource Sharing Estimates Resource Sharing Algorithm PBRS applied to Data Flow Graphs Experimental Results ILP Overestimates Area Costs by as much as 374% and 582% for Pipelined and VLIW Datapaths


Download ppt "Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable."

Similar presentations


Ads by Google