Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable Systems Lab Computer Science Department University of California, Los Angeles DAC ’04. June 9, San Diego Convention Center, San Diego, CA
Outline Custom Instruction Generation and Selection Resource Sharing Algorithm Description with Examples Datapath Synthesis Techniques Experimental Methodology and Results Summary
Custom Instruction Generation Compiler Profiles Application Code Extracts Favorable IR Patterns Synthesizes Patterns as Hardware Datapaths Custom Instruction Selection Area Constraints Limit on-Chip Functionality NP-Hard 0-1 Knapsack Problem Formulated as an Integer Linear Program (ILP) Custom Instruction Generation and Selection
For each custom instruction i Gain(i) : Estimated Performance Gain of i Area(i) : Estimated Area of i Selected(i) : 1 if i is Selected; 0 Otherwise Goal Maximize Gain of Selected Instructions Constraint Area of Selected Instructions FPGA Area < ILP Formulation for Instruction Selection Problem
What About Resource Sharing? Area = 17 Area = 25 Two DFGs 1.5 My Datapath Area = 28 ILP Area Estimate = 42 Area Costs
Analysis 0-1 Knapsack Problem Formulation Over- Estimated Area by 150% ILP Solvers Do Not Consider Resource Sharing How to Remedy This Develop a Resource Sharing Algorithm Avoid Additive Area Estimates Based on per- Instruction Costs
Resource Sharing for DFGs Given: A Set of DFGs G* = {G 1, …, G n } Goal: Construct a Consolidation Graph G C of Minimal Cost Constraints: G C Must be Acyclic G C Must be a Supergraph of each G i in G* That’s Life: The Problem is NP-Hard
Resource Sharing Overview G3G3 G4G4 G1G1 G2G2 Decompose Patterns into Input-Output Paths Path Based Resource Sharing (PBRS)
Resource Sharing Overview G3G3 G4G4 G1G1 G2G2 Decompose Patterns into Input-Output Paths Path Based Resource Sharing (PBRS)
Resource Sharing Overview Use Substring Matching to Share Resources Merge DFGs Along Matched Nodes G3G3 G4G4 G1G1 G2G2
Resource Sharing Overview Synthesize G C Requires Less Area than Synthesizing G 1 …G 4 Separately GcGc G3G3 G4G4 G1G1 G2G2
Area Costs Path-Based Resource Sharing P1:() P2:()
P1:() P2:() MACStr O(L) L – Length of String ( ) Area of MACStr = 26 Maximum Area Common Substring Area Costs
P1:() P2:() MACSeq O(L 2 /logL) L – Length of String ( ) Area of MACSeq = 43 Area Costs Maximum Area Common Subsequence
Resource Sharing Algorithm Global Phase Determine: Which DFGs to Merge An Initial Path to Merge Local Phase Aggressively Apply PBRS to Share Resources Between the DFGs Selected by the Global Phase Repeat Until all DFGs are Merged, or no Further Resource Sharing is Possible
Resource Sharing Algorithm Area Costs G1G1 G2G2 G3G3 G4G4
Global Phase Area Costs G3G3 G4G4 G1G1 G2G2
Global Phase Area Costs G3G3 G4G4 G1G1 G2G2 MACSeq/MACStr
Entering Local Phase Area Costs G1G1 G2G2 MACSeq/MACStr
Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr
Local Phase Area Costs G1G1 G2G G MACSeq/MACStr
Local Phase Area Costs G1G1 G2G G 12
Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr
Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr
Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr
Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr
Local Phase Area Costs G1G1 G2G G 12 MACSeq/MACStr
Local Phase Area Costs G1G1 G2G G 12
Returning To Global Phase Area Costs G 12 G3G3 G4G4
Global Phase Area Costs G3G3 G4G4 G 12
Global Phase Area Costs G3G3 G4G4 G 12 MACSeq/MACStr
Entering Local Phase Area Costs G 12 G4G4 MACSeq/MACStr
Local Phase Area Costs G4G G 12 G MACSeq/MACStr
Local Phase Area Costs G4G G 12 G MACSeq/MACStr
Local Phase Area Costs G4G G 12 G
Local Phase Area Costs G4G G 12 G MACSeq/MACStr
Local Phase Area Costs G4G G 12 G MACSeq/MACStr
A Local Decision Area Costs G4G4 G 12 G MACSeq/MACStr
A Local Decision Area Costs G4G4 G 12 G
A Local Decision Area Costs G4G4 G 12 G MACSeq/MACStr
A Local Decision Area Costs G4G4 G 12 G MACSeq/MACStr
Cycles are Illegal Area Costs ILLEGAL! 4 12 G G 124 MACSeq/MACStr
Cycles are Illegal Area Costs G LEGAL! 4 12 G 124 MACSeq/MACStr
Local Phase Area Costs G4G4 G 12 G
Returning To Global Phase Area Costs G3G3 G 124
Global Phase Area Costs G3G3 G 124
Global Phase Area Costs G3G3 G 124 MACSeq/MACStr
Global Phase Area Costs G3G3 G G 1234 MACSeq/MACStr
Global Phase Area Costs G3G3 G G 1234 MACSeq/MACStr
Local Phase Area Costs G3G3 G G 1234
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr
Local Phase Area Costs G3G3 G 124 G
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr 124
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr 124
Local Phase Area Costs G3G3 G 124 G MACSeq/MACStr
Local Phase Area Costs G3G3 G 124 G
We’re Done Area Costs G1G1 G2G2 G3G3 G4G4 G 1234
We’re Done Area Costs G1G1 G2G2 G3G3 G4G4 Area = 17 Area = 25 Area = 14 Area = 20 G 1234 Area = 30 Total Area of DFGs = 76 G 1234
VLIW Synthesis Experimental Procedure Custom Instr. Generation Set of Patterns Machine-SUIF Compiler Consolidation Graph Construction Algorithm Consolidation Graph Estimate Area Pipeline Synthesis
Pipelined Datapath Synthesis Compiler Loop Bodies 80-90% of Program Execution Time Parallelism Exists Across Multiple Iterations Pipelined Datapath Yields Maximal Throughput. Data Flow Graph Insert Registers & Muxes
Pipelined Datapath Synthesis GcGc G1G1 G2G2 G3G3 G4G4
VLIW Datapath Synthesis Compiler Non-Loop Computations Instruction-Level Parallelism Similar to Latency-Constrained Scheduling in High-Level Synthesis Data Flow Graph
Benchmark Suite MediaBench Benchmark Suite Exp.BenchmarkFile/Function Num. Instrs. Largest Instr. (Operations) Avg. Ops per Instr Mesa PGP Rasta Epic JPEG MPEG2 Rasta blend.c idea.c mul_mdmd_md.c collapse_pyr jpeg_fdct_ifast jpeg_idct_4x4 jpeg_idct_2x2 idct_col FR4TR Lqsolve.c idct_row
Experimental Results XilinxE-1000 Area
Experimental Results XilinxE-1000 Area
Summary Area Estimates Based on Resource Sharing 0-1 Knapsack Problem Formulation Does Allow for Resource Sharing Estimates Resource Sharing Algorithm PBRS applied to Data Flow Graphs Experimental Results ILP Overestimates Area Costs by as much as 374% and 582% for Pipelined and VLIW Datapaths