Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable Systems Lab Computer Science Department University of California, Los Angeles philip@cs.ucla.edumajid@cs.ucla.edukaplan@cs.ucla.edu DAC ’04. June 9, 2004. San Diego Convention Center, San Diego, CA

Outline Custom Instruction Generation and Selection Resource Sharing Algorithm Description with Examples Datapath Synthesis Techniques Experimental Methodology and Results Summary

Custom Instruction Generation Compiler Profiles Application Code Extracts Favorable IR Patterns Synthesizes Patterns as Hardware Datapaths Custom Instruction Selection Area Constraints Limit on-Chip Functionality NP-Hard 0-1 Knapsack Problem Formulated as an Integer Linear Program (ILP) Custom Instruction Generation and Selection

For each custom instruction i Gain(i) : Estimated Performance Gain of i Area(i) : Estimated Area of i Selected(i) : 1 if i is Selected; 0 Otherwise Goal Maximize Gain of Selected Instructions Constraint Area of Selected Instructions FPGA Area < ILP Formulation for Instruction Selection Problem

What About Resource Sharing? Area = 17 Area = 25 Two DFGs 1.5 My Datapath Area = 28 ILP Area Estimate = 42 Area Costs 8 5 1 3

Analysis 0-1 Knapsack Problem Formulation Over- Estimated Area by 150% ILP Solvers Do Not Consider Resource Sharing How to Remedy This Develop a Resource Sharing Algorithm Avoid Additive Area Estimates Based on per- Instruction Costs

Resource Sharing for DFGs Given: A Set of DFGs G* = {G 1, …, G n } Goal: Construct a Consolidation Graph G C of Minimal Cost Constraints: G C Must be Acyclic G C Must be a Supergraph of each G i in G* That’s Life: The Problem is NP-Hard

Resource Sharing Overview G3G3 G4G4 G1G1 G2G2 Decompose Patterns into Input-Output Paths Path Based Resource Sharing (PBRS)

Resource Sharing Overview Use Substring Matching to Share Resources Merge DFGs Along Matched Nodes G3G3 G4G4 G1G1 G2G2

Resource Sharing Overview Synthesize G C Requires Less Area than Synthesizing G 1 …G 4 Separately GcGc G3G3 G4G4 G1G1 G2G2

Area Costs 8 5 1 3 Path-Based Resource Sharing P1:() P2:()

P1:() P2:() MACStr O(L) L – Length of String ( ) Area of MACStr = 26 Maximum Area Common Substring Area Costs 8 5 1 3

P1:() P2:() MACSeq O(L 2 /logL) L – Length of String ( ) Area of MACSeq = 43 Area Costs 8 5 1 3 Maximum Area Common Subsequence

Resource Sharing Algorithm Global Phase Determine: Which DFGs to Merge An Initial Path to Merge Local Phase Aggressively Apply PBRS to Share Resources Between the DFGs Selected by the Global Phase Repeat Until all DFGs are Merged, or no Further Resource Sharing is Possible

Resource Sharing Algorithm Area Costs 8 5 1 3 G1G1 G2G2 G3G3 G4G4

Global Phase Area Costs 8 5 1 3 G3G3 G4G4 G1G1 G2G2

Global Phase Area Costs 8 5 1 3 G3G3 G4G4 G1G1 G2G2 MACSeq/MACStr

Entering Local Phase Area Costs 8 5 1 3 G1G1 G2G2 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 G1G1 G2G2 1 2 2 2 2 2 G 12 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 G1G1 G2G2 1 2 2 2 2 2 G 12 0 0 0 0 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 1 2 2 2 2 2 G 12

Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 1 2 2 2 2 2 G 12 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 2 2 2 2 G 12 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 G1G1 G2G2 0 0 0 0 2 2 2 2 G 12

Returning To Global Phase Area Costs 8 5 1 3 G 12 G3G3 G4G4

Global Phase Area Costs 8 5 1 3 G3G3 G4G4 G 12

Global Phase Area Costs 8 5 1 3 G3G3 G4G4 G 12 MACSeq/MACStr

Entering Local Phase Area Costs 8 5 1 3 G 12 G4G4 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 G4G4 0 0 0 0 G 12 G 124 4 4 4 12 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 G4G4 0 0 0 0 G 12 G 124 4 4 4 12

Local Phase Area Costs 8 5 1 3 G4G4 0 0 0 0 G 12 G 124 4 4 4 12 MACSeq/MACStr

A Local Decision Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 4 12 MACSeq/MACStr

A Local Decision Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 4 12

A Local Decision Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 4 12 MACSeq/MACStr

Cycles are Illegal Area Costs 8 5 1 3 0 0 0 0 ILLEGAL! 4 12 G 124 4 4 12 G 124 MACSeq/MACStr

Cycles are Illegal Area Costs 8 5 1 3 0 0 0 0 G 124 4 4 12 LEGAL! 4 12 G 124 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 0 0 0 0 G4G4 G 12 G 124 4 12

Returning To Global Phase Area Costs 8 5 1 3 G3G3 G 124

Global Phase Area Costs 8 5 1 3 G3G3 G 124

Global Phase Area Costs 8 5 1 3 G3G3 G 124 MACSeq/MACStr

Global Phase Area Costs 8 5 1 3 G3G3 G 124 3 3 3 124 G 1234 MACSeq/MACStr

Global Phase Area Costs 8 5 1 3 G3G3 G 124 3 3 3 124 0 0 0 0 G 1234 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 3 3 3 124 G 1234

Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 3 3 124 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 124 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 124

Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 3 124 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 124 MACSeq/MACStr 124

Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 124 MACSeq/MACStr

Local Phase Area Costs 8 5 1 3 0 0 0 0 G3G3 G 124 G 1234 124

We’re Done Area Costs 8 5 1 3 G1G1 G2G2 G3G3 G4G4 G 1234

We’re Done Area Costs 8 5 1 3 G1G1 G2G2 G3G3 G4G4 Area = 17 Area = 25 Area = 14 Area = 20 G 1234 Area = 30 Total Area of DFGs = 76 G 1234

VLIW Synthesis Experimental Procedure Custom Instr. Generation Set of Patterns Machine-SUIF Compiler Consolidation Graph Construction Algorithm Consolidation Graph Estimate Area Pipeline Synthesis

Pipelined Datapath Synthesis Compiler Loop Bodies 80-90% of Program Execution Time Parallelism Exists Across Multiple Iterations Pipelined Datapath Yields Maximal Throughput. Data Flow Graph Insert Registers & Muxes

Pipelined Datapath Synthesis GcGc G1G1 G2G2 G3G3 G4G4

VLIW Datapath Synthesis Compiler Non-Loop Computations Instruction-Level Parallelism Similar to Latency-Constrained Scheduling in High-Level Synthesis Data Flow Graph

Benchmark Suite MediaBench Benchmark Suite Exp.BenchmarkFile/Function Num. Instrs. Largest Instr. (Operations) Avg. Ops per Instr. 1 2 3 4 5 6 7 8 9 10 11 Mesa PGP Rasta Epic JPEG MPEG2 Rasta blend.c idea.c mul_mdmd_md.c collapse_pyr jpeg_fdct_ifast jpeg_idct_4x4 jpeg_idct_2x2 idct_col FR4TR Lqsolve.c idct_row 6 14 5 7 21 5 8 7 9 4 10 18 8 6 4 9 17 12 5 30 37 25 5.5 3.2 3.0 4.4 7.0 5.9 3.1 7.2 20.0 7.5

Experimental Results XilinxE-1000 Area

Summary Area Estimates Based on Resource Sharing 0-1 Knapsack Problem Formulation Does Allow for Resource Sharing Estimates Resource Sharing Algorithm PBRS applied to Data Flow Graphs Experimental Results ILP Overestimates Area Costs by as much as 374% and 582% for Pipelined and VLIW Datapaths

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Similar presentations

Presentation on theme: "Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Similar presentations

Presentation on theme: "Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable."— Presentation transcript:

Similar presentations

About project

Feedback