Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.

Similar presentations


Presentation on theme: "1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering."— Presentation transcript:

1 1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering University of California, Los Angeles University of California, Santa Barbara June 4, 2003 Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering University of California, Los Angeles University of California, Santa Barbara June 4, 2003

2 2/20 Application specified in system-level language Compiler SynthesisandPhysicalDesign HDL(behavioral,structural) From Algorithm to HDL We focus our efforts on mapping an application written in a high-level language to a hardware description. We desire this mapping to have optimal characteristics (area, latency, etc.) In this talk, we focus on the problem of minimizing data communication in the final hardware. We focus our efforts on mapping an application written in a high-level language to a hardware description. We desire this mapping to have optimal characteristics (area, latency, etc.) In this talk, we focus on the problem of minimizing data communication in the final hardware.

3 3/20 Similar Compilation Projects Hardware compilers Reconfigurable Architecture PRISM project – synthesize subset of C to FPGA Garp compiler (BRASS) – synthesize C to processor + FPGA platform DEFACTO – synthesize SUIF to FPGA (Wildstar) General Architecture DeepC compiler – synthesize C to HDL MATCH compiler – synthesize Matlab to HDL PICO – synthesize nested loops into VLIW-like functional unit Hardware compilers Reconfigurable Architecture PRISM project – synthesize subset of C to FPGA Garp compiler (BRASS) – synthesize C to processor + FPGA platform DEFACTO – synthesize SUIF to FPGA (Wildstar) General Architecture DeepC compiler – synthesize C to HDL MATCH compiler – synthesize Matlab to HDL PICO – synthesize nested loops into VLIW-like functional unit

4 4/20 SUIF/ MachSUIF Compiler Control Data-Flow Graph (CDFG) Control Data-Flow Graph (CDFG) C Code Our Framework From the SUIF IR, we construct a CDFG representation. Each basic block of the CDFG becomes a separate synthesizable module in the hardware description. From the SUIF IR, we construct a CDFG representation. Each basic block of the CDFG becomes a separate synthesizable module in the hardware description. Control Node 1 Control Node 3 Control Node 2 Control Node 4 Hardware Description Hardware Description

5 5/20 Characterizing Data Communication Control Node 1 Control Node 3 Control Node 2 Control Node 4 Memory (Register Bank, RAM) Control Node 4 Control Node 2 Control Node 3 Control Node 1 Bus Two examples of data communication schemes DistributedCentralized data communication = wire data communication = storage access

6 6/20 Identifying Data Communication Determine relationship between place(s) where data is defined and where data is used b  … a  …  a a  … c  … b  …  b Naïve method: all use-points of a variable depend on all definitions of that variable Not all use points “use” a variable Naïve method: all use-points of a variable depend on all definitions of that variable Not all use points “use” a variable  c Need analysis to minimize the amount of data communication Need analysis to minimize the amount of data communication Global Data Communication = 5 variables

7 7/20 Must determine relationship between where data is generated and where data is used Problem formulation: minimize the total number of bits communicated between all pairs of control nodes SSA (Static Single Assignment) Changes each variable to have a unique definition point Must add  -nodes to merge definitions Must determine relationship between where data is generated and where data is used Problem formulation: minimize the total number of bits communicated between all pairs of control nodes SSA (Static Single Assignment) Changes each variable to have a unique definition point Must add  -nodes to merge definitions Minimizing Data Communication b  … a  …  a a  … c  … b  …  b  c b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 )

8 8/20 Using SSA to Minimize Data Communication SSA algorithms Find location of  -nodes Rename variables Three main SSA algorithms Minimal, Pruned – Cytron et al. Semi-pruned – Briggs et al. Differ in number and location of  -nodes Minimal – insert  -nodes at iterated dominance frontier (IDF) Semi-pruned – insert  -node at IDF if variable live outside some basic block Pruned – insert  -node at IDF if variable live at that time SSA algorithms Find location of  -nodes Rename variables Three main SSA algorithms Minimal, Pruned – Cytron et al. Semi-pruned – Briggs et al. Differ in number and location of  -nodes Minimal – insert  -nodes at iterated dominance frontier (IDF) Semi-pruned – insert  -node at IDF if variable live outside some basic block Pruned – insert  -node at IDF if variable live at that time b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) c 2   (c 1 ) b 3   (b 1,b 2 )Minimal b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) b 3   (b 1,b 2 )Semi-Pruned b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 )Pruned

9 9/20 Experimental Setup SSA Conversion HDL Generation CDFG Synopsys Behavioral / Design Compiler CDFG in SSA form

10 10/20 MediaBench Benchmark Suite A benchmark suite of DSP applications [Lee et al] DSP Applications well suited to hardware implementation Tend to: be parallelizable be computationally intensive often have large basic blocks A benchmark suite of DSP applications [Lee et al] DSP Applications well suited to hardware implementation Tend to: be parallelizable be computationally intensive often have large basic blocks for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1; x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos, y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin; x_filt<y_filt_lin; x_filt++,im_pos++) sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; } first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER); Sample code: internal filter of an image convolver

11 11/20 Edge Weight w(i,j)– number of bits communicated from node i to j Total Edge Weight (TEW) - corresponds to amount of data communication Edge Weight w(i,j)– number of bits communicated from node i to j Total Edge Weight (TEW) - corresponds to amount of data communication Results: SSA for Data Comm. Minimization

12 12/20 Results: SSA for Area Minimization

13 13/20 Relationship Between  -nodes and Data Communication

14 14/20 Further Minimizing Data Communication Current SSA algorithms place  -nodes temporally In software compilation, live ranges should be short. Appropriate in hardware? Current SSA algorithms place  -nodes temporally In software compilation, live ranges should be short. Appropriate in hardware? Spatial  -node distribution Temporal  -node distribution b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) TEW = 4 b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) TEW = 3

15 15/20 Temporal  -node placement Spatial  -node placement Effect of  -node Distribution

16 16/20 d – number of uses of  -node destination s – number of  -node source values Number of temporal links Number of spatial links d – number of uses of  -node destination s – number of  -node source values Number of temporal links Number of spatial links Spatial  -nodes Distribution Algorithm a 3  (a 0,a 1,a 2 )  a 3 s = 3 d = 2

17 17/20 Spatial SSA Results – Num. Spatial  -nodes

18 18/20 Spatial SSA Results –  TEW after spatial SSA

19 19/20  area After Spatial SSA (from Synopsys)

20 20/20 ConclusionConclusion In this work, we demonstrate a mapping from compiler IR (CDFG) to hardware description. SSA binds variables to values, which is useful in reducing data communication between control nodes. Spatial distribution of phi nodes can reduce data communication, modeled as total edge weight (TEW) by as much as 20%. However, circuit area sometimes increases… Future research: refine the model using information from later stages of synthesis. Compiler techniques applied to hardware design can greatly reduce data communication. In this work, we demonstrate a mapping from compiler IR (CDFG) to hardware description. SSA binds variables to values, which is useful in reducing data communication between control nodes. Spatial distribution of phi nodes can reduce data communication, modeled as total edge weight (TEW) by as much as 20%. However, circuit area sometimes increases… Future research: refine the model using information from later stages of synthesis. Compiler techniques applied to hardware design can greatly reduce data communication.


Download ppt "1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering."

Similar presentations


Ads by Google