Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.
Comparison and Evaluation of Back Translation Algorithms for Static Single Assignment Form Masataka Sassa #, Masaki Kohama + and Yo Ito # # Dept. of Mathematical.
1 SSA review Each definition has a unique name Each use refers to a single definition The compiler inserts  -functions at points where different control.
8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
SSA.
CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.
High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.
Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Instruction Generation and Regularity Extraction for Reconfigurable Processors Philip Brisk, Adam Kaplan, Ryan Kastner*, Majid Sarrafzadeh Computer Science.
Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.
Program Representations. Representing programs Goals.
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Physically Aware Data Communication Optimization for Hardware Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer.
Cpeg421-08S/final-review1 Course Review Tom St. John.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,
1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Precision Going back to constant prop, in what cases would we lose precision?
Automated Design of Custom Architecture Tulika Mitra
Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Static Single Assignment John Cavazos.
Static Single Assignment Form in the COINS Compiler Infrastructure Masataka Sassa, Toshiharu Nakaya, Masaki Kohama, Takeaki Fukuoka and Masahito Takahashi.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Generating SSA Form (mostly from Morgan). Why is SSA form useful? For many dataflow problems, SSA form enables sparse dataflow analysis that –yields the.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Program Representations. Representing programs Goals.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.
Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.
ASIC Design Methodology
Static Single Assignment
Data Communication Estimation and Reduction for Reconfigurable Systems
Methodology of a Compiler that Compresses Code using Echo Instructions
Factored Use-Def Chains and Static Single Assignment Forms
Architectural-Level Synthesis
EECS 583 – Class 7 Static Single Assignment Form
Static Single Assignment
EECS 583 – Class 7 Static Single Assignment Form
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Presentation transcript:

Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer Engineering University of California, Santa Barbara Adam Kaplan, Philip Brisk and Majid Sarrafzadeh Computer Science Department University of California, Los Angeles

High Level Synthesis Input: Application description written in *C (C, SystemC, HandelC, SpecC) for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1; x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos, y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin; x_filt<y_filt_lin; x_filt++,im_pos++) sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; } first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER); Internal filter of an image convolver SSA CDFG Maximize “ performance ” (area, latency, power, … ) subject to input constraints Output: “Hardware” (RTL Specification)

Target Architectures  “Spatial” architectures  Local control between data path, global data flow between control nodes  Lots of distributed computational units, memory  Coarse/fine grained reconfigurable architectures  Techniques could be used for other architectures  May not make sense  Our design flow has little resource sharing Fine grain configurable platform Coarse grain programmable platform

Obligatory Design Flow Slide SUIF: Syntactic & Semantic Analysis Application Specification AST Machine SUIF: Compiler Backend SSA CDFG 4. Synthesize behavioral HDL code to RTL code Behavioral Synthesis Logical & Physical Synthesis 8. Synthesize RTL code Entity 1 Entity 3Entity 2 Entity 4 6. Determine structural control and data communication between basic block entities 7. Generate synthesizable RTL code CFG Entity 5. Create CFG interface entity cfg is … architecture behavioral of cfg … 2. Transform instruction list to dataflow graph 1. Create interface ++ + * * 3. Transform dataflow graph to behavioral HDL code Basic Block Entity entity basic_block is … architecture behavioral of basic_block … entity basic_block is

Design Example /* perform radix 4 iterations */ for(i = 1; i <= n4pow; i++) { nn *= 4; in = n / nn; FR4TR(in, nn, b, b + in, b + 2 * in, b + 3 * in); } /* perform inplace reordering */ FORD1(n2pow, b); FORD2(n2pow, b); /* take conjugates */ for(i = 3; i < n; i += 2) b[i] = -b[i]; return 1;} int FAST(real *b, int n) { real fn; int i, in, nn, n2pow, n4pow, nthpo; n2pow = fastlog2(n); if(n2pow <= 0) return 0; nthpo = n; fn = nthpo; n4pow = n2pow / 2; /* radix 2 iteration required; do it now */ if(n2pow % 2) { nn = 2; in = n / nn; FR2TR(in, b, b + in); } else nn = 1; Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10  “FAST” function from MediaBench  Some nodes missing - simple computation, merged into others  Lines below show data communication

Characterizing Data Communication  Examples of data communication schemes Control Node 3 Control Node 2 Control Node 4 Memory (Register Bank, RAM) Control Node 4 Control Node 2 Control Node 3 Bus DistributedCentralized Data communication = wireData communication = memory access

Identifying Data Communication  Determine relationship between place(s) where data is defined and where data is used b  … a  …  a a  … c  … b  …  b  c  Naïve method: all use-points of a variable depend on all definitions of that variable  Not all use points “use” a variable Need analysis to minimize the amount of data communication  Global Data Communication = 5 variables

 Must determine relationship between where data is generated and where data is used  Problem formulations  [DAC03]: Minimize the total number of bits communicated between all pairs of control nodes  Today: Minimize overall wirelength  SSA (Static Single Assignment)  Changes each variable to have a unique definition point  Must add  -nodes to merge definitions Use of SSA in Compilation b  … a  …  a a  … c  … b  …  b  c b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 )

 SSA algorithms  Find location of  -nodes  Rename variables  Three main SSA algorithms  Minimal, Pruned – Cytron et al.  Semi-pruned – Briggs et al.  Differ in number and location of  - nodes  Minimal – insert  -nodes at iterated dominance frontier (IDF)  Semi-pruned – insert  -node at IDF if variable live outside some basic block  Pruned – insert  -node at IDF if variable live at that time SSA Fundamentals b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) c 2   (c 1 ) b 3   (b 1,b 2 )Minimal b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) b 3   (b 1,b 2 )Semi-Pruned b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 )Pruned

Results: SSA for Data Comm. Minimization  Edge Weight w(i,j)– number of bits communicated from node i to j  Total Edge Weight (TEW) - corresponds to amount of data communication “ MediaBench ” marks

Further Minimizing Data Communication  Current SSA algorithms place  -nodes temporally  In software compilation, live ranges should be short  Appropriate in hardware? Spatial  -node distribution Temporal  -node distribution b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) TEW = 4 b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) TEW = 3

Spatial  -nodes Distribution Algorithm  d – number of uses of  -node destination  s – number of  -node source values  Number of temporal links  Number of spatial links a 3  (a 0,a 1,a 2 )  a 3 s = 3 d = 2 Optimal assuming “ ideal ” n-dimensional floorplan

Physically Aware Compiler Transforms  Consider layout information during compilation  Modify transforms to consider physical info  Ideal: full physical synthesis – extremely accurate, but way too time consuming Physical Synthesis Hardware Compilation application Floor- planner  Approximate using floorplanning  Much faster  Gives “good enough” high level physical picture  Our previous data comm. work  No physical information  Can lead to negative results Let ’ s Get Physical!

Physically Aware Data Communication  Modify placement of Φ-functions to consider wirelength 1. Given a CFG G cfg (V cfg, E cfg ) 2. perform_ssa(G cfg ) 3. calculate_def_use_chains(G cfg ) 4. remove_back_edges(G cfg ) 5. topological_sort(G cfg ) 6. foreach vertex v  V cfg 7. foreach  -node   v 8. s  .sources 9. d  |def_use_chain( .dest)| 10. IDF  iterated_dominance_fronter(s) 11. PossiblePlacements  findPlacementOptions(IDF) 12. place(  )  selectBest(PossiblePlacements) 13. distribute/duplicate  to place(  )   -Placement Algorithm 1.Given a set of CFG Nodes R 2.  -options   3. insert(R) into  -options 4. foreach instruction i  R 5. if( i is a destination of  -function f ) 6. return  -options 7. temp_  -options   8. foreach non-dominated child c of R 9. temp_  -options  crossProductJoin(temp_  _options, findPlacementOptions(c)) 10. return  -options  temp_  -options FindPlacementOptions Algorithm

Algorithm in Action b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 )  Evaluate all options for  -nodes  Replicate  when necessary  Limit amount of replication - most often leads to more wirelength  Can play tricks to limit redundant placements Traditional (temporal) Spatial [DAC03] Spatial [DAC03] Traditional (temporal) Any of these options could yield the best wirelength Highly dependent on the floorplan

Algorithm in Action  FAST function from MediaBench testsuite F T T F N3 nn_4, i_2nn_5, i_3 N9

Algorithm in Action F T T F nn_4, i_2nn_5, i_3 N3 N9 F T T F N3 nn_4, i_2nn_5, i_3 N9

Physical Synthesis Hardware Compilation Full Floor- planner 1. Initial optimization minimizes data communication 2. Full SA based floorplanning 3. Reoptimization based to minimize floorplanning 4. Full SA based floorplanning Spectacularly negative results Full Floorplanning Results  Simple iterative approach

Incremental Floorplanning  Incremental Placement [Coudert et al]:  Given an optimized placement and a set of changes to the netlist (e.g., due to technology remapping) modify the placement to improve it.  Equally applicable to floorplanning Initial Floorplan Modified Floorplan Perturbations floorplan modules (e.g. due to  -function movement) floorplan

| 2/ / / /18 - 5/ / / Incremental Floorplan Our Incremental Floorplanner Incremental Floorplanner Initial Floorplan Modified Floorplan Perturbations

Our Incremental Floorplanner 1. Calculate area & room of each node: bottom up slicing tree traversal 2. Area redistribution  Top down traversal  Increase area if necessary  Not enough space at root  Aspect ratios become too distorted | 2/ / / /18 - 5/ / / Incremental Floorplan Modified Floorplan Simple, yet effective Other more complicated algorithms might work better

MediaBench Functions BenchmarkBlocks  LinksWeightInitial WL 1 adpcm coder adpcm decoder internal filter Internal expand compress output mpeg2dec block mpeg2dec vector FAST FR4TR det

Incremental Floorplanning Results Normalized Wirelength Benchmarks “ Optimal ” Approach: 12% Overall Wirelength Reduction 25% Phi-node Wirelength Reduction Our Approach: 6% Overall Wirelength Reduction 8% Phi-node Wirelength Reduction avg

Related Work  Hardware compilation projects using SSA  PDG+SSA form [UCSB]  CASH [CMU]  SA-C [UCR]  Sea Cucumber [BYU]  Physically aware behavioral synthesis techniques  SA for scheduling, binding and floorplanning [Prabhakaran97]  SA for binding and floorplanning [Yung-Ming94]  Scheduling, allocation and binding [Dougherty00]  Fasolt: bus topology [Knapp92]  High level synthesis [Tarafdar00]  Incremental CAD  Problem overview/challenges [Coudert00]  Floorplanning [Crenshaw99]

Conclusions  It’s been a long strange trip…  SSA a nice IR for hardware compilation  Explicitly shows data flow  Useful for exploiting parallelism  Compiler techniques applied to hardware design can reduce wirelength  They must be aware of physical information  They must use an incremental floorplanning

Questions? (and cue for applause)