Physically Aware Data Communication Optimization for Hardware Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer.

Slides:



Advertisements
Similar presentations
Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.
High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.
Instruction Generation and Regularity Extraction for Reconfigurable Processors Philip Brisk, Adam Kaplan, Ryan Kastner*, Majid Sarrafzadeh Computer Science.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Program Representations. Representing programs Goals.
AUTOMATIC GENERATION OF CODE OPTIMIZERS FROM FORMAL SPECIFICATIONS Vineeth Kumar Paleri Regional Engineering College, calicut Kerala, India. (Currently,
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Cpeg421-08S/final-review1 Course Review Tom St. John.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.
1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Automated Generation of Layout and Control for Quantum Circuits Mark Whitney, Nemanja Isailovic, Yatish Patel, John Kubiatowicz University of California,
Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer.
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
Precision Going back to constant prop, in what cases would we lose precision?
CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
CAD for Physical Design of VLSI Circuits
ASIC/FPGA design flow. FPGA Design Flow Detailed (RTL) Design Detailed (RTL) Design Ideas (Specifications) Design Ideas (Specifications) Device Programming.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 10, 10/30/2003 Prof. Roy Levow.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Design methodologies.
Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
ECE-C662 Lecture 2 Prawat Nagvajara
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
Basic Memory Management 1. Readings r Silbershatz et al: chapters
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.
Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.
ASIC Design Methodology
Linear Scan Register Allocation Massimiliano Poletto, Vivek Sarkar A Fast, Memory-Efficient Register Allocation Framework for Embedded Systems Sathyanarayanan.
Data Communication Estimation and Reduction for Reconfigurable Systems
Methodology of a Compiler that Compresses Code using Echo Instructions
ECE-C662 Introduction to Behavioral Synthesis Knapp Text Ch
HIGH LEVEL SYNTHESIS.
(via graph coloring and spilling)
Presentation transcript:

Physically Aware Data Communication Optimization for Hardware Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer Engineering University of California, Santa Barbara Adam Kaplan, Philip Brisk and Majid Sarrafzadeh Computer Science Department University of California, Los Angeles

Hardware Compilation Application specified in high level language Compiler SynthesisandPhysicalDesign HDL(behavioral,structural)  We focus our efforts on mapping an application written in a high-level language to a hardware description  We desire this mapping to have optimal characteristics (area, latency, etc.)  In this talk, we focus on the problem of minimizing data communication in the final hardware Chip, bitstream, …

Obligatory Design Flow Slide SUIF: Syntactic & Semantic Analysis Application Specification AST Machine SUIF: Compiler Backend SSA CDFG 4. Synthesize behavioral HDL code to RTL code Behavioral Synthesis Logical & Physical Synthesis 8. Synthesize RTL code Entity 1 Entity 3Entity 2 Entity 4 6. Determine structural control and data communication between basic block entities 7. Generate synthesizable RTL code CFG Entity 5. Create CFG interface entity cfg is … architecture behavioral of cfg … 2. Transform instruction list to dataflow graph 1. Create interface ++ + * * 3. Transform dataflow graph to behavioral HDL code Basic Block Entity entity basic_block is … architecture behavioral of basic_block …

Characterizing Data Communication  Examples of data communication schemes Control Node 1 Control Node 3 Control Node 2 Control Node 4 Memory (Register Bank, RAM) Control Node 4 Control Node 2 Control Node 3 Control Node 1 Bus DistributedCentralized Data communication = wire Data communication = memory access

Identifying Data Communication  Determine relationship between place(s) where data is defined and where data is used b  … a  …  a a  … c  … b  …  b  c  Naïve method: all use-points of a variable depend on all definitions of that variable  Not all use points “use” a variable Need analysis to minimize the amount of data communication  Global Data Communication = 5 variables

Use of SSA in Compilation b  … a  …  a a  … c  … b  …  b  c b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 )  Must determine relationship between where data is generated and where data is used  Problem formulations  [DAC02]: Minimize the total number of bits communicated between all pairs of control nodes  Today: Minimize overall wirelength  SSA (Static Single Assignment)  Changes each variable to have a unique definition point  Must add  -nodes to merge definitions

Physically Aware Compiler Transforms  Consider layout information during compilation  Modify transforms to consider physical info  Ideal: full physical synthesis – extremely accurate, but way too time consuming Physical Synthesis Hardware Compilation application Floor- planner  Approximate using floorplanning  Much faster  Gives “good enough” high level physical picture  Previous data communication work  No physical information  Can lead to negative results Let ’ s Get Physical!

Physically Aware Data Communication  Modify placement of Φ-functions to consider wirelength 1. Given a CFG G cfg (V cfg, E cfg ) 2. perform_ssa(G cfg ) 3. calculate_def_use_chains(G cfg ) 4. remove_back_edges(G cfg ) 5. topological_sort(G cfg ) 6. foreach vertex v  V cfg 7. foreach  -node   v 8. s  .sources 9. d  |def_use_chain( .dest)| 10. IDF  iterated_dominance_fronter(s) 11. PossiblePlacements  findPlacementOptions(IDF) 12. place(  )  selectBest(PossiblePlacements) 13. distribute/duplicate  to place(  )   -Placement Algorithm 1.Given a set of CFG Nodes R 2.  -options   3. insert(R) into  -options 4. foreach instruction i  R 5. if( i is a destination of  -function f ) 6. return  -options 7. temp_  -options   8. foreach non-dominated child c of R 9. temp_  -options  crossProductJoin(temp_  _options, findPlacementOptions(c)) 10. return  -options  temp_  -options FindPlacementOptions Algorithm

Algorithm in Action  FAST function from MediaBench testsuite F T T F N3 nn_4, i_2nn_5, i_3 N9

Algorithm in Action i F T T F nn_4, i_2nn_5, i_3 N3 N9 F T T F N3 nn_4, i_2nn_5, i_3 N9

Physical Synthesis Hardware Compilation Full Floor- planner 1. Initial optimization minimizes data communication 2. Full SA based floorplanning 3. Reoptimization based to minimize floorplanning 4. Full SA based floorplanning Spectacularly negative results Full Floorplanning Results  Simple iterative approach

Incremental Floorplanning  Incremental Placement [Coudert et al]:  Given an optimized placement and a set of changes to the netlist (e.g., due to technology remapping) modify the placement to improve it.  Equally applicable to floorplanning Initial Floorplan Modified Floorplan Perturbations floorplan modules (e.g. due to  -function movement) floorplan

| 2/ / / /18 - 5/ / / Incremental Floorplan Our Incremental Floorplanner Incremental Floorplanner Initial Floorplan Modified Floorplan Perturbations

Our Incremental Floorplanner 1. Calculate area & room of each node: bottom up slicing tree traversal 2. Area redistribution  Top down traversal  Increase area if necessary  Not enough space at root  Aspect ratios become too distorted | 2/ / / /18 - 5/ / / Incremental Floorplan Modified Floorplan Simple, yet effective Other more complicated algorithms might work better

MediaBench Functions BenchmarkBlocks  LinksWeightInitial WL 1 adpcm coder adpcm decoder internal filter Internal expand compress output mpeg2dec block mpeg2dec vector FAST FR4TR det

Incremental Floorplanning Results Normalized Wirelength Benchmarks “ Optimal ” Approach: 12% Overall Wirelength Reduction 25% Phi-node Wirelength Reduction Our Approach: 6% Overall Wirelength Reduction 8% Phi-node Wirelength Reduction

Related Work  Hardware compilation projects using SSA  PDG+SSA form [UCSB]  CASH [CMU]  SA-C [UCR]  Sea Cucumber [BYU]  Physically aware behavioral synthesis techniques  SA for scheduling, binding and floorplanning [Prabhakaran97]  SA for binding and floorplanning [Yung-Ming94]  Scheduling, allocation and binding [Dougherty00]  Fasolt: bus topology [Knapp92]  High level synthesis [Tarafdar00]  Incremental CAD  Problem overview/challenges [Coudert00]  Floorplanning [Crenshaw99]

Conclusions  It’s been a long strange trip…  SSA a nice IR for hardware compilation  Explicitly shows data flow  Useful for exploiting parallelism  Compiler techniques applied to hardware design can reduce wirelength  They must be aware of physical information  They must use an incremental floorplanning