1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.

Slides:



Advertisements
Similar presentations
TOPIC : SYNTHESIS DESIGN FLOW Module 4.3 Verilog Synthesis.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Static Single-Assignment ? ? Introduction: Over last few years [1991] SSA has been Stablished as… Intermediate program representation.
Comparison and Evaluation of Back Translation Algorithms for Static Single Assignment Form Masataka Sassa #, Masaki Kohama + and Yo Ito # # Dept. of Mathematical.
8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.
Instruction Generation and Regularity Extraction for Reconfigurable Processors Philip Brisk, Adam Kaplan, Ryan Kastner*, Majid Sarrafzadeh Computer Science.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Stanford University CS243 Winter 2006 Wei Li 1 SSA.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 10: RC Principles: Software (3/4) Prof. Sherief Reda.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Physically Aware Data Communication Optimization for Hardware Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer.
Cpeg421-08S/final-review1 Course Review Tom St. John.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.
Lecture #17, June 5, 2007 Static Single Assignment phi nodes Dominators Dominance Frontiers Dominance Frontiers when inserting phi nodes.
Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger Register Bank Assignment For Spatially Partitioned Processors.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Mahapatra-Texas A&M-Fall'001 Partitioning - I Introduction to Partitioning.
Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer.
Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.
A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
High-Level Synthesis for Reconfigurable Systems. 2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Bitwidth Analysis with Application to Silicon Compilation Mark Stephenson Jonathan Babb Saman Amarasinghe MIT Laboratory for Computer Science.
Precision Going back to constant prop, in what cases would we lose precision?
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Automated Design of Custom Architecture Tulika Mitra
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Static Single Assignment John Cavazos.
Static Single Assignment Form in the COINS Compiler Infrastructure Masataka Sassa, Toshiharu Nakaya, Masaki Kohama, Takeaki Fukuoka and Masahito Takahashi.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.
Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.
Ph.D. in Computer Science
Static Single Assignment
Data Communication Estimation and Reduction for Reconfigurable Systems
Methodology of a Compiler that Compresses Code using Echo Instructions
Introduction to cosynthesis Rabi Mahapatra CSCE617
From C to Elastic Circuits
Factored Use-Def Chains and Static Single Assignment Forms
Architecture Synthesis
(via graph coloring and spilling)
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering University of California, Los Angeles University of California, Santa Barbara June 4, 2003 Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering University of California, Los Angeles University of California, Santa Barbara June 4, 2003

2/20 Application specified in system-level language Compiler SynthesisandPhysicalDesign HDL(behavioral,structural) From Algorithm to HDL We focus our efforts on mapping an application written in a high-level language to a hardware description. We desire this mapping to have optimal characteristics (area, latency, etc.) In this talk, we focus on the problem of minimizing data communication in the final hardware. We focus our efforts on mapping an application written in a high-level language to a hardware description. We desire this mapping to have optimal characteristics (area, latency, etc.) In this talk, we focus on the problem of minimizing data communication in the final hardware.

3/20 Similar Compilation Projects Hardware compilers Reconfigurable Architecture PRISM project – synthesize subset of C to FPGA Garp compiler (BRASS) – synthesize C to processor + FPGA platform DEFACTO – synthesize SUIF to FPGA (Wildstar) General Architecture DeepC compiler – synthesize C to HDL MATCH compiler – synthesize Matlab to HDL PICO – synthesize nested loops into VLIW-like functional unit Hardware compilers Reconfigurable Architecture PRISM project – synthesize subset of C to FPGA Garp compiler (BRASS) – synthesize C to processor + FPGA platform DEFACTO – synthesize SUIF to FPGA (Wildstar) General Architecture DeepC compiler – synthesize C to HDL MATCH compiler – synthesize Matlab to HDL PICO – synthesize nested loops into VLIW-like functional unit

4/20 SUIF/ MachSUIF Compiler Control Data-Flow Graph (CDFG) Control Data-Flow Graph (CDFG) C Code Our Framework From the SUIF IR, we construct a CDFG representation. Each basic block of the CDFG becomes a separate synthesizable module in the hardware description. From the SUIF IR, we construct a CDFG representation. Each basic block of the CDFG becomes a separate synthesizable module in the hardware description. Control Node 1 Control Node 3 Control Node 2 Control Node 4 Hardware Description Hardware Description

5/20 Characterizing Data Communication Control Node 1 Control Node 3 Control Node 2 Control Node 4 Memory (Register Bank, RAM) Control Node 4 Control Node 2 Control Node 3 Control Node 1 Bus Two examples of data communication schemes DistributedCentralized data communication = wire data communication = storage access

6/20 Identifying Data Communication Determine relationship between place(s) where data is defined and where data is used b  … a  …  a a  … c  … b  …  b Naïve method: all use-points of a variable depend on all definitions of that variable Not all use points “use” a variable Naïve method: all use-points of a variable depend on all definitions of that variable Not all use points “use” a variable  c Need analysis to minimize the amount of data communication Need analysis to minimize the amount of data communication Global Data Communication = 5 variables

7/20 Must determine relationship between where data is generated and where data is used Problem formulation: minimize the total number of bits communicated between all pairs of control nodes SSA (Static Single Assignment) Changes each variable to have a unique definition point Must add  -nodes to merge definitions Must determine relationship between where data is generated and where data is used Problem formulation: minimize the total number of bits communicated between all pairs of control nodes SSA (Static Single Assignment) Changes each variable to have a unique definition point Must add  -nodes to merge definitions Minimizing Data Communication b  … a  …  a a  … c  … b  …  b  c b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 )

8/20 Using SSA to Minimize Data Communication SSA algorithms Find location of  -nodes Rename variables Three main SSA algorithms Minimal, Pruned – Cytron et al. Semi-pruned – Briggs et al. Differ in number and location of  -nodes Minimal – insert  -nodes at iterated dominance frontier (IDF) Semi-pruned – insert  -node at IDF if variable live outside some basic block Pruned – insert  -node at IDF if variable live at that time SSA algorithms Find location of  -nodes Rename variables Three main SSA algorithms Minimal, Pruned – Cytron et al. Semi-pruned – Briggs et al. Differ in number and location of  -nodes Minimal – insert  -nodes at iterated dominance frontier (IDF) Semi-pruned – insert  -node at IDF if variable live outside some basic block Pruned – insert  -node at IDF if variable live at that time b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) c 2   (c 1 ) b 3   (b 1,b 2 )Minimal b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) b 3   (b 1,b 2 )Semi-Pruned b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 )Pruned

9/20 Experimental Setup SSA Conversion HDL Generation CDFG Synopsys Behavioral / Design Compiler CDFG in SSA form

10/20 MediaBench Benchmark Suite A benchmark suite of DSP applications [Lee et al] DSP Applications well suited to hardware implementation Tend to: be parallelizable be computationally intensive often have large basic blocks A benchmark suite of DSP applications [Lee et al] DSP Applications well suited to hardware implementation Tend to: be parallelizable be computationally intensive often have large basic blocks for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1; x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos, y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin; x_filt<y_filt_lin; x_filt++,im_pos++) sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; } first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER); Sample code: internal filter of an image convolver

11/20 Edge Weight w(i,j)– number of bits communicated from node i to j Total Edge Weight (TEW) - corresponds to amount of data communication Edge Weight w(i,j)– number of bits communicated from node i to j Total Edge Weight (TEW) - corresponds to amount of data communication Results: SSA for Data Comm. Minimization

12/20 Results: SSA for Area Minimization

13/20 Relationship Between  -nodes and Data Communication

14/20 Further Minimizing Data Communication Current SSA algorithms place  -nodes temporally In software compilation, live ranges should be short. Appropriate in hardware? Current SSA algorithms place  -nodes temporally In software compilation, live ranges should be short. Appropriate in hardware? Spatial  -node distribution Temporal  -node distribution b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) TEW = 4 b 1  … a 2  …  a 4 a 3  … a 1  … c 1  … b 2  …  b 1  c 1 a 4   (a 2,a 3 ) TEW = 3

15/20 Temporal  -node placement Spatial  -node placement Effect of  -node Distribution

16/20 d – number of uses of  -node destination s – number of  -node source values Number of temporal links Number of spatial links d – number of uses of  -node destination s – number of  -node source values Number of temporal links Number of spatial links Spatial  -nodes Distribution Algorithm a 3  (a 0,a 1,a 2 )  a 3 s = 3 d = 2

17/20 Spatial SSA Results – Num. Spatial  -nodes

18/20 Spatial SSA Results –  TEW after spatial SSA

19/20  area After Spatial SSA (from Synopsys)

20/20 ConclusionConclusion In this work, we demonstrate a mapping from compiler IR (CDFG) to hardware description. SSA binds variables to values, which is useful in reducing data communication between control nodes. Spatial distribution of phi nodes can reduce data communication, modeled as total edge weight (TEW) by as much as 20%. However, circuit area sometimes increases… Future research: refine the model using information from later stages of synthesis. Compiler techniques applied to hardware design can greatly reduce data communication. In this work, we demonstrate a mapping from compiler IR (CDFG) to hardware description. SSA binds variables to values, which is useful in reducing data communication between control nodes. Spatial distribution of phi nodes can reduce data communication, modeled as total edge weight (TEW) by as much as 20%. However, circuit area sometimes increases… Future research: refine the model using information from later stages of synthesis. Compiler techniques applied to hardware design can greatly reduce data communication.