Seoul National University Memory Efficient Software Synthesis from Dataflow Graph Wonyong Sung, Junedong Kim, Soonhoi Ha Codesign and Parallel Processing.

Slides:

Advertisements

Similar presentations

Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.

ECE-777 System Level Design and Automation Hardware/Software Co-design

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

A Dataflow Programming Language and Its Compiler for Streaming Systems

Control Flow Analysis (Chapter 7) Mooly Sagiv (with Contributions by Hanne Riis Nielson)

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

Synchronous Data Flow Presenter: Zohair Hyder Gang Zhou Synchronous Data Flow E. A. Lee and D. G. Messerschmitt Proc. of the IEEE, September, Joint.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

FunState – An Internal Design Representation for Codesign A model that enables representations of different types of system components. Mixture of functional.

Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

Chapter 11: Limitations of Algorithmic Power

Memory management Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles.

5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 JHDL Hardware Generation Mike Wirthlin and Matthew Koecher

Improving Code Generation Honors Compilers April 16 th 2002.

ECE 667 Synthesis and Verification of Digital Systems

Joint Minimization of Code and Data for Synchronous Dataflow Programs Kaushik Ravindran EE 249 – Presentation.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Grover’s Algorithm in Machine Learning and Optimization Applications

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Linear Scan Register Allocation POLETTO ET AL. PRESENTED BY MUHAMMAD HUZAIFA (MOST) SLIDES BORROWED FROM CHRISTOPHER TUTTLE 1.

“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Design Space Exploration

INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Efficiency of Alignment-based algorithms B.F. van Dongen Laziness! (Gu)estimation! Implementation effort?

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Automatic Evaluation of the Accuracy of Fixed-point Algorithms Daniel MENARD 1, Olivier SENTIEYS 1,2 1 LASTI, University of Rennes 1 Lannion, FRANCE 2.

Pointer Analysis Survey. Rupesh Nasre. Aug 24, 2007.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Hanyang University Hyunok Oh Energy Optimal Bit Encoding for Flash Memory.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Digitaalsüsteemide verifitseerimise kursus1 Exercises Binary decision diagrams ROBDD generation. Shannon expansion Finding an optimal ordering Dynamic.

On the Relation Between Simulation-based and SAT-based Diagnosis CMPE 58Q Giray Kömürcü Boğaziçi University.

Contents Introduction Bus Power Model Related Works Motivation

Introduction To Computer Systems

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Compiler Code Optimizations

Research: Past, Present and Future

Presentation transcript:

Seoul National University Memory Efficient Software Synthesis from Dataflow Graph Wonyong Sung, Junedong Kim, Soonhoi Ha Codesign and Parallel Processing Lab. Seoul National University

Contents  Introduction  Code Generation from Block Diagram Specification  Synchronous Data Flow and Single Appearance Schedule  Proposed Strategies  Optimization 1 : code sharing optimization  Optimization 2 : minimize buffer requirement  Experiments  Conclusions

Seoul National University Introduction  Motivations  Embedded system has limited amount of memory large program = memory cost, performance penalty, power consumption  New trend of software development : high level design methodology growing complexity, fast design turn-around time, limited budget, etc.  Goal of Research  Reduce the code and data size of automatically generated software  In an automatic software synthesis environment Specification = Dataflow graph with SDF(Synchronous DataFlow) semantics

Seoul National University Software Synthesis from SDF graph A B C D Possible Schedules : = AABCABACDABABCD = (6A)(4B)(3C)(2D) = (2(3A2B))(3C)(2D) main(){ for(i=0;i<6;i++){A} for(i=0;i<4;i++){B} for(i=0;i<3;i++){C} for(i=0;i<2;i++){D} } main(){ for(i=0;i<2;i++){ for(j=0;j<3;j++){A} for(j=0;j<2;j++){B} } for(i=0;i<3;i++){C} for(i=0;i<2;i++){D} } Single Appearance Schedule (SAS)

Seoul National University Previous Efforts  Single Appearance Schedule (SAS): APGAN,RPMC  [by Battacharyya et. al.] in Ptolemy Group  SAS guarantees the minimum code size (without code sharing)  APGAN,RPMC : heuristics to find data minimized SAS schedule  ILP formulation for data memory minimization  [by Ritz et. al.] in Meyr Group  flat single appearance schedule + sharing of data buffer  Rate optimal compile time schedule  [by Govindarajan et. al.]in Gao Group  tried to minimize the buffer requirement using linear programming  An algorithm to compute the smallest data buffer size  [by Ade et. al.] in GRAPE group

Seoul National University Proposed Strategies  Coding style  not stuck to one coding style, hybrid approach  generated code is a mixture of inlines and functions  Optimization 1: Code Sharing  Multiple instances of a same kernel treated as different node in SAS  Code sharing optimization has gain(block size) and cost(context size)  Optimization 2: Schedule Adjustment  give up single appearance schedule to reduce the data size  (1) represents schedule information with BTLC data structure  (2) find possible location for adjustment  (3) schedule adjustment

Seoul National University Flowchart of Optimization Procedure Get SAS schedule [RPMC,APGAN] Code sharing optimization code-block size context size BTLC Schedule Adjustment C code generation

Seoul National University Example of Code Sharing (CD2DAT) ramp ramp’ sine sine’  fir1fir2fir3fir4xgraph Code before sharing for(int i=0;i<2;i++) { { /* code for fir1 */ ……………… out = tap*input[i]’ ……………… } /* code for fir 2 */ …………….. Code after sharing for(int i=0;i<2;i++) fir(1); for(int i=0;i<3;i++) fir(2); …………… void fir(int context){ ……………… context_FIR[context].out... ……………… } context definition typedef struct{ double *out; int output_ofs; int output_bs; int output_nx; …………. double decimation; double tap; }context_FIR;

Seoul National University Code Size Overhead (in Sparc/Solaris) without contextwith context 4 bytes40 bytes Reference Overhead = 36 bytes! ….. = value; ….. = *(context_CGCRamp[context].value); ldd[%fp ],%o0sethi %hi(0x20800),%o1 ld [%o1+0x3c8], %o0 mov%o0, %o2 sll%o2, 2, %o1 add%o1, %o0, %o1 sll%01, 3, %o0 add%fp, -424, %o1 add%o1, %o0, %o2 ld[%o2 + 0x1c], %o0 ldd[%o0], %o2

Seoul National University Optimization 1 : Code sharing  Multiple instances of a same kernel have their own contexts  Kernel code should be transformed into shared version function  Shared Version  references are only through context variable  Gain and cost of sharing  Gain = (# instances -1)  (code block size)  Cost = (#instances)  (context variable size) + (code block overhead)  Code sharing is performed only when the gain is larger than the cost

Seoul National University Decision Formula 1 >  (  -1)  (  -1)  >    >  +    >  context +  reference +    >  context +  shared (1)  = code sharing overhead =  context +  reference (2)  context =  p i (p i ), p i  ports where, (x) = 3*sizeof(int) + sizeof(pointer) (3)  reference =  t  {S,C,AS,AP} (  (t)  (t))  (t) = reference count  (t) = unit overhead t = type of reference (4)  = code block size (5)  = number of instances

Seoul National University Optimization 2 : Adjusting SAS  Adjusting Single Appearance Schedule  2(7A3B)5C ==> 51  2(7A3B2C)C ==> 39  give up single appearance schedule  BTLC (Binary Tree with Leaf Chain) CAB 3756 C AB G [6,0,0] = [input, inside, output] [7,0,5] [21,0,15] [0,0,3] [0,0,21]

Seoul National University Computation of Buffer Requirements A 3 B W = |O L  I R | I =  | I L  I R - W | O =  | O L  O R -W | In general W  LR [I,W,O] C AB G [7,0,5] [21,0,15] [0,0,3] [0,0,21] [0,21,30] [30,0,0] [0,30,0] 21 30

Seoul National University Flowchart of Schedule Adjustment Construct BTLC Compute buffer requirement Find candidate for adjustment Adjust schedule (split a chain) SAS schedule BTLC found yes Done code generation no

Seoul National University Splitting A Chain C AB G Schedule = 2(7A3B)5C [0,0,3] [7,0,5] [6,0,0] [0,0,21] [21,0,15] [0,21,30] [30,0,0] [0,30,0] Split point  Finding split candidate  a chain which has the largest number  in this example BC is selected  Schedule after splitting  2(7A3B2C)C  In general, for a schedule that has two clusters aC a bC b (a and b are loop counts) new schedule is defined as  a(C a (b/a)C b )(b%a)C b ), if a<b  (a%b)C a b((b/a)C a C b ), otherwise

Seoul National University Decision Formula C C G [0,0,21] [12,0,0] [0,12,6] [0,6,0] 12 6 AB 73 [0,0,3] [7,0,5] 21 [6,0,0] [21,0,15] [0,21,15] [6,0,0] New Schedule 2(7A3B2C)C Gain = 12 |Cluster| = |W| value of the cluster

Seoul National University Experiment : CD2DAT [0,280,0] G 407 4F F11 X21 fork 1 M1 S21 R21 S1R1 F2 F3X1 [1,0,1] [0,0,1] [1,0,1] [2,0,1] [1,0,2] [1,0,0] [1,0,2] [3,0,4] [7,0,5] [7,0,4] [1,0,0] [4,0,0] [280,4,0] [56,0,40] [6,0,8] [0,0,1] [0,1,1] [0,0,2] [0,1,2] [0,2,1] [0,1,2] [0,1,1] [0,1,6] [0,6,56] [0,56,280] F F2 F3X1 [3,0,4] [7,0,5] [7,0,4] [1,0,0] [4,0,0] [35,4,0] [56,35,40] [6,0,8] [0,6,56] [0,56,40] [0,35,35] F4 X1 [7,0,4] [1,0,0] [4,0,0] [35,4,0] 4 G [0,35,0] 35

Seoul National University Experimental Result CD2DATFilter Bank SAS Code Sharing Schedule Adjustment Program size after each optimization Memory behavior of CD2DAT in ARM7 FetchesMiss SAS Code Sharing Schedule Adjustment

Seoul National University Conclusion  Our Environment  PeaCE : Ptolemy extension as Codesign Environment  Optimization Techniques in Software Synthesis  For automatic code generation from dataflow graph  Joint minimization of code and data size  Selective application code sharing and schedule adjustment to SAS  Future works  Clustering : multiple fine grain nodes into a large one increase chance of code sharing  Buffer sharing further reduce the buffer size and increase the cache effect

Seoul National University Thank You !