Derivation of Efficient FSM from Polyhedral Loop Nests Tomofumi Yuki, Antoine Morvan, Steven Derrien INRIA/Université de Rennes 1 1.

Slides:



Advertisements
Similar presentations
Some Trends in High-level Synthesis Research Tools Tanguy Risset Compsys, Lip, ENS-Lyon
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Far Fetched Prefetching?
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
AlphaZ: A System for Design Space Exploration in the Polyhedral Model
Introducing Formal Methods, Module 1, Version 1.1, Oct., Formal Specification and Analytical Verification L 5.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Compiler Challenges for High Performance Architectures
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.
1 An adaptable FPGA-based System for Regular Expression Matching Department of Computer Science and Information Engineering National Cheng Kung University,
Complexity Analysis (Part I)
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Chapter 1 Software Development. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 1-2 Chapter Objectives Discuss the goals of software development.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Combining High Level Synthesis and Floorplan Together EDA Lab, Tsinghua University Jinian Bian.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Petros OikonomakosBashir M. Al-Hashimi Mark Zwolinski Versatile High-Level Synthesis of Self-Checking Datapaths Using an On-line Testability Metric Electronics.
Generic Software Pipelining at the Assembly Level Markus Pister
A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.
RUP Implementation and Testing
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Reporter: PCLee. Although assertions are a great tool for aiding debugging in the design and implementation verification stages, their use.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
1 Software Engineering: A Practitioner’s Approach, 6/e Chapter 14a: Software Testing Techniques Software Engineering: A Practitioner’s Approach, 6/e Chapter.
An Efficient Implementation of Scalable Architecture for Discrete Wavelet Transform On FPGA Michael GUARISCO, Xun ZHANG, Hassan RABAH and Serge WEBER Nancy.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.
High-Level Transformations for Embedded Computing
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
Creativity Methods.
These slides are designed to accompany Software Engineering: A Practitioner’s Approach, 7/e (McGraw-Hill 2009). Slides copyright 2009 by Roger Pressman.1.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.
Compiler Design (40-414) Main Text Book:
Research in Compilers and Introduction to Loop Transformations Part I: Compiler Research Tomofumi Yuki EJCP 2016 June 29, Lille.
Research in Compilers and Introduction to Loop Transformations Part I: Compiler Research Tomofumi Yuki EJCP 2017 June 29, Toulouse.
Fei Li Jinjun Xiong University of Wisconsin-Madison
Hardware Acceleration of the Lifting Based DWT
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
إستراتيجيات ونماذج التقويم
Register Pressure Guided Unroll-and-Jam
Code Shape III Booleans, Relationals, & Control flow
Presentation transcript:

Derivation of Efficient FSM from Polyhedral Loop Nests Tomofumi Yuki, Antoine Morvan, Steven Derrien INRIA/Université de Rennes 1 1

High-Level Synthesis Writing HDL is too costly Led to emergence of HLS tools HLS is sensitive to input source Must be written in “HW-aware” manner Source-to-Source transformations Common in optimizing compilers (semi-)automated exploration at HLS stage Further enhance productivity/performance 2

HLS Specific Transformations Not all optimizing compiler transformations make sense in embedded context Its converse is also true Finite State Machines is an example for loops are preferred in general purpose context 3 for i for j S0 for k S1 for i for j S0 for k S1 while (…) if (…) S0; if (…) S1; if (…) k = k+1; if (…) i=i+1; j=0; while (…) if (…) S0; if (…) S1; if (…) k = k+1; if (…) i=i+1; j=0; FSM derivation

Contributions Analytical model of Loop Pipelining Understanding when to use Nested LP w.r.t. Single Loop Pipelining Derivation of Finite State Machines Handles imperfectly nested loops Based on polyhedral techniques Pipelining of the control-path Computing n-states ahead Improves performance of the control-path 4

Outline Modeling Loop Pipelining Single Loop Pipelining Nested Loop Pipelining NLP vs SLP FSM Derivation Evaluation Conclusion 5

Single Loop Pipelining Overlapped execution of innermost loop 6 for i=1:M for j=1:N S(i,j); for i=1:M for j=1:N S(i,j); for i=1:M for j=1:N stage0(i,j); stage1(i,j); stage2(i,j); stage3(i,j); for i=1:M for j=1:N stage0(i,j); stage1(i,j); stage2(i,j); stage3(i,j); for i=1:M s0(i,1); s1(i,1); s0(i,2); s2(i,1); s1(i,2); … s3(i,1); s2(i,2); … s0(i,N); s3(i,2); … s1(i,N); … s2(i,N); s3(i,N); for i=1:M s0(i,1); s1(i,1); s0(i,2); s2(i,1); s1(i,2); … s3(i,1); s2(i,2); … s0(i,N); s3(i,2); … s1(i,N); … s2(i,N); s3(i,N);

Pipeline flush/fill Overhead Overhead for each iteration of the outer loop 7 i=1i=2i=3 for i=1:M for j=1:N s0(i,j); s1(i,j); s2(i,j); s3(i,j); for i=1:M for j=1:N s0(i,j); s1(i,j); s2(i,j); s3(i,j); under-utilized stages

Nested Loop Pipelining “Compress” by pipelining alltogether 8 i=1i=2i=3 for i=1:M for j=1:N s0(i,j); s1(i,j); s2(i,j); s3(i,j); for i=1:M for j=1:N s0(i,j); s1(i,j); s2(i,j); s3(i,j); for i=1:M j=j+1; j<N s0(i,j); s1(i,j); s2(i,j); s3(i,j); for i=1:M j=j+1; j<N s0(i,j); s1(i,j); s2(i,j); s3(i,j); while(has_next) i,j=next(i,j) s0(i,j); s1(i,j); s2(i,j); s3(i,j); while(has_next) i,j=next(i,j) s0(i,j); s1(i,j); s2(i,j); s3(i,j);

NLP Overhead Larger control-path FSM for loop nest, instead of a single loop FSM for SLP is a simple check on loop bound Hinders maximum frequency Complex control-path may take longer than one data-path stage Savings in flush/fill overhead must be greater than the loss in frequency 9

Modeling Trade-offs Important parameters: Frequency Degradation due to NLP Innermost trip count Number of pipeline stages f*: NLP frequency normalized to SLP f* = 0.9 means 10% degradation in frequency α= #stages / trip count larger α means large flush/fill overhead 10

When is NLP Beneficial? f*: higher = less degradation α: larger = small trip count (innermost) Program Characteristic (cannot change) Improving control-path is possible Model speedup as a function of f* and α

Outline Modeling Loop Pipelining FSM Derivation Polyhedral Representation Computing Transitions State Look Ahead Evaluation Conclusion 12

Polyhedral Representation Represent loops as mathematical objects 13 for i = 0:N for j = 0:M S for i = 0:N for j = 0:M S S S M N for i = 0:N for j = 0:i S0 for k = 0:N-i S1 for i = 0:N for j = 0:i S0 for k = 0:N-i S1 S0 S1

FSM Derivation next function Find a piece-wise function that gives the immediate successor in lexicographic order Proposed in 1998 for low-level code generation Direct Application to FSM Each piece = condition of transition Function = transition Can be composed to obtain next n 14

State Look Ahead Pipelining the control-flow When data-path is heavily pipelined, control- path becomes the critical path Computing n-states ahead Allows n-stage pipelining of the control-path 15 datapath i,j i,’j’ i”,j” next 2 datapath i,j i’,j’ next

Other Optimizations Merging transitions next computed can have many transitions Some can be merged by looking at its context Common Sub-expressions HLS tools sometimes fail to catch 16 next(i,j) = (i,j+1) if i<N next(i,j) = (N,j+1) if i=N next(i,j) = (i,j+1) if i≤N if (a>b && c>d) A; if (a>b && e>f) B; if (a>b && c>d) A; if (a>b && e>f) B; x = a>b; if (x && c>d) A; if (x && e>f) B; x = a>b; if (x && c>d) A; if (x && e>f) B;

Evaluation Methodology Focus on control-path empty data-path (incrementing arrays) independent iterations loops with different shapes 3 versions: different pipelining SLP : innermost loop NLP: all loops FSM-LA2: while loop of FSM with next 2 17

Evaluation: HLS Phase Maximum Target Frequency 18

Evaluation: Synthesized Design Achieved Frequency 19

Conclusion Improved FSM generation from for loops Example of HLS specific transformation State look ahead to pipeline control-path HLS tools currently lack compiler optimizations Applied to Nested Loop Pipelining Enlarge applicability by reducing its overhead Future Directions Other uses of next function Other HLS-specific transformations 20

21