Dynamically Scheduled High-level Synthesis

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
CSCI 4717/5717 Computer Architecture
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.
Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007 Causality Interfaces for Actor Networks Ye Zhou and Edward A. Lee University of California,
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
Using Interfaces to Analyze Compositionality Haiyang Zheng and Rachel Zhou EE290N Class Project Presentation Dec. 10, 2004.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
Models of Computation for Embedded System Design Alvise Bonivento.
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
Synthesis of synchronous elastic architectures Jordi Cortadella (Universitat Politècnica Catalunya) Mike Kishinevsky (Intel Corp.) Bill Grundmann (Intel.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.
CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
Architectural Optimizations David Ojika March 27, 2014.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
Automated Design of Custom Architecture Tulika Mitra
designKilla: The 32-bit pipelined processor Brought to you by: Victoria Farthing Dat Huynh Jerry Felker Tony Chen Supervisor: Young Cho.
C. André, J. Boucaron, A. Coadou, J. DeAntoni,
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Modeling and Analysis of Printer Data Paths using Synchronous Data Flow Graphs in Octopus Ashwini Moily Under the supervision of Dr. Lou Somers, Prof.
School of Computer Science, The University of Adelaide© The University of Adelaide, Control Data Flow Graphs An experiment using Design/CPN Sue Tyerman.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
Automatic Pipelining during Sequential Logic Synthesis Jordi Cortadella Universitat Politècnica de Catalunya, Barcelona Joint work with Marc Galceran-Oms.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Specification mining for asynchronous controllers Javier de San Pedro† Thomas Bourgeat ‡ Jordi Cortadella† † Universitat Politecnica de Catalunya ‡ Massachusetts.
Multiscalar Processors
Prof. Onur Mutlu Carnegie Mellon University
Parallel Programming By J. H. Wang May 2, 2017.
Pipeline Implementation (4.6)
Flow Path Model of Superscalars
Morgan Kaufmann Publishers The Processor
Introduction to cosynthesis Rabi Mahapatra CSCE617
From C to Elastic Circuits
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Architectural-Level Synthesis
Clockless Logic: Asynchronous Pipelines
The Vector-Thread Architecture
Speculative Dataflow Circuits
Dynamic Hardware Prediction
Advanced Computer Architecture Dataflow Processing
Pipelining and Superscalar Techniques
Presentation transcript:

Dynamically Scheduled High-level Synthesis Lana Josipović, Radhika Ghosal, Paolo Ienne École Polytechnique Fédérale de Lausanne (EPFL) February 2018

High-level Synthesis and Static Scheduling High-level synthesis (HLS) is the future of reconfigurable computing Design circuits from high-level programming languages Classic HLS relies on static schedules Each operation executes at a cycle fixed at synthesis time Scheduling dictated by compile-time information Maximum parallelism in regular applications Limited parallelism when information unavailable at compile time (i.e., latency, memory or control dependencies)

The Limitations of Static Scheduling for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } 1: x[0]=5 → ld hist[5]; st hist[5]; 2: x[1]=4 → ld hist[4]; st hist[4]; 3: x[2]=4 → ld hist[4]; st hist[4];

The Limitations of Static Scheduling for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } 1: x[0]=5 → ld hist[5]; st hist[5]; 2: x[1]=4 → ld hist[4]; st hist[4]; 3: x[2]=4 → ld hist[4]; st hist[4]; RAW

The Limitations of Static Scheduling for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } 1: x[0]=5 → ld hist[5]; st hist[5]; 2: x[1]=4 → ld hist[4]; st hist[4]; 3: x[2]=4 → ld hist[4]; st hist[4]; RAW Static scheduling (any HLS tool) Inferior when memory accesses cannot be disambiguated at compile time

The Limitations of Static Scheduling for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } 1: x[0]=5 → ld hist[5]; st hist[5]; 2: x[1]=4 → ld hist[4]; st hist[4]; 3: x[2]=4 → ld hist[4]; st hist[4]; RAW Static scheduling (any HLS tool) Inferior when memory accesses cannot be disambiguated at compile time Dynamic scheduling Maximum parallelism: Only serialize memory accesses on actual dependencies

Towards Dynamic Scheduling Overcoming the conservatism of statically scheduled HLS Dynamically selecting the schedule at runtime1 Application-specific dynamic hazard detection2 Pipelining particular classes of irregular loops3 1 Liu et al. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. FCCM ’15. 2 Dai et al. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. FPGA ‘17. 3 Tan et al. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. ICCAD ‘15.

Some dynamic behavior, but still based on static scheduling Towards Dynamic Scheduling Overcoming the conservatism of statically scheduled HLS Dynamically selecting the schedule at runtime1 Application-specific dynamic hazard detection2 Pipelining particular classes of irregular loops3 Some dynamic behavior, but still based on static scheduling 1 Liu et al. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. FCCM ’15. 2 Dai et al. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. FPGA ‘17. 3 Tan et al. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. ICCAD ‘15.

Dynamically Scheduled Circuits Efforts in the asynchronous domain 1 Structures of asynchronous handshake components Each operation executes as soon as all of its inputs arrive Dataflow, latency-insensitive, elastic 2, 3, 4 Synchronous methodologies for achieving dynamic scheduling 1 Budiu et al. Dataflow: A complement to superscalar. ISPASS ’05. 2 Carloni et al. Theory of latency-insensitive design. TCAD '01. 3 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06. 4 Vijayaraghavan et al. Bounded dataflow networks and latency-insensitive circuits. MEMOCODE '09.

How to create dynamically scheduled circuits from high-level programs? Efforts in the asynchronous domain 1 Structures of asynchronous handshake components Each operation executes as soon as all of its inputs arrive Dataflow, latency-insensitive, elastic 2, 3, 4 Synchronous methodologies for achieving dynamic scheduling How to create dynamically scheduled circuits from high-level programs? 1 Budiu et al. Dataflow: A complement to superscalar. ISPASS ’05. 2 Carloni et al. Theory of latency-insensitive design. TCAD '01. 3 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06. 4 Vijayaraghavan et al. Bounded dataflow networks and latency-insensitive circuits. MEMOCODE '09.

Outline Introduction Background: elastic circuits Synthesizing elastic circuits Evaluation Conclusion

Elastic Circuits1 Every component communicates via a pair of handshake signals valid ready Component 1 Component 2 data 1 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06.

Elastic Components1 Elastic buffer Elastic FIFO Fork Join Branch Merge 1 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06.

Elastic Components1 Elastic buffer Elastic FIFO Fork Join Branch Merge 1 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06.

Elastic Components1 Elastic buffer Elastic FIFO Fork Join Branch Merge 1 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06.

Elastic Components1 Elastic buffer Elastic FIFO Fork Join Branch Merge 1 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06.

Elastic Components1 Elastic buffer Elastic FIFO Fork Join Branch Merge 1 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06.

Elastic Components1 Elastic buffer Elastic FIFO Fork Join Branch Merge 1 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06.

Elastic Components + − > = Memory interface Functional units data Memory subsystem + LD address − Load ports data LD address data ST address Store ports > = data ST address Functional units Memory interface

Outline Introduction Background: elastic circuits Synthesizing elastic circuits Evaluation Conclusion

Synthesizing Elastic Circuits

Synthesizing Elastic Circuits From C to intermediate graph representation LLVM compiler framework for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } Basic block

Each operator corresponds to a functional unit Synthesizing Elastic Circuits Constructing the datapath i 1 for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } + LD x[i] LD weight[i] N LD hist[x[i]] < Each operator corresponds to a functional unit + ST hist[x[i]]

Merge for each variable entering the BB Synthesizing Elastic Circuits Implementing control flow Start: i=0 Mg 1 for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } + LD x[i] LD weight[i] N LD hist[x[i]] < Merge for each variable entering the BB + ST hist[x[i]]

Branch for each variable exiting the BB Synthesizing Elastic Circuits Implementing control flow Start: i=0 Mg 1 for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } + LD x[i] LD weight[i] N LD hist[x[i]] < Branch for each variable exiting the BB + Br ST hist[x[i]] Exit: i=N

Fork after every node with multiple successors Synthesizing Elastic Circuits Inserting Forks Start: i=0 Mg Fork 1 for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < Fork after every node with multiple successors + Br ST hist[x[i]] Exit: i=N

Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] +

Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] Buff +

Buffer insertion does not affect circuit functionality Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] Buffer insertion does not affect circuit functionality Buff +

Buffer insertion does not affect circuit functionality Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] Buffer insertion does not affect circuit functionality Buff + Elastic buffers and avoiding deadlock r=0 Mg v=1 1 + Br

Buffer insertion does not affect circuit functionality Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] Buffer insertion does not affect circuit functionality Buff + Elastic buffers and avoiding deadlock Mg Each combinational loop in the circuit needs to contain at least one buffer Buff 1 + Br

Adding Elastic Buffers Start: i=0 Mg Fork 1 + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < + Br St hist[x[i]] Exit: i=N

Adding Elastic Buffers Start: i=0 Mg Fork 1 2 Combinational loops + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Adding Elastic Buffers Start: i=0 Mg Buff Fork 1 + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Backpressure from slow paths prevents pipelining Adding Elastic Buffers Backpressure from slow paths prevents pipelining

Increasing Throughput Start: i=0 Mg Buff Fork 1 + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Insert FIFOs into slower paths Increasing Throughput Start: i=0 Mg Buff Fork 1 Insert FIFOs into slower paths + LD x[i] FIFO Fork LD weight[i] Fork N FIFO LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Increasing Throughput Start: i=0 Mg Buff Fork 1 + LD x[i] FIFO Fork LD weight[i] Fork N FIFO LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Increasing Throughput Sending multiple values to the next BB i x y Join Branch Huang et al. Elastic CGRAs. FPGA ‘13.

Pipelining only with modulo scheduling Increasing Throughput Sending multiple values to the next BB i x y Join Branch Pipelining only with modulo scheduling Huang et al. Elastic CGRAs. FPGA ‘13.

Pipelining only with modulo scheduling Increasing Throughput Sending multiple values to the next BB i x y i x y Join Fork Branch Branch Branch Branch Pipelining only with modulo scheduling

Increasing Throughput Sending multiple values to the next BB i x y i x y Join Fork Branch Branch Branch Branch Pipelining only with modulo scheduling Pipelining without modulo scheduling

Increasing Throughput Start: i=0 Mg Buff Fork 1 + LD x[i] FIFO Fork LD weight[i] Fork N FIFO LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Increasing Throughput Start: i=0 Mg Buff Fork 1 + LD x[i] FIFO Fork LD weight[i] Fork N RAW dependency not honored! FIFO LD hist[x[i]] < + What about memory? Br ST hist[x[i]] Exit: i=N

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution BB1: LD, ST BB1 is starting LSQ LD x LD weight LD hist ST hist Memory 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution BB1: LD, ST BB1 is starting LSQ LD x LD weight LD hist ST hist LD hist ST hist Memory 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution BB1: LD, ST BB1 is starting LSQ LD x LD weight LD hist ST hist LD hist ST hist LD hist Memory ST hist 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Dynamically scheduled circuit Connecting to Memory Dynamically scheduled circuit

Dynamically scheduled circuit Connecting to Memory Fork Dynamically scheduled circuit Interface to the LSQ

Dynamically scheduled circuit Connecting to Memory Fork Dynamically scheduled circuit Interface to the LSQ

Outline Introduction Background: elastic circuits Synthesizing elastic circuits Evaluation Conclusion

Results Elastic designs dynamically resolve memory and control dependencies and produce high-throughput pipelines

Results Elastic designs dynamically resolve memory and control dependencies and produce high-throughput pipelines

Results Elastic designs dynamically resolve memory and control dependencies and produce high-throughput pipelines

Results

Dynamically scheduled HLS is valuable for emerging FPGA applications Results Dynamically scheduled HLS is valuable for emerging FPGA applications

Thank you!