From C to Elastic Circuits

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
CSCI 4717/5717 Computer Architecture
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Compiler Optimized Dynamic Taint Analysis James Kasten Alex Crowell.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.
Synthesis of Embedded Software Using Free-Choice Petri Nets.
7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007 Causality Interfaces for Actor Networks Ye Zhou and Edward A. Lee University of California,
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
Using Interfaces to Analyze Compositionality Haiyang Zheng and Rachel Zhou EE290N Class Project Presentation Dec. 10, 2004.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
Models of Computation for Embedded System Design Alvise Bonivento.
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.
CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
CMPE 511 DATA FLOW MACHINES Mustafa Emre ERKOÇ 11/12/2003.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
Architectural Optimizations David Ojika March 27, 2014.
Automated Design of Custom Architecture Tulika Mitra
designKilla: The 32-bit pipelined processor Brought to you by: Victoria Farthing Dat Huynh Jerry Felker Tony Chen Supervisor: Young Cho.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
Modeling and Analysis of Printer Data Paths using Synchronous Data Flow Graphs in Octopus Ashwini Moily Under the supervision of Dr. Lou Somers, Prof.
School of Computer Science, The University of Adelaide© The University of Adelaide, Control Data Flow Graphs An experiment using Design/CPN Sue Tyerman.
A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
Automatic Pipelining during Sequential Logic Synthesis Jordi Cortadella Universitat Politècnica de Catalunya, Barcelona Joint work with Marc Galceran-Oms.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Specification mining for asynchronous controllers Javier de San Pedro† Thomas Bourgeat ‡ Jordi Cortadella† † Universitat Politecnica de Catalunya ‡ Massachusetts.
Topics discussed in this section:
Multiscalar Processors
CS5100 Advanced Computer Architecture Instruction-Level Parallelism
Pipeline Implementation (4.6)
Morgan Kaufmann Publishers The Processor
Dynamically Scheduled High-level Synthesis
Jordi Cortadella and Jordi Petit
Architectural-Level Synthesis
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Speculative Dataflow Circuits
Dynamic Hardware Prediction
How to improve (decrease) CPI
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Presentation transcript:

From C to Elastic Circuits Lana Josipović, Philip Brisk, Paolo Ienne École Polytechnique Fédérale de Lausanne University of California, Riverside October 2017

High-level Synthesis and Static Scheduling High-level synthesis (HLS) tools are the future of reconfigurable computing Design circuits from high-level programming languages Classic HLS relies on static schedules Each operation executes at a cycle fixed at synthesis time Scheduling dictated by compile-time information Maximum parallelism in regular applications Pessimistic assumptions when memory or control dependencies might occur

The Limitations of Static Scheduling for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } 1: x[0]=5 → ld hist[5]; st hist[5]; 2: x[1]=4 → ld hist[4]; st hist[4]; 3: x[2]=4 → ld hist[4]; st hist[4];

The Limitations of Static Scheduling for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } 1: x[0]=5 → ld hist[5]; st hist[5]; 2: x[1]=4 → ld hist[4]; st hist[4]; 3: x[2]=4 → ld hist[4]; st hist[4]; RAW

The Limitations of Static Scheduling for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } 1: x[0]=5 → ld hist[5]; st hist[5]; 2: x[1]=4 → ld hist[4]; st hist[4]; 3: x[2]=4 → ld hist[4]; st hist[4]; RAW Static scheduling (any HLS tool) Inferior when memory accesses cannot be disambiguated at compile time

The Limitations of Static Scheduling for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } 1: x[0]=5 → ld hist[5]; st hist[5]; 2: x[1]=4 → ld hist[4]; st hist[4]; 3: x[2]=4 → ld hist[4]; st hist[4]; RAW Static scheduling (any HLS tool) Inferior when memory accesses cannot be disambiguated at compile time Dynamic scheduling Maximum parallelism: Only serialize memory accesses on actual dependencies

Towards Dynamic Scheduling Overcoming the conservatism of statically scheduled HLS Dynamically selecting the schedule at runtime1 Application-specific dynamic hazard detection2 Pipelining particular classes of irregular loops3 1 Liu et al. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. FCCM ’15. 2 Dai et al. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. FPGA ‘17. 3 Tan et al. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. ICCAD ‘15.

Some dynamic behavior, but still based on static scheduling Towards Dynamic Scheduling Overcoming the conservatism of statically scheduled HLS Dynamically selecting the schedule at runtime1 Application-specific dynamic hazard detection2 Pipelining particular classes of irregular loops3 Some dynamic behavior, but still based on static scheduling 1 Liu et al. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. FCCM ’15. 2 Dai et al. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. FPGA ‘17. 3 Tan et al. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. ICCAD ‘15.

Dynamically Scheduled Circuits Dataflow, latency-insensitive, elastic 1, 2, 3 Each operation executes as soon as all of its inputs arrive Operators pass on the data they produce 1 Carloni et al. Theory of latency-insensitive design. TCAD '01. 2 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06. 3 Vijayaraghavan et al. Bounded dataflow networks and latency-insensitive circuits. MEMOCODE '09.

How to create dynamically scheduled circuits from high-level programs? Dataflow, latency-insensitive, elastic 1, 2, 3 Each operation executes as soon as all of its inputs arrive Operators pass on the data they produce How to create dynamically scheduled circuits from high-level programs? 1 Carloni et al. Theory of latency-insensitive design. TCAD '01. 2 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06. 3 Vijayaraghavan et al. Bounded dataflow networks and latency-insensitive circuits. MEMOCODE '09.

Outline Introduction Background: elastic circuits Synthesizing elastic circuits Evaluation Conclusion

Elastic Circuits1 Every component communicates with its predecessors and successors through a pair of handshake signals Analogy with Petri nets and tokens valid ready Elastic component 1 Elastic component 2 data 1 Cortadella et al. Synthesis of synchronous elastic architectures. DAC '06.

Elastic Components Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface

Elastic Components Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface Buff

Elastic Components Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface FIFO

Elastic Components Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface Fork

Elastic Components Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface Join

Elastic Components Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface Branch

Elastic Components Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface Merge

Elastic Components + − > = Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface + − > =

Elastic Components Elastic buffer Elastic FIFO Fork Join Branch Merge Functional units Memory interface data Memory subsystem LD address Load ports data LD address data ST address Store ports data ST address

Outline Introduction Background: elastic circuits Synthesizing elastic circuits Evaluation Conclusion

Synthesizing Elastic Circuits From C to intermediate graph representation LLVM compiler framework for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; }

Synthesizing Elastic Circuits From C to intermediate graph representation LLVM compiler framework for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } Basic block

Synthesizing Elastic Circuits From C to intermediate graph representation LLVM compiler framework .loop:     ; preds = .loop, .start i.0 = phi [ i.1, .loop ], [ 0, .start ]   %1 = load *x, i.0     %2 = load *weight, i.0     %3 = load *hist, %1   %4 = fadd %2, %3   store %4, *hist, %1   i.1 = add i.0, 1   %5 = icmp slt i.1, N br %5, label .loop, label .exit for (i=0; i<N; i++) { hist[x[i]] = hist[x[i]] + weight[i]; } Basic block

Each operator corrensponds to a functional unit Synthesizing Elastic Circuits Constructing the datapath .loop:     ; preds = .loop, .start i.0 = phi [ i.1, .loop ], [ 0, .start ]   %1 = load *x, i.0     %2 = load *weight, i.0     %3 = load *hist, %1   %4 = fadd %2, %3   store %4, *hist, %1   i.1 = add i.0, 1   %5 = icmp slt i.1, N br %5, label .loop, label .exit Each operator corrensponds to a functional unit

Each operator corrensponds to a functional unit Synthesizing Elastic Circuits Constructing the datapath .loop:     ; preds = .loop, .start i.0 = phi [ i.1, .loop ], [ 0, .start ]   %1 = load *x, i.0     %2 = load *weight, i.0     %3 = load *hist, %1   %4 = fadd %2, %3   store %4, *hist, %1   i.1 = add i.0, 1   %5 = icmp slt i.1, N br %5, label .loop, label .exit i 1 + LD x[i] LD weight[i] N LD hist[x[i]] < + Each operator corrensponds to a functional unit ST hist[x[i]]

Merge for each variable entering the BB Synthesizing Elastic Circuits Implementing control flow Start: i=0 Mg .loop:     ; preds = .loop, .start i.0 = phi [ i.1, .loop ], [ 0, .start ]   %1 = load *x, i.0     %2 = load *weight, i.0     %3 = load *hist, %1   %4 = fadd %2, %3   store %4, *hist, %1   i.1 = add i.0, 1   %5 = icmp slt i.1, N br %5, label .loop, label .exit 1 + LD x[i] LD weight[i] N LD hist[x[i]] < + Merge for each variable entering the BB ST hist[x[i]]

Branch for each variable exiting the BB Synthesizing Elastic Circuits Implementing control flow Start: i=0 Mg .loop:     ; preds = .loop, .start i.0 = phi [ i.1, .loop ], [ 0, .start ]   %1 = load *x, i.0     %2 = load *weight, i.0     %3 = load *hist, %1   %4 = fadd %2, %3   store %4, *hist, %1   i.1 = add i.0, 1   %5 = icmp slt i.1, N br %5, label .loop, label .exit 1 + LD x[i] LD weight[i] N LD hist[x[i]] < + Br Branch for each variable exiting the BB ST hist[x[i]] Exit: i=N

Fork after every node with multiple successors Synthesizing Elastic Circuits Inserting Forks Start: i=0 Mg .loop:     ; preds = .loop, .start i.0 = phi [ i.1, .loop ], [ 0, .start ]   %1 = load *x, i.0     %2 = load *weight, i.0     %3 = load *hist, %1   %4 = fadd %2, %3   store %4, *hist, %1   i.1 = add i.0, 1   %5 = icmp slt i.1, N br %5, label .loop, label .exit Fork 1 + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < + Br Fork after every node with multiple successors ST hist[x[i]] Exit: i=N

Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] +

Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] Buff +

Buffer insertion does not affect circuit functionality Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] Buffer insertion does not affect circuit functionality Buff +

Buffer insertion does not affect circuit functionality Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] Buffer insertion does not affect circuit functionality Buff + Elastic buffers and avoiding deadlock r=0 Mg v=1 1 + Br

Buffer insertion does not affect circuit functionality Adding Elastic Buffers Elastic buffers and circuit functionality LD hist[x] LD weight[i] Buffer insertion does not affect circuit functionality Buff + Elastic buffers and avoiding deadlock Mg Each combinational loop in the circuit needs to contain at least one buffer Buff 1 + Br

Adding Elastic Buffers Start: i=0 Mg Fork 1 + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < + Br St hist[x[i]] Exit: i=N

Adding Elastic Buffers Start: i=0 Mg Fork 1 2 Combinational loops + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Adding Elastic Buffers Start: i=0 Mg Buff Fork 1 + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Backpressure from slow paths prevents pipelining Adding Elastic Buffers Start: i=0 Mg Buff Fork 1 + LD x[i] Fork LD weight[i] Fork N LD hist[x[i]] < Backpressure from slow paths prevents pipelining + Br ST hist[x[i]] Exit: i=N

Insert FIFOs into slower paths Increasing Throughput Start: i=0 Mg Buff Fork 1 Insert FIFOs into slower paths + LD x[i] FIFO Fork LD weight[i] Fork N FIFO LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Increasing Throughput Start: i=0 Mg Buff Fork 1 + LD x[i] FIFO Fork LD weight[i] Fork N FIFO LD hist[x[i]] < + Br ST hist[x[i]] Exit: i=N

Increasing Throughput Start: i=0 Mg Buff Fork 1 + LD x[i] FIFO Fork LD weight[i] Fork N RAW dependency not honored! FIFO LD hist[x[i]] < + What about memory? Br ST hist[x[i]] Exit: i=N

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution BB is starting LSQ LD hist ST hist LD hist ST hist 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution LD hist ST hist BB is starting LSQ LD ST LD hist ST hist 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution LD hist ST hist BB is starting LSQ LD ST LD hist ST hist 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Connecting to Memory Load-store queue (LSQ) Needs information on the original program order of the memory accesses1 Send tokens to the LSQ in order of basic block execution BB is starting LSQ LD hist ST hist LD ST LD hist ST hist 1 Josipović et al. An out-of-order load-store queue for spatial computing. CASES ‘17.

Connecting to Memory

Some restrictions imposed by the LSQ Connecting to Memory Completely dynamic Some restrictions imposed by the LSQ

Some restrictions imposed by the LSQ Connecting to Memory Completely dynamic Some restrictions imposed by the LSQ

Outline Introduction Background: elastic circuits Synthesizing elastic circuits Evaluation Conclusion

Results Elastic designs dynamically resolve memory and control dependencies and produce high-throughput pipelines

Results Dynamically scheduled designs are Pareto dominant over the statically scheduled designs

Dynamically scheduled HLS is valuable for emerging FPGA applications Results Dynamically scheduled designs are Pareto dominant over the statically scheduled designs Dynamically scheduled HLS is valuable for emerging FPGA applications

Thank you!