Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

Slides:



Advertisements
Similar presentations
Switching circuits Composed of switching elements called “gates” that implement logical blocks or switching expressions Positive logic convention (active.
Advertisements

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Loops and cyclic graphs.
ECE 667 Synthesis and Verification of Digital Circuits
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Integer & Fixed Point Addition and Multiplication CENG 329 Lab Notes By F. Serdar TAŞEL.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPU Review and Programming Models CT101 – Computing Systems.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Chapter 8. Pipelining.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.
Clock Skewing EECS 290A Sequential Logic Synthesis and Verification.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.
CS 536 Spring Global Optimizations Lecture 23.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
1 Application Specific Integrated Circuits. 2 What is an ASIC? An application-specific integrated circuit (ASIC) is an integrated circuit (IC) customized.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
ICS 252 Introduction to Computer Design
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
Pipelining By Toan Nguyen.
Introduction to FPGA Design Illustrating the FPGA design process using Quartus II design software and the Cyclone II FPGA Starter Board. Physics 536 –
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
CS3350B Computer Architecture Winter 2015 Lecture 5.2: State Circuits: Circuits that Remember Marc Moreno Maza [Adapted.
Explicit, Summative, and Recursive
IT253: Computer Organization
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
The George Washington University School of Engineering and Applied Science Department of Electrical and Computer Engineering ECE122 – 30 Lab 3: Layout.
System Analysis (Part 3) System Control and Review System Maintenance.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
IT253: Computer Organization
Processor Architecture
ELEC692 VLSI Signal Processing Architecture Lecture 3
Pipelining and Retiming
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
PHY 202 (Blum)1 More basic electricity Non-Ideal meters, Kirchhoff’s rules, Power, Power supplies.
LECTURE 4 Logic Design. LOGIC DESIGN We already know that the language of the machine is binary – that is, sequences of 1’s and 0’s. But why is this?
CS151 Introduction to Digital Design Chapter 5: Sequential Circuits 5-1 : Sequential Circuit Definition 5-2: Latches 1Created by: Ms.Amany AlSaleh.
Sequences and Series Explicit, Summative, and Recursive.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
George Mason University Finite State Machines Refresher ECE 545 Lecture 11.
SOFTWARE DESIGN AND ARCHITECTURE
Morgan Kaufmann Publishers
By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers
Pipeline Implementation (4.6)
Sequential Circuits: Latches
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Graph Paper Programming
Arithmetic Logical Unit
Activity on Node Approach to CPM Scheduling
Sequential Circuits: Latches
Chapter 8. Pipelining.
ICS 252 Introduction to Computer Design
Instructor: Michael Greenbaum
Presentation transcript:

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling

Latencies in stream computing Scheduling algorithms Stream offsets 2 Overview

Consider a simple arithmetic pipeline Each operation has a latency – Number of cycles from input to output – May be zero – Throughput is still 1 value per cycle, L values can be in-flight in the pipeline 3 Latencies in Stream Computing (A + B) + C

4 + + Output Input A Input B Input C Basic hardware implementation

+ + Output Input A Input B Input C Data propagates through the circuit in “lock step”

+ + Output Input A Input B Input C

+ + Output Input A Input B Input C X Data arrives at wrong time due to pipeline latency

8 + + Output Input A Input B Input C Insert buffering to correct

+ + Output Input A Input B Input C Now with buffering

+ + Output Input A Input B Input C

+ + Output Input A Input B Input C

+ + Output Input A Input B Input C

+ + Output Input A Input B Input C

+ + Output Input A Input B Input C Success!

A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations Can be automatically applied on a large dataflow graph (many thousands of nodes) Can try to optimize for various metrics – Latency from inputs to outputs – Amount of buffering inserted  generally most interesting – Area (resource sharing) 15 Stream Scheduling Algorithms

16 ASAP As Soon As Possible

17 Input A Input A Input B Input C 000 Build up circuit incrementally Keeping track of latencies

18 + Input A Input A Input B Input C 000 1

Input A Input A Input B Input C Input latencies are mismatched

Input A Input A Input B Input C Insert buffering

Output Input A Input A Input B Input C

22 ALAP As Late As Possible

23 Output 0 Start at output

24 + Output 0 Latencies are negative relative to end of circuit

Output Input C -2 0

Output Input A Input A Input B Input C -2 0

Output Input A Input A Input B Input C -2 0 Buffering is saved

Output 1 Input A Input A Input B Input C Output 2 Sometimes this is suboptimal What if we add an extra output?

Output 1 Input A Input A Input B Input C -2 0 Output 2 Unnecessary buffering is added 0 Neither ASAP nor ALAP can schedule this design optimally

ASAP and ALAP both fix either inputs or outputs in place More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP 30 Optimal Scheduling

Consider: We can see that we might need some explicit buffering to hold more than one data element on-chip We could do this explicitly, with buffering elements 31 Buffering data on-chip a = a + (buffer(a, 1) + buffer(b, 1)) a[i] = a[i] + (a[i - 1] + b[i - 1])

Output Input A Input B Buffer(1) The buffer has zero latency in the schedule

Output Input A Input B Buffer(1) This will schedule thus Buffering =

Accessing previous values with buffers is looking backwards in the stream This is equivalent to having a wire with negative latency – Can not be implemented directly, but can affect the schedule 34 Buffers and Latency

Output Input A Input B Offset wires can have negative latency Offset(-1)

Output Input A Input B This is scheduled Buffering = 0 Offset(-1)

A stream offset is just a wire with a positive or negative latency Negative latencies look backwards in the stream Positive latencies look forwards in the stream The entire dataflow graph will re-schedule to make sure the right data value is present when needed Buffering could be placed anywhere, or pushed into inputs or outputs  more optimal than manual instantiation 37 Stream Offsets

38 + Output Input A 0 Offset(1) a = a + stream.offset(a, +1) a[i] = a + a[i + 1]

39 + Output Input A Scheduling produces a circuit with 1 buffer 0 Offset(1) 1 1 2

For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. 1.Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph 2.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: a)c = ( (a1 + a2) + a3) + a4 b)c = (a1 + a2) + (a3 + a4) 3.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: a)c = ((a1 * a2) + (a3 * a4)) + a1 b)c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)? 40 Exercises