Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CSCI 4717/5717 Computer Architecture
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Processor Technology and Architecture
Instruction Level Parallelism (ILP) Colin Stevens.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 12 Pipelining Strategies Performance Hazards.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Introduction to Parallel Processing Ch. 12, Pg
Pipelining By Toan Nguyen.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
More Basics of CPU Design Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Computer Organization
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
What have mr aldred’s dirty clothes got to do with the cpu
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
RISC Architecture RISC vs CISC Sherwin Chan.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.
IT253: Computer Organization Lecture 9: Making a Processor: Single-Cycle Processor Design Tonga Institute of Higher Education.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.
Exploiting Parallelism
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Memory Buffering Techniques Greg Stitt ECE Department University of Florida.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
PipeliningPipelining Computer Architecture (Fall 2006)
Buffering Techniques Greg Stitt ECE Department University of Florida.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Memory Buffering Techniques
Central Processing Unit Architecture
Performance of Single-cycle Design
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Greg Stitt ECE Department University of Florida
Pipelining Example Cycle 1 b[0] b[1] b[2] + +
Pipelining: Advanced ILP
Lecture 11: Machine-Dependent Optimization
Presentation transcript:

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida

Why are microprocessors slow? Von Neumann architecture “Stored-program” machine Memory for instructions (and data)

Von Neumann architecture Summary 1) Fetch instruction 2) Decode instruction, fetch data 3) Execute 4) Store results 5) Repeat from 1 until end of program Problem Inherently sequential Only executes one instruction at a time Does not take into consideration parallelism of application

Problem 2: Von Neumann bottleneck Constantly reading/writing data for every instruction requires high memory bandwidth Performance limited by bandwidth of memory Von Neumann architecture RAM Control Bandwidth not sufficient - “Von Neumann bottleneck” Datapath

Improvements Increase resources in datapath to execute multiple instructions in parallel VLIW - very long instruction word Compiler encodes parallelism into “very-long” instructions Superscalar Architecture determines parallelism at run time - out-of-order instruction execution Von Neumann bottleneck still problem RAM Control Datapath...

Why is RC fast? RC implements custom circuits for an application Circuits can exploit massive amount of parallelism VLIW/Superscalar Parallelism ~5 ins/cycle in best case (rarely occurs) RC Potentially thousands As many ops as will fit in device Also, supports different types of parallelism

Types of Parallelism Bit-level x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Circuit for Bit Reversal Bit Reversed X Value Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x) for same clock

Types of Parallelism Arithmetic-level Parallelism for (i=0; i < 128; i++) y[i] += c[i] * x[i].. for (i=0; i < 128; i++) y += c[i] * x[i].. ************ C Code Processor 1000’s of instructions Several thousand cycles Circuit Processor FPGA ~ 7 cycles Speedup > 100x for same clock

Types of Parallelism Pipeline Parallelism for (i=0; i < 128; i++) y[i] += c[i] * x[i].. for (j=0; j < n; j++) { y = a[j]; x = b[j]; for (i=0; i < 128; i++) y += c[i] * x[i]; // output y y = 0; } ************ Start new inner loop every cycle After filling up pipeline, performs 128 mults adds every cycle

Types of Parallelism Task-level Parallelism e.g. MPEG-2 Execute each task in parallel

How to exploit parallelism? General Idea Identify tasks Create circuit for each task Communication between tasks with buffers How to create circuit for each task? Want to exploit bit-level, arithmetic-level, and pipeline-level parallelism Solution: Systolic architectures (arrays/computing)

Systolic Architectures Systolic definition The rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and pulmonary artery after each dilation or diastole. Analogy with heart pumping blood We want architecture that pumps data through efficiently. Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory. [Hung]

Systolic Architecture General Idea: Fully pipelined circuit, with I/O at top and bottom level Local connections - each element communicates with elements at same level or level below Inputs arrive each cycle Outputs depart each cycle, after pipeline is full

Systolic Architecture Simple Example Create DFG (data flow graph) for body of loop Represent data dependencies of code for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[i]b[i+1]b[i+2] + + a[i]

Simple Example Add pipeline stages to each level of DFG b[i]b[i+1]b[i+2] + + a[i]

Simple Example Allocate one resource (adder, ALU, etc) for each operation in DFG Resulting systolic architecture: + + b[0] b[1] b[2] Cycle 1 for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Simple Example Allocate one resource for each operation in DFG Resulting systolic architecture: + + b[1] b[2] b[3] Cycle 2 b[0]+b[1] b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Simple Example Allocate one resource for each operation in DFG Resulting systolic architecture: + + b[2] b[3] b[4] Cycle 3 b[1]+b[2] b[3] b[0]+b[1]+b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Simple Example Allocate one resource for each operation in DFG Resulting systolic architecture: + + b[3] b[4] b[5] Cycle 4 b[2]+b[3] b[4] b[1]+b[2]+b[3] a[0] First output appears, takes 4 cycles to fill pipeline for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Simple Example Allocate one resource for each operation in DFG Resulting systolic architecture: + + b[4] b[5] b[6] Cycle 5 b[3]+b[4] b[5] b[2]+b[3]+b[4] a[1] One output per cycle at this point, 99 more until completion Total Cycles => 4 init + 99 = 103 for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

uP Performance Comparison Assumptions: 10 instructions for loop body CPI = 1.5 Clk 10x faster than FPGA Total SW cycles: 100*10*1.5 = 1,500 cycles RC Speedup (1500/103)*(1/10) = 1.46x for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

uP Performance Comparison What if uP clock is 15x faster? e.g. 3 GHz vs. 200 MHz RC Speedup (1500/103)*(1/15) =.97x RC is slightly slower But! RC requires much less power Several Watts vs ~100 Watts SW may be practical for embedded uPs => low power Clock may be just 2x faster (1500/103)*(1/2) = 7.3x faster RC may be cheaper Depends on area needed This example would certainly be cheaper for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Simple Example, Cont. Improvement to systolic array Why not execute multiple iterations at same time? No data dependencies Loop unrolling for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[i]b[i+1]b[i+2] + + a[i] b[i+1]b[i+2]b[i+3] + + a[i+1]..... Unrolled DFG

Simple Example, Cont. How much to unroll? Limited by memory bandwidth and area b[i]b[i+1]b[i+2] + + a[i] b[i+1]b[i+2]b[i+3] + + a[i+1]..... Must get all inputs once per cycle Must write all outputs once per cycle Must be sufficient area for all ops in DFG

Unrolling Example Original circuit + + b[0] b[1] b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; a[0] 1st iteration requires 3 inputs

Unrolling Example Original circuit + + b[0] b[1] b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; + + b[3] a[0] a[1] Each unrolled iteration requires one additional input

Unrolling Example Original circuit + + b[1] b[2] b[3] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; + + b[4] b[0]+b[1] b[2] b[1]+b[2] b[3] Each cycle brings in 4 inputs (instead of 6)

Performance after unrolling How much unrolling? Assume b[] elements are 8 bits First iteration requires 3 elements = 24 bits Each unrolled iteration requires 1 element = 8 bit Due to overlapping inputs Assume memory bandwidth = 64 bits/cycle Can perform 6 iterations in parallel ( ) = 64 bits New performance Unrolled systolic architecture requires 4 cycles to fill pipeline, 100/6 iterations ~ 21 cycles With unrolling, RC is (1500/21)*(1/15) = 4.8x faster than 3 GHz microprocessor!!! for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Importance of Memory Bandwidth Performance with wider memories 128-bit bus 14 iterations in parallel 64 extra bits/8 bits per iteration = 8 parallel iterations + 6 original unrolled iterations = 14 total parallel iterations Total cycles = 4 to fill pipeline + 100/14 = ~11 Speedup (1500/11)*(1/15) = 9.1x Doubling memory width increased speedup from 4.8x to 9.1x!!! Important Point Performance of hardware often limited by memory bandwidth More bandwidth => more unrolling => more parallelism => BIG SPEEDUP

Delay Registers Common mistake Forgetting to add registers for values not used during a cycle Values “delayed” or passed on until needed Instead of Incorrect Correct

Delay Registers Illustration of incorrect delays + + b[0] b[1] b[2] Cycle 1 for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Delay Registers Illustration of incorrect delays + + b[1] b[2] b[3] Cycle 2 for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0]+b[1] b[2] + ?????

Delay Registers Illustration of incorrect delays + + b[2] b[3] b[4] Cycle 3 for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1]+b[2] b[0] + b[1] + b[3] b[2] + ?????

Another Example Your turn Steps Build DFG for body of loop Add pipeline stages Map operations to hardware resources Assume divide takes one cycle Determine maximum amount of unrolling Memory bandwidth = 128 bits/cycle Determine performance compared to uP Assume 15 instructions per iteration, CPI = 1.5, CLK = 15x faster than RC short b[1004], a[1000]; for (i=0; i < 1000; i++) a[i] = avg( b[i], b[i+1], b[i+2], b[i+3], b[i+4] );

Another Example, Cont. What if divider takes 20 cycles? But, fully pipelined Calculate the effect on performance In systolic architectures, performance usually dominated by throughput of pipeline, not latency

Dealing with Dependencies op2 is dependent on op1 when the input to op2 is an output from op1 Problem: limits arithmetic parallelism, increases latency i.e. Can’t execute op2 before op1 Serious Problem: FPGAs rely on parallelism for performance Little parallelism = Bad performance op1 op2

Dealing with Dependencies Partial solution Parallelizing transformations e.g. tree height reduction a b cd a b cd Depth = # of adders Depth = log2( # of adders )

Dealing with Dependencies Simple example w/ inter-iteration dependency - potential problem for systolic arrays Can’t keep pipeline full a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; + + b[1] b[2] a[0] + + b[2] b[3] a[1] Can’t execute until 1st iteration completes - limited arithmetic parallelism, increases latency

Dealing with Dependencies But, systolic arrays also have pipeline-level parallelism - latency less of an issue + + b[1] b[2] a[0] + + b[2] b[3] a[1] a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];

Dealing with Dependencies But, systolic arrays also have pipeline-level parallelism - latency less of an issue + + b[1] b[2] a[0] + + b[2] b[3] a[1] a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];

Dealing with Dependencies But, systolic arrays also have pipeline-level parallelism - latency less of an issue + + b[1] b[2] a[0] + + b[2] b[3] a[1] + b[3] b[4] a[2] +.. a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];

Dealing with Dependencies But, systolic arrays also have pipeline-level parallelism - latency less of an issue + + b[1] b[2] a[0] + + b[2] b[3] a[1] + b[3] b[4] a[2] + Add pipeline stages => systolic array.. Only works if loop is fully unrolled! Requires sufficient memory bandwidth a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; *Outputs not shown

Dealing with Dependencies Your turn char b[1006]; for (i=0; i < 1000; i++) { acc=0; for (j=0; j < 6; j++) acc += b[i+j]; } Steps Build DFG for inner loop (note dependencies) Fully unroll inner loop (check to see if memory bandwidth allows) Assume bandwidth = 64 bits/cycle Add pipeline stages Map operations to hardware resources Determine performance compared to uP Assume 15 cycles per iteration, CPI = 1.5, CLK = 15x faster than RC

Dealing with Control If statements char b[1006], a[1000]; for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a[I] = b[I+2] + b[I+3] ; } * MUX b[i] b[I+1] + b[I+2] b[I+3] a[i] Can’t wait for result of condition - stalls pipeline Convert control into computation - if conversion i % 2

Dealing with Control If conversion, not always so easy char b[1006], a[1000], a2[1000]; for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a2[I] = b[I+2] + b[I+3] ; } * MUX b[i] b[I+1] + b[I+2] b[I+3] a[i] MUX a2[i] i 2 %

Other Challenges Outputs can also limit unrolling Example 4 outputs, 1 input Each output 32 bits Total output bandwidth for 1 iteration = 128 bits Memory bus = 128 bits Can’t unroll, even though inputs only use 32 bits long b[1004], a[1000]; for (i=0, j=0; i < 1000; i+=4, j++) { a[i] = b[j] + 10 ; a[i+1] = b[j] * 23; a[i+2] = b[j] - 12; a[i+3] = b[j] * b[j]; }

Other Challenges Requires streaming data to work well Systolic array But, pipelining is wasted because small data stream Point - systolic arrays work best with repeated computation for (i=0; i < 4; i++) a[i] = b[i] + b[i+1]; + b[0] b[1] + b[2] + b[3] + b[4] a[0] a[1] a[2] a[3]

Other Challenges Memory bandwidth Values so far are “peak” values Can only be achieved if all input data stored sequentially in memory Often not the case Example Two-dimensional arrays long a[100][100], b[100][100]; for (i=1; i < 100; i++) { for (j=1; j < 100; j++) { a[i][j] = avg( b[i-1][j], b[I][j-1], b[I+1][j], b[I][j+1]); }

Other Challenges Memory bandwidth, cont. Example 2 Multiple array inputs b[] and c[] stored in different locations Memory accesses may jump back and forth Possible solutions Use multiple memories, or multiported memory (high cost) Interleave data from b[] and c[] in memory (programming effort) If no compiler support, requires manual rewite long a[100], b[100], c[100]; for (i=0; i < 100; i++) { a[i] = b[i] + c[i] }

Other Challenges Dynamic memory access patterns Sequence of addresses not known until run time Clearly, not sequential Possible solution Something creative enough for a Ph.D thesis int f( int val ) { long a[100], b[100], c[100]; for (i=0; i < 100; i++) { a[i] = b[rand()%100] + c[i * val] }

Other Challenges Pointer-based data structures Even if scanning through list, data could be all over memory Very unlikely to be sequential Can cause aliasing problems Greatly limits optimization potential Solutions are another Ph. D. Pointers ok if used as array int f( int val ) { long a[100], b[100]; long *p = b; for (i=0; i < 100; i++, p++) { a[i] = *p + 1; } int f( int val ) { long a[100], b[100]; for (i=0; i < 100; i++) { a[i] = b[i] + 1; } equivalent to

Other Challenges Not all code is just one loop Yet another Ph.D. Main point to remember Systolic arrays are extremely fast, but only certain types of code work What can we do instead of systolic arrays?

Other Options Try something completely different Try slight variation Example - 3 inputs, but can only read 2 per cycle + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Not possible - can only read two inputs per cycle

Variations Example, cont. Break previous rules - use extra delay registers + + b[i] b[i+1] b[i+2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Variations Example, cont. Break previous rules - use extra delay registers + + b[0] b[1] Junk Cycle 1 for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Variations Example, cont. Break previous rules - use extra delay registers + + Junk b[2] Cycle 2 b[0] b[1] Junk b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Variations Example, cont. Break previous rules - use extra delay registers + + b[1] b[2] Junk Cycle 3 Junk b[0]+b[1] b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Junk

Variations Example, cont. Break previous rules - use extra delay registers + + Junk b[3] Cycle 4 b[1] b[2] Junk b[3] b[0]+b[1]+b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Junk

Variations Example, cont. Break previous rules - use extra delay registers + + b[2] b[3] Junk Cycle 5 Junk b[1] + b[2] b[3] Junk a[0] First output after 5 cycles for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Variations Example, cont. Break previous rules - use extra delay registers + + Junk b[4] Cycle 6 b[2] b[3] Junk b[4] b[1]+b[2]+b[3] Junk Junk on next cycle for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Variations Example, cont. Break previous rules - use extra delay registers + + b[3] b[4] Junk Cycle 7 Junk b[2]+b[3] b[4] Junk a[1] Second output 2 cycles later Valid output every 2 cycles - approximately 1/2 the performance for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Entire Circuit Controller Buffer RAM Output Address Generator Input Address Generator Datapath Buffer RAM Buffers handle differences in speed between RAM and datapath