Buffering Techniques Greg Stitt ECE Department University of Florida.

Slides:

Advertisements

Similar presentations

Pipelining (Week 8).

Advertisements

CSCI 4717/5717 Computer Architecture

Morgan Kaufmann Publishers The Processor

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

September 10, 2003Serguei A. Mokhov, 1 Introduction SOEN228 Revision 1.1 September 10, 2003.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Chapter 12 Pipelining Strategies Performance Hazards.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Appendix A Pipelining: Basic and Intermediate Concepts

Pipelining By Toan Nguyen.

CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and

9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.

Presented by: Sergio Ospina Qing Gao. Contents ♦ 12.1 Processor Organization ♦ 12.2 Register Organization ♦ 12.3 Instruction Cycle ♦ 12.4 Instruction.

What have mr aldred’s dirty clothes got to do with the cpu

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.

ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Exploiting Parallelism

More on Pipelining 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Memory Buffering Techniques Greg Stitt ECE Department University of Florida.

More on Pipelining 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.

Computer Organization CS224 Fall 2012 Lessons 39 & 40.

EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Memory Buffering Techniques

Speed up on cycle time Stalls – Optimizing compilers for pipelining

ESE532: System-on-a-Chip Architecture

Computer Architecture Chapter (14): Processor Structure and Function

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

Greg Stitt ECE Department University of Florida

Pipelining Example Cycle 1 b[0] b[1] b[2] + +

Instruction Level Parallelism and Superscalar Processors

Pipelining Basic concept of assembly line

Control unit extension for data hazards

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Computer Architecture

Pipelining Basic concept of assembly line

Pipelining Basic concept of assembly line

Memory System Performance Chapter 3

A relevant question Assuming you’ve got: One washer (takes 30 minutes)

Presentation transcript:

Buffering Techniques Greg Stitt ECE Department University of Florida

Buffers Purposes Metastability issues Memory clock likely different from circuit clock Buffer stores data at one speed, circuit reads data at another Stores “windows” of data, delivers to datapath Window is set of inputs needed each cycle by pipelined circuit Generally, more efficient than datapath requesting needed data i.e. Push data into datapath as opposed to pulling data from memory Conversion between memory and datapath widths E.g. Bus is 64-bit, but datapath requires 128-bits every cycle Input to buffer is 64-bit, output from buffer is 128 bits Buffer doesn’t say it has data until receiving pairs of inputs

FIFOs FIFOs are a common buffer Outputs data in order read from memory + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; FIFO RAM

FIFOs FIFOs are a common buffer Outputs data in order read from memory + + b[0-2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2] First window read from memory FIFO

FIFOs are a common buffer Outputs data in order read from memory FIFOs + + b[0-2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2]

FIFOs FIFOs are a common buffer Outputs data in order read from memory + + b[1-3] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[1] b[2] b[3] b[0] b[1] b[2] First window pushed to datapath Second window read from RAM

FIFOs FIFOs are a common buffer Outputs data in order read from memory + + b[2-4] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[2] b[3] b[4] b[1] b[2] b[3] b[0]+b[1] b[2]

FIFOs Timing issues Memory bandwidth too small Circuit stalls or wastes cycles while waiting for data Memory bandwidth larger than data consumption rate of circuit May happen if area exhausted

FIFOs Memory bandwidth too small + + b[0-2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2] First window read from memory into FIFO

FIFOs Memory bandwidth too small + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2] b[1-3] 1 st window pushed to datapath 2 nd window requested from memory, but not transferred yet

FIFOs Memory bandwidth too small + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[1-3] No data ready (wasted cycles) 2 nd window requested from memory, but not transferred yet b[0]+b[1] b[2] Alternatively, could have prevented 1 st window from proceeding (stall cycles) - necessary if feedback in pipeline

FIFOs Memory bandwidth larger than data consumption rate + + b[0-2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2] 1st window read from memory into FIFO

FIFOs Memory bandwidth larger than data consumption rate + + b[1-3] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[1] b[2] b[3] b[0] b[1] b[2] 2nd window read from memory into FIFO. 1 st window pushed to datapath

FIFOs Memory bandwidth larger than data consumption rate + + b[2-4] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[1] b[2] b[3] b[0] b[1] b[2] b[3] b[4] Data arrives faster than circuit can process it – FIFOs begin to fill If FIFO full, memory reads stop until not full

Improvements Do we need to fetch entire window from memory for each iteration? Only if windows of consecutive iterations are mutually exclusive Commonly, consecutive iterations have overlapping windows Overlap represents “reused” data Ideally, would be fetched from memory just once Smart Buffer [Guo, Buyukkurt, Najjar LCTES 2004] Part of the ROCCC compiler Analyzes memory access patterns Detects “sliding windows”, reused data Prevents multiple accesses to same data for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0]b[1]b[4]b[3]b[2]b[5]

Smart Buffer + + b[0-2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2] 1st window read from memory into smart buffer

Smart Buffer + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2] b[3-5] Continues reading needed data, but does not reread b[1-2] b[3] b[4] b[5] b[0] b[1] b[2] 1 st window pushed to datapath

Smart Buffer + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2] b[6-8] b[0] no longer needed, deleted from buffer b[3] b[4] b[5] b[1] b[2] b[3] b[0]+b[1] b[2] b[6]... 2 nd window pushed to datapath Continues fetching needed data (as opposed to windows)

Smart Buffer + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; RAM b[0] b[1] b[2] b[9-11] b[1] no longer needed, deleted from buffer b[3] b[4] b[5] b[2] b[3] b[4] b[1]+b[2] b[3] b[1]+b[2]+b[3] b[6]... 3 rd window pushed to datapath And so on

Comparison with FIFO FIFO - fetches a window for each iteration Reads 3 elements every iteration 100 iterations * 3 accesses/iteration = 300 memory accesses Smart Buffer - fetches as much data as possible each cycle, buffer assembles data into windows Ideally, reads each element once # accesses = array size Circuit performance is equal to latency plus time to read array from memory No matter how much computation, execution time is approximately equal to time to stream in data! Note: Only true for streaming examples 102 memory accesses Essentially improves memory bandwidth by 3x How does this help? Smart buffers enable more unrolling

Unrolling with Smart Buffers Assume bandwidth = 128 bits We can read 128/32 = 4 array elements each cycle First access doesn’t save time No data in buffer Same as FIFO, can unroll once long b[102]; for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0]b[1]b[3]b[2] First memory access allows only 2 parallel iterations

Unrolling with Smart Buffers After first window is in buffer Smart Buffer Bandwidth = Memory Bandwidth + Reused Data 4 elements + 2 reused elements (b[2],b[3]) Essentially, provides bandwidth of 6 elements per cycle Can perform 4 iterations in parallel 2x speedup compared to FIFO long b[102]; for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0]b[1]b[4]b[3]b[2]b[5]b[6]b[7] However, every subsequent access enables 4 parallel iterations (b[2] and b[3] already in buffer)

Datapath Design w/ Smart Buffers Datapath based on unrolling enabled by smart buffer bandwidth (not memory bandwidth) Don’t be confused by first memory access Smart buffer waits until initial windows in buffer before passing any data to datapath Adds a little latency But, avoids 1 st iteration requiring different control due to less unrolling

Another Example Your turn Analyze memory access patterns Determine window overlap Determine smart buffer bandwidth Assume memory bandwidth = 128 bits/cycle Determine maximum unrolling with and without smart buffer Determine total cycles with and without smart buffer Use previous systolic array analysis (latency, bandwidth, etc). short b[1004], a[1000]; for (i=0; i < 1000; i++) a[i] = avg( b[i], b[i+1], b[i+2], b[i+3], b[i+4] );