Memory Buffering Techniques

Slides:



Advertisements
Similar presentations
Pipelining (Week 8).
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Morgan Kaufmann Publishers The Processor
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Pipelining By Toan Nguyen.
CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.
Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
What have mr aldred’s dirty clothes got to do with the cpu
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Exploiting Parallelism
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Memory Buffering Techniques Greg Stitt ECE Department University of Florida.
EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Buffering Techniques Greg Stitt ECE Department University of Florida.
DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.
Speed up on cycle time Stalls – Optimizing compilers for pipelining
Computer Architecture Chapter (14): Processor Structure and Function
Computer Organization
Stalling delays the entire pipeline
Concepts and Challenges
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
The University of Adelaide, School of Computer Science
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Approaches to exploiting Instruction Level Parallelism (ILP)
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Cache Memory Presentation I
Single Clock Datapath With Control
Greg Stitt ECE Department University of Florida
\course\cpeg323-08F\Topic6b-323
Pipelining Example Cycle 1 b[0] b[1] b[2] + +
Morgan Kaufmann Publishers The Processor
Instruction Level Parallelism and Superscalar Processors
Pipelining and Vector Processing
Lecture 6: Advanced Pipelines
Instruction Level Parallelism and Superscalar Processors
Serial versus Pipelined Execution
Pipelining Basic concept of assembly line
Overview Parallel Processing Pipelining
Control unit extension for data hazards
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Instruction Execution Cycle
Computer Architecture
Chapter 8. Pipelining.
Overview Last lecture Digital hardware systems Today
Pipelining: Basic Concepts
Pipelining Basic concept of assembly line
Pipelining Basic concept of assembly line
Morgan Kaufmann Publishers The Processor
Memory System Performance Chapter 3
ECE 352 Digital System Fundamentals
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Presentation transcript:

Memory Buffering Techniques Greg Stitt ECE Department University of Florida

Buffers Purposes Metastability issues Memory clock likely different from circuit clock Buffer stores data at one speed, loads data at another Stores “windows” of data, delivers to datapath Window is set of inputs needed each cycle by pipelined circuit Generally, more efficient than datapath requesting needed data i.e. Push data into datapath as opposed to pulling data from memory

FIFOs FIFOs are a common buffer Outputs data in order read from memory RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; FIFO FIFO FIFO + +

FIFOs FIFOs are a common buffer Outputs data in order read from memory RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; First window read from memory b[0-2] b[0] b[1] b[2] FIFO FIFO FIFO + +

FIFOs FIFOs are a common buffer Outputs data in order read from memory RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0-2] b[0] b[1] b[2] + +

FIFOs FIFOs are a common buffer Outputs data in order read from memory RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Second window read from RAM b[1-3] b[1] b[2] b[3] First window pushed to datapath b[0] b[1] b[2] + +

FIFOs FIFOs are a common buffer Outputs data in order read from memory RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2-4] b[2] b[3] b[4] b[1] b[2] b[3] + b[0]+b[1] b[2] +

FIFOs Timing issues Memory bandwidth too small Circuit stalls or wastes cycles while waiting for data Memory bandwidth larger than data consumption rate of circuit May happen if area exhausted

FIFOs Memory bandwidth too small RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; First window read from memory into FIFO b[0-2] b[0] b[1] b[2] + +

FIFOs Memory bandwidth too small RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 2nd window requested from memory, but not transferred yet 1st window pushed to datapath b[0] b[1] b[2] + +

FIFOs Memory bandwidth too small RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 2nd window requested from memory, but not transferred yet No data ready (wasted cycles) Alternatively, could have prevented 1st window from proceeding (stall cycles) - necessary if feedback in pipeline + b[0]+b[1] b[2] +

FIFOs Memory bandwidth larger than data consumption rate RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 1st window read from memory into FIFO b[0-2] b[0] b[1] b[2] + +

FIFOs Memory bandwidth larger than data consumption rate RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 2nd window read from memory into FIFO. b[1-3] b[1] b[2] b[3] 1st window pushed to datapath b[0] b[1] b[2] + +

FIFOs Memory bandwidth larger than data consumption rate RAM for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2-4] b[2] b[3] b[4] Data arrives faster than circuit can process it – FIFOs begin to fill b[1] b[2] b[3] b[0] b[1] b[2] If FIFO full, memory reads stop until not full + +

Improvements Do we need to fetch entire window from memory for each iteration? Only if windows of consecutive iterations are mutually exclusive Commonly, consecutive iterations have overlapping windows Overlap represents “reused” data Ideally, would be fetched from memory just once Smart Buffer [Guo, Buyukkurt, Najjar LCTES 2004] Part of the ROCCC compiler Analyzes memory access patterns Detects “sliding windows”, reused data Prevents multiple accesses to same data for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] b[3] b[4] b[5]

Smart Buffer RAM 1st window read from memory into smart buffer + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 1st window read from memory into smart buffer b[0-2] b[0] b[1] b[2] + +

Smart Buffer RAM 1st window pushed to datapath + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Continues reading needed data, but does not reread b[1-2] b[3-5] b[0] b[1] b[2] b[3] b[4] b[5] 1st window pushed to datapath b[0] b[1] b[2] + +

Smart Buffer RAM 2nd window pushed to datapath + + for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Continues fetching needed data (as opposed to windows) b[6-8] b[0] b[1] b[2] b[3] b[4] b[5] b[6] . . . 2nd window pushed to datapath b[1] b[2] b[3] b[0] no longer needed, deleted from buffer + b[0]+b[1] b[2] +

Smart Buffer RAM 3rd window pushed to datapath + + And so on for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; And so on b[9-11] b[0] b[1] b[2] b[3] b[4] b[5] b[6] . . . 3rd window pushed to datapath b[2] b[3] b[4] b[1] no longer needed, deleted from buffer + b[1]+b[2] b[3] + b[1]+b[2]+b[3]

Comparison with FIFO FIFO - fetches a window for each iteration Reads 3 elements every iteration 100 iterations * 3 accesses/iteration = 300 memory accesses Smart Buffer - fetches as much data as possible each cycle, buffer assembles data into windows Ideally, reads each element once # accesses = array size Circuit performance is equal to latency plus time to read array from memory No matter how much computation, execution time is approximately equal to time to stream in data! Note: Only true for streaming examples 102 memory accesses Essentially improves memory bandwidth by 3x How does this help? Smart buffers enable more unrolling

Unrolling with Smart Buffers Assume bandwidth = 128 bits We can read 128/32 = 4 array elements each cycle First access doesn’t save time No data in buffer Same as FIFO, can unroll once long b[102]; for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] b[3] First memory access allows only 2 parallel iterations

Unrolling with Smart Buffers After first window is in buffer Smart Buffer Bandwidth = Memory Bandwidth + Reused Data 4 elements + 2 reused elements (b[2],b[3]) Essentially, provides bandwidth of 6 elements per cycle Can perform 4 iterations in parallel 2x speedup compared to FIFO long b[102]; for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] b[3] b[4] b[5] b[6] b[7] However, every subsequent access enables 4 parallel iterations (b[2] and b[3] already in buffer)

Datapath Design w/ Smart Buffers Datapath based on unrolling enabled by smart buffer bandwidth (not memory bandwidth) Don’t be confused by first memory access Smart buffer waits until initial windows in buffer before passing any data to datapath Adds a little latency But, avoids 1st iteration requiring different control due to less unrolling

Another Example Your turn Analyze memory access patterns Determine window overlap Determine smart buffer bandwidth Assume memory bandwidth = 128 bits/cycle Determine maximum unrolling with and without smart buffer Determine total cycles with and without smart buffer Use previous systolic array analysis (latency, bandwidth, etc). short b[1004], a[1000]; for (i=0; i < 1000; i++) a[i] = avg( b[i], b[i+1], b[i+2], b[i+3], b[i+4] );