Fetch Directed Prefetching - a Study

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

SimpleScalar v3.0 Tutorial U. of Wisconsin, CS752, Fall 2004 Andrey Litvin (main source: Austin & Burger) (also Dana Vantrease’ slides)
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Pipelining Fetch instruction Decode instruction Calculate operands (i.e. EAs) Fetch operands Execute instructions Write result Overlap these operations.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Chapter 12 CPU Structure and Function. Example Register Organizations.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
Prophet/Critic Hybrid Branch Prediction Falcon, Stark, Ramirez, Lai, Valero Presenter: Christian Wanamaker.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
1 Lecture 25: Advanced Data Prefetching Techniques Prefetching and data prefetching overview, Stride prefetching, Markov prefetching, precomputation- based.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.
Prof. Hsien-Hsin Sean Lee
CS203 – Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
The University of Adelaide, School of Computer Science
PowerPC 604 Superscalar Microprocessor
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
5.2 Eleven Advanced Optimizations of Cache Performance
Samira Khan University of Virginia Nov 13, 2017
ECE/CS 552: Pipelining to Superscalar
Milad Hashemi, Onur Mutlu, Yale N. Patt
Chapter 5 Memory CSE 820.
Ka-Ming Keung Swamy D Ponpandi
Advanced Computer Architecture
Lecture 10: Branch Prediction and Instruction Delivery
Lecture 20: OOO, Memory Hierarchy
Dynamic Hardware Prediction
Ka-Ming Keung Swamy D Ponpandi
Lecture 7: Branch Prediction, Dynamic ILP
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Fetch Directed Prefetching - a Study CS 752 Project Gokul Nadathur Nitin Bahadur Sambavi Muthukrishnan Gokul, Nitin, Sambavi

Motivation Execution engine limited by fetch bandwidth effect of memory latency on fetch correlation between i-cache stalls and branch predictor rate at which branch predictor and BTB can be cycled With increase in ILP, there is a need to increase fetch performance Gokul, Nitin, Sambavi

Fetch Directed Architecture Prefetch Buffer Branch Predictor Prefetch Filtration Mechanism L2 Cache Instruction Fetch Fetch Target Queue Prefetch Instruction Queue Fetch Target Buffer Gokul, Nitin, Sambavi

Decoupled Branch Predictor has its own PC runs independent of fetch pipeline stage makes a prediction each cycle unaffected by i-cache stalls Problem!!! May not have updated branch history Gokul, Nitin, Sambavi

Fetch Target Buffer and Fetch Target Queue Stores fall through and target address for taken branches Accessed with a prediction from branch predictor each cycle Fills in single/multiple cache line blocks into FTQ Fetch Target Queue Contains blocks of instruction addresses to be next executed FTQ entries are dequeued by fetch engine Gokul, Nitin, Sambavi

Prefetch Filter and Prefetch Instruction Queue Contains queue of cache blocks to be prefetched Prefetch mechanism dequeues PIQ and performs the prefetching Prefetch Filter Takes entries from FTQ, filters them and inserts them into PIQ Enables intelligent prefetching ! Gokul, Nitin, Sambavi

Stream Buffers L1 I-cache L2 I-cache Stream buffer FIFO Tag and comparator Cache block Tag FIFO Head Tail Gokul, Nitin, Sambavi

Prefetching in the Fetch Directed Architecture Similar to stream buffers Addresses given by PIQ Gokul, Nitin, Sambavi

Simulation Results Gokul, Nitin, Sambavi

Simulation Results Gokul, Nitin, Sambavi

Simulation Results Gokul, Nitin, Sambavi

Simulation Results Gokul, Nitin, Sambavi

Simulation Results Gokul, Nitin, Sambavi

Simulation Results Gokul, Nitin, Sambavi

Conclusions Prefetching definitely helps Fetch directed architecture aids prefetching Optimal results require sophisticated memory hierarchy Gokul, Nitin, Sambavi