Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

School of Electrical Engineering and Computer Science University of Central Florida Combining Local and Global History for High Performance Data Prefetching.

CS455/CpE 442 Intro. To Computer Architecure

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

CSCI 232© 2005 JW Ryder1 Cache Memory Systems Introduced by M.V. Wilkes (“Slave Store”) Appeared in IBM S360/85 first commercially.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Chapter 12 Pipelining Strategies Performance Hazards.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

ECE/CSC Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Revisiting Load Value Speculation:

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

M. Tiwari, B. Agrawal, S. Mysore, J. Valamehr, T. Sherwood, CS & ECE of UCSB Reading Group Presentation by Theo.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Fetch Directed Prefetching - a Study

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

COSC3330 Computer Architecture

Prof. Hsien-Hsin Sean Lee

Framework For Exploring Interconnect Level Cache Coherency

Lecture: Large Caches, Virtual Memory

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Cache Memory Presentation I

Morgan Kaufmann Publishers

William Stallings Computer Organization and Architecture 7th Edition

Characterization and Evaluation of Hardware Loop Unrolling

Milad Hashemi, Onur Mutlu, Yale N. Patt

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Module 3: Branch Prediction

Address-Value Delta (AVD) Prediction

Lecture: Cache Innovations, Virtual Memory

Smita Vijayakumar Qian Zhu Gagan Agrawal

Advanced Computer Architecture

Lecture 10: Branch Prediction and Instruction Delivery

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CSC3050 – Computer Architecture

Cache - Optimization.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

The O-GEHL branch predictor

Multi-Lookahead Offset Prefetching

Handling Stores and Loads

Presentation transcript:

Multi-level Adaptive Prefetching based on Performance Gradient Tracking Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals. University of Zaragoza (Spain) DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 Introduction Hardware Data Prefetching Effective to hide memory latency No prefetching method matches every application Aggressive prefetchers (e.g. SEQT & stream buffers) Boost the average performance High pressure on mem. & perf. losses in hostile app. Filtering mechanisms (non negligible Hw) Adaptive mechanisms  tune the aggressiveness [Ramos et al. 08] Correlating prefetchers (e.g. PC/DC) More selective Tables store memory program behaviour (addresses or deltas) Megasized tables & number of table accesses PDFCM [Ramos et al. 07] DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 Introduction Reasonable targets One proposal to address each target Using a common framework Prefetched blocks stored in caches Prefetch filtering techniques L1  SEQT w/ static degree policy L2  SEQT and/or PDFCM w/ adaptive degree policy based on performance gradient I. minimize costs II. cut losses for every app. III. boost overall performance DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 Outline Prefetching framework Proposals Hardware costs Results Conclusions DPC-1 - Raleigh, NC – Feb. 15th, 2009

Prefetching framework Prefetch Engine Degree Controller MSHRs Cache Lookup PMAF Prefetch Filters to Queue inputs Falta que vayan apareciendo poco a poco DPC-1 - Raleigh, NC – Feb. 15th, 2009

Prefetching framework to L1Q SEQT Degree Controller Cache Lookup PMAF Prefetch Filters L1 inputs SEQT */ PDFCM* Degree Controller MSHRs Cache Lookup PMAF Prefetch Filters to L2Q L2 inputs * Depending on the proposal DPC-1 - Raleigh, NC – Feb. 15th, 2009 6 6

DPC-1 - Raleigh, NC – Feb. 15th, 2009 SEQT Prefetch Engines to L1Q SEQT Degree Controller Cache Lookup PMAF Prefetch Filters L1 Fed with misses and 1st uses of prefetched blocks Load & stores Includes a Degree Automaton to generate 1 prefetch / cycle Maximum degree indicated by the Degree Controller inputs SEQT */ PDFCM* Degree Controller MSHRs Cache Lookup PMAF Prefetch Filters to L2Q L2 inputs * Depending on the proposal DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 PDFCM Prefetch Engine to L1Q SEQT Degree Controller Cache Lookup PMAF Prefetch Filters L1 Delta correlating prefetcher Trained with L2 misses & 1st uses History Table & Delta Table PDFCM operation update predict degree automaton inputs tag last @ history PC HT DT predicted δ cc SEQT */ PDFCM* Degree Controller MSHRs Cache Lookup PMAF Prefetch Filters to L2Q L2 inputs * Depending on the proposal DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 PDFCM Operation training @: 20 22 24 30 32 34 40 … δ: 2 2 6 2 2 … current I. Update 1) index HT, check tag & read HT entry 40 2) check predicted δ and update conf. counter 3) calculate new history 2 2 6 2 6 HT DT tag last @ history 34 2 2 6 2 cc 4) update HT entry PC II. Predict last predicted δ  6 ok actual δ  40 – 34 = 6 III. Degree Automaton 34 2 2 1) calculate speculative history Prefetch: 40 + 2 = 42 + 2 6 2 2) predict next Prefetch: 42 + 2 = 44 + 40 40 2 6 42 6 2 DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 L1 Degree Controller to L1Q SEQT Degree Controller Cache Lookup PMAF Prefetch Filters L1 L1 Degree Controller: static degree policy Degree (1-4) on miss  deg 1 on 1st use  deg 4 inputs SEQT */ PDFCM* Degree Controller MSHRs Cache Lookup PMAF Prefetch Filters to L2Q L2 inputs - The DCs monitor the automaton degree of the prefetch engines - Implements one of our static degree policies called * Depending on the proposal DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 L2 Degree Controller L2 Degree Controller: Performance Gradient Tracking to L1Q SEQT Degree Controller Cache Lookup PMAF Prefetch Filters L1 inputs - Deg++ Deg- - + + SEQT */ PDFCM* Degree Controller MSHRs Cache Lookup PMAF Prefetch Filters to L2Q L2 - inputs +: current epoch (64K cycles) more performance than previous -: current epoch less performance than previous L2 degree controller is more complex The controller has 2 states Increasing degree Decreasing degree Every epoch (64? Kcycles) more performance than previous  maintain the state Update the degree [0, 1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 64] * Depending on the proposal Degree [0, 1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 64] - + DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 Prefetch Filters to L1Q SEQT Degree Controller Cache Lookup PMAF Prefetch Filters L1 16 MSHRs in L2 to filter secondary misses Cache Lookup eliminates prefetches to blocks that are already in the cache PMAF is a FIFO holding up to 32 prefetch block addresses issued but not serviced yet inputs SEQT */ PDFCM* Degree Controller MSHRs Cache Lookup PMAF Prefetch Filters to L2Q L2 inputs Bc it affects very much to the learning process of the PDFCM * Depending on the proposal DPC-1 - Raleigh, NC – Feb. 15th, 2009

Three goals, three proposals Three reasonable targets I. minimize costs II. cut losses for every app. III. boost overall performance Mincost (1255 bits) Minloss (20784 bits) Maxperf (20822 bits) SEQT & PDFCM PDFCM SEQT L2 Prefetch Engine L1 SEQT Prefetch Engine - degree policy Degree (1-4) Adaptive degree by tracking performance gradient in L2 Prefetch Filters DPC-1 - Raleigh, NC – Feb. 15th, 2009

Results: the three proposals DPC-1 environment SPEC CPU 2006 40 bill. warm, 100 mill. exec. DPC-1 - Raleigh, NC – Feb. 15th, 2009

Results: adaptive vs. fixed degree 16 4 1 DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 Conclusions Different targets lead to different designs Common multi-level prefetching framework Three different engines targeted to: Mincost  minimize cost (~1 Kbit) Minloss  minimize losses (< 1% in astar; < 2% in povray) Maxperf  maximize performance (11% losses in astar) The proposed adaptive degree policy is cheap (131 bits) & effective DPC-1 - Raleigh, NC – Feb. 15th, 2009

DPC-1 - Raleigh, NC – Feb. 15th, 2009 Thank you DPC-1 - Raleigh, NC – Feb. 15th, 2009