School of Electrical Engineering and Computer Science University of Central Florida Combining Local and Global History for High Performance Data Prefetching.

Slides:



Advertisements
Similar presentations
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Advertisements

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Exploring Correlation for Indirect Branch Prediction 1 Nikunj Bhansali, Chintan Panirwala, Huiyang Zhou Department of Electrical and Computer Engineering.
SPIM and MIPS programming
Access Map Pattern Matching Prefetch: Optimization Friendly Method
CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Prefetching On-time and When it Works Sequential Prefetcher With Adaptive Distance (SPAD) Ibrahim Burak Karsli Mustafa Cavus
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
School of Electrical Engineering and Computer Science University of Central Florida Anomaly-Based Bug Prediction, Isolation, and Validation: An Automated.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
1 Fast packet classification for two-dimensional conflict-free filters Department of Computer Science and Information Engineering National Cheng Kung University,
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Fetch Directed Prefetching - a Study
M U N - February 15, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.
MICRO-48, 2015 Computer System Lab, Kim Jeong Won.
Descriptor Table & Register
ECE 1304 Introduction to Electrical and Computer Engineering
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Prof. Hsien-Hsin Sean Lee
Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.
Prof. Zhang Gang School of Computer Sci. & Tech.
Improving cache performance of MPEG video codec
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Presented by: Isaac Martin
Multiplying Matrices.
CSCI206 - Computer Organization & Programming
CS149D Elements of Computer Science
ICIEV 2014 Dhaka, Bangladesh
Help! How does cache work?
CARP: Compression-Aware Replacement Policies
Lecture 3: Main Memory.
Sampoorani, Sivakumar and Joshua
Multiplying Matrices.
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Multiplying Matrices.
Multiplying Matrices.
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
10/18: Lecture Topics Using spatial locality
Copyright © 2013 Elsevier Inc. All rights reserved.
Multiplying Matrices.
8/23/
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

School of Electrical Engineering and Computer Science University of Central Florida Combining Local and Global History for High Performance Data Prefetching Martin Dimitrov and Huiyang Zhou

Our Contributions New localities in the local and global address stream A high performance prefetcher design Mechanisms for eliminating redundant prefetches Advocating for L1-cache data prefetchers University of Central Florida2

Presentation Outline Contributions Novel data localities in the address stream Proposed data prefetcher Filtering of redundant prefetches Design Space Exploration Experimental Results Conclusions University of Central Florida3

Novel Data Localities: Global Stride Global Stride exists when there is a constant stride between addresses of two different instructions. global address stream  Load A: X Y Z Load B: X+d Y+d Z+d When does it occur –Load/store instructions access adjacent elements of a data structure –Address-Value Delta [MICRO-38] is also a form of global stride University of Central Florida4

Novel Data Localities: Most Common Stride Most Common Stride exists when a constant pattern is disrupted from time to time. local address delta stream  Store A:D X D Y D Z D … When does it occur University of Central Florida5 for (j = lll = 0; j value(j); if (isNotZero(x, eps)){ k = psv->index(j); kk = u.row.start[k] + (u.row.len[k]++); u.col.idx[m++] = k; u.row.idx[kk] = i; u.row.val[kk] = x; ++lll; Code example from Soplex Local address delta in bytes

Novel Data Localities: Scalar Stride Scalar Stride exists when the address is multiplied or divided by a constant local address stream  Load A:32D 16D 8D 4D 2D D … When does it occur University of Central Florida6 long cmp; while (... ){... cmp *= 2; if( cmp + 1 max_residual_new_m ) if( new[cmp-1].flow < new[cmp].flow ) cmp++; } Code example from mcf Local address delta in bytes

Proposed Data Prefetcher University of Central Florida7 GHB (N entries) Prefetch Function Prefetch requests PC Last addr Last matched stride LDB (FIFO)... Index<N-1 Index-N Few static instructions may occupy the whole GHB Few static instructions may occupy the whole GHB Requires sequential traversal of the linked list Requires sequential traversal of the linked list Filtering Redundant Prefetches Tag Index PC Index Table Global History Buffer (GHB) Prefetcher

Prefetch Function Detecting Global Stride global address stream  Load A: X Y Z Load B: X+d Y+d Z+d University of Central Florida8 GHB (N entries) Y+d Z+d X Y Z - - Match ? Global delta

Prefetch Function Detecting Delta Correlation University of Central Florida9 local delta stream  Load A:a b c d a b c d a b c d... a b a b c d  generate prefetches Match !

Prefetch Function Detecting Single Delta Match University of Central Florida10 local delta stream  Load A:a x c d a z c d a y c d... a a x c d  generate prefetches Match !

Prefetch Function If no delta correlation is detected, generate 2 prefetches –Prefetch last matched stride to approximate most common stride. –Next line prefetch The output of the prefetch function is a buffer (up to max prefetch degree) filled with potential prefetch addresses. University of Central Florida11

Filtering of Redundant Prefetches Local redundant prefetches University of Central Florida12 Load A address stream miss: aprefetch: b, c, d, etime 1: hit (pref bit ON): bprefetch: c, d, e, ftime 2: hit (pref bit ON): cprefetch: d, e, f, gtime 3: Global redundant prefetches Load B prefetches: a+8, x, y, etc. Other loads/stores use data in the same cache line as Load A. Load C prefetches: b+16, w, z, etc.

Filtering of Redundant Prefetches Filtering local redundant prefetches –Add a confidence bit to each LDB to indicate that we have already prefetched the full prefetch degree –If conf bit is set, make only 1 prefetch University of Central Florida13 Load A address stream miss: aprefetch: b, c, d, etime 1: hit (pref bit ON): bprefetch: ftime 2: Filtering global redundant prefetches –Use a MSHR –Use a Bloom filter. On a Bloom filter hit, drop the prefetch. Reset the Bloom filter periodically. conf: ON

Design Space Exploration Prefetch into the L1 or L2 Cache ? We advocate for prefetching into the L1 cache +L1-cache hits are better than L2-cache hits +More accurate address stream +Access to the program counter (PC) –Latency is more critical University of Central Florida14

Design Space Exploration Three Prefetcher Design Points GHB-LDB-v1: Highest performance design, using MSHRs to remove redundant prefetches. GHB-LDB-v2: Scaled down design, using Bloom filter to remove redundant prefetches. LDB-only: Very complexity and latency efficient design. University of Central Florida15

Design Space Exploration LDB-only Design Each entry in the table is an LDB. (a FIFO of last several deltas, last address and a confidence bit) Can detect all the stride patterns, except global stride Latency efficient: no linked list traversal, quick Bloom filter access University of Central Florida16 Tag LDB PC LDB Table Prefetch Function Prefetch requests Bloom Filter

Storage Cost University of Central Florida17

Experimental Results University of Central Florida18 Speedup for best performing design point GHB-LDB-v1 Avg. speedup for other two designs: 1.60X and 1.56X

Conclusions We introduce a high performance prefetcher design for prefetching into the L1 cache. Discover and utilize novel localities in the global and local address streams Emphasize the importance of filtering redundant prefetches and proposing mechanisms to accomplish the task University of Central Florida19

Questions? University of Central Florida20

Backup: Experimental Results University of Central Florida21 Speedup for best performing design point GHB-LDB-v1 Speedup for best performing design point GHB-LDB-v1, prefetching into the L2 cache* Speedupbzip2lbmmcfmilcomnetppsoplexxalanGmean Conf Conf Conf Speedupbzip2lbmmcfmilcomnetppsoplexxalanGmean Conf Conf Conf *Due to a problem with our MSHR implementation while prefetching into the L2-cache, we use a Bloom filter.

Backup: Experimental Results University of Central Florida22 Speedup for GHB-LDB-v1, no filtering of redundant prefetches Speedup for original GHB design, prefetching into L1, no filtering of redundant prefetches Speedupbzip2lbmmcfmilcomnetppsoplexxalanGmean Conf Conf Conf Speedupbzip2lbmmcfmilcomnetppsoplexxalanGmean Conf Conf Conf *Due to a problem with our MSHR implementation, when prefetching into the L2-cache, we use a Bloom filter.

Backup: Experimental Results University of Central Florida23 Speedup for GHB-LDB-v2 Speedupbzip2lbmmcfmilcomnetppsoplexxalanGmean Conf Conf Conf Speedupbzip2lbmmcfmilcomnetppsoplexxalanGmean Conf Conf Conf Speedup for LDB