Access Map Pattern Matching Prefetch: Optimization Friendly Method

Slides:



Advertisements
Similar presentations
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.
Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
ECE/CSC Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.
TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
Concentration Zone/ Delta correlation based data prefetcher aided by stream buffer Kowshick Boddu 04/09/2015.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
MICRO-48, 2015 Computer System Lab, Kim Jeong Won.
CSE 502: Computer Architecture
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Cache Hierarchies
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Lecture 14: Reducing Cache Misses
Module IV Memory Organization.
Lecture 11: Memory Data Flow Techniques
Lecture: Cache Innovations, Virtual Memory
15-740/ Computer Architecture Lecture 27: Prefetching II
Presented by David Wolinsky
15-740/ Computer Architecture Lecture 14: Prefetching
Lecture: Cache Hierarchies
Cache - Optimization.
rePLay: A Hardware Framework for Dynamic Optimization
Cache Performance Improvements
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Overview Problem Solution CPU vs Memory performance imbalance
Stream-based Memory Specialization for General Purpose Processors
Interconnection Network and Prefetching
Presentation transcript:

Access Map Pattern Matching Prefetch: Optimization Friendly Method Yasuo Ishii1, Mary Inaba2, and Kei Hiraki2 1 NEC Corporation 2 The University of Tokyo

Background Speed gap between processor and memory has been increased To hide long memory latency, many techniques have been proposed. Importance of HW data prefetch has been increased Many HW prefetchers have been proposed The speed gap between processor and memory has been increased. To hide long memory latency, importance of HW data prefetching has been increased.

Conventional Methods Prefetchers uses Instruction Address Memory Access Order Memory Address Optimizations scrambles information Out-of-Order memory access Loop unrolling Generally, conventional prefetching methods predict prefetch address from instruction address, memory access order, and memory addresses. However, these information is scrambled by out of order memory access and loop unrolling.

Limitation of Stride Prefetch[Chen+95] Out-of-Order Memory Access Memory Address Space ・・・ for (int i=0; i<N; i++) { load A[2*i]; ・・・・・ (A) } 0xAAFF 0xAB00 Access 1 0xAB01 0xAB02 Access 2 Out of Order 0xAB03 0xAB04 Access 3 Tag Address Stride State 0xAB05 A 0xAB04 2 steady 0xAB06 Access 4 This is the example of the out of order memory access. When the prefetcher handles this loop. The access of 4th iteration can be prefetched by stride prefetcher. But when the access order between 2nd access and 3rd access is swapped, the prefetetcher cannot detect correct address correlation. In such case, the stride prefetching table is confused by the out of order memory access and cannot detect stride. ・・・ Cannot detect strides Cache Line 0xABFF ・・・ 4

Weakness of Conventional Methods Out-of-Order Memory Access Scrambles memory access order Prefetcher cannot detect address correlations Loop-Unrolling Requires additional table entry Each entry trained slowly  Optimization friendly prefetcher is required So the weakness of the conventional prefetcher is modern optimizations such as OoO memory access and loop unrolling. These techniques are already uses as the commodity techniques. Thus, optimization friendly prefetching method is required.

Access Map Pattern Matching Order Free Prefetching Optimization Friendly Prefetch Access Map Map-base history 2-bit state map Each state is attached to cache block To realize optimization friendly prefetching, we propose an access map pattern matching prefetch. It is based on order free pattern matching. Thus, it realizes optimization friendly prefetch. The history of the ampm is stored in access map. Each access map hold 2-bit state map.

State Diagram for Each Cache Block Init Initialized state Access Already accessed Prefetch Issued Pref. Requests Success Accessed Pref. Data Access Init Access Prefetch Pre- fetch Success In AMPM, 2bit states are attached to memory addresses in cache granularity. All states are initialized as init states. If init state is accessed, the init state goes to access state. If prefetcher issues requests to init state, the init state goes to prefetch state. If prefetcher is accessed by actual load / store instructions, the prefetch state goes to success state. Access

Memory Access Pattern Map Corresponding to memory address space Cache line granularity Memory Address Space ・・・ Zone Size ・・・ Memory Access Pattern Map Memory access pattern map manages multiple states in a zone. The map is associated with fixed sized zone. All states in the map are associated with cache line of the zone. When actual requests accessed the zone, the corresponding map is selected. The selected zone is send to pattern matching logic. Cache Line I A A I S ・・・ P ・・・ Pattern Match Logic

Pattern Matching Logic Memory Access Pattern Map Access Map Shifter Pattern Detector Pipeline Register Prefetch Selector Addr I I I A I A A A I A A Addr Access Map Shifter ・・・ Priority Encoder & Adder Access Map Shifter I I A I A I A ・・・ Feedback Path 1 Priority Encoder & Adder Prefetch Request 1 1 ・・・ In pattern matching logic, the prefetch requests are generated. First, the memory access map is shifted to left in a pattern matching logic. Next, the requested point, in this slide the black box, is aligned to left edge of the map. Then, the pattern detector checks all address correlation of the map. The pattern matching results are stored in pipeline register as the prefetch candidates. Finally, the priority encoder selects the appropriate prefetch candidates. The prefetch request is feedbacked to the pipeline register, the selected bit is cleared. In next cycle, the next bit will be selected as the prefetch request. +1 +2 +3 ・・・ (Addr+2) 9

Parallel Pattern Matching Detects patterns from memory access map Detects address correlations in parallel Searches candidates effectively ・・・ A I I A I I A I A I A I S I A ・・・ The parallel pattern matching detects stride pattern in parallel. For example, the parallel pattern matching logic checks all address correlation in the memory access pattern map. It realizes that the parallel pattern detector searches prefetch candidates effectively Memory Access Pattern Map 10

AMPM Prefetch Memory address space divides into zone Detects hot zone Memory Access Map Table LRU replacement Pattern Matching Memory Address Space Hot Zone Zone Memory Access Map Table ・・・ P S A I P S I A ・・・ This is a summary of the AMPM prefetching. The AMPM prefetcher divides the memory address space into fixed sized zone. The prefetcher detects the frequently accessed zones as the hot zones. Only the hot zones are managed by access map pattern map. When the actual requests access to the main memory, the associated zone is selected from the table. The selected zone is issued to pattern matching logic. Finally, the pattern matching logic generate prefetch requests and send it to main memory. Pattern Match Logic Prefetch Request Access Zone

Features of AMPM Prefetcher Pattern Matching Base Prefetching Map base history Optimization friendly prefetching Parallel pattern matching Searches candidates effectively Complexity-effective implementation The AMPM has 2 features. First one is pattern matching base prefetching. It is based map base history. It realizes optimization friendly prefetching. Other one is parallel pattern matching. It searches prefetch candidates effectively.

Configuration for DPC Competition AMPM Prefetcher Full-assoc 52 maps, 256 states / map Adaptive Stream Prefetcher [Hur+ 2006] 16 Histograms, 8 Stream Length MSHR Configuration 16 entries for Demand Requests (Default) 32 entries for Prefetch Requests (Additional) Now we explain the our prefetcher’s configuration for DPC competition. We combine AMPM prefetcher and Adaptive Stream prefetcher.

Budget Count This is the budget count of our prefetcher. About the detail of this table, please refer to the paper.

Methodology Simulation Environment Benchmark DPC Framework Skips first 4000M instructions and evaluate following 100M instructions Benchmark SPEC CPU2006 benchmark suite Compile Option: “-O3 -fomit-frame-pointer -funroll-all-loops” This is the simulation methodology of our prefetcher. We use DPC framework. Our simulation skips first 4G instructions and evaluate next 100M instructions. We use SPEC 2006 as benchmark. All benchmarks are compiled with O3, fomit frame pointer, and funroll all loops.

IPC Measurement Improves performance by 53% This is the IPC measurement of our prefetcher. The red bars are results without our prefetcher. The blue bars are results with prefetcher. Our prefetcher improves performance by 53% from no prefetch. It improves performance in all benchmarks. Improves performance by 53% Improves performance in all benchmarks

L2 Cache Miss Count Reduces L2 Cache Miss by 76% This is the L2 cache miss count of our prefetcher. Our prefetcher reduces L2 cache miss count by 76%. Reduces L2 Cache Miss by 76%

Related Works Sequence-base Prefetching Adaptive Prefetching Sequential Prefetch [Smith+ 1978] Stride Prefetching Table [Fu+ 1992] Markov Predictor [Joseph+ 1997] Global History Buffer [Nesbit+ 2004] Adaptive Prefetching AC/DC [Nesbit+ 2004] Feedback Directed Prefetch [Srinath+ 2007] Focus Prefetching[Manikantan+ 2008] These are the related work. We refer to sequence base and adaptive prefetching.

Conclusion Access Map Pattern Matching Prefetch Order-Free Prefetch Optimization friendly prefetching Parallel Pattern Matching Complexity-effective implementation Optimized AMPM realizes good performance Improves IPC by 53% Reduces L2 cache miss by 76% We proposed an access map pattern matching prefetch. It uses order free prefetching and realizes optimization friendly prefetching. The key point of the AMPM prefetcher is parallel pattern matching. It searches prefetch candidates effectively. It can implemented with complexity effective hardware. The optimized AMPM prefetcher realizes good performance. It improves IPC by 53% and it reduces L2 cache miss count by 76%.

Q & A Software Adaptive Spatial Hybrid AMPM Prefetch Ishii+ 2009 Buffer Block Gindele1977 Sequential Smith+ 1978 Commercial Processors Software Adaptive Software Support Mowry+ 1992 Stride Prefetch Fu+ 1992 SuperSPARC Adaptive Seq. Dahlgren+ 1993 HW/SW Integrate Gornish+ 1994 PA7200 RPT Chen+ 1995 Spatial R10000 Markov Prefetch Joseph+ 1997 Hybrid Hsu+ 1998 Locality Detect Johnson+, 1998 Pentium 4 Tag Correlation Hu+ 2003 Hybrid Power4 AC/DC Nesbit+ 2004 GHB Nesbit+ 2004 Spatial Pat. Chen+ 2004 Sequence-Base (Order Sensitive) Adaptive Stream Hur+ 2006 SMS Somogyi 2006 FDP Srinath+ 2007 Feedback based Honjo 2009 AMPM Prefetch Ishii+ 2009 20