Download presentation
Published byZachery Oxton Modified over 9 years ago
1
Access Map Pattern Matching Prefetch: Optimization Friendly Method
Yasuo Ishii1, Mary Inaba2, and Kei Hiraki2 1 NEC Corporation 2 The University of Tokyo
2
Background Speed gap between processor and memory has been increased
To hide long memory latency, many techniques have been proposed. Importance of HW data prefetch has been increased Many HW prefetchers have been proposed The speed gap between processor and memory has been increased. To hide long memory latency, importance of HW data prefetching has been increased.
3
Conventional Methods Prefetchers uses
Instruction Address Memory Access Order Memory Address Optimizations scrambles information Out-of-Order memory access Loop unrolling Generally, conventional prefetching methods predict prefetch address from instruction address, memory access order, and memory addresses. However, these information is scrambled by out of order memory access and loop unrolling.
4
Limitation of Stride Prefetch[Chen+95] Out-of-Order Memory Access
Memory Address Space ・・・ for (int i=0; i<N; i++) { load A[2*i]; ・・・・・ (A) } 0xAAFF 0xAB00 Access 1 0xAB01 0xAB02 Access 2 Out of Order 0xAB03 0xAB04 Access 3 Tag Address Stride State 0xAB05 A 0xAB steady 0xAB06 Access 4 This is the example of the out of order memory access. When the prefetcher handles this loop. The access of 4th iteration can be prefetched by stride prefetcher. But when the access order between 2nd access and 3rd access is swapped, the prefetetcher cannot detect correct address correlation. In such case, the stride prefetching table is confused by the out of order memory access and cannot detect stride. ・・・ Cannot detect strides Cache Line 0xABFF ・・・ 4
5
Weakness of Conventional Methods
Out-of-Order Memory Access Scrambles memory access order Prefetcher cannot detect address correlations Loop-Unrolling Requires additional table entry Each entry trained slowly Optimization friendly prefetcher is required So the weakness of the conventional prefetcher is modern optimizations such as OoO memory access and loop unrolling. These techniques are already uses as the commodity techniques. Thus, optimization friendly prefetching method is required.
6
Access Map Pattern Matching
Order Free Prefetching Optimization Friendly Prefetch Access Map Map-base history 2-bit state map Each state is attached to cache block To realize optimization friendly prefetching, we propose an access map pattern matching prefetch. It is based on order free pattern matching. Thus, it realizes optimization friendly prefetch. The history of the ampm is stored in access map. Each access map hold 2-bit state map.
7
State Diagram for Each Cache Block
Init Initialized state Access Already accessed Prefetch Issued Pref. Requests Success Accessed Pref. Data Access Init Access Prefetch Pre- fetch Success In AMPM, 2bit states are attached to memory addresses in cache granularity. All states are initialized as init states. If init state is accessed, the init state goes to access state. If prefetcher issues requests to init state, the init state goes to prefetch state. If prefetcher is accessed by actual load / store instructions, the prefetch state goes to success state. Access
8
Memory Access Pattern Map
Corresponding to memory address space Cache line granularity Memory Address Space ・・・ Zone Size ・・・ Memory Access Pattern Map Memory access pattern map manages multiple states in a zone. The map is associated with fixed sized zone. All states in the map are associated with cache line of the zone. When actual requests accessed the zone, the corresponding map is selected. The selected zone is send to pattern matching logic. Cache Line I A A I S ・・・ P ・・・ Pattern Match Logic
9
Pattern Matching Logic
Memory Access Pattern Map Access Map Shifter Pattern Detector Pipeline Register Prefetch Selector Addr I I I A I A A A I A A Addr Access Map Shifter ・・・ Priority Encoder & Adder Access Map Shifter I I A I A I A ・・・ Feedback Path 1 Priority Encoder & Adder Prefetch Request 1 1 ・・・ In pattern matching logic, the prefetch requests are generated. First, the memory access map is shifted to left in a pattern matching logic. Next, the requested point, in this slide the black box, is aligned to left edge of the map. Then, the pattern detector checks all address correlation of the map. The pattern matching results are stored in pipeline register as the prefetch candidates. Finally, the priority encoder selects the appropriate prefetch candidates. The prefetch request is feedbacked to the pipeline register, the selected bit is cleared. In next cycle, the next bit will be selected as the prefetch request. +1 +2 +3 ・・・ (Addr+2) 9
10
Parallel Pattern Matching
Detects patterns from memory access map Detects address correlations in parallel Searches candidates effectively ・・・ A I I A I I A I A I A I S I A ・・・ The parallel pattern matching detects stride pattern in parallel. For example, the parallel pattern matching logic checks all address correlation in the memory access pattern map. It realizes that the parallel pattern detector searches prefetch candidates effectively Memory Access Pattern Map 10
11
AMPM Prefetch Memory address space divides into zone Detects hot zone
Memory Access Map Table LRU replacement Pattern Matching Memory Address Space Hot Zone Zone Memory Access Map Table ・・・ P S A I P S I A ・・・ This is a summary of the AMPM prefetching. The AMPM prefetcher divides the memory address space into fixed sized zone. The prefetcher detects the frequently accessed zones as the hot zones. Only the hot zones are managed by access map pattern map. When the actual requests access to the main memory, the associated zone is selected from the table. The selected zone is issued to pattern matching logic. Finally, the pattern matching logic generate prefetch requests and send it to main memory. Pattern Match Logic Prefetch Request Access Zone
12
Features of AMPM Prefetcher
Pattern Matching Base Prefetching Map base history Optimization friendly prefetching Parallel pattern matching Searches candidates effectively Complexity-effective implementation The AMPM has 2 features. First one is pattern matching base prefetching. It is based map base history. It realizes optimization friendly prefetching. Other one is parallel pattern matching. It searches prefetch candidates effectively.
13
Configuration for DPC Competition
AMPM Prefetcher Full-assoc 52 maps, 256 states / map Adaptive Stream Prefetcher [Hur+ 2006] 16 Histograms, 8 Stream Length MSHR Configuration 16 entries for Demand Requests (Default) 32 entries for Prefetch Requests (Additional) Now we explain the our prefetcher’s configuration for DPC competition. We combine AMPM prefetcher and Adaptive Stream prefetcher.
14
Budget Count This is the budget count of our prefetcher.
About the detail of this table, please refer to the paper.
15
Methodology Simulation Environment Benchmark DPC Framework
Skips first 4000M instructions and evaluate following 100M instructions Benchmark SPEC CPU2006 benchmark suite Compile Option: “-O3 -fomit-frame-pointer -funroll-all-loops” This is the simulation methodology of our prefetcher. We use DPC framework. Our simulation skips first 4G instructions and evaluate next 100M instructions. We use SPEC 2006 as benchmark. All benchmarks are compiled with O3, fomit frame pointer, and funroll all loops.
16
IPC Measurement Improves performance by 53%
This is the IPC measurement of our prefetcher. The red bars are results without our prefetcher. The blue bars are results with prefetcher. Our prefetcher improves performance by 53% from no prefetch. It improves performance in all benchmarks. Improves performance by 53% Improves performance in all benchmarks
17
L2 Cache Miss Count Reduces L2 Cache Miss by 76%
This is the L2 cache miss count of our prefetcher. Our prefetcher reduces L2 cache miss count by 76%. Reduces L2 Cache Miss by 76%
18
Related Works Sequence-base Prefetching Adaptive Prefetching
Sequential Prefetch [Smith+ 1978] Stride Prefetching Table [Fu+ 1992] Markov Predictor [Joseph+ 1997] Global History Buffer [Nesbit+ 2004] Adaptive Prefetching AC/DC [Nesbit+ 2004] Feedback Directed Prefetch [Srinath+ 2007] Focus Prefetching[Manikantan+ 2008] These are the related work. We refer to sequence base and adaptive prefetching.
19
Conclusion Access Map Pattern Matching Prefetch
Order-Free Prefetch Optimization friendly prefetching Parallel Pattern Matching Complexity-effective implementation Optimized AMPM realizes good performance Improves IPC by 53% Reduces L2 cache miss by 76% We proposed an access map pattern matching prefetch. It uses order free prefetching and realizes optimization friendly prefetching. The key point of the AMPM prefetcher is parallel pattern matching. It searches prefetch candidates effectively. It can implemented with complexity effective hardware. The optimized AMPM prefetcher realizes good performance. It improves IPC by 53% and it reduces L2 cache miss count by 76%.
20
Q & A Software Adaptive Spatial Hybrid AMPM Prefetch Ishii+ 2009
Buffer Block Gindele1977 Sequential Smith+ 1978 Commercial Processors Software Adaptive Software Support Mowry+ 1992 Stride Prefetch Fu+ 1992 SuperSPARC Adaptive Seq. Dahlgren+ 1993 HW/SW Integrate Gornish+ 1994 PA7200 RPT Chen+ 1995 Spatial R10000 Markov Prefetch Joseph+ 1997 Hybrid Hsu+ 1998 Locality Detect Johnson+, 1998 Pentium 4 Tag Correlation Hu+ 2003 Hybrid Power4 AC/DC Nesbit+ 2004 GHB Nesbit+ 2004 Spatial Pat. Chen+ 2004 Sequence-Base (Order Sensitive) Adaptive Stream Hur+ 2006 SMS Somogyi 2006 FDP Srinath+ 2007 Feedback based Honjo 2009 AMPM Prefetch Ishii+ 2009 20
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.