Download presentation
Presentation is loading. Please wait.
Published byShanon Wright Modified over 8 years ago
1
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech
2
Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 2
3
Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 3
4
Motivation Number of cores in a same chip grows every year Nehalem 4~6 Cores Tilera 64~100 Cores Intel Polaris 80 Cores Nvidia GeForce Up to 256 Cores 4
5
Prefetching Reduce memory latency Bring to a nearest cache next data required by CPU Increase the hit ratio It is implemented in most of the commercial processors Erroneous prefetching may produce – Cache pollution – Resources consumption (queues, bandwidth, etc.) – Power consumption 5
6
Prefetch in CMPs Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency Useless prefetchers implies less performance – More power consumption – More NoC congestion – Interference with other cores requests 6
7
Prefetch adverse behaviors 7 M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.
8
Prefetch in shared memories 8 Prefetcher distributed Entails challenges – Distributed memory streams – Distributed prefetch queue – Statistics generation and recollection point differ Difficult the prefetcher task Harder to prefetch accurately M. Torrents, et al. “Prefetching Challenges in Distributed Memories for CMPs”, In Proceedings of the International Conference on Computational Science (ICCS'15), Reykjavík, (Iceland), June 2015.
9
Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 9
10
Objective Maximize the prefetching effect By using it only when it is working properly Minimizing its adverse effects 10
11
Proposal Identify when the prefetcher generates slowdown – Identify code regions with several granularities – Analyze the prefetcher performance in these regions – Tag this code regions with stats Switch the prefetcher off – Save power – Avoid network contention – Avoid cache pollution Switch it on again – When it generates speedup 11
12
Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 12 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 Instruction level
13
Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 13 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 Basic Bloc level
14
Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 14 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 All the code
15
Code Region Granularity Regions tagged with statistics – Accuracy / Miss Ratio Activate or deactivate at every new code region – According to the statistic and the current code region 15 Divide the code in code regions – Single instructions, basic blocs, etc. or all the code Identify and tag the regions – Statically (Profiling execution) – Dynamically (During the warm up)
16
Switching off the prefetcher Detect the uselessness of the prefetcher Accuracy – Useful prefetches / Total number of prefetches – Switch off when the accuracy decreases Miss Ratio – Based on the number of misses 16
17
Switching on the prefetcher Switched off prefetcher does not generate stats Cannot reactivate with accuracy increment Reactivate when? – Based on miss ratio – After a certain timeout 17
18
Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 18
19
Experimental framework Gem5 – 16 x86 CPUs – Ruby memory system – L1 prefetchers – MOESI coherency protocol – Garnet network simulator Parsecs 2.1 19
20
Simulation environment 20
21
Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 21
22
Expected Results 22 Power savings without losing performance Smaller granularity more accuracy – Blocs or super blocs better than the whole code – Single instructions more accurate than blocs or super blocs Smaller granularity: – More resources – More complexity Basic bloc granularity should provide good results with a realistic complexity
23
Q & A 23
24
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech 24
25
Back up slides 25
26
Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 26 DISTRIBUTED L2 MEMORY
27
Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 27 DISTRIBUTED L2 MEMORY @ L1 MISS for @
28
Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 28 DISTRIBUTED L2 MEMORY @ L1 MISS for @ Distributed patterns
29
Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 29 DISTRIBUTED L2 MEMORY @@ + 2@ + 4
30
Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 30 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 Queue filtering
31
Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 31 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 L1 MISS for @ + 2
32
Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 32 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 L1 MISS for @ + 2 Dynamic profiling
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.