IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 2

Motivation Number of cores in a same chip grows every year Nehalem 4~6 Cores Tilera 64~100 Cores Intel Polaris 80 Cores Nvidia GeForce Up to 256 Cores 4

Prefetching Reduce memory latency Bring to a nearest cache next data required by CPU Increase the hit ratio It is implemented in most of the commercial processors Erroneous prefetching may produce – Cache pollution – Resources consumption (queues, bandwidth, etc.) – Power consumption 5

Prefetch in CMPs Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency Useless prefetchers implies less performance – More power consumption – More NoC congestion – Interference with other cores requests 6

Prefetch adverse behaviors 7 M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

Prefetch in shared memories 8 Prefetcher distributed Entails challenges – Distributed memory streams – Distributed prefetch queue – Statistics generation and recollection point differ Difficult the prefetcher task Harder to prefetch accurately M. Torrents, et al. “Prefetching Challenges in Distributed Memories for CMPs”, In Proceedings of the International Conference on Computational Science (ICCS'15), Reykjavík, (Iceland), June 2015.

Objective Maximize the prefetching effect By using it only when it is working properly Minimizing its adverse effects 10

Proposal Identify when the prefetcher generates slowdown – Identify code regions with several granularities – Analyze the prefetcher performance in these regions – Tag this code regions with stats Switch the prefetcher off – Save power – Avoid network contention – Avoid cache pollution Switch it on again – When it generates speedup 11

Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 12 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 Instruction level

Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 13 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 Basic Bloc level

Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 14 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 All the code

Code Region Granularity Regions tagged with statistics – Accuracy / Miss Ratio Activate or deactivate at every new code region – According to the statistic and the current code region 15 Divide the code in code regions – Single instructions, basic blocs, etc. or all the code Identify and tag the regions – Statically (Profiling execution) – Dynamically (During the warm up)

Switching off the prefetcher Detect the uselessness of the prefetcher Accuracy – Useful prefetches / Total number of prefetches – Switch off when the accuracy decreases Miss Ratio – Based on the number of misses 16

Switching on the prefetcher Switched off prefetcher does not generate stats Cannot reactivate with accuracy increment Reactivate when? – Based on miss ratio – After a certain timeout 17

Experimental framework Gem5 – 16 x86 CPUs – Ruby memory system – L1 prefetchers – MOESI coherency protocol – Garnet network simulator Parsecs 2.1 19

Simulation environment 20

Expected Results 22 Power savings without losing performance Smaller granularity more accuracy – Blocs or super blocs better than the whole code – Single instructions more accurate than blocs or super blocs Smaller granularity: – More resources – More complexity Basic bloc granularity should provide good results with a realistic complexity

Q & A 23

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech 24

Back up slides 25

Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 26 DISTRIBUTED L2 MEMORY

Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 27 DISTRIBUTED L2 MEMORY @ L1 MISS for @

Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 28 DISTRIBUTED L2 MEMORY @ L1 MISS for @ Distributed patterns

Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 29 DISTRIBUTED L2 MEMORY @@ + 2@ + 4

Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 30 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 Queue filtering

Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 31 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 L1 MISS for @ + 2

Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 32 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 L1 MISS for @ + 2 Dynamic profiling

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Similar presentations

Presentation on theme: "IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Similar presentations

Presentation on theme: "IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC."— Presentation transcript:

Similar presentations

About project

Feedback