Presentation is loading. Please wait.

Presentation is loading. Please wait.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Similar presentations


Presentation on theme: "IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC."— Presentation transcript:

1 IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech

2 Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 2

3 Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 3

4 Motivation Number of cores in a same chip grows every year Nehalem 4~6 Cores Tilera 64~100 Cores Intel Polaris 80 Cores Nvidia GeForce Up to 256 Cores 4

5 Prefetching Reduce memory latency Bring to a nearest cache next data required by CPU Increase the hit ratio It is implemented in most of the commercial processors Erroneous prefetching may produce – Cache pollution – Resources consumption (queues, bandwidth, etc.) – Power consumption 5

6 Prefetch in CMPs Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency Useless prefetchers implies less performance – More power consumption – More NoC congestion – Interference with other cores requests 6

7 Prefetch adverse behaviors 7 M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.

8 Prefetch in shared memories 8 Prefetcher distributed Entails challenges – Distributed memory streams – Distributed prefetch queue – Statistics generation and recollection point differ Difficult the prefetcher task Harder to prefetch accurately M. Torrents, et al. “Prefetching Challenges in Distributed Memories for CMPs”, In Proceedings of the International Conference on Computational Science (ICCS'15), Reykjavík, (Iceland), June 2015.

9 Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 9

10 Objective Maximize the prefetching effect By using it only when it is working properly Minimizing its adverse effects 10

11 Proposal Identify when the prefetcher generates slowdown – Identify code regions with several granularities – Analyze the prefetcher performance in these regions – Tag this code regions with stats Switch the prefetcher off – Save power – Avoid network contention – Avoid cache pollution Switch it on again – When it generates speedup 11

12 Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 12 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 Instruction level

13 Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 13 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 Basic Bloc level

14 Code Region Granularity Divide the code in code regions – Single instructions, basic blocs, etc. or all the code 14 mov ebx, 0 mov eax, 0 mov ecx, 0 _Label_1: mov ecx, [esi + ebx * 4] add eax, ecx inc ebx cmp ebx, 100 jne _Label_1 All the code

15 Code Region Granularity Regions tagged with statistics – Accuracy / Miss Ratio Activate or deactivate at every new code region – According to the statistic and the current code region 15 Divide the code in code regions – Single instructions, basic blocs, etc. or all the code Identify and tag the regions – Statically (Profiling execution) – Dynamically (During the warm up)

16 Switching off the prefetcher Detect the uselessness of the prefetcher Accuracy – Useful prefetches / Total number of prefetches – Switch off when the accuracy decreases Miss Ratio – Based on the number of misses 16

17 Switching on the prefetcher Switched off prefetcher does not generate stats Cannot reactivate with accuracy increment Reactivate when? – Based on miss ratio – After a certain timeout 17

18 Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 18

19 Experimental framework Gem5 – 16 x86 CPUs – Ruby memory system – L1 prefetchers – MOESI coherency protocol – Garnet network simulator Parsecs 2.1 19

20 Simulation environment 20

21 Outline Motivation - Prefetching - Prefetching in CMPs - Prefetch adverse behaviors Objective - Proposal - Code region granularity - Switch the prefetcher off - Switch the prefetcher on Experimental framework Expected Results 21

22 Expected Results 22 Power savings without losing performance Smaller granularity more accuracy – Blocs or super blocs better than the whole code – Single instructions more accurate than blocs or super blocs Smaller granularity: – More resources – More complexity Basic bloc granularity should provide good results with a realistic complexity

23 Q & A 23

24 IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech 24

25 Back up slides 25

26 Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 26 DISTRIBUTED L2 MEMORY

27 Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 27 DISTRIBUTED L2 MEMORY @ L1 MISS for @

28 Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 28 DISTRIBUTED L2 MEMORY @ L1 MISS for @ Distributed patterns

29 Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 29 DISTRIBUTED L2 MEMORY @@ + 2@ + 4

30 Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 30 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 Queue filtering

31 Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 31 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 L1 MISS for @ + 2

32 Prefetch Distributed Memory Systems Increases the complexity of prefetching Challenges without trivial solutions 32 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 L1 MISS for @ + 2 Dynamic profiling


Download ppt "IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC."

Similar presentations


Ads by Google