Download presentation
Presentation is loading. Please wait.
Published byPatrick Lee Scott Modified over 9 years ago
1
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech
2
Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 2
3
Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 3
4
Prefetching Reduce memory latency Bring to a nearest cache next data required by CPU Increase the hit ratio It is implemented in most of the commercial processors Erroneous prefetching may produce – Cache pollution – Resources consumption (queues, bandwidth, etc.) – Power consumption 4
5
Motivation Number of cores in a same chip grows every year Nehalem 4~6 Cores Tilera 64~100 Cores Intel Polaris 80 Cores Nvidia GeForce Up to 256 Cores 5
6
Prefetch in CMPs Useful prefetchers implies more performance – Avoid network latency – Reduce memory access latency Useless prefetchers implies less performance – More power consumption – More NoC congestion – Interference with other cores requests 6
7
Prefetch adverse behaviors 7 M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.
8
Distributed memories 8 Distribution of the memory access pattern: @@+2@+4@+6@+8 @+10 @ @ + 2 @ + 4 @ + 6 @ + 8 @ + 10
9
@ @ + 2 @ + 4 @ + 6 @ + 8 @ + 10 @ + 12 @ + 14 TILE 00 TILE 01 TILE 02 TILE 03 TILE 04 TILE 05 TILE 06 TILE 07 Distributed memories 9 Distribution of the memory access pattern: @@+2@+4@+6@+8 @+10 @+12 @+14
10
Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 10
11
Prefetch Distributed Memory Systems Analysis phase 11 DISTRIBUTED L2 MEMORY @ L1 MISS for @ Distributed patterns
12
Pattern Detection Challenge Distribution of the memory stream Prefetcher aware of a certain part of the stream Harder to detect access patterns or correlation Not all the prefetchers affected – Correlation prefetchers affected: GHB – One Block Lookahead not affected: Tagged 12
13
Prefetch Distributed Memory Systems Request generation phase 13 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 Queue filtering
14
Prefetch Queue Filtering Challenge Prefetch requests queued in distributed queues Independent engines generating requests Repeated requests can be queued In a centralized queue those would be merged Adverse effects: – Power consumption – Network contention 14
15
Prefetch Distributed Memory Systems Evaluation phase 15 DISTRIBUTED L2 MEMORY @@ + 2@ + 4 L1 MISS for @ + 2 Dynamic profiling
16
Dynamic Profiling Challenge Prefetch requests generated in one tile Dynamic profiling information in another tile Erroneous profiling in the self tile Techniques using this info may work erroneously – Filtering – Throttling – Concrete prefetching engines 16
17
Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 17
18
Challenge evaluation methodology Three environments to test the challenges Pattern Detection Challenge: Ideal Prefetcher – Prefetcher that it is aware of all the memory stream – No extra network contention added in the system – No extra power consumed – Requests classified depending on its core identifier – To preserve the original stream of each core Prefetcher used to test: Global History Buffer 18
19
Pattern Detection Challenge 19
20
Challenge evaluation methodology Three environments to test the challenges Prefetch Queue Filtering: Centralized queue – All the requests sent to a centralized queue – Repeated requests are merged – No extra network contention added in the system – No extra power consumed – Repeated requests are not issued Prefetcher used to test: Tagged prefercher 20
21
Prefetch Queue Filtering Challenge 21
22
Challenge evaluation methodology 22
23
Dynamic Profiling Challenge 23
24
Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 24
25
Experimental framework Gem5 – 64 x86 CPUs – Ruby memory system – L2 prefetchers – MOESI coherency protocol – Garnet network simulator Parsecs 2.1 25
26
Simulation environment 26
27
Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 27
28
Pattern Detection Challenge 28
29
Prefetch Queue Filtering Challenge 29
30
Dynamic Profiling Challenge 30
31
Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 31
32
Facing the challenges 32 There are two main options – Redesign the entire prefetch philosophy – Adapt the current techniques to work with DSMs Moreover, there are two main directions – Centralize the information – Handicap of communication increment – Distribute the prefetcher – Handicap of smartly distribute the prefetcher
33
Outline Introduction Naming the challenges Challenge evaluation methodology Experimental framework Challenge Quantification Facing the Challenges Conclusions 33
34
Conclusions 34 Three challenges when prefetching in DSMs – Prefetch Queue Filtering Challenge – Dynamic Profiling Challenge – Challenge evaluation methodology Directions for future investigators There are no evident solutions for them Not solving them -> limited prefetch performance
35
Q & A 35
36
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.