Download presentation
Presentation is loading. Please wait.
Published bySavanna Procter Modified over 9 years ago
1
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1
2
CMPs are taking over the world! 2, 4 or k threads in a single chip have to play fair and share the L2 cache …but that doesn’t happen ◦ Unfair penalization of apps (40% in some cases) ◦ Priority inversion L2 But the O/S said I have higher priority Yes, but the H/W doesn’t know! 2
3
Change on-the-fly the L2 cache size/app Multiple H/W suggestions exist ◦ But no one implements them S/W could partition the L2 ◦ Where to split? ◦ How to split? Perf loss due to sharing can be reclaimed ◦ And we could save lots of power by reducing L2 misses!!!!* * Work supported by U.S. Department of Energy 3
4
O/S has some control on where a block will be placed in the L2 ◦ Through the virtual to physical translation ◦ If L2 large enough, and pages small enough Supports horizontal L2 partitions only 4 Physical Page Number (20 bits) Page Offset (12 bits) Block Off. (7) L2 Index (9) Tag 4 bits of O/S control 5 bits beyond O/S control Power 5
5
5 Physical Page Number (20 bits) Page Offset (12 bits) Block Off. (7) L2 Index (9) Tag 4 bits of O/S control5 bits beyond O/S control Process A Process B … … Virtual PagesDRAML2 cache … … Blue Sets Red Sets 16 colors total (4 bits)
6
Miss Rate Curves (MRCs) are needed to decide what is the optimal L2 chunk for each app 6 Miss Rate Curve for MCF
7
Apps go through phases that change MRCs ◦ Offline MRCs are useless Need online tool and phase detection 7 Data for MCF
8
For online MRCs we need ◦ All L2 accesses ◦ Cache simulator Must be able to simulate multiple cache sizes in single pass Speed is crucial ◦ Phases last a few billion cycles Can’t take forever to calculate Limited time to offset cost 8
9
Only applies to fully-associative memories ◦ Originally developed for DRAM management Cheat: today’s L2s are associative enough ◦ Only supports LRU replacement (Stack) Idea: Store all L2 accesses in a stack ◦ Record distance between this and last access to same block ◦ For each i, if Distance i > Size i, then Misses i ++ 9
10
10 Bottom Hit, distance = 10k L2 Accesses Top: L2 Accesses Stack Miss! Stack Size: 16k entries ◦ Optimized with hash map for linear seek Warm-up period ◦ Automatically detected by RapidMRC When the L2 is detected to be full ◦ 50% of trace length if auto-detection fails For apps with small working set
11
Simulation ◦ Too expensive Dynamic Binary Instrumentation ◦ Will have to instrument every memory access Simulation of L1 is undesirable Processor’s PMUs (Perf Monitoring Units) ◦ Hmmm… ◦ Not all of them support tracing L2 accesses 11
12
Power5 features ◦ Can profile L1-D misses (major cause of L2 accesses) ◦ A register recording the address of last L1-D miss ◦ A counter of L1-D misses ◦ A threshold for the counter When threshold reached, an exception is raised So, set counter to 1 and get all L2 accesses ◦ Not quite! 12
13
Hardware prefetches ◦ Don’t change the address in Power5 Stride prefetch; just increase the address manually ◦ Don’t appear at all in Power5+ Superscalar Execution ◦ A miss may mask other misses The exception #1 will block further exceptions When the other loads will be reissued, they may hit 13
14
Actual miss rate known for a point in the MRC ◦ The miss rate experienced by the app itself For the size given to it (usually ½ of cache) ◦ V-offset: real miss rate-estimated for that size Vertically shift every point in the MRC ◦ Using that v-offset Unclear why it should be equal shift ◦ More misses greater opportunity for OoO to hide misses 14
15
Apps go through phases ◦ Easily detected through dramatic miss rate change RapidMRC must run once per phase O/S afterwards must change the page table ◦ TLB, caches will be invalidated ◦ Copy: ~7.3μs/4KB page NOT implemented 15
16
Cores/Chip2 Frequency1.5 GHz L1-I64 KB, 128-byte line, 2-way s.a., Private L1-D32 KB, 128-byte line, 4-way s.a.,Private L21.875 MB, 128-byte line, 10-way s.a., Shared L3 Victim Cache36 MB, 256-byte line, 12-way s.a., off-chip RAM8 GB (4 GB in some experiments) 16
17
17 Applications with normal MRC are predicted very well
18
18 Abnormal MRC are usually mispredicted
19
19 7.70 8.683.12 For most programs, RapidMRC’s overhead is negligible
20
20 Significant performance improvements over uncontrolled sharing ◦ Although there is room for improvement L3 deactivated in some experiments (small working set)
21
MRC estimation on-the-fly ◦ Using a technique for fully-associative memories Ensure minimal interference in multi- programming workloads Save energy! 21
22
Definitely need to implement migration and phase detection Different coloring scheme for I and D? Previous paper shows L2 misses not as important as retirement stall due to L1 misses. ◦ L3 cache can reverse results Continuous on-line monitoring (e.g. LBA) could achieve significant improvements 22
23
Questions? 23
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.