D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1

 CMPs are taking over the world!  2, 4 or k threads in a single chip have to play fair and share the L2 cache  …but that doesn’t happen ◦ Unfair penalization of apps (40% in some cases) ◦ Priority inversion L2 But the O/S said I have higher priority Yes, but the H/W doesn’t know! 2

 Change on-the-fly the L2 cache size/app  Multiple H/W suggestions exist ◦ But no one implements them  S/W could partition the L2 ◦ Where to split? ◦ How to split?  Perf loss due to sharing can be reclaimed ◦ And we could save lots of power by reducing L2 misses!!!!* * Work supported by U.S. Department of Energy 3

 O/S has some control on where a block will be placed in the L2 ◦ Through the virtual to physical translation ◦ If L2 large enough, and pages small enough  Supports horizontal L2 partitions only 4 Physical Page Number (20 bits) Page Offset (12 bits) Block Off. (7) L2 Index (9) Tag 4 bits of O/S control 5 bits beyond O/S control Power 5

5 Physical Page Number (20 bits) Page Offset (12 bits) Block Off. (7) L2 Index (9) Tag 4 bits of O/S control5 bits beyond O/S control Process A Process B … … Virtual PagesDRAML2 cache … … Blue Sets Red Sets 16 colors total (4 bits)

 Miss Rate Curves (MRCs) are needed to decide what is the optimal L2 chunk for each app 6 Miss Rate Curve for MCF

 Apps go through phases that change MRCs ◦ Offline MRCs are useless  Need online tool and phase detection 7 Data for MCF

 For online MRCs we need ◦ All L2 accesses ◦ Cache simulator  Must be able to simulate multiple cache sizes in single pass  Speed is crucial ◦ Phases last a few billion cycles  Can’t take forever to calculate  Limited time to offset cost 8

 Only applies to fully-associative memories ◦ Originally developed for DRAM management  Cheat: today’s L2s are associative enough ◦ Only supports LRU replacement (Stack)  Idea: Store all L2 accesses in a stack ◦ Record distance between this and last access to same block ◦ For each i, if Distance i > Size i, then Misses i ++ 9

10 Bottom Hit, distance = 10k L2 Accesses Top: L2 Accesses Stack Miss!  Stack Size: 16k entries ◦ Optimized with hash map for linear seek  Warm-up period ◦ Automatically detected by RapidMRC  When the L2 is detected to be full ◦ 50% of trace length if auto-detection fails  For apps with small working set

 Simulation ◦ Too expensive  Dynamic Binary Instrumentation ◦ Will have to instrument every memory access  Simulation of L1 is undesirable  Processor’s PMUs (Perf Monitoring Units) ◦ Hmmm… ◦ Not all of them support tracing L2 accesses 11

 Power5 features ◦ Can profile L1-D misses (major cause of L2 accesses) ◦ A register recording the address of last L1-D miss ◦ A counter of L1-D misses ◦ A threshold for the counter  When threshold reached, an exception is raised  So, set counter to 1 and get all L2 accesses ◦ Not quite! 12

 Hardware prefetches ◦ Don’t change the address in Power5  Stride prefetch; just increase the address manually ◦ Don’t appear at all in Power5+  Superscalar Execution ◦ A miss may mask other misses  The exception #1 will block further exceptions  When the other loads will be reissued, they may hit 13

 Actual miss rate known for a point in the MRC ◦ The miss rate experienced by the app itself  For the size given to it (usually ½ of cache) ◦ V-offset: real miss rate-estimated for that size  Vertically shift every point in the MRC ◦ Using that v-offset  Unclear why it should be equal shift ◦ More misses   greater opportunity for OoO to hide misses 14

 Apps go through phases ◦ Easily detected through dramatic miss rate change  RapidMRC must run once per phase  O/S afterwards must change the page table ◦ TLB, caches will be invalidated ◦ Copy: ~7.3μs/4KB page  NOT implemented 15

Cores/Chip2 Frequency1.5 GHz L1-I64 KB, 128-byte line, 2-way s.a., Private L1-D32 KB, 128-byte line, 4-way s.a.,Private L21.875 MB, 128-byte line, 10-way s.a., Shared L3 Victim Cache36 MB, 256-byte line, 12-way s.a., off-chip RAM8 GB (4 GB in some experiments) 16

17 Applications with normal MRC are predicted very well

18 Abnormal MRC are usually mispredicted

19 7.70 8.683.12 For most programs, RapidMRC’s overhead is negligible

20  Significant performance improvements over uncontrolled sharing ◦ Although there is room for improvement  L3 deactivated in some experiments (small working set)

 MRC estimation on-the-fly ◦ Using a technique for fully-associative memories  Ensure minimal interference in multi- programming workloads  Save energy! 21

 Definitely need to implement migration and phase detection  Different coloring scheme for I and D?  Previous paper shows L2 misses not as important as retirement stall due to L1 misses. ◦ L3 cache can reverse results  Continuous on-line monitoring (e.g. LBA) could achieve significant improvements 22

 Questions? 23

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

Similar presentations

Presentation on theme: "D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

Similar presentations

Presentation on theme: "D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1."— Presentation transcript:

Similar presentations

About project

Feedback