Download presentation
Presentation is loading. Please wait.
Published byStephanie Webb Modified over 9 years ago
1
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh
2
Dec. 13 ’06 – MICRO-39 Multicore distributed L2 caches L2 caches typically sub-banked and distributed IBM Power4/5: 3 banks Sun Microsystems T1: 4 banks Intel Itanium2 (L3): many “sub-arrays” (Distributed L2 caches + switched NoC) NUCA Hardware-based management schemes Private caching Shared caching Hybrid caching processor core local L2 cache router
3
Dec. 13 ’06 – MICRO-39 Private caching 2. L2 access 1. L1 miss 2. L2 access Hit Miss 3. Access directory A copy on chip Global miss 3. Access directory short hit latency (always local) high on-chip miss rate long miss resolution time complex coherence enforcement
4
Dec. 13 ’06 – MICRO-39 Shared caching 1. L1 miss 2. L2 access Hit Miss low on-chip miss rate straightforward data location simple coherence (no replication) long average hit latency
5
Dec. 13 ’06 – MICRO-39 Our work Placing “flexibility” as the top design consideration OS-level data to L2 cache mapping Simple hardware based on shared caching Efficient mapping maintenance at page granularity Demonstrating the impact using different policies
6
Dec. 13 ’06 – MICRO-39 Talk roadmap Data mapping, a key property Flexible page-level mapping Goals Architectural support OS design issues Management policies Conclusion and future works
7
Dec. 13 ’06 – MICRO-39 Data mapping, the key Data mapping = deciding data location ( i.e., cache slice) Private caching Data mapping determined by program location Mapping created at miss time No explicit control Shared caching Data mapping determined by address slice number = ( block address ) % ( N slice ) Mapping is static Cache block installation at miss time No explicit control (Run-time can impact location within slice) Mapping granularity = block
8
Dec. 13 ’06 – MICRO-39 Changing cache mapping granularity Memory blocksMemory pages miss rate? impact on existing techniques? ( e.g., prefetching) latency?
9
Dec. 13 ’06 – MICRO-39 Observation: page-level mapping Memory pagesProgram 1 Program 2 OS PAGE ALLOCATION Mapping data to different $$ feasible Key: OS page allocation policies Flexible
10
Dec. 13 ’06 – MICRO-39 Goal 1: performance management Proximity-aware data mapping
11
Dec. 13 ’06 – MICRO-39 Goal 2: power management Usage-aware cache shut-off 000000000000
12
Dec. 13 ’06 – MICRO-39 Goal 3: reliability management On-demand cache isolation XX
13
Dec. 13 ’06 – MICRO-39 Goal 4: QoS management Contract-based cache allocation
14
Dec. 13 ’06 – MICRO-39 page_numpage offset Architectural support L1 miss Method 1: “bit selection” slice_num = ( page_num ) % ( N slice ) other bitsslice_numpage offset data address Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high page_numpage offset region0_lowslice_num0region0_high region1_lowslice_num1region1_high Method 3: “page table (TLB)” page_num «–» slice_num vpage_num0slice_num0ppage_num0 vpage_num1slice_num1ppage_num1 reg_table TLB Method 1: “bit selection” slice number = ( page_num ) % ( N slice ) Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high Method 3: “page table (TLB)” page_num «–» slice_num Simple hardware support enough Combined scheme feasible
15
Dec. 13 ’06 – MICRO-39 Some OS design issues Congruence group CG( i ) Set of physical pages mapped to slice i A free list for each i multiple free lists On each page allocation, consider Data proximity Cache pressure (e.g.) Profitability function P = f ( M, L, P, Q, C ) M : miss rates L : network link status P : current page allocation status Q : QoS requirements C : cache configuration Impact on process scheduling Leverage existing frameworks Page coloring – multiple free lists NUMA OS – process scheduling & page allocation
16
Dec. 13 ’06 – MICRO-39 Working example Program 1023 456 8 7 91011 12131415 5 5 5 5 P (4) = 0.9 P (6) = 0.8 P (5) = 0.7 … P (1) = 0.95 P (6) = 0.9 P (4) = 0.8 … 4 1 6 Static vs. dynamic mapping Program information ( e.g., profile) Proper run-time monitoring needed
17
Dec. 13 ’06 – MICRO-39 Page mapping policies
18
Dec. 13 ’06 – MICRO-39 Simulating private caching For a page requested from a program running on core i, map the page to cache slice i L2 cache latency (cycles) SPEC2k INTSPEC2k FP private caching OS-based L2 cache slice size Simulating private caching is simple Similar or better performance
19
Dec. 13 ’06 – MICRO-39 Simulating “large” private caching For a page requested from a program running on core i, map the page to cache slice i ; also spread pages SPEC2k INTSPEC2k FP Relative performance (time -1 ) OS private 1.93 512kB cache slice
20
Dec. 13 ’06 – MICRO-39 Simulating shared caching For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …) L2 cache latency (cycles) SPEC2k INTSPEC2k FP L2 cache slice size shared OS 129106 Simulating shared caching is simple Mostly similar behavior/performance Pathological cases ( e.g., applu)
21
Dec. 13 ’06 – MICRO-39 1023 456 8 7 91011 12131415 Simulating clustered caching For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …) Relative performance (time -1 ) private OSshared 4 cores used; 512kB cache slice Simulating clustered caching is simple Lower miss traffic than private Lower on-chip traffic than shared
22
Dec. 13 ’06 – MICRO-39 Profile-driven page mapping Using profiling collect: Inter-page conflict information Per-page access count information Page mapping cost function (per slice) Given program location, page to map, and previously mapped pages ( # conflicts miss penalty ) + weight (# accesses latency ) weight as a knob Larger value more weight on proximity (than miss rate) Optimize both miss rate and data proximity Theoretically important to understand limits Can be practically important, too miss costLatency cost
23
Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d 256kB L2 cache slice remote local miss on-chip hit L2 cache accesses weight
24
Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d # pages mapped 256kB L2 cache slice Program location GCC
25
Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d 256kB L2 cache slice Performance improvement Over shared caching 108% Room for performance improvement Best of the two or better than the two Dynamic mapping schemes desired
26
Dec. 13 ’06 – MICRO-39 Isolating faulty caches When there are faulty cache slices, avoid mapping pages to them Relative L2 cache latency 4 cores running a multiprogrammed workload; 512kB cache slice shared OS # cache slice deletions
27
Dec. 13 ’06 – MICRO-39 Conclusion “Flexibility” will become important in future multicores Many shared resources Allows us to implement high-level policies OS-level page-granularity data-to-slice mapping Low hardware overhead Flexible Several management policies studied Mimicking private/shared/clustered caching straightforward Performance-improving schemes
28
Dec. 13 ’06 – MICRO-39 Future works Dynamic mapping schemes Performance Power Performance monitoring techniques Hardware-based Software-based Data migration and replication support
29
Dec. 13 ’06 – MICRO-39 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh Thank you!
30
Dec. 13 ’06 – MICRO-39 Multicores are here AMD Opteron dual-core (2005) IBM Power5 (2004) Sun Micro. T1, 8 cores (2005) Intel Core2 Duo (2006) Quad cores (2007) Intel 80 cores? (2010?)
31
Dec. 13 ’06 – MICRO-39 A multicore outlook ???
32
Dec. 13 ’06 – MICRO-39 A processor model Many cores (e.g., 16) processor core local L2 cache router Private L1 I/D-$$ 8kB~32kB Local unified L2 $$ 128kB~512kB 8~18 cycles Switched network 2~4 cycles/switch Distributed directory Scatter hotspots
33
Dec. 13 ’06 – MICRO-39 Other approaches Hybrid/flexible schemes “Core clustering” [Speight et al., ISCA2005] “Flexible CMP cache sharing” [Huh et al., ICS2004] “Flexible bank mapping” [Liu et al., HPCA2004] Improving shared caching “Victim replication” [Zhang and Asanovic, ISCA2005] Improving private caching “Cooperative caching” [Chang and Sohi, ISCA2006] “CMP-NuRAPID” [Chishti et al., ISCA2005]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.