Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh
Dec. 13 ’06 – MICRO-39 Multicore distributed L2 caches L2 caches typically sub-banked and distributed IBM Power4/5: 3 banks Sun Microsystems T1: 4 banks Intel Itanium2 (L3): many “sub-arrays” (Distributed L2 caches + switched NoC) NUCA Hardware-based management schemes Private caching Shared caching Hybrid caching processor core local L2 cache router
Dec. 13 ’06 – MICRO-39 Private caching 2. L2 access 1. L1 miss 2. L2 access Hit Miss 3. Access directory A copy on chip Global miss 3. Access directory short hit latency (always local) high on-chip miss rate long miss resolution time complex coherence enforcement
Dec. 13 ’06 – MICRO-39 Shared caching 1. L1 miss 2. L2 access Hit Miss low on-chip miss rate straightforward data location simple coherence (no replication) long average hit latency
Dec. 13 ’06 – MICRO-39 Our work Placing “flexibility” as the top design consideration OS-level data to L2 cache mapping Simple hardware based on shared caching Efficient mapping maintenance at page granularity Demonstrating the impact using different policies
Dec. 13 ’06 – MICRO-39 Talk roadmap Data mapping, a key property Flexible page-level mapping Goals Architectural support OS design issues Management policies Conclusion and future works
Dec. 13 ’06 – MICRO-39 Data mapping, the key Data mapping = deciding data location ( i.e., cache slice) Private caching Data mapping determined by program location Mapping created at miss time No explicit control Shared caching Data mapping determined by address slice number = ( block address ) % ( N slice ) Mapping is static Cache block installation at miss time No explicit control (Run-time can impact location within slice) Mapping granularity = block
Dec. 13 ’06 – MICRO-39 Changing cache mapping granularity Memory blocksMemory pages miss rate? impact on existing techniques? ( e.g., prefetching) latency?
Dec. 13 ’06 – MICRO-39 Observation: page-level mapping Memory pagesProgram 1 Program 2 OS PAGE ALLOCATION Mapping data to different $$ feasible Key: OS page allocation policies Flexible
Dec. 13 ’06 – MICRO-39 Goal 1: performance management Proximity-aware data mapping
Dec. 13 ’06 – MICRO-39 Goal 2: power management Usage-aware cache shut-off
Dec. 13 ’06 – MICRO-39 Goal 3: reliability management On-demand cache isolation XX
Dec. 13 ’06 – MICRO-39 Goal 4: QoS management Contract-based cache allocation
Dec. 13 ’06 – MICRO-39 page_numpage offset Architectural support L1 miss Method 1: “bit selection” slice_num = ( page_num ) % ( N slice ) other bitsslice_numpage offset data address Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high page_numpage offset region0_lowslice_num0region0_high region1_lowslice_num1region1_high Method 3: “page table (TLB)” page_num «–» slice_num vpage_num0slice_num0ppage_num0 vpage_num1slice_num1ppage_num1 reg_table TLB Method 1: “bit selection” slice number = ( page_num ) % ( N slice ) Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high Method 3: “page table (TLB)” page_num «–» slice_num Simple hardware support enough Combined scheme feasible
Dec. 13 ’06 – MICRO-39 Some OS design issues Congruence group CG( i ) Set of physical pages mapped to slice i A free list for each i multiple free lists On each page allocation, consider Data proximity Cache pressure (e.g.) Profitability function P = f ( M, L, P, Q, C ) M : miss rates L : network link status P : current page allocation status Q : QoS requirements C : cache configuration Impact on process scheduling Leverage existing frameworks Page coloring – multiple free lists NUMA OS – process scheduling & page allocation
Dec. 13 ’06 – MICRO-39 Working example Program P (4) = 0.9 P (6) = 0.8 P (5) = 0.7 … P (1) = 0.95 P (6) = 0.9 P (4) = 0.8 … Static vs. dynamic mapping Program information ( e.g., profile) Proper run-time monitoring needed
Dec. 13 ’06 – MICRO-39 Page mapping policies
Dec. 13 ’06 – MICRO-39 Simulating private caching For a page requested from a program running on core i, map the page to cache slice i L2 cache latency (cycles) SPEC2k INTSPEC2k FP private caching OS-based L2 cache slice size Simulating private caching is simple Similar or better performance
Dec. 13 ’06 – MICRO-39 Simulating “large” private caching For a page requested from a program running on core i, map the page to cache slice i ; also spread pages SPEC2k INTSPEC2k FP Relative performance (time -1 ) OS private kB cache slice
Dec. 13 ’06 – MICRO-39 Simulating shared caching For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …) L2 cache latency (cycles) SPEC2k INTSPEC2k FP L2 cache slice size shared OS Simulating shared caching is simple Mostly similar behavior/performance Pathological cases ( e.g., applu)
Dec. 13 ’06 – MICRO Simulating clustered caching For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …) Relative performance (time -1 ) private OSshared 4 cores used; 512kB cache slice Simulating clustered caching is simple Lower miss traffic than private Lower on-chip traffic than shared
Dec. 13 ’06 – MICRO-39 Profile-driven page mapping Using profiling collect: Inter-page conflict information Per-page access count information Page mapping cost function (per slice) Given program location, page to map, and previously mapped pages ( # conflicts miss penalty ) + weight (# accesses latency ) weight as a knob Larger value more weight on proximity (than miss rate) Optimize both miss rate and data proximity Theoretically important to understand limits Can be practically important, too miss costLatency cost
Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d 256kB L2 cache slice remote local miss on-chip hit L2 cache accesses weight
Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d # pages mapped 256kB L2 cache slice Program location GCC
Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d 256kB L2 cache slice Performance improvement Over shared caching 108% Room for performance improvement Best of the two or better than the two Dynamic mapping schemes desired
Dec. 13 ’06 – MICRO-39 Isolating faulty caches When there are faulty cache slices, avoid mapping pages to them Relative L2 cache latency 4 cores running a multiprogrammed workload; 512kB cache slice shared OS # cache slice deletions
Dec. 13 ’06 – MICRO-39 Conclusion “Flexibility” will become important in future multicores Many shared resources Allows us to implement high-level policies OS-level page-granularity data-to-slice mapping Low hardware overhead Flexible Several management policies studied Mimicking private/shared/clustered caching straightforward Performance-improving schemes
Dec. 13 ’06 – MICRO-39 Future works Dynamic mapping schemes Performance Power Performance monitoring techniques Hardware-based Software-based Data migration and replication support
Dec. 13 ’06 – MICRO-39 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh Thank you!
Dec. 13 ’06 – MICRO-39 Multicores are here AMD Opteron dual-core (2005) IBM Power5 (2004) Sun Micro. T1, 8 cores (2005) Intel Core2 Duo (2006) Quad cores (2007) Intel 80 cores? (2010?)
Dec. 13 ’06 – MICRO-39 A multicore outlook ???
Dec. 13 ’06 – MICRO-39 A processor model Many cores (e.g., 16) processor core local L2 cache router Private L1 I/D-$$ 8kB~32kB Local unified L2 $$ 128kB~512kB 8~18 cycles Switched network 2~4 cycles/switch Distributed directory Scatter hotspots
Dec. 13 ’06 – MICRO-39 Other approaches Hybrid/flexible schemes “Core clustering” [Speight et al., ISCA2005] “Flexible CMP cache sharing” [Huh et al., ICS2004] “Flexible bank mapping” [Liu et al., HPCA2004] Improving shared caching “Victim replication” [Zhang and Asanovic, ISCA2005] Improving private caching “Cooperative caching” [Chang and Sohi, ISCA2006] “CMP-NuRAPID” [Chishti et al., ISCA2005]