Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Similar presentations


Presentation on theme: "Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh."— Presentation transcript:

1 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

2 Dec. 13 ’06 – MICRO-39 Multicore distributed L2 caches  L2 caches typically sub-banked and distributed IBM Power4/5: 3 banks Sun Microsystems T1: 4 banks Intel Itanium2 (L3): many “sub-arrays”  (Distributed L2 caches + switched NoC)  NUCA  Hardware-based management schemes Private caching Shared caching Hybrid caching processor core local L2 cache router

3 Dec. 13 ’06 – MICRO-39 Private caching 2. L2 access 1. L1 miss 2. L2 access Hit Miss 3. Access directory A copy on chip Global miss 3. Access directory  short hit latency (always local)  high on-chip miss rate  long miss resolution time  complex coherence enforcement

4 Dec. 13 ’06 – MICRO-39 Shared caching 1. L1 miss 2. L2 access Hit Miss  low on-chip miss rate  straightforward data location  simple coherence (no replication)  long average hit latency

5 Dec. 13 ’06 – MICRO-39 Our work  Placing “flexibility” as the top design consideration  OS-level data to L2 cache mapping Simple hardware based on shared caching Efficient mapping maintenance at page granularity  Demonstrating the impact using different policies

6 Dec. 13 ’06 – MICRO-39 Talk roadmap  Data mapping, a key property  Flexible page-level mapping Goals Architectural support OS design issues  Management policies  Conclusion and future works

7 Dec. 13 ’06 – MICRO-39 Data mapping, the key  Data mapping = deciding data location ( i.e., cache slice)  Private caching Data mapping determined by program location Mapping created at miss time No explicit control  Shared caching Data mapping determined by address slice number = ( block address ) % ( N slice ) Mapping is static Cache block installation at miss time No explicit control (Run-time can impact location within slice) Mapping granularity = block

8 Dec. 13 ’06 – MICRO-39 Changing cache mapping granularity Memory blocksMemory pages  miss rate?  impact on existing techniques? ( e.g., prefetching)  latency?

9 Dec. 13 ’06 – MICRO-39 Observation: page-level mapping Memory pagesProgram 1 Program 2 OS PAGE ALLOCATION  Mapping data to different $$ feasible  Key: OS page allocation policies  Flexible

10 Dec. 13 ’06 – MICRO-39 Goal 1: performance management  Proximity-aware data mapping

11 Dec. 13 ’06 – MICRO-39 Goal 2: power management  Usage-aware cache shut-off 000000000000

12 Dec. 13 ’06 – MICRO-39 Goal 3: reliability management  On-demand cache isolation XX

13 Dec. 13 ’06 – MICRO-39 Goal 4: QoS management  Contract-based cache allocation

14 Dec. 13 ’06 – MICRO-39 page_numpage offset Architectural support L1 miss Method 1: “bit selection” slice_num = ( page_num ) % ( N slice ) other bitsslice_numpage offset data address Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high page_numpage offset region0_lowslice_num0region0_high region1_lowslice_num1region1_high Method 3: “page table (TLB)” page_num «–» slice_num vpage_num0slice_num0ppage_num0 vpage_num1slice_num1ppage_num1 reg_table TLB Method 1: “bit selection” slice number = ( page_num ) % ( N slice ) Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high Method 3: “page table (TLB)” page_num «–» slice_num  Simple hardware support enough  Combined scheme feasible

15 Dec. 13 ’06 – MICRO-39 Some OS design issues  Congruence group CG( i ) Set of physical pages mapped to slice i A free list for each i  multiple free lists  On each page allocation, consider Data proximity Cache pressure (e.g.) Profitability function P = f ( M, L, P, Q, C ) M : miss rates L : network link status P : current page allocation status Q : QoS requirements C : cache configuration  Impact on process scheduling  Leverage existing frameworks Page coloring – multiple free lists NUMA OS – process scheduling & page allocation

16 Dec. 13 ’06 – MICRO-39 Working example Program 1023 456 8 7 91011 12131415 5 5 5 5 P (4) = 0.9 P (6) = 0.8 P (5) = 0.7 … P (1) = 0.95 P (6) = 0.9 P (4) = 0.8 … 4 1 6  Static vs. dynamic mapping  Program information ( e.g., profile)  Proper run-time monitoring needed

17 Dec. 13 ’06 – MICRO-39 Page mapping policies

18 Dec. 13 ’06 – MICRO-39 Simulating private caching For a page requested from a program running on core i, map the page to cache slice i L2 cache latency (cycles) SPEC2k INTSPEC2k FP private caching OS-based L2 cache slice size  Simulating private caching is simple  Similar or better performance

19 Dec. 13 ’06 – MICRO-39 Simulating “large” private caching For a page requested from a program running on core i, map the page to cache slice i ; also spread pages SPEC2k INTSPEC2k FP Relative performance (time -1 ) OS private 1.93 512kB cache slice

20 Dec. 13 ’06 – MICRO-39 Simulating shared caching For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …) L2 cache latency (cycles) SPEC2k INTSPEC2k FP L2 cache slice size shared OS 129106  Simulating shared caching is simple  Mostly similar behavior/performance  Pathological cases ( e.g., applu)

21 Dec. 13 ’06 – MICRO-39 1023 456 8 7 91011 12131415 Simulating clustered caching For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …) Relative performance (time -1 ) private OSshared 4 cores used; 512kB cache slice  Simulating clustered caching is simple  Lower miss traffic than private  Lower on-chip traffic than shared

22 Dec. 13 ’06 – MICRO-39 Profile-driven page mapping  Using profiling collect: Inter-page conflict information Per-page access count information  Page mapping cost function (per slice) Given program location, page to map, and previously mapped pages ( # conflicts  miss penalty ) + weight  (# accesses  latency ) weight as a knob Larger value  more weight on proximity (than miss rate) Optimize both miss rate and data proximity  Theoretically important to understand limits  Can be practically important, too miss costLatency cost

23 Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d 256kB L2 cache slice remote local miss on-chip hit L2 cache accesses weight

24 Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d # pages mapped 256kB L2 cache slice Program location GCC

25 Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d 256kB L2 cache slice Performance improvement Over shared caching 108%  Room for performance improvement  Best of the two or better than the two  Dynamic mapping schemes desired

26 Dec. 13 ’06 – MICRO-39 Isolating faulty caches When there are faulty cache slices, avoid mapping pages to them Relative L2 cache latency 4 cores running a multiprogrammed workload; 512kB cache slice shared OS # cache slice deletions

27 Dec. 13 ’06 – MICRO-39 Conclusion  “Flexibility” will become important in future multicores Many shared resources Allows us to implement high-level policies  OS-level page-granularity data-to-slice mapping Low hardware overhead Flexible  Several management policies studied Mimicking private/shared/clustered caching straightforward Performance-improving schemes

28 Dec. 13 ’06 – MICRO-39 Future works  Dynamic mapping schemes Performance Power  Performance monitoring techniques Hardware-based Software-based  Data migration and replication support

29 Dec. 13 ’06 – MICRO-39 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh Thank you!

30 Dec. 13 ’06 – MICRO-39 Multicores are here AMD Opteron dual-core (2005) IBM Power5 (2004) Sun Micro. T1, 8 cores (2005) Intel Core2 Duo (2006) Quad cores (2007) Intel 80 cores? (2010?)

31 Dec. 13 ’06 – MICRO-39 A multicore outlook ???

32 Dec. 13 ’06 – MICRO-39 A processor model Many cores (e.g., 16) processor core local L2 cache router  Private L1 I/D-$$ 8kB~32kB  Local unified L2 $$ 128kB~512kB 8~18 cycles  Switched network 2~4 cycles/switch  Distributed directory Scatter hotspots

33 Dec. 13 ’06 – MICRO-39 Other approaches  Hybrid/flexible schemes “Core clustering” [Speight et al., ISCA2005] “Flexible CMP cache sharing” [Huh et al., ICS2004] “Flexible bank mapping” [Liu et al., HPCA2004]  Improving shared caching “Victim replication” [Zhang and Asanovic, ISCA2005]  Improving private caching “Cooperative caching” [Chang and Sohi, ISCA2006] “CMP-NuRAPID” [Chishti et al., ISCA2005]


Download ppt "Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh."

Similar presentations


Ads by Google