Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Lecture 17: Virtual Memory, Large Caches

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Dynamic Cache Clustering for Chip Multiprocessors

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

By Islam Atta Supervised by Dr. Ihab Talkhan

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Lecture: Large Caches, Virtual Memory

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Lecture: Large Caches, Virtual Memory

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

‘99 ACM/IEEE International Symposium on Computer Architecture

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture: Large Caches, Virtual Memory

Building Expressive, Area-Efficient Coherence Directories

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

What we need to be able to count to tune programs

Lecture 12: Cache Innovations

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Lecture: Cache Innovations, Virtual Memory

Lecture 14: Large Cache Design II

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lecture: Cache Hierarchies

Presentation transcript:

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39 Multicore distributed L2 caches  L2 caches typically sub-banked and distributed IBM Power4/5: 3 banks Sun Microsystems T1: 4 banks Intel Itanium2 (L3): many “sub-arrays”  (Distributed L2 caches + switched NoC)  NUCA  Hardware-based management schemes Private caching Shared caching Hybrid caching processor core local L2 cache router

Dec. 13 ’06 – MICRO-39 Private caching 2. L2 access 1. L1 miss 2. L2 access Hit Miss 3. Access directory A copy on chip Global miss 3. Access directory  short hit latency (always local)  high on-chip miss rate  long miss resolution time  complex coherence enforcement

Dec. 13 ’06 – MICRO-39 Shared caching 1. L1 miss 2. L2 access Hit Miss  low on-chip miss rate  straightforward data location  simple coherence (no replication)  long average hit latency

Dec. 13 ’06 – MICRO-39 Our work  Placing “flexibility” as the top design consideration  OS-level data to L2 cache mapping Simple hardware based on shared caching Efficient mapping maintenance at page granularity  Demonstrating the impact using different policies

Dec. 13 ’06 – MICRO-39 Talk roadmap  Data mapping, a key property  Flexible page-level mapping Goals Architectural support OS design issues  Management policies  Conclusion and future works

Dec. 13 ’06 – MICRO-39 Data mapping, the key  Data mapping = deciding data location ( i.e., cache slice)  Private caching Data mapping determined by program location Mapping created at miss time No explicit control  Shared caching Data mapping determined by address slice number = ( block address ) % ( N slice ) Mapping is static Cache block installation at miss time No explicit control (Run-time can impact location within slice) Mapping granularity = block

Dec. 13 ’06 – MICRO-39 Changing cache mapping granularity Memory blocksMemory pages  miss rate?  impact on existing techniques? ( e.g., prefetching)  latency?

Dec. 13 ’06 – MICRO-39 Observation: page-level mapping Memory pagesProgram 1 Program 2 OS PAGE ALLOCATION  Mapping data to different $$ feasible  Key: OS page allocation policies  Flexible

Dec. 13 ’06 – MICRO-39 Goal 1: performance management  Proximity-aware data mapping

Dec. 13 ’06 – MICRO-39 Goal 2: power management  Usage-aware cache shut-off

Dec. 13 ’06 – MICRO-39 Goal 3: reliability management  On-demand cache isolation XX

Dec. 13 ’06 – MICRO-39 Goal 4: QoS management  Contract-based cache allocation

Dec. 13 ’06 – MICRO-39 page_numpage offset Architectural support L1 miss Method 1: “bit selection” slice_num = ( page_num ) % ( N slice ) other bitsslice_numpage offset data address Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high page_numpage offset region0_lowslice_num0region0_high region1_lowslice_num1region1_high Method 3: “page table (TLB)” page_num «–» slice_num vpage_num0slice_num0ppage_num0 vpage_num1slice_num1ppage_num1 reg_table TLB Method 1: “bit selection” slice number = ( page_num ) % ( N slice ) Method 2: “region table” regionx_low ≤ page_num ≤ regionx_high Method 3: “page table (TLB)” page_num «–» slice_num  Simple hardware support enough  Combined scheme feasible

Dec. 13 ’06 – MICRO-39 Some OS design issues  Congruence group CG( i ) Set of physical pages mapped to slice i A free list for each i  multiple free lists  On each page allocation, consider Data proximity Cache pressure (e.g.) Profitability function P = f ( M, L, P, Q, C ) M : miss rates L : network link status P : current page allocation status Q : QoS requirements C : cache configuration  Impact on process scheduling  Leverage existing frameworks Page coloring – multiple free lists NUMA OS – process scheduling & page allocation

Dec. 13 ’06 – MICRO-39 Working example Program P (4) = 0.9 P (6) = 0.8 P (5) = 0.7 … P (1) = 0.95 P (6) = 0.9 P (4) = 0.8 …  Static vs. dynamic mapping  Program information ( e.g., profile)  Proper run-time monitoring needed

Dec. 13 ’06 – MICRO-39 Page mapping policies

Dec. 13 ’06 – MICRO-39 Simulating private caching For a page requested from a program running on core i, map the page to cache slice i L2 cache latency (cycles) SPEC2k INTSPEC2k FP private caching OS-based L2 cache slice size  Simulating private caching is simple  Similar or better performance

Dec. 13 ’06 – MICRO-39 Simulating “large” private caching For a page requested from a program running on core i, map the page to cache slice i ; also spread pages SPEC2k INTSPEC2k FP Relative performance (time -1 ) OS private kB cache slice

Dec. 13 ’06 – MICRO-39 Simulating shared caching For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …) L2 cache latency (cycles) SPEC2k INTSPEC2k FP L2 cache slice size shared OS  Simulating shared caching is simple  Mostly similar behavior/performance  Pathological cases ( e.g., applu)

Dec. 13 ’06 – MICRO Simulating clustered caching For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …) Relative performance (time -1 ) private OSshared 4 cores used; 512kB cache slice  Simulating clustered caching is simple  Lower miss traffic than private  Lower on-chip traffic than shared

Dec. 13 ’06 – MICRO-39 Profile-driven page mapping  Using profiling collect: Inter-page conflict information Per-page access count information  Page mapping cost function (per slice) Given program location, page to map, and previously mapped pages ( # conflicts  miss penalty ) + weight  (# accesses  latency ) weight as a knob Larger value  more weight on proximity (than miss rate) Optimize both miss rate and data proximity  Theoretically important to understand limits  Can be practically important, too miss costLatency cost

Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d 256kB L2 cache slice remote local miss on-chip hit L2 cache accesses weight

Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d # pages mapped 256kB L2 cache slice Program location GCC

Dec. 13 ’06 – MICRO-39 Profile-driven page mapping, cont’d 256kB L2 cache slice Performance improvement Over shared caching 108%  Room for performance improvement  Best of the two or better than the two  Dynamic mapping schemes desired

Dec. 13 ’06 – MICRO-39 Isolating faulty caches When there are faulty cache slices, avoid mapping pages to them Relative L2 cache latency 4 cores running a multiprogrammed workload; 512kB cache slice shared OS # cache slice deletions

Dec. 13 ’06 – MICRO-39 Conclusion  “Flexibility” will become important in future multicores Many shared resources Allows us to implement high-level policies  OS-level page-granularity data-to-slice mapping Low hardware overhead Flexible  Several management policies studied Mimicking private/shared/clustered caching straightforward Performance-improving schemes

Dec. 13 ’06 – MICRO-39 Future works  Dynamic mapping schemes Performance Power  Performance monitoring techniques Hardware-based Software-based  Data migration and replication support

Dec. 13 ’06 – MICRO-39 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh Thank you!

Dec. 13 ’06 – MICRO-39 Multicores are here AMD Opteron dual-core (2005) IBM Power5 (2004) Sun Micro. T1, 8 cores (2005) Intel Core2 Duo (2006) Quad cores (2007) Intel 80 cores? (2010?)

Dec. 13 ’06 – MICRO-39 A multicore outlook ???

Dec. 13 ’06 – MICRO-39 A processor model Many cores (e.g., 16) processor core local L2 cache router  Private L1 I/D-$$ 8kB~32kB  Local unified L2 $$ 128kB~512kB 8~18 cycles  Switched network 2~4 cycles/switch  Distributed directory Scatter hotspots

Dec. 13 ’06 – MICRO-39 Other approaches  Hybrid/flexible schemes “Core clustering” [Speight et al., ISCA2005] “Flexible CMP cache sharing” [Huh et al., ICS2004] “Flexible bank mapping” [Liu et al., HPCA2004]  Improving shared caching “Victim replication” [Zhang and Asanovic, ISCA2005]  Improving private caching “Cooperative caching” [Chang and Sohi, ISCA2006] “CMP-NuRAPID” [Chishti et al., ISCA2005]