Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.

Slides:

Advertisements

Similar presentations

Advertisements

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Kernel memory allocation

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

Multiprocessing Memory Management

Memory Management 2010.

Lecture 17: Virtual Memory, Large Caches

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Virtual Memory By: Dinouje Fahih. Definition of Virtual Memory Virtual memory is a concept that, allows a computer and its operating system, to use a.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

1 Coordinating Accesses to Shared Caches in Multi-core Processors Software Approach Xiaodong Zhang Ohio State University Collaborators: Jiang Lin, Zhao.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

Virtual Memory 1 1.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Full and Para Virtualization

Memory Management. Why memory management? n Processes need to be loaded in memory to execute n Multiprogramming n The task of subdividing the user area.

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

Copyright ©: Nahrstedt, Angrave, Abdelzaher, Caccamo 1 Memory management & paging.

The Evicted-Address Filter

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

Sunpyo Hong, Hyesoon Kim

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Memory Management.

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Lecture 12 Virtual Memory.

Xiaodong Wang, Shuang Chen, Jeff Setter,

Lecture: Large Caches, Virtual Memory

CSC 4250 Computer Architectures

18742 Parallel Computer Architecture Caching in Multi-core Systems

Lecture: Large Caches, Virtual Memory

Energy-Efficient Address Translation

What we need to be able to count to tune programs

CSCI206 - Computer Organization & Programming

Reducing Memory Reference Energy with Opportunistic Virtual Caching

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

The Operating System Memory Manager

Page Replacement.

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

(A Research Proposal for Optimizing DBMS on CMP)

Contents Memory types & memory hierarchy Virtual memory (VM)

Virtual Memory: Working Sets

COMP755 Advanced Operating Systems

CSE 542: Operating Systems

Virtual Memory 1 1.

Presentation transcript:

Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Department of ECE Iowa State University 2 Department of CSE The Ohio State University

2 2 Shared Caches Can be a Critical Bottleneck in Multi-Core Processors L2/L3 caches are shared by multiple cores Intel Xeon 51xx (2core/L2) AMD Barcelona (4core/L3) Sun T2,... (8core/L2) Effective cache partitioning is critical to address the bottleneck caused by the conflicting accesses in shared caches. Several hardware cache partitioning methods have been proposed with different optimization objectives Performance: [HPCA’02], [HPCA’04], [Micro’06] Fairness: [PACT’04], [ICS’07], [SIGMETRICS’07] QoS: [ICS’04], [ISCA’07] Shared L2/L3 cache Core …… Core

3 3 Limitations of Simulation-Based Studies Excessive simulation time Whole programs can not be evaluated. It would take several weeks/months to complete a single SPEC CPU2006 benchmark As the number of cores continues to increase, simulation ability becomes even more limited Absence of long-term OS activities Interactions between processor/OS affect performance significantly Proneness to simulation inaccuracy Bugs in simulator Impossible to model many dynamics and details of the system

4 4 Our Approach to Address the Issues Design and implement OS-based Cache Partitioning Embedding cache partitioning mechanism in OS By enhancing page coloring technique To support both static and dynamic cache partitioning Evaluate cache partitioning policies on commodity processors Execution- and measurement-based Run applications to completion Measure performance with hardware counters

5 5 Four Questions to Answer Can we confirm the conclusions made by the simulation- based studies? Can we provide new insights and findings that simulation is not able to? Can we make a case for our OS-based approach as an effective option to evaluate multicore cache partitioning designs? What are advantages and disadvantages for OS-based cache partitioning?

6 6 Outline Introduction Design and implementation of OS-based cache partitioning mechanisms Evaluation environment and workload construction Cache partitioning policies and their results Conclusion

7 7 OS-Based Cache Partitioning Mechanisms Static cache partitioning Predetermines the amount of cache blocks allocated to each program at the beginning of its execution Page coloring enhancement Divides shared cache to multiple regions and partition cache regions through OS page address mapping Dynamic cache partitioning Adjusts cache quota among processes dynamically Page re-coloring Dynamically changes processes’ cache usage through OS page address re-mapping

8 8 Page Coloring virtual page number Virtual address page offset physical page number Physical address Page offset Address translation Cache tag Block offset Set index Cache address Physically indexed cache page color bits … OS control = Physically indexed caches are divided into multiple regions (colors). All cache lines in a physical page are cached in one of those regions (colors). OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).

9 9 Enhancement for Static Cache Partitioning …... …… … … Physically indexed cache … …… … Physical pages are grouped to page bins according to their page color … i+2 i i+1 … Process … i+2 i i+1 … Process 2 OS address mapping Shared cache is partitioned between two processes through address mapping. Cost: Main memory space needs to be partitioned too (co-partitioning).

10 Dynamic Cache Partitioning Why? Programs have dynamic behaviors Most proposed schemes are dynamic How? Page re-coloring How to handle overhead? Measure overhead by performance counter Remove overhead in result (emulating hardware schemes)

11 Allocated color Dynamic Cache Partitioning through Page Re-Coloring page links table …… N Page re-coloring: Allocate page in new color Copy memory contents Free old page Allocated color Pages of a process are organized into linked lists by their colors. Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.

12 Control the Page Migration Overhead Control the frequency of page migration Frequent enough to capture application phase changes Not too often to introduce large page migration overhead Lazy migration: avoid unnecessary page migration Observation: Not all pages are accessed between their two migrations. Optimization: do not migrate a page until it is accessed

13 After the optimization On average, 2% page migration overhead Up to 7%. 13 Lazy Page Migration Process page links …… N Avoid unnecessary page migration for these pages! Allocated color

14 Outline Introduction Design and implementation of OS-based cache partitioning mechanisms Evaluation environment and workload construction Cache partitioning policies and their results Conclusion

15 Experimental Environment Dell PowerEdge1950 Two-way SMP, Intel dual-core Xeon 5160 Shared 4MB L2 cache, 16-way 8GB Fully Buffered DIMM Red Hat Enterprise Linux kernel Performance counter tools from HP (Pfmon) Divide L2 cache into 16 colors

16 Benchmark Classification Is it sensitive to L2 cache capacity? Red group: IPC(1M L2 cache)/IPC(4M L2 cache) < 80% Give red benchmarks more cache: big performance gain Yellow group: 80% <IPC(1M L2 cache)/IPC(4M L2 cache) < 95% Give yellow benchmarks more cache: moderate performance gain Else: Does it extensively access L2 cache? Green group: > = 14 accesses / 1K cycle Give it small cache Black group: < 14 accesses / 1K cycle Cache insensitive 29 benchmarks from SPEC CPU

17 Workload Construction core RR (3 pairs) RY (6 pairs) RG (6 pairs) YY (3 pairs) YG (6 pairs)GG (3 pairs) 27 workloads: representative benchmark combinations

18 Outline Introduction OS-based cache partitioning mechanism Evaluation environment and workload construction Cache partitioning policies and their results Performance Fairness Conclusion

19 Performance – Metrics Divide metrics into evaluation metrics and policy metrics [PACT’06] Evaluation metrics: Optimization objectives, not always available during run-time Policy metrics Used to drive dynamic partitioning policies: available during run-time Sum of IPC, Combined cache miss rate, Combined cache misses

20 Static Partitioning Total #color of cache: 16 Give at least two colors to each program Make sure that each program get 1GB memory to avoid swapping (because of co-partitioning) Try all possible partitionings for all workloads (2:14), (3:13), (4:12) ……. (8,8), ……, (13:3), (14:2) Get value of evaluation metrics Compared with performance of all partitionings with performance of shared cache

21 Performance – Optimal Static Partitioning Confirm that cache partitioning has significant performance impact Different evaluation metrics have different performance gains RG-type of workloads have largest performance gains (up to 47%) Other types of workloads also have performance gains (2% to 10%)

22 A New Finding Workload RG1: 401.bzip2 (Red) bwaves (Green) Intuitively, giving more cache space to 401.bzip2 (Red) Increases the performance of 401.bzip2 largely (Red) Decreases the performance of 410.bwaves slightly (Green) However, we observe that

23 Insight into Our Finding

24 Insight into Our Finding We have the same observation in RG4, RG5 and YG5 This is not observed by simulation Did not model main memory sub-system in detail Assumed fixed memory access latency Shows the advantages of our execution- and measurement- base study

25 Performance - Dynamic Partition Policy Init: Partition the cache as (8:8) Run current partition (P 0 :P 1 ) for one epoch finished Try one epoch for each of the two neighboring partitions: (P 0 – 1: P 1 +1) and (P 0 + 1: P 1 -1) Choose next partitioning with best policy metrics measurement No Yes Exit A simple greedy policy. Emulate policy of [HPCA’02]

26 Performance – Static & Dynamic Use combined miss rates as policy metrics For RG-type, and some RY-type: Static partitioning outperforms dynamic partitioning For RR- and RY-type, and some RY-type Dynamic partitioning outperforms static partitioning

27 Fairness – Metrics and Policy [PACT’04] Metrics Evaluation metrics FM0 difference in slowdown, small is better Policy metrics Policy Repartitioning and rollback

28 Fairness - Result Dynamic partitioning can achieve better fairness If we use FM0 as both evaluation metrics and policy metrics None of policy metrics (FM1 to FM5) is good enough to drive the partitioning policy to get comparable fairness with static partitioning Strong correlation was reported in simulation-based study – [PACT’04] None of policy metrics has consistently strong correlation with FM0 SPEC CPU2006 (ref input)  SPEC CPU2000 (test input) Complete trillions of instructions  less than one billion instruction 4MB L2 cache  512KB L2 cache

29 Conclusion Confirmed some conclusions made by simulations Provided new insights and findings Give cache space from one to another, increase performance of both Poor correlation between evaluation and policy metrics for fairness Made a case for our OS-based approach as an effective option for evaluation of multicore cache partitioning Advantages of OS-based cache partitioning Working on commodity processors for an execution- and measurement-based study Disadvantages of OS-based cache partitioning Co-partitioning (may underutilize memory), migration overhead

30 Ongoing Work Reduce migration overhead on commodity processors Cache partitioning at the compiler level Partition cache at object level Hybrid cache partitioning method Remove the cost of co-partitioning Avoid page migration overhead

Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Iowa State University 2 The Ohio State University Thanks!

32 Backup Slides

33 Fairness - Correlation between Evaluation Metrics and Policy Metrics (Reported by [PACT’04]) Strong correlation was reported in simulation study – [PACT’04]

34 Fairness - Correlation between Evaluation Metrics and Policy Metrics (Our result) None of policy metrics has consistently strong correlation with FM0 SPEC CPU2006 (ref input)  SPEC CPU2000 (test input) Complete trillions of instructions  less than one billion instruction 4MB L2 cache  512KB L2 cache