1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 6: Multicore Systems
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Structure of Computer Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
Extensibility, Safety and Performance in the SPIN Operating System Department of Computer Science and Engineering, University of Washington Brian N. Bershad,
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Parallel Computer Architectures
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
MULTICORE RESOURCE MANAGEMENT By Kyle J Nesbit, James E. Smith – University of Wisconsin, Miquel Moreto– Polytechnic University of Catalonia, Francisco.
Computer System Architectures Computer System Software
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
SU YUXIN JAN 20, 2014 Petuum: An Iterative-Convergent Distributed Machine Learning Framework.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multicore Resource Management 謝政宏. 2 Outline Background Virtual Private Machines  Spatial Component  Temporal Component  Minimum and Maximum.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Providing QoS with Virtual Private Machines Kyle J. Nesbit, James Laudon, and James E. Smith.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Scott Ferguson Section 1
Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
Introduction to Operating Systems and Concurrency.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Technical Reading Report Virtual Power: Coordinated Power Management in Virtualized Enterprise Environment Paper by: Ripal Nathuji & Karsten Schwan from.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Wayne Wolf Dept. of EE Princeton University
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
CS140 – Operating Systems Midterm Review
CPE 631 Lecture 05: Cache Design
Virtual Memory فصل هشتم.
Tapestry: Reducing Interference on Manycore Processors for IaaS Clouds
Operating Systems : Overview
Fine-grained vs Coarse-grained multithreading
Operating Systems : Overview
Operating Systems : Overview
Chapter 4 Multiprocessors
Operating Systems : Overview
A Virtual Machine Monitor for Utilizing Non-dedicated Clusters
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Presentation transcript:

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li

2 CMP-based System Chip-level Multiprocessor  multiple processor cores are implemented into a single chip  Multithreading support Intel Core 2 Duo E6750

3 CMP-based System (2) Resource sharing  Cache capacity/bandwidth, main memory…… Pros: Higher resource utilization Cons: Inter-thread interference  Unpredictable performance / no QoS! Many applications running on CMP-based systems require Quality of Service

4 Quality of Service QoS are required by many applications:  Soft real-time applications video games  Find-grain parallel applications Scheduling & synchronization  Server consolidation Hosting services QoS objectives in CMP-based system  provide an upper bound on thread execution time regardless of other thread activity

5 Outline Introduction QoS Framework Virtual Private Cache - VPC Arbiter Virtual Private Cache - Capacity Manager Performance Evaluation Conclusions

6 Overview of VPM Virtual Private Machine: A set of allocated hardware resources  Processors, bandwidth, memory spaces… Each thread is allocated a share of hardware resource based on policies  Applications & system software Hardware mechanism enforces allocated resources

7 System hardware VPM

8 Objectives of VPM Performance Isolation  thread performance is as good as on real private machine having same resources Dynamic distribution of excess resources  Unallocated resources  Allocated but not used resources

9 Virtual Private Cache Microarchitecture-level mechanism Main components  VPC Arbiter: tag & data array bandwidth sharing  VPC Capacity Manager: cache capacity sharing Advantages  Performance isolation  Improved utilization

10 Outline Introduction QoS Framework Virtual Private Cache - VPC Arbiter Virtual Private Cache - Capacity Manager Performance Evaluation Conclusions

11 VPC Arbiter - Implementation(1) Each data & tag array has an arbiter Each arbiter has  FIFO buffer for each thread:  1 clock register R.clk: determine arrival time  R.Li & R.Si for thread i: virtual service/start time

12 VPC Arbiter - Implementation(2) R.Li: virtual service time of a request from thread i   L: latency of shared cache; : thread i’s fraction of resources R.Si: virtual start time of the next request of thread i  Time that the resource is available for the next request of thread i

13 Fair Queuing Scheduling Request Arrival:  Arbiter Calculation of virtual finish time:  Arbiter Selection:  select the request with the earliest Fi 

14 Arbiter Fairness Policy Excess bandwidth is distributed to threads that has received the least excess bandwidth in the past

15 Outline Introduction QoS Framework Virtual Private Cache - VPC Arbiter Virtual Private Cache - Capacity Manager Performance Evaluation Conclusions

16 Implementation Set associative replacement policy Each thread receives  same number of sets as the shared cache  at least Replacement policy  LRU line owned by thread i, such that thread i owns more than ways  LRU line owned by the thread that requesting the replacement

17 Outline Introduction QoS Framework Virtual Private Cache - VPC Arbiter Virtual Private Cache - Capacity Manager Performance Evaluation Conclusions

18 Experiment Setup Two microbenchmarks to stress performance isolation feature  Loads: load operations with continuous read hits  Stores: store operations with continuous write hits SPEC CPU2000 benchmark suite QoS performance metrics  IPC  Data array utilization

19 Other Arbiter Read over Write  Prioritize read over write Read over Write First Come First Service  Prioritize read over write  Prioritize oldest requests Round Robin  Interleave requests uniformly and consistently

20 Microbenchmark

21 SPEC

22 Conclusions VPC: hardware mechanism of VPM QoS framework  VPC arbiter & capacity manager VPC can achieve global QoS objectives Issues:  Local QoS objectives assumes performance monotonicity

23 Thank You! & Questions?