Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk.

Slides:

Advertisements

Similar presentations

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Advertisements

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

High Performing Cache Hierarchies for Server Workloads

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Difference Engine: Harnessing Memory Redundancy in Virtual Machines by Diwaker Gupta et al. presented by Jonathan Berkhahn.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration Xiang Zhang 1,2, Zhigang Huo 1, Jie Ma 1, Dan Meng 1 1. National Research Center.

Hardware Support for Spin Management in Overcommitted Virtual Machines Philip Wells Koushik Chakraborty Gurindar Sohi {pwells, kchak,

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

A Secure System-wide Process Scheduling across Virtual Machines Hidekazu Tadokoro (Tokyo Institute of Technology) Kenichi Kourai (Kyushu Institute of Technology)

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor Fall 2014 Presented By: Probir Roy.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Synchronization and Scheduling in Multiprocessor Operating Systems

1 Some Context for This Session…  Performance historically a concern for virtualized applications  By 2009, VMware (through vSphere) and hardware vendors.

Tanenbaum 8.3 See references

Virtualization and Cloud Computing Research at Vasabilab Kasidit Chanchio Vasabilab Dept of Computer Science, Faculty of Science and Technology, Thammasat.

Microkernels, virtualization, exokernels Tutorial 1 – CSC469.

Jakub Szefer, Eric Keller, Ruby B. Lee Jennifer Rexford Princeton University CCS October, 2011 報告人：張逸文.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

1 Previous lecture review n Out of basic scheduling techniques none is a clear winner: u FCFS - simple but unfair u RR - more overhead than FCFS may not.

Virtual Machine Scheduling for Parallel Soft Real-Time Applications

Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.

Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.

COMS E Cloud Computing and Data Center Networking Sambit Sahu

SAN FRANCISCO, CA, USA Adaptive Energy-efficient Resource Sharing for Multi-threaded Workloads in Virtualized Systems Can HankendiAyse K. Coskun Boston.

Embedded System Lab. 오명훈 Memory Resource Management in VMware ESX Server Carl A. Waldspurger VMware, Inc. Palo Alto, CA USA

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

1 Xen and Co.: Communication-aware CPU Scheduling for Consolidated Xen-based Hosting Platforms Sriram Govindan, Arjun R Nath, Amitayu Das, Bhuvan Urgaonkar,

Xen (Virtual Machine Monitor) Operating systems laboratory Esmail asyabi- April 2015.

Revisiting Hardware-Assisted Page Walks for Virtualized Systems

Difference of Degradation Schemes among Operating Systems -Experimental analysis for web application servers- Hideaki Hibino*(Tokyo Tech) Kenichi Kourai.

(Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich,

Fall 2013 SILICON VALLEY UNIVERSITY CONFIDENTIAL 1 Introduction to Embedded Systems Dr. Jerry Shiao, Silicon Valley University.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.

VTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core Embedded Lab. Kim Sewoog Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella,

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Introduction to virtualization

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Full and Para Virtualization

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

Research on Embedded Hypervisor Scheduler Techniques 2014/10/02 1.

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

Using Uncacheable Memory to Improve Unity Linux Performance

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

Lecturer 5: Process Scheduling Process Scheduling  Criteria & Objectives Types of Scheduling  Long term  Medium term  Short term CPU Scheduling Algorithms.

vCAT: Dynamic Cache Management using CAT Virtualization

Is Virtualization ready for End-to-End Application Performance?

Xen and the Art of Virtualization

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Presented by Yoon-Soo Lee

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Demand-Based Coordinated Scheduling for SMP VMs

18742 Parallel Computer Architecture Caching in Multi-core Systems

Comparison of the Three CPU Schedulers in Xen

Department of Computer Science University of California, Santa Barbara

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Jeongseob Ahn*, Chang Hyun Park‡, Taekyung Heo‡, Jaehyuk Huh‡

CPU SCHEDULING.

Research on Embedded Hypervisor Scheduler Techniques

Department of Computer Science University of California, Santa Barbara

Xing Pu21 Ling Liu1 Yiduo Mei31 Sankaran Sivathanu1 Younggyun Koh1

Presentation transcript:

Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk Huh Computer Science Department KAIST

Virtual Time Discontinuity vCPU 1 Virtual time Physical time pCPU 0 vCPU 0 Time slice Context Switching 2 Virtual CPUs are not always running Time shared

Interrupt with Virtualization vCPU 0 pCPU 0 Interrupt Interrupt occurs on vCPU0 vCPU 1 Interrupt processing is delayed T 3 Interrupt Delay

Spinlock with Virtualization vCPU 1 pCPU 0 vCPU 0 vCPU0 holding a lock is preempted vCPU1 starts spinning to acquire the lock T 4 vCPU0 releases the lock next time Lock acquiring is delayed Delay

Prior Efforts in Hypervisor Highly focused on modifying the hypervisor case by case Hypervisor Parallel workload Web server HPC workload Spinlock optimization Spinlock optimization Other optimization Other optimization I/O interrupt optimization I/O interrupt optimization Memory CPUs I/O Spin detection buffer [Wells et al. PACT ‘06] Relaxed co-scheduling [VMware ESX] Balanced scheduling [Sukwong and Kim, EuroSys ’11] Preemption delay [Kim et al. ASPLOS ‘13] Preemptable spinlock [Ouyang and Lange, VEE ‘13] Boosting I/O requests [Ongaro et al. VEE ’08] Task-aware VM scheduling [Kim et al. VEE ‘09] vSlicer [Xu et al. HPDC ‘12] vTurbo [Xu et al. USENIX ATC ‘13] 5 Keep your hypervisor simple

Fundamental of CPU Scheduling Most of the CPU schedulers employ time-sharing vCPU 1 pCPU 0 vCPU 0 vCPU 2 Turn around time Time slice T 6

Toward Virtual Time Continuity To minimize the turn around time, we propose shorter but more frequent runs vCPU 1 pCPU 0 vCPU 0 vCPU 2 Shortened time slice Reduced turn around time T 7

Methodology for Real System 4 physical CPUs (Intel Xeon) – Xen hypervisor – 2 VMs 4 vCPUs, 4GB memory Ubuntu HVM Linux kernel – 1G network Benchmarking workloads – PARSEC (Spinlock and IPI)* – iPerf (I/O interrupt) – SPEC-CPU 2006 Xen Hypervisor OS App OS App 4 physical CPUs 4 virtualized CPUs 8 2-to-1 consolidation ratio *[Kim et al., ASPLOS ‘13]

PARSEC Multi-threaded Applications Co-running with Swaptions Xen default: 30ms time slice *Other PARSEC results in our paper 9 Better -49

Mixed VM Scenario* Evaluate a consolidated scenario of I/O intensive and multi-threaded workloads iPerf VM1VM2VM3VM4 workloads ferret(m), iPerf(I/O) vips(m)Dedup(m) 3 x swaptions(s), streamcluster(s) Parsec * [vTurbo, USENIX ATC ‘13], [vSlicer, HPDC ’12], [Task-aware, VEE ‘09] 10 ferret: 1.7x vips: 1.9x dedup: 2.3x In 1ms time slice

SPEC Single-threaded Applications SPEC CPU 2006 with Libquantum *Page coloring technique is used to isolate cache 11 Xen default: 30ms time slice Shortening the time slice provides a generalized solution but has the overhead of frequent context switching Better

Overheads of Short Time Slice vCPU 1 vCPU 0 pCPU 0 Frequent context switching Pollution of architectural structures T $ $$ 12

Methodology for Simulated System We use MARSSx86 full-system simulator with DRAMSim2 – Modified Linux scheduler to simulate 30ms and 1ms time slices – Executed mixes of two applications on a single CPU – Used SPEC-CPU 2006 applications System configurations ProcessorOut-of-order x86 (Xeon) L1 I/D Cache32KB, 4-way, 64B L2 Cache256KB, 8-way, 64B L3 Cache2MB, 16-way, 64B MemoryDDR3-1600, 800Mhz 13 CPU: fits in L2 cache CFR: cache friendly THR: cache thrashing

Performance Effects of Short Time Slice Type-4 (CFR – CFR)Type-5 (CFR – THR) ms 1ms CFR: cache friendly THR: cache thrashing Better

Mitigating Cache Pollution Context prefetcher [Daly and Cain, HPCA ‘12] [Zebchuk et al., HPCA ‘13] – The evicted cache blocks of other VMs are logged – When the VM is re-scheduled, the logged blocks will be prefetched May cause memory bandwidth saturation & congestion Context preservation – Retains the data of previous contexts with dynamic insertion policy* to either MRU or LRU position Cache size Perf. Cache size Perf. 15 MRULRU Preserved CFR: cache friendly THR: cache thrashing 1ms MRU LRU 1ms 8ms Winner of the two insertion policies Simple time-sampling mechanism * [Qureshi et al., ISCA ‘07]

Evaluated Schemes ConfigurationDescription Baseline30ms time slice (Xen default) 1ms1ms time slice 1ms w/ ctx-prefetch*State-of-the-art context prefetch 1ms w/ DIPDynamic insertion policy 1ms w/ DIP + ctx-prefetchDIP with context prefetch 1ms w/ SIP-bestOptimal static insertion policy * [Zebchuk et al., HPCA ‘13] 16

Performance with Cache Preservation Type 5 (CFR – THR) 17 Better CFR: cache friendlyTHR: cache thrashing 142

Performance with Cache Preservation Type 4 (CFR – CFR) Baseline: 30ms time slice 18 CFR: cache friendly

Conclusion Investigated unexpected artifacts of CPU sharing in virtualized systems – Spinlock and interrupt handling in kernel Shortening time slice – Improving runtime for PARSEC multi-threaded applications – Improving throughput and latency for I/O applications Context prefetcher with dynamic insertion policy – Minimizing the negative effects of short time slice – Improving performance for SPEC cache-sensitive applications 19

Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk Huh Computer Science Department KAIST