Reinforcement Learning Based Virtual Cluster Management

Slides:



Advertisements
Similar presentations
Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.
Advertisements

Hadi Goudarzi and Massoud Pedram
SLA-Oriented Resource Provisioning for Cloud Computing
Virtualization and Cloud Computing. Definition Virtualization is the ability to run multiple operating systems on a single physical system and share the.
Introduction CSCI 444/544 Operating Systems Fall 2008.
An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.
Green Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology,
Proactive Prediction Models for Web Application Resource Provisioning in the Cloud _______________________________ Samuel A. Ajila & Bankole A. Akindele.
SLA-aware Virtual Resource Management for Cloud Infrastructures
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
COMS E Cloud Computing and Data Center Networking Sambit Sahu
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Virtualization in Data Centers Prashant Shenoy
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Virtualization for Cloud Computing
Adaptive Server Farms for the Data Center Contact: Ron Sheen Fujitsu Siemens Computers, Inc Sever Blade Summit, Getting the.
VMware vSphere 4 Introduction. Agenda VMware vSphere Virtualization Technology vMotion Storage vMotion Snapshot High Availability DRS Resource Pools Monitoring.
Virtualization Technology Prof D M Dhamdhere CSE Department IIT Bombay Moving towards Virtualization… Department of Computer Science and Engineering, IIT.
Generating Adaptation Policies for Multi-Tier Applications in Consolidated Server Environments College of Computing Georgia Institute of Technology Gueyoung.
Self-Adaptive QoS Guarantees and Optimization in Clouds Jim (Zhanwen) Li (Carleton University) Murray Woodside (Carleton University) John Chinneck (Carleton.
Abstract Cloud data center management is a key problem due to the numerous and heterogeneous strategies that can be applied, ranging from the VM placement.
Department of Computer Science Engineering SRM University
Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.
Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.
A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
Ian Alderman A Little History…
Virtual Machine Monitors: Technology and Trends Jonathan Kaldor CS614 / F07.
RECON: A TOOL TO RECOMMEND DYNAMIC SERVER CONSOLIDATION IN MULTI-CLUSTER DATACENTERS Anindya Neogi IEEE Network Operations and Management Symposium, 2008.
Autonomic SLA-driven Provisioning for Cloud Applications Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer Presented by Ismail Alan.
A Brief Intro to Virtualiztion. What is Virtualization? An abstraction Usually performed via software Many different types –Hardware –Software –Data –Network.
COMS E Cloud Computing and Data Center Networking Sambit Sahu
High Performance Computing on Virtualized Environments Ganesh Thiagarajan Fall 2014 Instructor: Yuzhe(Richard) Tang Syracuse University.
Dynamic Resource Monitoring and Allocation in a virtualized environment.
Data Placement and Task Scheduling in cloud, Online and Offline 赵青 天津科技大学
Energy Aware Consolidation for Cloud Computing Srikanaiah, Kansal, Zhao Usenix HotPower 2008.
Server Virtualization
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
A dynamic optimization model for power and performance management of virtualized clusters Vinicius Petrucci, Orlando Loques Univ. Federal Fluminense Niteroi,
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Dynamic Placement of Virtual Machines for Managing SLA Violations NORMAN BOBROFF, ANDRZEJ KOCHUT, KIRK BEATY SOME SLIDE CONTENT ADAPTED FROM ALEXANDER.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
IMPROVEMENT OF COMPUTATIONAL ABILITIES IN COMPUTING ENVIRONMENTS WITH VIRTUALIZATION TECHNOLOGIES Abstract We illustrates the ways to improve abilities.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
1 Automated Power Management Through Virtualization Anne Holler, VMware Anil Kapur, VMware.
A Hierarchical Edge Cloud Architecture for Mobile Computing IEEE INFOCOM 2016 Liang Tong, Yong Li and Wei Gao University of Tennessee – Knoxville 1.
Virtualization for Cloud Computing
OPERATING SYSTEMS CS 3502 Fall 2017
Processes and threads.
Volume Licensing Readiness: Level 200
Volume Licensing Readiness: Level 200
Virtualization Dr. Michael L. Collard
LIGHTWEIGHT CLOUD COMPUTING FOR FAULT-TOLERANT DATA STORAGE MANAGEMENT
Volume Licensing Readiness: Level 200
Adaptive Cloud Computing Based Services for Mobile Users
A Brief Intro to Virtualiztion
A Brief Intro to Virtualiztion
1. 2 VIRTUAL MACHINES By: Satya Prasanna Mallick Reg.No
Management of Virtual Execution Environments 3 June 2008
Comparison of the Three CPU Schedulers in Xen
Zhen Xiao, Qi Chen, and Haipeng Luo May 2013
Smita Vijayakumar Qian Zhu Gagan Agrawal
Specialized Cloud Architectures
Virtualization Dr. S. R. Ahmed.
A Virtual Machine Monitor for Utilizing Non-dedicated Clusters
Client/Server Computing and Web Technologies
Dynamic Placement of Virtual Machines for managing sla violations
Presentation transcript:

Reinforcement Learning Based Virtual Cluster Management Anshul Gangwar Dept. of Computer Science and Automation, Indian Institute of Science

Virtualization: Cloud Computing: The ability to run multiple operating systems on a single physical system and share the underlying hardware resources*. Cloud Computing: “The provisioning of services in a timely (near on instant), on-demand manner, to allow the scaling up and down of resources”**. * VMware white paper, Virtualization Overview ** Alan Williamson, quoted in Cloud BootCamp March 2009

The Traditional Server Concept SERVER m SERVER n App 1 30 % OS 1 App 2 40 % OS 2 App 3 25 % OS 3 App 4 30 % OS 4 App 5 20 % OS 5 App m 28 % OS m App n 50 % OS n Machine provisioning done for peak demands Processors are under utilized during off-peak hours Wastage of resources Need technology and algorithms that can allow allocation of only as many resources as are required 9 A.M. TO 5 P.M. M-F All Other Times Rate of Server Accesses Time

Server Consolidation Process SERVER m SERVER n App 1 30 % OS 1 App 2 40 % OS 2 App 3 25 % OS 3 App 4 30 % OS 4 App 5 20 % OS 5 App m 28 % OS m App n 50 % OS n Consolidation Process Allows shutting down of idle PMs, saving operational costs SERVER 1 SERVER 2 SERVER m VM 1 VM 2 VM 3 VM 4 VM 5 VM m VM n App 1 on Guest OS 30 % App 2 on Guest OS 40 % App 3 on Guest OS 25 % App 4 on Guest OS 30 % App 5 on Guest OS 20 % App m on Guest OS 28 % App n on Guest OS 50 % Hypervisor Hypervisor Hypervisor

Load Distribution Hypervisor Hypervisor Hypervisor Hypervisor VM 1 VM 2 VM 3 VM 4 VM 5 VM m VM n App 1 on Guest OS 30 % App 2 on Guest OS 40 % App 3 on Guest OS 25 % App 4 on Guest OS 30 % App 5 on Guest OS 20 % App m on Guest OS 28 % App n on Guest OS 50 % Hypervisor Hypervisor Hypervisor SERVER 1 SERVER 2 SERVER m VM 1 VM 2 VM 3 VM 4 VM 5 VM m VM n App 1 on Guest OS 30 % App 2 on Guest OS 20 % App 3 on Guest OS 45 % App 4 on Guest OS 30 % App 5 on Guest OS 25 % App m on Guest OS 28 % App n on Guest OS 50 % Hypervisor Hypervisor 5 Hypervisor

Live Migration Migrate VM 5 Hypervisor Hypervisor Hypervisor VM m VM n App 1 on Guest OS 30 % App 2 on Guest OS 20 % App 3 on Guest OS 45 % App 4 on Guest OS 30 % App 5 on Guest OS 25 % App m on Guest OS 28 % App n on Guest OS 50 % Hypervisor Hypervisor 6 Hypervisor SERVER 1 SERVER 2 SERVER m Migrate VM 5 SERVER 1 SERVER 2 SERVER m VM 1 VM 2 VM 5 VM 3 VM 4 VM m VM n App 1 on Guest OS 30 % App 2 on Guest OS 20 % App 5 on Guest OS 25 % App 3 on Guest OS 45 % App 4 on Guest OS 30 % App m on Guest OS 28 % App n on Guest OS 50 % Hypervisor 6 Hypervisor Hypervisor

Dynamic Resource Allocation Dynamic Workload requires Dynamic Resource Management Allocation of resources to VMs in each PM Resources such as CPU, memory etc. Allocation of VMs to PMs Minimize number of operational PMs Modern VMs (e.g. Xen) allow Resource allocation within PM Dynamic allocation of VM to PM through Live Migration Required: Architecture and mechanisms for Determining resource allocation to VM within PM Determining deployment of VMs on PMs So that: Capital and Operational costs are minimized Application Performance is maximized

Two Level Controller Architecture PMA : Performance Measurement Agent RAC : Resource Allocation Controller VM Placement Controller RAC Determined Resource Requirements Performance Measures Migration Decisions PM PMA RAC PM PMA RAC PMA RAC PM VM VM VM VM VM VM Note: PMs(Physical Machines/servers) are assumed to be homogeneous.

Problem Definition VM Placement Controller has to make optimal migration decisions at regular intervals which results in Low SLA Violations Reduction in number of busy PMs SLA(Service Level Agreement) are Performance Guaranties which Data Center Owner negotiates with User. These Performance Guaranties can include average response time, maximum delay, maximum downtime etc. Idle PMs can be switched off/run in low power mode

Issues in Server Consolidation/ Distribution There are various issues involved in Server Consolidation Interference of VMs : Bad behaviors of one application in a VM adversely affect(degraded performance) the other VMs on same PM Delayed Effects : Resource configurations of a VM show effects after some delay Migration Cost : Live Migrations involves cost(performance degradation) Workload Pattern : is not deterministic or known apriori These difficulties motivates us to use Reinforcement Learning Approach

Reinforcement Learning(RL) The agent-environment interaction in RL The goal of the agent is to maximize the cumulative long term reward based on the immediate reward rn+1.

Reinforcement Learning(RL) The agent-environment interaction in RL The goal of the agent is to maximize the cumulative long term reward based on the immediate reward rn+1. RL has two major benefits Doesn't requires model of the system Capture delayed effects in decision making Can take action before problem arises

Problem Formulation for RL Framework: System Assumptions M PMs: Assumed to be homogeneous N VMs: Each assumed to be running one application whose performance metrics of interest are throughput and response time Response time implied at server level only (not as seen by user) Workload per VM is assumed to be cyclic Resource requirement assumed to be equal to workload Time Period Phase 2 Cyclic Workload model: Time period assumed to be divided Into phases Rate of Server Accesses Phase 1 Time

Problem Formulation for RL Framework N VMs , M PMs , M > 1 and P Phases in Cyclic Workload Models State sn at time n is (Phase of Cyclic Workload Models , allocation vector of VMs) Action an at time n is (VM id which is migrating , to PM id which it is migrating) Reward r(sn,an) at time n is defined as r( 𝒔 𝒏 ; 𝒂 𝒏 ) = 𝒔𝒄𝒐𝒓𝒆 𝒏+𝟏 −𝒑𝒐𝒘𝒆𝒓𝒄𝒐𝒔𝒕 𝒏+𝟏 𝒊𝒇 𝒔𝒄𝒐𝒓𝒆 𝒊 >𝟎;𝒊=𝟏,…,𝑵, −𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 ; 𝒑𝒐𝒘𝒆𝒓𝒄𝒐𝒔𝒕 𝒏 =𝒃𝒖𝒔𝒚 𝒏 ∗𝒇𝒊𝒙𝒆𝒅𝒄𝒐𝒔𝒕 ; 𝒔𝒄𝒐𝒓𝒆(𝒏)= 𝑵 𝒊=𝟏 𝑵 𝒘𝒆𝒊𝒈𝒉𝒕 𝒊 ∗ 𝒔𝒄𝒐𝒓𝒆 𝒊 (𝒏) ; 𝒔𝒄𝒐𝒓𝒆 𝒊 𝐧 = 𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕 𝒐𝒇 𝑽𝑴 𝒊(𝒏) 𝑨𝒓𝒓𝒊𝒗𝒂𝒍 𝑹𝒂𝒕𝒆 𝒐𝒇 𝑽𝑴 𝒊(𝒏) 𝒊𝒇 𝒓𝒆𝒔𝒑𝒐𝒏𝒔𝒆 𝒕𝒊𝒎𝒆 𝒊 ≤ 𝑺𝑳𝑨 𝒊 −𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 ; 𝑺𝑳𝑨 𝒊 is maximum allowable response time for VM i

RL Agent implemented in CloudSim CloudSim is a Java based Simulation tool for Cloud Computing We have implemented the following additions to it Response Time and Throughput calculations Cyclic Workload Model Interference Model RL Agent which takes Migration Decisions Implements Q-Learning with ε non-greedy policy Full State Representation without batch updates Full State Representation with batch updates of batch size 200 Implements Q-Learning with Function Approximation and ε non-greedy policy Function Approximation without batch updates Function Approximation with batch updates of batch size 200 Used CloudSim for implementation of all our algorithms

Workload Model Graphs for Experiments Workload Model 1(wm2) Workload Model 2(wm2) Graphs shows the cyclic workload graphs with 5 phases which repeats itself periodically

Experiment Setup 5 VMs, 3 PMs and 5 Phases in cyclic workload model (shown in previous slide) Migration decisions are taken at end of Phase Experiments: One VM have workload model wm1 and others have workload model wm2 Two VMs have workload model wm1 and others have workload model wm2 For each Experiment (1) and (2), following scenarios All costs are negligible except power High interference cost with VM 4 and 5 interfering(high performance degradation due to interference of VMs) High migration cost with VM 4 and 5 interfering (high performance degradation due to VM migration) High migration cost and Interference cost

Policy Generated after Convergence Initial state is all VMs on PM 1 and Phase 1 i.e. (1;((1,2,3,4,5),(),())) All costs are negligible except power VM 1 VM 2 VM 3 VM 4 VM 5 0.5 utilization Migrate VM 1 to PM 1 Migrate VM 1 to PM 1 0.1 1 2 3 4 5 1 2 3 4 5 Migrate VM 1 to PM 2 Phase Migrate VM 1 to PM 2

Policy Generated after Convergence Initial state is all VMs on PM 1 and Phase 1 i.e. (1;((1,2,3,4,5),(),())) High migration cost and Interference cost VM 3 and 4 are interfering VM 1 VM 2 VM 3 VM 4 VM 5 0.5 utilization 0.1 1 2 3 4 5 1 2 3 4 5 Migrate VM 1 Phase

Results with Full State Representation Algorithm verified to converge most of the times in 15000 steps in case 1 and 80 steps in case 2 Algorithm verified to converge every time in 20000 steps in case 1 and 115 steps in case 2.

Features Based Function Approximation Full state representation leads to state space explosion 10 PMs , 10 VMs , 5 Phases results in number of states to be in order of 1011 Will require huge memory and long time to converge to a good policy Need approximation methods Approximate 𝑄 𝑠;𝑎 as 𝑄 𝑠;𝑎 ≈ 𝜃 𝑇 σ s;a where σ s;a is a d-dimensional feature (column vector) corresponding to the state-action tuple 𝑠;𝑎 and 𝜃 is a d-dimensional tunable parameter

Features for Function Approximation Let there be 5 VMs, 3 PMs and 2 Phases in Cyclic Workload Model State = [1,(1,2,3),(4),(5)] and Action = (4,3) Next State allocation = [1,(1,2,3),(),(4,5)] Phase Indicator of Cyclic Workload Model (utilization level) Pairwise Indicator whether VMs (i,j) allocated on same PM in Next State(interference) Fraction of Busy PMs in Next State (Power Savings) 1 2 (1,2) (1,3) (1,4) (3,5) (4,5) 0.7 1 0 1 1 1 … 0 1 total k features

Features for Function Approximation start index Migrate VM 1 Migrate VM 2 Migrate VM 4 No Migration f1 f2 f4 f6 0 … 0 0 … 0 … 0.7 … 1 … 0 … 0 k features k features k features k features Position of fi features captures the migration cost for migrating VM i Features except f4 are zero vectors Store only f4 features and its start index Perform multiplication and addition operations only for k features starting from start index k features are corresponding to three bullet points on previous slide Above idea reduces number of multiplication and addition operations by around five times the number of VMs

Results with Function Approximation Feature based Q-Learning with Function Approximation Algorithm Features are found to be non-differentiating with some state-action (s;a) tuples. For Example Consider state-action tuples ((5;(1,2,3,4),(5),()); (1,2)) and ((5;(2,3,4),(1,5),()); (1,1)) Above state-action tuples are differentiated by pairwise indicators only In first case pairwise indicators (1,5) and in second case (1,2);(1,3);(1,4) Clearly from next slide action (1,2) is good for state (5;(1,2,3,4),(5),()) while action (1,1) is bad for state (5;(2,3,4),(1,5),()) Which implies pair (1,3) and (1,4) is bad allocation and pair (1,5) is good but they are equivalent in deployment

Optimal Policy Initial state is all VMs on PM 1 and Phase 1 VM 1 i.e. (1;((1,2,3,4,5),(),())) High migration cost and Interference cost VM 3 and 4 are interfering VM 1 VM 2 VM 3 VM 4 VM 5 0.5 utilization 0.1 1 2 3 4 5 1 2 3 4 5 Migrate VM 1 Phase

Conclusion and Future Work We conclude with this project that RL Algorithm with Full State Representation works very well but has problem of huge state space For these features to work for Function Approximation, we have to add more features for interference of VMs Which results in huge Feature Set Same problem as before Future work would involve following three issues Features must be able to well differentiate between (s;a) tuple Fast Convergence of algorithm Scalability of Algorithm

References References Virtualization and Cloud Computing . Norman Wilde. Thomas Huber http://uwf.edu/computerscience/seminar/Documents/20090911_VirtualizationAndCloud.ppt VCONF : A Reinforcement Learning Approach to Virtual Machines Auto-Configuration http://portal.acm.org/citation.cfm?id=1555263 L A Prashanth and Shalabh Bhatnagar. Reinforcement learning with function approximation for trafic signal control.

Thank You !