Predicting The Performance Of Virtual Machine Migration Presented by : Eli Nazarov Sherif Akoush, Ripduman Sohan, Andrew W.Moore, Andy Hopper University.

1 Predicting The Performance Of Virtual Machine Migration Presented by : Eli Nazarov Sherif Akoush, Ripduman Sohan, Andrew W.Moore, Andy Hopper University of Cambridge

2 Agenda  Introduction.  How to migrate?  Defining migration performance.  Performance prediction.  The AVG & HIST models.  Evaluation.  Conclusions.

3 Why performance prediction matters?  Provision and control computing capacity.  Guarantee performance levels.  Efficient management. Better VM placement. Better resource utilization (e.g. load balancing).

4 How to migrate?  Stop-and-Copy  Minimizes total migration time.  Highest downtime.   On-Demand.  Short downtime.  Very high total migration time. 

5 Pre-Copy migration  Pre-Copy migration involves 6 steps:  Initialization Pre-select a target for migration.  Reservation Reserve resources on the destination host.  Iterative Pre-Copy First Iteration : Send all RAM. Each iteration : Send modified pages.  Stop-and-Copy Stop VM for final transfer.  Commitment Destination host acknowledges that the copy finished correctly.  Activation Re-attachment of resources to VM on the destination host. Pre-copy phase Copy phase

6 Xen Stop Conditions  Less then 50 pages were dirtied during the last pre- copy iteration.  Guarantees short downtime.  29 pre-copy iteration have been carried out.  Already copied more then 3*|VM|. At iteration N-1 we copied 3*|VM|-1page  Forces Stop-and-Copy stage.

7 Migration & Down times

8 How To Predict?  Calculate Bounds.

9 Bounds are not enough!  Don’t give accurate prediction.  Reason: Significant differences in lower and upper bounds due to link speed and VM size correlation.  Example: For VM Size=1,024 MB MT =Total Migration Time, DT=Total Downtime, LB=Lower Bound, UB=Upper Bound  For big VM memory sizes even larger differences.  We need something more accurate. Speed 100 Mbps96.3 s459.1 s0.314 s91.4945 s 1 Gbps13.3 s49.9 s0.314 s9.4978 s 10 Gbps5.3 s10.1 s0.314 s1.5187 s

10 Parameters affecting migration  Migration link bandwidth.  Higher speed links allow faster transfers.  Pre and Post migration overheads.  Operations that aren’t part of the actual transfer.  Examples: Initializing container in destination host. Reattaching device drivers to the new VM. etc.  Example: 10 Gbps, VM size = 512MB Pre-overhead = 77%

11 Parameters affecting migration (cont.)  Page dirty rate.  The rate at which memory pages in VM are modified.  Affects the number of pages transferred in each pre- copy integration.  Page dirty rate and performance relation is not linear Reason: Link speed.

12 Page dirty rate and link speed  Downtime at low page dirty rate is almost constant and close to lower bound.  Downtime increases to upper bound when page dirty rate is high (reaches link capacity). 10Gbps – Total downtime

13 Page dirty rate and link speed (cont.)  Total migration time increases with page dirty rate.  Total migration time goes back to lower bound for extremely high page dirty rate. Back to pure Stop-and-Copy. 10Gbps – Total migration time 100Mbps – Total migration time

14 What's next?  Prediction using all parameters affecting migration.  Link speed.  Page dirty rate.  VM memory size.  Overheads. AVG - Average Page Dirty Rate HIST – History Based Page Dirty

15 The AVG model  Based on the migration logic.  Assumes constant or average page dirty rate.  Useful when the dirty page rate is stable.  Follow the core functionality of migration in Xen.

16 The AVG model (cont.)  Input parameters:  Link Speed.  Page Dirty Rate. Analytically determinable.  Pre\Post overheads. Time spent during actual transfer – Time to migrate idle VM  VM Size.  Xen functionality:  sim_clean(): returns the set of dirty pages + sets state to “all clean”.  sim_peek(): returns bitmap of dirty pages (no state change).

17 Algorithm - the AVG model  Each Pre-Copy phase:  Get dirty bitmap – sim_peek().  Skip the pages re-dirtied in this iteration  Collect at most 1024 pages – batch.  migration_time +=  if (last_iteration) downtime_time +=  Clean pages status – sim_clean().  Calculate the total times:  total_migration_time = migration_time + pre_overheads + post_overheads.  total_downtime = downtime + post_overheads.

18 The HIST model  Used in cases where the dirty page rate is a function of time.  Depends on the history log of page dirty rate.

19 The HIST model (cont.)  Given the start time of migration – t  Predict migration times based on: t+1,t+2, …, t+N  Changed sim_clean() and sim_peek() to return #dirty pages at the above points in time for log.  Use AVG algorithm with these function. Observation: For deterministic processes the set of dirtied pages at any point in time will be approximately the same as for previous runs of the same workload running in a similar environment.

20 Evaluation  Test-bed:  Xenserver 5.5.0 (Xen 3.3.1) on 3 servers. 1 pool master, 2 hosts for migration.  Each server: 2 Intel® Xeon™ 2.13 GHZ, 6GB DDR3.  SAN – IBM eserver xSeries 336. 2 GB DIMM. Ultra320 SCSI. Ubuntu 2.6.27-7 kernel.  Compared to:  Actual migration using 2 SolarFlare10Gbps NICs.

21 Evaluation (Cont.)  Page Modification Micro-Benchmark  Can be used both for AVG & HIST.  Deterministic application.  Writes to memory pages at fixed rates.  High resolution of page modification Up to pages/sec.  Over 25,000 live migrations.

22 Evaluation (cont.) - Results AVG v.s Real migration HIST v.s Real migration

23 Results (Cont.) - Results  For |VM|=1024MB, LinkSpeed=10Gbps:  HIST mean deviation from the measurements :  3.3% - total migration time.  6.2% - total downtime.  AVG mean deviation from the measurements:  2.6% - total migration time.  3.3% - total downtime.

24 Evaluation(cont.) – Industry workloads  Comparing against a set of industry-standard workloads.  SPEC CPU For CPU bounds workloads.  SPECweb WebServer workloads.  SPECsfs I/O, MapReduce & non-interactive workloads.

25 Industry workloads - Results CPU5.8 s5.7 s2.4%0.317 s0.314 s2.4% WEB7.5 s7.4 s2.0%0.449 s0.42 s6.4% SFS14.8 s14.9 s1.5%0.2176 s0.2177 s0.1% MR14.9s15.13s1.4%0.348 s0.3840.2% MT =Total Migration Time, DT=Total Downtime, A=Actual Measurements P=HIST Prediction

26 Comments  Presented an accurate model for prediction.  Performed a large scale evaluation.   Very specific to Xen implementation.  Didn’t perform evaluation comparing to other prediction methods.  Didn’t state how to predict with bounds.

27 Questions? ?

