Sandpiper : Black box and Gray-Box resource management for Virtual Machines Journal : Computer Networks: The International Journal of Computer and Telecommunications Networking, 2009 Vinayak Gagrani
Introduction Overloaded Data Center can be handled in two ways :- – Reallocation of resources within the physical machines – Migration of one or multiple VMs to distribute the load Manual Migration is error prone and lacks immediate response Sandpiper introduces technique for monitoring VMs: – Black Box : externally monitor VMs, without knowledge of applications executing within them – Gray Box : use metrics from OS on applications for more information Uses prediction to decide the utilization possible in future Uses greedy approach to decide which VMs to move around
Sandpiper Archietecture
Resource Provisioning Need to estimate the additional resource requirements by VM Black Box :- – High percentile of the tail distribution as initial estimate – VM is over using its fair share – VM is using its fair share completely, denotes less requirements Scaling (How much to scale ?) Gray Box :- – Better provisioning using service rate, response rate and drop rate – Applications modeled as G/G/1 queuing system – Allows to reduce the memory allocated in case its not being used fully
Hotspot Detection & Mitigation Hotspot Detection – Black box – per physical server, Gray box – per virtual server – Prolonged exceeding of hotspot (k in N) as well as next predicted value then only hotspot are marked – Conservative or Aggressive approach ( based on k and N ) – Prediction of future values using auto-regressive predictors Hotspot Mitigation – VM Resizing – Migration NP Hard Capturing Multi-dimension loads – Volume of server Migration phase Swap phase
Positives and Negatives Positives – Very good demonstration of using one technology (live migration) into other (resource management) – Lot of figures and graphs to assist text – Very detailed description, efficient and ready to be used Negatives – Separate machine is to be dedicated as control plane – Lot of data to be kept for predicting and profiling for each VM – Possible bottleneck (?) – Algorithms in mitigation could have been more structured – Does not describe how to determine ‘k’ lowest VSR VMs in swapping phase
Points to ponder Memory resizing in black box approach – Issues and possible solution Quantify the load of a machine – Problem with current metric for volume – Alternatives ? Experiments ?
Future Work Multiple Control Planes ? – Instead of one control plane use multiple planes which interact with each other – Utilize features of distributed computations – Remove bottleneck in monitoring(?) – Reduce chances of failure on central machine (?) Any Other ?