Stand Tall, and Carry a Precision Micrometer: Observations on Creating a Measurement Model for Virtual Machines David Boyes Sine Nomine Associates
Introduction Observations on the problem – Accuracy – Fairness (particularly on chargeback information) Variations in Standard Techniques A Few Old Things are New Again Model Approximation Types Some thoughts on capacity planning and projection Bringing it all back together Q&A
Measurement in Virtual Machines Measurement takes place in multiple locations: – Within each virtual machine – At the supporting virtualization system level Numbers resulting from traditional measurement techniques are false and misleading Accurate measurement demands: – Correlation between virtual machine view and supporting system view – Correction of counts to accommodate multiple workloads
Factors in Virtualization Models Traditional Resource Utilization Factors – CPU – I/O – Storage (RAM and Disk) – Network Traffic Correction Factors – Total/Virtual CPU Ratio – I/O allocation to specific virtual machine – Allocation of storage resources and time element for occupancy – VLAN and traffic sampling
“ Classic ” Summary of Samples Utilization Chargeback assigns unit cost to each element – Simple arithmetic, right? – NO!!
Observed Problems False Data from Instrumentation Relative difficulty in building correlation between 1 st and 2 nd level observation Missing identification of application or virtual machine specific data in accounting and performance data streams Re-socialization of “ shared resources ” Inability of performance tooling to account for external costs
Virtual and Total Resource Mesurement are No Longer the Same In virtual machines, we have to capture the cost of instruction simulation and the operation of the virtualization environment – True cost measured by difference in CPU measured inside virtual machine vs CPU measured in hypervisor or “ host ” – Requires correlation of host measurement against “ inside ” measurement Clocks don ’ t always match! Also true for all the other factors! How can we get data for one machine separated from the entire mass?
Implications for Chargeback and Management What appears to be a “ fair share ” does not actually reflect real utilization – Most critical observation reflected in relatively non- scalable function (I/O, network) – Users want to pay only for what they use – Direct impact on capacity planning What ’ s a lad to do?
Borrowing From the Phone Company This isn ’ t a new problem either in the performance world or the billing world – the phone companies have had it for ages in reconciliation of cross-network charges. Can we borrow some ideas here? – Rating vs simple measurement – Peak-leveling models – Fuzzy correlation
Rating Vs Simple Sampling By using a correction factor based on correlation period rather than simple sums, we can modify the measurement according to business rules – Relaxes the requirement for precision timestamping and clock correlation – Allows workload costing feedback for management tooling in shared environment – User favorite: easy revaluation of data in case of dispute
Example Correction of CPU resource utilization effected by T/V ratio: Assumption still rests on ability of host instrumentation to report statistics by virtual machine Similar technique for other variables – Note sum for individual measures should be close to total amount per interval per processor (MP systems > 100%)
Example Note that current non-zSeries systems are weak on separation of data for individual partitions – work ongoing in DTMF, CIM/SNMP and WS-I workgroups to address additional granularity for virtual systems – Competing prototypes in pSeries LPAR and Sun Domain Mgr
Projection and Confidence Levels Goal: +/-.5% nominal Realistic expectation at this stage: 5-7% Projection at this point still weak on data.
Projection and Confidence Levels Tendency is toward under-correction (ie, overestimation of consumption) – Good if you ’ re a service provider! – If linked to auto-provisioning (eWLM, Superdome, etc) will trigger early provisioning of additional resources Model may be fine-tuned by adjusting rating interval: – Optimax for most transaction-oriented servers in sec intervals – Optimax for compute-intensive servers in sec intervals
Data Correlation Use of rating engine stream allows correlation requirement to be less stringent – Still some requirement for “ near ” timing, but buckets are large enough that most virtual machine monitors cannot span an interval.
Summary Virtual machine modeling presents a combination of old and new problems Additional sophistication in instrumentation will be sufficient for a truly representative model A reasonably accurate simulation can be provided by adjusting measurement based on rated intervals instead of simple accumulation
Q&A
Contact Info David Boyes Sine Nomine Associates