Applying Benchmark Data To A Model for Relative Server Capacity CMG 2013 Joseph Temple, LC-NS Consulting John J Thomas, IBM.

Slides:

Advertisements

Similar presentations

Tableau Software Australia

Advertisements

Distributed Data Processing

Topics to be discussed Introduction Performance Factors Methodology Test Process Tools Conclusion Abu Bakr Siddiq.

Distributed Processing, Client/Server and Clusters

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

A Fast Growing Market. Interesting New Players Lyzasoft.

Zookeeper at Facebook Vishal Kathuria.

® IBM India Research Lab © 2006 IBM Corporation Challenges in Building a Strategic Information Integration Infrastructure Mukesh Mohania IBM India Research.

Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part IV Capacity Planning Methodology.

1 Part IV Capacity Planning Methodology © 1998 Menascé & Almeida. All Rights Reserved.

Web Server Hardware and Software

Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 16 Designing.

Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.

Web Server Software Architectures Author: Daniel A. Menascé Presenter: Noshaba Bakht.

EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,

Proxy Cache Leonid Romanovsky Olga Fomenko Winter 2003 Instructor: Konstantin Sinyuk.

KPIs and Machine Comparisons Joe Temple Copyright 5/2014 t Low Country North Shore Consulting.

PlacePlace TypeType ServiceService Analysis Caching Integration Sync Search Relational BLOB Query BackupLoad Multi Dim In Memory File XML Reporting.

Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.

Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.

Server to Server Communication Redis as an enabler Orion Free

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

1 Admission Control and Request Scheduling in E-Commerce Web Sites Sameh Elnikety, EPFL Erich Nahum, IBM Watson John Tracey, IBM Watson Willy Zwaenepoel,

Alternative Performance Metrics for Server RFPs Joe Temple Low Country North Shore Consulting

Grid Computing Framework A Java framework for managed modular distributed parallel computing.

Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.

Modeling Virtualized Environments in Simalytic ® Models by Computing Missing Service Demand Parameters CMG2009 Paper 9103, December 11, 2009 Dr. Tim R.

Capacity Planning Plans Capacity Planning Operational Laws

Operating Systems: Internals and Design Principles

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.

Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.

EGRE 426 Computer Organization and Design Chapter 4.

Introduction to Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program.

Tackling I/O Issues 1 David Race 16 March 2010.

(re)-Architecting cloud applications on the windows Azure platform CLAEYS Kurt Technology Solution Professional Microsoft EMEA.

REMINDER Check in on the COLLABORATE mobile app Oracle Performance Management with vCenter Operations Manager and Oracle Enterprise Manager (OEM) Adapter.

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.

Cloud-based movie search web application with transaction service Group 14 Yuanfan Zhang Ji Zhang Zhuomeng Li.

Designing a Grid Computing Architecture: A Case Study of Green Computing Implementation Using SAS® N.Krishnadas Indian Institute of Management, Kozhikode.

SysPlex -What’s the problem Problems are growing faster than uni-processor….1980’s Leads to SMP and loosely coupled Even faster than SMP and loosely coupled.

Performance Assurance for Large Scale Big Data Systems

Chapter 8 Environments, Alternatives, and Decisions.

Lecture 2: Performance Evaluation

System Architecture & Hardware Configurations

Morgan Kaufmann Publishers

Structural Simulation Toolkit / Gem5 Integration

Chapter 16 Designing Distributed and Internet Systems

Windows Azure 講師: 李智樺, Ruddy Lee

Capacity Analysis, cont. Realistic Server Performance

Scaleout vs. Scaleup Robert Barnes Microsoft

Excelian Grid as a Service Offers Compute Power for a Variety of Scenarios, with Infrastructure on Microsoft Azure and Costs Aligned to Actual Use MICROSOFT.

Admission Control and Request Scheduling in E-Commerce Web Sites

Software models - Software Architecture Design Patterns

Improve Patient Experience with Saama and Microsoft Azure

Computer Evolution and Performance

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

Performance And Scalability In Oracle9i And SQL Server 2000

Presentation transcript:

Applying Benchmark Data To A Model for Relative Server Capacity CMG 2013 Joseph Temple, LC-NS Consulting John J Thomas, IBM

CMG Relative Server Capacity How do I compare machine capacity? What platform is best fit to deliver a given workload? Simple enough questions, but difficult to answer! Establishing server capacity is complex Different platform design points Different machine architectures Continuously evolving platform generations Standard benchmarks (SPECInt, TPC-C etc.) and composite metrics (RPE2, QPI etc.) help, but may not be sufficient Some platforms do not support these metrics May not be sufficient to decide best fit for a given workload We need a model to address Relative Server Capacity See Alternative Metrics for Server RFPs [J. Temple]

CMG Local Factors / Constraints Non-Functional Requirements Technology Adoption Strategic Direction Cost Models Reference Architectures System z System x Power Workload Fit

CMG Fit for Purpose Workload Types Mixed Workload – Type 1 Scales up Updates to shared data and work queues Complex virtualization Business Intelligence with heavy data sharing and ad hoc queries Parallel Data Structures – Type 3 Small Discrete – Type 4 Application Function Data Structure Usage Pattern SLA Integration Scale Highly Threaded – Type 2 Scales well on clusters XML parsing Buisness intelligence with Structured Queries HPC applications Scales well on large SMP Web application servers Single instance of an ERP system Some partitioned databases Limited scaling needs HTTP servers File and print FTP servers Small end user apps Black are design factors Blue are local factors

CMG Fitness Parameters in Machine Design Can be customized to machines of interest. Need to know specific comparisons desired These parameters were chosen to represent the ability to handle parallel, serial and bulk data traffic. This is based on Greg Pfisters work on workload characterization in In Search of Clusters

CMG Key Aspects Of The Theoretical Model Throughput (TP) Common concept: Units of Work Done / Units of Time Elapsed Theoretical model defines TP as a function of Thread Speed: TP = Thread Speed x Threads Thread Speed is calculated as clock rate x Threading multiplier / Threads per Core. Threading multiplier is the increase in throughput due to multiple threads per core Thread Capacity (TC) Throughput (TP) gives us an idea of instantaneous peak throughput rate In order to sustain this rate the load has to keep all threads of the machine busy In the world of dedicated systems, TP is the parameter of interest because it tells us the peak load the machine can handle without causing queues to form However in the world of virtualized/consolidated workloads, we are stacking multiple workloads on threads of the machine Thread capacity is an estimator of how deep these stacks can be Theoretical model defines TC as: TC = Thread Speed x Cache per Thread

CMG Throughput, Saturation, Capacity 7 TPMeasured ITRCapacity TP Pure Parallel CPUITR Other resources and Serialization ETR Load and Response Time

CMG Single Dimension Metrics Do Not Reflect True Capacity The standard metrics do not leverage cache. This leads to the pure ITR view of relative capacity on the right. Common Metrics: ITR TP ETR ITR Power advantaged z is not price competitive Consolidation: ETR << ITR unless loads are consolidated Consolidation accumulates working sets Power and z advantaged Cache can also mitigate Saturation

CMG Bridging Two Worlds - I There appears to be a disconnect between common benchmark metrics and theoretical model metrics like TP Does this mean metrics like TP are invalid? No We see the effect of TP/TC in real world deployments a machine performs either better or poorer than what a common benchmark metric would have suggested Does this mean benchmark metrics are useless? No They provide valuable data points A better approach would be to try and bridge these two worlds in a meaningful way

CMG Bridging Two Worlds - II Theoretical model calculates TP and TC using estimated values for thread speed Based on machine specifications Example: TP calculation for POWER7 A key factor in TP calculation is Thread Speed, which in turn depends on the value of the thread multiplier But this factor is only an estimate. We estimated the thread multiplier for POWER7 in SMT-4 mode was 2 However, using an estimate for thread speed assumes common path length and linear scaling An inherent problem here – these estimates are not measured or specified using any common metric across platforms As an example, should the thread multiplier be the same for POWER7 in SMT-2 mode as Intel running with HyperThreading? Recommendation: Refine factors in the theoretical model with benchmark results Instead of using theoretical values for thread speed, pathlength etc., plug in benchmark observations

CMG Two Common Categories Of Benchmarks Stress tests Measure raw throughput Measure the maximum throughput that can be driven through a system, focusing all system resources to this particular task VM density tests Consolidation ratios (VM density) that can be achieved on a platform Usually do not try to maximize throughput of a system They usually look at how multiple workloads can be stacked efficiently to share the resources on a system, while delivering steady throughput Adjusting Thread Speed affects both TP and TC

CMG Example of a Stress Test, A Misleading One If Used In Isolation! This benchmark result is quite misleading, it suggests a z core yields only 15% better ITR. But we know that z has much higher capacity What is wrong here? System z design point is to run multiple workloads together, not a single atomic application under stress This particular application doesnt seem to leverage many of zs capabilities (cache, IO etc.) Can this benchmark result be used to compare capacity? Online trading WAS ND workload driven as a stress test 2ch/16co Intel 2.7GHz Blade Linux on System z 16 IFLs TradeLite workload Peak ITR: 3467 tps Peak ITR: 3984 tps

CMG Use Benchmark Data To Refine Relative Capacity Model Calculate Effective thread speed from measured values What is the benchmarked thread speed? Normalizing thread speed and clock to a platform allows us to calculate pathlength for a given platform This in turn allows us to calculate Effective thread speed Doing this affects both TP and TC Plug in Effective thread speed values into Relative Capacity calculation model

CMG Use Benchmark Data To Refine Relative Capacity Model - Results ITR / Threads Clock ratio / Threadspeed ratio Effective Threadspeed * Total Threads * Cache/Thread In this case, System z ends up with a 13.5x Relative Capacity factor, relative to Intel

CMG Online banking WAS ND workloads, each driving 22 transactions per second with light I/O Common x86 hypervisor 2ch/16co Intel 2.7GHz Blade PowerVM 2ch/16co POWER7+ 3.6GHz Blade z/VM on zEC12 16 IFLs Light workloads 48 VMs per IPAS Intel blade 100 VMs per 16-way z/VM 68 VMs per IPAS POWER7+ blade Consolidation ratios derived from IBM internal studies. Results will vary based on workload profiles/characteristics. Example of a VM Density Test: Consolidating Standalone VMs With Light CPU Requirements

CMG Use Benchmark Data To Refine Relative Capacity Model - Results Follow a similar exercise to calculate effective thread speed Each VM is driving a certain fixed throughput This test used a constant injection rate If throughput varies (for example, holding a constant think time), need to adjust for that Calculate benchmarked thread speed Normalize to a platform to get path length Calculate effective thread speed Plug into relative server capacity calculation In this case, System z ends up with a 22.2x Relative Capacity factor relative to Intel

CMG Math Behind Consolidation Rogers Equation: Uavg = 1/(1+HR(avg)) Where HR(avg) = kcN 1/2 For consolidation, N is the number of loads (VMs) k is a design parameter (Service Level) c is the variability of the initial load

CMG Larger Servers With More Resources Make More Effective Consolidation Platforms Most workloads experience variance in demand When you consolidate workloads with variance on a virtualized server, the variance of the sum is less (statistical multiplexing) The more workloads you can consolidate, the smaller is the variance of the sum Consequently, bigger servers with capacity to run more workloads can be driven to higher average utilization levels without violating service level agreements, thereby reducing the cost per workload

CMG A Single Workload Requires a Machine Capacity Of 6x the Average Demand Server utilization = 17% Average Demand m=10/sec Assumes coefficient of variation = 2.5, required to meet 97.7% SLA 6x Peak To Average Server Capacity Required 60/sec

CMG Consolidation Of 4 Workloads Requires Server Capacity Of 3.5x Average Demand Server utilization = 28% Average Demand 4*m = 40/sec Server Capacity Required 140/sec Assumes coefficient of variation = 2.5, required to meet 97.7% SLA 3.5x Peak To Average

CMG Consolidation Of 16 Workloads Requires Server Capacity Of 2.25x Average Demand Server utilization = 44% Average Demand 16*m = 160/sec Server Capacity Required 360/sec Assumes coefficient of variation = 2.5, required to meet 97.7% SLA 2.25x Peak To Average

CMG Consolidation Of 144 Workloads Requires Server Capacity Of 1.42x Average Demand Server utilization = 70% Average Demand 144*m = 1440/sec Server Capacity Required 2045/sec Assumes coefficient of variation = 2.5, required to meet 97.7% SLA 1.42x Peak To Average

CMG Lets Look At Actual Customer Data Large US insurance company 13 Production POWER7 frames Some large servers, some small servers Detailed CPU utilization data 30 minute intervals, one whole week For each LPAR on the frame For each frame in the data center Measure peak, average, variance

CMG Detailed Data Example: One Frame

CMG Customer Data Confirms Theory Servers with more workloads have less variance in their utilization and less headroom requirements

CMG Consolidation Observations There is a benefit to large scale servers The headroom required to accommodate variability goes up only by sqrt(n) when n workloads are pooled The larger the shared processor pool is, the more statistical benefit you get Large scale virtualization platforms are able to consolidate large numbers of virtual machines because of this Servers with capacity to run more workloads can be driven to higher average utilization levels without violating service level agreements

CMG Summary We need a theoretical model for relative server capacity comparisons Purely theoretical models need to be grounded in reality Atomic benchmarks can sometimes be quite misleading in terms of overall system capability Refine theoretical models with benchmark measurements Real world (customer) data trumps everything! Validates or negates models Customer data validates sqrt(n) model for consolidation