Ignacio Cano, Srinivas Aiyar, Arvind Krishnamurthy

Slides:



Advertisements
Similar presentations
Windows IT Pro magazine Datacenter solution with lower infrastructure costs and OPEX savings from increased operational efficiencies. Datacenter.
Advertisements

Profit from the cloud TM Parallels Dynamic Infrastructure AndOpenStack.
What’s in Windows Server 2012 for SQL Server N.Raja / James Crawshaw Technology Specialists - Microsoft DBI332.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Performance Anomalies Within The Cloud 1 This slide includes content from slides by Venkatanathan Varadarajan and Benjamin Farley.
Web-Scale Converged Infrastructure For The Enterprise Nutanix Discovery Kit products Jan Presenter Name Date.
High memory instances Monthly SLA : Virtual Machines Validated & supported Microsoft workloads Price reduction: standard Windows (22%) & Linux (29%)
INTRODUCING COMPELLENT – FASTEST GROWING SAN VENDOR Virtualized storage for enterprises and cloud data centers.
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
VIRTUALIZATION AND YOUR BUSINESS November 18, 2010 | Worksighted.
European Organization for Nuclear Research Virtualization Review and Discussion Omer Khalid 17 th June 2010.
Resource Management in Data-Intensive Systems Bernie Acs, Magda Balazinska, John Ford, Karthik Kambatla, Alex Labrinidis, Carlos Maltzahn, Rami Melhem,
Virtualization Infrastructure Administration Cluster Jakub Yaghob.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Scalability Module 6.
Virtualization Performance H. Reza Taheri Senior Staff Eng. VMware.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
Deploying Moodle with Red Hat Enterprise Virtualization Brian McSpadden Director of Network Operations Remote-Learner.net.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
1 The Virtual Reality Virtualization both inside and outside of the cloud Mike Furgal Director – Managed Database Services BravePoint.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
The Era of the Cloud OS: Transform the Datacentre
Components of Windows Azure - more detail. Windows Azure Components Windows Azure PaaS ApplicationsWindows Azure Service Model Runtimes.NET 3.5/4, ASP.NET,
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
Appendix B Planning a Virtualization Strategy for Exchange Server 2010.
Cloud Computing Energy efficient cloud computing Keke Chen.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
FlashSystem family 2014 © 2014 IBM Corporation IBM® FlashSystem™ V840 Product Overview.
Never Down? A strategy for Sakai high availability Rob Lowden Director, System Infrastructure 12 June 2007.
Eric Burgener VP, Product Management A New Approach to Storage in Virtual Environments March 2012.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
Copyright © 2005 VMware, Inc. All rights reserved. How virtualization can enable your business Richard Allen, IBM Alliance, VMware
Tier-2 storage A hardware view. HEP Storage dCache –needs feed and care although setup is now easier. DPM –easier to deploy xrootd (as system) is also.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Scalable and elastic Enterprise scale and performance for the largest workloads Shared- nothing live migration Hyper-V Network.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
U N C L A S S I F I E D LA-UR Leveraging VMware to implement Disaster Recovery at LANL Anil Karmel Technical Staff Member
HYPERCONVERGED INFRASTRUCTURE Adam Savage A Brief history of the data centre The Mainframe Natively converged CPU, memory and storage Had redundancy.
Sql Server Architecture for World Domination Tristan Wilson.
Global Commercial Channels Sales Play Card Customer Challenge Windows Server 2012 feature PowerEdge 12G feature Customer Benefits “I’m unable to grow due.
Commvault and Nutanix October Changing IT landscape Today’s Challenges Datacenter Complexity Building for Scale Managing disparate solutions.
 Session Objectives:  Understand how to upgrade your private cloud: Windows Server 2008 R2  Windows Server 2012 R2 Windows Server 2012  Windows.
Journey to the HyperConverged Agile Infrastructure
Azure.
E2800 Marco Deveronico All Flash or Hybrid system
Course: Cluster, grid and cloud computing systems Course author: Prof
Azure Site Recovery For Hyper-V, VMware, and Physical Environments
Lenovo – DataCore Deployment Ready Offerings
Understanding The Cloud
Organizations Are Embracing New Opportunities
C Loomis (CNRS/LAL) and V. Floros (GRNET)
OpenStack Swift Where do big data go? Eben van Zyl
Curator: Self-Managing Storage for Enterprise Clusters
Running virtualized Hadoop, does it make sense?
Windows Server* 2016 & Intel® Technologies
Windows Azure Migrating SQL Server Workloads
VIDIZMO Deployment Options
Provisioning Performance of name server Software
Azure.
Upgrading to Microsoft SQL Server 2014
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Microsoft Virtual Academy
CS 345A Data Mining MapReduce This presentation has been altered.
IBM Power Systems.
TechEd /24/2019 4:29 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
PerformanceBridge Application Suite and Practice 2.0 IT Specifications
OpenStack for the Enterprise
Presentation transcript:

Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters Ignacio Cano, Srinivas Aiyar, Arvind Krishnamurthy University of Washington – Nutanix Inc. ACM Symposium on Cloud Computing October 2016

Private Clouds

Private Clouds Cloud computing that delivers service to a single organization, as opposed to public clouds, which service many. Direct control of infrastructure and data. Carry management and maintenance costs.

Motivation Increasing trend in the use of private clouds within companies. Private clouds deployments require careful consideration of what will happen in the future: Capacity Failures …

Motivation Need Measurement Data! Research Questions: What are the most common failures? What type of workloads are typically run? How is the storage used? What about CPU usage? How do additional replicas impact data durability? What causes companies to expand their clusters? Need Measurement Data!

Related Work Limited prior work on Private Clouds! Setting \ Study Hardware Failures Storage Compute Desktops HW Failures in PCs [Nightingale et al., EuroSys’11] Metadata in Windows PCs [Agrawal et al., TOS’07] I/O on Apple computers [Harter et al., SOSP’11] Disk/CPU Usage and Load [Bolosky et al., SIGMETRICS’00] Public Clouds HW reliability [Vishwanath et al., SoCC’11] Data Characteristics and Access Patterns [Liu et al., IEEE/ACM CCGrid’13] Workloads characterization [Mishra et al., SIGMETRICS’10] Scheduling on Heterogeneous Clusters [Reiss et al., SoCC’12] Limited prior work on Private Clouds!

In this talk Large-Scale Measurement Study of Private Clouds Lower hardware failure rates Nodes overprovisioned Stable storage and CPU usage Modeling based on the Measurements Each extra replica provides substantial durability improvements Storage needs drive growth more than compute

Outline Large-Scale Measurement Study of Private Clouds Context Cluster Profiles Failure Analysis Workload Characteristics Modeling based on the Measurements Durability Cluster Growth

Outline Large-Scale Measurement Study of Private Clouds Context Cluster Profiles Failure Analysis Workload Characteristics Modeling based on the Measurements Durability Cluster Growth

Nutanix Clusters Random replication VMs migration … Operations interposed at the hypervisor level and redirected to CVMs Nutanix Clusters Random replication VMs migration … Integrated Compute-Storage Global view of cluster state Global view of cluster state

Outline Large-Scale Measurement Study of Private Clouds Context Cluster Profiles Failure Analysis Workload Characteristics Modeling based on the Measurements Durability Cluster Growth

Clusters Summary Statistics Value # of Clusters 2168

Clusters Summary Statistics Value # of Clusters 2168 # of Nodes 13394 6.18 Nodes/Cluster

Clusters Summary Statistics Value # of Clusters 2168 # of Nodes 13394 Cluster Sizes 3 - 40

Clusters Summary Statistics Value # of Clusters 2168 # of Nodes 13394 Cluster Sizes 3 - 40 # of Disks ~ 70K

Mostly homogeneous within a cluster Node Configurations Configuration Storage Compute Memory (GB) SSD (TB) HDD (TB) Cores Clock Rate (GHz) Config-1 1.6 8 24 2.5 384 Config-2 0.8 4 12 2.4 128 Config-3 30 16 256 Storage-heavy Compute-heavy Mostly homogeneous within a cluster

Virtual Desktop Infrastructure Workloads Workload Example Applications Configuration Virtual Desktop Infrastructure Citrix XenDesktop VMware Horizon/View Config-1

Virtual Desktop Infrastructure Workloads Workload Example Applications Configuration Virtual Desktop Infrastructure Citrix XenDesktop VMware Horizon/View Config-1 Server SQL Server Exchange Mail Server Config-2 Config-3

Virtual Desktop Infrastructure Workloads Workload Example Applications Configuration Virtual Desktop Infrastructure Citrix XenDesktop VMware Horizon/View Config-1 Server SQL Server Exchange Mail Server Config-2 Config-3 Big Data Splunk Hadoop

Virtual Desktop Infrastructure Workloads Workload Example Applications Configuration Virtual Desktop Infrastructure Citrix XenDesktop VMware Horizon/View Config-1 Server SQL Server Exchange Mail Server Config-2 Config-3 Big Data Splunk Hadoop Others IT Infrastructure Custom applications Mix

Distribution of VMs per Node Most 2-4 vCPUs Median 21 Highest density Lowest density

Outline Large-Scale Measurement Study of Private Clouds Context Cluster Profiles Failure Analysis Workload Characteristics Modeling based on the Measurements Durability Cluster Growth

Failures We only consider failures that require manual intervention, i.e., human operators annotate the cause of the problem.

Top 3 account for around 50% of HW failures Hardware Failures Top 3 account for around 50% of HW failures

Annual Return Rate Component ARR (%) HDD 0.76 2-9 % prior studies

Enterprise-grade commodity HW Annual Return Rate Component ARR (%) HDD 0.76 SSD 0.72 Lower return rates Enterprise-grade commodity HW 4-10 % prior studies (4 years)

Outline Large-Scale Measurement Study of Private Clouds Context Cluster Profiles Failure Analysis Workload Characteristics Modeling based on the Measurements Durability Cluster Growth

Workload Characteristics Usage over time seems to be stable/predictable: 80% of the clusters use Storage: mean <= 50%, std <= 8% CPU: mean <= 20%, std <= 5% SSDs can generally maintain the working set 80% of nodes use <= 500 GB for the working set

Outline Large-Scale Measurement Study of Private Clouds Context Cluster Profiles Failure Analysis Workload Characteristics Modeling based on the Measurements Durability Cluster Growth

Durability Model Estimate the probability of data loss. Assumptions: replication factor of 2 random replication (replicate to a random node) The time required to create a new replica when a node goes down: Data to be replicated Data transfer rate Remaining live nodes

Durability Model p(∆t) = probability of node failure in ∆t time. We decompose the overall period over which we want to provide the durability guarantee into a sequence of intervals, each of length ∆t. Q = data loss event where two failures occur within ∆t time, i.e. data could not be replicated.

The remaining n-1 nodes do not fail within ∆t time Durability Model Then the probability that there is no data loss in an interval ∆t: The remaining n-1 nodes do not fail within ∆t time No failures Exactly one node fails

# of intervals of ∆t time in a year Durability Model On a yearly-basis, we consider all ∆t intervals in a year. Probability of no data loss within a year is: # of intervals of ∆t time in a year

Durability in Private Clouds Rule of Thumb: each additional replica provides an additional 5 9’s of durability Most clusters have 5 9’s with RF2, and 10 9’s with RF3 Most clusters have 5 9’s with RF2, and 10 9’s with RF3

Outline Large-Scale Measurement Study of Private Clouds Context Cluster Profiles Failure Analysis Workload Characteristics Modeling based on the Measurements Durability Cluster Growth

Cluster Growth Analysis Customers periodically add nodes to their existing clusters. What drives such growth? We resort to machine learning Binary classification problem Logistic Regression with L1 regularization

Cluster Growth Analysis Use 200 clusters than grew at least once in a period of 8 months. 15K examples (70% train, 10% val, 20% test). Train with different combination of features to understand which are important.

Performance Features Fp Cluster Features Fc Description n(nodes) discretized # of nodes n(vms) # of vms per node Storage Features Fs Description r(ssd) ssd usage to ssd capacity ratio r(hdd) hdd usage to hdd capacity ratio r(store) storage usage to total capacity ratio Performance Features Fp Description n(vcpus) # of virtual cpus n(iops) # of iops per node

What drives cluster growth? Upgrades from 3-4 node clusters Cluster Size Storage Needs Compute Needs HDD usage Number of VMs Storage more than compute!

Outline Large-Scale Measurement Study of Private Clouds Context Cluster Profiles Failure Analysis Workload Characteristics Modeling based on the Measurements Durability Cluster Growth

Conclusions Large-Scale Measurement Study of Private Clouds Lower hardware failure rates Nodes overprovisioned Stable storage and CPU usage Modeling based on the Measurements Each extra replica provides substantial durability improvements Storage needs drive growth more than compute

Thanks!