The Performance of Big Data Workloads in Cloud Datacenters

Slides:

Advertisements

Similar presentations

University of St Andrews School of Computer Science Experiences with a Private Cloud St Andrews Cloud Computing co-laboratory James W. Smith Ali Khajeh-Hosseini.

Advertisements

Cloud Service Models and Performance Ang Li 09/13/2010.

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Clouds C. Vuerli Contributed by Zsolt Nemeth. As it started.

Look Who’s Talking: Discovering Dependencies between Virtual Machines Using CPU Utilization HotCloud 10 Presented by Xin.

Charles Reiss *, Alexey Tumanov †, Gregory R. Ganger †, Randy H. Katz *, Michael A. Kozuch ‡ * UC Berkeley† CMU‡ Intel Labs.

W4118 Operating Systems OS Overview Junfeng Yang.

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

July 13, “How are Real Grids Used?” The Analysis of Four Grid Traces and Its Implications IEEE Grid 2006 Alexandru Iosup, Catalin Dumitrescu, and.

New Challenges in Cloud Datacenter Monitoring and Management

1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,

INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.

Design Discussion Rain: Dynamically Provisioning Clouds within FutureGrid Geoffrey Fox, Andrew J. Younge, Gregor von Laszewski, Archit Kulshrestha, Fugang.

Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith School of Informatics and Computing.

Apache Spark and the future of big data applications Eric Baldeschwieler.

Ch 4. The Evolution of Analytic Scalability

LDBC-Benchmarking Graph-Processing Platforms: A Vision Benchmarking Graph-Processing Platforms: A Vision (A SPEC Research Group Process) Delft University.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

1 Cloud Computing Research at TU Delft – A. Iosup Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology The Netherlands.

SUNY FARMINGDALE Computer Programming & Information Systems BCS451 – Cloud Computing Prof. Tolga Tohumcu.

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

Challenges towards Elastic Power Management in Internet Data Center.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

Henri Bal Vrije Universiteit Amsterdam High Performance Distributed Computing.

Server Virtualization

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

Web Technologies Lecture 13 Introduction to cloud computing.

Optimize the Business with Microsoft Datacenter Services 2.0

Background Computer System Architectures Computer System Software.

Universidade Federal do Ceará FOLE: A Framework for Elasticity Performance Evaluation in Cloud Computing Systems Emanuel F. Coutinho Group of Computer.

© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Docker for Ops: Operationalize Your Apps in Production Vivek Saraswat Sr. Product Evan Hazlett Sr. Software

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University

Cloud benchmarking, tools and challenges

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

Organizations Are Embracing New Opportunities

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

OPERATING SYSTEMS CS 3502 Fall 2017

Packing Tasks with Dependencies

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Measurement-based Design

Docker Birthday #3.

Improving Datacenter Performance and Robustness with Multipath TCP

Utilizing AI & GPUs to Build Cloud-based Real-Time Video Event Detection Solutions Zvika Ashani CTO.

Grid Computing Colton Lewis.

1. 2 VIRTUAL MACHINES By: Satya Prasanna Mallick Reg.No

湖南大学-信息科学与工程学院-计算机与科学系

Objective of This Course

Sky Computing on FutureGrid and Grid’5000

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Syllabus and Introduction Keke Chen

Distributed computing

Virtualization.

A benchmark for Minecraft-like services

Views of Cloud Computing

Agenda Need of Cloud Computing What is Cloud Computing

Mihai Neacşu, BSc. Prof.dr.eng. Alexandru Iosup Ir. Laurens Versluis

A General Approach to Real-time Workflow Monitoring

MagnaData: Scheduling Complex Workflows with Non-Functional Requirements in Datacenters Laurens Versluis Massivizing Computer research.

Sky Computing on FutureGrid and Grid’5000

Towards Predictable Datacenter Networks

Presentation transcript:

The Performance of Big Data Workloads in Cloud Datacenters Alexandru Uta, Alexandru Custura, Harry Obaseki a.uta@vu.nl Vrije Universiteit Amsterdam Massivizing Computer Systems June 11, 2018

Massivizing Computer Systems Big Science Education for Everyone (Online) Business Services Online Gaming Grid Computing Datacenters Daily Life In massivizing computer systems we study the complex and non-trivial interactions Between applications and the large-scale ecosystems they are running on To elaborate on this problem, in this presentation I’m going to present The interaction between big data processing systems and applications and clouds

Convenient to use big data + cloud I think we all agree it is very convenient to run big data workloads in the cloud Very easy to spawn a couple of VMs and run your big data framework Even easier, cloud providers offer big data frameworks as a service (e.g., Elastic MR in AWS) So, not need to worry anymore about managing a cluster and dealing with hardware And if it’s so easy, …

Wide variety of frameworks running together What happens when everybody runs big data in the cloud? Large variety of big data apps and frameworks We have here the big data landscape of Matt Turck Most of these systems already run in clouds, either as services, or deployed independently by users As a result: clouds see very diverse workloads running next to each other Image courtesy of mattturck.com

Resource contention produces performance variability in clouds! Co-location induces (resource) performance variability How does resource interference affect performance? Resource contention produces performance variability in clouds! Let’s assume a physical machine in a cloud It’s quite possible that it runs VMs that perform very different tasks (hadoop, spark, tensorflow) What does the interference entail? Resource contention produces perf. variability

Workload variability produces performance variability! Co-location induces (resource) performance variability How does workload variability affect performance? Workload variability produces performance variability! Utilization Same setup, but let’s assume all apps use same resource With highly different utilization patterns

Cloud (resource) performance is highly variable! Due to: Co-location Virtualization Workload variability Network congestion Affects all possible resources: Emergent behavior in large-scale ecosystems! Multiple reasons for perf variability in the cloud: co-location + wkld var. (as in the example, sharing the same hardware) Virtualization (adding a layer of abstraction between the physical hw and the VMs slows down Congestion is mostly about networking [overcommiting resources] The most impacted resources are the most limited ones [disk, netw] MSR Cambridge measured the netw bw variability on 8 different clouds, at infrastructure level the distributions are very different (var between clouds and within clouds) Ballani et al., SIGCOMM 2011 Iosup et al., CCGrid 2011

Convenient to use big data + cloud, but... Variability entails: Poor performance predictions Poor scheduling decisions Over-provisioning Extra costs How to study performance variability? How to control the variability?

How to study performance variability? Traditional performance analysis: (1) Trace analysis (2) Benchmarking (3) Performance modeling Current models, benchmarks do not consider resource variability! No study on resource performance variability and big data Variability within clouds and between clouds (performance portability issues) Resource variability leads to performance variability? Why is perf variability bad? Because we ideally want to be able to predict performance Example with the train to airport: we want to catch plane at 10am, we assume bus comes on time, or that security checks don’t take more than 1h The same stands in computing: we need performance predictability

A Framework for Studying Performance Variability Fallback to empirical evaluation based on previous observations Controlled environment that emulates real-world variability scenarios Multiple classes of big data applications Statistical analysis and performance modeling to understand correlations 1 2 3 (1) Traces (2) Benchmark (3) Modeling

Benchmarking Performance Variability

Quantifying network variability impact on Big Data Systematic study using A-H cloud bandwidth distributions Run a series of big data applications - So the solution is to take these 8 distributions Emulate them in our cluster Perform a systematic study using the distributions and a series of representative big data apps We do this on spark, which has become the de-facto platform for big data analytics

Cloud network bandwidth emulation For each distribution: Cluster Vary bandwidth How do we do this? Take each distribution at a time Draw values from those distributions (25% in that interval, 25% in the other etc.) Assign these values to our nodes In the cluster

Big Data Workloads HiBench suite, MapReduce-style apps 6 real-world applications from various domains Each app having different resource usage Application Wordcount ++ -- Sort Terasort Naïve Bayes K-means PageRank We use the HiBench benchmark suite, from which we select 6 representative applications We characterize all these apps in terms of the resources they use ++ means high -- means low 0 means medium They cover a large spectrum of resource usage

Variable network = Variable Runtime (Terasort) We ran an experiment using the terasort app on the 8 bw distributions And one exp on a 1Gbps homogeneous network setup We see that the homogeneous achieves little to no runtime variability Whereas, heterogeneous netw also gives runtime variability

Variable network = Variable Runtime (Terasort) We ran an experiment using the terasort app on the 8 bw distributions And one exp on a 1Gbps homogeneous network setup We see that the homogeneous achieves little to no runtime variability Whereas, heterogeneous netw also gives runtime variability

Variable network = Variable Runtime (Terasort) We ran an experiment using the terasort app on the 8 bw distributions And one exp on a 1Gbps homogeneous network setup We see that the homogeneous achieves little to no runtime variability Whereas, heterogeneous netw also gives runtime variability

Variable network = Variable Runtime (Terasort) We ran an experiment using the terasort app on the 8 bw distributions And one exp on a 1Gbps homogeneous network setup We see that the homogeneous achieves little to no runtime variability Whereas, heterogeneous netw also gives runtime variability

Surprisingly, non-network-intensive Wordcount slowed down We use the wordcount app, which is mostly cpu bound And we compute the slowdown in comparison with the no limitation 1gbps setup notice the slowdown given by two of the cloud setups Notice that bandwidth impacts performance quite a lot

Most apps are slowed down on real clouds All apps get quite a significant slowdown, mostly on the A and D bandwidth distributions Surprinsing result: not only the network-intensive applications get to be slowed down As a consequence, network variability breaks performance predictability Moreover, network variability breaks performance portability (results on cloud A differ significantly from results on cloud F)

Take-home message Alexandru Uta Network variability leads to high slowdown for big data in the cloud Network variability also affects performance portability Surprisingly, also apps not network-bound applications slow down Future work: In-depth statistical analysis Performance modeling tools Control through better scheduling Alexandru Uta a.uta@vu.nl Vrije Universiteit Amsterdam Massivizing Computer Systems