The Cloud Workloads Archive: A Status Report

Slides:

Advertisements

Similar presentations

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.

Advertisements

University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Copyright © 2005 Department of Computer Science CPSC 641 Winter PERFORMANCE EVALUATION Often in Computer Science you need to: – demonstrate that.

June 3, 2015 Synthetic Grid Workloads with Ibis, K OALA, and GrenchMark CoreGRID Integration Workshop, Pisa A. Iosup, D.H.J. Epema Jason Maassen, Rob van.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.

Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema, Hashim Mohamed,Mathieu Jan, Ozan Sonmez 3 rd Grid Initiative Summer School,

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.

1 NetGames 2010 – CAMEO: Continuous Analytics for Massively Multiplayer Online Games CAMEO : Enabling Social Networks for Massively Multiplayer Online.

1 Google Workshop at TU Delft, 2010 – Online Games and Clouds Cloudifying Games: Rain for the Thirsty Alexandru Iosup Parallel and Distributed Systems.

1 A Performance Study of Grid Workflow Engines Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Corina Stratan Parallel.

UC Berkeley Monitoring Hadoop through Tracing Andy Konwinski and Matei Zaharia.

1 Trace-Based Characteristics of Grid Workflows Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Simon Ostermann,

Performance Evaluation

June 25, GrenchMark: Synthetic workloads for Grids First Demo at TU Delft A. Iosup, D.H.J. Epema PDS Group, ST/EWI, TU Delft.

1 PERFORMANCE EVALUATION H Often in Computer Science you need to: – demonstrate that a new concept, technique, or algorithm is feasible –demonstrate that.

June 28, Resource and Test Management in Grids Rapid Prototyping in e-Science VL-e Workshop, Amsterdam, NL Dick Epema, Catalin Dumitrescu, Hashim.

1 ASCI, 2010 – Analysis of BBO Fans Social Networks Analysis of BBO Fans, an Online Social Gaming Community Alexandru Iosup Parallel and Distributed Systems.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

July 13, “How are Real Grids Used?” The Analysis of Four Grid Traces and Its Implications IEEE Grid 2006 Alexandru Iosup, Catalin Dumitrescu, and.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Euro-Par 2008, Las Palmas, 27 August DGSim : Comparing Grid Resource Management Architectures Through Trace-Based Simulation Alexandru Iosup, Ozan.

1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,

August 28, Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing Berkeley, CA, USA Alexandru Iosup, Nezih Yigitbasi,

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Euro-Par 2007, Rennes, 29th August 1 The Characteristics and Performance of Groups of Jobs in Grids Alexandru Iosup, Mathieu Jan *, Ozan Sonmez and Dick.

1 EIT ICT Labs Workshop at TU Delft, May 2011 – Cloud Computing Parallel and Distributed Systems Group Delft University of Technology The Netherlands Our.

LDBC-Benchmarking Graph-Processing Platforms: A Vision Benchmarking Graph-Processing Platforms: A Vision (A SPEC Research Group Process) Delft University.

1 TUD-PDS A Periodic Portfolio Scheduler for Scientific Computing in the Data Center Kefeng Deng, Ruben Verboon, Kaijun Ren, and Alexandru Iosup Parallel.

1 Cloud Computing Research at TU Delft – A. Iosup Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology The Netherlands.

1 EuroPar 2009 – POGGI: Puzzle-Based Online Games on Grid Infrastructures POGGI: Puzzle-Based Online Games on Grid Infrastructures Alexandru Iosup Parallel.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

1 CS 425 Distributed Systems Fall 2011 Slides by Indranil Gupta Measurement Studies All Slides © IG Acknowledgments: Jay Patel.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

1 Challenge the future KOALA-C: A Task Allocator for Integrated Multicluster and Multicloud Environments Presenter: Lipu Fei Authors: Lipu Fei, Bogdan.

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

1 ROIA 2009 – CAMEO: Continuous Analytics for Massively Multiplayer Online Games CAMEO: Continuous Analytics for Massively Multiplayer Online Games Alexandru.

Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.

November 29, Our team: Undergrad Thomas de Ruiter, Anand Sawant, Ruben Verboon, … Grad Siqi Shen, Guo Yong, Nezih Yigitbasi Staff Henk Sips, Dick.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

A Platform for Fine-Grained Resource Sharing in the Data Center

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology The Netherlands Cloud Computing : Open Research Questions.

Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

(Parallel and) Distributed Systems Group

PA an Coordinated Memory Caching for Parallel Jobs

Cloud Computing MapReduce in Heterogeneous Environments

The Performance of Big Data Workloads in Cloud Datacenters

Computer Systems Performance Evaluation

Presentation transcript:

The Cloud Workloads Archive: A Status Report Special thanks to Ion for this opportunity! Alexandru Iosup Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica Parallel and Distributed Systems Group, Delft University of Technology, The Netherlands RADLab, University of California, Berkeley, USA April 22, 2017 Berkeley, CA, USA

About the Team Recent Work in Performance Speaker: Alexandru Iosup The Grid Workloads Archive (Nov 2006) The Failure Trace Archive (Nov 2009) Analysis of Facebook, Yahoo, and Google data center workloads (2009-2010) The Peer-to-Peer Trace Archive (Apr 2010) Tools: GrenchMark workload-based grid benchmarking, RAIN Speaker: Alexandru Iosup Systems work: Tribler (P2P file sharing), Koala (grid scheduling), POGGI and CAMEO (massively multiplayer online gaming) Performance evaluation of clouds for sci.comp.: EC2 & three others Team of 15+ active collaborators in NL, AT, RO, US Happy to be in Berkeley until September April 22, 2017

Traces: Sine Qua Non in Comp.Sys.Res. “My system/method/algorithm is better than yours (on my carefully crafted workload)” Unrealistic (trivial): Prove that ‘prioritize jobs from users whose name starts with A’ is a good scheduling policy Realistic? 85% jobs are short, 15% are long Major problem in Computer Systems research Workload Trace = recording of real activity from a (real) system, often as a sequence of jobs / requests submitted by users for execution Main use: compare and cross-validate new job and resource management techniques and algorithms Major problem: obtaining and using real workload traces April 22, 2017

Previous Data Sharing Efforts Critical datasets in computer science Grid Workloads Archive Failure Trace Archive Peer-to-Peer Trace Archive Game Trace Archive (soon) … PWA, ITA, CRAWDAD, … 1,000s of scientists From theory to practice Research Question: Are data center workloads unique? (vs GWA, PWA, …) Dataset Size Year 1GB 10GB 100GB 1TB 1TB/yr P2PTA GamTA ‘09 ‘10 ‘11 ‘06 April 22, 2017

Agenda Introduction & Motivation The Cloud Workloads Archive: What’s in a Name? Format and Tools Contents Analysis & Modeling Applications Take Home Message April 22, 2017

The Cloud Workloads Archive (CWA) What’s in a Name? CWA = Public collection of cloud/data center workload traces and of tools to process these traces; allows us to: Compare and cross-validate new job and resource management techniques and algorithms, across various workload traces Determine which (part of a) trace is most interesting for a specific job and resource management technique or algorithm Design a general model for data center workloads, and validate it with various real workload traces Evaluate the generality of a particular workload trace, to determine if results are biased towards a particular trace Analyze the evolution of workload characteristics across long timescales, both intra- and inter-trace April 22, 2017

One Format Fits Them All Flat format Job and Tasks Summary (20 unique data fields) and Detail (60 fields) Categories of information Shared with GWA, PWA: Time, Disk, Memory, Net Jobs/Tasks that change resource consumption profile MapReduce-specific (two-thirds data fields) CWJ CWJD CWT CWTD A. Iosup, R. Griffith, A. Konwinski, M. Zaharia, A. Ghodsi, I. Stoica, Data Format for the Cloud Workloads Archive, v.3, 13/07/10 April 22, 2017

CWA Contents: Large-Scale Workloads Trace ID System Size J/T/Obs Period Notes CWA-01 Facebook 1.1M/-/- 5m/2009 Time & IO CWA-02 Yahoo M 28K/28M/- 20d/2009 ~Full detail CWA-03 Facebook 2 61K/10M/- 10d/2009 Full detail CWA-04 Facebook 3 ?/?/- 10d/01-2010 CWA-05 Facebook 4 3m/02+2010 CWA-06 Google 2 25 Aug 2010 CWA-07 eBay 23 Sep 2010 CWA-08 Twitter Need help! CWA-09? Google 9K/177K/4M 7h/2009 Coarse,Period Tools Convert to CWA format Analyze and model automatically  Report April 22, 2017

Agenda Introduction & Motivation The Cloud Workloads Archive: What’s in a Name? Format and Tools Contents Analysis & Modeling Applications Take Home Message April 22, 2017

Types of Analysis Analysis Focus Time-related Structure-related Run, Wait, Resp.Time Bounded Slowdown Structure-related Number of tasks IO-related IO sizes and ratios Status-related Sys. Utilization-related Counts/Ratios Analysis Type Basic statistics Evolution over time Correlations Data Break-down Overall By Task Type (M/R) By App. Type (ID) By User (ID) By Duration (Short) April 22, 2017

Types of Analysis Sys.U., Over Time, By RunTime Also 1h, 10mins, … counting intervals Study Short-/Long- Range Dependence (self-similarity) Also Job count, Running/Waiting counts, … Study system utilization behavior April 22, 2017

Modeling Process Well-known prob. distrib. MLE to fit Goodness-of-Fit Normal, Exp, LogNormal, Gamma, Weibull, Gen-Pareto, MLE to fit Fit known distribution to empirical distribution  parameters Goodness-of-Fit Assess how good the fit is; select best-fitting distribution Kolmogorov-Smirnov: sensitive to body of distribution + D stat Anderson-Darling: sensitive to tails of distribution Hybrid method*: works for very large populations *Kondo et al., Failure Trace Archive, CCGrid’10, Best Paper Award. April 22, 2017

Main Results: Basic Stats Trace ID TRunTime [s] #Tasks/Job Pk.Arr.Rate/D # users CWA-01 165J n/a 21KJ/-T CWA-02 512/80med 901/712Map 6KJ/3.2MT CWA-03 433/86med 153/143Map 8KJ/2MT 18 GWA-T1 370 5—20 -/20KT 332 GWA-T3 89,274 -/8KT 387 GWA-T6 14,599 -/22.5KT 206 GWA-T10 31,964 -/1.6KTph 216 GWA-T11 8,971 -/22KTph 412 MapReduce vs Grid workloads [vs Parallel Prod. Env.] Massive short tasks vs Many long tasks vs Few very long tasks Fewer users for MapReduce environments? TODO: Analyse amounts per core April 22, 2017

Agenda Introduction & Motivation The Cloud Workloads Archive: What’s in a Name? Format and Tools Contents Analysis & Modeling Applications Take Home Message April 22, 2017

Applications Mesos running mixtures of workloads Workloads: MPI, MapReduce, grid, … Find bottlenecks Find workloads that are particularly difficult to run Improve the system! Status: in progress, using cluster in Finland (Petri Savolainen) All the apps typical to trace-based work: design, validation, and comparison of algorithms, methods, and systems. April 22, 2017

Agenda Introduction & Motivation The Cloud Workloads Archive: What’s in a Name? Format and Tools Contents Analysis & Modeling Applications Take Home Message April 22, 2017

Take Home Message Cloud Workloads Archive Datasets Tools to convert, analyze, and model the datasets Need your help to collect more traces Converted and analyzed three MapReduce workloads Different from grid and parallel production environment workloads (ask about additional proof and let me show a couple more slides) Invariants? Applications 1: Model of Cloud/MapReduce workloads 2: Test and improve Mesos April 22, 2017

Continuing Our Collaboration Scheduling mixtures of grid/HPC/cloud workloads Scheduling and resource management in practice Modeling aspects of cloud infrastructure and workloads Condor on top of Mesos Massively Social Gaming and Mesos Step 1: Game analytics and social network analysis in Mesos … April 22, 2017

Thank you! Questions? Observations? Alex Iosup, Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica email: A.Iosup@tudelft.nl Thanks for all: AliG, Andrew, AndyK, Ari, Beth, Blaine, David, Ion, Justin, Lucian, Matei, Petri, Rean, Tim, … More Information: The Grid Workloads Archive: gwa.ewi.tudelft.nl The Failure Trace Archive: fta.inria.fr The GrenchMark perf. eval. tool: grenchmark.st.ewi.tudelft.nl Cloud research: www.st.ewi.tudelft.nl/~iosup/research_cloud.html see PDS publication database at: www.pds.twi.tudelft.nl/ Big thanks to our collaborators: U. Wisc.-Madison, U Chicago, U Dortmund, U Innsbruck, LRI/INRIA Paris, INRIA Grenoble, U Leiden, Politehnica University of Bucharest, Technion, … April 22, 2017

Additional Slides April 22, 2017

Main Results: Basic Stats Trace ID Total IO [MB] Rd. [MB] Wr [%] HDFS Wr[MB] CWA-01 10,934 6,805 38% 1,538 CWA-02 75,546 47,539 37% 8,563 CWA-03 - GWA12.1 469 174 63% n/a GWA12.2 144 114 21% GWA12.3 161 130 19% GWA12.4 389 33 92% GWA12.5 330 31 91% MapReduce vs Grid workloads IO-intensive vs Compute-intensive Constant Wr[%]~40%IO for MapReduce traces? TODO: More MapReduce traces to validate findings April 22, 2017

Main Results Two-mode trace  do NOT analyze as whole April 22, 2017

April 22, 2017