Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads Jackie.

Slides:

Advertisements

Similar presentations

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.

Advertisements

Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:

LIBRA: Lightweight Data Skew Mitigation in MapReduce

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.

SkewTune: Mitigating Skew in MapReduce Applications

Using the Crosscutting Concepts As conceptual tools when meeting an unfamiliar problem or phenomenon.

Spark Performance Patrick Wendell Databricks.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,

Towards Energy Efficient Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Yanpei Chen, Laura Keys, Randy Katz RAD Lab, UC Berkeley.

Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

October 14, 2002MASCOTS Workload Characterization in Web Caching Hierarchies Guangwei Bai Carey Williamson Department of Computer Science University.

Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.

Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.

Network Traffic Measurement and Modeling CSCI 780, Fall 2005.

Fast Algorithms for Association Rule Mining

Discovering and Describing Relationships

Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.

Hinrich Schütze and Christina Lioma Lecture 4: Index Construction

Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

Ch 4. The Evolution of Analytic Scalability

TTMG 5103 Module Techniques and Tools for problem diagnosis and improvement prior to commercialization Shiva Biradar TIM Program, Carleton University.

SharePoint 2010 Business Intelligence Module 6: Analysis Services.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

AN INTRODUCTION TO THE OPERATIONAL ANALYSIS OF QUEUING NETWORK MODELS Peter J. Denning, Jeffrey P. Buzen, The Operational Analysis of Queueing Network.

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,

Introduction to Hadoop and HDFS

TASK PACKAGING Module 1 UNIT IV ADDITIONAL TOPICS " Copyright 2002, Information Spectrum, Inc. All Rights Reserved."

Restore ： Reusing results of mapreduce jobs Jun Fan.

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.

© 2003, Carla Ellis Self-Scaling Benchmarks Peter Chen and David Patterson, A New Approach to I/O Performance Evaluation – Self-Scaling I/O Benchmarks,

Chapter 10 Verification and Validation of Simulation Models

1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.

Microsoft Office 2013 ®® Calculating Data with Formulas and Functions.

MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.

1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.

1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter ： Zhao-Wei Luo Che-Jung Chang,Der-Chiang.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Performance Assurance for Large Scale Big Data Systems

Big Data is a Big Deal!.

Hadoop MapReduce Framework

Formulate the Research Problem

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Distributed Systems CS

ICS-2018 June 12-15, Beijing Zwift : A Programming Framework for High Performance Text Analytics on Compressed Data Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng.

Introduction to Apache

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Distributed Systems CS

Charles Tappert Seidenberg School of CSIS, Pace University

Presentation transcript:

Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads Jackie

 Present an empirical analysis of seven industrial MapReduce Workload traces over long-durations  Break down each MapReduce workload into three conceptual components: data, temporal, compute patterns Key findings:  A new class of MapReduce workloads for interactive, semi-streaming analysis  A wide range of behavior within this workload  Query-like programmatic frameworks make up a considerable activity in all workloads  Some prior assumptions no longer hold  Construct a TPC-style big data processing benchmark for MapReduce Introduction

 Hypotheses on workload behavior  Workload traces overview  Data access patterns  Workload variation over time  Computation patterns  Towards a big Data Benchmark  Summary and conclusions Outline

 For optimizing the underlying storage system  How uniform or skewed are the data accesses?  How much temporal locality exists?  For workload-level provisioning and load shaping  How regular or unpredictable is the cluster load?  How large are the bursts in the workload?  For job-level scheduling and execution planning  What are the common job types?  What are the size, shape, and duration of these jobs?  How frequently does each job type appear? Hypotheses on Workload Behavior

 For optimizing query-like programming frameworks  What % of cluster load comes from these frameworks?  What are the common uses of each framework?  For performance comparison between systems  How much variation exists between workloads?  Can we distill features of a representative workload? Hypotheses on Workload Behavior

 Seven workloads: five from Cloudera, two from Facebook  Details about these workloads Workload Traces Overview

 Answer the following questions  How uniform or skewed are the data accesses?  How much temporal locality exists?  Per-job data size  Skews in access frequency  Access temporal locality Data Access Patterns

 Most jobs have input, shuffle, output sizes in the MB to GB range  the Facebook workloads' per- job input and shuffle size distributions shift right, while the per-job output size distribution shifts left Per-job Data Sizes

Skews in Access Frequency

Access Temporal Locality

 Answer the following questions  How regular or unpredictable is the cluster load?  How large are the bursts in the workload?  Proceed in four dimensions: job submission counts, I/O, aggregation map and reduce task times, utilization  Weekly time series  Burstiness  Time series correlations Workload Variation Over Time

Weekly time series

 Extend the concept of peak-to- average ratio to quantify burtiness  Compute general nth- percentile-to-median ratio for a workload  Graph this vector of values, with on the x-axis, versus n on y-axis Burstiness

 Correlation values: jobsSubmitted(t) and dataSizeBytes(t), jobsSubmitted(t) and computeTimeTaskSeconds(t), dataSizeBytes(t) and computeTimeTaskSeconds(t) Time Series Correlations

 Answer the following questions Related to optimizing query-like programming frameworks:  What % of cluster load comes from these frameworks?  What are the common uses of each framework? Job-level scheduling and execution planning:  What are the common job types?  What are the size, shape, and duration of these jobs?  How frequently does each job type appear?  By job name  By multi-dimensional job behavior  Represent each job as a six-dimensional vector: input size, shuffle size, output size, job duration, map task time, reduce task time Computation Patterns

By Job Names

By Multi-dimensional Job Behavior

 Data generation  Process generation  Capture size, shape, and sequence of jobs, the aggregate cluster load over time  Mixing MapReduce and query-like frameworks  Include native MapReduce jobs and jobs created by query-like frameworks  Scaled-down workloads  Reproduce workload behavior at production scale  Empirical models  The workload traces are the model  A true workload perspective  Treat a workload as a steady processing stream involving the superposition of many processing types  Workload suites  Identify as small suite of workload classes that cover a large range of behavior  A stopgap tool Towards A Big Data Benchmark

 For optimizing the underlying storage system:  Skew in data accesses frequencies range between an 80-1 and 80-8 rule  Temporal locality exists, and 80% of data re-accesses occur on the range of minutes to hours  For workload-level provisioning and load shaping:  The cluster load is bursty and unpredictable  Peak-to-median ratio in cluster load range from 9:1 to 260:1  For job-level scheduling and execution planning:  All workloads contain a range of job types, with the most common being small jobs  These jobs are small in all dimensions compared with other jobs in the same workload. They involve 10s of KB to GB of data, exhibit a range of data patterns between the map and reduce stages, and have durations of 10s of seconds to a few minutes  The small jobs form over 90% of all jobs for all workloads. The other job types appear with a wide range of frequencies Summary and Conclusions

 For optimizing query-like programming frameworks:  The cluster load that comes from these frameworks is up to 80% and at least 20%  The frameworks are generally used for interactive data exploration and semi-streaming analysis. For Hive, the most commonly used operators are insert and select; from is frequently used in only one workload. Additional tracing at the Hive/Pig/Hbase level is required  For performance comparison between systems:  A wide variation in behavior exists between workloads, as the above data indicates  There is sufficient diversity between workloads that we should be cautious in claiming any behavior as “typical”. Additional workload studies are required Summary and Conclusions