Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads Jackie.

Similar presentations


Presentation on theme: "Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads Jackie."— Presentation transcript:

1 Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads Jackie

2  Present an empirical analysis of seven industrial MapReduce Workload traces over long-durations  Break down each MapReduce workload into three conceptual components: data, temporal, compute patterns Key findings:  A new class of MapReduce workloads for interactive, semi-streaming analysis  A wide range of behavior within this workload  Query-like programmatic frameworks make up a considerable activity in all workloads  Some prior assumptions no longer hold  Construct a TPC-style big data processing benchmark for MapReduce Introduction

3  Hypotheses on workload behavior  Workload traces overview  Data access patterns  Workload variation over time  Computation patterns  Towards a big Data Benchmark  Summary and conclusions Outline

4  For optimizing the underlying storage system  How uniform or skewed are the data accesses?  How much temporal locality exists?  For workload-level provisioning and load shaping  How regular or unpredictable is the cluster load?  How large are the bursts in the workload?  For job-level scheduling and execution planning  What are the common job types?  What are the size, shape, and duration of these jobs?  How frequently does each job type appear? Hypotheses on Workload Behavior

5  For optimizing query-like programming frameworks  What % of cluster load comes from these frameworks?  What are the common uses of each framework?  For performance comparison between systems  How much variation exists between workloads?  Can we distill features of a representative workload? Hypotheses on Workload Behavior

6  Seven workloads: five from Cloudera, two from Facebook  Details about these workloads Workload Traces Overview

7  Answer the following questions  How uniform or skewed are the data accesses?  How much temporal locality exists?  Per-job data size  Skews in access frequency  Access temporal locality Data Access Patterns

8  Most jobs have input, shuffle, output sizes in the MB to GB range  the Facebook workloads' per- job input and shuffle size distributions shift right, while the per-job output size distribution shifts left Per-job Data Sizes

9 Skews in Access Frequency

10

11 Access Temporal Locality

12

13  Answer the following questions  How regular or unpredictable is the cluster load?  How large are the bursts in the workload?  Proceed in four dimensions: job submission counts, I/O, aggregation map and reduce task times, utilization  Weekly time series  Burstiness  Time series correlations Workload Variation Over Time

14 Weekly time series

15  Extend the concept of peak-to- average ratio to quantify burtiness  Compute general nth- percentile-to-median ratio for a workload  Graph this vector of values, with on the x-axis, versus n on y-axis Burstiness

16  Correlation values: jobsSubmitted(t) and dataSizeBytes(t), jobsSubmitted(t) and computeTimeTaskSeconds(t), dataSizeBytes(t) and computeTimeTaskSeconds(t) Time Series Correlations

17  Answer the following questions Related to optimizing query-like programming frameworks:  What % of cluster load comes from these frameworks?  What are the common uses of each framework? Job-level scheduling and execution planning:  What are the common job types?  What are the size, shape, and duration of these jobs?  How frequently does each job type appear?  By job name  By multi-dimensional job behavior  Represent each job as a six-dimensional vector: input size, shuffle size, output size, job duration, map task time, reduce task time Computation Patterns

18 By Job Names

19 By Multi-dimensional Job Behavior

20  Data generation  Process generation  Capture size, shape, and sequence of jobs, the aggregate cluster load over time  Mixing MapReduce and query-like frameworks  Include native MapReduce jobs and jobs created by query-like frameworks  Scaled-down workloads  Reproduce workload behavior at production scale  Empirical models  The workload traces are the model  A true workload perspective  Treat a workload as a steady processing stream involving the superposition of many processing types  Workload suites  Identify as small suite of workload classes that cover a large range of behavior  A stopgap tool Towards A Big Data Benchmark

21  For optimizing the underlying storage system:  Skew in data accesses frequencies range between an 80-1 and 80-8 rule  Temporal locality exists, and 80% of data re-accesses occur on the range of minutes to hours  For workload-level provisioning and load shaping:  The cluster load is bursty and unpredictable  Peak-to-median ratio in cluster load range from 9:1 to 260:1  For job-level scheduling and execution planning:  All workloads contain a range of job types, with the most common being small jobs  These jobs are small in all dimensions compared with other jobs in the same workload. They involve 10s of KB to GB of data, exhibit a range of data patterns between the map and reduce stages, and have durations of 10s of seconds to a few minutes  The small jobs form over 90% of all jobs for all workloads. The other job types appear with a wide range of frequencies Summary and Conclusions

22  For optimizing query-like programming frameworks:  The cluster load that comes from these frameworks is up to 80% and at least 20%  The frameworks are generally used for interactive data exploration and semi-streaming analysis. For Hive, the most commonly used operators are insert and select; from is frequently used in only one workload. Additional tracing at the Hive/Pig/Hbase level is required  For performance comparison between systems:  A wide variation in behavior exists between workloads, as the above data indicates  There is sufficient diversity between workloads that we should be cautious in claiming any behavior as “typical”. Additional workload studies are required Summary and Conclusions

23


Download ppt "Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads Jackie."

Similar presentations


Ads by Google