Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
SLA-Oriented Resource Provisioning for Cloud Computing
SkewTune: Mitigating Skew in MapReduce Applications
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
1 of 14 1 Fault-Tolerant Embedded Systems: Scheduling and Optimization Viacheslav Izosimov, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping.
Online Auctions in IaaS Clouds: Welfare and Profit Maximization with Server Costs Xiaoxi Zhang 1, Zhiyi Huang 1, Chuan Wu 1, Zongpeng Li 2, Francis C.M.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering.
Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
1 Quincy: Fair Scheduling for Distributed Computing Clusters Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.
The Packing Server for Real-time Scheduling of MapReduce Workflows Shen Li, Shaohan Hu, Tarek Abdelzaher University of Illinois at Urbana Champaign 1.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Matchmaking: A New MapReduce Scheduling Technique
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
Scientific days, June 16 th & 17 th, 2014 This work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX ) funded by the French program.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Prediction-Based Multivariate Query Modeling Analytic Queries.
Resource Provision for Batch and Interactive Workloads in Data Centers Ting-Wei Chang, Pangfeng Liu Department of Computer Science and Information Engineering,
Performance Assurance for Large Scale Big Data Systems
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Distributed Network Traffic Feature Extraction for a Real-time IDS
Edinburgh Napier University
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Distributed Systems CS
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Distributed Systems CS
Presentation transcript:

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard Labs

Cloud Environment Advantages ▫Large amount of resources ▫Elasticity ▫Pay-as-you-go pricing model Challenges ▫Distributed resources ▫Error-prone

MapReduce and Pig MapReduce: Simple and fault tolerant framework for data processing in the cloud Pig ▫Advanced MapReduce based platform ▫Widely used: Yahoo!, Twitter, LinkedIn ▫PigLatin: A high-level declaratice language for expressing data analysis tasks as Pig programs j1 j2 j3 j4 j5 j6j7

Motivation Latency-sensitive applications ▫Personalized advertising ▫Spam and fraud detection ▫Real-time log analysis How much resource does an application need to meet their deadlines?

Contributions Performance modeling for Pig programs ▫Given a Pig grogram, estimates its completion time as a function of assigned resource Deadline driven resource allocation estimates for Pig programs ▫Given a completion time target, determine the amount of resources for a Pig program to achieve it

Outline Introduction Building block ▫Performance model for single MapReduce jobs Resource allocation for Pig programs Evaluation Conclusion and ongoing work

Theoretical Makespan Bounds Bounds- based makespan estimates ▫n tasks, k servers ▫avg: average duration of the n tasks ▫max: maximum duration of the n tasks Lower bound Upper bound

Illustration Schedule 1: Schedule 2: Makespan = 4 Lower bound = 4 Makespan = 7 Upper bound =

Estimate the bounds of the job completion time based on job profile ▫Most production jobs are executed routinely on new data sets ▫Job profile based on previous running  Map stage: M avg, M max, AvgInputSize, Selectivity  Reduce stage: Sh avg, Sh max, R avg, R max, Selectivity ▫Predict the completion time for future running with the profile Estimate Completion Time for Single MR Job

Estimating bounds on the duration of map and reduce stages Map stage duration depends on: ▫N M -- the number of map tasks ▫S M -- the number of map slots Reduce stage duration depends on: ▫N R -- the number of reduce tasks ▫S R -- the number of reduce slots Job duration T J low, T J up, T j avg ▫Sum of the map and reduce stage duration 10 Estimate Completion Time for Single MR Job

Given a deadline D and the job profile, find the minimal resource to complete the job within D Resource Allocation for Single MR Job Given number of map/reduce tasks Find the value of S M J, S R J with minimum value of S M J + S R J using Lagrange's multipliers Statistics from job profile

Outline Introduction Building block ▫Performance model for single MapReduce jobs Resource allocation for Pig programs Evaluation Conclusion and ongoing work

Performance Model for Pig Programs Let P = {J 1, J 2,….J N }, extract the job profile of each job contained in P ▫Assign unique name for each job within a program The program completion time  sum of the completion time of all the jobs contained in P

Possible strategy: find out an appropriate pair of map and reduce slots for each job in the program Problem: difficult to implement and manage by the scheduler Resource Allocation for Pig Programs with

Resource Allocation for Pig Programs A simpler and more elegant solution ▫Allocate the same set of resource to the entire program instead of to each job Rewrite the previous equations into Find the minimum set of map and reduce slots ( S M P, S R P ) for the entire Pig program

Experiment Setup 66 nodes cluster in 2 racks ▫4 AMD 2.39GHz cores ▫8 GB RAM, ▫two 160GB hard disks Configuration ▫1 jobtracker, 1 namenode, 64 worker nodes ▫2 map slots and 1 reduce slot for each node

Benchmark Pigmix benchmark ▫17 programs ▫8 tables as the input data Dataset ▫Test dataset  Generated with the Pig mix data generator  Total size around 1TB. ▫Experimental dataset  Same layout as the test dataset  20% larger in size

Model Accuracy How well of our performance model captures Pig program completion time? Normalized results for predicted and measured completion time

Meeting Deadlines Are we meeting deadlines with our resource allocation mode? Pigmix executed on experimental data set : do we meet deadlines?

Conclusion ▫The performance model can accurately estimate the completion time of MapReduce workflow ▫Enables automatic resource provisioning for MapReduce workflow with deadlines Ongoing work ▫Refine the performance model for workflow with concurrent jobs ▫Incorporating failure scenarios in the current model

Thank you