Starfish: A Self-tuning System for Big Data Analytics.

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Herodotos Herodotou Shivnath Babu Duke University.
Three Perspectives & Two Problems Shivnath Babu Duke University.
Presented by Carl Erhard & Zahid Mian Authors: Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University.
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
VMware Capacity Planner 2.7 Discussion and Demo from Engineering May 2009.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
Clydesdale: Structured Data Processing on MapReduce Jackie.
Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,
Hadoop: The Definitive Guide Chap. 2 MapReduce
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 Community (Optimize both Yarn & Non Yarn Hadoop clusters)
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
A Hadoop MapReduce Performance Prediction Method
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Storage in Big Data Systems
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.
BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Chapter 13: Query Processing
Next Generation of Apache Hadoop MapReduce Owen
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
An Open Source Project Commonly Used for Processing Big Data Sets
Introduction to MapReduce and Hadoop
Database Performance Tuning and Query Optimization
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 15 QUERY EXECUTION.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Big Data - in Performance Engineering
Cse 344 May 4th – Map/Reduce.
Ch 4. The Evolution of Analytic Scalability
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter 11 Database Performance Tuning and Query Optimization
Presentation transcript:

Starfish: A Self-tuning System for Big Data Analytics

Motivation As the Hadoop ecosystem has evolved, it has become accessible to a wider and less technical audience. – E.g., data analysts as opposed to programmers However, creating optimal configurations by hand is entirely infeasible Starfish seeks to automatically provide an optimal configuration given the characteristics of given workloads, workflows, and MapReduce jobs.

Starfish Architecture What-if Engine answers hypothetical queries from other Starfish modules Just-in-Time Optimizer recommends job configuration parameter values Sampler generates approximate job profiles based on brief sample executions Profiler monitors execution of jobs to generate job profiles Workflow-aware scheduler determines the optimal data layout Workload Optimizer transforms submitted workflows to logically equivalent, but optimized, workflows Given a workload and a set of constraints, the Elastisizer determines an optimal deployment

Lastword Lastword is Starfishs language for reasoning about data and workloads Analogous to Microsofts.NET – Not actually used by humans – Allows data analysts to write workflows in higher- level language (e.g. Pig, Hive, etc)

Job-Level Tuning In Hadoop, MapReduce jobs are controlled by over 190 configuration parameters No one-size-fits-all configuration. Bad performance often due to poorly-tuned configurations Even rule-of-thumb configurations can result in very bad performance

Job-Level Tuning Just-in-Time Optimizer – Automatically selects optimal execution technique when job is submitted – Is assisted by information from Profiler and Sampler Profiler – Performs dynamic instrumentation using Btrace – Generate a job profiles Sampler – Collects statistics about input, intermediate, and output data – Samples execution of MapReduce jobs Enables Profiler to generate approximate job profiles, without complete execution

Job Profiles Capture information at task- and subtask-level information – Map Phase Reading, Map Processing, Spilling, and Merging – Reduce Phase Shuffling, Sorting, Reduce Processing, and Writing Expose three views – The timings view details how much time is spent in each subtask – The data-flow view provides the amount of data processed at each subtask – The resource-level view presents usage trends of various resources (i.e., CPU, memory, I/O, and network) over the execution of the subtasks Using data provided through job profiles, optimal configurations can be calculated

Job Profiles

Job-Level Tuning buffer utilization % at which point data begins spilling to disk total amount of buffer memory to use while sorting files, in MB % of io.sort.mb dedicated to tracking record boundaries # of streams to merge at once while sorting files default # of reduce tasks per job

Job-Level Tuning total amount of buffer memory to use while sorting files, in MB % of io.sort.mb dedicated to tracking record boundaries

Job-Level Tuning total amount of buffer memory to use while sorting files, in MB % of io.sort.mb dedicated to tracking record boundaries

Job-Level Tuning default # of reduce tasks per job % of io.sort.mb dedicated to tracking record boundaries

Workflow-Level Tuning Unbalanced data layouts can result in dramatically degraded performance Default HDFS replication scheme, in conjunction with data-locality-aware task scheduling can result in overloaded servers

Workflow-Level Tuning

Workflow-aware Scheduler coordinates with – the What-if Engine to determine optimal data layout – the Just-in-Time Optimizer to determine job execution schedule Workflow-aware Scheduler performs cost-based search over the follow space of choices: – Block placement policy (default HDFS Local Write, or custom Round-Robin) – Replication factor – Optimal size of file blocks – Whether to compress output or not

Workload-Level Tuning Starfish implements a workload-optimizer – Translates workloads submitted to the system to equivalent, but optimized, collections of workflows Optimizes three areas – Data-flow sharing – Materialization – Reorganization

Example Workload

Workload-Level Tuning Starfish Elastisizer – Given multiple constraints, provides optimal configuration E.g., complete in a given amount of time while minimizing monetary costs – Performs cost-based search over various cluster configurations

Feedback and Discussion People seem to appreciate the transparent nature and cross-layer optimizations How effective might the Sampler be? Is the motivation strong enough? – Unbalanced data layout example is a bit contrived Missing analysis of overhead introduced by Starfish Paper leaves out a lot of details