ReStore: Reusing Results of MapReduce Jobs

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
CS186 Final Review Query Optimization.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Programming Fundamentals (750113) Ch1. Problem Solving
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Restore : Reusing results of mapreduce jobs Jun Fan.
RESTORE IMPLEMENTATION as an extension to pig Vijay S.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Parameter Study Principles & Practices. What is Parameter Study? Parameter study is the application of a single algorithm over a set of independent inputs:
Parallel and Distributed Simulation Time Parallel Simulation.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Parameter Study Principles & Practices. Outline Data Model of the PS Part I Simple PS –Generating simple PS Workflow by introducing PS Input port – using.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
CS 101 – Nov. 11 Finish Database concepts –1-1 relationship –1-many relationship –Many-to-many relationship Review.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
C language + The Preprocessor. + Introduction The preprocessor is a program that processes that source code before it passes through the compiler. It.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
WATCHMAN: A Data Warehouse Intelligent Cache Manager Peter ScheuermannJunho ShimRadek Vingralek Presentation by: Akash Jain.
Algorithms and Problem Solving
MCMC Output & Metropolis-Hastings Algorithm Part I
Module 11: File Structure
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Database Management System
System Design.
Systems Analysis and Design in a Changing World, 6th Edition
Chapter 12: Query Processing
An Introduction to Visual Basic .NET and Program Design
Chapter 15 QUERY EXECUTION.
Evaluation of Relational Operations: Other Operations
Program Design Introduction to Computer Programming By:
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Programming Fundamentals (750113) Ch1. Problem Solving
Programming Fundamentals (750113) Ch1. Problem Solving
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
CSE 491/891 Lecture 21 (Pig).
Interpret the execution mode of SQL query in F1 Query paper
Algorithms and Problem Solving
Programming Fundamentals (750113) Ch1. Problem Solving
Chapter 12 Query Processing (1)
Overview of Query Evaluation
Implementation of Relational Operations
Charles Tappert Seidenberg School of CSIS, Pace University
Evaluation of Relational Operations: Other Techniques
Programming Fundamentals (750113) Ch1. Problem Solving
Chapter 11 Managing Databases with SQL Server 2000
Final Project Geog 375 Daniel Hewitt.
Evaluation of Relational Operations: Other Techniques
Database Management Systems and Enterprise Software
Map Reduce, Types, Formats and Features
Presentation transcript:

ReStore: Reusing Results of MapReduce Jobs Yang Zheng

Outline Introduction to MapReduce Overview of ReStore Implementation Details Experiments 1. MapReduce final result, intermediate result 2. MapReduce job, sub-job

Introduction to MapReduce This is the workflow I drew for the semi-join algorithm of the paper A Comparison of Join Algorithms for Log Processing in MapReduce which is presented by Devin before. I think this figure well illustrate what are MapReduce job, intermediate result and final result. The query behind this workflow is to join a log table and a customer table on CustomerID, under the situation that only a small fraction of customers are referenced by the log table.

Introduction to MapReduce What comes to your mind? Keep these intermediate results and final results and reuse them for future workflows submitted to the system. An example Facebook stores the result of any query in its MapReduce cluster for seven days so that it can be shared among users. The basic idea behind the Restore is to use these intermediate results and final results and reuse them for future workflows submitted to the system. A good example is Facebook stores the result of any query in its MapReduce cluster for seven days so that it can be shared among users.

Overview of ReStore Rewrites the MapReduce jobs in a submitted workflow to reuse job outputs previously stored in the system Stores the outputs of executed jobs for future reuse Creates more reuse opportunities by storing the outputs of sub-jobs in addition to whole MapReduce jobs Selects the outputs of jobs to keep in the distributed file system and those to delete

Overview of ReStore The name of the file or directory Identifiers schemas To illustrate the various result reuse opportunities, they use an example of two queries Q1 and Q2, written in PigMix. Identifiers are the names of relations (aliases), fields, variables, and so on. project join field Output directory

Overview of ReStore Positional notation

Overview of ReStore t1 t01 Job_01 t02 Job_02 Let’s go back to this slide, and I will give you an example of formula one. Let’s first assume Data 1 and Data 2 are obtained by executing Job_01 and Job_02. t01 Job_01 t02 Job_02

Overview of ReStore The time needed to finish executing jobn the time needed to finish all jobs on which Jobn depends For a certain jobn in the work flow, this paper divides the total time needed to execute jobn, into two parts. The time needed to finish executing jobn, and the time needed to finish all jobs on which Jobn depends, which is the time needed for the slowest of these jobs to finish. Because every job in the mapreduce architecture can not start until all the jobs on which it depends finish. The total time needed to execute jobn

Overview of ReStore ReStore generates two types of reuse opportunities by materializing the output of: (1) whole jobs, which reduces maxi∈Y{Ttotal (Jobi)} in future workflows, (2) operators in jobs (sub-jobs), which reduces ET(Jobn) in future workflows.

Overview of ReStore

Overview of ReStore ReStore Sysem Archiecture Input Output Workflow of MapReduce jobs generated by a dataflow system for an input query Output a modified MapReduce workflow that exploits prior jobs executed in the MapReduce system and stored by ReStore a new set of job outputs to store in the distributed file system

Overview of ReStore Repository the physical query execution plan of the MapReduce job that was executed to produce this output the filename of the output in the distributed file system statistics about the MapReduce job that produced the output and the frequency of use of this output by different workflows

Overview of ReStore

Implementation Details Plan Matcher and Rewriter Scans sequentially through the physical plans in the repository and tests whether each plan matches the input MapReduce job. As soon as a match is found, the input MapReduce job is rewritten to use the matched physical plan in the repository. After rewriting, a new sequential scan through the repository is started to look for more matches to the rewritten MapReduce job. If a scan through the repository does not find any matches, ReStore proceeds to matching the next MapReduce job in the workflow.

Implementation Details Sub-job Enumerator & Enumerated Sub-job Selector Two types of reuse opportunaties Store the output of whole MapReduce jobs Store the output of subjobs Which subjob should also be considered as candidates? Some physical operators such as Project and Filter are known to reduce the size of their input Other physical operators such as Join and Group are known to be expensive, so their outputs are also good sub-job candidates because replacing them with stored output reduces the time to execute these operators

Implementation Details Manage the Restore Repository Keep a candidate job in the repository only if the size of its output data is smaller than the size of its input data. Keep a candidate job in the repository only if Equation 1 tells us that there will be a reduction in execution time for workflows reusing this job. Evict a job from the repository if it has not been reused within a window of time. Evict a job from the repository if one or more of its inputs is deleted or modified.

Experiments

Experiments

Conclusion MapReduce mode has become widely accepted for analyzing large data sets In many cases, users express complex analysis tasks not directly with MapReduce but rather with higher-level SQL-like query languages Important to make full use of the intermediate outputs of jobs to gain performance improvement

Any Questions?