MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University.

Slides:



Advertisements
Similar presentations
University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra
Advertisements

A lightweight framework for testing database applications Joe Tang Eric Lo Hong Kong Polytechnic University.
CS107 Introduction to Computer Science Lecture 3, 4 An Introduction to Algorithms: Loops.
Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin.
FlumeJava Easy, Efficient Data-Parallel Pipelines Mosharaf Chowdhury.
1 NETE4631 Cloud deployment models and migration Lecture Notes #4.
FindAll: A Local Search Engine for Mobile Phones Aruna Balasubramanian University of Washington.
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Aggregate Query Processing in Cache- Aware Wireless Sensor Networks Khaled Ammar University of Alberta.
IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,
REES: Reasoning Engines Evaluation Shell version 3.0 Automated Reasoning Lab University of California, Irvine.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
CloudClustering Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley †Microsoft Research Toward an Iterative Data Processing Pattern on the.
Lecture 3 CS492 Special Topics in Computer Science Distributed Algorithms and Systems.
Mining High Utility Itemset in Big Data
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Tomasz Nykiel– University of Toronto Michalis Potamias– Boston University Chaitanya Mishra - Facebook George Kollios - Boston University Nick Koudas -
Cloud-based movie search web application with transaction service Group 14 Yuanfan Zhang Ji Zhang Zhuomeng Li.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
More SQL: Complex Queries, Triggers, Views, and Schema Modification
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Edinburgh Napier University
Online parameter optimization for elastic data stream processing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
Big Data Processing in Cloud Computing Environments
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Data Distribution for Reduce
Affiliation of presenter
MapReduce Algorithm Design
Ch 4. The Evolution of Analytic Scalability
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Word Co-occurrence Chapter 3, Lin and Dyer.
Mashup Service Recommendation based on User Interest and Service Network Buqing Cao ICWS2013, IJWSR.
CloudAnt: Database as a Service (DBaaS)
Prefer: A System for the Efficient Execution
MapReduce: Simplified Data Processing on Large Clusters
ReStore: Reusing Results of MapReduce Jobs
Presentation transcript:

MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1 Presented by Xiaolan Wang and Pengfei Tang

Motivation 2 Reducing the execution time Reducing energy consumption Monetary savings *

MRShare – a sharing framework for Map Reduce MRShare framework: – Inspired by sharing primitives from relational domain – Introduces a cost model for Map Reduce jobs – Searches for the optimal sharing strategies – Does not change the Map Reduce computational model 3

Outline Introduction Map Reduce recap. MRShare – Sharing opportunities in Map-Reduce Cost model for MapReduce MRShare – Grouping algorithms MRShare Implementation and Evaluation Summary 4

Outline 5

Map Reduce recap. I I I I Map Reduce 6

Outline 7

Sharing opportunities– sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown 8 8 User_idHometownOccupationAge

MRShare – sharing scans (map). 9

MRShare – sharing scans (reduce) J1J2J3J4keyvalue Toronto Toronto19 Toronto2 5 10

Sharing Map Output SELECT T.a, sum(T.b) SELECT T.a, avg(T.b) FROM T WHERE T.a>10 AND T.a 10 AND T.c<100 GROUP BY T.a 11

Sharing Map SELECT T.c, sum(T.b) SELECT T.a, avg(T.b) FROM T WHERE T.c > 10 GROUP BY T.c GROUP BY T.a 12 Same reducing.

Sharing Parts of Map SELECT T.a, sum(T.b) SELECT T.a, avg(T.b) FROM T WHERE T.c>10 AND T.a 10 AND T.c<100 GROUP BY T.a 13

Outline 14

Cost model for Map Reduce (single job) 15 T(J) = T read (J) + T sort (J) + T tr (J)

Cost of executing a group of jobs 16

Cost without grouping 17 n – n jobs; m – m maps; r – r reduces; |M i | - the average output size of a map task; |R i | - the average input size of a reduce task; |D i | - the size of the intermediate data of job J i. |D i | = |M i | · m = |R i | · r n MapReduce jobs, J = {J 1,..., J n }, read from the same input file F.

18 Sorting time

Cost with grouping 19 m – m maps; r – r reduces; |X m | - the average size of the combined output of map tasks; |X r | - the average size of the combined input of reduce tasks; |X G | - the size of the intermediate data. | X G | = | X m | · m = | X r | · r Single group G contains all n jobs and execute it as a single job J G.

Beneficial conditions 20 n <= B

Finding the optimal sharing strategy 21 An optimization problem “ NoShare ” “GreedyShare”

Sharing scans - cost based optimization Savings come from reduced number of scans The sorting cost might change The costs of copying and writing the output do not change 22

Outline 23

SplitJobs – a DP solution for sharing scans. We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting. 24 Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.

SplitJobs (cont.) 25 GS(i, l) = GAIN(i, l) − f c(l) is the savings of the optimal grouping of jobs J 1,…J l.

MultiSplitJobs – an improvement of SplitJobs 26

MultiSplitJobs (cont.) 27

Outline 28

Implementing MRShare MRShare implement on Hadoop First, acquire a batch of jobs from queries in a short time T Second, MultiSplit Jobs is called to compute the optimal grouping of the jobs Third, the groups are rewritten, using a meta-map and a meta-reduce function. These are MRShare specific container and their functionality relies on tagging. Finally, new jobs are submitted for execution 29

Tagging for Sharing Only Scans 30

Tagging for Sharing Map Output 31

Tagging for Sharing Map Output 32

Tagging for Sharing Map Output 33

Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries – Counts words matching a regular expression – Allows for variable intermediate data sizes – Generic aggregation Map Reduce job 34

Validation of the Cost Model 35

Evaluation goals Sharing is not always beneficial. – ‘GreedyShare’ policy How much can we save on sharing scans? – MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data? – MRShare - γ-MultiSplitJobs evaluation 36

Is sharing always beneficial? - ‘GreedyShare’ policy Group of jobs Group size d=|intermediate data| / |input data| H < d <0.7 H < d H < d 37

How much we save on sharing scans – MRShare MultiSplitJobs Group of jobs Group size d=|intermediate data| / |input data| G < d G < d < 0.7 G < d < 0.2 G < d < max G < d < max 38

How much we save on sharing Map-output – MRShare MultiSplitJobs 39

How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 40 Group of jobs Group size d=|intermediate data| / |input data| G < d G < d < 0.7 G < d < 0.2

Summary Introduction on MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map-Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 41

Thank you!!! Questions? 42