MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University.

MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1 Presented by Xiaolan Wang and Pengfei Tang

Motivation 2 Reducing the execution time Reducing energy consumption Monetary savings *http://aws.amazon.com/ec2/#pricing

MRShare – a sharing framework for Map Reduce MRShare framework: – Inspired by sharing primitives from relational domain – Introduces a cost model for Map Reduce jobs – Searches for the optimal sharing strategies – Does not change the Map Reduce computational model 3

Outline Introduction Map Reduce recap. MRShare – Sharing opportunities in Map-Reduce Cost model for MapReduce MRShare – Grouping algorithms MRShare Implementation and Evaluation Summary 4

Outline 5

Map Reduce recap. I I I I Map Reduce 6

Outline 7

Sharing opportunities– sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown 8 8 User_idHometownOccupationAge

MRShare – sharing scans (map). 9

MRShare – sharing scans (reduce) J1J2J3J4keyvalue Toronto1 1 1 17 Toronto19 Toronto2 5 10

Sharing Map Output SELECT T.a, sum(T.b) SELECT T.a, avg(T.b) FROM T WHERE T.a>10 AND T.a 10 AND T.c<100 GROUP BY T.a 11

Sharing Map SELECT T.c, sum(T.b) SELECT T.a, avg(T.b) FROM T WHERE T.c > 10 GROUP BY T.c GROUP BY T.a 12 Same reducing.

Sharing Parts of Map SELECT T.a, sum(T.b) SELECT T.a, avg(T.b) FROM T WHERE T.c>10 AND T.a 10 AND T.c<100 GROUP BY T.a 13

Outline 14

Cost model for Map Reduce (single job) 15 T(J) = T read (J) + T sort (J) + T tr (J)

Cost of executing a group of jobs 16

Cost without grouping 17 n – n jobs; m – m maps; r – r reduces; |M i | - the average output size of a map task; |R i | - the average input size of a reduce task; |D i | - the size of the intermediate data of job J i. |D i | = |M i | · m = |R i | · r n MapReduce jobs, J = {J 1,..., J n }, read from the same input file F.

18 Sorting time

Cost with grouping 19 m – m maps; r – r reduces; |X m | - the average size of the combined output of map tasks; |X r | - the average size of the combined input of reduce tasks; |X G | - the size of the intermediate data. | X G | = | X m | · m = | X r | · r Single group G contains all n jobs and execute it as a single job J G.

Beneficial conditions 20 n <= B

Finding the optimal sharing strategy 21 An optimization problem “ NoShare ” “GreedyShare”

Sharing scans - cost based optimization Savings come from reduced number of scans The sorting cost might change The costs of copying and writing the output do not change 22

Outline 23

SplitJobs – a DP solution for sharing scans. We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting. 24 Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.

SplitJobs (cont.) 25 GS(i, l) = GAIN(i, l) − f c(l) is the savings of the optimal grouping of jobs J 1,…J l.

MultiSplitJobs – an improvement of SplitJobs 26

MultiSplitJobs (cont.) 27

Outline 28

Implementing MRShare MRShare implement on Hadoop First, acquire a batch of jobs from queries in a short time T Second, MultiSplit Jobs is called to compute the optimal grouping of the jobs Third, the groups are rewritten, using a meta-map and a meta-reduce function. These are MRShare specific container and their functionality relies on tagging. Finally, new jobs are submitted for execution 29

Tagging for Sharing Only Scans 30

Tagging for Sharing Map Output 31

Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries – Counts words matching a regular expression – Allows for variable intermediate data sizes – Generic aggregation Map Reduce job 34

Validation of the Cost Model 35

Evaluation goals Sharing is not always beneficial. – ‘GreedyShare’ policy How much can we save on sharing scans? – MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data? – MRShare - γ-MultiSplitJobs evaluation 36

Is sharing always beneficial? - ‘GreedyShare’ policy Group of jobs Group size d=|intermediate data| / |input data| H1160.3 < d <0.7 H2160.7 < d H3160.9 < d 37

How much we save on sharing scans – MRShare MultiSplitJobs Group of jobs Group size d=|intermediate data| / |input data| G1160.7 < d G2160.2 < d < 0.7 G3160.0 < d < 0.2 G4160.0 < d < max G5640.0 < d < max 38

How much we save on sharing Map-output – MRShare MultiSplitJobs 39

How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 40 Group of jobs Group size d=|intermediate data| / |input data| G1160.7 < d G2160.2 < d < 0.7 G3160.0 < d < 0.2

Summary Introduction on MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map-Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 41

Thank you!!! Questions? 42

MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University.

Similar presentations

Presentation on theme: "MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University.

Similar presentations

Presentation on theme: "MRShare: Sharing Across Multiple Queries in MapReduce By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University."— Presentation transcript:

Similar presentations

About project

Feedback