Presentation is loading. Please wait.

Presentation is loading. Please wait.

IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,

Similar presentations


Presentation on theme: "IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,"— Presentation transcript:

1 iMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University, China Lixin Gao, UMass Amherst Cuirong Wang, Northeastern University, China

2 Iterative Computation Everywhere PageRank Clustering BFS Video suggestion Pattern Recognition

3 Iterative Computation on Massive Data PageRank Clustering BFS Video suggestion Pattern Recognition 100 million videos 700PB human genomics 29.7 billion pages 500TB/yr hospital Google maps 70.5TB

4 Large-Scale Solution MapReduce Programming model –Map: distribute computation workload –Reduce: aggregate the partial results But, no support of iterative processing –Batched processing –Must design a series of jobs, do the same thing many times Task reassignment, data loading, data dumping

5 Outline Introduction Example: PageRank in MapReduce System Design: iMapReduce Evaluation Conclusions

6 PageRank Each node associated with a ranking score R –Evenly to its neighbors with portion d –Retain portion (1-d)

7 PageRank in MapReduce 0 1:2,3 1 1:3 2 1:0,3 3 1:4 4 1:1,2,3 map 2 1/2 3 1/2 reduce 4 1 3 1 1 1/3 2 1/3 3 1/3 0 1/2 3 1/2 1 0.33 3 2.33 0 0.5 2 0.83 4 1 0 p:2,3 1p:3 2 p:0,3 3p:4 4p:1,2,3 0 0.5:2,3 2 0.83:0,3 4 1:1,2,3 1 0.33:3 3 2.33:4 for the next MapReduce job

8 Problems of Iterative Implementation 1. Multiple times job/task initializations 2. Re-shuffling the static data 3. Synchronization barrier between jobs

9 Outline Introduction Example: PageRank System Design: iMapReduce Evaluation Conclusions

10 iMapReduce Solutions 1. Multiple job/task initializations –Iterative processing 2. Re-shuffling the static data –Maintain static data locally 3. Synchronization barrier between jobs –Asynchronous map execution

11 Iterative Processing Reduce  Map –Load/dump once –Persistent map/reduce task –Each reduce task has a correspondent map task –Locally connected

12 Maintain Static Data Locally Iterate state data –Update iteratively Maintain static data –Join with state data at map phase

13 State Data Static Data 0 1 2 1 4 1 1 3 1 map 0map 1 2 1/2 3 1/2 reduce 0reduce 1 4 1 3 1 1 1/3 2 1/3 3 1/3 0 1/2 3 1/2 1 0.33 3 2.33 0 0.5 2 0.83 4 1 for the next iteration 0 2,3 2 0,3 4 1,2,3 1 3 3 4

14 Asynchronous Map Execution Persistent socket reduce -> map Reduce completion directly trigger map Map tasks no need to wait MapReduceiMapReduce time

15 Outline Introduction Example: PageRank System Design: iMapReduce Evaluation Conclusions

16 Experiment Setup Cluster –Local cluster: 4 nodes –Amazon EC2 cluster: 80 small instances Implemented algorithms –Shortest path, PageRank, K-Means Data sets SSSPPageRank

17 Results on Local Cluster PageRank: google webgraph iMapReduce Hadoop Sync. iMapReduce Hadoop ex. initialization

18 Results cont. PageRank: google webgraphPageRank: Berk-Stan webgraph SSSP: Facebook dataset SSSP: DBLP dataset

19 Scale to Large Data Sets On Amazon EC2 cluster Various-size graphs PageRankSSSP

20 Different Factors on Running Time Reduction iMapReduce Shuffling Sync. maps Init time Hadoop

21 Conclusion iMapReduce reduces running time by –Reducing the overhead of job/task initialization –Eliminating the shuffling of static data –Allowing asynchronous map execution Achieve a factor of 1.2x to 5x speedup Suitable for most iterative computations

22 Q & A Thank you!


Download ppt "IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,"

Similar presentations


Ads by Google