Download presentation
Presentation is loading. Please wait.
1
iMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University, China Lixin Gao, UMass Amherst Cuirong Wang, Northeastern University, China
2
Iterative Computation Everywhere PageRank Clustering BFS Video suggestion Pattern Recognition
3
Iterative Computation on Massive Data PageRank Clustering BFS Video suggestion Pattern Recognition 100 million videos 700PB human genomics 29.7 billion pages 500TB/yr hospital Google maps 70.5TB
4
Large-Scale Solution MapReduce Programming model –Map: distribute computation workload –Reduce: aggregate the partial results But, no support of iterative processing –Batched processing –Must design a series of jobs, do the same thing many times Task reassignment, data loading, data dumping
5
Outline Introduction Example: PageRank in MapReduce System Design: iMapReduce Evaluation Conclusions
6
PageRank Each node associated with a ranking score R –Evenly to its neighbors with portion d –Retain portion (1-d)
7
PageRank in MapReduce 0 1:2,3 1 1:3 2 1:0,3 3 1:4 4 1:1,2,3 map 2 1/2 3 1/2 reduce 4 1 3 1 1 1/3 2 1/3 3 1/3 0 1/2 3 1/2 1 0.33 3 2.33 0 0.5 2 0.83 4 1 0 p:2,3 1p:3 2 p:0,3 3p:4 4p:1,2,3 0 0.5:2,3 2 0.83:0,3 4 1:1,2,3 1 0.33:3 3 2.33:4 for the next MapReduce job
8
Problems of Iterative Implementation 1. Multiple times job/task initializations 2. Re-shuffling the static data 3. Synchronization barrier between jobs
9
Outline Introduction Example: PageRank System Design: iMapReduce Evaluation Conclusions
10
iMapReduce Solutions 1. Multiple job/task initializations –Iterative processing 2. Re-shuffling the static data –Maintain static data locally 3. Synchronization barrier between jobs –Asynchronous map execution
11
Iterative Processing Reduce Map –Load/dump once –Persistent map/reduce task –Each reduce task has a correspondent map task –Locally connected
12
Maintain Static Data Locally Iterate state data –Update iteratively Maintain static data –Join with state data at map phase
13
State Data Static Data 0 1 2 1 4 1 1 3 1 map 0map 1 2 1/2 3 1/2 reduce 0reduce 1 4 1 3 1 1 1/3 2 1/3 3 1/3 0 1/2 3 1/2 1 0.33 3 2.33 0 0.5 2 0.83 4 1 for the next iteration 0 2,3 2 0,3 4 1,2,3 1 3 3 4
14
Asynchronous Map Execution Persistent socket reduce -> map Reduce completion directly trigger map Map tasks no need to wait MapReduceiMapReduce time
15
Outline Introduction Example: PageRank System Design: iMapReduce Evaluation Conclusions
16
Experiment Setup Cluster –Local cluster: 4 nodes –Amazon EC2 cluster: 80 small instances Implemented algorithms –Shortest path, PageRank, K-Means Data sets SSSPPageRank
17
Results on Local Cluster PageRank: google webgraph iMapReduce Hadoop Sync. iMapReduce Hadoop ex. initialization
18
Results cont. PageRank: google webgraphPageRank: Berk-Stan webgraph SSSP: Facebook dataset SSSP: DBLP dataset
19
Scale to Large Data Sets On Amazon EC2 cluster Various-size graphs PageRankSSSP
20
Different Factors on Running Time Reduction iMapReduce Shuffling Sync. maps Init time Hadoop
21
Conclusion iMapReduce reduces running time by –Reducing the overhead of job/task initialization –Eliminating the shuffling of static data –Allowing asynchronous map execution Achieve a factor of 1.2x to 5x speedup Suitable for most iterative computations
22
Q & A Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.