IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,

Slides:



Advertisements
Similar presentations
Pregel: A System for Large-Scale Graph Processing
Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
SkewTune: Mitigating Skew in MapReduce Applications
Distributed Graph Processing Abhishek Verma CS425.
Spark: Cluster Computing with Working Sets
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Felix Halim, Roland H.C. Yap, Yongzheng Wu
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
FLANN Fast Library for Approximate Nearest Neighbors
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Network Support for Cloud Services Lixin Gao, UMass Amherst.
Pregel: A System for Large-Scale Graph Processing
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Open Source Tools for Teaching.
Mobility Limited Flip-Based Sensor Networks Deployment Reporter: Po-Chung Shih Computer Science and Information Engineering Department Fu-Jen Catholic.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Lu Qin Center of Quantum Computation and Intelligent Systems, University of Technology, Australia Jeffery Xu Yu The Chinese University of Hong Kong, China.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
TensorFlow– A system for large-scale machine learning
Hadoop Aakash Kag What Why How 1.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
PREGEL Data Management in the Cloud
MapReduce and Data Intensive Applications XSEDE’12 BOF Session
Data Structures and Algorithms in Parallel Computing
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Replication-based Fault-tolerance for Large-scale Graph Processing
Pregelix: Think Like a Vertex, Scale Like Spandex
Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^
Presentation transcript:

iMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University, China Lixin Gao, UMass Amherst Cuirong Wang, Northeastern University, China

Iterative Computation Everywhere PageRank Clustering BFS Video suggestion Pattern Recognition

Iterative Computation on Massive Data PageRank Clustering BFS Video suggestion Pattern Recognition 100 million videos 700PB human genomics 29.7 billion pages 500TB/yr hospital Google maps 70.5TB

Large-Scale Solution MapReduce Programming model –Map: distribute computation workload –Reduce: aggregate the partial results But, no support of iterative processing –Batched processing –Must design a series of jobs, do the same thing many times Task reassignment, data loading, data dumping

Outline Introduction Example: PageRank in MapReduce System Design: iMapReduce Evaluation Conclusions

PageRank Each node associated with a ranking score R –Evenly to its neighbors with portion d –Retain portion (1-d)

PageRank in MapReduce 0 1:2,3 1 1:3 2 1:0,3 3 1:4 4 1:1,2,3 map 2 1/2 3 1/2 reduce /3 2 1/3 3 1/3 0 1/2 3 1/ p:2,3 1p:3 2 p:0,3 3p:4 4p:1,2, :2, :0,3 4 1:1,2, : :4 for the next MapReduce job

Problems of Iterative Implementation 1. Multiple times job/task initializations 2. Re-shuffling the static data 3. Synchronization barrier between jobs

Outline Introduction Example: PageRank System Design: iMapReduce Evaluation Conclusions

iMapReduce Solutions 1. Multiple job/task initializations –Iterative processing 2. Re-shuffling the static data –Maintain static data locally 3. Synchronization barrier between jobs –Asynchronous map execution

Iterative Processing Reduce  Map –Load/dump once –Persistent map/reduce task –Each reduce task has a correspondent map task –Locally connected

Maintain Static Data Locally Iterate state data –Update iteratively Maintain static data –Join with state data at map phase

State Data Static Data map 0map 1 2 1/2 3 1/2 reduce 0reduce /3 2 1/3 3 1/3 0 1/2 3 1/ for the next iteration 0 2,3 2 0,3 4 1,2,

Asynchronous Map Execution Persistent socket reduce -> map Reduce completion directly trigger map Map tasks no need to wait MapReduceiMapReduce time

Outline Introduction Example: PageRank System Design: iMapReduce Evaluation Conclusions

Experiment Setup Cluster –Local cluster: 4 nodes –Amazon EC2 cluster: 80 small instances Implemented algorithms –Shortest path, PageRank, K-Means Data sets SSSPPageRank

Results on Local Cluster PageRank: google webgraph iMapReduce Hadoop Sync. iMapReduce Hadoop ex. initialization

Results cont. PageRank: google webgraphPageRank: Berk-Stan webgraph SSSP: Facebook dataset SSSP: DBLP dataset

Scale to Large Data Sets On Amazon EC2 cluster Various-size graphs PageRankSSSP

Different Factors on Running Time Reduction iMapReduce Shuffling Sync. maps Init time Hadoop

Conclusion iMapReduce reduces running time by –Reducing the overhead of job/task initialization –Eliminating the shuffling of static data –Allowing asynchronous map execution Achieve a factor of 1.2x to 5x speedup Suitable for most iterative computations

Q & A Thank you!