1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Slides:

Advertisements

Similar presentations

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Distributed Graph Processing Abhishek Verma CS425.

Spark: Cluster Computing with Working Sets

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

1 Sensor Relocation in Mobile Sensor Networks Guiling Wang, Guohong Cao, Tom La Porta, and Wensheng Zhang Department of Computer Science & Engineering.

Small-world Overlay P2P Network

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Distributed Computations

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.

Paper by: Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) Pregel: A System for.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Pregel: A System for Large-Scale Graph Processing

Efficient and Robust Query Processing in Dynamic Environments Using Random Walk Techniques Chen Avin Carlos Brito.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Distributed Asynchronous Bellman-Ford Algorithm

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,

VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

Efficient Deployment Algorithms for Prolonging Network Lifetime and Ensuring Coverage in Wireless Sensor Networks Yong-hwan Kim Korea.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.

Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

Data Structures and Algorithms in Parallel Computing Lecture 4.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

Data Structures and Algorithms in Parallel Computing

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.

A Framework for Reliable Routing in Mobile Ad Hoc Networks Zhenqiang Ye Srikanth V. Krishnamurthy Satish K. Tripathi.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

Load Rebalancing for Distributed File Systems in Clouds.

Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.

REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Seminar On Rain Technology

1 Fault Tolerance and Recovery Mostly taken from

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

BD-Cache: Big Data Caching for Datacenters

CSS534: Parallel Programming in Grid and Cloud

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

PREGEL Data Management in the Cloud

Chapter 16: Distributed System Structures

Data Structures and Algorithms in Parallel Computing

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

Cse 344 May 4th – Map/Reduce.

Edge computing (1) Content Distribution Networks

Replication-based Fault-tolerance for Large-scale Graph Processing

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Parallel Exact Stochastic Simulation in Biochemical Systems

Presentation transcript:

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor

2 Graph analytics Emergence of large graphs –The web, social networks, spatial networks, … Increasing demand of querying large graphs –PageRank, reverse web link analysis over the web graph –Influence analysis in social networks –Traffic analysis, route recommendation over spatial graphs

3 Distributed graph processing MapReduce- like systems Pregel-like systemsGraphLab-related systems Others

4 Failures of compute nodes Increasing graph size More compute nodes Increase in the number of failed nodes Failure rate –# of failures per unit of time –1/200(hours) Exponential failure probability

5 Outline Motivation & background Failure recovery problem –Challenging issues –Existing solutions Solution –Reassignment generation –In-parallel recomputation –Workload rebalance Experimental results Conclusions

6 Pregel-like distributed graph processing systems Graph model –G=(V,E) –P: partitions Computation model –A set of supersteps –Invoke compute function for each active vertex –Each vertex can Receive and process messages Send messages to other vertices Modify its value, state(active/inactive), its outgoing edges BAG H J C FE D I BA CD FE GH JI B C H IDEBF AB CD FE GH JI VertexSubgraph G

7 Failure recovery problem Running example –All the vertices compute and send messages to all neighbors in all the supersteps –N 1 fails when the job executes in superstep 12 –Two states: record each vertex completes which superstep when failure occurs (S f )and failure is recovered (S f *) Problem statement –For a failure F(N f, s f ), recover vertex states from S f to S f * BAG H J C FE D I SfSf Sf*Sf* A-F: 10; G-J: 12A-J: 12

8 Challenging issues Cascading failures –New failures may occur during the recovery phase –How to handle all the cascading failures if any? Existing solution: treat each cascading failure as an individual failure and restart from the latest checkpoint Recovery latency –Re-execute lost computations to achieve state S* –Forward messages during recomputation –Recover cascading failures –How to perform recovery with minimized latency?

9 Existing recovery mechanisms Checkpoint-based recovery –During normal execution all the compute nodes flush its own graph-related information to a reliable storage at the beginning of every checkpointing superstep (e.g., C+1, 2C+1, …, nC+1). –During recovery let c+1 be the latest checkpointing superstep use healthy nodes to replace failed ones; all the compute nodes rollback to the latest checkpoint and re-execute lost computations since then (i.e., from superstep c+1 to s f ) Simple to implement! Can handle cascading failures! Simple to implement! Can handle cascading failures!  Replay lost computations over whole graph!  Ignore partially recovered workload!  Replay lost computations over whole graph!  Ignore partially recovered workload!

10 Existing recovery mechanisms Checkpoint + log –During normal execution: besides checkpoint, every compute node logs its outgoing messages at the end of each superstep –During recovery Use healthy nodes (replacements) to replace failed one Replacements: – redo lost computation and forward messages among each other; – forward messages to all the nodes in superstep s f Healthy nodes: –holds their original partitions and redo the lost computation by forwarding locally logged messages to failed vertices

11 Existing recovery mechanisms Checkpoint + log –Suppose latest checkpoint is made at the beginning of superstep 11; N 1 (A-F) fails at superstep 12 –During recovery superstep 11: A-F perform computation and send messages to each other; G-J send messages to A-F superstep 12:A-F perform computation and send messages along their outgoing edges; G-J send messages to A-F BAG H J C FE D I Less computation and communication cost!  Overhead of locally logging! (negligible)  Limited parallelism: replacements handle all the lost computation!  Overhead of locally logging! (negligible)  Limited parallelism: replacements handle all the lost computation!

12 Outline Motivation & background Problem statement –Challenging issues –Existing solutions Solution –Reassignment generation –In-parallel recomputation –Workload rebalance Experimental results Conclusions

13 Our solution Partition-based failure recovery –Step 1: generate a reassignment for the failed partitions –Step 2: recompute failed partitions Every node is informed of the reassignment Every node loads its newly assigned failed partitions from the latest checkpoint; redoes lost computations –Step 3: exchange partitions Re-balance workload after recovery

14 Recompute failed partitions

15 Example N 1 fails in superstep 12 –Redo superstep 11, 12 BAG H J C FE D I B A G H J C FE DI (1) reassginment(2) recomputation Less computation and communication cost!

16 Handling cascading failures N 1 fails in superstep 12 N 2 fails superstep 11 during recovery BAG H J C FE D I B A G HJC FE DI (1) reassginment (2) recomputation No need to recover A and B since they have been recovered! Same recovery algorithm can be used to recovery any failure!

17 Reassignment generation When a failure occurs, how to compute a good reassignment for failed partitions? –Minimize the recovery time Calculating recovery time is complicated because it depends on: –Reassignment for the failure –Cascading failures –Reassignment for each cascading failure  No knowledge about cascading failures!

18 Our insight When a failure occurs (can be cascading failure), we prefer a reassignment that can benefit the remaining recovery process by considering all the cascading failures that have occurred We collect the state S after the failure and measure the minimum time T low to achieve S f * –T low provides a lower bound of remaining recovery time

19 Estimation of T low

20 Reassignment generation problem

21 Outline Motivation & background Problem statement –Challenging issues –Existing solutions Solution –Reassignment generation –In-parallel recomputation –Workload rebalance Experimental results Conclusions

22 Experimental evaluation Experiment settings –In-house cluster with 72 nodes, each of which has one Intel X GHz processor, 8GB of memory, two 500GB SATA hard disks and Hadoop , and Giraph Comparisons –PBR(our proposed solution), CBR(checkpoint-based) Benchmark Tasks –K-means –Semi-clustering –PageRank Datasets –Forest –LiveJournal –Friendster

23 PageRank results Logging OverheadSingle Node Failure

24 PageRank results Multiple Node FailureCascading Failure

25 PageRank results (communication cost) Multiple Node FailureCascading Failure

26 Conclusions Develop a novel partition-based recovery method to parallelize failure recovery workload for distributed graph processing Address challenges in failure recovery –Handle cascading failures –Reduce recovery latency Reassignment generation problem Greedy strategy

27 Thank You! Q & A