Phoenix: A Substrate for Resilient Distributed Graph Analytics Roshan Dathathri Gurbinder Gill Loc Hoang Keshav Pingali
Phoenix Substrate to recover from fail-stop faults in distributed graph applications Tolerates arbitrary number of failed machines, including cascading failures Classifies graph algorithms and uses class-specific recovery protocol
Phoenix Substrate to recover from fail-stop faults in distributed graph applications Tolerates arbitrary number of failed machines, including cascading failures Classifies graph algorithms and uses class-specific recovery protocol No overhead in the absence of faults, unlike checkpointing 24x faster than GraphX Evaluated on 128 hosts using graphs 1TB Outperforms checkpointing when up to 16 hosts fail Say GraphX is fault-tolerant distributed graph processing system
State of a graph Graph State of the graph A C E G B D F H ∞ ∞ ∞ ∞ ∞ ∞ ∞ State of the graph A B C D E F G H ∞
Distributed execution model Graph Host h1 Host h2 A C E G A C E E G CuSP [IPDPS’19] B D F H B D D F H State of the graph A B C D E F G H A B C D E D E F G H Galois [SoSP’13] ∞ ∞ ∞ compute compute state transition communicate Gluon [PLDI’18] ∞ 1 Say bfs operator ∞ 1 ∞ 1 1
How to recover from crashes or fail-stop faults? Graph Host h1 Host h2 A C E G A C E E G B D F H B D D F H State of the graph A B C D E F G H A B C D E D E F G H communicate ∞ 2 1 ∞ 2 1 Phoenix preserve re-initialize ∞ 2 1 ∞ 2 1 1 1 ∞ ∞ Fault detected during synchronization
States during algorithm execution and recovery Globally Consistent States Initial State Checkpoint-Restart Fault Phoenix Valid States Final State All States
Classification of graph algorithms Globally Consistent States s.t. x x Self-stabilizing algorithms Locally-correcting algorithms Valid States s.t. x All States Say bfs is an example for locally-correcting Globally-consistent algorithms Globally-correcting algorithms
Classes: examples and recovery Self-stabilizing algorithms Locally-correcting algorithms Collaborative filtering Belief propagation Pull-style pagerank Pull-style graph coloring Recovery: Reinitialize lost nodes Breadth first search Connected components Data-driven pagerank Topology-driven k-core Recovery: Reinitialize lost nodes Globally-consistent algorithms Globally-correcting algorithms Betweenness centrality Recovery: Restart from last checkpoint Residual-based pagerank Data-driven k-core Latent Dirichlet allocation Recovery: ?
Problem: find k-core of an undirected graph k-core: maximal subgraph where every node has degree at least k A C E G E G B D F H F H Graph 3-core of the graph Say k-core is used in graph coloring (give that as intuition)
k-core algorithm (globally-correcting) If node is alive (1) and its degree < k, mark dead (0) and decrement neighbor’s degree A B C D E F G H 1 2 3 4 5 A C E G 1 1 2 3 5 4 B D F H 1 1 2 4 3 1 1 2 3 Graph Algorithm execution
Phoenix recovery for k-core algorithm Valid state: degree of every node should be the number of alive (1) neighbors Any node can be alive (1) A B C D E F G H A B C D E F G H 1 1 2 3 4 5 4 3 Phoenix A C E G Fault 1 1 1 2 3 5 4 2 1 4 5 3 B D F H 1 1 2 4 3 1 1 2 3 Graph Algorithm execution
Phoenix substrate for recovery Phoenix invoked when fail-stop fault detected Arguments to Phoenix: depends on algorithm class Re-initialization function Re-computation function (globally-correcting) Phoenix recovery: Re-initialize and synchronize proxies Re-compute and synchronize proxies (optional) Locally-correcting algorithm
Experimental setup Systems: Benchmarks: D-Galois Phoenix in D-Galois Checkpoint-Restart (CR) in D-Galois GraphX [GRADES’13] Benchmarks: Connected components (cc) K-core (kcore) Pagerank (pr) Single source shortest path (sssp) Inputs twitter rmat28 kron30 clueweb wdc12 |V| 51M 268M 1073M 978M 3,563M |E| 2B 4B 11B 42B 129B |E|/|V| 38 16 44 36 Size (CSR) 16GB 35GB 136GB 325GB 986GB Clusters Stampede Wrangler No. of hosts 128 32 Machine Intel Xeon Phi KNL Intel Xeon Haswell Each host 272 threads of KNL 48 threads of Haswell Memory 96GB DDR3 128GB DDR4 Say algorithm class for each benchmark and why that algorithm was chosen
Wrangler: fault-free total time on 32 hosts Speedup (log scale) Geometric mean: 24x
Stampede: fault-free execution time on 128 hosts Execution Time (s) D-Galois and Phoenix are identical Geometric mean overheads: CR-50: 31% CR-500: 8%
Stampede: execution time when faults occur on 128 hosts pr on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500 Say Phoenix can be used with checkpoint
Stampede: execution time overhead when faults occur Recovery time of Phoenix is negligible Compared to fault-free execution of Phoenix, when faults occur on 128 hosts: System Number of crashed machines Average execution time overhead Phoenix 4 14% 16 21% 64 44% CR-50 Any 49% CR-500 59%
Fail-stop fault-tolerant distributed graph systems Globally-correcting algorithms? Globally-consistent algorithms? No fault-free execution overhead? Tolerates any number of failed machines? Guarantees precise results? No programmer input? GraphX [GRADES’13] x Imitator [DSN’14] Zorro [SoCC’15] CoRAL [ASPLOS’17] Phoenix
Future Work Extend Phoenix to handle data corruption errors or byzantine faults Use compilers to generate Phoenix recovery functions automatically Explore Phoenix-style recovery for other application domains
Conclusion Phoenix: substrate to recover from fail-stop faults in distributed graph applications Recovery protocols based on classification of graph algorithms Implemented in D-Galois, the state-of-the-art distributed graph system Evaluated on 128 hosts using graphs 1TB No overhead in the absence of faults, unlike checkpointing Outperforms checkpointing when up to 16 hosts crash
Programmer effort for Phoenix Globally-correcting kcore and pr: 1 day of programming 150 lines of code added (to 300 lines of code) Locally-correcting cc and sssp: Negligible programming effort 30 lines of code added
Phoenix substrate for recovery: globally-correcting
Stampede: execution time when faults occur on 128 hosts cc on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500
Stampede: execution time when faults occur on 128 hosts kcore on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500
Stampede: execution time when faults occur on 128 hosts sssp on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500