Zorro: Zero-Cost Reactive Failure Recovery in Distributed
Transcription
Zorro: Zero-Cost Reactive Failure Recovery in Distributed
Zorro: Zero-Cost Reactive Failure Recovery in Distributed Graph Processing Mayank Pundir and Luke Leslie Proactive Failure Recovery Periodic and synchronous disk IO Overhead of Proactive Recovery Per-iteration checkpointing slowdown with 16 servers Desirable Properties ● Zero Overhead [ZO]: No overhead is incurred during failure-free execution. ● Complete Recovery [CR]: Results in the face of failures are fully accurate. ● Fast Recovery [FR]: Recovery after failure is quick and does not require additional iterations. GAS Model Reactive Failure Recovery Out-neighbor Replication Example: LFGraph, Pregel, Giraph, Hama All-neighbor Replication Example: Distributed GraphLab, PowerGraph Why checkpoint when it’s already replicated? a. All-Neighbor Replication b. Out-Neighbor Replication Analysis Detailed proofs in paper ● The expected number of recovered vertices and the probability of recovery are both dependent on the fraction of servers that fail, rather than the actual number of server failures. ● The probability of vertex recovery exhibits rapid convergence to 1 as the number of neighbors increases, or as the fraction of servers that fail decreases. Three R’s of Zorr(r)o ● Replace: Each failed server is substituted by a new server (replacement server). ● Rebuild: Each replacement server collects state information from surviving servers, rebuilds local state. ● Resume: Computation restarts from the beginning of the last iteration before failure Zorro Recovery Protocol Evaluation ● Graphs: CA Road (Exponential), Twitter (Power-law), UK Web (Power-law) ● Frameworks: PowerGraph, LFGraph ● Applications: PageRank, Single-source Shortest Paths (SSSP), Connected Components, K-core Decomposition ● Setup: 16 machines, 16 cores, 64 GB RAM PowerGraph: PageRank Inaccuracy LFGraph: PageRank Inaccuracy PowerGraph: SSSP Inaccuracy LFGraph: SSSP Inaccuracy Communication Overhead Relative to total failure-free 10 PageRank iterations a. PowerGraph b. LFGraph Recovery Time a. PowerGraph b. LFGraph Partitioning Functions: PageRank a. 8 failed servers b. 4 Failures at last iteration Partitioning Functions: SSSP a. 8 failed servers b. 4 Failures at last iteration Future Work ● Don’t let failures stop you! ● Asynchronous computation ● Hybrid, tunable proactive + reactive replication ● Delay scheduling for cascading failures prediction model ● Gossip-style propagation among replacements Conclusion ● Zorro borrows from the rich and gives to the poor! “We live in a system of approximations.” - Ralph Waldo Emerson Backup Slides Trade-off Space