Slides
Transcription
Slides
FrogWild! Fast PageRank Approximations on Graph Engines Ioannis Mitliagkas Michael Borokhovich Alex Dimakis Constantine Caramanis Web Ranking Given web graph Find “important” pages A E B D C 2 Web Ranking Given web graph Find “important” pages E Rank Based on In-degree Classic Approach A B D C 2 Web Ranking Given web graph Find “important” pages E Rank Based on In-degree Classic Approach B A S Susceptible S S S to manipulation by spammer networks 2 D C PageRank [Page et al., 1999] Page Importance Described by distribution ⇡ E A B D C 3 PageRank [Page et al., 1999] Page Importance Described by distribution ⇡ E Recursive Definition Important pages are pointed to by ❖ important pages are pointed to by ❖ B A important pages are pointed to by… 3 ⇡ D C PageRank [Page et al., 1999] Page Importance Described by distribution ⇡ E Recursive Definition Important pages are pointed to by ❖ important pages are pointed to by ❖ B A important pages are pointed to by… Robust ⇡ to manipulation by spammer networks 3 D C PageRank - Continuous Interpretation Start: Gallon of water distributed evenly E A B D C 4 PageRank - Continuous Interpretation Start: Gallon of water distributed evenly Every Iteration Each vertex spreads water evenly to successors A B E D C 4 PageRank - Continuous Interpretation Start: Gallon of water distributed evenly Every Iteration Each vertex spreads water evenly to successors A B E D C 4 PageRank - Continuous Interpretation Start: Gallon of water distributed evenly Every Iteration Each vertex spreads water evenly to successors Redistribute evenly a fraction, pT = 0.15, of all water A B E D C 4 PageRank - Continuous Interpretation Start: Gallon of water distributed evenly Every Iteration Each vertex spreads water evenly to successors Redistribute evenly a fraction, pT = 0.15, of all water B A ⇡ Repeat until convergence Power Iteration employed usually 4 E D C Discrete Interpretation Frog walks randomly on graph Next vertex chosen uniformly at random E A B D C 5 Discrete Interpretation Frog walks randomly on graph Next vertex chosen uniformly at random E A 1 B D C 5 Discrete Interpretation Frog walks randomly on graph Next vertex chosen uniformly at random E 1/3 A B 1/3 D 1/3 C 5 Discrete Interpretation Frog walks randomly on graph Next vertex chosen uniformly at random E Teleportation Every step: teleport w.p. pT A B D 1 C 5 Discrete Interpretation Frog walks randomly on graph Next vertex chosen uniformly at random E Teleportation Every step: teleport w.p. pT A B D C 5 Discrete Interpretation Frog walks randomly on graph Next vertex chosen uniformly at random E Teleportation Every step: teleport w.p. pT Sampling after t steps A B D Frog location gives sample from ⇡ C 5 Discrete Interpretation Frog walks randomly on graph Next vertex chosen uniformly at random E Teleportation Every step: teleport w.p. pT Sampling after t steps B A Frog location gives sample from ⇡ PageRank Vector Many frogs, estimate vector ⇡ 5 ⇡ D C PageRank Approximation Looking for k “heavy nodes” Do not need full PageRank vector E Random Walk Sampling Favors heavy nodes A Captured Mass Metric For node set S: ⇡(S) 6 B D C PageRank Approximation Looking for k “heavy nodes” Do not need full PageRank vector E Random Walk Sampling Favors heavy nodes A Captured Mass Metric For node set S: ⇡(S) B k=2 Return set {E,D} D C Captured mass = ⇡({E,D}) 6 Platform Graph Engines ❖ Engine splits graph across cluster ❖ Vertex program describes logic E GAS abstraction A B D C Other approaches: Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013] 8 Graph Engines ❖ Engine splits graph across cluster ❖ Vertex program describes logic E GAS abstraction 1. Gather A B D C Other approaches: Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013] 8 Graph Engines ❖ Engine splits graph across cluster ❖ Vertex program describes logic E GAS abstraction 1. Gather A B 2. Apply D C Other approaches: Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013] 8 Graph Engines ❖ Engine splits graph across cluster ❖ Vertex program describes logic E GAS abstraction 1. Gather A B 2. Apply D 3. Scatter C Other approaches: Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013] 8 Edge Cuts ❖ Assign vertices to machines ❖ Cross-machine edges require network communication ❖ Pregel, GraphLab 1.0 ❖ High-degree nodes generate large volume of traffic ❖ Computational load imbalance E A B D C 9 Edge Cuts Machine 2 Machine 1 E ❖ Assign vertices to machines ❖ Cross-machine edges require network communication ❖ Pregel, GraphLab 1.0 ❖ High-degree nodes generate large volume of traffic ❖ A B D C Computational load imbalance Machine 3 9 Vertex Cuts ❖ Assign edges to machines ❖ High-degree nodes replicated ❖ One replica designated master ❖ Need for synchronization E 1. Gather A 2. Apply [on master] B D 3. Synchronize mirrors 4. Scatter C ❖ GraphLab 2.0 - PowerGraph ❖ Balanced - Network still bottleneck 10 Vertex Cuts ❖ Assign edges to machines ❖ High-degree nodes replicated ❖ One replica designated master ❖ Need for synchronization Machine 2 Machine 1 E A B B D 1. Gather 2. Apply [on master] B 3. Synchronize mirrors D 4. Scatter ❖ GraphLab 2.0 - PowerGraph ❖ Balanced - Network still bottleneck 10 C Machine 3 Vertex Cuts ❖ Assign edges to machines ❖ High-degree nodes replicated ❖ One replica designated master ❖ Need for synchronization Machine 2 Machine 1 E A B B D 1. Gather 2. Apply [on master] B 3. Synchronize mirrors D 4. Scatter ❖ GraphLab 2.0 - PowerGraph ❖ Balanced - Network still bottleneck 10 C Machine 3 Random Walks on GraphLab Machine 2 Master node decides step B Decision synced to all mirrors Machine 1 A Only machine M needs it Unnecessary network traffic Average replication factor ~8 B Machine 3 B D Machine M B 11 C Z Random Walks on GraphLab Machine 2 Master node decides step B Decision synced to all mirrors Machine 1 A Only machine M needs it Unnecessary network traffic Average replication factor ~8 B Z Machine 3 B D Machine M B 11 C Z Random Walks on GraphLab Machine 2 Master node decides step Z B Decision synced to all mirrors Machine 1 A Only machine M needs it Unnecessary network traffic B Machine 3 Z B 11 D Machine M Z B Average replication factor ~8 C Z Objective Faster PageRank approximation on GraphLab Idea Only synchronize the mirror that will receive the frog Doable, but requires 1. Serious engine hacking 2. Exposing an ugly/complicated API to programmer Simpler Pick mirrors to synchronize at random! Synchronize independently with probability pS 12 FrogWild! N Release N frogs in parallel Machine 2 Machine 1 Vertex Program B C Ber(pS ) Machine 3 1.Each frog dies w.p. pT (gives sample) Assume K frogs survive A 2.For every mirror, draw bridge w.p. pS 3.Spread frogs evenly among synchronized mirrors. B Ber(pS ) B D Ber(pS ) Machine M B 13 Z FrogWild! Machine 2 Release N frogs in parallel Machine 1 Vertex Program K 1.Each frog dies w.p. pT (gives sample) Assume K frogs survive A 2.For every mirror, draw bridge w.p. pS 3.Spread frogs evenly among synchronized mirrors. B C Ber(pS ) Machine 3 B Ber(pS ) B D Ber(pS ) Machine M B 13 Z FrogWild! Machine 2 Release N frogs in parallel B Machine 1 Vertex Program K 1.Each frog dies w.p. pT (gives sample) Assume K frogs survive A 2.For every mirror, draw bridge w.p. pS 3.Spread frogs evenly among synchronized mirrors. C Machine 3 B Ber(pS ) B D Ber(pS ) Machine M B 13 Z FrogWild! Machine 2 Release N frogs in parallel B Machine 1 Vertex Program K 1.Each frog dies w.p. pT (gives sample) Assume K frogs survive A Machine 3 B B 2.For every mirror, draw bridge w.p. pS 3.Spread frogs evenly among synchronized mirrors. C D Ber(pS ) Machine M B 13 Z FrogWild! Machine 2 Release N frogs in parallel Machine 1 Vertex Program K 1.Each frog dies w.p. pT (gives sample) Assume K frogs survive A 2.For every mirror, draw bridge w.p. pS B B C Machine 3 B D 3.Spread frogs evenly among synchronized mirrors. Machine M B 13 Z FrogWild! Machine 2 Release N frogs in parallel K/2 Machine 1 Vertex Program B C Machine 3 1.Each frog dies w.p. pT (gives sample) Assume K frogs survive A 2.For every mirror, draw bridge w.p. pS B B D 3.Spread frogs evenly among synchronized mirrors. Machine M K/2 B 13 Z FrogWild! Machine 2 Release N frogs in parallel K/2 Machine 1 Vertex Program B C Machine 3 1.Each frog dies w.p. pT (gives sample) Assume K frogs survive A B 2.For every mirror, draw bridge w.p. pS B D 3.Spread frogs evenly among synchronized mirrors. Machine M Bridges introduce dependencies! 13 K/2 B Z Contributions 1.Algorithm for approximate PageRank 2.Modification of GraphLab Exposes very simple API extension (pS). Allows for randomized synchronization. 3.Speedup of 7-10x 4.Theoretical guarantees for solution despite introduced dependencies 14 Theoretical Guarantee Mass Captured by top-k set, S, of estimate from N frogs after t steps ⇡(S) where OPT s p t k 1 ✏< k 2+ + (1 N 2✏ 2 pS )p\ (t) probability two Frogs meet at first t steps 1 tk⇡k1 p\ (t) + , n pT 15 w.p. 1 Experiments Experimental Results 17 Experimental Results 18 Experimental Results 19 Thank you! References Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing." Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., & Hellerstein, J. M. (2010). Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990. Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., & Guestrin, C. (2012, October). PowerGraph: Distributed GraphParallel Computation on Natural Graphs. In OSDI (Vol. 12, No. 1, p. 2). Nguyen, D., Lenharth, A., & Pingali, K. (2013, November). A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (pp. 456-471). ACM. Avery, C. (2011). Giraph: Large-scale graph processing infrastruction on Hadoop. Proceedings of Hadoop Summit. Santa Clara, USA:[sn]. Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013, June). Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems (p. 2). ACM. Backup Slides PageRank [Page et al., 1999] E A B D C 23 PageRank [Page et al., 1999] Normalized Adjacency Matrix Pij = 1 dout (j) , (j, i) 2 G E 1/3 A 1 B 1/3 1/3 D 1 C 23 PageRank [Page et al., 1999] Normalized Adjacency Matrix Pij = 1 dout (j) , (j, i) 2 G Augmented Matrix Qij = (1 E pT 2 [0, 1] pT pT )Pij + n A B D C 23 PageRank [Page et al., 1999] Normalized Adjacency Matrix Pij = 1 dout (j) , (j, i) 2 G Augmented Matrix Qij = (1 E pT 2 [0, 1] pT pT )Pij + n PageRank Vector A B D ⇡ = Q⇡ C 23 PageRank [Page et al., 1999] Normalized Adjacency Matrix Pij = 1 dout (j) , (j, i) 2 G Augmented Matrix Qij = (1 E pT 2 [0, 1] pT pT )Pij + n PageRank Vector A B D ⇡ = Q⇡ C Power Method t 0 Qp !⇡ 23 Here be dragons. Backup E E E A B A B D E A B A B A B D E C E D C D A B D C D C C C Backup E E A B D C A B D C