Slides

Transcription

Slides
FrogWild!
Fast PageRank Approximations
on Graph Engines
Ioannis Mitliagkas
Michael Borokhovich
Alex Dimakis
Constantine Caramanis
Web Ranking
Given web graph
Find “important” pages
A
E
B
D
C
2
Web Ranking
Given web graph
Find “important” pages
E
Rank Based on In-degree
Classic Approach
A
B
D
C
2
Web Ranking
Given web graph
Find “important” pages
E
Rank Based on In-degree
Classic Approach
B
A
S
Susceptible
S
S
S
to manipulation by spammer networks
2
D
C
PageRank [Page et al., 1999]
Page Importance
Described by distribution
⇡
E
A
B
D
C
3
PageRank [Page et al., 1999]
Page Importance
Described by distribution
⇡
E
Recursive Definition
Important pages are pointed to by
❖
important pages are pointed to by
❖
B
A
important pages are pointed to by…
3
⇡
D
C
PageRank [Page et al., 1999]
Page Importance
Described by distribution
⇡
E
Recursive Definition
Important pages are pointed to by
❖
important pages are pointed to by
❖
B
A
important pages are pointed to by…
Robust
⇡
to manipulation by spammer networks
3
D
C
PageRank - Continuous Interpretation
Start: Gallon of water distributed evenly
E
A
B
D
C
4
PageRank - Continuous Interpretation
Start: Gallon of water distributed evenly
Every Iteration
Each vertex spreads water evenly to successors
A
B
E
D
C
4
PageRank - Continuous Interpretation
Start: Gallon of water distributed evenly
Every Iteration
Each vertex spreads water evenly to successors
A
B
E
D
C
4
PageRank - Continuous Interpretation
Start: Gallon of water distributed evenly
Every Iteration
Each vertex spreads water evenly to successors
Redistribute evenly
a fraction, pT = 0.15, of all water
A
B
E
D
C
4
PageRank - Continuous Interpretation
Start: Gallon of water distributed evenly
Every Iteration
Each vertex spreads water evenly to successors
Redistribute evenly
a fraction, pT = 0.15, of all water
B
A
⇡
Repeat until convergence
Power Iteration employed usually
4
E
D
C
Discrete Interpretation
Frog walks randomly on graph
Next vertex chosen uniformly at random
E
A
B
D
C
5
Discrete Interpretation
Frog walks randomly on graph
Next vertex chosen uniformly at random
E
A
1
B
D
C
5
Discrete Interpretation
Frog walks randomly on graph
Next vertex chosen uniformly at random
E
1/3
A
B
1/3
D
1/3
C
5
Discrete Interpretation
Frog walks randomly on graph
Next vertex chosen uniformly at random
E
Teleportation
Every step: teleport w.p. pT
A
B
D
1
C
5
Discrete Interpretation
Frog walks randomly on graph
Next vertex chosen uniformly at random
E
Teleportation
Every step: teleport w.p. pT
A
B
D
C
5
Discrete Interpretation
Frog walks randomly on graph
Next vertex chosen uniformly at random
E
Teleportation
Every step: teleport w.p. pT
Sampling after t steps
A
B
D
Frog location gives sample from ⇡
C
5
Discrete Interpretation
Frog walks randomly on graph
Next vertex chosen uniformly at random
E
Teleportation
Every step: teleport w.p. pT
Sampling after t steps
B
A
Frog location gives sample from ⇡
PageRank Vector
Many frogs, estimate vector ⇡
5
⇡
D
C
PageRank Approximation
Looking for k “heavy nodes”
Do not need full PageRank vector
E
Random Walk Sampling
Favors heavy nodes
A
Captured Mass Metric
For node set S:
⇡(S)
6
B
D
C
PageRank Approximation
Looking for k “heavy nodes”
Do not need full PageRank vector
E
Random Walk Sampling
Favors heavy nodes
A
Captured Mass Metric
For node set S:
⇡(S)
B
k=2
Return set {E,D}
D
C
Captured mass = ⇡({E,D})
6
Platform
Graph Engines
❖
Engine splits graph across cluster
❖
Vertex program describes logic
E
GAS abstraction
A
B
D
C
Other approaches:
Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013]
8
Graph Engines
❖
Engine splits graph across cluster
❖
Vertex program describes logic
E
GAS abstraction
1. Gather
A
B
D
C
Other approaches:
Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013]
8
Graph Engines
❖
Engine splits graph across cluster
❖
Vertex program describes logic
E
GAS abstraction
1. Gather
A
B
2. Apply
D
C
Other approaches:
Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013]
8
Graph Engines
❖
Engine splits graph across cluster
❖
Vertex program describes logic
E
GAS abstraction
1. Gather
A
B
2. Apply
D
3. Scatter
C
Other approaches:
Giraph [Avery, 2011], Galois [Nguyen et al., 2013], GraphX [Xin et al., 2013]
8
Edge Cuts
❖
Assign vertices to machines
❖
Cross-machine edges require
network communication
❖
Pregel, GraphLab 1.0
❖
High-degree nodes generate
large volume of traffic
❖
Computational load imbalance
E
A
B
D
C
9
Edge Cuts
Machine 2
Machine 1
E
❖
Assign vertices to machines
❖
Cross-machine edges require
network communication
❖
Pregel, GraphLab 1.0
❖
High-degree nodes generate
large volume of traffic
❖
A
B
D
C
Computational load imbalance
Machine 3
9
Vertex Cuts
❖
Assign edges to machines
❖
High-degree nodes replicated
❖
One replica designated master
❖
Need for synchronization
E
1. Gather
A
2. Apply [on master]
B
D
3. Synchronize mirrors
4. Scatter
C
❖
GraphLab 2.0 - PowerGraph
❖
Balanced - Network still bottleneck
10
Vertex Cuts
❖
Assign edges to machines
❖
High-degree nodes replicated
❖
One replica designated master
❖
Need for synchronization
Machine 2
Machine 1
E
A
B
B
D
1. Gather
2. Apply [on master]
B
3. Synchronize mirrors
D
4. Scatter
❖
GraphLab 2.0 - PowerGraph
❖
Balanced - Network still bottleneck
10
C
Machine 3
Vertex Cuts
❖
Assign edges to machines
❖
High-degree nodes replicated
❖
One replica designated master
❖
Need for synchronization
Machine 2
Machine 1
E
A
B
B
D
1. Gather
2. Apply [on master]
B
3. Synchronize mirrors
D
4. Scatter
❖
GraphLab 2.0 - PowerGraph
❖
Balanced - Network still bottleneck
10
C
Machine 3
Random Walks on GraphLab
Machine 2
Master node decides step
B
Decision synced to all mirrors
Machine 1
A
Only machine M needs it
Unnecessary network traffic
Average replication factor ~8
B
Machine 3
B
D
Machine M
B
11
C
Z
Random Walks on GraphLab
Machine 2
Master node decides step
B
Decision synced to all mirrors
Machine 1
A
Only machine M needs it
Unnecessary network traffic
Average replication factor ~8
B
Z
Machine 3
B
D
Machine M
B
11
C
Z
Random Walks on GraphLab
Machine 2
Master node decides step
Z
B
Decision synced to all mirrors
Machine 1
A
Only machine M needs it
Unnecessary network traffic
B
Machine 3
Z
B
11
D
Machine M
Z
B
Average replication factor ~8
C
Z
Objective
Faster PageRank approximation on GraphLab
Idea
Only synchronize the mirror that will receive the frog
Doable, but requires
1. Serious engine hacking
2. Exposing an ugly/complicated API to programmer
Simpler
Pick mirrors to synchronize at random!
Synchronize independently with probability pS
12
FrogWild!
N
Release N frogs in parallel
Machine 2
Machine 1
Vertex Program
B
C
Ber(pS )
Machine 3
1.Each frog dies w.p. pT (gives sample)
Assume K frogs survive
A
2.For every mirror, draw bridge w.p. pS
3.Spread frogs evenly among
synchronized mirrors.
B
Ber(pS )
B
D
Ber(pS )
Machine M
B
13
Z
FrogWild!
Machine 2
Release N frogs in parallel
Machine 1
Vertex Program
K
1.Each frog dies w.p. pT (gives sample)
Assume K frogs survive
A
2.For every mirror, draw bridge w.p. pS
3.Spread frogs evenly among
synchronized mirrors.
B
C
Ber(pS )
Machine 3
B
Ber(pS )
B
D
Ber(pS )
Machine M
B
13
Z
FrogWild!
Machine 2
Release N frogs in parallel
B
Machine 1
Vertex Program
K
1.Each frog dies w.p. pT (gives sample)
Assume K frogs survive
A
2.For every mirror, draw bridge w.p. pS
3.Spread frogs evenly among
synchronized mirrors.
C
Machine 3
B
Ber(pS )
B
D
Ber(pS )
Machine M
B
13
Z
FrogWild!
Machine 2
Release N frogs in parallel
B
Machine 1
Vertex Program
K
1.Each frog dies w.p. pT (gives sample)
Assume K frogs survive
A
Machine 3
B
B
2.For every mirror, draw bridge w.p. pS
3.Spread frogs evenly among
synchronized mirrors.
C
D
Ber(pS )
Machine M
B
13
Z
FrogWild!
Machine 2
Release N frogs in parallel
Machine 1
Vertex Program
K
1.Each frog dies w.p. pT (gives sample)
Assume K frogs survive
A
2.For every mirror, draw bridge w.p. pS
B
B
C
Machine 3
B
D
3.Spread frogs evenly among
synchronized mirrors.
Machine M
B
13
Z
FrogWild!
Machine 2
Release N frogs in parallel
K/2
Machine 1
Vertex Program
B
C
Machine 3
1.Each frog dies w.p. pT (gives sample)
Assume K frogs survive
A
2.For every mirror, draw bridge w.p. pS
B
B
D
3.Spread frogs evenly among
synchronized mirrors.
Machine M
K/2
B
13
Z
FrogWild!
Machine 2
Release N frogs in parallel
K/2
Machine 1
Vertex Program
B
C
Machine 3
1.Each frog dies w.p. pT (gives sample)
Assume K frogs survive
A
B
2.For every mirror, draw bridge w.p. pS
B
D
3.Spread frogs evenly among
synchronized mirrors.
Machine M
Bridges introduce dependencies!
13
K/2
B
Z
Contributions
1.Algorithm for approximate PageRank
2.Modification of GraphLab
Exposes very simple API extension (pS).
Allows for randomized synchronization.
3.Speedup of 7-10x
4.Theoretical guarantees for solution
despite introduced dependencies
14
Theoretical Guarantee
Mass Captured by top-k set, S, of estimate
from N frogs after t steps
⇡(S)
where
OPT
s 
p t
k 1
✏< k 2+
+ (1
N
2✏
2
pS )p\ (t)
probability two Frogs meet at first t steps
1
tk⇡k1
p\ (t)  +
,
n
pT
15
w.p. 1
Experiments
Experimental Results
17
Experimental Results
18
Experimental Results
19
Thank you!
References
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the
web.
Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing." Proceedings of the 2010 ACM
SIGMOD International Conference on Management of data. ACM, 2010.
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., & Hellerstein, J. M. (2010). Graphlab: A new
framework for parallel machine learning. arXiv preprint arXiv:1006.4990.
Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., & Guestrin, C. (2012, October). PowerGraph: Distributed GraphParallel Computation on Natural Graphs. In OSDI (Vol. 12, No. 1, p. 2).
Nguyen, D., Lenharth, A., & Pingali, K. (2013, November). A lightweight infrastructure for graph analytics. In
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (pp. 456-471). ACM.
Avery, C. (2011). Giraph: Large-scale graph processing infrastruction on Hadoop. Proceedings of Hadoop
Summit. Santa Clara, USA:[sn].
Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013, June). Graphx: A resilient distributed graph system
on spark. In First International Workshop on Graph Data Management Experiences and Systems (p. 2). ACM.
Backup Slides
PageRank [Page et al., 1999]
E
A
B
D
C
23
PageRank [Page et al., 1999]
Normalized Adjacency Matrix
Pij =
1
dout (j)
,
(j, i) 2 G
E
1/3
A
1
B
1/3
1/3
D
1
C
23
PageRank [Page et al., 1999]
Normalized Adjacency Matrix
Pij =
1
dout (j)
,
(j, i) 2 G
Augmented Matrix
Qij = (1
E
pT 2 [0, 1]
pT
pT )Pij +
n
A
B
D
C
23
PageRank [Page et al., 1999]
Normalized Adjacency Matrix
Pij =
1
dout (j)
,
(j, i) 2 G
Augmented Matrix
Qij = (1
E
pT 2 [0, 1]
pT
pT )Pij +
n
PageRank Vector
A
B
D
⇡ = Q⇡
C
23
PageRank [Page et al., 1999]
Normalized Adjacency Matrix
Pij =
1
dout (j)
,
(j, i) 2 G
Augmented Matrix
Qij = (1
E
pT 2 [0, 1]
pT
pT )Pij +
n
PageRank Vector
A
B
D
⇡ = Q⇡
C
Power Method
t 0
Qp !⇡
23
Here be dragons.
Backup
E
E
E
A
B
A
B
D
E
A
B
A
B
A
B
D
E
C
E
D
C
D
A
B
D
C
D
C
C
C
Backup
E
E
A
B
D
C
A
B
D
C