TP3: Large Scale Graph Learning

Transcription

TP3: Large Scale Graph Learning
TP3: Large Scale Graph Learning
daniele.calandriello@inria.fr
th
Tuesday 10
March, 2015
Abstract
The report and the code are due in 2 weeks (deadline 23:59 24/3/2015).
You can send them by email to
subject
daniele.calandriello@inria.fr,
TD3 Graphs In ML Name Surname.
with
Naming the attach-
ments something along the lines of TD3_Graphs_ML_report_name_surname.
{pdf,doc}
and
TD3_Graphs_ML_code_name_surname.{zip,gzip}
be greatly appreciated.
would
All the code related to the TD must be sub-
mitted, to provide background for the code evaluation. All submission
that will arrive late will be linearly penalized. The maximum score is 100
for all submissions before 23:59 24/3/2015 and 0 for all submissions after
23:59 27/3/2015 (e.g. the maximum score is about 70 at 21:30 25/3/2015).
Material on http: // chercheurs. lille. inria. fr/ ~calandri/
A small preface (especially for those that will not be present at the TD).
All the experiments that will be presented during this TD makes use of randomly generated datasets. Because of this, there is always the possibility that a
single run will be not representative of the usual outcome. Randomness in the
data is common in Machine Learning, and managing this randomness is important. Proper experimental setting calls for repeated experiments and condence
intervals. In this case, it will be sucient to repeat each experiment multiple
times and visually see if there is huge variations (some experiments are designed
exactly to show this variations).
1 Large Scale Semi Supervised Learning
A large part of the algorithms on graph that we have seen so far, depends on
matrix representation of the graph (e.g.
Laplacian) to compute their results.
This means that the execution of these algorithm is strongly inuenced by constraint of representing matrices in memory, and the computational complexity
of executing operations on them.
1
We will see later how the computational complexity plays a role, but the
largest constraint in real-world systems is memory occupation.
storing in memory all the elements of an
MB for
n = 5000,
edges
n
n = 37000.
In particular,
matrix will require around 190
but will easily cross the GB threshold at
10 GB threshold at
of vertices
n×n
n = 12000
and the
At this point it is useless to talk about number
anymore and we have to start to reason in terms of number of
m.
Another important part of computation on large scale is parallelism. Parallel
algorithm have the advantage of distributing the data across nodes to increase
memory capacity, as well as computing dierent parts of the solution concurrently on dierent nodes. For this TD, we chose to use the Graphlab
1 library.
Graphlab implements a parallel, distributed paradigm for implementing machine
learning algorithms, and hides most of the complexity of the parallelization and
communication from the user. Nonetheless, it is important to have an idea of
how a modern parallel system abstract its computations
http://select.cs.cmu.edu/
code/graphlab/abstractiononly.pdf and give a very high level descrip-
1.1. Briey read Section 2 (2.1, 2.2 and 2.3) of
tion of Data Graph, Update function and Sync function
We will now start working with the library, you can nd a working installation of Graphlab in the VM that we provided for the course.
2 Large Scale Label Propagation
One of the main objective of Graphlab is allowing computation to be carried
out not only on datasets that can t in memory. For this reason, their main
structure, the
SFrame
stores most of its data on disk, and eciently maps it
into memory when a function has to be applied on it. This allows even normal
machine to scale to GB-size datasets.
2.1. What is the main drawback of storing data on the disk? How can we try
to mitigate this drawback when implementing our algorithms?
We will begin by implementing label propagation on a middle scale dataset.
In TD2 we have seen how to do this by computhing an HFS solution.
1 https://dato.com
2
The
Harmonic property that the HFS solution wants to satisfy is
P
f (xi ) =
i∼j
f (xj )wij
P
wij
i∼j
Computing HFS using linear algebra is not particularly easy to parallelize,
and does not decompose easily over vertices as the Graphlab abstraction would
like.
But the Harmonic property gives us an easy idea for implementing this
algorithm iteratively with subsequent sweeps across the graph.
In particular, if we know the degree of each node, we can compute a single
iteration of label propagation just by knowing the source node, the destination
node and the weight on the edge of the graph.
f (xj )wij
f (xi ) = f (xi ) + P
i∼j wij
In order to implement this algorithm in Graphlab, we will need to introduce
some concept from the large scale abstraction that Graphlab implements. The
basic data structure in Graphlab is the
SFrame
, which is a tabular data type
with named columns, that can contain several kinds of values, such as integers,
oats, lists or arrays.
Graphlab distributes the table according to a key value called
__id,
so ran-
dom access on the structure is not recommended. Instead, Graphlab provides
the ability to lter
SFrame
using the syntax
sf_extract = sf [ [ ' col_name1 ' , ' col_name2 '] ]
sf_extract = sf [ sf [ ' col_name1 '] == 2 ]
sf_extract = sf [ sf [ ' col_name1 '] == 2 ]\
[ [ ' col_name1 ' , ' col_name2 '] ]
In the rst case we are extracting specic column from an
SFrame
, in the
second we are extracting specic rows, and in the third we are combining the
two actions (do notice that the order of the ltering matters).
SFrame , we want to make operations on this
apply function, that will receive
returns an SArray , the datatype that compose
After we loaded our data in an
data. For this reason Graphlab provides the
every row of the
SFrame and
SFrame . in
the columns of an
other words we can write
sf [ ' new_column '] = sf . apply ( foo )
sf [ ' old_column '] = sf . apply ( foo )
3
Where in the rst case we save the result in an already existing column, and
in the second we update an old, existing column. The function
foo
can access
all the data in a single row
def foo ( row ):
if row [ ' col_name2 ']:
return row [ ' col_name1 ']
else :
return 0
apply function, we will have only access
apply comes into play when coupled with
If instead we lter before calling the
to the selected columns. The power of
anonymous functions also called lambda functions. In python these are dened
as
foo = lambda x ,y , z : x + y + z
The particular property of lambda functions, is that they have no state, so
for examples assignments are not permitted inside these functions. On the other
hand, we can write something like
sf [ ' new_column '] = sf . apply (
lambda row :
1 if row [ ' __id '] in node_list else 0
)
The expression in the lambda function will be evaluated either to 0 or 1,
and the resulting value will be stored in the new column of the
SFrame
. This
example shows the other important aspect of lambda functions: closures.
the example, the lambda function is accessing the
node_list
In
variable, but this
variable is not declared in the function (lambdas have no state) nor is provided
as a parameter. But because the function is dened in the middle of my code,
I can access all local variables available during the denition, and trap them
inside my lambda function for access, creating a closure. It is never a good idea
to trap a mutable object, because we do not know if the changes to the object
will propagate to the closure.
2.2. Can you think of a simple problem with using mutable variables in a
closure in a distributed framework like Graphlab?
There is two last necessary tools we need for the implementation.
support simple grouping of variables, using the syntax
sf_new = sf . groupby (
[" group_col "] ,
4
SFrame
{ ' new_col_name ' :
graphlab . aggregate . OPERATION (" col_name ")
}
)
group_col value, and
graphlab.aggregate.OPERATION to the group.
This command will group all rows that have the same
will apply the transformation
Simple transformations are concatenation of values, or summation. The result
will be stored in a new
new_col_name.
SFrame
with a key of
group_col
and a second column
triple_apply. This function takes an
SGraph . An SFrame is composed
essentially by two SFrame , vertices and edges. The vertices are identied by
a key value __id and the edges by the couple __src_id and __dst_id. Since
they are both SFrame , you can store any kind of data on vertices and edges
and apply the usual lambda functions updates on them. The triple_apply
The last and most important tool is
SGraph
in input, and returns a new updated
function instead has the following pattern
def foo ( src , edge , dst ):
src [ ' col_name1 '] += dst [ ' col_nam2 ']
edge [ ' col_name3 '] =\
1 if src [ ' __id '] == dst [ ' __id '] else 0
return ( src , edge , dst )
And all the updates are carried out in parallel across nodes. For large dataset
distributed in a cluster, this is a powerful abstraction. We will exploit it to implement two ML algorithms, label propagation and graph sparsication.
For
the rest of the basic datatypes in python, and the remaining Graphlab documentation, you can refer to the manuals online.
2.3. Complete
label_propagation.py
2.4. Why we resort to Approximate Nearest Neighbours to compute the similarity graph?
2.5. The implementation provided by
NearPy uses Local Sensitivity Hashing2 .
Try to give an high level explanation of how LSH works, and why we chose
to use it in this problem compared to Tree-based ANN.
2.6. Why we choose not to use a closure when feeding the ANN Engine?
2 https://en.wikipedia.org/wiki/Locality-sensitive_hashing
5
2.7. Why we need line 126 (in the original le) in
label_propagation.py
2.8. In normal HFS, regularization is added to the Laplacian to simulate absorption at each step in the label propagation. How can you have a similar
regularization in the iterative label propagation?
2.9. Try dierent combinations of parameters for the neighbours, regularization, and hashing space and plot the accuracy and running times.
An accuracy of
∼ 30%
can be quickly attained on this data.
It might be
possible to get better performance by tuning parameters and changing parts of
the implementation, and you are invited to try.
The goal of this TD on the
other hand is to introduce large scale approaches to Graph Learning, and nal
accuracy is secondary.
3 Large Scale and Sparsication
From the computational point of view, the naive implementation of the two most
basic matrix operations, are
O(n2 )
and
O(n3 )
for matrix-vector multiplication
and matrix-matrix multiplication. More advanced approaches to the problem
reduced this exponent to
O(n2.8 )
for Strassen's algorithm [2] and
O(n2.37 )
for
Coppersmith-Winograd's algorithm [1].
The running times of these algorithm seems superior to the naive implementation, but they are rarely used in practice. Coppersmith–Winograd has
large constants that make it prohibitive for any matrix that would t in reasonable memory. Strassen algorithm is a possible practical candidate, and for
n
in the thousands it is often used in practice.
The exact threshold is hard
to derive, due to the fact that modern hardware architecture rely heavily on
multiple level of cache, and therefore perform orders of magnitude worse for
Stressen's algorithm, that mostly uses random access to memory, compared to
the contiguous access of the naive implementation. In the end, this means that
all algorithms that perform matrix-matrix multiplications (for example most
kinds of decompositions), will have a computational cost of
O(n2.8 ).
The matrix-vector multiplication (in the general case) has an upper and
lower bound of
O(n2 ).
This is because we need to examine at least once each
of the elements of the matrix. In particular cases, when the matrix has only
non-zero entries, the computation cost becomes
O(m).
m
This shift in costs is at
the basis of a series of iterative solvers for all kinds of linear algebra problems
(decompositions, linear systems) where iterative methods requiring only matrixvector multiplication are involved.
6
In the particular case of solving linear problems, the solution that minimizes
the residual norm
b = arg min kAx − bk
x
x
can be found iteratively.
O(mn),
In the most general and worst case, this requires
which again translates to
O(n3 )
for dense matrices. This formulation
can be used for many kinds of problems (e.g. nding eigenvectors).
Given a graph
G,
a
(1 ± ε)
spectral approximation
LH
of the laplacian
LG
is dened as
(1 − ε)xT LG x ≤ xT LH x ≤ (1 + ε)xT LG x
(1 − ε) ≤
(1)
xT LH x
≤ (1 + ε)
xT LG x
(2)
1
1
T +
xT L+
xT L+
G x ≤ x LH x ≤
Gx
1+ε
1−ε
(3)
Spectral approximations play an important role in many modern linear
solvers, and ecient solvers have an impact on a large range of problems. In
order to make the solver ecient, we would like to substitute
LH ,
approximation
but
LH
needs to be sparse.
LG
with a good
This is a key component to
have fast solvers for normal size problems, and allow problems that could not
be solved exactly for lack of memory to become feasible. In particular you will
be implementing the following version of sparsication (Alg. 1).
Algorithm 1
Input:
Spielmann-Srivastava rejection sampling
G
H
H=∅
for e ∈ G do
for i = 1 to N = O(n log2 n/ε2 )[Run this loop implictely] do
Accept e with probability pe = (Re )/(n − 1)
Add it with weight we /N = (n − 1)/(N Re )
Output:
Initialize
end for
end for
Since we have to loop over all edges, this is an excellent candidate for
apply.
3.1. Complete
triple_
sparsification.py
A few implementation details. We will sparsify unweighted graphs, but the
7
sparsier will be weighted. The nal evaluation will be the average ratio between
the sparsier and the original graph on a set of random vectors.
+
Re = bT
e LG be , with
ei , ej are indicator vector for the nodes i, j (the order
How can you exploit the structure of be to extract Re
3.2. The denition of eective resistance for an edge is
be = e i − e j ,
and
does not matter).
eciently from
L+
G?
3.3. How can you implement the inner loop implicitly (Hint: the loop is ipping
N
times a coin with probability
pe )
3.4. In the code we are computing the eective resistances using the actual
inverse of
LG ,
but computing the inverse is an
O(n3 )
operation and it is
too expensive. If I could provide you with an algorithm that can compute
an approximation of
b = arg min kAx − bk
x
x
in
m log(n)
time, how could you use it to extract eective resistances?
References
[1] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9(3):251 280, 1990.
Computational algebraic complexity editorial.
[2] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathe-
matik, 13(4):354356, 1969.
8