TP3: Large Scale Graph Learning
Transcription
TP3: Large Scale Graph Learning
TP3: Large Scale Graph Learning daniele.calandriello@inria.fr th Tuesday 10 March, 2015 Abstract The report and the code are due in 2 weeks (deadline 23:59 24/3/2015). You can send them by email to subject daniele.calandriello@inria.fr, TD3 Graphs In ML Name Surname. with Naming the attach- ments something along the lines of TD3_Graphs_ML_report_name_surname. {pdf,doc} and TD3_Graphs_ML_code_name_surname.{zip,gzip} be greatly appreciated. would All the code related to the TD must be sub- mitted, to provide background for the code evaluation. All submission that will arrive late will be linearly penalized. The maximum score is 100 for all submissions before 23:59 24/3/2015 and 0 for all submissions after 23:59 27/3/2015 (e.g. the maximum score is about 70 at 21:30 25/3/2015). Material on http: // chercheurs. lille. inria. fr/ ~calandri/ A small preface (especially for those that will not be present at the TD). All the experiments that will be presented during this TD makes use of randomly generated datasets. Because of this, there is always the possibility that a single run will be not representative of the usual outcome. Randomness in the data is common in Machine Learning, and managing this randomness is important. Proper experimental setting calls for repeated experiments and condence intervals. In this case, it will be sucient to repeat each experiment multiple times and visually see if there is huge variations (some experiments are designed exactly to show this variations). 1 Large Scale Semi Supervised Learning A large part of the algorithms on graph that we have seen so far, depends on matrix representation of the graph (e.g. Laplacian) to compute their results. This means that the execution of these algorithm is strongly inuenced by constraint of representing matrices in memory, and the computational complexity of executing operations on them. 1 We will see later how the computational complexity plays a role, but the largest constraint in real-world systems is memory occupation. storing in memory all the elements of an MB for n = 5000, edges n n = 37000. In particular, matrix will require around 190 but will easily cross the GB threshold at 10 GB threshold at of vertices n×n n = 12000 and the At this point it is useless to talk about number anymore and we have to start to reason in terms of number of m. Another important part of computation on large scale is parallelism. Parallel algorithm have the advantage of distributing the data across nodes to increase memory capacity, as well as computing dierent parts of the solution concurrently on dierent nodes. For this TD, we chose to use the Graphlab 1 library. Graphlab implements a parallel, distributed paradigm for implementing machine learning algorithms, and hides most of the complexity of the parallelization and communication from the user. Nonetheless, it is important to have an idea of how a modern parallel system abstract its computations http://select.cs.cmu.edu/ code/graphlab/abstractiononly.pdf and give a very high level descrip- 1.1. Briey read Section 2 (2.1, 2.2 and 2.3) of tion of Data Graph, Update function and Sync function We will now start working with the library, you can nd a working installation of Graphlab in the VM that we provided for the course. 2 Large Scale Label Propagation One of the main objective of Graphlab is allowing computation to be carried out not only on datasets that can t in memory. For this reason, their main structure, the SFrame stores most of its data on disk, and eciently maps it into memory when a function has to be applied on it. This allows even normal machine to scale to GB-size datasets. 2.1. What is the main drawback of storing data on the disk? How can we try to mitigate this drawback when implementing our algorithms? We will begin by implementing label propagation on a middle scale dataset. In TD2 we have seen how to do this by computhing an HFS solution. 1 https://dato.com 2 The Harmonic property that the HFS solution wants to satisfy is P f (xi ) = i∼j f (xj )wij P wij i∼j Computing HFS using linear algebra is not particularly easy to parallelize, and does not decompose easily over vertices as the Graphlab abstraction would like. But the Harmonic property gives us an easy idea for implementing this algorithm iteratively with subsequent sweeps across the graph. In particular, if we know the degree of each node, we can compute a single iteration of label propagation just by knowing the source node, the destination node and the weight on the edge of the graph. f (xj )wij f (xi ) = f (xi ) + P i∼j wij In order to implement this algorithm in Graphlab, we will need to introduce some concept from the large scale abstraction that Graphlab implements. The basic data structure in Graphlab is the SFrame , which is a tabular data type with named columns, that can contain several kinds of values, such as integers, oats, lists or arrays. Graphlab distributes the table according to a key value called __id, so ran- dom access on the structure is not recommended. Instead, Graphlab provides the ability to lter SFrame using the syntax sf_extract = sf [ [ ' col_name1 ' , ' col_name2 '] ] sf_extract = sf [ sf [ ' col_name1 '] == 2 ] sf_extract = sf [ sf [ ' col_name1 '] == 2 ]\ [ [ ' col_name1 ' , ' col_name2 '] ] In the rst case we are extracting specic column from an SFrame , in the second we are extracting specic rows, and in the third we are combining the two actions (do notice that the order of the ltering matters). SFrame , we want to make operations on this apply function, that will receive returns an SArray , the datatype that compose After we loaded our data in an data. For this reason Graphlab provides the every row of the SFrame and SFrame . in the columns of an other words we can write sf [ ' new_column '] = sf . apply ( foo ) sf [ ' old_column '] = sf . apply ( foo ) 3 Where in the rst case we save the result in an already existing column, and in the second we update an old, existing column. The function foo can access all the data in a single row def foo ( row ): if row [ ' col_name2 ']: return row [ ' col_name1 '] else : return 0 apply function, we will have only access apply comes into play when coupled with If instead we lter before calling the to the selected columns. The power of anonymous functions also called lambda functions. In python these are dened as foo = lambda x ,y , z : x + y + z The particular property of lambda functions, is that they have no state, so for examples assignments are not permitted inside these functions. On the other hand, we can write something like sf [ ' new_column '] = sf . apply ( lambda row : 1 if row [ ' __id '] in node_list else 0 ) The expression in the lambda function will be evaluated either to 0 or 1, and the resulting value will be stored in the new column of the SFrame . This example shows the other important aspect of lambda functions: closures. the example, the lambda function is accessing the node_list In variable, but this variable is not declared in the function (lambdas have no state) nor is provided as a parameter. But because the function is dened in the middle of my code, I can access all local variables available during the denition, and trap them inside my lambda function for access, creating a closure. It is never a good idea to trap a mutable object, because we do not know if the changes to the object will propagate to the closure. 2.2. Can you think of a simple problem with using mutable variables in a closure in a distributed framework like Graphlab? There is two last necessary tools we need for the implementation. support simple grouping of variables, using the syntax sf_new = sf . groupby ( [" group_col "] , 4 SFrame { ' new_col_name ' : graphlab . aggregate . OPERATION (" col_name ") } ) group_col value, and graphlab.aggregate.OPERATION to the group. This command will group all rows that have the same will apply the transformation Simple transformations are concatenation of values, or summation. The result will be stored in a new new_col_name. SFrame with a key of group_col and a second column triple_apply. This function takes an SGraph . An SFrame is composed essentially by two SFrame , vertices and edges. The vertices are identied by a key value __id and the edges by the couple __src_id and __dst_id. Since they are both SFrame , you can store any kind of data on vertices and edges and apply the usual lambda functions updates on them. The triple_apply The last and most important tool is SGraph in input, and returns a new updated function instead has the following pattern def foo ( src , edge , dst ): src [ ' col_name1 '] += dst [ ' col_nam2 '] edge [ ' col_name3 '] =\ 1 if src [ ' __id '] == dst [ ' __id '] else 0 return ( src , edge , dst ) And all the updates are carried out in parallel across nodes. For large dataset distributed in a cluster, this is a powerful abstraction. We will exploit it to implement two ML algorithms, label propagation and graph sparsication. For the rest of the basic datatypes in python, and the remaining Graphlab documentation, you can refer to the manuals online. 2.3. Complete label_propagation.py 2.4. Why we resort to Approximate Nearest Neighbours to compute the similarity graph? 2.5. The implementation provided by NearPy uses Local Sensitivity Hashing2 . Try to give an high level explanation of how LSH works, and why we chose to use it in this problem compared to Tree-based ANN. 2.6. Why we choose not to use a closure when feeding the ANN Engine? 2 https://en.wikipedia.org/wiki/Locality-sensitive_hashing 5 2.7. Why we need line 126 (in the original le) in label_propagation.py 2.8. In normal HFS, regularization is added to the Laplacian to simulate absorption at each step in the label propagation. How can you have a similar regularization in the iterative label propagation? 2.9. Try dierent combinations of parameters for the neighbours, regularization, and hashing space and plot the accuracy and running times. An accuracy of ∼ 30% can be quickly attained on this data. It might be possible to get better performance by tuning parameters and changing parts of the implementation, and you are invited to try. The goal of this TD on the other hand is to introduce large scale approaches to Graph Learning, and nal accuracy is secondary. 3 Large Scale and Sparsication From the computational point of view, the naive implementation of the two most basic matrix operations, are O(n2 ) and O(n3 ) for matrix-vector multiplication and matrix-matrix multiplication. More advanced approaches to the problem reduced this exponent to O(n2.8 ) for Strassen's algorithm [2] and O(n2.37 ) for Coppersmith-Winograd's algorithm [1]. The running times of these algorithm seems superior to the naive implementation, but they are rarely used in practice. CoppersmithâWinograd has large constants that make it prohibitive for any matrix that would t in reasonable memory. Strassen algorithm is a possible practical candidate, and for n in the thousands it is often used in practice. The exact threshold is hard to derive, due to the fact that modern hardware architecture rely heavily on multiple level of cache, and therefore perform orders of magnitude worse for Stressen's algorithm, that mostly uses random access to memory, compared to the contiguous access of the naive implementation. In the end, this means that all algorithms that perform matrix-matrix multiplications (for example most kinds of decompositions), will have a computational cost of O(n2.8 ). The matrix-vector multiplication (in the general case) has an upper and lower bound of O(n2 ). This is because we need to examine at least once each of the elements of the matrix. In particular cases, when the matrix has only non-zero entries, the computation cost becomes O(m). m This shift in costs is at the basis of a series of iterative solvers for all kinds of linear algebra problems (decompositions, linear systems) where iterative methods requiring only matrixvector multiplication are involved. 6 In the particular case of solving linear problems, the solution that minimizes the residual norm b = arg min kAx − bk x x can be found iteratively. O(mn), In the most general and worst case, this requires which again translates to O(n3 ) for dense matrices. This formulation can be used for many kinds of problems (e.g. nding eigenvectors). Given a graph G, a (1 ± ε) spectral approximation LH of the laplacian LG is dened as (1 − ε)xT LG x ≤ xT LH x ≤ (1 + ε)xT LG x (1 − ε) ≤ (1) xT LH x ≤ (1 + ε) xT LG x (2) 1 1 T + xT L+ xT L+ G x ≤ x LH x ≤ Gx 1+ε 1−ε (3) Spectral approximations play an important role in many modern linear solvers, and ecient solvers have an impact on a large range of problems. In order to make the solver ecient, we would like to substitute LH , approximation but LH needs to be sparse. LG with a good This is a key component to have fast solvers for normal size problems, and allow problems that could not be solved exactly for lack of memory to become feasible. In particular you will be implementing the following version of sparsication (Alg. 1). Algorithm 1 Input: Spielmann-Srivastava rejection sampling G H H=∅ for e ∈ G do for i = 1 to N = O(n log2 n/ε2 )[Run this loop implictely] do Accept e with probability pe = (Re )/(n − 1) Add it with weight we /N = (n − 1)/(N Re ) Output: Initialize end for end for Since we have to loop over all edges, this is an excellent candidate for apply. 3.1. Complete triple_ sparsification.py A few implementation details. We will sparsify unweighted graphs, but the 7 sparsier will be weighted. The nal evaluation will be the average ratio between the sparsier and the original graph on a set of random vectors. + Re = bT e LG be , with ei , ej are indicator vector for the nodes i, j (the order How can you exploit the structure of be to extract Re 3.2. The denition of eective resistance for an edge is be = e i − e j , and does not matter). eciently from L+ G? 3.3. How can you implement the inner loop implicitly (Hint: the loop is ipping N times a coin with probability pe ) 3.4. In the code we are computing the eective resistances using the actual inverse of LG , but computing the inverse is an O(n3 ) operation and it is too expensive. If I could provide you with an algorithm that can compute an approximation of b = arg min kAx − bk x x in m log(n) time, how could you use it to extract eective resistances? References [1] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9(3):251 280, 1990. Computational algebraic complexity editorial. [2] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathe- matik, 13(4):354356, 1969. 8