Poster - James Fairbanks
Transcription
Poster - James Fairbanks
Contact Information: Email: james.fairbanks@gatech.edu Email: sanders29@llnl.gov Discovering Block Structure in Graphs with Approximate Eigenvectors James Fairbanks1 and Geoff Sanders2 Georgia Institute of Technology1 and Lawrence Livermore National Laboratory2 Graphs and Networks are important in modeling structure across disciplines. We demonstrate that graph partitioning in a minimum cut sense can be improved with an ensemble of low-fidelity eigenvectors which can outperform a single high-fidelity eigenvector. These ensembles can be computed faster and be more helpful than one high-fidelity eigenvector. Since the individual ensemble members are independent they can be computed in parallel. This effect arises because of the discrepancy between solutions to a continuous relaxation of a discrete optimization problem and the discrete optimization solutions. true blocks are in the graph. We can also see that a random vector used for splitting vastly underperforms eigenvector based splitting. Results Eigenvector solvers such as ARPACK [3] use random seed vectors to compute a candidate eigenvector. We can see the effects of treating the eigenvector returned from ARPACK as a random variable. For an Erd˝os-R´enyi random graph Laplacian, we see that as the tolerance on the eigenresidual decreases the spread of the approximate solution vectors also decreases. ² =0.01 ² =0.0001 ² =1e−08 8 7 Graphs can represent structure in diverse application areas. Stochastic Block Models (SBM) are controlled models of community structure, where the probability of each edge depends only on the two communities of the vertices. Eigenvectors can recover minimum cut partitions of graphs. SBM parameters can be learned using spectral partitioning [2]. 6 • Adjacency Matrix A: Ai,j counts the edges between vertex i and j P • Laplacian: L = D − A where Dii = degree(i) = j Aij Frequency Introduction Adjacency Matrix of Block Graph 500 1000 1500 2000 2500 0 0 500 500 1000 1000 destination destination 0 0 1500 2000 Adjacency Matrix of Scrambled Block Graph 500 1000 1500 2000 2500 1500 3 source 14 12 Figure 1: Matrix from Stochastic Block Model (left), and the same matrix permuted to hide structure (right). Small Graph with Spring Force Layout 0 3 10 7 1 9 0.2 8 17 4 19 14 12 15 11 16 13 0.3 5 Fiedler Coordinate 2 6 Sorting vertices into positive and negative classes frequency 0.4 0.1 0.0 0.1 0.2 0.3 0 5 10 Vertex number 15 20 8 7 6 5 4 3 2 1 When examining a near-bipartite community graph, we see that the cut size is more concentrated for tighter tolerances than for looser tolerances. As we get more digits of accuracy on the eigenvectors, we are more likely to get the equivalent partitionings from the vectors. Since the goal is to find the minimum cut of the graph, we can take several approximate eigenvectors and take the best induced cut. Figure 4 shows the distribution of cut size when an ensemble of eigenvectors are computed. The minimum of the distribution is the most important part. source 40 60 percentile of split 80 100 • Large residual Fiedler vector approximations can outperform the small residual approximations at finding a minimum cut. Figure 3: Tighter eigenresidual bounds imply tighter distributions of approximate solutions. Log scale implies that the distributions on the left have smaller variances. 2500 20 Conclusions log|x−µx | 16 0 Figure 5: Comparing high-fidelity eigenvectors to random vectors. Here Evector refers to the absolute value of the maximal eigenvector of the Laplacian, which reveals structure in near bipartite graphs. 4 2000 2500 18 5 0 10000 0 1 The eigenvectors of associated matrices can be useful for solving graph problems such as the min cut problem minb∈{−1,1}n bT Lb. The Fiedler Vector solves the continuous relaxation min{x∈Rn,x⊥1,kxk=1}xT Lx [1]. Graphs can be represented by their matrices and visualized to see regular or block structure. If that structure is not known it can be hard to detect. 15000 5000 2 ˆ = I − D−1/2AD−1/2 • Normalized Laplacian: L ˆ = λx • Eigenvalue λ Eigenvector x: Lx Evector Fiedler Max CIRand Min CIRand 20000 Variation in approximate eigenvectors 9 All cut edges 25000 number of cut edges Abstract ArpackFiedler ²=0.01 ²=0.0001 ²=1e-08 10 • At large tolerances, randomized eigensolvers have useful random variation. • Ensembles of weak numerical solutions can outperform strong numerical solutions in data mining. • Stochastic Block Models reveal the quality performance of graph partitioning algorithms. Forthcoming Research • Further research should examine these techniques in the streaming environment. • Thorough analysis of cost to perform multiple low-fidelity solves against the cost to perform one high-fidelity solve. • Using other structured graphs to examine the interaction between approximate numerical solutions and data analysis solutions. • Extension to other Machine Learning and Data Mining tasks. References 8 [1] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997. 6 4 [2] D. Fishkind, D. Sussman, M. Tang, J. Vogelstein, and C. Priebe. Consistent adjacencyspectral partitioning for the stochastic block model when the model parameters are unknown. SIAM Journal on Matrix Analysis and Applications, 34(1):23–39, 2013. 2 [3] R. Lehoucq and D. Sorensen. Deflation techniques for an implicitly restarted arnoldi iteration. SIAM Journal on Matrix Analysis and Applications, 17(4):789–821, 1996. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 number of cut edges Acknowledgments Figure 2: Small graph showing block structure (left) Graph partitioned according to Fiedler Vector Figure 4: Distribution of supervised cut size using Fielder vectors of various accuracy If we can solve min-cut with low-fidelity eigenvectors then eigenvectors can be used in a streaming environment, where new connections are being formed rapidly. Given a vector, we can split the graph at any threshold value. This gives a set of possible cuts. Evaluating the cut size at every possible split produces an indication of how many This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and the ASEE-NDSEG fellowship supporting my PhD at Georgia Tech. LLNL-POST-668628