Slides
Transcription
Slides
Big Data Machine Learning ŒêâÅìÆS oÉ LAMDA Group H®ŒÆOŽÅ‰Æ†EâX ^‡#EâI[-:¢ ¿ liwujun@nju.edu.cn Oct 31, 2014 Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 1 / 79 Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Supervised Hashing with Latent Factor Models Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing 4 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 2 / 79 Introduction Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Supervised Hashing with Latent Factor Models Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing 4 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 3 / 79 Introduction Big Data Big data has attracted much attention from both academic and industry. Facebook: 750 Million users Flickr: 6 Billion photos Wal-Mart: 267 Million items/day; 4PB data warehouse Sloan Digital Sky Survey: New Mexico telescope captures 200 GB image data/day Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 4 / 79 Introduction Definition of Big Data Gartner (2012): “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” (“3Vs”) International Data Corporation (IDC) (2011): “Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.” (“4Vs”) McKinsey Global Institute (MGI) (2011): “Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” Why not hot until recent years? Big data: 7¶ Cloud computingµæ¶Eâ Big data machine learningµŽ7Eâ Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 5 / 79 Introduction Big Data Machine Learning Definition: perform machine learning from big data. Role: key for big data Ultimate goal of big data processing is to mine value from data. Machine learning provides fundamental theory and computational techniques for big data mining and analysis. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 6 / 79 Introduction Challenge Storage: memory and disk Computation: CPU Communication: network Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 7 / 79 Introduction Our Contribution Learning to hash (MFÆS): memory/disk/cpu/communication Distributed learning (©ÙªÆS): memory/disk/cpu; but increase communication cost Stochastic learning (‘ÅÆS): memory/disk/cpu Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 8 / 79 Learning to Hash Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Supervised Hashing with Latent Factor Models Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing 4 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 9 / 79 Learning to Hash Nearest Neighbor Search (Retrieval) Given a query point q, return the points closest (similar) to q in the database(e.g. images). Underlying many machine learning, data mining, information retrieval problems Challenge in Big Data Applications: Curse of dimensionality Storage cost Query speed Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 10 / 79 Learning to Hash Similarity Preserving Hashing Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 11 / 79 Learning to Hash Reduce Dimensionality and Storage Cost Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 12 / 79 Learning to Hash Querying Hamming distance: ||01101110, 00101101||H = 3 ||11011, 01011||H = 1 Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 13 / 79 Learning to Hash Querying Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 14 / 79 Learning to Hash Querying Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 15 / 79 Learning to Hash Fast Query Speed By using hashing scheme, we can achieve constant or sub-linear search time complexity. Exhaustive search is also acceptable because the distance calculation cost is cheap now. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 16 / 79 Learning to Hash Two Stages of Hash Function Learning Projection Stage (Dimension Reduction) Projected with real-valued projection function Given a point x, each projected dimension i will be associated with a real-valued projection function fi (x) (e.g. fi (x) = wiT x) Quantization Stage Turn real into binary Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 17 / 79 Learning to Hash Our Contribution Unsupervised Hashing [NIPS 2012]: Isotropic hashing (IsoHash) Supervised Hashing [SIGIR 2014]: Supervised hashing with latent factor models Multimodal Hashing [AAAI 2014]: Large-scale supervised multimodal hashing with semantic correlation maximization Multiple-Bit Quantization: Double-bit quantization (DBQ) [AAAI 2012] Manhattan quantization (MQ) [SIGIR 2012] Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 18 / 79 Learning to Hash Isotropic Hashing Motivation Problem: All existing methods use the same number of bits for different projected dimensions with different variances. Possible Solutions: Different number of bits for different dimensions (Unfortunately, have not found an effective way) Isotropic (equal) variances for all dimensions Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 19 / 79 Learning to Hash Isotropic Hashing PCA Hash To generate a code of m bits, PCAH performs PCA on X, and then use the top m eigenvectors of the matrix XX T as columns of the projection matrix W ∈ Rd×m . Here, top m eigenvectors are those corresponding to the m largest eigenvalues {λk }m k=1 , generally arranged with the non-increasing order λ1 ≥ λ2 ≥ · · · ≥ λm . Let λ = [λ1 , λ2 , · · · , λm ]T . Then Λ = W T XX T W = diag(λ) Define hash function h(x) = sgn(W T x) Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 20 / 79 Learning to Hash Isotropic Hashing Idea of IsoHash Learn an orthogonal matrix Q ∈ Rm×m which makes QT W T XX T W Q become a matrix with equal diagonal values. Effect of Q: to make each projected dimension has the same variance while keeping the Euclidean distances between any two points unchanged. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 21 / 79 Learning to Hash Isotropic Hashing Accuracy (mAP) Method # bits IsoHash PCAH ITQ SH SIKH LSH 32 0.2249 0.0319 0.2490 0.0510 0.0353 0.1052 Li (http://cs.nju.edu.cn/lwj) 64 0.2969 0.0274 0.3051 0.0589 0.0902 0.1907 CIFAR 96 0.3256 0.0241 0.3238 0.0802 0.1245 0.2396 Big Learning 128 0.3357 0.0216 0.3319 0.1121 0.1909 0.2776 256 0.3651 0.0168 0.3436 0.1535 0.3614 0.3432 CS, NJU 22 / 79 Learning to Hash Isotropic Hashing Training Time 50 Training Time(s) 40 30 IsoHash−GF IsoHash−LP ITQ SH SIKH LSH PCAH 20 10 0 0 1 2 3 4 Number of training data Li (http://cs.nju.edu.cn/lwj) Big Learning 5 6 4 x 10 CS, NJU 23 / 79 Learning to Hash Supervised Hashing with Latent Factor Models Problem Definition Input: Feature vectors: xi ∈ RD , i = 1, . . . , N . (Compact form: X ∈ RN ×D ) Similarity labels: sij , i, j = 1, . . . , N . (Compact form: S = {sij }) sij = 1 if points i and j belong to the same class. sij = 0 if points i and j belong to different classes. Output: Binary codes: bi ∈ {−1, 1}Q , i = 1, . . . , N . (Compact form: B ∈ {−1, 1}N ×Q ) When sij = 1, the Hamming distance between bi and bj should be low. When sij = 0, the Hamming distance between bi and bj should be high. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 24 / 79 Learning to Hash Supervised Hashing with Latent Factor Models Motivation Existing supervised methods: High training complexity Semantic information is poorly utilized Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 25 / 79 Learning to Hash Supervised Hashing with Latent Factor Models Model The likelihood on the observed similarity labels S is defined as: Y p(S | B) = p(sij | B) sij ∈S ( aij , sij = 1 p(sij | B) = 1 − aij , sij = 0 aij is defined as aij = σ(Θij ) with: 1 1 + e−x 1 Θij = bTi bj 2 Relationship between the Hamming distance and the inner product: σ(x) = 1 1 distH (bi , bj ) = (Q − bTi bj ) = (Q − 2Θij ) 2 2 Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 26 / 79 Learning to Hash Supervised Hashing with Latent Factor Models Relaxation Re-defined Θij as: 1 Θij = UTi∗ Uj∗ 2 p(S | B), p(B), p(B | S) become p(S | U), p(U), p(U | S). Define a normal distribution of p(U) as: p(U) = Q Y N (U∗d | 0, βI) d=1 The log posteriori of U can be derived as: L = log p(U | S) = X (sij Θij − log(1 + eΘij )) − sij ∈S Li (http://cs.nju.edu.cn/lwj) Big Learning 1 kUk2F + c 2β CS, NJU 27 / 79 Learning to Hash Supervised Hashing with Latent Factor Models Stochastic Learning Furthermore, if we choose the subset of S by randomly selecting O (Q) of its columns and rows, we can further reduce the time cost to O N Q2 per iteration. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 28 / 79 Learning to Hash Supervised Hashing with Latent Factor Models MAP (CIFAR-10) LFH KSH MLH ITQ AGH LSH PCAH SH SIKH 0.8 0.7 0.6 MAP 0.5 0.4 0.3 0.2 0.1 8 16 24 32 Li (http://cs.nju.edu.cn/lwj) 48 64 Code Length Big Learning 96 128 CS, NJU 29 / 79 Learning to Hash Supervised Hashing with Latent Factor Models Training Time (CIFAR-10) LFH KSH MLH ITQ AGH LSH PCAH SH SIKH 5 4 Log Training Time 3 2 1 0 −1 −2 8 16 24 32 Li (http://cs.nju.edu.cn/lwj) 48 64 Code Length Big Learning 96 128 CS, NJU 30 / 79 Learning to Hash Supervised Multimodal Hashing with SCM Supervised Multimodal Similarity Search Given a query of either image or text, return images or texts similar to it in both feature space and semantics (label information). Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 31 / 79 Learning to Hash Supervised Multimodal Hashing with SCM Motivation and Contribution Motivation Existing supervised methods are not scalable Contribution Avoiding explicitly computing the pairwise similarity matrix, linear-time complexity w.r.t. the size of training data. A sequential learning method with closed-form solution to each bit, no hyper-parameters and stopping conditions are needed. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 32 / 79 Learning to Hash Supervised Multimodal Hashing with SCM In matrix form, we can rewrite the problem as follows 2 min sgn(XWx )sgn(Y Wy )T − cS F Wx ,Wy s.t. sgn(XWx )T sgn(XWx ) = nIc sgn(Y Wy )T sgn(Y Wy ) = nIc . Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 33 / 79 Learning to Hash Supervised Multimodal Hashing with SCM Sequential Strategy (1) (t−1) (1) (t−1) Assuming that the projection vectors wx , ..., wx and wy , ..., wy (t) (t) have been learned, to learn the next projection vectors wx and wy . Define a residue matrix Rt = cS − t−1 X sgn(Xwx(k) )sgn(Y wy(k) )T . k=1 Objective function can be written as 2 min sgn(Xwx(t) )sgn(Y wy(t) )T − Rt . (t) F (t) wx ,wy Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 34 / 79 Learning to Hash Supervised Multimodal Hashing with SCM Algorithm 1 Learning Algorithm of SCM Hashing Method. (0) Cxy ← 2(X T L̃)(Y T L̃)T − (X T 1n )(Y T 1n )T ; (1) (0) Cxy ← c × Cxy ; T Cxx ← X X + γIdx ; Cyy ← Y T Y + γIdy ; for t = 1 → c do Solving the following generalized eigenvalue problem (t) −1 (t) Cxy Cyy [Cxy ]T wx = λ2 Cxx wx , (t) we can obtain the optimal solution wx corresponding to the largest eigenvalue λmax ; (t) −1 C T w Cyy xy x ; λmax (t) (t) hx ← sgn(Xwx ); (t) (t) hy ← sgn(Y wy ); (t+1) (t) (t) Cxy ← Cxy − (X T sgn(Xwx ))(Y T sgn(Y (t) wy ← (t) wy ))T ; end for Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 35 / 79 Learning to Hash Supervised Multimodal Hashing with SCM Scalability Table: Training time (in seconds) on NUS-WIDE dataset by varying the size of training set. Method\Size of Training Set SCM-Seq SCM-Orth CCA CCA-3V CVH CRH MLBE Li (http://cs.nju.edu.cn/lwj) 500 276 36 25 69 62 68 67071 1000 249 80 20 57 116 253 126431 1500 303 85 23 68 123 312 - 2000 222 77 22 69 149 515 - Big Learning 2500 236 83 25 62 155 760 - 3000 260 76 22 55 170 1076 - 5000 248 110 28 67 237 - 10000 228 87 38 70 774 - CS, NJU 20000 230 102 44 86 1630 - 36 / 79 Learning to Hash Supervised Multimodal Hashing with SCM Accuracy Table: MAP results on NUS-WIDE. The best performance is shown in boldface. Task Image Query v.s. Text Database Text Query v.s. Image Database Li (http://cs.nju.edu.cn/lwj) Method SCM-Seq SCM-Orth CCA CCA-3V CVH CRH MLBE SCM-Seq SCM-Orth CCA CCA-3V CVH CRH MLBE c = 16 0.4385 0.3804 0.3625 0.3826 0.3608 0.3957 0.3697 0.4273 0.3757 0.3619 0.3801 0.3640 0.3926 0.3877 Big Learning Code Length c = 24 c = 32 0.4397 0.3746 0.3586 0.3741 0.3575 0.3965 0.3620 0.4265 0.3625 0.3580 0.3721 0.3596 0.3910 0.3636 0.4390 0.3662 0.3565 0.3692 0.3562 0.3970 0.3540 0.4259 0.3581 0.3560 0.3676 0.3581 0.3904 0.3551 CS, NJU 37 / 79 Learning to Hash Multiple-Bit Quantization Double Bit Quantization Sample Numbe 2000 1500 1000 500 0 í1.5 í1 í0.5 0 0.5 1 1.5 X (a) A BC 0 (b) 01 (c) 01 D 1 10 00 00 11 10 Point distribution of the real values computed by PCA on 22K LabelMe data set, and different coding results based on the distribution: (a) single-bit quantization (SBQ); (b) hierarchical hashing (HH); (c) double-bit quantization (DBQ). Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 38 / 79 Learning to Hash Multiple-Bit Quantization Experiment mAP on LabelMe data set # bits ITQ SH PCA LSH SIKH SBQ 0.2926 0.0859 0.0535 0.1657 0.0590 32 HH 0.2592 0.1329 0.1009 0.105 0.0712 DBQ 0.3079 0.1815 0.1563 0.12272 0.0772 SBQ 0.3413 0.1071 0.0417 0.2594 0.1132 SBQ 0.3675 0.1730 0.0323 0.3579 0.2792 128 HH 0.4032 0.2034 0.1083 0.3311 0.3147 DBQ 0.4650 0.3403 0.1748 0.4055 0.3436 SBQ 0.3846 0.2140 0.0245 0.4158 0.4759 # bits ITQ SH PCA LSH SIKH Li (http://cs.nju.edu.cn/lwj) Big Learning 64 HH 0.3487 0.1768 0.1034 0.2089 0.1514 256 HH 0.4251 0.2468 0.1103 0.4359 0.5055 DBQ 0.4002 0.2649 0.1822 0.2577 0.1737 DBQ 0.4998 0.3468 0.1499 0.5154 0.5325 CS, NJU 39 / 79 Learning to Hash Multiple-Bit Quantization Manhattan Quantization Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 40 / 79 Learning to Hash Multiple-Bit Quantization Experiment Table: mAP on ANN SIFT1M data set. The best mAP among SBQ, HQ and 2-MQ under the same setting is shown in bold face. # bits ITQ SIKH LSH SH PCA SBQ 0.1657 0.0394 0.1163 0.0889 0.1087 32 HQ 0.2500 0.0217 0.0961 0.2482 0.2408 Li (http://cs.nju.edu.cn/lwj) 2-MQ 0.2750 0.0570 0.1173 0.2771 0.2882 SBQ 0.4641 0.2027 0.2340 0.1828 0.1671 64 HQ 0.4745 0.0822 0.2815 0.3841 0.3956 Big Learning 2-MQ 0.5087 0.2356 0.3111 0.4576 0.4683 SBQ 0.5424 0.2263 0.3767 0.2236 0.1625 96 HQ 0.5871 0.1664 0.4541 0.4911 0.4927 2-MH 0.6263 0.2768 0.4599 0.5929 0.5641 CS, NJU 41 / 79 Distributed Learning Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Supervised Hashing with Latent Factor Models Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing 4 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 42 / 79 Distributed Learning Definition Perform machine learning on clusters with several machines (nodes). Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 43 / 79 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction CTR Prediction for Online Advertising Multi-billion business on the web and accounts for the majority of the income for the major internet companies. Display advertising is a big part of online advertising. Click through rate (CTR) prediction is the problem of estimating the probability that an ad is clicked when displayed to a user in a specific context. (a) Google Li (http://cs.nju.edu.cn/lwj) (b) Amazon Big Learning (c) Taobao CS, NJU 44 / 79 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Notation Impression (instance): ad + user + context Training set {(x(i) , y (i) ) | i = 1, ..., N } xT = (xTu , xTa , xTo ) y ∈ {0, 1} with y = 1 denoting click and y = 0 denoting non-click Learn h(x) = h(xu , xa , xo ) to predict CTR Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 45 / 79 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Coupled Group Lasso (CGL) 1 Likelihood h(x) = P r(y = 1|x, W, V, b) = g (xTu W)(xTa V)T + bT xo where W ∈ Rdu ×k , V ∈ Rda ×k , (xTu W)(xTa V)T = xTu (WVT )xa 2 Objective function N X ξ W, V, b; x(i) , y (i) + λΩ(W, V), min W,V,b i=1 in which (i) (i) ξ(W, V, b; x(i) , y (i) ) = − log (h(x(i) ))y (1 − h(x(i) ))1−y Ω(W, V) = ||W||2,1 + ||V||2,1 = du X i=1 Li (http://cs.nju.edu.cn/lwj) Big Learning ||Wi∗ ||2 + da X ||Vi∗ ||2 i=1 CS, NJU 46 / 79 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Learning Framework Compute gradient gp0 locally on each node p in parallel. P Compute gradient g0 = Pp=1 gp0 with AllReduce. Add the gradient of the regularization term and take an L-BFGS step in the master node. Broadcast the updated parameters to each slaver node. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 47 / 79 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Experiment Experiment Environment MPI-Cluster with hundreds of nodes, each of which is a 24-core server with 2.2GHz CPU and 96GB of RAM Baseline and Evaluation Metric Baseline: LR (with L2-norm) [Chapelle et al., 2013] Evalution Metric (Discrimination) RelaImpr = Li (http://cs.nju.edu.cn/lwj) AU C(model) − 0.5 × 100% AU C(baseline) − 0.5 Big Learning CS, NJU 48 / 79 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Data Sets Data set # Instances (Billion) CTR (%) # Ads # Users (Million) Storage (TB) Train 1 1.011 1.62 21, 318 874.7 1.895 Test 1 0.295 1.70 11, 558 331.0 0.646 Train 2 1.184 1.61 21, 620 958.6 2.203 Test 2 0.145 1.64 6, 848 190.3 0.269 Train 3 1.491 1.75 33, 538 1119.3 2.865 Test 3 0.126 1.70 9, 437 183.7 0.233 Real-world data sets collected from taobao of Alibaba Group Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 49 / 79 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Performance (c) RelaImpr w.r.t. Baseline (d) Speedup Accuracy and Scalability Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 50 / 79 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Feature Selection . Important Features Useless Features Ad Part Women’s clothes, Skirt, Dress,Children’s wear, Shoes, Cellphone Movie, Act, Takeout, Food booking service User Part Watch, Underwear, Fur clothing, Furniture Stage Costume, Flooring, Pencil, Outdoor sock Feature Selection Results Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 51 / 79 Distributed Learning Distributed Power-Law Graph Computing Graph-based Machine Learning Big graphs emerge in many real applications Graph-based machine learning is a hot research topic with wide applications: relational learning, manifold learning, PageRank, community detection, etc Distributed graph computing frameworks for big graphs (a) Social Network Li (http://cs.nju.edu.cn/lwj) (b) Biological Network Big Learning CS, NJU 52 / 79 Distributed Learning Distributed Power-Law Graph Computing Graph Partitioning Graph partitioning (GP) plays a key role to affect the performance of distributed graph computing: Workload balance Communication cost Two strategies for graph partitioning. Shaded vertices are ghosts and mirrors, respectively. (a) Edge-Cut (b) Vertex-Cut Theoretical and empirical results show vertex-cut is better than edge-cut. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 53 / 79 Distributed Learning Distributed Power-Law Graph Computing Power-Law Graph Partitioning Natural graphs from real world typically follow skewed power-law degree distributions: Pr(d) ∝ d−α . Different vertex-cut methods can result in different performance (b) (a) Sample Li (http://cs.nju.edu.cn/lwj) (c) Bad partitioning Good partitioning Big Learning CS, NJU 54 / 79 Distributed Learning Distributed Power-Law Graph Computing Degree-based Hashing (DBH) Existing GP methods, such as the Random in PowerGraph (Joseph E Gonzalez et al, 2012) and Grid in GraphBuilder (Nilesh Jain et al, 2013), do not make effective use of the power-law degree distribution. We propose a novel GP method called degree-based hashing (DBH): Algorithm 2 GP with DBH Input: The set of edges E; the set of vertices V ; number of machines p. Output: The assignment M (e) ∈ {1, . . . , p} for each edge e. Initialization: count the degree di for each vertex i ∈ {1, . . . , n} in parallel for all e = (vi , vj ) ∈ E do Hash each edge in parallel: if di < dj then M (e) ← vertex hash(vi ) else M (e) ← vertex hash(vj ) end if end for Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 55 / 79 Distributed Learning Distributed Power-Law Graph Computing Theoretical Analysis We theoretically prove that our DBH method can outperform Random and Grid in terms of: reducing replication factor (communication and storage cost) keeping good edge-balance (workload balance) Nice property: Our DBH reduces more replication factor when the power-law graph is more skewed. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 56 / 79 Distributed Learning Distributed Power-Law Graph Computing Data Set Table: Datasets Alias Tw Arab Wiki LJ WG Li (http://cs.nju.edu.cn/lwj) Graph Twitter Arabic-2005 Wiki LiveJournal WebGoogle Big Learning |V | 42M 22M 5.7M 5.4M 0.9M |E| 1.47B 0.6B 130M 79M 5.1M CS, NJU 57 / 79 Distributed Learning Distributed Power-Law Graph Computing 18 70 Random Grid DBH 16 60.6% Empirical Results Random Grid 60 10 −12 0 WG LJ Wiki Arab 0 1+10 Tw (d) Replication Factor WG LJ Wiki 25% 13.3% 4.28% 8.42% 20 2 6.06% 30 6 4 23.6% 40 21.2% 8 26.5% 10 31.5% 50 12 Speedup(%) Replication Factor 14 Arab 1+10−12 Tw (e) Speedup relative to baselines Figure: Experiments on real-world graphs. The number of machines is 48. 20 2000 Random Grid DBH 18 Random Grid DBH 1800 1600 Execution Time (Sec) Replication Factor 16 14 12 10 8 6 4 1400 1200 1000 800 600 400 2 200 0 8 16 24 Number of Machines 48 64 1+10−12 (a) Replication Factor 8 16 24 Number of Machines 48 64 1+10−12 (b) Execution Time Figure: The number of machines ranges from 8 to 64 on Twitter graph. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 58 / 79 Stochastic Learning Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Supervised Hashing with Latent Factor Models Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing 4 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 59 / 79 Stochastic Learning Definition Given a (large) set of training data {(xi , yi )|i = 1, 2, · · · , n}, each time we randomly sample one training instance for training. Many machine learning problems can be formulated as: n 1X argmin L(w) = fi (w|xi , yi ) n w i=1 Hence, the computation cost can be greatly reduced by using stochastic learning. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 60 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Recommender System Recommend products (items) to customers (users) by utilizing the customers’ historic preferences. (a) Amazon Li (http://cs.nju.edu.cn/lwj) (b) Sina Weibo Big Learning CS, NJU 61 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Matrix Factorization Popular in recommender systems for its promising performance 3 I N C M E A P T T I …… R I O X N 5 …… …… m Users n k James Kate k 5 4 m Ann I D I O T S …… 2 Helen 4.5 n Items User Factors R ← UT Item Factors × V 1 X T 2 T T min (Ri,j − U∗i V∗j ) + λ1 U∗i U∗i + λ2 V∗j V∗j U,V 2 (i,j)∈Ω Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 62 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Data Split Strategy Decouple U and V as possible as we can: R1 R1 R2 [U1]T …… R2 Rp [U2]T [U1]T [U2]T …… V1 V2 V …… V Rp [Up]T Vp [Up]T Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 63 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Reformulated MF problem : P 1X X p 2 p T p (Ri,j − UT∗i V∗j ) + λ1 UT∗i U∗i + λ2 [V∗j ] V∗j U,V,V 2 p min p=1 (i,j)∈Ω s.t. : Vp − V = 0; ∀p ∈ {1, 2, ..., P } where V = {Vp }Pp=1 , Ωp denotes the (i, j) indices of the ratings located in node p Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 64 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Define: Lp (U, Vp , Θp , V) = f p (U, Vp ) + lp (Vp , V, Θp ) X p fˆi,j (U∗i , V∗j ) = (i,j)∈Ωp ρ p p T p 2 + kV − VkF + tr [Θ ] (V − V) , 2 we can get L(U, V, O, V) = P X Lp (U, Vp , Θp , V). p=1 Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 65 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Distributed ADMM Get the solutions by repeating the following three steps: p Ut+1 , Vt+1 ← argmin Lp (U, Vp , Θpt , Vt ), ∀p ∈ {1, 2, ..., P } U,Vp Vt+1 ← argmin L(Ut+1 , Vt+1 , Ot , V), Θpt+1 ← V p Θt + p ρ(Vt+1 − Vt+1 ), ∀p ∈ {1, 2, ..., P }. The solution for Vt+1 is: Vt+1 = P 1 X p Vt+1 P p=1 which can be calculated efficiently. The problem lies in getting Ut+1 ,Vt+1 efficiently Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 66 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Stochastic Learning Batch learning is still not very efficient: Ut+1 = Ut − τt ∗ ∇TU f p (Ut , Vtp ), p Vt+1 = Vp τt [ t + ρVt − Θpt − ∇TVp f p (Ut , Vtp )] 1 + ρτt τt Stochastic Learning p (U∗i )t+1 =(U∗i )t + τt (ij (V∗j )t − λ1 (U∗i )t ), 1 − λ2 τt p τt [ (V∗j )t 1 + ρτt τt +ij (U∗i )t + ρ(V∗j )t − (Θp∗j )t ], p )t+1 = (V∗j p where ij = Rij − [(U∗i )t ]T (V∗j )t . Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 67 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Scheduler Comparison Baselines: CCD++ (H.-F Yu etc, ICDM’12); DSGD (R. Gemulla etc, KDD’11) Number of synchronization for one iteration to fully scan all the ratings: Generate Statum Sequence for Each Node Set d = 0 Parallelly Update Up and Vp Parallelly Update Ud* Set p = 0 Synchronize to get V Synchronize Ud* Parallelly Update Vd* Parallelly Update U and V Parallelly Update Θp Synchronize V Synchronize Vd* p<P d<k p++ d++ (a) CCD++, total synchronizes k times Li (http://cs.nju.edu.cn/lwj) (b) DSGD, totally synchronizes P times Big Learning (c) DS-ADMM, totally synchronizes 1 time CS, NJU 68 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Experiment Experiment environment MPI-Cluster with 20 nodes, each of which is a 24-core server with 2.2GHz CPU and 96GB of RAM; One core and 10GB memory for each node are actually used. Baseline and evaluation metric Baseline: CCD++, DSGD, DSGD-Bias Evalution metric: qP 1 test RMSE : Q (Ri,j − UT∗i V∗j )2 Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 69 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Data Sets Table: Data sets and parameter settings Data Set m n #Train #Test k η0 /τ0 λ1 /λ2 ρ α β P Netflix 480,190 17,770 99,072,112 1,408,395 40 0.1 0.05 0.05 0.002 0.7 8 Li (http://cs.nju.edu.cn/lwj) Yahoo! Music R1 1,938,361 49,995 73,578,902 7,534,592 40 0.1 0.05 0.05 0.002 0.7 10 Big Learning Yahoo! Music R2 1,823,179 136,736 699,640,226 18,231,790 40 0.1 0.05 0.1 0.006 0.7 20 CS, NJU 70 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Accuracy and Efficiency 0.98 1 CCD++ DSGD DSGD−Bias DS−ADMM CCD++ DSGD DSGD−Bias DS−ADMM 0.96 Test RMSE Test RMSE 0.96 0.94 0.92 0.88 0.92 0 100 200 300 400 0.84 500 0 200 Time(s) Test RMSE 1.2 1.1 1.04 0 200 400 600 800 3.6 DS−ADMM DSGD DSGD−Bias CCD++ 3.3 3 2.7 2.4 2.1 2 Time(s) (c) Yahoo-Music-R2 Li (http://cs.nju.edu.cn/lwj) 500 (b) Yahoo-Music-R1 CCD++ DSGD DSGD−Bias DS−ADMM Log Running Time to get Test RMSE 0.922 (a) Netflix 1.3 400 Time(s) 4 6 8 Node Number (d) Time to fixed-RMSE (0.922) Big Learning CS, NJU 71 / 79 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization Speedup 6 CCD++ DSGD DS−ADMM Linear−Speedup Speedup 5 4 3 2 1 2 4 6 8 Node Number Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 72 / 79 Conclusion Outline 1 Introduction 2 Learning to Hash Isotropic Hashing Supervised Hashing with Latent Factor Models Supervised Multimodal Hashing with SCM Multiple-Bit Quantization 3 Distributed Learning Coupled Group Lasso for Web-Scale CTR Prediction Distributed Power-Law Graph Computing 4 Stochastic Learning Distributed Stochastic ADMM for Matrix Factorization 5 Conclusion Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 73 / 79 Conclusion Our Contribution Learning to hash (MFÆS): memory/disk/cpu/communication Distributed learning (©ÙªÆS): memory/disk/cpu; but increase communication cost Stochastic learning (‘ÅÆS): memory/disk/cpu Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 74 / 79 Conclusion Future Work Learning to hash for decreasing communication cost Distributed programming models and platforms for machine learning: MPI (fault tolerance, asynchronous), GraphLab, Spark, Parameter Server, MapReduce, Storm, GPU, etc Distributed stochastic learning: communication modeling Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 75 / 79 Conclusion Future Work Big data machine learning (big learning) framework Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 76 / 79 Conclusion Related Publication (1/2)[*indicate my students] Ling Yan*, Wu-Jun Li, Gui-Rong Xue, Dingyi Han. Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising. Proceedings of the 31st International Conference on Machine Learning (ICML), 2014. Peichao Zhang*, Wei Zhang*, Wu-Jun Li, Minyi Guo. Supervised Hashing with Latent Factor Models. Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2014. Cong Xie*, Ling Yan*, Wu-Jun Li, Zhihua Zhang. Distributed Power-law Graph Computing: Theoretical and Empirical Analysis. To Appear in Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014. Zhi-Qin Yu*, Xing-Jian Shi*, Ling Yan*, Wu-Jun Li. Distributed Stochastic ADMM for Matrix Factorization. To Appear in Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM), 2014. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 77 / 79 Conclusion Related Publication (2/2)[*indicate my students] Dongqing Zhang*, Wu-Jun Li. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. Proceedings of theTwenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), 2014. Weihao Kong*, Wu-Jun Li. Isotropic Hashing. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012. Weihao Kong*, Wu-Jun Li, Minyi Guo. Manhattan Hashing for Large-Scale Image Retrieval. Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2012. Weihao Kong*, Wu-Jun Li. Double-Bit Quantization for Hashing. Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2012. Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 78 / 79 Conclusion Q&A Thanks! Li (http://cs.nju.edu.cn/lwj) Big Learning CS, NJU 79 / 79