Slides

Transcription

Slides
Big Data Machine Learning
ŒêâÅìÆS
oÉ
LAMDA Group
H®ŒÆOŽÅ‰Æ†EâX
^‡#EâI[-:¢ ¿
liwujun@nju.edu.cn
Oct 31, 2014
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
1 / 79
Outline
1
Introduction
2
Learning to Hash
Isotropic Hashing
Supervised Hashing with Latent Factor Models
Supervised Multimodal Hashing with SCM
Multiple-Bit Quantization
3
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Distributed Power-Law Graph Computing
4
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
5
Conclusion
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
2 / 79
Introduction
Outline
1
Introduction
2
Learning to Hash
Isotropic Hashing
Supervised Hashing with Latent Factor Models
Supervised Multimodal Hashing with SCM
Multiple-Bit Quantization
3
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Distributed Power-Law Graph Computing
4
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
5
Conclusion
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
3 / 79
Introduction
Big Data
Big data has attracted much attention from both academic and industry.
Facebook: 750 Million users
Flickr: 6 Billion photos
Wal-Mart: 267 Million items/day; 4PB data warehouse
Sloan Digital Sky Survey: New Mexico telescope captures 200 GB
image data/day
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
4 / 79
Introduction
Definition of Big Data
Gartner (2012): “Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization.” (“3Vs”)
International Data Corporation (IDC) (2011): “Big data technologies
describe a new generation of technologies and architectures, designed to
economically extract value from very large volumes of a wide variety of data,
by enabling high-velocity capture, discovery, and/or analysis.” (“4Vs”)
McKinsey Global Institute (MGI) (2011): “Big data refers to datasets whose
size is beyond the ability of typical database software tools to capture, store,
manage, and analyze.”
Why not hot until recent years?
Big data: 7¶
Cloud computingµæ¶Eâ
Big data machine learningµŽ7Eâ
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
5 / 79
Introduction
Big Data Machine Learning
Definition: perform machine learning from big data.
Role: key for big data
Ultimate goal of big data processing is to mine value from data.
Machine learning provides fundamental theory and computational
techniques for big data mining and analysis.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
6 / 79
Introduction
Challenge
Storage: memory and disk
Computation: CPU
Communication: network
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
7 / 79
Introduction
Our Contribution
Learning to hash (MFÆS): memory/disk/cpu/communication
Distributed learning (©ÙªÆS): memory/disk/cpu;
but increase communication cost
Stochastic learning (‘ÅÆS): memory/disk/cpu
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
8 / 79
Learning to Hash
Outline
1
Introduction
2
Learning to Hash
Isotropic Hashing
Supervised Hashing with Latent Factor Models
Supervised Multimodal Hashing with SCM
Multiple-Bit Quantization
3
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Distributed Power-Law Graph Computing
4
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
5
Conclusion
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
9 / 79
Learning to Hash
Nearest Neighbor Search (Retrieval)
Given a query point q, return the points closest (similar) to q in the
database(e.g. images).
Underlying many machine learning, data mining, information retrieval
problems
Challenge in Big Data Applications:
Curse of dimensionality
Storage cost
Query speed
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
10 / 79
Learning to Hash
Similarity Preserving Hashing
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
11 / 79
Learning to Hash
Reduce Dimensionality and Storage Cost
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
12 / 79
Learning to Hash
Querying
Hamming distance:
||01101110, 00101101||H = 3
||11011, 01011||H = 1
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
13 / 79
Learning to Hash
Querying
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
14 / 79
Learning to Hash
Querying
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
15 / 79
Learning to Hash
Fast Query Speed
By using hashing scheme, we can achieve constant or sub-linear
search time complexity.
Exhaustive search is also acceptable because the distance calculation
cost is cheap now.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
16 / 79
Learning to Hash
Two Stages of Hash Function Learning
Projection Stage (Dimension Reduction)
Projected with real-valued projection function
Given a point x, each projected dimension i will be associated with a
real-valued projection function fi (x) (e.g. fi (x) = wiT x)
Quantization Stage
Turn real into binary
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
17 / 79
Learning to Hash
Our Contribution
Unsupervised Hashing [NIPS 2012]:
Isotropic hashing (IsoHash)
Supervised Hashing [SIGIR 2014]:
Supervised hashing with latent factor models
Multimodal Hashing [AAAI 2014]:
Large-scale supervised multimodal hashing with semantic correlation
maximization
Multiple-Bit Quantization:
Double-bit quantization (DBQ) [AAAI 2012]
Manhattan quantization (MQ) [SIGIR 2012]
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
18 / 79
Learning to Hash
Isotropic Hashing
Motivation
Problem:
All existing methods use the same number of bits for different projected
dimensions with different variances.
Possible Solutions:
Different number of bits for different dimensions
(Unfortunately, have not found an effective way)
Isotropic (equal) variances for all dimensions
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
19 / 79
Learning to Hash
Isotropic Hashing
PCA Hash
To generate a code of m bits, PCAH performs PCA on X, and then use
the top m eigenvectors of the matrix XX T as columns of the projection
matrix W ∈ Rd×m . Here, top m eigenvectors are those corresponding to
the m largest eigenvalues {λk }m
k=1 , generally arranged with the
non-increasing order λ1 ≥ λ2 ≥ · · · ≥ λm . Let λ = [λ1 , λ2 , · · · , λm ]T .
Then
Λ = W T XX T W = diag(λ)
Define hash function
h(x) = sgn(W T x)
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
20 / 79
Learning to Hash
Isotropic Hashing
Idea of IsoHash
Learn an orthogonal matrix Q ∈ Rm×m which makes
QT W T XX T W Q become a matrix with equal diagonal values.
Effect of Q: to make each projected dimension has the same variance
while keeping the Euclidean distances between any two points
unchanged.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
21 / 79
Learning to Hash
Isotropic Hashing
Accuracy (mAP)
Method
# bits
IsoHash
PCAH
ITQ
SH
SIKH
LSH
32
0.2249
0.0319
0.2490
0.0510
0.0353
0.1052
Li (http://cs.nju.edu.cn/lwj)
64
0.2969
0.0274
0.3051
0.0589
0.0902
0.1907
CIFAR
96
0.3256
0.0241
0.3238
0.0802
0.1245
0.2396
Big Learning
128
0.3357
0.0216
0.3319
0.1121
0.1909
0.2776
256
0.3651
0.0168
0.3436
0.1535
0.3614
0.3432
CS, NJU
22 / 79
Learning to Hash
Isotropic Hashing
Training Time
50
Training Time(s)
40
30
IsoHash−GF
IsoHash−LP
ITQ
SH
SIKH
LSH
PCAH
20
10
0
0
1
2
3
4
Number of training data
Li (http://cs.nju.edu.cn/lwj)
Big Learning
5
6
4
x 10
CS, NJU
23 / 79
Learning to Hash
Supervised Hashing with Latent Factor Models
Problem Definition
Input:
Feature vectors: xi ∈ RD , i = 1, . . . , N .
(Compact form: X ∈ RN ×D )
Similarity labels: sij , i, j = 1, . . . , N .
(Compact form: S = {sij })
sij = 1 if points i and j belong to the same class.
sij = 0 if points i and j belong to different classes.
Output:
Binary codes: bi ∈ {−1, 1}Q , i = 1, . . . , N .
(Compact form: B ∈ {−1, 1}N ×Q )
When sij = 1, the Hamming distance between bi and bj should be
low.
When sij = 0, the Hamming distance between bi and bj should be
high.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
24 / 79
Learning to Hash
Supervised Hashing with Latent Factor Models
Motivation
Existing supervised methods:
High training complexity
Semantic information is poorly utilized
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
25 / 79
Learning to Hash
Supervised Hashing with Latent Factor Models
Model
The likelihood on the observed similarity labels S is defined as:
Y
p(S | B) =
p(sij | B)
sij ∈S
(
aij ,
sij = 1
p(sij | B) =
1 − aij , sij = 0
aij is defined as aij = σ(Θij ) with:
1
1 + e−x
1
Θij = bTi bj
2
Relationship between the Hamming distance and the inner product:
σ(x) =
1
1
distH (bi , bj ) = (Q − bTi bj ) = (Q − 2Θij )
2
2
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
26 / 79
Learning to Hash
Supervised Hashing with Latent Factor Models
Relaxation
Re-defined Θij as:
1
Θij = UTi∗ Uj∗
2
p(S | B), p(B), p(B | S) become p(S | U), p(U), p(U | S).
Define a normal distribution of p(U) as:
p(U) =
Q
Y
N (U∗d | 0, βI)
d=1
The log posteriori of U can be derived as:
L = log p(U | S) =
X
(sij Θij − log(1 + eΘij )) −
sij ∈S
Li (http://cs.nju.edu.cn/lwj)
Big Learning
1
kUk2F + c
2β
CS, NJU
27 / 79
Learning to Hash
Supervised Hashing with Latent Factor Models
Stochastic Learning
Furthermore, if we choose the subset of S by randomly selecting O (Q) of
its columns and rows, we can further reduce the time cost to O N Q2
per iteration.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
28 / 79
Learning to Hash
Supervised Hashing with Latent Factor Models
MAP (CIFAR-10)
LFH
KSH
MLH
ITQ
AGH
LSH
PCAH
SH
SIKH
0.8
0.7
0.6
MAP
0.5
0.4
0.3
0.2
0.1
8 16 24 32
Li (http://cs.nju.edu.cn/lwj)
48
64
Code Length
Big Learning
96
128
CS, NJU
29 / 79
Learning to Hash
Supervised Hashing with Latent Factor Models
Training Time (CIFAR-10)
LFH
KSH
MLH
ITQ
AGH
LSH
PCAH
SH
SIKH
5
4
Log Training Time
3
2
1
0
−1
−2
8 16 24 32
Li (http://cs.nju.edu.cn/lwj)
48
64
Code Length
Big Learning
96
128
CS, NJU
30 / 79
Learning to Hash
Supervised Multimodal Hashing with SCM
Supervised Multimodal Similarity Search
Given a query of either image or text, return images or texts similar to
it in both feature space and semantics (label information).
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
31 / 79
Learning to Hash
Supervised Multimodal Hashing with SCM
Motivation and Contribution
Motivation
Existing supervised methods are not scalable
Contribution
Avoiding explicitly computing the pairwise similarity matrix,
linear-time complexity w.r.t. the size of training data.
A sequential learning method with closed-form solution to each bit,
no hyper-parameters and stopping conditions are needed.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
32 / 79
Learning to Hash
Supervised Multimodal Hashing with SCM
In matrix form, we can rewrite the problem as follows
2
min sgn(XWx )sgn(Y Wy )T − cS F
Wx ,Wy
s.t. sgn(XWx )T sgn(XWx ) = nIc
sgn(Y Wy )T sgn(Y Wy ) = nIc .
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
33 / 79
Learning to Hash
Supervised Multimodal Hashing with SCM
Sequential Strategy
(1)
(t−1)
(1)
(t−1)
Assuming that the projection vectors wx , ..., wx
and wy , ..., wy
(t)
(t)
have been learned, to learn the next projection vectors wx and wy .
Define a residue matrix
Rt = cS −
t−1
X
sgn(Xwx(k) )sgn(Y wy(k) )T .
k=1
Objective function can be written as
2
min sgn(Xwx(t) )sgn(Y wy(t) )T − Rt .
(t)
F
(t)
wx ,wy
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
34 / 79
Learning to Hash
Supervised Multimodal Hashing with SCM
Algorithm 1 Learning Algorithm of SCM Hashing Method.
(0)
Cxy ← 2(X T L̃)(Y T L̃)T − (X T 1n )(Y T 1n )T ;
(1)
(0)
Cxy ← c × Cxy ;
T
Cxx ← X X + γIdx ;
Cyy ← Y T Y + γIdy ;
for t = 1 → c do
Solving the following generalized eigenvalue problem
(t) −1
(t)
Cxy Cyy
[Cxy ]T wx = λ2 Cxx wx ,
(t)
we can obtain the optimal solution wx corresponding to the largest
eigenvalue λmax ;
(t)
−1 C T w
Cyy
xy x
;
λmax
(t)
(t)
hx ← sgn(Xwx );
(t)
(t)
hy ← sgn(Y wy );
(t+1)
(t)
(t)
Cxy ← Cxy − (X T sgn(Xwx ))(Y T sgn(Y
(t)
wy ←
(t)
wy ))T ;
end for
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
35 / 79
Learning to Hash
Supervised Multimodal Hashing with SCM
Scalability
Table: Training time (in seconds) on NUS-WIDE dataset by varying the size of
training set.
Method\Size of Training Set
SCM-Seq
SCM-Orth
CCA
CCA-3V
CVH
CRH
MLBE
Li (http://cs.nju.edu.cn/lwj)
500
276
36
25
69
62
68
67071
1000
249
80
20
57
116
253
126431
1500
303
85
23
68
123
312
-
2000
222
77
22
69
149
515
-
Big Learning
2500
236
83
25
62
155
760
-
3000
260
76
22
55
170
1076
-
5000
248
110
28
67
237
-
10000
228
87
38
70
774
-
CS, NJU
20000
230
102
44
86
1630
-
36 / 79
Learning to Hash
Supervised Multimodal Hashing with SCM
Accuracy
Table: MAP results on NUS-WIDE. The best performance is shown in boldface.
Task
Image Query
v.s.
Text Database
Text Query
v.s.
Image Database
Li (http://cs.nju.edu.cn/lwj)
Method
SCM-Seq
SCM-Orth
CCA
CCA-3V
CVH
CRH
MLBE
SCM-Seq
SCM-Orth
CCA
CCA-3V
CVH
CRH
MLBE
c = 16
0.4385
0.3804
0.3625
0.3826
0.3608
0.3957
0.3697
0.4273
0.3757
0.3619
0.3801
0.3640
0.3926
0.3877
Big Learning
Code Length
c = 24
c = 32
0.4397
0.3746
0.3586
0.3741
0.3575
0.3965
0.3620
0.4265
0.3625
0.3580
0.3721
0.3596
0.3910
0.3636
0.4390
0.3662
0.3565
0.3692
0.3562
0.3970
0.3540
0.4259
0.3581
0.3560
0.3676
0.3581
0.3904
0.3551
CS, NJU
37 / 79
Learning to Hash
Multiple-Bit Quantization
Double Bit Quantization
Sample Numbe
2000
1500
1000
500
0
í1.5
í1
í0.5
0
0.5
1
1.5
X
(a) A
BC
0
(b)
01
(c)
01
D
1
10
00
00
11
10
Point distribution of the real values computed by PCA on 22K LabelMe
data set, and different coding results based on the distribution:
(a) single-bit quantization (SBQ);
(b) hierarchical hashing (HH);
(c) double-bit quantization (DBQ).
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
38 / 79
Learning to Hash
Multiple-Bit Quantization
Experiment
mAP on LabelMe data set
# bits
ITQ
SH
PCA
LSH
SIKH
SBQ
0.2926
0.0859
0.0535
0.1657
0.0590
32
HH
0.2592
0.1329
0.1009
0.105
0.0712
DBQ
0.3079
0.1815
0.1563
0.12272
0.0772
SBQ
0.3413
0.1071
0.0417
0.2594
0.1132
SBQ
0.3675
0.1730
0.0323
0.3579
0.2792
128
HH
0.4032
0.2034
0.1083
0.3311
0.3147
DBQ
0.4650
0.3403
0.1748
0.4055
0.3436
SBQ
0.3846
0.2140
0.0245
0.4158
0.4759
# bits
ITQ
SH
PCA
LSH
SIKH
Li (http://cs.nju.edu.cn/lwj)
Big Learning
64
HH
0.3487
0.1768
0.1034
0.2089
0.1514
256
HH
0.4251
0.2468
0.1103
0.4359
0.5055
DBQ
0.4002
0.2649
0.1822
0.2577
0.1737
DBQ
0.4998
0.3468
0.1499
0.5154
0.5325
CS, NJU
39 / 79
Learning to Hash
Multiple-Bit Quantization
Manhattan Quantization
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
40 / 79
Learning to Hash
Multiple-Bit Quantization
Experiment
Table: mAP on ANN SIFT1M data set. The best mAP among SBQ, HQ and
2-MQ under the same setting is shown in bold face.
# bits
ITQ
SIKH
LSH
SH
PCA
SBQ
0.1657
0.0394
0.1163
0.0889
0.1087
32
HQ
0.2500
0.0217
0.0961
0.2482
0.2408
Li (http://cs.nju.edu.cn/lwj)
2-MQ
0.2750
0.0570
0.1173
0.2771
0.2882
SBQ
0.4641
0.2027
0.2340
0.1828
0.1671
64
HQ
0.4745
0.0822
0.2815
0.3841
0.3956
Big Learning
2-MQ
0.5087
0.2356
0.3111
0.4576
0.4683
SBQ
0.5424
0.2263
0.3767
0.2236
0.1625
96
HQ
0.5871
0.1664
0.4541
0.4911
0.4927
2-MH
0.6263
0.2768
0.4599
0.5929
0.5641
CS, NJU
41 / 79
Distributed Learning
Outline
1
Introduction
2
Learning to Hash
Isotropic Hashing
Supervised Hashing with Latent Factor Models
Supervised Multimodal Hashing with SCM
Multiple-Bit Quantization
3
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Distributed Power-Law Graph Computing
4
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
5
Conclusion
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
42 / 79
Distributed Learning
Definition
Perform machine learning on clusters with several machines (nodes).
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
43 / 79
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
CTR Prediction for Online Advertising
Multi-billion business on the web and accounts for the majority of the
income for the major internet companies.
Display advertising is a big part of online advertising.
Click through rate (CTR) prediction is the problem of estimating the
probability that an ad is clicked when displayed to a user in a specific
context.
(a) Google
Li (http://cs.nju.edu.cn/lwj)
(b) Amazon
Big Learning
(c) Taobao
CS, NJU
44 / 79
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Notation
Impression (instance): ad + user + context
Training set {(x(i) , y (i) ) | i = 1, ..., N }
xT = (xTu , xTa , xTo )
y ∈ {0, 1} with y = 1 denoting click and y = 0 denoting non-click
Learn h(x) = h(xu , xa , xo ) to predict CTR
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
45 / 79
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Coupled Group Lasso (CGL)
1
Likelihood
h(x) = P r(y = 1|x, W, V, b) = g (xTu W)(xTa V)T + bT xo
where
W ∈ Rdu ×k , V ∈ Rda ×k , (xTu W)(xTa V)T = xTu (WVT )xa
2
Objective function
N
X
ξ W, V, b; x(i) , y (i) + λΩ(W, V),
min
W,V,b
i=1
in which
(i)
(i)
ξ(W, V, b; x(i) , y (i) ) = − log (h(x(i) ))y (1 − h(x(i) ))1−y
Ω(W, V) = ||W||2,1 + ||V||2,1 =
du
X
i=1
Li (http://cs.nju.edu.cn/lwj)
Big Learning
||Wi∗ ||2 +
da
X
||Vi∗ ||2
i=1
CS, NJU
46 / 79
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Distributed Learning Framework
Compute gradient gp0 locally on each node p in parallel.
P
Compute gradient g0 = Pp=1 gp0 with AllReduce.
Add the gradient of the regularization term and take an L-BFGS step
in the master node.
Broadcast the updated parameters to each slaver node.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
47 / 79
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Experiment
Experiment Environment
MPI-Cluster with hundreds of nodes, each of which is a 24-core server
with 2.2GHz CPU and 96GB of RAM
Baseline and Evaluation Metric
Baseline: LR (with L2-norm) [Chapelle et al., 2013]
Evalution Metric (Discrimination)
RelaImpr =
Li (http://cs.nju.edu.cn/lwj)
AU C(model) − 0.5
× 100%
AU C(baseline) − 0.5
Big Learning
CS, NJU
48 / 79
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Data Sets
Data set
# Instances (Billion)
CTR (%)
# Ads
# Users (Million)
Storage (TB)
Train 1
1.011
1.62
21, 318
874.7
1.895
Test 1
0.295
1.70
11, 558
331.0
0.646
Train 2
1.184
1.61
21, 620
958.6
2.203
Test 2
0.145
1.64
6, 848
190.3
0.269
Train 3
1.491
1.75
33, 538
1119.3
2.865
Test 3
0.126
1.70
9, 437
183.7
0.233
Real-world data sets collected from taobao of Alibaba Group
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
49 / 79
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Performance
(c) RelaImpr w.r.t. Baseline
(d) Speedup
Accuracy and Scalability
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
50 / 79
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Feature Selection
.
Important
Features
Useless Features
Ad Part
Women’s
clothes,
Skirt,
Dress,Children’s
wear,
Shoes,
Cellphone
Movie, Act, Takeout, Food booking
service
User Part
Watch, Underwear,
Fur clothing, Furniture
Stage
Costume,
Flooring,
Pencil,
Outdoor sock
Feature Selection Results
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
51 / 79
Distributed Learning
Distributed Power-Law Graph Computing
Graph-based Machine Learning
Big graphs emerge in many real applications
Graph-based machine learning is a hot research topic with wide
applications: relational learning, manifold learning, PageRank,
community detection, etc
Distributed graph computing frameworks for big graphs
(a) Social Network
Li (http://cs.nju.edu.cn/lwj)
(b) Biological Network
Big Learning
CS, NJU
52 / 79
Distributed Learning
Distributed Power-Law Graph Computing
Graph Partitioning
Graph partitioning (GP) plays a key role to affect the performance of
distributed graph computing:
Workload balance
Communication cost
Two strategies for graph partitioning. Shaded vertices are ghosts and
mirrors, respectively.
(a) Edge-Cut
(b) Vertex-Cut
Theoretical and empirical results show vertex-cut is better than edge-cut.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
53 / 79
Distributed Learning
Distributed Power-Law Graph Computing
Power-Law Graph Partitioning
Natural graphs from real world typically follow skewed power-law degree
distributions: Pr(d) ∝ d−α .
Different vertex-cut methods can result in different performance
(b)
(a)
Sample
Li (http://cs.nju.edu.cn/lwj)
(c)
Bad partitioning
Good partitioning
Big Learning
CS, NJU
54 / 79
Distributed Learning
Distributed Power-Law Graph Computing
Degree-based Hashing (DBH)
Existing GP methods, such as the Random in PowerGraph (Joseph E
Gonzalez et al, 2012) and Grid in GraphBuilder (Nilesh Jain et al, 2013),
do not make effective use of the power-law degree distribution.
We propose a novel GP method called degree-based hashing (DBH):
Algorithm 2 GP with DBH
Input: The set of edges E; the set of vertices V ; number of machines p.
Output: The assignment M (e) ∈ {1, . . . , p} for each edge e.
Initialization: count the degree di for each vertex i ∈ {1, . . . , n} in parallel
for all e = (vi , vj ) ∈ E do
Hash each edge in parallel:
if di < dj then
M (e) ← vertex hash(vi )
else
M (e) ← vertex hash(vj )
end if
end for
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
55 / 79
Distributed Learning
Distributed Power-Law Graph Computing
Theoretical Analysis
We theoretically prove that our DBH method can outperform Random and
Grid in terms of:
reducing replication factor (communication and storage cost)
keeping good edge-balance (workload balance)
Nice property: Our DBH reduces more replication factor when the
power-law graph is more skewed.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
56 / 79
Distributed Learning
Distributed Power-Law Graph Computing
Data Set
Table: Datasets
Alias
Tw
Arab
Wiki
LJ
WG
Li (http://cs.nju.edu.cn/lwj)
Graph
Twitter
Arabic-2005
Wiki
LiveJournal
WebGoogle
Big Learning
|V |
42M
22M
5.7M
5.4M
0.9M
|E|
1.47B
0.6B
130M
79M
5.1M
CS, NJU
57 / 79
Distributed Learning
Distributed Power-Law Graph Computing
18
70
Random
Grid
DBH
16
60.6%
Empirical Results
Random
Grid
60
10
−12
0
WG
LJ
Wiki
Arab
0
1+10
Tw
(d) Replication Factor
WG
LJ
Wiki
25%
13.3%
4.28%
8.42%
20
2
6.06%
30
6
4
23.6%
40
21.2%
8
26.5%
10
31.5%
50
12
Speedup(%)
Replication Factor
14
Arab
1+10−12
Tw
(e) Speedup relative to baselines
Figure: Experiments on real-world graphs. The number of machines is 48.
20
2000
Random
Grid
DBH
18
Random
Grid
DBH
1800
1600
Execution Time (Sec)
Replication Factor
16
14
12
10
8
6
4
1400
1200
1000
800
600
400
2
200
0
8
16
24
Number of Machines
48
64
1+10−12
(a) Replication Factor
8
16
24
Number of Machines
48
64
1+10−12
(b) Execution Time
Figure: The number of machines ranges from 8 to 64 on Twitter graph.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
58 / 79
Stochastic Learning
Outline
1
Introduction
2
Learning to Hash
Isotropic Hashing
Supervised Hashing with Latent Factor Models
Supervised Multimodal Hashing with SCM
Multiple-Bit Quantization
3
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Distributed Power-Law Graph Computing
4
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
5
Conclusion
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
59 / 79
Stochastic Learning
Definition
Given a (large) set of training data {(xi , yi )|i = 1, 2, · · · , n}, each time we
randomly sample one training instance for training.
Many machine learning problems can be formulated as:
n
1X
argmin L(w) =
fi (w|xi , yi )
n
w
i=1
Hence, the computation cost can be greatly reduced by using stochastic
learning.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
60 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Recommender System
Recommend products (items) to customers (users) by utilizing the
customers’ historic preferences.
(a) Amazon
Li (http://cs.nju.edu.cn/lwj)
(b) Sina Weibo
Big Learning
CS, NJU
61 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Matrix Factorization
Popular in recommender systems for its promising performance
3
I
N
C
M
E
A
P
T
T
I …… R
I
O
X
N
5
……
……
m Users
n
k
James
Kate
k
5 4
m
Ann
I
D
I
O
T
S
……
2
Helen
4.5
n Items
User Factors
R
←
UT
Item Factors
×
V
1 X
T
2
T
T
min
(Ri,j − U∗i V∗j ) + λ1 U∗i U∗i + λ2 V∗j V∗j
U,V 2
(i,j)∈Ω
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
62 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Data Split Strategy
Decouple U and V as possible as we can:
R1
R1
R2
[U1]T
……
R2
Rp
[U2]T
[U1]T
[U2]T
……
V1
V2
V
……
V
Rp
[Up]T
Vp
[Up]T
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
63 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Distributed ADMM
Reformulated MF problem :
P
1X X
p 2
p T p
(Ri,j − UT∗i V∗j
) + λ1 UT∗i U∗i + λ2 [V∗j
] V∗j
U,V,V 2
p
min
p=1 (i,j)∈Ω
s.t. : Vp − V = 0;
∀p ∈ {1, 2, ..., P }
where V = {Vp }Pp=1 , Ωp denotes the (i, j) indices of the ratings located in
node p
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
64 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Distributed ADMM
Define:
Lp (U, Vp , Θp , V) = f p (U, Vp ) + lp (Vp , V, Θp )
X
p
fˆi,j (U∗i , V∗j
)
=
(i,j)∈Ωp
ρ p
p T
p
2
+ kV − VkF + tr [Θ ] (V − V) ,
2
we can get
L(U, V, O, V) =
P
X
Lp (U, Vp , Θp , V).
p=1
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
65 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Distributed ADMM
Get the solutions by repeating the following three steps:
p
Ut+1 , Vt+1
← argmin Lp (U, Vp , Θpt , Vt ), ∀p ∈ {1, 2, ..., P }
U,Vp
Vt+1 ← argmin L(Ut+1 , Vt+1 , Ot , V),
Θpt+1
←
V
p
Θt +
p
ρ(Vt+1
− Vt+1 ), ∀p ∈ {1, 2, ..., P }.
The solution for Vt+1 is:
Vt+1 =
P
1 X p
Vt+1
P
p=1
which can be calculated efficiently.
The problem lies in getting Ut+1 ,Vt+1 efficiently
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
66 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Stochastic Learning
Batch learning is still not very efficient:
Ut+1 = Ut − τt ∗ ∇TU f p (Ut , Vtp ),
p
Vt+1
=
Vp
τt
[ t + ρVt − Θpt − ∇TVp f p (Ut , Vtp )]
1 + ρτt τt
Stochastic Learning
p
(U∗i )t+1 =(U∗i )t + τt (ij (V∗j
)t − λ1 (U∗i )t ),
1 − λ2 τt p
τt
[
(V∗j )t
1 + ρτt
τt
+ij (U∗i )t + ρ(V∗j )t − (Θp∗j )t ],
p
)t+1 =
(V∗j
p
where ij = Rij − [(U∗i )t ]T (V∗j
)t .
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
67 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Scheduler Comparison
Baselines: CCD++ (H.-F Yu etc, ICDM’12); DSGD (R. Gemulla etc, KDD’11)
Number of synchronization for one iteration to fully scan all the ratings:
Generate Statum Sequence
for Each Node
Set d = 0
Parallelly Update
Up and Vp
Parallelly Update Ud*
Set p = 0
Synchronize to
get V
Synchronize Ud*
Parallelly Update Vd*
Parallelly Update
U and V
Parallelly Update
Θp
Synchronize V
Synchronize Vd*
p<P
d<k
p++
d++
(a) CCD++, total
synchronizes k times
Li (http://cs.nju.edu.cn/lwj)
(b) DSGD, totally
synchronizes P times
Big Learning
(c) DS-ADMM, totally
synchronizes 1 time
CS, NJU
68 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Experiment
Experiment environment
MPI-Cluster with 20 nodes, each of which is a 24-core server with
2.2GHz CPU and 96GB of RAM;
One core and 10GB memory for each node are actually used.
Baseline and evaluation metric
Baseline: CCD++, DSGD, DSGD-Bias
Evalution metric:
qP
1
test RMSE : Q
(Ri,j − UT∗i V∗j )2
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
69 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Data Sets
Table: Data sets and parameter settings
Data Set
m
n
#Train
#Test
k
η0 /τ0
λ1 /λ2
ρ
α
β
P
Netflix
480,190
17,770
99,072,112
1,408,395
40
0.1
0.05
0.05
0.002
0.7
8
Li (http://cs.nju.edu.cn/lwj)
Yahoo! Music R1
1,938,361
49,995
73,578,902
7,534,592
40
0.1
0.05
0.05
0.002
0.7
10
Big Learning
Yahoo! Music R2
1,823,179
136,736
699,640,226
18,231,790
40
0.1
0.05
0.1
0.006
0.7
20
CS, NJU
70 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Accuracy and Efficiency
0.98
1
CCD++
DSGD
DSGD−Bias
DS−ADMM
CCD++
DSGD
DSGD−Bias
DS−ADMM
0.96
Test RMSE
Test RMSE
0.96
0.94
0.92
0.88
0.92
0
100
200
300
400
0.84
500
0
200
Time(s)
Test RMSE
1.2
1.1
1.04
0
200
400
600
800
3.6
DS−ADMM
DSGD
DSGD−Bias
CCD++
3.3
3
2.7
2.4
2.1
2
Time(s)
(c) Yahoo-Music-R2
Li (http://cs.nju.edu.cn/lwj)
500
(b) Yahoo-Music-R1
CCD++
DSGD
DSGD−Bias
DS−ADMM
Log Running Time to get Test RMSE 0.922
(a) Netflix
1.3
400
Time(s)
4
6
8
Node Number
(d) Time to fixed-RMSE (0.922)
Big Learning
CS, NJU
71 / 79
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
Speedup
6
CCD++
DSGD
DS−ADMM
Linear−Speedup
Speedup
5
4
3
2
1
2
4
6
8
Node Number
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
72 / 79
Conclusion
Outline
1
Introduction
2
Learning to Hash
Isotropic Hashing
Supervised Hashing with Latent Factor Models
Supervised Multimodal Hashing with SCM
Multiple-Bit Quantization
3
Distributed Learning
Coupled Group Lasso for Web-Scale CTR Prediction
Distributed Power-Law Graph Computing
4
Stochastic Learning
Distributed Stochastic ADMM for Matrix Factorization
5
Conclusion
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
73 / 79
Conclusion
Our Contribution
Learning to hash (MFÆS): memory/disk/cpu/communication
Distributed learning (©ÙªÆS): memory/disk/cpu;
but increase communication cost
Stochastic learning (‘ÅÆS): memory/disk/cpu
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
74 / 79
Conclusion
Future Work
Learning to hash for decreasing communication cost
Distributed programming models and platforms for machine learning:
MPI (fault tolerance, asynchronous), GraphLab, Spark, Parameter
Server, MapReduce, Storm, GPU, etc
Distributed stochastic learning:
communication modeling
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
75 / 79
Conclusion
Future Work
Big data machine learning (big learning) framework
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
76 / 79
Conclusion
Related Publication (1/2)[*indicate my students]
Ling Yan*, Wu-Jun Li, Gui-Rong Xue, Dingyi Han. Coupled Group Lasso for
Web-Scale CTR Prediction in Display Advertising. Proceedings of the 31st
International Conference on Machine Learning (ICML), 2014.
Peichao Zhang*, Wei Zhang*, Wu-Jun Li, Minyi Guo. Supervised Hashing
with Latent Factor Models. Proceedings of the 37th International ACM
SIGIR Conference on Research and Development in Information Retrieval
(SIGIR), 2014.
Cong Xie*, Ling Yan*, Wu-Jun Li, Zhihua Zhang. Distributed Power-law
Graph Computing: Theoretical and Empirical Analysis. To Appear in
Proceedings of the 28th Annual Conference on Neural Information
Processing Systems (NIPS), 2014.
Zhi-Qin Yu*, Xing-Jian Shi*, Ling Yan*, Wu-Jun Li. Distributed Stochastic
ADMM for Matrix Factorization. To Appear in Proceedings of the 23rd
ACM International Conference on Information and Knowledge Management
(CIKM), 2014.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
77 / 79
Conclusion
Related Publication (2/2)[*indicate my students]
Dongqing Zhang*, Wu-Jun Li. Large-Scale Supervised Multimodal Hashing
with Semantic Correlation Maximization. Proceedings of theTwenty-Eighth
AAAI Conference on Artificial Intelligence (AAAI), 2014.
Weihao Kong*, Wu-Jun Li. Isotropic Hashing. Proceedings of the 26th
Annual Conference on Neural Information Processing Systems (NIPS), 2012.
Weihao Kong*, Wu-Jun Li, Minyi Guo. Manhattan Hashing for Large-Scale
Image Retrieval. Proceedings of the 35th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR),
2012.
Weihao Kong*, Wu-Jun Li. Double-Bit Quantization for Hashing.
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence
(AAAI), 2012.
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
78 / 79
Conclusion
Q&A
Thanks!
Li (http://cs.nju.edu.cn/lwj)
Big Learning
CS, NJU
79 / 79