On how to solve large-scale log-determinant optimization problems Chengjing Wang January 25, 2014

Transcription

On how to solve large-scale log-determinant optimization problems Chengjing Wang January 25, 2014
On how to solve large-scale log-determinant
optimization problems
Chengjing Wang
∗
January 25, 2014
Abstract
We propose a proximal augmented Lagrangian method and a hybrid method, i.e.,
employing the proximal augmented Lagrangian method to generate a good initial
point and then employing the Newton-CG augmented Lagrangian method to get
a highly accurate solution, to solve large-scale nonlinear semidefinite programming
problems whose objective functions are a sum of a convex quadratic function and
a log-determinant term. We demonstrate that the algorithms can supply a high
quality solution efficiently even for some ill-conditioned problems.
Keywords: Quadratic programming, Log-determinant optimization problem, Proximal augmented Lagrangian method, Augmented Lagrangian method, Newton-CG method
1
Introduction
In this paper, by defining log 0 := −∞, we consider the following standard primal and
dual nonlinear semidefinite programming problems whose objective functions are a sum
of a convex quadratic function and a log-determinant term (QP-Logdet) :
(P )
1
min{ ⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X : A(X) = b, X ≽ 0},
X
2
(D)
1
max {− ⟨X, Q(X)⟩ + bT y + µ log det Z + nµ(1 − log µ) :
X,y,Z
2
−Q(X) + A∗ y + Z = C, Z ≽ 0},
∗
School of Mathematics and Institute of Applied Mathematics, Southwest Jiaotong University, No.999,
Xian Road, Chengdu 611756, China (ryan101879@gmail.com). The author’s research was supported by
the National Natural Science Foundation of China under grant 11201382, the Youth Fund of Humanities
and Social Sciences of the Ministry of Education under grant 12YJC910008, the project of the science
and technology department of Sichuan province under grant 2012ZR0154, and the Fundamental Research
Funds for the Central Universities under grants SWJTU12CX055 and SWJTU12ZT15.
1
where Q : S n → S n is a given self-adjoint positive semidefinite linear operator, C ∈ S n ,
b ∈ Rm , µ ≥ 0 is a given parameter, A : S n → Rm is a linear map and A∗ : Rm → S n is
the adjoint of A. Note that the linear maps A and A∗ can be expressed, respectively, as
[
A(X) = ⟨A1 , X⟩, . . . , ⟨Am , X⟩
]T
,
∗
A (y) =
m
∑
yk Ak ,
(1)
k=1
where Ak , k = 1, · · · , m are given matrices in S n . As for the explanation of all other
main notations one may see Subsection 1.1.
The perturbed QP-Logdet problem has the form
{1
}
(P ε )
min
⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X : A(X) = b, X ≽ εI .
X
2
(Dε )
1
max {− ⟨X, Q(X)⟩ + bT y + ⟨Z, εI⟩ + µ log det(Q(X) + C − A∗ y − Z)
X,y,Z
2
+nµ(1 − log µ) : −Q(X) + A∗ y + Z + µX −1 = C, Z ≽ 0}.
The QP-Logdet problem (P ) is itself a classical convex model problem in optimization
theory. It can be regarded as an extension of the qudratic semidefinite programming
problem (QSDP) and the log-determinant (Logdet) problem, so it shares the structures of
both problems, and it goes without saying that the QP-Logdet problem is considerable.
For the QSDP, it is certainly a heart problem in nonlinear semidefinite programming
problems, which has been considered by Toh [35], Toh, T¨
ut¨
unc¨
u and Todd [36, 37], Zhao
[45], Jiang, Sun and Toh [14], etc.. For the Logdet problem, which has a very important
application in covariance selection [5] and has been intensively studied over the past several
years, including the work of Dahl, Vandenberghe and Roychowdhury [4], d’Aspremont,
Banerjee and El Ghaoui [6], Li and Toh [15], Lu [16, 17], Lu and Zhang [18], Olsen,
Oztoprak, Nocedal and Rennie [24], Scheinberg, Ma and Goldfarb [30], Scheinberg and
Rish [31], Toh [34], Wang, Sun and Toh [40], Yang, Sun and Toh [41], Yang, Shen, Wonka,
Lu and Ye [43], Yuan [44], etc.. As far as the QP-Logdet problem be concerned, it also
arises in many practical applications such as robust simulation of global warming policies
[13], speech recognition [39], and so on. Thus the algorithms developed to solve this kind
of problems can potentially find wide applications.
For the QP-Logdet problems of small and medium size, the interior-point method
(IPM) with a direct solver is certainly an efficient and robust approach; however, for these
large-scale problems, the IPM lacks the ability due to the need of computing, storing, and
factorizing the Schur matrices that are typically dense. In view of this, we need to design
new approaches.
The proximal augmented Lagrangian (PAL) method proposed in [7] is a fast primaldual approach. The key idea of this approach is to apply the proximal technique to
the augmented Lagrangian function of the primal problem at every iteration so that the
2
simplified inner problem has an analytical or at least a semi-analytical solution which can
be solved very fast, and then to update the dual variables. The convergence of the PAL
method has been shown in [7]. The biggest advantage of the PAL method is that there
is no need to solve any linear system of equations to obtain step directions. It is a good
approach for generating a good initial point. Furthermore, for some ill-conditioned inner
problems whose Hessian matrices are of near low ranks, the proximal technique of the
PAL method is actually better than the Newton-CG method since it can be regarded as a
regularization treatment in some sense. Allowing for the advantages of the PAL method,
we may apply it to the QP-Logdet problem, especially for generating an initial point.
Although the PAL method is a good choice for the QP-Logdet problem, it is after all a
gradient method, once the iteration point is near the optimal solution, we may consider an
approach with locally faster convergence rate. The augmented Lagrangian (AL) method
[29] is just in the position to play this role. It is a classical method that can be viewed as a
special form of the proximal point algorithm (PPA). (As for the PPA, it was proposed by
Martinet [19], and further studied by Rockafellar [28, 29].) Although the AL method for
convex programs is a gradient ascent method applied to the corresponding dual problems
[27], Sun, Sun and Zhang revealed that under the constraint nondegenerate conditions
[1], it could be locally regarded as an approximate generalized Newton method applied
to a semismooth equation. It is just this reason that inspired us to apply the AL method
to solve the QP-Logdet problem. Great successes of the applications of the AL method
to large-scale semidefinite programming problems can also be seen in [46, 40, 41] and the
references therein.
If applying the AL method to problem (P), we have to solve the inner problems
which have no closed-form solutions. The well-tested quadratically convergent semismooth
Newton method introduced by Qi and Sun [25] is an ideal approach to fulfill this task.
As for the semismooth Newton direction, the preconditioned conjugate gradient (PCG)
method is a good choice because the Newton system of equations is large and a direct
solver is not proper.
Mainly based on Rockafellar’s theoretical framework the global convergence and local
convergence rate of the sequence generated by the Newton-CG AL (NAL) method can be
established.
The numerical results demonstrate that the proposed approach can be very efficient and robust to supply highly accurate solutions for large-scale problems not only for
synthetic problems but also for real data problems. In this paper, we solve synthetic
QP-Logdet problems with n up to 2, 000 and m up to 1, 186, 173 in about 2 hours.
The remaining parts of this paper are organized as follows. In Section 2, we give a
brief introduction on some preliminary contents. In Section 3, we describe the algorithms
in details. In Section 4, we establish the convergence theory of the proposed approaches.
In Section 5 we describe some numerical issues on the semismooth Newton-CG method.
In Section 6, we present the numerical results. Finally, we give some concluding remarks.
3
1.1
Additional Notations
In this paper, all vectors are assumed to be finite dimensional. The symbol Rn denotes
the n-dimensional Euclidean space, and On denotes the set of n × n orthogonal matrices.
The set of all m × n matrices with real entries is denoted by Rm×n , ⟨·, ·⟩ stands for
the standard trace inner product in S n , ∥ · ∥ denotes the induced Frobenius norm, S+n
n
and S++
denote the sets of positive semidefinite matrices and positive definite matrices,
respectively. The symbol ◦ denotes the Hadamard product, and ⊗ denotes the Kronecker
product. The symbol FP ε := {X ∈ S n : A(X) = b, X ≽ εI} denotes the feasible set of
the problem (P ε ), and ΠK (·) denotes the metric projector onto the closed convex set K.
Let vec : Rm×n → Rmn be the vectorization operator on matrices defined by stacking the
columns of a matrix to a long vector one by one.
2
Preliminaries
In order to be able to present our ideas more clearly, we first introduce some concepts
related to the AL method based on the classical papers by Rockafellar [28, 29].
2.1
Maximal monotone operators
Let H be a real Hilbert space with an inner product ⟨·, ·⟩. A multifunction T : H ⇒ H is
said to be a monotone operator if
⟨z − z ′ , w − w′ ⟩ ≥ 0, whenever w ∈ T (z), w′ ∈ T (z ′ ).
(2)
It is said to be maximal monotone if, in addition, the graph
G(T ) = {(z, w) ∈ H × H| w ∈ T (z)}
is not properly contained in the graph of any other monotone operator T ′ : H ⇒ H.
For example, if T is the subdifferential ∂f of a lower semicontinuous convex function
f : H → (−∞, +∞], f ̸≡ +∞, then T is maximal monotone (see Minty [21] or Moreau
[22]), and the relation 0 ∈ T (z) means that f (z) = min f .
2.2
Representations in terms of maximal monotone operators
The following Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient for the
optimality of (P ) and (D):
A(X) − b = 0,
Q(X) + C − A∗ y − Z = 0,
XZ = µI, X ≽ 0, Z ≽ 0.
4
The following KKT conditions are also necessary and sufficient for the optimality of (P ε )
and (Dε ):
A(X) − b = 0,
Q(X) + C − A∗ y − Z − µX −1 = 0,
⟨X − εI, Z⟩ = 0, X − εI ≽ 0, Z ≽ 0.
Throughout this paper, the following conditions for (P ε ) and (Dε ) are assumed to
hold.
Assumption 2.1. Problem (P ε ) satisfies the condition
n
∃ X0 ∈ S++
such that A(X0 ) = b, X0 ≻ εI.
Assumption 2.2. For problem (P ε ), there exists an α ∈ R such that the level set
{x| f0 (X) ≤ α, A(X) = b, X ≽ εI} is nonempty and bounded, where f0 (x) stands for
the objective function.
Let l(X; y, Z) : S n × Rm × S n → R
the extended form:


 L0 (X; y, Z)
−∞
l(X; y, Z) =


+∞
be the ordinary Lagrangian function for (P ε ) in
if X ∈ FP ε and (y, Z) ∈ Rm × S+n ,
if X ∈ FP ε and (y, Z) ̸∈ Rm × S+n ,
if X ̸∈ FP ε ,
where
L0 (X; y, Z) =
1
⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X − ⟨y, A(X) − b⟩ − ⟨Z, X − εI⟩.
2
The essential objective function in (P ε ) is given by
{ 1
⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X if X ∈ FP ε ,
2
f (X) =
sup
l(X; y, Z) =
+∞
otherwise,
y∈Rm ,Z∈S n
while the essential objective function in (Dε ) is given by
{
g0 (y, Z) if (y, Z) ∈ Rm × S+n ,
g(y, Z) = inf n l(X; y, Z) =
X∈S
−∞
if (y, Z) ̸∈ Rm × S+n ,
5
As in [29], we can define the following operators
Tl (X, y, Z) = {(V, u, W ) ∈ S n × Rm × S n | (V, −u, −W ) ∈ ∂l(X; y, Z)},
(X, y, Z) ∈ S n × Rm × S n ,
Tf (X) = {V ∈ S n | V ∈ ∂f (X)}, X ∈ S n ,
Tg (y, Z) = {(u, W ) ∈ Rm | (−u, −W ) ∈ ∂g(y, Z)}, (y, Z) ∈ Rm × S n .
We can observe that l is a closed proper saddle function in the sense of [26, p.363], and the
mapping Tl is a maximal monotone operator in S n ×Rm ×S n [26, Cor.37.5.2]. Meanwhile,
f is a closed proper convex function, so Tf is a maximal monotone operator in S n [21, 22].
Similarly, g is a closed proper concave function, so Tg is a maximal monotone operator in
Rm × S n .
To discuss the rate of convergence, we introduce some related concepts and conclusions.
Definition 2.1. (cf. [28]) For a maximal monotone operator T , we say that its inverse
T −1 is Lipschitz continuous at the origin (with modulus a ≥ 0) if there is a unique solution
z¯ to z = T −1 (0), and for some τ > 0 we have
∥z − z¯∥ ≤ a∥w∥, where z ∈ T −1 (w) and ∥w∥ ≤ τ.
Assumption 2.3. Let X be the optimal solution to problem (P ε ) and (y, Z) is the corresponding Lagrangian multiplier. We say that the primal constraint nondegeneracy [1]
holds at X to problem (P ε ) if
(
)
(
) ( m )
{0}
A
R
n
S +
=
,
n
lin(TS+ (X − εI))
I
Sn
where lin(TS+n (X − εI)) denotes the lineality space of TS+n (X − εI), and TS+n (X − εI)
denotes the tangent cone of S+n at X − εI.
Assumption 2.4. Let X be the optimal solution to problem (P ε ) and (y, Z) is the corresponding Lagrangian multiplier. We say that the strong second order sufficient condition
[3] holds at X if
⟨H, ∇2XX L0 (X; y, Z)(H)⟩ + Υ(X−εI) (Z, H) > 0,
∀ H ∈ aff(C(X)) \ {0},
where the linear-quadratic function ΥB : S n × S n → R is defined by
ΥB (Γ, A) := 2⟨Γ, AB + A⟩, (Γ, A) ∈ S n × S n ,
where B + is the Moore-Penrose pseudo-inverse of B.
6
Next we present the direct condition for the Lipschitz continuity of Tg−1 due to Rockafellar [29, Prop.2].
Proposition 2.1. If the strong second-order sufficient condition and the primal constraint
nondegeneracy condition are satisfied, then Tl−1 is actually single-valued and continuously
differentiable on a neighborhood of the origin, and so are Tf−1 and Tg−1 . Thus in particular,
these mappings are all Lipschitz continuous at the origin.
Remark 2.1. We must emphasize that the so called strong second-order condition in [29]
actually includes the primal constraint nondegeneracy condition.
3
3.1
The algorithms
The PAL method
In this subsection, we describe the PAL method in details. The augmented Lagrangian
function of (P) is
Lσ (X; y) =
1
σ
⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X − ⟨y, A(X) − b⟩ + ∥A(X) − b∥2 , X ≽ 0.
2
2
At the kth iteration, we solve the following subproblem
}
{
1
argmin Lσk (X; y k ) + ∥X − X k ∥2Tk | X ≽ 0
2
{1
}
= argmin ⟨X, (Q + σk A∗ A + Tk )(X)⟩ − ⟨M k , X⟩ − µ log det X | X ≽ 0 , (3)
2
where M k = A∗ y k + σk A∗ b + Tk (X k ) − C. Pick Tk ≽ 0 such that Q + σk A∗ A + Tk = αk I,
with αk = λmax (Q + σk A∗ A), then M k = −Q(X k ) − C + αk X k + A∗ (y k + σk (b − A(X k ))),
and (3) simplifies to
{1
}
1 k 2
argmin αk ∥X − M ∥ − µ log det X | X ≽ 0 .
(4)
2
αk
Based on Lemma 2.1 in [40], problem (4) has a closed-form solution
X k+1 = ϕ+
γk (
1 k
k
k T
M ) := P k diag(ϕ+
γk (d ))(P ) ,
αk
(5)
Z k+1 = ϕ−
γk (
1 k
k
k T
M ) := P k diag(ϕ−
γk (d ))(P ) ,
αk
(6)
and further
where α1k M k = P k Dk (P k )T is the eigenvalue decomposition of α1k M k with P k ∈ On and
1
k
k
− 1
k
+
Dk = diag(dk ), dk ∈ Rn , ϕ+
γk ( αk M ) and ϕγk ( αk M ) are matrix valued functions, ϕγk (d )
7
√
k
ϕ−
γk (d )
ϕ+
γk (x)
x2 +4γ
k
are vector valued functions, whose scalar counterparts are
=
2
√
x2 −4γk
and ϕ−
, correspondingly, and γk = αµk . From (5) and (6), we can see
γk (x) =
2
that X k+1 and Z k+1 are positive definite automatically. Once X k+1 is obtained, one may
update the dual variable to get y k+1 .
As for λmax (Q + σk A∗ A), we may apply the power method to compute it.
Now we may summarize the PAL method as below
and
n
Algorithm 1: The PAL method. Input X 0 ∈ S++
, y 0 ∈ Rm , σ0 = 10. Iterate the
following steps:
Step 1. Compute αk = λmax (Q + σk A∗ A).
Step 2. Set Tk = αk I − Q − σk A∗ A, and M k = A∗ y k + σk A∗ b + Tk (X k ) − C. Compute
X k+1 = ϕ+
γk (
1 k
1 k
M ) and Z k+1 = ϕ−
M ).
γk (
αk
αk
Step 3. Set
y k+1 = y k − τ σk (A(X k+1 ) − b),
where τ ∈ [1, 2) is a given parameter.
Step 4. Compute
RPk+1 =
∥b − A(X k+1 )∥
,
max{1, ∥b∥}
k+1
RD
=
∥Q(X k+1 ) + C − A∗ y k+1 − Z k+1 ∥
,
max{1, ∥C∥}
k+1
If max{RPk+1 , RD
} < Tol1, stop; else, choose σk+1 ; end.
3.2
The NAL method
In this subsection, we apply the NAL method to the perturbed problem (P ε ) since this
method can not guarantee the positive definiteness of the variable X automatically.
Given a penalty parameter σ > 0, the augmented Lagrangian function for problem
8
(P ε ) is defined as
Lσ (X; y, Z) =
1
σ
⟨X, Q(X)⟩ + ⟨C, X⟩ − µ log det X − ⟨y, A(X) − b⟩ + ∥A(X) − b∥2 +
2
2
1
1
∥ΠS+n (Z − σ(X − εI))∥2 −
∥Z∥2 .
2σ
2σ
At the kth iteration, we solve the following subproblem
{
}
argmin Lσk (X; y k , Z k ) | X ≽ εI .
(7)
Since the subproblem (7) has no closed-form solution, it needs elaborating on. We can
calculate the first-order derivative of Θk (X) := Lσk (X; y k , Z k ) with respect to X
∇Θk (X) = Q(X) + C − A∗ y k − µX −1 + σk A∗ (A(X) − b) − ΠS+n (Z k − σk (X − εI)).
Furthermore, base on [26, Thm. 23.8] and [32, Lem. 2.1] we can calculate the Clarke
generalized Jacobian ∂ 2 Θk (X) := ∂∇Θk (X) as below
∂ 2 Θk (X)[H] = (Q + σk A∗ A)[H] + µX −1 HX −1 + σk ∂ΠS+n (Z k − σk (X − εI)), ∀ H ∈ S n .
Actually, from [20, Prop.1] every element in ∂ΠS+n (·) is positive semidefinite, so every
element in ∂ 2 Θk (X) is positive definite. Therefore we can apply the semismooth Newton
method developed in [25] with the line search technique which guarantees the global
convergence. And we may obtain the Newton direction by solving the linear system of
equations
Vbσ0k [H] = −∇Θk (X k ),
where
Vbσ0k := Q + σk A∗ A + µ(X k )−1 ⊗ (X k )−1 + Vσ0k ,
here Vσ0k ∈ ∂ΠS+n (Z k − σk (X k − εI)) may be chosen in the same way as that in [46]
Vσ0k [H] = σk P k (Ωk ◦ ((P k )T (H)P k ))(P k )T ,
where P k ∈ On and
Z k − σk (X k − εI) = P k Λk (P k )T ,
where Λk is the diagonal matrix with diagonal entries consisting of the eigenvalues λk1 ≥
λk2 ≥ · · · ≥ λkn of Z k − σk (X k − εI); if defining three index sets
αk := {i | λki > 0}, βk := {i | λki = 0}, and γk := {i | λki < 0},
[
then
k
Ω =
Eγ¯k γ¯k νγ¯k γk
0
νγ¯Tk γk
]
, νij :=
9
λki
, i ∈ γ¯k , j ∈ γk ,
λki − λkj
here γ¯k = {1, · · · , n} \ γk , and Eγ¯k γ¯k ∈ S |¯γk | is the matrix of ones.
Now we summarize the algorithm as below
Algorithm 2: The NAL method. Input X 0 ∈ FP ε , y 0 ∈ Rm , Z 0 ∈ S+n , σ0 = 10.
Iterate the following steps:
Step 1. Find an approximate minimizer
X k+1 ≈ arg minn Θk (X).
X∈S++
(8)
Step 2. Compute
y k+1 = y k − σk (A(X k+1 ) − b),
Z k+1 = ΠS+n (Z k − σk (X k+1 − εI)).
Step 3. Compute
RPk+1 =
∥b − A(X k+1 )∥
,
max{1, ∥b∥}
k+1
RD
=
∥Q(X k+1 ) + C − A∗ y k+1 − Z k+1 − µ(X k+1 )−1 ∥
,
max{1, ∥C∥}
k+1
If max{RPk+1 , RD
} < Tol2, stop; else, σk+1 = 2σk ; end.
In Algorithm 2, the principal computing costs lie in Step 1, i.e., computing the approximate minimizer of the inner problem (8). So we state the algorithm about how to
compute the approximate minimizer in details below.
10
Algorithm 3: The semismooth Newton-CG Method.
Step 1. Given ζ ∈ (0, 12 ), τ1 , τ2 ∈ (0, 1), and δ ∈ (0, 1), choose X 0 (≽ εI).
Step 2. For j = 0, 1, 2, . . . ,
Step 1.1. Apply the PCG method to find an approximate solution H j to
(Vbσ0k + ϵj I)[H] = −∇Θk (X j ),
(9)
where ϵj := τ1 min{τ2 , ∥∇Θk (X j )∥}.
Step 1.2. Set αj = δ mj , where mj is the first nonnegative integer m for which
Θk (X j + δ m H j ) ≤ Θk (X j ) + ζδ m ⟨∇Θk (X j ), H j ⟩.
Step 1.3. Set
X j+1 = X j + αj H j .
From Algorithm 3, we can see that the main computing costs lie in solving the linear
operator equation (9). The linear operator corresponds to a fourth-order n dimemsional
tensor or a n2 × n2 matrix. For large-scale problems, n is very large, so direct solvers are
not suitable and PCG solvers are ideal approaches for (9). Among various PCG solvers,
we adopt the symmetric QMR algorithm [9]. But we must note that comparing with
the algorithm in [9], we change the matrix to the linear operator in (9), and change the
vectors to the corresponding matrices. As these changes are only a trivial generalization
of the symmetric QMR algorithm, we do not go to details.
4
Convergence analyses
The convergence analyses of these two methods can be derived from Fazel, Pong, Sun and
Tseng’s paper [7] and Rockafellar’s paper [28, 29] without many difficulties, respectively.
For the sake of completeness, we also present these results below.
4.1
Convergence analysis for the PAL method
Theorem 4.1. Assume that the solution set of (P) is nonempty and Assumption 2.1
holds. Assume that Q + σk A∗ A + √Tk is positive definite. Let {X k , y k , Z k } be generated
from Algorithm 1, and if τ ∈ (0, 1+2 5 ), then the sequence {X k } converges to the optimal
solution to (P) and {y k , Z k } converges to the optimal solution to the dual problem (D).
Proof. Since Q + σk A∗ A + Tk = λmax (Q + σk A∗ A)I ≻ 0, based on Theorem B.1 in [7]
and the KKT condition, the conclusion of the theorem is obvious.
11
4.2
Convergence analysis for the NAL method
Since we can not solve the inner problems exactly, we will use the following stopping
criteria considered by Rockafellar [28, 29] for terminating Algorithms 2 and 3:
(A) Θk (X
k+1
) − inf Θk (X) ≤
ϵ2k /2σk ,
ϵk ≥ 0,
∞
∑
ϵk < ∞;
k=0
(B) Θk (X
k+1
) − inf Θk (X) ≤
δk2 /2σk ∥(y k+1 , Z k+1 )
− (y , Z )∥ , δk ≥ 0,
k
k
2
∞
∑
δk < ∞;
k=0
(B ′ ) ∥∇Θk (X k+1 )∥ ≤ δk′ /σk ∥(y k+1 , Z k+1 ) − (y k , Z k )∥, 0 ≤ δk′ → 0.
We directly obtain from [28, 29] the following convergence results.
Theorem 4.2. Let Algorithm 2 be executed with stopping criterion (A). If Assumption
2.1 is satisfied, then the generated sequence {(y k , Z k )} is bounded and {(y k , Z k )} converges
to (y, Z), where (y, Z) is the unique optimal solution to (Dε ), and {X k } is asymptotically
minimizing for (P ε ) with max(Dε ) = inf(P ε ).
The boundedness of {y k , Z k } under (A) is actually equivalent to the existence of an
optimal solution to (Dε ).
If {y k , Z k } is bounded and Assumption 2.2 is satisfied, then the sequence {X k } is also
bounded, and the accumulation point of the sequence {X k } is the unique optimal solution
to (P ε ).
Theorem 4.3. Let Algorithm 2 be executed with stopping criteria (A) and (B). If Assumption 2.1 holds and Tg−1 is Lipschitz continuous at the origin with modulus ag , then
{(y k , Z k )} converges to the unique optimal solution (y, Z) with max(Dε ) = inf(P ε ), and
for all k sufficiently large,
∥(y k+1 , Z k+1 ) − (y, Z)∥ ≤ θk ∥(y k , Z k ) − (y, Z)∥,
where
2 −1/2
θk = [ag (a2g + σk2 )−1/2 + δk ](1 − δk )−1 → θ∞ = ag (a2g + σ∞
)
< 1, σk → σ∞ ,
Moreover, the conclusions of Theorem 4.2 about {(y k , Z k )} are valid.
If in addition to (B) and the condition on Tg−1 one has (B’) and the stronger condition
that Tl−1 is Lipschitz continuous at the origin with modulus al (≥ ag ), then X k → X,
where X is the unique optimal solution to (P ε ), and one has that for all k sufficiently
large,
∥X k+1 − X∥ ≤ θk′ ∥(y k+1 , Z k+1 ) − (y k , Z k )∥,
where
′
= al /σ∞ .
θk′ = al (1 + δk′ )/σk → θ∞
12
Remark 4.1. In Algorithm 2 we can also add the term 2σ1k ∥X −X k ∥2 to Θk (X). Actually,
in our Matlab code, one can optionally add this term. This actually corresponds to the
proximal method of multipliers considered in [29, Section 5]. Convergence analysis for
this improvement can be conducted in a parallel way as for Algorithm 2.
Note that in the stopping criteria (A) and (B), inf Θk (X) is an unknown value. Since
b
Θk (X) := Θk (X) + 2σ1k ∥X − X k ∥2 is a strongly convex function with modulus σ1k , then
one has the estimation
b k (X k+1 ) − inf Θ
b k (X) ≤ 1 ∥∇Θ
b k (y k+1 )∥2 ,
Θ
2σk
thus criteria (A) and (B) can be practically modified as follows:
b k (y k+1 )∥ ≤ ϵk , ϵk ≥ 0,
∥∇Θ
b k (X k+1 )∥ ≤ δk ∥(y k+1 , Z k+1 ) − (y k , Z k )∥, δk ≥ 0,
∥∇Θ
∞
∑
k=0
∞
∑
ϵk < ∞;
δk < ∞.
k=0
Remark 4.2. The condition that Tg−1 is Lipschitz continuous at the origin is not easy to
be realized, however, if Assumptions 2.3 and 2.4 are satisfied, then Tl−1 , Tf−1 and Tg−1 are
all single-valued and continuously differentiable on a neighborhood of the origin, and in
particular they are all Lipschitz continuous at the origin.
5
Numerical issues in the associated semismooth NewtonCG method
When applying Algorithm 3 to solve the inner problem (7), the most expensive step lies
in solving the linear system of equations
where
(M + εI)vec(H) = vec(−∇Θk (X)),
(10)
M := Q + σAT A + µX −1 ⊗ X −1 + σP ⊗ P diag(vec(Ω))P ⊗ P,
(11)
Q and A denote the matrix representations of Q and A, to obtain the approximate Newton
direction. In order to solve (10) as efficiently as possible, it is necessary to analyze the
condition number of the coefficient matrix and design an efficient preconditioner.
5.1
Conditioning of the coefficient matrix
Let
S = X − εI.
13
For simplicity, we suppose that strict complementarity holds for Z, S, i.e., Z + S ≻ 0.
From the fact that ZS = 0, we have
[ Z
]
Λ
0
PT,
(12)
Z − σS = P
0 −σΛS
where ΛZ = diag(λZ ) ∈ Rr×r and ΛS = diag(λS ) ∈ R(n−r)×(n−r) are diagonal matrices
of positive eigenvalues of Z and S, respectively. Define the index sets γ¯ := {1, . . . , r},
γ := {r + 1, . . . , n}. Let
]
[
λZi
Eγ¯γ¯ νγ¯γ
,
ν
:=
Ω=
, i ∈ γ¯ , j ∈ γ,
(13)
ij
νγ¯Tγ 0
λZi + σλSj−r
and
c1 =
min(λZ )
max(λZ )
,
c
=
< σ.
2
min(λZ )/σ + max(λS )
max(λZ )/σ + min(λS )
Then we have
Proposition 5.1. Suppose that the strict complementarity holds for Z and S, then we
have the following bound on the condition number of M
κ(M ) ≤
λmax (Q) + σλmax (AT A) + µ/λ2min (X) + σλmax (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T )
,
λmin (Q) + σλmin (AT A) + µ/λ2max (X) + c1 λmin (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T )
(14)
where Pe1 = Pγ¯ ⊗ Pγ¯ , Pe2 = Pγ ⊗ Pγ¯ , Pe3 = Pγ¯ ⊗ Pγ , Pγ¯ and Pγ are submatrices of
P = [Pγ¯ Pγ ].
Proof. From (11), (12) and (13), we obtain
M = Q + σAT A + µX −1 ⊗ X −1 + σ(Pe1 Pe1T + Pe2 D2 Pe2T + Pe3 D3 Pe3T ),
where D2 = diag(vec(νγ¯γ )), and D3 = diag(vec(νγ¯Tγ )). Since
c1 ≤ σνij ≤ c2 , i ∈ γ¯ , j ∈ γ,
then we can derive
c1 (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T ) ≼ σ(Pe1 Pe1T + Pe2 D2 Pe2T + Pe3 D3 Pe3T ) ≼ σ(Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T ).
Furthermore, from [11, Thm 4.2.12],
µ
λ2max (X)
I ≼ µX −1 ⊗ X −1 ≼
14
µ
λ2min (X)
I.
Hence from [10, Thm 4.3.7],
µ
+ c1 λmin (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T )]I ≼ M ≼
2
λmax (X)
µ
+ σλmax (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T )]I.
[λmax (Q) + σλmax (AT A) + 2
λmin (X)
[λmin (Q) + σλmin (AT A) +
So we can get the following bound on the condition number of M :
κ(M ) ≤
λmax (Q) + σλmax (AT A) + µ/λ2min (X) + σλmax (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T )
.
λmin (Q) + σλmin (AT A) + µ/λ2max (X) + c1 λmin (Pe1 Pe1T + Pe2 Pe2T + Pe3 Pe3T )
The upper bound in (14) suggests that: (i) with σ and 1/c1 increasing, the condition
number κ(M ) may increase. (ii) With Q, AT A and P ⊗ P diag(vec(Ω))P ⊗ P being of
(near) low ranks, µX −1 ⊗ X −1 can potentially improve the condition number, for the
term µ/λ2max (X)I can “lift” the minimal eigenvalue of M ; in other words, if Q, AT A and
P ⊗ P diag(vec(Ω))P ⊗ P are of (near) low ranks, then with µ decreasing, the condition
number κ(M ) may probably increase.
5.2
A diagonal preconditioner
To achieve faster convergence for the PCG method to solve (10), one may select a proper
preconditioner. In our implementation, we devise an easy-to-compute diagonal preconditioner by using an idea first developed in [8]. The preconditioner has the following
form
MD := diag(Q) + σdiag(AT A) + µdiag(vec(Ψ)) + σdiag(d),
where Ψ ∈ S n , d ∈ Rn , and
2
(Ψ)ij = (X −1 )ii (X −1 )jj + (X −1 )2ij ,
d(ij) = ((P ◦ P )Ω(P ◦ P )T )ij , 1 ≤ i, j ≤ n.
The biggest advantage of the preconditioner MD is that only O(n3 ) flops are needed to
compute it.
6
Numerical experiments
In this section, we present some numerical results to demonstrate the performances of
the approaches with both synthetic and real data. We implemented the approaches in
Matlab R2013a. All runs were performed on a PC (Intel Core 2 Duo 2.60 GHz with 12
GB RAM).
15
We measure the infeasibilities and optimality for the primal and dual problems by
RP , RD and RG which are described in Algorithms 1 and 2. In general cases, we use
the PAL method to generate an initial point and then switch to the NAL method for
accelerating the local convergence. We stop the PAL method when
max{RP , RD } < Tol1,
and stop the NAL method when
max{RP , RD } < Tol2,
where Tol1 and Tol2 are pre-specified accuracy tolerances with Tol1 = 5 × 10−3 and
Tol2 = 10−6 as the default. In some hard cases, we use the PAL method only and
terminate it with Tol1 = 10−6 . Furthermore, we also use the relative gap
RG :=
|pobj − dobj|
1 + |pobj| + |dobj|
to measure the quality of the solution, where pobj and dobj denote the primal and dual
objective function values, respectively. For the PAL method, we cap the iteration number
to be 1000; for the Newton-CG method, we set the maximal outer iteration number to
be 100 and the maximal inner iteration number to be 15. We choose the initial iterate
X0 = I, and σ0 = 10. If the NAL method is applied to the above problems, the inequality
constraint X ≽ 0 is replaced by X ≽ εI with ε = 10−16 .
6.1
Synthetic experiments I
In this subsection, we focus our numerical experiments on the following special problems
}
{1
⟨X, H ◦ X⟩ + ⟨C, X⟩ − µ log det X | Xii = 1, i = 1, 2, · · · , n, X ≽ 0
(A1) min
X
2
{
}
Xij = Xji = 0, ∀ (i, j) ∈ E
1
(A2) min
⟨X, H ◦ X⟩ + ⟨C, X⟩ − µ log det X X
2
Xii = 1, i = 1, 2, · · · , n, X ≽ 0
{1
}
(A3) min
⟨X, H ◦ X⟩ + ⟨C, X⟩ − µ log det X | Tr(X) = 1, X ≽ 0
X
2
}
{
Xij = Xji = 0, ∀ (i, j) ∈ E
1
(A4) min
⟨X, H ◦ X⟩ + ⟨C, X⟩ − µ log det X .
X
2
Tr(X) = 1, X ≽ 0
The matrices H and C are generated randomly, and E is a random subset of {(i, j) | 1 ≤
i < j, j = 2, . . . , n}.
The performances of the algorithms are presented in the following tables. For each
instance, we report the matrix dimension (n) and the number of the equality constraints
16
(m); the number of outer iterations (it) and the total number of inner iterations (itsub);
the primal (pobj) and dual (dobj) objective values; the primal infeasibility (RP ), the dual
infeasibility (RD ), the relative gap (RG ); the time (in the format of hours:minutes:seconds)
taken.
Firstly, we present the performances of the pure PAL method and the hybrid method
(i.e., the PAL method and the NAL method) on the problems (A1) and (A2) with µ = 1
in Tables 1 and 2, respectively. But for the hybrid method, we only report the details of
the Newton-CG method. From the two tables, we can see that although the PAL method
can solve the problems rapidly, the hybrid method is still superior to it for all instances.
Especially for instances with relatively few constraints, the hybrid method outperforms
the PAL method obviously. Actually, the superiority of the hybrid method will be even
obvious if we want to get a highly accurate solution.
Table 1: Performances of the PAL method on problems (A1)-(A4)
with µ = 1.
Problem
n|m
it
rand-A1-0500
rand-A1-1000
rand-A1-1500
rand-A1-2000
rand-A2-0500
rand-A2-1000
rand-A2-1500
rand-A2-2000
rand-A3-0500
rand-A3-1000
rand-A3-1500
rand-A3-2000
rand-A4-0500
rand-A4-1000
rand-A4-1500
rand-A4-2000
500 | 500;
1000 | 1000;
1500 | 1500;
2000 | 2000;
500 | 12322;
1000 | 25086;
1500 | 38389;
2000 | 51322;
500 | 1;
1000 | 1;
1500 | 1;
2000 | 1;
500 | 74080;
1000 | 296243;
1500 | 666966;
2000 | 1186173;
93|
88|
88|
95|
106|
102|
99|
95|
570|
1000|
1000|
1000|
1000|
1000|
1000|
1000|
pobj
1.51611644
2.88889067
4.02954208
5.12194289
1.48549608
2.92216880
4.07660888
5.14445107
3.10830391
6.90874980
1.09705221
1.52001337
3.10830458
6.90848660
1.09609128
1.51822498
17
dobj
3
3
3
3
3
3
3
3
3
3
4
4
3
3
4
4
1.51612384
2.88891640
4.02959372
5.12194882
1.48549478
2.92216820
4.07660986
5.14444160
3.10830441
6.90875463
1.09708284
1.52027947
3.10830554
6.90875346
1.09707760
1.52026740
3
3
3
3
3
3
3
3
3
3
4
4
3
3
4
4
RP /RD /RG
Time
4.2-7| 9.4-7| 2.4-6
6.8-7| 9.9-7| 4.5-6
8.9-7| 9.9-7| 6.4-6
9.2-7| 2.5-7| 5.8-7
5.1-7| 9.2-7| 4.4-7
2.7-7| 9.6-7| 1.0-7
6.6-8| 9.9-7| 1.2-7
3.3-7| 9.1-7| 9.2-7
9.8-7| 9.3-7| 7.9-8
4.8-6| 4.6-6| 3.5-7
2.0-4| 1.9-4| 1.4-5
1.3-3| 1.3-3| 8.8-5
1.6-4| 3.5-4| 1.5-7
2.8-4| 4.7-4| 1.9-5
6.6-3| 7.0-3| 4.5-4
1.0-2| 7.5-3| 6.7-4
18
1:20
3:41
8:13
20
1:47
4:28
8:53
1:08
10:36
31:24
1:16:31
3:29
17:55
46:26
1:41:03
Table 2: Performances of the hybrid method on problems (A1)(A4) with µ = 1.
Problem
n|m
it|itsub
rand-A1-0500
rand-A1-1000
rand-A1-1500
rand-A1-2000
rand-A2-0500
rand-A2-1000
rand-A2-1500
rand-A2-2000
rand-A3-0500
rand-A3-1000
rand-A3-1500
rand-A3-2000
rand-A4-0500
rand-A4-1000
rand-A4-1500
rand-A4-2000
500 | 500;
1000 | 1000;
1500 | 1500;
2000 | 2000;
500 | 12322;
1000 | 25086;
1500 | 38389;
2000 | 51322;
500 | 1;
1000 | 1;
1500 | 1;
2000 | 1;
500 | 74080;
1000 | 296243;
1500 | 666966;
2000 | 1186173;
5| 8
6| 9
6| 8
6| 7
10| 11
11| 12
10| 11
10| 11
13| 15
15| 17
20| 22
16| 19
32| 55
36| 63
52| 92
79| 140
pobj
1.51610927
2.88889217
4.02952847
5.12195286
1.48549372
2.92217094
4.07661066
5.14445263
3.10830441
6.90875466
1.09708296
1.52028038
3.10830709
6.90875652
1.09708312
1.52028052
dobj
3
3
3
3
3
3
3
3
3
3
4
4
3
3
4
4
1.51593972
2.88900007
4.02940576
5.12224022
1.48549527
2.92217815
4.07661215
5.14445533
3.10830442
6.90875467
1.09708296
1.52028038
3.10830709
6.90875652
1.09708312
1.52028052
3
3
3
3
3
3
3
3
3
3
4
4
3
3
4
4
RP /RD /RG
Time
5.3-7| 1.7-7| 5.6-5
1.3-7| 3.9-8| 1.9-5
3.2-7| 8.2-8| 1.5-5
5.1-7| 1.4-7| 2.8-5
7.7-7| 6.4-7| 5.2-7
3.8-7| 2.0-7| 1.2-6
3.8-7| 2.3-7| 1.8-7
4.5-7| 2.4-7| 2.6-7
3.6-9| 5.8-7| 2.4-9
1.3-9| 4.1-7| 7.4-10
2.8-9| 9.0-7| 9.6-10
1.2-9| 7.9-7| 6.6-10
1.4-10| 7.6-7| 1.5-12
5.5-11| 7.4-7| 4.4-13
8.7-11| 6.9-7| 3.6-13
1.9-10| 9.1-7| 5.9-13
08
41
1:47
3:44
15
1:20
3:26
7:50
35
5:25
22:55
1:07:14
1:37
13:39
59:34
2:09:36
Secondly, we present the performances of the PAL method and the hybrid method on
problem (A1) with various µ. From Tables 3 and 4, we can see that with µ decreasing,
the computing time increases very mildly. In contrast, with µ decreasing, the computing
time increases very obviously. The rationale is analyzed behind Proposition 5.1, that is,
in this problem, Q, AT A and P ⊗ P diag(vec(Ω))P ⊗ P are of near low ranks and a small
µ leads to the ill-conditioning of the inner problem, so the NAL method needs many PCG
iterations to solve the linear system of equations, however, the PAL method directly gives
a closed-form solution of a regularized problem.
18
Table 3: Performances of the PAL method on problem (A1) with
various µ.
Problem
rand1000-1
rand1000-2
rand1000-3
rand1000-4
rand1000-5
rand1000-6
rand1000-7
rand1000-8
rand1000-9
rand1000-10
n|m
µ
it
|
|
|
|
|
|
|
|
|
|
1.0000;
0.7500;
0.5000;
0.2500;
0.1250;
0.0625;
0.0313;
0.0157;
0.0078;
0.0039;
88|
113|
112|
126|
171|
169|
179|
179|
180|
180|
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
pobj
2.88889067
2.84053903
2.77900646
2.69409906
2.63413549
2.59461152
2.56991656
2.55504884
2.54622819
2.54123460
RP /RD /RG
dobj
3
3
3
3
3
3
3
3
3
3
2.88891640
2.84053619
2.77901061
2.69412261
2.63413210
2.59461261
2.56991919
2.55505345
2.54623730
2.54124523
3
3
3
3
3
3
3
3
3
3
6.8-7|
8.2-7|
8.8-7|
1.6-7|
8.6-7|
8.8-7|
7.7-7|
9.4-7|
7.7-7|
8.2-7|
9.9-7|
1.9-7|
5.5-7|
9.9-7|
2.2-7|
4.7-7|
4.9-7|
6.4-7|
7.2-7|
8.0-7|
Time
4.5-6
5.0-7
7.5-7
4.4-6
6.4-7
2.1-7
5.1-7
9.0-7
1.8-6
2.1-6
1:20
1:42
1:41
1:57
2:45
2:41
2:44
2:47
2:45
2:45
Table 4: Performances of the hybrid method on problem (A1) with
various µ.
Problem
rand1000-1
rand1000-2
rand1000-3
rand1000-4
rand1000-5
rand1000-6
rand1000-7
rand1000-8
rand1000-9
rand1000-10
n|m
µ
it|itsub
|
|
|
|
|
|
|
|
|
|
1.0000;
0.7500;
0.5000;
0.2500;
0.1250;
0.0625;
0.0313;
0.0157;
0.0078;
0.0039;
6| 9
6| 9
6| 11
12| 24
12| 24
11| 22
11| 22
11| 22
11| 26
11| 30
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
pobj
2.88889217
2.84053768
2.77900549
2.69409749
2.63412896
2.59460453
2.56991042
2.55504168
2.54622239
2.54122850
19
RP /RD /RG
dobj
3
3
3
3
3
3
3
3
3
3
2.88900007
2.84064543
2.77914546
2.69409705
2.63412821
2.59460367
2.56990961
2.55504091
2.54622168
2.54122780
3
3
3
3
3
3
3
3
3
3
1.3-7|
1.4-7|
1.9-7|
9.5-8|
1.6-7|
1.9-7|
1.8-7|
1.7-7|
1.6-7|
1.5-7|
3.9-8|
3.6-8|
5.6-8|
4.7-7|
8.3-7|
9.8-7|
9.3-7|
9.1-7|
8.5-7|
8.4-7|
1.9-5
1.9-5
2.5-5
8.2-8
1.4-7
1.7-7
1.6-7
1.5-7
1.4-7
1.4-7
Time
43
45
59
3:18
3:51
5:12
6:59
9:08
12:17
15:50
1000
800
800
time (seconds)
time (seconds)
1000
600
400
200
0 0
10
600
400
200
−1
10
0 0
10
−2
µ
10
−1
10
−2
µ
10
Figure 1: Comparison of the performances of two methods on problem (A1) with various
µ.
6.2
Synthetic experiments II
In this subsection, we focus the numerical experiments on the following problem:
{1
}
(B1) min
∥H ◦ (X − G)∥2 − µ log det X | Xii = 1, i = 1, 2, · · · , n, X ≽ 0
X
2
}
{
Xij = Xji = 0, ∀ (i, j) ∈ E
1
∥H ◦ (X − G)∥2 − µ log det X .
(B2) min
X
2
Xii = 1, i = 1, 2, · · · , n, X ≽ 0
The matrices H and G are generated randomly, where H is a weighted matrix whose
entries are between 0 and 1, and many entries are nearly zero, and E is a random subset
of {(i, j) | 1 ≤ i < j, j = 2, . . . , n}.
The objective function of (B1) or (B2) is
1
1
⟨X, H ◦ H ◦ X⟩ − ⟨H ◦ H ◦ G, X⟩ + ⟨H ◦ G, H ◦ G⟩ − µ log det X,
2
2
(15)
which is a special case of (P ), so the algorithms we developed can be applied to solve (B1)(B2). We list the performances of the PAL method and the hybrid method on problems
(B1) and (B2) in Tables 5 and 6 as below. We can see that for these two problems, the
PAL method is much faster than the hybrid method. The reason lies in that the entries
of the weighted matrix H in (15) are small, then the entries of H ◦ H are even smaller
which leads to a more ill-conditioned Hessian of the inner problem and in this situation
the NAL method has no superiority.
20
Table 5: Performances of the PAL method on problems (B1)-(B2)
with µ = 1.
Problem
n|m
it
rand-B1-0500
rand-B1-1000
rand-B1-1500
rand-B1-2000
rand-B2-0500
rand-B2-1000
rand-B2-1500
rand-B2-2000
500 | 500;
1000 | 1000;
1500 | 1500;
2000 | 2000;
500 | 12322;
1000 | 25086;
1500 | 38389;
2000 | 51322;
147|
114|
106|
107|
175|
143|
135|
129|
pobj
3.90611146
1.76566465
4.38503029
8.41256796
5.30756232
1.91004704
4.53956279
8.56585712
RP /RD /RG
dobj
3
4
4
4
3
4
4
4
3.90613831
1.76570729
4.38510787
8.41269558
5.30756434
1.91005117
4.53957194
8.56586838
3
4
4
4
3
4
4
4
8.9-7|
7.5-7|
2.5-7|
1.6-7|
8.6-7|
4.7-7|
3.4-8|
2.3-8|
3.1-7|
9.7-7|
9.4-7|
9.6-7|
2.1-7|
9.8-7|
9.9-7|
9.6-7|
3.4-6
1.2-5
8.8-6
7.6-6
1.9-7
1.1-6
1.0-6
6.6-7
Time
29
1:49
4:47
9:47
35
2:21
6:05
11:53
Table 6: Performances of the hybrid method on problems (B1)(B2) with µ = 1.
Problem
n|m
rand-B1-0500
rand-B1-1000
rand-B1-1500
rand-B1-2000
rand-B2-0500
rand-B2-1000
rand-B2-1500
rand-B2-2000
500 | 500;
1000 | 1000;
1500 | 1500;
2000 | 2000;
500 | 12322;
1000 | 25086;
1500 | 38389;
2000 | 51322;
it
9|
7|
7|
7|
6|
7|
7|
7|
25
20
19
20
17
19
19
19
pobj
3.90611894
1.76566628
4.38503287
8.41257271
5.30756255
1.91004712
4.53956281
8.56585840
RP /RD /RG
dobj
3
4
4
4
3
4
4
4
3.90609819
1.76566528
4.38502532
8.41255864
5.30757122
1.91004680
4.53956184
8.56585294
3
4
4
4
3
4
4
4
6.9-7|
2.7-7|
3.7-7|
4.8-7|
3.5-7|
1.4-7|
1.7-7|
2.7-7|
7.6-7|
5.2-7|
6.5-7|
7.8-7|
9.8-7|
2.8-7|
3.1-7|
4.4-7|
2.7-6
2.9-7
8.6-7
8.4-7
8.2-7
8.2-8
1.1-7
3.2-7
Time
1:24
7:22
22:01
58:27
1:01
6:51
22:13
51:15
Based on the synthetic experiments in the above two subsections, we conclude that
if the Hessian of the inner problem is good-conditioned, we prefer to adopt the hybrid
method, but if the Hessian is ill-conditioned, the PAL method only is a good choice.
6.3
Real data experiments
In this section, we consider the following model problem which is a special case of problem
(P ) with Q ≡ 0:
{
}
(C) min ⟨C, X⟩ − log det X | Xij = Xji = 0, ∀ (i, j) ∈ E, X ≽ 0 ,
X
21
where the matrix C is real data coming from the two gene expression data sets, and E is
a predetermined index set. The model (C) finds wide applications in covariance selection.
One gene set is the Rosetta Inpharmatics Compendium of gene expression profiles
described by Hughes et al. [12]. The data set contains 253 samples with n = 6136
variables. We aim to estimate the covariance matrix of a Gaussian graphic model whose
conditional independence is unknown. Another gene set is the Iconix microarray data
obtained from 255 drug-treated rat livers; see Natsoulis et al. [23] for details. For both
data sets, although our method can handle problems with larger matrix dimensions, we
test only on a subset of the data. We create 5 subsets by taking 500, 1000, 1500, 2000 and
2500 variables with the highest variances, respectively. And as the variances vary widely,
we normalize the sample covariance matrices to have unit variances on the diagonal.
As the model (C) is a special case of (P ) with Q ≡ 0, so the Hessian of the inner
problem may probably be ill-conditioned and we only apply the PAL method to solve it.
Furthermore, we also compare the performance of the PAL method with that of the PPA
proposed in [40].
Table 7: Performances of the PAL method on problem (C).
Problem
Rosetta-0500
Rosetta-1000
Rosetta-1500
Rosetta-2000
Rosetta-2500
Iconix-0500
Iconix-1000
Iconix-1500
Iconix-2000
Iconix-2500
n|m
it
|
|
|
|
|
|
|
|
|
|
13|
12|
12|
12|
12|
28|
25|
21|
21|
16|
500
1000
1500
2000
2500
500
1000
1500
2000
2500
999;
1999;
2999;
3999;
4999;
999;
1999;
2999;
3999;
4999;
pobj
2.29472323
3.06492978
3.56384267
3.94407754
4.19969175
1.30457645
2.28035658
2.92514554
3.42933318
3.84427349
22
RP /RD /RG
dobj
1
1
1
1
1
2
2
2
2
2
2.29473435
3.06495129
3.56386620
3.94410344
4.19971940
1.30457705
2.28035773
2.92514672
3.42933507
3.84427544
1
1
1
1
1
2
2
2
2
2
4.3-7|
4.7-7|
3.7-7|
3.3-7|
2.9-7|
3.8-7|
1.3-7|
1.6-7|
6.6-8|
7.5-8|
5.1-7|
8.7-7|
6.7-7|
5.9-7|
5.4-7|
7.9-7|
9.0-7|
8.2-7|
9.7-7|
9.1-7|
2.4-6
3.5-6
3.3-6
3.2-6
3.3-6
2.3-7
2.5-7
2.0-7
2.8-7
2.5-7
Time
03
16
46
1:49
3:27
08
28
58
2:14
3:05
Table 8: Comparison of the PAL method and the PPA on problem
(C) with accuracy 10−6 .
Problem
Rosetta-0500
Rosetta-1000
Rosetta-1500
Rosetta-2000
Rosetta-2500
Iconix-0500
Iconix-1000
Iconix-1500
Iconix-2000
Iconix-2500
n|m
500
1000
1500
2000
2500
500
1000
1500
2000
2500
|
|
|
|
|
|
|
|
|
|
it
PAL
PPA
13
12
12
12
12
28
25
21
21
16
2
2
2
2
3
3
3
4
4
4
999
1999
2999
3999
4999
999
1999
2999
3999
4999
Time
PAL PPA
pobj
PAL
2.29472323
3.06492978
3.56384267
3.94407754
4.19969175
1.30457645
2.28035658
2.92514554
3.42933318
3.84427349
PPA
1
1
1
1
1
2
2
2
2
2
2.29473468
3.06495844
3.56387453
3.94411131
4.19979828
1.30457676
2.28035671
2.92514506
3.42933324
3.84427356
1
1
1
1
1
2
2
2
2
2
03
16
46
1:49
3:27
08
28
58
2:14
3:05
08
43
2:01
4:51
13:36
14
1:20
5:59
13:27
25:45
Tables 7 and 8 not only show that the PAL method is very efficient to solve the real
data problems, but also show that the PAL method is about 2 to 7 times faster than the
PPA.
Concluding remarks
In this paper, we designed a PAL method and a hybrid method to solve log-determinant
optimization problems. We established the rigorous convergence results based on the
fundamental theoretical frameworks. Extensive numerical experiments conducted on both
synthetic problems and real data problems demonstrated that our methods are very robust
and efficient. Based on the results of the numerical experiments, we conclude that for
general problems, we prefer to adopt the hybrid method; but for some ill-conditioned
problems, the PAL method may be a better choice.
Acknowledgements
I sincerely appreciate the Institute for Mathematical Sciences, National University of
Singapore for supporting me to visit the institute and attend the workshop “Optimization:
Computation, Theory and Modeling” in 2012 so that I can have a good opportunity to
have fruitful discussions with Professors Defeng Sun and Kim-Chuan Toh on the PAL
method. I also appreciate Dr. Xinyuan Zhao in Beijing University of Technology for
many discussions on this topic.
23
References
[1] F. Alizadeh, J. P. A. Haeberly, and O. L. Overton, Complementarity and nondegeneracy in semidefinite programming, Math. Programming, 77 (1997), 111–128.
[2] O. Banerjee, L. El Ghaoui, A. d’Aspremont, Model selection through sparse maximum
likelihood estimation, J. Mach. Learn. Res., 9 (2008), 485–516.
[3] J. F. Bonnans and A. Shapiro, Perturbation Analysis of Optimization Problems,
Springer, New York, 2000.
[4] J. Dahl, L. Vandenberghe, and V. Roychowdhury, Covariance selection for nonchordal graphs via chordal embedding, Optim. Methods Softw., 23 (2008), 501–520.
[5] A. Dempster, Covariance selection, Biometrics, 28 (1972), 157–175.
[6] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, First-order methods for sparse
covariance selection, SIAM J. Matrix Anal. Appl., 30 (2008), 56–66.
[7] M. Fazel, T.-K. Pong, D. Sun, and P. Tseng, Hankel matrix rank minimization with
applications to system identification and realization, SIAM J. Matrix Anal. Appl., 34
(2013), 946–977.
[8] Y. Gao and D. Sun, Calibrating least squares semidefinite programming with equality
and inequality constraints, SIAM J. Matrix Anal. Appl., 31 (2009), 1432–1457.
[9] R. W. Freund and N. M. Nachtigal, A new Krylov subspace method for symmetric
indefinite linear systems, ORNL/TM-12754, 1994.
[10] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press, Cambridge, UK, 1985.
[11] R. A. Horn and C. R. Johnson, Topics in Matrix Analysis, Cambridge University
Press, Cambridge, UK, 1991.
[12] T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D.
Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M.
R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte,
K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend, Functional discovery via a
compendium of expression profiles, Cell, 102 (2000), 109–126.
[13] Z. Hu, J. Cao, and L. J. Hong, Robust simulation of global warming policies using
the DICE model, Manage. Sci., 58 (2012), 1–17.
[14] K. F. Jiang, D. F. Sun, and K.-C. Toh, An inexact accelerated proximal gradient
method for large scale linearly constrained convex SDP, SIAM J. Optim., 22 (2012),
1042–1064.
24
[15] L. Li, and K.-C. Toh,, An inexact interior point method for L1-regularized sparse
covariance selection, Math. Program. Comput., 2 (2010), 291–315.
[16] Z. Lu, Smooth optimization approach for sparse covariance selection, SIAM J. Optim.,
19 (2009), 1807–1827.
[17] Z. Lu, Adaptive first-order methods for general sparse inverse covariance selection,
SIAM J. Matrix Anal. Appl., 31 (2010), 2000–2016.
[18] Z. Lu and Y. Zhang, Penalty decomposition methods for L0-norm minimization,
Proceedings of Neural Information Processing Systems (NIPS), 2011: 46–54.
[19] B. Martinet, Regularisation d’in´equations variationelles par approximations successives, Rev. Fran¸caise d’Informat. Recherche Op´erationnelle, (1970), 154–159.
[20] F. Meng, D. Sun, and G. Zhao, Semismoothness of solutions to generalized equations
and the Moreau–Yosida regularization, Math. Programming, 104 (2005), 561–581.
[21] G. J. Minty, On the monotonicity of the gradient of a convex function, Pacific J.
Math., 14 (1964), 243–247.
[22] J. J. Moreau, Proximit´e et dualit´e dans un espace Hilbertien, Bull. Soc. Math. France,
93 (1965), 273–299.
[23] G. Natsoulis, C. I. Pearson, J. Gollub, B. P. Eynon, J. Ferng, R. Nair, R. Idury,
M. D. Lee, M. R. Fielden, R. J. Brennan, A. H. Roter and K. Jarnagin, The liver
pharmacological and xenobiotic gene response repertoire, Mol. Syst. Biol., 175 (2008),
1–12.
[24] P. Olsen, F. Oztoprak, J. Nocedal, and S. Rennie, Newton-like methods
for sparse inverse covariance estimation, available at http://www.optimizationonline.org/DB HTML/2012/06/3506.html.
[25] H. Qi and D. Sun, A quadratically convergent Newton method for computing the
nearest correlation matrix, SIAM J. Matrix Anal. Appl., 28 (2006), 360–385.
[26] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, 1970.
[27] R. T. Rockafellar, A dual approach to solving nonlinear programming problems by
unconstrained optimization, Math. Programming 5 (1973), 354–373.
[28] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J.
Control Optim., 14 (1976), 877–898.
[29] R. T. Rockafellar, Augmented Lagrangains and applications of the proximal point
algorithm in convex programming, Math. Oper. Res., 1 (1976), 97–116.
25
[30] K. Scheinberg, S. Ma, and D. Goldfarb, Sparse inverse covariance selection via alternating linearization methods, Twenty-Fourth Annual Conference on Neural Information Processing Systems (NIPS), 2010: 2101–2109.
[31] K. Scheinberg and I. Rish, Learning sparse Gaussian Markov networks using a
greedy coordinate ascent approach, in Machine Learning and Knowledge Discovery
in Databases, Lecture Notes in Comput. Sci. 6323, J. L. Balcazar, F. Bonchi, A.
Gionis, and M. Sebag, eds., Springer-Verlag, Berlin, 2010, 196–212.
[32] D. Sun, The strong second order sufficient condition and constraint nondegeneracy
in nonlinear semidefinite programming and their implications, Math. Oper. Res., 31
(2006), 761–776.
[33] D. Sun, J. Sun, and L. Zhang, The rate of convergence of the augmented Lagrangian
method for nonlinear semidefinite programming, Math. Programming, 114 (2008),
349–391.
[34] K.-C. Toh, Primal-dual path-following algorithms for determinant maximization problems with linear matrix inequalities, Comput. Optim. Appl., 14 (1999), 309–330.
[35] K.-C. Toh, An inexact primal-dual path following algorithm for convex quadratic
SDP, Math. Programming, 112 (2008), 221–254.
[36] R. H. T¨
ut¨
unc¨
u, K.-C. Toh, and M. J. Todd, Solving semidefinite-quadratic-linear
programs using SDPT3, Math. Programming, 95 (2003), 189–217.
[37] K.-C. Toh, R. H. T¨
ut¨
unc¨
u, and M. J. Todd, Inexact primal-dual path-following algorithms for a special class of convex quadratic SDP and related problems, Pac. J.
Optim., 3 (2007), 135–164.
[38] N.-K. Tsing, M. K. H. Fan, and E. I. Verriest, On analyticity of functions involving
eigenvalues, Linear Algebra Appl. 207 (1994), 159–180.
[39] B. Varadarajan, D. Povey, and S. M. Chu, Quick fmllr for speaker adaptation in
speech recognition, Acoustics, Speech and Signal Processing, 2008. ICASSP 2008.
IEEE International Conference on.
[40] C. Wang, D. Sun, and K.-C. Toh, Solving log-determinant optimization problems by a
Newton-CG primal proximal point algorithm, SIAM J. Optim., 20 (2010), 2994–3013.
[41] J. Yang, D. Sun, and K.-C. Toh, A proximal point algorithm for log-determinant
optimization with group Lasso regularization, SIAM J. Optim., 23 (2013), 857–893.
[42] K. Yosida, Functional Analysis, Springer Verlag, Berlin, 1964.
26
[43] S. Yang, X. Shen, P. Wonka, Z. Lu, and J. Ye, Fused multiple graphical Lasso,
available at http://people.math.sfu.ca/∼zhaosong/ResearchPapers/FMGL.pdf.
[44] X. Yuan, Alternating direction methods for sparse covariance selection, J. Sci. Comput., 51 (2012), 261–273.
[45] X.-Y. Zhao, A Semismooth Newton-CG Augmented Lagrangian Method for Large
Scale Linear and Convex Quadratic SDPs, PhD thesis, National University of Singapore, 2009.
[46] X.-Y. Zhao, D. Sun, and K.-C. Toh, A Newton-CG augmented Lagrangian method
for semidefinite programming, SIAM J. Optim., 20 (2010), 1737–1765.
27