On Bayesian Computation

Transcription

On Bayesian Computation
Michael I. Jordan
with Elaine Angelino, Maxim Rabinovich, Martin Wainwright
and Yun Yang
Previous Work: Information Constraints on Inference
I
Minimize the minimax risk under constraints
I
I
I
I
privacy constraint
communication constraint
memory constraint
Yields tradeoffs linking statistical quantities (amount of data,
complexity of model, dimension of parameter space, etc) to
“externalities” (privacy level, bandwidth, storage, etc)
Ongoing Work: Computational Constraints on Inference
I
Tradeoffs via convex relaxations
I
Tradeoffs via concurrency control
I
Bounds via optimization oracles
I
Bounds via communication complexity
I
Tradeoffs via subsampling
I
All of this work has been frequentist; how about Bayesian
inference?
Today’s talk: Bayesian Inference and Computation
I
Integration rather than optimization
I
Part I: Mixing times for MCMC
I
Part II: Distributed MCMC
Part I: Mixing Times for MCMC
MCMC and Sparse Regression
I
Statistical problem: predict/explain a response variable Y
based on a very large number of features X1 , . . . , Xd .
I
Formally:
Y = X1 β1 + X2 β2 + · · · + Xd βd + noise,
Yn×1 = Xn×d βd×1 + noise,
or
(matrix form)
number of features d sample size n.
I
Sparsity condition: number of non-zero βj ’s n.
I
Goal: variable selection—identify influential features
I
Can this be done efficiently (in a computational sense)?
Statistical Methodology
I
Frequentist approach: penalized optimization
=
I
Objective function f (β)
1
kY − X βk22 +
Pλ (β)
2n
{z
}
|
| {z }
Goodness of fit
Penalty term
Bayesian approach: Monte Carlo sampling
Posterior probability P(β | Y )
∝
P(Y | β)
| {z }
Likelihood
×
P(β)
| {z }
Prior probability
Computation and Bayesian inference
Posterior probability P(β | Y )
=
C (Y )
×
| {z }
Normalizing
constant
P(Y | β)
| {z }
Likelihood
×
P(β)
| {z }
Prior probability
I
The most widely used tool for fitting Bayesian models is the
sampling technique based on Markov chain Monte Carlo
(MCMC)
I
The theoretical analysis of the computational efficiency of
MCMC algorithms lags that of optimization-based methods
Computational Complexity
I
Central object of interest: the mixing time tmix of the Markov
chain
I
Bound tmix as a function of problem parameters (n, d)
I
I
rapidly mixing—tmix grows at most polynomially
I
slowly mixing—tmix grows exponentially
It has long been believed by many statisticians that MCMC
for Bayesian variable selection is necessarily slowly mixing;
efforts have been underway to find lower bounds but such
bounds have not yet been found
Model
Mγ :
Y = Xγ βγ + w , w ∼ N (0, φ−1 In )
1
Precision prior:
π(φ) ∝
φ
Regression prior:
βγ | γ ∼ N (0, g φ−1 (XγT Xγ )−1 )
1 κ|γ|
Sparsity prior:
π(γ) ∝
I[|γ| ≤ s0 ].
p
Linear model:
Assumption A (Conditions on β ∗ )
The true regression vector has components β ∗ = (βS∗ , βS∗ c ) that
satisfy the bounds
Full β ∗ condition:
Off-support S c condition:
for some Le ≥ 0.
1
√ X β ∗ 2 ≤ g σ02 log p
2
n
n
1
√ XS c β ∗ c 2 ≤ Le σ02 log p ,
S 2
n
n
Assumption B (Conditions on the design matrix)
The design matrix has been normalized so that kXj k22 = n for all
j = 1, . . . , p; moreover, letting Z ∼ N(0, In ), there exist constants
ν > 0 and L < ∞ such that Lν ≥ 4 and
Lower restricted eigenvalue (RE(s)):
min λmin
|γ|≤s
1
n
XγT Xγ
≥ ν,
Sparse projection condition (SI(s)):
h
i 1 p
1 EZ max max √ h I − Φγ Xk , Z i ≤
Lν log p,
2
n
|γ|≤s k∈[p]\γ
where Φγ denotes projection onto the span of {Xj , j ∈ γ}.
This is a mild assumption, needed in the information-theoretic
analysis of variable selection.
Assumption C (Choices of prior hyperparameters)
The noise hyperparameter g and sparsity penalty hyperparameter
κ > 0 are chosen such that
g p 2α
e +2
κ + α ≥ C1 (L + L)
for some α > 0, and
for some universal constant C1 > 0.
Assumption D (Sparsity control)
For a constant C0 > 8, one of the two following conditions holds:
Version D(s ∗ ): We set s0 : = p in the sparsity prior and the true
sparsity s ∗ is bounded as
max{1, s ∗ } ≤
o
1 n n
e 2
− 16Lσ
0
8C0 K log p
e where c is a universal constant.
for some constant K ≥ 4 + α + c L,
Version D(s0 ): The sparsity parameter s0 satisfies the sandwich
relation
o
1 n n
e 2 ,
max 1, 2ν −2 ω(X ) + 1 s ∗ ≤ s0 ≤
− 16Lσ
0
8C0 K log p
where ω(X ) : = max |||(XγT Xγ )−1 XγT Xγ ∗ \γ |||2op .
γ
Our Results
Theorem (Posterior concentration)
Given Assumption A, Assumption B with s = Ks ∗ , Assumption C,
and Assumption D(s ∗ ), if Cβ satisfies
log p
,
Cβ2 ≥ c0 ν −2 (L + Le + α + κ) σ02
n
we have πn (γ ∗ | Y ) ≥ 1 − c1 p −1 with high probability.
Compare to `1 -based approaches,
which require an irrepresentable
condition:
max kXkT XS (XST XS )−1 k1 < 1
k∈[d],
S: |S|=s ∗
Our Results
Theorem (Rapid mixing)
Suppose that Assumption A, Assumption B with s = s0 ,
Assumption C, and Assumption D(s0 ) all hold. Then there are
universal constants c1 , c2 such that, for any ∈ (0, 1), the -mixing
time of the Metropolis-Hastings chain is upper bounded as
τ ≤ c1 ps02 c2 α (n + s0 ) log p + log(1/) + 2
with probability at least 1 − c3 p −c4 .
High Level Proof Idea
I
A Metropolis-Hasting random walk on the d-dim hypercube
{0, 1}d
I
Canonical path ensemble argument: for any model γ 6= γ ∗ ,
find a path from γ to γ ∗ , along which acceptance ratios are
high, where γ ∗ is the true model
Part II: Distributed MCMC
Traditional MCMC
I
Serial, iterative algorithm for generating samples
I
Slow for two reasons:
(1) Large number of iterations required to converge
(2) Each iteration depends on the entire dataset
I
Most research on MCMC has targeted (1)
I
Recent threads of work target (2)
Serial MCMC
Data
Single core
Samples
Data-parallel MCMC
Data
Parallel cores
“Samples”
Aggregate samples from across partitions — but how?
Data
Parallel cores
“Samples”
Aggregate
Factorization motivates a data-parallel approach
π(θ | x) ∝ π(θ) π(x | θ) =
|{z} | {z }
| {z }
posterior
prior
likelihood
J
Y
j=1
π(θ)1/J π(x(j) | θ)
{z
}
|
sub-posterior
Factorization motivates a data-parallel approach
π(θ | x) ∝ π(θ) π(x | θ) =
| {z }
|{z} | {z }
posterior
prior
likelihood
J
Y
j=1
π(θ)1/J π(x(j) | θ)
{z
}
|
sub-posterior
I
Partition the data as x(1) , . . . , x(J) across J cores
I
The jth core samples from a distribution proportional to the
jth sub-posterior (a ‘piece’ of the full posterior)
I
Aggregate the sub-posterior samples to form approximate full
posterior samples
Aggregation strategies for sub-posterior samples
π(θ | x) ∝ π(θ) π(x | θ) =
| {z }
|{z} | {z }
posterior
prior
likelihood
J
Y
j=1
π(θ)1/J π(x(j) | θ)
{z
}
|
sub-posterior
Sub-posterior density estimation (Neiswanger et al, UAI 2014)
Weierstrass samplers (Wang & Dunson, 2013)
Weighted averaging of sub-posterior samples
I
Consensus Monte Carlo (Scott et al, Bayes 250, 2013)
I
Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015)
Aggregate ‘horizontally’ across partions
Data
Parallel cores
“Samples”
Aggregate
Naı̈ve aggregation = Average
Aggregate
Aggregate
=
( , )
x
+
=
0.5
x
( , )
x
+
0.5
x
Less naı̈ve aggregation = Weighted average
Aggregate
Aggregate
=
( , )
x
+
=
0.58
x
( , )
x
+
0.42
x
Consensus Monte Carlo (Scott et al, 2013)
Aggregate
Aggregate
( , )
( , )
=
x
+
x inverse
x matrices
= Weights are
+ covariance
I
I
Motivated by Gaussian assumptions
I
Designed at Google for the MapReduce framework
x
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate the
target distribution
Method: Convex optimization via variational Bayes
target distribution
F = aggregation function
qF = approximate distribution
L (F ) = EqF [log π (X, θ)] + H [qF ]
| {z }
| {z }
{z
}
|
objective
likelihood
entropy
target distribution
L̃ (F ) = EqF [log π (X, θ)] +
| {z }
|
{z
}
objective
likelihood
H̃ [q ]
| {zF}
relaxed entropy
target distribution
L̃ (F ) = EqF [log π (X, θ)] +
| {z }
|
{z
}
objective
likelihood
H̃ [q ]
| {zF}
relaxed entropy
No mean field assumption
Aggregate
Aggregate
( , )
=
( , )
x
+
x
x
= Optimize xover weight
+
matrices
I
I
Restrict to valid solutions when parameter vectors constrained
Theorem (Entropy relaxation)
Under mild structural assumptions, we can choose
H̃ [qF ] = c0 +
K
1 X
hk (F ) ,
K
k=1
with each hk a concave function of F such that
H [qF ] ≥ H̃ [qF ] .
We therefore have
L (F ) ≥ L̃ (F ) .
Theorem (Concavity of the variational Bayes objective)
Under mild structural assumptions, the relaxed variational Bayes
objective
L̃ (F ) = EqF [log π (X, θ)] + H̃ [qF ]
is concave in F .
Empirical evaluation
I
I
Compare three aggregation strategies:
I
Uniform average
I
Gaussian-motivated weighted average (CMC)
I
Optimized weighted average (VCMC)
For each algorithm A, report approximation error of some
expectation Eπ [f ], relative to serial MCMC
A (f ) =
I
|EA [f ] − EMCMC [f ]|
|EMCMC [f ]|
Preliminary speedup results
Example 1: High-dimensional Bayesian probit regression
#data = 100, 000, d = 300
First moment estimation error, relative to serial MCMC
(Error truncated at 2.0)
Example 2: High-dimensional covariance estimation
Normal-inverse Wishart model
#data = 100, 000, #dim = 100 =⇒ 5, 050 parameters
(L) First moment estimation error (R) Eigenvalue estimation error
Example 3: Mixture of 8, 8-dim Gaussians
Error relative to serial MCMC, for cluster comembership
probabilities of pairs of test data points
VCMC reduces CMC error at the cost of speedup (∼2x)
VCMC speedup is approximately linear
CMC
VCMC
Discussion
Contributions
I
Convex optimization framework for Consensus Monte Carlo
I
Structured aggregation accounting for constrained parameters
I
Entropy relaxation
I
Empirical evaluation

On Bayesian Computation

Transcription

Similar documents

Extended abstract - Computer Graphics Group

Problem Set 9

as PDF - French Travel Connection

- Monte Cassino

MONTE HALCONES

Stan Ulam, John von Neumann, and the Monte

scarica PDF - Hotel Villa Brunella Capri

STOKE CITY FC Carlo Nash - National Literacy Trust

PU0069 Apartment Bilocale 2/4 Monte Sarago

Monte Verde Casita

Harvest Festival flier 2016

Bayesian Sample Size Determination for Case-Control Studies Cyr Emile M’L , Lawrence J