On Bayesian Computation
Transcription
On Bayesian Computation
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference I Minimize the minimax risk under constraints I I I I privacy constraint communication constraint memory constraint Yields tradeoffs linking statistical quantities (amount of data, complexity of model, dimension of parameter space, etc) to “externalities” (privacy level, bandwidth, storage, etc) Ongoing Work: Computational Constraints on Inference I Tradeoffs via convex relaxations I Tradeoffs via concurrency control I Bounds via optimization oracles I Bounds via communication complexity I Tradeoffs via subsampling I All of this work has been frequentist; how about Bayesian inference? Today’s talk: Bayesian Inference and Computation I Integration rather than optimization I Part I: Mixing times for MCMC I Part II: Distributed MCMC Part I: Mixing Times for MCMC MCMC and Sparse Regression I Statistical problem: predict/explain a response variable Y based on a very large number of features X1 , . . . , Xd . I Formally: Y = X1 β1 + X2 β2 + · · · + Xd βd + noise, Yn×1 = Xn×d βd×1 + noise, or (matrix form) number of features d sample size n. I Sparsity condition: number of non-zero βj ’s n. I Goal: variable selection—identify influential features I Can this be done efficiently (in a computational sense)? Statistical Methodology I Frequentist approach: penalized optimization = I Objective function f (β) 1 kY − X βk22 + Pλ (β) 2n {z } | | {z } Goodness of fit Penalty term Bayesian approach: Monte Carlo sampling Posterior probability P(β | Y ) ∝ P(Y | β) | {z } Likelihood × P(β) | {z } Prior probability Computation and Bayesian inference Posterior probability P(β | Y ) = C (Y ) × | {z } Normalizing constant P(Y | β) | {z } Likelihood × P(β) | {z } Prior probability I The most widely used tool for fitting Bayesian models is the sampling technique based on Markov chain Monte Carlo (MCMC) I The theoretical analysis of the computational efficiency of MCMC algorithms lags that of optimization-based methods Computational Complexity I Central object of interest: the mixing time tmix of the Markov chain I Bound tmix as a function of problem parameters (n, d) I I rapidly mixing—tmix grows at most polynomially I slowly mixing—tmix grows exponentially It has long been believed by many statisticians that MCMC for Bayesian variable selection is necessarily slowly mixing; efforts have been underway to find lower bounds but such bounds have not yet been found Model Mγ : Y = Xγ βγ + w , w ∼ N (0, φ−1 In ) 1 Precision prior: π(φ) ∝ φ Regression prior: βγ | γ ∼ N (0, g φ−1 (XγT Xγ )−1 ) 1 κ|γ| Sparsity prior: π(γ) ∝ I[|γ| ≤ s0 ]. p Linear model: Assumption A (Conditions on β ∗ ) The true regression vector has components β ∗ = (βS∗ , βS∗ c ) that satisfy the bounds Full β ∗ condition: Off-support S c condition: for some Le ≥ 0. 1 √ X β ∗ 2 ≤ g σ02 log p 2 n n 1 √ XS c β ∗ c 2 ≤ Le σ02 log p , S 2 n n Assumption B (Conditions on the design matrix) The design matrix has been normalized so that kXj k22 = n for all j = 1, . . . , p; moreover, letting Z ∼ N(0, In ), there exist constants ν > 0 and L < ∞ such that Lν ≥ 4 and Lower restricted eigenvalue (RE(s)): min λmin |γ|≤s 1 n XγT Xγ ≥ ν, Sparse projection condition (SI(s)): h i 1 p 1 EZ max max √ h I − Φγ Xk , Z i ≤ Lν log p, 2 n |γ|≤s k∈[p]\γ where Φγ denotes projection onto the span of {Xj , j ∈ γ}. This is a mild assumption, needed in the information-theoretic analysis of variable selection. Assumption C (Choices of prior hyperparameters) The noise hyperparameter g and sparsity penalty hyperparameter κ > 0 are chosen such that g p 2α e +2 κ + α ≥ C1 (L + L) for some α > 0, and for some universal constant C1 > 0. Assumption D (Sparsity control) For a constant C0 > 8, one of the two following conditions holds: Version D(s ∗ ): We set s0 : = p in the sparsity prior and the true sparsity s ∗ is bounded as max{1, s ∗ } ≤ o 1 n n e 2 − 16Lσ 0 8C0 K log p e where c is a universal constant. for some constant K ≥ 4 + α + c L, Version D(s0 ): The sparsity parameter s0 satisfies the sandwich relation o 1 n n e 2 , max 1, 2ν −2 ω(X ) + 1 s ∗ ≤ s0 ≤ − 16Lσ 0 8C0 K log p where ω(X ) : = max |||(XγT Xγ )−1 XγT Xγ ∗ \γ |||2op . γ Our Results Theorem (Posterior concentration) Given Assumption A, Assumption B with s = Ks ∗ , Assumption C, and Assumption D(s ∗ ), if Cβ satisfies log p , Cβ2 ≥ c0 ν −2 (L + Le + α + κ) σ02 n we have πn (γ ∗ | Y ) ≥ 1 − c1 p −1 with high probability. Compare to `1 -based approaches, which require an irrepresentable condition: max kXkT XS (XST XS )−1 k1 < 1 k∈[d], S: |S|=s ∗ Our Results Theorem (Rapid mixing) Suppose that Assumption A, Assumption B with s = s0 , Assumption C, and Assumption D(s0 ) all hold. Then there are universal constants c1 , c2 such that, for any ∈ (0, 1), the -mixing time of the Metropolis-Hastings chain is upper bounded as τ ≤ c1 ps02 c2 α (n + s0 ) log p + log(1/) + 2 with probability at least 1 − c3 p −c4 . High Level Proof Idea I A Metropolis-Hasting random walk on the d-dim hypercube {0, 1}d I Canonical path ensemble argument: for any model γ 6= γ ∗ , find a path from γ to γ ∗ , along which acceptance ratios are high, where γ ∗ is the true model Part II: Distributed MCMC Traditional MCMC I Serial, iterative algorithm for generating samples I Slow for two reasons: (1) Large number of iterations required to converge (2) Each iteration depends on the entire dataset I Most research on MCMC has targeted (1) I Recent threads of work target (2) Serial MCMC Data Single core Samples Data-parallel MCMC Data Parallel cores “Samples” Aggregate samples from across partitions — but how? Data Parallel cores “Samples” Aggregate Factorization motivates a data-parallel approach π(θ | x) ∝ π(θ) π(x | θ) = |{z} | {z } | {z } posterior prior likelihood J Y j=1 π(θ)1/J π(x(j) | θ) {z } | sub-posterior Factorization motivates a data-parallel approach π(θ | x) ∝ π(θ) π(x | θ) = | {z } |{z} | {z } posterior prior likelihood J Y j=1 π(θ)1/J π(x(j) | θ) {z } | sub-posterior I Partition the data as x(1) , . . . , x(J) across J cores I The jth core samples from a distribution proportional to the jth sub-posterior (a ‘piece’ of the full posterior) I Aggregate the sub-posterior samples to form approximate full posterior samples Aggregation strategies for sub-posterior samples π(θ | x) ∝ π(θ) π(x | θ) = | {z } |{z} | {z } posterior prior likelihood J Y j=1 π(θ)1/J π(x(j) | θ) {z } | sub-posterior Sub-posterior density estimation (Neiswanger et al, UAI 2014) Weierstrass samplers (Wang & Dunson, 2013) Weighted averaging of sub-posterior samples I Consensus Monte Carlo (Scott et al, Bayes 250, 2013) I Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015) Aggregate ‘horizontally’ across partions Data Parallel cores “Samples” Aggregate Naı̈ve aggregation = Average Aggregate Aggregate = ( , ) x + = 0.5 x ( , ) x + 0.5 x Less naı̈ve aggregation = Weighted average Aggregate Aggregate = ( , ) x + = 0.58 x ( , ) x + 0.42 x Consensus Monte Carlo (Scott et al, 2013) Aggregate Aggregate ( , ) ( , ) = x + x inverse x matrices = Weights are + covariance I I Motivated by Gaussian assumptions I Designed at Google for the MapReduce framework x Variational Consensus Monte Carlo Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes Variational Consensus Monte Carlo Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function qF = approximate distribution L (F ) = EqF [log π (X, θ)] + H [qF ] | {z } | {z } {z } | objective likelihood entropy Variational Consensus Monte Carlo Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function qF = approximate distribution L̃ (F ) = EqF [log π (X, θ)] + | {z } | {z } objective likelihood H̃ [q ] | {zF} relaxed entropy Variational Consensus Monte Carlo Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function qF = approximate distribution L̃ (F ) = EqF [log π (X, θ)] + | {z } | {z } objective likelihood H̃ [q ] | {zF} relaxed entropy No mean field assumption Variational Consensus Monte Carlo Aggregate Aggregate ( , ) = ( , ) x + x x = Optimize xover weight + matrices I I Restrict to valid solutions when parameter vectors constrained Variational Consensus Monte Carlo Theorem (Entropy relaxation) Under mild structural assumptions, we can choose H̃ [qF ] = c0 + K 1 X hk (F ) , K k=1 with each hk a concave function of F such that H [qF ] ≥ H̃ [qF ] . We therefore have L (F ) ≥ L̃ (F ) . Variational Consensus Monte Carlo Theorem (Concavity of the variational Bayes objective) Under mild structural assumptions, the relaxed variational Bayes objective L̃ (F ) = EqF [log π (X, θ)] + H̃ [qF ] is concave in F . Empirical evaluation I I Compare three aggregation strategies: I Uniform average I Gaussian-motivated weighted average (CMC) I Optimized weighted average (VCMC) For each algorithm A, report approximation error of some expectation Eπ [f ], relative to serial MCMC A (f ) = I |EA [f ] − EMCMC [f ]| |EMCMC [f ]| Preliminary speedup results Example 1: High-dimensional Bayesian probit regression #data = 100, 000, d = 300 First moment estimation error, relative to serial MCMC (Error truncated at 2.0) Example 2: High-dimensional covariance estimation Normal-inverse Wishart model #data = 100, 000, #dim = 100 =⇒ 5, 050 parameters (L) First moment estimation error (R) Eigenvalue estimation error Example 3: Mixture of 8, 8-dim Gaussians Error relative to serial MCMC, for cluster comembership probabilities of pairs of test data points VCMC reduces CMC error at the cost of speedup (∼2x) VCMC speedup is approximately linear CMC VCMC Discussion Contributions I Convex optimization framework for Consensus Monte Carlo I Structured aggregation accounting for constrained parameters I Entropy relaxation I Empirical evaluation