Bayesian Inference in the Sample Selection and Two-Part Models
Transcription
Bayesian Inference in the Sample Selection and Two-Part Models
Bayesian Inference in the Sample Selection and Two-Part Models Martijn van Hasselt Department of Economics The University of Western Ontario February 2008 Abstract This paper considers two models to deal with an outcome variable that contains a large fraction of zeros: the sample-selection model and the two-part model. Whereas the sample selection model allows correlation between the selection and outcome mechanisms, the two-part model assumes conditional independence. Using a fully parametric Bayesian approach, we present Markov Chain Monte Carlo (MCMC) algorithms for the model parameters that are based on data augmentation. With a Gaussian speci…cation of the likelihood the models are, in some sense, nested. In order to determine which model is more appropriate, inference can focus on the correlation coe¢ cient. Alternatively, a Bayes factor can be computed. The Bayesian semiparametric approach uses ‡exible families of distributions. In particular, we present MCMC schemes based on mixtures of normal distributions and Dirchlet process priors. The various methods are illustrated using simulated data. This is still a work in progress and feedback is appreciated. I thank Tony Lancaster for his guidance and William McCausland for helpful comments and suggestions. All errors are, of course, my own. Contact: mvanhass@uwo.ca 1 1 Introduction In many economic applications a nonnegative outcome variable of interest is typically characterized by a certain fraction of observations clustered at zero and a distribution of positive values that is highly skewed. Examples are consumer expenditures on durable goods and labor supply functions (hours worked). Our paper considers two models that are commonly used in the literature to analyze this type of data. The essential di¤erence between these models is how they interpret a zero value of the outcome. In the …rst model, that we will refer to as a sample selection model (SSM), the data generating process is split up into two stages. The …rst stage describes the selection mechanism, which determines whether we see a positive outcome or not. Usually, the selection mechanism takes the form of a structural equation for an underlying latent variable such as utility. If the latent variable falls below a certain threshold, the outcome variable is zero; if it exceeds this threshold a positive outcome is observed. The second stage determines the level of the outcome. If the …rst stage dictates that the outcome should be positive we observe the level determined in the second stage. Otherwise, we observe a zero. Thus, the zeros represent missing data: we do not observe what the positive outcome would have been. As an example, consider a consumer who is considering the purchase of a car. Depending on her income, preferences, the availability of public transportation and the travel distance to work, she will …rst decide on whether to buy a car at all. If not, we observe zero expenditures and the potential outcome is unknown. On the other hand, if she does decide to buy, the potential expenditures are realized (i.e. observed) and may vary depending on, say, income, travel distance and individual tastes. Sample selection models, such as the one used in this paper, typically describe potential outcomes which are only partially observed. In this case the observed positive values of the outcome variable follow a pattern that is derived from the latent structure. One could wonder whether the potential outcome is an interesting quantity to begin with. Regardless, as a modeling device the SSM can be used to describe actual outcomes as well. More generally, sample selection occurs when the data at hand is not a random sample from the population. When making inferences it is then important to understand the process that generated the sample. Individuals may select themselves into (or out of) the sample based on a combination of observable quantities and unobserved heterogeneity. If this heterogeneity also a¤ects the outcome variable, inferences using the selected sample alone may be subject to selection bias. Early contributions to the sample selection literature are, among others, Gronau (1974) and Heckman (1979). Gronau’s paper analyzes the potential for selection bias in the labor market when observed wages are used to make inference about the distribution of wage o¤ers. Heckman (1979) treats sample selection as a speci…cation error and proposes a two-step estimator that corrects for omitted variable bias. Since our paper takes a Bayesian approach we will not discuss the frequentist literature any further.1 The second model in this paper is a two-part model (2PM).2 One of the …rst discussions of the 1 2 Excellent frequentist surveys are Lee (2003) and Vella (1998). The latter focuses on semiparametric models. The term hurdle model is also used. Wooldridge (2002) and Cameron and Trivedi (2005) contain good discussions 2 2PM goes back to Cragg (1971). As in the sample selection model, two stages are distinguished in the outcome generating process: a selection mechanism and an outcome level function. The main di¤erence with the SSM is that the 2PM models the observed outcomes directly, rather than specifying a latent structure. In this framework a zero does not represent missing data, but can be interpreted as a corner solution in a consumer optimization problem. An application can be found in Duan et al. (1983) who use the two-part model to analyze individuals’medical expenditures. There has been some debate in the health literature as to which model is more appropriate for describing medical expenditures. Duan et al. (1983) argue that the 2PM is to be preferred since it models actual as opposed to potential outcomes. Whether we are interested in actual or potential outcomes depends on the particular application at hand. Regardless, the SSM can also be used to analyze actual outcomes because the latent structure implies a model for the observed data. Another claim, e.g. Hay and Olsen (1984), has been that in the parametric (Gaussian) case the 2PM is actually nested within the SSM. This is shown to be incorrect in Duan et al. (1984). Moreover, the model parameters are not directly comparable because they have a di¤erent interpretation in each model. Marginal e¤ects on the observed outcomes are directly available in the 2PM, whereas in the SSM they involve some nonlinear transformation of the model parameters and are typically covariate dependent. The aforementioned debate has prompted many authors to compare the SSM and 2PM. Manning, Duan, and Rogers (1987) conduct an extensive Monte Carlo study and …nd that overall the 2PM performs very well in terms of predictive ability, even if the SSM is the true model generating the data. Given a joint normality assumption the 2PM is observationally equivalent to the SSM when the cross-equation correlation is zero. In principle the null hypothesis that the 2PM is the true model can then be tested via a classical t-test. Leung and Yu (1996) present simulation evidence suggesting that such a test may perform poorly due to near multicollinearity. Dow and Norton (2003) propose a test based on the di¤erence in empirical mean squared error between the two models.3 Our paper takes a Bayesian approach to estimating the sample selection and two-part models. The main contribution is to provide sampling algorithms to approximate the posterior distributions of interest. We start by considering inference in a fully parametric Bayesian model based on multivariate normal distributions. This analysis is then extended to allow for more ‡exible families of distributions. In particular, the focus is on mixtures of normal distributions with either a …xed or random number of mixture components. We will refer to this case as a semiparametric Bayesian model. Bayesian inference in limited dependent variable models can proceed by a combination of Gibbs sampling (e.g. Casella and George 1992) and data augmentation (e.g. Tanner and Wong 1987). These methods are useful when either the joint posterior is analytically intractable or di¢ cult to sample from. Gibbs sampling entails generating consecutive draws from the conditional posterior of the 2PM and related models. 3 There are some problems with this test as well. First, the choice of null hypothesis is arbitrary and second, the test su¤ers from the same power problems as the t-test. 3 of each parameter, given the remaining ones. Data augmentation treats missing observations as additional parameters and samples new values as part of the algorithm.4 The combined algorithm yields a Markov chain that, under some regularity conditions, has the posterior as its invariant distribution. Parameter values generated by the chain are then an approximate sample from the posterior. An excellent treatment of Markov Chain Monte Carlo (MCMC) methods can be found in Gilks, Richardson, and Spiegelhalter (1996). Applications of MCMC are becoming widespread. Discrete choice models are treated in Albert and Chib (1993), McCulloch and Rossi (1994) and McCulloch, Polson, and Rossi (2000). The analysis of the parametric sample selection model in our paper is most closely related to Li (1998), Huang (2001) and Munkin and Trivedi (2003), the common element being a simultaneous equations structure combined with a limited range of the dependent variable. The extension to semiparametric models draws on a large body of literature on nonparametric Bayesian methods and incorporates it into a sample selection framework.5 The remainder of this paper is organized as follows. Section 2 presents the SSM and two Gibbs sampling algorithms based on a bivariate normal likelihood. In section 3 we construct a Gibbs sampler for the 2PM and show how the output may be used to compute Bayes factors for the purpose of model selection and hypothesis testing. Section 4 discusses a simulation example. Section 5 and 6 present semiparametric versions of the SSM and construct the corresponding MCMC algorithms. Finally, section 7 concludes. Some additional details on the various algorithms can be found in the appendix. 2 The Sample Selection Model We use the following version of the SSM, which is sometimes referred to as a type 2 Tobit model (e.g. Amemiya 1985, ch. 10): Ii = x0i1 1 + ui1 ; si = I fIi > 0g ; mi = x0i2 ( ln yi = 2 mi + ui2 ; if si = 1 1 if si = 0 (2.1) : The subscript i denotes the ith observation in a sample of size n. The vectors xi1 and xi2 have k1 and k2 elements, respectively. The equation for Ii is a selection equation: if Ii > 0, then a positive outcome yi is observed; Ii 0 corresponds to yi = 0. The variable si is simply the indicator of a positive outcome. The equation for mi represents the logarithm of the potential outcomes. Potential outcomes are realized only when si = 1. Thus, m is a partially observed, partially latent 4 Thus, data augmentation can be viewed as stochastic imputation of missing values. To quote Müller and Quintana (2004), “Nonparametric Bayesian inference is an oxymoron and a misnomer”. The term nonparametric refers to the fact that these methods bear some resemblance to classical nonparametric techniques, such as kernel smoothing. 5 4 variable. If the outcome yi is zero, and hence si = 0, then mi is unobserved. On the other hand, if yi is positive and si = 1, the potential outcome mi (= ln yi ) is realized.6,7 To summarize, we observe the vectors (x0i1 ; x0i2 ; si ) for all i and the values mi that belong to the set fmi : si = 1g. For the parametric Bayesian analysis of this model it is assumed that the joint distribution of ui1 and ui2 is bivariate normal: ui1 ui2 where ! N (0; ) ; = " 2 1 1 2 2 2 1 2 # ; (2.2) is the correlation coe¢ cient. The random variable si has a Bernoulli distribution with Pr fsi = 1jxi1 ; where 1g = x0i1 1= 1 ; ( ) denotes the CDF of the standard normal distribution. Though 1 and 1 are not separately identi…ed, we retain both parameters for reasons explained shortly. The likelihood for the n observations can be written as pSSM (ln yj 1; 2; ) = n Y x0i1 1= 1 si x0i1 1 1= 1 1 si (2.3) i=1 Y i:yi >0 x0i2 pu2 jI>0 ln yi ; 2 where pu2 jI>0 is the density of ui2 conditional on Ii > 0. Let fN (ajb; c) and FN (ajb; c) denote the density and CDF, respectively, of a normal random variable with mean b, variance c, evaluated at the point a. Let ( ) denote the standard normal density function and de…ne u ~i = ln yi x0i2 2, so that pu2 jI>0 (~ ui ) = = = R1 0 pu2 ;I (~ ui ; I)dI P (I > 0) Z 1 pu2 (~ ui ) pIju2 (Ij~ ui )dI (x0i1 1 = 1 ) 0 ~i j0; fN u 0 (xi1 1 = 1 = (~ ui = 2 0 (xi1 1 = 2 2 1) 2) 1 FN 0jx0i1 x0i1 1) 6 1+( p 2 (1 1 1 +( ui ; 1 = 2 )~ ~i 1= 2) u 2) ! 2 1 (1 2 ) : This version of the sample selection model is widely used in the literature; see, for example, Lee (2003). The logarithmic transform is very common in this type of models. A discussion of its rationales can be found in Manning (1998). 7 5 Plugging this back into (2.3) the likelihood becomes pSSM (ln yj 1; Y 2; ) = x0i1 1 1= 1 (2.4) i:yi =0 Y 1 ln yi 2 x0i2 x0 pi1 1 1 2 2 i:yi >0 Gibbs Sampling in the SSM 1 2 x0i2 (ln yi p + By inspection of the likelihood (2.4) it appears that no choice of prior for ( 2 1 2 1; 2) 2; ! : ) will yield a tractable posterior distribution. We therefore develop a Gibbs sampling algorithm that simulates draws from the posterior distribution of ( ; ; ). The updating step for 2, 1 for 2 2 and the covariance 12 . The implied value of is then computed as Our Gibbs sampler involves the unidenti…ed parameters ( 1; 1) generates new values 1 and 1. = 12 =( 1 2 ). The sampled values of are therefore not informative, in the sense that there is no updating of the prior.8 The out- put from the algorithm, however, can be used to approximate the posterior of identi…ed parameters such as 1= 1 and . We follow the approach of McCulloch and Rossi (1994), who apply this idea in the context of the multinomial Probit model. The main advantage of retaining the unidenti…ed parameters is that it preserves the natural conjugacy structure in the model, and allows for an easier approximation of the posterior. We now turn to the Gibbs samplers for the parametric SSM. Since only the selection indicator si is observed, the variable Ii is latent and treated as an additional parameter in the algorithm. The same can be said about the unobserved values of mi . Data-augmentation ’completes’ the data and generates a sequence of ( 1; 2; ; I; m) values that are approximately drawn from their joint posterior. Discarding the values of (I; m) we then obtain a sample (again, approximately) from the posterior distribution of ( 1; 2; ). In what follows all conditional distributions are to be understood as conditional on the data as well. This conditioning is omitted for notational simplicity. The SSM in (2.1) can be written as a SUR model. Let I = (I1 ; : : : ; In )0 , m = (m1 ; : : : ; mn )0 , u1 = (u11 ; : : : ; un1 )0 and u2 = (u12 ; : : : ; un2 )0 be n W X2 = " I m 2 # : 2n 3 x012 6 . 7 . 7 = 6 4 . 5:n x0n2 " # = 1 1; k2 ; : (k1 + k2 ) 1 vectors. De…ne the following matrices: 2 3 x011 6 . 7 . 7 X1 = 6 4 . 5:n x0n1 X= 1; " X1 0 0 X2 u= 2 8 When the prior is improper, the generated values of ( 1; 6 1) " u1 u2 k1 ; # # : 2n : 2n (k1 + k2 ); 1: are a random walk; see McCulloch and Rossi (1994). Then W = X + u, where E(u) = 0 and V (u) = In . The likelihood of the normal SUR model is p(W j ; ) / j j n=2 exp / j j n=2 exp 1 (W X )0 ( 2 1 tr B 1 ; 2 1 In )(W X ) (2.5) where tr( ) is the trace of a square matrix and B is de…ned as B= " (I X1 0 1 ) (I X1 1) (I X1 0 1 ) (m X2 2) (m X2 0 2 ) (I X1 1) (m X2 0 2 ) (m X2 2) # : (2.6) Starting with the conditional posterior of , note that p( jI; m; ; s) = p( jI; m; ) because s is a function of I. The likelihood in (2.5) can be rewritten as n=2 p(W j ; ) / j j X ^; e = W S ^ = (X 0 S 1 1 In : 1 = 1h 0 eS 2 exp X) 1 X 0S 1 1 ^)0 X 0 S e+( 1 i ^) ; X( W; Combining this with a normal N ( 0 ; D0 ) prior for , the posterior is again normal with mean and variance given by E ( jW; ) = D0 1 + X 0 S 1 X V ( jW; ) = D0 1 + X 0 S 1 X 1 1 h D0 1 0 + X 0S 1 : i X^ ; (2.7) (2.8) To sample (Ii ; mi ) we need to distinguish two cases: si = 0 and si = 1. Suppose …rst that si = 1 so that mi is observed and Ii > 0. From (2.2) it follows that Ii , conditional on mi and Ii > 0, has p 2 , truncated a normal distribution with mean x0i1 1 + 1 2 1 (mi x0i2 2 ) and variance 1 1 from below at zero: p (Ii jsi = 1; mi ; 1; 0 2 ; ) = N xi1 If si = 0 then it is known that Ii can be generated from the N 1+ 1 2 1 mi x0i2 2 ; 1 p 1 2 I fIi > 0g : (2.9) 0 but the actual values (Ii ; mi ) are not observed. A value of Ii x0i1 1 ; 2 1 distribution truncated from above at zero.9 The value of 9 All draws from truncated normal distributions can easily be obtained through the inverse c.d.f. method, e.g. Lancaster (2004, p.190-191). 7 mi is a realization of its conditional distribution given Ii :10 ) = N x0i1 p (Ii jsi = 0; 1; 2; p (mi jIi ; 2; ) = N x0i2 1; 2 1; + 2 1 1 I fIi 1 (2.10) x0i1 Ii 2 Finally it remains to …nd the conditional posterior of 0g ; 1 ; 2 2 2 1 : (2.11) . By inspection of the SUR likelihood (2.5) it can be seen that the inverse Wishart distribution is the natural conjugate prior. If has an inverse Wishart distribution with parameter matix H and degrees of freedom v, we will write W 1 (H; v). The density of is given by p ( jH; v) / j j (v+3)=2 exp and the prior mean (e.g. Muirhead, page 97) is (v 1 tr 2 3) 1 H. 1 ; H v 2; Multiplying this density with the SUR likelihood we get p ( j ; ; I; m) / j j (n+v+3)=2 exp 1 tr 2 1 (B + H) where B was de…ned in (2.6). Thus the conditional posterior of Gibbs sampler can now be summarized as follows: ; is W Algorithm 1 (Unidenti…ed Parameters) For given starting values of ( v 1 (B 1; 2; (2.12) + H; n + v). The 2; ; Ig and fmi : yi = 0g: 1. Sample ( 1; 2) from a multivariate normal with mean (2.7) and variance (2.8); 2. if si = 1, sample Ii from (2.9); if si = 0, sample Ii from (2.10) and mi from (2.11); 3. sample from (2.12); 4. return to step 1 and repeat. In order to execute step 3 of the algorithm, note that by de…nition of the inverse Wishart distribution the precision matrix P (= parameter matrix (B + H) 1 1) has a conditional posterior Wishart distribution with and (n + v) degrees of freedom. A draw of by …rst generating (n + v) vectors zj from the N2 (0; (B + H) P 0 1 ( n+v j=1 zj zj ) . 1) can then be obtained distribution and computing = Algorithm 1 yields a realization of a Markov chain that is informative about the posterior distribution of ( 1= 1; 2; 2; ). A disadvantage of using unidenti…ed parameters is that it may be di¢ cult to choose appropriate priors for 1= 1 and .11 Our algorithm cannot be trivially modi…ed 10 When si = 0 we could also generate mi …rst and then Ii from its conditional distribution given mi , righttruncated at zero. 11 In other words, the induced prior of ( 1 = 1 ; ) needs to be checked to ensure it is appropriately re‡ecting the researcher’s beliefs. 8 to satisfy the restriction 1 = 1.12 For that reason we also adopt the approach of McCulloch, Polson, and Rossi (2000). In essence this entails a reparameterization of the model and placing priors on the identi…ed parameters directly; see also Koop and Poirier (1997) and Li (1998) for applications of this idea to a regime switching model and a Tobit model with endogeneity, respectively. Given the bivariate normality of (ui1 ; ui2 ) and imposing the restriction 2 2 2 var(ui2 jui1 ) = The covariance matrix can now be written as " 1 = 12 # 2 12 + = 1, it follows that 2 12 : 12 2 1 ; and the likelihood of the SSM becomes pSSM (ln yj 1; 2; Y ) = i:yi =0 0 Y @ i:yi >0 In order to generate draws ( p( 12 ; jI; m; 1; 2 ). 12 ; x0i1 1 Y 1 2 @ lnqyi 1=2 2 12 + 0 i:yi >0 x0i1 2 1 2 12 + q + 2 x0i2 12 (ln yi + 2 12 1 2) A x0i2 2 + 2 12 1 2A : (2.13) ) in the Gibbs sampler, we need the conditional posterior Given (I; m; 1; 2 ), however, the errors u1 and u2 are known and ( 12 ; 2 ) 12 ; 2 ) are the parameters of a normal linear regression model: ui2 = 12 ui1 + i; N 0; i 2 : Thus, the conditional posterior of interest satis…es p 12 ; 2 jI; m; 1; 2 = p 2 12 ; / p(u1 ; u2 j ju1 ; u2 ; 12 ; / p u1 ; u 2 j where we take ( 12 ; 2 ) a priori independent of ( is of the normal-inverse gamma form. We say parameters (c0 ; d0 ), written as 2 1; 2 ( 2 jc0 ; d0 ) = dc00 ( 2) (c0 ) 12 ; 1; 2 2) ( 12 ; 2 2 j 1; 2) 12 ; 2 2 ). The natural conjugate prior for ( 12 ; ; has a prior inverse gamma distribution with IG(c0 ; d0 ), if Then 2 1; 2 has a prior gamma distribution G(c0 ; d0 ). (c0 +1) e d0 = 2 : Although has an inverse Wishart distribution, conditional on 1 = 1 does not. Nobile (2000) proposes a methods to sample the remaining elements of , conditional on the value of a diagonal element. The algorithm in our paper is based on McCulloch, Polson, and Rossi (2000). 9 The conditional prior of 12 is given by ( 12 j The reason for prior dependence between 2 ; ; g) = N (g; 12 2 and 2 ): is that the induced prior for the correlation coe¢ cient can be made roughly uniform by an appropriate choice of . Here (c0 ; d0 ; g; ) is a set of hyperparameters. It is easy to show that the posteriors take the following form: ~ IG(~ c; d); n+1 c~ = c0 + ; 2 ( 12 g)2 1 0 d~ = d0 + + (u2 12 u1 ) (u2 2 2 2 jI; m; 1; 2; and 12 jI; m; 1; 2; 2 g= + u01 u2 ; 1 + u0 u 1 1 N (2.14) 12 12 u1 ) ; 2 1 + u01 u1 : (2.15) The Gibbs sampler with identi…ed parameters, which is similar to Li’s (1998) algorithm, can now be summarized as follows: Algorithm 2 (Identi…ed Parameters) For given starting values of ( 1; 2; 12 ; 2 ; Ig and fmi : yi = 0g: 1. Sample ( 1; 2) from a multivariate normal with mean (2.7) and variance (2.8); 2. if si = 1, sample Ii from (2.9); if si = 0, sample Ii from (2.10) and mi from (2.11); 3. sample 2 from (2.14) and 12 from (2.15); 4. return to step 1 and repeat. 3 The Two-Part Model The version of the 2PM we use is Ii = x01i ln (yi jsi ) = 1 + "i1 ; si = I fIi > 0g ; ( x0i2 2 + "i2 if si = 1 1 (3.1) : if si = 0 For the parametric model we take "i1 N 0; 2 1 ; "i2 10 N 0; 2 2 : The selection equation is the same as in the SSM: if Ii > 0, then yi > 0 and the logarithm of yi is well-de…ned. If Ii 0 then yi = 0. The main di¤erence between the SSM and the 2PM concerns the errors "i2 and ui2 . In the sample selection model ui2 is an error that corresponds to potential outcomes. Conditional on Ii > 0 the error then has a nonzero mean that depends on and x0i1 1. In contrast, "i2 only a¤ects the logarithm of positive values of expenditures and by construction E ("i2 jIi > 0) = 0. The 2PM is silent about the joint distribution of ("i1 ; "i2 ) and assumes that conditional on "i1 > xi1 1, the errors "i1 and "i2 are independent.13 The likelihood of the 2PM is p2P M (ln yj 1; 2; 1; n Y 2) = x0i1 si 1= 1 x0i1 1 1= 1 1 si (3.2) i=1 Y ln yi 1 2 x0i2 2 : 2 i:yi >0 By comparing (2.4) and (3.2) it is clear that the former reduces to the latter when = 0. This suggests that in order to discriminate between the SSM and 2PM we can consider inference on the correlation coe¢ cient. Gibbs Sampling in the 2PM Approximating the posterior of ( (or 12 ) ( 1; 1; and ( sample ( 2; 2; 1 2) 2 in the two-part model is considerably easier because is absent from the model. Moreover, the likelihood in (3.2) factors into a Probit term involving ( 2) 1 1; 1; 2) 1 2) 1 and a Gaussian term involving ( 2; 2) 2 2 ). 2 2; If we impose independence between in the prior, this is carried over to the posterior. As a consequence we can and ( 2) 2 2; each in their own ’mini Gibbs sampler’. In the Probit part we impose the restriction 1 = 1, because it reduces the number of steps needed in the algorithm. The Probit algorithm described here is due to Albert and Chib (1993). The parameters are (I; p( 1 jI).14 1) and it remains to …nd the conditional posteriors p(Ij Since I = X1 p(Ij 1) + u1 and u1 1 = (2 ) e = I ^ 1 = n=2 exp 1 and p( N (0; In ), it follows that X1 ^ 1 ; X10 X1 1 ; s) 1h 0 ee+( 2 1 ^ )0 X 0 X1 ( 1 1 1 1 jI; s) = i ^ ) ; 1 X10 I: Combining a normal N (b1 ; B1 ) prior distribution for 1 with the likelihood of I given above, we get 1 jI 13 14 N b1 ; B1 ; (3.3) This does not imply that "i1 and "i2 are independent; see Duan et al. (1984) for an example. The equality follows because s is a function of I. 11 B1 1 + X10 X1 B1 = 1 b1 = Since Ii j given 1 1 B1 + X10 X1 1 ; 1 B1 1 b1 + X10 X1 ^ 1 : has a normal distribution with mean x0i1 1 and unit variance, the distribution of Ii and si is truncated normal: p(Ii j ; si = 0) = N x0i1 ; 1 I fIi 0g ; (3.4) p(Ii j ; si = 1) = N x0i1 ; 1 I fIi > 0g : Inference on 2 2 ; + uses only the subsample in which yi > 0. Let ln y + , X2+ and u+ 2 = ln y all refer to this subsample of size n+ . If the priors are 2 2 IG(c0 ; d0 ) and 2 X2+ N (b2 ; B2 ), then standard results for the linear model with normal errors yield 2 + 2 j 2 ; ln y n+ c~ = c0 + 2j 2 ; 2 + 2 ; ln y B2 = B2 1 + b2 = B2 1 + ~ IG(~ c; d); (3.5) 1 u+ ; d~ = d0 + u+0 2 2 2 N b2 ; B2 ; (3.6) 2 +0 + 2 X2 X2 2 +0 + 2 X2 X2 ^ = X +0 X + 2 2 2 1 1 1 ; B2 1 b2 + 2 2 X2+0 X2+ ^ 2 ; X2+0 ln y + : The Gibbs sampler in the 2PM can now be summarized as follows: Algorithm 3 (Two-Part Model) For given starting values of I; 1. Sample 1 from (3.3) and I from (3.4); 2. sample 2 2 from (3.5) and 1; 2; 2 2 : from (3.6); 3. return to step 1 and repeat. Bayes Factors The Bayes factor provides a way to compare di¤erent models on the basis of their prior predictive distribution.15 Suppose that two competing models, M1 and M2 , are entertained to describe the outcome y. A model in this context consists of a prior distribution on the appropriate parameters and a likelihood for the data. Given the prior probabilities (M1 ) and (M2 ) of the two models, the posterior odds ratio is computed as p (M1 jy) p (M2 jy) 15 = p (yjM1 ) p (yjM2 ) (M1 ) (M2 ) Bayes factors have been researched extensively. The article by Kass and Raftery (1995) is an excellent survey. 12 (M1 ) ; (M2 ) = B12 where B12 is the Bayes factor of model 1 versus model 2. In other words, the Bayes factor transforms the prior odds ratio into the posterior odds ratio. The Bayes factor itself is the ratio of the prior predictive distributions or marginal likelihoods. Bayes factors larger than 1 indicate support for model M1 . Conversely, a value of B12 much smaller than 1 suggests that model M2 is more likely. In what follows we consider two ways to compute the Bayes factor. Let M1 denote the SSM with the restriction = 12 = 0 imposed and M2 the unrestricted SSM. In the parameterization of the model leading up to algorithm 2 (i.e. incorporating the identi…cation constraint 2 1 = 1) the marginal likelihood mj Z mj = where pj ( j ) and j( 2 pj (ln yj ; p(ln yjMj ) of model Mj (j = 1; 2) is given by ; 12 ) j ( ; 2 ; 12 )d d 2d 12 ; ) are the likelihood and prior of model j, respectively. Note that p1 simply follows from p2 by imposing the restriction = 0. As before, the prior 12 2 for the unrestricted model is: 0 0 1; = 2 In M1 only the parameters ( 0 2 N ( 0 ; D0 ) ; IG (c0 ; d0 ) ; 1; 2 2; 1( ) appear, because 2 ; 12 j N g; 12 2 : = 0. It is therefore natural to take ) = N ( 0 ; D0 ) IG (c0 ; d0 ) = 2( ; 2 ): Using the arguments in Verdinelli and Wasserman (1995), we can now …nd an expression for the Bayes factor: B12 = m1 m2 = p( 12 = 0jy) Z p(ln yj ; 2 ; 12 = 0) 1 ( ; m2 p( 12 = 0jy) 2 ) d d 2: Plugging in the relation 1 p( 12 = 0jy) p( ; p( ; 2 = p( ; 2 = 2 jy; 12 = 0) ; 12 = 0jy) jy; 12 = 0) p(ln yj ; 13 2 ; 12 m2 = 0) 2( ; 2 ; 12 = 0) ; we get Z B12 = p( 12 = 0jy) = p( 12 = 0jy)E p( 12 = 2( = 0jy) E 12 = 0) 1( )= 2( ; 2 j p( ; 2 jy; = 0)d d 12 2 2 ) 2 ( ; ; 12 = 0) 2 1( ; ) ; 2 2 ( ; j 12 = 0) where the expectation is with respect to p( ; 2 2 ) 2 ( ; ; 12 = 0) 1( 2 ; 2 ; jy; 12 (3.7) = 0). Note that in the special case where 1( ; 12 = 0. This simple form is usually referred to as the Savage-Dickey density ratio. Our choice 12 = 0), the Bayes factor reduces to the ratio p( 12 jy)= 2 ( 12 ), evaluated at of restricted prior is slightly di¤erent, so that we need Verdinelli and Wasserman’s (1995) more general expression of the Bayes factor. The expression in (3.7) can be simpli…ed even further: B12 = p( = p( 12 12 = 0jy)E = 0jy)E ; 2) 2 2 ( 12 = 0) 2 ( ; j 1 ; 2 2 ( 12 = 0j ) 1( 12 = 0) (3.8) where the last line follows from Bayes’ rule and prior independence of 12 and . In order to estimate (3.8) from the Gibbs sampler output, note that p( 12 jy) = E p( 12 jy; 2 ; ) ; where the expectation is taken with respect to p( ; 2 jy). Given a sample n oT 2 ; ; I ; m generated by algorithm 2, the value of p( 12 = 0jy) can be estimated (t) (t) (t) (t) through t=1 p^( 12 T 1X = 0jy) = p( T 12 = 0jy; 2 (t) ; (t) ; I(t) ; m(t) ); t=1 which requires T evaluations of the density in (2.15) at zero. The expected value in (3.8) is with respect to p( ; f (s) ; 2 S (s) gs=1 2 jy; 12 = 0). To simulate a sample from this distribution, we run algorithm 2 again, but now with 12 …xed at zero. The expected value is then estimated through ^ E S 1X 1 = 2 S 2 ( 12 = 0j ) s=1 " 2 ( 12 1 = 0j We do not consider the Bayes factor for algorithm 1, because when 2 (s) ) # : has an inverse Wishart prior, the elements of (3.8) are hard to estimate. In particular, the marginal prior and posterior of 12 are very complicated. Algorithm 2 has the advantage that all densities that appear in (3.8) are 14 standard, so that B12 can easily be estimated. A second method to compute the Bayes factor is the one proposed by Chib (1995). Output from the Gibbs sampling algorithms 2 and 3 can be used to estimate m1 and m2 separately and compute their ratio.16 We apply Chib’s (1995) method …rst to the sample selection model. The marginal likelihood m2 can be written as p ln yj ; m2 = 2 12 ; p ; 2 2 12 ; ; 12 ; 2 : jy Note that this equation holds for all parameter values in the support of p speci…c value ; 2 12 ; ; 12 ; 2 jy . Now pick a , say the sample mean from the Gibbs output. Taking logarithms we get log m2 = log p ln yj ; ; log p Using (2.13) and the prior on ; 12 ; 2 2 12 ; 2 12 ; + log ; 2 2 12 ; jy : the …rst two terms on the right-hand side can easily be calculated. It remains to estimate the value of the posterior. To this end, write log p ; 12 ; 2 jy 2 = log p jy + log p + log p ( j Since p and a sample (t) ; 2 jy = Z 2 p j ; T 12(t) ; I(t) ; m(t) t=1 p^ 2 jy = 12 ; I; m; y 12 ; p( ; 2 12 j ;y (3.9) ; y) : 12 ; I; mjy) d d 12 dIdm is available from the posterior, we estimate this term by T 1X p T 2 t=1 j (t) ; 12(t) ; I(t) ; m(t) ; y ; where each term in the sum requires evaluating the inverse-gamma density in (2.14).17 As for the second term in (3.9), we have p 12 j 2 ;y = Z p 12 j ; 2 ; I; m; y p ; I; mj 2 ; y d dIdm: In order to estimate this term, we need a sample from the posterior distribution of ( ; I; m), given y and 2 . The current sample does not satisfy this condition. Therefore the algorithm needs to implemented again, this time with R (r) ; I(r) ; m(r) r=1 2 …xed at the value that (approximately) comes from p 16 ; I; mj 2 2 . This yields a sequence ; y . The estimate is constructed Estimating the marginal likelihood from algorithm 1 yields unstable results. Apparently the lack of full identi…cation in that sampler causes problems. 17 Note that the parameter d~ varies with t. 15 as R p^ 2 12 j 1 X ;y = p R 2 12 j (r) ; r=1 ; I(r) ; m(r) ; y ; where each term involves evaluating (2.15).18 Using similar logic …x Q I(q) ; m(q) q=1 run algorithm 2 again, to obtain a sample 12 from p I; mj = 12 ; 12 2 and 2 = 2 and ; y . The third term in (3.9) is then estimated as Q p^ j 2 12 ; 1 X ;y = p Q j q=1 2 12 ; ; I(q) ; m(q) ; y ; where each term in the sum requires evaluating the multivariate normal density with mean (2.7) and variance (2.8) at the point . Only the mean varies with q. Computations in the 2PM are largely similar, so we will be brief. Again the logarithm of the marginal likelihood m1 is split up into the logarithms of the likelihood, the prior and the posterior, evaluated at 1; 2; 2 2 . Using (3.2) and the priors for algorithm 3 the …rst two terms are easy to compute. The value of the posterior is estimated using the Gibbs output. Unlike the SSM there is no need for additional runs of the sampler. The currently available sample f 1(t) ; I(t) ; is used to construct the estimates p^( p^ 1 jy) = 2 2 jy = T 1X p T t=1 1 jI(t) ; y T 1X p T 2 2 t=1 where we evaluate (3.3) and (3.5) T times. Note that p( j + 2 ; y) ; = p( 2j 2 be estimated, but only requires a single function evaluation of (3.6). 4 2 gT (t) t=1 ; 2(t) ; y 2j 2 2(t) ; 2; y+) does not have to Simulation Example We now compare algorithms 1, 2 and 3 using some simulated data. A sample of size n = 500 is generated from the sample selection model in 2.1: ui1 ui2 Ii = 01 + xi1 11 + ui1 ; mi = 02 + xi2 12 + ui2 ; yi = I fIi > 0g emi ; ! N (0; ) ; 01 18 = 02 ; 11 = = 12 " (4.1) 1 0:5 0:5 1 # ; = 1: The mean and variance of the normal distribution in this case both depend on r. 16 Note that that 10 = 01 controls the probability p0 of observing a zero outcome. We let xi1 ; xi2 0:58 and 10 = U (0; 3), so 1:50 correspond to p0 = 0:25 and p0 = 0:50, respectively (cf. Leung and Yu 1996). With the exclusion restriction xi1 6= xi2 the parameter 12 measures the marginal e¤ect of xi2 on the log-outcome (conditional on a positive outcome) in both the SSM and 2PM. Of course, the correlation is absent in the 2PM. We start each algorithm at three di¤erent points. The number of iterations is 6; 000 with a burn-in period of 3; 000. The …gures in this section are therefore estimated posterior distributions based on 9; 000 random draws. β11 4.5 β12 8 4 7 3.5 ρ 3.5 3 6 2.5 3 5 2 2.5 4 2 1.5 3 1.5 0.5 1 0.5 0 1 2 1 0.8 1 1.2 00.8 1.4 0.9 1 1.1 1.2 0 -0.2 0 0.2 0.4 0.6 0.4 0.6 Figure 1: algorithm 1; n = 500; p0 = 0:25 β11 β12 8 4 ρ 3.5 7 3 3.5 6 2.5 3 5 2 2.5 4 2 1.5 3 1.5 0.5 1 0.5 0 1 2 1 0.8 1 1.2 1.4 1.6 00.8 0.9 1 1.1 1.2 0 -0.2 0 Figure 2: algorithm 2; n = 500; p0 = 0:25 17 0.2 β11 4 β12 8 3.5 7 3 6 2.5 5 2 4 1.5 3 1 2 0.5 1 00.6 0.8 1 1.2 00.8 1.4 0.9 1 1.1 1.2 1.3 Figure 3: algorithm 3; n = 500; p0 = 0:25 As appears from …gures 1-3, all three Gibbs samplers produce posterior approximations centered around the true parameter value. It is interesting to note that Gibbs sampling in the two-part model does locate the coe¢ cient of the selection equation, the log-outcome, 22 , 11 , and the marginal e¤ect of xi2 on despite ignoring the cross-equation correlation. β11 5 β12 7 ρ 4.5 4.5 4 6 4 3.5 5 3.5 3 3 4 2.5 2.5 2 3 2 1.5 1.5 2 1 1 1 0.5 0.5 0 0.8 1 1.2 1.4 0 0.8 1 1.2 00 0.2 Figure 4: algorithm 1; n = 500; p0 = 0:50 18 0.4 0.6 0.8 β11 5 β12 7 ρ 4.5 4.5 4 6 4 3.5 5 3.5 3 3 4 2.5 2.5 2 3 2 1.5 1.5 2 1 1 1 0.5 0.5 0 0.8 1 1.2 0 1.4 0.8 1 00.2 1.2 0.4 0.6 0.8 Figure 5: algorithm 2; n = 500; p0 = 0:50 β11 4.5 β12 6 4 3.5 5 3 4 2.5 3 2 1.5 2 1 1 0.5 00.7 0.8 0.9 1 1.1 1.2 00.7 1.3 0.8 0.9 1 1.1 1.2 1.3 Figure 6: algorithm 3; n = 500; p0 = 0:50 The results for more severe selection, p0 = 0:50, are given in …gures 4-6. Even with a larger fraction of zero outcomes all posteriors for 12 and 22 are centered around the true value, though the posterior variance has increased. Moreover, the posterior of the correlation puts most probability on positive values. In results not reported here, we have found that all algorithms converge rapidly (as judged by the Gelman-Rubin R-statistic; e.g. Gelman, Carlin, Stern, and Rubin 1995, chapter 11). In addition, the Bayes factor for testing the restriction = 0 decisively rejects the 2PM in favor of the SSM. 5 Finite Mixtures and Sample Selection There are several ways the assumption of bivariate normality in the SSM can be relaxed. Here we …rst consider using a mixture of normal distributions with a …xed number of mixture components. 19 The sample selection model is given in (2.1) but now ! ui1 ui2 where cu > 0 is a scaling factor, the j ’s j1 = j j2 k X j N ( j ; cu j ); j=1 are nonnegative and ! ; j = " 2 j;1 Pk j j=1 j;12 2 j;2 j;12 # = 1, and : Mixtures of normals are very ‡exible distributions and even with small values of k (say, 2 or 3) can display skewness, excess kurtosis and multimodality (e.g. Geweke 2005, chapter 6). Note that the mixture distribution trivially reduces to (2.2) when k = 1, j = 0 and cu = 1. To construct a Gibbs sampler for this mixture model, let component selectors. That is, if mixture component: ui1 ui2 Since the values of i i ! j = ( 1; : : : ; n) be the vector of = j then the errors of observation i are drawn from the j th i =j N( j ; cu j ); j = 1; : : : ; k: are not observed, the parameters in the Gibbs sampler are now 0 =( 0 1; = 0 2 ); k j j=1 ; fIi gni=1 ; ; fmi : yi = 0g ; k j j=1 ; = cu ; k j gj=1 : =f We highlight some of the main features of the sampler and leave additional details to appendix A. Recall that in the SUR formulation W = X + u. Let W(j) ; X(j) ; u(j) be the submatrices corresponding to the j th mixture component. The SSM likelihood in (2.5) can be written as p(W j ; ; ; ; cu ) / (5.1) k Q j=1 where Aj = cu 1 1 j jcu jj nj =2 1 (W 2 (j) exp Inj and nj = jfi : i X(j) j 0 nj ) Aj (W(j) Using straightforward algebra, it can be shown that j=1 nj ) is prior independent of , the same will be true for the conditional posterior of . p(W j ; ; ; ; cu ) / j = jgj. Note that once we condition on the component selectors , the completed data likelihood does not depend on . Provided k Q X(j) jcu exp jj ( nj =2 1 ( 2 20 exp ( k 1 P e0 Aj e(j) 2 j=1 (j) k ^) P X 0 Aj X(j) ( (j) 0 j=1 ) ) ^) ; ; where ^ is the GLS estimator and e(j) the residual: 2 3 k X 0 ^ = 4 Aj X(j) 5 X(j) 1 nj j 0 Aj (W(j) X(j) nj ); j j=1 j=1 e(j) = W(j) k X X(j) ^: Combining the likelihood with a N (d0 ; D0 ) prior, the conditional posterior is again normal: p( jW; ; ; ) = N (d; D); 2 3 k X 0 d = D 4D0 1 d0 + Aj X(j) ^5 ; X(j) (5.2) j=1 2 k X D = 4D0 1 + Sampling fIi gni=1 , fmi : yi = 0g and j=1 3 1 0 X(j) Aj X(j) 5 : is similar as before, so we will be brief. For simplicity we do not list every conditioning argument: x0i1 Ii jsi = 1; i =j N Ii jsi = 0; i =j N (x0i1 mi jIi ; i =j N The component selector 1; : : : ; k. Write the ith i x0i2 1 1 + + 2 j1 j;12 2 (Ii j;1 x0i1 + j2 + ! 2 j) i IfIi > 0g;(5.3) (5.4) : (5.5) = jj g = j, j = observation as the bivariate linear model wi = Xi + ui , where / " Xi = i x0i1 0 0 x0i2 # ; j ui = (ui1 ; ui2 )0 : follows from Bayes’rule: jcu jj exp tribution. If ( 2 j) 2 j1 ); cu j;2 (1 1 = jjwi ; ; ; ; ; cu g / p(Ii ; mi j ; ; ; ; cu ; The parameter ! 2 j2 ); cu j;1 (1 2 has a prior multinomial distribution with Prf The conditional posterior distribution of by x0i2 0g; wi = (Ii ; mi ) ; i + 2 j1 ; cu j;1 )IfIi 0 Prf j;12 2 (mi j;2 i = jg Prf 1=2 i = jj g (5.6) 1 (wi 2cu Xi 0 j) j 1 (wi Xi j) : is a set of multinomial probabilities, which suggests using a Dirichlet prior dis1; : : : ; k) = D( 0; : : : ; 0 ), it is shown in the appendix that the posterior is given j D( 0 + n1 ; : : : ; 21 0 + nk ): (5.7) The use of a uniform Dirichlet prior, i.e. is the parameter for each state j = 1; : : : ; k, is 0 appropriate here, because the state labels are not identi…ed. This lack of knowledge is re‡ected by a prior in which the states are exchangeable. In order to …nd the conditional posterior of ( ( j) =W 1 (H; v); ( j; j) j jcu ; The advantage of this choice is that the posterior of ( we take its prior to be j) j; normal-inverse Wishart form. This allows us to sample ( convergence behavior of the Markov p( j jW; chain.19 uj j) j; conditional on ( ; ; cu ) is again of the j) jointly, which should improve the It follows that then ~ j ; v~j ); (H 1 nj uj u0j 1 X = H+ + (ui cu 1 + nj cu = j) : 1 ; ; cu ) = W ~j H = N (0; cu (5.8) uj )0 ; uj )(ui i: i =j 1 X ui ; nj i: i =j v~j where nj = j. Also: Pn i=1 If i = v + nj ; = jg is the size of the subsample that is distributed according to component p( j j j ; W; ; ; cu ) = N (bj ; cu Mj ) ; nj bj = uj ; 1 + nj Mj Allowing for prior dependence between j and j = 1 + nj (5.9) j: is often reasonable: if cu j is large the errors are highly variable, making it harder to pin down their location. The parameter can be thought of as a tuning parameter that can be used to control the potential multimodality in the distribution of ui , independently of the variance. The conditionally conjugate prior distribution for cu is the inverse gamma distribution. In particular, we take (cu ) = IG(v=2; d). Since cu enters the prior of j and the likelihood, it is shown in the appendix that ~ p(cu jW; ; ; ; ; ) = IG(~ v ; d); v v~ = + n + k; 2 k n X 1 1X 1 1 0 0 d~ = d + + (ui j j(i) ) j(i) (ui j j 2 2 j=1 19 p( (5.10) j(i) ); i=1 Alternatively, if j is a priori independent of j , it would be necessary to sample from p( j j j ; W; ; ; cu ) and ; ; cu ) consecutively. This adds a ’block’to the Gibbs sampler, which slows down convergence. j j j ; W; 22 where j(i) is the value of j 2 f1; : : : ; kg such that i = j. The Gibbs sampler for the mixture of normals sample selection model can now be summarized as follows: Algorithm 4 (Mixture SSM) For given starting values of ( ; ; ; ; ; cu ; I) and fmi : yi = 0g: 1. Sample 0 1; =( 0 0 2) from (5.2); 2. if si = 1 sample Ii from (5.3); if si = 0 sample Ii from (5.4) and mi from (5.5); 3. sample from (5.6) for i = 1; : : : ; n; i 4. sample =( 5. sample j 1; : : : ; k) from (5.7); from (5.8 and from (5.9) for j = 1; : : : ; k; j 6. sample cu from (5.10); 7. return to step 1 and repeat. A similar algorithm can be constructed for the two-part model. Some simplications will occur because of the assumed conditional independence between the selection and outcome equations. In (3.1) the distributions of both "i1 and "i2 could be modeled as a mixture of normals. One important point to note is that the use of improper priors in a mixture model leads to improper posterior distributions (e.g. Roeder and Wasserman 1997). Second, the state labels j are not identi…ed without further prior information. If the states themselves are the primary focus, for example a state might represent a certain regime or subpopulation, then the algorithm above is not appropriate.20 In our case, however, we merely use mixtures as a modeling device, and labeling issues are not a concern. Simulation Example To illustrate the Gibbs sampler of algorithm 4 we generate a sample of size n = 500 from the SSM in 4.1 with 01 = 02 = 1:50 and 11 = 12 = 1. This corresponds to a fraction of zero outcomes of about 50%. The errors are generated according to ui1 ui2 where 1 = ( 1; 1), 2 ! = (0:25; 0:25), 1N ( 1; 1 ) + (1 = 0:2 and 1 )N ( 2 ; ); has variances 1 and covariance 0:5. The regressors xi1 and xi2 are taken from the U (0; 3) distribution. Since in our experience the sampler for the mixture model displays slower convergence rates, we run the Markov chain from three di¤erent starting points for 15; 000 iterations and discard the …rst 10; 000. The posterior density estimates are thus based on 15; 000 sampled values. 20 For a discussion, see Geweke (2005, chapter 6) and the references cited therein. 23 β11 0.8 β12 7 0.7 ρ 2.5 6 2 0.6 5 0.5 1.5 4 0.4 3 1 0.3 2 0.2 0.5 1 0.1 00 1 2 3 4 5 00.8 1 1.2 1.4 0-0.5 0 0.5 1 Figure 7: algorithm 4; n = 500; p0 = 0:50 γ1 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 Figure 8: mixture weights ( The estimated posterior of 12 0.7 1 0.8 0.9 1 = 0:2) centered around its true value with a 95% highest posterior density (HPD) interval of [0:91; 1:17]. Although the correlation is harder to pin down, the posterior puts substantial probability on positive values, with a 95% HPD of [ 0:07; 0:61]. The coe¢ cient of xi1 in the selection equation is not well-identi…ed with an 95% HPD of [0:55; 3:16]. Finally, the posterior of 1 is reasonably accurate. The bimodality occurs because of so-called ’label-switching’ in the sampler; since the state labels themselves are not identi…ed, it is possible that at some point in the Markov chain the value of probability region around 1 1; 2; 1; 2. 1 jumps from a low probability region, around its true value, to a high 1. The same then happens with the component speci…c parameters To ensure that this switching happens in a balanced manner and that the mixture posterior is fully explored, we have incorporated a random permutation step in algorithm 4.21 21 See Frühwirth-Schnatter (2006, page 82) 24 6 Dirichlet Mixtures and Sample Selection The approach discussed above requires one to choose the number of mixture components beforehand. If the econometrician is uncomfortable doing this, he could use various choices of k and compare di¤erent models on the basis of their posterior probabilities. Here we explore the use of Dirichlet process priors in modeling sample selection. Building on a large literature that started with Ferguson ((1973, 1974)) and Antoniak (1974) we show that some of the existing methodology can be readily adapted to the models in this paper. The main appeal of using a Dirichlet process prior lies in the fact that the errors are modeled as a mixture of normals with a random number of mixture components. Through Bayesian updating we can directly make inference about this number. Also, the prior allows us to center in some sense the semiparametric model around the parametric one. The work in this section is based on results from Escobar (1994), Escobar and West (1995)22 and is closely related to Conley, Hansen, McCulloch, and Rossi (2007), who consider the use of Dirchlet process priors in an instrumental variables model. The basic setup can be described as follows. In (2.1) let ui1 ui2 ! j i i jG Gj ; G0 Here i N ( i ; cu i ); 23 . = ( i; i ); G; (6.1) DP( ; G0 ): is simply the set of parameters for the normal distribution (apart from the common scale factor cu ). Our discussion of the SSM in section 2 involved for i i = = ( ; ) and specifying a prior The semiparametric model in (6.1) allows each pair (ui1 ; ui2 ) to have a distinct normal distribution, conditional on G is chosen to be a N ( 0 ; S0 ) i. The parameters f i gni=1 are i.i.d. draws from a distribution G. If distribution, possibly augmented with a hyperprior on ( 0 ; S0 ), then we have speci…ed a hierarchical normal model, which still imposes a lot of structure on, say, the marginal distribution of the errors. In particular, it would not allow multimodality or skewness. Instead, in (6.1) the distribution G itself is treated as unknown and given a Dirichlet process (DP) prior24 . Thus, G can be viewed as a random probability measure. G0 is a distribution that in some sense is a prior guess about G. Speci…cally, the marginal prior distribution of i is exactly G0 (e.g. Ferguson 1973, Antoniak 1974). The parameter re‡ects the prior belief that G0 is the actual distribution of ! 1. i. This belief becomes stronger as We construct a Gibbs sampler based on …xed values of ( ; G0 ). Additional details on the prior of and the form of G0 are discussed in appendix B. Throughout this section the conditioning on 22 Escobar and West (1998), MacEachern (1998) and Müller and Quintana (2004) are excellent reviews of semiparametric modeling with Dirichlet processes. 23 To be precise, the prior of was degenerate at zero and ( ) W 1 (H; v). 24 Suppose is the sample space and fAj gkj=1 is any measurable partition. If G DP( ; G0 ), then the collection of random probabilities fG(Aj )gkj=1 follows a Dirichlet distribution. 25 ( ; G0 ) is implicit. The parameters are now fIi gni=1 ; ; fmi : yi = 0g; n i gi=1 ; f G; cu A simpli…cation occurs because G can be integrated out of the posterior (e.g. Escobar 1994), so that the Gibbs sampler only involves updating the remaining parameters. = f i gni=1 , is The likelihood of W , conditional on p(W j ; ; cu ) / n Y i=1 jcu ij 1=2 1 (wi 2cu exp Combining this with a N (d0 ; D0 ) prior for jW; ; cu i (wi Xi i) : and collecting terms, it can be shown that N (d; D); " D0 1 + cu 1 D = 1 0 i) Xi n X 1 Xi0 i Xi i=1 " d = D D0 1 d0 + cu 1 n X Xi0 # i i=1 (6.2) 1 ; # 1 Xi ^ ; and ^ is the weighted least squares estimator: ^= " n X 1 Xi0 i Xi i=1 # 1 n X 1 Xi0 i i ): (wi i=1 Sampling Ii when si = 1 and (Ii ; mi ) when si = 0 is done by generating draws from the distributions in (5.3), (5.4) and (5.5), where we now condition on Let i and G = f 1; : : : ; i 1 ; i+1 ; : : : ; n g. ij That is, i i equals one of the other = j i given w. prob. G0 w. prob. j ’s = ( i; i ), instead of i. Blackwell and MacQueen (1973) show that if DP( ; G0 ), then the distribution of ( i i i jG G with G integrated out is given by 1 +n 1 ; j 6= i : (6.3) +n 1 with nonzero probability, or is a new value distributed according to G0 . This property is often referred to as the Pólya urn representation of a sample from the Dirichlet process. Using Bayes’rule the posterior distribution takes the following form: ij Here c 1 i ; W; ( = j w. prob. c p( i jwi ; Xi ; ; cu ) w. prob. c 1 p(w jX ; ; ; c ); i i j u 1 p(w jX ; ; c ) i i u j 6= i : (6.4) is a normalizing constant, p( i jwi ; ; cu ) is the posterior given prior dG0 ( i ) and p(wi jXi ; ; cu ) 26 is the marginal likelihood after integrating out i with respect to dG0 ( i ): p( i jwi ; Xi ; ; cu ) / p(wi jXi ; ; i ; cu )dG0 ( i ); Z p(wi jXi ; ; cu ) = p(wi jXi ; ; i ; cu )dG0 ( i ): More details are given in appendix B. Finally, updating cu conditional on the remaining parameters is exactly the same as in the …nite mixture case discussed earlier. The Gibbs sampler for the Dirichlet process SSM can now be summarized as follows: Algorithm 5 (Dirichlet SSM) For given starting values of ( ; ; I; cu g and fmi : yi = 0g: 1. Sample from (6.2); 2. if si = 1 sample Ii from (5.3); if si = 0 sample Ii from (5.4) and mi from (5.5); all draws here are conditional on 3. sample i i = ( i; i ); from (6.4) for i = 1; : : : ; n; 4. sample cu from (5.10); 5. return to step 1 and repeat. Algorithm 5 can be extended in several ways. First of all, it is possible to place a prior distribution on . Recall that represents the prior belief that G0 is the distribution of i. If is large, then we will see many unique values in , which yields a model with a large number of mixture components. Alternatively, if is small, then will likely see few unique values. In fact, Antoniak (1974) shows that kn , the number of unique values in a sample of size n, satis…es E(kn j ) log(( + n)= ). The limit case = 0 results in parametric model in section 2. By placing a prior on i = for all i, which leads to the it is possible to learn about the number of mixture components, after seeing the data. Escobar and West (1995) use a convenient gamma prior which yields a posterior mixture of two gamma distributions. See appendix B for details. Alternatively, it is possible to specify a prior for kn explicitly (rather than implicitly through ) as in Escobar (1994). The Markov chain constructed in algorithm 5 may converge slowly if the Gibbs sampler ’gets stuck’ at a few …xed values of c 1 p(w jX ; i i ; j ; cu ) i. From (6.4) this could happen when the sum (over j 6= i) of gets large relative to c 1 p(wi jXi ; ; cu ). It is possible to slightly reparame- terize the model and associated Gibbs sampler, such that at each iteration the distinct values in are ’remixed’; see West, Müller, and Escobar (1994), MacEachern (1998) and the appendix. Simulation Example Consider the same simulation set-up as in section 5, where we now apply algorithm 5. In principle a density estimate of (ui1 ; ui2 ), computed over a grid of values, can be constructed by adding an 27 n i gi=1 extra step in algorithm 5: after updating f i ; sample a new value ( n+1 ; n+1 ) from (6.4) and evaluate the density at this new value. Taking an average over all MCMC iterations then yields a density estimate, from which we may learn about the dependence between ui1 and ui2 .25 Here we only present the marginal posterior density estimates for the various coe¢ cients. β01 0.4 β11 0.8 0.3 0.6 0.2 0.4 0.1 0.2 0 -6 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -4 -4 -2 0 00 2 1 2 β02 -3 -2 3 4 5 1.1 1.2 1.3 β12 -1 0 1 6 5 4 3 2 1 00.8 2 0.9 1 Figure 9: algorithm 5; n = 500; p0 = 0:50; mixture of 2 normals #mixturecomponents 7000 6000 5000 4000 3000 2000 1000 00 2 4 6 8 10 12 14 16 18 20 Figure 10: histogram of number of mixture components (mixture of 2 normals) As before the posterior of 12 in …gure 9 is closely centered around its true value. The likelihood is not very informative about the remaining coe¢ cients. The posterior of the number of mixture components puts the largest probability on a single component. In this case the mixture components are apparently not ’separated’enough. As a second example we generate n = 500 observations from (4.1) where ui1 ui2 25 ! = " 1 0:5 0:5 1 # See Conley, Hansen, McCulloch, and Rossi (2007) for details. 28 vi1 vi2 ! ; where vi1 and vi2 have independent log-chi2 distributions with 1 degree of freedom. The coe¢ cients are 01 = 02 = 0:50 and 11 = 12 = 1, which again leads to about 50% zero outcomes. β01 0.5 β11 2 0.4 1.5 0.3 1 0.2 0.5 0.1 0 -4 -2 0 00 2 0.5 1 β02 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1.5 2 2.5 β12 4 3 2 1 -3 -2 -1 0 1 00.4 2 0.6 0.8 1 1.2 Figure 11: algorithm 5; n = 500; p0 = 0:50; log-chi2 errors #mixturecomponents 3500 3000 2500 2000 1500 1000 500 00 2 4 6 8 10 12 14 16 18 20 Figure 12: histogram of number of mixture components (log-chi2 errors) Interestingly, the posteriors for the coe¢ cients of the selection equation are now more accurate: compare …gures 9 and 11. The posterior of the marginal e¤ect 12 of xi2 on the log-outcome is underestimated. On the other hand the sampler does use a larger number of mixture components to approximate the bivariate log-chi2 distribution, with the largest posterior probability on 2 and 3 components and zero probability on a single component. In this example there appears to be a trade-o¤ between modeling the error distribution and obtaining an accurate posterior of the coe¢ cients of interest. 7 Conclusion In this paper we have considered models that can be used to describe nonnegative outcome variables. The sample selection model essentially treats the outcome as a censored quantity, and speci…es a 29 structure for the latent process. The two-part model focuses on the observed outcomes directly. Given the strong parametric assumption of normal errors, Bayesian inference in both models can proceed straightforwardly, using a combination of data augmentation and Gibbs sampling. The MCMC schemes can be formulated with or without an identi…cation restriction. We have found there to be hardly any di¤erence in the posterior approximation.26 A possible Bayesian semiparametric treatment introduces more ‡exibility in the (joint) error distribution. When mixtures of normal distributions are used, together with natural conjugate priors, we can construct a Gibbs sampler by augmenting the parameter space with a set of latent state variables. Within this MCMC algorithm the number of mixture components is …xed. In principle di¤erent speci…cations can be compared on the basis of a Bayes factor. An attractive alternative to comparing many di¤erent models is the use of Dirichlet process mixtures. We have modeled the errors as having a bivariate normal distribution whose parameters may di¤er across observations. In this case the Dirichlet process essentially amounts to using a mixture of normals with an unknown number of mixture components. The data may then be used to make inference about this number. Our paper has constructed an MCMC algorithm for use in the sample selection model. The only requirement for tractability is then the choice of conjugate priors. Many questions remain that we shall address in future work. First of all, we aim to provide a thorough comparison between the mixture of normals and Dirichlet process models. A potential problem with the …nite mixture model is over…tting: when the chosen number of components is too large, problems arise due to a lack of identi…cation. For a number of mixture components the parameters are then sampled from the prior and no Bayesian updating can take place. The Dirichlet mixture approach on the other hand appears to avoid over…tting. Finding and comparing analytical bounds on the convergence rates of each MCMC algorithm should be useful in this context. Second, imposing any type of identi…cation restriction in a general mixture model will typically destroy the natural conjugacy between the prior and likelihood. Reparameterizing the model may then be helpful and we expect that the more general Metropolis-Hastings algorithm is needed to approximate the posteriors. Finally, Richardson and Green (1997) have proposed an alternative method for using mixtures with an unknown number of components. It remains to contrast their approach with the Dirichlet process model. 26 This is not always the case: McCulloch, Polson, and Rossi (2000) …nd large di¤erences in the autocorrelation and convergence behavior of the chains, in case of the multinomial probit model. 30 References Albert, J. H., and S. Chib (1993): “Bayesian Analysis of Binary and Polychotomous Response Data,” Journal of the American Statistical Association, 88(422), 669–679. Amemiya, T. (1985): Advanced Econometrics. Harvard University Press. Antoniak, C. E. (1974): “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” The Annals of Statistics, 2(6), 1152–1174. Blackwell, D., and J. B. MacQueen (1973): “Ferguson Distributions via Pólya Urn Schemes,” The Annals of Statistics, 1(2), 353–355. Cameron, A. C., and P. K. Trivedi (2005): Microeconometrics: Methods and Applications. Cambridge. Casella, G., and E. I. George (1992): “Explaining the Gibbs Sampler,”The American Statistician, 46(3), 167–174. Chib, S. (1995): “Marginal Likelihood from the Gibbs Output,”Journal of the American Statistical Association, 90(432), 1313–1321. Conley, T., C. Hansen, R. McCulloch, and P. E. Rossi (2007): “A Semiparametric Bayesian Approach to the Instrumental Variable Problem,” Graduate School of Business, University of Chicago Working Paper. Cragg, J. G. (1971): “Some Statistical Models for Limited Dependent Variables with Application to the Demand for Durable Goods,” Econometrica, 39(5), 829–844. Dow, W. H., and E. C. Norton (2003): “Choosing Between and Interpreting the Heckit and Two-Part Models for Corner Solutions,” Health Services and Outcomes Research Methodology, 4, 5–18. Duan, N., W. G. Manning, C. N. Morris, and J. P. Newhouse (1983): “A Comparison of Alternative Models for the Demand for Medical Care,” Journal of Business and Economic Statistics, 1(2), 115–126. (1984): “Choosing Between the Sample-Selection Model and the Multi-Part Model,” Journal of Business and Economic Statistics, 2(3), 283–289. Escobar, M. D. (1994): “Estimating Normal Means With a Dirichlet Process Prior,” Journal of the American Statistical Association, 89(425), 268–277. Escobar, M. D., and M. West (1995): “Bayesian Density Estimation and Inference Using Mixtures,” Journal of the American Statistical Association, 90(430), 577–588. 31 (1998): “Computing Nonparametric Hierarchical Models,”in Practical Nonparametric and Semiparametric Bayesian Statistics, ed. by D. Dey, P. Müller, and D. Sinha, pp. 1–22. Springer. Ferguson, T. S. (1973): “A Bayesian Analysis of Some Nonparametric Problems,” The Annals of Statistics, 1(2), 209–230. (1974): “Prior Distributions on Spaces of Probability Measures,”The Annals of Statistics, 2(4), 615–629. Frühwirth-Schnatter, S. (2006): Finite Mixture and Markov Switching Models. Springer. Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995): Bayesian Data Analysis. Chapman & Hall. Geweke, J. (2005): Contemporary Bayesian Econometrics and Statistics. Wiley. Gilks, W., S. Richardson, and D. Spiegelhalter (1996): Markov Chain Monte Carlo in Practice. Chapman & Hall. Gronau, R. (1974): “Wage Comparisons –A Selectivity Bias,”The Journal of Political Economy, 82(6), 1119–1143. Hay, J. W., and R. J. Olsen (1984): “Let Them Eat Cake: A Note on Comparing Alternative Models of the Demand for Medical Care,” Journal of Business and Economic Statistics, 2(3), 279–289. Heckman, J. J. (1979): “Sample Selection as a Speci…cation Error,” Econometrica, 47(1), 153– 162. Huang, H.-C. (2001): “Bayesian Analysis of the SUR Tobit Model,” Applied Economics Letters, 8, 617–622. Kass, R. E., and A. E. Raftery (1995): “Bayes Factors,” Journal of the American Statistical Association, 90(430), 773–795. Koop, G., and D. J. Poirier (1997): “Learning About the Across-Regime Correlation in Switching Regression Models,” Journal of Econometrics, 78, 217–227. Lancaster, T. (2004): An Introduction to Modern Bayesian Econometrics. Blackwell. Lee, L. F. (2003): “Self-Selection,” in A Companion to Theoretical Econometrics, ed. by B. H. Baltagi, chap. 18. Blackwell Publishing. Leung, S. F., and S. Yu (1996): “On the Choice Between Sample Selection and Two-Part Models,” Journal of Econometrics, 72, 197–229. 32 Li, K. (1998): “Bayesian Inference in a Simultaneous Equation Model with Limited Dependent Variables,” Journal of Econometrics, 85, 387–400. MacEachern, S. N. (1998): “Computational Methods for Mixture of Dirichlet Process Models,” in Practical Nonparametric and Semiparametric Bayesian Statistics, ed. by D. Dey, P. Müller, and D. Sinha, pp. 23–44. Springer. Manning, W., N. Duan, and W. Rogers (1987): “Monte Carlo Evidence on the Choice Between Sample Selection and Two-Part Models,” Journal of Econometrics, 35, 59–82. Manning, W. G. (1998): “The Logged Dependent Variable, Heteroscedasticity, and the Retransformation Problem,” Journal of Health Economics, 17, 283–295. McCulloch, R. E., N. G. Polson, and P. E. Rossi (2000): “A Bayesian Analysis of the Multinomial Probit Model with Fully Identi…ed Parameters,”Journal of Econometrics, 99, 173– 193. McCulloch, R. E., and P. E. Rossi (1994): “An Exact Likelihood Analysis of the Multinomial Probit Model,” Journal of Econometrics, 64, 207–240. Muirhead, R. J. (1982): Aspects of Multivariate Statistical Theory. Wiley. Müller, P., and F. A. Quintana (2004): “Nonparametric Bayesian Data Analysis,” Statistical Science, 19(1), 95–110. Munkin, M. K., and P. K. Trivedi (2003): “Bayesian Analysis of a Self-Selection Model with Multiple Outcomes Using Simulation-Based Estimation: An Application to the Demand for Healthcare,” Journal of Econometrics, 114, 197–220. Nobile, A. (2000): “Comment: Bayesian Multinomial Probit Models with a Normalization Constraint,” Journal of Econometrics, 99, 335–345. Richardson, S., and P. J. Green (1997): “On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion),” Journal of the Royal Statistical Society B, 59(4), 731–792. Roeder, K., and L. Wasserman (1997): “Practical Bayesian Density Estimation Using Mixtures of Normals,” Journal of the American Statistical Association, 92(439), 894–902. Tanner, M. A., and W. H. Wong (1987): “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, 82, 528–550. Vella, F. (1998): “Estimating Models with Sample Selection Bias: a Survey,”Journal of Human Resources, 33, 127–169. 33 Verdinelli, I., and L. Wasserman (1995): “Computing Bayes Factors Using a Generalization of the Savage-Dickey Density Ratio,” Journal of the American Statistical Association, 90(430), 614–618. West, M., P. Müller, and M. D. Escobar (1994): “Hierarchical Priors and Mixture Models, With Applications in Regression and Density Estimation,” in Aspects of Uncertainty: a Tribute to D.V. Lindley, ed. by P. Freedman et al., pp. 363–386. Wooldridge, J. M. (2002): Econometric Analysis of Cross Section and Panel Data. MIT Press. 34 A Mixtures of Normals Sampling i To sample i from its posterior distribution (5.6) in the mixture model, we can use the following steps: 1. Calculate Prf i = jjwi ; ; ; ; ; cu g = P k j jcu l=1 2. Calculate the CDF Prf i l jj jcu 1=2 exp lj 1=2 n 1 2cu (wi exp n 1 2cu (wi jjwi ; ; ; ; g for j = 1; : : : ; k 1 0 j) Xi Xi j 0 l) (wi 1 l Xi (wi Xi o ) j 1; 3. Generate a random number u from the U (0; 1) distribution; 4. Find j , such that Prf and set Sampling If i i j 1jwi ; ; ; ; g < u Prf =j . j has a Dirichlet prior distribution with parameters ( ( )= (k [ ( 0) k 0 )] 1 0 1 0 1 k 0; : : : ; I 8 k <X : j=1 The distribution of the component selectors is multinomial: p( 1 ; : : : ; where nj = posterior of j jwi ; ; ; ; g; i Pn i=1 If i = jg and Pk j=1 nj nj )= n1 1 0 ), j its density is given by 9 = =1 : ; nk k ; = n. If ( ; ; ; cu ) is a priori independent of ( ; ), the conditional on the completed data W and the remaining parameters satis…es p( jW; ; ; ; ; cu ) / p(W j ; ; ; ; ; cu ) Conditional on p( j ) ( ): the likelihood p(W j ; ; ; ; ; cu ) does not depend on , so that p( jW; ; ; ; ; cu ) = p( j ) / p( j ) ( ); 35 o: l) from which (5.7) follows. To generate draws from this distribution, perform the following two steps P (e.g. Ferguson 1973): (1) for j = 1; : : : ; k, generate Zj G( 0 +nj ; 1) and (2) set j = Zj = kl=1 Zl . Sampling ( j; j) Multiplying the completed data mixture likelihood in (5.1) with the normal-inverse Wishart prior, we have p( j; j jW; ; ; cu ) / j 1 tr( j 1 H) 2 1 0 1 j cu j j 1=2 exp 2 cu j j 8 < 1 X nj =2 (ui jcu j j exp : 2cu jj (v+3)=2 exp j 1 0 j) j (ui i: i =j The exponent involving exp 8 < j can be rewritten as 2 1 4 ( : 2cu bj )0 Mj 1 ( j j; j jW; ; ; cu ) / p( j j j ; W; j jj exp / p( u0i given j j then follows. As for j, we can write ; ; cu ) (v+nj +3)=2 8 < 2 1 4 : 2cu 1 tr( 2 exp X u0i 1 j ui i: i =j 1 j H) 39 = b0j Mj 1 bj 5 ; ; ; cu ) j j j (v+nj +3)=2 8 0 < 1 X tr j 1 @H + cu 1 ui u0i exp : 2 j j j ; W; 19 2 = n j 1 0A cu uj uj ; ; 1 + nj i: i =j from which (5.8) follows. 39 = 1 5 j ui ; ; i: i =j with bj and Mj de…ned in (5.9). The posterior of p( b0j Mj 1 bj + bj ) j X 9 = j) : ; Sampling cu Given that (cu ) = IG(v=2; d) and cu enters the prior of j and the completed data likelihood, we have p(cu jW; ; ; ; ; ) / cu (v=2+1) e d=cu k Y j=1 j cu 36 jj 1=2 exp 1 2 cu 0 j 1 j j n Y 1 (wi Xi 2cu i=1 8 2 k < 1 X 0 4d + 1 / cu (v=2+k+n+1) exp j : cu 2 j=1 " n ( 1 1X 1 0 (ui exp j(i) ) j(i) (ui cu 2 jcu j(i) j 1=2 0 j(i) ) exp 1 j 39 = 5 j ; #) j(i) ) i=1 1 j(i) (wi Xi j(i) ) ; from which (5.10) follows. B Dirichlet Mixtures of Normals Posterior of We will use the following shorthand notation for the Polya urn prior in (6.3): ( ij where j i) = +n 1 is the measure with unit mass at p( i j n Y i ; W; ; cu ) / i=1 / 1 X 1 +n j6=i j. j 1 ( i ); Then p(wi jXi ; ; i ; cu ) ( i j +n / dG0 ( i ) + i) dG0 ( i )p(wi jXi ; ; i ; cu ) + p(wi jXi ; ; i ; cu )dG0 ( i ) + X j6=i X p(wi jXi ; ; i ; cu ) +n 1 j ( i) j6=i p(wi jXi ; ; j ; cu ) j ( i ): (B.1) The normalizing constant c satis…es c = Z = 2 4 p(wi jXi ; ; i ; cu )dG0 ( i ) + p(wi jXi ; ; cu ) + X j6=i p(wi jXi ; ; X j6=i p(wi jXi ; ; j ; cu ) j ; cu ): j 3 ( i )5 d i The posterior (B.1) then becomes p( i j i ; W; ; cu ) = 8 1< c: = (c 1 X 9 = p(wi jXi ; ; j ; cu ) j ( i ) ; p(wi jXi ; ; i ; cu )dG0 ( i ) + p(wi jXi ; ; cu ) j6=i X p(wi jXi ; ; cu ))p( i jwi ; Xi ; ) + (c 1 p(wi jXi ; ; p(wi jXi ; ; cu ) j6=i 37 j ; cu )) j ( i ); which yields (6.4). In order to sample from this distribution three elements are needed: p(wi jXi ; ; j ; cu ), p(wi jXi ; ; cu ) and p( i jwi ; Xi ; ; cu ). The …rst is simply the completed data likelihood contribution of observation i: p(wi jXi ; ; j ; cu ) = (2 ) 1 jcu jj 1=2 1 (wi 2cu exp For the second component, integrating out 1 0 j) Xi j (wi Xi j) : (B.2) from the equation above yields p(wi jXi ; ; cu ). In j order to do so analytically, we take dG0 ( i ) = N ( i ; 0; cu i) 1 W ( i ; H; v): (B.3) Thus, the base distribution is the product of an inverse Wishart distribution for H and v) and a conditional normal distribution for (with parameters i (with mean zero and variance cu i i ). This choice of G0 allows us to center prior beliefs around the parametric SSM in section ?? by taking = 0 and = 0. The conditional joint distribution of wi and p(wi jXi ; ; i ; cu )dG0 ( i ) = kv jHjv=2 j ij cu 1 (2 ) 1 cu 1 (2 ) 1 is now (v+3)=2 1 j i ij 1=2 1=2 exp j ij 1 tr 2 exp 1 i 1 2 cu exp 1 (wi 2cu H i 1 0 i Xi i 0 i) 1 i (wi Xi i) where kv is a normalizing constant: kv = 2 v 1=2 v 1 2 v 2 : (B.4) Collecting all constants and rewriting the exponent as a quadratic function in p(wi jXi ; ; i ; cu )dG0 ( i ) = cu 2 k~v j (v+5)=2 ij exp 1 tr( 2 1 i i, it follows that H) 1 0 1 [u ui b0i Mi 1 bi g 2cu i i 1 exp ( bi )0 Mi 1 ( i bi ) ; 2cu i = wi Xi ; expf ui bi = 1+ ui ; Mi = k~v 1+ = (2 ) 38 (B.5) i; 2 1 kv jHjv=2 : ; Integrating out p(wi ; i we …nd i jXi ; ; cu ) = Z p(wi jXi ; ; i ; cu )dG0 ( i )d = cu 2 k~v j ij (v+5)=2 1 0 [u 2cu i expf 1 tr( 2 exp 1 i i 1 i b0i Mi 1 bi g ui kv jHjv=2 j i j (v+4)=2 cu (1 + ) 1 exp tr( i 1 H) + cu 1 u0i 2 = (2 ) = (2 ) where ~i = H + H H) 2 cu j (1 + ) 1=2 ij 1 1 kv jHjv=2 j cu (1 + ) ij 1 (wi cu (1 + ) 1 p(wi jXi ; ; cu ) = i p(wi ; cu 1 b0i Mi 1 bi ui exp 1 tr( 2 1 i ~ i) ; H Xi )0 : Xi )(wi Using the density of the inverse Wishart distribution Z (v+4)=2 1 i can be integrated out, so that i jXi ; ; cu )d = (2 ) 1 kv jHjv=2 cu (1 + ) kv~ ~ v~=2 Hi = (v 1) jHjv=2 ; 2 cu (1 + ) ~ v~=2 Hi i where v~ = v + 1 and kv~ is de…ned in (B.4). A typical choice of vague prior uses H = "0 I2 for some large "0 > 0. In that case the determinants can be explicitly calculated. Using the result that for any jIp + aa0 j = 1 + a0 a for any a 2 Rp , it follows that p(wi jXi ; ; cu ) = = Finally, from (6.4), if the value of (v 1) "v0 v+1 2 cu (1 + ) "0 (1 + ["0 cu (1 + )] (v 1) 1 1+ u0 ui 2 cu "0 (1 + ) cu "0 (1 + ) i i is not equal to any other j 1 u 0 u )(v+1)=2 i i (v+1)=2 : it is distributed according to the posterior arising from G0 ( i ). The chosen form of G0 now allows us to generate values of i in two simple steps: …rst generate generate i i i and from its posterior (conditional on (wi ; ; cu )) and then from its posterior conditional on ( i ; wi ; 39 ; cu ). Multiplying the likelihood (B.2) and prior (B.3) and rearranging the exponent, it follows that p( i ; i jwi ; Xi ; ; cu ) / j (v+4)=2 ij j 1=2 ij 1 tr 2 1 ( 2cu exp exp 1 i ~i H bi )0 Mi 1 ( i i bi ) ; where bi and Mi were de…ned in (B.5). Thus: B.1 i jwi ; Xi ; ; cu W i j i ; wi ; X i ; ; cu N ~ i ; v + 1); (H cu ui ; 1+ 1+ 1 i ): Remixing Unique Values of To describe the remixing step for , let k be the number of unique values in , denoted by f k j gj=1 . De…ne the component selectors i Let k i be the number of distinct The posterior of ij =j i ; W; ; cu ij ( i =f , i n i gi=1 = values in as before: j; i j = 1; : : : ; k: and nj; i = in (6.4) then becomes = w. prob. c j p( i jwi ; Xi ; ; cu ) w. prob. c Note that knowledge of = 1n 1 P l:l6=i If l j; i p(wi jXi ; ; = jg for j = 1; : : : ; k i . j ; cu ); j = 1; : : : ; k p(wi jXi ; ; cu ) i : (B.6) is equivalent to knowing ( ; ; k). The remixing algorithm is based on sampling ( ; k) from its conditional distribution given , and from its conditional distribution given ( ; k). From (B.6) it follows immediately that Prf i = jj i ; W; ; ; cu g = c 1 nj; i p(wi jXi ; ; j ; cu ); j = 1; : : : ; k i : (B.7) Also, with probability Prf set i i = 0j i ; W; equal to zero and generate i ; ; cu g = 1 c 1 k i X j=1 nj; i p(wi jXi ; ; j ; cu ); from p( i jwi ; Xi ; ; cu ). After cycling through for i = 1; : : : ; n, and potentially relabeling , we obtain a new value of ( ; k). In the prior represents k i.i.d. draws from G0 (Antoniak 1974). Then: p( 1 ; : : : ; k jW; (B.8) ; ; k; cu ) / 8 k < Y Y j=1 : i: i =j 40 9 = p(wi jXi ; ; j ; cu )dG0 ( j ) ; ; so that the j ’s are conditionally independent and p( j jW; ; ; k; cu ) / Applying this to ( ( j; j) j; j) Y i: i =j p(wi jXi ; ; j ; cu )dG0 ( j ); j = 1; : : : ; k: (B.9) yields exactly the posterior given by (5.8) and (5.9). Thus, the posterior for in the …nite mixture model can be used in the Dirichlet model as a remixing distribution: n i )gi=1 …rst update f( i ; and determine the number of unique components, then regenerate the unique components alone and assign the k values across the sample, according to the component selectors i. Step 3 in algorithm 5 can now be replaced by 3a. Sample i from the distribution in (B.7) and (B.8) for i = 1; : : : ; n, and determine the number of unique components k; 3b. sample ( j; j) from (5.8) and (5.9) for j = 1; : : : ; k. Uncertainty about It is possible to incorporate uncertainty about into a Gibbs sampling algorithm, thereby allowing the data to determine how many mixture components are appropriate. A convenient approach, due to Escobar and West (1995), is described here. As before let of unique values of likelihood p(wi jXi ; ; p( jW; ; i. Let = f j gkj=1 be the collection be a priori independent of ( ; cu ) with prior ; cu ; ; k; ) does not depend on . The posterior of ; cu ; ; k) / p(W j ; ( ). Note that the then satis…es ; cu ; ; k; )p( ; ; kj ; ; cu ) ( ) / p( j ; k; ; ; cu )p( jk; ; ; cu )p(kj ; ; cu ) ( ): Given (6.3), the prior distribution of given (k; ; ; cu ) only depends on the sample size and k. Also, the random number of mixture components only depends on , so that p(kj ; ; cu ) = p(kj ). Finally, conditional on k, not depend on and cu the unique values follow the prior distribution G0 , which does or . With these conditional independence relations, the posterior simpli…es to p( jW; ; ; ; k) = p( jk) / p(kj ) ( ) k ( ) / ( ): ( + n) Using a property of the beta function B(b1 ; b2 ): B(b1 ; b2 ) = Z 1 xb1 1 (1 0 = (b1 ) (b2 ) ; (b1 + b2 ) 41 x)b2 1 dx the posterior of can be written as ( ) p( jk) / k 1( + n) (n) ( ) 1 )n (1 1 d : 0 The posterior corresponds to a joint posterior of p( ; jk) / Z and a latent variable k 1( + n) )n (1 (n) 1 2 (0; 1), given by ; from which it is clear that p( j ; k) is the beta B( +1; n) distribution. The joint posterior suggests using a G(c1 ; c2 ) distribution as prior for . Then: c1 +k 1 p( j ; k) / [c2 +n (c2 log ) +n c1 +k 2 e (c2 log ) 1 log ]c1 +k (c1 + k) [c2 / e G (c1 + k; c2 log ]c1 +k (c1 + k 1) log ) 1 1 G(c1 + k 1; c2 log ); which is a mixture of two gamma distributions. The mixing probabilities satisfy p 1 p = = log ]c1 +k (c1 + k) c1 + k 1 : n(c2 log ) [c2 1 n 1 [c2 log ]c1 +k (c1 + k 1) 1 (B.10) We can now augment algorithm 5 with the following steps: 4a. sample from B( + 1; n); 4b. calculate p according to (B.10); 4c. with probability p , sample from G(c1 + k 1; c2 from G(c1 + k; c2 log ); 5. return to step 1 in algorithm 5 and repeat. 42 log ); with probability 1 p , sample