Censored and Sample Selection Models
Transcription
Censored and Sample Selection Models
Censored and Sample Selection Models Li Gan October 2010 Examples Censored data examples: (1) stadium attendance= min(attendance*, capacity) Here grouping (those observations with attendance rates less than one vs those observations with attendance rates at one) are clearly based on observed factors. (2) Top-coding: wealth = min (wealth*, 1 million) We are interested in understanding how wealth is determined: wealth* = xβ + u, where u|x ~ log-normal Corner solution / censored regression (no data observables) (3) Testing scores: The Texas Assessment of Knowledge and Skills (TAKS) typically has 42 or 43 multiple choice questions. If a student gets all correct, then he/she gets maximum scores. We are interested in how testing score for student i at school j is affected by, among others, per student spending at school j: scoreij* = xij β + γ PerStudent Spending j + uij Observed scores are: scoreij = min{scoreij* , maximum possible score} It is also the case that a portion of students may get the minimum scores too. (4) Testing scores again: scoreijt* = ρ scoreijt* −1 + xij β + γ PerStudent Spending jt + uij In this case, both dependent and independent variables could be censored. Examples of Sample Selection Models (5) The classical example of sample selection: Consider the model. We are interested in how x affect wages: wage* = xβ + u, wage = wage* if work = 1 However, we only observe wages from those who are working. work = 1(zγ + v > 0) 1 It is likely that some unobserved heterogeneity (such as earning ability, personal ambition, etc) would affect both the wage equation and the work equation. So the errors between u and v are correlated. Intuitively, conditional on observables, the groups who work are different the group who do not work because they may have different unobserved heterogeneities. This is the so-called sample selection. (6) In surveys, it is typical to see that a large portion of people would not give continuous responses on their wealth values. For example, when asked about the value of their stock holdings, 45.2% of people gave no amount. Can we ignore these people (treat them as missing values)? This is the same questions as if those non-responses are random or not. In fact, there are two responses: DK (Don’t Know) and RF(refused to answer this question) Pr(DK = 1) = professional (-) widow(+) Pr(RF = 1) = HS Grad(-1) College (-) professional (-1) If we allow people to give bracketed responses, a much larger percentage of people are willing to give bracketed responses. Differences between censored (or truncated) data and the sample selection data In general, suppose we have a set of information (x, y, z). The population model is: y = xβ + u, and z are instruments. (2) Let s be: s = 1 if belong to one sample; and s = 0 if belong to another sample. Note if s = 0, y is not observed, this is called the truncated sample. The key difference between Heckman’s sample selection and censored regression is that if the selection is random. In the classical example of the wage regression for women: log(wagei) = xiβ+ ui (3) Case 1: s = 1 observe those women who work. Case 2: s = 1 observe wages, including some would have the minimum wage, wage0. Case 3: s = 1 observe those born in odd months of the year. Case 4: s = 1 observe those schooling years > 13. Consider the case that there is no unobserved ability (this may be true if we observed AFQT scores). In another word, the error in the wage equation is uncorrelated with all xi. 2 In each of four cases, the starting point is similar: E(log(wagei)|si =1) = E(xiβ+ ui | si =1) = xiβ + E(ui| si =1) Therefore, the critical issue is if E(ui| si =1) = 0. This largely depends if ui and si are correlated. Case 1: Sample selection problem: E(log(wagei)|si =1) = E(xiβ+ ui | ziγ + vi > 0) = xiβ + E(ui| vi > -ziγ ) ≠ xiβ, if Cov(ui, vi) ≠ 0. Therefore, ignoring sample selection would create biased estimates because E(ui| vi > -ziγ ) would enter into the error term – and it is obviously correlated with the regressor xi. Case 2: Censored or truncated data: E(log(wagei)|s =1) = E(xiβ+ ui |log(wage) >log(wage0)) = xiβ + E(ui| ui > log(wage0) -xiγ ) ≠ xiβ, this is obviously true. Similarly, ignoring censoring would create biased estimates because E(ui| ui > log(wage0) -xiγ ) would enter into the error term – and it is obviously correlated with the regressor xi. Case 3: Random sampling: E(log(wagei)|si =1) = E(xiβ+ ui | dummy-born-in-odd-monthi = 1) = xiβ + E(ui| dummy-born-in-odd-monthi = 1) = xiβ, since ui and dummy-born-in-odd-monthi are independent. In this case, ignoring the unobserved sample does NOT create any problems in estimation. This is called random sample. Case 4: Sampling based on some observables xi. E(log(wagei)|si =1) = xiβ+E(ui | schoolingi > 13) If the original model is correctly specified, then cov(xi, ui) = 0. Therefore, E(ui | schoolingi > 13) = 0. Sampling based on observables typically behaves like a random sampling. 3 However, suppose we only observe a person has college degree or not (a dummy variable). If the Si = 1 if College Degreei = 1. In this case, the selection creates problem. Because the selection process has essentially creates a missing variable problem. If College Degreei is correlated with the rest of regressors xi, then this creates a missing variable problem. For comparison: (1) Censored (or truncated) sample can be considered as a special case for sample selection in which vi = ui, and ziγ= xiβ. (2) Between the sample selection model and random selection model, the key is that Si and ui are uncorrelated. Estimation Methods: A general censored regression model: yi* = xiβ+ ui yi = max(0, yi*), or: si = 1(yi* > 0). There are two methods to estimate such models. 1. The regression method: To construct regression models, it is necessary to find out conditional expectation: E(yi | si = 1). It is useful to work out how the conditional expectation for the standard normal. Suppose ε ~ N(0,1). E (ε | ε > c ) = ∫ ε f (ε | ε > c )dε = ∫ε ∞ f (ε , ε > c ) 1 ε f (ε )dε dε = ∫ Pr (ε > c ) 1 − Φ (c ) c = ∞ ⎛ ε2 ⎞ ε 1 ⎜⎜ − ⎟⎟dε exp 1 − Φ (c ) ∫c 2π ⎝ 2 ⎠ = ∞ ⎛ ε2 ⎞ ⎛ε2 ⎞ 1 1 ⎜⎜ − ⎟⎟d ⎜⎜ ⎟⎟ exp 1 − Φ (c ) ∫c 2π ⎝ 2 ⎠ ⎝ 2 ⎠ ∞ ⎛ ε2 ⎞ 1 1 ⎟⎟ =− exp⎜⎜ − 1 − Φ (c ) 2π ⎝ 2 ⎠c φ (c ) = 1 − Φ (c ) 4 Similarly, we can work out the case: E(ε| ε < c): E (ε | ε < c ) = ∫ ε f (ε | ε < c )dε c ⎛ ε2 ⎞ 1 c 1 1 ⎜⎜ − ⎟⎟ ( ) = = − exp f d ε ε ε Φ (c ) ∫−∞ Φ (c ) 2π ⎝ 2 ⎠ −∞ φ (c ) =− Φ (c ) Given this, we have: ⎛u u c⎞ φ (c / σ ) E (u | u > c ) = σ E ⎜⎜ > ⎟⎟ = σ 1 − Φ (c / σ ) ⎝σ σ σ ⎠ Similarly, one can obtain the expectation: E (u, u > c ) = σ φ (c / σ ) . Therefore: E ( y | x, y > 0) = xβ + E (u | u > − xβ ) φ ( xβ / σ ) = xβ + σ Φ ( xβ / σ ) Previous equation suggests a nonlinear regression method. For those observations that yi > 0: φ ( xi β / σ ) (4) y i = xi β + σ + ui . Φ ( xi β / σ ) Discussions: (1) A somewhat common approach to this problem is a two-step method. In the first step, one estimates a binary probit model, and uses the coefficients estimates from the first step to calculate the inverse mills ratio. At the second step, one estimate the equation (4). However, this two-step method has several problems. (a) Note in this nonlinear equation model, the parameter set β appears at both the φ ( xi β / σ ) . linear part of the model xiβ and the nonlinear part of the model: Φ ( xi β / σ ) Therefore, it is necessary to estimate the linear part and the nonlinear part simultaneously to ensure that these parameter estimates have the same values. Estimating them separately would not guarantee the same parameter estimates. (b) Second, using the estimated parameters to generate inverse-mills ratio suffer the usual “forbidden” regression problem. (c) Even one can estimate the model consistently by working with a nonlinear least square, only a subset of information is used. So it is less efficient. Fortunately, the usual likelihood function is simple to estimate and it is also efficient. 5 (2) If we are interested in applying all observations, and suppose that yi is censored at yi = c and we observe xi for all yi: Therefore, yi* = xiβ+ ui yi = max(c, yi*). E ( yi | xi ) = E ( yi | xi , yi ≥ c ) Pr ( yi ≥ c | xi ) + E ( yi | xi , yi < c ) Pr ( yi < c | xi ) ⎛ ⎛ c − xi β ⎞ φ⎜ ⎜ ⎟ ⎝ σ ⎠ = ⎜ xi β + σ ⎜ ⎛ c − xi β 1 − Φ⎜ ⎜ ⎝ σ ⎝ ⎛ ⎛ c − xi β = xi β ⎜⎜1 − Φ ⎜ ⎝ σ ⎝ ⎞ ⎟ ⎟⎛⎜1 − Φ ⎛⎜ c − xi β ⎞ ⎟⎜⎝ ⎝ σ ⎟⎟ ⎠⎠ ⎛ c − xi β ⎞ ⎞⎞ ⎟ ⎟⎟ + cΦ ⎜ ⎟ ⎠⎠ ⎝ σ ⎠ ⎞⎞ ⎛ c − xi β ⎞ ⎛ c − xi β ⎞ ⎟ ⎟⎟ + σφ ⎜ ⎟ + cΦ ⎜ ⎟ ⎠⎠ ⎝ σ ⎠ ⎝ σ ⎠ So, a nonlinear regression method that applies all data is given by (assume c=0): ⎛xβ⎞ ⎛xβ⎞ yi = xi β Φ ⎜ i ⎟ + σφ ⎜ i ⎟ + ui ⎝ σ ⎠ ⎝ σ ⎠ (5) According to equations (4) and (5), using OLS of yi on xi in either the sub-sample or the whole sample would lead to the biased estimates. (3) Alternatively, one may apply a Heckman-type two step least squares. a. Estimate a Probit of y = 0 vs y > 0: Let the coefficient be γˆ . b. Estimate a linear regression For the sub-sample y > 0: y = xβ + λ φ ( xγˆ ) +v Φ ( xγˆ ) According to (4), the coefficients β and γ should be the same. This procedure does not guarantee that the two coefficients to be the same. A specification test H0: β = γ can be performed here. 2. The Maximum Likelihood Estimation method: Again, consider a censored regression model: yi* = xiβ+ ui yi = max(0, yi*). The density is given by: 6 ⎛ ⎛xβ f ( yi | xi ) = ⎜⎜1 − Φ ⎜ i ⎝ σ ⎝ ` 1( yi =0 ) ⎞⎞ ⎟ ⎟⎟ ⎠⎠ ⎛ 1 ⎛ y i − xi β ⎜⎜ φ ⎜ ⎝σ ⎝ σ 1( yi >0 ) ⎞⎞ ⎟ ⎟⎟ ⎠⎠ This is the Tobit model. In the charity example (equation (1)) , the likelihood function is given by: qi is the amount of money given to charities. 1( qi =0 ) ⎛ ⎛ z γ − log( pi ) ⎞ ⎞ f (qi | zi , pi ) = ⎜⎜1 − Φ ⎜ i ⎟ ⎟⎟ σ ⎝ ⎠⎠ ⎝ 1(qi >0 ) ⎛ 1 ⎛ log(1 + qi ) − ziγ − log( pi ) ⎞ ⎞ ⎜⎜ φ ⎜ ⎟ ⎟⎟ σ ⎠⎠ ⎝σ ⎝ Other Types of Censoring: (1) Double censoring ⎧a ⎪ y = ⎨ y * = xβ + u ⎪b ⎩ y* ≤ a b < y* < a y* ≥ b (7) The density function for (7) is: ⎛a⎞ f ( y i | xi ) = Φ ⎜ ⎟ ⎝σ ⎠ 1( yi = a ) 1( yi =b ) ⎛ ⎛ b ⎞⎞ ⎜⎜1 − Φ ⎜ ⎟ ⎟⎟ ⎝ σ ⎠⎠ ⎝ ⎛ 1 ⎛ y i − xi β ⎜⎜ φ ⎜ ⎝σ ⎝ σ 1(b< yi < a ) ⎞⎞ ⎟ ⎟⎟ ⎠⎠ One can apply MLE for this density function. (2) Endogenous explanatory variable model with censoring: y1 = max (0, z1δ1+α1y2+u1) y2= zδ2+ v2 Note that u1 and v2 are correlated. Rewrite the u1 as: u1 = θv2 +ε1 Plug it into the previous equation: y1 = max (0, z1δ1+ α1y2+ θv2 +ε1) This suggests a two-step procedure (similar to the discrete case) – Smith and Blundell (1986). Step 1: estimate the model y2=zδ2+ v2, and obtain the residual vˆ2 . Step 2: estimate a standard Tobit model of 7 y1 = max (0, z1δ1+ α1y2+ θ vˆ2 +ε1) This two-step procedure gives consistent estimators coefficients. (6) Alternatively, one can apply the full maximum likelihood: f(y1,y2|z) = f(y1|y2,z) f(y2|z) The densities are given by: ⎛ ⎛ z δ + a1 y 2 + θ ( y 2 − zδ 2 ) ⎞ ⎞ ⎟⎟ ⎟⎟ f ( y1 | y 2 , z ) = ⎜⎜ Φ ⎜⎜ 1 1 σε ⎠⎠ ⎝ ⎝ and, f ( y2 | z ) = 1 σv y1 =0 ⎛ 1 ⎜ ⎜σ ⎝ ε ⎛ y − z δ − a y − θ ( y 2 − zδ 2 ) ⎞ ⎞ ⎟⎟ ⎟⎟ φ ⎜⎜ 1 1 1 1 2 σε ⎝ ⎠⎠ ⎛ y 2 − zδ 2 ⎞ ⎟⎟ ⎝ σv ⎠ φ ⎜⎜ Discussions: Note here we substitute v2 by y2-zδ2. The key reason is that in (6) y1 is continuous if y1>0. So the continuous part of y1 can be used to figure out the variance. For a binary y1, this no longer holds. Sample selection model: Consider a classical sample selection: y1* = x1 β1 + u1 y 2 = 1[x2δ 2 + v2 > 0] y1 = y1* [ y 2 = 1] We discuss estimation of the model if: (a) (x2, y2) are always observed, y1 is observed only when y2 =1. (b) (u1, v2) is independent of x. (c) v2 ~ N(0,1) (d) E(u1|v2) = γ1v2 Note that x2 has to be observed always while x1 only need to be observed when y2=1. The classic example is the women’s labor force participation, in which y1 is the wages that the women gets if she is working, and y2 is the labor force participation dummy. We observe factors that affect women’s labor force participation, such as number of kids, husband’s income, etc, regardless if the women is working or not. However, we only observe if the women’s wage if she is working (y2=1). Again, there are two methods to estimate this model: 1. Regression Method: 8 y1 >0 E(y1|x,v2)=x1β1+ E(u1|x, v2) = x1β1+ E(u1|v2) = x1β1+ γ1v2 If γ1 = 0 Î no endogeneity. OLS is fine. However, when γ1 ≠ 0, since v2 is unobserved, we need take the condition expectation (conditioning y2 = 1): E (v2 | x, y 2 = 1) = E (v 2 | x, x2δ 2 + v 2 > 0 ) = E (v 2 | x, v 2 > − x2δ 2 ) φ ( x 2δ 2 ) Φ ( x 2δ 2 ) The last equality applies the earlier result for the standard normal: φ (c ) E (ε | ε > c ) = . 1 − Φ (c ) = Therefore, E ( y1 | x, y 2 = 1) = x1 β1 + γ 1 E (v 2 | x, y 2 = 1) = x1 β1 + γ 1 φ ( x 2δ 2 ) Φ ( x 2δ 2 ) (6) As before, one can use the non-linear regression method to estimate this model. However, it is computationally more difficult than MLE and has no advantage over MLE. Heckman suggests a two-step estimator: Step 1: Estimate a binary probit model: Pr(y2=1) = Φ(x2δ2) The estimated parameter is used to construct inverse mills ratio: ( ) ( ) φ x2δˆ2 Φ x2δˆ2 Step 2: Estimate the following regression: E ( y1 | x, y 2 = 1) = x1 β1 + γ 1 E (v 2 | x, y 2 = 1) φ x 2δˆ2 = x1 β1 + γ 1 Φ x 2δˆ2 ( ) ( ) Discussions: (1) Note given that E(u1|v2) = γ1v2, and var(v2) = 1, it is therefore necessary to have: Cov(u1, v2) = γ1, and the correlation coefficient between u1 and v2 is given by: r = γ1/σu. (2) An important but less noticeable fact is: in the case that we observe y1 in cases that y2 = 0 and y2 = 1, we can develop a specification test: 9 E ( y1 | x, y 2 = 0) = x1 β1 + γ 1 E (v2 | x, y 2 = 0) = x1 β1 + γ 1 E (v 2 | x, v 2 < − x 2δ 2 ) (7) ⎛ − φ ( x 2δ 2 ) ⎞ ⎟⎟ = x1 β1 + γ 1 ⎜⎜ ⎝ 1 − Φ ( x 2δ 2 ) ⎠ The last inequality applies the earlier result we have: E (ε | ε < c ) = − Note in (6) and (7), the inverse Mills ratio term is different, (6) , and φ (c ) Φ (c ) φ (x2δ 2 ) in Φ ( x2δ 2 ) − φ ( x2δ 2 ) in (7). However, their coefficient is the same. Therefore, one 1 − Φ ( x2δ 2 ) can estimate both (6) and (7) by applying Heckman two step estimator, and test if the coefficient from (6) and (7) are the same. This test can serve as a specification test for the model. 2. Maximum Likelihood Estimator Since we only observe y1 when y2 = 1, so we only need to find out the joint density of the case (y1, y2=1). The density function for y2 is: f ( y 2 | x ) = (Φ ( xδ 2 )) 2 (1 − Φ (xδ 2 )) 1− y2 y The conditional density for y1 is given by: f ( y1 | y 2 = 1, x ) = Pr ( y 2 = 1 | y1 , x ) f ( y1 | x ) Pr ( y 2 = 1 | x ) Note that y1|x ~ N(xβ, σ12), and assume that cov(u1, v2) = σ12. We can write: v2 = ⎛ σ 122 ⎞ σ 12 ⎜ ⎟ ( ) where e ~ N 0 , 1 − y − x β + e 2 1 1 2 ⎜ σ 12 ⎟⎠ σ 12 ⎝ Therefore, ⎛ xδ + ( y − x β )σ / σ 2 ⎞ 1 1 12 1 ⎟ Pr ( y 2 = 1 | y1 , x ) = Φ ⎜ 2 2 2 1/ 2 ⎜ ⎟ 1 − σ 12 / σ 1 ⎝ ⎠ ( ) So the likelihood function is given by: 10 ( = ln (Pr ( y ) li (θ ) = ln Pr ( yi 2 = 0) yi 2 =0 ⋅ Pr ( yi1 , yi 2 = 1) = 0) yi 2 =0 ⋅ (Pr ( yi1 | yi 2 = 1) Pr ( yi 2 = 1)) i2 yi 2 =1 yi 2 =1 ) = (1 − yi 2 ) ln (1 − Φ ( xδ 2 )) + yi 2 ln Pr ( y 2 = 1 | y1 , x ) f ( y1 | x ) Applications: Example: Wage offer function: only those who have jobs have observed wages. Suppose we are interested in how wage offers are determined, E (wio | xi ) . The observation rule for the wage is such that: wio = wi if and only if that the worker is working. Suppose that the wio = exp(xi1β + ui1 ) . The decision rule for the work is that she would work if and only if that wio > wir , where wir is the reservation wage. To study how wir is determined, we consider explicitly a labor supply model: ( max u wio h + ai , h h ) s.t. 0≤h≤1 ai is the nonwage income of person i. Since we have: du / dh ≤ 0. At h=0, we get the reservation wage: ( ) ( du = u1 wio h + ai , h wio + u 2 wio h + a i , h dh ) where u1 is marginal utility from income, and u2 is marginal utility from working. du dh = u1 (ai ,0 )wio + u2 (ai ,0 ) ≤ 0 h =0 ÎThe reservation wage is obtained by setting previous equation to zero: wir = − u1 (ai ,0) u2 (ai ,0) Therefore, the reservation wage wir will definitely depend on non-labor income ai. Let wir be determined by the following equation: wir = exp( xi 2 β 2 + γ 2 ai + ui 2 ) 11 where ui1 and ui2 are independent of (xi1, xi2, ai). xi1 represents the productivity characteristics, while xi2 are variables that determine the marginal utility of leisure and income, and ai is the non-wage income. Rewrite the previous equations in logarithm: log wio = xi1 β1 + ui1 log wir = xi 2 β 2 + γ a i + ui 2 The selection rule is given by the difference: log wio − log wir > 0 . If wir were observed and exogenous and xi1 were always observed Îcensored regression If wir were observed and exogenous but xi were only observed when wio is observed Î Truncated Tobit. If wir is not observed, as in most cases, we have: log wio − log wir = xi1 β1 − xi 2 β 2 − γ ai + ui1 − ui 2 = xi δ + vi > 0 . Here the selection process and the wage regression are given by: S i = 1( xi δ + vi > 0 ) log wio = xi1 β1 + ui1 if S i = 1 , and the wage regression is given by: log wio = xi1β1 + ui1 . Note in the selection process, vi includes both ui1 and ui2. Therefore, it is obvious that ui1 and vi are correlated, and clearly the correlation is positive. Another important point from this model is about what xi should be used. It is clear that the sample selection model should include all xi. It is also important to notice that the xi1 in the wage regression should NOT include ai, the non-labor income of the person. Example (charitable contributions): we are interested in finding demand for charity donation. Maxc,q ui(c, q) = ci + ailog(1+qi), s.t. ci + piqi = mi, and qi ≥0 where c is the annual consumption, q is the annual charitable giving, and αi is the marginal utility from giving. In addition, mi is the family income, and the pi is the dollar 12 price of charitable contribution, depending on the marginal tax rate of the person. For example, for a person with a marginal tax rate of 30%, his pi is 0.7. Plug the budget constraint in the utility function, take the derivatives with respect to q, the first order condition: − pi + α =0 1 + qi* The solution to this problem is: qi = 0 if ai/pi ≤ 1 qi = ai/pi if ai/pi > 1 If we are interested in what characteristics would determine the charitable giving, we model ai = exp(ziγ+ui), we have our estimation model: log(1+qi) = max(0, ziγ – log(pi)+ui) (1) Now we are interested in understanding charitable giving among faculty at the Texas A&M University. There are two ways to obtain a sample. (1) We randomly draw N people from the university’s accounting office. The office has detailed information of all faculty members on campus, including their donation amount. (2) We randomly draw N faculty members and call them. For each faculty we ask the amount of their charitable giving. Inevitably, some faculty would refuse to answer this question (recorded as RF i = 1). Censored sampling: In the first case, we know exactly if a person donates or not. Some faculty may make zero amount of donation. This is the classic censored model: log(1+qi) = max(0, ziγ – log(pi)+ui) Usual censored regression can be applied here. Selected Sample: First we assume a simplified case that all faculty whose RFi = 0 have positive donation amount: log(1+qi) = ziγ – log(pi)+ui if RFi = 0. 13 Suppose we do observe zi and pi for all people. Pr(RFi = 0) = Pr(miη + αlog(pi)+ vi < 0). How to estimate this problem? Note it is often the case that RF is not random, i.e., cov(ui, vi) ≠ 0. Intuitively, those people who refuse to answer are more likely to donate less than those who gave a response (response could be zero). Therefore, we have: E (log(1 + qi ) | RFi = 0) = z i γ − log( pi ) + E (ui | miη + α log( pi ) + vi < 0 ) = z i γ − log( pi ) + ρ − φ (miη + α log( pi )) 1 − Φ (miη + α log( pi )) A standard two-step Heckman approach can be used here. Now consider the Maximum Likelihood approach. Here we assume that some faculty may give zero amount of donation (RFi = 0): We have three cases: (qi > 0, RF = 0) and (qi = 0, RF = 0), and (RF = 1). Write ui = ρvi + εi. Case 1: (qi > 0, RFi = 0) f (qi , RF = 0 ) = f (qi , miη + α log( pi ) + vi < 0 ) = f (qi , vi < − miη − α log( pi )) =∫ − miη −α log ( pi ) −∞ 1 σε ⎛ log(1 + qi ) − zi γ + log( pi ) + ρ vi ⎞ ⎟⎟φ (vi )dvi σε ⎝ ⎠ φ ⎜⎜ Case 2: (qi = 0, RF i = 0) Pr (qi = 0, RF = 0) = Pr (z i γ − log( pi ) + ρ vi + ε i < 0, miη + α log( pi ) + vi < 0) = Pr (ε i < − zi γ + log( pi ) − ρ vi , vi < −miη − α log( pi )) =∫ − miη −α log ( pi ) −∞ ⎛ − z γ + log( pi ) − ρ vi Φ ⎜⎜ i σε ⎝ ⎞ ⎟⎟φ (vi )dvi ⎠ Case 3: (RF i = 0): Pr (RF = 1) = 1 − Φ (miη + α log( pi )) Therefore, the likelihood for one observation is give by: 14 1(qi > 0, RFi = 0 ) f (qi , RFi ) = f (qi > 0, RFi = 0) 1(qi = 0, RFi = 0 ) Pr (qi = 0, RFi = 0) 1( RFi =1) Pr (RFi = 1) A further complication: Assume that pi is endogenous. The endogeneity of pi could come from the fact that choice of qi could affect pi by switching to a different tax bracket, or pi is measured with error. Let log(pi) = xiβ + wi. Let ui = ρ1wi + ε1i, and vi = ρ2wi + ε2i First consider the case that when RFi =0, all faculty make positive amount of donation, i.e., qi > 0. In this case, we apply the regression model: E (log(1 + qi ) | RFi = 0) = z i γ − log( pi ) + E (ui | miη + α log( pi ) + vi < 0 ) = z i γ − log( pi ) + E (ρ 1 wi + ε 1i | miη + α log( pi ) + ρ 2 wi + ε 2i < 0) = z i γ − log( pi ) + ρ 1 E (wi | miη + α log( pi ) + ρ 2 wi + ε 2i < 0 ) = z i γ − log( pi ) + ρ 1 E (wi | ρ 2 wi < − miη − α log( pi ) − ε 2i ) ⎛ m η + α log( pi ) + ε 2i ⎞ ⎟⎟ − φ ⎜⎜ i ∞ ρ 2σ w ⎝ ⎠ 1 φ ⎛⎜ ε 2i = z i γ − log( pi ) + ρ 1 ∫ −∞ ⎛ m η + α log( pi ) + ε 2i ⎞ σ 2 ⎜⎝ σ 2 ⎟⎟ 1 − Φ ⎜⎜ i ρ 2σ w ⎝ ⎠ ⎞ ⎟⎟dε 2i ⎠ The last equality is obtained by first conditional on ε2i to get inverse of the Mills ratio, and then integrate out the ε2i since it is not observed. Therefore, it is not easy to estimate such a model using a two-step approach. Maximum likelihood is probably the only plausible approach to solve this problem. There are three endogenous variables, (qi, RFi, pi). We consider the following transformation: f (qi , RFi , pi ) = f (qi , RFi | pi ) f ( pi ) The density for f(pi) is easy. So our focus is on the first part: f(qi, RFi|pi). To write the likelihood function, consider again the three cases for the likelihood function as before: (qi > 0, RF = 0|pi) and (qi = 0, RFi = 0|pi), and (RFi = 1|pi). Here we again assume that some faculty members make zero amount of contribution.) Case 1: (qi > 0, RFi = 0) 15 f (qi , RFi = 0) = f (qi , miη + α log( pi ) + vi < 0 | pi ) = f (qi , ε 2i < − miη − α log( pi ) − ρ 2 wi | wi ) = 1 ⎛ log(1 + qi ) − zi γ + log( pi ) + ρ1 wi φ⎜ σ 1 ⎜⎝ σ1 ⎛ m η + α log( pi ) + ρ 2 wi ⎞⎡ ⎟⎟ ⎢1 − Φ ⎜⎜ i σ2 ⎝ ⎠⎣ = 1 ⎛ log(1 + qi ) − zi γ + log( pi ) + ρ1 (log( pi ) − xi β ) ⎞ ⎟⎟ ⋅ φ⎜ σ 1 ⎜⎝ σ1 ⎠ ⎞⎤ ⎟⎟⎥ ⎠⎦ ⎡ ⎛ miη + α log( pi ) + ρ 2 (log( pi ) − xi β ) ⎞⎤ ⎟⎟⎥ ⎢1 − Φ ⎜⎜ σ 2 ⎝ ⎠⎦ ⎣ Note in the last equality, we replace wi by log(pi)-xiβ. This is possible because log(pi) is continuous and there is no constraint on the error term wi. It is also noted that conditional on pi is equivalent to wi. Case 2: (qi = 0, RFi = 0) Pr (qi = 0, RF = 0) = Pr (zi γ − log( pi ) + ρ1 wi + ε 1i < 0, miη + α log( pi ) + ρ 2 wi + ε 2i < 0 ) = Pr (ε 1i < − zi γ + log( pi ) − ρ1 wi , ε 2i < −miη − α log( pi ) − ρ 2 wi ) ⎛ − z γ + log( pi ) − ρ1 wi = Φ ⎜⎜ i σ1 ⎝ ⎞ ⎛ − miη − α log( pi ) − ρ 2 wi ⎟⎟Φ ⎜⎜ σ2 ⎠ ⎝ ⎞ ⎟⎟ ⎠ ⎛ − z γ + log( pi ) − ρ1 (log( pi ) − xi β ) ⎞ ⎛ − wiη − α log( pi ) − ρ 2 (log( pi ) − xi β ) ⎞ ⎟⎟Φ ⎜⎜ ⎟⎟ = Φ ⎜⎜ i σ1 σ2 ⎝ ⎠ ⎝ ⎠ Case 3: (RFi = 1) Pr (RFi = 1) = Pr (miη + α log( pi ) + ρ 2 wi + ε 2i > 0 ) ⎛ m η + α log( pi ) + ρ 2 wi = Φ ⎜⎜ i σ2 ⎝ ⎞ ⎟⎟ ⎠ ⎛ m η + α log( pi ) + ρ 2 (log( pi ) − xi β ) ⎞ ⎟⎟ = Φ ⎜⎜ i σ2 ⎝ ⎠ Finally, we need to have a density for wi (or, another word, a density for log(pi)). Therefore, the likelihood is given by: f (qi , RFi , pi ) = f (qi , RFi | pi ) f ( pi ) ( ) ( ) ( = f (qi > 0, RFi = 0)1 qi >0, RFi =0 Pr (qi = 0, RFi = 0)1 qi =0,RFi =0 Pr (RFi = 1)1 RFi =1 ) 1 ⎛ log pi − xi β ⎞ ⎟⎟ σw ⎝ ⎠ φ ⎜⎜ σw It is interesting to point out that this likelihood function does not involve integration (other than the cdf of normal density). This property makes the current model easy to estimate. 16 Summary In all previous examples, including examples in the discrete choice part, figuring out the density is the critical step. In general, there are three steps in figuring out the density. Step 1: determine the density of “what”. The “what” should be all potentially endogenous variables. Step 2: determine the relations between two endogenous variables. Often the error term in one of the equations is potentially “contaminated”. We need to write out the contaminated random variable as a function of independent random variables. T Step 3: write the joint density as a conditional density and a marginal density. In this step, it is important to keep in mind the ranges that error terms may lie in. Example: Discrete/Continuous Model (Dubin and McFadden, Econometrica, 1985) Consumers face a choice of m mutually exclusive, exhaustive appliance portfolios, which can be index as i = 1, …, m. Portfolio i has a rental price (annualized cost) ri. Given i, the consumer has a conditional indirect utility function: u = V(i, y-ri, p1, p2, si, εi, η) where p1 is the price of electricity, p2 is price of alternative energy sources, y is income, si is observed attributes of i, εi is unobserved attributes of i, ri is the price of i, η is unobserved characteristics of the consumer. Electricity and alternative energy consumption levels, given i, are (by Roy’s identity): x1 = − ∂V (i, y − ri , p1 , p2 , si , ε i ,η ) / ∂p1 ∂V (i, y − ri , p1 , p2 , si , ε i ,η ) / ∂y x2 = − ∂V (i, y − ri , p1 , p2 , si , ε i ,η ) / ∂p2 ∂V (i, y − ri , p1 , p2 , si , ε i ,η ) / ∂y The probability that portfolio i is chosen: 17 Pi = Pr{(ε 1 , L, ε m ,η ) : V (i, y − ri , p1 , p2 , si , ε i ,η ) > V (i, y − r j , p1 , p2 , s j , ε j ,η ) for j ≠ i} First consider the demand system is linear in income: x1 = α 0i + α1 p1 + α 2 p2 + w' γ + β i ( y − ri ) + η + v1i The indirect utility that leads to such a demand function is given by: ⎛ ⎞ α Vi = ⎜⎜ α 0i + + α1 p1 + α 2 p2 + w' γ + β i ( y − ri ) + η + v1i ⎟⎟e − βp1 + ε i β ⎝ ⎠ So the probability of choice i is given by: Pi = Pr (Vi > V j for j ≠ i ) Estimation process: (1) estimate a discrete choice model (2) estimate a continuous demand model The problem of the second stage model is: E(η|i) is not zero. x1 = α 0i + α1 p1 + α 2 p2 + w' γ + β i ( y − ri ) + η + v1i Example: vehicle choice and vehicle miles driven (VMT). Let the indirect utility from driving is given by: max ci ,VMTi u(ci ,VMTi ) s.t. pg MPGi VMTi + ci = y − δk i where ci is the consumption, ki is the cost of owning the vehicle bundle i, VMTi is the vehicle miles driven, and pg is the price of gasoline, which does not vary over i, MPGi is the miles per gallon, which does vary across vehicle bundle i. pg/MPGi is the cost per mile of driving vehicle bundle i. Let the optimal solution – the indirect utility be denoted: ⎛ pg ⎞ Vi = v ⎜⎜ , y − δk i ; x,η ⎟⎟ ⎝ MPGi ⎠ where indirect utility is obtained by maximizing the direct utility under budget constraint. η represents unobserved characteristics, such as preference to driving, and distance from work, traffic congestion, etc. 18 Individual chooses vehicle bundle i if and only that Vi > Vj for all i≠j (discrete choice part of the model) Next is to specify VMT. The VMT is obtained by Roy’s identity: VMTi = − ∂v (mpg i , y , z i ,η ) / ∂p i = α 0 + α1i pi + β ( y − ri ) + x ' γ + η ∂v (mpg i , y , zi ,η ) / ∂y Since we only observe the VMTi for the chosen vehicle bundle, the conditional expectation E(η|p,y,r,x) ≠ 0. Dubin-McFadden suggests three alternative ways to do this: one is similar to Heckman two-stage model, another one is to use instrumental variables, and the third one is to use reduced form estimation method: (1) Heckman-type method: ln(VMTi ) = α 0 + α1i p + β ( y − ri ) + x ' γ + E (η | i chosen ) (*) The expectation E(η|i chosen) is not zero. In the Heckman sample selection model, this expectation is inverse of mills ratio, multiplied by a constant. Here it is a function of Pr(i chosen) for all i = 1, …, m. (2) IVs: use the predicted Pr(i chosen) in the discrete choice part as IVs. One can rewrite (*) as: m m i =1 i =1 ln(VMT ) = α 0 + ∑ Diα1i p + β ∑ Di ( y − ri ) + x ' γ + η (**) where Di is the dummy indicating if choice i is chosen. Obviously Di is dependent on η. Dubin and McFadden suggest using Pr(i chosen) from the discrete choice part as IVs for Di. (3) Reduced form: Taking expectation of (**) (unconditional on choice i is chosen), we have: m m i =1 i =1 E (ln(VMT ) ) = α 0 + ∑ Piα1i p + β ∑ Pi ( y − ri ) + x ' γ 19 (***) where Pi is the probability of choice i is chosen. E(η) = 0 because the expectation is taken unconditionally for the full sample (not just the subsample of those who choose bundle i). Dubin and McFadden suggest using the estimated Pi, Pˆi , from the discrete choice part to substitute Pi. Final complication: if only observed shares of vehicle bundles (instead of individual choice of vehicle bundles): S in = ∫ η exp(Vin (η )) K ∑ exp(V jn f (η ) dη , (η )) j ln(VKTin ) = α 0 + ∑ α1 j p jn S jn + βy n + β ∑ ( y n − r jn ) S jn + x n' γ + vin , j j 20