Censored Regression, Sample Selection, Endogenous Switching, and Treatment-Effect Regression Models Dan Powers Soc385K
Transcription
Censored Regression, Sample Selection, Endogenous Switching, and Treatment-Effect Regression Models Dan Powers Soc385K
Censored Regression, Sample Selection, Endogenous Switching, and Treatment-Effect Regression Models Dan Powers Soc385K November 26, 2007 Censored Regression Models Truncated and censored regression models are useful when the dependent variable is observed in some ranges but not in others. A classical example is Tobin’s (1958) study of household expenditures, in which expenditures on durable goods were seen to generally increase with income, but a non-trivial number of households had no expenditures on durable goods. The problem, in this case, is how to model the response.1 It is clearly inappropriate for a classical linear model due to the presence of many observations with values of 0. However, a model can be constructed using a latent variable approach. Suppose that we consider a latent continuous variable Y ∗ that is normally distributed with mean µ and variance σ 2 . The standard Tobit model is given by Y ∗ = Xβ + ε Y =Y∗ if Y ∗ > 0 =0 if Y ∗ ≤ 0. To estimate this model we need to consider two pieces of the overall likelihood, the probability that Y ∗ is non-masspoint and the probability that Y ∗ is a masspoint (at zero). Combining these into a single expression for the joint probability (likelihood) results in the following expression L= Y [1 − Φ(Xβ/σ)] Y 1 φ[(Y − Xβ)/σ], σ y∗>0 y∗≤0 where Φ(·) and φ(·) are, respectively, the cumulative standardized normal distribution function and standardized normal density function. If we were to use OLS on all observations, we would get biased estimates. It is also the case that OLS applied to the non-masspoint observations would yield biased estimates. In this case, the expectation of Y is not equal to Xβ as it would be in a classical linear regression model. In fact, E(Y |Y ∗ > 0) = Xβ + E(ε|ε > −Xβ) The last term is not necessarily equal to zero. If we assume normality of ε, then E(Y |Y ∗ > 0) = Xβ + σλ(Xβ/σ), 1 We will drop individual subscripts and change notation to simplify things a bit. For example, x0iβ = Xβ. 1 where λ(·) = φ(·)/Φ(·).2 Estimation strategies involve simple (two-step) procedures or ML (there are also other possibilities). A two step strategy involves first using a probit model on the complete data, coding Z = 1 if Y > 0, and 0 otherwise. probit[Pr(Z = 1)] = Xα b = φ(X α We then compute λ b)/Φ(X α b), and use this quantity as an additional regressor in an OLS model for the non-masspoint observations. However, it is generally easier ( and results in a more efficient estimator) to estimate this model using maximum likelihood. Example. This example models the number of hours worked per year as a function of educational level, age, and number of children. . regress hours edLTHS edCOL age children Source | SS df MS -------------+-----------------------------Model | 489867964 4 122466991 Residual | 324711042 1121 289661.946 -------------+-----------------------------Total | 814579006 1125 724070.228 Number of obs F( 4, 1121) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1126 422.79 0.0000 0.6014 0.6000 538.2 -----------------------------------------------------------------------------hours | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------edLTHS | -550.7335 43.67641 -12.61 0.000 -636.4302 -465.0368 edCOL | 687.8252 36.75285 18.71 0.000 615.713 759.9373 age | -1.254959 1.239134 -1.01 0.311 -3.686242 1.176324 children | -210.2569 8.367137 -25.13 0.000 -226.6739 -193.8399 _cons | 1198.584 53.0985 22.57 0.000 1094.4 1302.767 -----------------------------------------------------------------------------. tobit hours edLTHS edCOL age children, ll(0) Tobit estimates Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -6335.3063 = = = = 1126 1257.40 0.0000 0.0903 -----------------------------------------------------------------------------hours | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------edLTHS | -837.2152 62.12268 -13.48 0.000 -959.1049 -715.3255 edCOL | 751.6508 44.09414 17.05 0.000 665.1345 838.167 age | -2.339379 1.515065 -1.54 0.123 -5.312059 .6333014 children | -433.9989 16.773 -25.87 0.000 -466.9089 -401.0889 _cons | 1349.601 65.84254 20.50 0.000 1220.413 1478.79 -------------+---------------------------------------------------------------_se | 621.3327 16.2205 (Ancillary parameter) -----------------------------------------------------------------------------2 See the Appendix for more details. 2 Obs. summary: 342 784 left-censored observations at hours<=0 uncensored observations We find that OLS underestimates the effects of all of the predictors in this case. Interpret the results as you would any regression model. The parameter labelled se is the “selection effect,” or bias correction factor. Selection Models The selection model of Heckman (1976) describes an estimation problem resulting from incomplete data. To motivate this model, consider the problem of estimating a regression for women’s earnings. We would not have a problem if we observed earnings data for the majority of working-age women. However, labor-force participation rates in the 1970s, when Heckman worked on this problem, where hovering around 43 percent (compared to more than 60 percent currently), and full-time LFP rates were only about 30 in 1970. This is a problem if the goal is to make inferences about the factors that contribute to women’s earnings. We can generalize this to any sample. For the data in the example above, we can assume that we would only observe wages for the subsample of labor-force participants. Participation depends on covariates that are assumed to affect the reservation wage, or the wage that would need to be offered to induce a respondent to enter the market. The econometric model postulates that if a wage offer exceeds the reservation wage, then we will observe a wage for that individual. Theory. The selection problem is that earnings are observed only for labor force participants; earnings are unobserved or “latent” for nonparticipants. Economists envision two equations: Yo = Xβ + u1 , wage offered and Yr = Xβ + u2 , reservation wage It follows that, Y = Yo if Yo ≥ Yr and Y is missing or 0 otherwise. Selectivity bias refers to the fact that if we were to estimate the earnings function based on observations for which we have data, then we get inconsistent (biased) estimates of the effects of variables affecting wage rates. Heckman’s solution was to view the wage equation as the substantive equation of interest. Y = Xβ + u1 , along with a participation equation, W ∗ = Zγ + u2 , in which a woman works if W ∗ ≥ 0, and Pr(W = 1|Z) = Φ(Zγ). It can be shown that, " E(u1 |W = 1) = E(u1 |u2 > −Zγ) = E(u1 |u2 < Zγ) = σ12 Note, by the symmetry of the normal distribution, we have: φ(−z) φ(z) = 1 − Φ(−z) Φ(z) 3 φ(Zγ) Φ(Zγ) # and φ(−z) φ(z) = Φ(−z) 1 − Φ(z) Logit models can also be used. As an exercise, show that for a logit model f (z) 1 =1−p= . F (z) 1 + exp(z) Note that the ratio φ(z)/Φ(z) is referred to as the inverse Mills ratio (or non-selection hazard rate). and, φ(z)/[1 − Φ(z)] = φ(z)/Φ(−z), is referred to as “selection” hazard rate. From the modeling standpoint, we have: W ∗ = Zγ + u2 , W = 1, if W ∗ ≥ 0, 0 otherwise Y ∗ = Xβ + u1 , Y ∗ = Y if W = 1. The equation for W is a probit model; Y ∗ is a latent continuous dependent variable. Extensions to the model allow for different possibilities for Y ∗ (censored, binary, etc). Here, of course, wages are treated as continuous. We assume that u1 and u2 are distributed as bivariate normal, with 2 , σ 2 , and σ .3 means zero, (co)variances σu1 12 u2 The nature of the selectivity bias can be shown by evaluating the expectation of Y . In this case, E(Y |W ∗ > 0) = E(Y |Zγ + u2 > 0) = Xβ + E(u1 |u2 > −Zγ) = Xβ + E(u1 |u2 < Zγ) φ(Zγ) = Xβ + σ12 Φ(Zγ) Both the two-step and ML approaches are used for this model. For the two-step approach, we fit b = φ(Zb a probit model to W , generate λ γ )/Φ(Zb γ ), and include this as a regressor. For example, . * Heckman selection model 2-stage (by ’hand’) . . probit works married children education age Probit estimates Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -1027.0616 = = = = 2000 478.32 0.0000 0.1889 -----------------------------------------------------------------------------works | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------married | .4308575 .074208 5.81 0.000 .2854125 .5763025 children | .4473249 .0287417 15.56 0.000 .3909922 .5036576 education | .0583645 .0109742 5.32 0.000 .0368555 .0798735 age | .0347211 .0042293 8.21 0.000 .0264318 .0430105 _cons | -2.467365 .1925635 -12.81 0.000 -2.844782 -2.089948 -----------------------------------------------------------------------------. predict Zgamma, xb 3 Multivariate normality is tractable. Other joint distributions are less convenient. 4 . gen Invmills = normden(Zgamma)/norm(Zgamma) . gen Selhazard = normden(Zgamma)/( 1 - norm(Zgamma) ) . regress wage education age Invmills Source | SS df MS -------------+-----------------------------Model | 14904.6806 3 4968.22688 Residual | 38450.214 1339 28.7156191 -------------+-----------------------------Total | 53354.8946 1342 39.7577456 Number of obs F( 3, 1339) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1343 173.01 0.0000 0.2793 0.2777 5.3587 -----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------education | .9825259 .0504982 19.46 0.000 .8834616 1.08159 age | .2118695 .0206636 10.25 0.000 .171333 .252406 Invmills | 4.001616 .5771027 6.93 0.000 2.869492 5.133739 _cons | .7340391 1.166214 0.63 0.529 -1.553766 3.021844 ------------------------------------------------------------------------------ We can compare these results to a “naive” regression for LF participants. . regress wage education age -----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------education | .8965829 .0498061 18.00 0.000 .7988765 .9942893 age | .1465739 .0187135 7.83 0.000 .109863 .1832848 _cons | 6.084875 .8896182 6.84 0.000 4.339679 7.830071 ------------------------------------------------------------------------------ It is clear that the naive regression will underestimate the effects of age and education. Using the results from the two-step procedure, we can retrieve the correlation between u1 and u2 . In the b σu1 = 4.00/5.36 = 0.75. Similarly, σ b We will get probit model σu2 = 1, so ρb = λ/b b12 = ρbσ bu1 = λ. identical results with the “canned” two step procedure in Stata. However, the standard errors will be adjusted for the joint estimation of the two equations. In most cases, it is easiest to use the The default ML estimator. However, it is always a good idea to consider two-step estimators when full-information maximum likelihood solutions are not available. That is, there is considerable added value in thinking statistically about joint processes rather than relying on canned or convenient solutions. The ML solution below is not too different from the two-stage results presented earlier. . heckman wage education age, select(works = married children education age) Heckman selection model (regression model with sample selection) Log likelihood = -5178.304 5 Number of obs Censored obs Uncensored obs = = = 2000 657 1343 Wald chi2(2) Prob > chi2 = = 508.44 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------wage | education | .9899537 .0532565 18.59 0.000 .8855729 1.094334 age | .2131294 .0206031 10.34 0.000 .1727481 .2535108 _cons | .4857752 1.077037 0.45 0.652 -1.625179 2.59673 -------------+---------------------------------------------------------------works | married | .4451721 .0673954 6.61 0.000 .3130794 .5772647 children | .4387068 .0277828 15.79 0.000 .3842534 .4931601 education | .0557318 .0107349 5.19 0.000 .0346917 .0767718 age | .0365098 .0041533 8.79 0.000 .0283694 .0446502 _cons | -2.491015 .1893402 -13.16 0.000 -2.862115 -2.119915 -------------+---------------------------------------------------------------/athrho | .8742086 .1014225 8.62 0.000 .6754241 1.072993 /lnsigma | 1.792559 .027598 64.95 0.000 1.738468 1.84665 -------------+---------------------------------------------------------------rho | .7035061 .0512264 .5885365 .7905862 sigma | 6.004797 .1657202 5.68862 6.338548 lambda | 4.224412 .3992265 3.441942 5.006881 -----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0): chi2(1) = 61.20 Prob > chi2 = 0.0000 Switching Regression Models The switching regressions model is a variant of the classical Heckman selection model. Suppose that instead of observing a truncated distribution for Y , we observe Y1 and Y0 from two truncated distributions. Let Z ∗ denote a social position, status, or other characteristic. Assume, for now that this is a latent variable with the following index function: Z ∗ = Xγ + v, Z = 1 if Z ∗ > 0, Z = 0 otherwise , which can be estimated using models for binary data. Then suppose that we observe the following: Y1 = X1 β1 + u1 if Z = 1 Y0 = X0 β0 + u0 if Z = 0 and That is, we observe Y1 when Z = 1, in which case Y0 is unobserved, latent, or missing. Similarly, we observe Y0 when Z = 0, in which case Y1 is missing. Note that in practice, we observe sample respondents in only one state Z = 1 or Z = 0. In the experimental literature, there is the notion that those assigned to status 1 are identical to those assigned to status 0, so that there is an interchangeability across statuses. It then makes sense to ask what the outcome would be if a respondent in status 1 were assigned to status 0 and vice versa. The model presented above is referred to as an endogenous switching regressions model. It is used to address issues of self selection and the estimation of treatment effects when there is nonrandom allocation of subjects to treatment and control groups as is generally the case with observational (as opposed to experimental) data. 6 Example Suppose that a vocational/academic training program has two tracks or specializations consisting of demography and theory. Students select into the tracks based on the evaluations of teachers (grades in theory and demography) in addition to their own choices (past courses), as well as tastes and preferences for the subject matter. The probability that a student enters a particular track is not necessarily independent of the expected outcome in that track due to the “self-selection” problem. That is, selection into track is not determined by a random mechanism as it would be in a controlled experiment. It may be the case that the choice is not simply a function of observed characteristics and that random factors affect the choice of track and the eventual outcome in the track. If we were to compare differences in outcomes (taken to be earnings in this case) between members of both tracks via a simple difference in means, part of this difference may be due to selectivity, making it difficult to know the “true” effect of the specialization (or track) on the outcome. To summarize, conventional models ignore the possibility that common unmeasured factors affect track assignment and the rewards accruing to individuals in chosen specializations. Individuals may enter the track based on expected or perceived rewards. From a statistical standpoint, this means that the errors in Z are correlated with the errors in Y1 and Y0 . Model Specification Let the regression function for placement into the demography specialty be Z ∗ = Xγ + v, where Z = 1 if the individual chooses demography, Z = 0 if the individual chooses theory. Let Y1 be the rewards or earnings for demographers and let Y0 be the earnings for theorists. Assume, for simplicity, that the effects of exogenous variables (prestige of university, demand and supply factors, etc.) on earnings in each field are equal yielding the following earnings functions: Y1 = α1 + Xβ + u1 if Z = 1 demographer and Y0 = α0 + Xβ + u0 if Z = 0 theorist Because, of the selection problem (the failure to observe Y0 when Z = 1 and the failure to observe Y1 when Z = 0), we need to write these outcomes in a selection-equation format, much like the Heckman model. Taking expectations of the outcome equations, we can find the expected earnings for a demographer who self-selected into demography. E(Y1 |Z = 1) = E(Y1 |Z ∗ > 0) = E(Y1 |Xγ + v > 0) = E(Y1 |v > −Xγ) = α1 + Xβ + E(u1 |v < Xγ) φ(Xγ) = α1 + Xβ + σ1v , Φ(Xγ) 7 which follows due to the truncation of the distribution of Y1 from below. Similarly, the expected earnings for a theorist who self-selects into theory is E(Y0 |Z = 0) = E(Y0 |Z ∗ < 0) = E(Y0 |Xγ + v < 0) = E(Y0 |v < −Xγ) = α0 + Xβ + E(u0 |v > Xγ) φ(Xγ) = α0 + Xβ − σ0v , 1 − Φ(Xγ) which follows from truncation of Y0 ’s distribution from above. You should recognize the additional model terms as the inverse Mills ratio and the selection hazard rate. For two-stage estimation of this model, we substitute these expressions (which are based on a first stage probit model) into the second stage regression model. It should be noted that under this specification, the outcome becomes two variables for each individual. Of course, we typically only observe the outcome associated with the position (or track) the individual actually enters. However, using this formulation, we can consider the possibility that both the observed outcomes and the hypothetical outcomes that individuals would have if they entered different tracks would affect their assignment to tracks. Direct ML is an alternative to the two-stage approach. Letting αj + Xβ = Xβj for j = 0, 1, the log likelihood for this model is, X Y 1 log(2πσ1v ) − log L = log{Φ(η1 )} − 2 Z=1 X 1 Y + log{Φ(η0 )} − log(2πσ0v ) − 2 Z=0 where η1 = Xγ + (Y − Xβ1 )σ1v /σ12 p 1 − ρ21v η0 = Xγ + (Y − Xβ0 )σ0v /σ02 p 1 − ρ20v and Note that: ρjv = − Xβ1 σ1 − Xβ0 σ0 2 2 , σjv , for j = 0, 1. σj Interpreting the Covariance Parameters The covariance parameters tell the direction and magnitude of the selection effects. • Case 1— If σ0v < 0 and σ1v > 0, this is a case of positive selection into both tracks. This implies that the mean income of those who enter demography is greater than the population average for demographers and the mean income of those entering theory is higher than the population mean income of theorists. This implies that if demographers were to be placed in (or assigned to) the theory track, they would have lower incomes than those who entered the theory track. This pattern of covariances is consistent with the notion of two latent dimensions along which individuals can be evaluated, each of which is best suited to one of the positions. In summary, those who enter either program tend to do better than the population average in the respective groups (positive selection). 8 • Case 2— If σ0v > 0 and σ1v > 0, this is a case where the mean income of those who choose demography is greater than the population average for demographers, but the mean income for those who choose theory is less than the population average for theorists. Those who choose demography are better than average in both demography and theory—but better in demography than theory. Those who choose theory are below average in both demography and theory, but they are better in theory than demography. If persons who entered the demography track were placed in the theory track, they would do better than the persons who in fact entered the theory track. This is a case of positive selection into regime 1 and negative selection into regime 0. • Case 3— If σ0v < 0 and σ1v < 0, this mirrors case 2. Self selectivity. Typically Case 2 occurs when σ1 is very large relative to σ0 . Individuals with higher skills enter professions with the greater variance in earnings. Selection and Treatment Effects We can evaluate the treatment effect (the effects of tracking, program placement, specialization) as the difference in conditional expectations T (Z) = E(Y1 |X, Z = 1) − E(Y0 |X, Z = 0) = α1 − α0 , which is simplified by the constraints placed on β (i.e., that β0 = β1 . Due to the presence of the additional model terms φ(·)/Φ(·) and φ(·)/[1 − Φ(·)], our estimates of α1 , α0 and β will likely be different than they would be had we not controlled for selectivity. Thus, treatment effects would be biased in the presence of nonrandom selection into regimes (also called endogeneity bias). Example 1 Using Stata We now show how to estimate this model using two-stage procedures and a way to trick the Heckman model into estimating a switching regression model. We use the following Stata code to generate the (artificial) data for this example. clear *log using switchreg.log, replace set seed 12309 matrix m=(0,0,0) matrix c=(1, .5, .5\.5,1,0\.5,0,1) drawnorm u1 u2 u3, n(1000) means(m) corr(c) * generate some x-variables gen x1=uniform() - .50 gen x2=uniform() + .25 * generate latent regime selection equation gen zstar = -1 + 1.0*x1 + 1.25*x2 + u1 * generate observed counterpart gen zobs = (zstar > 0) sum zobs * generate y’s gen y1s = 20 + 5.75*x1 + 5*u2 if zobs==1 gen y0s = 10 + 5.75*x1 + 4*u3 if zobs==0 * write this data for use by other programs replace y1s = 0 if y1s == . 9 replace y0s = 0 if y0s == . gen y = y1s+y0s egen id=fill(1 2) outfile id y x1 x2 zobs using swregsim.raw, replace We could estimate the model using this code. However, we may want to use the same data to estimate the model using special-purpose software such as aML (Applied Maximum Likelihood), which offers numerous advantages when estimating complex multi-process models. We will provide details later. For now, we will estimate the endogenous switching model in Stata. clear infile id y x1 x2 zobs using swregsim.raw gen y0s = y if zobs==0 gen y1s = y if zobs==1 gen z0 = zobs==0 gen z1 = zobs==1 probit z1 x1 x2 predict xb1, xb gen m11 = normden(xb1)/(norm(xb1)) gen m00 = -normden(xb1)/(norm(-xb1)) * use usual formula for two-step Heckman model (twice) regress y x1 m00 if zobs==0 regress y x1 m11 if zobs==1 The two-step procedure gives the following results. . regress y x1 m00 if zobs==0 Source | SS df MS -------------+-----------------------------Model | 1609.68059 2 804.840293 Residual | 6471.80983 514 12.5910697 -------------+-----------------------------Total | 8081.49041 516 15.6618031 Number of obs F( 2, 514) Prob > F R-squared Adj R-squared Root MSE = = = = = = 517 63.92 0.0000 0.1992 0.1961 3.5484 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | 5.413118 .6954651 7.78 0.000 4.046814 6.779422 m00 | -1.408397 .7325275 -1.92 0.055 -2.847513 .0307189 _cons | 10.5414 .5541568 19.02 0.000 9.452709 11.63009 -----------------------------------------------------------------------------. regress y x1 m11 if zobs==1 Source | SS df MS -------------+-----------------------------Model | 629.518624 2 314.759312 Residual | 9159.29437 480 19.0818633 -------------+-----------------------------Total | 9788.813 482 20.3087407 10 Number of obs F( 2, 480) Prob > F R-squared Adj R-squared Root MSE = = = = = = 483 16.50 0.0000 0.0643 0.0604 4.3683 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | 5.035281 .8915633 5.65 0.000 3.283431 6.78713 m11 | 2.429054 .9070122 2.68 0.008 .6468493 4.21126 _cons | 20.25713 .7405297 27.35 0.000 18.80205 21.71221 ------------------------------------------------------------------------------ Estimates of Selection Effects. Letting λ1 = φ(Xβ)/Φ(Xβ) and λ0 = −φ(Xβ)/Φ(−Xβ), we can retrieve the covariances and correlations as follows: b0 σ σ bv0 = λ b0 = (−1.41)(3.55) = −5.0 b σ bv1 = λ1 σ b1 = (2.43)(4.37) = 10.6 ρbv0 = −0.40 ρbv1 = 0.56 Thus, we find the case (Case 1) of positive selection into both tracks. The ML solution that uses the heckman procedure twice is similar. However, we must use a different probit model for each estimation (unlike the two-step approach used above). As a result, we will have to change the sign of ρ in the Y0 equation when interpreting the results. Including the first stage probit model twice will have some effect on the estimates of the selection parameters. In either case, it is not possible to constrain the effects of X to be the same across separate models. However, it is not difficult to write a Stata program to do this, as we will see later. * Trick Heckman: Because it uses only phi/PHI, we need to * change signs of effects involving z0 (including rhoV0) * * change sign for rho and Z0 terms heckman y x1, select(z0 = x1 x2) * OK heckman y x1, select(z1 = x1 x2) Results. Heckman selection model (regression model with sample selection) Log likelihood = -2013.432 Number of obs Censored obs Uncensored obs = = = 1000 483 517 Wald chi2(1) Prob > chi2 = = 69.13 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------y0s | x1 | 5.561111 .6688425 8.31 0.000 4.250204 6.872018 _cons | 10.7338 .4810195 22.31 0.000 9.79102 11.67658 -------------+---------------------------------------------------------------z0 | x1 | -1.042018 .1448909 -7.19 0.000 -1.325999 -.7580367 x2 | -1.293231 .1446665 -8.94 0.000 -1.576772 -1.009689 _cons | 1.030022 .1175535 8.76 0.000 .7996216 1.260423 -------------+---------------------------------------------------------------/athrho | .3244194 .1790397 1.81 0.070 -.0264918 .6753307 /lnsigma | 1.29397 .0433012 29.88 0.000 1.209101 1.378839 11 -------------+---------------------------------------------------------------rho | .3134975 .1614435 -.0264857 .5884755 sigma | 3.647238 .1579296 3.350473 3.970289 lambda | 1.1434 .6243263 -.0802569 2.367057 -----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0): chi2(1) = 3.12 Prob > chi2 = 0.0774 -----------------------------------------------------------------------------. heckman y1s x1, select(z1 = x1 x2) Heckman selection model (regression model with sample selection) Log likelihood = -2021.482 Number of obs Censored obs Uncensored obs = = = 1000 517 483 Wald chi2(1) Prob > chi2 = = 33.71 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------y1s | x1 | 5.2987 .9126277 5.81 0.000 3.509982 7.087417 _cons | 19.93652 .7172913 27.79 0.000 18.53065 21.34238 -------------+---------------------------------------------------------------z1 | x1 | 1.03737 .1439349 7.21 0.000 .7552631 1.319477 x2 | 1.279762 .1446453 8.85 0.000 .9962629 1.563262 _cons | -1.01894 .1172908 -8.69 0.000 -1.248825 -.7890542 -------------+---------------------------------------------------------------/athrho | .6657136 .2135947 3.12 0.002 .2470756 1.084352 /lnsigma | 1.581906 .0678032 23.33 0.000 1.449014 1.714798 -------------+---------------------------------------------------------------rho | .5821532 .141207 .2421677 .7948073 sigma | 4.864216 .3298095 4.258912 5.555551 lambda | 2.831719 .8606427 1.144891 4.518548 -----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0): chi2(1) = 8.68 Prob > chi2 = 0.0032 A more convenient approach is to use the add on procedure movestay, written by M. Lokshin (DECRG, The World Bank) and Z. Sajaya (Stanford University), which was designed specifically for this type of endogenous switching regressions model. .movestay (y = x1)(y = x1), select(zobs=x1 x2) Endogenous switching regression model Number of obs Wald chi2(1) Prob > chi2 Log likelihood = -3408.6387 = = = 1000 33.69 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------y_1 | x1 | 5.281593 .9098828 5.80 0.000 3.498255 7.06493 _cons | 19.94938 .7152708 27.89 0.000 18.54747 21.35128 12 -------------+---------------------------------------------------------------y_0 | x1 | 5.564055 .6678406 8.33 0.000 4.255112 6.872999 _cons | 10.73037 .4826484 22.23 0.000 9.784397 11.67634 -------------+---------------------------------------------------------------zobs | x1 | 1.033465 .1439562 7.18 0.000 .7513159 1.315614 x2 | 1.288041 .1437439 8.96 0.000 1.006308 1.569774 _cons | -1.024566 .1167285 -8.78 0.000 -1.253349 -.7957818 -------------+----------------------------------------------------------------------------+---------------------------------------------------------------sigma_1 | 4.859087 .3278679 4.257157 5.546125 sigma_2 | 3.647968 .1584643 3.350238 3.972157 rho_1 | .5799307 .1413935 .2400231 .7931687 rho_2 | -.3143474 .1617676 -.589687 .0264554 -----------------------------------------------------------------------------LR test of indep. eqns. : chi2(1) = 11.80 Prob > chi2 = 0.0006 ------------------------------------------------------------------------------ We note that the results are very similar to running heckman twice. However, this approach uses full information maximum likelihood so the standard errors are correct. A drawback with either of these approaches is that it is not possible to constrain covariate effects to be the same across the different regimes (tracks). For example, we have no compelling reason to assume that the effects of X are different for theorists and demographers. To invoke constrained estimation, we need to modify the movestay routine to allow this. It is convenient to write a new program swreg as a single call to ML as follows: cap prog drop swreg prog def swreg version 8 *set trace on args todo b lnf #delimit ; tempname xb1 xb2 xb3 sig1 sig2 rho1 rho2 rr1 rr2; mleval ‘xb1’ =‘b’, eq(1); mleval ‘xb2’ =‘b’, eq(2); mleval ‘xb3’ =‘b’, eq(3); mleval ‘sig1’ =‘b’, eq(4) scalar; mleval ‘sig2’ =‘b’, eq(5) scalar; mleval ‘rho1’ =‘b’, eq(6) scalar; mleval ‘rho2’ =‘b’, eq(7) scalar; scalar ‘rr1’ = 1/sqrt(1-‘rho1’^2); scalar ‘rr2’ = 1/sqrt(1-‘rho2’^2); tempvar eta1 eta2 lf1 eps1 eps2; qui generate double ‘eta1’ = (‘xb3’ + ($ML_y1 - ‘xb1’) * ‘rho1’/‘sig1’)*‘rr1’; qui replace ‘eta1’=-37 if (‘eta1’<-37); qui generate double ‘eta2’ = (‘xb3’ + ($ML_y2 - ‘xb2’) * ‘rho2’/‘sig2’)*‘rr2’; qui replace ‘eta2’= 37 if (‘eta2’> 37); qui generate double ‘eps1’ = $ML_y1 - ‘xb1’; qui generate double ‘eps2’ = $ML_y2 - ‘xb2’; tempname const1 const2; scalar const1=0.5*ln(2*_pi*‘sig1’^2); scalar const2=0.5*ln(2*_pi*‘sig2’^2); 13 qui mlsum ‘lnf’ = cond($ML_y3==1, ln(norm( ‘eta1’))-const1-0.5*(‘eps1’/‘sig1’)^2, ln(norm(- ‘eta2’))-const2-0.5*(‘eps2’/‘sig2’)^2); if (‘todo’==0 | ‘lnf’==.) exit; #delimit cr end * * Get starting values for b1, b2, b3, sigma1, and sigma2. * qui probit zobs x1 x2 matrix b3 = e(b) predict xg, xb qui gen lam1 = normden(xg)/norm(xg) qui gen lam2 = -normden(xg)/norm(-xg) qui reg y x1 if zobs==1 matrix b1 = e(b) scalar s1 = e(rmse) qui reg y x1 if zobs==0 matrix b2 = e(b) scalar s2 = e(rmse) * * Obtain starting values for rho1 & rho2. [optional] * qui reg y x1 lam1 if zobs==1 scalar rho1 = _b[lam1]/s1 qui reg y x1 lam2 if zobs==0 scalar rho2 = _b[lam2]/s2 matrix b = (b1, b2, b3, s1, s2, rho1, rho2) cons def 1 [y1]x1 = [y0]x1 ml model d0 swreg (y1: y = x1)(y0: y = x1)(z:zobs=x1 x2) /// (sigma1:)(sigma2:)(rho1:)(rho2:), const(1) ml init b, copy ml maximize We can now estimate the model with any number of constraints on the parameters. However, I have made some modifications which, while offering some flexibility, will not guarantee that the σ’s and ρ’s will not be “out of bounds.” This means that we will require good starting values based on previous regressions as shown above. It is always a good idea to compare the results of complicated models to those from alternative approaches. A comparison is provided below. Comparison of Results. Heckman Twice movestay swreg aML ---------------------------------------------------------------------------------------x0 5.41 (0.695) 5.57 (0.668) 5.46 (0.526) 5.45 (0.546) x1 5.04 (0.892) 5.28 (0.910) (constrained) (constrained) constant0 10.54 (0.554) 10.73 (0.483) 10.69 (0.453) 10.68 (0.505) constant1 20.26 (0.740) 19.95 (0.715) 19.86 (0.610) 19.87 (0.601) sigma0 3.55 3.86 (0.159) 3.66 (0.157) 3.66 (0.167) sigma1 4.36 4.86 (0.328) 4.89 (0.297) 4.89 (0.296) rho0 -0.40 -0.31 (0.162) -0.33 (0.151) -0.33 (0.168) rho1 0.56 0.58 (0.141) 0.60 (0.120) 0.60 (0.121) 14 Example 2 using aML It is best to use full information ML on this problem. Outside of the movestay and the custom-built ML routine above (swreg), there are only a couple of canned programs that can estimate this kind of model. The most user-friendly of these special-purpose programs is aML (Applied Maximum Likelihood), which is a general purpose package for models with mixed structures (as well as multiple levels). This is a free package written by people at the Rand corporation. I encourage everyone to learn it, or at least become familiar with the idea of multi-process models. It can do everything from structural equation modeling to multilevel models, and is the only package to allow users to mix both of these kinds of models. For more information see www.applied-ml.com. Below is the setup for an endogenous switching regression using aML. The script and the results are shown. Note that the setup requires some additional effort on the part of the user to provide details about the correct joint (trivariate normal) distribution for the error terms and the starting values. The results are similar to those reported above, with minor differences due to the fact that we have constrained the effect of X1 to be the same across regimes. Note that the aML results compare favorably to the swreg routine that I wrote, which is not surprising since the same parameter constraints are used by both programs. Interpretation of the error correlations is more straightforward in ML as these measures reflect the direct association between Y and Pr(Z ∗ > 0). dsn = swregsim.dat; define regressor set alphaX; var= 1 x1 x2; define regressor set betaX; var= x1; define par mu0; define par mu1; define normal distribution; dim = 3; number of integration points = 15; name = v; name = u0; name = u1; probit model; outcome=z; model = regset alphaX + res(draw=1,ref=v); continuous model; keep if z==0; outcome=y; model = par mu0 + regset betaX + res(draw=1,ref=u0); continuous model; keep if z==1; outcome=y; model = par mu1 + regset betaX + res(draw=1,ref=u1); starting values; Zcons T -1. Zx1 T 1.0 Zx2 T 1.2 Yx1 T 5 mu0 T 10.213598521 mu1 T 19.539468801 sigV F 1.0 sigU0 T 3.9396435831 sigU1 T 4.6505359691 rhoVU0 T 0 rhoVU1 T 0 rhoU0U1 F 0.0; We can compare the results to a model that constrains the selection effects. For this, we simply modify the script above and constrain the ρ terms to zero. 15 Zcons Zx1 Zx2 Yx1 mu0 mu1 sigV sigU0 sigU1 rhoVU0 swregsim0 -1.4434 *** (0.1703) 1.4746 *** (0.2079) 1.8103 *** (0.2102) 5.2060 *** (0.4319) 11.5075 *** (0.1580) 22.0471 *** (0.2056) 1.0000 swregsim -1.0195 *** (0.1164) 1.0420 *** (0.1400) 1.2812 *** (0.1438) 5.4452 *** (0.5475) 10.6783 ***-------| (0.5053) |----Compare Differences 19.8692 ***-------| (0.6010) 1.0000 3.5632 *** (0.1123) 4.4109 *** (0.1450) 0.0000 3.6594 *** (0.1666) 4.8919 *** (0.2956) -0.3304 ** (0.1683) 0.5950 *** (0.1206) 0.0000 rhoVU1 0.0000 rhoU0U1 0.0000 ln-L -3418.90 -3408.67 Here (as in all the other models) we find that taking account of self-selection results in a smaller estimated track effect (b α1 − α b0 ), which is 10.55 in the conventional model and 9.19 in the endogenous switching model. Thus, we find that earnings differences are overstated in the conventional model by about 1.4 units. Next we examine how we might evaluate the expected earnings if individuals were assigned to statuses other than the ones they in fact entered. This might be useful from a policy standpoint. For example, if a sudden paradigm shift makes all past theory obsolete, theorists would be forced to become demographers (as opposed to cab drivers). The expected outcome for demographers if they would have entered theory is: E(Y1 |Z ∗ < 0) = α1 + Xβ1 + E(u1 |v > Xγ) φ(Xγ) . = α1 + Xβ1 − σ1v 1 − Φ(Xγ) Similarly, the expected outcome for theorists had they entered demography would be E(Y0 |Z ∗ > 0) = α0 + Xβ0 + E(u0 |v < Xγ) φ(Xγ) . = α0 + Xβ0 + σ0v Φ(Xγ) Note that for the general model, we can allow X to vary by track as is the default in procedures like heckman-twice and movestay. However, for simplicity, and without a compelling reason to expect that returns to X differ by group, let us assume that respondents do not differ except with regard to mean earnings, that is, both the endowments (X) and returns to X (β) are identical for a hypothetical demographer and theorist. Then the only difference is between the means, which 16 are 20 income units for demographers and 10 income units for theorists. Assume, for convenience, that Xγ = 1, σ1v = 5 and σ0v = −4, which are similar to our estimated values. Then E(Y1 |Z ∗ > 0) = E(Y for demographer|in demography track) = 21.44 E(Y0 |Z ∗ < 0) = E(Y for theorist|in theory track) = 16.10 E(Y0 |Z ∗ > 0) = E(Y for theorist|in demography track) = 8.85 E(Y1 |Z ∗ < 0) = E(Y for demographer|in theory track) = 12.37 Verify these results as an exercise. Example Using Real Data from the NELS. In this example a switching regression model is used to assess the effect of becoming sexually active prior to the senior year in high school on 12th-grade math achievement test scores using the NELS data. About 70% of the sample had sex before 12th grade. We wish to examine possible endogeneity of first sex (i.e., common unmeasured factors affect sexual initiation and academic achievement) and measure the “treatment” effect controlling for differential selectivity into “virgin” and “nonvirgin” statuses. In this case, the selection process is best modeled using a hazard model for the risk of first sex at age t, which is the kind of model that I discuss in another course (SOC386L). However, to be consistent with the previous discussion, we will use a probit model instead. This involves a probit selection equation with latent outcome Z ∗ and achievement outcome (math test score) Y . Z ∗ = Xα + v, Z = 1 if Z ∗ > 0, 0 otherwise Y0 = µ0 + Xβ + u0 , if virgin Y1 = µ1 + Xβ + u1 , if nonvirgin where v, u1 , and u0 follow a trivariate normal distribution. Note that cov(u1 , u0 ) = 0 because a person can’t be observed in both regimes (virgin or nonvirgin). The sign and magnitude of cov(v, u1 ) and cov(v, u0 ) give the magnitude and direction of the selection bias and µ1 − µ0 gives the treatment effect. Note that the heterogeneity component v in the first sex equation cannot be identified because 1st sex happens only once and we have a single-level data structure. So we fix the variance in heterogeneity in sexual initiation at 1 but model the correlation between random components of both process. The sign and magnitude of cov(v, u1 ) and cov(v, u0 ) give the magnitude and direction of the selection bias and µ1 − µ0 gives the treatment effect. aML Model Code The model is easiest to set up using specialized software (aML). dsn = ach0.dat; define regressor set AlphaX; var = 1 black intact; define regressor set Beta1X; var = black cath pub urb faminc; define par mu1; define par mu2; 17 define normal distribution; dim=3; number of integration points = 10; name=v; name=u0; name=u1; probit model; outcome=(1-censor); model = regset AlphaX + res(draw=1,ref=v); continuous model; keep if censor==1; /* virgins */ outcome=y; model=par mu1 + regset Beta1X + res(draw=1,ref=u0); continuous model; keep if censor==0; /* nonvirgins */ outcome=y; model=par mu2 + regset Beta1X + res(draw=1,ref=u1); starting values; Const T 1.3419743976 Blk T .23864170709 Intact T -0.5956678435 Black T -6.0047121792 CathSch T -2.8252736269 PubSch T -5.5770334811 Urb T .24995452413 Inc T 1.6443668192 Mu 0 T 41.045145274 Mu 1 T 37.437397243 sigv T 1.0 sigu0 T 12.993118301 sigu1 T 12.871405018 r12 T 0.0 r13 T 0.0 r23 F 0.0 ; Results Model ach_0 Model ach_1 Probit(Nonvirgin) Intercpt 1.5337 *** 1.5101 *** (0.0639) (0.0667) Blk 0.2968 *** 0.3013 *** (0.1038) (0.1039) Intact -0.7107 *** -0.6856 *** (0.0706) (0.0734) sigv 1.0000 1.0000 -------------------------------------Regression (Test Score) Black -5.7804 *** (0.6328) -6.2059 *** (0.6717) 18 CathSch PubSch Urb Inc Constant Constant sigu0 sigu1 r12 -2.8012 *** (0.9320) -5.6172 *** (0.8032) 0.3078 (0.4574) 1.6290 *** (0.0811) 41.2275 *** (1.3009) 36.7348 *** (1.2628) 12.9935 *** (0.2817) 12.7756 *** (0.1533) 0.0000 r13 0.0000 r23 0.0000 ln-L -20446.85 -2.7273 *** (0.9324) -5.5940 *** (0.8035) 0.2987 (0.4578) 1.6667 *** (0.0831) 40.6370 *** Virgins (2.0136) 37.9446 *** Nonvirgins (1.3864) 12.9974 *** (0.2822) 13.0431 *** (0.2918) -0.0188 (0.1268) -0.4255 ** (0.1938) 0.0000 TREATMENT EFFECT Difference exog endog 4.492 2.692 -20446.52 NOTE: Asymptotic standard errors in parentheses; Significance: ’*’=10%; ’**’=5%; ’***’=1%. NOTE: all standard errors are based on numerical Hessians. Thus, we find that the gain in 12th grade math test scores from being a virgin are overstated by a bit under two points under the assumption that first sex is exogenous. The estimate of the correlation (r13) between v and u1 is negative and statistically different from 0, which implies that nonvirgins have higher test scores than would a population of students who are assigned at random to nonvirgin status. Other Approaches As we saw, the switching regressions approach is well-suited for estimating treatment effects under endogenous (i.e., self-selected) assignment to treatment and control groups. Next, we illustrate a simpler approach to estimating endogenous treatment effects. You have probably thought that if we are interested in the effect track membership on earnings, why not simply include a dummy variable in our earnings model as, Y = α + Xβ + Zδ + u Of course, now you know that δ will not measure the value of a specialization in demography, because, as we all know, a person who chooses the demography track will have high earnings whether or not they chose demography as a specialty (due to innate abilities that we can dream about, but cannot hope to measure). This means that our estimate of δ will overstate the true treatment effect (of specializing in demography). 19 The appropriate model in this case is, Z ∗ = Xγ + v Y ∗ = α + Xβ + Zδ + u We assume, as before, that u and v are bivariate normal with zero means and variance covariance 2 matrix σv σuv Σvu = σu2 . We can estimate this model using Stata’s treatreg procedure. . treatreg y x1,treat(z1=x1 x2) Treatment effects model -- MLE Number of obs = 1000 Log likelihood = -3429.6537 Wald chi2(2) Prob > chi2 = = 534.37 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------y | x1 | 5.230804 .5684312 9.20 0.000 4.1167 6.344909 z1 | 9.754676 .9276166 10.52 0.000 7.936581 11.57277 _cons | 11.88237 .4639237 25.61 0.000 10.97309 12.79164 -------------+---------------------------------------------------------------z1 | x1 | 1.045804 .1447645 7.22 0.000 .7620707 1.329537 x2 | 1.286205 .1456226 8.83 0.000 1.00079 1.57162 _cons | -1.027875 .1182486 -8.69 0.000 -1.259638 -.7961118 -------------+---------------------------------------------------------------/athrho | .1368113 .1480757 0.92 0.356 -.1534118 .4270344 /lnsigma | 1.389886 .0250241 55.54 0.000 1.34084 1.438933 -------------+---------------------------------------------------------------rho | .1359641 .1453384 -.1522195 .40284 sigma | 4.014394 .1004568 3.822253 4.216194 lambda | .5458133 .5897084 -.6099939 1.701621 -----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0): chi2(1) = 0.87 Prob > chi2 = 0.3512 ------------------------------------------------------------------------------ Note that our estimate of the treatment effect is 9.75, which compares to an estimate of 10.57 from a naive regression. Heckman suggested a two-stage approach for this model, which is referred to as the dummy endogenous regression model with structural shift. He suggested to obtain predicted values for Pr(Z = 1|X) from a probit model, say pbZ , and include this as an additional regressor in a regression model. Following this procedure for these data gives a treatment effect of 9.70. The quality of this estimate is likely to depend on the predictive ability of the Z equation (which is too good because we simulated this data). Note that the estimate of ρ of 0.136 is not significant, and differs from the two estimates we obtained using the switching regression approach. This is because this single estimate is a weighted average of ρv0 and ρv1 , which have 20 opposite signs. For this reason it is informative about the general issue of endogeneity but not of the direction of selection, except to suggest there is a slight self-selection effect into demography. The set-up for this model in aML is straightforward, and gives results identical to treatreg. However, it should be noted that aML will accommodate different types of outcome variables, not simply continuous response variables. dsn = swregsim.dat; /* specify regressor set for regime */ define regressor set alphaX; var= 1 x1 x2; /* specify regressor set of outcome */ define regressor set betaX; var= 1 x1 z; /* set up equation error structure */ define normal distribution; dim = 2; number of integration points = 15; name = v; name = u; probit model; outcome=z; model = regset alphaX + res(draw=1,ref=v); continuous model; outcome=y; model = regset betaX + res(draw=1,ref=u); starting values; Zcons T -1.0 Zx1 T 1.0 Zx2 T 1.2 mu0 T 10.213598521 Yx1 T 5 Yz T 10 sigV F 1.0 sigU T 5 rhoVU T 0 ; ====================================================================== = ESTIMATION CONVERGED SUCCESSFULLY = = RESULTS OF ESTIMATION = ====================================================================== Log Likelihood: -3429.6545 BHHH-based, non-corrected Parameter Free? Estimate Std Err T-statistic 1 2 3 4 5 Zcons Zx1 Zx2 mu0 Yx1 T T T T T -1.0267557313 1.044635091 1.2845931836 11.874803358 5.2317064567 .12061251389 .14656887929 .14862568326 .48747752746 .58004436879 21 -8.5128 7.1273 8.6431 24.3597 9.0195 6 7 8 9 Yz sigV sigU rhoVU T F T T 9.761195825 1.0 4.0137859425 .13538399022 .95280957924 10.2446 .10137260868 .15030444659 39.5944 0.9007 ====================================================================== Extensions to Binary Responses So far we have considered continuous variables as the substantive outcomes. What if the outcome is binary? In this case, we can use variants of bivariate probit models. The bivariate probit consists of two probit equations with correlated errors Z ∗ = Xα + u Y ∗ = Xβ + v, where u and v are distributed bivariate normal with 0 means and unit variances and correlation ρ. A complication arises in the situation when Z (the realization of Z ∗ ) appears as an endogenous regressor in the Y ∗ equation. In this case, we need identifying restrictions, meaning that some of the X’s in the Z ∗ equation must not appear in the Y ∗ equation in order to strongly identify the effect of Z. Consider the example Z ∗ = α0 + α1 X1 + α2 X2 + u Y ∗ = β0 + γZ + β1 X2 + v. We can use ML or a two-stage approach. The two-stage approach uses a probit model for Z ∗ and generates a predicted probability pZ that is used as a regressor in the Y ∗ equation. . probit z x1 x2 Probit estimates Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -617.56905 = = = = 1000 149.04 0.0000 0.1077 -----------------------------------------------------------------------------z | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | 1.079573 .1479664 7.30 0.000 .7895645 1.369582 x2 | 1.468242 .1482265 9.91 0.000 1.177724 1.758761 _cons | -1.174744 .1200381 -9.79 0.000 -1.410015 -.939474 -----------------------------------------------------------------------------. predict pz (option p assumed; Pr(z)) . probit y z x2 Probit estimates Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -338.77391 22 = = = = 1000 693.33 0.0000 0.5058 -----------------------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------z | 2.545253 .1240325 20.52 0.000 2.302154 2.788352 x2 | .3813336 .192569 1.98 0.048 .0039052 .758762 _cons | -1.106581 .1469428 -7.53 0.000 -1.394583 -.8185779 -----------------------------------------------------------------------------. probit y pz x2 Probit estimates Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -640.96636 = = = = 1000 88.95 0.0000 0.0649 -----------------------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------pz | 1.362025 .3685597 3.70 0.000 .6396615 2.084389 x2 | .5274081 .2360944 2.23 0.025 .0646715 .9901446 _cons | -.8794941 .1194334 -7.36 0.000 -1.113579 -.645409 ------------------------------------------------------------------------------ ML Estimation of the Bivariate Probit. For full ML, we need to construct the likelihood function from all possible outcomes. Define: P11 = Pr(Z = 1, Y = 1) = F (αX, γ + βX, ρ) P10 = Pr(Z = 1, Y = 0) = F (αX, −γ − βX, −ρ) P01 = Pr(Z = 0, Y = 1) = F (−αX, βX, −ρ) P00 = Pr(Z = 0, Y = 0) = F (−αX, −γ − βX, ρ), where F (·, ·, ρ) is the standard bivariate normal cdf. Then, n X log L = ZY log P11 + Z(1 − Y ) log P10 + (1 − Z)Y log P01 + (1 − Z)(1 − Y ) log P00 . i=1 . biprobit (z x1 x2) (y z x2) Seemingly unrelated bivariate probit Log likelihood = Number of obs Wald chi2(4) Prob > chi2 -953.4767 = = = 1000 162.45 0.0000 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------z | x1 | 1.082444 .1475838 7.33 0.000 .7931849 1.371703 x2 | 1.467911 .1479426 9.92 0.000 1.177949 1.757873 _cons | -1.172706 .1198119 -9.79 0.000 -1.407533 -.9378794 -------------+---------------------------------------------------------------y | 23 z | 1.4282 .4488288 3.18 0.001 .548512 2.307888 x2 | .8401728 .2299791 3.65 0.000 .3894219 1.290924 _cons | -.9710812 .1523623 -6.37 0.000 -1.269706 -.6724566 -------------+---------------------------------------------------------------/athrho | .6271636 .2447731 2.56 0.010 .1474171 1.10691 -------------+---------------------------------------------------------------rho | .556096 .1690788 .1463584 .8029675 -----------------------------------------------------------------------------Likelihood-ratio test of rho=0: chi2(1) = 5.73253 Prob > chi2 = 0.0167 Extension to Switching Regressions. This model can be readily adapted to a switching regressions context. Let the regime equation be Z ∗ = Xα + v and the structural equations, Y0∗ = Xβ0 + u0 if Z = 0 Y1∗ = Xβ1 + u1 if Z = 1 and Define: P1 = Pr(Z = 0, Y0 = 1) = F (−Xα, Xβ0 ; −ρ0 ) P2 = Pr(Z = 0, Y0 = 0) = F (−Xα, −Xβ0 ; ρ0 ) P3 = Pr(Z = 1, Y1 = 1) = F (Xα, Xβ1 ; ρ1 ) P4 = Pr(Z = 1, Y1 = 0) = F (Xα, −Xβ1 ; −ρ1 , ) where F (·, ·, ρ) is the standard bivariate normal cdf. Then, n X log L = (1 − Z){Y0 log P1 + (1 − Y0 ) log P2 } + Z{Y1 log P3 + (1 − Y1 ) log P4 } . i=1 This model cannot be estimated directly in Stata. However, it is not too difficult to program. Here we define a new procedure called swbvprob for switching bivariate probit regressions. cap prog drop swbvprob prog def swbvprob *set trace on args todo b lnf tempvar xb0 xb1 xg rho0 rho1 y0 y1 z lf mleval ‘xb0’ = ‘b’, eq(1) mleval ‘xb1’ = ‘b’, eq(2) mleval ‘xg’ = ‘b’, eq(3) mleval ‘rho0’ = ‘b’, eq(4) mleval ‘rho1’ = ‘b’, eq(5) * scalar ‘rho0’ = 0 * scalar ‘rho1’ = 0 local y0 "$ML_y1" local y1 "$ML_y2" local z "$ML_y3" qui gen double ‘lf’ = binorm(‘xb0’,-‘xg’, -‘rho0’) if (‘y0’==1 & ‘z’==0) qui replace ‘lf’ = binorm(-‘xb0’,-‘xg’, ‘rho0’) if (‘y0’==0 & ‘z’==0) qui replace ‘lf’ = binorm(‘xb1’,‘xg’, ‘rho1’) if (‘y1’==1 & ‘z’==1) 24 qui replace ‘lf’ = binorm(-‘xb1’,‘xg’, -‘rho1’) if (‘y1’==0 & ‘z’==1) qui mlsum ‘lnf’ = log(‘lf’) end * * run program * probit y0 x1 matrix b1 = e(b) probit y1 x1 matrix b2 = e(b) probit zobs x1 x2 matrix b3 = e(b) matrix b = (b1,b2,b3,0,0) cons def 1 [eq1]x1=[eq2]x1 ml model d0 swbvprob (eq1: y0 = x1)(eq2: y1 = x1) (eq3:zobs=x1 x2) (eq4:) (eq5:), constr(1) ml init .83 1.65 .83 1.11 1.0 1.3 -1 .2 .5, copy ml search eq4 0 0.3 ml search eq5 0 0.6 ml maximize This routine requires good starting values. After a few iterations, we get the following estimates. Number of obs Wald chi2(0) Prob > chi2 Log likelihood = -840.17903 = = = 1000 . . ( 1) [eq1]x1 - [eq2]x1 = 0 -----------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------eq1 | x1 | .8274907 .2589903 3.20 0.001 .3198791 1.335102 _cons | 1.648328 .2526571 6.52 0.000 1.153129 2.143527 -------------+---------------------------------------------------------------eq2 | x1 | .8274907 .2589903 3.20 0.001 .3198791 1.335102 _cons | 1.122869 .3617985 3.10 0.002 .4137572 1.831981 -------------+---------------------------------------------------------------eq3 | x1 | 1.057045 .1444147 7.32 0.000 .7739977 1.340093 x2 | 1.2843 .1458508 8.81 0.000 .9984372 1.570162 _cons | -1.023984 .1183104 -8.66 0.000 -1.255869 -.7921003 -------------+---------------------------------------------------------------eq4 | _cons | .1915166 .4360708 0.44 0.661 -.6631665 1.0462 -------------+---------------------------------------------------------------eq5 | _cons | .5463507 .2407519 2.27 0.023 .0744857 1.018216 ------------------------------------------------------------------------------ This model is much easier to estimate in aML. Below is the aML script to estimate this model. Note that we need to fix the variances to 1 as required by the normalization of the probit model. For this model, we constrain the slopes to be equal across regimes. Then the treatment effect can be estimated as a difference in intercepts. 25 /* data set */ dsn = tvswregsim.dat; /* specify regressor set for regime */ define regressor set alphaX; var= 1 x1 x2; /* specify regressor set of outcome */ define regressor set betaX; var= x1; define par mu0; define par mu1; define name = name = name = normal distribution; dim = 3; number of integration points = 15; v; u0; u1; probit model; outcome=z; model = regset alphaX + res(draw=1,ref=v); probit model; keep if z==0; outcome=y; model = par mu0 + regset betaX + res(draw=1,ref=u0); probit model; keep if z==1; outcome=y; model = par mu1 + regset betaX + res(draw=1,ref=u1); starting values; Zcons T -1.0242318948 Zx1 T 1.0453925122 Zx2 T 1.284764978 Yx1 T .63345352653 mu0 T 1.512191648 mu1 T 1.7127904538 sigV F 1.0 sigU0 F 1.0 sigU1 F 1.0 rhoVU0 T 0.0 rhoVU1 T 0.0 Log Likelihood: Parameter 1 2 3 4 5 6 Zcons Zx1 Zx2 Yx1 mu0 mu1 -840.1798 Free? Estimate T T T T T T -1.0240495122 1.0566987099 1.2844424075 0.8319107477 1.6486932406 1.1107510248 BHHH-based, non-corrected Std Err T-statistic .11971602174 0.1457274833 .14802401634 .27142110864 .23151806228 .44377621134 26 -8.5540 7.2512 8.6773 3.0650 7.1212 2.5030 7 8 9 10 11 12 sigV sigU0 sigU1 rhoVU0 rhoVU1 rhoU0U1 F F F T T F 1.0 1.0 1.0 .18989034371 .55400851566 0.0 .37980849053 .29657219387 0.5000 1.8680 Comparison of Results. x1 constant0 constant1 rho1 rho2 swbvprob 0.827 (0.259) 1.648 (0.253) 1.123 (0.362) 0.192 (0.436) 0.546 (0.241) aML 0.832 (0.271) 1.649 (0.232) 1.111 (0.444) 0.190 (0.380) 0.554 (0.297) Appendix Truncated and Censored Distributions. This material is taken almost verbatim from the great book on limited dependent variables by G.S. Maddala (1983).4 Define F (y) to be the normal distribution function, which is also denoted by Φ(y). Let us also define f (y) to be the normal density function, dF (y)/dy, which is also referred to as φ(y). For a normal distribution with mean µ and variance σ, we scale the arguments of these functions, so that y−µ F (y) → F σ and 1 f (y) → f σ y−µ σ . Mean and Variance of the Truncated Normal Suppose that Z is a standard normal variate and we consider the truncated distribution of Z > b. We say that the distribution of Z is truncated from below at point b. This is depicted in the first panel of Figure 1. Assuming standardized arguments, the mean and variance of the truncated distribution are given by ordinate at Z = b f (b) = = M1 E(Z|Z > b) = 1 − F (b) right-hand tail area and var(Z|Z > b) = 1 − M1 (M1 − b). In the original metric, E(Y |Y > µ + bσ) = µ + σM1 and var(Y |Y > µ + bσ) = {1 − M1 (M1 − b)}σ 2 . Suppose that Z is censored from above at point a, as depicted in the second panel of Figure 1. Then, −f (a) E(Z|Z < a) = = M2 F (a) 4 G.S. Maddala (1983) Limited Dependent and Qualitative Variables in Econometrics, New York:Cambridge Univ. Press. 27 b a Z Z Figure 1: Left Truncation (from below at b) and Right Truncation (from above at a) of a Standard Normal Distribution and var(Z|Z < a) = 1 − M2 (M2 − a) The transformation back to the original metric (Y ) is the same as earlier. If Y has a normal distribution with mean µ and variance σ 2 (instead of N(0,1)), then substitute (Y − µ)/σ, (b − µ)/σ, and (a − µ)/σ, for Z, b, and a in the expressions above. It is also possible to have double truncation (from above and below). Use of the Quantities in Modeling. These quantities are used extensively in selection and switching regression models. For example, we often need to know certain conditional expectations of the form, E(u1 |u2 < c) and E(u1 |u2 > c), where u1 and u2 are distributed bivariate normal with mean 0, unit variances, and covariance σ12 . Note that E(u1 |u2 ) = σ12 u2 . Thus, E(u1 |u2 < c) = σ12 E(u2 |u2 < c) = −σ12 and E(u1 |u2 > c) = σ12 E(u2 |u2 > c) = σ12 f (c) F (c) f (c) 1 − F (c) Simulation Checks. You can do a simple simulation to check the analytic expressions against some test data. Here we find the means for various truncated distributions # verifying the properties of truncated distributions # simulate normal with mean=100 std. dev=20 mu <- 100 sigma <- 20 n <- 30000 28 y <- rnorm(n,mu,sigma) # suppose that distribution is truncated from below at B and we want E(Y|Y>B) B <- 80 # standardize b <- (B-mu)/sigma # result will be in terms of standardized Y EY.gt80 <- dnorm(b)/(1-pnorm(b)) # scale back to original mu + EY.gt80*sigma # empirically mean(y[y>B]) # suppose that the distribution is truncated above at A and we want E(Y|Y<A). A <- 110 # standardize a <- (A-mu)/sigma EY.lt110 <- -dnorm(a)/pnorm(a) mu + EY.lt110*sigma mean(y[y<A]) Adapt this program to check the expressions for the variances against the simulated data. For one simulation from a normal distribution with mean 100 and standard deviation 20, I get E(Y |Y > 80) = 105.75 analytically, and 105.73 based on the simulated distribution, with variance 251.87 analytically and 252.49 (simulated). 29