Treatment evaluation in the presence of sample selection Martin Huber Abstract:
Transcription
Treatment evaluation in the presence of sample selection Martin Huber Abstract:
Treatment evaluation in the presence of sample selection Martin Huber University of St. Gallen, Dept. of Economics First draft: April 2008 This version: April 2009 Abstract: Sample selection is inherent to a range of treatment evaluation problems as the estimation of the returns to schooling or of the effect of school vouchers on test scores of college admissions tests, when some students abstain from the test in a non-random manner. Parametric and semiparametric estimators tackling selectivity typically rely on restrictive functional form assumptions that are unlikely to hold in reality. This paper proposes nonparametric weighting and matching estimators of average and quantile treatment effects that are consistent under more general forms of sample selection and incorporate effect heterogeneity with respect to observed characteristics. These estimators control for the double selection problem (i) into the observed population (e.g., working or taking the test) and (ii) into treatment by conditioning on nested propensity scores characterizing either selection probability. Weighting estimators √ based on parametric propensity score models are shown to be n-consistent and asymptotically normal. Simulations suggest that the proposed methods yield decent results in scenarios when parametric estimators are inconsistent. Keywords: treatment effects, sample selection, inverse probability weighting, propensity score matching. JEL classification: C13, C14, C21 I have benefited from comments by Joshua D. Angrist, Eva Deuchert, Markus Fr¨ olich, Michael Lechner, Blaise Melly, Rudi Stracke, and by seminar/conference participants in St. Gallen, Engelberg, and Bern. Address for correspondence: Martin Huber, SEW, University of St. Gallen, Varnb¨ uelstrasse 14, 9000 St. Gallen, Switzerland, martin.huber@unisg.ch. 1 Introduction The sample selection problem, which was discussed by Gronau (1974), Heckman (1974), and Vella (1998), among many others, arises whenever the outcome of interest is only observable for some subpopulation conditional on selection that is non-ignorable conditional on observed characteristics. Potential bias due to selection is an issue for a range of evaluation problems, e.g., when estimating the returns to schooling1 based on a selective subpopulation of working or the effect of school vouchers on college admissions tests2 , given that some students abstain from the test in a non-random manner. This paper discusses identification and estimation of treatment effects in the presence of sample selection, attrition, and survey non-response related to unobserved characteristics. It considers a sample selection model of rather general form in which two forms of selection appear, firstly, sample selection as discussed above and secondly, non-random treatment assignment which is selective with respect to observed characteristics. The literature that is based on the conditional independence assumption (see for instance Lechner, 1999, and Imbens, 2004) assumes that treatment effects are identified conditional on observed characteristics jointly related to the treatment (e.g., education) and the outcome (e.g., employment). However, in the framework considered here, we also need to tackle the sample selection problem. Under certain conditions the latter can be controlled for by conditioning on the sample selection propensity score, i.e., the conditional probability to be selected into the observed population. This intuitive result was acknowledged by Angrist (1997), among others, and underlies the estimator of Ahn & Powell (1993) based on matching individuals with similar selection propensity scores. It is also used by Newey (2007) for the identification of nonparametric choice models. It follows that treatment effects in our framework are identified when conditioning both on the sample selection propensity score and on observed confounders. This paper establishes assumptions sufficient to point identify unconditional average treatment effects (ATEs) and quantile treatment effects (QTEs) for the observed population. Our nonparametric selection model invokes considerably weaker restrictions than parametric and semiparametric specifications encountered in the sample selection literature. In particular, the model allows for effect heterogeneity with respect to the observed confounders and the sample selection propensity score. This allows identifying heterogenous QTEs at different ranks of the outcome distribution. Furthermore, additivity between observed and unobserved terms is not imposed as in virtually all models used in empirical applications. The main contribution of the paper is the proposition of nonparametric estimators which ‘kill two birds with one stone’ by controlling for selectivity bias (i) in the observed population (e.g., working or taking the test) and (ii) with respect to the treatment assignment, using a nested propensity score characterizing either selection probability. The estimators rely on inverse probability weighting (IPW) and propensity score matching, where the (first stage) sample selection propensity score is 1 2 See for instance Mulligan & Rubinstein (2008). See for instance Angrist, Bettinger & Kremer (2004). 1 included as additional covariate among other observed factors to compute the (second stage) propensity to receive the treatment. The estimators invoke a minimum of assumptions required for point identification in the presence of the double selection problem into the observed population and into treatment. Monte Carlo simulations suggest that even for moderate sample sizes, IPW and matching are considerably more accurate than parametric estimators with respect to bias and mean squared error when the data generating process is nonlinear. This paper provides two empirical applications. Firstly, we estimate the wage differentials between individuals with high school graduation and lower education. Selection bias stems from the fact that wages are only observed for the subpopulation of working. Secondly, we check the robustness of the effects of school vouchers on test scores in a school voucher lottery. Selection bias is due to non-random test √ attendance related to voucher possession. n-consistency and asymptotic normality of the IPW estimator is established in the appendix. The remainder of the paper is organized as follows. Section 2 reviews the literature on sample selection models and highlights its relations and distinctions to the framework discussed in this paper. Section 3 introduces a general sample selection model and discusses the identifying assumptions. Section 4 presents IPW estimators for average and quantile treatment effects. Section 5 provides simulation results on the finite sample properties of IPW and matching estimators relative to parametric benchmarks. In section 6, the estimators are applied to labor market data and a school voucher lottery. Section 7 concludes. 2 Related literature The estimation of wages constitutes a prominent example for the sample selection problem in labor economics and was first addressed by Gronau (1974). As unobserved individual characteristics are likely to affect both the probability of working and the potential wage, observed wages, i.e. potential wages conditional on working, will be correlated with the likelihood of employment which gives rise to selectivity bias. Heckman (1974) proposed a maximum likelihood (ML) estimator to tackle selectivity bias when covariates are linear and additive in the selection and outcome equation and unobservables are homoscedastic and bivariate Gaussian. Still for the linear and homoscedastic case, Heckman (1976, 1979) suggested a two-step estimation approach known as the two-step heckit estimator. In the first step, the conditional mean of the selection indicator is estimated and used to correct for the selectivity bias in the second step estimation of the outcome equation. From a today’s perspective, these estimators are not very appealing as they rely on overly restrictive parametric assumptions and are in general inconsistent when the unobservables’ distribution is misspecified. The subsequent literature aims at relaxing these restrictions in various directions3 . Using ML esti3 An excellent survey on improvements in the sample selection literature is provided by Vella (1998). 2 mation, Gallant & Nychka (1987) suggest to approximate the bivariate density of the unobservables in the selection and outcome equation to circumvent the joint normality assumption. Among two-step estimators, semiparametric alternatives have been proposed for first step and second step estimation, and more recently for both. For the first step, Klein & Spady (1993) suggest an estimator for binary choice models that is semiparametric in the sense that it does not put any restrictions on the distribution of the unobservable term, while the parametric form of the single index still has to be known. The estimator attains the semiparametric efficiency bound of Chamberlain (1986) and Cosslett (1987) and allows for heteroscedasticity of unknown form as long as it is related to the regressors only through the index. Alternatives - although asymptotically less efficient - are Manski’s (1975) maximum score estimator, Horowitz’s (1992) smoothed maximum score estimator, Ichimura’s (1993) semiparametric least squares estimator, and Han’s (1987) maximum rank correlation estimator, among others. Fr¨olich (2006) discusses nonparametric estimators of binary choice models along with their small sample properties. Results suggest that in many empirical settings, local logit estimation is likely to be more appropriate than parametric and semiparametric specifications. For the second step, various authors suggest to allow for a nonparametric specification of the bias correction function (related to the sample selection bias) in the outcome equation. Cosslett (1991) proposes a two-step procedure where the marginal distribution function of the selection bias is approximated by a step function of J intervals in the first step. In the second step, the outcome equation is estimated by a OLS regression on the regressors and the J indicator (or dummy) variables. The estimator is consistent if J increases with the sample size. Newey (1999) suggests to estimate the bias correction function by series expansion after estimating the single index semiparametrically (e.g., using the Klein and Spady estimator). He proposes GMM estimators and argues that efficiency gains can be obtained when additional orthogonality conditions implied by the independence of errors in the bias corrected outcome equation and the covariates are exploited. Pagan & Ullah (1999) discuss the optimal choice of such conditions with respect to efficiency. In contrast to Cosslett and Newey, Powell (1987) estimates the outcome equation conditional on the estimated linear single index without estimating the bias correction function itself. Powell’s approach is closely related to Robin’s (1988) semiparametric and partially linear model and is based on the intuition that if two observations have similar values in the first stage single index, subtracting one observation from the other eliminates sample selection bias. This allows consistent estimation of the coefficients in the outcome equation. Powell (1987) therefore suggests an estimator based on pairwise comparisons of all observations in the sample, where the contribution of each comparison is weighted by the difference in the single index values. Still, the parametric form of the single index related to the selection probability has to be known and is assumed to be linear. Ahn & Powell (1993) extend the approach of Powell (1987) to the nonparametric estimation of the index based on kernel regression. 3 Das, Newey & Vella (2003) suggest to estimate both the selection and the outcome equation semiparametrically based on series approximation4 . The outcome depends on a general function of the covariates and the bias correction term. The selection probability underlying the correction term is a nonparametric function and possibly dependent of multiple indices. However, additivity between the covariates’ function, the correction term, and the unobservables has to be imposed. In contrast, Newey (2007) discusses identification in nonseparable sample selection models where the outcome is an unknown (and potentially non-additive) function of covariates and unobservables. Likewise, the model presented in section 3 does not impose additivity between covariates, the selection probability, and unobservables in order to identify ATEs and QTEs.5 . Semiparametric estimators have been used in a range of empirical studies and the excerpt given below is far from being exhaustive. Gerfin (1996) applies various first-step estimators to German and Swiss labor market data to estimate the labor force participation of married women. He uses a probit specification, Horowitz’s smoothed maximum score estimator, the Klein and Spady estimator, and quasi-maximum likelihood estimation as proposed by Gabler, Laisney & Lechner (1993). Specification tests reject the benchmark probit model for the German data, but not for the Swiss data. Also Kumar (2006) considers (first step) estimation of female labor supply using the 1986 wave of the Panel Study of Income Dynamics (PSID). The author is among the very few who estimate the decision of working nonparametrically. He employs kernel estimators and compares their predictive power to logit, probit, and Manski’s (1975) maximum score estimator. His findings suggest that the predictive power of nonparametric estimators (about 95%) is considerably better than in parametric and semiparametric specifications (around 76%) which perform particularly poor in predicting the outcome for non-participants. Newey, Powell & Walker (1990) investigate data on married women’s hours worked previously analyzed by Mroz (1987). They employ semiparametric two-step estimators using the methods of Klein & Spady (1993), Ichimura (1993), Powell (1987), and Newey (1999) and obtain results which are quite similar to parametric two-step estimation. Similarly, Melenberg & van Soest (1996) analyze vacation expenditures of Dutch families and obtain almost identical results using parametric and semiparametric estimators based on Klein and Spady first step estimation. Martins (2001) presents results on parametric and semiparametric estimation of (first step) labor force participation decisions and (second step) wage equations for Portuguese married women. Semiparametric estimation is based on Klein & Spady (1993) and Newey’s (1991) series approximation. In contrast to Newey et al. (1990) and Melenberg & van Soest (1996), specification tests indicate that the estimates obtained by parametric and semiparametric estimation are significantly different. This is in line with Bhalotra & Sanhueza (2002) who investigate returns of schooling for women in South Africa. The coefficient estimates on schooling are considerably higher when using semiparametric 4 5 However, in their empirical application they use a partially linear model. Note that all methods discussed so far are concerned with the estimation of conditional (i.e., local) effects, whereas we are interested in unconditional ATEs and QTEs. 4 estimation proposed by Newey (1999) and Klein & Spady (1993) than for the parametric specification based on Heckman (1979). Thus, empirical evidence suggests that tight parametrization might imply inconsistent estimation. All studies discussed so far estimate conditional mean effects. Comparably few researchers attempted to estimate conditional quantile effects to identify effect heterogeneity at different points in the conditional outcome distribution. Using data from the US March Current Population Survey, Buchinsky (1998, 2001) estimate female wage equations applying the methods of Ichimura (1993) and Newey (1999) to the quantile regression framework6 . The problem inherent to this literature is that independence between observed covariates and unobservables conditional on the selection probability, an assumption consistency of virtually all point estimators relies upon, implies that conditional quantile and mean effects are of the same magnitude. Conditional quantile effects are by assumption equal to conditional mean effects and quantile regression does not yield more or different information about the effects if independence holds. If effects are found to differ across conditional quantiles in empirical applications, this merely points to the violation of the independence assumption and to the inconsistency of the estimator. If conditional independence is likely to fail or no exclusion restriction is at hand, partial identification7 represents a valuable alternative to point identification. Recent empirical applications of partial treatment effect identification in the presence of sample selection are provided in Lee (2005) and Lechner & Melly (2007). Similarly to the latter, our framework is characterized by the double selection problem (i.e., selection (i) into the observed subpopulation and (ii) into treatment), but the difference is that we impose conditional independence and the availability of an exclusion restriction in order to obtain point identification. Note that sample selection poses similar problems to identification as non-ignorable sample attrition and item non-response8 related to unobservables, see for instance Wooldridge (2002). Thus, our estimator can also be applied to attrition and non-response related to unobservables as discussed in Fitzgerald, Gottschalk & Moffitt (1998). To conclude this section, it is important to highlight the differences between the double selection problem considered in this paper and the control function literature promoted by Heckman & NavarroLozano (2004), Heckman & Vytlacil (2005), and Heckman, Urzua & Vytlacil (2006), among others, which constitutes an important string of the evaluation literature devoted to the identification of local average treatment effects (LATEs). In the latter case, sample selection models have been applied to problems where treatment assignment is endogenous even conditional on observed covariates. The selection equation characterizes the selection into treatment based on a continuous and exogenous instrument that shifts the treatment probability but is independent of the unobservables. This allows identifying the LATE for the 6 For a discussion of quantile regression, see Koenker and Bassett (1978, 1982) and Koenker (2005). See Manski (1989,1994) for a general discussion of partial identification in the presence of sample selection. 8 Identification in the presence of attrition and non-response is discussed Robins & Rotnitzky (1995), Robins, 7 Rotnitzky & Zhao (1995), Robins & Rotnitzky (1997), among others. 5 subpopulation of compliers, which are defined as those individuals who switch their treatment status due to a shift in the instrument9 . Vytlacil (2002) shows that this approach is in principle equivalent to the nonparametric LATE framework advocated by Imbens & Angrist (1994). However, the problem considered in this paper is different, despite the use of sample selection models in both set ups. After all, LATE identification is based on switching regression models where outcomes are not censored and selection bias stems from the endogenous treatment decision. In this paper, it is assumed that the treatment assignment would be ignorable conditional on the covariates, if all outcomes were observed. Bias is, however, introduced by non-random selection into the observed subpopulation. Thus, a selection model along with an exclusion restriction is used to characterize the probability of being observed. Effect identification in the observed subpopulation is based on the conditional independence assumption given the covariates and the selection probability. By conditioning on the covariates and the sample selection propensity score, we are back in a quasi-random evaluation set up which is distinct from the endogenous treatment assumption in the LATE related control function literature. 3 Model and identifying assumptions In this section, we introduce a nonseparable sample selection model, where the latent outcome is an unknown function of two observed components, the treatment of interest and a vector of observed covariates, and an unobserved term. Y ∗ denotes the latent outcome that is only partially observed, conditional on selection. Let D denote the treatment which might be discrete or continuous, and X, U the covariates and the unobserved term, respectively. Throughout the paper we will assume to have an i.i.d. sample of n units, indexed by i = 1, ..., n. The latent outcome equation can be written as Yi∗ = ϕ(Di , Xi , Ui ), (1) where ϕ(·) is an unknown function. We observe {Xi , Di } for all units in the sample, whereas Yi∗ is only known for some non-random subsample. We denote the observed outcome as Y , which is Yi = Yi∗ if Si = 1 and not observed otherwhise. (2) S is a binary selection function of the unknown function ζ(·): Si = I{ζ(Di , Xi , Zi ) ≥ Vi }. (3) I{·} denotes the indicator function. In this nonparametric selection equation, Z represents a one or multidimensional instrument which is observable for all units. Theoretically, Z can be continuous, discrete, or both. In any case it has to be relevant in the sense that it contributes importantly to S such that a ceteris 9 It therefore has to be assumed that the treatment varies monotonously with the instrument. 6 paribus change in Z shifts the selection probability considerably10 . The result that point identification is not ruled out for a discrete Z may seem surprising and is therefore briefly discussed in appendix A.4. V is an unobservable term that is not independent of U . By assumption, S is a function of at least one element that is excluded in ϕ. Identification will crucially hinge on the availability of such an exclusion restriction. However, the model is fairly general and does not impose parametric restrictions as linearity or additivity on D, X, and U . To identify the causal effects of D, we utilize the potential outcome framework advocated by Rubin (1974), among others. In the subsequent discussion, we will focus on point identification of treatment effects for the observed population, i.e., conditional on being selected. Denote the potential outcome for individual i and some hypothetical treatment D = d as Yid = ϕ(d, Xi , Ui ). In the labor market literature, D typically represents participation in a training program or years of schooling whereas Y represents observed wages. We want to learn about the unconditional average and quantile treatment effects11 (ATE, QTE) of D on Y by considering the differences in potential outcomes for distinct hypothetical treatments. Both ATEs and QTEs bear intuitive interpretations for policy recommendations. The ATE represents the mean effect for some population and is simply the average difference in potential outcomes for distinct treatments. The QTE is the effect evaluated at a particular point in the population’s potential outcome distribution, e.g., at the median or rank 0.75. It simply measures the horizontal distance between two potential outcome distributions for distinct treatments at the predefined rank. The quantile framework also allows evaluating inequality treatment effects (ITE), which are differences in inequality measures (e.g., the interquartile range) of the potential outcome distributions. This allows analyzing whether a treatment increases or decreases inequality, see Firpo (2007b). To formalize the discussion, let d0 , d00 , d00 < d012 denote two distinct treatments. ATEs and QTEs in the observed population are defined as ∆ = 10 0 00 E[Y d ] − E[Y d ] 0 00 ∆d 0 = E[Y d |D = d0 ] − E[Y d |D = d0 ] ∆τ = QτY d0 − QτY d00 ∆τd0 = QτY d0 |D=d0 − QτY d00 |D=d0 Therefore, discrete and even binary instruments might in principle be used if changes from Z = 0 to Z = 1 largely affect the selection probability. However, powerful discrete instruments are most likely impossible to find in reality. 11 By ‘unconditional’ we mean the global effects for the whole population of interest. In contrast, treatment effects for covariates given would be local, i.e., conditional on specific values of the covariates. 12 In the binary treatment framework, e.g. participation vs non-participation in a training program, d0 = 1 and d00 = 0. 7 ∆ denotes the ATE for the observed population, ∆d0 is the average treatment effect on the treated (ATET), i.e. conditional on being observed and receiving treatment D = d0 . Analogously, ∆τ , ∆τd0 represent the QTE and quantile treatment effect on the treated (QTET) in the observed subpopulation. τ ∈ [0, 1] and denotes the rank in the potential outcome distribution at which the effects are evaluated. E.g., τ = 0.5 yields the median effect of the treatment. The unconditional quantiles are defined as QτY d = inf y Pr(Y d ≤ y) ≥ τ and QτY d |D=d = inf y Pr(Y d ≤ y|D = d) ≥ τ . In the remainder of this paper, discussion will focus on the ATE and QTE, as it is straightforward to obtain analogous results for the identification and estimation of the ATET and the QTET. The problem inherent to any causal analysis is that only one state of the world, i.e. the realized outcome and treatment, is observed. In the sample selection model, this is even conditional on being P d selected. In the observed subpopulation the realized outcome is defined as Yi = d∈D Yi I{Di = d} R for discrete treatments and Yi = d∈D Yid I{Di = d}dd for continuous treatments, respectively, where D denotes the (nonnegative and finite) space of possible treatments or treatment doses, respectively. To infer on the unobserved potential outcomes, any regression method needs to impose intestable identifying assumptions. The assumptions proposed in this paper are less restrictive than those in the parametric and semiparametric sample selection literature, but more restrictive than those in the treatment evaluation literature based on partial identification, as our framework allows for point identification. Briefly speaking, identification is based on 3 key assumptions: (i) Conditional independence between potential outcomes and treatments in the total population, (ii) the availability of an exclusion restriction to identify the selection probability into the observed population, and (iii) conditional independence of observables and unobservables given the sample selection propensity score. Assumption 1: Conditional independence in the total population. 0 00 (1a) Y ∗d , Y ∗d ⊥D|X = x, ∀ x ∈ X (conditional independence of the latent outcome), (1b) 0 < Pr(D = d|X) < 1 ∀d ∈ D (common support of D in X), (1c) stable unit treatment variation assumption (SUTVA). The conditional independence assumption (CIA) or selection on observables assumption is frequently imposed in the treatment evaluation literature, see for instance Heckman, Ichimura & Todd (1997), Lechner (1999), and Wunsch & Lechner (2008). (1a) states that the potential latent outcome is independent of the treatment given the observed covariates X 13 . This implies that all factors jointly affecting the treatment assignment and the latent outcome can be controlled for by conditioning on the covariates. The difference to conventional evaluation studies relying on the CIA is that the outcome is not fully observed. (1b) is 13 In appendix A.4, we will briefly discuss identification under random treatment assignment, as in randomized experiments and lotteries, such that one needs not condition on X. 8 the classical common support assumption and states that the selection probability must not be perfectly predicted conditional on the covariates. (1c) states that the potential outcome for any unit i is stable in the sense that it always takes the same value, independent of treatment allocations in the rest of the population, see Rubin (1990) for further details. Assumption 1 implies that 0 E[Y ∗d |D 00 E[Y ∗d |D 0 = d00 , X = x] = E[Y ∗d |D = d0 , X = x] = E[Y ∗ |D = d0 , X = x], = d0 , X = x] = E[Y ∗d |D = d00 , X = x] = E[Y ∗ |D = d00 , X = x]. 00 0 00 Thus, the ATE for the total population conditional on X is ∆∗ (x) = E[Y ∗d |X = x] − E[Y ∗d |X = x] = E[Y ∗ |D = d0 , X = x] − E[Y ∗ |D = d00 , X = x]. If the CIA holds, the effect of D on Y ∗ could be identified if Y ∗ was fully observed. As this is not the case, we concentrate on the observed outcome Y . To ease notation, we define E[Y |D = d, X = x] = E[Y ∗ |D = d, X = x, S = 1]. If selection is ignorable R conditional on X, then E[Y |D = d, X = x] = E[Y ∗ |D = d, X = x] as E[Y |D = d, X = x, U ]dFU = R E[Y ∗ |D = d, X = x, U ]dFU , where F denotes the cdf. This immediately implies that the treatment 0 00 effect conditional on X and S = 1 is identified by ∆(x) = E[Y d |X = x] − E[Y d |X = x]. By integration R over X we obtain the ATE in the observed population, ∆ = ∆(x)dFX|S=1 14 . If unobservables V and U are not independent conditional on X, the effect of D on Y is confounded in the observed sample. Point identification requires the availability of an instrument Z that predicts selection S but is not related to the potential outcomes conditional on D, X. We therefore make the following assumption. Assumption 2: Exclusion restriction. (2a) Cov(Z, S|X, D) 6= 0 and Y ∗ ⊥ Z|D, X (exclusion restriction), (2b) 0 < Pr(S = 1|D = d) < 1, ∀d ∈ D (common support of S in D), (2c) (U, V )⊥(D, X, Z)| Pr(S = 1|D, X, Z) (conditional independence of unobservables and observables given the selection probability), (2d) FV (t), the cdf of V , is strictly monotonic in the argument t. Assumption (2a) states that Z shifts S but is independent of the latent outcome given D, X. Therefore, direct effects of Z on Y ∗ are ruled out15 . Together with assumption 1, this implies that F(Y ∗ |D=d,X=x) = F(Y ∗ |D=d,X=x,Z) for all values of Z, where F(·|·) denotes the conditional cdf. (2b) rules out that the treatment is a perfect predictor for sample selection. To see the usefulness of this assumption, consider the case that latent realizations with D = d are never selected whatever values X, Z take. Obviously, treatment d cannot be evaluated in the observed population and neither can any counterfactual population be defined upon d. Likewise, perfect positive selection will cause identification problems, as discussed further below. 14 The ATE for the total population is identified, too, given that fX , the pdf of X, is observed for S = 0 and R that there is common support in fX and fX|S=1 . Then, ∆∗ = ∆(x)dFX . 15 A test for the validity of exclusion restrictions related to discrete instruments was recently proposed by Kitagawa (2008). 9 By (2c), we impose that D, X, Z are jointly independent of the unobservables U, V conditional on the sample selection propensity score. Even though conditional heteroscedasticity of unknown form is still allowed for in this framework, any dependence between observables and unobservalbes is restricted to be captured by the sample selection propensity score. (2c) is for instance violated if U is related to D in the total population. In this case, the selection bias cannot be controlled for by conditioning on Pr(S = 1|D, X, Z), as unobserved interaction terms of U and D drive the selection probability16 . Assumptions similar to (2c) are crucial for point identification in any selection model of both parametric and general form. Its violation implies the inconsistency of virtually all point estimators proposed in the literature. Nevertheless, (2c) is considerably weaker than most analogous assumptions made in the literature, as it does not impose parametric restrictions on ζ. By monotonicity assumption (2d) it holds that Pr(S = 1|D, X, Z) = Pr(ζ(D, X, Z) ≥ V ) = FV (ζ(D, X, Z)). Thus, the likelihood to be observed increases monotonically in ζ. Note that monotonicity is implicitly assumed in any linear index restriction frequently used in the sample selection literature. For notational ease, let W ≡ (D, X, Z) and Pr(S = 1|D, X, Z) ≡ p(W ). If (2c) and (2d) hold, U and D are independent conditional on p(W ) in the observed population. This can be shown by applying the proof of theorem 1 in Newey (2007). Let a(U ) denote any bounded function of U . Note that {S = 1} = {FV−1 (p(W )) ≥ V }. Then, E [a(U )|D, p(W ), S = 1] £ ¤ E E [a(U )|V, D, X, Z] |D, p(W ), FV−1 (p(W )) ≥ V £ ¤ = E E [a(U )|V ] |D, p(W ), FV−1 (p(W )) ≥ V £ ¤ = E E [a(U )|V ] |p(W ), FV−1 (p(W )) ≥ V = = E [E [a(U )|V, p(W )] |p(W ), S = 1] = E [a(U )|p(W ), S = 1] . If assumptions (1) and (2) are satisfied, selection bias in the observed population can be corrected for by conditioning on p(W ). Thus, the identification of treatment effects requires the inclusion of the sample selection propensity score as additional conditioning variable. To see this, note that the treatment effect in the observed population given X and p(W ) is defined as Z ∆(x, p(w)) = ϕ(d0 , x, p(w), U )dFU |X=x,p(W )=p(w),S=1 Z − ϕ(d00 , x, p(w), U )dFU |X=x,p(W )=p(w),S=1 = 0 00 E[Y d |X = x, p(W ) = p(w)] − E[Y d |X = x, p(W ) = p(w)]. 0 E[Y d |X = x, p(W ) = p(w)] is the expected potential outcome for a hypothetical treatment d0 given X 16 Huber & Melly (2008) provide a more detailed discussion of this issue in a semiparametric framework. 10 and p(W ). By the independence of U and D given p(W ) implied by (2c) and (2d), it holds that Z 0 E[Y d |X = x, p(W ) = p(w)] = ϕ(d0 , x, p(w), U )dFU |X=x,p(W )=p(w),S=1 Z = ϕ(d0 , x, p(w), U )dFU |D=d0 ,X=x,p(W )=p(w),S=1 = E[Y |D = d0 , X = x, p(W ) = p(w)]. Hence, the expected potential outcome is equal to the expected conditional outcome given D = d0 . The 00 same applies to E[Y d |X = x] so that E[Y |D = d0 , X = x, p(W ) = p(w)] − E[Y |D = d00 , X = x, p(W ) = p(w)] = ∆(x, p(w)) and 0 E[Y d |D 00 E[Y d |D 0 = d00 , X = x, p(W ) = p(w)] = E[Y d |D = d0 , X = x, p(W ) = p(w)] = E[Y |D = d0 , X = x, p(W ) = p(w)], = d0 , X = x, p(W ) = p(w)] = E[Y d |D = d00 , X = x, p(W ) = p(w)] = E[Y |D = d00 , X = x, p(W ) = p(w)]. 00 The ATE ∆ is identified by integrating over the marginal distributions of X and p(W ). Z Z £ £ ¤ £ ¤¤ E Y |D = d0 , X = x, p(W ) = p(w) − E Y |D = d00 , X = x, p(W ) = p(w) dFX|p(W )=p(w),S=1 dFp(W )|S=1 Z Z 0 00 = [E[Y d |X = x, p(W ) = p(w)] − E[Y d |X = x, p(W ) = p(w)]]dFX|p(W )=p(w),S=1 dFp(W )|S=1 0 00 = E[Y d ] − E[Y d ] = ∆. (4) Identification of QTEs is analogous, but requires that the conditional quantiles of interest are unique. I.e., the density in the neighborhood of the quantiles must be bounded away from zero such that each quantile corresponds to exactly one particular rank in the conditional distribution. Secondly, for an intuitive interpretation of QTEs, the rank stability assumption has to be satisfied across treatments. It states that individuals occupy the same rank in the respective conditional outcome distribution for different treatments, see for instance Firpo (2007a) for further discussion. Let QτA denote the quantile at −1 rank τ ∈ [0, 1] for some variable A, QτA = inf{a : FA (a) ≥ τ }. Then, FA (a) = QτA , i.e. the τ th quantile of A is the inverse of its cdf evaluated at a. Let QτY d0 (x, p(w)) denote the τ th conditional quantile of the 0 potential outcome Y d given X = x, p(W ) = p(w), and S = 1. By assumption 2, Z FY |D,X,p(W ) (y|d0 , x, p(w)) = I{ϕ(d0 , x, p(w), U ) ≤ y}dFU |D=d0 ,X=x,p(W )=p(w),S=1 Z = I{ϕ(d0 , x, p(w), U ) ≤ y}dFU |X=x,p(W )=p(w),S=1 = −1 QτY d0 (x, p(w)). The unconditional quantile of the potential outcome is identified as the inverse of the integration over the marginal distributions of X and p(W ). Z Z −1 −1 QτY d0 (x, p(w))dFX|(p(W )=p(w),S=1 dFp(W )|S=1 = QτY d0 . The difference between the quantiles with distinct treatments yields the QTE, ∆τ = QτY d0 − QτY d00 . 11 (5) Identification of ∆, ∆τ in the observed population hinges on common support of the treatment in X and p(W ). We therefore make a further assumption: Assumption 3: Common support in the selected sample. (3a) c < Pr(D = d|X, p(W ), S = 1) < 1 − c ∀d ∈ D, c > 0 (common support of D in X and p(W )). (3) implies that the treatment probability is bounded away from zero in the observed population conditional on the selection probability and observed covariates. It is obvious that (2b) is a necessary condition for (3) to hold. To see this point, consider the case that (2b) is violated by assuming that all individuals receiving treatment D = d are selected, i.e. D = d implies p(W ) = 1, independent of X, Z. Furthermore, let p(W ) < 1 for any D 6= d. It follows that Pr(D = d|X = x, p(W ) = p(w)) = Pr(D = d|X = x, p(W ) = 1) = 1 ∀ x ∈ X , such that p(W ) = 1 perfectly predicts D and the common support assumption fails. At this point, let us assume that (2b) and (3) are satisfied and consider the special case that there exist some observations with p(W ) = 1. I.e., even though 0 < Pr(S = 1|D = d) < 1 holds, Pr(S = 1|W ) = 1 for some triple(s) w = (d, x, z). Obviously, selection bias is not an issue for those observations and it follows that E[Y |D = d, X = x, p(W ) = 1] = E[Y ∗ |D = d, X = x, p(W ) = 1]. This allows identifying local treatment effects for the subpopulation with p(W ) = 1. It remains a priori unclear why this particular population should be of any policy interest. However, if one is willing to impose the strong restriction of treatment effect homogeneity across selection probabilities, i.e. ∆(x, p(w)) = ∆(x) ∀ p ∈ P, treatment effects can be identified for other populations as well if there is common support in X. For instance, the R ATE for the observed population is ∆ = ∆(x|p(W ) = 1)dFX|S=1 for sufficient overlap in fX|p(W )=1 and fX|S=1 . Identification based on p(W ) = 1 is known as ‘identification at infinity’ and was discussed by Heckman (1990) and Andrews & Schafgans (1998). However, in empirical applications, observation with selection probabilities close to one might be rare. Furthermore, effect homogeneity in p(W ) is a strong assumption that might not hold in reality. We therefore concentrate on a more general identification strategy using the whole distribution of p(W ). After having established the identifying assumptions, we will now propose expressions for ∆, ∆τ based on inverse probability weighting which can be used to build sample analogues required for estimation. Let πd (X, p(W )) denote the treatment propensity score, i.e., the probability of receiving treatment D = d conditional on X and p(W ), πd (X, p(W )) ≡ Pr(D = d|X, p(W ))17 . To control for selection into treatment, we will henceforth condition on the πd (X, p(W )) instead of X and p(W ). Rosenbaum & Rubin (1983) have shown that conditioning on the treatment propensity score is equivalent to conditioning on the covariates directly, as both are balancing scores in the sense that they adjust the distributions of covariates in the groups of treated and controls. However, conditioning on πd (X, p(W )) has the advantage that practical 17 For a binary treatment, the treatment propensity score is π1 (X, p(W )) and the nontreatment propensity score is π0 (X, p(W )) = 1 − π1 (X, p(W )). 12 problems related to the nonparametric estimation using high dimensional covariates, e.g., empty cells for particular combinations of covariate values, can be circumvented. PROPOSITION 1 (Identification of mean effects). Under assumptions 1,2, and 3, the ATE in the subpopulation of observed for two treatments d0 6= d00 is identified by ¸ ¸ · S · I{D = d0 } · Y ∗ S · I{D = d00 } · Y ∗ E −E p(W ) · πd0 (X, p(W )) p(W ) · πd00 (X, p(W )) · ¸ · ¸ 0 I{D = d } · Y I{D = d00 } · Y E −E . πd0 (X, p(W )) πd00 (X, p(W )) · ∆ = = (6) Proof: See appendix A.1. The ATE is obtained by reweighing the outcome of each individual in the observed population by the inverse of the conditional treatment probability given X and p(W ). Similar results are obtained for QTEs, as both parameters are functions of the distribution of Y . PROPOSITION 2 (Identification of quantiles). Under assumptions 1,2, and 3, QτY d is an implicit function of · ¸ · ¸ S · I{D = d} I{D = d} E · I{Y ∗ ≤ QτY d } = E · I{Y ≤ QτY d } = FY d (QτY d ) = τ p(W ) · πd (X, p(W )) πd (X, p(W )) (7) Proof: See appendix A.2. It follows that ¸ I{D = d} · I{Y < y} − τ , = arg zeroy E πd (X, p(W )) · QτY d which is a first order condition to · QτY d ¸ I{D = d} = arg min E · ρτ (Y − y) . y πd (X, p(W )) (8) ρτ (·) is the check function, an asymmetric loss function, suggested by Koenker & Bassett (1978) for quantile estimation, ρτ (u) = u · (τ − I{u < 0}). It follows that ∆τ = QτY d0 − QτY d00 . Expressions (6) and (8) are quite similar to the identification results obtained by Hirano, Imbens & Ridder (2003)18 and Firpo (2007a), respectively. The difference is, however, that the latter assume unconfoundedness of the treatment effect conditional on the treatment propensity score with respect to X alone, whereas we have to condition on both X and p(W ) to control for selection bias into the observed population and into treatment. We therefore extend the approach of Hirano et al. (2003) and Firpo (2007a) to the case of sample selection by including the selection probability as additional covariate in the treatment propensity score. 18 The IPW estimator analyzed by Hirano et al. (2003) was first proposed by Horvitz & Thompson (1952). 13 4 Estimation Both p(W ) and πd (X, p(W )) are unknown to the researcher and have to be estimated in order to be used ˆ ∆ ˆ τ denote the estimates in the weighting functions of the estimators of ∆, ∆τ . Let pˆ(W ), π ˆ (X, pˆ(W )), ∆, of the respective true parameters. Our estimation procedure can be described as follows: 1) Estimate pˆ(W ) by regressing S on D, X, Z, 2) estimate π ˆd (X, pˆ(W )) by regressing D on X and pˆ(W ), ˆ ∆ ˆ τ by the sample analogues of (6) and (8). 3) estimate ∆, The sample selection propensity score p(W ) may be estimated by parametric (e.g., logit or probit), semiparametric (e.g., Klein and Spady, 1993, Ichimura. 1993), or nonparametric estimators. The latter seem attractive if the structural form of p(W ) is not known (which is usually the case) and the dimension the continuous elements in W is not too high. Hirano et al. (2003) suggest to estimate p(W ) by a logistic power series approximation. I.e., they use a series of functions of W to approximate the log-odds ratio of the selection probability. Another class of nonparametric estimators are kernel methods such as local constant (Nadaraya-Watson) regression, see Ahn & Powell (1993) and Kumar (2006), or local logit19 . All these methods are conditional mean estimators, but as pointed out by Li, Racine & Wooldridge (2009), conditional probability estimators may also be used when dealing with binary outcomes. This is obvious from the fact that E[S|W = w] = Pr(S = 1|W = w) = fS,W (1, w) = p(w), fW (w) where f (·) denotes the pdf. An estimator of the sample selection propensity score is (9) fˆS,W (1,w) , fˆW (w) where f (s,w) fˆ(·) is the estimated pdf. Following Hall, Racine & Li (2004), S,W can be consistently estimated fW (w) P P n n by fˆS,W (s, w) = n−1 i=1 κ(w, Wi , hn )Λ(s, Si , hn ) and fˆW (w) = n−1 i=1 κ(w, Wi , hn ). κ(·) and Λ(·) denote generalized kernel functions related to continuous and discrete variables, see Hall et al. (2004) for more details. hn denotes the vector of bandwidths for the continuous and discrete elements in W and S, respectively, and might be determined by least squares or maximum likelihood cross validation. Our framework explicitly allows for multiple treatments as discussed in Imbens (2000) and Lechner (2001) or different treatment doses of a continuous treatment as considered by Hirano & Imbens (2004). Let us assume that there is a finite set of discrete treatment choices, D ≡{0, 1, .., G} and G < ∞. The 19 Fr¨ olich (2001) investigates the finite sample properties of (global) logit, local constant, local linear, and local logit estimators for a binary outcome, 4 continuous covariates, and 10 binary regressors. Local logit appears to be substantially more appropriate than (global) logit, whenever the model specification is not encompassed by the logit model, whereas local constant and local linear estimation perform worse than logit in the specifications considered. In line with these results, Monte Carlo evidence in Fr¨ olich (2006) points to the superiority of local logit compared to Klein and Spady and local constant estimation, at least for the data generating processes considered. 14 propensity scores π ˆd (X, pˆ(W )) for all d ∈ D might be estimated simultaneously by multinomial probit or logit20 . Alternatively, Lechner (2001) suggests to split estimation into a series of binomial models, where the propensity score of each treatment relative to every other treatment is estimated by several binary choice models. This procedure is computationally less costly than multinomial probit and also more robust, as a misspecification of one choice model does not spill over to all other specifications. Thus, the methods used for the estimation of p(·) might also be used for the estimation of πd (·). Finally, we use the sample analogue of expression (6) to estimate the ATE for d0 > d00 by ˆ = ∆ Pn 1 j=1 = n X Sj · n n X X 1 I{Di = d0 } · Yi I{Di = d00 } · Yi − Pn · π ˆd0 (Xi , pˆ(Wi )) π ˆd00 (Xi , pˆ(Wi )) j=1 Sj i|S=1 ω ˆ d0 ,i · Yi − i|S=1 (10) i|S=1 n X ω ˆ d00 ,i · Yi = i|S=1 n X [(ˆ ωd0 ,i − ω ˆ d00 ,i ) · Yi ] , i|S=1 where the weighting function ω ˆ d,i is defined as ω ˆ d,i = Pn 1 j=1 Sj · I{Di = d} . π ˆd (Xi , pˆ(Wi )) Similarly, the QTE estimator is ˆτ = Q ˆ τ d0 − Q ˆ τ d00 , ∆ Y Y where ˆ τ d = arg min Q Y y n X ω ˆ d,i · ρτ (Yi − y). (11) i|S=1 ˆ τ can be written as Thus, ∆ ˆτ ∆ = arg min Pn y j=1 − arg min Pn y = arg min y 1 Sj 1 j=1 n X Sj · n X I{Di = d0 } · ρτ (Yi − y) π ˆd0 (Xi , pˆ(Wi )) i|S=1 · n X I{Di = d00 } · ρτ (Yi − y) π ˆd00 (Xi , pˆ(Wi )) i|S=1 ω ˆ d0 ,i · ρτ (Yi − y) − arg min y i|S=1 n X ω ˆ d00 ,i · ρτ (Yi − y). (12) i|S=1 Again, (10) and (12) look similar to the estimators discussed in Hirano et al. (2003) and Firpo (2007a), √ for which n-consistency, asymptotic normality, and semi-parametric efficiency were shown when using a nonparametrically estimated propensity score. The major difference is that here weighting is based on a nested propensity score that also accounts for the selection into the observed sample. Using a GMM √ framework, appendix A.3 establishes n consistency and asymptotic normality of the proposed estimators based on parametric propensity score estimation. 20 Caliendo & Kopeinig (2008) argue that multinomial probit is preferable as it relies on less restrictive assump- tions. 15 Our estimation procedure includes the trimming function θ(n) that trims out π ˆd (Xi , pˆ(Wi )) which are close to the boundaries 0 and 1. I.e., estimation in (10) and (12) is based on π ˆdθ (Xi , pˆ(Wi )) ≡ max(θ(n), min(1 − θ(n), π ˆd (Xi , pˆ(Wi ))), where θ(n) is some ‘small’ number that decreases in the sample size n. θ(n) guarantees that no observations obtains an arbitrarily large or small weight due to a propensity score estimate close to the boundary, as this could seriously deteriorate the appropriateness of IPW methods in finite samples, see Khan & Tamer (2007) and Busso, DiNardo & McCrary (2008). The estimator remains consistent because θ(n) → 0 as n → ∞. Propensity score matching may be used as an alternative method to IPW as both methods rely on the same identifying assumptions, see Lechner (2007). A third possibility consists of estimating the conditional outcomes for various treatments locally, e.g., by local linear kernel regression, and integrating over the distribution of π ˆd0 (X, pˆ(W )) to identify the unconditional effects. In the selection on observables framework without sample selection, Heckman, Ichimura & Todd (1998) use this approach to estimate ATEs whereas Melly (2006) estimates counterfactual distributions required for QTE estimation. All these methods allow for effect heterogeneity in X and p(W ) and thus, for heterogenous QTEs across ranks τ . In contrast, parametric and most semiparametric methods impose effect homogeneity in X and p(W ), and thus, τ , by making restrictive linearity and additivity assumptions on the treatment, the covariates, and the bias correction term: Yi∗ = αDi + Xi0 β + Ui , Si = Wi0 δ + Vi , E[Y |D, X, p(W )] = αDi + Xi0 β + λ(p(Wi )), λ(p(W )) = E[U |D, X, V > −W 0 δ]. α denotes the treatment coefficients, β, δ the coefficients on X and W , respectively, and λ(p(W )) the bias correction term. Hence, nonparametric estimators may also be used to construct tests for homogenous effects in in X, p(W ) by verifying whether QTEs are constant across τ . If QTE estimates differ significantly at different points of the outcome distribution, parametric methods are inconsistent and we should therefore rely on nonparametric estimators imposing less functional form assumptions21 . 21 Note that even though ∆τ is allowed to differ across τ in the nonparametric framework, the conditional QTE ∆τ (x, p(w)) is not. Non-constant ∆τ (x, p(w)) would point to the violation of U ⊥D|p(W ) and assumption (2c). Then, the proposed estimators and virtually all point estimators suggested in the literature would be inconsistent. The assumption of constant ∆τ (x, p(w)) is in principle testable, too, albeit very data hungry in a nonparametric framework, in particular when X is high dimensional. Huber & Melly (2008) propose and apply such tests in a semiparametric framework. In the presence of heterogenous ∆τ (x, p(w)), point identification is not feasible, but 16 5 Monte Carlo simulations This section presents results of linear and nonlinear Monte Carlo simulations to examine the finite sample properties of the proposed IWP and matching estimators relative to parametric maximum likelihood and two-step estimators as well as to the naive estimator (i.e., the difference in the sample means of the observed treated and observed nontreated observations). In all specifications, treatment D is binary and a function of X. The first data generating process (DGP) represents a classical linear selection model with bivariate normally distributed errors the covariance of which is 0.8. Yi∗ = α1 Di + α2 Xi + Ui , Si = I{β1 Di + β2 Xi + β3 Zi + Vi > 0}, Di = I{γ1 Xi + εi > 0}, Yi = Yi∗ if Si = 1, X, Z α1 ∼ N (0, 1), U, V, ε ∼ N (0, 2), Cov(U, V ) = 0.8, Cov(U, ε) = 0, = α2 = 1, β1 = β2 = 0.25, β3 = γ1 = 0.5. We run 1000 Monte Carlo replications and estimate the median effect and the ATE by IWP for two sample sizes (n = 700, 2800). The trimming factor is set to Tn = 0.05, 0.025 for the smaller and larger sample, respectively. To estimate the standard errors of the IPW estimators, we draw 199 bootstrap samples with replacement and set the bootstrap block size to the sample size n. In addition, the ATE is estimated by two nearest neighbors matching using the R matching package developed by Sekhon (2007). For the computation of standard errors we use the Abadie & Imbens (2006) estimator based on matching observations within the same treatment group. This estimator is inconsistent as it does not account for the uncertainty related to the estimation of the propensity scores, see also appendix A.3. We nevertheless apply it to assess how severely its accuracy is affected by the inconsistency. The nested propensity scores p(W ), π1 (X, p(W )) are estimated by probit specifications. We compare the IPW and matching results to the parametric ML and heckit two-step estimators for sample selection models. Table 5.1 displays the point estimates, standard errors (s.e.), and the mean squared errors (MSEs) of the estimators. As expected, the parametric benchmarks are superior to IPW and matching in terms of MSEs due to correct parametric specification. the nonparametric methods is satisfactory for both sample sizes. However, the performance of The IPW mean and matching estimators even outperform the parametric methods in terms of small sample bias. Taking a look at the s.e. estimates (ˆ σ ), it appears that the bootstrap comes close to the true IPW standard errors in particular for n = 2800. The same applies to the analytical estimates of the ML and two-step standard treatment effects might still be bounded in the spirit of Manski (1989). Lee (2005) and Lechner & Melly (2007) present empirical applications of interval estimation of treatment effects in the presence of sample selection. 17 errors. In contrast, the Abadie Imbens estimator considerably overestimates the matching standard error. Table 5.1 Estimates and MSEs for the linear model with Gaussian errors n=700 IPW median (s.e.) IPW mean (s.e.) matching (s.e.) ML (s.e.) two-step ˆ ∆ ˆτ ∆, MSE 0.977 0.084 (0.288) 0.997 0.063 (0.251) 0.997 0.064 (0.252) 0.982 0.040 (0.200) 0.993 0.046 n=2800 σ ˆ ˆ ∆ ˆτ ∆, MSE σ ˆ 0.324 0.995 0.021 0.147 (0.067) (0.145) 0.267 1.000 (0.053) (0.122) 0.481 0.999 (0.047) (0.121) 0.202 0.997 (0.012) (0.100) 0.218 0.997 (0.033) (0.103) (s.e.) (0.214) naive 1.484 (s.e.) (0.183) (0.096) true 1.000 1.000 0.268 1.495 (0.022) 0.015 0.127 (0.021) 0.015 0.332 (0.015) 0.010 0.101 (0.003) 0.011 0.104 (0.006) 0.255 We now consider the more interesting case of a nonlinear specification and treatment effect heterogeneity in X. The DGP is Yi∗ = α1 Xi + α2 Xi2 + α3 Xi3 + Ui if Di = 1, Yi∗ = δ1 Xi + δ2 Xi2 + δ3 Xi3 + Ui if Di = 0, Si = I{β1 Di + β2 Xi + β3 Zi + Vi } > 0, Di = I{γ1 Xi + εi } > 0, Yi = Yi∗ if Si = 1, X, Z α1 ∼ N (0, 1), ε ∼ N (0, 1), U, V ∼ N (0, 2), Cov(U, V ) = 0.8, Cov(U, ε) = 0, = 2, α2 = 6, α3 = 2, δ1 = δ2 = δ3 = 1, β1 = β2 = 0.25, β3 = γ1 = 0.5. The outcome is a cubic function of X that differs for D = 0, 1. We would expect the parametric estimators to be severely biased due to their inconsistency related to model misspecification. In contrast, the semiparametric IPW and matching estimators should still yield decent results. Table 5.2 presents the results for n = 700, 2800. All estimates are normalized with respect to the true treatment effect, such that ∆ = 1. Again, we estimate p(W ), π1 (X, p(W )) by probit specifications. IPW and 18 matching are considerably more accurate than the parametric benchmarks, the MSEs of which are more than 10 times larger for n = 2800. Obviously, the parametric estimators handle the nonlinearity of the outcome in X and D very poorly. The results demonstrate the caveats related to restrictive assumptions in sample selection models and demonstrate the merits of a more flexible model specification. Table 5.2 Estimates and MSEs for the semi-nonlinear model with Gaussian errors n=700 IPW mean (s.e.) matching (s.e.) ML (s.e.) two-step n=2800 ˆ ∆ MSE σ ˆ ˆ ∆ MSE σ ˆ 1.016 0.038 0.214 1.007 0.010 0.114 (0.059) (0.102) 0.221 1.019 (0.042) (0.069) 0.221 0.655 (0.052) (0.092) 0.247 0.666 (0.083) (0.077) (0.194) 1.017 0.062 (0.248) 0.650 0.168 (0.215) 0.665 0.141 (s.e.) (0.170) naive 1.746 (s.e.) (0.143) (0.070) true 1.000 1.000 0.577 1.757 (0.036) 0.005 0.147 (0.014) 0.128 0.113 (0.007) 0.112 0.111 (0.010) 0.577 Table 5.3 presents the results for the same DGP as before with the exception that the unobserved terms U, V are now jointly t-distributed with four degrees of freedom and ε is t-distributed with four degrees of freedom. We therefore introduce misspecification with respect to the probit models of p(W ), π1 (X, p(W )) where normally distributed errors are assumed. As before, the IPW and matching estimators perform quite well and greatly outperform the parametric methods in terms of bias and MSE. The misspecification of the nested propensity score does not seem to harm the accuracy of the estimators. This is in line with Zhao (2008) who investigates the finite sample properties of propensity score matching estimators and whose simulations suggest that ATE estimates are hardly affected (under conditional independence) when matching on misspecified, but yet balancing propensity scores. 19 Table 5.3 Estimates and MSEs for the semi-nonlinear model with t-distributed errors n=700 IPW mean (s.e.) matching (s.e.) ML (s.e.) two-step 6 n=2800 ˆ ∆ MSE σ ˆ ˆ ∆ MSE σ ˆ 0.999 0.029 0.174 0.996 0.008 0.095 (0.032) (0.091) 0.173 0.964 (0.019) (0.057) 0.168 0.820 (0.075) (0.083) 0.198 0.700 (0.026) (0.066) (0.169) 0.894 0.022 (0.105) 0.841 0.054 (0.170) 0.728 0.094 (s.e.) (0.142) naive 1.618 (s.e.) (0.165) (0.068) true 1.000 1.000 0.409 1.657 (0.028) 0.005 0.128 (0.008) 0.039 0.098 (0.006) 0.095 0.100 (0.008) 0.436 Empirical applications This section presents two applications. The first one is a classical wage regression using Italian survey data from Ichino, Mealli & Nannicini (2008). The data set encounters 2030 individuals aged between 18 and 40 without stable jobs (i.e., open-ended contracts or self-employment) in January 2001 and was originally investigated to assess the effectiveness of temporary work assignments. We use it to estimate the returns to schooling for individuals that received secondary education or less. The dependent variable Y is log hourly wage in November 2002 which is observed conditional on being employed. We are interested in the wage effects of high school graduation (D = 1) vs. lower (secondary or primary) education (D = 0) for those having received either the one or the other. 1115 individuals in the sample graduated from high school and 637 have a lower education. Wages are observed (S = 1) for 747 individuals or 43%, of which 537 are treated and 210 are nontreated. Among other socio-economic information the data comprise labour market experience, age, gender, regional dummies, and the grade obtained in the last degree (expressed as a fraction of the highest mark), which may be considered as a proxy for unobserved ability. These factors are potential confounders to education in the explanation of log hourly wages and are therefore used as conditioning variables X in the outcome equation. They also enter the selection equation besides education, as they are likely to affect the probability to work. In addition, the marital status and number of children (along with interaction terms with gender), denoted as instruments Z, are included in the selection equation but excluded in the outcome equation. We estimate the ATE of high school graduation by IPW and two nearest neighbor caliper matching 20 and use probit specifications for the sample selection and treatment propensity scores22 . The IPW trimming factor is set to Tn = 0.05. But no treatment propensity score estimate π ˆd (Xi , pˆ(Wi )) actually has to be trimmed, as the maximum is 94.2% and the minimum is 8.2%. The histograms of π ˆd (Xi , pˆ(Wi )) for D = 1 and D = 0 presented in figure 6.1 show that the overlap in the treatment propensity scores across treatment states is quite satisfactory. Figure 6.1 Estimated treatment propensity scores for D = 1 and D = 0 Histogram of pi[d == 0] 100 Frequency 0 0 50 50 Frequency 100 150 150 Histogram of pi[d == 1] 0.2 0.4 0.6 0.8 0.0 pi[d == 1] 0.2 0.4 0.6 0.8 1.0 pi[d == 0] The caliper in the matching algorithm defines the maximally acceptable distance in any match’s propensity score in order to eliminate those matches that are not comparable in terms of their treatment probabilities, i.e., lie outside the support. We set the caliper to 1 standard deviation (of the estimated treatment propensity score), but no observations have to be dropped. After-matching balance tests indicate decent balance, suggesting that treated and nontreated matches are comparable with respect to the distribution of X and the estimated sample selection propensity score pˆ. In addition to the semiparametric procedures we also estimate the ATE nonparametrically by directly matching on X and pˆ, where the latter is obtained by nonparametric conditional density estimation as discussed in Li et al. (2009). The caliper is again 1 standard deviation and 107 observations (14%) are discarded due to a lack of common support. We also estimate the QTE at the median using IPW. As in the simulations, standard errors of IPW and matching estimators are based on bootstrapping (999 draws) and the Abadie Imbens (2006) estimator, respectively. 22 The treatment specification includes age, age squared, years unemployed, age*years unemployed, a dummy for Sicily, and the grade obtained in the last degree. The selection equation additionally includes educational dummies, gender, marital status, dummies for 1,2, and 3 children, and interaction terms between gender and children. 21 Table 6.1 provides the results for the non- and semiparametric estimators as well as for the parametric ML and two-step (heckit) procedures. The estimates suggest that graduating from high school increases the hourly wage on average by at least 6%. The median estimate is somewhat higher than the mean estimates, but one would generally expect the QTE to diverge from the ATE if effects are heterogenous with respect to X and p. Despite the limited sample size the IPW effects are significant at the 10% level. Note that the parametric estimates are not too far away from the results obtained by semiparametric or nonparametric estimation, but this need not be the case in other problems. It therefore seems advisable to use both semi-/nonparametric and parametric estimators in empirical applications as the former are more robust and the latter are generally more precise, given that the estimates obtained by both methods are close. Table 6.1 Average and median treatment effects (increase of hourly wage in %) IPW mean match (probit) direct match* ML two-step IPW median ˆ ∆ ˆτ ∆, 0.073 0.055 0.066 0.087 0.080 0.105 (s.e.) (0.038) (0.036) (0.037) (0.037) (0.046) (0.034) 0.054 0.135 0.070 0.024 0.081 0.002 p-value *107 observations (14%) dropped due to a lack of common support The methods proposed in this paper may also be used as robustness checks, which we demonstrate in the second application. Angrist et al. (2004) consider the effects of school vouchers on scores achieved in a college admission test based on data from Colombias PACES program, which covered half the cost of private secondary schooling. Many vouchers were assigned by lottery, which suggests that treatment effects can be evaluated by comparing the test scores of voucher winners and losers just like in an experiment. Experimental results in Angrist et al. (2004) imply that vouchers increase reading test scores on average by roughly 0.7 points and this effect is significant at the 5 % level. However, as only 35% of students in the sample of voucher applicants took the test, selection bias is an issue if test taking is non-random, e.g., if voucher winners were more likely to be tested. Therefore, Angrist et al. (2004) use censored regression and nonparametric bounds23 to account for potential sample selection. On balance, they still find substantial gains from the PACES program. In what follows, an alternative way to check the effects’ robustness will be presented by modeling the relationship between the likelihood to take the test, the (potential) test score, and the incidence of winning a voucher. Thus, the necessity of an exclusion restriction is substituted by imposing more structure on the model. We assume that the probability to be tested is characterized (i) 23 Note that there is no instrument for taking the test available in the data. 22 by a linear probability model (LPM) or (ii) by a probit model. The linear model has the form p = Pr(S = 1) = β1 Y ∗ + β2 D + η, η ∼ unif(0.05, 0.45), where p is the probability to take the test, Y ∗ are the potential test scores, and D is winning (D = 1) or not winning (D = 0) a voucher. η is a randomly assigned baseline probability that is assumed to be uniformly distributed between 5 and 45%. Similarly, the probit model is defined as p = Φ(β1 Y ∗ + β2 D + η), where Φ(·) is the normal cdf. Hence, p is assumed to be related to both the test score and the school voucher. The relation to the test score is due to the assumption that more able students with higher potential test scores are also more likely to take the test. On top, voucher winners might be more encouraged by their (more often private) schools to take the test which is one potential reason why p may be related to D. The sample of test takers consists of 1223 observations24 for which p is computed. The sample average of the test score is 47.356 and the test score’s standard deviation is 5.588. We assess the robustness of the voucher effect estimate for different values of β1 , β2 , and γ using IPW. In a perfect experiment, the probability to receive a voucher is independent of p and X such that the unconditional treatment probability Pr(D = 1) (63.7% in the sample) can be used for estimation. In this case, IPW yields an effect estimate of 0.683 (standard error based on 199 bootstrap draws: 0.329). This is the same result as obtained by taking the difference in mean test scores of treated and nontreated or regressing the test score on a treatment dummy. To account for selection, we specify π1 (X, p), the propensity score for having received a voucher, as a probit model with the test-taking probability p and other covariates X, namely age and age squared, as explanatory variables. It is therefore assumed that p is known and can be controlled for to consistently estimate the voucher effect. By changing β1 , β2 , and γ over a range of plausible values, the robustness of the voucher effect estimate can be investigated. E.g., for the LPM, β1 = 0.001 implies that each additional point in the test comes with an increase in the likelihood to take the test by 0.1 percentage points. β1 = 0.05 means that voucher winners have ceteris paribus a 5 percentage points higher probability to take the test than losers in the LPM. Results are provided in table 6.2. ˆ decreases in β1 , and β2 , suggesting positive As expected, ∆ selection bias. Still, the estimates remain positive for most combinations of parameter values, albeit not significantly different from zero at conventional levels in most cases25 . 24 25 1 observation was dropped due to a missing value in the reading test score Standard errors are based on 199 bootstrap draws. 23 Table 6.2 IPW based robustness checks, linear probability model and probit model linear probability model β1 =0.001 β1 =0.003 β2 0.01 0.03 0.05 0.07 0.01 0.03 0.05 0.07 ˆ ∆ 0.632 0.519 0.418 0.375 0.581 0.388 0.213 0.102 (0.322) (0.338) (0.348) (0.392) (0.332) (0.353) (0.338) (0.409) (s.e.) β1 =0.005 β1 =0.007 β2 0.01 0.03 0.05 0.07 0.01 0.03 0.05 0.07 ˆ ∆ 0.523 0.260 0.019 -0.154 0.463 0.140 -0.155 -0.382 (0.318) (0.317) (0.350) (0.396) (0.308) (0.306) (0.348) (0.387) (s.e.) probit model β1 =0.001 β1 =0.003 β2 0.01 0.03 0.05 0.07 0.01 0.03 0.05 0.07 ˆ ∆ 0.632 0.515 0.398 0.321 0.580 0.382 0.189 0.038 (0.324) (0.332) (0.358) (0.386) (0.325) ( 0.329) (0.353) (0.377) (s.e.) β1 =0.005 β2 0.01 0.03 0.05 0.07 0.01 0.03 0.05 0.07 ˆ ∆ 0.522 0.253 -0.007 -0.221 0.461 0.133 -0.180 -0.443 (0.316) (0.315) (0.338) (0.385) ( 0.307) (0.325) (0.332) (0.377) (s.e.) 7 β1 =0.007 Conclusion This paper discusses point identification and estimation of average and quantile treatment effects in the presence of sample selection, attrition, and non-response related to unobservables. It extends methods discussed by Hirano et al. (2003) and Firpo (2007a) for treatment evaluation in a selection on observables framework to the case of a non-randomly drawn subpopulation related to unobservables. The main contribution of the paper is the proposition of nonparametric estimators which ‘kill two birds with one stone’ by controlling for selectivity bias with respect to (i) sample selection and (ii) treatment assignment, using a nested propensity score characterizing either selection probability. The estimators rely on inverse probability weighting (IPW) and propensity score matching, where the (first stage) sample selection propensity score is included as additional covariate among other observed factors to compute the (second stage) propensity to receive the treatment. In contrast to most parametric and semiparametric procedures, the proposed estimators apply to selection models of rather general form and allow for effect heterogeneity in the covariates and in the sample selection propensity score. They constitute an alternative to conventional approaches whenever one is interested in the unconditional effects of a particular treatment variable rather than a broader set 24 of regressors. Neither exact knowledge of the structural relation between the selection probability and the outcome, nor additivity of the unobserved term in the outcome equation is required for consistency. However, as for virtually all methods yielding point identification, joint independence of the observed and unobserved factors in the selection and outcome equations must hold conditional on the sample selection propensity score. Monte Carlo results suggest that IPW and matching estimators are considerably more appropriate than parametric alternatives when the data generating process is nonlinear. The paper also provides two empirical applications to Italian labor market data, see Ichino et al. (2008), and to a school voucher lottery in Colombia previously analyzed by Angrist et al. (2004). Further research might investigate the finite sample properties of the proposed estimators in more detail and systematically evaluate their performance in terms of bias and mean squared error relative to conventional parametric and semiparametric methods for various specifications of the selection and outcome equations as well as the unobserved terms. 25 A Appendix A.1 Proof of proposition 1 ∆, the ATE for the subpopulation with observed outcomes, is identified by · ¸ · ¸ · ¸ · ¸ S·D·Y∗ S · (1 − D) · Y ∗ D·Y (1 − D) · Y ∆ =E −E =E −E . p(W ) · πd0 (X, p(W )) p(W ) · πd00 (X, p(W )) πd0 (X, p(W )) πd00 (X, p(W )) Proof: · ¸ ¸ S · I{D = d00 } · Y ∗ S · I{D = d0 } · Y ∗ −E p(W ) · πd0 (X, p(W )) p(W ) · πd00 (X, p(W )) · · · ¸ ¸¸ 0 ∗ S · I{D = d00 } · Y ∗ S · I{D = d } · Y E E E − |X, p(W ) |p(W ) p(W ) X p(W ) · πd0 (X, p(W )) p(W ) · πd00 (X, p(W )) ¸ ¸¸ · · · I{D = d00 } · Y ∗ I{D = d0 } · Y ∗ − |S = 1, X, p(W ) · p(W )|p(W ) E E E p(W ) X p(W ) · πd0 (X, p(W )) p(W ) · πd00 (X, p(W )) · · · ¸ ¸¸ I{D = d0 } · Y I{D = d00 } · Y E E E − |X, p(W ) |p(W ) p(W ) X πd0 (X, p(W )) πd00 (X, p(W )) · · · ¸ Y E E E |D = d0 , X, p(W ) · πd0 (X, p(W )) p(W ) X πd0 (X, p(W )) ¸¸ · ¸ Y E |D = d00 , X, p(W ) · πd00 (X, p(W ))|p(W ) πd00 (X, p(W )) h £ £ ¤ £ ¤ ¤i £ ¤ £ ¤ E E E Y |D = d0 , X, p(W ) − E Y |D = d00 , X, p(W ) |p(W ) = E Y |D = d0 − E Y |D = d00 · E = = = = − = = A.2 p(W ) E[Y d0 X 00 ] − E[Y d ] = ∆. Proof of proposition 2 For the identification of ∆τ , the QTE for the subpopulation with observed outcomes, note that QτY d , the τ th unconditional quantile of Y d , is an implicit function of the following expression: · ¸ · ¸ S · I{D = d} I{D = d} E · I{Y ∗ ≤ QτY d } = E · I{Y ≤ QτY d } = FY d (QτY d ) = τ. p(W ) · πd (X, p(W )) πd (X, p(W )) Proof: · ¸ S · I{D = d} · I{Y ∗ ≤ QτY d } p(W ) · πd (X, p(W )) ¸ ¸¸ · · · S · I{D = d} E E E · I{Y ∗ ≤ QτY d }|X, p(W ) |p(W ) p(W ) · πd (X, p(W )) p(W ) X ¸ ¸¸ · · · I{D = d} · I{Y ∗ ≤ QτY d }|S = 1, X, p(W ) · p(W )|p(W ) E E E p(W ) · πd (X, p(Z)) p(W ) X · · · ¸ ¸¸ I{D = d} E E E · I{Y ≤ QτY d }|X, p(W ) |p(W ) πd (X, p(W )) p(W ) X ¸ ¸¸ · · · 1 E E E · I{Y ≤ QτY d }|D = d, X, p(W ) · πd (X, p(W ))|p(W ) πd (X, p(W )) p(W ) X i h E E [E [I{Y ≤ QτY d }|D = d, X, p(W )] |p(W )] = E [I{Y ≤ QτY d }|D = d] = τ. E = = = = = p(W ) X ∆τ is identified by QτY d0 − QτY d00 . 26 A.3 Asymptotic distribution of the IWP estimator using parametric propensity score models This section shows √ n-consistency and asymptotic normality of IWP estimators using parametric models for the selection into the observed population and into treatment. The properties are discussed in a GMM framework that is similar to the one considered by Lechner (2009) for dynamic treatment evaluation. It is assumed that the nested propensity scores p, πd for sample selection and treatment assignment are known up to a finite number of coefficients. I.e., β ≡ (βs , βd ), where βs denotes the coefficients on W ≡ D, X, Z in p = p(W, βs ) and βd the coefficients on X, p in πd = πd (X, p(W, βs ), βd ). Furthermore, √ ˆ for instance a two step ML estimator of a there exists a n-consistent, asymptotically normal estimator β, nested probit or logit model with likelihood functions Ls (s, βs ), Ld (d, βd , βs ). Note that βˆd , the coefficient estimates characterizing the treatment probability, are a function of the selection probability implied by βˆs √ (which is n-consistent) rather than the true value βs . Murphy & Topel (1985) show that under certain √ regularity conditions the two step ML estimator of βˆd is n-consistent and asymptotically normal26 . Let k, g denote the score functions, i.e.,the first derivatives of the likelihood functions with respect to the k(x, z, s, d, βs ) ∂Ls (s, βs )/∂βs = . Using a GMM framework, the estimators of coefficients: g(x, z, s, d, β) ∂Ld (d, βd , βs )/∂βd the unknown values of βd , βs satisfy the conditions n 1X k(Xi , Zi , Si , Di ; βˆs ) n i=1 = 0. n 1X ˆ g(Xi , Zi , Si , Di ; β) n i=1 = 0. These conditions allow predicting the sample selection and treatment propensity scores and will serve as one part of the final GMM estimator that will also incorporate a moment condition related to the treatment effects. We therefore reconsider the ATE estimator which we defined as ˆ = ∆ n X [(ˆ ωd0 ,i − ω ˆ d00 ,i ) · Yi ] i|S=1 = n X [Si · (ˆ ωd0 ,i − ω ˆ d00 ,i ) · Yi ] , i=1 26 Murphy & Topel (1985) prove that √ n(βˆd − βd ) → N (0, Σ), Σ = R2−1 + R2−1 [R30 R1−1 R3 − R40 R1−1 R3 − R30 R1−1 R4 ]R2−1 , R1 = −E R3 = ∂ 2 Ld (d, βd , βs ) ∂ 2 Ls (s, βs ) , R = −E , 2 ∂βs ∂βs0 ∂βd ∂βd0 µ ¶0 ∂ 2 Ld (d, βd , βs ) ∂Ls (s, βs ) ∂Ld (d, βd , βs ) −E , R = E , 4 ∂βs ∂βd0 ∂βs ∂βd where ‘0 ’ denotes transposed. 27 with weights ω ˆ d,i = Pn 1 j=1 Sj · I{Di = d} πd (Xi , p(Wi , βˆs ), βˆd ) . It is straightforward to rewrite the estimator as n X ˆ · Yi , ˆ = 1 ∆ λi (x, z, s, d, β) n i=1 with ˆ = λi (x, z, s, d, β) = = n · Si · (ˆ ωd0 ,i − ω ˆ d00 ,i ) ! à n I{Di = d00 } I{Di = d0 } Pn − · Si · πd0 (Xi , p(Wi , βˆs ), βˆd0 ) πd00 (Xi , p(Wi , βˆs ), βˆd00 ) j=1 Sj à ! I{Di = d00 } Si I{Di = d0 } · − , ˆ Π πd0 (Xi , p(Wi , βˆs ), βˆd0 ) πd00 (Xi , p(Wi , βˆs ), βˆd00 ) ˆ denotes the unconditional probability to be observed, Π ˆ ≡ (Pn Sj )/n. This allows us to where Π j=1 ˆ satisfying formulate the estimator of ∆ as the value ∆ n n X 1X ˆ ∆) ˆ · Yi = 0, ˆ =∆ ˆ−1 h(Xi , Zi , Si , Di ; β, λi (x, z, s, d; β) n i=1 n i=1 which constitutes the second ingredient of the GMM estimator. As in Lechner (2009), one particularity of this otherwise standard parametric GMM problem (see Hansen, 1982, and Newey and McFadden, 1994) is that some of the moment conditions depend only on a subset of unknown parameters. I.e., the moment conditions g related to β do not depend on ∆ and furthermore, Ls (s, βs ) does not depend on βd . The regularity conditions required for consistency and asymptotic normality in this framework of sequential estimators were established by Newey (1984): Data must be generated from stationary and ergodic processes, the moment functions and the respective derivatives must exist and must be measurable and continuous, the parameters must be finite and not at the boundary of the parameter space, and the derivatives of the moment conditions w.r.t. the parameters must have full rank. Furthermore, the sample moments must converge to their population counterparts with decreasing variances and to uniquely identified values of the unknown parameters. Applying the results of Newey (1984) and using the partitioned inverse formula on the matrix of derivatives (w.r.t. to the unknown parameters βs , βd , ∆) of the moment conditions, the asymptotic variance of the ATE estimator is equal to √ ˆ asVar( n∆) = −1 −1 −1 H∆ E[{h(·) + Hβd G−1 βd g(·) − (Hβs Gβd − Hβd Gβs )Kβs Gβd k(·)} −1 −1 −1 0 ×{h(·) + Hβd G−1 βd g(·) − (Hβs Gβd − Hβd Gβs )Kβs Gβd k(·)} ]H∆ = −1 −1 Vhh + Hβd G−1 βd Vgh − (Hβs Gβd − Hβd Gβs )Kβs Gβd Vkh −10 0 −10 0 −1 −1 −10 0 +Hβd G−1 βd Vgg Gβd Hβd + Vhg Gβd Hβd − (Hβs Gβd − Hβd Gβs )Kβs Gβd Vkg Gβd Hβd −10 −10 0 +(Hβs Gβd − Hβd Gβs )Kβ−1 G−1 βd Vkk Gβd Kβs (Hβs Gβd − Hβd Gβs ) s −10 −1 −10 −10 0 0 −Vhk G−10 βd Kβs (Hβs Gβd − Hβd Gβs ) − Hβd Gβd Vgk Gβd Kβs (Hβs Gβd − Hβd Gβs ) , 28 where ‘0 ’ denotes transposed and H∆ ≡ Gβd ≡ Vhh ≡ · ¸ · ¸ ∂h(·) ∂h(·) ∂λi (·) ∂h(·) ∂λi (·) = 1, Hβd ≡ −E = −E Yi , Hβs ≡ −E = −E Yi , ∂∆ ∂βd ∂βd ∂βs ∂βs ∂g(·) ∂g(·) ∂k(·) , Gβs ≡ E , Kβs ≡ E , E ∂βd ∂βs ∂βs ˆ · Yi ], Vgg ≡ E[g(·)g(·)0 ], Vkk ≡ E[k(·)k(·)0 ], E[h(·)2 ] = Var[λi (x, z, s, d, β) Vgh ≡ 0 0 0 E[g(·)h(·)], Vhg ≡ Vgh , Vkh ≡ E[k(·)h(·)], Vhk ≡ Vkh , Vkg ≡ E[k(·)g(·)], Vgk ≡ Vkg . E i (·) Ignoring the estimation of the nested propensity score would amount to assuming that ∂λ ∂βd = 0 and √ ˆ ∂λi (·) ˆ ∂βs = 0 such that asVar( n∆) =Var[λi (x, z, s, d, β) · Yi ]. Note that this is what the Abadie & Imbens (2006) variance estimator does for the nearest neighbor matching estimator and for which reason it is inconsistent in the framework considered in this paper. As acknowledged by Lechner (2009), the full ˆ · Yi ], depending on variance might be smaller or larger than Var[λi (x, z, s, d, β) ∂λi (·) ∂β and on the correlation of the moment conditions. A consistent estimator of the asymptotic variance is obtained by using the sample analogues of the terms in the formula or by bootstrapping. We conclude this section by establishing a condition for the estimation of unconditional quantile functions required to estimate QTEs. We defined the estimator of QτY d as ˆτ d Q Y = arg min y n X ω ˆ d,i · ρτ (Yi − y), i|S=1 n = arg min y 1X [n · Si · ω ˆ d,i · ρτ (Yi − y)] . n i=1 This implies the first order condition # " n n X 1X 1 S I{D = d} i i τ τ ˆ Q ˆ d } − τ = 0, ˆ d) = · I{Yi < Q hτ (xi , zi , si , di ; β, · Y Y ˆ πd (Xi , p(Wi , βˆs ), βˆd ) n i=1 n i=1 Π which immediately serves as condition for GMM estimation. The asymptotic variance of the asymptotˆ τ d can be obtained in a similar way as outlined for the ATE estimator. As a ically normal estimator Q Y ˆτ = Q ˆ τ d0 − Q ˆ τ d00 for distinct treatments d0 6= d00 is asymptotically normal, consequence, the difference ∆ Y Y ˆ τ involves independent terms, see for instance the argumentation in Firpo (2007a). Therefore, the too. ∆ ˆ τ can be easily obtained from the asymptotic variances of Q ˆ τ d0 , Q ˆ τ d00 as the coasymptotic variance of ∆ Y Y variance term is zero. A.4 Identification in a randomized experiment with censored outcomes Throughout this paper we assumed that treatment assignment is non-random and only unconfounded conditional on observed covariates X and that X also affects selection. This is plausible in many interesting evaluation problems as wage equations, where factors as tenure or experience are likely to affect both the probability to work and the potential wage and may be confounders to the treatment ‘education’. 29 Let us now assume that the treatment is randomly assigned (i.e., independent of X) in the total population27 such that the treatment propensity score in the observed population is only a function of the sample selection propensity score, i.e., πd (p(·)) = Pr(D = d|p(·)), and the latter is only a function of Z and D, p(D, Z) = Pr(S = 1|D, Z). This is useful for randomized experiments or lotteries with partially observed outcomes where outcome censoring is non-random. Consider for instance the effect of school vouchers assigned by a lottery on college admission test scores several years later. If only a subpopulation takes the test and the participation probability is a function of the lottery win, point identification generally requires an exclusion restriction to adjust for selection bias. Let Z denote an instrument satisfying this restriction. Without loss of generality, we will discuss identification for a binary treatment D ∈ {1, 0}. Let Pr(S = 1|D = d, Z) = pd (Z) denote the selection probability conditional on D = d. Then, Pr(S = 1|D, Z) = p1 (Z) · D + p0 (Z) · (1 − D). We denote the treatment propensity in the observed population conditional on pd (Z) by Pr(D = 1|pd (Z)) = π1 (pd (Z)). For a fixed Z = z, π1 (p1 (z)) 6= π1 (p0 (z)), otherwise D is unrelated to pd (Z) and selection into college admission tests is ignorable. Identification of treatment effects requires that treated and nontreated observations with the same selection propensity score are available, which is obviously only feasible if Z shifts the selection probability. I.e., it must hold that π1 (p1 (z 0 )) = π1 (p0 (z 00 )) for some values z 0 6= z 00 . In general, Z needs to be continuous for point identification. To gain some intuition, assume the converse that Z is discrete and either 1 or 0. Let Pr(S = 1|D = d, Z = z) = pdz . Then, Pr(S = 1|D, Z) = p1 (Z) · D + p0 (Z) · (1 − D) = [p11 · Z + p10 · (1 − Z)] · D + [p01 · Z + p00 · (1 − Z)] · (1 − D) For the identification of the ATET, all treated observations with p11 and p10 , respectively, have to be compared to non-treated units with equal selection probabilities. However, in general, p11 6= p10 6= p01 6= p00 . Let us consider two special cases where at least some combinations of D and Z yield equal sample selection propensity scores among treated and nontreated. Firstly, let D and Z shift p equally into the same direction, e.g., increase the selection probability. Then, p10 = p01 , but p11 6= p00 (and p11 6= p01 ) such that effects could only be point identified for a subpopulation. Secondly, let D and Z shift p equally in absolute terms, but in opposite directions. Then, p11 = p00 , but p10 6= p01 (and p10 6= p00 ). Thus, even in special cases, point identification is infeasible for the entire observed population if Z is binary. There is, loosely speaking, an empty cells problem with respect to the sample selection propensity score. This is not necessarily true for the scenario considered throughout this paper, where X needs to be conditioned on in p(W ) = Pr(S = 1|D, X, Z) and in Pr(D = d|X, p(W )) = πd (X, p(W )) for unconfoundedness. If X is continuous and its range is sufficiently large, there may be common support in p(W ) for 27 The author thanks Josh Angrist, Michael Lechner, and Blaise Melly for comments motivating the following discussion. 30 discrete Z. Even if this is not the case, there might still be common support in πd (X, p(W )) if the continuous X is sufficiently powerful in shifting πd (X, p(W )). In the latter case identification fails if we match treated and nontreated observations directly on X and p(W ) due to empty cells w.r.t. p(W ), but matching on πd (X, p(W )) is feasible. This result is related to the dimensionality reduction argument in the selection on observables framework advocating propensity score matching rather than direct matching to avoid empty cells for particular combinations of covariate values. In any case, whether Z is continuous or discrete, it needs to be ‘sufficiently’ relevant28 for p. To see this reconsider the randomized framework with censored outcomes and assume the extreme case that Z is not a relevant instrument for p at all. Then, Pr(D = 1| Pr(S = 1|D, Z)) = Pr(D = 1| Pr(S = 1|D)) and nonparametric identification breaks down. The same holds true conditional on X, implying that Pr(D = 1|X, Pr(S = 1|D, X, Z)) = Pr(D = 1|X, Pr(S = 1|D, X)). 28 Simulation methods may be used to investigate what ‘sufficiently relevant’ means in a particular scenario. 31 References Abadie, A. & Imbens, G. (2006), ‘Large sample properties of matching estimators for average treatment effects’, Econometrica 74, 235–267. Ahn, H. & Powell, J. (1993), ‘Semiparametric estimation of censored selection models with a nonparametric selection mechanism’, Journal of Econometrics 58, 3–29. Andrews, D. & Schafgans, M. (1998), ‘Semiparametric estimation of the intercept of a sample selection model’, Review of Economic Studies 65, 497–517. Angrist, J. (1997), ‘Conditional independence in sample selection models’, Economics Letters 54, 103–112. Angrist, J., Bettinger, E. & Kremer, M. (2004), ‘Long-term educational consequences of secondary school vouchers: Evidence from administrative records in colombia’, NBER Working Paper no. W10713. Bhalotra, S. & Sanhueza, C. (2002), ‘Parametric and semi-parametric estimations of the return to schooling in south africa’, unpublished manuscript. Buchinsky, M. (1998), ‘The dynamics of changes in the female wage distribution in the usa: A quantile regression approach’, Journal of Applied Econometrics 13, 1–30. Buchinsky, M. (2001), ‘Quantile regression with sample selection: Estimating women’s return to education in the u.s.’, Empirical Economics 26, 87–113. Busso, M., DiNardo, J. & McCrary, J. (2008), ‘Finite sample properties of semiparametric estimators of average treatment effects’, unpublished manuscript. Caliendo, M. & Kopeinig, S. (2008), ‘Some practical guidance for the implementation of propensity score matching’, Journal of Economic Surveys 22, 31–72. Chamberlain, G. (1986), ‘Asymptotic efficiency in semiparametric models with censoring’, Journal of Econometrics 32, 189–218. Cosslett, S. (1987), ‘Efficiency bounds for distribution-free estimators of the binary choice and censored regression models’, Econometrica 55, 559–585. Cosslett, S. (1991), Distribution-free estimator of a regression model with sample selectivity, in W. Barnett, J. Powell & G. Tauchen, eds, ‘Nonparametric and semiparametric methods in econometrics and statistics’, Cambridge University Press, Camdridge, UK, pp. 175–198. Das, M., Newey, W. & Vella, F. (2003), ‘Nonparametric estimation of sample selection models’, Review of Economic Studies 70, 33–58. Firpo, S. (2007a), ‘Efficient semiparametric estimation of quantile treatment effects’, Econometrica 75, 259–276. Firpo, S. (2007b), ‘Inequality treatment effects’, unpublished manuscript. Fitzgerald, J., Gottschalk, P. & Moffitt, R. (1998), ‘An analysis of the impact of sample attrition on the second generation of respondents in the michigan panel study of income dynamics’, Journal of Human Resources 33, 300–344. Fr¨olich, M. (2001), ‘Applied higher-dimensional nonparametric regression’, University of St. Gallen Discussion Paper no. 2001-12. 32 Fr¨olich, M. (2006), ‘Non-parametric regression for binary dependent variables’, Econometrics Journal 9, 511–540. Gabler, S., Laisney, F. & Lechner, M. (1993), ‘Seminonparametric estimation of binary-choice models with an application to labor-force participation’, Journal of Business & Economic Statistics 11, 61–80. Gallant, A. & Nychka, D. (1987), ‘Semi-nonparametric maximum likelihood estimation’, Econometrica 55, 363–390. Gerfin, M. (1996), ‘Parametric and semi-parametric estimation of the binary response model of labour market participation’, Journal of Applied Econometrics 11, 321–339. Gronau, R. (1974), ‘Wage comparisons-a selectivity bias’, Journal of Political Economy 82, 1119–1143. Hall, P., Racine, J. & Li, Q. (2004), ‘Cross-validation and the estimation of conditional probability densities’, Journal of the American Statistical Association 99, 1015–1026. Han, A. (1987), ‘Non-parametric analysis of a generalized regression model: The maximum rank correlation estimator’, Journal of Econometrics 35, 303–316. Hansen, L. (1982), ‘Large sample properties of generalized method of moment estimators’, Econometrica 50, 1029–1054. Heckman, J. J. (1974), ‘Shadow prices, market wages and labor supply’, Econometrica 42, 679–694. Heckman, J. J. (1976), ‘The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models’, Annals of Economic and Social Measurement 5, 475–492. Heckman, J. J. (1979), ‘Sample selection bias as a specification error’, Econometrica 47, 153–161. Heckman, J. J. (1990), ‘Varieties of selection bias’, American Economic Review, Papers and Proceedings 80, 313–318. Heckman, J. J. & Navarro-Lozano, S. (2004), ‘Using matching, instrumental variables, and control functions to estimate economic choice models’, The Review of Economics and Statistics 86, 30–57. Heckman, J. J. & Vytlacil, E. (2005), ‘Structural equations, treatment effects, and econometric policy evaluation 1’, Econometrica 73, 669–738. Heckman, J. J., Ichimura, H. & Todd, P. (1997), ‘Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme’, Review of Economic Studies 64, 605–654. Heckman, J. J., Ichimura, H. & Todd, P. (1998), ‘Matching as an econometric evaluation estimator’, Review of Economic Studies 65, 261–294. Heckman, J. J., Urzua, S. & Vytlacil, E. (2006), ‘Understanding instrumental variables in models with essential heterogeneity’, The Review of Economics and Statistics 88, 389–432. Hirano, K. & Imbens, G. W. (2004), The propensity score with continuous treatments, in A. Gelman & X. L. Meng, eds, ‘Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives’, New York: Wiley, pp. 73–84. Hirano, K., Imbens, G. W. & Ridder, G. (2003), ‘Efficient estimation of average treatment effects using the estimated propensity score’, Econometrica 71, 1161–1189. 33 Horowitz, J. L. (1992), ‘A smoothed maximum score estimator for the binary response model’, Econometrica 60, 505–531. Horvitz, D. G. & Thompson, D. J. (1952), ‘A generalization of sampling without replacement from a finite universe’, Journal of the American Statistical Association 47, 663–685. Huber, M. & Melly, B. (2008), ‘Quantile regression in the presence of sample selection’, unpublished manuscript. Ichimura, H. (1993), ‘Semiparametric least squares (sls) and weighted sls estimation of single-index models’, Journal of Econometrics 58, 71–120. Ichino, A., Mealli, F. & Nannicini, T. (2008), ‘From temporary help jobs to permanent employment: what can we learn from matching estimators and their sensitivity?’, Journal of Applied Econometrics 23, 305–327. Imbens, G. W. (2000), ‘The role of the propensity score in estimating dose-response functions’, Biometrika 87, 706–710. Imbens, G. W. (2004), ‘Nonparametric estimation of average treatment effects under exogeneity: a review’, The Review of Economics and Statistics 86, 4–29. Imbens, G. W. & Angrist, J. (1994), ‘Identification and estimation of local average treatment effects’, Econometrica 62, 467–475. Khan, S. & Tamer, E. (2007), ‘Irregular identification, support conditions, and inverse weight estimation’, unpublished manuscript. Kitagawa, T. (2008), ‘Testing for exclusion restriction in the selection model’, unpublished manuscript. Klein, R. W. & Spady, R. H. (1993), ‘An efficient semiparametric estimator for binary response models’, Econometrica 61, 387–421. Koenker, R. (2005), Quantile Regression, Cambridge University Press. Koenker, R. & Bassett, G. (1978), ‘Regression quantiles’, Econometrica 46, 33–50. Koenker, R. & Bassett, G. (1982), ‘Robust tests for heteroskedasticity based on regression quantiles’, Econometrica 50, 43–62. Kumar, A. (2006), ‘Nonparametric conditional density estimation of labour force participation’, Applied Economics Letters 13, 835–841. Lechner, M. (1999), ‘Earnings and employment effects of continuous off-the-job training in east germany after unification’, Journal of Business and Economic Statistics 17, 74–90. Lechner, M. (2001), Identification and estimation of causal effects of multiple treatments under the conditional independence assumption, in M. Lechner & F. Pfeiffer, eds, ‘Econometric Evaluations of Active Labor Market Policies in Europe’, Heidelberg: Physica. Lechner, M. (2007), ‘A note on the relation of weighting and matching estimators’, University of St. Gallen Discussion Paper no. 2007-34. Lechner, M. (2009), ‘Sequential causal models for the evaluation of labor market programs’, Journal of Business and Economic Statistics 27, 71–83. Lechner, M. & Melly, B. (2007), ‘Earnings effects of training programs’, IZA Discussion Paper no. 2926. 34 Lee, D. S. (2005), ‘Training, wages, and sample selection: estimating sharp bounds on treatment effects’, NBER Working Paper no. W11721. Li, Q., Racine, J. & Wooldridge, J. (2009), ‘Efficient estimation of average treatment effects with mixed categorical and continuous data’, forthcoming in the Journal of Business and Economics Statistics. Manski, C. F. (1975), ‘Maximum score estimation of the stochastic utility model of choice’, Journal of Econometrics 3, 205–228. Manski, C. F. (1989), ‘Anatomy of the selection problem’, The Journal of Human Resources 24, 343–360. Manski, C. F. (1994), The selection problem, in C. Sims., ed., ‘Advances in Econometrics: Sixth World Congress’, Cambridge University Press, pp. 143–170. Martins, M. F. O. (2001), ‘Parametric and semiparametric estimation of sample selection models: An empirical application to the female labour force in portugal’, Journal of Applied Econometrics 16, 23– 39. Melenberg, B. & van Soest, A. (1996), ‘Parametric and semi-parametric modelling of vacation expenditures’, Journal of Applied Econometrics 11(1), 59–76. Melly, B. (2006), ‘Estimation of counterfactual distributions using quantile regression’, unpublished manuscript. Mroz, T. A. (1987), ‘The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions’, Econometrica 55, 765–799. Mulligan, C. B. & Rubinstein, Y. (2008), ‘Selection, investment, and women’s relative wages over time’, Quarterly Journal of Economics 123, 1061–1110. Murphy, K. M. & Topel, R. H. (1985), ‘Estimation and inference in two-step econometric models’, Journal of Business and Economic Statistics 3, 88–97. Newey, W. K. (1984), ‘A method of moments interpretation of sequential estimators’, Economics Letters 14, 201–206. Newey, W. K. (1999), ‘Two-step series estimation of sample selection models’, MIT Working Papers no. 99-04. Newey, W. K. (2007), ‘Nonparametric continuous/discrete choice models’, International Economic Review 48, 1429–1439. Newey, W. K. & McFadden, D. (1994), Large sample estimation and hypothesis testing, in R. Engle & D. McFadden, eds, ‘Handbook of Econometrics’, Elsevier, Amsterdam. Newey, W. K., Powell, J. L. & Walker, J. (1990), ‘Semiparametric estimation of selection models: Some empirical results’, American Economic Review 80, 324–328. Pagan, A. & Ullah, A. (1999), Nonparametric Econometrics, Cambridge University Press, Cambridge. Powell, J. (1987), ‘Semiparametric estimation of bivariate latent variable models’, unpublished manuscript. University of Wisconsin-Madison. Robins, J. M. & Rotnitzky, A. (1995), ‘Semiparametric efficiency in multivariate regression models with missing data’, Journal of the American Statistical Association 90, 122–129. 35 Robins, J. M. & Rotnitzky, A. (1997), ‘Analysis of semi-parametric regression models with non-ignorable non-response’, Statistics in Medicine 16, 81–102. Robins, J. M., Rotnitzky, A. & Zhao, L. P. (1995), ‘Analysis of semiparametric regression models for repeated outcomes in the presence of missing data’, Journal of the American Statistical Association 90, 106–121. Robinson, P. M. (1988), ‘Root-n-consistent semiparametric regression’, Econometrica 56, 931–954. Rosenbaum, P. & Rubin, D. B. (1983), ‘The central role of the propensity score in observational studies for causal effects’, Biometrika 70, 41–55. Rubin, D. (1974), ‘Estimating causal effects of treatments in randomized and nonrandomized studies’, Journal of Educational Psychology 66, 688–701. Rubin, D. B. (1990), ‘Formal modes of statistical inference for causal effects’, Journal of Statistical Planning and Inference 25, 279–292. Sekhon, J. S. (2007), ‘Multivariate and propensity score matching software with automated balance optimization: The matching package for r’, forthcoming in the Journal of Statistical Software. Vella, F. (1998), ‘Estimating models with sample selection bias: A survey’, The Journal of Human Resources 33, 127–169. Vytlacil, E. (2002), ‘Independence, monotonicity, and latent index models: An equivalence result’, Econometrica 70, 331–341. Wooldridge, J. (2002), ‘Inverse probability weigthed m-estimators for sample selection, attrition and stratification’, Portuguese Economic Journal 1, 141–162. Wunsch, C. & Lechner, M. (2008), ‘What did all the money do? on the general ineffectiveness of recent west german labour market programmes’, Kyklos 61, 134–174. Zhao, Z. (2008), ‘Sensitivity of propensity score methods to the specifications’, Economics Letters 98, 309– 319. 36