NEW SEMIPARAMETRIC PAIRWISE DIFFERENCE ESTIMATORS FOR PANEL DATA SAMPLE SELECTION MODELS Abstract
Transcription
NEW SEMIPARAMETRIC PAIRWISE DIFFERENCE ESTIMATORS FOR PANEL DATA SAMPLE SELECTION MODELS Abstract
NEW SEMIPARAMETRIC PAIRWISE DIFFERENCE ESTIMATORS FOR PANEL DATA SAMPLE SELECTION MODELS María Engracia Rochina-Barrachina1 Abstract In this paper, estimation of the coefficients in a “double-index” panel data sample selection model is considered under the assumption that the selection function depends on the conditional means of some observable variables. We present two methods. The first is a “weighted double pairwise difference estimator” because it is based in the comparison of individuals in time differences. The second is a “single pairwise difference estimator” because only differences over time for each individual are required. The finite sample properties of the estimators are investigated by Monte Carlo experiments. Their advantages are: no distributional assumptions or a parametric selection mechanism are needed, and heteroskedasticity over time is allowed for. 1. INTRODUCTION • I am grateful to Bo Honoré, Myoung-jae Lee and Frank Windmeijer for useful comments and suggestions. Thanks are also owed to participants at the Econometric Society European Meeting (ESEM), August/September 1999, Santiago de Compostela, Spain. Financial support from the Spanish foundation “Fundación Ramón Areces” is gratefully acknowledged. The usual disclaimer applies. 1 Department of Economics, Universidad de Valencia and University College London. 1 In a panel data sample selection model, where both the selection and the regression equation of interest may contain individual effects allowed to be correlated with the observable variables, Wooldridge (1995) proposed a method for correcting for selection bias. For a panel for two time periods Kyriazidou (1997) proposes an estimator imposing weaker distributional assumptions. Also for two time periods a more parametric approach getting ride of given assumptions in the methods above has been developed by Rochina-Barrachina (1999). The last two estimators can be easily generalised to the case of more than two time periods. The method by Wooldridge (1995), given that it is based on estimation of a model in levels, needs the assumption of a linear projection form of the individual effects in the equation of interest on the leads and lags of the explanatory variables. The other two methods overcome this problem by estimation of a model in differences over time for a given individual. Time differencing for the same individual will eliminate the individual effects from the regression equation. The work of Kyriazidou (1997) is the less parametric of the three methods in the sense that the distribution of all unobservables is left unspecified and it is allowed an arbitrary correlation between individual effects and regressors. The price we pay is in terms of another assumption, that is, the called conditional exchangeability assumption for the errors in the model. This assumption allows for individual heteroskedasticity of unknown form but it imposes homoskedasticity over time. The advantage of the estimator proposed by RochinaBarrachina (1999) is that it allows the variance of the errors to vary over time. It is then relaxed the assumption that the errors for a given individual are homoskedastic. The 2 price we pay for this is that we need to assume a trivariate normal distribution of the errors in differences in the main equation jointly with the composed errors in the selection rules for the two time periods we are pair differencing. According to the results of a Monte Carlo investigation of the finite-sample properties of Wooldridge (1995) and Kyriazidou’s (1997) estimators (Rochina-Barrachina, (1996)) we can conclude that important factors of bias or lack in precision in the estimates comes from misspecification problems related to the individual effects in the main equation and violations of the conditional exchangeability assumption. Rochina- Barrachina’s (1999) estimator gets ride of both factors as can be seen in the Monte Carlo experiments presented in that work. However, the need to assume a trivariate normal distribution for the errors may question the robustness of the estimator against misspecification of the error distribution. The work in this paper has been developed with the aim of keeping the properties of Rochina-Barrachina’s (1999) estimator but allowing for a free joint trivariate distribution. In this paper, estimation of the coefficients in a “double-index” selectivity bias model is considered under the assumption that the selection correction function depends only on the conditional means of some observable selection variables. We will present two alternative methods. The first one follows the familiar two-step approach proposed by Heckman (1976,1979) for selection models. The procedure will first estimate consistently and nonparametrically the conditional means of the selection variables. In the second step we will not only take pair differences for the same individual over time 3 (to eliminate the individual effects as in Kyriazidou (1997) and Rochina-Barrachina (1999)) but also after this we will take pairwise differences across individuals to eliminate the sample selection correction term (the idea of pairwise differencing across individuals in a cross section setting appears in Powell (1987) and Ahn and Powell (1993)). On the resulting model after this double differencing we will apply a weighted least squares regression with decreasing weights to pairs of individuals with larger differences in their “double index” variables, and then larger differences in the selection correction terms. The alternative method will need just pairwise differences over time for the same individual but will include three steps. The first one will be identical to the corresponding one in the other method, that is, nonparametrically we will estimate the conditional means of the selection variables. In the second step we will estimate by nonparametric regression the conditional means of pairwise differences in explanatory variables and pairwise differences in dependent variables on the selection variables (the “double index”) estimated in the first step. The third step will use these nonparametric regression estimators to write a model in the spirit of the semiparametric regression model of Robinson (1988) that will be estimated by OLS. The paper is organised as follows. Section 2 describes the model, discusses some related identification issues, and revises assumptions on the sample selection correction terms in the available difference estimators for panel data sample selection models. Section 3 presents the new estimators. Section 4 reports results of a small Monte Carlo simulation study of its finite sample performance. In Section 5 we show the link between both estimators. 4 Section 6 gives concluding remarks, and the Appendices provide formulae for the asymptotic variance-covariance matrices. 2. THE MODEL AND THE AVAILABLE ESTIMATORS Our case of study is a panel data sample selection model. In this model we are interested in the estimation of the regression coefficients yit = xit + i + it ; d it* = f t (zi ) − ci − uit ; i = 1,..., N ; [ in the equation t = 1,..., T , (2.1) ] (2.2) d it = 1 d it* ? 0 , where z i = (z i1 ,..., z iT ). xit and zi are vectors of explanatory variables (which may have components in common), it and uit are unobserved disturbances, i are individual- specific effects allowed to be correlated to the explanatory variables xi , and ci are individual-specific effects uncorrelated to zi . Whether or not observations for yit are available is denoted by the dummy variable d it . In (2.2) there is no need to assume any parametric assumption about the form of the selection indicator index f t (z i ). In fact, by assuming that depends on all the leads and lags of an F-dimensional vector of conditioning variables z we allow for an individual effects structure with correlation with the explanatory variables and/or for sample selection indices with a lagged endogenous variable as explanatory variable. 5 This flexibility is convenient because although the form of this function may not be derived from some underlying behavioural model, the set of conditioning variables which govern the selection probability may be known in advance. Like misspecification of the parametric form of the selection function, misspecification of the parametric form of the index function results in general in inconsistent estimators of the coefficients in the equation of interest, as pointed out by Ahn and Powell (1993). Time differencing on the observational equation (2.1) for those observations which have d it = d is = 1 (s ? t ) we get y it − y is = (xit − xis ) + ( it − is ) (2.3) It might be the case that we do not want to specify any selection indicator function but we just want to assume that selection depends on a T ↔F - vector zi . In this case, by assuming that ( it − the expectation of ( is it ) is mean independent of − is xit , xis , z i conditional on d it = d is = 1 , ) conditional on selection (i.e. d it = d is = 1 ) is a function of only zi , so that the expectation of (yit − yis ) conditional on selection takes the form [ ] [ E yit − yis xit , xis , zi , d it = d is = 1 =( xit − xis) + E = (xit − xis ) + ts it − (zi ) is ] xit , xis , zi , d it = d is = 1 (2.4) and consequently, a selection corrected regression equation for (yit − yis ) is given by 6 yit − yis = (xit − xis ) + ts (zi ) + (eit − eis ) (2.5) where we have taken out from the error term E [( it − is ) xit , xis , zi , d it ] = d is = 1 = [ ts (zi ) ( it − is ) in (2.3) its conditional mean driven by sample selection. ] E (eit − eis ) xit , xis , zi , d it = d is = 1 = 0 by construction and ts ()? Thus, is an unknown function of the T ↔F − vector zi . Equation (2.5) provides insight concerning identification. Notice that if some linear combination (xit − xis ) of (xit − xis ) where equal to any function of zi , then there would be asymptotically perfect multicollinearity among the variables on the right-hand side of equation (2.5), and (yit − yis ) could not be estimated from a regression of observed on (xit − xis ) and unknown function of zi , ts ts ()? . The reason is that any approximation to the ()? , will also be able to approximate the linear combination of (xit − xis ) , resulting in asymptotic perfect multicollinearity. To guaranty that taking any nontrivial (xit − xis ) there is no measurable function Γ(zi ) such that = Γ(zi ) we need to impose the following identification assumption: { [ ][ ]}is non- Assumption 1: E d it d is ( xit − xis) − E(xit − xis zi ) ? ( xit − xis) − E(xit − xis zi ) singular, i.e. for any (xit − xis ) 7 = Γ(zi ) . ' ? 0 there is no measurable function Γ(zi ) such that Accordingly, identification of requires the strong exclusion restriction that none of the components of (xit , xis ) can be an exact linear combination of components of zi . This implies that (xit , xis ) and zi cannot have any components in common. As in sample selection models typically individual components of the vector zi appear in the vector of regressors xit , xis in the main equation we are interested in structures for the selection correction component that permit identification under this situation. If we do not want identification to relay on strong exclusion restrictions we should impose more structure [ on ts (zi ) for ] the E (eit − eis ) xit , xis , zi , d it = d is = 1 = 0 to identify stochastic restriction . In the literature there are different ways to impose this structure for models with sample selection. The restricted form of the selection correction in (2.5) is typically derived through imposition of restrictions on the behaviour of the indicator variables d i ( = t , s) given zi ; that is, the indicator variables d i are assumed to depend upon f (zi ) through the binary response model in (2.2). In what remains of this section we make a revision of this literature to understand the contribution of the methods proposed in section 3. The following classification obeys to different degrees of distributional assumptions for the unobservables in the model and to whether or not it is imposed a parametric form for the index function in the selection equation. Different structures on the form of the selection correction 8 ts (zi ) . Case A. One way of imposing more structure on the form of the selection correction ts (zi ) is as follows ts (zi ) = E[( it [( = E [( =E − is it − it − { = Λ {f (z , ) xit , xis , zi , d it = d is = 1] is ) x it , x is , zi , ci + uit ≤ f t (zi ), ci + uis ≤ f s (zi )] is )x it , x is , zi , ci + uit ≤ f (zi , t ), ci + uis ≤ f (zi , s )] = Λ f (zi , i = Λ {f (zi , t t ), f (zi , s ); F3 [( it − is ), f (zi , s ); F3 [( it − t ), f (zi , s )} is ), (ci + uit ), (ci + uis ) xit , xis , zi ]} ), (ci + uit ), (c i + uis) f( zi , t) , f( zi , s)]} (2.6) , } is unknown and f (.,. ) are scalar single index functions of where the function Λ{?? known parametric form (which can be linear but not necessarily). The joint conditional distribution function F3 of the error terms depends only upon the double index ( it − is ), (ci + uit ), (ci + uis ) xit , xis , zi {f (z , ), f (z , )}. i t i s A consequence of ignorance , } is unknown. concerning the form of this distribution is that the functional form of Λ{?? The selection correction term ts (zi ) can be written as in (2.6) when ( (ci + uit ), (ci + uis ) are independent of mean independent of xit , xis , z i (ci + uit ), (ci + uis ) are independent of 9 xit , xis , z i , or alternatively, when conditional on it − ( it is − (ci + uit ), (ci + uis ) , ) is and ) is and xit , xis , z i . The conditional mean independence assumption always holds if but we do not require ( it [( − is it − is ), (ci + uit ), (ci + uis )] is independent of ) to be independent of alternative sets of assumptions the expectation of (i.e. d it = d is = 1 ) is a function of only xit , xis , z i , xit , xis , z i . Under any of the two ( − it is ) conditional on selection {f (z , ), f (z , )}, so that the expectation of i t i s (yit − yis ) conditional on selection takes the form [ ] E yit − yis xit , xis , zi , d it = d is = 1 =( xit − xis) + Λ{ f( zi , ) , f( zi , s)} t (2.7) The selection corrected regression equation for (yit − yis ) is given by yit − yis = (xit − xis ) + Λ{f (zi , [ ] t ), f (zi , s )}+ eits , (2.8) E eits xit , xis , zi ,d it = d is = 1 = 0. We need the following identification assumption for to be identified in (2.8): Assumption 2: [ ( E d it d is ( xit − xis) − E xit − xis f ( zi , is non-singular, Γ(f (zi , t i.e. for any ) , f ( zi , s))]?[( xit − xis) − E (xit − xis f ( zi , t ), f ( zi , s))]? ' t ?0 ), f (zi , s )) such that (xit − xis ) Now we incorporate more structure on ? there = Γ(f (zi , ts (zi ) information, that the distribution of the indicators d i 10 is t no measurable function ), f (zi , s )) . by adding, as extra identifying ( = t , s) depends on the double index {f (z , ), f (z , )}. i t i The double index structure of the selection correction s permits identification even when individual components of the conditioning vector zi appear in the regressors xit , xis . Case B. A fully standard parametric approach applied to (2.6) leads to ts (zi ) = E [( − it [( = E [( = E [( =E { = Λ{z is it − it − it − ) xit , xis , zi , d it = d is = 1] is ) x it , x is , zi , ci + uit ≤ f t (zi ), ci + uis ≤ f s (zi )] is )x it , x is , zi , ci + uit ≤ f (zi , t ), ci + uis ≤ f (zi , s )] is )x it , x is , zi , ci + uit ≤ zi t , ci + uis ≤ zi s ] = Λ zi t , zi s i t , zi s [( ; Φ [( ; F3 3 − it it − is (2.9) ), (ci + uit ), (ci + uis ) xit , xis , zi ]} is ), (ci + uit ), (ci + uis ) xit , xis , zi ]} ? are scalar aggregators in the selection equation of a linear parametric form where f () and we have imposed strong stochastic restrictions by specifying the joint conditional distribution function F3 of the error terms ( it − is ), (ci + uit ), (ci + uis ) xit , xis , zi as a trivariate normal distribution function Φ 3 . Under these parametric assumptions, the form of the selection term, to be added as an additional regressor to the differenced equation in (2.3), can be worked out (see, Rochina-Barrachina (1999)). Under this fully parametric approach the estimation method developed in Rochina-Barrachina (1999) consists on a two steps estimator. The method eliminates the individual effects from 11 the equation of interest by taking time differences conditioning to observability of the individual in two time periods. Two correction terms, which form depends upon the linear scalar aggregator function and the trivariate normal joint distribution function assumed for the unobservables in the model, are worked out. Given consistent first step estimates of these terms, simple least squares in the equation of interest can be used to obtain consistent estimates of in the second step. Because of the linearity ? , the estimator under Case B corresponds to the called “More assumption for f () parametric new estimator” in Rochina-Barrachina (1999). Case C. Relaxing in Case B the parametric form for the index functions f (.,. ) we get ts (zi ) = E [( =E =E it [( [( − is it − it − ) xit , xis , zi , d it = d is = 1] is ) x it , x is , zi , ci + uit ≤ f t (zi ), ci + uis ≤ f s (zi )] −1 [ht (zi )], ci + uis ≤ F −1 [hs (zi )]] is )x it , x is , zi , ci + uit ≤ F { [h (z )], F [h (z )]; F [( = Λ{ Φ [h (z )], Φ [h (z )]; Φ [( =Λ F −1 −1 t i t i −1 s i s i 3 it −1 3 it − − ?, where the selection indicator indices f () is (2.10) ), (ci + uit ), (ci + uis ) xit , xis , zi ]} is ), (c i + uit) ,( ci + uis) xit , xis , zi ]} = t , s are unknown and of unrestricted form. We have still imposed as in Case B strong stochastic restrictions by specifying the joint conditional distribution of the errors ( it − is ), (ci + uit ), (ci + uis ) xit , xis , zi as trivariate normal. The values of these semiparametric indices in the selection equation 12 are recovered by [ applying the inversion rule [ ] f t (zi ) = Φ −1 ht (zi ) ] and f s (zi ) = Φ −1 hs (zi ) , where the conditional expectations h (zi ) = E (d i zi ) for are replaced with nonparametric estimators = t, s h$ (zi ) = E$ (d i zi ) , such as kernel ? in (2.2) the Given the unrestricted treatment of the functions f () estimators. estimator under Case C corresponds to the three steps estimator called “Less parametric new estimator” in Rochina-Barrachina (1999). Both for Case B and Case C, although Rochina-Barrachina’s (1999) estimators are based upon an independence assumption where [( it − ), (ci + uit ), (ci + uis )] ' is is independent of xit , xis , zi with a joint normality of the error terms, for Rochina-Barrachina’s (1999) methods to work, it is sufficient to have a) marginal normality for (ci + uit ), (ci + uis ) and consequently joint normality of (ci + uit ) and (ci + uis ) ; b) independence of xit , xis , zi for (ci + uit ) and (ci + uis ) ; c) a conditional mean independence assumption of ( it − is ) from xit , xis , zi once conditioning to projection of ( it − is ) on [(c i (ci + uit ) and (ci + uis ) ; ] + uit ), (ci + uis ) . d) a linear Furthermore, the normality of (ci + uit ) and (ci + uis ) could be relaxed under other distributional assumption, but it can be difficult to give a closed form for the sample selection correction term as in the normal case. Under a, b, c and d E ( it − ) is = {( ci + uit ), ( ci + uis)} = ' its ' its E−1 ( ' its its )E (( it − is ) its )= ' its , (2.11) 13 =( where ts , st )' = E −1 ( its ' its )E (( − it is ) its ) . Then, the selection bias is [ E( ) ci + uit ≤ f t( zi), ci + uis ≤ f s( zi)]= − it ' is ?E ( ci + uit ≤ f t( zi), ci + uis ≤ f s( zi) its ) (2.12) Expression which can be worked out with the results for a truncated normal distribution in Tallis (1961) and it leads to the same sample selection correction terms than in Rochina-Barrachina (1999) under full joint normality. Rochina-Barrachina’s (1999) estimators (Case B and Case C) do not require technically exclusion restrictions. Case D. ts (zit , zis ) = E [( [ = E( − E( =E ] − is ) xit , xis , zit , zis , it xit , xis , zit , zis , i , i , d it = d is = 1 − E it xit , xis , zit , zis , i , i , uit ≤ zit − i ,uis ≤ zis − i , uis ≤ zis − i ,uit ≤ zit − i is { − Λ{z xit , xis , zit , zis , = Λ zit − is it − i i i , zis − , zit − i [ ;F[ i i , ; F3 3 it is i , i , d it = d is = 1 ] [ is xit , xis , zit , zis , , uit , uis xit , xis , zit , zis , , uis , uit xit , xis , zit , zis , i i , , ) ) ]} ]}= 0 i , i ] , d it = d is = 1 = i i (2.13) 14 where [ F3 the , it equality to , uit, uis xit, xis, zit, zis, is , i zero ]= F [ i 3 holds , is if zit = zis and ]. There are , uis, uit xit, xis, zit, zis, it , i i no prior distributional assumptions on the unobserved error components but they are subject to the joint conditional exchangeability assumption above. The idea of imposing these conditions, under which first differencing for a given individual not only eliminates the individual effects in the main equation but also the sample selection effects, is exploited by the estimator developed by Kyriazidou (1997). Conditioning to a given individual the estimation method is developed independently of the individual effects in the selection equation. For this reason we do not need to explicitly consider, parametrically or non-parametrically, the correlation between the individual effects in that equation and the explanatory variables. In Kyriazidou’s (1997) [ model identification of requires ] E ( x t − x s )(' x t − x s )d t d s ( z t − z s ) = 0 to be finite and non-singular. Given that we require support of (zt − z s ) at zero, nonsingularity requires an exclusion restriction on the set of regressors, namely that at least one of the variables zit is not contained in xit . Summarising, Rochina-Barrachina’s (1999) approach imposes strong stochastic restrictions, by specifying the joint conditional distribution of the error terms ( it − is )(, ci + uit )(, ci + uis ) as trivariate normal. Under this assumption “sample selectivity regressors” that asymptotically purge the equation of interest of its 15 selectivity bias can be computed and the corrected model can be estimated by OLS on the selected subsample of individuals observed the two time periods. However, if the joint distribution of the error terms is misspecified, then the estimator of will be inconsistent in general. The semiparametric method developed by Kyriazidou (1997) relaxes the assumption of a known parametric form of the joint distribution but imposes a parametric form for the index function ft(.) and the named “joint conditional exchangeability” assumption for the time varying errors in the model. The two semiparametric methods for panel data sample selection models proposed in this paper will avoid the mentioned limitations in the available methods. 3. THE PROPOSED ESTIMATORS 3.1. Weighted Double Pairwise Difference Estimator (WDPDE) We assume here that the conditional mean of the differenced error term ( it − is ) = ( yit − yis ) − (xit − xis ) in (2.3) depends on xit , xis , z i only through ht (z i ), hs (z i ) , where the two indices ht (z i ), hs (z i ) are probabilities defined as ht (z i ) = Pr(d it = 1 z i ) = E(d it z i )= E (d it xit , xis , z i ) hs (z i ) = Pr(d is = 1 z i ) = E(d is z i )= E (d is xit , xis , z i ). (3.1) In this case the differenced error term of the main equation satisfies the mean double index restriction 16 E[( it − is ) xit , xis , z i , d it = d is = 1] = E [( = ) ht (z i ), hs (z i ), d it [h t (z i ), hs (z i )] a.s., it − is = d is = 1] (3.2) Accordingly to this assumption, we need the sample selection correction term, to be included in (2.3), to be a continuous function of the probabilities in (3.1). The regressors xit , xis and z i should not enter separately into the correction term. Although it is not explicit in (3.2), other than the case of independence between the error terms ( it − is )(, ci + u it )(, ci + u is ) and the regressors, it is unlikely that the assumption holds. Precisely, the assumption of a known parametric form z i , = t , s for the indices in the selection equation can be relaxed in this approach because we assume this independence. Under the indices restriction in (3.2) we consider estimation of the parameter vector of a “double-index, partially linear” model of the form y it − y is = (xit − xis ) + where [ht (z i ), hs (z i )]+ (eit − eis ) (?,?) is an unknown, smooth function of two (3.3) scalars, unobservable “indices” ht (z i ), hs (z i ) . It is derived from (3.2) that the error term (eit − eis ) has by construction conditional mean zero, E[(eit − eis ) xit , xis , z i , d it = d is = 1] = E [(eit − eis ) ht ( z i ), hs ( z i ), d it = d is = 1]= 0 a.s., (3.4) 17 We need some method for estimation of the unobservable conditional expectation terms in (3.1). A natural way is the use of the nonparametric kernel method N h$t (zi ) = N Kil d lt l =1 N h$s (zi ) = , Kil d ls l =1 N Kil , Kil l =1 Kil …K zi − z l √. g1 N ↵ (3.5) l =1 It is interesting to see what happens if we base estimation just in (3.2), (3.3) and (3.4), that is if we develop an estimator that relies just on differences over time for a given individual. The result is that even under the new set-up, that is conditioning in probabilities, we cannot avoid the “exchangeability” assumption in Kyriazidou’s (1997) method. To see this decompose the conditional mean of the differenced error in (3.2) in two terms [( it− is) xit, xis, z,i d it = d is= 1]= E[ it xit, xis, z,i d it = d is= 1 ]− E[ is xit, xis, z,i d it = d is= 1 ]= {ht (z i ,) hs (z i ;) F[ it, (ci+ u it ,)(ci+ u is ) xit, xis, z i]}− {hs( z i), ht( z i); F [ is, ( ci+ u is)(, ci+ u it) xit, xis, z i ]} E = − its ist (3.6) where for assumption 18 its = ist we need ht (z i ) = hs (z i ) and the “conditional exchangeability” F[ it , is ,(ci + u it )( , ci + u is ) xit , xis , z i ] …F [ is , it ,(ci + u is )( , ci + u it ) xit , xis , z i ] (3.7) It is important to notice that this “conditional exchangeability” assumption implies for the first step estimator the conditional stationarity assumption F( c +u i it ) xit , xis , zi …F(c +u i is (3.8) ) xit , xis , zi Estimation methods compatible with this condition are the conditional maximum score estimator (Manski, (1987)), the conditional smoothed maximum score estimator (Kyriazidou, (1994); Charlier, Melenberg, and van Soest, (1995)), and the conditional maximum likelihood estimator (Chamberlain, (1980)). All these methods are independent of the individual fixed effects in a structural sample selection equation, reason for which (3.8) can be rewritten as Fuit xit , xis , zi , ci …Fu is xit , xis , zi , ci ∩ Fuit xit , xis , zit , zis , i …Fu is xit , xis , zit , zis , (3.9) i The use of these methods implies a linearity assumption for the index in the selection rule according to what f t (z i ) in (2.1) is assumed to be equal to z it − i + ci . According to Ahn and Powell (1993) if the latent regression function is linear, to condition in probabilities is equivalent to conditioning on z i , = t , s . Given the known parametric form of the selection indices we do not need now to assume independence between the error terms and the regressors. Anticipating this result we kept the regressors in the conditioning set of (3.6). Under the linearity assumption identification will require some 19 component of z to be excluded from x . By the contrary, if the true latent regression function is non-linear in z we have identification even without exclusion restrictions because these non-linear terms are implicitly excluded from the regression function of interest. We ended up then in the method developed by Kyriazidou (1997) where (3.6) is now rewritten using as indices no the probabilities but the linear indices z it , z is , and both in (3.6) and (3.7) appear in the conditioning set xit , xis , zit , zis in place of xit , xis , zi . In this setting it is necessary to assume that a root-n-consistent estimator ˆ of the true was available, what is not going to be needed in our approach. In sample selection models with cross section data pairs of observations are constructed across individuals. Up to date, in panel data sample selection models they are constructed, not across individuals, but over time for the same individual (Kyriazidou (1997)). In our approach the pairs of observations will be constructed across individuals in differences over time. The motivation of the method is both to eliminate the individual effects and to get ride of sample selection problems. The drawback of Kyriazidou’s (1997) estimator was given by the fact that elimination of the sample selection effects needed the named “joint conditional exchangeability assumption”. In our {[(y method it given a pair [ observations ] − yis )( , xit − xis )], (y jt − y js )( , x jt − x js )} hits …(hit , his ) = (h jt , h js )…h jts , 20 of with characterised by the d it = d is = 1, d jt = d js = 1 vector and [( y it ] [ ] − y is ) − (y jt − y js ) = ( xit − xis ) − (x jt − x js ) + { [E(d z ), E(d z )]− [E(d z ), E (d z )]}+ [(e [(x − x )− (x − x )] + [(e − e )− (e it i is it i jt is jt i js js it i it is ] − eis )− (e jt − e js ) = ] − e js ) jt (3.10) where we have assumed )(h (z ,)h (z ))= (h (z ,)h (z ),)d = d = 1, d = d = 1 ]= E [[(y − y )− (y − y )]− [(x − x )− (x − x )] (h (z ,)h (z ))= ( h (z ,)h (z ),)d = { [E (d z ,)E (d z )]− [E (d z ), E (d z )]}, E [( it − is it )− ( jt is it − js jt i t js is i i s it jt i t is jt i js j s js j it t i is s jt i t js j s j it = d is = 1, d jt = d js = 1 i (3.11) and by construction [ ] E (eit − eis )− (e jt − e js )(ht (z i ), hs (z i )) = (ht (z j ), hs (z j )), d it = d is = 1, d jt = d js = 1 = 0. (3.12) How close are the vectors of conditional means will be weighted by the bivariate kernel weighs $ ijts … 1 k g 22 N h$its − h$ jts g2 N √d it d is d jt d js . √ ↵ The estimator will be of the form 21 (3.13) ] [ ] S$ $ = S$ xx −1 xy −1 N S$xx … √ 2↵ , N −1 N i =1 j = i +1 [ ( [ ( $ijts (xit − xis ) − x jt − x js )] ['(x it ( − xis ) − x jt − x js )] (3.14) and −1 N −1 N S$xy … √ 2↵ N i =1 j = i +1 $ijts (xit − xis )− x jt − x js )] '[(y it ( − yis )− y jt − y js )] Then our WDPDE will be defined with a closed form solution that comes from a weighted least squares regression of the distinct differences (yit − yis )− (y jt − y js ) in dependent variables on the distinct differences (xit − xis )− (x jt − x js ) in regressors, using $ ijts … 1 k g 22 N h$its − h$ jts g2 N √d it d is d jt d js as bivariate kernel weighs. We only have to include √ ↵ pairs of observations for individuals observed two time periods and we have to exclude pairs of individuals for which hits ? h jts . The advantages of this estimator are the following. No distributional assumptions for the error terms are needed compared with the estimators in Rochina-Barrachina (1999) or Wooldridge (1995), and no “time-reversibility” or “conditional exchangeability” assumption is needed compared with Kyriazidou (1997). We do not need conditions for a given individual over time to eliminate the selection terms but conditions among individuals in time differences. For comparability with Kyriazidou’s (1997) notation we can write 22 [( F [( F it jt − − is ) ,( jt ),( it js − − js is ),(c ) ,(c i ( ),(c )( ( ) ( )] … )h (z ) ,h (z ) ,h (z ) ,h (z )] ) + uit ) ,(ci + uis ) , c j + u jt , c j + u js ht (zi ) ,hs (zi ) ,ht z j ,hs z j )( +u jt , c j +u js j i +uit ) ,(ci +uis t i s i t j s j (3.15) We require ( it − is )(, ci + u it )(, ci + u is ) to be i.i.d. across individuals and independent of the individual-specific vector xit , xis , z i . functional form of F or eliminating the In other words, we cannot allow for the to vary across individuals. This is crucial to our method for sample selection effect. It is not ( it − is )(, ci + u it )(, ci + u is ) be i.i.d. across time for the same individual. form of F or required that The functional can vary across time. 3.2. Single Pairwise Difference Estimator (SPDE) We generalise Robinson (1988) to the case of panel data sample selection models. In a model like the one in (3.3) y it − y is = (xit − xis ) + [ht (z i ), hs (z i )]+ (eit − eis ) , (3.16) we have already eliminated the individual effects in the main regression by taking time differences for a given individual. First, we can estimate the two indices ht (z i ), hs (z i ) that correspond to the probabilities defined in (3.1) with the same nonparametric kernel estimator of (3.5). Second, we take expectations conditional on the probability indices 23 and observability in the two time periods to get E(y it − y is ht (z i ), hs (z i ), d it = d is = 1) = E (xit − xis ht (z i ), hs (z i )) + (3.17) [ht (z i ), hs (z i )] To get ride of the selection bias in (3.16) we take out from (3.16) its conditional expectation in (3.17), and then we get the “centred” equation ( yit ( ) − y is ) − E ( y it − y is ) ht (zi ), hs (zi ), d it = d is = 1 = {(x it ( )} − x is ) − E (x it − x is ) ht (zi ), hs (zi ), d it = d is = 1 + (eit − eis ) (3.18) In the second step we insert in (3.18) the nonparametric regression kernel estimators of ( ) E ( yit − yis ) h$t ( zi ), h$s( zi ), d it = d is = 1 and ( ) E ( xit − xis ) h$t ( zi ), h$s( zi ), d it = d is = 1 . Specifically, an estimated value of those conditional means can be constructed by fitting a kernel regression. Using the same kernel as in (3.5) above, the estimated conditional means are of the form 24 N EN ( ) (yit − yis )h$t (zi ), h$s (zi ), d it = d is = 1 = j? i ( , N $ ijts j? i N ( ) E N (xit − xis )h$t (zi ), h$s (zi ), d it = d is = 1 = j? i ) $ ijts y jt − y js ( ) $ ijts x jt − x js N (3.19) , $ ijts j? i 1 h$ − h$ jts √ $ ijts … 2 k its d it d is d jt d js g2 N g2 N √ ↵ Finally, in the third step, we apply least squares regression of the differences ( yit − yis) − E N (( yit − yis) h$t( zi), h$s( zi), d it = d is = 1 ( xit − xis) − E N (( xit − xis) h$t( zi), h$s( zi), d it = d is = 1 to get [ ] S$ $ = S$xx S$xx … N i =1 −1 xy ) on the differences in regressors ) , { ( d it d is (xit − xis ) − E N (xit − xis )h$t (zi ),h$s (zi ),d it = d is = 1 { ( )} )}' ? (xit − xis )− E N (xit − xis )h$t (zi ),h$s (zi ),d it = d is = 1 (3.20) and S$xy … N i =1 { ( d it d is (xit − xis )− E N (xit − xis )h$t (zi ),h$s (zi ),d it = d is = 1 { ( )} ? (yit − yis )− E N (yit − yis )h$t (zi ),h$s (zi ),d it = d is = 1 4. MONTE CARLO RESULTS 25 )}' In this subsection we report the results of a small simulation study to illustrate the finite-sample performance of the proposed estimators. Each Monte Carlo experiment is concerned with estimating the scalar parameter yit = d it [xit + d it = z 1it * 1 i + z 2it + − 2 ]; it i i = 1,..., N ; [ − uit ; in the model , , t = 12 ] (4.1) d it = 1 d it ? 0 , * where yit is observed if d it = 1 . The true value of , 1 , and 2 is 1; z1it and z2it follow a N(0,1); x it is equal to the variable z 2it . The individual effects are generated as [ i = − (z11i + z1i 2 ) / 2 + (z 2i1 + z 2i 2 ) / 2 + (z11i * z 2i1 + z1i 2 * z 2i 2 ) / 2 + i = (xi1 + xi 2 ) / 2 + 2 ? 2 2 (01, )+ 1 . The particular design of 2 2 i . (01, ) + 007 ] and is driven by the fact that we do not need to restrict its correlation with the explanatory variables to be linear. f t (zi ) z1it 1 + z 2it in 2 (2.2) + [( z 11 i coincides in our experiment design ] + z1i 2) / 2+ ( z 2i1 + z 2i 2) / 2+ ( z11i * z 2i1 + z1i 2 * z 2i 2) / 2 . varying errors are uit = 2 2 (01, ) and it = 08 . * uit + 06 . * 2 2 (01, ) . with The time The errors in the main equation are generated as a linear function of the errors in the selection equation, which guarantees the existence of non-random selection into the sample. We report results when normalised and central 2 distributions with 2 degrees of freedom are considered. Our estimators are distributionally free methods and therefore they are robust to any distributional assumption. 26 The results with 100 replications and different sample sizes are presented in Table 1 (WDPDE) and Table 2 (SPDE). It is fair to say that we will probably need bigger sample sizes that the ones included in the experiments to exploit the properties of these estimators. The tables report the estimated mean bias for the estimators, the small sample standard errors (SE), and as not all the moments of the estimators may exist in finite samples some measures based on quantiles, as the median bias, and the median absolute deviation (MAD) are also reported. In Panel A we report the finite sample properties of the estimator that ignores sample selection. The purpose in presenting these results is to make explicit the importance of the sample selection problem in our experiment design. In Table 1, this estimator is obtained by applying least squares to the model in double differences where correction for sample selection has been ignored and for the sample of individuals who are observed in both time periods, i.e. those that have d i1 = d i 2 = 1 . In Table 2, by applying least squares to the model in single differences over time for a given individual observed two time periods. In Panels B and C we implement second (R=1), fourth (R=3), and sixth (R=5) higher order bias reducing kernels of Bierens (1987). They correspond to a normal, to a mixture of two normals and to a mixture of three normals, respectively. The bandwidth −1 2 R +1 + 2 T ?f ] sequence for the first step is2 g N = g ? N [ ( ) , where T=2 is the number of time periods and f=2 is the dimension of zi . The first step probabilities h1 (zi ) and h2 (zi ) are estimated by leave-one-out kernel estimators (this is theoretically convenient) 2 By following the best uniform consistency rate in Bierens (1987) for multivariate kernels. If we were focused on convergence in distribution the optimal rate would have been obtained by setting g N = g ?N 27 −1 [2 (R +1 )+ T ?f ] . constructed as in (3.5) but without zi being used in estimating h$ (zi ) . The summations in (3.5) should read l ? i . The bandwidth sequence for the weights in the second step −1 2 R +1 + 2 q of the WDPDE and the SPDE is g N = g ? N [ ( ) ] , where q=2 is the dimension of the vectors hts . The constant part of the bandwidth was chosen equal to 1, 0.5 or 3 in both steps. There was no serious attempt at optimal choice. From both tables we see that in Panels B and C the estimators are less biased than the estimator ignoring correction for sample selection. The bias are all positive, they increase as the kernel order increases and they diminish with sample size. The best behaviour is found with the combination of R=1 and constant part of the bandwithn g=1. Some anomalous results for sample size 1000 may be claiming the use of some trimming to ensure that all the kernel estimators are well behaved. The SPDE performs slightly better than the WDPDE, which can have its origin on the extra differencing present in the latter method. 28 TABLE 1: Weighted Double Pairwise Difference Estimator (WDPDE) uit = it i [ 250 500 750 1000 Mean Bias 0.0650 0.0426 0.0258 0.0282 (01, ) = 08 . * uit + 06 . * = (xi1 + xi 2 ) / 2 + 2 ? 2 2 (01, ) 2 , )+ 1 2 (01 = − (z11i + z1i 2 )/ 2 + (z 2i1 + z 2i 2 )/ 2 + (z11i * z 2i1 + z1i 2 * z 2i 2 )/ 2 + PANEL A Ignoring Correction For Sample Selection Mean Bias Median Bias SE 0.1099 0.1186 0.1416 0.0937 0.1005 0.1239 0.0933 0.0911 0.1075 0.0912 0.0887 0.1015 N 250 500 750 1000 N i 2 2 R=1 & g=1 Media SE n Bias 0.0735 0.1444 0.0496 0.1139 0.0248 0.0827 0.0244 0.0782 MAD 0.1143 0.0747 0.0580 0.0580 Mean Bias 0.0969 0.0679 0.0499 0.0570 PANEL B R=1 & g=0.5 Media SE n Bias 0.0842 0.1811 0.0806 0.1419 0.0398 0.1025 0.0629 0.0991 2 2 . (01, )+ 007 MAD 0.1194 0.1024 0.0911 0.0887 MAD 0.1194 0.1101 0.0655 0.0744 Mean Bias 0.0917 0.0762 0.0707 0.0700 R=1 & g=3 Median SE Bias 0.0913 0.1365 0.0816 0.1137 0.0777 0.0935 0.0654 0.0882 PANEL C N 250 500 750 1000 Mean Bias 0.0713 0.0576 0.0464 0.0531 29 R=3 & g=1 Median Bias 0.0754 0.0605 0.0550 0.0589 SE 0.1465 0.1112 0.0864 0.0844 MAD 0.1055 0.0888 0.0709 0.0662 Mean Bias 0.0852 0.0669 0.0700 0.0735 ] R=5 & g=1 Median Bias 0.0879 0.0722 0.0612 0.0749 SE 0.1439 0.1099 0.0949 0.0907 MAD 0.1051 0.0808 0.0625 0.0751 MAD 0.0966 0.0861 0.0777 0.0654 TABLE 2: Single Pairwise Difference Estimator (SPDE) uit = it i 250 500 750 1000 [ (01, ) = 08 . * uit + 06 . * = (xi1 + xi 2 ) / 2 + 2 ? 2 2 (01, ) 2 , )+ 1 2 (01 = − (z11i + z1i 2 )/ 2 + (z 2i1 + z 2i 2 )/ 2 + (z11i * z 2i1 + z1i 2 * z 2i 2 )/ 2 + PANEL A Ignoring Correction For Sample Selection Mean Bias Median Bias SE 0.1090 0.1156 0.1412 0.0940 0.1001 0.1242 0.0930 0.0906 0.1074 0.0911 0.0889 0.1014 N 250 500 750 1000 N i 2 2 Mean Bias 0.0448 0.0165 0.0074 0.0063 R=1 & g=1 Media n Bias 0.0431 0.0134 0.0167 0.0053 SE MAD 0.1373 0.0942 0.0704 0.0641 0.1029 0.0583 0.0510 0.0431 Mean Bias 0.0910 0.0494 0.0441 0.0432 2 2 . (01, )+ 007 MAD 0.1182 0.1011 0.0906 0.0889 PANEL B R=1 & g=0.5 Median SE Bias 0.1039 0.1497 0.0554 0.1028 0.0431 0.0792 0.0470 0.0718 MAD 0.1114 0.0773 0.0547 0.0597 Mean Bias 0.0705 0.0616 0.0443 0.0472 R=1 & g=3 Median SE Bias 0.0670 0.1255 0.0684 0.1037 0.0505 0.0753 0.0409 0.0708 PANEL C N 250 500 750 1000 Mean Bias 0.0749 0.0459 0.0471 0.0370 30 R=3 & g=1 Median Bias 0.0680 0.0526 0.0356 0.0379 SE 0.1550 0.1277 0.1277 0.0837 MAD 0.1054 0.0781 0.0562 0.0578 Mean Bias 0.0764 0.0552 0.0704 0.0729 ] R=5 & g=1 Median Bias SE 0.0672 0.1354 0.0703 0.2379 0.0525 0.1934 0.0690 0.1600 MAD 0.0989 0.0844 0.0706 0.0746 MAD 0.0861 0.0818 0.0550 0.0495 5. RELATIONSHIP BETWEEN THE WDPDE AND THE SPDE We have presented, for both methods, least squares estimation of as a final step in the estimation procedures but we can derive also instrumental variables estimation of to make explicit the fact that no strict exogeneity is needed for the variables in the main equation. The exogenous variables zi can be used to construct a k-dimensional vector (dimension of xit ) of “instrumental variables” for (xit − xis ) . In particular, if we let the instruments be suitable functions of the conditioning variables zi and z j , algebraically these instruments are defined as Z its …Z ts (zi ) for some function Z ts : ∑ F *T ♦ ∑ K . The estimator in (3.14) rewritten as a weighted instrumental variables estimator is given by the following expression: [ ] S$ $ = S$ Zx −1 Zy −1 N S$Zx … √ 2↵ , N −1 N i =1 j = i +1 [ ][(x [ ][(y $ ijts Z its − Z jts it ( − xis ) − x jt − x js )] (5.1) and −1 N −1 N S$Zy … √ 2↵ N i =1 j = i +1 $ ijts Z its − Z jts it ( − yis )− y jt − y js )] For the estimator in (3.20) we can also present an alternative version to the least squares approach given by a weighted instrumental variables version. ( As in some other ) applications of kernel regression estimators, E N ( yit − yis ) h$t ( zi ), h$s( zi ), d it = d is = 1 ( ) and E N ( xit − xis ) h$t ( zi ), h$s( zi ), d it = d is = 1 cause technical difficulties associated with 31 its random denominators, which can be small (which need not be bounded away from zero). To avoid this problem a convenient choice of instrumental variables is the product of the original instruments Z its with the sum in the denominators of ( ) E N ( yit − yis ) h$t ( zi ), h$s( zi ), d it = d is = 1 and ( ) E N ( xit − xis ) h$t ( zi ), h$s( zi ), d it = d is = 1 ; that is, the instruments Z$its are defined as Z$its …Z its ? N j? i $ ijts . (5.2) With this definition, the coefficients of an instrumental variables regression of ( yit − yis) − E N (( yit − yis) h$t( zi), h$s( zi), d it = d is = 1 ) ( xit − xis) − E N (( xit − xis) h$t( zi), h$s( zi), d it = d is = 1 using the instrumental variables Z$its ) can be shown to be algebraically equivalent to $ , defined in (5.1) above3. 3 To show that they are equivalent we have to take into account the property of the kernel Kij = K ji . 32 on [ ] S$ $ = S$Zx −1 Zy , { )} ( N d it d is Z its ? $ ijts √ (xit − xis ) − E N (xit − xis )h$t (zi ),h$s (zi ),d it = d is = 1 = ↵ i =1 j? i S$Zx … N N N d it d is Z its ? $ ijts √ (xit − xis )− ↵ i =1 j? i N j? i ( $ ijts x jt − x js ? N $ ijts j? i { ) ? )} N d it d is Z its ? $ ijts √ (yit − yis ) − E N ( yit − yis) h$t ( zi ), h$s( zi ), d it = d is = 1 = ↵ i =1 j? i S$Zy … N ( N N d it d is Z its ? $ ijts √ (yit − yis ) − ↵ i =1 j? i N j? i ( $ ijts y jt − y js N j? i $ ijts ) ? ? (5.3) The estimators in (5.1) and (5.3) are equivalent. 6. CONCLUDING REMARKS In this paper, estimation of the coefficients in a “double-index” selectivity bias model is considered under the assumption that the selection correction function depends only on the conditional means of some observable selection variables. We present two alternative methods. The first is a “weighted double pairwise difference estimator” because it is based in the comparison of individuals in time differences. The second is a “single pairwise difference estimator” because only differences over time for each individual are required. Their advantages with respect to already available methods are that they are distributionally free methods, there is no need to assume a parametric 33 selection mechanism and heteroskedasticity over time is allowed for. The methods do not require strict exogeneity for the variables in the main equation and they are equivalent under a special type of instrumental variables. The finite sample properties of the estimators are investigated by Monte Carlo experiments. The results of our small Monte Carlo simulation study show the following. Both estimators are less biased than the estimator ignoring correction for sample selection. The bias are all positive, they increase as the kernel order increases and they diminish with sample size. The best behaviour is found with the combination of R=1 and constant part of the bandwithn g=1. The SPDE performs slightly better than the WDPDE, which can have its origin on the extra differencing present in the latter method. REFERENCES - AHN, H. AND J. K. POWELL (1993), “Semiparametric estimation of censored selection models with a nonparametric selection mechanism”, Journal of Econometrics, 58, 3-29. - BIERENS, H. J. (1987), " Kernel estimators of regression functions ", in Advances in Econometrics, Fifth World Congress, Volume I, Econometric Society Monographs, No. 13, ED. T. F. BEWLEY, Cambridge University Press. - CHAMBERLAIN, G. (1980), " Analysis of covariance with qualitative data ", Review of Economic Studies, XLVII, 225-238. - CHARLIER, E., B. MELENBERG, AND A. H. O. VAN SOEST (1995), “ A smoothed maximum score estimator for the binary choice panel data model with an application to labour force participation “, Statistica Nederlandica, 49, 324-342. 34 - HECKMAN, J. (1976), "The common structure of statistical models of truncation, and a simple estimates for such models ", Annals of Economics and Social Measurement, 15, 475-492. - HECKMAN, J. (1979), " Sample selection bias as a specification error ", Econometrica, 47, 153-161. - HOROWITZ, J. L. (1988), “Semiparametric M-estimation of censored linear regression models”, Advances in Econometrics, 7, 45-83. - KYRIAZIDOU, E. (1994), " Estimation of a panel data sample selection model ", unpublished manuscript, Northwestern University. - KYRIAZIDOU, E. (1997), “ Estimation of a panel data sample selection model “, Econometrica, Vol. 65, No. 6, 1335-1364. - MANSKI, C. (1987), " Semiparametric analysis of random effects linear models from binary panel data ", Econometrica, 55, 357-362. - POWELL, J. L. (1987), “Semiparametric estimation of bivariate latent variable models”, Working paper number 8704, Revised April 1989 (Social Systems Research Institute, University of Wisconsin, Madison, WI). - ROBINSON, P. M. (1988), “Root-N-consistent semiparametric regression”, Econometrica, Vol. 56, No. 4, 931-954. - ROCHINA-BARRACHINA, M.E. (1996), " Small sample properties of two different estimators of panel data sample selection models with non-parametric components ", unpublished paper. - ROCHINA-BARRACHINA, M.E. (1999), " A new estimator for panel data sample selection models ", Annales d’Économie et de Statistique, 55/56, 153-181. - WOOLDRIDGE, J. M. (1995), " Selection corrections for panel data models under conditional mean independence assumptions ", Journal of Econometrics, 68, 115-132. Appendix I The variance-covariance matrix for the WDPDE 35 One variation inside the termed “semiparametric M-estimators” by Horowitz (1988) defines the WDPDE of −1 $ = argmin N √ Β 2↵ N −1 as a minimazer of a second-order (bivariate) U-statistic, N i =1 j = i +1 {[(∆y its ) ( − ∆y jts − ∆xits − ∆x jts ) ]$ U } …argmin Β 2 ijts 0N ( ), (I.1) that will solve an approximate first order condition −1 N √ 2↵ N −1 N i =1 j = i +1 (∆x its ) [( ) ( )] − ∆x jts ' ∆yits − ∆y jts − ∆xits − ∆x jts $ $ ijts = 0, (I.2) where ∆yits = yit − yis , ∆xits = xit − xis , and $ ijts is defined by expression (3.13) in the main text above. The empirical loss-function in (I.1) and the estimating equations in (I.2) also depend upon an estimator of the nonparametric components hits and hist defined in (3.1) and (3.5). To derive the influence function for an estimator satisfying (I.2), we first do an expansion around ( value expansion around hits − h jts (h$ its determine the effect on $ of estimation of ) − h$ jts . Expanding (I.2) around −1 N 0= √ 2↵ N −1 N i =1 j = i +1 −1 N −1 N − √ 2↵ from where 36 ) to of (I.2) and subsequently a functional mean- (∆x N i =1 j = i +1 its (∆x its we get ) [( ) ( − ∆x jts ' ∆yits − ∆y jts − ∆xits − ∆x jts )( ) ( − ∆x jts ' ∆xits − ∆x jts $ ijts $ − ), ) ]$ ijts (I.3) −1 N −1 N ( ) N = √ 2↵ N $− −1 N N √ 2↵ {S$ } −1 xx being ( its − jts (∆x its i =1 j =i +1 )…[(∆y N −1 N )( −1 ) − ∆x jts ' ∆xits − ∆x jts $ ijts ? ? ? (∆x its )( − ∆x jts ' i =1 j =i +1 its − jts )$ ijts … (I.4) NS$ x its ) ( − ∆y jts − ∆xits − ∆x jts ) ]. If we analyse the components of (I.4): 1 1) S$ xx …2 ? N N −1 i =1 ( )( ) N 1 ∆xits − ∆x jts ' ∆xits − ∆x jts $ ijts = p S xx = p 2 ? N − 1 j =i +1 (I.5) xx As S xx = U 1 N , that is a bivariate U-statistic, by using U-statistics asymptotic theory we NU 1 N = p 2 ? know 1 N N i =1 [( )( )( ) ) E ∆xits − ∆x jts ' ∆xits − ∆x jts ijts ∆xits , hits , d it , d is ] and then U 1N 1 = 2? N N p i =1 [( E ∆x its − ∆x jts ' ∆x its − ∆x jts {[( )( 2 ? E E ∆x its − ∆x jts ' ∆x its − ∆x jts The matrix 2) 37 xx is easily handled, since ( ) NS$ x expanded around hits − h jts , ) ijts ijts ] ,h ,d ,d ] }= 2 ? ∆x its , hits , d it , d is = ∆x its its it is 1 $ S xx consistently estimates it. 2 (I.6) xx −1 N −1 N N … N √ 2↵ NS$ x 1 N 1 N N −1 i =1 N −1 i =1 ( )( ∆xits − ∆x jts ' i =1 j =i +1 ( )( 2 N ∆xits − ∆x jts ' N − 1 j =i +1 2 N −1 N N (∆x jts − its )( − ∆x lts ' j = i +1 l =1 jts )g1 lts jts k 2 2N − jts its − )g )g 1 2 2N k hits − h jts g2N h$its − h$ jts g2N √d its d jts = √ ↵ √d its d jts + ↵ h * its − h * jts k' √ d jts d lts K jl g2N ↵ 1 3 2N −1 N K jl l =1 d it √ − h$its √ d is↵ ↵ (I.7) . is the derivative of the second-stage kernel k ' () . and K jl is where d its …d it d is , k ' () defined as in (3.5). The expression (I.7) includes derivatives of the weights with respect ( ) to hits − h jts (kernel derivatives). For the first term on the right hand side of (I.7), 1 N 2 N N −1 i =1 N ( )( N 2 ∆x its − ∆x jts ' N − 1 j = i +1 ( )( E ∆x its − ∆x jts ' i =1 its − − its jts )g jts 1 2 2N )g k 1 2 2N k hits − h jts hits − h jts g2 N g2 N √d its d jts = ↵ √d its d jts ∆x its , ↵ its p (I.8) , hits , d its , and for the second, 1 N 2 N N −1 i =1 N i =1 ( )( N N 2 ∆x jts − ∆x lts ' N − 1 j = i +1 l =1 ( )( E ∆x jts − ∆x lts ' jts − lts jts ) g1 − 3 2N lts k' )g 1 3 2N k' h * its − h * jts √d jts d lts K jl g2 N ↵ hits − h jts g2 N √d jts d lts K jl ↵ −1 N K jl l =1 −1 N K jl l =1 d it √ − h$its √= d is ↵ ↵ d it d it √ − h$its √ hits , √ − h$its √? d is ↵ d is ↵ ↵ ↵? (I.9) Substituting (I.5), (I.8) and (I.9) in (I.4) we get 38 p ( ) { } N $− −1 =p xx ( )( N E l =1 ∆x jts − ∆x lts ' jts ? 1 N − lts ( N i =1 )g )( E ∆xits − ∆x jts ' 1 k' 3 2N its − hits − h jts √d jts d lts K jl g2N ↵ jts )g1 2 2N −1 N l =1 k K jl hits − h jts g2N √d its d jts ∆xits , ↵ its , hits , d its + d it d it $ $ √ − hits √ hits , √ − hits √ ? d is ↵ d is ↵ ↵ ↵ ? (I.10) that is asymptotically normal, ( )?? ♦ N 0, N $− where d xx $ …1 Ω xx N −1 xx is estimated by Ω xx [ ] '√↵, −1 (I.11) xx 1 $ S xx by (I.5), and Ω xx can be estimated by 2 d it d it $ $ its + $ its √ − h$its √ ' √ − h$its √ $ its + its d is ↵ d is ↵ ↵ ↵ N i =1 (I.12) where $ its … 1 N −1 $ its 1 … N −1 N j =1 N ( )( ∆xits − ∆x jts ' $ its − $ jts N j =1 l =1 (∆x jts )( $ $ 1 hits − h jts √d its d jts , k g 22 N g2 N √ ↵ ) − ∆x lts ' $ jts − $ lts h$its − h$ jts 1 √d jts d lts K jl ' k g 23N g2 N √ ↵ ) −1 N l =1 K jl (I.13) The general theory derived for minimizers of mth-order U-statistics can be applied to show N − consistency and to obtain the large sample distribution of the WDPDE for panel data sample selection models. The variance-covariance matrix for this estimator depends upon the conditional variability of the errors in the regression equation and the deviations of the selection indicators from their conditional means, 39 d it √ − h$its . d is ↵ Appendix II The variance-covariance matrix for the SPDE We can define the SPDE of $ = argmin 1 Β N {[ as a minimazer of N i=1 )] [ ( )] } ( ∆ yits − E N ∆ y ts h$t (zi ), h$ s (zi ), d it = d is = 1 − ∆ xits − E N ∆ x ts h$t (zi ), h$ s (zi ), d it = d is = 1 ? 2 d it d is (II.1) that will solve an approximate first order condition N 1 − N {[∆y i =1 its [∆x its − EN ( )] ' ∆x ts h$t (zi ), h$s(zi ), d it = d is = 1 ? )] [ ( )] } ( − E N ∆y ts h$t ( zi ), h$s( zi ), d it = d is = 1 − ∆xits − E N ∆x ts h$t( zi ), h$s( zi ), d it = d is = 1 ? $ d it d is = (II.2) Expanding (II.2) around 1 N 0=− {[∆y 1 N its N i =1 N i =1 [ we get ( )] ' ∆xits − E N ∆x ts h$t (zi ), h$s(zi ), d it = d is = 1 ? ( )] [ ( = d = 1)][∆x − E (∆x )] }d d + = 1)]d d ? − E N ∆y ts h$t ( zi ), h$s( zi ), d it = d is = 1 − ∆xits − E N ∆x ts h$t( zi), h$s( zi), d it = d is = 1 ? [∆x its − EN ($ − ) ( ∆x ts h$t ( zi ), h$s( zi ), d it ' is its N ts h$t ( zi ), h$s( zi ), d it = d is (II.3) from where 40 it it is is ( )= N $− N 1 N its N N 1 ? [∆x − E ( [∆x − E (∆x i= 1 N its i =1 )][∆x ' ∆x ts h$t ( zi ), h$s( zi ), d it = d is = 1 N )][( $ $ = d is = 1 ts ht ( zi ), hs( zi ), d it − EN its ' − it )] ( )− EN is (− t s h$t( zi), h$s( zi), d it = d is = 1 ? d it d is (II.4) where ( −E N ( t − it − is s ) = ∆yits − ∆xits , and ) ( (∆ x ) = 1) . h$ t (zi ), h$ s (zi ), d it = d is = 1 = − E N ∆ y its h$ t ( zi ), h$ s( zi ), d it = d is = 1 + EN its h$ t (zi ), h$ s (zi ), d it = d is (II.5) It can be shown that the inverted matrix in (II.4) is consistent for A= [∆x ts )] [∆x ( − E ∆x ts ht (z ), hs (z ), d t = d s = 1 We shall analyse now the term estimating four (h ( z) , h ( z) , E(∆x t s ts 1 N ' ts )] ( − E ∆x ts ht (z ), hs (z ), d t = d s = 1 d it d is ? . ? in (II.4). We have to work out the effect of i =1 dimensional ) ( ht( z) , hs( z) , d t = d s = 1 , E t − the asymptotic variance of our parameter of interest summand in {[ 1 N (II.6) N infinite conditional s means )) ht( z) , hs( z) , d t = d s = 1 on . The moment condition for the N can be written as i =1 ( ) ( E m ht( z), hs( z), E ∆x ts ht( z), hs( z), d t = d s = 1 , E t − s )]}= 0 ht ( z), hs( z), d t = d s = 1 (II.7) 41 − ∆x ts h$t( zi), h$s( zi), d it = d is = 1 d it d is? ? where [ ( ) ( m ht( z) , hs( z) , E ∆x ts ht( z) , hs( z) , d t = d s = 1 , E [ )] [( ( ∆x ts − E ∆x ts ht (z), hs (z), d t = d s = 1 ' − ) − E( t − t t s s − )] = 1)]d d . ht ( z), hs( z), d t = d s = 1 = s ht( z) , hs( z) , d t = d s t s (II.8) The following four derivatives are of interest: m ( =−( ) [ E ∆x ts ht (z), hs (z), d t = d s = 1 − t )− E ( t − s s )] ht (z ), hs (z ), d t = d s = 1 d t d s , (II.9) E ( m t − s = − ∆x ) [ ht (z), hs (z), d t = d s = 1 ts )] ( − E ∆x ts ht (z ), hs (z ), d t = d s = 1 d t d s , (II.10) m =− ht (z ) [ ) [( ( E ∆x ts ht (z ), hs (z ), d t = d s = 1 t )] ( − ∆x ts − E ∆x ts ht (z ), hs (z ), d t = d s = 1 ' ' E t ( t − s t − s )− E ( t − s )] ht (z ), hs (z ), d t = d s = 1 d t d s ) ht (z ), hs (z ), d t = d s = 1 d t d s , (II.11) m =− hs (z ) [ s ) [( ( E ∆x ts ht (z ), hs (z ), d t = d s = 1 )] ( − ∆x ts − E ∆x ts ht (z ), hs (z ), d t = d s = 1 ' ' s E ( t − s t − s )− E ( t − s )] ht (z ), hs (z ), d t = d s = 1 d t d s ) ht (z ), hs (z ), d t = d s = 1 d t d s , (II.12) where t ( ) E ?ht (z), hs (z), d t = d s = 1 ( and s ( ) E ?ht (z), hs (z), d t = d s = 1 are the ) derivatives of E ?ht (z), hs (z), d t = d s = 1 with respect to ht (z ) and hs (z ) , respectively. For the moment condition in (II.7) a functional expansion around ht (z ) and hs (z ) gives 42 1 N [ N N 1 N =p + E +E +E ( ) m h$t (zi ), h$s (zi ), E N ∆x ts ht (zi ), hs (zi ), d it = d is = 1 , E N i =1 i =1 t − m ht (zi ), hs (zi ), E N ∆x ts ht (zi ), hs (zi ), d it = d is = 1 , E N ( [ ( m ( ) ) E ∆x ts ht (z ), hs (z ), d t = d s = 1 ( t − [ s − t )] ht (zi ), hs (zi ), d it = d is = 1 s )] ht (zi ), hs (zi ), d it = d is = 1 )] ( ht (z ), hs (z ), d t = d s = 1 ∆x ts − E ∆x ts ht (z ), hs (z ), d t = d s = 1 m E ( ) h z ,h z ,d = ds = 1 s t ( ) s( ) t ht (z), hs (z), d t = d s = 1 m ht ( z), hs( z), d t = d s = 1 [d t − ht ( z)] + E ht (z ) [( t − s )− E( t − s )] ht (z), hs (z), d t = d s = 1 m ht (z ), hs (z ), d t = d s = 1 [d s − hs (z )]? hs( z) ? (II.13) For our estimator, the two means of ( ) m E ?ht (z ), hs (z ), d t = d s = 1 conditional on ht (z ), hs (z ) , and d t = d s = 1 are zero (see (II.9) and (II.10), above). Furthermore, the corresponding two terms [ [ ] E m ht (z) ht (z), hs (z), d t = d s = 1 for and ] E m hs (z) ht (z), hs (z), d t = d s = 1 , according to (II.11) and (II.12), are also zero [ E( because of [ t ) − E( − s ( t − s ] ) ht( z), hs( z), d t = d s = 1 ht (z ), hs (z ), d t = d s = 1 = 0 ] ) and E ∆x ts − E ∆x ts ht (z), hs (z), d t = d s = 1 ht (z ), hs (z ), d t = d s = 1 = 0 . Hence, there is no effect of estimating the four infinite dimensional nuisance parameters on the asymptotic variance of ? in (II.13) is equal to zero. given that the correction term in {} Therefore, we get ( )= A ? 1 [∆ x − E(∆ x N N $− −1 p N its =i 1 )][( ts ht( zi) , hs( zi) , d it = d is = 1 ' it − )− is ( E t − s ht( zi) , hs( zi) , d it = d is = 1 ? d it d is = A−1 ? N i =i 1 (II.14) that is asymptotically normal, 43 )] ( )?? ♦ N (0, N $− d ) A −1 E ( ')A −1 , (II.15) where i [ )][( ( … ∆xits − E ∆x ts ht (zi ), hs (zi ), d it = d is = 1 ' it − is )− E ( t − s )] ht (zi ), hs (zi ), d it = d is = 1 d it d i (II.16) A can be estimated as in (II.4), while E ( ') is estimated by replacing all the conditional (h ( z) , h ( z) , E(∆x t s means ts ) ( ht( z) , hs( z) , d t = d s = 1 , E nonparametric estimates. 44 involved, t − that s is )) ht( z) , hs( z) , d t = d s = 1 , with