Introductory Econometrics Problem set 2 – solutions

Transcription

Introductory Econometrics Problem set 2 – solutions
Introductory Econometrics
Problem set 2 – solutions
Jan Zouhar
Department of Econometrics, University of Economics, Prague, zouharj@vse.cz
Problem 2.1. Use the data in wage2.gdt for both this problem and the remaining problems below.
a ) Estimate the equation
log.wage/ D ˇ0 C ˇ1 exper C ˇ2 exper2 C ˇ3 educ C ˇ4 female C ˇ5 nonwhite C u:
(1)
and report the results using the usual format.
After loading the dataset and creating the variables l_wage log.wage/ and sq_exper exper2 , we run the OLS
regression to obtain
Model 1: OLS, using observations 1-526
Dependent variable: l_wage
coefficient
std. error
t-ratio
p-value
----------------------------------------------------------const
0.395402
0.103219
3.831
0.0001
***
exper
0.0389529
0.00482908
8.066
5.03e-015 ***
sq_exper
-0.000687172
0.000107516
-6.391
3.66e-010 ***
educ
0.0839129
0.00699065
12.00
1.80e-029 ***
female
-0.337424
0.0363579
-9.281
4.53e-019 ***
nonwhite
-0.0213127
0.0596998
-0.3570
0.7212
Mean dependent var
Sum squared resid
R-squared
F(5, 520)
Log-likelihood
Schwarz criterion
1.623268
89.03680
0.399737
69.25751
-279.2075
596.0069
S.D. dependent var
S.E. of regression
Adjusted R-squared
P-value(F)
Akaike criterion
Hannan-Quinn
0.531538
0.413793
0.393966
1.89e-55
570.4150
580.4354
The usual format of reporting the results is the equation form:
3
log.wage/ D 0:395 C 0:0390 exper 0:000687 exper2 C 0:0839 educ 0:337 female 0:0213 nonwhite;
.0:103/ .0:00483/
.0:000108/
.0:00699/
.0:0364/
.0:0597/
n D 526; R2 D 0:400:
b ) Based on your results from part a, find the 99% confidence interval for ˇ5 . Is the (partial) effect of
race statistically significant at the 1% level in your equation?
The easiest way to obtain the 99% CI is to use Gretl’s built-in routines: Analysis ! Confidence intervals for
coefficients ! ˛ ! 0:99, the result is
99% CI for ˇ5 D Œ 0:176; 0:133:
Alternatively, we can calculate the endpoints of the interval manually using the formula
coefficient ˙ c.standard error/ D
0:02131 ˙ 2:585.0:05970/;
where c D 2:585 is the 99:5th percentile of t526 5 1 ; the value can be found e.g. in Gretl’s Tools ! Statistical
tables.
The 99% CI contains zero, meaning that we cannot reject the null that ˇ5 D 0 against the two-sided alternative
at the 1% level. (We could also compare the p-value for nonwhite from the regression output, 0.721, to our
significance level.) We conclude that the effect of race is not significant in our equation.
1
Introductory econometrics: Problem Set 2
Jan Zouhar
c ) Use White’s test and the Breusch-Pagan test (Tests ! Heteroskedasticity ! White’s test / BreuschPagan) to show whether Assumption MLR.5 holds. What do you conclude? (Report the value of the
test statistics and the resulting p-value along with your conclusions.) What does the test tell you about
the results you obtained from the regression?
After running the tests, Gretl appends the following text to the original regression output:
White’s test for heteroskedasticity Null hypothesis: heteroskedasticity not present
Test statistic: LM = 24.7849
with p-value = P(Chi-square(17) > 24.7849) = 0.0996277
Breusch-Pagan test for heteroskedasticity Null hypothesis: heteroskedasticity not present
Test statistic: LM = 13.663
with p-value = P(Chi-square(5) > 13.663) = 0.0178977
From the p-values we can immediately see that White’s test fails to reject the null of homoskedasticity, while
the Breusch-Pagan test does reject at the conventional 5% level. As I mentioned in the lectures, power of these
tests is rather limited, and if any of them rejects, the conservative approach is to proceed as if heteroskedasticity
was present in the model. Recall that under heteroskedasticity, the MLR.5 assumption is violated, and our results
related to standard errors, confidence intervals and hypothesis tests do not hold (i.e., last three columns in Gretl’s
regression output are unusable). On the other hand, OLS estimates of ˇ0 ; : : : ; ˇ5 are still consistent and typically
fairly efficient, so the coefficient column can be retained.
d ) Using the approximation
%wage 100.ˇ1 C 2ˇ2 exper/exper;
find the approximate return to the fifth year of experience. What is the approximate return to the
twentieth year of experience?
For the fifth year of experience we have exper D 5; exper D 1; plugging these values and our estimates into the
above formula gives
%wage 100 0:0390 C 2. 0:000687/5 1 D 3:213;
i.e. the wage is expected to change by approximately 3.213 per cent as a result of increasing working experience
from 4 to 5 years. Analogously, for the twentieth year of experience we have
%wage 100 0:0390 C 2. 0:000687/20 1 D 1:152;
showing that the increase in wage diminishes with additional experience.
e ) At what value of exper does additional experience actually begin to lower predicted log.wage/? (Or,
what is the turning point in the effect of experience?) How many people have more experience in this
sample? (Hint: Sorting the data using Data ! Sort data ! exper might help you out with the last
question.)
From the lectures, we know that the turning point can easily be obtained from first-order conditions as
turning point D
coefficient on the linear term
D
2 coefficient on the quadratic term
0:0390
D 28:4 years of experience:
2. 0:000687/
It turns out that 121 people in the sample have at least 29 years of experience (exper is recorded as an integer in
our data).
Problem 2.2. Based on (1), you want predict the salary of a white male person with 5 years of work experience and 18 years of education. This prediction is made difficult by the presence of logarithms; read
Wooldridge’s section ‘Predicting y when log.y/ is the dependent variable’.
a ) Find the prediction, assuming that u is normally distributed (conditional on all independent variables),
i.e. that assumptions MLR.1 through MLR.6 hold.
2
Introductory econometrics: Problem Set 2
Jan Zouhar
First of all, it is convenient to re-estimate the equation with slightly modified variables: we replace exper, sq_exper
and educ with
exper_5 exper
5;
sq_exper_5 .exper
educ_18 exper
5/2 ;
18:
(Use Add ! Define new variable. . . to create these variables in Gretl.) In the new equation,
log.wage/ D ı0 C ˇ1 exper_5 C ˇ2 sq_exper_5 C ˇ3 educ_18 C ˇ4 female C ˇ5 nonwhite C u;
(2)
coefficients ˇ1 ; : : : ; ˇ5 are the same as in (1), the only thing that has changed is the intercept (hence the different
notation, ı0 ). We can easily verify this by running the OLS for (2) in Gretl – compare the results for (2) below
with those for (1) above:
Model 2: OLS, using observations 1-526
Dependent variable: l_wage
coefficient
std. error
t-ratio
p-value
------------------------------------------------------------const
2.08342
0.0451186
46.18
4.08e-186
exper_5
0.0320812
0.00381271
8.414
3.84e-016
sq_exper_5
-0.000687172
0.000107516
-6.391
3.66e-010
educ_18
0.0839129
0.00699065
12.00
1.80e-029
female
-0.337424
0.0363579
-9.281
4.53e-019
nonwhite
-0.0213127
0.0596998
-0.3570
0.7212
Mean dependent var
Sum squared resid
R-squared
F(5, 520)
Log-likelihood
Schwarz criterion
1.623268
89.03680
0.399737
69.25751
-279.2075
596.0069
S.D. dependent var
S.E. of regression
Adjusted R-squared
P-value(F)
Akaike criterion
Hannan-Quinn
***
***
***
***
***
0.531538
0.413793
0.393966
1.89e-55
570.4150
580.4354
The reason why we transformed the variables is that in the new equation, the intercept tells us something about
the wage of a white male person with 5 years of work experience and 18 years of education. For this person, (2)
reduces to
log.wage/ D ı0 C u; or wage D e ı0 Cu :
For our prediction, we will use the expected wage of the person in question, which is
E Œwage D E Œe ı0 Cu  D „ƒ‚…
e ı0 E Œe u  :
„ƒ‚…
A
(3)
B
The first term, A, can be consistently estimated by exponentiating the OLS intercept, in our case AO D e 2:0834 D
2
8:032: If u Normal.0; 2 /, it can be shown that B E Œe u  D e =2 ; see the lectures. An estimate of is provided in the regression output under the name S.E. of regression. Therefore, we can estimate B as BO D
2
e 0:4138 =2 D 1:0894: Altogether, our estimate of the person’s wage is
b
wage D AO BO D 8:032.1:0894/ D $8:75 per hour:
b ) Save the residuals from (1) to a new variable uhat, and test for normality (Variable ! Normality test),
the null is that uhat is normally distributed. What do you conclude?
Gretl’s output after running the tests is:
Test for normality of uhat:
Doornik-Hansen test = 10.6516, with p-value 0.00486434
Shapiro-Wilk W = 0.991748, with p-value 0.00508462
Lilliefors test = 0.0367591, with p-value ~= 0.08
3
Introductory econometrics: Problem Set 2
Jan Zouhar
Jarque-Bera test = 10.5413, with p-value 0.00514034
The null hypothesis is that u is normally distributed. This null is rejected by the Doornik-Hansen, Shapiro-Wilk
and Jarque-Bera tests at the 5% (or 1%) level, so we have to acknowledge that the assumption of normality of u
was not justified in the previous task.
c ) Find the prediction once again, this time using Duan’s (1983) smearing estimate, described in the same
section of Wooldridge’s book. (Hint: you will need to create a new variable, calculated as exp.uhat/,
and find its mean, e.g. by displaying summary statistics.)
Once again,P
we will base our prediction on (3), but this time, we will estimate B as the average e uO in the sample,
n
uO i
i.e. BO D
i D1 e : In Gretl, this can be done as follows. We have already saved the residuals in the uhat
variable. Next, we create exponentiated residuals and store them in a new variable, say expuhat: Add ! Define
new variable. . . ! expuhat = exp(uhat). Now we can obtain BO as the mean of expuhat, e.g. via View ! Summary
statistics. This gives us BO D 1:0891, which is nearly the same as before, and does not change the predicted wage
from 2.2a within first 3 significant figures:
b
wage D AO BO D 8:032.1:0891/ D $8:75 per hour:
Problem 2.3.
a ) Estimate a modified version of (1) with the level, rather than log, of wage as the dependent variable:
wage D ˇ0 C ˇ1 exper C ˇ2 exper2 C ˇ3 educ C ˇ4 female C ˇ5 nonwhite C u:
(4)
Gretl’s output from the OLS regression is given below; note that the equation form of reporting your estimates is
preferred.
Model 3: OLS, using observations 1-526
Dependent variable: wage
coefficient
std. error
t-ratio
p-value
---------------------------------------------------------const
-2.28278
0.746117
-3.060
0.0023
exper
0.255446
0.0349070
7.318
9.61e-013
sq_exper
-0.00444815
0.000777181
-5.723
1.77e-08
educ
0.554632
0.0505318
10.98
2.37e-025
female
-2.11579
0.262813
-8.051
5.64e-015
nonwhite
-0.157833
0.431539
-0.3657
0.7147
Mean dependent var
Sum squared resid
R-squared
F(5, 520)
Log-likelihood
Schwarz criterion
5.896103
4652.262
0.350280
56.06904
-1319.651
2676.894
S.D. dependent var
S.E. of regression
Adjusted R-squared
P-value(F)
Akaike criterion
Hannan-Quinn
***
***
***
***
***
3.693086
2.991096
0.344033
1.36e-46
2651.302
2661.322
b ) Save the residuals (u)
O from (4) and find the sample correlation coefficients between uO and all the
explanatory variables (i.e., 5 correlation coefficients). Explain the results.
After saving your residuals, the correlations can be obtained through View ! Correlation matrix. All correlations
are nearly zero. Actually, the fact they are not exactly zero is only attributable to rounding errors, inherent in all
computer calculations. We know that OLS residuals are non-correlated with all explanatory variables, see the slide
entitled ‘Three facts about the fitted values and residuals’ in my ‘Multiple regression’ presentation. In particular,
the fact that residuals (u)
O are not correlated with explanatory variables tells us nothing about our assumption
MLR.4 which rules out correlation between explanatory variables and the random error (u)!
1
1
c ) Save the fitted values wage from (4) and find the sample correlation coefficient between wage and
wage. Is there any relationship between this correlation coefficient and the R2 from the regression
model? (Hint: See Wooldridge, look for the origin of the term ‘R-squared’.)
4
Introductory econometrics: Problem Set 2
Jan Zouhar
With all that we have learnt to do in Gretl so far, this one should be a breeze. The relationship with R-squared is
as follows:
R2 D 0:35 D 0:5922 D Œcorr .wage; wage/2 :
b
1
1
d ) Based on (1), calculate the predicted wage for all people in the sample (wage2), using Duan’s estimate
as in Problem 2.2. Find the squared correlation between wage and wage2, and use the result to compare
the goodness of fit of (1) and (4). (See Wooldridge, same section as in Problem 2.2, for a comparison
of goodness of fit for models that combine dependent variables in the level and the log form.)
1
(i) Open the window with OLS regression output for either (1) or (2), save the fitted values in a new variable
1 using Save ! Fitted values.
l_wage_hat l_wage
In order to obtain wage2 as requested, the following steps need to be carried out:
(ii) Create exponentiated fitted values Add ! Define new variable. . . ! A = exp(l_wage_hat) – this gives us an
estimate of A from (3) for each individual in the sample.
1
(iii) Create a new variable wage2hat wage2 based on (3) using Add ! Define new variable. . . ! wage2hat =
A * 1.0891, where 1.0891 was our estimate of B obtained using Duan’s method in 2.2c.
1
1
Finally, we can find the correlation between wage and wage2; the result is corr .wage; wage2/ D 0:625. This is
higher than the correlation of 0.592 between fitted and actual values in 2.3b where wage, rather than log.wage/,
was the dependent variable, implying that the model with logarithmic wage has a better fit.
As an aside, note that if our only aim is to obtain the correlation between fitted and actual values, we needn’t do
step (iii) above, as multiplication by a constant does not affect correlations. In other words, in terms of variable
names created in Gretl in the above procedure, corr .wage; A/ D corr .wage; wage2hat/.
5