ON REGRESSION MODELS WITH AUTOCORRELATED ERROR: SMALL SAMPLE PROPERTIES Hisashi Tanizaki
Transcription
ON REGRESSION MODELS WITH AUTOCORRELATED ERROR: SMALL SAMPLE PROPERTIES Hisashi Tanizaki
ON REGRESSION MODELS WITH AUTOCORRELATED ERROR: SMALL SAMPLE PROPERTIES Hisashi Tanizaki Graduate School of Economics, Kobe University, Kobe 657-8501, Japan e-mail: tanizaki@kobe-u.ac.jp Abstract: Using both the maximum likelihood estimator and the Bayes estimator, we consider estimating the regression model with the first-order autocorrelated error, where the initial distribution of the autocorrelated error is taken into account. For the Bayes estimator, the Gibbs sampler and the Metropolis-Hastings algorithm are utilized to obtain random draws of the parameters. As a result, the Bayes estimator is less biased and more efficient than the maximum likelihood estimator. Especially, for the autocorrelation coefficient, the Bayes estimate is much less biased than the maximum likelihood estimate. Accordingly, for the standard error of the estimated regression coefficient, the Bayes estimate is more plausible than the maximum likelihood estimate, because variance of the estimated regression coefficient depends on the estimated autocorrelation coefficient. Thus, we find that the Bayes approach might be recommended in the empirical studies. AMS Subj. Classification: 62M10, 62F15. Key Words: Autoregressive Model, Gibbs Sampler, Metropolis-Hastings Algorithm, Bayes Estimator, MLE. 1. Introduction In this paper, we consider the regression model with the first-order autocorrelated error term, where the error term is assumed to be stationary, i.e., the autocorrelation coefficient is assumed to be less than one in absolute value. The traditional estimator, i.e., the maximum likelihood estimator (MLE), is compared with the Bayes estimator (BE). Utilizing the Gibbs sampler, Chib [1] and Chib and Greenberg [2] discussed the regression model with the autocorrelated error term in a Bayesian framework, where the initial condition of the autoregressive process is ignored. In this paper, taking into account the initial density, we compare MLE and BE, where the Gibbs sampler and the Metropolis-Hastings (MH) algorithm are utilized in BE. As for MLE, it is well known that the autocorrelation coefficient is underestimated in small sample and therefore that variance of the estimated regression coefficient is biased. See, for example, Andrews [3] and Tanizaki [4, 5]. Under this situation, inference on the regression coefficient is not appropriate. We show in this paper that BE is superior to MLE because BE’s of both the autocorrelation coefficient and the variance of the error term are closer to the true values, compared with MLE’s. 2. Setup of the Model Let Xt be a 1 × k vector of exogenous variables and β be a k × 1 parameter vector. Consider the following standard linear regression model: yt = Xt β + ut , ut = ρut−1 + ²t , ²t ∼ N (0, σ 2 ), for t = 1, 2, · · · , n, where ²1 , ²2 , · · ·, ²n are assumed to be mutually independently distributed. In this model, the parameter to be estimated is given by θ ≡ (β 0 , ρ, σ 2 )0 . The unconditional density function of yt is: f (yt |β, ρ, σ 2 ) = q 1 2πσ 2 /(1 − ρ2 ) ³ exp − ´ 1 2 (y − X β) , t t 2σ 2 /(1 − ρ2 ) which corresponds to the initial density function of yt when t = 1. Let Yt be the information set up to time t, i.e., Yt = {yt , yt−1 , · · · , y1 }. The conditional density of yt given Yt−1 is: f (yt |Yt−1 , β, ρ, σ 2 ) = f (yt |yt−1 , β, ρ, σ 2 ) ³ ´ 1 1 exp − 2 ((yt − ρyt−1 ) − (Xt − ρXt−1 )β)2 . =√ 2σ 2πσ 2 Therefore, the joint density of Yn , i.e., the likelihood function, is given by: f (Yn |β, ρ, σ 2 ) = f (y1 |β, ρ, σ 2 ) 2 −n/2 = (2πσ ) n Y t=2 f (yt |Yt−1 , β, ρ, σ 2 ) 2 1/2 (1 − ρ ) ³ n ´ 1 X exp − 2 (yt∗ − Xt∗ β)2 , 2σ t=1 where yt∗ and Xt∗ represent the following transformed variables: √ 1 − ρ2 y t , for t = 1, yt∗ ≡ yt∗ (ρ) = yt − ρyt−1 , for t = 2, 3, · · · , n, Xt∗ ≡ Xt∗ (ρ) = √ 1 − ρ2 Xt , X − ρX , t t−1 for t = 1, for t = 2, 3, · · · , n, which depend on the autocorrelation coefficient ρ. (1) Maximum Likelihood Estimator: As shown above, the likelihood function is given by (1). Maximizing (1) with respect to β and σ 2 , we obtain the following expressions: n X ˆ βˆ ≡ β(ρ) =( Xt∗ 0 Xt∗ )−1 t=1 σ ˆ2 ≡ σ ˆ 2 (ρ) = n X Xt∗ 0 yt∗ , (2) t=1 n 1X ˆ 2. (yt∗ − Xt∗ β) n t=1 (3) By substituting βˆ and σ ˆ 2 into β and σ 2 in (1), we have the concentrated likelihood function: ³ ´−n/2 ˆ ρ, σ f (Yn |β, ˆ 2 ) = 2πˆ σ 2 (ρ) n (1 − ρ2 )1/2 exp(− ), 2 (4) which is a function of ρ. (4) is maximized with respect to ρ. In the next section, we obtain the maximum likelihood estimate of ρ by a simple grid search, in which the concentrated likelihood function (4) is maximized by changing the parameter value of ρ by 0.0001 in the interval between −0.9999 and 0.9999. ˆ ρ) and σ Once the solution of ρ, denoted by ρˆ, is obtained, β(ˆ ˆ 2 (ˆ ρ) lead to the 2 2 ˆ maximum likelihood estimates of β and σ . Hereafter, β, σ ˆ and ρˆ are taken ˆ ρ) and σ as the maximum likelihood estimates of β, σ 2 and ρ, i.e., β(ˆ ˆ 2 (ˆ ρ) are 2 ˆ simply written as β and σ ˆ . Variance of the estimate of θ = (β 0 , σ 2 , ρ)0 is asymptotically given by: ˆ = I −1 (θ), where I(θ) denotes the information matrix, which is repreV(θ) sented as: ! Ã ∂ 2 log f (Yn |θ) . I(θ) = −E ∂θ∂θ0 ˆ = σ 2 (Pn X ∗ 0 X ∗ )−1 in large Therefore, the variance of βˆ is given by V(β) t=1 t t ρ). For example, sample, where ρ in Xt∗ is replaced by ρˆ, i.e., Xt∗ = Xt∗ (ˆ suppose that Xt∗ has a tendency to rise over time t and that we have ρ > 0. If ˆ is also underestimated, which yields incorrect ρ is underestimated, then V(β) inference on the regression coefficient β. Thus, unless ρ is properly estimated, ˆ is also biased. In large sample, ρˆ is a consistent estimator the estimate of V(β) ˆ is not biased. However, in small sample, since it is of ρ and therefore V(β) known that ρˆ is underestimated (see, for example, Andrews [3], Tanizaki [4, ˆ is also underestimated. In addition to ρˆ, the estimate of 5]), clearly V(β) ˆ = σ 2 (Pn X ∗ 0 X ∗ )−1 σ 2 also influences inference of β, because we have V(β) t=1 t t as mentioned above. If σ 2 is underestimated, the estimated variance of β is also underestimated. σ ˆ 2 is a consistent estimator of σ 2 in large sample, but it is appropriate to consider that σ ˆ 2 is biased in small sample, because σ ˆ 2 is a function of ρˆ as in (3). Therefore, the biased estimate of ρ gives us the serious problem on inference of β. Bayes Estimator: We assume that the prior density functions of β, ρ and σ 2 are the following noninformative priors: fβ (β) ∝ const., fσ (σ 2 ) ∝ fρ (ρ) ∝ const., 1 , σ2 (5) for −∞ < β < ∞, −1 < ρ < 1 and 0 < σ < ∞. For fρ (ρ), theoretically we should have −1 < ρ < 1. As for the prior density of σ 2 , since we consider that log σ 2 has the flat prior for −∞ < log σ 2 < ∞, we obtain fσ (σ 2 ) ∝ 1/σ 2 . Combining the four densities (1) and (5), the posterior density function of β, ρ and σ 2 , denoted by fβρσ (β, ρ, σ 2 |Yn ), is represented as follows: fβρσ (β, ρ, σ 2 |Yn ) ∝ f (Yn |β, ρ, σ 2 )fβ (β)fρ (ρ)fσ (σ 2 ) 2 −(n/2+1) ∝ (σ ) 2 1/2 (1 − ρ ) n ´ 1 X exp − 2 (yt∗ − Xt∗ β)2 . (6) 2σ t=1 ³ We consider generating random draws of β, ρ and σ 2 given Yn . However, it is not easy to generate random draws of β, ρ and σ 2 from fβρσ (β, ρ, σ 2 |Yn ). Therefore, we perform the Gibbs sampler in this problem. According to the Gibbs sampler, we can sample from the posterior density function (6), using the three conditional distributions fβ|ρσ (β|ρ, σ 2 , Yn ), fρ|βσ (ρ|β, σ 2 , Yn ) and fσ|βρ (σ 2 |β, ρ, Yn ), which are proportional to fβρσ (β, ρ, σ 2 |Yn ) and are obtained as follows: n ´ X 1 0 1 ˆ ˆ , Xt∗ 0 Xt∗ )(β − β) fβ|ρσ (β|ρ, σ , Yn ) ∝ exp − (β − β) ( 2 2 σ t=1 n ³ ³ ´2 ´ 1 X fρ|βσ (ρ|β, σ 2 , Yn ) ∝ (1 − ρ2 )1/2 exp − 2 yt∗ − Xt∗ β , 2σ t=1 n ³ ´ 1 X 1 fσ|βρ (σ 2 |β, ρ, Yn ) ∝ 2 n/2+1 exp − 2 (yt∗ − Xt∗ β)2 . (σ ) 2σ t=1 2 ³ (7) (8) (9) For (7), βˆ represents the ordinary least squares (OLS) estimate, which is repP P resented by βˆ = ( nt=1 Xt∗ 0 Xt∗ )−1 ( nt=1 Xt∗ 0 yt∗ ). (7) indicates that β is sampled ˆ σ 2 (Pn X ∗ 0 X ∗ )−1 ). (8) from the multivariate normal distribution: β ∼ N (β, t=1 t t is not represented in a known distribution, where −1 < ρ < 1. Sampling from (8) is implemented by the MH algorithm, which will be discussed below. (9) P indicates that 1/σ 2 ∼ G(n/2, 2/ nt=1 ²2t ), where ²t = yt∗ − Xt∗ β. Note that G(a, b) denotes the gamma distribution with parameters a and b. See, for example, Bernardo and Smith [6], Carlin and Louis [7], Chen et al [8], Gamerman [9], Robert and Casella [10] and Smith and Roberts [11] for the Gibbs sampler and the MH algorithm. In order to generate random draws of β, ρ and σ 2 from the posterior density fβρσ (β, ρ, σ 2 |Yn ), the following procedures have to be taken: (i) Let βi , ρi and σi2 be the i-th random draws of β, ρ and σ 2 . Take the 2 initial values of (β, ρ, σ 2 ) as (β−M , ρ−M , σ−M ). P ˆ σ 2 ( n X ∗ 0 X ∗ )−1 ) as in (7), generate βi given ρi−1 , (ii) From β ∼ N (β, i−1 t=1 t t P P 2 σi−1 and Yn , where βˆ = ( nt=1 Xt∗ 0 Xt∗ )−1 ( nt=1 Xt∗ 0 yt∗ ), yt∗ = yt∗ (ρi−1 ) and Xt∗ = Xt∗ (ρi−1 ). 2 (iii) From (8), generate ρi given βi , σi−1 and Yn . Since it is not easy to generate random draws directly from (8), the MH algorithm is utilized, which is implemented as follows: (a) Generate ρ∗ from the uniform distribution between −1 and 1, which implies that the sampling density of ρ is given by f∗ (ρ|ρi−1 ) = 1/2 for −1 < ρ < 1. Compute the acceptance probability ω(ρi−1 , ρ∗ ), which is defined as: Ã ! 2 fρ|βσ (ρ∗ |βi , σi−1 , Yn )/f∗ (ρ∗ |ρi−1 ) ω(ρi−1 , ρ ) = min , 1 2 fρ|βσ (ρi−1 |βi , σi−1 , Yn )/f∗ (ρi−1 |ρ∗ ) ! Ã 2 fρ|βσ (ρ∗ |βi , σi−1 , Yn ) , 1 . = min 2 fρ|βσ (ρi−1 |βi , σi−1 , Yn ) ∗ (b) Set ρi = ρ∗ with probability ω(ρi−1 , ρ∗ ) and ρi = ρi−1 otherwise. P (iv) From 1/σ 2 ∼ G(n/2, 2/ nt=1 ²2t ) as in (9), generate σi2 given βi , ρi and Yn , where ²t = yt∗ − Xt∗ β, yt∗ = yt∗ (ρi ) and Xt∗ = Xt∗ (ρi ). (v) Repeat Steps (ii) – (iv) for i = −M + 1, −M + 2, · · · , N , where M indicates the burn-in period. Repetition of Steps (ii) – (iv) corresponds to the Gibbs sampler. For sufficiently large M , we have the following results: N 1 X g(xi ) −→ E(g(x)), N i=1 where x should be replaced by β, ρ or σ 2 . g(·) represents a function, typically g(x) = x or g(x) = x2 . Thus, we can take the Bayes estimates of β, ρ and σ 2 P P P 2 e ≡ (1/N ) N e 2 ≡ (1/N ) N as βe ≡ (1/N ) N i=1 βi , ρ i=1 ρi and σ i=1 σi , respectively. 3. Monte Carlo Experiments For the exogenous variables, we take the data shown in Table 1, which are presented in Judge et al [12, p.156]. The DGP is defined as: yt = β1 + β2 x2,t + β3 x3,t + ut , ut = ρut−1 + ²t , (10) where the ²t ’s are normally and independently distributed with E(²t ) = 0 and E(²2t ) = σ 2 . As in Judge et al [12], the parameter values are set to be β 0 = (β1 , Table 1: The Exogenous Variables x1,t and x2,t t x2,t x3,t t x2,t x3,t 1 14.53 16.74 11 20.77 19.33 2 15.30 16.81 12 21.17 17.04 3 15.92 19.50 13 21.34 16.74 4 17.41 22.12 14 22.91 19.81 5 18.37 22.34 15 22.96 31.92 6 18.83 17.47 16 23.69 26.31 7 18.84 20.24 17 24.82 25.93 8 19.71 20.37 18 25.54 21.96 9 20.01 12.71 19 25.63 24.05 10 20.26 22.98 20 28.73 25.66 β2 , β3 ) = (10, 1, 1). In this section, we utilize x2,t and x3,t given in Judge et al [12, p.156], which is shown in Table 1, and generate L samples of yt given the Xt = (1, x2,t , x3,t ) for t = 1, 2, · · · , n. That is, we perform L simulation runs for both MLE and BE, where L = 104 is taken in this section. The simulation procedure in this section is as follows: (i) Given ρ, generate random numbers of ut for t = 1, 2, · · · , n, based on the assumptions: ut = ρut−1 +²t , ²t ∼ N (0, σ 2 ), where σ 2 = 1 and ρ = −0.99, −0.98, · · ·, 0.99 are taken. (ii) Given β, Xt and ut for t = 1, 2, · · · , n, we obtain a set of data yt , t = 1, 2, · · · , n, from (10), where (β1 , β2 , β3 ) = (10, 1, 1) is assumed. (iii) Given (yt , Xt ) for t = 1, 2, · · · , n, obtain the estimates of θ = (β 0 , ρ, σ 2 )0 e respectively. by MLE and BE, which are denoted by θˆ and θ, (iv) Repeat (i) – (iii) L times, where L = 104 is taken as mentioned above. (v) From L estimates of θ, compute the arithmetic average (AVE), the standard error (SER), the root mean square error (RMSE), the skewness (Skewness), the kurtosis (Kurtosis), and the 5, 10, 25, 50, 75, 90 and 95 percent points (5%, 10%, 25%, 50%, 75%, 90% and 95%) for each estimator. For AVE and RMSE of MLE, we compute: AVE = L 1X (l) θˆ , L l=1 j RMSE = L ³1 X L l=1 (l) (θˆj − θj )2 ´1/2 , (l) for j = 1, 2, · · · , 5, where θj denotes the j-th element of θ and θˆj represents the j-th element of θˆ in the l-th simulation run. For AVE and e RMSE of BE, simply θˆ is replaced by θ. In this section, we compare BE with MLE through Monte Carlo studies. In Figure 1 we draw the relationship between ρ and ρˆ, where ρˆ denotes the arithmetic average of the 104 MLE’s, while in Figure 2 we display the e where ρe indicates the arithmetic average of relationship between ρ and ρ, 4 the 10 BE’s. In the two figures the cases of n = 10, 15, 20 are shown, and (M, N ) = (5000, 104 ) is taken in Figure 2. In Appendix, we check whether M Figure 1: The Arithmetic Average from the 104 MLE’s of AR(1) Coeff. ρˆ 1.0 45◦ Degree Line .....n = 20 ..... . . . . ... ................ n = 15 0.422 ¾ ......................... . . . . .............. ..................... n = 10 . 0.142 ¾ . . .. ................... . ρ . . ........ . ...... −1.0 −0.5 0.5 1.0 . . ? . . . . . 0.9 ............ ...... ....... . . . . . . . .. . . . . . . ...... ....... ........ ....... . . . . . −0.5 .. ....... ....... ....... ......... . . . . ... . ..... .. ........ ..... .. ........ . 0.559 ¾ −1.0 Figure 2: The Arithmetic Average from the 104 BE’s of AR(1) Coeff. ——— M = 5000 and N = 104 ——— 1.0 ρe 0.661 ¾ 0.568 ¾ 45◦ Degree Line ...... n = 20 .......................... n = 15 . . . . ... ........ ....................... . . . . . . . . . ¾ . .. n = 10 0.369 ............... ........ ....... . . . . . . ..... ....... ...... ......... . . . . ... . ρ ..... . .... .. −1.0 −0.5 .................... 0.5 1.0 ? .. 0.9 .... .. .... . .... .. ....... . . . . .... . .... .. .... .. .... ... . . . . −0.5 .... . .... .. .... . .... ... . . . . . .... .. .... .. ... −1.0 Table 2: MLE: n = 20 and ρ = 0.9 Parameter True Value AVE SER RMSE Skewness Kurtosis 5% 10% 25% 50% 75% 90% 95% β1 10 10.012 3.025 3.025 0.034 2.979 5.096 6.120 7.935 10.004 12.051 13.913 15.036 β2 1 0.999 0.171 0.171 −0.045 3.093 0.718 0.785 0.883 0.999 1.115 1.217 1.274 β3 1 1.000 0.053 0.053 −0.008 3.046 0.914 0.933 0.965 1.001 1.036 1.068 1.087 ρ 0.9 0.559 0.240 0.417 −1.002 4.013 0.095 0.227 0.426 0.604 0.740 0.825 0.863 σ2 1 0.752 0.276 0.372 0.736 3.812 0.363 0.426 0.550 0.723 0.913 1.120 1.255 Table 3: BE with M = 5000 and N = 104 : n = 20 and ρ = 0.9 Parameter True Value AVE SER RMSE Skewness Kurtosis 5% 10% 25% 50% 75% 90% 95% β1 10 10.010 2.782 2.782 0.008 3.018 5.498 6.411 8.108 10.018 11.888 13.578 14.588 β2 1 0.999 0.160 0.160 −0.029 3.049 0.736 0.798 0.891 1.000 1.107 1.205 1.258 β3 1 1.000 0.051 0.051 −0.022 2.942 0.915 0.934 0.966 1.001 1.036 1.067 1.085 ρ 0.9 0.661 0.188 0.304 −1.389 5.391 0.285 0.405 0.572 0.707 0.799 0.852 0.875 σ2 1 1.051 0.380 0.384 0.725 3.783 0.515 0.601 0.776 1.011 1.275 1.555 1.750 e lies on the and N are large enough. If the relationship between ρ and ρˆ (or ρ) ◦ 45 degree line, we can conclude that MLE (or BE) of ρ is unbiased. However, from the two figures, both estimators are biased. Take an example of ρ = 0.9 in Figures 1 and 2. When the true value is ρ = 0.9, the arithmetic averages of 104 MLE’s are given by 0.142 for n = 10, 0.422 for n = 15 and 0.559 for n = 20 (see Figure 1), while those of 104 BE’s are 0.369 for n = 10, 0.568 for n = 15 and 0.661 for n = 20 (see Figure 2). As n increases the estimators are less biased, because MLE gives us the consistent estimators. Comparing BE and MLE, BE is less biased than MLE in the small sample, because BE is closer to the 45◦ degree line than MLE. Especially, as ρ goes to one, the difference between BE and MLE becomes quite large in small sample. Tables 2 and 3 represent the basic statistics such as arithmetic average, standard error, root mean square error, skewness, kurtosis and percent points, which are computed from L = 104 simulation runs, where the case of n = 20 and ρ = 0.9 is examined. Table 2 is based on the MLE’s while Table 3 is Figure 3: Empirical Distributions for MLE and BE —— n = 20, ρ = 0.9, M = 5000, N = 104 and L = 104 —— ρˆ −0.5 0.0 0.5 1.0 0.5 1.0 ρe −0.5 0.0 obtained from the BE’s. Both MLE and BE give us the unbiased estimators of regression coefficients β1 , β2 and β3 , because the arithmetic averages from the 104 estimates of β1 , β2 and β3 , (i.e., AVE in the tables) are very close to the true parameter values, which are set to be (β1 , β2 , β3 ) = (10, 1, 1). However, in the SER and RMSE criteria, BE is better than MLE, because SER and RMSE of BE are smaller than those of MLE. From Skewness and Kurtosis in the two tables, we can see that the empirical distributions of MLE and BE of (β1 , β2 , β3 ) are very close to the normal distribution. Remember that the skewness and kurtosis of the normal distribution are given by zero and three, respectively. As for σ 2 , AVE of BE is closer to the true value than that of MLE, because AVE of MLE is 0.752 (see Table 2) and that of BE is 1.051 (see Table 3). However, in the SER and RMSE criteria, MLE is superior to BE, since SER and RMSE of MLE are given by 0.276 and 0.372 (see Table 2) while those of BE are 0.380 and 0.384 (see Table 3). The empirical distribution obtained from 104 estimates of σ 2 is skewed to the right (Skewness is positive for both MLE and BE) and has a larger kurtosis than the normal distribution because Kurtosis is greater than three for both tables. For ρ, AVE of MLE is 0.559 (Table 2) and that of BE is given by 0.661 (Table 3). As it is also seen in Figures 1 and 2, BE is less biased than MLE from the AVE criterion. Moreover, SER and RMSE of MLE are 0.240 and 0.417, while those of BE are 0.188 and 0.304. Therefore, BE is more efficient than MLE. Thus, in the AVE, SER and RMSE criteria, BE is superior to MLE with respect to ρ. The empirical distributions of MLE and BE of ρ are skewed to the left because Skewness is negative, which value is given by −1.002 in Table 2 and −1.389 in Table 3. We can see that MLE is less skewed than BE. For Kurtosis, both MLE and BE of ρ are greater than three and therefore the empirical distributions of the estimates of ρ have fat tails, compared with the normal distribution. Since Kurtosis in Table 3 is 5.391 and that in Table 2 is 4.013, the empirical distribution of BE has more kurtosis than that of MLE. Figure 3 also indicates these facts. Figure 3 corresponds to ρˆ in Table 2 and ρe in Table 3, respectively. As we can see from Skewness and Kurtosis in Tables 2 and 3, βˆi and βei , i = 1, 2, 3, are very similar to a normal distribution. From Skewness and Kurtosis, σ ˆ 2 and 2 σe are quite different from a normal distribution, but they are very similar to each other. Also, ρˆ and ρe2 are quite different from the normal distribution. e We However, the empirical distribution of ρˆ is quite different from that of ρ. can observe that ρe is more skewed to the left than ρˆ and ρe has a larger kurtosis than ρˆ. Therefore, the empirical distributions of ρˆ and ρe are shown in Figure 3. The mode of MLE ρˆ lies on the interval between 0.65 and 0.70, while that of BE ρe is between 0.75 and 0.80. From the facts that SER of ρe is smaller than that of ρˆ and that the mode of ρe is closer to the true value than that of ρˆ, ρe is distributed around the true value 0.9. 4. Summary In this paper, we have compared MLE with BE, using the regression model with the autocorrelated error term. Chib [1] and Chib and Greenberg [2] applied the Gibbs sampler to the autocorrelation model, where the initial density of the error term is ignored. Under this setup, the posterior distribution of ρ reduces to the normal distribution. Therefore, random draws of ρ given β, σ 2 and (yt , Xt ) can be easily generated. However, when the initial density of the error term is properly taken into account, the posterior distribution of ρ is not normal and it cannot be represented in an explicit functional form. Accordingly, in this paper, the MH algorithm has been applied to generate random draws of ρ from its posterior density. The obtained results are summarized as follows. Given β 0 = (10, 1, 1) and σ 2 = 1, in Figure 1 we have the relationship between ρ and ρˆ, and ρe corresponding to ρ is drawn in Figure 2. In the two figures, we can observe: (i) both MLE and BE approach the true parameter value as n is large, and (ii) BE is closer to the 45◦ degree line than MLE and accordingly BE is superior to MLE. Moreover, we have compared MLE with BE in Tables 2 and 3, where 0 β = (10, 1, 1), ρ = 0.9 and σ 2 = 1 are taken as the true values. As for the regression coefficient β, both MLE and BE gives us the unbiased estimators. However, we have obtained the result that BE of β is more efficient than Table 4: BE with (M, N ) = (5000, 5000), (1000, 104 ): n = 20 and ρ = 0.9 Parameter β1 β2 True Value 10 1 (a) (M, N ) = (5000, 5000) AVE 10.011 0.999 RMSE 2.785 0.160 50% 10.015 1.000 Skewness 0.004 −0.027 Kurtosis 3.028 3.056 (b) (M, N ) = (1000, 104 ) AVE 10.010 0.999 RMSE 2.783 0.160 50% 10.014 1.000 0.008 −0.029 Skewness Kurtosis 3.031 3.055 β3 1 ρ 0.9 σ2 1 1.000 0.052 1.001 −0.022 2.938 0.661 0.305 0.707 −1.390 5.403 1.051 0.384 1.011 0.723 3.776 1.000 0.051 1.001 −0.021 2.938 0.661 0.304 0.706 −1.391 5.404 1.051 0.384 1.011 0.723 3.774 MLE. For estimation of σ 2 , BE is less biased than MLE. In addition, BE of the autocorrelation coefficient ρ is also less biased than MLE. Therefore, as for inference on β, BE is superior to MLE, because it is plausible to consider that e Remember the estimated variance of βˆ is biased much more than that of β. that variance of βˆ depends on both ρ and σ 2 . Thus, from the simulation studies, we can conclude that BE performs much better than MLE. Appendix: Is (M, N ) = (5000, 104 ) sufficiently large? In this appendix, we check whether M and N are enough large in BE. Comparison between Tables 3 and 4(a) shows whether N = 5000 is large enough and we can see from Tables 3 and 4(b) if the burn-in period M = 1000 is large enough. For the burn-in period M , there are some diagnostic tests, which are discussed in Geweke [13] and Mengersen, Robert and GuihenneucJouyaux [14]. However, since their tests are applicable in the case of one sample path, we cannot utilize them in this section. Because L simulation runs are implemented in this paper, we have L test statistics if we apply the tests. It is not possible to evaluate L testing results at the same time. Therefore, we consider using the alternative approach to see if M = 1000 and N = 5000 are sufficient. We can conclude that N = 5000 is large enough if Table 3 is very close to Table 4(a) and that M = 1000 is enough if Table 3 is close to Table 4(b). From Tables 3 and 4, we can observe as follows. The difference between Tables 3 and 4(a) is at most 0.012 (see Kurtosis in ρ) and that between Tables 3 and 4(b) is less than or equal to 0.013 (see Kurtosis in β1 and ρ). Thus, all the values in the tables are very close to each other. Therefore, we can conclude that (M, N ) = (1000, 5000) is enough. For safety, we take the case of (M, N ) = (5000, 104 ) in this paper. References [1] S. Chib, “Bayes Regression with Autoregressive Errors: A Gibbs Sampling Approach,” Journal of Econometrics, 58 (1993), 275 – 294. [2] S. Chib and E. Greenberg, “Bayes Inference in Regression Models with ARMA(p, q) Errors,” Journal of Econometrics, 64 (1994), 183 – 206. [3] D.W.K. Andrews, “Exactly Median-Unbiased Estimation of First Order Autoregressive Unit Root Models,” Econometrica, 61 (1993), 139 – 165. [4] H. Tanizaki, “Bias Correction of OLSE in the Regression Model with Lagged Dependent Variables,” Computational Statistics and Data Analysis, 34 (2000), 495 – 511. [5] H. Tanizaki, “On Least-Squares Bias in the AR(p) Models: Bias Correction Using the Bootstrap Methods,” Unpublished Manuscript (2001), http://ht.econ.kobe-u.ac.jp/~tanizaki/cv/working/unbiased.pdf. [6] J.M. Bernardo and A.F.M. Smith, Bayesian Theory, John Wiley & Sons (1994). [7] B.P. Carlin and T.A. Louis, Bayes and Empirical Bayes Methods for Data Analysis, Chapman & Hall (1996). [8] M.H. Chen, Q.M. Shao and J.G. Ibrahim, Monte Carlo Methods in Bayesian Computation, Springer-Verlag (2000). [9] D. Gamerman, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Chapman & Hall (1997). [10] C.P. Robert and G. Casella, Monte Carlo Statistical Methods, SpringerVerlag (1999). [11] A.F.M. Smith and G.O. Roberts, “Bayesian Computation via Gibbs Sampler and Related Markov Chain Monte Carlo Methods,” Journal of the Royal Statistical Society, Ser.B, 55 (1993), 3 – 23. [12] G. Judge, C. Hill, W. Griffiths and W. Lee, The Theory and Practice of Econometrics, John Wiley & Sons (1980). [13] J. Geweke, “Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments,” in Bayesian Statistics, Vol.4, 169 – 193 (with discussion), edited by J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M. Smith, Oxford University Press (1992). [14] K.L. Mengersen, C.P. Robert and C. Guihenneuc-Jouyaux, “MCMC Convergence Diagnostics: A Reviewww,” in Bayesian Statistics, Vol.6, 514 – 440 (with discussion), edited by J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M. Smith, Oxford University Press (1999).