8. The Simple Linear Regression Model 8.1 The “True” Model
Transcription
8. The Simple Linear Regression Model 8.1 The “True” Model
8. The Simple Linear Regression Model 8.1 8.2 8.3 8.4 8.5 The “True” Model Understanding the SLR Model Predictive Intervals Using the SLR Model Sampling Distributions of the regression model and confidence intervals Hypothesis Testing 8.1 The “True” Model • If we take a different sample of size 15 we would clearly get a different best fit regression line. That is, the line that minimizes the sum of in sample squared errors depends on the particular sample we got. • Here is a new set of data of 20 houses from the same neighborhood. 1 New set of 20 houses from the same neighborhood. Regression Plot Y = 24.7587 + 40.7068X R-Sq = 0.769 150 140 130 Price 120 110 100 90 80 70 60 50 1 2 3 Size regression line from old data The true model • We now describe the “true” relationship between X and Y or the “true regression model”. • In the last couple of sections of the notes, we considered the example of estimating the Normal(,2) model with parameters and 2. sx2 and 2. • We viewed and as estimates of x • We now view b0 and b1 as estimates of parameters of a “true model”. • So what is the true regression model? 2 Our model is: Y 0 1X How far off the line the points tend to be (everything about Y not captured by X) The true line describing the linear pattern 0 is the true intercept 1 is the true slope We assume that has a normal: Saying that ε is a normally distributed random variable is a statement about the possible values that ε could turn out to be. c ~ N 0, 2 mean of is zero (sometimes Y is above the line, sometimes Y is below the line, on average Y is on the line) h variance of is If is small, tends to be small (close to zero). If is large, tends to be large (far from zero). 3 A picture of the model: Y 0 1X 0 1X Y 2 ~ N(0, 2 ) Given X=x1 expect Y1 in here 1 ~ N(0, 2 ) X x2 x1 You can see the role played by σ If σ is large… If σ is small… o o 200 200 o o o o oooo oooo o o o o o o o o o o o o o oo o oo o o o o o o o o o o o o o oo oo o oo oo o o o o oo o o o o o o o o oo o oo o o o o o o o ooo oo o o o o oo oo o oo o o oo o o o o oo o o ooo oo o o o o o 150 o 50 o 50 o o Price o o oo o o oo ooo o ooo ooo o o o o o o o oo o ooooo oo o o oo ooooo oo oo o oo oo o o ooo ooo o o oo oooo o o o o o o o ooo o o o ooooo oo oo o oo ooo oo oooo o o o o ooo o o o oooo o o o o o ooooo o oo ooooo o o ooo oooooo oo ooooooooo o o oo o oo oooo ooooo o ooo oo o o o oo o 100 100 Price 150 o o o o o o oo oo oo oo o o o o o o oo o o o o oo o oooo oo o o o o o o oo o o o o o o ooo o o o oo o o o o o o o o o o o o o o o o o o o o 1.0 1.5 2.0 Size 2.5 3.0 3.5 1.0 1.5 2.0 2.5 3.0 3.5 Size 4 Y 0 1X c ~ N 0, 2 h The model has three unknown parameters: 0 determine the “true line” 1 describes how big the part of Y unrelated to X can be We estimate these parameters from the data. In fact, this is what we have … Y X From the data we have to estimate the true line (0 and 1) and . 5 8.2 Understanding the SLR Model Regression reveals information about the conditional distribution of Y given X. That is, what do we know about Y for a particular value of the X variable. The marginal distribution tells us about a random variable, without conditioning on the value of another random variable. • Recall our discussion of discrete random variables. • We could talk about the unconditional distribution (ie whether bond prices go up or down), or we could talk about conditional distributions (ie whether bonds go up or down given that stocks went down). 6 B S 0 1 B 0 .3 .15 0 3/5 1 .2 .35 1 2/5 Pr(B|S=0) .45 .55 Pr(B=0|S=0)=.3/.5 Pr(B=1|S=0)=.2/.5 Marginal or “unconditional” distribution of bonds Conditional distribution of bonds given stocks go down Interpretation: On days that the bonds fell, there was a 3/5 chance that stocks also fell and a 2/5 chance that stocks went up. Back to the regression problem. We can use the regression model to form the conditional expected value and variance of Y given X. Y | X E (Y | X ) E ( 0 1 X | X ) 0 1 X Y2 | X Var ( Y | X ) Var ( 0 1 X | X ) 2 The conditional distribution of Y given X: Y | X ~ N ( 0 1 X , 2 ) 7 Example • Suppose that the distribution of sales prices is Normal with a mean of mean of 125 (thousand) and a standard deviation of 33. 0.014 N(125,332) 0.012 f(y) 0.01 0.008 0.006 0.004 0.002 0 50 70 90 110 130 150 170 190 price • Suppose that the true value of 0 is 40 and the true value of 1 is 35. • Then the average sales price of homes with 3000 sqft is 0 + 1(3)=40+35(3)=145 0.03 0.025 f(y) 0.02 N(145,152) 0.015 0.01 0.005 0 -0.005 50 70 90 110 130 150 170 190 price 8 Let’s graph the probability (density) of the sales price both unconditionally and conditional on the size=3000 sqft 0.03 0.025 Given Size=3000 f(Y) 0.02 unconditional 0.015 0.01 0.005 0 -0.005 50 100 150 200 Price (Y) 8.3 Predictive Intervals Using the SLR Model For the housing problem, suppose we knew the true parameter values: 0 = 40, 1 = 35, = 15 What can be said about a house with X = 3.0? From the model, f(Y|X)=3000~N(145,152) So there is a 95% chance that a house with 3000 sqft sells for somewhere between 145-2*15 and 145+2*15 or 115,000 to 175,000 9 In general, given X and the model parameters, we are 95% sure that Y is in the interval ( 0 1 X 2 , 0 1 X 2 ) In real life, we don’t know the true parameters of the model (0, 1, and ). We estimate them from the data. We use the estimated values rather than the true, unknown, values. Best guess for is (basically) the sample standard deviation of the residuals Results of simple regression for Price Results of simple regression for Price Summary measures Summary measures Found here in regression output. Multiple R 0.9092 R‐Square 0.8267 Multiple R R‐Square StErr of Est StErr of Est 0.9092 0.8267 14.1384 14.1384 ANOVA table 1 n 2 s ei n 2 i 1 2 ANOVA table Source Source Explained df SS MS MS F 1 12393.1077 12393.1077 61.9983 Explained Unexplained SS 1 12393.107 12393.107 13 Unexplained We divide by n‐2 for technical reasons. df 2598.6256 p‐value F p‐value 0.0000 61.9983 0.0000 199.8943 13 2598.6256 199.8943 Regression coefficients Regression coefficients Coefficient Std Err t‐value p‐value Lower limit Upper limit Constant 38.8847Coef.9.0939Std Err 4.2759 0.0009 t‐value 19.2385 Lower limit 58.5309 Upper limit p‐value Size Constant 35.3860 7.8739 38.88474.49419.0939 0.0000 4.2759 25.6771 0.0009 45.0948 19.2385 58.5309 7.8739 0.0000 25.6771 45.0948 Size 35.3860 4.4941 10 8.4 Sampling Distribution for the regression model and confidence intervals • It turns out that both b0 and b1 are unbiassed estimates of 0 and so that E(b0)= 0 and E(b1)= . • The next question is how wrong might they be? • We need to know the standard deviation (or standard error) of b0 and b1 Back to the housing regression: Results of simple regression for Price Summary measures Multiple R R‐Square StErr of Est 0.9092 0.8267 14.1384 ANOVA table Source Explained Unexplained df SS MS F p‐value 1 12393.107 12393.107 61.9983 0.0000 13 2598.6256 199.8943 sb0 Regression coefficients Coef. Std Err t‐value p‐value Lower limit Upper limit Constant 38.8847 9.0939 4.2759 0.0009 19.2385 58.5309 Size 35.3860 4.4941 7.8739 0.0000 25.6771 45.0948 sb1 11 It turns out that the standardized distance that bj lies away from the true parameter j follows a the tn-2 distribution (“t distribution with n-2 degrees of freedom”) : bj j sb j ~ tn2 The confidence intervals and hypothesis tests are based on this key result. Thus, a 95% confidence interval for j is b t j s , bj tn2,.025 sbj n2,.025 b j • This is the same notation as before, specifically, we would use =tinv(.05,n‐2) to obtain tn 2,.025 12 Back to the housing regression: Regression coefficients Coef. Std Err t‐value p‐value Lower limit Upper limit Constant 38.8847 9.0939 4.2759 0.0009 19.2385 58.5309 Size 35.3860 4.4941 7.8739 0.0000 25.6771 45.0948 Goal: Form a 95% confidence interval for 1. b1 35.4 Calculation: 35.4 ( 2.16)( 4.5) 35.4 9.7 sb1 4.5 tn2,.025 t13,0.025 2.16 95% confidence interval for 1 : ( 25.7 , 45.1) 8.5 Hypothesis Testing Suppose you are interested in testing whether the true slope parameter, 1, is equal to a particular value. Ex: Is the “beta coefficient” from the market model equal to 1? Maybe you want to test whether X affects Y in any simple linear regression model, we would test whether 1 is equal to zero. 13 Example In finance, a well known model for pricing assets regresses stock returns against returns of some market index, such as the S&P 500. The slope of the regression line is referred to as the asset’s “beta”. In your finance course you will learn that asset’s with high betas tend to be viewed as more risky than assets with low betas. Facts: The market portfolio (S&P500) has a beta of 1 A portfolio with a beta larger than 1 should have a higher average return than the market (S&P500) A portfolio with a beta less than 1 should have a smaller average return than the market (S&P500). Form the following null hypothesis: H 0 : 1 10 The alternative hypothesis is defined as the opposite of the null hypothesis: H a : 1 10 For the market model example, the null and alternative hypotheses are: H 0 : 1 1 and H a : 1 1 14 To test H 0 : 1 10 , form the t-statistic: b 10 t 1 s b1 The intuition is exactly as before. The top is just the difference between the estimate and the claimed value. The bottom is the standard error that accounts for the accuracy of our estimate. If the null hypothesis is true, the t-statistic should be small (in absolute value). If the null hypothesis is false, the t-statistic should be large (in absolute value). The t-test Reject H0 at the 5% level if b1 10 | t | tn 2,.025 sb1 15 Beta estimate for IBM (n = 240): Regression coefficients Coefficient Std Err t‐value p‐value Lower limit Upper limit Constant 0.0031 0.0046 0.6786 0.4980 ‐0.0059 0.0122 Market 0.8889 0.1011 8.7891 0.0000 0.6896 1.0881 sb1 b1 Test #1: b1 sb1 (say more about this) Test the null hypothesis that there is no relationship between the return on the IBM & the market return. H 0 : 1 0 and H a : 1 0 t-statistic: t b1 0 .889 8.78 sb1 .101 reported by Excel! Large sample so t‐cutoff is the same as Normal(0,1), or 2! Reject H0 at the 5 level. 16 Test #2: Test the null hypothesis that the IBM has the same risk as the market portfolio (i.e. test that the beta is 1). H 0 : 1 1 and H a : 1 1 t-statistic: t b1 1 .1111 1.0989 sb1 0.1011 Fail to reject H0 at the 5 level. P-values are calculate just as before. p-value Pr | t n 2 | | t | random variable from tn-2 distribution calculated t-statistic 17 The p-value for testing 1 0 is part of Excel’s regression output (under the column labeled “p”). Compute the p-value for testing 1 1 from the IBM regression: pvalue Pr | t238 | | 1.0989 | =tdist(1.0989,238,2)=.273 So, the p-value is 27.3%. RECALL Loosely speaking: Small p-value means we reject! More precisely: If the p-value < .05, then we reject at level .05! In general: If the p-value < α, then we reject at level α! 18 Adding more X’s in the prediction problem, the multiple regression model • Clearly there is more information that is useful in predicting the sales price of a house than just the size. • We need to expand our model to allow for this possibility. • This will be called the multiple regression model. The multiple regression model 9.1 The multiple regression model. 9.2 Fitted values and residuals again 9.3 Prediction in the multiple regression model. 9.4 Confidence intervals and Hypothesis testing. 19 9.1 Multiple regression model • So far we have considered modeling and prediction when we have one Y variable and one X. • Often we might think that there is more than one variable that would be useful in predicting (the size of the house is surely not the only variable that carries information about the expected sales price!) • More information should lead to more precise predication. The “true” model with k independent (X) variables: Y 0 1 X 1 2 X 2 ... k X k ~ N ( 0, 2 ) i.i.d. The conditional mean, the “true plane” (in a (k+1)dimensional space!) E (Y | X 1 , X 2 ,..., X k ) 0 1 X 1 2 X 2 ... k X k , is the part of Y related to the X’s. The residual, , is independent of all the X variables! 20 Once there is more than one X variable, it’s difficult to plot the data and/or the regression “line.” One could use a 3-D plot for the case of two X variables, but anything more than two variables is basically hopeless… Interpreting the coefficients: the average (or expected) change in Y for a one unit change in Xj, holding all of the other independent variables fixed E (Y | X 1 , ..., X k ) j Xj 9.2 Fitted values and residuals Parameters to estimate: 0 , 1 ,..., k , Notation for the estimates: b0 , b1 ,..., bk , s Fitted values (in sample): Y b b X b X i 0 1 1 ,i 2 2 ,i ... bk X k ,i Residuals: ei Yi Yi 21 Choose b0, b1, …, bk to make the sum of squared residuals as small as possible. n 2 minimize ei i 1 n (Y b i 0 b1 X 1,i ... bk X k ,i ) 2 i 1 In the simple linear regression case (k=1), we gave explicit formulae for the least-squares estimates. To do so for k>1 requires extensive use of matrix algebra. We pass. However: R‐squared • Rearranging the terms in the definition of the residual: and . Y Yˆ e Var Y Var Yˆ Var e i i i • The R‐squared has the same interpretation as before and is defined as: R2 Var Yˆ Var Y • It answers the question of “what fraction of the variation in the Y’s is explained by the X’s”. 22 Given b0, b1, …, bk, the estimate of the variance of the true residual (2) is n 1 s ei2 n k 1 i 1 2 and the estimate of the standard deviation () is s n 1 ei2 n k 1 i 1 Fact: E(s2) = 2 (unbiased) Regression output of son’s height on mom’s height and dad’s height Results of multiple regression for SHGT R‐squared s Summary measures Multiple R R‐Square Adj R‐Square StErr of Est ANOVA Table Source Explained Unexplained b0 b1 0.5083 0.2583 0.2380 2.5216 df 2 73 SS 161.6780 464.1773 MS 80.8390 6.3586 Regression coefficients Coefficient Constant 29.0036 MHGT 0.3601 FHGT 0.2726 Std Err 8.4198 0.1092 0.1063 t‐value 3.4447 3.2966 2.5641 F 12.7133 p‐value 0.0000 p‐value Lower limit Upper limit 0.0010 12.2230 45.7842 0.0015 0.1424 0.5778 0.0124 0.0607 0.4846 b2 23 9.3 Prediction in the multiple regression model Predict Y given Xj,f (j = 1, …,k) where the prediction is obtained by plugging in X’s: Ypred=b0+b1X1,f+b2X2,f+…+bkXk,f 95% prediction interval is: (Ypred 2 s, Ypred 2 s ) s is the standard error of the prediction error. Prediction from the multiple regression: Predict the height when mom’s height is 64 inches and dad’s height is 70 inches. Summary measures Multiple R R‐Square Adj R‐Square StErr of Est 0.5083 0.2583 0.2380 2.5216 Regression coefficients Coefficient Constant 29.0036 MHGT 0.3601 FHGT 0.2726 Std Err 8.4198 0.1092 0.1063 t‐value 3.4447 3.2966 2.5641 p‐valueLower limitUpper limit 0.0010 12.2230 45.7842 0.0015 0.1424 0.5778 0.0124 0.0607 0.4846 Ypred=29.00+.26*64+.27*70=64.54 95% prediction interval=64.54+/-2*2.52 24 9.4 Distribution of bj in the multiple regression Fact: E(b0) = 0, E(b1) = 1, …, E(bk) = k Fact: bj j sb j ~ t n k 1 accounts for the fact that (k+1) parameters are being estimated This fact can be used to construct confidence intervals and perform t-tests. (1) Confidence Intervals 95% confidence interval is: (bj tnk 1,.025 sbj , bj tnk 1,.025 sbj ) 25 (2) t-Tests and p-values Test H0 : j 0j vs. Ha : j 0j Reject H0 at the 5% level if t b bj 0j sbj tnk 1,.025 g p - value Pr | t n k 1 | | t | random variable calculated from tn-k-1 distribution t-statistic Example: Regression of male MBA student height on parents’ heights (n = 76) Regression coefficients Coefficien t Constant 29.0036 MHGT 0.3601 FHGT 0.2726 Std Err 8.4198 0.1092 0.1063 t‐value 3.4447 3.2966 2.5641 Lower Upper p‐value limit limit 0.0010 12.2230 45.7842 0.0015 0.1424 0.5778 0.0124 0.0607 0.4846 “partial” effects are significantly different from zero confidence interval for MomHgt estimate: (0.36 t 73, 0.025 (0.109), 0.36 t 73, 0.025 (0.109)) (0.14, 0.58) confidence interval for DadHgt estimate: ( 0.273 t 73 , 0 .025 ( 0.106 ), 0.273 t 73 , 0 .025 ( 0.106 )) ( 0.061, 0.485) 26