Solutions for Chapter 5 by Alekhya Akula - Full
Transcription
Solutions for Chapter 5 by Alekhya Akula - Full
Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Question 1: Using basic statistical properties of the variance, as well as single variable calculus, derive (5.6). In other words, prove that α given by (5.6) does indeed minimize Var(αX + (1 − α)Y ). Solution: Minimizing Var(αX + (1 − α)Y ). As we know: Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y) Var(aX) = 𝑎2 Var(X) Cov(aX,bY) = abCov(X,Y). So, Var(αX + (1 − α)Y ) = Var(αX) + Var((1 − α)Y) + 2Cov(αX, (1 − α)Y) = 𝛼 2 Var(X) + (1 − 𝛼)2 Var(Y) + 2α(1-α)Cov(X,Y) f(𝛼) = 𝜎 2𝑋 𝛼 2 + 𝜎 2 𝑌 (1 − 𝛼)2 + 2𝜎𝑋𝑌 (-𝛼 2 + 𝛼) Take the first derivative respect to α to find critical points: 𝑑 𝑑𝛼 f(𝛼) = 0 2𝜎 2𝑋 𝛼 + 2𝜎 2 𝑌 (1- 𝛼)(-1) + 2𝜎𝑋𝑌 (-2 𝛼 + 1) = 0 𝜎 2𝑋 𝛼 + 𝜎 2 𝑌 (1- 𝛼) + 𝜎𝑋𝑌 (-2 𝛼 + 1) = 0 (𝜎 2𝑋 + 𝜎 2 𝑌 - 2𝜎𝑋𝑌 )𝛼 - 𝜎 2 𝑌 + 𝜎𝑋𝑌 = 0 Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R 𝜎 2𝑋 𝜎 2 𝑌 − 𝜎𝑋𝑌 + 𝜎 2 𝑌 − 2𝜎𝑋𝑌 Question 2: We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations. (a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer. Solution: Given that there are n observations and bootstrap sampling draws items. Excluding the jth observation, the total number 1 of items are n-1. The probability is (1 − ). 𝑛 (b) What is the probability that the second bootstrap observation is not the jth observation from the original sample? Solution: Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R 1 The probability is the same as above (1 − ). Again, the 𝑛 second time you pick an observation, the set of observations you start with is the same, because you are sampling with replacement. (c) Argue that the probability that the jth observation is not 1 in the bootstrap sample is (1 − )𝑛 . 𝑛 Solution: The probability that the jth sample is not the first sample in 1 your bootstrap is (1 − ) (Like we saw above questions). The 𝑛 total bootstrap sample size is n. So we need to pick n different observations and none of them should be the jth one. As bootstrapping does sampling with replacement, the probabilities of each observation are independent of one another. If that is the case, then we just have to multiply 1 1 𝑛 𝑛 (1 − ), n times, therefore the answer is (1 − )𝑛 . (d) When n = 5, what is the probability that the jth observation is in the bootstrap sample? Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: The probability that the jth observation is not in the 1 bootstrap sample is (1 − ). 𝑛 The probability that the jth observation is in the bootstrap 1 sample is just going to be 1- (1 − ). 𝑛 With n=5, the answer will be 1-[(1-1/5)^5] = 1-0.8^5 = 0.672 (e) When n = 100, what is the probability that the jth observation is in the bootstrap sample? Solution: The probability that the jth observation is not in the 1 bootstrap sample is (1 − ). 𝑛 The probability that the jth observation is in the bootstrap 1 sample is just going to be 1- (1 − ). 𝑛 With n=5, the answer will be 1-[(1-1/100)^100] = 1-0.99^100 = 0.624 (f) When n = 10, 000, what is the probability that the jth observation is in the bootstrap sample? Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: The probability that the jth observation is not in the 1 bootstrap sample is (1 − ). 𝑛 The probability that the jth observation is in the bootstrap 1 sample is just going to be 1- (1 − ). 𝑛 With n=5, the answer will be 1-[(1-1/10000)^10000] = 10.9999^10000 = 1-0.367 = 0.633 (g) Create a plot that displays, for each integer value of n from 1 to 100, 000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe. Solution: >x = seq(1,100000) >y = sapply(x,function(n) {1- ((1- (1 / n))^n)}) >mydata = data.table(x,y) mydata %>% ggplot(aes(x=x, y=y)) + geom_point() + xlim(0,100) + ylim(0.5,1) Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R #Using ggvis mydata %>%> ggvis(-x,-y) %>%> layer_points() %>%> scale_numeric(“x”,domain = c(0,100), nice= FALSE, clamp = TRUE) %>% add_axis(“y”, title=”Y”, title_offset = 50) (h) We will now investigate numerically the probability that a bootstrap sample of size n = 100 contains the jth observation. Here j = 4. We repeatedly create bootstrap samples, and each time we record whether or not the fourth observation is contained in the bootstrap sample. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R > store=rep(NA, 10000) > for(i in 1:10000) { store[i]=sum(sample (1:100, rep=TRUE)==4) >0 } > mean(store) Comment on the results obtained. Solution: >n=100000 >store=rep(NA,n) >for(i in 1:n){ + store[i] = sum(sample(1:100,rep=TRUE)==4)>0 +} >mean(store) [1] 0.633 Question 6: We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis. Solution: #data.table is an (a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors. Solution: >set.seed >glmModel = glm(default ~ income + balance , data = defaultData, family = binomial) >pander(summary(glmModel)) Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Estimate Income 2e-05 balance 0.005647 Intrcept -11.54 Error 4.9e-06 0.0002 0.4348 Z val 4.174 24.84 26.54 Pr> 2.991e-05 3.638e-13 2.95e-155 Dispersion parameter for binomial family taken to be 1 Null deviance: 2921 on 9999 degrees of freedom Residual deviance: 1579 on 9997 degrees of freedom (b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model. Solution: >boot.fn = function(formula, data, indices) { mydata = data[indices,] glmModel = glm(formula, data =mydata, family = binomial) return(coef(glmModel)) } Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance. Solution: >cl = makeCluster(detectCores()) >result = boot(data = defaultData, statistic = boot.fn, R = 1000, formula = default – income + balance, parallel = “snow”, ncpus = 8, cl = cl) >result (d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function. Solution: The estimates of the bootstrap are really close to the glm summary estimates. Question 7: In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4). Solution: >weeklyData = data.table(Weekly) >summary(weeklyData) Year Min.:1990 1st Qu.:1995 Median:2000 Mean:2000 3rd Qu.:2005 Max.:2010 Lag1 Min.:-18.19 1st Qu.:-1.1540 Median:0.2410 Mean:0.1506 3rd Qu.:1.4050 Max.:12.0260 Volume Min.:0.08747 1st Qu.:0.33202 Median:1.00268 Mean:1.57462 3rd Qu.:2.05373 Max.:9.32821 Lag2 Min.:-18.19 1st Qu.:-1.1540 Median:0.2410 Mean:0.1511 3rd Qu.:1.4090 Max.:12.0260 Today Min.:-18.1950 1st Qu.:-1.1540 Median:0.2410 Mean:0.1499 3rd Qu.:1.4050 Max.:12.0260 Lag3 Min.:-18.19 1st Qu.:-1.1580 Median:0.2410 Mean:0.1472 3rd Qu.:1.4090 Max.:12.0260 Lag4 Min.:-18.19 1st Qu.:-1.1580 Median:0.2380 Mean:0.1458 3rd Qu.:1.4090 Max.:12.0260 Lag5 Min.:-18.19 1st Qu.:-1.10 Median:0.2340 Mean:0.1399 3rd Qu.:1.4050 Max.:12.0260 Direction Down:484 Up: 605 NA NA NA NA (a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: > glmModel = glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial) > summary(glmModel) (b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation. Solution: > glmModelB = update(glmModel, subset=-1) > glmModelB = glm(Direction ~ Lag1 + Lag2, data = Weekly[-1,], family = binomial) Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R > summary(glmModelB) (c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(Direction="Up"|Lag1, Lag2) > 0.5. Was this observation correctly classified? Solution: > testData = Weekly[1,] Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R > glmProbs = predict(glmModelB, testData, type = "response") > glmPred = ifelse(glmProbs > .5, "up", "Down") > as.character(glmPred) > as.character(Weekly$Direction[1]) (d) Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performs each of the following steps: i. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2. ii. Compute the posterior probability of the market moving up for the ith observation. iii. Use the posterior probability for the ith observation in order to predict whether or not the market moves up. iv. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0. Solution: > count = rep(NA, nrow(Weekly) > for (i in 1:nrow(Weekly)) { glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i,], family = binomial) is_up = predict.glm(glm.fit, Weekly[i,], type="response") > 0.5 is_true_up = Weekly[i,]$Direction == "Up" if(is_up != is_true_up) count[i] = 1 } Sum(count) ## [1] NA NA errors. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results. Solution: 1-mean(count) ## [1] NA Question 8: We will now perform cross-validation on a simulated data set. (a) Generate a simulated data set as follows: > set.seed(1) > y=rnorm(100) > x=rnorm(100) > y=x-2*x^2+rnorm (100) In this data set, what is n and what is p? Write out the model used to generate the data in equation form. Solution: Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (b) Create a scatterplot of X against Y . Comment on what you find. Solution: >plot(x,y) Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares: i. Y = β0 + β1X + ε ii. Y = β0 + β1X + β2X2 + ε iii. Y = β0 + β1X + β2X2 + β3X3 + ε iv. Y = β0 + β1X + β2X2 + β3X3 + β4X4 + ε. Note you may find it helpful to use the data.frame() function to create a single data set containing both X and Y . Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: set.seed(10) > y <- rnorm(100) > x <- rnorm(100) > y = x-2 * x^2 + rnorm(100) > plot(x,y) Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (d) Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why? Solution: set.seed(10) > y <- rnorm(100) > x <- rnorm(100) > plot(x,y) > simulated <- data.frame(x,y) > cv.error <- rep(0.5) > library(boot) > for (i in 1:5) { glm.fit <- glm(y ~ poly(x,i), data=simulated) cv.error[i] <- cv.glm(simulated,glm.fit)$delta[1] } Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R The results are not the same with different seeds. That is because the seed sets the values of X, which then sets the values of Y. If we change the seed, the random numbers we generate for X change and that gives us different results. (e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: The model with the second degree polynomial gives the lowest error. With some seeds, the 3rd degree polynomial gives a lower result, but over several seeds, the 2nd degree polynomial is lower. This is what we expected because when we plot x and y, we see a quadratic relationship there. (f) Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results? Solution: > set.seed(10) > y <- rnorm(100) > x <- rnorm(100) > y = x-2 * x^2 + rnorm(100) > plot(x,y) > simulated <- data.frame(x,y) > cv.error <- rep(0.5) > for (i in 1:5) { + glm.fit <- glm(y ~ poly(x,i), data=simulated) Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R + print(i) + print(summary(glm.fit)) + print("&&&&&&&&&&") + cv.error[i] <- cv.glm(simulated,glm.fit)$delta[1] +} Question 9: We will now consider the Boston housing data set, from the MASS library. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R (a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ. Solution: >library(MASS) >load(Boston) >boot.mean<-function(data=Boston, index=1:506){ mean(data$medv[index]) } >set.seed(1) >mu<-boot.mean(Boston,sample(506,506,replace=T)) (b) Provide an estimate of the standard error of ˆμ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations. Solution: >boot.sd<-function(data=Boston, index=1:506){ sd(data$medv[index]) } >set.seed(1) Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R >x<-boot.sd(Boston,sample(506,506,replace=T)) >SE<-x/sqrt(506) (c) Now estimate the standard error of ˆμ using the bootstrap. How does this compare to your answer from (b)? Solution: >boot(Boston, boot.mean,1000) (d) Based on your bootstrap estimate from (c), provide a 95 % con- fidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆμ − 2SE(ˆμ), μˆ + 2SE(ˆμ)]. Solution: >confint1<-mu-2*SE >confint2<-mu+2*SE >confint1 >confint2 >t.test(Boston$medv) (e) Based on this data set, provide an estimate, ˆμmed, for the median value of medv in the population. Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: >boot.median<-function(data=Boston, index=1:506){ median(data$medv[index]) } >set.seed(1) >median<-boot.median(Boston,sample(506,506,replace=T)) (f) We now would like to estimate the standard error of ˆμmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings. Solution: >boot(Boston, boot.median, R=1000) The median value of medv is 21.9 and the std error is 0.38. So, using these 2 values, we can find the 95% confidence intervals. >quantile(Boston$medv, c(0.1)) (g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity ˆμ0.1. (You can use the quantile() function.) Chapter 5 Solutions for Resampling Methods Text book: An Introduction to Statistical Learning with Applications in R Solution: >boot.quantile<-function(data=Boston, index=1:506){ quantile(data$medv[index], c(0.1)) } >set.seed(1) >mu01<-boot.quantile(Boston,sample(506,506,replace=T)) >mu01 (h) Use the bootstrap to estimate the standard error of ˆμ0.1. Comment on your findings. Solution: >boot(Boston, boot.quantile, R=1000) The bootstrap estimate of the boot.quantile statistic is very close to what we got by running it on the entire data set. The Std Error is 0.5066