Solutions for Chapter 5 by Alekhya Akula - Full

Transcription

Solutions for Chapter 5 by Alekhya Akula - Full
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Question 1: Using basic statistical properties of the
variance, as well as single variable calculus, derive (5.6). In
other words, prove that α given by (5.6) does indeed
minimize Var(αX + (1 − α)Y ).
Solution: Minimizing Var(αX + (1 − α)Y ).
As we know:
Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)
Var(aX) = 𝑎2 Var(X)
Cov(aX,bY) = abCov(X,Y).
So,
Var(αX + (1 − α)Y ) = Var(αX) + Var((1 − α)Y) + 2Cov(αX,
(1 − α)Y)
= 𝛼 2 Var(X) + (1 − 𝛼)2 Var(Y) + 2α(1-α)Cov(X,Y)
f(𝛼) = 𝜎 2𝑋 𝛼 2 + 𝜎 2 𝑌 (1 − 𝛼)2 + 2𝜎𝑋𝑌 (-𝛼 2 + 𝛼)
Take the first derivative respect to α to find critical points:
𝑑
𝑑𝛼
f(𝛼) = 0
2𝜎 2𝑋 𝛼 + 2𝜎 2 𝑌 (1- 𝛼)(-1) + 2𝜎𝑋𝑌 (-2 𝛼 + 1) = 0
𝜎 2𝑋 𝛼 + 𝜎 2 𝑌 (1- 𝛼) + 𝜎𝑋𝑌 (-2 𝛼 + 1) = 0
(𝜎 2𝑋 + 𝜎 2 𝑌 - 2𝜎𝑋𝑌 )𝛼 - 𝜎 2 𝑌 + 𝜎𝑋𝑌 = 0
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
𝜎 2𝑋
𝜎 2 𝑌 − 𝜎𝑋𝑌
+ 𝜎 2 𝑌 − 2𝜎𝑋𝑌
Question 2: We will now derive the probability that a given
observation is part of a bootstrap sample. Suppose that we
obtain a bootstrap sample from a set of n observations.
(a) What is the probability that the first bootstrap
observation is not the jth observation from the original
sample? Justify your answer.
Solution:
Given that there are n observations and bootstrap sampling
draws items. Excluding the jth observation, the total number
1
of items are n-1. The probability is (1 − ).
𝑛
(b) What is the probability that the second bootstrap
observation is not the jth observation from the original
sample?
Solution:
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
1
The probability is the same as above (1 − ). Again, the
𝑛
second time you pick an observation, the set of observations
you start with is the same, because you are sampling with
replacement.
(c) Argue that the probability that the jth observation is not
1
in the bootstrap sample is (1 − )𝑛 .
𝑛
Solution:
The probability that the jth sample is not the first sample in
1
your bootstrap is (1 − ) (Like we saw above questions). The
𝑛
total bootstrap sample size is n. So we need to pick n
different observations and none of them should be the jth
one. As bootstrapping does sampling with replacement, the
probabilities of each observation are independent of one
another. If that is the case, then we just have to multiply
1
1
𝑛
𝑛
(1 − ), n times, therefore the answer is (1 − )𝑛 .
(d) When n = 5, what is the probability that the jth
observation is in the bootstrap sample?
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
The probability that the jth observation is not in the
1
bootstrap sample is (1 − ).
𝑛
The probability that the jth observation is in the bootstrap
1
sample is just going to be 1- (1 − ).
𝑛
With n=5, the answer will be 1-[(1-1/5)^5] = 1-0.8^5 = 0.672
(e) When n = 100, what is the probability that the jth
observation is in the bootstrap sample?
Solution:
The probability that the jth observation is not in the
1
bootstrap sample is (1 − ).
𝑛
The probability that the jth observation is in the bootstrap
1
sample is just going to be 1- (1 − ).
𝑛
With n=5, the answer will be 1-[(1-1/100)^100] = 1-0.99^100
= 0.624
(f) When n = 10, 000, what is the probability that the jth
observation is in the bootstrap sample?
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
The probability that the jth observation is not in the
1
bootstrap sample is (1 − ).
𝑛
The probability that the jth observation is in the bootstrap
1
sample is just going to be 1- (1 − ).
𝑛
With n=5, the answer will be 1-[(1-1/10000)^10000] = 10.9999^10000 = 1-0.367 = 0.633
(g) Create a plot that displays, for each integer value of n
from 1 to 100, 000, the probability that the jth observation is
in the bootstrap sample. Comment on what you observe.
Solution:
>x = seq(1,100000)
>y = sapply(x,function(n) {1- ((1- (1 / n))^n)})
>mydata = data.table(x,y)
mydata %>%
ggplot(aes(x=x, y=y)) +
geom_point() + xlim(0,100) + ylim(0.5,1)
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
#Using ggvis
mydata %>%> ggvis(-x,-y) %>%> layer_points() %>%>
scale_numeric(“x”,domain = c(0,100), nice= FALSE, clamp =
TRUE) %>%
add_axis(“y”, title=”Y”, title_offset = 50)
(h) We will now investigate numerically the probability that
a bootstrap sample of size n = 100 contains the jth
observation. Here j = 4.
We repeatedly create bootstrap samples, and each time we
record whether or not the fourth observation is contained in
the bootstrap sample.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
> store=rep(NA, 10000) > for(i in 1:10000) {
store[i]=sum(sample (1:100, rep=TRUE)==4) >0 } >
mean(store) Comment on the results obtained.
Solution:
>n=100000
>store=rep(NA,n)
>for(i in 1:n){
+ store[i] = sum(sample(1:100,rep=TRUE)==4)>0
+}
>mean(store)
[1] 0.633
Question 6: We continue to consider the use of a logistic
regression model to predict the probability of default using
income and balance on the Default data set.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
In particular, we will now compute estimates for the
standard errors of the income and balance logistic
regression coefficients in two different ways:
(1) using the bootstrap, and
(2) using the standard formula for computing the standard
errors in the glm() function. Do not forget to set a random
seed before beginning your analysis.
Solution:
#data.table is an
(a) Using the summary() and glm() functions, determine the
estimated standard errors for the coefficients associated
with income and balance in a multiple logistic regression
model that uses both predictors.
Solution:
>set.seed
>glmModel = glm(default ~ income + balance , data =
defaultData, family = binomial)
>pander(summary(glmModel))
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Estimate
Income 2e-05
balance 0.005647
Intrcept -11.54
Error
4.9e-06
0.0002
0.4348
Z val
4.174
24.84
26.54
Pr>
2.991e-05
3.638e-13
2.95e-155
Dispersion parameter for binomial family taken to be 1
Null deviance: 2921 on 9999 degrees of freedom
Residual deviance: 1579 on 9997 degrees of freedom
(b) Write a function, boot.fn(), that takes as input the
Default data set as well as an index of the observations, and
that outputs the coefficient estimates for income and
balance in the multiple logistic regression model.
Solution:
>boot.fn = function(formula, data, indices) {
mydata = data[indices,]
glmModel = glm(formula, data =mydata, family =
binomial)
return(coef(glmModel))
}
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(c) Use the boot() function together with your boot.fn()
function to estimate the standard errors of the logistic
regression coefficients for income and balance.
Solution:
>cl = makeCluster(detectCores())
>result = boot(data = defaultData, statistic = boot.fn, R =
1000, formula = default – income + balance, parallel =
“snow”, ncpus = 8, cl = cl)
>result
(d) Comment on the estimated standard errors obtained
using the glm() function and using your bootstrap function.
Solution:
The estimates of the bootstrap are really close to the glm
summary estimates.
Question 7: In Sections 5.3.2 and 5.3.3, we saw that the
cv.glm() function can be used in order to compute the
LOOCV test error estimate. Alternatively, one could compute
those quantities using just the glm() and predict.glm()
functions, and a for loop. You will now take this approach in
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
order to compute the LOOCV error for a simple logistic
regression model on the Weekly data set. Recall that in the
context of classification problems, the LOOCV error is given
in (5.4).
Solution:
>weeklyData = data.table(Weekly)
>summary(weeklyData)
Year
Min.:1990
1st Qu.:1995
Median:2000
Mean:2000
3rd Qu.:2005
Max.:2010
Lag1
Min.:-18.19
1st Qu.:-1.1540
Median:0.2410
Mean:0.1506
3rd Qu.:1.4050
Max.:12.0260
Volume
Min.:0.08747
1st Qu.:0.33202
Median:1.00268
Mean:1.57462
3rd Qu.:2.05373
Max.:9.32821
Lag2
Min.:-18.19
1st Qu.:-1.1540
Median:0.2410
Mean:0.1511
3rd Qu.:1.4090
Max.:12.0260
Today
Min.:-18.1950
1st Qu.:-1.1540
Median:0.2410
Mean:0.1499
3rd Qu.:1.4050
Max.:12.0260
Lag3
Min.:-18.19
1st Qu.:-1.1580
Median:0.2410
Mean:0.1472
3rd Qu.:1.4090
Max.:12.0260
Lag4
Min.:-18.19
1st Qu.:-1.1580
Median:0.2380
Mean:0.1458
3rd Qu.:1.4090
Max.:12.0260
Lag5
Min.:-18.19
1st Qu.:-1.10
Median:0.2340
Mean:0.1399
3rd Qu.:1.4050
Max.:12.0260
Direction
Down:484
Up: 605
NA
NA
NA
NA
(a) Fit a logistic regression model that predicts Direction
using Lag1 and Lag2.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
> glmModel = glm(Direction ~ Lag1 + Lag2, data = Weekly,
family = binomial)
> summary(glmModel)
(b) Fit a logistic regression model that predicts Direction
using Lag1 and Lag2 using all but the first observation.
Solution:
> glmModelB = update(glmModel, subset=-1)
> glmModelB = glm(Direction ~ Lag1 + Lag2, data =
Weekly[-1,], family = binomial)
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
> summary(glmModelB)
(c) Use the model from (b) to predict the direction of the first
observation. You can do this by predicting that the first
observation will go up if P(Direction="Up"|Lag1, Lag2) > 0.5.
Was this observation correctly classified?
Solution:
> testData = Weekly[1,]
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
> glmProbs = predict(glmModelB, testData, type =
"response")
> glmPred = ifelse(glmProbs > .5, "up", "Down")
> as.character(glmPred)
> as.character(Weekly$Direction[1])
(d) Write a for loop from i = 1 to i = n, where n is the number
of observations in the data set, that performs each of the
following steps:
i. Fit a logistic regression model using all but the ith
observation to predict Direction using Lag1 and Lag2.
ii. Compute the posterior probability of the market
moving up for the ith observation.
iii. Use the posterior probability for the ith observation in
order to predict whether or not the market moves up. iv.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Determine whether or not an error was made in predicting
the direction for the ith observation. If an error was made,
then indicate this as a 1, and otherwise indicate it as a 0.
Solution:
> count = rep(NA, nrow(Weekly)
> for (i in 1:nrow(Weekly)) {
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i,],
family = binomial)
is_up = predict.glm(glm.fit, Weekly[i,], type="response") >
0.5
is_true_up = Weekly[i,]$Direction == "Up"
if(is_up != is_true_up)
count[i] = 1
}
Sum(count)
## [1] NA
NA errors.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(e) Take the average of the n numbers obtained in (d)iv in
order to obtain the LOOCV estimate for the test error.
Comment on the results.
Solution:
1-mean(count)
## [1] NA
Question 8: We will now perform cross-validation on a
simulated data set.
(a) Generate a simulated data set as follows:
> set.seed(1)
> y=rnorm(100)
> x=rnorm(100)
> y=x-2*x^2+rnorm (100)
In this data set, what is n and what is p? Write out the
model used to generate the data in equation form.
Solution:
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(b) Create a scatterplot of X against Y . Comment on what
you find.
Solution:
>plot(x,y)
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(c) Set a random seed, and then compute the LOOCV errors
that result from fitting the following four models using least
squares:
i. Y = β0 + β1X + ε
ii. Y = β0 + β1X + β2X2 + ε
iii. Y = β0 + β1X + β2X2 + β3X3 + ε
iv. Y = β0 + β1X + β2X2 + β3X3 + β4X4 + ε.
Note you may find it helpful to use the data.frame() function
to create a single data set containing both X and Y .
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
set.seed(10)
> y <- rnorm(100)
> x <- rnorm(100)
> y = x-2 * x^2 + rnorm(100)
> plot(x,y)
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(d) Repeat (c) using another random seed, and report your
results. Are your results the same as what you got in (c)?
Why?
Solution:
set.seed(10)
> y <- rnorm(100)
> x <- rnorm(100)
> plot(x,y)
> simulated <- data.frame(x,y)
> cv.error <- rep(0.5)
> library(boot)
> for (i in 1:5) {
glm.fit <- glm(y ~ poly(x,i), data=simulated)
cv.error[i] <- cv.glm(simulated,glm.fit)$delta[1]
}
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
The results are not the same with different seeds. That is
because the seed sets the values of X, which then sets the
values of Y. If we change the seed, the random numbers we
generate for X change and that gives us different results.
(e) Which of the models in (c) had the smallest LOOCV error?
Is this what you expected? Explain your answer.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
The model with the second degree polynomial gives the
lowest error. With some seeds, the 3rd degree polynomial
gives a lower result, but over several seeds, the 2nd degree
polynomial is lower. This is what we expected because when
we plot x and y, we see a quadratic relationship there.
(f) Comment on the statistical significance of the coefficient
estimates that results from fitting each of the models in (c)
using least squares. Do these results agree with the
conclusions drawn based on the cross-validation results?
Solution:
> set.seed(10)
> y <- rnorm(100)
> x <- rnorm(100)
> y = x-2 * x^2 + rnorm(100)
> plot(x,y)
> simulated <- data.frame(x,y)
> cv.error <- rep(0.5)
> for (i in 1:5) {
+ glm.fit <- glm(y ~ poly(x,i), data=simulated)
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
+ print(i)
+ print(summary(glm.fit))
+ print("&&&&&&&&&&")
+ cv.error[i] <- cv.glm(simulated,glm.fit)$delta[1]
+}
Question 9: We will now consider the Boston housing data
set, from the MASS library.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
(a) Based on this data set, provide an estimate for the
population mean of medv. Call this estimate ˆμ.
Solution:
>library(MASS)
>load(Boston)
>boot.mean<-function(data=Boston, index=1:506){
mean(data$medv[index])
}
>set.seed(1)
>mu<-boot.mean(Boston,sample(506,506,replace=T))
(b) Provide an estimate of the standard error of ˆμ. Interpret
this result. Hint: We can compute the standard error of the
sample mean by dividing the sample standard deviation by
the square root of the number of observations.
Solution:
>boot.sd<-function(data=Boston, index=1:506){
sd(data$medv[index])
}
>set.seed(1)
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
>x<-boot.sd(Boston,sample(506,506,replace=T))
>SE<-x/sqrt(506)
(c) Now estimate the standard error of ˆμ using the
bootstrap. How does this compare to your answer from (b)?
Solution:
>boot(Boston, boot.mean,1000)
(d) Based on your bootstrap estimate from (c), provide a 95
% con- fidence interval for the mean of medv. Compare it to
the results obtained using t.test(Boston$medv). Hint: You
can approximate a 95 % confidence interval using the
formula [ˆμ − 2SE(ˆμ), μˆ + 2SE(ˆμ)].
Solution:
>confint1<-mu-2*SE
>confint2<-mu+2*SE
>confint1
>confint2
>t.test(Boston$medv)
(e) Based on this data set, provide an estimate, ˆμmed, for
the median value of medv in the population.
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
>boot.median<-function(data=Boston, index=1:506){
median(data$medv[index])
}
>set.seed(1)
>median<-boot.median(Boston,sample(506,506,replace=T))
(f) We now would like to estimate the standard error of
ˆμmed. Unfortunately, there is no simple formula for
computing the standard error of the median. Instead,
estimate the standard error of the median using the
bootstrap. Comment on your findings.
Solution:
>boot(Boston, boot.median, R=1000)
The median value of medv is 21.9 and the std error is 0.38.
So, using these 2 values, we can find the 95% confidence
intervals.
>quantile(Boston$medv, c(0.1))
(g) Based on this data set, provide an estimate for the tenth
percentile of medv in Boston suburbs. Call this quantity
ˆμ0.1. (You can use the quantile() function.)
Chapter 5
Solutions for Resampling Methods
Text book: An Introduction to Statistical Learning with Applications in R
Solution:
>boot.quantile<-function(data=Boston, index=1:506){
quantile(data$medv[index], c(0.1))
}
>set.seed(1)
>mu01<-boot.quantile(Boston,sample(506,506,replace=T))
>mu01
(h) Use the bootstrap to estimate the standard error of
ˆμ0.1. Comment on your findings.
Solution:
>boot(Boston, boot.quantile, R=1000)
The bootstrap estimate of the boot.quantile statistic is very
close to what we got by running it on the entire data set.
The Std Error is 0.5066