Assignment - Jeff Goldsmith

Transcription

Assignment - Jeff Goldsmith
Biostatistics P8111, Spring 2015
Homework 4
Due April 10 by 6:00pm
Please email an electronic copy (PDF) of your solutions to ajg2202@cumc.columbia.edu. Use the title
“P8111HW4_Lastname_Firstname.pdf”.
Solutions to Problem 1 can be typed or handwritten and scanned; Problems 2-4 should be completed using R
markdown. In addition to your PDF, please also submit the .Rmd that produces your written solutions to
Problems 2-4 with the title “P8111HW4_Lastname_Firstname.Rmd“.
Problem 1. [5 points]
Consider the multiple linear regression model
y = Xβ + with response vector y, design matrix X, coefficient vector β and error vector . In Lecture 16, we found
ridge regression estimates by using
βˆR = minβ (y − Xβ)T (y − Xβ) + λβ T P β
with penalty matrix P and tuning parameter λ.
We now make the additional assumptions
(a) That errors are multivariate normal: ∼ N(0, σ2 I).
(b) That regression coefficients are also multivariate normal: β ∼ N(0, σβ2 P −1 ).
Show that, under these assumptions, the MLE for beta is the same as the ridge regression estimate (it is
sufficient to show that maximizing the likelihood is equivalent to minimizing the objective function above).
2
What is the relationship between λ, σ2 and σbeta
?
Problem 2 [2+3+1+2+3+4 = 15 points]
A cross-sectional study of Nepalese children was carried out to understand the relationships between various
measures of growth, including arm circumference, weight, and height; also included are age and sex. Data for
2706 children can be downloaded using
data = read.csv("http://jeffgoldsmith.com/P8111/P8111_HWs/nepalese.csv")
In this problem, the goal is to build an interpretable model for arm circumference as an outcome using all
other available covariates.
(a) Build a piece-wise linear model using the continuous predictors weight and age. How many knots would
you use for each, and where would you place them? How did you make this choice?
(b) Your collaborators believe that the effect of weight on arm circumference can be described by a piecewise
linear fit with a single knot at 7. They also believe that that the effect of weight on arm circumference is
different for boys and for girls, both before and after the changepoint. To examine these hypotheses, write
down a regression model with (1) a piecewise linear fit for weight, (2) sex, and (3) weight x sex interactions
before and after the changepoint. Interpret the coefficients of this model.
(c) Fit the model in part (b) and test the hypothesis that the slope for weight after the changepoint is the
same for boys and girls.
(d) Again using the model in part (b), test the null hypothesis that after the changepoint, weight is not
associated with arm circumference in girls.
(e) Fit an additive model with smooth terms for weight and height; plot and briefly interpret the coefficients.
(f) Build a model for arm circumference using all available covariates. Briefly describe your model building
process, including forms for the mean you considered and the criteria you used to choose among models, and
the major findings of this process. Your description should be no more than one paragraph and use at most
one figure.
1
Problem 3. [2+3+1+4=10 points]
In Homework 3, we modeled the LIDAR dataset, available in the SemiPar package, using polynomial
regression. We now compare that method to a penalized spline model fit.
(a) Present a polynomial model for these data (this can be your model from Homework 2 or a different one).
What is your model? How did you choose this? Does it fit the data well?
(b) Use gam() in the mgcv package to fit a penalized spline model for these data, and plot the resulting fit.
(c) Based on visual inspection, do you prefer the polynomial or penalized spline model?
(d) Based on a cross-validation analysis, do you prefer the polynomial or penalized spline model?
Problem 4. [1+4+2+3=10 points]
In this project, you will analyze data gathered to understand the effects of several variables on a child’s
birthweight. This dataset consists of 4306 children and includes the following variables:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
babysex: baby’s sex (male = 1, female = 2)
bhead: baby’s head circumference at birth (centimeters)
blength: baby’s length at birth (centimeteres)
bwt: baby’s birth weight (grams)
delwt: mother’s weight at delivery (pounds)
fincome: family monthly income (in hundreds, rounded)
frace: father’s race (1= White, 2 = Black, 3 = Asian, 4 = Puerto Rican, 8 = Other, 9 = Unknown)
gaweeks: gestational age in weeks
malform: presence of malformations that could affect weight (0 = absent, 1 = present)
menarche: mother’s age at menarche (years
mheigth: mother’s height (inches)
momage: mother’s age at delivery (years)
mrace: mother’s race (1= White, 2 = Black, 3 = Asian, 4 = Puerto Rican, 8 = Other)
parity: number of live births prior to this pregnancy
pnumlbw: previous number of low birth weight babies
pnumgsa: number of prior small for gestational age babies
ppbmi: mother’s pre-pregnancy BMI
ppwt: mother’s pre-pregnancy weight (pounds)
smoken: average number of cigarettes smoked per day during pregnancy
wtgain: mother’s weight gain during pregnancy (pounds)
This dataset can be accessed using
data = read.csv("http://jeffgoldsmith.com/P8111/P8111_HWs/BWT.csv").
(a) Pose a linear regression model for bwt as the response and remaining variables as predictors and construct
the corresponding design matrix. Describe any steps in this process.
(b) Estimate the regression coefficients using lasso penalization, using cross validation to choose the tuning
parameter. Make a plot showing coefficient values as a function of λ and a plot showing the cross-validation
curve.
(c) For your chosen tuning parameter, fit the model using the complete data. What variables remain in the
model? Are these reasonable predictors of birthweight?
(d) Implement forward stepwise selection for this dataset. Does this procedure result in the same covariates?
2