Assignment - Jeff Goldsmith
Transcription
Assignment - Jeff Goldsmith
Biostatistics P8111, Spring 2015 Homework 4 Due April 10 by 6:00pm Please email an electronic copy (PDF) of your solutions to ajg2202@cumc.columbia.edu. Use the title “P8111HW4_Lastname_Firstname.pdf”. Solutions to Problem 1 can be typed or handwritten and scanned; Problems 2-4 should be completed using R markdown. In addition to your PDF, please also submit the .Rmd that produces your written solutions to Problems 2-4 with the title “P8111HW4_Lastname_Firstname.Rmd“. Problem 1. [5 points] Consider the multiple linear regression model y = Xβ + with response vector y, design matrix X, coefficient vector β and error vector . In Lecture 16, we found ridge regression estimates by using βˆR = minβ (y − Xβ)T (y − Xβ) + λβ T P β with penalty matrix P and tuning parameter λ. We now make the additional assumptions (a) That errors are multivariate normal: ∼ N(0, σ2 I). (b) That regression coefficients are also multivariate normal: β ∼ N(0, σβ2 P −1 ). Show that, under these assumptions, the MLE for beta is the same as the ridge regression estimate (it is sufficient to show that maximizing the likelihood is equivalent to minimizing the objective function above). 2 What is the relationship between λ, σ2 and σbeta ? Problem 2 [2+3+1+2+3+4 = 15 points] A cross-sectional study of Nepalese children was carried out to understand the relationships between various measures of growth, including arm circumference, weight, and height; also included are age and sex. Data for 2706 children can be downloaded using data = read.csv("http://jeffgoldsmith.com/P8111/P8111_HWs/nepalese.csv") In this problem, the goal is to build an interpretable model for arm circumference as an outcome using all other available covariates. (a) Build a piece-wise linear model using the continuous predictors weight and age. How many knots would you use for each, and where would you place them? How did you make this choice? (b) Your collaborators believe that the effect of weight on arm circumference can be described by a piecewise linear fit with a single knot at 7. They also believe that that the effect of weight on arm circumference is different for boys and for girls, both before and after the changepoint. To examine these hypotheses, write down a regression model with (1) a piecewise linear fit for weight, (2) sex, and (3) weight x sex interactions before and after the changepoint. Interpret the coefficients of this model. (c) Fit the model in part (b) and test the hypothesis that the slope for weight after the changepoint is the same for boys and girls. (d) Again using the model in part (b), test the null hypothesis that after the changepoint, weight is not associated with arm circumference in girls. (e) Fit an additive model with smooth terms for weight and height; plot and briefly interpret the coefficients. (f) Build a model for arm circumference using all available covariates. Briefly describe your model building process, including forms for the mean you considered and the criteria you used to choose among models, and the major findings of this process. Your description should be no more than one paragraph and use at most one figure. 1 Problem 3. [2+3+1+4=10 points] In Homework 3, we modeled the LIDAR dataset, available in the SemiPar package, using polynomial regression. We now compare that method to a penalized spline model fit. (a) Present a polynomial model for these data (this can be your model from Homework 2 or a different one). What is your model? How did you choose this? Does it fit the data well? (b) Use gam() in the mgcv package to fit a penalized spline model for these data, and plot the resulting fit. (c) Based on visual inspection, do you prefer the polynomial or penalized spline model? (d) Based on a cross-validation analysis, do you prefer the polynomial or penalized spline model? Problem 4. [1+4+2+3=10 points] In this project, you will analyze data gathered to understand the effects of several variables on a child’s birthweight. This dataset consists of 4306 children and includes the following variables: • • • • • • • • • • • • • • • • • • • • babysex: baby’s sex (male = 1, female = 2) bhead: baby’s head circumference at birth (centimeters) blength: baby’s length at birth (centimeteres) bwt: baby’s birth weight (grams) delwt: mother’s weight at delivery (pounds) fincome: family monthly income (in hundreds, rounded) frace: father’s race (1= White, 2 = Black, 3 = Asian, 4 = Puerto Rican, 8 = Other, 9 = Unknown) gaweeks: gestational age in weeks malform: presence of malformations that could affect weight (0 = absent, 1 = present) menarche: mother’s age at menarche (years mheigth: mother’s height (inches) momage: mother’s age at delivery (years) mrace: mother’s race (1= White, 2 = Black, 3 = Asian, 4 = Puerto Rican, 8 = Other) parity: number of live births prior to this pregnancy pnumlbw: previous number of low birth weight babies pnumgsa: number of prior small for gestational age babies ppbmi: mother’s pre-pregnancy BMI ppwt: mother’s pre-pregnancy weight (pounds) smoken: average number of cigarettes smoked per day during pregnancy wtgain: mother’s weight gain during pregnancy (pounds) This dataset can be accessed using data = read.csv("http://jeffgoldsmith.com/P8111/P8111_HWs/BWT.csv"). (a) Pose a linear regression model for bwt as the response and remaining variables as predictors and construct the corresponding design matrix. Describe any steps in this process. (b) Estimate the regression coefficients using lasso penalization, using cross validation to choose the tuning parameter. Make a plot showing coefficient values as a function of λ and a plot showing the cross-validation curve. (c) For your chosen tuning parameter, fit the model using the complete data. What variables remain in the model? Are these reasonable predictors of birthweight? (d) Implement forward stepwise selection for this dataset. Does this procedure result in the same covariates? 2