Fitting latency models in epidemiological studies
Transcription
Fitting latency models in epidemiological studies
Paper 75011 Fitting latency models in epidemiological studies Bryan Langholz, University of Southern California, Los Angeles, CA David B. Richardson, University of North Carolina, Chapel Hill, NC ABSTRACT Latency models address important questions about the timing of exposure and subsequent disease risk in epidemiologic research. However, the non-standard form of such models and the complexity of time-varying exposure histories that characterize many epidemiologic studies make such models difficult to fit using standard software packages. SAS offers software tools that can be used to fit “non-standard” latency models. In particular, SAS procedures NLP and NLMIXED provide the flexibility to specify quite general latency models and maximize the likelihood for latency parameters. While the methods require more sophistication than using a package or procedure that is specialized to the usual log-linear form, the additional programming complexity is not great. The methods are illustrated and compared by fitting latency models for radon exposure and lung cancer mortality rates from a cohort study of Colorado Plateau uranium miners. It was found that the conditional logistic likelihood is more computationally efficient than the unconditional and that PROC NLP is much faster than PROC NLMIXED. KEYWORDS: cohort studies, case-control studies, conditional logistic likelihood, epidemiology, general relative risk models, partial likelihood, PROC NLP, PROC NLMIXED INTRODUCTION Latency models are used to describe the change in risk of disease due to a given exposure as a function of time since that exposure. Characterization of the evolution of the relative risk of disease following exposure to an agent is important for understanding the long term consequences of exposure. If exposure occurs at a single point in time, then risk as a function of time-since-exposure is typically estimated directly as the change in disease risk or rate as a function of since that exposure. The situation can be complex when exposure is protracted such as the case with many occupational and environmental exposures. For instance, in a study of Colorado Plateau uranium miners, radon exposure in the course of underground mining of uranium continued throughout a miner’s work life, and varied over time, depending on the mine location and conditions in the mine at a given time [Archer et al., 1973, Hornung and Meinhardt, 1987, Stram et al., 1999]. Ad-hoc methods for investigation of latency for extended exposures include estimation of risk as a function of time since first or last exposure. These approaches are relatively easy to implement but do not characterize the evolution of risk over time. While more sophisticated approaches based on informative models of latency have been described [e.g., Thomas, 1982, 1988, Breslow and Day, 1987, Langholz et al., 1999, Hauptmann et al., 2001, Berhane et al., 2008], they have rarely been used, in large part because of they are not accommodated by standard modeling software. Recently, methods for fitting a quite general class of latency models exploiting the very general modeling facilities available in SAS procedure NLMIXED were described [Richardson, 2009]. This approach was based on an unconditional logistic likelihood with “nuisance” strata parameters for age intervals. Even more recently, methods have been describe to fit conditional logistic likelihoods, including partial likelihoods for proportional hazards models, using PROC NLP [Langholz and Richardson, 2010]. In this paper, we combine the insights about fitting latency models given in Richardson [2009] with those about fitting conditional logistic likelihoods described in Langholz and Richardson [2010]. We show how latency models can be fitted using procedure NLMIXED or NLP using the conditional logistic likelihood and compare the performance of the unconditional likelihood fitting approach to the conditional by fitting latency models for radon exposure and lung cancer mortality rates to data from a cohort of uranium miners from the Colorado Plateau. METHODS LATENCY FUNCTIONS Let t be the current age and u index the ages at exposure, with d(u) the dose at age u and w(t − u; α) the latency curve t − u years in the past that depends on parameter vector α, then the effective dose at age t from an exposure incurred at age u is d(u)w(t − u; α). The total effective dose at age t, D(t; α), is then given by X D(t; α) = d(u)w(t − u; α). u The latency function and latency parameters for a number of latency models are shown in Table 1 [e.g., Thomas, 1988, Langholz et al., 1999, Hauptmann et al., 2001, Richardson, 2009]. 1 Table 1: Some latency functions. Lag Piecewise constanta Splineb Bilinear Exponential decay Log-normal Weighting function w(v) I(v ≥ α) PK k=1 I(Ck−1 < v ≤ Ck ) αk PK k=1 g(v, αk ) (v − α0 )/(α1 − α0 ) if α0 < v ≤ α1 (α2 − v)/(α2 − α1 ) if α1 < v ≤ α2 0 otherwise (v − α0 )/(α1 − α0 ) if α0 < v ≤ α1 exp(−(v − α1 ) × log(2)/α2 if α1 < v 0 otherwise pdf(Lognormal,v, θ, λ) Gamma pdf(Gamma,a, λ) a (C k−1 , Ck ] are time intervals. b g(v, α ) are basis functions and α may be a vector. k k Parameters α α1 , . . . , αK α1 , . . . , αK α0 , α 1 , α 2 α0 , α 1 , α 2 θ, λ a, λ The total effective dose summarizes the exposure history into a single value that at each age can be related to disease risk. A form of this relationship must be chosen or determined from the data. For various reasons, both theoretical and empirical, radiation effects have been modeled as the excess relative risk as a linear function of total effective dose. This is expressed as λ(t, D(t; α); β) = λ0 (t)(1 + βD(t; α)) (1) where λ0 (t) is the rate of disease at age t in a (comparable) unexposed population and β is the dose-response parameter; the change in relative risk per unit effective dose. Combining this dose-response model with the total effective dose formula (), we note that X X βD(t; α) = β d(u)w(t − u; α) = d(u) βw(t − u; α) (2) u u so that β w(t − u; α) gives the excess relative risk per unit dose ascribed to exposure at t − u years in the past. Other forms for the rates may be more appropriate in other situations such as the log linear form (Cox model) λ(t, D(t; α); β) = λ0 (t) exp(βD(t; α)) or more complex relationships. FITTING LATENCY MODELS A major impediment to the investigation of latency in characterizing exposure-disease relationships has been computational. Given that standard software used for the modeling of data from epidemiologic studies accommodate only dose-response models of the log-linear form, there are two aspects of model (1) that are “non-standard.” The first is that the excess relative risk form is not log-linear. Second, to estimate the latency parameters α, the effective dose D(t; α) given in () needs to be updated at each iteration in the fitting algorithm, again not accommodated by standard fitting software. In order to fit latency models to the Colorado Plateau uranium miners data, Langholz et al. [1999] organized cohort data into time and strata determined risk sets and then nested case-control sampled from the risk sets to reduce the computational burden. Lengthy “scripts” were written for the specialized package EPICURE that used a grid search to fit the models (http://hydra.usc.edu/timefactors/examples/exampl.html, topic 5). Recently, Richardson [2009] described how SAS PROC NLMIXED can be used to fit latency models. He used the same nested case-control sampling approach to reduce computational burden and analyzed the data as stratified binary data using an unconditional logistic likelihood, with a separate stratum parameter for each risk set. When disease is rare, the unconditional logistic likelihood approach yields parameter estimates that are close to the partial likelihood. Langholz and Richardson [2010] described how to use PROC NLP (or NLMIXED) to accommodate conditional logistic likelihoods, including how to perform partial likelihood analyses from cohort risk set or nested case-control data. This involves a organizing the data as a single record per risk set and computing the conditional logistic likelihood contributions as a “general model.” This approach has the advantage over the unconditional logistic likelihood that risk set strata parameters are avoided so that fitting is faster and more reliable. For simplicity, the nested case-control data should have only one case per set. We discuss the analysis of tied failure times or grouped time cohort data in the Discussion. UNCONDITIONAL LOGISTIC LIKELIHOOD Figure 1 shows the SAS macro latency described in Richardson [2009]. Briefly, the annual exposures d(u) are given as a variable list in macro variable exphx while the latency weighting is entered into the macro variable wt and the regression 2 Figure 1: Unconditional logistic analysis of nested case-control data. %MACRO latency (data= , case= , exphx=, age= , parms= , lag=, wt=, regmodel= ); PROC NLMIXED DATA=&data TECH=congra ; PARMS &parms; lag=&lag ; edose=0; ARRAY annexp &exphx; endage=FLOOR(&age)-lag; endage=MIN(endage,hbound(annexp)); DO Y = 1 TO endage; t = (&age -y-lag+ (182/365) ); edose = edose + ( ( &wt ) * annexp{y} ); END; odds=®model ; p=odds/(1+odds); MODEL &case ˜ BINARY(p); RUN; %MEND latency; Figure 2: Conditional logistic likelihood analysis of nested case-control data. %MACRO latency (data= , nv=, maxsetsize=, age= , parms= , lag=, wt=, regmodel= ); PROC NLMIXED DATA=&data tech=congra ; PARMS &parms; lag=&lag ; * note: maxsetsize is set globally by the make_case_control macro; ARRAY z[&nv,&maxsetsize] _z1-_z%EVAL(&nv*&maxsetsize); endage=FLOOR(&age)-lag; endage=MIN(endage,80); sum = 0; _ntot = _ntot; DO imem = _ntot TO 1 by -1; edose=0; DO Y = 1 TO endage; t = (&age -y-lag-(182/365)); edose = edose + ( ( &wt ) * z[y,imem]); END; phi = ®model; sum=sum + phi; end; dum = 1; * last rr is for the case; L = phi / sum; MODEL dum ˜ GENERAL(log(L)); RUN; %MEND latency; model in regmodel. Further details and examples are given in [Richardson, 2009]. CONDITIONAL (PARTIAL) LIKELIHOOD To fit models via the conditional logistic likelihood, we first organize the data into a “case-control set” structure with one line per case-control set and has been described previously [Langholz and Richardson, 2010]. Briefly, covariate information from all case-control set members is put into a covariate array z arranged in blocks of maxsetsize, the maximum number of subjects in any case-control set, for each of the nv covariates. In each covariate block, the case’s covariate is first, followed by that of the controls, with z values over the number in the case-control set ntot to missing (and ignored in the analysis). The macro make case control to generate the case-control set structured data is available at http://hydra.usc.edu/timefactors/examples/exampl.html, topic 12. A macro to fit latency models via the conditional logistic likelihood using the case-control set organized data is shown in Figure 2 and is similar to the unconditional logistic likelihood fitting macro in Figure 1. The macro variables nv and maxsetsize are for the number of covariates and the size of largest case-control (or risk) set. Dose is computed as a latency weighted function wt of annual exposures and is then a rate ratio is computed via the regression model regmodel. However, this is done for all members of the case-control set so that there is a double loop first over members imem of the set, and second over age Y for a given member. The denominator of the likelihood is computed as the sum of member rate ratios phi and the numerator is the case-rate ratio, who is last in the case-control member loop. 3 Figure 3: Conditional logistic likelihood analysis of nested case-control data using PROC NLP. %MACRO latency (data= , nv=, maxsetsize=, age= , parms= , lag=, wt=, regmodel=, profile=); PROC NLP DATA=&data GRADCHECK=none COV=2; PARMS &parms; PROFILE &profile /ALPHA=.05; lag=&lag ; ARRAY z[&nv,&maxsetsize] _z1-_z%EVAL(&nv*&maxsetsize); endage=FLOOR(&age)-lag; endage=MIN(endage,80); sum = 0; _ntot = _ntot; DO i = _ntot TO 1 by -1; edose=0; DO Y = 1 TO endage; t = (&age -Y-lag); edose = edose + ( ( &wt ) * z[y,i]); END; phi = ®model; sum=sum + phi; end; * last rr is for the case; logL = log(phi / sum); MAX logL; RUN; %MEND latency; PROFILE CONFIDENCE INTERVALS USING PROC NLP Wald confidence intervals are symmetric around the estimate and are obtained by subtracting and adding to the estimate 1.96 times the standard errors of the estimates. While these are appropriate in many standard model settings, asymmetrical intervals that take the constraints on the parameters and the skewed distribution of the estimates into account are often preferred. The profile likelihood may be asymmetrical and is obtained by finding the values of β such that twice the log-likelihood differs from its maximum by 3.84(=1.962 ). This requires maximizing the likelihood for other parameters, at a fixed value of the parameter of interest, and can be quite computationally intensive. Langholz and Richardson [2010] suggest that PROC NLP may be preferred over PROC NLMIXED because PROC NLP can compute profile likelihood confidence intervals via the PROFILE command. Figure 3 shows the macro for fitting the conditional logistic likelihood using PROC NLP. All the basic programming statements are the same as those used with PROC NLMIXED, with the model statement replaced by the NLP command MAX logL to maximize the log likelihood. Profile confidence intervals are computed for variables listing in the macro variable profile. EXAMPLE To illustrate and compare the approaches to fitting latency models, we used the nested case-control sample from the Colorado Plateau uranium miners data that was used previously illustrate the use of latency methods [Langholz et al., 1999, Richardson, 2009]. Briefly, the cohort data consisted 2704 miners who started employment in the Colorado Plateau after 1950 and were followed for lung cancer death to December, 1990. Tied ages of death were broken randomly and a risk set for each lung cancer mortality case was defined as all cohort members who were alive and on-study at age of the case’s death and who attained that age during the five-year calendar period in which the case died; i.e. risk sets with age as time scale with stratification by five year calendar period; a data set that consists of 55,964 miner/time records. Forty controls were sampled from the risk sets to form the nested case-control data set, with 10,322 miner/time records, used for the analysis. As in Richardson [2009], three models were fitted: simple cumulative dose (time constant), bilinear with α0 fixed at 0, and log-normal. Each model was fitted using the unconditional logistic likelihood, with a separate stratum parameter for each risk set, and using the conditional logistic (partial) likelihood, in the latter case with procedures NLMIXED and NLP to do the fitting. For each model and method, we tabulated the dose-response and latency parameter estimates, standard errors, and model deviance as well as the time required to fit the model on a Lenovo x300 Thinkpad laptop computer. Finally, we estimated Wald and profile likelihood confidence intervals for the dose-response parameter β, the latter using the PROFILE command in PROC NLP. The data set and SAS code used to do these analyses have been posted at http://hydra.usc.edu/timefactors/examples/exampl.html, topic 12. RESULTS Table 2 gives the parameter estimates, standard errors, and deviances from each of the fitted models. The latency and doseresponse parameters are required in the model specification for both unconditional and conditional logistic likelihood methods 4 Table 2: Latency model and radon dose-response parameters and standard errors and model deviances from unconditional and conditional logistic likelihood fitting. Type of logistic likelihood Unconditional (NLMIXED) Conditional (NLMIXED or NLP) Latency function Time constant Parameter β Estimate 0.32 SE 0.096 Deviance 2329.2 Estimate 0.28 SE 0.083 Deviance 1815.3 Bilinear α1 α2 β 8.6 34.1 0.83 3.8 2.9 0.30 2316.7 9.8 35.0 0.78 4.2 2.6 0.30 1803.5 Lognormal θ λ β 2.7 0.59 15.2 0.16 0.13 5.3 2318.2 2.7 0.53 13.3 0.14 0.11 4.7 1805.2 Table 3: CPU time to fit latency models. Model Time constant Bilinear Log-normal Unconditional logistic (NLMIXED) 4 min 15 sec 9 min 3 sec 17 min 21 sec Conditional logistic (NLMIXED) 8 sec 1 min 21 sec 3 min 29 sec Conditional logistic (NLP) 1 sec 25 sec 29 sec Table 4: Wald and profile likelihood 95% confidence intervals for the dose response parameter β. Wald 0.12-0.44 0.19-1.37 1.6-25.0 β Time constant 0.28 Bilinear 0.78 Lognormal 13.3 a No convergence. Profile 0.16-0.52 –a 6.9-28.7 but the unconditional also included parameters for each of the 263 risk set defined strata. There are differences between the unconditional logistic likelihood estimates and standard errors and those from the conditional logistic, but these are relatively small. Also, while the different likelihoods yield different absolute deviances, the between-model differences in the deviances are similar. Thus, at least for our example, results from the two likelihoods are consistent with each other. The results using conditional logistic likelihood fitted with PROC NLP were virtually identical to those using PROC NLMIXED, with differences in the third decimal. Table 3 shows the reported CPU times to fit each of the models. The time required to fit the conditional logistic likelihood is a number of orders of magnitude less than needed to fit the unconditional, probably because fitting a large number of strata parameters is avoided. The last column shows the computing times for the conditional logistic likelihood fitting using PROC NLP. These times are a number of orders of magnitude smaller than the conditional logistic fitted using PROC NLMIXED. Table 4 Wald and profile likelihood 95% confidence intervals from each of the latency models. The intervals are somewhat different, with the profile likelihood interval shifted toward larger values relative to the Wald interval. The profile likelihood intervals were computationally challenging. For the log-normal model, computation of the interval took about 8 additional minutes of CPU time compared to just 29 seconds to estimate the β parameter and standard errors. Profile likelihood limits for bilinear model β parameter could not be computed using the latency macro because some parameter combinations resulted in negative likelihood contributions. Even after constraining parameters to assure positive rate ratios (phi in the code), the search failed. DISCUSSION We have shown that the strategy for fitting latency models using an unconditional logistic likelihood given in Richardson [2009] is easily adapted to the conditional logistic setting. Although the effective dose value D(t; α) is a function of the latency function parameters α, the code required is fairly straightforward and easy to implement. When appropriate, the conditional logistic likelihood will be preferable because it is the natural partial likelihood for the cohort (or nested case-control data) and for the practical reason that fitting is much faster than the unconditional likelihood. Further, we found that PROC NLP was much faster at fitting the conditional likelihood than PROC NLMIXED probably because of factors related to the mixed modeling components of NLMIXED, which we do not use in these analyses. Finally, we found that profile likelihood confidence 5 intervals could be computed using PROC NLP for parameters in some of the latency models but not all, and that computing these intervals takes some time. In situations when profile limits do not converge, it may be possible to identify the confidence interval by plotting the log-likelihood over a grid of parameter values. We have focused on risk set or nested case-control data with “no tied failure times.” When there are ties, these can be randomly broken as was done for the miners cohort data. When breaking the ties does not seem appropriate because, for instance, failure status is determined only over larger time intervals, then the unconditional logistic likelihood provides a valid approach to analyze the grouped time data. An alternative may be estimation via the conditional logistic likelihood for multiple failures [Langholz and Richardson, 2010]. Whenever one is fitting complex models, it is important to be sure that the estimated parameters are reasonable and well represent the data. For latency models, latency parameters are only estimable when there evidence of dose response so it is important to first establish an exposure-disease association. Then, descriptive latency models, such as the piecewise constant model can provide an idea of the general shape of the latency function and often suggest good starting values for latency model parameters. REFERENCES V.E. Archer, J.K. Wagoner, and F.E. Lundin. Uranium mining and cigarette smoking effects on man. Journal of Occupational Medicine, 15:204–211, 1973. K. Berhane, M. Hauptmann, and B. Langholz. Using tensor product splines in modeling exposure-time-response relationships: application to the colorado plateau uranium miners cohort. Stat Med, 27(26):5484–5496, 2008. N. E. Breslow and N. E. Day. Statistical Methods in Cancer Research. Volume II – The Design and Analysis of Cohort Studies, volume 82 of IARC Scientific Publications. International Agency for Research on Cancer, Lyon, 1987. M. Hauptmann, K. Behrane, B. Langholz, and Lubin J.H. Using splines to analyze latency in the colorado plateau uranium miners cohort. Journal of Epidemiology and Biostatistics, 6:417–424, 2001. R.W. Hornung and T.J. Meinhardt. Quantitative risk assessment of lung cancer in U. S. uranium miners. Health Physics, 52: 417–430, 1987. B. Langholz and D. B. Richardson. Fitting general relative risk models for survival time and matched case-control analysis. American Journal of Epidemiology, 171:377–83, 2010. B. Langholz, N. Rothman, S. Wacholder, and D.C. Thomas. Cohort studies for characterizing measured genes. Monographs Journal of the National Cancer Institute, 26:39–42, 1999. D. B. Richardson. Latency models for analyses of protracted exposures. Epidemiology, 20:395–9, 2009. D.O. Stram, B. Langholz, M. Huberman, and D.C. Thomas. Correcting for dosemetry error in a reanalysis of lung cancer mortality for the colorado plateau uranium miners cohort. Health Physics, 77:265–275, 1999. D.C. Thomas. Temporal effects and interactions in cancer: Implications of carcinogenic models. In R.L. Prentice and A.S. Whittemore, editors, Environmental Epidemiology: Risk Assesment, pages 107–121. Society for Industrial and Applied Mathematics, Philadelphia, 1982. D.C. Thomas. Exposure-time-response relationships with applications to cancer epidemiology. Annual Review of Public Health, 9:451–482, 1988. 6 CONTACT INFORMATION Comments and questions are valued and encouraged. Contact the authors: Bryan Langholz Department of Preventive Medicine USC Keck School of Medicine 2001 N Soto Street Second Floor, MC 9237 Los Angeles, CA 90089 langholz@usc.edu http://hydra.usc.edu/langholz David B. Richardson Department of Epidemiology School of Public Health University of North Carolina Chapel Hill, NC 27599-7435 david.richardson@unc.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 7