Fitting latency models in epidemiological studies

Transcription

Fitting latency models in epidemiological studies
Paper 75011
Fitting latency models in epidemiological studies
Bryan Langholz, University of Southern California, Los Angeles, CA
David B. Richardson, University of North Carolina, Chapel Hill, NC
ABSTRACT
Latency models address important questions about the timing of exposure and subsequent disease risk in epidemiologic
research. However, the non-standard form of such models and the complexity of time-varying exposure histories that characterize
many epidemiologic studies make such models difficult to fit using standard software packages. SAS offers software tools that
can be used to fit “non-standard” latency models. In particular, SAS procedures NLP and NLMIXED provide the flexibility to
specify quite general latency models and maximize the likelihood for latency parameters. While the methods require more
sophistication than using a package or procedure that is specialized to the usual log-linear form, the additional programming
complexity is not great. The methods are illustrated and compared by fitting latency models for radon exposure and lung cancer
mortality rates from a cohort study of Colorado Plateau uranium miners. It was found that the conditional logistic likelihood is
more computationally efficient than the unconditional and that PROC NLP is much faster than PROC NLMIXED.
KEYWORDS: cohort studies, case-control studies, conditional logistic likelihood, epidemiology, general relative risk models,
partial likelihood, PROC NLP, PROC NLMIXED
INTRODUCTION
Latency models are used to describe the change in risk of disease due to a given exposure as a function of time since that
exposure. Characterization of the evolution of the relative risk of disease following exposure to an agent is important for
understanding the long term consequences of exposure. If exposure occurs at a single point in time, then risk as a function of
time-since-exposure is typically estimated directly as the change in disease risk or rate as a function of since that exposure. The
situation can be complex when exposure is protracted such as the case with many occupational and environmental exposures.
For instance, in a study of Colorado Plateau uranium miners, radon exposure in the course of underground mining of uranium
continued throughout a miner’s work life, and varied over time, depending on the mine location and conditions in the mine
at a given time [Archer et al., 1973, Hornung and Meinhardt, 1987, Stram et al., 1999]. Ad-hoc methods for investigation of
latency for extended exposures include estimation of risk as a function of time since first or last exposure. These approaches
are relatively easy to implement but do not characterize the evolution of risk over time. While more sophisticated approaches
based on informative models of latency have been described [e.g., Thomas, 1982, 1988, Breslow and Day, 1987, Langholz
et al., 1999, Hauptmann et al., 2001, Berhane et al., 2008], they have rarely been used, in large part because of they are not
accommodated by standard modeling software. Recently, methods for fitting a quite general class of latency models exploiting
the very general modeling facilities available in SAS procedure NLMIXED were described [Richardson, 2009]. This approach
was based on an unconditional logistic likelihood with “nuisance” strata parameters for age intervals. Even more recently,
methods have been describe to fit conditional logistic likelihoods, including partial likelihoods for proportional hazards models,
using PROC NLP [Langholz and Richardson, 2010]. In this paper, we combine the insights about fitting latency models given
in Richardson [2009] with those about fitting conditional logistic likelihoods described in Langholz and Richardson [2010]. We
show how latency models can be fitted using procedure NLMIXED or NLP using the conditional logistic likelihood and compare
the performance of the unconditional likelihood fitting approach to the conditional by fitting latency models for radon exposure
and lung cancer mortality rates to data from a cohort of uranium miners from the Colorado Plateau.
METHODS
LATENCY FUNCTIONS
Let t be the current age and u index the ages at exposure, with d(u) the dose at age u and w(t − u; α) the latency curve t − u
years in the past that depends on parameter vector α, then the effective dose at age t from an exposure incurred at age u is
d(u)w(t − u; α). The total effective dose at age t, D(t; α), is then given by
X
D(t; α) =
d(u)w(t − u; α).
u
The latency function and latency parameters for a number of latency models are shown in Table 1 [e.g., Thomas, 1988,
Langholz et al., 1999, Hauptmann et al., 2001, Richardson, 2009].
1
Table 1: Some latency functions.
Lag
Piecewise constanta
Splineb
Bilinear
Exponential decay
Log-normal
Weighting function w(v)
I(v ≥ α)
PK
k=1 I(Ck−1 < v ≤ Ck ) αk
PK
k=1 g(v, αk )

 (v − α0 )/(α1 − α0 ) if α0 < v ≤ α1
(α2 − v)/(α2 − α1 ) if α1 < v ≤ α2
 0
otherwise

 (v − α0 )/(α1 − α0 )
if α0 < v ≤ α1
exp(−(v − α1 ) × log(2)/α2 if α1 < v
 0
otherwise
pdf(Lognormal,v, θ, λ)
Gamma
pdf(Gamma,a, λ)
a (C
k−1 , Ck ] are time intervals.
b g(v, α ) are basis functions and α may be a vector.
k
k
Parameters
α
α1 , . . . , αK
α1 , . . . , αK
α0 , α 1 , α 2
α0 , α 1 , α 2
θ, λ
a, λ
The total effective dose summarizes the exposure history into a single value that at each age can be related to disease risk.
A form of this relationship must be chosen or determined from the data. For various reasons, both theoretical and empirical,
radiation effects have been modeled as the excess relative risk as a linear function of total effective dose. This is expressed as
λ(t, D(t; α); β) = λ0 (t)(1 + βD(t; α))
(1)
where λ0 (t) is the rate of disease at age t in a (comparable) unexposed population and β is the dose-response parameter; the
change in relative risk per unit effective dose. Combining this dose-response model with the total effective dose formula (), we
note that
X
X
βD(t; α) = β
d(u)w(t − u; α) =
d(u) βw(t − u; α)
(2)
u
u
so that β w(t − u; α) gives the excess relative risk per unit dose ascribed to exposure at t − u years in the past.
Other forms for the rates may be more appropriate in other situations such as the log linear form (Cox model)
λ(t, D(t; α); β) = λ0 (t) exp(βD(t; α))
or more complex relationships.
FITTING LATENCY MODELS
A major impediment to the investigation of latency in characterizing exposure-disease relationships has been computational.
Given that standard software used for the modeling of data from epidemiologic studies accommodate only dose-response
models of the log-linear form, there are two aspects of model (1) that are “non-standard.” The first is that the excess
relative risk form is not log-linear. Second, to estimate the latency parameters α, the effective dose D(t; α) given in ()
needs to be updated at each iteration in the fitting algorithm, again not accommodated by standard fitting software. In
order to fit latency models to the Colorado Plateau uranium miners data, Langholz et al. [1999] organized cohort data into
time and strata determined risk sets and then nested case-control sampled from the risk sets to reduce the computational
burden. Lengthy “scripts” were written for the specialized package EPICURE that used a grid search to fit the models
(http://hydra.usc.edu/timefactors/examples/exampl.html, topic 5). Recently, Richardson [2009] described how
SAS PROC NLMIXED can be used to fit latency models. He used the same nested case-control sampling approach to reduce
computational burden and analyzed the data as stratified binary data using an unconditional logistic likelihood, with a separate
stratum parameter for each risk set. When disease is rare, the unconditional logistic likelihood approach yields parameter
estimates that are close to the partial likelihood. Langholz and Richardson [2010] described how to use PROC NLP (or
NLMIXED) to accommodate conditional logistic likelihoods, including how to perform partial likelihood analyses from cohort
risk set or nested case-control data. This involves a organizing the data as a single record per risk set and computing the
conditional logistic likelihood contributions as a “general model.” This approach has the advantage over the unconditional
logistic likelihood that risk set strata parameters are avoided so that fitting is faster and more reliable.
For simplicity, the nested case-control data should have only one case per set. We discuss the analysis of tied failure times or
grouped time cohort data in the Discussion.
UNCONDITIONAL LOGISTIC LIKELIHOOD
Figure 1 shows the SAS macro latency described in Richardson [2009]. Briefly, the annual exposures d(u) are given as
a variable list in macro variable exphx while the latency weighting is entered into the macro variable wt and the regression
2
Figure 1: Unconditional logistic analysis of nested case-control data.
%MACRO latency (data= , case= , exphx=, age= , parms= , lag=, wt=, regmodel= );
PROC NLMIXED DATA=&data TECH=congra ;
PARMS &parms;
lag=&lag ;
edose=0;
ARRAY annexp &exphx;
endage=FLOOR(&age)-lag; endage=MIN(endage,hbound(annexp));
DO Y = 1 TO endage;
t = (&age -y-lag+ (182/365) );
edose = edose + ( ( &wt ) * annexp{y} );
END;
odds=&regmodel ;
p=odds/(1+odds);
MODEL &case ˜ BINARY(p);
RUN;
%MEND latency;
Figure 2: Conditional logistic likelihood analysis of nested case-control data.
%MACRO latency (data= , nv=, maxsetsize=, age= , parms= , lag=, wt=, regmodel= );
PROC NLMIXED DATA=&data tech=congra ;
PARMS &parms;
lag=&lag ;
* note: maxsetsize is set globally by the make_case_control macro;
ARRAY z[&nv,&maxsetsize] _z1-_z%EVAL(&nv*&maxsetsize);
endage=FLOOR(&age)-lag; endage=MIN(endage,80);
sum = 0; _ntot = _ntot;
DO imem = _ntot TO 1 by -1;
edose=0;
DO Y = 1 TO endage;
t = (&age -y-lag-(182/365));
edose = edose + ( ( &wt ) * z[y,imem]);
END;
phi = &regmodel;
sum=sum + phi;
end;
dum = 1;
* last rr is for the case;
L = phi / sum;
MODEL dum ˜ GENERAL(log(L));
RUN;
%MEND latency;
model in regmodel. Further details and examples are given in [Richardson, 2009].
CONDITIONAL (PARTIAL) LIKELIHOOD
To fit models via the conditional logistic likelihood, we first organize the data into a “case-control set” structure with one line per
case-control set and has been described previously [Langholz and Richardson, 2010]. Briefly, covariate information from all
case-control set members is put into a covariate array z arranged in blocks of maxsetsize, the maximum number of subjects
in any case-control set, for each of the nv covariates. In each covariate block, the case’s covariate is first, followed by that of
the controls, with z values over the number in the case-control set ntot to missing (and ignored in the analysis). The macro
make case control to generate the case-control set structured data is available at
http://hydra.usc.edu/timefactors/examples/exampl.html, topic 12. A macro to fit latency models via the conditional
logistic likelihood using the case-control set organized data is shown in Figure 2 and is similar to the unconditional logistic
likelihood fitting macro in Figure 1. The macro variables nv and maxsetsize are for the number of covariates and the size of
largest case-control (or risk) set. Dose is computed as a latency weighted function wt of annual exposures and is then a rate
ratio is computed via the regression model regmodel. However, this is done for all members of the case-control set so that
there is a double loop first over members imem of the set, and second over age Y for a given member. The denominator of
the likelihood is computed as the sum of member rate ratios phi and the numerator is the case-rate ratio, who is last in the
case-control member loop.
3
Figure 3: Conditional logistic likelihood analysis of nested case-control data using PROC NLP.
%MACRO latency (data= , nv=, maxsetsize=, age= , parms= , lag=, wt=, regmodel=, profile=);
PROC NLP DATA=&data GRADCHECK=none COV=2;
PARMS &parms;
PROFILE &profile /ALPHA=.05;
lag=&lag ;
ARRAY z[&nv,&maxsetsize] _z1-_z%EVAL(&nv*&maxsetsize);
endage=FLOOR(&age)-lag; endage=MIN(endage,80);
sum = 0; _ntot = _ntot;
DO i = _ntot TO 1 by -1;
edose=0;
DO Y = 1 TO endage;
t = (&age -Y-lag);
edose = edose + ( ( &wt ) * z[y,i]);
END;
phi = &regmodel;
sum=sum + phi;
end;
* last rr is for the case;
logL = log(phi / sum);
MAX logL;
RUN;
%MEND latency;
PROFILE CONFIDENCE INTERVALS USING PROC NLP
Wald confidence intervals are symmetric around the estimate and are obtained by subtracting and adding to the estimate 1.96
times the standard errors of the estimates. While these are appropriate in many standard model settings, asymmetrical intervals
that take the constraints on the parameters and the skewed distribution of the estimates into account are often preferred. The
profile likelihood may be asymmetrical and is obtained by finding the values of β such that twice the log-likelihood differs from
its maximum by 3.84(=1.962 ). This requires maximizing the likelihood for other parameters, at a fixed value of the parameter
of interest, and can be quite computationally intensive. Langholz and Richardson [2010] suggest that PROC NLP may be
preferred over PROC NLMIXED because PROC NLP can compute profile likelihood confidence intervals via the PROFILE
command. Figure 3 shows the macro for fitting the conditional logistic likelihood using PROC NLP. All the basic programming
statements are the same as those used with PROC NLMIXED, with the model statement replaced by the NLP command
MAX logL to maximize the log likelihood. Profile confidence intervals are computed for variables listing in the macro variable
profile.
EXAMPLE
To illustrate and compare the approaches to fitting latency models, we used the nested case-control sample from the Colorado
Plateau uranium miners data that was used previously illustrate the use of latency methods [Langholz et al., 1999, Richardson,
2009]. Briefly, the cohort data consisted 2704 miners who started employment in the Colorado Plateau after 1950 and were
followed for lung cancer death to December, 1990. Tied ages of death were broken randomly and a risk set for each lung cancer
mortality case was defined as all cohort members who were alive and on-study at age of the case’s death and who attained
that age during the five-year calendar period in which the case died; i.e. risk sets with age as time scale with stratification by
five year calendar period; a data set that consists of 55,964 miner/time records. Forty controls were sampled from the risk sets
to form the nested case-control data set, with 10,322 miner/time records, used for the analysis.
As in Richardson [2009], three models were fitted: simple cumulative dose (time constant), bilinear with α0 fixed at 0, and
log-normal. Each model was fitted using the unconditional logistic likelihood, with a separate stratum parameter for each risk
set, and using the conditional logistic (partial) likelihood, in the latter case with procedures NLMIXED and NLP to do the fitting.
For each model and method, we tabulated the dose-response and latency parameter estimates, standard errors, and model
deviance as well as the time required to fit the model on a Lenovo x300 Thinkpad laptop computer. Finally, we estimated
Wald and profile likelihood confidence intervals for the dose-response parameter β, the latter using the PROFILE command in
PROC NLP.
The data set and SAS code used to do these analyses have been posted at
http://hydra.usc.edu/timefactors/examples/exampl.html, topic 12.
RESULTS
Table 2 gives the parameter estimates, standard errors, and deviances from each of the fitted models. The latency and doseresponse parameters are required in the model specification for both unconditional and conditional logistic likelihood methods
4
Table 2: Latency model and radon dose-response parameters and standard errors and model deviances from unconditional
and conditional logistic likelihood fitting.
Type of logistic likelihood
Unconditional (NLMIXED)
Conditional (NLMIXED or NLP)
Latency function
Time constant
Parameter
β
Estimate
0.32
SE
0.096
Deviance
2329.2
Estimate
0.28
SE
0.083
Deviance
1815.3
Bilinear
α1
α2
β
8.6
34.1
0.83
3.8
2.9
0.30
2316.7
9.8
35.0
0.78
4.2
2.6
0.30
1803.5
Lognormal
θ
λ
β
2.7
0.59
15.2
0.16
0.13
5.3
2318.2
2.7
0.53
13.3
0.14
0.11
4.7
1805.2
Table 3: CPU time to fit latency models.
Model
Time constant
Bilinear
Log-normal
Unconditional
logistic (NLMIXED)
4 min 15 sec
9 min 3 sec
17 min 21 sec
Conditional
logistic (NLMIXED)
8 sec
1 min 21 sec
3 min 29 sec
Conditional
logistic (NLP)
1 sec
25 sec
29 sec
Table 4: Wald and profile likelihood 95% confidence intervals for the dose response parameter β.
Wald
0.12-0.44
0.19-1.37
1.6-25.0
β
Time constant
0.28
Bilinear
0.78
Lognormal
13.3
a No convergence.
Profile
0.16-0.52
–a
6.9-28.7
but the unconditional also included parameters for each of the 263 risk set defined strata. There are differences between the
unconditional logistic likelihood estimates and standard errors and those from the conditional logistic, but these are relatively
small. Also, while the different likelihoods yield different absolute deviances, the between-model differences in the deviances
are similar. Thus, at least for our example, results from the two likelihoods are consistent with each other. The results using
conditional logistic likelihood fitted with PROC NLP were virtually identical to those using PROC NLMIXED, with differences in
the third decimal.
Table 3 shows the reported CPU times to fit each of the models. The time required to fit the conditional logistic likelihood is
a number of orders of magnitude less than needed to fit the unconditional, probably because fitting a large number of strata
parameters is avoided. The last column shows the computing times for the conditional logistic likelihood fitting using PROC
NLP. These times are a number of orders of magnitude smaller than the conditional logistic fitted using PROC NLMIXED.
Table 4 Wald and profile likelihood 95% confidence intervals from each of the latency models. The intervals are somewhat
different, with the profile likelihood interval shifted toward larger values relative to the Wald interval. The profile likelihood
intervals were computationally challenging. For the log-normal model, computation of the interval took about 8 additional
minutes of CPU time compared to just 29 seconds to estimate the β parameter and standard errors. Profile likelihood limits for
bilinear model β parameter could not be computed using the latency macro because some parameter combinations resulted
in negative likelihood contributions. Even after constraining parameters to assure positive rate ratios (phi in the code), the
search failed.
DISCUSSION
We have shown that the strategy for fitting latency models using an unconditional logistic likelihood given in Richardson [2009]
is easily adapted to the conditional logistic setting. Although the effective dose value D(t; α) is a function of the latency
function parameters α, the code required is fairly straightforward and easy to implement. When appropriate, the conditional
logistic likelihood will be preferable because it is the natural partial likelihood for the cohort (or nested case-control data)
and for the practical reason that fitting is much faster than the unconditional likelihood. Further, we found that PROC NLP
was much faster at fitting the conditional likelihood than PROC NLMIXED probably because of factors related to the mixed
modeling components of NLMIXED, which we do not use in these analyses. Finally, we found that profile likelihood confidence
5
intervals could be computed using PROC NLP for parameters in some of the latency models but not all, and that computing
these intervals takes some time. In situations when profile limits do not converge, it may be possible to identify the confidence
interval by plotting the log-likelihood over a grid of parameter values.
We have focused on risk set or nested case-control data with “no tied failure times.” When there are ties, these can be
randomly broken as was done for the miners cohort data. When breaking the ties does not seem appropriate because, for
instance, failure status is determined only over larger time intervals, then the unconditional logistic likelihood provides a valid
approach to analyze the grouped time data. An alternative may be estimation via the conditional logistic likelihood for multiple
failures [Langholz and Richardson, 2010].
Whenever one is fitting complex models, it is important to be sure that the estimated parameters are reasonable and well
represent the data. For latency models, latency parameters are only estimable when there evidence of dose response so it is
important to first establish an exposure-disease association. Then, descriptive latency models, such as the piecewise constant
model can provide an idea of the general shape of the latency function and often suggest good starting values for latency
model parameters.
REFERENCES
V.E. Archer, J.K. Wagoner, and F.E. Lundin. Uranium mining and cigarette smoking effects on man. Journal of Occupational
Medicine, 15:204–211, 1973.
K. Berhane, M. Hauptmann, and B. Langholz. Using tensor product splines in modeling exposure-time-response relationships:
application to the colorado plateau uranium miners cohort. Stat Med, 27(26):5484–5496, 2008.
N. E. Breslow and N. E. Day. Statistical Methods in Cancer Research. Volume II – The Design and Analysis of Cohort Studies,
volume 82 of IARC Scientific Publications. International Agency for Research on Cancer, Lyon, 1987.
M. Hauptmann, K. Behrane, B. Langholz, and Lubin J.H. Using splines to analyze latency in the colorado plateau uranium
miners cohort. Journal of Epidemiology and Biostatistics, 6:417–424, 2001.
R.W. Hornung and T.J. Meinhardt. Quantitative risk assessment of lung cancer in U. S. uranium miners. Health Physics, 52:
417–430, 1987.
B. Langholz and D. B. Richardson. Fitting general relative risk models for survival time and matched case-control analysis.
American Journal of Epidemiology, 171:377–83, 2010.
B. Langholz, N. Rothman, S. Wacholder, and D.C. Thomas. Cohort studies for characterizing measured genes. Monographs
Journal of the National Cancer Institute, 26:39–42, 1999.
D. B. Richardson. Latency models for analyses of protracted exposures. Epidemiology, 20:395–9, 2009.
D.O. Stram, B. Langholz, M. Huberman, and D.C. Thomas. Correcting for dosemetry error in a reanalysis of lung cancer
mortality for the colorado plateau uranium miners cohort. Health Physics, 77:265–275, 1999.
D.C. Thomas. Temporal effects and interactions in cancer: Implications of carcinogenic models. In R.L. Prentice and A.S.
Whittemore, editors, Environmental Epidemiology: Risk Assesment, pages 107–121. Society for Industrial and Applied
Mathematics, Philadelphia, 1982.
D.C. Thomas. Exposure-time-response relationships with applications to cancer epidemiology. Annual Review of Public Health,
9:451–482, 1988.
6
CONTACT INFORMATION
Comments and questions are valued and encouraged. Contact the authors:
Bryan Langholz
Department of Preventive Medicine
USC Keck School of Medicine
2001 N Soto Street
Second Floor, MC 9237
Los Angeles, CA 90089
langholz@usc.edu
http://hydra.usc.edu/langholz
David B. Richardson
Department of Epidemiology
School of Public Health
University of North Carolina
Chapel Hill, NC 27599-7435
david.richardson@unc.edu
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in
the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
7