Characterizing the Performance of the Conway-Maxwell

Transcription

2
Characterizing the Performance of the Conway-Maxwell
Poisson Generalized Linear Model
3
4
5
Royce A. Francis1, Srinivas Reddy Geedipally2, Seth D. Guikema1, Soma Sekhar Dhavala3,
Dominique Lord2, Sarah LaRocca1
1
6
7
8
9
10
11
12
13
1
Department of Geography and Environmental Engineering, Johns Hopkins University, 313 Ames Hall, 3400 N.
Charles St., Baltimore, MD 21218
{royce.francis@jhu.edu, sguikema@jhu.edu, larocca@jhu.edu}
2
Zachry Department of Civil Engineering, Texas A&M University, 3136 TAMU, College Station, TX 778433136
{srinivas8@tamu.edu, d-lord@tamu.edu}
3
Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143
{somasd@gmail.com}
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Authors’ Footnote: Royce Francis is Postdoctoral Fellow, Seth Guikema is Assistant
Professor, and Sarah LaRocca is Graduate Research Assistant in the Department of
Geography and Environmental Engineering, The Johns Hopkins University, Baltimore, MD
21218 (e-mail: royce.francis@jhu.edu, sguikema@jhu.edu, larocca@jhu.edu). Srinivas
Geedipally is Engineering Research Associate, Dominique Lord is Assistant Professor, Texas
Transportation Institute, College Station, TX 77843 (e-mail: srinivas8@tamu.edu, dlord@tamu.edu). Soma Dhavala is Postdoctoral Fellow, Department of Statistics, Texas
A&M University, College Station, TX 77843 (email: somasd@gmail.com). This work was
partially supported by the National Science Foundation (grant ECCS-0725823), the U.S.
Department of Energy (grant BER- FG02-08ER64644) and the Whiting School of
Engineering. All opinions in this paper are those of the authors and do not necessarily reflect
the views of the sponsors.
Characterizing the Performance of the Conway-Maxwell Poisson GLM
27
ABSTRACT
28
This paper documents the performance of a MLE for the Conway-Maxwell-Poisson (COM-
29
Poisson) generalized linear model (GLM). This distribution was originally developed as an
30
extension of the Poisson distribution in 1962 and has a unique characteristic in that it can
31
handle both under-dispersed and over-dispersed count data. Parameter estimation for this
32
model is done using a maximum likelihood estimation (MLE) approach for the COM-
33
Poisson single-link function. The objectives of this paper are to (1) characterize the
34
parameter estimation accuracy of the MLE implementation of the COM-Poisson GLM and
35
(2) estimate the prediction accuracy of the COM-Poisson GLM. We use simulated datasets to
36
assess the performance of the COM-Poisson GLM. The results of the study indicate that the
37
COM-Poisson GLM is flexible enough to model under-, equi- and over-dispersed datasets
38
with different sample mean values. The results also show that the COM-Poisson GLM yields
39
accurate parameter estimates. Furthermore, we show that the Shmueli et al (2005) asymptotic
40
approximation of the mean of the COM-Poisson distribution is an accurate approximation
41
even when the sample mean of the data is substantially below the lower bound previously
42
suggested. However, the approximation is less accurate for very lower sample mean values,
43
and we characterize the degree of this inaccuracy. The COM-Poisson GLM provides a
44
promising and flexible approach for performing count data regression.
45
46
Keywords: Generalized linear model, count data, maximum likelihood estimation, regression
47
1
47
1. INTRODUCTION
48
The Poisson family of discrete distributions stands as a benchmark for analyzing count
49
data. However, because the Poisson distribution itself has well-known limitations due to the
50
assumed mean-variance structure, a number of generalizations have been proposed. The
51
Conway-Maxwell Poisson (COM-Poisson) distribution is one of these generalizations,
52
Originally developed in 1962 as a method for modeling both under-dispersed and over-
53
dispersed count data (Conway and Maxwell 1962), the COM-Poisson distribution was then
54
revisited by Shmueli et al. (2005) after a period in which it was not widely used. Shmueli et
55
al. (2005) derived many of the basic properties of the distribution. The COM-Poisson
56
belongs to the exponential family as well as to the two-parameter power series family of
57
distributions. It introduces an extra parameter,
58
successive ratios of probabilities. It nests the usual Poisson (
59
Bernoulli (
60
the Poisson distribution (Boatwright, Borle and Kadane 2003, Shmueli, et al. 2005). The
61
conjugate priors for the parameters of the COM-Poisson distribution have also been derived
62
(Kadane, Shmueli, Minka, Borle and Boatwright 2006).
, which governs the rate of decay of
= 1), geometric (
= 0) and
= ∞) distributions and it allows for both thicker and thinner tails than those of
63
The COM-Poisson distribution has recently become much more widely used, including
64
studies analyzing word length (Shmueli et al. 2005), birth process models (Ridout and
65
Besbeas 2004), prediction of purchase timing and quantity decisions (Boatwright et al. 2003),
66
quarterly sales of clothing (Shmueli et al. 2005), internet search engine visits (Telang,
67
Boatwright and Mukhopadhyay 2004), the timing of bid placement and extent of multiple
68
bidding (Borle, Boatwright and Kadane 2006), developing cure rate survival models
69
(Rodrigues et al. 2009), modeling electric power system reliability (Guikema and Coffelt
2
70
2008), modeling the number of car breakdowns (Mamode Khan and Jowaheer 2010) and
71
modeling motor vehicle crashes (***** 2008, 2010).
72
Only recently has the COM-Poisson distribution been used in a generalized linear model
73
setting, and the estimation efficiency and bias has not been adequately assessed. Guikema
74
and Coffelt (2006, 2008) introduced a generalized linear model (GLM) built on the COM-
75
Poisson, and ***** (2008, 2010) then utilized this model to analyze traffic accident data. The
76
approach of Guikema and Coffelt (2006, 2008) depended on MCMC for fitting a dual-link
77
GLM based on the COM-Poisson distribution. It also used a reformulation of the COM-
78
Poisson to provide a more direct centering parameter than the original COM-Poisson
79
formulation. Sellers and Shmueli (2008) developed an MLE for a single-link GLM based on
80
the original COM-Poisson distribution. Jowaheer and Mamode Khan (2009) compared the
81
efficiency of quasi-likelihood and MLE estimation approaches for estimating the parameters
82
of a single-link COM-Poisson GLM in the case of equidispersion based on simulated data
83
sets.
84
Aside from the limited tests of Jowaheer and Mamode Khan (2009), there has not been
85
any evaluation of the accuracy or bias of parameter estimates from the COM-Poisson
86
distribution, particularly for cases with underdisperson and overdispersion. Given that a
87
major advantage of the COM-Poisson GLM is its ability to handle both underdispersion and
88
overdispersion within a single conditional distribution, testing the estimation accuracy and
89
bias is a critical need. The objectives of this paper are to: (1) evaluate the estimation accuracy
90
of the Guikema and Coffelt COM-Poisson GLM for datasets characterized by overdispersion,
91
underdispersion and equidispersion with different means based on a maximum likelihood
92
estimator, and (2) characterize the accuracy of the asymptotic approximation of the mean of
3
93
the COM-Poisson distribution suggested by Shmueli et al. (2005) for the Guikema and
94
Coffelt COM-Poisson GLM. In addition, we briefly discuss potential convergence issues
95
with the infinite series in the normalizing factor of the distribution. We based our analysis on
96
900 simulated data sets representing 9 different mean-variance relationships spanning
97
underdispersion, overdispersion, and equidispersion for low, moderate, and high means. This
98
paper is organized as follows. The next section describes the COM-Poisson distribution and
99
its GLM framework. The third section presents our research method. The fourth section gives
100
the results of our computational study. The fifth section gives a brief discussion of the
101
results, and the sixth section provides concluding comments.
102
103
104
2. BACKGROUND
This section describes the characteristics of the COM-Poisson distribution and the COMPoisson GLM framework.
105
Parameterization. The COM-Poisson distribution was first introduced by Conway and
106
Maxwell (1962) for modeling queues and service rates. Although the COM-Poisson
107
distribution is not particularly new, it has been largely unstudied and unused until Shmueli et
108
al. (2005) derived the basic properties of the distribution.
109
The COM-Poisson distribution is a two-parameter extension of the Poisson distribution
110
that generalizes some well-known distributions including the Poisson, Bernoulli, and
111
geometric distributions (Shmueli et al. 2005). It also offers a more flexible alternative to
112
distributions derived from these discrete distributions, such as the binomial and negative
113
binomial distributions. The COM-Poisson distribution can handle both under-dispersion
4
114
(variance less than the mean) and over-dispersion (variance greater than the mean). The
115
probability mass function (PMF) of the COM-Poisson for the discrete count Y is given as:
116
(1)
117
(2)
118
Here
119
is the shape parameter of the COM-Poisson distribution. Z is the normalizing constant. The
120
condition ν >1 corresponds to under-dispersed data, ν <1 to over-dispersed data, and ν =1 to
121
equi-dispersed (Poisson) data. Several common PMFs are special cases of the COM-Poisson
122
with the original formulation. Specifically, setting ν = 0 yields the geometric distribution, λ <
123
1 and ν→∞ yields the Bernoulli distribution in the limit, and ν = 1 yields the Poisson
124
distribution. This flexibility greatly expands the types of problems for which the COM-
125
Poisson distribution can be used to model count data.
126
127
is a centering parameter that is directly related to the mean of the observations and
The asymptotic expressions for the mean and variance of the COM-Poisson derived by
Shmueli et al. (2005) are given by Equations (3) and (4) below.
128
(3)
129
(4)
5
130
The COM-Poisson distribution does not have closed-form expressions for its moments in
131
terms of the parameters λ and ν. However, the mean can be approximated through a few
132
different approaches, including (i) using the mode, (ii) including only the first few terms of Z
133
when ν is large, (iii) bounding E[Y] when ν is small, and (iv) using an asymptotic expression
134
for Z in Equation (1). Shmueli et al. (2005) used the last approach to derive the
135
approximation for Z and the mean as:
136
(5)
137
Using the same approximation for Z as in Shmueli et al. (2005), the variance can be
138
approximated as:
139
140
141
(6)
Shmueli et al (2005) suggest that these approximations may not be accurate for ν>1 or
.
142
Despite its flexibility and attractiveness, the COM-Poisson has limitations in its
143
usefulness as a basis for a GLM, as documented in Guikema and Coffelt (2008). In
144
particular, neither λ nor ν provide a clear centering parameter. While λ is approximately the
145
mean when ν is close to one, it differs substantially from the mean for small ν. Given that ν
146
would be expected to be small for over-dispersed data, this would make a COM-Poisson
147
model based on the original COM-Poisson formulation difficult to interpret and use for over-
148
dispersed data.
6
149
Guikema and Coffelt (2008) proposed a re-parameterization using a new parameter
150
to provide a clear centering parameter. This new formulation of the COM-Poisson is
151
summarized in Equations (7) and (8) below:
152
(7)
153
(8)
154
By substituting
155
variance of Y are given in terms of the new formulation as
156
=
in equations (4), (5), and (41) of Shmueli (2005), the mean and
with
asymptotic
approximations
and
and
157
especially accurate once µ>10. With this new parameterization, the integral
158
part of µ is the mode leaving µ as a reasonable centering parameter. The substitution
159
also allows ν to keep its role as a shape parameter. That is, if ν < 1, the variance is
160
greater than the mean while ν > 1 leads to under-dispersion. In this paper we investigate the
161
accuracy of the approximation
more closely.
162
This new formulation provides a good basis for developing a COM-Poisson GLM. The
163
clear centering parameter provides a basis on which the centering link function can be built,
164
allowing ease of interpretation across a wide range of values of the shape parameter.
165
Furthermore, the shape parameter ν provides a basis for using a second link function to allow
7
166
the amount of over-dispersion, equi-dispersion or under-dispersion to vary across
167
measurements.
168
Generalized Linear Model. Guikema and Coffelt (2008) developed a COM-Poisson
169
GLM framework for modeling discrete count data using the reformulation of the COM-
170
Poisson given in equations (7) and (8). This dual-link GLM framework, in which both the
171
mean and the variance depend on covariates, is given in equations (9-12), where Y is the
172
count random variable being modeled and xi and zi are covariates. There are p covariates used
173
in the centering link function and q covariates used in the shape link function. The sets of
174
parameters used in the two link functions do not need to be identical. If a single-link model is
175
desired, the second link given by equation (12) can be removed allowing a single ν to be
176
estimated directly.
177
(9)
178
(10)
179
(11)
180
(12)
8
181
In our simulation exercise, we employ equation 11 and 12 as a single link function. The
182
COM-Poisson log-likelihood for a single observation with single-link function may now be
183
written as:
184
185
(13)
Let
. The log-likelihood for the entire dataset (N observations) is given as:
186
(14)
187
To obtain the maximum likelihood estimates (MLE) for the parameters, we suggest using
188
unconstrained optimization to avoid convergence issues in the iterative solution method. To
189
use unconstrained optimization, let
. The likelihood now takes the form:
190
191
192
(15)
To find the MLE for the parameters, we first find the partial derivatives of the loglikelihood, l, with respect to the coefficients and the dispersion parameter and are given as:
193
194
195
(16)
and,
.
(17)
9
196
These equations are solved using iteratively re-weighted least squares approaches described
197
in Nelder and Wedderburn (1972), and Wood (2006).
198
To complete the MLE, we must obtain the information matrix at the MLE of the log-
199
likelihood to estimate the standard errors of the coefficient estimates. Sellers and Shmueli
200
(2008) present useful results implied by the likelihood equations for finding the information
201
matrix at the MLE which will later be useful. By setting equations (16) an (17) equal to zero,
202
we see the following:
203
204
(18)
and,
205
(19)
206
These results not only permit solution of the MLE using iteratively re-weighted least squares,
207
but also allow the information matrix (Hessian of the log-likelihood) to be solved as a
208
function of the “true” parameters. The information matrix at the MLE of the log-likelihood,
209
I, is the expected second derivative matrix of the log-likelihood:
10
210
(20)
211
The parts of the information matrix follow (equations 21-23):
212
Iβ :
(21)
Iβ,ν:
(22)
213
214
215
11
216
217
Iν :
(23)
218
219
The information matrix at the MLE may now be used to obtain the standard errors of the
220
regression coefficients. Based on large sample approximation, the general large sample limit
221
for the parameters
222
is (Wood 2006):
.
(24)
12
223
Due to the GLM formulation, and our use of the Poisson distribution, the standard errors
224
follow directly from this result (e.g.,
225
for each parameter:
226
), and we can find the confidence intervals
.
(25)
227
The GLM described above is highly flexible and readily interpreted. It can model under-
228
dispersed datasets, over-dispersed datasets, and datasets that contain intermingled under-
229
dispersed and over-dispersed counts (for dual-link COM-Poisson GLM models only). While
230
we have not presented a dual-link COM-Poisson GLM, in the dual-link case the variance
231
may be allowed to depend on the covariate values, which can be important if high (or low)
232
values of some covariates tend to be variance-decreasing while high (or low) values of other
233
covariates tend to be variance-increasing. The parameters have a direct link to either the
234
mean or the variance, providing insight into the behavior and driving factors in the problem,
235
and the mean and variance of the predicted counts are readily approximated based on the
236
covariate values and regression parameter estimates.
237
3. METHODOLOGY
238
To test the estimation accuracy and computational burden of the MLE implementation of
239
the COM-Poisson GLM of Guikema and Coffelt (2008), we simulated a number of datasets
240
from the COM-Poisson GLM with known regression parameters that correspond to a range
241
of mean and variance values. We then estimated the regression parameters of the COM-
242
Poisson GLM using the MLE implementation and the estimated parameters were then
243
compared to the known parameter values. In this section, we present the procedural details.
13
244
Data Simulation. In order to characterize the accuracy of the parameter estimates from
245
the COM-Poisson GLM, one hundred datasets, with 1,000 observations each, were randomly
246
generated for each of nine different scenarios corresponding to different levels of dispersion
247
and mean. The nine scenarios include simulated datasets of under-dispersed, equi-dispersed
248
and over-dispersed data. For each level of dispersion, three different sample means were
249
used: high mean (~ 20.0), moderate mean (~ 5.0) and low mean (~ 0.8). Each of these 900
250
datasets was then used as input for the COM-Poisson GLM, and the resulting parameters
251
estimates were compared to the known parameters values that had been used to generate the
252
datasets.
253
Initially, we simulated 1,000 values of the covariates X1 and X2 from a uniform
254
distribution on [0, 1]. The centering parameter µ and shape parameter ν were then generated
255
according to Equations (11) and (12) with known (assigned) regression parameters. Note that
256
the same covariates are used for both the centering and shape parameters. Realizations from
257
the COM-Poisson distribution are then generated using the inverse CDF method (Devroye
258
1986).
259
The regression parameter values were selected in such a way that the shape parameter ν
260
was always set between 0 and 1 for simulating the over-dispersed datasets, above 1 for the
261
under-dispersed datasets and approximately 1 for the equi-dispersed datasets. The parameters
262
that were assigned in simulating the datasets are given in the table below. Table 1
263
summarizes the characteristics of the simulation scenarios.
264
265
Testing Protocol. The coefficients of the COM-POISSON GLM were estimated using
the MLE described above, implemented in R (R Core Development Team 2008).
14
266
4. RESULTS
267
This section consists of an assessment of the performance of the Guikema and Coffelt
268
COM-Poisson GLM for the nine scenarios mentioned above. The results concerning the
269
accuracy of the asymptotic approximation of the mean of the COM-Poisson suggested by
270
Shmueli et al. (2005) are also discussed.
271
Parameter Estimation Accuracy. Table 2 reports the fraction of simulated datasets where
272
the the 95% confidence interval does not contain the “true” parameter value for the parameter
273
estimates. Our results show that, generally, the true coefficient values are contained within
274
the 95% confidence interval with 95% frequency. Some instances where this is not the case
275
are the dispersion parameter under S1 and S7, and the intercept under S3. Estimating the
276
dispersion parameter under the MLE approach presented may be difficult. In only two cases
277
(S2, S6) does the 95% confidence interval contain the true value of the dispersion parameter
278
with greater than 95% frequency. While under most scenarios, the 95% confidence interval
279
contains the true parameter value with greater than 90% frequency, the two highest fractions
280
of simulated datasets where the confidence interval does not contain the “true” parameter
281
value correspond to estimation of the dispersion parameter. The true value of the dispersion
282
parameter may be difficult to ascertain with the level of confidence assumed by the α-level
283
selected.
284
Histograms of the MLE linear and dispersion parameters are plotted and compared with
285
the true parameters in Figures 1-3. Each figure corresponds to a specific dispersion level and
286
each subplot corresponds to a different dataset scenario. Figure 1 illustrates histograms for
287
the over-dispersed scenarios (S1-S3). While the parameter estimates generally seem to be
288
appropriate for all of the linear parameters, β, the dispersion parameter, ν, for the high-mean
15
289
overdispersed scenario (S1) is overestimated. The dispersion parameter for the moderate and
290
low mean (S2 and S3) scenarios appear to be more well-behaved, and the ranges of the
291
parameter distributions in the low-mean case are much wider (1.0-3.0) than the moderate and
292
high mean cases (0.1-0.5) for each parameter estimated. While this pattern is not replicated
293
for the under and equidispersed scenarios, the dispersion parameter distribution for all
294
scenarios except S3 is wider than those of its attendant linear parameter distributions. For
295
example, Figure 2 illustrates the plots for the under-dispersed scenarios (S4-S6). As with the
296
overdispersed cases, all linear parameter estimates are well behaved. In the underdispersed
297
scenarios, the dispersion parameters are also well-behaved, although the dispersion parameter
298
distribution for the high-mean case (S4) seems slightly skewed to the right. Furthermore, the
299
low-mean scenario (S6) has a wider parameter estimate distribution for each parameter when
300
compared to its high and moderate-mean counterparts. Figure 3 replicates this pattern for the
301
equidispersed scenarios (S7-S9); though the magnitudes of the parameter ranges are
302
generally smaller than those for the overdispersed scenarios (S1-S3), but slightly larger than
303
those for the underdispersed scenarios (S4-S6).
304
Prediction Bias.
The bias of an estimator is defined as the difference between an
305
estimator's expected value and the true value of the parameter being estimated. If the bias is
306
zero then the estimator is said to be unbiased. The bias of an estimator
307
E(θ ) − θ where
308
observed data.
∧
is the true value of the parameter and the estimator
16
is calculated as
is a function of the
309
The bias of the parameters β and ν under each scenario is calculated as the difference
310
between their average estimates from the 100 simulated datasets and the true (or assigned)
311
value in each scenario.
312
The bias of the centering parameter ‘µ’ is calculated as
∧
313
E(µ ) − µ = (βˆ 0 − β 0 ) + (βˆ 1 − β1 )X 1 + (βˆ 2 − β 2 )X 2
314
The bias of the dispersion parameter ‘ν’ is calculated as
315
(26)
(27)
316
Where
and
are the bias in the parameters and
is the average value of
317
the independent variable. Figure 4 reports the parameter bias for each parameter under each
318
scenario. The overdispersed scenarios are shown in the left panel, the underdispersed in the
319
middle panel, and the equidispersed are shown in the right panel. For the overdispersed
320
scenarios, the bias did not change much for the parameters, except for the intercept parameter
321
under S3. For the underdispersed and equidispersed scenarios, the first two linear parameters
322
and the dispersion parameter demonstrated little sensitivity to the dispersion and mean levels,
323
while the intercept showed a marked sensitivity to these simulated conditions. As the mean
324
decreased in these scenarios, the bias decreased for the intercept. In addition, figure 4
325
suggests that the bias is largest for the overdispersed scenarios relative to the underdispersed
326
and equidispersed scenarios.
327
Figures 5 and 6 report the bias in the mean estimates, given the parameter estimates.
328
First, consider Figure 5. This plot illustrates, for each scenario, the distribution of the bias in
329
the mean for each simulated observation in each dataset under each scenario (N=100,000).
17
330
These plots suggest that the COM-Poisson GLM is biased in inferences drawn under a low-
331
mean scenario where underdispersion is expected (S3). Furthermore, although the range of
332
the bias distribution is largest for the large-mean equidispersed scenario (S7), the mode of the
333
remaining bias distributions seem close to zero. This observation is corroborated by Figure
334
6. Figure 6 illustrates the prediction accuracy under each scenario. The prediction accuracy
335
scatterplots reflect the information encoded in Figure 5; with the exception of S2 and S3, the
336
COM-Poisson GLM MLE estimator appears to be approximately unbiased. While the COM-
337
Poisson GLM performs better for high and moderate mean for all three categories of
338
dispersion, the moderate and low-mean scenarios under under- and equi-dispersion seem
339
more comparable in their levels of accuracy than the same comparison in the over-dispersed
340
scenarios. For over-dispersed and equi-dispersed datasets, the performance is worse for all
341
low sample mean values. The COM-Poisson GLM works well for all sample mean values for
342
the under-dispersed datasets.
343
Accuracy of the Asymptotic Mean Approximation.
The centering parameter µ is
344
believed to adequately approximate the mean when µ >10 based on the asymptotic
345
approximation developed by Shmueli et al. (2005). However, the deviation of µ for mean
346
values below 10 (µ < 10) has not been investigated. We chose one sample from each of the
347
nine scenarios as a basis for estimating the accuracy of the asymptotic mean approximation.
348
First, the µ and ν parameters were calculated from the estimated parameters. We examined
349
the goodness of this approximation by simulating 100,000 random values from the COM-
350
Poisson distribution for a given µ and ν. We then plotted the mean of the simulated values
351
against the asymptotic mean approximation (
18
). The results showed that
352
the asymptotic mean approximates the true mean accurately even for 10 > E[Y] > 5. As the
353
sample mean value decreases below 5, the accuracy of the approximation drops. As seen in
354
Figure 7, the asymptotic approximation holds well for all datasets with high and moderate
355
mean values irrespective of the dispersion in the data. The approximation is also accurate for
356
low sample mean values for under-dispersed datasets. The accuracy drops significantly for
357
the low sample mean values for over-dispersed and equi-dispersed datasets. There is not
358
much difference between the asymptotic mean approximation and the true mean for E[Y]
359
> 10. This difference begins to increase slightly at the moderate mean values, although the
360
difference is not high. The difference can clearly be observed for the low sample mean
361
values, particularly for the over-dispersed and equi-dispersed datasets. This shows that one
362
must be careful in using the asymptotic approximation for the mean of the COM-Poisson
363
GLM to estimate future event counts for datasets characterized by low sample mean values.
364
Convergence of S(µ,ν). When using the MLE approach to estimate parameters for the
365
COM-Poisson GLM, it is necessary to compute S(µ,ν), an infinite sum involving the
366
parameters µ and ν, as given in Equation 8. However, for some values of µ and ν, difficulties
367
arise in achieving convergence for the sum. Table 3 summarizes the combinations of µ and ν
368
requiring greater than 170 terms to achieve convergence (defined as ε = 0.0001) in S(µ,ν).
369
These combinations are significant because calculating S(µ,ν) requires computing n!, where n
370
is the index of the term in the infinite sum. Therefore, with combinations of µ and ν
371
requiring summation of greater than 170 terms for convergence, it is necessary to calculate
372
the factorial of numbers larger than 170. However, the built-in factorial function in R, as in
373
most software, cannot be used for numbers greater than 170, because 170! is the largest
374
number that can be represented as an IEEE double precision value (Press et al 2007). Thus,
19
375
for certain combinations of µ and ν, it is necessary to use an alternate (and likely slower or
376
less precise) algorithm for computing the factorial. However, as shown in Table 3, such
377
difficulties will only arise when using data sets with a very high mean, and are therefore
378
unlikely to be problematic in most applications. In this paper, all combinations of values of µ
379
and ν required 170 or fewer terms to reach convergence.
380
5. DISCUSSION
381
This paper shows that the COM-Poisson GLM is flexible in handling count data
382
irrespective of the dispersion in the data. First, the true parameters lie in the 95% confidence
383
interval for nearly all cases, except as noted above, and are generally close to the estimated
384
value of the parameters. The confidence intervals were found to be wider for the low mean
385
values for both the centering and shape parameters. The bias in the prediction of the
386
parameters and the mean also does not appear to be sensitive to assumptions we have made
387
concerning sample mean and dispersion level, except for the intercept, whose bias decreased
388
as the sample mean decreased under the under- and equi-dispersed scenarios. Even at the low
389
sample mean values, the bias is considerably less for under-dispersed datasets than for over-
390
dispersed and equi-dispersed datasets. Despite its flexibility in handing count data with all
391
dispersions, the COM-Poisson GLM suffers from important limitations for moderate- and
392
low-mean over-dispersed data. The Negative Binomial (Poisson-gamma) models exhibit
393
similar behavior (Lord 2006). Second, the asymptotic approximation of the mean suggested
394
by Shmueli et al (2005) approximates the true mean adequately for E[Y] > 5. This value,
395
obtained by numerical analysis of the COM-Poisson GLM, is substantially lower than the
396
lower bound value of 10 suggested by Shmueli et al. (2005). As the sample mean value
397
decreases, the accuracy of the approximation becomes lower. The asymptotic approximation
20
398
is accurate for all datasets with high and moderate sample mean values irrespective of the
399
dispersion in the data. The approximation is also accurate for low sample mean values for
400
under-dispersed datasets. However, the accuracy drops substantially for low sample mean
401
values for over-dispersed and equi-dispersed datasets.
402
In summary, this paper has documented the performance of COM-Poisson GLM for
403
datasets characterized by different levels of dispersion and sample mean values. The results
404
of this study showed that the COM-Poisson GLMs can handle under-, equi- and over-
405
dispersed datasets with different mean values, although the confidence intervals are found to
406
be wider for low sample mean values. Despite its limitations for low sample mean values for
407
over-dispersed datasets, the COM-Poisson GLM is still a flexible method for analyzing count
408
data. The asymptotic approximation of the mean is accurate for all datasets with high and
409
moderate sample mean values irrespective of the dispersion in the data, and it is also accurate
410
for low sample mean values for under-dispersed datasets. The COM-Poisson GLM is a
411
promising, flexible regression model for count data.
412
21
412
REFERENCES
413
***** (2008), "Extension of the Application of Conway-Maxwell-Poisson Models:
414
Analyzing Traffic Crashes Exhibiting Underdispersion," Working Papers.
415
416
***** (2009), "Extension of the Application of Conway-Maxwell-Poisson Models:
417
Analyzing Traffic Crash Data Exhibiting Under-Dispersion," Risk Analysis, in press.
418
419
Boatwright, P., Borle, S., and Kadane, J. B. (2003), "A Model of the Joint Distribution of
420
Purchase Quantity and Timing," Journal of the American Statistical Association, 98, 564-
421
572.
422
423
Borle, S., Boatwright, P., and Kadane, J. B. (2006), "The Timing of Bid Placement and
424
Extent of Multiple Bidding: An Empirical Investigation Using Ebay Online Auctions,"
425
Statistical Science, 21, 194-205.
426
427
Conway, R. W., and Maxwell, W. L. (1962), "A Queuing Model with State Dependent
428
Service Rates," Journal of Industrial Engineering, 12, 132-136.
429
430
Guikema, S. D., and Coffelt, J. P. (2008), "A Flexible Count Data Regression Model for Risk
431
Analysis.," Risk Analysis, 28, 213-223.
432
22
433
Jowaheer, V. and N.A. Mamode Khan. 2009. "Estimating Regerssion Effects in COM
434
Poisson Generalized Linear Model," World Academy of Science, Engineering and
435
Technology. 53, 213-223.
436
437
Kadane, J. B., Shmueli, G., Minka, T. P., Borle, S., and Boatwright, P. (2006), "Conjugate
438
Analysis of the Conway-Maxwell-Poisson Distribution," Bayesian Analysis, 1, 363-374.
439
440
Lord, D. (2006), "Modeling Motor Vehicle Crashes Using Poisson-Gamma Models:
441
Examining the Effects of Low Sample Mean Values and Small Sample Size on the
442
Estimation of the Fixed Dispersion Parameter," Accident Analysis & Prevention, 38, 751-
443
766.
444
445
Lord, D., Guikema, S. D., and Geedipally, S. (2008), "Application of the Conway-Maxwell-
446
Poisson Generalized Linear Model for Analyzing Motor Vehicle Crashes," Accident Analysis
447
& Prevention, 40, 1123-1134.
448
449
Mamode Khan, N., and Jowaheer, V., (2010), "A comparison of marginal and joint
450
generalized quasi-likelihood estimating equations based on the Com-Poisson GLM:
451
Application to car breakdowns data," International Journal of Mathematical and Statistical
452
Sciences 2:2.
453
454
Nelder, J. A., and Wedderburn, R. W. M. (1972), "Generalized Linear Models," Journal of
455
the Royal Statistical Society, Series A, 135, 370-384.
23
456
457
Press, W. H., Teukolsky, S. A., Vetterling, W. T., Flannery, B. P. (2007). Numerical Recipes:
458
The Art of Scientific Computing, New York: Cambridge University Press.
459
460
Ridout, M. S., and Besbeas, P. (2004), "An Empirical Model for Underdispersed Count
461
Data," Statistical Modelling, 4, 77-89.
462
463
Sellers, K. F., and Shmueli, G. (2010), "A Flexible Regression Model for Count Data,"
464
Annals of Applied Statistics, In Press.
465
466
Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., and Boatwright, P. (2005), "A Useful
467
Distribution for Fitting Discrete Data: Revival of the Conway-Maxwell-Poisson
468
Distribution.," Applied Statistics, 54, 127-142.
469
470
R Core Development Team (2008), "R: A Language and Environment for Statistical
471
Computing."
472
473
Rodrigues, J., de Castro, M., Cancho, V. G., Balakrishnan, N. (2009), "COM–Poisson cure
474
rate survival models and an application to a cutaneous melanoma data," Journal of Statistical
475
Planning and Inference, 139, 3605-3611.
476
477
Telang, R., Boatwright, P., and Mukhopadhyay, T. (2004), "A Mixture Model for Internet
478
Search-Engine Visits," Journal of Marketing, 41, 206-214.
24
479
480
Wood, S. N. (2006), Generalized Additive Models: An Introduction with R, eds. B. P. Carlin,
481
C. Chatfield, M. A. Tanner and J. Zidek, New York: Chapman and Hall/CRC.
482
483
25
483
TABLES AND FIGURES
484
485
Table 1: Assigned parameters for the case study. The scenarios are labeled S1 through
S9.
Over-dispersed data
β0
β1
β2
ν0
Under-dispersed data
Equi-dispersed data
High
mean
S1
Moderate
Mean
S2
Low
mean
S3
High
mean
S4
Moderate
Mean
S5
Low
mean
S6
High
mean
S7
Moderate
Mean
S8
Low
mean
S9
3.0
1.3
-2.0
3.0
1.7
0.2
3.0
1.7
0.2
0.15
0.15
0.15
0.15
0.15
0.15
0.15
0.15
0.15
-0.15
-0.15
-0.15
-0.15
-0.15
-0.15
-0.15
-0.15
-0.15
-0.4
-1.3
-1.3
1.0
1.0
1.2
0
0
0
486
487
488
Table 2: Fraction of cases where "true" parameter values do not fall in confidence
interval.
Dispersion
Overdispersed
Underdispersed
Equidispersed
489
490
491
Scenario
Hi Mean (S1)
Mod Mean (S2)
Lo Mean (S3)
Hi Mean (S4)
Mod Mean (S5)
Lo Mean (S6)
Hi Mean (S7)
Mod Mean (S8)
Lo Mean (S9)
β0
0.08
0.03
0.22
0.05
0.04
0.04
0.06
0.05
0.04
β1
0.03
0.04
0.01
0.04
0.01
0.04
0.06
0.03
0.04
β2
0.02
0.03
0.04
0.04
0.05
0.05
0.01
0.04
0.03
ν
0.45
0.03
0.08
0.07
0.09
0.05
0.35
0.1
0.06
Table 3: Combinations of λ and ν values requiring greater than 170 terms for
convergence of Z(λ,ν) (Equation 2).
µ
ν
≥ 190
0.05
≥ 110
0.10
≥ 102
0.15
≥ 98
0.20
≥ 105
0.25
≥ 110
0.30
≥ 111
0.35
≥ 112
0.40
≥ 113
0.45
≥ 117
0.50
≥ 118
0.55
≥ 120
0.60
≥ 121
0.65
492
493
494
26
494
495
496
497
Figure 1: Histograms for parameter estimates in overdispersed scenarios. S1 (highmean) first row, S2 (moderate mean) second row, S3 (low mean) third row. "True"
parameter values indicated by vertical red lines.
498
499
27
499
500
501
502
Figure 2: Histograms for parameter estimates in underdispersed scenarios. S4 (highmean) first row, S5 (moderate mean) second row, S6 (low mean) third row. "True"
503
504
28
504
505
506
507
Figure 3: Histograms for parameter estimates in equidispersed scenarios. S7 (highmean) first row, S8 (moderate mean) second row, S9 (low mean) third row. "True"
508
509
29
509
510
511
Figure 4: Prediction bias for parameter estimates under each scenario. β 0 blue
diamonds, β 1 red squares, β 2 yellow triangles, ν indicated by “x” on green line.
512
30
512
513
514
Figure 5: Prediction accuracy histograms for overdispersed (top row, S1-S3),
underdispersed (middle row, S4-S6), and equidispersed (bottom row, S7-S9) datasets.
515
31
515
516
517
Figure 6: Prediction accuracy scatterplots for overdispersed (top row, S1-S3),
underdispersed (middle row, S4-S6), and equidispersed (bottom row, S7-S9) datasets.
32
518
Figure 7: Mean versus asymptotic approximation for each scenario.
519
33

Characterizing the Performance of the Conway-Maxwell

Transcription

Similar documents

Lorenz system II - Harvard Mathematics Department

Hyperscanning

science with the virtual observatory - Euro

presentation

Diapositiva 1

Presentation

vn ncom

Section 8 - Applications

Korg Poly61

Using the GLM Procedure in SPSS - Psychology

the nonlinear dynamics of a vibrating system modelled by a inverted