Count Models 1

Transcription

Count Models 1
Count Models 1
Sociology 8811 Lecture 12
Copyright © 2007 by Evan Schofer
Do not copy or distribute without permission
Count Variables
• Many dependent variables are counts: Nonnegative integers
•
•
•
•
# Crimes a person has committed in lifetime
# Children living in a household
# new companies founded in a year (in an industry)
# of social protests per month in a city
– Can you think of others?
Count Variables
• Count variables can be modeled with OLS
regression… but:
– 1. Linear models can yield negative predicted
values… whereas counts are never negative
• Similar to the problem of the Linear Probability Model
– 2. Count variables are often highly skewed
• Ex: # crimes committed this year… most people are
zero or very low; a few people are very high
• Extreme skew violates the normality assumption of
OLS regression.
Count Models
• Two most common count models:
• Poisson Regression Model
• Negative Binomial Regression Model
• Both based on the Poisson distribution:
• m = expected count (and variance)
– Called lambda (l) in some texts; I rely on Freese & Long 2006
• y = observed count
m
e m
P y m  
y!
y
Poisson Regression
• Strategy: Model log of m as a function of Xs
• Quite similar to modeling log odds in logit
• Again, the log form avoids negative values
K
ln m     j X ji
j 1
• Which can be written as:
m e
K
  j X ji
j 1
Poisson Regression: Example
.1
0
.05
Density
.15
.2
• Hours per week spent on web
0
10
20
30
www hours per week
40
50
Poisson Regression: Web Use
• Output = similar to logistic regression
. poisson wwwhr male age educ lowincome babies
Poisson regression
Log likelihood =
-8598.488
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
1552
525.66
0.0000
0.0297
-----------------------------------------------------------------------------wwwhr |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.3595968
.0210578
17.08
0.000
.3183242
.4008694
age | -.0097401
.0007891
-12.34
0.000
-.0112867
-.0081934
educ |
.0205217
.004046
5.07
0.000
.0125917
.0284516
lowincome | -.1168778
.0236503
-4.94
0.000
-.1632316
-.0705241
babies | -.1436266
.0224814
-6.39
0.000
-.1876892
-.0995639
_cons |
1.806489
.0641575
28.16
0.000
1.680743
1.932236
------------------------------------------------------------------------------
Men spend more time on
the web than women
Number of young children in
household reduces web use
Poisson Regression: Stata Output
• Stata output yields familiar statistics:
– Standard errors, z/t- values, and p-values for
coefficient hypothesis tests
– Pseudo R-square for model fit
• Not a great measure… but gives a crude explained
variance
– MLE log likelihood
– Likelihood ratio test: Chi-square and p-value
• Comparing to null model (constant only)
• Tests can also be conducted on nested models with
stata command “lrtest”.
Interpreting Coefficients
• In Poisson Regression, Y is typically
conceptualized as a rate…
• Positive coefficients indicate higher rate; negative =
lower rate
• Like logit, Poisson models are non-linear
• Coefficients don’t have a simple linear interpretation
• Like logit, model has a log form;
exponentiation aids interpretation
• Exponentiated coefficients are multiplicative
• Analogous to odds ratios… but called “incidence rate
ratios”.
Interpreting Coefficients
• Exponentiated coefficients: indicate effect of
unit change of X on rate
• In STATA: “incidence rate ratios”: “poison … , irr”
• eb= 2.0 indicates that the rate doubles for each unit
change in X
• eb= .5 indicates that the rate drops by half for each unit
change in X
• Recall: Exponentiated coefs are multiplicative
• If eb= 5.0, a 2-point change in X isn’t 10; it is 5 * 5 = 25
– Also: you must invert to see opposite effects
• If eb= 5.0, a 1-point decrease in X isn’t -5, it is 1/5 = .2
Interpreting Coefficients
• Again, exponentiated coefficients (rate ratios)
can be converted to % change
• Formula: (eb - 1) * 100%
• Ex: (e.5 - 1) * 100% = 50% decrease in rate.
Interpreting Coefficients
• Exponentiated coefficients yield multiplier:
. poisson wwwhr male age educ lowincome babies
Poisson regression
Log likelihood =
-8598.488
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
1552
525.66
0.0000
0.0297
-----------------------------------------------------------------------------wwwhr |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.3595968
.0210578
17.08
0.000
.3183242
.4008694
age | -.0097401
.0007891
-12.34
0.000
-.0112867
-.0081934
educ |
.0205217
.004046
5.07
0.000
.0125917
.0284516
lowincome | -.1168778
.0236503
-4.94
0.000
-.1632316
-.0705241
babies | -.1436266
.0224814
-6.39
0.000
-.1876892
-.0995639
_cons |
1.806489
.0641575
28.16
0.000
1.680743
1.932236
------------------------------------------------------------------------------
Exponentiation of .359 = 1.43;
Rate is 1.43 times higher for men
Exp(-.14) = .87. Each baby reduces
rate by factor of .87
(1.43-1) * 100 = 43% more
(.87-1) * 100 = 13% less
Predicted Counts
• Stata “predict varname, n” computes predicted
value for each case
. predict predwww if e(sample), n
. list wwwhr predwww if e(sample)
1.
2.
3.
12.
13.
15.
16.
19.
20.
21.
23.
24.
25.
27.
33.
+------------------+
| wwwhr
predwww |
|------------------|
|
1
5.659943 |
|
3
7.090338 |
|
2
5.281404 |
|
5
6.09473 |
|
4
6.968055 |
|
3
5.815624 |
|
0
5.539187 |
|
0
7.207257 |
|
8
8.03906 |
|
5
4.400002 |
|
1
6.77004 |
|
1
4.806245 |
|
8
5.710855 |
|
12
3.687142 |
|
40
4.997193 |
Some of the predictions are close
to the observed values…
Many of the predictions are quite bad…
Recall that the model fit was VERY poor!
Predicted Probabilities
• Stata extension “prcount” can compute
probabilities for each possible count outcome
• For all cases, of for particular groups
• It plugs values (m), Xs, & bs into formula:
Pm | X  
Rate:
Pr(y=0|x):
Pr(y=1|x):
Pr(y=2|x):
Pr(y=3|x):
Pr(y=4|x):
Pr(y=5|x):
Pr(y=6|x):
Pr(y=7|x):
Pr(y=8|x):
Pr(y=9|x):
x=
male
.4503866
5.7446
0.0032
0.0184
0.0528
0.1011
0.1452
0.1668
0.1597
0.1311
0.0941
0.0601
age
40.992912
[
[
[
[
[
[
[
[
[
[
[
5.6238,
0.0028,
0.0165,
0.0486,
0.0953,
0.1399,
0.1642,
0.1589,
0.1276,
0.0897,
0.0560,
educ
14.345361
5.8655]
0.0036]
0.0202]
0.0570]
0.1069]
0.1505]
0.1694]
0.1606]
0.1345]
0.0986]
0.0642]
lowincome
.7371134
babies
.20296392
e
 X
X
m!
m
Issue: Exposure
• Poisson outcome variables are typically
conceptualized as rates
• Web hours per week
• Number of crimes committed in past year
• Issue: Cases may vary in exposure to “risk”
of a given outcome
• To properly model rates, we must account for the fact
that some cases have greater exposure than others
• Ex: # crimes committed in lifetime
– Older people have greater opportunity to have higher counts
• Alternately, exposure may vary due to research design
– Ex: Some cases followed for longer time than others…
Issue: Exposure
• Poisson (and other count models) can
address varying exposure:
K
  j X ji  ln(ti )
j 1
mi ti  e
• Where ti = exposure time for case i
• It is easy to incorporate into stata, too:
• Ex: poisson NumCrimes SES income, exposure(age)
• Note: Also works with other “count” models.
Poisson Model Assumptions
• Poisson regression makes a big assumption:
That variance of m = m (“equidisperson”)
• In other words, the mean and variance are the same
• This assumption is often not met in real data
• Dispersion is often greater than m: overdispersion
– Consequence of overdispersion: Standard errors
will be underestimated
• Potential for overconfidence in results; rejecting H0
when you shouldn’t!
• Note: overdispersion doesn’t necessarily affect
predicted counts (compared to alternative models).
Poisson Model Assumptions
• Overdispersion is most often caused by highly
skewed dependent variables
– Often due to variables with high numbers of zeros
• Ex: Number of traffic tickets per year
• Most people have zero, some can have 50!
• Mean of variable is low, but SD is high
– Other examples of skewed outcomes
• # of scholarly publications
• # cigarettes smoked per day
• # riots per year (for sample of cities in US).
Negative Binomial Regression
• Strategy: Modify the Poisson model to
address overdispersion
• Add an “error” term to the basic model:
K
m e
  j X ji  e i
j 1
• Additional model assumptions:
• Expected value of exponentiated error = 1 (ee = 1)
• Exponentiated error is Gamma distributed
• We hope that these assumptions are more plausible
than the equidispersion assumption!
Negative Binomial Regression
• Full negative biniomial model:

 y     


P y | X  
1 
1
y!     m 
1
1
 1
 m 
 1

  m 
• Note that the model incorporates a new
parameter: 
• Alpha represents the extent of overdispersion
• If  = 0 the model reduces to simple poisson regression
y
Negative Binomial Regression
• Question: Is alpha () = 0?
• If so, we can use Poisson regression
• If not, overdispersion is present; Poisson is inadequate
• Strategy: conduct a statistical test of the
hypothesis: H0:  = 0; H1:  > 0
• Stata provides this information when you run a negative
binomial model:
• Likelihood ratio test (G2) for alpha
• P-value < .05 indicates that overdispersion is present;
negative binomial is preferred
• If P>.05, just use Poisson regression
– So you don’t have to make assumptions about gamma dist….
Negative Binomial Regression
• Interpreting coefficients: Identical to poisson
regression
• Predicted probabilities: Can be done. You
must use big Neg Binomial formula
• Plugging in observed Xs, estimates of a, Bs…
1

1
1





y



 1

Pˆ  y | X  
1 
y!     mˆ 
 mˆ 
 1

   mˆ 
• Probably best to get STATA to do this one…
• Long & Freese created command: prvalue
y
Negative Binomial Example: Web Use
• Note: Bs are similar but SEs change a lot!
Negative binomial regression
Log likelihood = -4368.6846
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
1552
57.80
0.0000
0.0066
-----------------------------------------------------------------------------wwwhr |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.3617049
.0634391
5.70
0.000
.2373666
.4860433
age | -.0109788
.0024167
-4.54
0.000
-.0157155
-.006242
educ |
.0171875
.0120853
1.42
0.155
-.0064992
.0408742
lowincome | -.0916297
.0724074
-1.27
0.206
-.2335457
.0502862
babies | -.1238295
.0624742
-1.98
0.047
-.2462767
-.0013824
_cons |
1.881168
.1966654
9.57
0.000
1.495711
2.266625
-------------+---------------------------------------------------------------/lnalpha |
.2979718
.0408267
.217953
.3779907
-------------+---------------------------------------------------------------alpha |
1.347124
.0549986
1.243529
1.459349
-----------------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) = 8459.61 Prob>=chibar2 = 0.000
Note: Standard Error for education increased from .004
to .012! Effect is no longer statistically significant.
Negative Binomial Example: Web Use
• Note: Info on overdispersion is provided
Negative binomial regression
Log likelihood = -4368.6846
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
1552
57.80
0.0000
0.0066
-----------------------------------------------------------------------------wwwhr |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.3617049
.0634391
5.70
0.000
.2373666
.4860433
age | -.0109788
.0024167
-4.54
0.000
-.0157155
-.006242
educ |
.0171875
.0120853
1.42
0.155
-.0064992
.0408742
lowincome | -.0916297
.0724074
-1.27
0.206
-.2335457
.0502862
babies | -.1238295
.0624742
-1.98
0.047
-.2462767
-.0013824
_cons |
1.881168
.1966654
9.57
0.000
1.495711
2.266625
-------------+---------------------------------------------------------------/lnalpha |
.2979718
.0408267
.217953
.3779907
-------------+---------------------------------------------------------------alpha |
1.347124
.0549986
1.243529
1.459349
-----------------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) = 8459.61 Prob>=chibar2 = 0.000
Alpha is clearly > 0! Overdispersion is evident; LR test p<.05
You should not use Poisson Regression in this case
General Remarks
• Poisson & Negative binomial models suffer all
the same basic issues as “normal” regression
• Model specification / omitted variable bias
• Multicollinearity
• Outliers/influential cases
– Also, it uses Maximum Likelihood
• N > 500 = fine; N < 100 can be worrisome
– Results aren’t necessarily wrong if N<100;
– But it is a possibility; and hard to know when problems crop up
• Plus ~10 cases per independent variable.
General Remarks
• It is often useful to try both Poisson and
Negative Binomial models
• The latter allows you to test for overdispersion
• Use LRtest on alpha () to guide model choice
– If you don’t suspect dispersion and alpha appears
to be zero, use Poission Regression
• It makes fewer assumptions
– Such as gamma-distributed error.
Example: Labor Militancy
Isaac &
Christiansen 2002
Note: Results are
presented as %
change