Count Models 1
Transcription
Count Models 1
Count Models 1 Sociology 8811 Lecture 12 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission Count Variables • Many dependent variables are counts: Nonnegative integers • • • • # Crimes a person has committed in lifetime # Children living in a household # new companies founded in a year (in an industry) # of social protests per month in a city – Can you think of others? Count Variables • Count variables can be modeled with OLS regression… but: – 1. Linear models can yield negative predicted values… whereas counts are never negative • Similar to the problem of the Linear Probability Model – 2. Count variables are often highly skewed • Ex: # crimes committed this year… most people are zero or very low; a few people are very high • Extreme skew violates the normality assumption of OLS regression. Count Models • Two most common count models: • Poisson Regression Model • Negative Binomial Regression Model • Both based on the Poisson distribution: • m = expected count (and variance) – Called lambda (l) in some texts; I rely on Freese & Long 2006 • y = observed count m e m P y m y! y Poisson Regression • Strategy: Model log of m as a function of Xs • Quite similar to modeling log odds in logit • Again, the log form avoids negative values K ln m j X ji j 1 • Which can be written as: m e K j X ji j 1 Poisson Regression: Example .1 0 .05 Density .15 .2 • Hours per week spent on web 0 10 20 30 www hours per week 40 50 Poisson Regression: Web Use • Output = similar to logistic regression . poisson wwwhr male age educ lowincome babies Poisson regression Log likelihood = -8598.488 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 1552 525.66 0.0000 0.0297 -----------------------------------------------------------------------------wwwhr | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .3595968 .0210578 17.08 0.000 .3183242 .4008694 age | -.0097401 .0007891 -12.34 0.000 -.0112867 -.0081934 educ | .0205217 .004046 5.07 0.000 .0125917 .0284516 lowincome | -.1168778 .0236503 -4.94 0.000 -.1632316 -.0705241 babies | -.1436266 .0224814 -6.39 0.000 -.1876892 -.0995639 _cons | 1.806489 .0641575 28.16 0.000 1.680743 1.932236 ------------------------------------------------------------------------------ Men spend more time on the web than women Number of young children in household reduces web use Poisson Regression: Stata Output • Stata output yields familiar statistics: – Standard errors, z/t- values, and p-values for coefficient hypothesis tests – Pseudo R-square for model fit • Not a great measure… but gives a crude explained variance – MLE log likelihood – Likelihood ratio test: Chi-square and p-value • Comparing to null model (constant only) • Tests can also be conducted on nested models with stata command “lrtest”. Interpreting Coefficients • In Poisson Regression, Y is typically conceptualized as a rate… • Positive coefficients indicate higher rate; negative = lower rate • Like logit, Poisson models are non-linear • Coefficients don’t have a simple linear interpretation • Like logit, model has a log form; exponentiation aids interpretation • Exponentiated coefficients are multiplicative • Analogous to odds ratios… but called “incidence rate ratios”. Interpreting Coefficients • Exponentiated coefficients: indicate effect of unit change of X on rate • In STATA: “incidence rate ratios”: “poison … , irr” • eb= 2.0 indicates that the rate doubles for each unit change in X • eb= .5 indicates that the rate drops by half for each unit change in X • Recall: Exponentiated coefs are multiplicative • If eb= 5.0, a 2-point change in X isn’t 10; it is 5 * 5 = 25 – Also: you must invert to see opposite effects • If eb= 5.0, a 1-point decrease in X isn’t -5, it is 1/5 = .2 Interpreting Coefficients • Again, exponentiated coefficients (rate ratios) can be converted to % change • Formula: (eb - 1) * 100% • Ex: (e.5 - 1) * 100% = 50% decrease in rate. Interpreting Coefficients • Exponentiated coefficients yield multiplier: . poisson wwwhr male age educ lowincome babies Poisson regression Log likelihood = -8598.488 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 1552 525.66 0.0000 0.0297 -----------------------------------------------------------------------------wwwhr | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .3595968 .0210578 17.08 0.000 .3183242 .4008694 age | -.0097401 .0007891 -12.34 0.000 -.0112867 -.0081934 educ | .0205217 .004046 5.07 0.000 .0125917 .0284516 lowincome | -.1168778 .0236503 -4.94 0.000 -.1632316 -.0705241 babies | -.1436266 .0224814 -6.39 0.000 -.1876892 -.0995639 _cons | 1.806489 .0641575 28.16 0.000 1.680743 1.932236 ------------------------------------------------------------------------------ Exponentiation of .359 = 1.43; Rate is 1.43 times higher for men Exp(-.14) = .87. Each baby reduces rate by factor of .87 (1.43-1) * 100 = 43% more (.87-1) * 100 = 13% less Predicted Counts • Stata “predict varname, n” computes predicted value for each case . predict predwww if e(sample), n . list wwwhr predwww if e(sample) 1. 2. 3. 12. 13. 15. 16. 19. 20. 21. 23. 24. 25. 27. 33. +------------------+ | wwwhr predwww | |------------------| | 1 5.659943 | | 3 7.090338 | | 2 5.281404 | | 5 6.09473 | | 4 6.968055 | | 3 5.815624 | | 0 5.539187 | | 0 7.207257 | | 8 8.03906 | | 5 4.400002 | | 1 6.77004 | | 1 4.806245 | | 8 5.710855 | | 12 3.687142 | | 40 4.997193 | Some of the predictions are close to the observed values… Many of the predictions are quite bad… Recall that the model fit was VERY poor! Predicted Probabilities • Stata extension “prcount” can compute probabilities for each possible count outcome • For all cases, of for particular groups • It plugs values (m), Xs, & bs into formula: Pm | X Rate: Pr(y=0|x): Pr(y=1|x): Pr(y=2|x): Pr(y=3|x): Pr(y=4|x): Pr(y=5|x): Pr(y=6|x): Pr(y=7|x): Pr(y=8|x): Pr(y=9|x): x= male .4503866 5.7446 0.0032 0.0184 0.0528 0.1011 0.1452 0.1668 0.1597 0.1311 0.0941 0.0601 age 40.992912 [ [ [ [ [ [ [ [ [ [ [ 5.6238, 0.0028, 0.0165, 0.0486, 0.0953, 0.1399, 0.1642, 0.1589, 0.1276, 0.0897, 0.0560, educ 14.345361 5.8655] 0.0036] 0.0202] 0.0570] 0.1069] 0.1505] 0.1694] 0.1606] 0.1345] 0.0986] 0.0642] lowincome .7371134 babies .20296392 e X X m! m Issue: Exposure • Poisson outcome variables are typically conceptualized as rates • Web hours per week • Number of crimes committed in past year • Issue: Cases may vary in exposure to “risk” of a given outcome • To properly model rates, we must account for the fact that some cases have greater exposure than others • Ex: # crimes committed in lifetime – Older people have greater opportunity to have higher counts • Alternately, exposure may vary due to research design – Ex: Some cases followed for longer time than others… Issue: Exposure • Poisson (and other count models) can address varying exposure: K j X ji ln(ti ) j 1 mi ti e • Where ti = exposure time for case i • It is easy to incorporate into stata, too: • Ex: poisson NumCrimes SES income, exposure(age) • Note: Also works with other “count” models. Poisson Model Assumptions • Poisson regression makes a big assumption: That variance of m = m (“equidisperson”) • In other words, the mean and variance are the same • This assumption is often not met in real data • Dispersion is often greater than m: overdispersion – Consequence of overdispersion: Standard errors will be underestimated • Potential for overconfidence in results; rejecting H0 when you shouldn’t! • Note: overdispersion doesn’t necessarily affect predicted counts (compared to alternative models). Poisson Model Assumptions • Overdispersion is most often caused by highly skewed dependent variables – Often due to variables with high numbers of zeros • Ex: Number of traffic tickets per year • Most people have zero, some can have 50! • Mean of variable is low, but SD is high – Other examples of skewed outcomes • # of scholarly publications • # cigarettes smoked per day • # riots per year (for sample of cities in US). Negative Binomial Regression • Strategy: Modify the Poisson model to address overdispersion • Add an “error” term to the basic model: K m e j X ji e i j 1 • Additional model assumptions: • Expected value of exponentiated error = 1 (ee = 1) • Exponentiated error is Gamma distributed • We hope that these assumptions are more plausible than the equidispersion assumption! Negative Binomial Regression • Full negative biniomial model: y P y | X 1 1 y! m 1 1 1 m 1 m • Note that the model incorporates a new parameter: • Alpha represents the extent of overdispersion • If = 0 the model reduces to simple poisson regression y Negative Binomial Regression • Question: Is alpha () = 0? • If so, we can use Poisson regression • If not, overdispersion is present; Poisson is inadequate • Strategy: conduct a statistical test of the hypothesis: H0: = 0; H1: > 0 • Stata provides this information when you run a negative binomial model: • Likelihood ratio test (G2) for alpha • P-value < .05 indicates that overdispersion is present; negative binomial is preferred • If P>.05, just use Poisson regression – So you don’t have to make assumptions about gamma dist…. Negative Binomial Regression • Interpreting coefficients: Identical to poisson regression • Predicted probabilities: Can be done. You must use big Neg Binomial formula • Plugging in observed Xs, estimates of a, Bs… 1 1 1 y 1 Pˆ y | X 1 y! mˆ mˆ 1 mˆ • Probably best to get STATA to do this one… • Long & Freese created command: prvalue y Negative Binomial Example: Web Use • Note: Bs are similar but SEs change a lot! Negative binomial regression Log likelihood = -4368.6846 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 1552 57.80 0.0000 0.0066 -----------------------------------------------------------------------------wwwhr | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .3617049 .0634391 5.70 0.000 .2373666 .4860433 age | -.0109788 .0024167 -4.54 0.000 -.0157155 -.006242 educ | .0171875 .0120853 1.42 0.155 -.0064992 .0408742 lowincome | -.0916297 .0724074 -1.27 0.206 -.2335457 .0502862 babies | -.1238295 .0624742 -1.98 0.047 -.2462767 -.0013824 _cons | 1.881168 .1966654 9.57 0.000 1.495711 2.266625 -------------+---------------------------------------------------------------/lnalpha | .2979718 .0408267 .217953 .3779907 -------------+---------------------------------------------------------------alpha | 1.347124 .0549986 1.243529 1.459349 -----------------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) = 8459.61 Prob>=chibar2 = 0.000 Note: Standard Error for education increased from .004 to .012! Effect is no longer statistically significant. Negative Binomial Example: Web Use • Note: Info on overdispersion is provided Negative binomial regression Log likelihood = -4368.6846 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 1552 57.80 0.0000 0.0066 -----------------------------------------------------------------------------wwwhr | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .3617049 .0634391 5.70 0.000 .2373666 .4860433 age | -.0109788 .0024167 -4.54 0.000 -.0157155 -.006242 educ | .0171875 .0120853 1.42 0.155 -.0064992 .0408742 lowincome | -.0916297 .0724074 -1.27 0.206 -.2335457 .0502862 babies | -.1238295 .0624742 -1.98 0.047 -.2462767 -.0013824 _cons | 1.881168 .1966654 9.57 0.000 1.495711 2.266625 -------------+---------------------------------------------------------------/lnalpha | .2979718 .0408267 .217953 .3779907 -------------+---------------------------------------------------------------alpha | 1.347124 .0549986 1.243529 1.459349 -----------------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) = 8459.61 Prob>=chibar2 = 0.000 Alpha is clearly > 0! Overdispersion is evident; LR test p<.05 You should not use Poisson Regression in this case General Remarks • Poisson & Negative binomial models suffer all the same basic issues as “normal” regression • Model specification / omitted variable bias • Multicollinearity • Outliers/influential cases – Also, it uses Maximum Likelihood • N > 500 = fine; N < 100 can be worrisome – Results aren’t necessarily wrong if N<100; – But it is a possibility; and hard to know when problems crop up • Plus ~10 cases per independent variable. General Remarks • It is often useful to try both Poisson and Negative Binomial models • The latter allows you to test for overdispersion • Use LRtest on alpha () to guide model choice – If you don’t suspect dispersion and alpha appears to be zero, use Poission Regression • It makes fewer assumptions – Such as gamma-distributed error. Example: Labor Militancy Isaac & Christiansen 2002 Note: Results are presented as % change