Taming the Statistical Shrew
Transcription
Taming the Statistical Shrew
Taming the Statistical Shrew Richard M. Rosenfeld MD, MPH, FAAP Professor of Otolaryngology, SUNY Downstate Medical Center, Chairman of Otolaryngology, Long Island College Hospital, Brooklyn, NY Statistics The science and art of collecting, summarizing, and analyzing data that are subject to random variation Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001 Statistics are used to: Develop sound judgment about data applicable to clinical care Read the literature critically, understanding potential errors and fallacies Apply epidemiologic information to patient care and disease prevention Reach correct conclusions about diagnostic procedures and laboratory results Publish and critique journal manuscripts Create and evaluate research protocols Dawson B, Trapp RG. Basic & Clinical Biostatistics, 3rd ed. NY: Lange 2001 Mummy Powder Cures Common Cold within 24 Hours for 85.7% of Subjects!!! Not Cured 14% Cured 86% How Confident Should You Be? 95% Confidence Interval vs. Sample Size for a Success Rate of 86% Successes/ total sample Success rate 95% Confidence interval 6/7 86% 42 – 100% 12/14 86% 57 – 98% 24/28 86% 67 – 96% 48/56 86% 74 – 94% 96/112 86% 78 – 92% 192/224 86% 80 – 90% Precision The quality of being sharply defined or stated. Statistical precision is the inverse of the variance for an estimate. Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001 Defined as: Consistency or repeatablility Threatened by: Random error (variance) Independent of: Systematic error (bias) Would you cross this rickety old bridge? Rule of Threes The 95% CI Upper Limit Given No Events in n Trials is 3/n Based on a Poisson Distribution 10 30% 15 20% 20 Trials with no events 15% 25 12% 30 10% 35 40 45 50 9% 8% 7% 6% The rule of threes can address the following type of question: “I am told by my physician that I need a serious operation and there has not been a fatality in 20 she performed. What is the potential postoperative mortality based on this information?” Upper limit of 95% CI Van Belle G. Statistical Rules of Thumb. NY: Wiley Inter-science, 2002. Should you believe a “zero” result? It’s all a question of confidence. Great Mysteries in Ear Tube Surgery 12 o’clock 6 o’clock I say “put it here”… … but they put it there Accuracy The degree to which a measurement or an estimate based on measurement represents the true value of the attribute that is being measured Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001 Defined as: Nearness to the truth Threatened by: Systematic error (bias) Independent of: Random error (variance) New Technique for Tonsillectomy Reduces Pain, Improves Satisfaction The Magic of Cherry-Picking Your Sample 1. AB, age 7y: normal diet by 2 days 2. RR, age 5y: ate everything next day like nothing happened; everyone brought gifts but she didn’t need them 6. NC, age 7y: ate everything next day like nothing happened 7. LG, age 4y: ate Pudgies’ Chicken, macaroni & cheese next day 8. KS, age 4y: normal diet next day 3. PV, age 4y: normal diet next day 9. NM, age 5y: normal diet next day 4. AC, age 8y: normal diet next day 10. DW, age 8 y: no pain after surgery 5. MC, age 4y: went to dance class and ate pizza next day 11. AC, age 4y: ate Chinese food the same day for dinner Validity Study Sample Findings in the Study Observation Truth in the Study Internal Validity Inference Truth in the Accessible Population Generalization Truth in the Target Population Target Population External Validity Effect of Early vs. Delayed Tympanostomy Tubes for OME on Child Development Paradise et al, 2001-2007 RCT of 588 children, identified from monthly screening of 6,350 healthy infants, with cumulative duration of bilateral OME >90d or unilateral OME >135d, of which 429 were randomized to prompt vs. delayed (6-9m) tube insertion Most children had unilateral OME (63%) or discontinuous OME (67%) Bilateral continuous OME was uncommon (18%) Early-treatment group had delays in tube placement: 31% within 30d, 33% in 31-60d, 15% in 61-180d, 18% never Impact of early-tube placement was only 10% fewer days with OME over 24 months (30 vs. 40%), which equals 36 days per year Developmental and academic tests at age 3y, 4y, 5y, 6y, and 9-11y showed no difference in outcomes for the prompt- vs. delayed-tube group NEJM 2001;344:1179-87, Pediatr Infect Dis J 2003;22:309-14, Pediatrics 2003;112:265-77, NEJM 2005; 353:576-86, NEJM 2007; 356-248-61 Anatomy of an Estimate Not All Statistics Are Created Equal Is it Precise? – Are the results consistent and repeatable? Is it Accurate? – Does it reflect the true value of the attribute being measured? Is it Valid? – Can we make inferences based on the estimate? Mummy Powder Cures 85.7% of Colds within 24 hours Low Budget Study Precision 46 – 99% (N = 7) Accuracy Inclusion by stuffy nose Outcome by telephone contact Validity Judgmental sample drawn from waiting room of local chiropractor's office Mummy Powder Cures 85.7% of Colds within 24 hours High Budget Study Precision 81 – 90% (N = 224) Accuracy Inclusion by X-ray and RAST Outcome by rhinomanometry Validity Two-stage cluster sample drawn from most recent US census report Controlled Clinical Trial James Lind, Scottish Surgeon, 1716-1794 Tröhler U (2003). James Lind and scurvy: 1747 to 1795. The James Lind Library (www.jameslindlibrary.org) Which Case Series’ are Worth Publishing? Rosenfeld RM, Otolaryngol HNS 2007 The best case series report uncommon situations or deal with circumstances where RCTs would be unethical or impractical, AND: 1. Include a consecutive, well-defined sample of subjects that is fully described so readers can judge relevance 2. Report interventions with enough detail for reproduction, including any adjunctive treatments allowed 3. Account for all patients initially enrolled, and follow them long enough to overcome random disease fluctuations 4. Perform statistical analysis, preferably multivariate 5. Reach justifiable conclusions, devoid of “efficacy” claims Otolaryngol Head Neck Surg 2007; 136:337-9 First Randomized Trial (Sealed Envelopes) Medical Research Council, BMJ 1948 First clinical trial (streptomycin for tuberculosis) using random numbers and sealed envelopes, instead of old practice of alternating cases BMJ 1948; 2:769-82. Centralized randomization scheme Why Bother to Randomize? What’s wrong with thoughtful, individualized allocation of patients to treatment or no treatment by insightful clinicians? 1. Randomization eliminates allocation bias, which can give false or misleading results when clinicians allocate treatment 2. Randomization provides proper estimates of random error, which are required for valid statistical analysis Mummy Powder for Adenovirus URI Symptom relief for 150 patients randomized to cellulose placebo (n=75) vs. mummy powder (n=75) for adenovirus upper respiratory infection χ2 = 3.14, P = .076 Rate difference = -13% 76% 63% Cellulose placebo Mummy powder P Value P value is the probability that a test statistic would be as extreme as or more extreme than observed if the null hypothesis were true Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001 P < .05 P ≥ .05 An alternative to the null hypothesis might better explain the observations The null hypothesis satisfactorily explains the observations P Value Dichotomy Investigators may arbitrarily set their own significance levels, but in most biomedical and epidemiologic work, a study whose probability value is less than 5% (P < .05) or 1% ( P < .01) is considered sufficiently unlikely to have occurred by chance to justify the designation “statistically significant” Ronald A. Fisher British Mathematician and Biologist, 1890-1962 Introduced P values and randomization into agricultural research in 1920s “If P is between 0.1 and 0.9 there is certainly no reason to suspect the (null) hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05…” Red Hot News Flash !! New Studies Prove Efficacy of Recombinant Mummy Powder for the Common Cold!! New England Journal of Medicine Mummy beats placebo, P < .05, N = 3,000 !!! Journal of Low Budget Research Mummy beats placebo, P < .05, N = 30 !!! Statistical vs. Clinical Significance For a Binary Outcome in 2 Groups 15 45% 20 40% 30 35% Group 40 size 60 30% 25% 90 20% 170 15% 400 1500 10% 5% For example, a group size of 15 (N = 30) would need an absolute difference in outcome between groups of at least 45% to reach statistical significance Smallest detectable absolute group difference Amoxicillin for Acute Otitis Media Kaleida et al, Pediatrics 1991 Children aged 7m – 12y with 980 episodes of AOM randomized to amoxicillin vs. placebo for 14 days 96% 92% Absolute increase in success rate 4% [1 – 7%] P = .015 Amoxicillin Relative decrease in failure rate 50% [14 – 70%] P = .015 4% Success Pediatrics 1991; 87: 466-474 Placebo 8% Failure Absolute vs. Relative Risk Your chance of winning > $50K in a lottery is 1:80 million You are 12 times more likely to get killed in a 1 mile car ride to buy a ticket than to actually win the lottery Anatomy of an Estimate Not All Statistics Are Created Equal Is it Precise? – Are the results consistent and repeatable? Is it Accurate? – Does it reflect the true value of the attribute being measured? Is it Valid? – Can we make inferences based on the estimate? Bias in Treatment Studies Bias is a systematic deviation from the truth, which may occur in the collection, analysis, interpretation, publication, or review of data 1. Design bias occurs when the study is planned to include subjects, endpoints, or outcomes that are more likely to support prior expectations 2. Ascertainment bias is caused by studying a subject sample that does not fairly represent the larger population to which results are to be applied 3. Allocation (selection) bias occurs when groups vary in prognosis because of demographics, illness severity, or other baseline characteristics 4. Observer (detection or measurement) bias can distort how outcomes are assessed if the observer is aware of the treatment received 5. Reviewer bias can lead to erroneous conclusions when an author selectively cites published studies that favor a particular viewpoint Rosenfeld RM. Disclosure. Otolaryngol Head Neck Surg 2008; In press Theriac for Sale: Universal Antidote for Poisoning – Also cures Aprosexia Theriac for Aprosexia* Randomized Double-Blind Clinical Trial Research hypothesis Theriac is more effective than placebo in treating aprosexia Null hypothesis Theriac is equivalent to or less effective than placebo * Aprosexia is defined as the inability to concentrate due to ocular, aural, or mental deficits or to mental weakness Stedman’s Medical Dictionary, 23rd edition Theriac for Aprosexia Clinical cure rate for 50 patients randomized to placebo (n=25) vs. theriac (n=25) for intractable aprosexia χ2 = 9.74, P = .002 Rate difference = 44% 76% 32% Placebo Theriac Type I (Alpha) Error Probability of Occurrence is P value Decision situation Type I error Null hypothesis Reject even though true Diagnostic test False positive Clinical trial Promote worthless therapy Judicial system Sentence the innocent Used car selection Reject dependable car Brown GW. Errors, types I and II. Am J Dis Child 137:586-91 Ronald Fisher Jerzy Neyman P value, 1925 Confidence Interval, 1937 Absolute rate difference P values vs. Confidence Intervals for a Rate Difference of 44% 80% 70% 60% 50% 40% 30% 20% P=.002 P<.001 P<.001 P<.001 P<.001 100 200 500 1000 10% 50 Sample size Rosenfeld RM. Taming the Statistical Shrew. In: Johnson JT (ed), Instructional Courses Volume Six. St. Louis, Mosby Year Book; 1993. Confidence Interval The computed interval with a given probability, e.g., 95%, that the true value of a variable (mean, proportion, or rate) is contained within the interval Last JM, A Dictionary of Epidemiology, 4th ed., Oxford Univ. Press 2001 Defined as: Range of sample means consistent with the data Threatened by: Small samples (low precision) Related to, but better than: P values Gardner MJ. Br Med J 1986;292:746-50 Confidence Intervals Use At Least 12 Observations in Constructing a Confidence Interval Gerald van Belle. Statistical Rules of Thumb. NY: Wiley Interscience; 2002 Precision does not vary linearly with sample size, but is related to the square root of the number of observations Rule of thumb: the width of a CI decreases rapidly until 12 observations are reached then decreases slowly For example, for a sample size of 15, the half-width of a 95% CI is 0.5 standard deviations Statistics in Otolaryngology Journals Wasserman et al, Otolaryngol HNS Survey of 1,924 clinical research articles from the 4 leading peer-reviewed otolaryngology journals in 1993, 1998, and 2003 36% 39% 45% 43% 1993 1998 2003 35% 26% 1% Internal control group P-values 4% 8% Confidence intervals Otolaryngol Head & Neck Surg 2006; 134:717-23 Confidence Intervals Overlapping Confidence Intervals Do Not Imply Non-Significance Gerald van Belle. Statistical Rules of Thumb. NY: Wiley Interscience; 2002 It is sometimes claimed that if two independent statistics have overlapping confidence intervals, then they are not significantly different This is true for substantial overlap, but the overlap can be surprisingly large and the means still significantly different Rule of thumb: overlaps of 25% or less still suggest statistical significance Red Hot News Flash !! Randomized Controlled Trial Proves Efficacy of Theriac for Intractable Aprosexia!! Theriac 44% more effective than placebo for aprosexia (95% CI, 19-69%). There is less than a 2 in 1,000 probability that this is a chance finding! Red Hot News Flash !! Urgent Alert from the Food and Drug Administration Complete and irreversible hair loss linked to theriac ingestion for aprosexia! 50% of subjects became bald within one year of therapy. Univariate Analysis of Risk Factors for Post-Theriac Baldness Statistically significant factors Season of therapy Shoe size P = .018 P = .002 Statistically non-significant factors Geographic region Ethnic group Climate Height Weight Socioeconomic status Hair type Gender Hair color Age Race Eye color Favorite color P = .007 Family history Diet Exercise Smoking Alcohol Red Hot News Flash !! FDA Issues New Precautions When Using Theriac Unless you seek baldness, theriac is not recommended for winter-time aprosexia if your shoe size is 9 and your favorite color is aquamarine You seem in fine health, Mr. Cosgrove, but let’s give you a series of tests. I’m sure we can find something wrong. Consequences of Performing 20 Statistical Tests on a Single Set of Data Assuming that: Each test is performed with an alpha level of .05 And the observed findings are caused solely by random variations The probability of: 1 or more type I errors is 64% 2 or more type I errors is 26% 3 or more type I errors is 7% A.K.A. If you torture the data sufficiently, they will eventually confess to something! Statistical Tests for Associating an Outcome with Predictor Variables Data scale for outcome Parametric test Nonparametric test Nominal or ordinal Discriminant analysis Log-linear model Dichotomous Discriminant analysis Multiple-logistic regression Numerical, 1 predictor Pearson correlation Spearman rank correlation Numerical, 2 predictors ANOVA, ANCOVA — Numerical, ≥ 2 predictors Multiple linear regression — Censored Cox regression — Parametric tests are used when group size ≥ 30 or if < 30 with normal distribution Sample Size for Multivariate Analysis Obtain At Least 10 Events For Every Variable Investigated Assume that 20% of subjects are expected to have the event of interest and there are 5 predictor variables About 10 events per variable are needed to get stable estimates of the regression coefficients Therefore, about 10 x 5 or 50 events are needed, making the necessary sample size 250 subjects Gerald van Belle. Statistical Rules of Thumb New York: Wiley Interscience; 2002 Statistical Tests for Comparing Three or More Groups Independent samples Data scale Parametric test Nonparametric test Dichotomous or nominal — χ2 test, log-likelihood ratio Ordinal — Kruskal-Wallis ANOVA, χ2 for trend Numerical One-way ANOVA Kruskal-Wallis ANOVA Matched, paired, or repeated samples Data scale Parametric test Nonparametric test Dichotomous — Mantel-Haenszel χ2, Cochran’s Q Ordinal — Friedman ANOVA Numerical Repeated ANOVA Friedman ANOVA Parametric tests are used when group size ≥ 30 or if < 30 with normal distribution We need theriac! But baldness is not an option… Unicorn Horn for Aprosexia Randomized Double-Blind Clinical Trial Research hypothesis Unicorn horn is within 0.20 as effective as theriac Null hypothesis Theriac is more effective than unicorn horn by at least 0.20 Sample size calculation = 300 per group (600 overall) based on: Alpha = .05 (one-sided) Cure rate, theriac = .60 Beta = .20 Cure rate, unicorn = .40 Unicorn Horn vs. Theriac for Aprosexia Clinical cure rate for 50 patients randomized to theriac (n=25) vs. unicorn horn (n=25) for intractable aprosexia χ2 = 0.33, P = .560 Rate difference = -8% 64% Theriac 56% Unicorn horn Type II (Beta) Error Power is One Minus Beta Decision situation Type II error Null hypothesis Accept even though false Diagnostic test False negative Clinical trial Miss a difference between groups Judicial system Free the guilty Used car selection Buy a lemon Brown GW. Errors, types I and II. Am J Dis Child 137:586-91 Effect of Sample Size on Power and Precision for a Rate Difference of 8% Absolute rate difference 50% 40% 30% Circles indicate the point estimate for the absolute rate difference, and vertical bars indicate the 95% confidence intervals. Positive values favor theriac. Only the largest trial has adequate statistical power. 20% 10% 0% -10% -20% -30% P=.56 Power=17% P=.41 Power=26% P=.25 Power=34% P=.16 Power=54% P=.06 Power=80% 50 100 200 300 600 -40% Sample size Rosenfeld RM. Taming the Statistical Shrew. In: Johnson JT (ed), Instructional Courses Volume Six. St. Louis, Mosby Year Book; 1993. Oxford: Oxford University Press, 2001 Philadelphia: American College of Physicians, 1997 Looking Beyond P values Statistical Savvy 101 All P values Significant P value How many hypotheses were tested? How many groups were compared? Is the result clinically important? What is the magnitude of outcome? Is the result precise, accurate, and valid? Was the correct statistical test used? Non-significant Was the statistical power adequate? P value Is the result clinically important? “Start out with the conviction that absolute truth is hard to reach in matters relating to our fellow creatures, healthy or diseased, that slips in observation are inevitable even with the best trained faculties, that errors in judgment must occur in the practice of an art which consists largely in balancing probabilities. Start, I say, with this attitude of mind, and mistakes will be acknowledged and regretted; but instead of a slow process of self-deception, with ever increasing inability to recognize truth, you will draw from your errors the very lessons which may enable you to avoid their repetition.” Sir William Osler, Aequanimatas 1904 Taming the Statistical Shrew 2008 Thank you for your kind attention! Richard M. Rosenfeld richrosenfeld@msn.com