Statistics 100 – Two-sample Inference
Transcription
Statistics 100 – Two-sample Inference
Statistics 100 – Two-sample Inference Inference for difference in means: We have two random variables, X and Y , and we assume • X ∼ N(µX , σX ) • Y ∼ N(µY , σY ) Want to make inferences about µY − µX . Example: Remember her? You were asked on the first day of class how old she is. Question: To the extent that the students in the class were a representative sample from the Harvard student population, what inferences can we make about the difference in average guessed age of the girl between male and female students? • Confidence interval for the difference in mean guessed age between male and female students 1 • Hypothesis test for whether a difference in mean guessed age exists between male and female students 20 16 18 Guessed age (yrs) 22 24 Girl’s guessed age by gender Female Male Student gender Typical application – determining the effect of a treatment from experimental data (though applications to survey data are also common) • Two treatments (“treatment” and “control”) • Want to infer whether there is a difference between the mean response when exposed to treatment versus when exposed to control. Two situations we’ll consider: 1. Independent samples (e.g., samples from a completely randomized design) (a) Assume no restrictions on (population) standard deviations (b) Assume (population) standard deviations are identical 2. Matched pairs design Analysis for independent samples: 2 Let x ¯, sX and nX be the sample mean, sample standard deviation, and sample size for the X variable, and let y¯, sY and nY be the same quantities for the Y variable. 100C% CI for µY − µX : s (¯ y−x ¯) ± t ∗ s2X s2 + Y nX nY where t∗ is the 100C% critical value from a t-distribution, where the degrees of freedom is the smaller of nY − 1 and nX − 1. Example: Effect of MSG (mono-sodium glutamate) on weight of ovaries in rats: A completely randomized experiment was performed to compare average weights (in milligrams) of ovaries in female rats when fed diets with and without MSG (mono-sodium glutamate). Eleven female rats were fed a diet containing MSG, and ten “control” rats were fed a normal diet. For the treated rats, the mean and standard deviations were 29.35 and 4.55, respectively. For the control rats, the mean and standard deviations were 21.86 and 10.09, respectively. Find a 90% CI for the difference in average weights of rat ovaries when fed MSG versus not. Solution: Let X be the ovary weight of a control rat, and Y be the ovary weight of a treated rat. Assume X ∼ N(µX , σX ), and Y ∼ N(µY , σY ). Computed from the data, we have x ¯ = 21.86 y¯ = 29.35 sX = 10.09 sY = 4.55 nX = 10 nY = 11 Want a 90% CI for µY − µX . Step 1: Degrees of freedom for t-distribution in non-pooled procedure is the smaller of 11 − 1 and 10 − 1 which is 9. Step 2: Critical value for 90% confidence on t(9) distribution is 1.833 (from Table D). s (29.35 − 21.86) ± 1.833 (10.09)2 (4.55)2 + 10 11 = 7.49 ± 6.37 = (1.12, 13.86) We are 90% confident that the difference in average ovary weights between MSG-fed rats and ordinary rats is from 1.12 to 13.86. As an aside, because 0 is not in the interval, µY − µX = 0 is not “plausible.” Hypothesis testing: In most cases, Ho : µY = µX (can also be written as Ho : µY − µX = 0). 3 The alternative hypothesis is either • Ha : µY > µX , • Ha : µY < µX , or • Ha : µY 6= µX . The first two correspond to one-sided tests, and the third corresponds to a two-sided test. Test statistic: t= (¯ y−x ¯) − (µY − µX ) r s2X nX + s2Y nY (¯ y−x ¯) =r s2X s2Y nX + nY Under Ho , the test statistic has a t-distribution where the degrees of freedom is the smaller of nY − 1 and nX − 1. Another example: Do telephone calls made by the sales division of a company last longer, on average, than calls made by the customer service department? A random sample of 40 calls from the sales division revealed an average of 10.26 minutes per call with a standard deviation of 8.65 minutes. A random sample of 20 calls from the customer service department provided an average of 6.93 minutes per call with a standard deviation of 4.93 minutes. Test at the α = 0.01 level. Solution: Let X be the length of a phone call (minutes) placed by a customer service representative, and let Y be the length of a call placed by a salesman. Assume X ∼ N(µX , σX ), and Y ∼ N(µY , σY ). x ¯ = 6.93 y¯ = 10.26 sX = 4.93 sY = 8.65 nX = 20 nY = 40 Step 1: Ho : µY = µX Ha : µY > µX Step 2: The significance level is 0.01. Step 3: Degrees of freedom for t-distribution in the non-pooled procedure is the smaller of 40 − 1 and 20 − 1 which is 19. 4 Step 4: Calculate the test statistic: 3.33 10.26 − 6.93 = = 1.90 t= q 2 1.75 4.93 8.652 + 20 40 Step 5: Calculate the p-value: On 19 degrees of freedom, we have 1.729 < 1.90 < 2.093, so that 0.025 < P(t > 1.90) < 0.05. For this one-sided test, therefore, 0.025 < p-value < 0.05 Thus, at the α = 0.01 level, we cannot reject Ho . We do not have enough evidence to conclude that sales division personnel use the phone, on average, more per call than customer service representatives. Comments: • y¯ − x ¯ is a point estimate of µY − µX . r • t∗ r • s2X nX s2X nX + + s2Y nY s2Y nY is the margin of error for the confidence interval. is the standard error of y¯ − x ¯. • These techniques assume σY may be different from σX . Another major possibility: What if it is believable that σX = σY = σ? In this case, we are assuming that • X ∼ N(µX , σ) • Y ∼ N(µY , σ) We can form a single estimate of the (common) standard deviation. Define s sp = (nX − 1)s2X + (nY − 1)s2Y nX + nY − 2 to be the “pooled” estimate of σ. Assuming σY = σX , 100C% CI for µY − µX : s ∗ (¯ y−x ¯ ) ± t sp 5 1 1 + nX nY where t∗ is the critical value from a t-distribution on nY + nX − 2 degrees of freedom. Hypothesis testing: Again, in most cases, Ho : µY = µX . Test statistic: t= (¯ y−x ¯) − (µY − µX ) sp q 1 nX + 1 nY = (¯ y−x ¯) sp q 1 nX + 1 nY . Under Ho , the test statistic has a t-distribution on nX + nY − 2 degrees of freedom. Using pooled vs non-pooled procedure: • If sX and sY are not within a factor of 1.5, use non-pooled procedures. • If sX and sY are within a factor of 1.5, use pooled procedures. Examples: • sX = 2.84 and sY = 3.92. Because 3.92/2.84 = 1.38 < 1.5, use the pooled procedure. • sX = 1.43 and sY = 5.91. Because 5.91/1.43 = 4.13 > 1.5, use the non-pooled procedure. In the previous two data examples, the sample standard deviations were too far apart to use the pooled procedure: • In the study examining the effect of MSG on rat ovaries, we had sX = 10.09 and sY = 4.55, so because 10.09/4.55 = 2.22 > 1.5, we should not use the pooled procedure. • In the study on the comparison of phone call lengths, we had sX = 4.93 and sY = 8.65, so because 8.65/4.93 = 1.75 > 1.5, we should not use the pooled procedure. Example: The Wide Range Achievement Test is given to a stratified random sample of 12 six-year olds and 16 seven-year olds. The average score and standard deviation for the six-year olds were 27.5 and 10.2, respectively. The average score and standard deviation for the seven-year olds were 44.0 and 13.2, respectively. Find a 95% confidence interval for the difference in average scores for six and seven year olds. Solution: Let X be the score of a six-year old and let Y be the score of a seven-year old. It is reasonable to assume approximately X ∼ N(µX , σX ) and Y ∼ N(µY , σY ). We want to find a 95% CI for µY − µX . 6 We’re given: x ¯ = 27.5 y¯ = 44.0 sX = 10.2 sY = 13.2 nX = 12 nY = 16 Step 1: Because the standard deviations are less than a factor of 1.5 from each other (13.2/10.2 = 1.29 < 1.5), use the pooled procedure. Step 2: Degrees of freedom for t-distribution in the pooled procedure is 12 + 16 − 2 = 26. Step 3: Critical value for 95% confidence on t(26) distribution is 2.056 (from Table D). Step 4: Compute sp . s sp = (12 − 1)10.22 + (16 − 1)13.22 = 12.02 12 + 16 − 2 Step 5: The confidence interval is r (44.0 − 27.5) ± 2.056(12.02) 1 1 + 12 16 = 16.5 ± 9.44 = (7.06, 25.94) We are 95% confident that the difference in average score between six year olds and seven-year olds is from 7.06 to 25.94. Another example: An economist was curious whether the news media gives equal coverage to good news and bad news that is of equal importance. He looked at television reportings of changes in the unemployment rate over the time period from 1973 to 1985. Out of the 171 times that the unemployment rate was reported to have increased, the average news time and standard deviation were 161.8 seconds and 110.8 seconds, respectively. Out of the 170 times that the unemployment rate was reported to have decreased, the average news time and standard deviation were 123.6 seconds and 103.9 seconds, respectively. At the α = 0.05 significance level, is there a difference in the amount of time the media devotes the good news versus bad news? Solution: Let X be the time for a reporting that the unemployment rate decreases, and let Y be the time for a reporting that the unemployment rate increases. Reasonable to assume X ∼ N(µX , σX ) and Y ∼ N(µY , σY ). x ¯ = 123.6 y¯ = 161.8 sX = 103.9 sY = 110.8 7 nX = 170 nY = 171 Step 1: Because the standard deviations are less than a factor of 1.5 from each other (110.8/103.9 = 1.066 < 1.5), use the pooled procedure. Step 2: Ho : µY = µX Ha : µY 6= µX Step 3: The significance level is 0.05. Step 4: Degrees of freedom for t-distribution in the pooled procedure is 170 + 171 − 2 = 339. On Table D, round down to 100. Step 5: Calculate sp : s sp = (170 − 1)103.92 + (171 − 1)110.82 = 107.42 170 + 171 − 2 Step 6: Calculate the test statistic: t= 161.8 − 123.6 q 107.42 1 170 + 1 171 = 38.2 = 3.28 11.63 Step 7: Calculate the p-value: On 100 degrees of freedom, we have 3.174 < 3.28 < 3.390, so that 0.0005 < P(t > 3.28) < 0.001. The two-sided p-value is P(t > 3.28) + P(t < −3.28) which is twice P(t > 3.28), so that 2(0.0005) < p-value < 2(0.001) or 0.001 < p-value < 0.002 Thus, at the α = 0.05 level, we can reject Ho . The news media devotes a different amount of time, on average, to unemployment rates increasing versus decreasing. To pool or not to pool. . . what is the question?! • Use of the pooled procedure makes a strong assumption (σX = σY ) about the population. • If this assumption is correct, then inferences about µY − µX will be more precise. If the assumption is wrong, then inferences about µY − µX will be mostly meaningless. • Using the non-pooled procedure is “safer,” though the trade-off is that inferences will not be as precise as the pooled procedure if σX = σY . 8 Recap: Steps for inference for a difference in means: 1. Determine whether to use the non-pooled procedure, or the pooled procedure (based on examining the sample standard deviations). 2. Carry out the appropriate confidence interval or hypothesis test. Let’s try with the age-guessing example. Remarks: • Data were assumed to come from a CRD (experiment), or from a SRS or stratified sample (survey, obs study). • Again, both X and Y are assumed to be approximately normally distributed. • The use of t-procedures are fairly robust to non-normality of the data, but usually a good idea to check data for strong skewness or outliers. Matched pairs design: When the data from two samples come from a matched pairs experiment, using the previous methods will not be precise enough. Can do better by incorporating the knowledge that the observations are paired. Motivating example: Suppose a gasoline distributor wants to know whether an additive improves cars’ mileage. He designs a matched pairs experiment where 10 cars are given, in random order, ordinary gasoline and gasoline with the additive. Mileage Mileage Car With Additive Ordinary Gasoline 1 25.7 24.9 20.0 18.8 2 3 28.4 27.7 4 13.7 13.0 5 18.8 17.8 6 12.5 11.3 7 28.4 27.8 8 8.1 8.2 9 23.1 23.1 10 10.4 9.9 avg 18.91 18.25 sd 7.47 7.42 9 20 10 15 Car mileage 25 Car mileage with and without gasoline additive Without additive With additive With and without additive Let X be the mileage of a car without the additive, and let Y be the mileage of a car with the additive. Want to make an inference about µY − µX . An Idea: Let D = Y − X be the difference between the Y and X measurements within a pair. Then D ∼ N(µD , σD ). Because µD = µY − µX , can perform inference on µD . Data for a matched pairs analysis: Observe X values x1 , x2 , . . . , xn and Y values y1 , y2 , . . . , yn . Compute the differences, d1 = y1 − x1 , d2 = y2 − x2 , . . ., dn = yn − xn . Let d¯ and sD be the sample mean and standard deviation of the differences. Inference for this two-sample problem has now been reduced to a 1-sample problem by analyzing the within-pair differences. 10 Car 1 2 3 4 5 6 7 8 9 10 avg sd Mileage With Additive 25.7 20.0 28.4 13.7 18.8 12.5 28.4 8.1 23.1 10.4 18.91 7.47 Mileage Ordinary Gasoline 24.9 18.8 27.7 13.0 17.8 11.3 27.8 8.2 23.1 9.9 18.25 7.42 Difference 0.8 1.2 0.7 0.7 1.0 1.2 0.6 –0.1 0.0 0.5 0.66 0.443 0.6 0.4 0.0 0.2 Difference in car mileage 0.8 1.0 1.2 Car mileage differences with and without gasoline additive 100C% CI for µY − µX : sD d¯ ± t∗ √ n where t∗ is the critical value from a t-distribution on n − 1 degrees of freedom. Hypothesis testing: In most cases, Ho : µY = µX , which is identical to Ho : µD = 0. 11 Test statistic: t= d¯ − µD d¯ √ = √ . sD / n sD / n Under Ho , the test statistic has a t-distribution on n − 1 degrees of freedom. To construct a 95% CI for the average increase due to the gasoline additive, we have n = 10, d¯ = 0.66 and sD = 0.443. For 95% confidence and df=9, we have t∗ = 2.262. So the 95% confidence interval for µY − µX is given by √ , 0.66 + 2.262 0.443 √ ) (0.66 − 2.262 0.443 10 10 (0.66 − 0.32, 0.66 + 0.32) (0.34, 0.98) We are 95% confident that the average increase in mileage due to the additive is between 0.34 and 0.98 miles per gallon. As an aside, if we used the independent sample method (pooled procedure), the 95% CI would be (−6.34, 7.66). Example: Effects of alcohol on hypoxia. Ten male subjects were taken to a simulated altitude of 25,000 ft and given tasks to perform. The time (in seconds) at which useful consciousness ended was measured for each subject. Three days later, the experiment was repeated with the same ten subjects 1 hour after subjects took 0.5 cc of 100-proof whiskey per pound of body weight. The time (in seconds) at which useful consciousness ended when whiskey was ingested was then recorded. Does whiskey significantly reduce the average time of useful consciousness (at the 0.05 level)? The differences within each subject were: 76, 190, 590, 390, 65, –55, –5, 530, 175, and 0. Let X be the “survival” time with whiskey, and let Y be the survival time without whiskey. Let D = Y − X be the time difference for a subject (positive if whiskey reduces the time of consciousness). Assume D ∼ N(µD , σD ), where µD = µY − µX . From these data, we can compute d¯ = 195.6 and sD = 230.53. Solution: Ho : µY = µX ↔ µD = 0 Ha : µY > µX ↔ µD > 0 Significance level: α = 0.05 12 Test statistic: t= d¯ − µD 195.6 − 0 √ = 2.68 √ = sD / n 230.53/ 10 Computing a p-value: On 9 degrees of freedom, the one-sided p-value is between 0.01 and 0.02 (because 2.398 < 2.68 < 2.821 from Table D). Because the p-value is less than 0.05, we can conclude sufficient evidence that the whiskey is associated with lower time of useful consciousness. Why not act as if data came from CRD? • Typically responses within pairs are much more similar than responses between pairs. • For matched pairs design, it is usually true that the variability of y¯ − x ¯ is much smaller than estimated by the “independent sample” method. • Analysis of within-pair differences takes advantage of similarity within pairs Inference for difference in proportions: Usual situation: • Experimental data with 2 treatments (treatment group and control group), or • Survey data from stratified sample with 2 strata Response variable for each unit is a binary categorical variable (“success” or “failure”). The information is summarized as sample proportions of success for each group. Example: A study in 2001 sponsored by the National Sleep Foundation asked a random sample of 995 U.S. adults whether they snore. The data can be summarized by whether subjects were under 30 years old and 30 or over, and appear in the following table: Age group Snores Does not snore Under 30 y/o 48 136 30+ y/o 318 493 Among younger participants, the fraction of sample that snored was pˆY = 48/(48 + 136) = 0.261 13 Among older participants, the fraction was pˆO = 318/(318 + 493) = 0.392 What inferences can we make about the difference in the proportion of snoring in the U.S. population between younger and older people? Notation: Let nX = Sample size of control group nY = Sample size of treatment group X = # of “successes” in control group Y = # of “successes” in treatment group Also let pˆX = X/nX = sample proportion of “successes” in X group pˆY = Y /nY = sample proportion of “successes” in Y group Assume X ∼ B(nX , pX ) Y ∼ B(nY , pY ) Want to make inferences about pY − pX . 100C% CI for pY − pX : s (ˆ pY − pˆX ) ± z ∗ pˆX (1 − pˆX ) pˆY (1 − pˆY ) + nX nY where z ∗ is the critical value from Table D, df=z ∗ . Comments: • This confidence interval is approximate because we used normal probabilities to approximate binomial probabilities. • The term s z ∗ pˆX (1 − pˆX ) pˆY (1 − pˆY ) + , nX nY is the margin of error of the confidence interval. 14 • The term s pˆX (1 − pˆX ) pˆY (1 − pˆY ) + , nX nY is the standard error of pˆY − pˆX . Worthwhile comment: Making inferences about parameters connected with binomial distributions never involves use of the t-distribution even though we’re estimating a population standard deviation from data. Example: To determine if absenteeism is more a problem among male or female workers at your company, you obtain a stratified random sample of 130 female employees and 140 male employees. Looking at their records, you note that 12 of the females were absent from work for more than five days last year, and 14 of the males were absent for more than five days. Construct a 90% confidence interval for the difference in rates of absenteeism between male and female employees. Solution: Let X be the number of male employees out of a sample of 140 that were absent for more than five days last year, and let Y the be analogous number for women out of 130. Then X ∼ B(140, pX ) Y ∼ B(130, pY ) Want a 90% CI for pY − pX . We have pˆX = 14/140 = 0.1, and pˆY = 12/130 = 0.0923. For 90% confidence, z ∗ = 1.645 (from Table D, df = z ∗ ). The confidence interval for pY − pX is computed as (0.0923 − 0.1) ± 1.645× s 0.1(1 − 0.1) 0.0923(1 − 0.0923) + 140 130 = −0.0077 ± 0.059 = (−0.0667, 0.0513) We are 90% confident that the difference in proportion of male absenteeism from women absenteeism is between –0.0667 and 0.0513. Hypothesis testing: In virtually all cases, Ho : pY = pX . Possible alternative hypotheses: 15 • Ha : pY > pX , • Ha : pY < pX , or • Ha : pY 6= pX . Acting as though Ho is true, let pY = pX = p. Pooled estimate of p: x+y nX + nY where x and y are the observed number of “successes” in each group. pˆ = Test statistic: (ˆ pY − pˆX ) − (pY − pX ) (ˆ pY − pˆX ) z= q =q pˆ(1 − pˆ)( n1X + n1Y ) pˆ(1 − pˆ)( n1X + 1 nY ) Under Ho , the test statistic has a standard normal distribution, so the p-value can be computed from Table A. Example: A survey was performed to estimate the proportion of registered voters that vote. A stratified random sample of 400 employed people and 450 unemployed people was obtained, all of whom were registered to vote. Among the employed people, 262 voted in the last election, and among the unemployed people, 244 voted. Test whether there is evidence at the α = 0.05 level if the proportion of employed and unemployed voters in the population that voted last election is the same. Solution: Let X be the number out of a sample of 400 employed people that voted last year, and let Y be the number out of a sample of 450 unemployed people that voted last year. Then X ∼ B(400, pX ) Y ∼ B(450, pY ) Want to test Ho : pY = pX Ha : pY 6= pX at the α = 0.05 significance level. Calculate pˆX = 262/400 = 0.655 and pˆY = 244/450 = 0.542. Pooled pˆ: pˆ = 262 + 244 = 0.595. 400 + 450 16 Test statistic: z = q = q = − pˆY − pˆX pˆ(1 − pˆ)( n1X + 1 nY ) 0.542 − 0.655 1 0.595(1 − 0.595)( 400 + 1 450 ) 0.113 = −3.32 0.034 Calculate the p-value: For this two-sided test, p-value = P(Z < −3.32) + P(Z > 3.32) = 0.0005 + 0.0005 = 0.001 Because 0.001 < 0.05, should reject Ho . We have significant evidence that the probability people vote depends on whether they are employed. Comments: • Need to assume X and Y are independent binomially distributed random variables. In other words, the data came from two independent binomial samples. • Must have nX and nY large enough to use normal approximation. This means: – nX pˆX and nX (1 − pˆX ) must both be greater than 10, and – nY pˆY and nY (1 − pˆY ) must both be greater than 10. For this course, the samples will be large enough so you don’t need to check. Inference for pY − pX with matched pairs: McNemar’s procedure Example: Data were collected on movie evaluations by Chicago film critics Roger Ebert and Gene Siskel. The data consisted of 111 movies from April 1995 through September 1996. Ebert Siskel Thumbs Down Thumbs Up Thumbs Down 24 10 Did Ebert and Siskel give “Thumbs up” equally often? 17 Thumbs Up 13 64 Notation: Y X 0 1 Total 0 w00 w10 n−y 1 w01 w11 y Total n−x x n Notice pˆX = x/n and pˆY = y/n. 100C% CI for pY − pX : µ (ˆ pY − pˆX ) ± z ∗ 1√ w01 + w10 n ¶ Siskel and Ebert example: Suppose we want a 90% confidence interval for the difference in “thumbs up” rate between Siskel and Ebert. Letting X represent Siskel and Y represent Ebert, pˆX = (10 + 64)/111 = 0.667, and pˆY = (13 + 64)/111 = 0.694. Also, we have n = 111, and for 90% confidence, z ∗ = 1.645. Finally, w01 = 13 and w10 = 10. So, ¶ µ 1√ w01 + w10 = (ˆ pY − pˆX ) ± z n µ ¶ 1 √ 0.694 − 0.667 ± 1.645 13 + 10 111 = 0.027 ± 0.071 = (−0.044, 0.098) ∗ Thus we are 90% confidence that the difference in “thumbs up” rates between Siskel and Ebert is between −0.044 and 0.098. Worth noting that assuming independent samples, the confidence interval would be (−0.076, 0.130). Hypothesis testing with paired samples: Use same null and alternative hypotheses as with independent samples (always have Ho : pY = pX , while Ha depends on the context of the problem). Test statistic: w01 − w10 z=√ w10 + w01 Under Ho , the test statistic has a standard normal distribution, so compute the p-value from Table A. 18 Note: Positive values of w01 − w10 (and therefore positive values of z) correspond to evidence that pY − pX > 0; negative values correspond to evidence that pY − pX < 0. Example: On the 1994 General Social Survey, respondents were asked whether a person has the right to end his/her life if the person has an incurable disease, and whether a doctor can assist in ending a person’s life if the person has in incurable disease. The data on 1825 respondents are below: Assisted suicide Self-suicide No Yes No 435 90 Yes 203 1097 At the α = 0.01 significance level, test whether there is a difference in self-suicide and assistedsuicide acceptability rates. Solution: Let X be the number of respondents accepting self-suicide out of 1825, and Y be the number of respondents accepting assisted-suicide out of 1825. Ho : pY = pX Ha : pY 6= pX From the table, we have w01 = 203 and w10 = 90. Test statistic: √ 203 − 90 w01 − w10 =√ z=√ = 113/ 293 = 6.60 w10 + w01 90 + 203 For this two-sided test, the p-value is p-value = P(Z < −6.60) + P(Z > 6.60) = 2P(Z < −6.60) From Table A, P(Z < −6.60) ≈ 0, so that the p-value is less than 0.01. We can reject the hypothesis that the acceptability rates are the same for both types of suicide at the 0.01 level. Comments: • Both X and Y must come from binomial sampling, but need not be independent. • Interestingly, for matched pairs, the standard error of pˆY − pˆX does not depend on the data in which there is agreement. • Standard errors based on matched pairs are generally smaller than based on independent samples – this reflects the extra information in the study design. • The sample sizes must be large enough to use the normal approximation. One rule of thumb is to have the sum of w01 and w10 to be at least 25. 19