SELECTING THE SAMPLE SIZE FOR ESTIMATING POPULATION MEANS AND TOTALS
Transcription
SELECTING THE SAMPLE SIZE FOR ESTIMATING POPULATION MEANS AND TOTALS
SELECTING THE SAMPLE SIZE FOR ESTIMATING POPULATION MEANS AND TOTALS At some point in the design of the survey, someone must make a decision about the size of the sample to be selected from the population. So far we have discussed a sampling procedure (simple random sampling) but have said nothing about the number of observations to be included in the sample. The implications of such a decision are obvious. Observations cost money. Hence if the sample is too large, time and talent are wasted. Conversely, if the number of observations included in the sample is too small, we have bought inadequate information for the time and effort expended and have again been wasteful. The number of observations needed to estimate a population mean µ with a bound on the error of estimation of magnitude B is found by setting two standard deviations of the estimator, y , equal to B and solving this expression for n. That is, we must solve B = 2 V (y) = 2 σ2 n ( NN−−1n ) Nσ 2 2 where D = B4 2 ( N − 1) D + σ Solving for n in a practical situation presents a problem because the population variance σ2 is unknown. Since a sample variance s2 is frequently available from prior experimentation, we can obtain an approximate sample size by replacing σ2 with s2. We will illustrate a method for guessing a value of σ2 when very little prior information is available. If N is large, as it usually is, the (N - 1) can be replaced by N in the denominator of equation. to get n = Example 5 The average amount of money µ for a hospital's accounts receivable must be estimated. Although no prior data is available to estimate the population variance σ2, that most accounts lie within a $100 range is known. There are N = 1000 open accounts. Find the sample size needed to estimate µ with a bound on the error of estimation B = $3. Solution We need an estimate of σ2, the population variance. Since the range is often approximately equal to four standard deviations (4σ), one-fourth of the range will provide an approximate value of σ. Hence σ is taken to be approximately 25 and σ2 = 625. B 2 32 D= = = 2.25 4 4 1000 ( 625 ) Nσ 2 n= = = 217.56 2 ( N − 1) D + σ 999 ( 2.25) + 625 That is, we need approximately 218 observations to estimate µ, the mean accounts receivable, with a bound on the error of estimation of $3.00. In like manner, we can determine the number of observations needed to estimate a population total τ, with a bound on the error of estimation of magnitude B. The required sample size is found by setting two standard deviations of the estimator equal to B and solving this expression for n. Proceeding as we did earlier, we get Nσ 2 B2 n= where D = 4N 2 ( N − 1) D + σ 2 Example 6 An investigator is interested in estimating the total weight gain in 0 to 4 weeks for N = 1000 chicks fed on a new ration. Obviously, to weigh each bird would be time-consuming and tedious. Therefore, determine the number of chicks to be sampled in this study in order to estimate τ with a bound on the error of estimation equal to 1000 grams. Many similar studies on chick nutrition have been run in the past. Using data from these studies, the investigator found that σ2, the population variance, was approximately equal to 36.00 (grams)2. Determine the required sample size. Solution We can obtain an approximate sample size using the previous equation with σ2 equal to 36.0. B2 10002 D= = = 0.25 4 N 2 4 10002 ( n= ) 1000 ( 36 ) Nσ = = 125.98 ( N − 1) D + σ 2 999 ( 0.25) + 36 2 The investigator, therefore, needs to weigh n = 126 chicks to estimate τ, the total weight gain for N = 1000 chickens in 0 to 4 weeks, with a bound on the error of estimation equal to 1000 grams. ESTIMATION OF A POPULATION PROPORTION The investigator conducting a sample survey is frequently interested in estimating the proportion of the population that possesses a specified characteristic. For example, a congressional leader investigating the merits of an 18-year-old voting age may want to estimate the proportion of the potential voters in the district between the ages of 18 and 21. A marketing research group may be interested in the proportion of the total sales market in diet preparations that is attributable to a particular product. That is, what percentage of sales is accounted for by a particular product? A forest manager may be interested in the proportion of trees with a diameter of 12 inches or more. Television ratings are often determined by estimating the proportion of the viewing public that watches a particular program. You will recognize that all these examples exhibit a characteristic of the binomial experiment, that is, an observation either does belong or does not belong to the category of interest. For example, one can estimate the proportion of eligible voters in a particular district by examining population census data for several of the precincts within the district. An estimate of the proportion of voters between 18 and 21 years of age for the entire district will be the fraction of potential voters from the precincts sampled that fell into this age range. In subsequent discussion we denote the population proportion and its estimator by the symbols p and pˆ , respectively. The properties of pˆ for simple random sampling parallel those of the sample mean y if the response measurements are defined as follows: Let y i , = 0 if the i th element sampled does not possess the specified characteristic and yi = 1 if it does. Then the total number of elements in a n sample of size n possessing a specified characteristic is ∑ yi . i =1 If we draw a simple random sample of size n, the sample proportion pˆ is the fraction of the elements in the sample that possess the characteristic of interest. For example, the estimate pˆ of the proportion of eligible voters between the ages of 18 and 21 in a certain district is n number of voters sampled between the ages of 18 and 21 pˆ = = number of voters sampled ∑y i =1 n i =y In other words, pˆ is the average of the 0 and 1 values from the sample. Similarly, we can think of the population proportion as the average of the 0 and 1 values for the entire population (that is, p = µ). Estimator of the population proportion p: n pˆ = y = ∑y i =1 i n Estimated variance of p: ˆ ˆ ⎛ N −n⎞ pq Vˆ ( pˆ ) = ⎜ ⎟ n −1 ⎝ N ⎠ Where q = 1 – p and qˆ = 1 − pˆ Bound on the error of estimation: ˆ ˆ ⎛ N −n⎞ pq 2 Vˆ ( pˆ ) = 2 ⎜ ⎟ n −1 ⎝ N ⎠ Example 7 A simple random sample of n = 100 college seniors was selected to estimate (1) the fraction of N = 300 seniors going on to graduate school and (2) the fraction of students that have held part-time jobs during college. Let y, and x i (i = 1, 2, ..., 100) denote the responses of the ith student sampled. We will set yi = 0 if the ith student does not plan to attend graduate school and y i = 1 if he does. Similarly, let x i = 0 if he has not held a part-time job sometime during college and x i = l if he has. Using the sample data presented in the accompanying table, estimate p1, the proportion of seniors planning to attend graduate school, and p2, the proportion of seniors who have had a part-time job sometime during their college careers (summers included). Student y x 1 2 3 4 5 6 7 1 0 0 1 0 0 0 0 1 1 1 0 0 96 97 98 99 100 0 1 0 0 1 1 0 1 1 1 ∑y i = 15 ∑x i = 65 Solution: The sample proportions are given by 15 65 pˆ1 = = 0.15 pˆ1 = = 0.65 100 100 Bounds on the error of estimation for p1 and p2 respectively are (.15)(.85 ) ⎛ 300 − 100 ⎞ = 2 0.0293 = 0.059 pˆ qˆ ⎛ N − n ⎞ 2 Vˆ ( pˆ1 ) = 2 1 1 ⎜ ( ) ⎟ =2 ⎜ ⎟ n −1 ⎝ N ⎠ 99 ⎝ 300 ⎠ and (.65)(.35) ⎛ 300 − 100 ⎞ = 2 0.0391 = 0.078 pˆ qˆ ⎛ N − n ⎞ 2 Vˆ ( pˆ 2 ) = 2 2 2 ⎜ ( ) ⎟ =2 ⎜ ⎟ n −1 ⎝ N ⎠ 99 ⎝ 300 ⎠ Thus we estimate that 15% of the seniors plan to attend graduate school, with a bound on the error of estimation equal to .059. We estimate that 65% of the seniors have held a part-time job during college, with a bound on the error of estimation equal to .078. We have shown that the population proportion p can be regarded as the average of the 0 and 1 values for the entire population. Hence the problem of determining the sample size required to estimate p to within B units should be analogous to determining a sample size for estimating µ with a bound on the error of estimation B. You will recall that the required sample size for estimating µ is given by B2 Nσ 2 n= where D = . The corresponding sample size needed to 4 ( N − 1) D + σ 2 estimate p can be found by replacing σ2 with pq.. Sample size required to estimate p with a bound on the error of estimation B: n= B2 Npq where D = 4 ( N − 1) D + pq In a practical situation we do not know p. An approximate sample size can be found by replacing p with an estimated value. Frequently, such an estimate can be obtained from similar past surveys. However, if no such prior information is available, we can take p = .5 to obtain a conservative sample size (one that is N likely to be larger than required). This yields n = 4 ( N − 1) D + 1 Example 8 Student government leaders at a college want to conduct a survey to determine the proportion of students that favors a proposed honor code. Since interviewing N = 2000 students in a reasonable length of time is almost impossible, determine the sample size (number of students to be interviewed) needed to estimate p with a bound on the error of estimation of magnitude B = .05. Assume that no prior information is available to estimate p. Solution We can approximate the required sample sizes when no prior information is available by setting p = .5. We have 2 B 2 (.05 ) D= = = .000625 4 4 N 2000 n= = = 333.47 4 ( N − 1) D + 1 4 (1999 )(.000625 ) + 1 That is, 334 students must be interviewed to estimate the proportion of students that favors the proposed honor code with a bound on the error of estimation of B = .05. Example 9 Referring to Example 8, suppose that in addition to estimating the proportion of students that favors the proposed honor code, student government leaders also want to estimate the number of students who feel the student union building adequately serves their needs. Determine the combined sample size required for a survey to estimate p1 , the proportion that favors the proposed honor code, and p2, the proportion that believes the student union adequately serves its needs, with bounds on the errors of estimation of magnitude B1 = .05 and B2 = .07. Although no prior information is available to estimate p1, approximately 60 % of the students believed the union adequately met their needs in a similar survey run the previous year. Solution In this example we must determine a sample size n that will allow us to estimate p1, with a bound B1 = .05 and p2 with a bound B 2 = .07. First, we determine the sample sizes that satisfy each objective separately. The larger of the two will then be the combined sample size for a survey to meet both objectives. From Example 8 the sample size required to estimate p1 , with a bound on the error of estimation of B 1 = .05 was n = 334 students. We can use data from the survey of the previous year to determine the sample size needed to estimate p2. We have 2 B 2 (.07 ) D= = = .001225 4 4 2000 (.6 )(.4 ) Npq n= = = 178.52 ( N − 1) D + pq (1999 )(.001225) + (.6 )(.4 ) That is, 179 students must be interviewed to estimate p2. The sample size required to achieve both objectives in one survey is 334, the larger of the two sample sizes. SAMPLING WITH PROBABILITIES PROPORTIONAL TO SIZE Previous work in this chapter has depended on the sample being a simple random sample, according to Definition 1. We will now show that varying the probabilities with which different sampling units are selected is sometimes advantageous. Suppose, for example, we wish to estimate the number of job openings in a city by sampling industrial firms within the city. Typically, many such firms will be quite small and employ few workers, while some firms will be very large. In a simple random sample, size of firm is not taken into account, and a typical sample will contain mostly small firms. But the information desired (number of job openings) is heavily influenced by the large firms. Thus we should be able to improve on the simple random sample by giving the large firms a greater chance to appear in the sample. A method for accomplishing this sampling is called sampling with probabilities proportional to size, or pps sampling. For a sample y1, y2, ... , y n from a population of size N, let πi = probability that yi appears in the sample. Unbiased estimators of τ and µ, along with their estimated variances and bounds on the error of estimation, are as follows: Estimator of the population total τ: τˆ pps = Estimated variance of τˆpps : Vˆ (τˆ pps ) = n 1 n Yi i =1 ∑( π n 1 n( n −1) ∑π i =1 Yi i i − τˆ pps ) 2 1 n ( n −1) ∑( π n 1 N n ( n −1) 2 i =1 Yi i Bound on the error of estimation: 2 Vˆ ( µˆ pps ) = 2 1 Nn Yi i =1 n Estimator of the population mean µ: µˆ pps = N1 τˆ pps = Estimated variance of µˆ pps : Vˆ ( µˆ pps ) = ∑( π n Bound on the error of estimation: 2 Vˆ (τˆ pps ) = 2 ∑π ) 2 Yi i =1 − τˆ pps i − τˆ pps ) i 2 ∑( π n 1 N 2 n ( n −1) i =1 Yi i − τˆ pps ) 2 These estimators are unbiased for any choices of π i , but it is clearly in the best interest of the experimenter to choose these probabilities so that the variances of the estimators are as small as possible. The best practical way to choose the π i ’s is to choose them proportional to a known measurement that is highly correlated with yi. In the problem of estimating total number of job openings, firms can he sampled with probabilities proportional to their total work force, which should be known fairly accurately before the sample is selected. The number of job openings per firm is not known before sampling, but it should be highly correlated with the total number of workers in the firm. Example 10 An investigator wishes to estimate the average number of defects per board on boards of electronic components manufactured for installation in computers. The boards contain varying numbers of components, and the investigator feels that the number of defects should be positively correlated with the number of components on a board. Thus pps sampling is used, with the probability of selecting any one board for the sample being proportional to the number of components on that board. A sample of n = 4 boards is to be selected from the N = 10 boards of one day's production. The number of components on the 10 boards are, respectively, 10, 12, 22, 8, 16, 24, 9, 10, 8, 31 Show how to select n = 4 boards with probabilities proportional to size. Solution: We list the number of components (our measure of size) in a column and list the cumulative ranges and desirable πi ’s in adjacent columns, as follows: Board 1 2 3 4 5 Number of components 10 12 22 8 16 Cumulative range 1-10 11-22 23-44 45-52 53-68 πi 10/150 12/150 22/150 8/150 16/150 6 7 8 9 10 24 9 10 8 31 69-92 93-101 102-111 112-119 120-150 24/150 9/150 10/150 8/150 31/150 There are 150 components in the population to be sampled. We can think of these components as being numbered from 1 to 150. The cumulative range column keeps track of the interval of numbered components on each board. Board number 1 has the first 10 components, board number 2 has components 11 through 22, and so on. The π’s are simply the number of components per board divided by the total number of components. The boards having greater numbers of components have larger probabilities of selection. To choose the sample of n = 4 boards, we enter the random number table and select four random numbers between l and 150. The numbers we selected were 14, 56, 94, and 25. We locate these numbers in the cumulative range column. The boards corresponding to those range intervals constitute the sample. Since 14 lies in the range of board 2, that board enters the sample. Similarly, 56 lies in the range of board 5, 94 lies in the range of board 7, and 25 lies in the range of board 3. Thus the sample consists of boards 2, 3, 5, and 7. These boards have been selected with probabilities proportional to their numbers of components. Note that with this method we could have sampled a particular board more than once. Example 11 After the sampling of Example 10 was completed, the number of defects found on boards 2, 3, 5, and 7 were, respectively, 1, 3, 2, and 1. Estimate the average number of defects per board, and place a bound on the error of estimation. Solution µˆ pps = n 1 Nn ∑π Yi i =1 Vˆ ( µˆ pps ) = i 150 150 150 ⎤ = 101( 4) ⎡⎣1( 150 12 ) + 3 ( 22 ) + 2 ( 16 ) + 1( 9 ) ⎦ = 1.71 ∑( π n 1 N n ( n −1) 2 i =1 Yi i − τˆ pps ) 2 2 = 102 (14)( 3) ⎡⎢( 150 12 − 17.1) + ⎣ = 0.0295 (( 3 150 ) 22 ) (( 2 − 17.1 + 2 150 ) 16 ) 2 2⎤ − 17.1 + ( 150 9 − 17.1) ⎥ ⎦ 2 Vˆ ( µˆ pps ) = 2 0.0295 = 0.34 The estimate of the average number of defects per board, with a bound on the error of estimation, is then 1.71 ± .34 and the interval (1.37, 2.05) provides an approximate 95% confidence interval for the average number of defects per board.