Statistics - Haese Mathematics
Transcription
Statistics - Haese Mathematics
7 Statistics cyan magenta yellow 95 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 G H I J Key statistical concepts Describing data Normal distributions The standard normal distribution Finding quantiles (k-values) Investigating properties of normal distributions Distribution of sample means Hypothesis testing for a mean Confidence intervals for means Review 100 A B C D E F Contents: black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\219SA12STU-2_07.CDR Thursday, 2 November 2006 3:10:48 PM PETERDELL SA_12STU-2 220 STATISTICS (Chapter 7) INTRODUCTION The word statistics was introduced into the English language by the Scottish politician Sir John Sinclair (1754 – 1835). He borrowed it from Germany where, as he put it, it meant, “an inquiry for the purpose of ascertaining the political strength of a country”. The meaning he wished to give to the word was an “inquiry into the state of a country, for the purpose of ascertaining the quantum of happiness enjoyed by its inhabitants, and the means of future improvement.” You can still recognise the word “state” in statistics. Words that are commonly used in Statistics: ² A collection of individuals about which we want to draw conclusions. Census The collection of information from the whole population. Sample A selection of information from a subset of the population. Data (singular datum) Information about individuals in a population. Parameter A numerical quantity measuring some aspect of a population. Statistic A quantity calculated from data gathered from a sample. It is usually used to estimate a population parameter. Distribution The pattern of variation of data. Population ² ² ² ² ² ² A KEY STATISTICAL CONCEPTS RANDOM SAMPLES A population generally consists of a large number of individuals. Because of expense and time factors it is often only practical to select a sample rather than use the whole population. A random sample is a sample where every individual has the same chance of being selected. A sampling technique is biased if it tends to systematically select members of the population with certain properties and not select those that do not have these properties. In other words it favours some individuals above others. DISCUSSION SAMPLING In the following scenarios, can you suggest a likely population? Can you think of any reasons the sampling techniques might be biased? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 ² People in the local shopping centre on Saturday morning were asked how many computers they have in their household. ² After a program likely to be watched by older people, a television station asked viewers to vote on the use of hand-held phones in cars. ² A local paper advertised for volunteers to test the usefulness of fish oil in a diet. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\220SA12STU-2_07.CDR Thursday, 2 November 2006 3:11:49 PM PETERDELL SA_12STU-2 STATISTICS 221 (Chapter 7) Many sampling techniques have been developed to avoid bias. In this book it will be assumed that any sample is a random, unbiased sample. DESCRIPTIVE AND INFERENTIAL STATISTICS Descriptive statistics are concerned with collecting, summarising and describing the characteristics of data. With descriptive statistics we are only concerned with the data collected and make no effort to generalise it to any other data, such as for the population. In inferential statistics we select a random sample and we use the information from it to make generalisations about the population from which the sample was taken. EXAMPLES OF PARAMETERS AND STATISTICS Recall that: a parameter is a numerical characteristic of a population and a statistic is a numerical characteristic of a sample. Note: P S arameter opulation ample tatistics For example, when examining the mean age of people in retirement villages throughout Australia, the mean age found would be a parameter. If we took a random sample of 300 people from the population of all retirement village persons, then the mean age would be a statistic. Example 1 cyan The population is the number of blank CDs to be purchased and its size is 50 000. b The sample size is 600: c The population parameter being considered is the percentage of CDs which are defective. d The statistic being used is the percentage of CDs which are defective in the sample. As 1:5% of 600 = 9, the business would make the purchase if 9 or less CDs in the sample were found to be defective. magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 is the population size? is the sample size? population parameter is of interest to the business? statistic is being used to estimate the parameter? 100 95 a 50 What What What What 75 a b c d 25 0 5 95 100 50 75 25 0 5 A business is considering purchasing 50¡000 blank CDs to make CDs of their new text books. It will make the purchase if no more than 1:5% of the CDs are defective. Because of the expense and time factors in testing all 50¡000 CDs the business decides to test a random sample of 600 for defects. They will then use the results of this sample to estimate the percentage of defectives for the population to be purchased. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\221SA12STU-2_07.CDR Thursday, 2 November 2006 3:11:54 PM PETERDELL SA_12STU-2 222 STATISTICS (Chapter 7) THE PROCEDURE USED IN AN INFERENTIAL PROBLEM In this course the key application is to examine a random sample in order to make appropriate statements or inferences about the population. Generally speaking there are five steps to address in any inferential problem. They are: Step 1: Step 2: State the population we are interested in examining. Collect data from a random sample of sufficient size from the population. Note: What is meant by sufficient size is covered in a later chapter. Examine the relevant information from the sample. Use the results of the sample analysis to make an inference about the population. Give a measure of the reliability of the inference made. Step 3: Step 4: Step 5: Example 2 For the CD purchase in Example 1 list the procedural steps for the inferential problem. Step 1: Step 2: The population consists of all 50 000 CDs. To avoid unnecessary costs and wasting time we must first decide on the sample size. 600 has been decided upon, so we collect 600 data values at random. We record only whether the CD is defective or not. Step 3: Find the percentage of defective CDs in the sample. Step 4: The inference will be to provide an estimate of the percentage of defective CDs for the whole population. For example, if 12 CDs are defective in 12 the sample our inference would be that approximately 600 = 2% would be defective in the population. Step 5: The estimate from the sample is not likely to be equal to the exact value for the population. Some indication of the possible error for the estimate should therefore be given. An example of such a statement as in Step 5 is: If we had many shipments of 50 000 CDs and in each we found that 12 in a sample of 600 were defective, then in 95% of these shipments there would be between 440 and 1560 defective CDs. This type of statement is usually condensed to: We are 95% confident that about 440 to 1560 CDs are defective. The main thrusts of this course are to: ² cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 50 75 25 0 5 ² ² 100 determine confidence intervals in which a certain population parameter should lie at a particular level of confidence (commonly 90%, 95%, 99%) devise and use particular tests of hypotheses about population means determine what sample sizes should return a particular level of confidence in given situations. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\222SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:01 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 223 EXERCISE 7A 1 A new drug called Cobrasyl, a derivative of cobra venom, is to be approved for the treatment of high blood pressure in humans. A research team treats 127 high blood pressure patients with the drug and in 119 cases it reduces their blood pressure to an acceptable level. a What is the sample of interest? b What is the population of interest? 2 In 2006, 800 computer workers throughout Australia were surveyed and asked a question. The question was: “Is your main interest in developing software or in using already developed software?” 83% said that developing software was their main interest. a What is the population of interest? b What is the parameter of interest? c What statistic is used to estimate the parameter? 3 A South Australian processor of seafood needs to estimate the average weight of a prawn in a catch. A sample of 352 prawns was selected and found to have an average weight of 53:8 grams. a What is the population the processor is interested in? b What is the parameter of interest? c What statistic does the processor use to estimate the parameter? 4 Last December Tina visited four supermarkets A, B, C and D on the same day. She recorded the price per kilogram of various fruits in the table opposite: Determine whether the following statements are descriptive or inferential: Store A B C D Oranges $2:35 $2:45 $2:50 $2:25 Apples $2:15 $2:55 $2:60 $2:05 Bananas $1:70 $2:00 $2:10 $1:90 a In this city, bananas are cheaper than oranges. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 b If you buy a kilogram of each of the three different types of fruit from the one store, you pay the same total amounts at stores A and D. c Of the four stores, the store with the most expensive apples also had the most expensive oranges and bananas. d In general, store C has the most expensive fruit. e Of the four stores, store C has the most expensive fruit. (Careful! What is the population and what is the sample?) black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\223SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:07 PM PETERDELL SA_12STU-2 224 STATISTICS (Chapter 7) B DESCRIBING DATA This section will review the main concepts from Year 11 so that students will reacquaint themselves with the terminology used in statistics. A variable is a quantity that can have different values for different individuals in the population. Since variables are sometimes used to describe random processes, they are often called random variables. Variables are usually denoted by capital letters such as X. Individual values, called observations or outcomes, are denoted by lower case letters such as x. We shall deal with two types of variables: categorical and quantitative. A categorical or nominal variable can be described by a quality or characteristic that is essentially non-numeric. Individuals are described by different categories. Examples of categorical data are: Variable X is the gender of a person C is the type of motor car M is the membership of political party ² ² ² Possible values x = male or female c = Holden, Ford, Toyota m = ALP, LIB, DEM A quantitative or numerical variable takes numerical values. There are essentially two different types of numerical variable. A numerical discrete variable takes discrete number values only. It is often a result of counting. Examples of discrete variables are: Variable X is the number of people in a household T is the mark out of 10 for a test ² ² Possible values x = 1, 2, 3, 4 :::::: t = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 A numerical continuous variable can take any numerical value in an interval. A continuous variable is often a result of measuring. Examples of continuous variables are: cyan magenta yellow 95 100 50 75 0 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 X is the amount of water in a 500 litre rain water tank 100 50 75 25 0 5 ² 25 ² Possible values w is likely to be in the interval from 0:5 kg to 5 kg. x is any volume between 0 and 500 litres. 5 Variable W is the weight of newborn babies black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\224SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:14 PM PETERDELL SA_12STU-2 STATISTICS 225 (Chapter 7) Since continuous variables take on values in intervals, they are also called interval variables. The essential difference between a categorical and a quantitative variable is that we can do arithmetic with quantitative variables, but not with categorical variables. In this book we are mainly concerned with the mean and the standard deviation. THE MEAN AND STANDARD DEVIATION (REVIEW) The mean of a sample of n numbers, x1 , x2 , ......... , xn is: x= n x1 + x2 + ::::::: + xn 1 P xi = n n i=1 P The Greek letter (sigma) is used to denote the summation of numbers, n P so xi = x1 + x2 + ::::::: +xn (read “the sum of all xi for i = 1 to n”). i=1 The endpoints of the summation, i = 1 to n are sometimes omitted, so the mean can be P P xi or even n1 x. written as n1 P The mean of a population is usually denoted by the Greek letter ¹ (mu), so ¹ = n1 x. We can get a much clearer picture of a data set if, in addition to having a measure for the centre, we also have an indication of how the data is spread. For example, the mean weight of oranges from a particular orchard and the mean weight of salt bagged by a machine may both be 500 grams, but the variation in the weights of oranges is likely to be much greater than that of bags of salt. The data for oranges will therefore have a greater spread. The most commonly used measure of spread about the mean is the standard deviation. The standard deviation of a sample is a little different from the standard deviation of a population. In a sample of size n, the sample standard deviation, usually denoted by s, is: sP s (xi ¡ x)2 (x1 ¡ x)2 + (x2 ¡ x)2 + :::::: + (xn ¡ x)2 = s= n¡1 n¡1 In a population of size n, the population standard deviation, usually denoted by the Greek letter ¾ (sigma), is: sP s (x1 ¡ ¹)2 + (x2 ¡ ¹)2 + :::::: + (xn ¡ ¹)2 (xi ¡ ¹)2 = ¾= n n cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The reason for this difference is rather technical and, at this stage we do not attempt to explain the difference. Statisticians know that the value of s, as calculated by the above formula, gives an unbiassed estimate of the population standard deviation ¾. Notice that for large n, the values of s and ¾ are virtually the same. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\225SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:20 PM PETERDELL SA_12STU-2 226 STATISTICS (Chapter 7) The mean and standard deviation can also be calculated from frequency tables. The frequency fi of a quantity xi is the number of times it occurs. For a population of size n, the formulae for the mean and standard deviation become: f1 x1 + f2 x2 + f3 x3 + :::::: + fk xk n r (x1 ¡ ¹)2 f1 + (x2 ¡ ¹)2 f2 + :::::: + (xk ¡ ¹)2 fk ¾= n ¹= and µ Notice that ¹ = f1 n ¶ µ x1 + f2 n ¶ µ x2 + f3 n ¶ µ x3 + :::::: + fk n ¶ xk . fi is the proportion of xi in the population. For large values of n, the experimental n fi probability pi of randomly selecting xi from the population is taken to be pi = . n So, using pi = fi , n ¹ = p1 x1 + p2 x2 + p3 x3 + :::::: + pk xk = X pi xi : Similarly for the population standard deviation: sµ ¶ µ ¶ µ ¶ f2 fk f1 (x1 ¡ ¹)2 + (x2 ¡ ¹)2 + :::::: + (xk ¡ ¹)2 ¾= n n n ¾= which leads to qX pi (xi ¡ ¹)2 . Example 3 A magazine store claims 23% of its customers purchase one magazine, 38% purchase two, 21% purchase three, 13% purchase four, and 5% purchase five. Find the mean and the standard deviation of X, the number of magazines sold to a customer. The probability table is: Now ¹ = X xi pi 0 0:00 1 0:23 2 0:38 3 0:21 4 0:13 5 0:05 pi xi = 0:23 £ 1 + 0:38 £ 2 + 0:21 £ 3 + 0:13 £ 4 + 0:05 £ 5 = 2:39 i.e., in the long run, the average number purchased per customer is 2:39 qX Also, ¾ = pi (xi ¡ ¹)2 q = 0:23 £ (1 ¡ 2:39)2 + 0:38 £ (2 ¡ 2:39)2 + :::: + 0:05 £ (5 ¡ 2:39)2 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 + 1:12 black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\226SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:26 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 227 Example 4 ‘Cheap Car Insurance’ insures used cars valued at $6000 under these conditions. A $6000 will be paid to the owner for total loss B for damage between $3000 and $5999, $3500 will be paid C for damage between $1500 and $2999, $1000 will be paid D for damage less than $1500, nothing will be paid. From statistical information the insurance company knows that in any year the probabilities of A, B, C and D are 0:03, 0:12, 0:35 and 0:50 respectively. If the company wishes to receive $80 more than its expected payout on each policy, what should it charge for the policy? Let X be the random variable of payouts, so the probability table is: 0 0:50 xi pi 1000 0:35 3500 0:12 6000 0:03 The expected payout is the mean, ¹, and P ¹ = pi xi = (0:50) £ 0 + (0:35) £ 1000 + (0:12) £ 3500 + (0:03) £ 6000 = 950 The company expects to pay out $950 on average in the long run, so it should charge $950 + $80 = $1030: EXERCISE 7B 1 Australian crayfish is exported to Asian markets. The buyers are prepared to pay high prices when the crayfish arrive still alive. If X is the number of deaths per dozen crayfish, the probability function for X is given by: 0 0:54 xi P (xi ) 1 0:26 2 0:15 3 0:03 4 0:01 5 0:01 >5 0:00 a What is the mean number of deaths per dozen crayfish? b Find ¾, the standard deviation for the probability distribution. 2 A random variable X has probability function given by P (x) = k(0:4)x (0:6)3¡x for x = 0, 1, 2, 3. a Find P (x) for x = 0, 1, 2 and 3 and hence find k. b Find the mean and standard deviation for the distribution. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 3 An insurance policy covers a $20 000 sapphire ring against theft and loss. If it is stolen the insurance company will pay the policy owner in full. If it is lost they will pay the owner $8000. From past experience the insurance company knows that the probability of theft is 0:0025 and of being lost is 0:03. How much should the company charge to cover the ring if they want a $100 expected return? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\227SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:33 PM PETERDELL SA_12STU-2 228 STATISTICS (Chapter 7) 4 Use technology to find the mean and standard deviation of the two samples, A and B, of weights given in grams. A 498:8 500:2 500:4 499:9 500:4 500:6 498:9 498:2 500:1 501:9 500:8 498:6 499:7 498:6 499:0 498:8 499:1 500:7 500:7 501:3 501:1 501:5 499:0 499:7 498:4 501:1 500:1 499:9 500:9 499:2 B 545:5 543:4 399:8 511:3 616:3 496:7 337:8 650:2 426:3 522:2 664:0 415:1 416:0 425:4 419:9 503:7 427:8 474:2 459:9 390:5 428:5 451:9 590:1 613:5 402:3 318:3 478:1 502:2 626:4 435:7 Which of the samples is the weights of bags of salt, and which is the weights of oranges? 5 Test marks out of 10 are recorded in the following frequency table: 0 2 Mark Frequency 1 1 2 0 3 4 4 5 5 8 6 12 7 15 8 7 9 3 10 5 a Find the mean and standard deviation of these scores. b Calculate the percentage difference between using the formulae for population standard deviation and sample standard deviation. P P 6 Using ¾ 2 = pi (xi ¡ ¹)2 show that ¾ 2 = pi xi2 ¡ ¹2 : P (Hint: ¾ 2 = pi (xi ¡ ¹)2 = p1 (x1 ¡ ¹)2 + p2 (x2 ¡ ¹)2 + :::::: + pn (xn ¡ ¹)2 : Expand ¾ 2 and regroup the terms.) C NORMAL DISTRIBUTIONS Many quantities reflect the combined effect of a large number of random factors. For example: ² The yield of a wheat plant is the combined result of many unpredictable factors such as genes, rainfall, sunshine, and its position in the field where it was seeded. ² The weight of a packet of sultanas is the sum of the weights of each individual sultana, and it is unlikely a packet labelled as 1 kg will weigh exactly 1 kg. DISCUSSION THE EFFECT OF RANDOM FACTORS ² ² Consider at least three factors that affect each of the following: a the weight of a newly born piglet b the time to complete an assignment c the mark achieved in an examination d the number of goals scored in a netball match. For each of the above random variables, suggest why the distribution might be a symmetric b bell shaped. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The next investigation explores the distribution of a quantity that is the combined result of different factors. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\228SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:41 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 229 INVESTIGATION 1 SOME PROPERTIES OF A NORMAL DISTRIBUTION Consider the time it takes Les to walk home from school. We have broken this into the following stages with the time it takes to complete each stage: Stage 1 2 3 4 5 6 7 Question: What is happening Cross the road in front of the school Walk to the shopping centre Walk through the shopping centre Cross a road Buy a loaf of bread Talk with a friend Walk the remaining distance home Time up to 1 minute 5 § 2 minutes 3 § 2 minutes up to 1 minute up to 2 minutes up to 2 minutes 2 § 1 minutes According to the table, what is the longest time it may take Les to walk home? What is the shortest time? If Les wanted to study the distribution of the time it takes to walk home, he could keep a daily record, but the amount of data collected would be very small. Les could also use the information given in the table and use a spreadsheet or a calculator to simulate the time it takes to walk home. The following instructions are set up for a spreadsheet, but the procedure will also work on a calculator. What to do: SPREADSHEET 1 Open the spreadsheet “Normal distribution”. A spreadsheet with the following headings will appear. 2 In each of the cells A2 to G2, under the headings ‘Stage 1’ to ‘Stage 7’, type in the formulae shown in the table. Do not forget to start each formula with an = sign. Note: rand() calculates a random number between 0 and 1. Question: What does 5 + (4*rand( ) ¡ 2) calculate? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 3 In cell N2, below the heading ‘Total time’, type in the formula =sum(A2:M2) Question: What does this formula calculate? 4 Drag the formulae in cells A2 to N2 down to fill all cells A251 to N251. Pressing the F9 function key will produce another random sample. The numbers in cell P2 under the heading ‘Mean’, and in cell Q2 under the heading ‘Standard Deviation’, are the mean and standard deviation of the numbers in cells N2 to N251. The number in cell R2 under the heading ‘No. within 1 st. dev.’ gives the number of black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\229SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:47 PM PETERDELL SA_12STU-2 230 STATISTICS (Chapter 7) values within 1 standard deviation of the mean. For example, if the mean x = 12:96 and the standard deviation s = 1:82, then this cell gives the number of values that lie between x ¡ s = 11:14 and x + s = 14:78 . Similarly, the numbers in cells S2 and T2 give the number of values within 2 and 3 standard deviations of the mean respectively. The graph that appears is the histogram of data in cells N2 to N251. If you are having difficulty setting up this spreadsheet, click on the tag ‘Normal 2’ to open a finished version. 5 Calculate the proportion of data values within each interval. For example, if there are 169 values within 1 standard deviation of the mean, the proportion of values in the interval = 169 250 = 0:676 . 6 Copy and fill in the following table for 5 different samples. The entries of the first line may not agree with your values. Sample no. 1 2 3 4 5 Mean x 12:96 x ¡ s to x + s Count Propn. 169 0:676 Stdev s 1:82 x ¡ 2s to x + 2s Count Propn. x ¡ 3s to x + 3s Count Propn. What do you notice about the proportions of data in each of the intervals? In the following we change the value of the factors and then add more factors. 7 Change the formulae in cells A2 to G2 as shown in the table. 8 Repeat steps 4 to 6. 9 Add the following formulae in cells H2 to M2: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 10 Repeat steps 4 to 6. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\230SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:52 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 231 From Investigation 1 you should have discovered that changing the number and values of factors may change the mean and standard deviation, but leaves the following unchanged: ² The shape of the histogram is symmetric about the mean. ² Approximately 68% of the data lies between 1 standard deviation below the mean and 1 standard deviation above the mean. ² Approximately 95% of the data lies between 2 standard deviations below and 2 standard deviations above the mean. ² Approximately 99.7% of the data lies between 3 standard deviations below and 3 standard deviations above the mean. Note: It is a rare event for an outcome to be outside the standard deviation range between ¡3¾ and 3¾. In a sample of 1000, you would only expect about 3 cases. A smooth curve drawn through the midpoints of each column of the histogram would ideally look like the graph displayed. concave point of inflection point of inflection convex convex Note the points of inflection at ¹ ¡ ¾ and ¹ + ¾. ¹¡ ¾ ¹ ¹+ ¾ The above information is typical of a family of normal distributions. Curves with this shape are known as normal curves. Because of their characteristic shape, they are also called bell-shaped curves. 34% 2.35% 0.15% 34% 2.35% 13.5% m-3s m-2s 0.15% 13.5% m-s m m+s m+2s m+3s Variables which are the combined result of many random factors are often approximately normal. The normal variable X with mean ¹ and standard deviation ¾ is denoted by X » N(¹, ¾ 2 ). CONTINUOUS PROBABILITY DENSITY FUNCTIONS For any distribution of data, whether it is a normal distribution or not, the function whose smooth curve approximates the histogram of the data is called a probability density function or pdf. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 If the variable X is normally distributed, N(¹, ¾ 2 ), the probability density function is 1 x¡¹ 2 1 f (x) = p e¡ 2 ( ¾ ) . ¾ 2¼ black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\231SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:59 PM PETERDELL SA_12STU-2 232 STATISTICS (Chapter 7) Probability density functions f have the following properties: ² f(x) > 0 for all values of x. ² The area between the graph of f and the horizontal axis is 1, since the total of all probabilities is 1. ² The proportion of outcomes of the variable X between the values a and b is the area between the graph of f and the horizontal axis for a 6 x 6 b. Z b Notice that: Pr(a 6 X 6 b) = f(x) dx a For a continuous variable X, the probability X is exactly equal to a point a is zero. For example, the probability an egg will weigh exactly 72:9 g is zero. If you were to weigh an egg on scales that weigh to the nearest 0:1 g, a weight of 72:9 g means the weight lies somewhere between 72:85 and 72:95 grams. Presumably an egg has to weigh something, and it could be 72:9 grams, but you will never know. No matter how accurate your scales are, you can only ever know the weight of an egg within a range. So, for a continuous variable we can only talk about the probability an event lies in an interval. Notice that: if X is continuous, Pr(a 6 X 6 b), Pr(a < X 6 b), Pr(a 6 X < b) and Pr(a < X < b) all have the same value. Why? This would not be correct if X was discrete. Example 5 The chest measurements of 18 year old male footballers are normally distributed with a mean of 95 cm and a standard deviation of 8 cm. a Find the percentage of randomly chosen footballers with chest measurements between: i 87 cm and 103 cm ii 103 cm and 111 cm b Find the probability of randomly choosing a footballer with a chest measurement between 87 cm and 111 cm. For the distribution of chest measurements, the mean ¹¡=¡95¡cm and the standard deviation ¾¡=¡8¡cm. a i ii 34% We need the percentage between ¹ ¡ ¾ and ¹ + ¾. This is 68%. We need the percentage between ¹ + ¾ and ¹ + 2¾. This is 13:5%: 34% 13.5% s s s 87 95 103 111 m-s m m+s m+2s b The percentage between ¹ ¡ ¾ and ¹ + 2¾ is 68% + 13:5% = 81:5%: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 m-s m 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 So the probability is 0:815 black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\232SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:05 PM PETERDELL m+2s SA_12STU-2 STATISTICS (Chapter 7) 233 EXERCISE 7C 1 What is the probability that a normally distributed value lies between: a 1¾ below the mean and 1¾ above the mean b the mean and the value 1¾ above the mean c the mean and the value 2¾ below the mean d the mean and the value 3¾ above the mean? 2 Suppose the heights of 16 year old male students are normally distributed with a mean of 170 cm and a standard deviation of 8 cm. Find the percentage of male students whose height is: a between 162 cm and 170 cm b between 170 cm and 186 cm. Find the probability that a student from this group has a height: c between 178 cm and 186 cm d less than 162 cm e less than 154 cm f greater than 162 cm. 3 The time T minutes it takes Charlotte to go to work is normally distributed with mean 50 minutes and standard deviation of 5 minutes. Every morning Charlotte leaves for work at 8 am. a If work starts at 9 am, what is the probability Charlotte will be late for work? b If Charlotte works 250 days a year, how many times can she expect to be late? 4 Explain why each of the following variables might be normally distributed: a the chest size of 18 year old Australian males b the length of adult female sharks c the protein content of each kilogram of corn grown in the same field. 5 A farmer has a flock of 237 crossbred lambs. The mean weight of the flock is 35 kg with a standard deviation of 2 kg. a Explain why the weights of the lambs might be normally distributed. b If lambs between the weights of 33 to 39 kg are suitable for export, how many lambs in this flock could the farmer expect to be able to export? 6 The weights of hens’ eggs are normally distributed with mean 65 grams and standard deviation 6 grams. a Determine the probability that a randomly selected egg has weight i greater than 53 g ii less than 71 g iii between 59 g and 77 g. b In one week the hens lay 1286 eggs. How many of these eggs are expected to be i greater than 53 g ii less than 71 g iii between 59 g and 77 g. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 7 The marks for a geography examination are normally distributed with mean 65 and standard deviation 11. a A geography student is chosen at random. Determine the probability that the student i less than 76 marks ii between 43 and 76 marks. scored b If the top 16% of students receive an A grade, what was the minimum mark for an A? c If 2582 students sit for the examination, how many of them would be expected to score less than 32 marks? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\233SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:11 PM PETERDELL SA_12STU-2 234 STATISTICS (Chapter 7) 8 The weights of Jason’s oranges are normally distributed. 84% of the crop weigh more than 152 grams and 16% weigh more than 200 grams. a Find ¹ and ¾ for the crop b What proportion of the oranges weigh between 152 grams and 224 grams? 9 The heights of 13 year old boys are normally distributed. 97:5% of them are above 131 cm and 2:5% are above 179 cm. a Find ¹ and ¾ for the height distribution b A 13-year old boy is randomly chosen. What is the probability that his height lies between 143 cm and 191 cm? 10 Using the same set of axes, quickly sketch the graphs of the density functions for each of the following distributions: a N(0, 32 ) b N(0, (0:5)2 ) c N(¡5, 12 ) d N(3, 0:25). 11 Each of the following is a graph of a normal distribution with different vertical scales: A B C -2.5 -2 -1.5 -20 -10 x 0 10 20 -4 -2 x 0 2 4 x a Write down the mean ¹ for each of these distributions. b Which of the distributions has standard deviation i ¾ = 0:1 ii ¾ = 1 iii ¾ = 10 ? c Which of the distributions has the largest spread? D THE STANDARD NORMAL DISTRIBUTION For each value of ¹ and ¾ there is a different normal distribution N(¹, ¾ 2 ). As illustrated by Investigation 1, all normal distributions have one important property in common: the probability of an event occurring depends only on the number of standard deviations the event is from the mean. If x is an observation from a normal distribution with mean ¹ and standard deviation ¾, the z-score of x is the number of standard deviations x is from the mean. The diagram shows how the z-score is related to a normal curve. Normal distribution curve 34% 2.35% 0.15% 34% 2.35% 13.5% cyan magenta 1 2 3 yellow 95 0 100 -1 50 -2 75 -3 25 m+3s 0 m+2s 5 m+s 95 m 100 m-s 50 m-2s 75 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 z-score 13.5% m-3s 25 actual score 0.15% black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\234SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:17 PM PETERDELL SA_12STU-2 STATISTICS 235 (Chapter 7) z-scores are particularly useful when comparing two measurements made using different ¹ and ¾. But be careful! These comparisons will only be reasonable if both measurements are approximately normal. Example 6 The local school has kept records of all its athletics competitions. It was found that the time, in minutes, to run the men’s 800 metres was normally distributed as N(3:4, (0:2)2). The women’s long jump, in metres, was normally distributed as N(4:3, (0:4)2). In 1980 John won the 800 metre race with a time of 3:2 minutes. In 2006 his daughter Anne came second in the long jump with a distance of 5:1 m. a i Sketch the graphs of the two distributions using the same scale for the z-scores from ¡3 to +3. ii Put the actual times/distances below each of the z-scores on the graphs. iii Calculate the z-scores for John and Anne, and mark these on the graphs. iv Shade the area under the respective graphs to represent performances that were better than those of John and Anne. b Of all the students who participated in these two events, what proportion would have performed better than i John ii Anne? c If 1000 students had participated in each of these two events, how many would have performed better than i John ii Anne? d Of the father and daughter, who had the better result? a i/ii/iv John’s time better than John 34% 2.35% 0.15% 34% 2.35% 13.5% z-score actual time (min) -3 2.8 -2 3.0 0.15% 13.5% -1 3.2 0 3.4 1 3.6 2 3.8 3 4.0 Anne’s distance better than Anne 34% 2.35% 0.15% 34% 2.35% 13.5% z-score actual distance (m) -3 3.1 -2 3.5 0.15% 13.5% -1 3.9 0 4.3 1 4.7 2 5.1 3 5.5 iii John’s time was 3:2 ¡ 3:4 = ¡0:2 minutes from the mean. Since the standard deviation is 0:2 minutes, John ran the 800 metres in a time of 1 standard deviation less than the mean. The z-score of John’s performance is ¡1: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The distance Anne jumped was 5:1 ¡ 4:3 = 0:8 m above the mean. Since the standard deviation is 0:4 metres, Anne jumped a distance of 2 standard deviations above the mean. The z-score of Anne’s performance is +2. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\235SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:23 PM PETERDELL SA_12STU-2 236 STATISTICS b i ii c i ii (Chapter 7) The proportion less than ¹ ¡ ¾ is 0:16, so 16% of all participants performed better than John. The proportion greater than ¹ + 2¾ is 0:025, so only 2:5% of all participants performed better than Anne. Of 1000 participants, 16% of 1000 = 160 were better than John. 2:5% of 1000 = 25 were better than Anne; one of these happened to be competing on the same day as Anne. d Anne’s long jump was more outstanding than her father’s 800 metre race. EXERCISE 7D.1 1 In a year 12 class, the marks for a Geography test marked out of 50 were normally distributed with mean of 34 and standard deviation of 6. The marks for an English essay out of 20 were normally distributed with a mean of 12 and standard deviation of 1:5 . Val received a mark of 40 for her Geography and 15 for her English essay. a Sketch the graphs of the two distributions below one another using the same scale for the z-scores from ¡3 to +3. Put the actual marks below each z-score on the graph. b For which of the two subjects did Val receive the higher % mark? c Calculate the z-score for each of Val’s results. i Mark these z-scores on the two graphs. ii Shade the region on the two graphs of scores which were better than Val’s. d What proportion of the students performed better than Val in Geography, and what proportion performed better than Val in English? e If there were 32 students in the class, how many performed better than Val in Geography and how many in English? f In which of these two assessments did Val perform better? 2 Suppose that the weight W of bags of sugar filled by a machine are normally distributed with mean ¹ = 504 grams and standard deviation ¾ = 2 grams. A quality controller rejects any bags of sugar with weight less than 500 grams. Across town, the weight A of bags of apples filled by an assistant in a green grocer shop is normally distributed with mean weight 5 kilograms and standard deviation 500 grams. Bags weighing less than 4 12 kg are rejected by a quality controller. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Sketch the graphs of the two distributions below one another using the same scale for the z-scores from ¡3 to +3. Put the actual weights below each z-score on the graph. b Calculate the z-score for each of the two quality controls, and shade in the regions corresponding to the weights of bags that are rejected. c Which of the two quality controllers is the more stringent, i.e., rejects the larger proportion of bags? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\236SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:29 PM PETERDELL SA_12STU-2 STATISTICS 237 (Chapter 7) Example 7 Suppose examination scores are normally distributed with mean mark ¹ = 63 and standard deviation of ¾ = 12 marks. a What is the z-score for a mark of 80? b If Hua’s z-score is ¡1:5, what is Hua’s actual score? a A mark of 80 is 80 ¡ 63 = 17 above the mean. Since the standard deviation is 12, this is 17 12 = 1:42 standard deviations above the mean. So, the z-score is 1:42 b Hua’s mark is ¡1:5 standard deviations from the mean. Since the standard deviation is 12, this is 12 £ (¡1:5) = ¡18 marks from the mean. Since the mean is 63, Hua’s mark is 63 + (¡18) = 45. score of 80 z-score actual mark -3 27 -2 39 -1 51 0 63 1 75 2 87 3 99 0 63 1 75 2 87 3 99 Hua’s mark z-score actual mark -3 27 -2 39 -1 51 3 Suppose the distribution of the diameter (in cm) of oranges from a tree is N(10, 22 ). a Sketch a graph of the distribution that displays both the actual diameters as well as the z-score along the horizontal axis. b Find the z-score for each of the following diameters: i 12 cm ii 9 cm iii 13 cm c Oranges are to be dumped if their diameters have a z-score of less than ¡2. What is the diameter of oranges that are to be dumped? d If there are 120 oranges on the tree, how many will be dumped? 4 The volume of milk cartons filled by a machine is normally distributed with mean 504 mL and standard deviation of 1:5 mL. a What is the z-score of a carton containing 506 mL of milk? b What is the volume of milk in a carton with a z-score of ¡1:5? If x is an observation from a normal distribution with mean ¹ and standard deviation ¾, the x¡¹ z-score of x can be calculated from the formula z = . ¾ If the variable X is normally distributed with mean ¹ and standard deviation ¾, then Z= X ¡¹ ¾ is called the standard normal distribution. The variable Z is the number of standard deviations X is from the mean. Notice that, if x = ¹ then z = 0 and if x = ¹ + ¾ then z = 1. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Hence, the mean of Z is 0 and the standard deviation of Z is 1. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\237SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:35 PM PETERDELL SA_12STU-2 238 STATISTICS (Chapter 7) Example 8 Find the probability that the standard normal distribution Z lies between ¡2 and 1. The graph of the Z-distribution is shown: 34% 34% z 13.5% -3 -2 -1 0 1 2 3 The probability Z lies between ¡2 and 1 is the proportion of observations that lie between 2 standard deviations to the left of the mean and 1 standard deviation to the right of the mean. This is about 0:815 . EXERCISE 7D.2 1 The table shows Emma’s midyear exam results. The exam results for each subject are normally distributed with mean ¹ and standard deviation ¾ shown in the table. a Find the z-score for each of Emma’s subjects. b Arrange Emma’s subjects from ‘best’ to ‘worst’ in terms of the z-scores. Subject Emma’s score English 12 Chinese 27 Geography 84 Biology 34 Mathematics 84 ¹ ¾ 10 1:1 20 3:0 55 18 25 10 50 15 2 Calculate the following probabilities. In each case sketch the graph of the Z-distribution shading in the region of interest. a Pr(¡1 < Z < 1) b Pr(¡1 < Z < 3) c Pr(¡1 < Z < 0) d Pr(Z < 2) e Pr(¡1 < Z) f Pr(Z > 1) USING TECHNOLOGY TO FIND PROBABILITIES So far we have only used integer z-scores to calculate probabilities. By refining the methods used in Investigation 1 we can calculate probabilities for other z-scores. To see how to use your calculator to do this, click on the icon. TI C When working with normal distributions, you are advised to sketch a graph of the normal distribution and shade in the areas of interest. Example 9 Use technology to illustrate and calculate: a Pr(¡0:41 6 Z 6 0:67) b Pr(Z 6 1:5) c Pr(Z > 0:84) a For a TI, Pr(a 6 Z 6 b) can be calculated using normalcdf(a, b, 0, 1) cyan magenta yellow 95 100 50 75 25 0 5 95 -0.41 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Pr(¡0:41 6 Z 6 0:67) = normalcdf (¡0:41, 0:67, 0, 1) + 0:408 black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\238SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:41 PM PETERDELL 0 0.67 SA_12STU-2 STATISTICS Note: When ¹ = 0 and ¾ = 1 we can simply use b normalcdf (a, b) Pr(Z 6 1:5) = normalcdf(¡E99, 1:5, 0, 1) + 0:933 Note: c 239 (Chapter 7) ¡E99 is the largest negative number on a calculator. 0 1.5 Pr(Z > 0:84) = normalcdf(0:84, E99, 0, 1) + 0:200 Note: E99 is the largest positive number on a calculator. 0 0.84 EXERCISE 7D.3 1 If Z is the standard normal distribution, find the following probabilities. In each case sketch the regions. a Pr(¡0:86 6 Z 6 0:32) b Pr(¡2:3 6 Z 6 1:5) c Pr(Z 6 1:2) d Pr(Z 6 ¡0:53) e Pr(Z > 1:3) f Pr(Z > ¡1:4) g Pr(Z > 4) TI With modern technology we can calculate probabilities for normal distributions which have not been standardised. Click on the icon to see how this is done. C Example 10 If X is N(10, 2:32 ), find these probabilities: a Pr(8 6 X 6 11) b Pr(X 6 12) a c Pr(X > 9). Illustrate. Pr(8 6 X 6 11) = normalcdf(8, 11, 10, 2:3) + 0:476 8 10 11 b Pr(X 6 12) = normalcdf(¡E99, 12, 10, 2:3) + 0:808 10 12 c Pr(X > 9) = normalcdf(9, E99, 10, 2:3) + 0:668 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 9 10 black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\239SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:46 PM PETERDELL SA_12STU-2 240 STATISTICS (Chapter 7) 2 If the random variable X is N(70, 32 ), find these probabilities: a Pr(60:6 < X 6 68:4) b Pr(X > 74) c Pr(X 6 68) 3 Suppose the variable X is normally distributed with mean ¹ = 58:3 and standard deviation ¾ = 8:96 . a Let the z-score of x = 50:6 be z 1 and the z-score of x = 68:9 be z 2 . i Calculate z 1 and z 2 . ii Find Pr(z1 6 Z 6 z 2 ) b Find Pr(50:6 6 X 6 68:9) directly from your calculator. c Compare the answers to a and b. 4 Suppose X is N(50, 52 ). Calculate Pr(a < X 6 51) for each of the following values of a. Give your answers to 5 decimal places. a a = 45 b a = 35 c a = 25 d a = 15 e a=0 Compare the answers of a to e with Pr(X 6 51): Example 11 In 1972 the heights of SANFL players was found to be normally distributed with mean 179 cm and standard deviation 7 cm. Find the probability that in 1972 a player was: a at least 175 cm tall b between 170 cm and 190 cm. If X is the height of a player then X is normally distributed with mean ¹ = 179 and standard deviation ¾ = 7: a We need to find b We need to find Pr(X > 175) Pr(170 6 X 6 190) = normalcdf(175, E99, 179, 7) = normalcdf(170, 190, 179, 7) + 0:716 + 0:843 5 The height of 18 year old men is normally distributed with mean 182:3 cm and standard deviation 9:6 cm. Find the probability that a randomly selected 18 year old man is: a at least 180 cm tall b at most 190 cm tall c between 175 and 185 cm. 6 The weight of hens’ eggs is normally distributed with mean 42:3 g and standard deviation 5:9 g. Find the probability that a randomly selected egg is: a at most 50 g b at least 45 g c between 35 g and 45 g. 7 The speed of cars passing the supermarket is normally distributed with mean 56:3 kmph and standard deviation 7:4 kmph. Find the probability that a randomly selected car is travelling at: a between 60 and 75 kmph b at most 70 kmph c at least 60 kmph. 8 The lengths of metal bolts produced by a machine are found to be normally distributed with a mean of 19:8 cm and a standard deviation of 0:3 cm. Find the probability that a bolt selected at random from the machine will have a length between 19:7 and 20 cm. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 9 The IQs of secondary school students from a particular area are believed to be normally distributed with a mean of 103 and a standard deviation of 15:1. Find the probability that a student will have an IQ: a of at least 115 b that is less than 75 c between 95 and 105: black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\240SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:52 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 241 10 The average weekly earnings of the students at a local high school are found to be approximately normally distributed with a mean of $40 and a standard deviation of $6: What proportion of students would you expect to earn: a between $30 and $50 per week b at least $50 per week? 11 The lengths of Murray Cod caught in the River Murray are found to be normally distributed with a mean of 41 cm and a standard deviation of 3:317 cm. a Find the probability that a cod is at least 50 cm. b What proportion of cod measure between 40 cm and 50 cm? c In a sample of 200 cod, how many of them would you expect to be at least 45 cm? E FINDING QUANTILES (k-VALUES) Let X be the random variable of the length in mm of a snail shell. Suppose that X is normally distributed with mean ¹ = 23:6 and standard deviation ¾ = 3:1 mm. A snail farmer wants to harvest some of his snails, but only those whose shell lengths are amongst the longest 5%. The problem is to find k such that Pr(X < k) = 95%. The number k is known as a quantile, and in this case the 95% quantile. When finding quantiles we are given a probability and are asked to calculate the corresponding measurement. This is the inverse of finding probabilities, and we use the inverse normal function. Click on the icon to obtain instructions for using your calculator. TI C For the above example, the TI instruction is k = invNorm(0:95, 23:6, 3:1) = 28:7 95% The instruction k = invNorm(0:95) will assume that the mean ¹ = 0, and the standard deviation ¾ = 1. m¡=¡23.6 s¡=¡3.1 k X Example 12 If Z has a standard normal distribution, find k if Pr(Z < k) = 0:73 73% Using a TI, k = invNorm(0:73, 0, 1) + 0:613 m¡=¡0 k s¡=¡1 Z cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 This means 73% of the values are expected to be less than 0:613 black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\241SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:59 PM PETERDELL SA_12STU-2 242 STATISTICS (Chapter 7) EXERCISE 7E 1 Z has a standard normal distribution. Illustrate with a sketch and find k if: a Pr(Z 6 k) = 0:81 b Pr(Z 6 k) = 0:58 c Pr(Z 6 k) = 0:17 2 X » N(20, 32 ). Illustrate with a sketch and find k if: a Pr(X 6 k) = 0:348 b Pr(X 6 k) = 0:878 Pr(X 6 k) = 0:5 c a Show that Pr(¡k 6 Z 6 k) = 2 Pr(Z 6 k) ¡ 1: b If Z is standard normally distributed, find k if: i Pr(¡k 6 Z 6 k) = 0:238 ii Pr(¡k 6 Z 6 k) = 0:7004 3 Example 13 A university professor determines that 80% of this year’s History candidates should pass the final examination. The examination results are expected to be normally distributed with mean 62 and standard deviation 13. Find the lowest score necessary to pass the examination. Let X denote the final examination result, so X » N(62, 132 ): We need to find k such that ) Pr(X > k) = 0:8 Pr(X 6 k) = 0:2 ) k = invNorm(0:2, 62, 13) ) k + 51:059 20% So, the minimum pass mark is 51. k 62 X 4 The length of a fish species is normally distributed with mean 35 cm and standard deviation 8 cm. The fisheries department has decided that the smallest 10% of the fish are not to be harvested. What is size of the smallest fish that can be harvested? 5 The length of screws produced by a machine is normally distributed with mean 75 mm and standard deviation 0:1 mm. If a screw is too long it is automatically rejected. If 1% of screws are rejected, what is the length of the smallest screw to be rejected? 6 The average score for a Physics test was 46 and the standard deviation of the scores was 15. Assuming that the scores were normally distributed, the teacher decided to award an A to the top 7% of the students in the class. What is the lowest score that a student needed in order to achieve an A? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 7 The volume of cool drink in a bottle filled by a machine is normally distributed with mean 503 mL and standard deviation 0:5 mL. 1% of the bottles are rejected because they are underfilled, and 2% are rejected because they are overfilled; otherwise they are kept for retail. What range of volumes is in the bottles that are kept? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\242SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:05 PM PETERDELL SA_12STU-2 STATISTICS 243 (Chapter 7) Note: Z-scores are essential for finding unknown values of ¹ and/or ¾. Example 14 An adult scallop population is known to have a standard deviation of 5:9 g. If 15% of scallops weigh less than 58:2 g, find the mean weight of the population. Let the mean weight of the population be ¹ g. If X g denotes the weight of an adult scallop, then X » N(¹, 5:92 ): 15% As we do not know ¹ we cannot use the invNorm directly, but we can find the z-value. 58.2 m¡=¡? Now Pr(X 6 58:2) = 0:15 s¡=¡5.9 58:2 ¡ ¹ ) Pr(Z 6 ) = 0:15 5:9 58:2 ¡ ¹ ) = invNorm(0:15) = ¡1:0364 5:9 ) 58:2 ¡ ¹ + ¡6:1 ¹ + 64:3 So, the mean weight is 64:3 g. 8 The arrival times of buses at a depot is normally distributed with standard deviation of 5 minutes. If 10% of the buses arrive before 3:45 pm, what is the mean arrival time of buses at the depot? 9 The IQ of a population has a standard deviation of 15. In a school 20% of students have an IQ larger than 125. What is the mean IQ of students in this school? 10 The distance an athlete can jump is normally distributed with mean 5:2 m. If 20% of the jumps by this athlete are less than 5 m, what is the standard deviation? 11 The weekly income of a greengrocer is normally distributed with a mean of $6100. If 85% of the time the weekly income exceeds $6000, what is the standard deviation? Example 15 Find the mean and standard deviation of a normally distributed random variable X if Pr(X 6 20) = 0:1 and Pr(X > 29) = 0:15 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 X » N(¹, ¾ 2 ) where we have to 0.1 0.15 find ¹ and ¾. We start by finding z1 and z2 which !z=20 m !x=29 correspond to x1 = 20 and x2 = 29. #z #x 20 ¡ ¹ = invNorm(0:1) = ¡1:282 ) 20 ¡ ¹ = ¡1:282¾ .... (1) Now z1 = ¾ 29 ¡ ¹ and z2 = = invNorm(0:85) = 1:036 ) 29 ¡ ¹ = 1:036¾ ....... (2) ¾ Solving these two equations gives ¹ + 25:0 and ¾ = 3:88 black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\243SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:10 PM PETERDELL SA_12STU-2 244 12 STATISTICS (Chapter 7) a Find the mean and the standard deviation of a normally distributed random variable X, if Pr(X > 80) = 0:1 and Pr(X 6 30) = 0:15: b In a Mathematics examination it was found that 10% of the students scored at least 80, and no more than 15% scored under 30. Assuming the scores are normally distributed, what proportion of students scored more than 50? 13 The diameters of pistons manufactured by a company are normally distributed. Only those pistons whose diameters lie between 3:994 and 4:006 cm are acceptable. a Find the mean and the standard deviation of the distribution if 4% of the pistons are rejected as being too small, and 5% are rejected as being too large. b Determine the probability that the diameter of a randomly chosen piston lies between 3.997 mm and 4.003 mm. F INVESTIGATING PROPERTIES OF NORMAL DISTRIBUTIONS In the previous section a number of assertions were made about the standard deviation. In this section some of these assertions will be justified. INVESTIGATION 2 THE GEOMETRIC SIGNIFICANCE OF ¹ AND ¾ What to do: 1 x¡¹ 2 1 1 The normal probability density function is f(x) = p e¡ 2 ( ¾ ) . ¾ 2¼ 2 3 4 5 Use technology to graph this function for a ¹ = 6, ¾ = 1 b ¹ = 6, ¾ = 2. x¡¹ Show that the derivative of f(x) is f 0 (x) = ¡ 2 f (x). ¾ Use the result in 2 to show that f (x) has a maximum value at x = ¹. GRAPHING 1 PACKAGE Show that f 00 (x) = ¡ 4 (¾ 2 ¡ (x ¡ ¹)2 ) f (x) . ¾ Use the result of 4 to find the points of inflection of f (x). From Investigation 2 you should have discovered that the points of inflection occur at x = ¹+¾ and x = ¹¡¾. point of inflection point of inflection s m -s s m x m+s Consequently: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 For a given normal curve the standard deviation is uniquely determined as the horizontal distance from the vertical line x = ¹ to a point of inflection. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\244SA12STU-2_07.CDR Thursday, 9 November 2006 3:04:23 PM DAVID3 SA_12STU-2 STATISTICS INVESTIGATION 3 (Chapter 7) 245 CALCULATING PROBABILITIES FROM NORMAL DISTRIBUTIONS To find probabilities from a normal distribution you need to be able to find areas between the graph of f (x) = 1 x¡¹ 2 p1 e¡ 2 ( ¾ ) ¾ 2¼ and the x-axis. A simple way to estimate these probabilities is to approximate them with areas of rectangles that fit snugly around the curve. The area beneath the smooth curve is approximately equal to the sum of the areas of the rectangles. What to do: Use a spreadsheet to: ² calculate the area of each rectangle using area = base £ height ² add the areas of rectangles to find an approximate area below the curve. Details of how to set up a spreadsheet can be found by clicking on the icon. G SPREADSHEET DISTRIBUTION OF SAMPLE MEANS Suppose a dietician wants to know the mean weight of thirteen year old Australian boys. It is impractical to weigh each thirteen year old boy in Australia, but the dietician could find the mean weight of a randomly selected sample of, say, 10 boys. The mean weight of the sample of 10 boys is a statistic that is then used to estimate the population parameter. Clearly the mean weight depends on the sample. If another health worker had selected a different sample of 10 boys, it would be unlikely that the two sample means would be the same. The statistic the sample weight is a new variable. Repeated sampling can be used to discover how the variable sample weight is distributed. In particular we want to know how the mean of the sample means and the standard deviation of the sample means is related to the parent population of 13 year old boys. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The following investigation explores the relation between the statistic “sample mean” and the parameter “population mean”. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\245SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:23 PM PETERDELL SA_12STU-2 246 STATISTICS (Chapter 7) INVESTIGATION 4 A SIMPLE RANDOM SAMPLER Suppose a school has 216 thirteen year old boys. Let the variable X be the weight in kg of the boys. The table shows all the possible values of X in random order. 1 31:2 35:7 33:8 36:7 30:9 35:9 35:5 36:4 32:9 27:9 33:2 32:8 32:0 33:2 35:4 32:0 33:8 30:8 34:8 37:3 31:9 36:4 33:6 29:0 30:2 35:0 36:7 34:5 30:5 32:1 36:3 2 34:0 33:6 29:2 31:6 30:9 32:7 38:9 34:4 33:6 32:5 35:0 35:4 32:0 32:0 31:0 35:3 33:2 30:4 28:0 32:7 32:5 34:6 36:2 33:3 32:7 36:7 36:4 31:1 35:2 30:2 33:6 3 35:4 31:2 33:5 30:5 35:0 31:4 27:5 32:5 32:5 30:5 32:5 32:4 31:7 29:6 30:1 36:3 34:1 37:1 37:1 35:1 34:9 34:3 33:2 32:5 29:9 32:9 32:3 32:1 32:9 35:9 31:6 4 37:3 33:6 31:4 31:3 33:4 30:3 32:5 36:7 33:0 30:8 33:2 34:9 33:4 30:7 32:4 29:8 31:1 32:1 35:2 32:8 29:7 30:8 32:3 34:6 34:2 32:5 33:6 29:2 30:6 35:7 29:5 5 34:3 31:9 32:6 31:6 27:0 33:4 34:0 33:2 29:4 34:2 31:8 35:3 30:6 34:5 32:6 29:1 36:1 38:7 37:1 32:4 35:5 35:4 32:7 37:5 30:4 30:8 33:0 29:9 31:0 32:1 33:2 6 32:4 32:0 31:4 33:7 35:9 33:9 28:5 27:1 40:6 29:0 38:4 34:0 36:2 36:4 37:1 32:0 31:6 34:2 35:7 34:0 31:4 29:9 34:4 29:2 36:4 32:4 30:0 34:6 31:6 37:6 33:2 30:8 33:3 34:9 31:8 33:3 29:4 30:6 33:1 32:0 31:4 31:9 31:5 35:0 32:3 29:3 32:0 35:3 37:7 34:6 35:7 34:9 36:6 30:2 29:4 35:4 35:5 32:0 30:4 29:7 33:7 What to do: 1 Select a sample of 10 boys from this population by: a rolling a die to select one of the 6 blocks b rolling the die again to select a row in the block c rolling the die again to select a boy in the row d count off 10 boys from left to right from the boy you selected. If the 3 rolls of the die produced f3, 2, 4g, the boy selected has weight 30:1 kg. The sample selected is presented in the first column of the table. 2 Copy and enter your data in the following table. cyan magenta yellow 95 Sample 4 100 50 75 25 0 Sample 3 5 95 100 50 Sample 2 75 25 0 5 95 Sample 1 30:1 34:9 32:3 34:9 31:4 33:0 32:4 29:7 33:6 30:6 32:3 100 50 75 25 0 5 95 100 50 75 25 0 5 Number 1 2 3 4 5 6 7 8 9 10 mean, x black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\246SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:31 PM PETERDELL Sample 5 SA_12STU-2 STATISTICS 247 (Chapter 7) 3 The last row in this table consists of 5 sample means. The variable of sample means can be denoted by X 10 . The bar on the top indicates it is a variable of means; the subscript 10 indicates that the means are of samples of size ten. The last row of your table is a sample of size 5 from the distribution of X 10 . 4 Combine your results with those of the other students of your class. Draw a histogram of the sample means. 5 Calculate the mean and the standard deviation of the sample means. 6 Compare the mean and the standard deviation you found in 5 with the mean weight 33:1 kg and standard deviation 2:54 kg of the 216 boys. From Investigation 4 you should have discovered that the sample means are close to the population mean. The mean of the sample means should be particularly close to the population mean. You should also have noted that the standard deviation of the sample means is smaller than the standard deviation of the population. The following important investigation uses a computer to speed up sampling and obtain a more accurate picture of how the standard deviation of the sample means is related to the standard deviation of the population. In this investigation it is important to distinguish between: ² The original population, sometimes referred to as the “parent population ”, with a random variable X which has mean ¹ and standard deviation ¾. In Investigation 4 the parent population consists of 216 thirteen year old boys. The mean ¹ = 33:1 kg and standard deviation ¾ = 2:54 kg. and ² The new population with variable X n , consisting of all statistics of sample means. The subscript n indicating the sample size is sometimes omitted and the variable just written X. x1 + x2 + :::::: + xn ¹= A typical outcome of X is a sample mean x n In Investigation 4 a typical outcome is the mean weight of 10 boys. The investigation explores the shape of the distribution of the random variable X, its mean ¹X or ¹(X), and its standard deviation ¾X or ¾(X). INVESTIGATION 5 A COMPUTER BASED RANDOM SAMPLER In this investigation we examine the variation in sample means. We examine samples taken from symmetric distributions as well as one that is skewed. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 We start by sampling from a population which has a normal distribution. The heights of 18 year old Australian males may be approximately normal. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\247SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:39 PM PETERDELL SA_12STU-2 248 STATISTICS (Chapter 7) What to do: 1 Click on the icon given alongside. This opens a worksheet named Samples with a number of buttons. Click on each of these buttons in turn. STATISTICS PACKAGE 2 Sample size: from which you can select the numbers n = 10, 20, 40, 80, 160. Start with n = 10. 3 Find sample means: finds the means of each of two hundred different samples. 4 Analyse: lists the two hundred sample means. It finds the standard deviation sX and draws a histogram of these sample means. It also superimposes a normal probability density function. This output is shown on the worksheet named Analysis. Note that the first graph on this worksheet is the graph of the probability density function of the population, and that the axes differ from that of the other graphs. 5 Make a copy of the table alongside. Enter the value of (sX )2 in the first column next to n = 10. Trial 1 Trial 2 Trial 3 Trial 4 (sX )2 n (sX )2 (sX )2 (sX )2 10 20 40 80 160 6 Go back to the worksheet named Samples and change the sample size to 20. Repeat steps 3, 4, and 5. Enter the value of (sX )2 next to n = 20 in the table. 7 Repeat for samples of size 40, 80 and 160. 8 We wish to see how (sX )2 is related to the standard deviation of the population. However, (sX )2 can vary quite a lot, so to spot the pattern more clearly you should repeat the experiment another 3 times. 9 From your experiment, determine a relationship between the square of the sample standard deviation (sX )2 and the square of the population standard deviation. 10 Now click on the icon to sample data from a population with a uniform distribution. These distributions are very commonly used in computer games where, for example, cards have to be selected at random. Complete an analysis of this data by repeating the above procedure and recording all results. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 11 Now click on the icon to sample data from a population with an exponential distribution. These distributions are notoriously skew. They are commonly used in modelling lifetimes, such as the lifetime of light globes. Complete an analysis of this data by repeating the above procedure and recording all results. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\248SA12STU-2_07.CDR Thursday, 9 November 2006 3:45:45 PM DAVID3 STATISTICS PACKAGE STATISTICS PACKAGE SA_12STU-2 STATISTICS 249 (Chapter 7) From the investigation you should have discovered the following: If X is a random variable with mean ¹ and standard deviation ¾ then the random variable X n of sample means of size n has: ² mean ¹X = ¹, the same as the mean of the random variable X ¾ ² standard deviation ¾X = p . n Furthermore, for large values of n, X n is approximately normal. You should notice: ² The histogram of the sample means becomes symmetric and starts to take on a bell-like shape. For large values of n it becomes approximately normal. ² The mean of the sample means approximates the population mean. Individual points selected from any distribution are likely to come from either side of the mean, and differences are likely to average out. m x1, x2, x3,..., xn Sample 1 x1 x1, x2, x3,..., xn Sample 2 x2 x1 ¹X x2 x1, x2, x3,..., xn Sample 3 x3 x3 ² As the sample size increases, there is less variability. ² This diagram shows what happens if the sample size n increases. ¾X ¾X ¾X ¾ The spread decreases since ¾X = p n and ¹X = ¹: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 In the Appendix the behaviour of the mean and the standard deviation are explored algebraically.¡ It is beyond the level of this course to show why the distribution of the sample means is approximately normal. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\249SA12STU-2_07.CDR Wednesday, 8 November 2006 8:41:06 AM DAVID3 APPENDIX SA_12STU-2 250 STATISTICS (Chapter 7) Example 16 The life expectancy X, of a certain brand of AAA battery is known to have a mean¡¹¡=¡27 hours and standard deviation ¾¡=¡3:25 hours. The batteries are sold in packets of 6. Let the random variable X6 be the mean life expectancy of batteries in a packet. a The 6 batteries in a packet were tested and the number of hours they lasted were: 25:3, 21:6, 27:75, 22:25, 35:5, 28:5 What is the corresponding outcome of the random variable X 6 ? b If the numbers of hours lasted by batteries in a packet of six were x1 , x2 , x3 , x4 , x5 , x6 what is the corresponding outcome of X 6 ? c What is the mean and standard deviation of X 6 ? a The outcomes of X 6 are the means of the life expectancies of 6 batteries in a packet. In this case the outcome of X 6 is the statistic x= b 25:3 + 21:6 + 27:75 + 22:25 + 35:5 + 28:5 + 26:8 6 If the batteries in the packet lasted for x1 , x2 , x3 , x4 , x5 , x6 hours, the corresponding outcome of X 6 is the statistic x = c x1 + x2 + x3 + x4 + x5 + x6 . 6 The mean of X 6 is the same as the mean of X, so ¹X 6 = 27 hours. Since the standard deviation of X is 3:25, the standard deviation of X 6 is ¾ 3:25 ¾X = p = p + 1:327 6 6 6 EXERCISE 7G.1 1 A machine produces sheets of cardboard with mean thickness 3 mm and standard deviation 0:12 mm. A quality controller checks the thickness of each sheet in 10 different places. Let the random variable X be the thickness of the cardboard at any point, and let the random variable X 10 be the mean thickness of the 10 points. a The quality controller records the following thicknesses in mm from a sample of 10 points: 3:02, 2:77, 3:08, 2:89, 3:21, 2:79, 2:97, 3:07, 2:94, 3:01: What is the corresponding outcome of the random variable X 10 ? b If the quality controller records 10 outcomes of X as: x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 , what is the corresponding statistic of X 10 ? c What is the mean and standard deviation of X 10 ? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 2 Records show that a machine has been producing screws with mean length 75 mm and standard deviation 0:5 mm. Screws are packaged in lots of 50. Let the random variable X 50 be the mean length of a screw in a packet. Find the mean and standard deviation of X 50 . black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\250SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:56 PM PETERDELL SA_12STU-2 STATISTICS 251 (Chapter 7) 3 The time it takes a train from Adelaide to Belair to complete its journey is known to have a mean of 40 minutes and standard deviation of 3 minutes. An inspector times 8 such trips. Let X 8 be the mean travel time of a sample of 8 trips. Find the mean and standard deviation of X 8 . 4 Suppose the probability a coin falls heads is p and the probability it falls tails is q = 1¡p. Let the random variable X = 1 if it falls heads and X = 0 if it falls tails. a Show that the mean of X is p. p p b Show that the standard deviation of X is pq = p(1 ¡ p). c Let X n be the sample mean of n tosses of the coin. i Find the mean and standard deviation of X n . ii Describe in words how X n is related to the tosses of a coin. In general, knowing the mean and standard deviation of a random variable X is insufficient information to calculate probabilities. However, we are able to calculate probabilities in the special case where X is normally distributed. Not only that, but if X is normally distributed, the random variable X n of sample means of size n is also normally distributed. Example 17 Suppose the random variable X is normally distributed with mean 40 and standard deviation 10. Let X 20 be the sample means of size 20. Find: a Pr(35 < X < 45) b Pr(35 < X 20 < 45). Pr(35 < X < 45) = normalcdf(35, 45, 40, 10) + 0:383 a b The mean of X 20 = mean of X = 40: The standard deviation of X 20 = Pr(35 < X 20 < 45) = normalcdf(35, 45, 40, p10 40 p10 ) 40 = 0:998 Notice that about 38% of the individual outcomes are in the interval 35 < X < 45, but almost all of the sample means lie in this interval. Example 18 The time T it takes to serve a customer at a railway station ticket booth is normally distributed with mean 45 seconds and standard deviation 20 seconds. You only have 10 minutes to buy your ticket or you will miss your train. If there is a line of 11 people in front of you waiting to be served, what is the probability you will catch the train? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Including yourself there are 12 persons in the line to be served. To complete buying your ticket in less than 10 minutes the mean serving time per 10 £ 60 person has to be less than = 50 seconds. 12 black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\251SA12STU-2_07.CDR Wednesday, 8 November 2006 8:41:33 AM DAVID3 SA_12STU-2 252 STATISTICS (Chapter 7) Let the random variable T 12 be the mean time to serve 12 persons. Since T is normally distributed with mean 45 and standard deviation 20, T 12 is normally distributed with mean 45 and standard deviation p2012 . Pr(T 12 < 50) = normalcdf(¡E99, 50, 45, + 0:807 p20 ) 12 So, the probability of catching the train is 0:807 5 Suppose the random variable X is normally distributed with mean 80 and standard deviation 20. Let X 10 be the sample means of size 10: Find: a Pr(75 < X < 85) b Pr(75 < X 10 < 85) 6 Let the random variable X be the IQ of 17 year old girls. Suppose X is normally distributed with mean 105 and standard deviation 15. a Find the probability that an individual 17 year old girl has an IQ of more than 110. b Find the probability that the mean IQ of a class of twenty 17 year old girls is greater than 110. 7 A manufacturer of chocolates produces chocolates of mean weight 20 g and standard deviation 5 g. A box of 13 such chocolates is sold with the claim that the nett weight in the box is 250 g. Assuming the weights are normally distributed: a For what proportion of boxes is this claim correct? b If the manufacturer decides to increase the number of chocolates to 15 per box, for what proportion of boxes is the claim now true? THE CENTRAL LIMIT THEOREM In the previous investigation, we also observed that the distribution of the sample means X is approximately normal. The Central Limit Theorem Suppose X is a random variable which is not necessarily normally distributed, but has mean ¹ and standard deviation ¾: For sufficiently large n, the distribution X n of the sample means ¾ of size n, is approximately normal with mean ¹X = ¹ and standard deviation ¾X = p : n Note: ² There is no simple answer as to how large n should be before the central limit theorem can be applied. It depends on many factors including how much accuracy is required. If the population is very skew it may require a large sample size n, whereas if the population is symmetric a small sample size n may be sufficient. As a rule of thumb, n¡>¡30 is often used, but each case must be considered on its merits. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 ² In the special case where the population is normally distributed, the distribution X of the sample means is always normal. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\252SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:09 PM PETERDELL SA_12STU-2 STATISTICS 253 (Chapter 7) THE SAMPLING ERROR We are trying to estimate the population mean using a sample mean. By only looking at a small portion of the population, the sample mean is likely to be different from the population mean. ¾ The standard deviation ¾X = p of the sample means X is a measure of the n variability of sample means, and is called the sampling error or the standard error. Note: ² Unless the population is small, the population size is almost irrelevant. ² The larger the value of n, the smaller the sampling error. A sufficiently large sample should give an accurate estimate of the mean. However, making the sample size too big may be expensive and may not improve the reliability of the estimate by much. ¾ ¾ + For example, a sample size of 1000 gives a sampling error of ¾X = p 32 1000 whereas a sample of 4000, four times the size, only halves the sampling error. Example 19 Histogram A 30 20 0 <0 [1,2) [3,4) [5,6) [7,8) [9,10) [11,12) [13,14) [15,16) [17,18) [19,20) [12.75,13) [12,12.25) [10.5,10.75) [11.25,11.5) [9.75,10) [9,9.25) [8.25,8.5) <7 10 interval interval magenta yellow 100 75 95 50 25 0 5 95 100 75 50 25 0 To find Pr (X 36 < 9) we count the numbers in all the bins before the bin [9, 9:25), and use the fact that there are 400 in the sample. We get: 5 b 95 The data in Histogram A is less spread out than that in Histogram B, and appears clustered around 10. Histogram A is the histogram for the distribution X 36 . 100 a 50 Which of the two histograms is from X 36 ? Give reasons for your answer. From the diagram estimate Pr (X 36 < 9). Find the approximate mean and standard deviation of X 36 . Use the histogram to estimate the probability X 36 is one standard deviation from the mean. 75 a b c d 25 0 5 95 100 50 75 25 0 5 cyan Histogram B frequency 50 40 30 20 10 0 [7.5,7.75) frequency Two histograms of samples, each of size 400, are shown below. One is from a uniform distribution X with mean 10 and standard deviation 5:77. The other is from the distribution X36 of the sample means of size 36 selected from the distribution X. Note that the scales are not the same in the two diagrams. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\253SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:14 PM PETERDELL SA_12STU-2 254 STATISTICS (Chapter 7) 53 15 + 15 + 12 + 3 + 2 + 3 + 2 + 1 = + 0:13 400 400 Your answer may vary a little depending on how well you can read the numbers on the graph. Pr (X 36 < 9) = The mean of X 36 = mean of X = 10. c ¾ 5:77 The standard deviation ¾X + p = = 0:962 6 36 Pr(10 ¡ 0:96 < X36 < 10 + 0:96) = Pr(9:04 < X 36 < 10:96) + Pr(9 < X 36 < 11) d 30 + 27 + 39 + 44 + 45 + 42 + 31 + 30 = 400 = 0:72 This crude estimate compares with 0:68 when using the normal approximation. EXERCISE 7G.2 1 The IQ measurements of a population have mean 100 and standard deviation 15. Many hundreds of random samples of size 36 are taken from the population and a relative frequency histogram of the sample means is formed. a What would we expect the mean of the samples to be? b What would we expect the standard deviation of the samples to be? c What would we expect the shape of the histogram to look like? [14.25,14.5) [12.75,13) [13.5,13.75) [12,12.25) interval [10.5,10.75) <7 [7.5,7.75) [52,54) [46,48) [34,36) [40,42) [28,30) [22,24) [16,18) [10,12) <0 [4,6) 0 [11.25,11.5) 20 [9,9.25) 40 [9.75,10) 60 Histogram B 30 25 20 15 10 5 0 [8.25,8.5) Histogram A frequency frequency 2 Two histograms of sample size 300 each are shown below. One is from a life expectancy distribution X with mean 10 and standard deviation 10. The other is from the distribution X 64 of the sample means of size 64 selected from the distribution X. Note that the scales are not the same in the two diagrams. interval cyan magenta yellow 100 75 95 50 25 0 5 100 95 50 75 25 0 5 100 95 50 75 25 0 Which of the two histograms is from X 64 ? Give reasons for your answer. From the diagram estimate Pr(X 64 < 9). Find the approximate mean and standard deviation of X 64 . Use the histogram to estimate the probability that X 64 is one standard deviation from the mean. How does this answer compare with using the normal approximation? 5 95 100 50 75 25 0 5 a b c d black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\254SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:20 PM PETERDELL SA_12STU-2 STATISTICS 255 (Chapter 7) Example 20 The age of men in Australia is distributed with mean 43 and standard deviation 8. If a sample of 67 men is selected from the population of Australian men, what is the probability the sample mean is: a less than 42 b greater than 45 c between 40 and 45? Let the random variable X be the mean age of samples of 67 Australian males. Assuming n = 67 is sufficiently large for the Central Limit Theorem to apply, X is approximately normal with mean 43 and standard deviation ¾X = Pr(X < 42) = normalcdf(¡E99, 42, 43, + 0:153 a Pr(X > 45) = normalcdf(45, E99, 43, + 0:0204 43 p8 ) 67 43 Pr(40 < X < 45) = normalcdf(40, 45, 43, + 0:979 c . p8 ) 67 42 b p8 67 45 p8 ) 67 40 43 45 3 During a one week period in Sydney the mean price of an orange was 42:8 cents with standard deviation 8:7 cents. Find the probability that the mean price per orange from a case of 60 oranges was less than 45 cents. 4 The mean energy content of a fruit bar is 1067 kJ with standard deviation 61:7 kJ. Find the probability that the mean energy content of a sample of 30 fruit bars is more than 1050 kJ/bar. 5 The mean sodium content of a box of cheese rings is 1183 mg with standard deviation 88:6 mg. Find the probability that the mean sodium content per box for a sample of 50 boxes lies between 1150 mg and 1200 mg. 6 Customers at a clothing store are in the shop for a mean time of 18 minutes with standard deviation 5:3 minutes. What is the probability that in a sample of 37 customers the mean stay in the shop is between 17 and 20 minutes? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 7 The mean contents of a can of cola is 382 mL, even though it says 375 mL on a can. The statistician at the factory says that the standard deviation is steady at 16:2 mL. Find the probability that a slab of three dozen cans has mean contents less than 375 mL per can. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\255SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:26 PM PETERDELL SA_12STU-2 256 STATISTICS (Chapter 7) Example 21 A population is known to have a standard deviation of 8 but has an unknown mean ¹. In order to estimate ¹, the mean of a random sample of 60 is found. Find the probability that this estimate is out by less than 2. Let the random variable X be the mean of samples of 60. As the sample size is larger than 30, we assume that X is normally distributed with mean ¹ and standard deviation p860 . We need to find Pr(¡2 < X ¡ ¹ < 2). Now Pr(¡2 < X ¡ ¹ < 2) = Pr = µ ¡2 p8 60 ³ p Pr ¡ 4 60 < X ¡¹ p8 60 <Z < p = normalcdf( ¡ 4 60 , + 0:947 < 2 ¶ p8 60 p ´ 60 4 p 60 4 , 0, 1) 8 A sample of 375 people will be used to estimate the mean number of hours that will be lost due to sickness this year. Last year the standard deviation for the number of hours lost was 67 and we will use this as the standard deviation this year. What is the probability that the estimate is in error by less than ten hours? 9 A concerned union member wishes to estimate the hourly wage of shop assistants in Adelaide. He decides to randomly survey 300 shop assistants to calculate the sample mean. Assuming that the standard deviation is $1:27, find the probability that the estimate of the population mean is in error by 10 cents or more. INVESTIGATION 6 CHOCKBLOCKS Chockblock produce mini chocolate bars which vary a little in weight. The machine used to make them produces bars whose weights are normally distributed with mean 18:2 grams and standard deviation 3:3 grams. 25 bars are then placed in a packet for sale. Hundreds of thousands of packets are produced each year with mean weight X. What to do: 1 What are the mean ¹X and standard deviation ¾X of X? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 2 Printed on each packet is the nett weight of contents, 425 grams. What is the manufacturer claiming about the mean weight of each bar? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\256SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:32 PM PETERDELL SA_12STU-2 STATISTICS 257 (Chapter 7) 3 What percentage of their packets will be rejected because they fail to meet the 425 gram claim? 4 An additional bar is added to each packet with the nett weight claim retained at 425 grams. a What is the minimum acceptable claim now? b What are the mean ¹X and standard deviation ¾X now? c What percentage of these packets would we expect to reject? H HYPOTHESIS TESTING FOR A MEAN Claims are often made about the population mean of some quantities. For example, it is claimed that the mean protein content of a 1 litre carton of milk is 39 grams. The truth of this claim can only be known by measuring the protein content of every 1 litre carton of milk, clearly an impossible task. It is, however, possible to draw reasonable conclusions from measuring the protein content of a random selection of cartons. A statistical hypothesis is a statement about a population parameter. The parameter could be a population mean or a proportion. In this section we will test hypotheses concerning the mean ¹. HYPOTHESIS ABOUT MEANS When a statement is made about a product, it is usually tested statistically before changes to the product are made. For example, suppose a consumer makes the statement that the mean protein content in 1¡litre cartons of milk is not 39 grams. The milk company does not want to go to the expense of changing packaging until it is statistically shown that the mean protein content is indeed not 39 grams. The company will start with the assumption that their claim is true, and whatever tests the consumer did were just random fluctuations. This assumption or statement of no change is called the null hypothesis and is usually denoted H0. The alternative hypothesis denoted Ha is that the statistical evidence is sufficient to accept the consumer’s claim, i.e., that the milk company’s statement is false. So, we consider two hypotheses: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 ² a null hypothesis H0 which is a statement of no difference or no change. It is assumed to be true until sufficient evidence is provided so that it is rejected. ² an alternative hypothesis Ha which is a statement that there is a difference or change which has to be established. Supporting evidence is necessary if it is to be accepted. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\257SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:38 PM PETERDELL SA_12STU-2 258 STATISTICS (Chapter 7) HYPOTHESIS TESTING WHEN THE POPULATION IS NORMALLY DISTRIBUTED We want to test the claim that the mean protein content of 1 litre cartons of milk is 39 grams. ¹ = 39 ¹ 6= 39 The null hypothesis is H 0 : The alternative hypothesis is H a : Suppose we select a sample of 10 cartons of milk and find that for this sample the mean protein content is x ¹ = 38:4 grams. We need to determine the likelihood that this difference is due to random fluctuation or chance, or whether it is sufficient evidence to say the milk company’s statement is incorrect. Since the protein content of milk is a result of many different factors, it is reasonable to assume that the protein content of 1 litre cartons of milk is normally distributed. Suppose it is known that the standard deviation of protein in 1 litre containers of milk is ¾ = 0:8 grams. Let X be the protein content of a 1 litre container of milk, so according to the null hypothesis, X » N(39, 0:82 ). Let the random variable X be the mean protein content of a sample of 10 one litre cartons. µ µ µ µ ¶2 ¶ 2¶ ¾ ¶ 0:8 Hence X » N ¹, p i.e., X » N 39 , p . n 10 We use this to calculate the z-score of the observed value x ¹ = 38:4 grams. z= x ¹¡¹ 38:4 ¡ 39 + ¡2:37 ¾ = 0:8 p p n 10 So the number of standard deviations x ¹ is from the mean is ¡2:37 . If the difference between the observed value of x ¹ and the mean is due to chance alone, it could just as likely have been 2:37 standard deviations to left or right of the mean. So, the probability that X is 2:37 standard deviations or more either side of the mean is a measure of how likely this is to occur. Now Pr(Z 6 ¡2:37 or Z > 2:37) = 2 £ Pr(Z 6 ¡2:37) fsymmetryg = 2 £ normalcdf(¡E99, ¡2:37) = 0:0178 so the probability of this event happening is small. One of the problems with random processes is that differences can always be due to chance. However, the practical solution is to reject the null hypothesis if the probability of the observed or more extreme results occurring is small. The probability ® at which we reject the null hypothesis is called the significance level of the test. Common significance levels are ® = 0:05 or 5% and ® = 0:01 or 1%. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 In the above example, Pr(Z 6 ¡2:37 or Z > 2:37) = 0:0178 . This is less than 0:05 so we would reject the null hypothesis at the significance level of 0:05, but not at the significance level of 0:01 . black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\258SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:44 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 259 Milk cartons example The procedure for testing a hypothesis is: Step 1: State the null hypothesis H 0 : ¹ = ¹0 and the alternate hypothesis H a : ¹ = 6 ¹0 . Step 2: Select a significance level, usually 0:05 . Unless otherwise stated, the level of 0:05 is used in this book. Step 3: From a sample, calculate the sample mean x ¹. If the parent population is normally distributed with mean ¹ and standard deviation ¾, then the random variable X of sample means has the normal µ µ 2¶ ¾ ¶ distribution N ¹, p . n µ µ 2¶ ¾ ¶ N ¹, p is called the null distribution: n H 0 : ¹ = 39 6 39 H a: ¹ = X » N(39, 0:2532 ) The null distribution is critical. It allows us to calculate the probability of the observed or more extreme events happening if the null hypothesis is true. Use the sample mean x ¹ to find the test statistic x ¹¡¹ z= ¾ : p n Step 4: z = ¡2:37 The name Z-test derives its name from this statistic. Step 5: Calculate the probability of all observations having z-values more extreme than the test statistic z found in Step 3. The P-value is the probability of all observations having a z-value more extreme than the test statistic. P= Pr(Z 6 ¡2:37 or Z > 2:37) = 0:0178 Since we include the extreme outcomes either side of the mean, we call this a two-sided Z-test. Only two-sided tests are considered in this course. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 Since P¡>¡0:01, we do not reject the null hypothesis at the 0:01 level. 5 ² If the P-value is larger than the significance level decided on in Step 2, do not reject the null hypothesis. 95 Since P¡<¡0:05 we reject the null hypothesis at the 0:05 level. 100 50 ² Reject the null hypothesis if the P-value is less than the significance level decided on in Step 2. The smaller the P-value is, the stronger the evidence against the null hypothesis. 75 25 0 5 95 100 50 75 25 0 5 Step 6: black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\259SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:51 PM PETERDELL SA_12STU-2 260 STATISTICS (Chapter 7) When a null hypothesis is not rejected, the terms “retain” and “accept” are often used. This does not mean that the null hypothesis is true, but rather that there is not enough evidence to show it is not true. Similarly, when rejecting the null hypothesis, it is often stated that the alternative hypothesis is “accepted”. This does not mean that the alternative hypothesis is true. However, if the null hypothesis is true, the outcome that led to rejecting it is a very unlikely one. The P-value tells you just how unlikely. Example 22 A Mathematics coaching school knows that the results for their final test are normally distributed with population mean 74% and standard deviation 7%. A new coaching technique which is cheaper to implement but reported to have the same results is trialled by the school. In a trial of 40 students it is found that the mean score for the final test is 72% with standard deviation 6%. Is there sufficient evidence at the 5% level to conclude that the final test scores will be different? Step 1: Step 2: Step 3: H0 : ¹ = 74 Ha : ¹ 6= 74 Significance level is 0:05 The sample mean, x ¹ = 72 TI C Let the random variable X be the sample means, so the null distribution µ µ ¶2 ¶2 ¾ 7 is X » N(¹, p ) i.e., X » N(74, p ): n 40 x ¹¡¹ 72 ¡ 74 + ¡1:81 ¾ = 7 p p n 40 Step 4: The test statistic is z = Step 5: The P-value is P = Pr(Z 6 ¡1:81 or Z > 1:81) = 2 £ Pr(Z 6 ¡1:81) + 0:0708 Step 6: As P¡=¡0:0708¡>¡0:05 there is insufficient evidence to reject the null hypothesis that the new coaching produces the same results as the old technique. We thus accept that the new technique has the same result as the old technique. Notice that we use ¾ and not s for the Z-test. If H0 is rejected, ² the direction of the difference is determined by the value of x ¹ ² we still do not know how accurate the claim was. Note: EXERCISE 7H.1 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 1 A random variable X is normally distributed with a standard deviation ¾ = 4. It is claimed that the mean of X is ¹ = 17. a To test this claim a random sample of n = 50 was taken and the sample mean x ¹ was found to be 16. i Write down the hypotheses H0 and Ha . ii Write down the null distribution. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\260SA12STU-2_07.CDR Thursday, 9 November 2006 10:12:45 AM DAVID3 SA_12STU-2 STATISTICS (Chapter 7) 261 iii Calculate the test statistic. iv Calculate the P-value. v What conclusion is there at the 0:05 level? b Suppose that a random sample of n = 70 was taken and x ¹ = 16. What can you now conclude at the 0:05 level? 2 A random variable X is normally distributed with a standard deviation ¾ = 6. A random sample of 40 was taken and the sample mean was found to be x ¹ = 61:4 . Use this information to test the claim that the population mean of X is ¹ = 60. Example 23 The bottlers of Groutt claim that the mean volume of bottles is 503 mL. To test this claim 10 bottles were selected. The measurements are listed below to the nearest 0:1 mL: 502:5, 501:0, 501:5, 503:9, 498:7, 505:7, 504:6, 499:4, 501:8, 501:1 Test the claim made by the bottlers of Groutt at the 5% level if it is known that the population standard deviation ¾ is 1:8 mL. We need to test: the null hypothesis H0 : against the alternative hypothesis H a : ¹ = 503 ¹ 6= 503 Let X be the volume of each bottle of Groutt. As the bottling of liquids is subject to many random fluctuations, it is reasonable to assume that X is normally distributed with mean ¹ and standard deviation ¾. Let X be the distribution of the sample means, so the null distribution of X is µ µ 2¶ ¾ ¶ N ¹, p . n From the null hypothesis we assume that ¹ = 503. From the sample we find that x ¹ = 502:02, so the test statistic z= x ¹¡¹ 502:02 ¡ 503 + ¡1:722 ¾ + 1:8 p p n 10 The P-value is P = Pr(Z 6 ¡ 1:722 or Z > 1:722) = 2 £ Pr(Z 6 ¡1:722) + 0:0851 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 As P > 0:05 there is insufficient evidence to reject the claim that the volume of bottles of Groutt is 503 mL, i.e., we accept that the mean volume could be 503 mL. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\261SA12STU-2_07.CDR Wednesday, 8 November 2006 8:44:35 AM DAVID3 SA_12STU-2 262 STATISTICS (Chapter 7) 3 A market gardener claims that the carrots in his field have a mean weight of 50 grams. Before buying the crop a buyer pulls 20 carrots at random. She finds that their individual weights in grams are: 57:6 34:7 53:9 52:5 61:8 51:5 61:3 49:2 56:8 55:9 57:9 58:8 44:3 58:3 49:3 56:0 59:5 47:0 58:0 47:2 a Explain why it is reasonable that the distribution of carrots’ weights is normally distributed. b Test the claim made by the market gardener if it is known that the standard deviation for the whole crop is 7:1 grams. 4 The length of screws produced by a machine is known to be normally distributed with standard deviation ¾ = 0:08 cm. The machine is supposed to produce screws with a mean length of ¹ = 2:00 cm. A quality controller selects a random sample of 15 screws and finds that the mean length of the 15 screws is x ¹ = 2:04 cm with sample standard deviation of s = 0:09 cm. Does this justify the need to adjust the machine? GRAPHING PACKAGE To see how to do hypothesis testing using a calculator, click on the appropriate icon. TI C HYPOTHESIS TESTING WHEN THE POPULATION IS NOT NECESSARILY NORMALLY DISTRIBUTED In the examples we have seen so far, the variable X was normally distributed and so the distribution of sample means X was normally distributed also. This may not be true if X is not normally distributed. However, if the sample size n is sufficiently large, the Central Limit Theorem tells us that X is approximately normally distributed with mean ¹ and standard ¾ deviation p . n We can use this fact to test claims about population means. Example 24 Susan’s resting pulse rate has been 55 beats per minute for many years with standard deviation ¾ = 2:6 bpm. During a 5 day period she checks her resting pulse rate 8 times a day at regular intervals and finds that it has mean 56:2. Is there sufficient evidence, at a 5% level, to conclude that Susan’s pulse rate has changed? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 null hypothesis is H 0 : ¹ = 55. The alternative hypothesis is H a : ¹ 6= 55 significance level ® = 0:05 . number in the sample is n = 5 £ 8 = 40 and the sample mean is x ¹ = 56:2. population standard deviation ¾ = 2:6 . 5 95 100 50 75 25 0 5 The The The The black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\262SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:08 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 263 Let X be Susan’s resting pulse rate. We do not know how the random variable X is distributed, but if we assume that n is large enough for the Central Limit Theorem to apply then the null distribution for the sample means X is approximately normally distributed with mean ¹ = 55 and standard ¾ 2:6 deviation p = p = 0:411 . n 40 Entering this information into the calculator gives a P-value of P = 0:003 51 . As P = 0:003 51 < 0:05 there is evidence at the 0:05 level to reject the null hypothesis. We accept the alternative hypothesis H a that Susan’s pulse rate has changed. EXERCISE 7H.2 1 Globe Industries make torch globes with standard deviation life time of ¾ = 9 hours. If the globes last too long, people will have no need to buy new ones, but if they do not last long enough, people will stop buying them. A quality controller is to ensure that globes made by a machine have a mean life of 80 hours. The quality controller selects a sample of 50 globes and finds that they have a mean life of 83 hours. a What is the null hypothesis the quality controller is testing? b Assuming that a sample of n = 50 is large enough for the Central Limit Theorem to apply, what is the null distribution the quality controller will be using? c Is there sufficient reason at the 5% level for the quality controller to adjust the machine? 2 Let X be the outcome of the roll of a fair six-sided die. The mean outcome of such a die is ¹ = 3:5 with standard deviation ¾ = 1:708. Jack thinks his die may not be fair. To test this he rolls the die 100 times and finds that the mean of the 100 rolls is 3:2. a What null hypothesis is Jack testing? b Briefly explain why the outcomes of a roll of a fair die are not normally distributed. c Assuming that a sample of size n = 100 is large enough for the Central Limit Theorem to apply, what is the null distribution Jack should be using? d Does Jack have enough evidence at the 5% level to claim the die is not fair? e Jack’s sister Betty rolls the same die 200 times and finds that the mean of her sample is also 3:2. Would Betty come to the same conclusion as Jack? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 3 While peaches are being canned, 250 mg of preservative is supposed to be added by a dispensing device. It is known that the standard deviation of preservative added is 7:3 mg. To check the machine, the quality controller obtains 60 random samples of dispensed preservative and finds that the mean preservative added was 242:6 mg. At a 5% level, is there sufficient evidence that the machine is not dispensing a mean of 250 mg? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\263SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:15 PM PETERDELL SA_12STU-2 264 STATISTICS (Chapter 7) 4 In recent times the mean age for New Zealand women on their first wedding day is 23:6 years with a standard deviation of 2:9 years. To determine if this differs from Australian women, a survey of 32 women was carried out. It was found that the mean age was 24:3 years. Test whether there is a significant difference at a 5% level. REJECTION REGION FOR THE NULL HYPOTHESIS H 0 To test the null hypothesis H 0 : ¹ = ¹0 we have used the test statistic z = against the alternative hypothesis H a : ¹ 6= ¹0 x ¹¡¹ ¾ . p n Assuming that z¡>¡0, our test at the 5% significance level has been to reject the null hypothesis if the P-value P = Pr(Z 6 ¡z or Z > z) < 0:05 i.e., 2 £ Pr(Z 6 ¡z) < 0:05 i.e., Pr(Z 6 ¡z) < 0:025 : 0.025 But invNorm(0:025) + ¡1:96, and so we reject the null hypothesis at the 5% level if the test statistic z 6 ¡1:96 or z > 1:96 . 0.025 -1.96 0 RR of H0 1.96 RR of H0 The rejection region for the null hypothesis H 0 is the set of values of the test statistic for which the null hypothesis is rejected. The 5% rejection region for the null hypothesis H 0 : ¹ = ¹0 fz : z 6 ¡1:96 or z > 1:96g is the set Example 25 A liquor chain claims that the mean price of wine has not changed from what it was 12 months ago. Records show that 12 months ago the mean price was $13:45 for a 750 mL bottle. A random sample of prices of 389 different bottles of wine is taken from several stores and the mean price is $13:30 and the standard deviation is $0:25. Is there sufficient evidence at the 5% level to reject the claim? H 0 : ¹ = 13:45, H a : ¹ 6= 13:45 We use s = 0:25 to estimate ¾ as n is large. Assuming that the sample of size n = 389 is large enough for the Central Limit Theorem to apply, we find the test statistic z = x ¹¡¹ 13:30 ¡ 13:45 + ¡11:8 ¾ = 0:25 p p n 389 Since z < ¡1:96 we reject the null hypothesis that there is no difference in the price and accept the alternative hypothesis that the price has changed. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Note that the calculator also calculates the test statistic z when using the 2-sided Z-test. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\264SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:21 PM PETERDELL SA_12STU-2 STATISTICS 265 (Chapter 7) EXERCISE 7H.3 For questions 1 and 2, test the hypothesis using the rejection region for the null hypotheses. In each case you may assume that the sample size n is large enough for the Central Limit Theorem to apply. 1 Quickshave produces disposable razorblades. They claim that the mean number of shaves before a blade has to be thrown away is 13. A researcher wishes to test the claim and asks 30 men to supply data on how many shaves they got from one of the Quickshave blades. The researcher found that the mean of the sample was 12:8. Use this information to test the manufacturer’s claim at a 5% level if the population standard deviation ¾ is 1:6: 2 It is claimed that the mean disposable income of households in a country town is $50 per week. To test this claim, 36 households were sampled and it was found that the mean disposable income of the 36 families was $47. Use this to test the claim that the mean disposable income is not $50 per week if the population standard deviation ¾ = $12. Example 26 To test the hypothesis H 0 : ¹ = 40 against H a : ¹ 6= 40, a random sample of size 60 was taken and found to have mean x ¹ and standard deviation s = 7. For what values of x ¹ will the null hypothesis be rejected at the 5% level? Assume that the sample size is large enough for the Central Limit Theorem to apply. The test statistic z = x ¹ ¡ 40 x ¹ ¡ 40 x ¹¡¹ + ¾ = 7 0:9037 p p n 60 The null hypothesis will be rejected if z 6 ¡1:96 or if z > 1:96 x ¹ ¡ 40 x ¹ ¡ 40 i.e., if 6 ¡1:96 or if > 1:96 0:9037 0:9037 ) x ¹ 6 40 ¡ 1:96 £ 0:9037 or x ¹ > 40 + 1:96 £ 0:9037 The null hypothesis will be rejected if x ¹ 6 38:2 or x ¹ > 41:8 . 3 To test the hypothesis H 0 : ¹ = ¡23 against H a : ¹ 6= ¡23, a random sample of size 100 was taken and found to have mean x ¹. For what values of x ¹ will the null hypothesis be rejected at the 5% level? You may assume that the sample size is large enough for the Central Limit Theorem to apply and that the population standard deviation ¾ = 4. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 4 The volume of soft drinks dispensed by a machine is normally distributed with standard deviation 3 mL. A quality controller has to adjust the machine if the mean volume dispensed is not 504 mL. To test the machine the quality controller finds the mean volume x ¹ of 20 randomly selected bottles every hour. For what values of x ¹ should the quality controller not adjust the machine? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\265SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:27 PM PETERDELL SA_12STU-2 266 STATISTICS (Chapter 7) DISCUSSION The null hypothesis H0 assumes that the population mean ¹ is exactly equal to ¹0. This is required to set up the null distribution needed to calculate probabilities. However, if the variable X that is being tested is continuous, the probability that ¹ is exactly equal to ¹0 is zero! Does this mean that if you take a large enough sample, and have a measuring instrument that can measure outcomes of X accurately enough, you can always reject the null hypothesis? Compare the formal sentence, “There is a statistically significant difference between the population mean ¹ and ¹0 .” with what is commonly understood by, “There is a significant difference between the population mean ¹ and ¹0 .” I CONFIDENCE INTERVALS FOR MEANS In this section we show how to use a sample mean x to calculate an interval in which we expect the population mean ¹ to lie. As with all statistics, our estimate for x could by chance be very far from ¹, and we can never be absolutely sure that ¹ lies within the interval. We can, however, know how probable it is that ¹ lies in the interval. A confidence interval estimate of a parameter (in this case the population mean ¹) is an interval of values between two limits, together with a percentage indicating our confidence that the parameter lies in that interval. We now consider how a so-called 95% confidence interval is constructed. We start by finding the number a for which the standard normal distribution Z has probability Pr(¡a < Z < a) = 0:95 . Because of the symmetry of the graph of the normal distribution, the statement reduces to Pr(Z < ¡a) ) ¡a ¡a a = = = + 0.95 0:025 invNorm(0:025) ¡1:95996 1:96 0.025 0.025 -a 0 a So, Pr(¡1:96 < Z < 1:96) = 0:95 This means that: In any normal distribution, 95% of the outcomes lie within 1:96 standard deviations from the mean. So, suppose the random variable X is normally distributed as N(¹, ¾ 2 ): µ µ 2¶ ¾ ¶ : If X is the random variable of sample means of size n, then X » N ¹, p n cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 ¾ ¾ ¹ < ¹ + 1:96 p : ) 95% of all x ¹ lie in the interval ¹ ¡ 1:96 p < x n n black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\266SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:33 PM PETERDELL SA_12STU-2 STATISTICS In the diagram we have shown a few x ¹ values in this interval as well as one that is not in this interval. (Chapter 7) 267 95% x1 Notice that the interval calculated for ` !v does not contain m. x2 x3 x4 m x1 x2 x3 x4 Note that each of the x ¹ is in the middle of a line segment. All of these segments have the ¾ ¾ to ¹ + 1:96 p . same length as the line segment from ¹ ¡ 1:96 p n n Since Pr(¡1:96 < Z < 1:96) = 0:95 we know Pr(¡1:96 < X ¡¹ ¾ < 1:96) = 0:95 . p n So for the outcome x within the confidence interval, x¡¹ ¾ < 1:96 p n and x¡¹ ¾ > ¡1:96 p n ) ¾ x ¡ ¹ < 1:96 p n and ¾ x ¡ ¹ > ¡1:96 p n ) ¾ ¹ > x ¡ 1:96 p n and ¾ ¹ < x + 1:96 p n This says that if we were to take many samples of size n and calculate the sample mean x ¹ for each of these samples, then for about 95% of these sample means, the population mean ¹ would lie in the interval ¾ ¾ x ¡ 1:96 p < ¹ < x + 1:96 p : n n So, 1.96 s n the 95% confidence interval for ¹ is from ¾ ¾ to x + 1:96 p : x ¡ 1:96 p n n –x - 1.96 s n lower limit 1.96 s n –x –x +1.96 s n upper limit cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Confidence intervals for different confidence levels can be constructed for the population ¹ in a similar way. Remember that we cannot be absolutely sure that ¹ will lie within the confidence interval, but we can be confident that 95% of the time it will be. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\267SA12STU-2_07.CDR Thursday, 9 November 2006 12:04:06 PM DAVID3 SA_12STU-2 268 STATISTICS (Chapter 7) INVESTIGATION 7 CONFIDENCE LEVELS AND INTERVALS To obtain a greater understanding of confidence levels and intervals, click on the icon to visit a random sampler demonstration. This will DEMO calculate confidence intervals at various levels of your choice (90%, 95%, 98% or 99%) and count the intervals which include the population mean. Note: Consider samples of different size but all with mean 10 and standard deviation 2. The 95% confidence interval is 10 ¡ For various values of n we have: 1:960 £ 2 1:960 £ 2 p p < ¹ < 10 + . n n n 20 50 100 200 Confidence interval 9:123 < ¹ < 10:877 9:446 < ¹ < 10:554 9:608 < ¹ < 10:392 9:723 < ¹ < 10:277 m=10 n = 20 n = 50 n = 100 n = 200 9 9.5 10 10.5 11 We see that increasing the sample size produces confidence intervals of shorter width. Example 27 A sample of 60 yabbies was taken from a dam. The sample mean weight of the yabbies was 84:6 grams. Find the 95% confidence interval for the population mean if the population standard deviation is 16:8 grams. We are given that x = 84:6 and ¾ = 16:8. ¾ ¾ x ¡ 1:96 p < ¹ < x + 1:96 p n n The 95% confidence interval is: i.e., 84:6 ¡ 1:96 £ 16:8 1:96 £ 16:8 p p < ¹ < 84:6 + 60 60 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 ) 80:3 < ¹ < 88:9 So, we are 95% confident that the population mean weight of yabbies lies between 80:3 grams and 88:9 grams. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\268SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:45 PM PETERDELL SA_12STU-2 STATISTICS 269 (Chapter 7) Example 28 The fat content (in grams) of 30 randomly selected pasties at the determined and recorded as: 15:1 14:8 13:7 15:6 15:1 16:1 16:6 17:4 17:5 15:7 16:2 16:6 15:1 12:9 17:4 16:5 17:2 17:3 16:1 16:5 16:7 16:8 17:2 17:6 local bakery was 16:1 13:2 17:3 13:9 14:0 14:7 Determine a 95% confidence interval for the mean fat content of all pasties made if the population standard deviation is 1:35 grams. From a calculator x = 15:90 and we are given ¾ = 1:35 The 95% confidence interval for ¹ is ¾ ¾ x ¡ 1:96 p < ¹ < x + 1:96 p n n ) 1:35 1:35 15:90 ¡ 1:96 £ p < ¹ < 15:90 + 1:96 £ p 30 30 ) 15:4 < ¹ < 16:4 So, we are 95% confident that the mean fat content of all pasties produced lies between 15:4 g and 16:4 g. EXERCISE 7I.1 1 A random sample of n individuals is selected from a population with known standard deviation 11. The sample mean is 81:6. a Find a 95% confidence interval for ¹ if: i n = 36 ii n = 100. b In changing n from 36 to 100, how does the width of the confidence interval change? 2 Neville works for a software company. He keeps records of the times customers have to wait to receive telephone support for their software. During a six month period he logs 167 calls, and the mean waiting time is 8:7 minutes. Find a 95% confidence interval for estimating the mean waiting time for all telephone customer calls for support if the population standard deviation is 2:08 minutes. 3 A breakfast cereal manufacturer uses a machine to deliver the cereal into plastic packets which then go into cardboard boxes. The quality controller randomly samples 75 packets and obtains a sample mean of 513:8 grams. Construct a 95% confidence interval in which the true population mean should lie if the population standard deviation is 14:9 grams. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 4 A sample of 42 patients from a drug rehabilitation program showed a mean length of stay on the program of 38:2 days. Estimate with a 95% confidence interval the average length of stay for all patients on the program if the population standard deviation is 4:7 days. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\269SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:52 PM PETERDELL SA_12STU-2 270 STATISTICS (Chapter 7) 5 To work out the credit limit of a prospective credit card holder, a company gives points based on factors such as employment, income, home and car ownership, and general credit history. A statistician working for the company randomly samples 40 applicants and determines the point total for each. These are: 84 82 63 53 76 71 66 60 67 61 63 76 80 78 54 75 71 72 67 80 64 74 60 70 59 72 70 56 63 61 81 58 82 68 77 68 74 68 69 72 a Determine the sample mean x and standard deviation s. b Using s to estimate ¾, determine a 95% confidence interval that the company would use to estimate the mean point score for the population of applicants. It is possible to obtain confidence intervals at any level of confidence from graphics calculators. Click on the icon to see how to do this on your calculator. TI C Example 29 A 95% confidence interval for a mean ¹ of a population was recorded as 8:5617 6 ¹ 6 9:4383. This estimate was based on a sample of size n = 60. Use this information to calculate a x, the sample mean b ¾, the population standard deviation which was used to calculate the confidence interval. a ¾ ¾ x ¡ 1:96 p < ¹ < x + 1:96 p n n ¾ ¾ So, x ¡ 1:96 p = 8:5617 and x + 1:96 p = 9:4383 n n The 95% confidence interval is Adding these equations gives 2x = 8:5617 + 9:4383 = 18 and so x = 9. b Substituting n = 60 and x = 9 into ¾ ¾ x ¡ 1:96 p = 8:5617 gives 9 ¡ 1:96 p + 8:5617 n 60 ¾ ) 1:96 p + 0:4383 p 60 60 ) ¾ + 0:4383 £ + 1:732 1:96 6 A 95% confidence interval for the mean ¹ of a population is based on a sample of n = 50, and given by 3:5842 6 ¹ 6 4:4158. Find: a x, the sample mean b ¾, the population standard deviation which was used to calculate the confidence interval. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 7 A 95% confidence interval for the mean ¹ of a population is given by 19:685 6 ¹ 6 22:315. If the population standard deviation is ¾ = 6, what was the sample size? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\270SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:59 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 271 DETERMINING HOW LARGE A SAMPLE SHOULD BE When designing an experiment in which we wish to estimate the population mean, the size of the sample is an important consideration. Finding the sample size is a problem that can be solved using the confidence interval. Let us revisit Example 28 on the fat content of pasties. The question arises: ‘How large should a sample be if we wish to be 95% confident that the sample mean will differ from the population mean by less than 0:3 grams?’ i.e., ¡0:3 < ¹ ¡ x < 0:3 ¾ Now the 95% confidence interval for ¹ is: x ¡ 1:96 p < n ¹ ¾ < x + 1:96 p n ¾ ¾ Hence ¡1:96 p < ¹ ¡ x < 1:96 p n n ¾ and we need to find n when 1:96 p = 0:3 . n So, p 1:96 £ 1:35 1:96¾ = + 8:82 and so n + 78. n= 0:3 0:3 Thus, a sample of 78 pasties should be taken. Example 30 Revisit the yabbies from the dam problem of Example 27. Suppose we wish to find the sample size needed to be 95% confident that the sample mean differs from the population mean by less than 5 grams. What sample size should be taken? ¾ ¾ ¡1:96 p < ¹ ¡ x < 1:96 p n n Now ¾ 1:96 £ 16:8 p so we need to find n such that 1:96 p = 5 i.e., =5 n n ¶2 µ 1:96 £ 16:8 + 43:37 ) n= 5 A sample of 44 yabbies should be taken. EXERCISE 7I.2 1 A researcher wishes to estimate the mean weight of adult crayfish in South Australian waters. She knows that the population standard deviation ¾ is 250:5 grams. How large must a sample be so that she is 95% confident that the sample mean differs from the population mean by less than 70 grams? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 2 A porridge manufacturer samples 80 packets of porridge and finds that the sample standard deviation s, of the contents’ weight is 17:8 grams. If s is used to estimate the population standard deviation ¾, how many packets must be sampled to be 95% confident that the sample mean differs from the population mean by less than 3 grams? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\271SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:04 PM PETERDELL SA_12STU-2 272 STATISTICS (Chapter 7) 3 Patients from an alcohol rehabilitation program participate for various lengths of time with a standard deviation of 4:7 days. How many patients would have to be sampled to be 95% confident that the sample mean number of days on the program differs from the population mean by less than 1:8 days? w Consider the typical 95% confidence interval shown in the diagram. x - 1.96 1:96¾ The width of this interval is w = 2 £ p . n s n x x + 1.96 s n In taking a sufficiently large sample size n we can make w as small as we like. 1:96¾ p 2 £ 1:96¾ As w = 2 £ p , n= w n µ 2 £ 1:96¾ w and so n = ¶2 When we wish to estimate the population mean from a sample of size n at a 95% confidence level, the sample size is given by ¶ µ 2 £ 1:96¾ 2 where ¾ is the population standard deviation n= w and w is the confidence interval width. µ In Example 30, w = 2 £ 5 and ¾ + 16:8 : Thus, n = 2 £ 1:96 £ 16:8 10 ¶2 + 43:37, etc. Since n is an integer, n = 44 would give a 95% confidence interval of width about 10 grams. 4 A population is known to have standard deviation ¾ = 34. Find the sample size n that should be taken to find a 95% confidence interval for the population mean ¹ of width: a w=5 b w=1 c w = 0:1 5 A manufacturer of bottled water knows that the machine dispenses water into 1 litre bottles with a standard deviation of 2:3 mL. The machine needs to be checked regularly to ensure it is still delivering the correct volume. How many bottles should a quality controller be checking to find a 95% confidence interval of width: a 2 mL b 1 mL c 0:5 mL? a If the size n of a sample is doubled, by how much will the width of a 95% confidence interval decrease? b How much larger do you have to make a sample size to halve the width of a 95% confidence interval? 6 USING A CONFIDENCE INTERVAL FOR A CLAIM ABOUT ¹ Confidence intervals provide an estimate for the size of the population mean ¹. They can also be used to assess claims about population means. For example: cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Suppose the volume V of fruit juice dispensed by a machine is normally distributed with mean ¹ litres which can be adjusted, and standard deviation ¾ = 0:0015 litre (1 12 mL, about 1 4 of a teaspoon) which is fixed. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\272SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:10 PM PETERDELL SA_12STU-2 STATISTICS 273 (Chapter 7) Suppose a manufacturer needs to fill cartons with 1 litre of fruit juice. To ensure that almost all cartons contain at least 1 litre, the value of the mean ¹ is set at 1:005 litre. A quality controller takes a sample of n cartons and, with very accurate measurements, finds that the sample mean v = 1:004 99 litres. We want to test the hypotheses H0 = 1:005, Ha 6= 1:005 for various large values of n: Note that for sufficiently large n the null hypothesis will not be accepted at the 5% level. For such values of n the difference is statistically significant at the 5% level even though the difference of 0:01 mL (hardly a drop) is not significant as the word is commonly understood. Example 31 Suppose the volume V of cool drinks dispensed into cartons by a machine is normally distributed with mean ¹ which can be adjusted, and standard deviation 10¡mL which is fixed. The value of ¹ is supposed to be 1005¡mL, but the machine operator notices that actually ¹¡=¡995¡mL. The operator therefore adjusts the volume dispensed by the machine. A quality controller tests 25 cartons and finds that their mean volume is 1007¡mL. a Construct a 95% confidence interval for the volume ¹ dispensed by the machine. b Use the 95% confidence interval to assess the claim that the volume dispensed by the machine has increased. c Can we conclude that the volume of ¹ is now larger than 1005 mL? a The confidence interval is 1003 6 ¹ 6 1011: b Since 995 is less than all the values in the 95% confidence interval we can be confident that the population mean has increased. c Althouth the sample statistic 1007 mL is larger than 1005, the smallest number in the 95% confidence interval for ¹ is 1003 mL. This means that ¹ could be as small as 1003 mL, and there is not enough evidence to support the claim that ¹ > 1005 mL. Note: This question is closely related to testing the hypotheses H0¡:¡¹¡=¡1005, Ha¡:¡¹ = ¡6 ¡1005. EXERCISE 7I.3 1 Suppose the time it takes Joan to run 100 metres is normally distributed with mean ¹ = 12:46 seconds and standard deviation 1 second. To improve her time Joan goes on a training program. After the training program, Joan finds that the mean time from 12 trial runs is now 11:62 seconds. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 a Construct a 95% confidence interval for Joan’s mean assuming the standard deviation has not changed. b Use the result of part a to assess the claims: i Joan’s time to run 100 metres has improved. ii Joan is now better than Betty whose time for the 100 metres is 11:97 seconds. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\273SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:16 PM PETERDELL SA_12STU-2 274 STATISTICS (Chapter 7) 2 A complaint was made to a call centre that it took a mean time of 12 minutes before a caller was put through to an operator. After changes were made, the call centre claimed that the service had improved. To check this claim, a consumer group made 40 calls to the centre. They found the mean waiting time was 8 minutes with a standard deviation of 3 minutes. Assuming that 40 is large enough for the Central Limit Theorem to apply, construct a 95% confidence interval for the mean waiting time ¹. Does the confidence interval support the call centre’s claim? (Use s to estimate ¾.) 3 The distance D a golfer can hit a ball is randomly distributed with a mean ¹ = 115 metres and standard deviation ¾ = 32 metres. a After spending time with a professional the golfer measured the drives. The results of the drives in metres were as follows: 133 153 110 93 142 135 62 150 127 119 171 143 92 162 128 149 73 39 138 152 163 174 152 141 129 87 118 distance of 30 112 84 149 Assuming that the sample of 30 is large enough for the Central Limit Theorem to apply, calculate a 95% confidence interval for the mean distance ¹ the golfer can now hit the ball. Does the confidence interval provide enough evidence to support the claim the golfer has improved? b The golfer decided to have another trial of 50 drives. Suppose the mean of the 50 trials is the same as in part a. i Explain briefly why increasing the number of trials could make a difference to a drive length. ii Does the new information provide evidence that the golfer has improved? OTHER APPLICATIONS OF CONFIDENCE INTERVALS Example 32 A buyer for a restaurant chain goes to a seafood wholesaler to inspect a large catch of 50 000 prawns. She has instructions to buy the catch only if the prawns are heavy enough. The buyer selects a sample of 60 prawns and finds that their mean weight is 57:2 grams. It is known that the population standard deviation ¾ is 4:2 grams. a Find the 95% confidence interval for the population mean. b The buyer claims she is 95% confident that no more than 10% of the prawns weigh less than 50 grams. Use the confidence interval found in part a to justify this claim. You may assume that the weights of prawns are normally distributed. a Using technology, the 95% confidence interval for the population mean ¹ is 56:1 6 ¹ 6 58:3 . cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 b The smallest value in the 95% confidence intCI erval is 56:1, and so the buyer can be 95% confident that the population mean ¹ > 56:1 . 50.0 56.1 57.2 58.3 2 If W is the weight of prawns, then W » N(¹, ¾ ). If we use ¹ = 56:1 and ¾ = 4:2, then using technology Pr(W < 50) = 0:0732. Hence 7:32%, or less than 10% of the prawns weigh less than 50 grams. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\274SA12STU-2_07.CDR Friday, 10 November 2006 12:22:49 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 275 EXERCISE 7I.4 1 The manager of a golf club claimed that the income of most of its members was in excess of $75 000 and thus its members could afford to pay increased annual subscriptions. To justify this claim was not valid, the members sought the help of a statistician. The statistician examined a random sample of 113 club members and found that the mean income was $96 318. It is known that the standard deviation of the members’ incomes is $14 268: a Find the 95% confidence interval for the population mean income of all members. b The statistician claimed that he was 95% certain that no more than 10% of the members had a mean income of less than $75 000. Assuming that the income of members is normally distributed, how could you justify the statistician’s claim? 2 Fabtread manufacture motorcycle tyres. Under normal test conditions the stopping time for motor cycles travelling at 60 km/h is 3:45 seconds with standard deviation 0:17 seconds. Their production team has just designed and manufactured a new tyre tread. They take 41 stopping time measurements with the new tyres and find the mean time is 3:03 seconds. a Calculate a 95% confidence interval for the mean stopping time of the new tyres. b The team claims that they are 95% certain that less than 15% of the stopping times of their new tyres will exceed the 3:45 seconds of the old tyres. Assuming that the stopping time is normally distributed, how could you justify the team’s claim? EXTENSION TO CONFIDENCE INTERVALS OTHER THAN 95% There are often good reasons to find confidence intervals other than those of 95%. In areas like medicine, a researcher may want to have more certainty when making decisions and often may prefer a confidence interval of 99%. In other areas where the outcomes of decisions are not so important, people may be satisfied with 90% confidence intervals. Your calculator can produce confidence intervals at any level. EXERCISE 7I.5 1 The mean ¹ of a population is unknown, but its standard deviation is 10. In order to estimate ¹ a random sample of size n = 35 was selected. The mean of the sample was found to be 28:9. a Find a 95% confidence interval for ¹. b Find a 99% confidence interval for ¹. c In changing the confidence level from 95% to 99%, how does the width of the confidence interval change? ¶ ¶ µ µ ¾ ¾ < ¹ <x+a p then 2 If the P % confidence interval for ¹ is x ¡ a p n n for P = 95, a = 1:960: Find a if P is: a 99 b 80 c 85 d 96. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 3 The choice of the confidence level to be used is made by an experimenter. Why is it that experimenters do not always choose confidence intervals of at least 99%? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\275SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:29 PM PETERDELL SA_12STU-2 276 STATISTICS (Chapter 7) J REVIEW REVIEW SET 7A 1 The arm lengths of 18 year old females are normally distributed with mean 64 cm and standard deviation 4 cm. a Find the percentage of 18 year old females whose arm lengths are: i between 60 cm and 72 cm ii greater than 60 cm. b Find the probability that if an 18 year old female is chosen at random, she will have an arm length in the range 56 cm to 68 cm. 2 a If Z has a standard normal distribution, find k if Pr(Z 6 k) = 0:95 . b If X » N(23, 2:62 ) find k if Pr(X < k) = 0:6 . 3 In a mathematics test out of 40 marks, the mean mark was 28:3 and the standard deviation was 4:1. The marks were all integers and the minimum pass mark was set at 24. Assuming marks were approximately normal, what proportion of the students: a passed the test b scored more than 20 c scored between 25 and 35? 4 The weights of apples from an orchard are known to be normally distributed with mean ¹ = 350 grams and standard deviation ¾ = 25 grams. The apples are packed in boxes of 50 each. a How many apples in a box would you expect to weigh more than 375 grams, and how many less than 325 grams? b In 500 boxes, how many apples would you expect to have a weight between 325 and 375 grams? 5 To test the hypotheses H 0 : ¹ = 36 and H a : ¹ 6= 36 a random sample of n = 20 was selected. The outcomes are listed below: 38 22 43 21 36 44 20 49 36 30 42 43 38 28 33 22 29 25 28 34 Use this information to test the null hypothesis at the 5% level if the population standard deviation is 10 grams. 6 The standard deviation in the weight of cereal boxes is 23:6 grams. How many boxes must be sampled from the population to be 95% confident that the sample mean differs from the population mean by less than 4 grams? 7 A factory canning apricots uses a machine to deliver the fruit and syrup into cans. The quality controller randomly samples 65 cans and finds that the mean mass of contents is 828:2 grams. a Construct a 95% confidence interval in which the true population mean should lie if the population standard deviation is 16:3 grams. b What should the sample size be to construct a confidence interval of half the width of that in a? magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 95 100 50 75 25 0 5 cyan 5 a Kerry’s marks for an English essay and a Chemistry test were 26 out of 40 and 82% respectively. i Explain briefly why the information given is not sufficient to determine whether Kerry’s results are better in English than in Chemistry. 8 black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\276SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:34 PM PETERDELL SA_12STU-2 STATISTICS (Chapter 7) 277 ii Suppose that the marks of all students in both the English essay and the Chemistry test were normally distributed as N (22, 42 ) and N (75, 72 ) respectively. Use this information to determine which of Kerry’s two marks is better. iii If there were 50 students sitting for the English essay, how many would have scored more than Kerry? b Les is to sit for five subjects in the final examination. Because of many different factors that determine examination marks, the marks Les can expect in each exam are normally distributed. Suppose that the mean ¹ and standard deviation ¾ = 2 are the same for each exam. If ¹ = 12 calculate the probability that Les will gain a total mark for the five subjects of between 60 and 70. c The value of the mean ¹ depends on the time t hours that Les studies. It is given by ¹ = 16 ¡ 8=(t + 2). i For how long must Les study to achieve a value of ¹ = 15? ii Les’s total score for the five examinations was 65. Use this information to test the hypotheses H0 : ¹ = 15 and Ha : ¹ 6= 15. iii Use the total score of 65 to construct a 95% confidence interval for the mean ¹. Use this interval to estimate a range of times Les might have studied for the examination. REVIEW SET 7B 1 Find the mean and standard deviation of these two samples of A 170:1 169:4 169:5 170:4 169:8 170:5 170:0 170:0 169:9 170:2 170:0 169:9 169:9 170:5 B 177 166 153 167 176 173 169 161 172 170 162 178 174 179 171 148 184 178 lengths given in cm: 170:0 170:3 170:8 170:1 169:7 170:0 174 175 Which of the above is a sample of heights of 15 year old boys, and which is a sample of length of planks cut by a machine? 2 The contents of a certain brand of soft drink can is normally distributed with mean 377 mL and standard deviation 4:2 mL. a Find the percentage of cans with contents: i less than 368:6 mL ii between 372:8 mL and 389:6 mL b Find the probability of randomly selecting a can with contents between 364:4 mL and 381:2 mL. 3 The life of a Xenon battery is known to be normally distributed with a mean of 33:2 weeks and a standard deviation of 2:8 weeks. a Find the probability that a randomly selected battery will last at least 35 weeks. b For how many weeks can the manufacturer expect the batteries to last before 8% of them fail? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 4 The length of steel rods produced by a machine is normally distributed with a standard deviation of 3 mm. It is found that 2% of all rods are less than 25 mm long. Find the mean length of rods produced by the machine. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\277SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:41 PM PETERDELL SA_12STU-2 278 5 STATISTICS (Chapter 7) a If Z has a standard normal distribution, find a if Pr(Z 6 a) = 0:9 . b If X » N(15:6, 22 ) find a if Pr(X < a) = 0:9 . 6 A manufacturer claims that his canned soup contains 135 mg of salt. To check this claim a consumer tested 87 cans for salt content and found that the mean was 139:6 mg. It is known that the population standard deviation is 22:8 mg. At a 5% level is there sufficient evidence to reject the manufacturer’s claim? 7 To test the null hypothesis H 0 : ¹ = 2000 and H a : ¹ 6= 2000, a random sample of n = 75 was selected and found to have mean x = 1840. a If the population standard deviation ¾ = 690, is there sufficient evidence to reject the null hypothesis at the 5% level? b For what values of the sample mean x ¹ would you not reject the null hypothesis at the 5% level? 8 A telephone call centre handles many calls each day. Let T be the time in minutes taken to answer a call. In 2006 the mean answering time for a call was ¹ = 4:3 minutes with standard deviation ¾ = 1:2 minutes. Let T be the mean time taken to answer a random sample of 100 calls. a The two histograms below show the distribution of a sample of size 50 taken from T . Note that the horizontal scale and the bin width are the same in both histograms, but the vertical scales are different. Histogram A 40 30 20 10 0 Histogram B 6 frequency frequency 4 2 0 1 2 3 4 5 6 7 8 time (min) 0 0 1 2 3 4 5 6 7 8 Identify the histogram that represents a sample from T . Explain your answer. i Assuming that n = 100 is sufficiently large, explain why the distribution of T is approximately normal with mean 4:3 minutes and standard deviation 0:12 minutes. ii Calculate the probability Pr(T 6 4:35). iii Hence calculate the probability that an operator in the call centre can be occupied in answering 100 calls for less than seven and a quarter hours. c As well as answering routine calls, the supervisor of the call centre also handles unusual cases that are too complicated for other staff to handle. When the supervisor was timed her mean time to answer 100 calls was T = 4:6 minutes. i Use the statistic T = 4:6 minutes to test the hypothesis H0 : ¹ = 4:3 and Ha : ¹ 6= 4:3, at 5% level. ii The supervisor is asked to explain why she is taking too long to answer questions. What reasons can the supervisor provide to claim that the Central Limit Theorem does not apply to her? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 b black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\278SA12STU-2_07.CDR Wednesday, 8 November 2006 8:46:05 AM DAVID3 SA_12STU-2 STATISTICS (Chapter 7) 279 REVIEW SET 7C 1 Sketch the graph of X » N(3, 22 ). On the horizontal axis mark in the z-scores as well as their corresponding x values. Calculate these probabilities: a Pr(¡1 6 X 6 1) b Pr(¡1 6 Z 6 1) . 2 Staplers are manufactured for $5:00 each and are sold for $20:00 each. The staplers are guaranteed to last three years. The mean life is actually 3:42 years and the standard deviation 0:4 years. If the life of these staplers is normally distributed, how much profit would we expect from selling a batch of 2000 (with a maximum of one replacement)? 3 The edible part of a batch of Coffin Bay oysters is normally distributed with mean 38:6 grams and standard deviation 6:3 grams. Given that the random variable X is the mass of a Coffin Bay oyster, find: a a if Pr(38:6 ¡ a 6 X 6 38:6 + a) = 0:6826 b b if Pr(X > b) = 0:8413. 4 King prawns are favourite items on the menu of Stirling Caterers. From past experience the manager knows that people on average eat 325 g of prawns with standard deviation 86 g. The manager is to cater for a wedding of 80 guests and decides to purchase 27:5 kg of prawns. What is the probability that the caterer will run out of prawns? 5 For export purposes peaches must be neither too small nor too large. A grower claims that the peaches in his orchard have a mean weight of 300 grams, just right for export. A buyer knows that the population standard deviation is 30 grams, and he wants to test the grower’s claim. a What hypotheses should the buyer consider? b Suppose the buyer selects a random sample of 100 peaches and finds that their mean weight x ¹ = 310 grams. i What is the null distribution the buyer should use? ii Calculate the test statistic z for this sample. iii Does this sample support the grower’s claim at the 5% level? 6 The average width of snail shells of a local species needs to be estimated.¡ It is known that the standard deviation is 1.4 mm.¡ Pauline takes a random sample of 200 snails and measures the width of each shell to the nearest mm.¡ The results are shown in the table alongside. a b Find the sample mean. Determine a 95% confidence interval for the population mean ¹. Width (mm) 22 23 24 25 26 27 28 29 Frequency 1 3 17 43 68 41 24 3 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 7 Suppose the weight X of apricots is normally distributed with ¹ = 90 grams and ¾ = 10 grams. a Calculate the proportion of apricots with weight less than 88 grams. b In a box of 100 apricots, how many would you expect to weigh less than 88 g? c The apricots are packaged into boxes of 100 each. What proportion of the boxes will have apricots with a mean weight less than 88 g? black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\279SA12STU-2_07.cdr Wednesday, 8 November 2006 8:49:00 AM DAVID3 SA_12STU-2 280 STATISTICS (Chapter 7) d On each of the boxes of 100 apricots is printed that the nett weight is 8:8 kilograms. In a shipment of 500 boxes, for how many is the weight less than 8:8 kilograms? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 8 The time T it takes Laura to travel to work is normally distributed with mean ¹ minutes and standard deviation 10 minutes. Laura’s work starts at 9 o’clock in the morning. a Suppose ¹ = 40 minutes and Laura leaves for work at a quarter past eight in the morning. i What is the probability she will be late? ii If there are 250 working days in a year, how often would Laura be expected to be late to work in a year? b Laura does not know the value of ¹ and decides to keep a 10 day record of the time it takes her to go to work. Let T 10 be the distribution of the mean time over 10 days it takes Laura to go to work. i Briefly describe the distribution T 10 in terms of the distribution T it takes Laura to go to work. ii Suppose Laura found that for her sample of 10 days the mean time to travel to work was T 10 = 35 minutes. Use this information to test the hypotheses H0 : ¹ = 40 and Ha : ¹ 6= 40, at 5% level. iii Calculate the 95% confidence interval for ¹. iv How large a sample should Laura take to obtain a 95% confidence interval of width 2:48 minutes? c After keeping records for a year consisting of 250 working days, Laura found that the mean travelling time to work was 31:52 minutes.¡ She wants to be 95% certain that she will be at work before 9 o’clock at least 90% of the time in the following year.¡ To the nearest minute, what is the latest time you would advise Laura to leave home?¡ Give reasons for your answer. black Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\280SA12STU-2_07.cdr Wednesday, 8 November 2006 8:50:22 AM DAVID3 SA_12STU-2