Statistics - Haese Mathematics

Transcription

Statistics - Haese Mathematics
7
Statistics
cyan
magenta
yellow
95
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
G
H
I
J
Key statistical concepts
Describing data
Normal distributions
The standard normal distribution
Finding quantiles (k-values)
Investigating properties of normal
distributions
Distribution of sample means
Hypothesis testing for a mean
Confidence intervals for means
Review
100
A
B
C
D
E
F
Contents:
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\219SA12STU-2_07.CDR Thursday, 2 November 2006 3:10:48 PM PETERDELL
SA_12STU-2
220
STATISTICS
(Chapter 7)
INTRODUCTION
The word statistics was introduced into the English language by the
Scottish politician Sir John Sinclair (1754 – 1835). He borrowed it
from Germany where, as he put it, it meant,
“an inquiry for the purpose of ascertaining the political
strength of a country”.
The meaning he wished to give to the word was an
“inquiry into the state of a country, for the purpose of
ascertaining the quantum of happiness enjoyed by its
inhabitants, and the means of future improvement.”
You can still recognise the word “state” in statistics.
Words that are commonly used in Statistics:
²
A collection of individuals about which we want to draw
conclusions.
Census
The collection of information from the whole population.
Sample
A selection of information from a subset of the population.
Data (singular datum) Information about individuals in a population.
Parameter
A numerical quantity measuring some aspect of a population.
Statistic
A quantity calculated from data gathered from a sample.
It is usually used to estimate a population parameter.
Distribution
The pattern of variation of data.
Population
²
²
²
²
²
²
A
KEY STATISTICAL CONCEPTS
RANDOM SAMPLES
A population generally consists of a large number of individuals. Because of expense and
time factors it is often only practical to select a sample rather than use the whole population.
A random sample is a sample where every individual has the same chance of being selected.
A sampling technique is biased if it tends to systematically select members of the population
with certain properties and not select those that do not have these properties. In other words
it favours some individuals above others.
DISCUSSION
SAMPLING
In the following scenarios, can you suggest a likely population?
Can you think of any reasons the sampling techniques might be biased?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
² People in the local shopping centre on Saturday morning were asked
how many computers they have in their household.
² After a program likely to be watched by older people, a television
station asked viewers to vote on the use of hand-held phones in cars.
² A local paper advertised for volunteers to test the usefulness of fish oil
in a diet.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\220SA12STU-2_07.CDR Thursday, 2 November 2006 3:11:49 PM PETERDELL
SA_12STU-2
STATISTICS
221
(Chapter 7)
Many sampling techniques have been developed to avoid bias. In this book it will be assumed
that any sample is a random, unbiased sample.
DESCRIPTIVE AND INFERENTIAL STATISTICS
Descriptive statistics are concerned with collecting, summarising and describing the
characteristics of data.
With descriptive statistics we are only concerned with the data collected and make no effort
to generalise it to any other data, such as for the population.
In inferential statistics we select a random sample and we use the information from it
to make generalisations about the population from which the sample was taken.
EXAMPLES OF PARAMETERS AND STATISTICS
Recall that:
a parameter is a numerical characteristic of a population and
a statistic is a numerical characteristic of a sample.
Note:
P
S
arameter
opulation
ample
tatistics
For example, when examining the mean age of people in retirement villages throughout
Australia, the mean age found would be a parameter. If we took a random sample of 300
people from the population of all retirement village persons, then the mean age would be a
statistic.
Example 1
cyan
The population is the number of blank CDs to be purchased and its size is
50 000.
b
The sample size is 600:
c
The population parameter being considered is the percentage of CDs which
are defective.
d
The statistic being used is the percentage of CDs which are defective in
the sample. As 1:5% of 600 = 9, the business would make the purchase if
9 or less CDs in the sample were found to be defective.
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
is the population size?
is the sample size?
population parameter is of interest to the business?
statistic is being used to estimate the parameter?
100
95
a
50
What
What
What
What
75
a
b
c
d
25
0
5
95
100
50
75
25
0
5
A business is considering purchasing 50¡000 blank CDs to make CDs of their new text
books. It will make the purchase if no more than 1:5% of the CDs are defective.
Because of the expense and time factors in testing all 50¡000 CDs the business decides
to test a random sample of 600 for defects. They will then use the results of this sample
to estimate the percentage of defectives for the population to be purchased.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\221SA12STU-2_07.CDR Thursday, 2 November 2006 3:11:54 PM PETERDELL
SA_12STU-2
222
STATISTICS
(Chapter 7)
THE PROCEDURE USED IN AN INFERENTIAL PROBLEM
In this course the key application is to examine a random sample in order to make appropriate
statements or inferences about the population.
Generally speaking there are five steps to address in any inferential problem. They are:
Step 1:
Step 2:
State the population we are interested in examining.
Collect data from a random sample of sufficient size from the population.
Note: What is meant by sufficient size is covered in a later chapter.
Examine the relevant information from the sample.
Use the results of the sample analysis to make an inference about the
population.
Give a measure of the reliability of the inference made.
Step 3:
Step 4:
Step 5:
Example 2
For the CD purchase in Example 1 list the procedural steps for the inferential
problem.
Step 1:
Step 2:
The population consists of all 50 000 CDs.
To avoid unnecessary costs and wasting time we must first decide on the
sample size. 600 has been decided upon, so we collect 600 data values
at random. We record only whether the CD is defective or not.
Step 3:
Find the percentage of defective CDs in the sample.
Step 4:
The inference will be to provide an estimate of the percentage of defective
CDs for the whole population. For example, if 12 CDs are defective in
12
the sample our inference would be that approximately 600
= 2% would
be defective in the population.
Step 5:
The estimate from the sample is not likely to be equal to the exact
value for the population. Some indication of the possible error for the
estimate should therefore be given.
An example of such a statement as in Step 5 is:
If we had many shipments of 50 000 CDs and in each we found that 12 in a sample of
600 were defective, then in 95% of these shipments there would be between 440 and
1560 defective CDs.
This type of statement is usually condensed to:
We are 95% confident that about 440 to 1560 CDs are defective.
The main thrusts of this course are to:
²
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
50
75
25
0
5
²
²
100
determine confidence intervals in which a certain population parameter should lie at
a particular level of confidence (commonly 90%, 95%, 99%)
devise and use particular tests of hypotheses about population means
determine what sample sizes should return a particular level of confidence in given
situations.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\222SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:01 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
223
EXERCISE 7A
1 A new drug called Cobrasyl, a derivative of cobra
venom, is to be approved for the treatment of high
blood pressure in humans.
A research team treats 127 high blood pressure
patients with the drug and in 119 cases it reduces
their blood pressure to an acceptable level.
a What is the sample of interest?
b What is the population of interest?
2 In 2006, 800 computer workers throughout Australia were surveyed and asked a question.
The question was: “Is your main interest in developing software or in using already
developed software?” 83% said that developing software was their main interest.
a What is the population of interest?
b What is the parameter of interest?
c What statistic is used to estimate the parameter?
3
A South Australian processor of seafood needs to
estimate the average weight of a prawn in a
catch. A sample of 352 prawns was selected and
found to have an average weight of 53:8 grams.
a What is the population the processor is
interested in?
b What is the parameter of interest?
c What statistic does the processor use to
estimate the parameter?
4 Last December Tina visited four supermarkets A, B, C and D on the same day.
She recorded the price per kilogram of
various fruits in the table opposite:
Determine whether the following statements are descriptive or inferential:
Store
A
B
C
D
Oranges
$2:35
$2:45
$2:50
$2:25
Apples
$2:15
$2:55
$2:60
$2:05
Bananas
$1:70
$2:00
$2:10
$1:90
a In this city, bananas are cheaper than oranges.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
b If you buy a kilogram of each of the three different types of fruit from the one
store, you pay the same total amounts at stores A and D.
c Of the four stores, the store with the most expensive apples also had the most
expensive oranges and bananas.
d In general, store C has the
most expensive fruit.
e Of the four stores, store C
has the most expensive
fruit. (Careful! What is the
population and what is the
sample?)
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\223SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:07 PM PETERDELL
SA_12STU-2
224
STATISTICS
(Chapter 7)
B
DESCRIBING DATA
This section will review the main concepts from Year 11 so that students will reacquaint
themselves with the terminology used in statistics.
A variable is a quantity that can have different values for different individuals in the
population.
Since variables are sometimes used to describe random processes, they are often called
random variables.
Variables are usually denoted by capital letters such as X. Individual values, called observations or outcomes, are denoted by lower case letters such as x.
We shall deal with two types of variables: categorical and quantitative.
A categorical or nominal variable can be described by a quality or characteristic that
is essentially non-numeric. Individuals are described by different categories.
Examples of categorical data are:
Variable
X is the gender of a person
C is the type of motor car
M is the membership of political party
²
²
²
Possible values
x = male or female
c = Holden, Ford, Toyota
m = ALP, LIB, DEM
A quantitative or numerical variable takes numerical values.
There are essentially two different types of numerical variable.
A numerical discrete variable takes discrete number values only.
It is often a result of counting.
Examples of discrete variables are:
Variable
X is the number of people in a household
T is the mark out of 10 for a test
²
²
Possible values
x = 1, 2, 3, 4 ::::::
t = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
A numerical continuous variable can take any numerical value in an interval.
A continuous variable is often a result of measuring.
Examples of continuous variables are:
cyan
magenta
yellow
95
100
50
75
0
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
X is the amount of water in a 500 litre
rain water tank
100
50
75
25
0
5
²
25
²
Possible values
w is likely to be in the interval from 0:5 kg
to 5 kg.
x is any volume between 0 and 500 litres.
5
Variable
W is the weight of newborn babies
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\224SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:14 PM PETERDELL
SA_12STU-2
STATISTICS
225
(Chapter 7)
Since continuous variables take on values in intervals, they are also called interval variables.
The essential difference between a categorical and a quantitative variable is that we can do
arithmetic with quantitative variables, but not with categorical variables.
In this book we are mainly concerned with the mean and the standard deviation.
THE MEAN AND STANDARD DEVIATION (REVIEW)
The mean of a sample of n numbers,
x1 , x2 , ......... , xn is:
x=
n
x1 + x2 + ::::::: + xn
1 P
xi
=
n
n i=1
P
The Greek letter
(sigma) is used to denote the summation of numbers,
n
P
so
xi = x1 + x2 + ::::::: +xn
(read “the sum of all xi for i = 1 to n”).
i=1
The endpoints of the summation, i = 1 to n are sometimes omitted, so the mean can be
P
P
xi or even n1
x.
written as n1
P
The mean of a population is usually denoted by the Greek letter ¹ (mu), so ¹ = n1
x.
We can get a much clearer picture of a data set if, in addition to having a measure for the
centre, we also have an indication of how the data is spread.
For example, the mean weight of oranges from a particular orchard and the mean weight of
salt bagged by a machine may both be 500 grams, but the variation in the weights of oranges
is likely to be much greater than that of bags of salt. The data for oranges will therefore have
a greater spread.
The most commonly used measure of spread about the mean is the standard deviation.
The standard deviation of a sample is a little different from the standard deviation of a
population.
In a sample of size n, the sample standard deviation, usually denoted by s, is:
sP
s
(xi ¡ x)2
(x1 ¡ x)2 + (x2 ¡ x)2 + :::::: + (xn ¡ x)2
=
s=
n¡1
n¡1
In a population of size n, the population standard deviation, usually denoted
by the Greek letter ¾ (sigma), is:
sP
s
(x1 ¡ ¹)2 + (x2 ¡ ¹)2 + :::::: + (xn ¡ ¹)2
(xi ¡ ¹)2
=
¾=
n
n
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The reason for this difference is rather technical and, at this stage we do not attempt to explain
the difference.
Statisticians know that the value of s, as calculated by the above formula, gives an unbiassed
estimate of the population standard deviation ¾.
Notice that for large n, the values of s and ¾ are virtually the same.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\225SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:20 PM PETERDELL
SA_12STU-2
226
STATISTICS
(Chapter 7)
The mean and standard deviation can also be calculated from frequency tables.
The frequency fi of a quantity xi is the number of times it occurs.
For a population of size n, the formulae for the mean and standard deviation become:
f1 x1 + f2 x2 + f3 x3 + :::::: + fk xk
n
r
(x1 ¡ ¹)2 f1 + (x2 ¡ ¹)2 f2 + :::::: + (xk ¡ ¹)2 fk
¾=
n
¹=
and
µ
Notice that ¹ =
f1
n
¶
µ
x1 +
f2
n
¶
µ
x2 +
f3
n
¶
µ
x3 + :::::: +
fk
n
¶
xk .
fi
is the proportion of xi in the population. For large values of n, the experimental
n
fi
probability pi of randomly selecting xi from the population is taken to be pi = .
n
So, using pi =
fi
,
n
¹ = p1 x1 + p2 x2 + p3 x3 + :::::: + pk xk =
X
pi xi :
Similarly for the population standard deviation:
sµ ¶
µ ¶
µ ¶
f2
fk
f1
(x1 ¡ ¹)2 +
(x2 ¡ ¹)2 + :::::: +
(xk ¡ ¹)2
¾=
n
n
n
¾=
which leads to
qX
pi (xi ¡ ¹)2 .
Example 3
A magazine store claims 23% of its customers purchase one magazine, 38% purchase
two, 21% purchase three, 13% purchase four, and 5% purchase five. Find the mean
and the standard deviation of X, the number of magazines sold to a customer.
The probability table is:
Now ¹ =
X
xi
pi
0
0:00
1
0:23
2
0:38
3
0:21
4
0:13
5
0:05
pi xi
= 0:23 £ 1 + 0:38 £ 2 + 0:21 £ 3 + 0:13 £ 4 + 0:05 £ 5
= 2:39
i.e., in the long run, the average number purchased per customer is 2:39
qX
Also, ¾ =
pi (xi ¡ ¹)2
q
= 0:23 £ (1 ¡ 2:39)2 + 0:38 £ (2 ¡ 2:39)2 + :::: + 0:05 £ (5 ¡ 2:39)2
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
+ 1:12
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\226SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:26 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
227
Example 4
‘Cheap Car Insurance’ insures used cars valued at $6000 under these conditions.
A $6000 will be paid to the owner for total loss
B for damage between $3000 and $5999, $3500 will be paid
C for damage between $1500 and $2999, $1000 will be paid
D for damage less than $1500, nothing will be paid.
From statistical information the insurance company knows that in any year the
probabilities of A, B, C and D are 0:03, 0:12, 0:35 and 0:50 respectively.
If the company wishes to receive $80 more than its expected payout on each
policy, what should it charge for the policy?
Let X be the random variable of payouts, so the probability table is:
0
0:50
xi
pi
1000
0:35
3500
0:12
6000
0:03
The expected payout is the mean, ¹, and
P
¹ = pi xi
= (0:50) £ 0 + (0:35) £ 1000 + (0:12) £ 3500 + (0:03) £ 6000
= 950
The company expects to pay out $950 on average in the long run, so it should
charge $950 + $80 = $1030:
EXERCISE 7B
1 Australian crayfish is exported to Asian markets. The
buyers are prepared to pay high prices when the crayfish
arrive still alive. If X is the number of deaths per dozen
crayfish, the probability function for X is given by:
0
0:54
xi
P (xi )
1
0:26
2
0:15
3
0:03
4
0:01
5
0:01
>5
0:00
a What is the mean number of deaths per dozen crayfish?
b Find ¾, the standard deviation for the probability distribution.
2 A random variable X has probability function given by
P (x) = k(0:4)x (0:6)3¡x for x = 0, 1, 2, 3.
a Find P (x) for x = 0, 1, 2 and 3 and hence find k.
b Find the mean and standard deviation for the distribution.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
3 An insurance policy covers a $20 000 sapphire ring against theft and loss. If it is stolen
the insurance company will pay the policy owner in full. If it is lost they will pay the
owner $8000. From past experience the insurance company knows that the probability
of theft is 0:0025 and of being lost is 0:03. How much should the company charge to
cover the ring if they want a $100 expected return?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\227SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:33 PM PETERDELL
SA_12STU-2
228
STATISTICS
(Chapter 7)
4 Use technology to find the mean and standard deviation of the two samples, A and B,
of weights given in grams.
A 498:8 500:2 500:4 499:9 500:4 500:6 498:9 498:2 500:1 501:9
500:8 498:6 499:7 498:6 499:0 498:8 499:1 500:7 500:7 501:3
501:1 501:5 499:0 499:7 498:4 501:1 500:1 499:9 500:9 499:2
B 545:5 543:4 399:8 511:3 616:3 496:7 337:8 650:2 426:3 522:2
664:0 415:1 416:0 425:4 419:9 503:7 427:8 474:2 459:9 390:5
428:5 451:9 590:1 613:5 402:3 318:3 478:1 502:2 626:4 435:7
Which of the samples is the weights of bags of salt, and which is the weights of oranges?
5 Test marks out of 10 are recorded in the following frequency table:
0
2
Mark
Frequency
1
1
2
0
3
4
4
5
5
8
6
12
7
15
8
7
9
3
10
5
a Find the mean and standard deviation of these scores.
b Calculate the percentage difference between using the formulae for population
standard deviation and sample standard deviation.
P
P
6 Using ¾ 2 = pi (xi ¡ ¹)2 show that ¾ 2 = pi xi2 ¡ ¹2 :
P
(Hint: ¾ 2 = pi (xi ¡ ¹)2 = p1 (x1 ¡ ¹)2 + p2 (x2 ¡ ¹)2 + :::::: + pn (xn ¡ ¹)2 :
Expand ¾ 2 and regroup the terms.)
C
NORMAL DISTRIBUTIONS
Many quantities reflect the combined effect of a large number of random factors.
For example:
²
The yield of a wheat plant is the combined result of many unpredictable factors such
as genes, rainfall, sunshine, and its position in the field where it was seeded.
²
The weight of a packet of sultanas is the sum of the weights of each individual
sultana, and it is unlikely a packet labelled as 1 kg will weigh exactly 1 kg.
DISCUSSION
THE EFFECT OF RANDOM FACTORS
²
²
Consider at least three factors that affect each of the following:
a the weight of a newly born piglet
b the time to complete an assignment
c the mark achieved in an examination
d the number of goals scored in a netball match.
For each of the above random variables, suggest why the distribution might be
a symmetric b bell shaped.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The next investigation explores the distribution of a quantity that is the combined result of
different factors.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\228SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:41 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
229
INVESTIGATION 1 SOME PROPERTIES OF A NORMAL DISTRIBUTION
Consider the time it takes Les to walk home from school. We have broken
this into the following stages with the time it takes to complete each stage:
Stage
1
2
3
4
5
6
7
Question:
What is happening
Cross the road in front of the school
Walk to the shopping centre
Walk through the shopping centre
Cross a road
Buy a loaf of bread
Talk with a friend
Walk the remaining distance home
Time
up to 1 minute
5 § 2 minutes
3 § 2 minutes
up to 1 minute
up to 2 minutes
up to 2 minutes
2 § 1 minutes
According to the table, what is the longest time it may take Les to walk
home? What is the shortest time?
If Les wanted to study the distribution of the time it takes to walk home, he could keep a
daily record, but the amount of data collected would be very small.
Les could also use the information given in the table and use a spreadsheet or a calculator
to simulate the time it takes to walk home.
The following instructions are set up for a spreadsheet, but the procedure will also work
on a calculator.
What to do:
SPREADSHEET
1 Open the spreadsheet “Normal distribution”.
A spreadsheet with the following headings will appear.
2 In each of the cells A2 to G2, under the headings ‘Stage 1’ to ‘Stage 7’, type in the
formulae shown in the table. Do not forget to start each formula with an = sign.
Note: rand() calculates a random number between 0 and 1.
Question: What does 5 + (4*rand( ) ¡ 2) calculate?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
3 In cell N2, below the heading ‘Total time’, type in the formula =sum(A2:M2)
Question: What does this formula calculate?
4 Drag the formulae in cells A2 to N2 down to fill all cells A251 to N251. Pressing
the F9 function key will produce another random sample.
The numbers in cell P2 under the heading ‘Mean’, and in cell Q2 under the heading
‘Standard Deviation’, are the mean and standard deviation of the numbers in cells N2
to N251.
The number in cell R2 under the heading ‘No. within 1 st. dev.’ gives the number of
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\229SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:47 PM PETERDELL
SA_12STU-2
230
STATISTICS
(Chapter 7)
values within 1 standard deviation of the mean. For example, if the mean x = 12:96
and the standard deviation s = 1:82, then this cell gives the number of values that
lie between x ¡ s = 11:14 and x + s = 14:78 . Similarly, the numbers in cells
S2 and T2 give the number of values within 2 and 3 standard deviations of the mean
respectively.
The graph that appears is the
histogram of data in cells N2 to N251.
If you are having difficulty setting up this spreadsheet, click on the tag ‘Normal 2’ to
open a finished version.
5 Calculate the proportion of data values within each interval. For example, if there are
169 values within 1 standard deviation of the mean, the proportion of values in the
interval = 169
250 = 0:676 .
6 Copy and fill in the following table for 5 different samples. The entries of the first
line may not agree with your values.
Sample
no.
1
2
3
4
5
Mean
x
12:96
x ¡ s to x + s
Count Propn.
169
0:676
Stdev
s
1:82
x ¡ 2s to x + 2s
Count Propn.
x ¡ 3s to x + 3s
Count Propn.
What do you notice about the proportions of data in each of the intervals?
In the following we change the value of the factors and then add more factors.
7 Change the formulae in cells A2 to G2 as shown in the table.
8 Repeat steps 4 to 6.
9 Add the following formulae in cells H2 to M2:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
10 Repeat steps 4 to 6.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\230SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:52 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
231
From Investigation 1 you should have discovered that changing the number and values of
factors may change the mean and standard deviation, but leaves the following unchanged:
² The shape of the histogram is symmetric about the mean.
² Approximately 68% of the data lies between 1 standard deviation below the mean
and 1 standard deviation above the mean.
² Approximately 95% of the data lies between 2 standard deviations below and 2
standard deviations above the mean.
² Approximately 99.7% of the data lies between 3 standard deviations below and 3
standard deviations above the mean.
Note:
It is a rare event for an outcome to be outside the standard deviation range between
¡3¾ and 3¾. In a sample of 1000, you would only expect about 3 cases.
A smooth curve drawn through
the midpoints of each column
of the histogram would ideally
look like the graph displayed.
concave
point of inflection
point of inflection
convex
convex
Note the points of inflection at
¹ ¡ ¾ and ¹ + ¾.
¹¡ ¾
¹
¹+ ¾
The above information is typical of a family of normal distributions. Curves with this shape
are known as normal curves. Because of their characteristic shape, they are also called
bell-shaped curves.
34%
2.35%
0.15%
34%
2.35%
13.5%
m-3s
m-2s
0.15%
13.5%
m-s
m
m+s
m+2s
m+3s
Variables which are the combined result of many random factors are often approximately
normal.
The normal variable X with mean ¹ and standard deviation ¾ is denoted by X » N(¹, ¾ 2 ).
CONTINUOUS PROBABILITY DENSITY FUNCTIONS
For any distribution of data, whether it is a normal distribution or not, the function whose
smooth curve approximates the histogram of the data is called a probability density function
or pdf.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
If the variable X is normally distributed, N(¹, ¾ 2 ), the probability density function is
1 x¡¹ 2
1
f (x) = p e¡ 2 ( ¾ ) .
¾ 2¼
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\231SA12STU-2_07.CDR Thursday, 2 November 2006 3:12:59 PM PETERDELL
SA_12STU-2
232
STATISTICS
(Chapter 7)
Probability density functions f have the following properties:
² f(x) > 0 for all values of x.
² The area between the graph of f and the horizontal axis is 1, since the total of all
probabilities is 1.
² The proportion of outcomes of the variable X between the values a and b is the
area between the graph of f and the horizontal axis for a 6 x 6 b.
Z b
Notice that:
Pr(a 6 X 6 b) =
f(x) dx
a
For a continuous variable X, the probability X is exactly equal to a point a is zero.
For example, the probability an egg will weigh exactly 72:9 g is zero.
If you were to weigh an egg on scales that weigh to the nearest 0:1 g, a weight of 72:9 g
means the weight lies somewhere between 72:85 and 72:95 grams.
Presumably an egg has to weigh something, and it could be 72:9 grams, but you will never
know. No matter how accurate your scales are, you can only ever know the weight of an egg
within a range.
So, for a continuous variable we can only talk about the probability an event lies in an
interval.
Notice that:
if X is continuous, Pr(a 6 X 6 b), Pr(a < X 6 b), Pr(a 6 X < b)
and Pr(a < X < b) all have the same value. Why?
This would not be correct if X was discrete.
Example 5
The chest measurements of 18 year old male footballers are normally distributed with
a mean of 95 cm and a standard deviation of 8 cm.
a Find the percentage of randomly chosen footballers with chest measurements
between:
i 87 cm and 103 cm
ii 103 cm and 111 cm
b Find the probability of randomly choosing a footballer with a chest measurement
between 87 cm and 111 cm.
For the distribution of chest measurements, the mean
¹¡=¡95¡cm and the standard deviation ¾¡=¡8¡cm.
a
i
ii
34%
We need the percentage between
¹ ¡ ¾ and ¹ + ¾. This is 68%.
We need the percentage between
¹ + ¾ and ¹ + 2¾. This is 13:5%:
34%
13.5%
s
s
s
87 95 103 111
m-s m m+s m+2s
b The percentage between ¹ ¡ ¾ and
¹ + 2¾ is 68% + 13:5% = 81:5%:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
m-s m
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
So the probability is 0:815
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\232SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:05 PM PETERDELL
m+2s
SA_12STU-2
STATISTICS
(Chapter 7)
233
EXERCISE 7C
1 What is the probability that a normally distributed value lies between:
a 1¾ below the mean and 1¾ above the mean
b the mean and the value 1¾ above the mean
c the mean and the value 2¾ below the mean
d the mean and the value 3¾ above the mean?
2 Suppose the heights of 16 year old male students are normally distributed with a mean
of 170 cm and a standard deviation of 8 cm. Find the percentage of male students whose
height is:
a between 162 cm and 170 cm
b between 170 cm and 186 cm.
Find the probability that a student from this group has a height:
c between 178 cm and 186 cm
d less than 162 cm
e less than 154 cm
f greater than 162 cm.
3 The time T minutes it takes Charlotte to go to work is normally distributed with mean
50 minutes and standard deviation of 5 minutes. Every morning Charlotte leaves for
work at 8 am.
a If work starts at 9 am, what is the probability Charlotte will be late for work?
b If Charlotte works 250 days a year, how many times can she expect to be late?
4 Explain why each of the following variables might be normally distributed:
a the chest size of 18 year old Australian males
b the length of adult female sharks
c the protein content of each kilogram of corn grown in the same field.
5 A farmer has a flock of 237 crossbred lambs. The mean weight of the flock is 35 kg
with a standard deviation of 2 kg.
a Explain why the weights of the lambs might be normally distributed.
b If lambs between the weights of 33 to 39 kg are suitable for export, how many
lambs in this flock could the farmer expect to be able to export?
6 The weights of hens’ eggs are normally distributed with mean 65 grams and standard
deviation 6 grams.
a Determine the probability that a randomly selected egg has weight
i greater than 53 g ii less than 71 g iii between 59 g and 77 g.
b In one week the hens lay 1286 eggs. How many of these eggs are expected to be
i greater than 53 g ii less than 71 g iii between 59 g and 77 g.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
7 The marks for a geography examination are normally distributed with mean 65 and
standard deviation 11.
a A geography student is chosen at random. Determine the probability that the student
i less than 76 marks ii between 43 and 76 marks.
scored
b If the top 16% of students receive an A grade, what was the minimum mark
for an A?
c If 2582 students sit for the examination, how many of them would be expected to
score less than 32 marks?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\233SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:11 PM PETERDELL
SA_12STU-2
234
STATISTICS
(Chapter 7)
8 The weights of Jason’s oranges are normally distributed. 84% of the crop weigh more
than 152 grams and 16% weigh more than 200 grams.
a Find ¹ and ¾ for the crop
b What proportion of the oranges weigh between 152 grams and 224 grams?
9 The heights of 13 year old boys are normally distributed. 97:5% of them are above 131
cm and 2:5% are above 179 cm.
a Find ¹ and ¾ for the height distribution
b A 13-year old boy is randomly chosen. What is the probability that his height lies
between 143 cm and 191 cm?
10 Using the same set of axes, quickly sketch the graphs of the density functions for each
of the following distributions:
a N(0, 32 )
b N(0, (0:5)2 )
c N(¡5, 12 )
d N(3, 0:25).
11 Each of the following is a graph of a normal distribution with different vertical scales:
A
B
C
-2.5
-2
-1.5
-20 -10
x
0
10
20
-4 -2
x
0
2
4
x
a Write down the mean ¹ for each of these distributions.
b Which of the distributions has standard deviation
i ¾ = 0:1
ii ¾ = 1
iii ¾ = 10 ?
c Which of the distributions has the largest spread?
D THE STANDARD NORMAL DISTRIBUTION
For each value of ¹ and ¾ there is a different normal distribution N(¹, ¾ 2 ).
As illustrated by Investigation 1, all normal distributions have one important property in
common: the probability of an event occurring depends only on the number of standard
deviations the event is from the mean.
If x is an observation from a normal distribution with mean ¹ and standard deviation ¾,
the z-score of x is the number of standard deviations x is from the mean.
The diagram shows how
the z-score is related to
a normal curve.
Normal distribution curve
34%
2.35%
0.15%
34%
2.35%
13.5%
cyan
magenta
1
2
3
yellow
95
0
100
-1
50
-2
75
-3
25
m+3s
0
m+2s
5
m+s
95
m
100
m-s
50
m-2s
75
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
z-score
13.5%
m-3s
25
actual score
0.15%
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\234SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:17 PM PETERDELL
SA_12STU-2
STATISTICS
235
(Chapter 7)
z-scores are particularly useful when comparing two measurements made using different ¹
and ¾. But be careful! These comparisons will only be reasonable if both measurements are
approximately normal.
Example 6
The local school has kept records of all its athletics competitions. It was found that
the time, in minutes, to run the men’s 800 metres was normally distributed as
N(3:4, (0:2)2). The women’s long jump, in metres, was normally distributed as
N(4:3, (0:4)2). In 1980 John won the 800 metre race with a time of 3:2 minutes. In
2006 his daughter Anne came second in the long jump with a distance of 5:1 m.
a i Sketch the graphs of the two distributions using the same scale for the
z-scores from ¡3 to +3.
ii Put the actual times/distances below each of the z-scores on the graphs.
iii Calculate the z-scores for John and Anne, and mark these on the graphs.
iv Shade the area under the respective graphs to represent performances that
were better than those of John and Anne.
b Of all the students who participated in these two events, what proportion would
have performed better than i John ii Anne?
c If 1000 students had participated in each of these two events, how many would
have performed better than i John ii Anne?
d Of the father and daughter, who had the better result?
a i/ii/iv
John’s time
better than John
34%
2.35%
0.15%
34%
2.35%
13.5%
z-score
actual time (min)
-3
2.8
-2
3.0
0.15%
13.5%
-1
3.2
0
3.4
1
3.6
2
3.8
3
4.0
Anne’s distance
better than Anne
34%
2.35%
0.15%
34%
2.35%
13.5%
z-score
actual distance (m)
-3
3.1
-2
3.5
0.15%
13.5%
-1
3.9
0
4.3
1
4.7
2
5.1
3
5.5
iii John’s time was 3:2 ¡ 3:4 = ¡0:2 minutes from the mean.
Since the standard deviation is 0:2 minutes, John ran the 800 metres in a
time of 1 standard deviation less than the mean.
The z-score of John’s performance is ¡1:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The distance Anne jumped was 5:1 ¡ 4:3 = 0:8 m above the mean.
Since the standard deviation is 0:4 metres, Anne jumped a distance of 2
standard deviations above the mean.
The z-score of Anne’s performance is +2.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\235SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:23 PM PETERDELL
SA_12STU-2
236
STATISTICS
b i
ii
c i
ii
(Chapter 7)
The proportion less than ¹ ¡ ¾ is 0:16, so 16% of all participants
performed better than John.
The proportion greater than ¹ + 2¾ is 0:025, so only 2:5% of all
participants performed better than Anne.
Of 1000 participants, 16% of 1000 = 160 were better than John.
2:5% of 1000 = 25 were better than Anne; one of these happened
to be competing on the same day as Anne.
d Anne’s long jump was more outstanding than her father’s 800 metre race.
EXERCISE 7D.1
1 In a year 12 class, the marks for a Geography test marked out of 50 were normally
distributed with mean of 34 and standard deviation of 6. The marks for an English essay
out of 20 were normally distributed with a mean of 12 and standard deviation of 1:5 .
Val received a mark of 40 for her Geography and 15 for her English essay.
a Sketch the graphs of the two distributions below one another using the same scale
for the z-scores from ¡3 to +3.
Put the actual marks below each z-score on the graph.
b For which of the two subjects did Val receive the higher % mark?
c Calculate the z-score for each of Val’s results.
i Mark these z-scores on the two graphs.
ii Shade the region on the two graphs of scores which were better than Val’s.
d What proportion of the students performed better than Val in Geography, and what
proportion performed better than Val in English?
e If there were 32 students in the class, how many performed better than Val in
Geography and how many in English?
f In which of these two assessments did Val perform better?
2 Suppose that the weight W of bags of sugar filled by a machine are normally distributed
with mean ¹ = 504 grams and standard deviation ¾ = 2 grams.
A quality controller rejects any bags of sugar with weight less than 500 grams.
Across town, the weight A of bags of apples filled by an assistant in a green grocer shop
is normally distributed with mean weight 5 kilograms and standard deviation 500 grams.
Bags weighing less than 4 12 kg are rejected by a quality controller.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Sketch the graphs of the two distributions below one another using the same scale
for the z-scores from ¡3 to +3.
Put the actual weights below each z-score on the graph.
b Calculate the z-score for each of the two quality controls, and shade in the regions
corresponding to the weights of bags that are rejected.
c Which of the two quality controllers is the more stringent, i.e., rejects the larger
proportion of bags?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\236SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:29 PM PETERDELL
SA_12STU-2
STATISTICS
237
(Chapter 7)
Example 7
Suppose examination scores are normally distributed with mean mark ¹ = 63 and
standard deviation of ¾ = 12 marks.
a What is the z-score for a mark of 80?
b If Hua’s z-score is ¡1:5, what is Hua’s actual score?
a A mark of 80 is 80 ¡ 63 = 17
above the mean.
Since the standard deviation
is 12, this is 17
12 = 1:42 standard
deviations above the mean.
So, the z-score is 1:42
b Hua’s mark is ¡1:5 standard
deviations from the mean.
Since the standard deviation is
12, this is 12 £ (¡1:5) = ¡18
marks from the mean.
Since the mean is 63, Hua’s
mark is 63 + (¡18) = 45.
score of 80
z-score
actual mark
-3
27
-2
39
-1
51
0
63
1
75
2
87
3
99
0
63
1
75
2
87
3
99
Hua’s mark
z-score
actual mark
-3
27
-2
39
-1
51
3 Suppose the distribution of the diameter (in cm) of oranges from a tree is N(10, 22 ).
a Sketch a graph of the distribution that displays both the actual diameters as well as
the z-score along the horizontal axis.
b Find the z-score for each of the following diameters:
i 12 cm ii 9 cm iii 13 cm
c Oranges are to be dumped if their diameters have a z-score of less than ¡2.
What is the diameter of oranges that are to be dumped?
d If there are 120 oranges on the tree, how many will be dumped?
4 The volume of milk cartons filled by a machine is normally distributed with mean 504
mL and standard deviation of 1:5 mL.
a What is the z-score of a carton containing 506 mL of milk?
b What is the volume of milk in a carton with a z-score of ¡1:5?
If x is an observation from a normal distribution with mean ¹ and standard deviation ¾, the
x¡¹
z-score of x can be calculated from the formula z =
.
¾
If the variable X is normally distributed with mean ¹ and standard deviation ¾, then
Z=
X ¡¹
¾
is called the standard normal distribution.
The variable Z is the number of standard deviations X is from the mean.
Notice that, if x = ¹ then z = 0 and if x = ¹ + ¾ then z = 1.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Hence, the mean of Z is 0 and the standard deviation of Z is 1.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\237SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:35 PM PETERDELL
SA_12STU-2
238
STATISTICS
(Chapter 7)
Example 8
Find the probability that the standard normal distribution Z lies between ¡2 and 1.
The graph of the Z-distribution is shown:
34%
34%
z
13.5%
-3
-2
-1
0
1
2
3
The probability Z lies between ¡2 and 1 is the proportion of observations that lie
between 2 standard deviations to the left of the mean and 1 standard deviation to
the right of the mean. This is about 0:815 .
EXERCISE 7D.2
1 The table shows Emma’s midyear exam
results. The exam results for each subject are
normally distributed with mean ¹ and
standard deviation ¾ shown in the table.
a Find the z-score for each of
Emma’s subjects.
b Arrange Emma’s subjects from
‘best’ to ‘worst’ in terms of the z-scores.
Subject
Emma’s score
English
12
Chinese
27
Geography
84
Biology
34
Mathematics
84
¹ ¾
10 1:1
20 3:0
55 18
25 10
50 15
2 Calculate the following probabilities. In each case sketch the graph of the Z-distribution
shading in the region of interest.
a Pr(¡1 < Z < 1)
b Pr(¡1 < Z < 3)
c Pr(¡1 < Z < 0)
d Pr(Z < 2)
e Pr(¡1 < Z)
f Pr(Z > 1)
USING TECHNOLOGY TO FIND PROBABILITIES
So far we have only used integer z-scores to calculate probabilities. By
refining the methods used in Investigation 1 we can calculate probabilities for
other z-scores. To see how to use your calculator to do this, click on the icon.
TI
C
When working with normal distributions, you are advised to sketch a graph of the normal
distribution and shade in the areas of interest.
Example 9
Use technology to illustrate and calculate:
a Pr(¡0:41 6 Z 6 0:67)
b Pr(Z 6 1:5)
c
Pr(Z > 0:84)
a For a TI, Pr(a 6 Z 6 b)
can be calculated using normalcdf(a, b, 0, 1)
cyan
magenta
yellow
95
100
50
75
25
0
5
95
-0.41
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Pr(¡0:41 6 Z 6 0:67)
= normalcdf (¡0:41, 0:67, 0, 1)
+ 0:408
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\238SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:41 PM PETERDELL
0
0.67
SA_12STU-2
STATISTICS
Note:
When ¹ = 0 and ¾ = 1 we can simply use
b
normalcdf (a, b)
Pr(Z 6 1:5)
= normalcdf(¡E99, 1:5, 0, 1)
+ 0:933
Note:
c
239
(Chapter 7)
¡E99 is the largest negative
number on a calculator.
0
1.5
Pr(Z > 0:84)
= normalcdf(0:84, E99, 0, 1)
+ 0:200
Note:
E99 is the largest positive
number on a calculator.
0 0.84
EXERCISE 7D.3
1 If Z is the standard normal distribution, find the following probabilities.
In each case sketch the regions.
a Pr(¡0:86 6 Z 6 0:32)
b Pr(¡2:3 6 Z 6 1:5)
c Pr(Z 6 1:2)
d Pr(Z 6 ¡0:53)
e Pr(Z > 1:3)
f Pr(Z > ¡1:4)
g Pr(Z > 4)
TI
With modern technology we can calculate probabilities for normal
distributions which have not been standardised. Click on the icon to
see how this is done.
C
Example 10
If X is N(10, 2:32 ), find these probabilities:
a Pr(8 6 X 6 11)
b Pr(X 6 12)
a
c
Pr(X > 9). Illustrate.
Pr(8 6 X 6 11)
= normalcdf(8, 11, 10, 2:3)
+ 0:476
8 10 11
b
Pr(X 6 12)
= normalcdf(¡E99, 12, 10, 2:3)
+ 0:808
10 12
c
Pr(X > 9)
= normalcdf(9, E99, 10, 2:3)
+ 0:668
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
9 10
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\239SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:46 PM PETERDELL
SA_12STU-2
240
STATISTICS
(Chapter 7)
2 If the random variable X is N(70, 32 ), find these probabilities:
a Pr(60:6 < X 6 68:4)
b Pr(X > 74)
c Pr(X 6 68)
3 Suppose the variable X is normally distributed with mean ¹ = 58:3 and standard
deviation ¾ = 8:96 .
a Let the z-score of x = 50:6 be z 1 and the z-score of x = 68:9 be z 2 .
i Calculate z 1 and z 2 .
ii Find Pr(z1 6 Z 6 z 2 )
b Find Pr(50:6 6 X 6 68:9) directly from your calculator.
c Compare the answers to a and b.
4 Suppose X is N(50, 52 ). Calculate Pr(a < X 6 51) for each of the following values
of a. Give your answers to 5 decimal places.
a a = 45
b a = 35
c a = 25
d a = 15
e a=0
Compare the answers of a to e with Pr(X 6 51):
Example 11
In 1972 the heights of SANFL players was found to be normally distributed with
mean 179 cm and standard deviation 7 cm. Find the probability that in 1972
a player was: a at least 175 cm tall
b between 170 cm and 190 cm.
If X is the height of a player then X is normally distributed with mean ¹ = 179
and standard deviation ¾ = 7:
a
We need to find
b We need to find
Pr(X > 175)
Pr(170 6 X 6 190)
= normalcdf(175, E99, 179, 7)
= normalcdf(170, 190, 179, 7)
+ 0:716
+ 0:843
5 The height of 18 year old men is normally distributed with mean 182:3 cm and standard
deviation 9:6 cm. Find the probability that a randomly selected 18 year old man is:
a at least 180 cm tall
b at most 190 cm tall
c between 175 and 185 cm.
6 The weight of hens’ eggs is normally distributed with mean 42:3 g and standard deviation
5:9 g. Find the probability that a randomly selected egg is:
a at most 50 g
b at least 45 g
c between 35 g and 45 g.
7 The speed of cars passing the supermarket is normally distributed with mean 56:3 kmph
and standard deviation 7:4 kmph. Find the probability that a randomly selected car is
travelling at:
a between 60 and 75 kmph b at most 70 kmph
c at least 60 kmph.
8 The lengths of metal bolts produced by a machine are found to be normally distributed
with a mean of 19:8 cm and a standard deviation of 0:3 cm. Find the probability that a
bolt selected at random from the machine will have a length between 19:7 and 20 cm.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
9 The IQs of secondary school students from a particular area are believed to be normally
distributed with a mean of 103 and a standard deviation of 15:1. Find the probability
that a student will have an IQ:
a of at least 115
b that is less than 75
c between 95 and 105:
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\240SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:52 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
241
10 The average weekly earnings of the students at a local high school are found to be
approximately normally distributed with a mean of $40 and a standard deviation of $6:
What proportion of students would you expect to earn:
a between $30 and $50 per week
b at least $50 per week?
11 The lengths of Murray Cod caught in the River Murray are found to be normally
distributed with a mean of 41 cm and a standard deviation of 3:317 cm.
a Find the probability that a cod is at least 50 cm.
b What proportion of cod measure between 40 cm and 50 cm?
c In a sample of 200 cod, how many of them would you expect to be at least 45 cm?
E
FINDING QUANTILES (k-VALUES)
Let X be the random variable of the length in mm of a snail shell.
Suppose that X is normally distributed with mean ¹ = 23:6
and standard deviation ¾ = 3:1 mm. A snail farmer wants to
harvest some of his snails, but only those whose shell lengths
are amongst the longest 5%. The problem is to find k such that
Pr(X < k) = 95%.
The number k is known as a quantile, and in this case the 95% quantile.
When finding quantiles we are given a probability and are asked to calculate the corresponding
measurement. This is the inverse of finding probabilities, and we use the inverse normal
function.
Click on the icon to obtain instructions for using your calculator.
TI
C
For the above example, the TI instruction is
k = invNorm(0:95, 23:6, 3:1) = 28:7
95%
The instruction k = invNorm(0:95) will
assume that the mean ¹ = 0, and the
standard deviation ¾ = 1.
m¡=¡23.6
s¡=¡3.1
k
X
Example 12
If Z has a standard normal distribution, find k if
Pr(Z < k) = 0:73
73%
Using a TI,
k = invNorm(0:73, 0, 1)
+ 0:613
m¡=¡0 k
s¡=¡1
Z
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
This means 73% of the values are expected to be less than 0:613
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\241SA12STU-2_07.CDR Thursday, 2 November 2006 3:13:59 PM PETERDELL
SA_12STU-2
242
STATISTICS
(Chapter 7)
EXERCISE 7E
1 Z has a standard normal distribution. Illustrate with a sketch and find k if:
a Pr(Z 6 k) = 0:81
b Pr(Z 6 k) = 0:58
c Pr(Z 6 k) = 0:17
2 X » N(20, 32 ). Illustrate with a sketch and find k if:
a Pr(X 6 k) = 0:348
b Pr(X 6 k) = 0:878
Pr(X 6 k) = 0:5
c
a Show that Pr(¡k 6 Z 6 k) = 2 Pr(Z 6 k) ¡ 1:
b If Z is standard normally distributed, find k if:
i Pr(¡k 6 Z 6 k) = 0:238
ii Pr(¡k 6 Z 6 k) = 0:7004
3
Example 13
A university professor determines that 80% of this year’s History candidates should
pass the final examination. The examination results are expected to be normally
distributed with mean 62 and standard deviation 13. Find the lowest score necessary
to pass the examination.
Let X denote the final examination result, so X » N(62, 132 ):
We need to find k
such that
)
Pr(X > k) = 0:8
Pr(X 6 k) = 0:2
) k = invNorm(0:2, 62, 13)
) k + 51:059
20%
So, the minimum pass mark is 51.
k
62
X
4 The length of a fish species is normally
distributed with mean 35 cm and standard
deviation 8 cm. The fisheries department
has decided that the smallest 10% of the
fish are not to be harvested. What is size
of the smallest fish that can be harvested?
5 The length of screws produced by a machine is normally distributed with mean 75 mm
and standard deviation 0:1 mm. If a screw is too long it is automatically rejected. If 1%
of screws are rejected, what is the length of the smallest screw to be rejected?
6 The average score for a Physics test was 46 and the standard deviation of the scores was
15. Assuming that the scores were normally distributed, the teacher decided to award
an A to the top 7% of the students in the class. What is the lowest score that a student
needed in order to achieve an A?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
7 The volume of cool drink in a bottle filled by a machine is normally distributed with
mean 503 mL and standard deviation 0:5 mL. 1% of the bottles are rejected because they
are underfilled, and 2% are rejected because they are overfilled; otherwise they are kept
for retail. What range of volumes is in the bottles that are kept?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\242SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:05 PM PETERDELL
SA_12STU-2
STATISTICS
243
(Chapter 7)
Note: Z-scores are essential for finding unknown values of ¹ and/or ¾.
Example 14
An adult scallop population is known to have a standard deviation of 5:9 g. If 15%
of scallops weigh less than 58:2 g, find the mean weight of the population.
Let the mean weight of the population be ¹ g.
If X g denotes the weight of an adult scallop,
then X » N(¹, 5:92 ):
15%
As we do not know ¹ we cannot use the
invNorm directly, but we can find the z-value.
58.2 m¡=¡?
Now
Pr(X 6 58:2) = 0:15
s¡=¡5.9
58:2 ¡ ¹
) Pr(Z 6
) = 0:15
5:9
58:2 ¡ ¹
)
= invNorm(0:15) = ¡1:0364
5:9
) 58:2 ¡ ¹ + ¡6:1
¹ + 64:3
So, the mean weight is 64:3 g.
8 The arrival times of buses at a depot is normally distributed with standard deviation of
5 minutes. If 10% of the buses arrive before 3:45 pm, what is the mean arrival time of
buses at the depot?
9 The IQ of a population has a standard deviation of 15. In a school 20% of students have
an IQ larger than 125. What is the mean IQ of students in this school?
10 The distance an athlete can jump is normally distributed with mean 5:2 m. If 20% of
the jumps by this athlete are less than 5 m, what is the standard deviation?
11 The weekly income of a greengrocer is normally distributed with a mean of $6100. If
85% of the time the weekly income exceeds $6000, what is the standard deviation?
Example 15
Find the mean and standard deviation of a normally distributed random variable X
if Pr(X 6 20) = 0:1 and Pr(X > 29) = 0:15
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
X » N(¹, ¾ 2 ) where we have to
0.1
0.15
find ¹ and ¾.
We start by finding z1 and z2 which
!z=20
m
!x=29
correspond to x1 = 20 and x2 = 29.
#z
#x
20 ¡ ¹
= invNorm(0:1) = ¡1:282
) 20 ¡ ¹ = ¡1:282¾ .... (1)
Now z1 =
¾
29 ¡ ¹
and z2 =
= invNorm(0:85) = 1:036
) 29 ¡ ¹ = 1:036¾ ....... (2)
¾
Solving these two equations gives ¹ + 25:0 and ¾ = 3:88
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\243SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:10 PM PETERDELL
SA_12STU-2
244
12
STATISTICS
(Chapter 7)
a Find the mean and the standard deviation of a normally distributed random variable
X, if Pr(X > 80) = 0:1 and Pr(X 6 30) = 0:15:
b In a Mathematics examination it was found that 10% of the students scored at least
80, and no more than 15% scored under 30. Assuming the scores are normally
distributed, what proportion of students scored more than 50?
13 The diameters of pistons manufactured by a company are normally distributed. Only
those pistons whose diameters lie between 3:994 and 4:006 cm are acceptable.
a Find the mean and the standard deviation of the distribution if 4% of the pistons
are rejected as being too small, and 5% are rejected as being too large.
b Determine the probability that the diameter of a randomly chosen piston lies
between 3.997 mm and 4.003 mm.
F
INVESTIGATING PROPERTIES
OF NORMAL DISTRIBUTIONS
In the previous section a number of assertions were made about the standard deviation. In
this section some of these assertions will be justified.
INVESTIGATION 2
THE GEOMETRIC SIGNIFICANCE OF ¹ AND ¾
What to do:
1 x¡¹ 2
1
1 The normal probability density function is f(x) = p e¡ 2 ( ¾ ) .
¾ 2¼
2
3
4
5
Use technology to graph this function for a ¹ = 6, ¾ = 1 b ¹ = 6, ¾ = 2.
x¡¹
Show that the derivative of f(x) is f 0 (x) = ¡ 2 f (x).
¾
Use the result in 2 to show that f (x) has a maximum value at x = ¹.
GRAPHING
1
PACKAGE
Show that f 00 (x) = ¡ 4 (¾ 2 ¡ (x ¡ ¹)2 ) f (x) .
¾
Use the result of 4 to find the points of inflection of f (x).
From Investigation 2 you
should have discovered that
the points of inflection occur
at x = ¹+¾ and x = ¹¡¾.
point of
inflection
point of
inflection
s
m -s
s
m
x
m+s
Consequently:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
For a given normal curve the standard deviation is uniquely determined as the
horizontal distance from the vertical line x = ¹ to a point of inflection.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\244SA12STU-2_07.CDR Thursday, 9 November 2006 3:04:23 PM DAVID3
SA_12STU-2
STATISTICS
INVESTIGATION 3
(Chapter 7)
245
CALCULATING PROBABILITIES
FROM NORMAL DISTRIBUTIONS
To find probabilities from a normal distribution you need to be able to find
areas between the graph of f (x) =
1 x¡¹ 2
p1 e¡ 2 ( ¾ )
¾ 2¼
and the x-axis.
A simple way to estimate these probabilities is to approximate them with areas
of rectangles that fit snugly around the
curve.
The area beneath the smooth curve is
approximately equal to the sum of the
areas of the rectangles.
What to do:
Use a spreadsheet to:
² calculate the area of each rectangle using area = base £ height
² add the areas of rectangles to find an approximate area below the curve.
Details of how to set up a spreadsheet can be found by clicking on
the icon.
G
SPREADSHEET
DISTRIBUTION OF SAMPLE MEANS
Suppose a dietician wants to know the mean
weight of thirteen year old Australian boys.
It is impractical to weigh each thirteen year
old boy in Australia, but the dietician could
find the mean weight of a randomly selected
sample of, say, 10 boys.
The mean weight of the sample of 10 boys
is a statistic that is then used to estimate the
population parameter.
Clearly the mean weight depends on the sample. If another health worker had selected a
different sample of 10 boys, it would be unlikely that the two sample means would be
the same.
The statistic the sample weight is a new variable. Repeated sampling can be used to discover
how the variable sample weight is distributed. In particular we want to know how the mean
of the sample means and the standard deviation of the sample means is related to the parent
population of 13 year old boys.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
The following investigation explores the relation between the statistic “sample mean” and the
parameter “population mean”.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\245SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:23 PM PETERDELL
SA_12STU-2
246
STATISTICS
(Chapter 7)
INVESTIGATION 4
A SIMPLE RANDOM SAMPLER
Suppose a school has 216 thirteen year old boys.
Let the variable X be the weight in kg of the boys.
The table shows all the possible values of X in random order.
1 31:2
35:7
33:8
36:7
30:9
35:9
35:5
36:4
32:9
27:9
33:2
32:8
32:0
33:2
35:4
32:0
33:8
30:8
34:8
37:3
31:9
36:4
33:6
29:0
30:2
35:0
36:7
34:5
30:5
32:1
36:3
2 34:0
33:6
29:2
31:6
30:9
32:7
38:9
34:4
33:6
32:5
35:0
35:4
32:0
32:0
31:0
35:3
33:2
30:4
28:0
32:7
32:5
34:6
36:2
33:3
32:7
36:7
36:4
31:1
35:2
30:2
33:6
3 35:4
31:2
33:5
30:5
35:0
31:4
27:5
32:5
32:5
30:5
32:5
32:4
31:7
29:6
30:1
36:3
34:1
37:1
37:1
35:1
34:9
34:3
33:2
32:5
29:9
32:9
32:3
32:1
32:9
35:9
31:6
4 37:3
33:6
31:4
31:3
33:4
30:3
32:5
36:7
33:0
30:8
33:2
34:9
33:4
30:7
32:4
29:8
31:1
32:1
35:2
32:8
29:7
30:8
32:3
34:6
34:2
32:5
33:6
29:2
30:6
35:7
29:5
5 34:3
31:9
32:6
31:6
27:0
33:4
34:0
33:2
29:4
34:2
31:8
35:3
30:6
34:5
32:6
29:1
36:1
38:7
37:1
32:4
35:5
35:4
32:7
37:5
30:4
30:8
33:0
29:9
31:0
32:1
33:2
6 32:4
32:0
31:4
33:7
35:9
33:9
28:5
27:1
40:6
29:0
38:4
34:0
36:2
36:4
37:1
32:0
31:6
34:2
35:7
34:0
31:4
29:9
34:4
29:2
36:4
32:4
30:0
34:6
31:6
37:6
33:2
30:8
33:3
34:9
31:8
33:3
29:4
30:6
33:1
32:0
31:4
31:9
31:5
35:0
32:3
29:3
32:0
35:3
37:7
34:6
35:7
34:9
36:6
30:2
29:4
35:4
35:5
32:0
30:4
29:7
33:7
What to do:
1 Select a sample of 10 boys from this population by:
a rolling a die to select one of the 6 blocks
b rolling the die again to select a row in the block
c rolling the die again to select a boy in the row
d count off 10 boys from left to right from the boy you selected.
If the 3 rolls of the die produced f3, 2, 4g, the boy selected has weight 30:1 kg.
The sample selected is presented in the first column of the table.
2 Copy and enter your data in the following table.
cyan
magenta
yellow
95
Sample 4
100
50
75
25
0
Sample 3
5
95
100
50
Sample 2
75
25
0
5
95
Sample 1
30:1
34:9
32:3
34:9
31:4
33:0
32:4
29:7
33:6
30:6
32:3
100
50
75
25
0
5
95
100
50
75
25
0
5
Number
1
2
3
4
5
6
7
8
9
10
mean, x
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\246SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:31 PM PETERDELL
Sample 5
SA_12STU-2
STATISTICS
247
(Chapter 7)
3 The last row in this table consists of 5 sample means.
The variable of sample means can be denoted by X 10 . The bar on the top indicates
it is a variable of means; the subscript 10 indicates that the means are of samples of
size ten.
The last row of your table is a sample of size 5 from the distribution of X 10 .
4 Combine your results with those of the other students of your class.
Draw a histogram of the sample means.
5 Calculate the mean and the standard deviation of the sample means.
6 Compare the mean and the standard deviation you found in 5 with the mean weight
33:1 kg and standard deviation 2:54 kg of the 216 boys.
From Investigation 4 you should have discovered that the sample means are close to the
population mean. The mean of the sample means should be particularly close to the population
mean.
You should also have noted that the standard deviation of the sample means is smaller than
the standard deviation of the population.
The following important investigation uses a computer to speed up sampling and obtain a
more accurate picture of how the standard deviation of the sample means is related to the
standard deviation of the population.
In this investigation it is important to distinguish between:
² The original population, sometimes referred to as the “parent population ”, with a
random variable X which has mean ¹ and standard deviation ¾.
In Investigation 4 the parent population consists of 216 thirteen year old boys.
The mean ¹ = 33:1 kg and standard deviation ¾ = 2:54 kg.
and
²
The new population with variable X n , consisting of all statistics of sample means.
The subscript n indicating the sample size is sometimes omitted and the variable
just written X.
x1 + x2 + :::::: + xn
¹=
A typical outcome of X is a sample mean x
n
In Investigation 4 a typical outcome is the mean weight of 10 boys.
The investigation explores the shape of the distribution of the random variable X, its
mean ¹X or ¹(X), and its standard deviation ¾X or ¾(X).
INVESTIGATION 5
A COMPUTER BASED RANDOM SAMPLER
In this investigation we examine the variation in sample means.
We examine samples taken from symmetric distributions as well as one that
is skewed.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
We start by sampling from a population which has a normal distribution. The heights of
18 year old Australian males may be approximately normal.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\247SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:39 PM PETERDELL
SA_12STU-2
248
STATISTICS
(Chapter 7)
What to do:
1 Click on the icon given alongside. This opens a worksheet named
Samples with a number of buttons. Click on each of these buttons
in turn.
STATISTICS
PACKAGE
2 Sample size: from which you can select the numbers n = 10, 20, 40, 80, 160.
Start with n = 10.
3 Find sample means: finds the means of each of two hundred different samples.
4 Analyse: lists the two hundred sample means.
It finds the standard deviation sX and
draws a histogram of these sample means.
It also superimposes a normal probability density function.
This output is shown on the worksheet named Analysis.
Note that the first graph on this worksheet is the graph of the probability density
function of the population, and that the axes differ from that of the other graphs.
5 Make a copy of the table alongside.
Enter the value of (sX )2 in the first
column next to n = 10.
Trial 1 Trial 2 Trial 3 Trial 4
(sX )2
n
(sX )2
(sX )2
(sX )2
10
20
40
80
160
6 Go back to the worksheet named
Samples and change the sample size
to 20. Repeat steps 3, 4, and 5.
Enter the value of (sX )2 next to
n = 20 in the table.
7 Repeat for samples of size 40, 80
and 160.
8 We wish to see how (sX )2 is related to the standard deviation of the population.
However, (sX )2 can vary quite a lot, so to spot the pattern more clearly you should
repeat the experiment another 3 times.
9 From your experiment, determine a relationship between the square of the sample
standard deviation (sX )2 and the square of the population standard deviation.
10 Now click on the icon to sample data from a population with a
uniform distribution. These distributions are very commonly used
in computer games where, for example, cards have to be selected
at random. Complete an analysis of this data by repeating the
above procedure and recording all results.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
11 Now click on the icon to sample data from a population with an
exponential distribution. These distributions are notoriously skew.
They are commonly used in modelling lifetimes, such as the
lifetime of light globes. Complete an analysis of this data by
repeating the above procedure and recording all results.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\248SA12STU-2_07.CDR Thursday, 9 November 2006 3:45:45 PM DAVID3
STATISTICS
PACKAGE
STATISTICS
PACKAGE
SA_12STU-2
STATISTICS
249
(Chapter 7)
From the investigation you should have discovered the following:
If X is a random variable with mean ¹ and standard deviation ¾ then the random
variable X n of sample means of size n has:
² mean ¹X = ¹, the same as the mean of the random variable X
¾
² standard deviation ¾X = p .
n
Furthermore, for large values of n, X n is approximately normal.
You should notice:
² The histogram of the sample means becomes symmetric and starts
to take on a bell-like shape. For large values of n it becomes
approximately normal.
² The mean of the sample
means approximates the
population mean.
Individual points selected
from any distribution are
likely to come from either
side of the mean, and differences are likely to average out.
m
x1, x2, x3,..., xn
Sample 1 x1
x1, x2, x3,..., xn
Sample 2 x2
x1
¹X
x2
x1, x2, x3,..., xn
Sample 3 x3
x3
² As the sample size increases, there is less variability.
² This diagram shows what happens if the sample size n increases.
¾X
¾X
¾X
¾
The spread decreases since ¾X = p
n
and ¹X = ¹:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
In the Appendix the behaviour of the mean and the standard deviation are
explored algebraically.¡ It is beyond the level of this course to show why
the distribution of the sample means is approximately normal.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\249SA12STU-2_07.CDR Wednesday, 8 November 2006 8:41:06 AM DAVID3
APPENDIX
SA_12STU-2
250
STATISTICS
(Chapter 7)
Example 16
The life expectancy X, of a certain brand of AAA battery is known to have a
mean¡¹¡=¡27 hours and standard deviation ¾¡=¡3:25 hours. The batteries are sold in
packets of 6. Let the random variable X6 be the mean life expectancy of batteries in
a packet.
a
The 6 batteries in a packet were tested and the number of hours they lasted
were: 25:3, 21:6, 27:75, 22:25, 35:5, 28:5
What is the corresponding outcome of the random variable X 6 ?
b
If the numbers of hours lasted by batteries in a packet of six were
x1 , x2 , x3 , x4 , x5 , x6 what is the corresponding outcome of X 6 ?
c
What is the mean and standard deviation of X 6 ?
a
The outcomes of X 6 are the means of the life expectancies of 6 batteries in
a packet. In this case the outcome of X 6 is the statistic
x=
b
25:3 + 21:6 + 27:75 + 22:25 + 35:5 + 28:5
+ 26:8
6
If the batteries in the packet lasted for x1 , x2 , x3 , x4 , x5 , x6 hours, the
corresponding outcome of X 6 is the statistic x =
c
x1 + x2 + x3 + x4 + x5 + x6
.
6
The mean of X 6 is the same as the mean of X, so ¹X 6 = 27 hours.
Since the standard deviation of X is 3:25, the standard deviation of X 6 is
¾
3:25
¾X = p = p + 1:327
6
6
6
EXERCISE 7G.1
1 A machine produces sheets of cardboard with mean thickness 3 mm and standard deviation 0:12 mm. A quality controller checks the thickness of each sheet in 10 different
places. Let the random variable X be the thickness of the cardboard at any point, and
let the random variable X 10 be the mean thickness of the 10 points.
a The quality controller records the following thicknesses in mm from a sample of
10 points: 3:02, 2:77, 3:08, 2:89, 3:21, 2:79, 2:97, 3:07, 2:94, 3:01: What is the
corresponding outcome of the random variable X 10 ?
b If the quality controller records 10 outcomes of X as:
x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 , what is the corresponding statistic of X 10 ?
c What is the mean and standard deviation of X 10 ?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
2 Records show that a machine has been producing screws with mean length 75 mm and
standard deviation 0:5 mm. Screws are packaged in lots of 50. Let the random variable
X 50 be the mean length of a screw in a packet.
Find the mean and standard deviation of X 50 .
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\250SA12STU-2_07.CDR Thursday, 2 November 2006 3:14:56 PM PETERDELL
SA_12STU-2
STATISTICS
251
(Chapter 7)
3 The time it takes a train from Adelaide to Belair to complete its journey is known to
have a mean of 40 minutes and standard deviation of 3 minutes. An inspector times 8
such trips. Let X 8 be the mean travel time of a sample of 8 trips. Find the mean and
standard deviation of X 8 .
4 Suppose the probability a coin falls heads is p and the probability it falls tails is q = 1¡p.
Let the random variable X = 1 if it falls heads and X = 0 if it falls tails.
a Show that the mean of X is p.
p
p
b Show that the standard deviation of X is pq = p(1 ¡ p).
c Let X n be the sample mean of n tosses of the coin.
i Find the mean and standard deviation of X n .
ii Describe in words how X n is related to the tosses of a coin.
In general, knowing the mean and standard deviation of a random variable X is insufficient
information to calculate probabilities. However, we are able to calculate probabilities in the
special case where X is normally distributed. Not only that, but if X is normally distributed,
the random variable X n of sample means of size n is also normally distributed.
Example 17
Suppose the random variable X is normally distributed with mean 40 and standard
deviation 10. Let X 20 be the sample means of size 20. Find:
a
Pr(35 < X < 45)
b Pr(35 < X 20 < 45).
Pr(35 < X < 45)
= normalcdf(35, 45, 40, 10)
+ 0:383
a
b The mean of X 20 = mean of X = 40:
The standard deviation of X 20 =
Pr(35 < X 20 < 45)
= normalcdf(35, 45, 40,
p10
40
p10 )
40
= 0:998
Notice that about 38% of the individual outcomes are in the interval 35 < X < 45,
but almost all of the sample means lie in this interval.
Example 18
The time T it takes to serve a customer at a railway station ticket booth is normally
distributed with mean 45 seconds and standard deviation 20 seconds. You only have
10 minutes to buy your ticket or you will miss your train. If there is a line of 11
people in front of you waiting to be served, what is the probability you will catch the
train?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Including yourself there are 12 persons in the line to be served.
To complete buying your ticket in less than 10 minutes the mean serving time per
10 £ 60
person has to be less than
= 50 seconds.
12
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\251SA12STU-2_07.CDR Wednesday, 8 November 2006 8:41:33 AM DAVID3
SA_12STU-2
252
STATISTICS
(Chapter 7)
Let the random variable T 12 be the mean time to serve 12 persons.
Since T is normally distributed with mean 45 and standard deviation 20, T 12 is
normally distributed with mean 45 and standard deviation p2012 .
Pr(T 12 < 50)
= normalcdf(¡E99, 50, 45,
+ 0:807
p20 )
12
So, the probability of catching the train is 0:807
5 Suppose the random variable X is normally distributed with mean 80 and standard
deviation 20. Let X 10 be the sample means of size 10: Find:
a Pr(75 < X < 85)
b Pr(75 < X 10 < 85)
6 Let the random variable X be the IQ of 17 year old girls. Suppose X is normally
distributed with mean 105 and standard deviation 15.
a Find the probability that an individual 17 year old girl has an IQ of more than 110.
b Find the probability that the mean IQ of a class of twenty 17 year old girls is greater
than 110.
7 A manufacturer of chocolates produces chocolates of mean weight 20 g and standard
deviation 5 g. A box of 13 such chocolates is sold with the claim that the nett weight in
the box is 250 g. Assuming the weights are normally distributed:
a For what proportion of boxes is this claim correct?
b If the manufacturer decides to increase the number of chocolates to 15 per box, for
what proportion of boxes is the claim now true?
THE CENTRAL LIMIT THEOREM
In the previous investigation, we also observed that the distribution of the sample means X
is approximately normal.
The Central Limit Theorem
Suppose X is a random variable which is not necessarily normally distributed, but has mean
¹ and standard deviation ¾: For sufficiently large n, the distribution X n of the sample means
¾
of size n, is approximately normal with mean ¹X = ¹ and standard deviation ¾X = p :
n
Note:
² There is no simple answer as to how large n should be before the central limit theorem
can be applied. It depends on many factors including how much accuracy is required. If
the population is very skew it may require a large sample size n, whereas if the
population is symmetric a small sample size n may be sufficient. As a rule of thumb,
n¡>¡30 is often used, but each case must be considered on its merits.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
² In the special case where the population is normally distributed, the distribution X of
the sample means is always normal.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\252SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:09 PM PETERDELL
SA_12STU-2
STATISTICS
253
(Chapter 7)
THE SAMPLING ERROR
We are trying to estimate the population mean using a sample mean. By only looking at a
small portion of the population, the sample mean is likely to be different from the population
mean.
¾
The standard deviation ¾X = p of the sample means X is a measure of the
n
variability of sample means, and is called the sampling error or the standard error.
Note:
² Unless the population is small, the population size is almost irrelevant.
² The larger the value of n, the smaller the sampling error. A sufficiently large sample
should give an accurate estimate of the mean. However, making the sample size too big
may be expensive and may not improve the reliability of the estimate by much.
¾
¾
+
For example, a sample size of 1000 gives a sampling error of ¾X = p
32
1000
whereas a sample of 4000, four times the size, only halves the sampling error.
Example 19
Histogram A
30
20
0
<0
[1,2)
[3,4)
[5,6)
[7,8)
[9,10)
[11,12)
[13,14)
[15,16)
[17,18)
[19,20)
[12.75,13)
[12,12.25)
[10.5,10.75)
[11.25,11.5)
[9.75,10)
[9,9.25)
[8.25,8.5)
<7
10
interval
interval
magenta
yellow
100
75
95
50
25
0
5
95
100
75
50
25
0
To find Pr (X 36 < 9) we count the numbers in all the bins before the
bin [9, 9:25), and use the fact that there are 400 in the sample. We get:
5
b
95
The data in Histogram A is less spread out than that in Histogram B, and
appears clustered around 10. Histogram A is the histogram for the
distribution X 36 .
100
a
50
Which of the two histograms is from X 36 ? Give reasons for your answer.
From the diagram estimate Pr (X 36 < 9).
Find the approximate mean and standard deviation of X 36 .
Use the histogram to estimate the probability X 36 is one standard deviation
from the mean.
75
a
b
c
d
25
0
5
95
100
50
75
25
0
5
cyan
Histogram B
frequency
50
40
30
20
10
0
[7.5,7.75)
frequency
Two histograms of samples, each of size 400, are shown below. One is from a
uniform distribution X with mean 10 and standard deviation 5:77. The other is from
the distribution X36 of the sample means of size 36 selected from the distribution X.
Note that the scales are not the same in the two diagrams.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\253SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:14 PM PETERDELL
SA_12STU-2
254
STATISTICS
(Chapter 7)
53
15 + 15 + 12 + 3 + 2 + 3 + 2 + 1
=
+ 0:13
400
400
Your answer may vary a little depending on how well you can read the numbers
on the graph.
Pr (X 36 < 9) =
The mean of X 36 = mean of X = 10.
c
¾
5:77
The standard deviation ¾X + p =
= 0:962
6
36
Pr(10 ¡ 0:96 < X36 < 10 + 0:96) = Pr(9:04 < X 36 < 10:96)
+ Pr(9 < X 36 < 11)
d
30 + 27 + 39 + 44 + 45 + 42 + 31 + 30
=
400
= 0:72
This crude estimate compares with 0:68 when using the normal approximation.
EXERCISE 7G.2
1 The IQ measurements of a population have mean 100 and standard deviation 15. Many
hundreds of random samples of size 36 are taken from the population and a relative
frequency histogram of the sample means is formed.
a What would we expect the mean of the samples to be?
b What would we expect the standard deviation of the samples to be?
c What would we expect the shape of the histogram to look like?
[14.25,14.5)
[12.75,13)
[13.5,13.75)
[12,12.25)
interval
[10.5,10.75)
<7
[7.5,7.75)
[52,54)
[46,48)
[34,36)
[40,42)
[28,30)
[22,24)
[16,18)
[10,12)
<0
[4,6)
0
[11.25,11.5)
20
[9,9.25)
40
[9.75,10)
60
Histogram B
30
25
20
15
10
5
0
[8.25,8.5)
Histogram A
frequency
frequency
2 Two histograms of sample size 300 each are shown below. One is from a life expectancy
distribution X with mean 10 and standard deviation 10. The other is from the distribution X 64 of the sample means of size 64 selected from the distribution X. Note that
the scales are not the same in the two diagrams.
interval
cyan
magenta
yellow
100
75
95
50
25
0
5
100
95
50
75
25
0
5
100
95
50
75
25
0
Which of the two histograms is from X 64 ? Give reasons for your answer.
From the diagram estimate Pr(X 64 < 9).
Find the approximate mean and standard deviation of X 64 .
Use the histogram to estimate the probability that X 64 is one standard deviation from
the mean. How does this answer compare with using the normal approximation?
5
95
100
50
75
25
0
5
a
b
c
d
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\254SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:20 PM PETERDELL
SA_12STU-2
STATISTICS
255
(Chapter 7)
Example 20
The age of men in Australia is distributed with mean 43 and standard deviation 8.
If a sample of 67 men is selected from the population of Australian men, what is
the probability the sample mean is:
a less than 42
b greater than 45
c between 40 and 45?
Let the random variable X be the mean age of samples of 67 Australian males.
Assuming n = 67 is sufficiently large for the Central Limit Theorem to apply,
X is approximately normal with mean 43 and standard deviation ¾X =
Pr(X < 42)
= normalcdf(¡E99, 42, 43,
+ 0:153
a
Pr(X > 45)
= normalcdf(45, E99, 43,
+ 0:0204
43
p8 )
67
43
Pr(40 < X < 45)
= normalcdf(40, 45, 43,
+ 0:979
c
.
p8 )
67
42
b
p8
67
45
p8 )
67
40
43
45
3 During a one week period in Sydney the mean price of an orange was 42:8 cents with
standard deviation 8:7 cents. Find the probability that the mean price per orange from a
case of 60 oranges was less than 45 cents.
4 The mean energy content of a fruit bar is 1067 kJ with standard deviation 61:7 kJ. Find
the probability that the mean energy content of a sample of 30 fruit bars is more than
1050 kJ/bar.
5 The mean sodium content of a box of cheese rings is 1183 mg with standard deviation
88:6 mg. Find the probability that the mean sodium content per box for a sample of 50
boxes lies between 1150 mg and 1200 mg.
6 Customers at a clothing store are in the shop for a mean time of 18 minutes with standard
deviation 5:3 minutes. What is the probability that in a sample of 37 customers the mean
stay in the shop is between 17 and 20 minutes?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
7 The mean contents of a can of cola is 382 mL, even though it says 375 mL on a can.
The statistician at the factory says that the standard deviation is steady at 16:2 mL. Find
the probability that a slab of three dozen cans has mean contents less than 375 mL per
can.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\255SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:26 PM PETERDELL
SA_12STU-2
256
STATISTICS
(Chapter 7)
Example 21
A population is known to have a standard deviation of 8 but has an unknown mean
¹. In order to estimate ¹, the mean of a random sample of 60 is found. Find the
probability that this estimate is out by less than 2.
Let the random variable X be the mean of samples of 60. As the sample size is
larger than 30, we assume that X is normally distributed with mean ¹ and standard
deviation p860 .
We need to find Pr(¡2 < X ¡ ¹ < 2).
Now Pr(¡2 < X ¡ ¹ < 2) = Pr
=
µ ¡2
p8
60
³ p
Pr ¡ 4 60
<
X ¡¹
p8
60
<Z <
p
= normalcdf( ¡ 4 60 ,
+ 0:947
<
2 ¶
p8
60
p ´
60
4
p
60
4 , 0,
1)
8 A sample of 375 people will be used to estimate
the mean number of hours that will be lost due
to sickness this year. Last year the standard deviation for the number of hours lost was 67 and
we will use this as the standard deviation this
year. What is the probability that the estimate is
in error by less than ten hours?
9 A concerned union member wishes to estimate the hourly wage of shop assistants in
Adelaide. He decides to randomly survey 300 shop assistants to calculate the sample
mean. Assuming that the standard deviation is $1:27, find the probability that the estimate
of the population mean is in error by 10 cents or more.
INVESTIGATION 6
CHOCKBLOCKS
Chockblock produce mini chocolate
bars which vary a little in weight. The
machine used to make them produces
bars whose weights are normally
distributed with mean 18:2 grams and standard
deviation 3:3 grams. 25 bars are then placed in a
packet for sale. Hundreds of thousands of packets
are produced each year with mean weight X.
What to do:
1 What are the mean ¹X and standard deviation ¾X of X?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
2 Printed on each packet is the nett weight of contents, 425 grams. What is the manufacturer claiming about the mean weight of each bar?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\256SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:32 PM PETERDELL
SA_12STU-2
STATISTICS
257
(Chapter 7)
3 What percentage of their packets will be rejected because they fail to meet the 425
gram claim?
4 An additional bar is added to each packet with the nett weight claim retained at 425
grams.
a What is the minimum acceptable claim now?
b What are the mean ¹X and standard deviation ¾X now?
c What percentage of these packets would we expect to reject?
H
HYPOTHESIS TESTING FOR A MEAN
Claims are often made about the population mean of some
quantities.
For example, it is claimed that the mean protein content of a
1 litre carton of milk is 39 grams. The truth of this claim can
only be known by measuring the protein content of every 1
litre carton of milk, clearly an impossible task. It is, however,
possible to draw reasonable conclusions from measuring the
protein content of a random selection of cartons.
A statistical hypothesis is a statement about a population parameter. The parameter
could be a population mean or a proportion.
In this section we will test hypotheses concerning the mean ¹.
HYPOTHESIS ABOUT MEANS
When a statement is made about a product, it is usually tested statistically before changes to
the product are made.
For example, suppose a consumer makes the statement that the mean protein content in
1¡litre cartons of milk is not 39 grams. The milk company does not want to go to the
expense of changing packaging until it is statistically shown that the mean protein content is
indeed not 39 grams. The company will start with the assumption that their claim is true,
and whatever tests the consumer did were just random fluctuations. This assumption or
statement of no change is called the null hypothesis and is usually denoted H0.
The alternative hypothesis denoted Ha is that the statistical evidence is sufficient to accept
the consumer’s claim, i.e., that the milk company’s statement is false.
So, we consider two hypotheses:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
² a null hypothesis H0 which is a statement of no difference or no change. It is
assumed to be true until sufficient evidence is provided so that it is rejected.
² an alternative hypothesis Ha which is a statement that there is a difference or
change which has to be established. Supporting evidence is necessary if it is to
be accepted.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\257SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:38 PM PETERDELL
SA_12STU-2
258
STATISTICS
(Chapter 7)
HYPOTHESIS TESTING WHEN THE POPULATION IS NORMALLY
DISTRIBUTED
We want to test the claim that the mean protein content of 1 litre cartons of milk is 39 grams.
¹ = 39
¹ 6= 39
The null hypothesis is H 0 :
The alternative hypothesis is H a :
Suppose we select a sample of 10 cartons of milk and find that for this sample the mean
protein content is x
¹ = 38:4 grams.
We need to determine the likelihood that this difference is
due to random fluctuation or chance, or whether it is
sufficient evidence to say the milk company’s statement is
incorrect.
Since the protein content of milk is a result of many
different factors, it is reasonable to assume that the protein
content of 1 litre cartons of milk is normally distributed.
Suppose it is known that the standard deviation of protein in 1 litre containers of milk is
¾ = 0:8 grams.
Let X be the protein content of a 1 litre container of milk, so according to the null hypothesis,
X » N(39, 0:82 ).
Let the random variable X be the mean protein content of a sample of 10 one litre cartons.
µ µ
µ µ
¶2 ¶
2¶
¾ ¶
0:8
Hence X » N ¹, p
i.e., X » N 39 , p
.
n
10
We use this to calculate the z-score of the observed value x
¹ = 38:4 grams.
z=
x
¹¡¹
38:4 ¡ 39
+ ¡2:37
¾ =
0:8
p
p
n
10
So the number of standard deviations x
¹ is
from the mean is ¡2:37 .
If the difference between the observed value of x
¹ and the mean is due to chance alone, it
could just as likely have been 2:37 standard deviations to left or right of the mean. So, the
probability that X is 2:37 standard deviations or more either side of the mean is a measure
of how likely this is to occur.
Now Pr(Z 6 ¡2:37 or Z > 2:37) = 2 £ Pr(Z 6 ¡2:37)
fsymmetryg
= 2 £ normalcdf(¡E99, ¡2:37)
= 0:0178
so the probability of this event happening is small.
One of the problems with random processes is that differences can always be due to chance.
However, the practical solution is to reject the null hypothesis if the probability of the observed
or more extreme results occurring is small.
The probability ® at which we reject the null hypothesis is called the significance level of the
test. Common significance levels are ® = 0:05 or 5% and ® = 0:01 or 1%.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
In the above example, Pr(Z 6 ¡2:37 or Z > 2:37) = 0:0178 . This is less than 0:05 so
we would reject the null hypothesis at the significance level of 0:05, but not at the significance
level of 0:01 .
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\258SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:44 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
259
Milk cartons example
The procedure for testing a hypothesis is:
Step 1:
State the null hypothesis
H 0 : ¹ = ¹0
and the alternate hypothesis H a : ¹ =
6 ¹0 .
Step 2:
Select a significance level, usually 0:05 .
Unless otherwise stated, the level of 0:05 is used in
this book.
Step 3:
From a sample, calculate the sample mean x
¹.
If the parent population is normally distributed with
mean ¹ and standard deviation ¾, then the random
variable X of sample means has the normal
µ µ
2¶
¾ ¶
distribution N ¹, p
.
n
µ µ
2¶
¾ ¶
N ¹, p
is called the null distribution:
n
H 0 : ¹ = 39
6 39
H a: ¹ =
X » N(39, 0:2532 )
The null distribution is critical. It allows us to calculate the probability of the observed or more extreme
events happening if the null hypothesis is true.
Use the sample mean x
¹ to find the test statistic
x
¹¡¹
z= ¾ :
p
n
Step 4:
z = ¡2:37
The name Z-test derives its name from this statistic.
Step 5:
Calculate the probability of all observations having
z-values more extreme than the test statistic z found
in Step 3.
The P-value is the probability of all observations
having a z-value more extreme than the test statistic.
P= Pr(Z 6 ¡2:37
or Z > 2:37)
= 0:0178
Since we include the extreme outcomes either side
of the mean, we call this a two-sided Z-test. Only
two-sided tests are considered in this course.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
Since P¡>¡0:01, we
do not reject the
null hypothesis at
the 0:01 level.
5
² If the P-value is larger than the significance
level decided on in Step 2, do not reject the
null hypothesis.
95
Since P¡<¡0:05 we
reject the null
hypothesis at the
0:05 level.
100
50
² Reject the null hypothesis if the P-value is less
than the significance level decided on in Step 2.
The smaller the P-value is, the stronger the
evidence against the null hypothesis.
75
25
0
5
95
100
50
75
25
0
5
Step 6:
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\259SA12STU-2_07.CDR Thursday, 2 November 2006 3:15:51 PM PETERDELL
SA_12STU-2
260
STATISTICS
(Chapter 7)
When a null hypothesis is not rejected, the terms “retain” and “accept” are often used. This
does not mean that the null hypothesis is true, but rather that there is not enough evidence to
show it is not true.
Similarly, when rejecting the null hypothesis, it is often stated that the alternative hypothesis
is “accepted”. This does not mean that the alternative hypothesis is true. However, if the null
hypothesis is true, the outcome that led to rejecting it is a very unlikely one. The P-value
tells you just how unlikely.
Example 22
A Mathematics coaching school knows that the results for their final test are
normally distributed with population mean 74% and standard deviation 7%. A new
coaching technique which is cheaper to implement but reported to have the same
results is trialled by the school. In a trial of 40 students it is found that the mean
score for the final test is 72% with standard deviation 6%. Is there sufficient
evidence at the 5% level to conclude that the final test scores will be different?
Step 1:
Step 2:
Step 3:
H0 : ¹ = 74 Ha : ¹ 6= 74
Significance level is 0:05
The sample mean, x
¹ = 72
TI
C
Let the random variable X be the sample means, so the null distribution
µ
µ
¶2
¶2
¾
7
is X » N(¹, p
)
i.e., X » N(74, p
):
n
40
x
¹¡¹
72 ¡ 74
+ ¡1:81
¾ =
7
p
p
n
40
Step 4:
The test statistic is z =
Step 5:
The P-value is P = Pr(Z 6 ¡1:81 or Z > 1:81)
= 2 £ Pr(Z 6 ¡1:81)
+ 0:0708
Step 6:
As P¡=¡0:0708¡>¡0:05 there is insufficient evidence to reject
the null hypothesis that the new coaching produces the
same results as the old technique. We thus accept that the
new technique has the same result as the old technique.
Notice that we
use ¾ and not s
for the Z-test.
If H0 is rejected,
² the direction of the difference is determined by the value of x
¹
² we still do not know how accurate the claim was.
Note:
EXERCISE 7H.1
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
1 A random variable X is normally distributed with a standard deviation ¾ = 4. It is
claimed that the mean of X is ¹ = 17.
a To test this claim a random sample of n = 50 was taken and the sample mean x
¹
was found to be 16.
i Write down the hypotheses H0 and Ha . ii Write down the null distribution.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\260SA12STU-2_07.CDR Thursday, 9 November 2006 10:12:45 AM DAVID3
SA_12STU-2
STATISTICS
(Chapter 7)
261
iii Calculate the test statistic.
iv Calculate the P-value.
v What conclusion is there at the 0:05 level?
b Suppose that a random sample of n = 70 was taken and x
¹ = 16. What can you
now conclude at the 0:05 level?
2 A random variable X is normally distributed with a standard deviation ¾ = 6. A random
sample of 40 was taken and the sample mean was found to be x
¹ = 61:4 .
Use this information to test the claim that the population mean of X is ¹ = 60.
Example 23
The bottlers of Groutt claim that the mean volume of bottles is 503 mL.
To test this claim 10 bottles were selected.
The measurements are listed below to the nearest 0:1 mL:
502:5, 501:0, 501:5, 503:9, 498:7, 505:7, 504:6, 499:4, 501:8, 501:1
Test the claim made by the bottlers of Groutt at the 5% level if it is known that
the population standard deviation ¾ is 1:8 mL.
We need to test:
the null hypothesis H0 :
against the alternative hypothesis H a :
¹ = 503
¹ 6= 503
Let X be the volume of each bottle of Groutt. As the bottling of liquids is subject to
many random fluctuations, it is reasonable to assume that X is normally distributed
with mean ¹ and standard deviation ¾.
Let X be the distribution of the sample means, so the null distribution of X is
µ µ
2¶
¾ ¶
N ¹, p
.
n
From the null hypothesis we assume that ¹ = 503.
From the sample we find that x
¹ = 502:02, so the test statistic
z=
x
¹¡¹
502:02 ¡ 503
+ ¡1:722
¾ +
1:8
p
p
n
10
The P-value is P = Pr(Z 6 ¡ 1:722 or Z > 1:722)
= 2 £ Pr(Z 6 ¡1:722)
+ 0:0851
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
As P > 0:05 there is insufficient evidence to reject the claim that the volume
of bottles of Groutt is 503 mL,
i.e., we accept that the mean volume could be 503 mL.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\261SA12STU-2_07.CDR Wednesday, 8 November 2006 8:44:35 AM DAVID3
SA_12STU-2
262
STATISTICS
(Chapter 7)
3 A market gardener claims that the carrots in his field have a mean weight of 50 grams.
Before buying the crop a buyer pulls 20 carrots at random. She finds that their individual
weights in grams are:
57:6 34:7 53:9 52:5 61:8 51:5 61:3 49:2 56:8 55:9
57:9 58:8 44:3 58:3 49:3 56:0 59:5 47:0 58:0 47:2
a Explain why it is reasonable that the distribution of carrots’ weights is normally
distributed.
b Test the claim made by the market gardener if it is known that the standard deviation
for the whole crop is 7:1 grams.
4 The length of screws produced by a machine is known to be normally distributed with
standard deviation ¾ = 0:08 cm.
The machine is supposed to produce screws with a mean length of ¹ = 2:00 cm.
A quality controller selects a random sample of 15 screws and finds that the mean length
of the 15 screws is x
¹ = 2:04 cm with sample standard deviation of s = 0:09 cm.
Does this justify the need to adjust the machine?
GRAPHING
PACKAGE
To see how to do hypothesis testing using a calculator,
click on the appropriate icon.
TI
C
HYPOTHESIS TESTING WHEN THE POPULATION IS NOT NECESSARILY
NORMALLY DISTRIBUTED
In the examples we have seen so far, the variable X was normally distributed and so the
distribution of sample means X was normally distributed also. This may not be true if X
is not normally distributed. However, if the sample size n is sufficiently large, the Central
Limit Theorem tells us that X is approximately normally distributed with mean ¹ and standard
¾
deviation p .
n
We can use this fact to test claims about population means.
Example 24
Susan’s resting pulse rate has been 55 beats per minute
for many years with standard deviation ¾ = 2:6 bpm.
During a 5 day period she checks her resting pulse rate
8 times a day at regular intervals and finds that it has
mean 56:2.
Is there sufficient evidence, at a 5% level, to conclude
that Susan’s pulse rate has changed?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
null hypothesis is H 0 : ¹ = 55. The alternative hypothesis is H a : ¹ 6= 55
significance level ® = 0:05 .
number in the sample is n = 5 £ 8 = 40 and the sample mean is x
¹ = 56:2.
population standard deviation ¾ = 2:6 .
5
95
100
50
75
25
0
5
The
The
The
The
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\262SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:08 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
263
Let X be Susan’s resting pulse rate. We do not know how the random variable
X is distributed, but if we assume that n is large enough for the Central Limit
Theorem to apply then the null distribution for the sample means X is
approximately normally distributed with mean ¹ = 55 and standard
¾
2:6
deviation p = p = 0:411 .
n
40
Entering this information into the calculator gives a P-value of P = 0:003 51 . As
P = 0:003 51 < 0:05 there is evidence at the 0:05 level to reject the null hypothesis.
We accept the alternative hypothesis H a that Susan’s pulse rate has changed.
EXERCISE 7H.2
1 Globe Industries make torch globes with standard deviation life time of ¾ = 9 hours. If
the globes last too long, people will have no need to buy new ones, but if they do not
last long enough, people will stop buying them. A quality controller is to ensure that
globes made by a machine have a mean life of 80 hours. The quality controller selects
a sample of 50 globes and finds that they have a mean life of 83 hours.
a What is the null hypothesis the quality controller is testing?
b Assuming that a sample of n = 50 is large enough for the Central Limit Theorem
to apply, what is the null distribution the quality controller will be using?
c Is there sufficient reason at the 5% level for the quality controller to adjust the
machine?
2 Let X be the outcome of the roll of a fair six-sided die. The mean outcome of such a
die is ¹ = 3:5 with standard deviation ¾ = 1:708. Jack thinks his die may not be
fair. To test this he rolls the die 100 times and finds that the mean of the 100 rolls is
3:2.
a What null hypothesis is Jack testing?
b Briefly explain why the outcomes of a roll of a fair die are not normally distributed.
c Assuming that a sample of size n = 100 is large enough for the Central Limit
Theorem to apply, what is the null distribution Jack should be using?
d Does Jack have enough evidence at the 5% level to claim the die is not fair?
e Jack’s sister Betty rolls the same die 200 times and finds that the mean of her sample
is also 3:2. Would Betty come to the same conclusion as Jack?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
3 While peaches are being canned, 250 mg of preservative is supposed to be added by a dispensing device.
It is known that the standard deviation of preservative added is 7:3 mg.
To check the machine, the quality controller obtains
60 random samples of dispensed preservative and
finds that the mean preservative added was 242:6
mg.
At a 5% level, is there sufficient evidence that the
machine is not dispensing a mean of 250 mg?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\263SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:15 PM PETERDELL
SA_12STU-2
264
STATISTICS
(Chapter 7)
4 In recent times the mean age for New Zealand women on their first wedding day is 23:6
years with a standard deviation of 2:9 years. To determine if this differs from Australian
women, a survey of 32 women was carried out. It was found that the mean age was
24:3 years. Test whether there is a significant difference at a 5% level.
REJECTION REGION FOR THE NULL HYPOTHESIS H 0
To test the null hypothesis H 0 : ¹ = ¹0
we have used the test statistic z =
against the alternative hypothesis H a : ¹ 6= ¹0
x
¹¡¹
¾ .
p
n
Assuming that z¡>¡0, our test at the 5% significance level has been to reject the null
hypothesis if the P-value
P = Pr(Z 6 ¡z or Z > z) < 0:05
i.e., 2 £ Pr(Z 6 ¡z) < 0:05
i.e., Pr(Z 6 ¡z) < 0:025 :
0.025
But invNorm(0:025) + ¡1:96, and so
we reject the null hypothesis at the 5%
level if the test statistic
z 6 ¡1:96 or z > 1:96 .
0.025
-1.96
0
RR of H0
1.96
RR of H0
The rejection region for the null hypothesis H 0 is the set of values of the test
statistic for which the null hypothesis is rejected.
The 5% rejection region for the null hypothesis H 0 : ¹ = ¹0
fz : z 6 ¡1:96 or z > 1:96g
is the set
Example 25
A liquor chain claims that the mean price of wine has not changed from what it was
12 months ago. Records show that 12 months ago the mean price was $13:45 for a
750 mL bottle. A random sample of prices of 389 different bottles of wine is taken
from several stores and the mean price is $13:30 and the standard deviation is $0:25.
Is there sufficient evidence at the 5% level to reject the claim?
H 0 : ¹ = 13:45, H a : ¹ 6= 13:45
We use s = 0:25 to estimate ¾ as n is large.
Assuming that the sample of size n = 389 is large enough for the Central Limit
Theorem to apply, we find the test statistic z =
x
¹¡¹
13:30 ¡ 13:45
+ ¡11:8
¾ =
0:25
p
p
n
389
Since z < ¡1:96 we reject the null hypothesis that there is no difference in the
price and accept the alternative hypothesis that the price has changed.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Note that the calculator also calculates the test statistic z when using the 2-sided Z-test.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\264SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:21 PM PETERDELL
SA_12STU-2
STATISTICS
265
(Chapter 7)
EXERCISE 7H.3
For questions 1 and 2, test the hypothesis using the rejection region for the null hypotheses.
In each case you may assume that the sample size n is large enough for the Central Limit
Theorem to apply.
1 Quickshave produces disposable razorblades. They
claim that the mean number of shaves before a blade
has to be thrown away is 13. A researcher wishes to test
the claim and asks 30 men to supply data on how many
shaves they got from one of the Quickshave blades. The
researcher found that the mean of the sample was 12:8.
Use this information to test the manufacturer’s claim at
a 5% level if the population standard deviation ¾ is 1:6:
2 It is claimed that the mean disposable income of households in a country town is $50 per
week. To test this claim, 36 households were sampled and it was found that the mean
disposable income of the 36 families was $47. Use this to test the claim that the mean
disposable income is not $50 per week if the population standard deviation ¾ = $12.
Example 26
To test the hypothesis H 0 : ¹ = 40 against H a : ¹ 6= 40, a random sample
of size 60 was taken and found to have mean x
¹ and standard deviation s = 7.
For what values of x
¹ will the null hypothesis be rejected at the 5% level? Assume
that the sample size is large enough for the Central Limit Theorem to apply.
The test statistic z =
x
¹ ¡ 40
x
¹ ¡ 40
x
¹¡¹
+
¾ =
7
0:9037
p
p
n
60
The null hypothesis will be rejected if z 6 ¡1:96 or if z > 1:96
x
¹ ¡ 40
x
¹ ¡ 40
i.e., if
6 ¡1:96 or if
> 1:96
0:9037
0:9037
) x
¹ 6 40 ¡ 1:96 £ 0:9037 or x
¹ > 40 + 1:96 £ 0:9037
The null hypothesis will be rejected if x
¹ 6 38:2 or x
¹ > 41:8 .
3 To test the hypothesis H 0 : ¹ = ¡23 against H a : ¹ 6= ¡23, a random sample
of size 100 was taken and found to have mean x
¹.
For what values of x
¹ will the null hypothesis be rejected at the 5% level? You may
assume that the sample size is large enough for the Central Limit Theorem to apply and
that the population standard deviation ¾ = 4.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
4 The volume of soft drinks dispensed by a machine is normally distributed with standard
deviation 3 mL. A quality controller has to adjust the machine if the mean volume
dispensed is not 504 mL. To test the machine the quality controller finds the mean
volume x
¹ of 20 randomly selected bottles every hour. For what values of x
¹ should the
quality controller not adjust the machine?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\265SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:27 PM PETERDELL
SA_12STU-2
266
STATISTICS
(Chapter 7)
DISCUSSION
The null hypothesis H0 assumes that the population mean ¹ is exactly
equal to ¹0. This is required to set up the null distribution needed to
calculate probabilities. However, if the variable X that is being tested is
continuous, the probability that ¹ is exactly equal to ¹0 is zero!
Does this mean that if you take a large enough sample, and have a measuring instrument that
can measure outcomes of X accurately enough, you can always reject the null hypothesis?
Compare the formal sentence, “There is a statistically significant difference between the
population mean ¹ and ¹0 .” with what is commonly understood by, “There is a significant
difference between the population mean ¹ and ¹0 .”
I
CONFIDENCE INTERVALS FOR MEANS
In this section we show how to use a sample mean x to calculate an interval in which we
expect the population mean ¹ to lie. As with all statistics, our estimate for x could by chance
be very far from ¹, and we can never be absolutely sure that ¹ lies within the interval. We
can, however, know how probable it is that ¹ lies in the interval.
A confidence interval estimate of a parameter (in this case the population mean ¹)
is an interval of values between two limits, together with a percentage indicating our
confidence that the parameter lies in that interval.
We now consider how a so-called 95% confidence interval is constructed.
We start by finding the number a for which the standard normal distribution Z has probability
Pr(¡a < Z < a) = 0:95 .
Because of the symmetry of the graph of the
normal distribution, the statement reduces to
Pr(Z < ¡a)
) ¡a
¡a
a
=
=
=
+
0.95
0:025
invNorm(0:025)
¡1:95996
1:96
0.025
0.025
-a
0
a
So, Pr(¡1:96 < Z < 1:96) = 0:95
This means that:
In any normal distribution, 95% of the outcomes lie within 1:96 standard deviations
from the mean.
So, suppose the random variable X is normally distributed as N(¹, ¾ 2 ):
µ µ
2¶
¾ ¶
:
If X is the random variable of sample means of size n, then X » N ¹, p
n
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
¾
¾
¹ < ¹ + 1:96 p :
) 95% of all x
¹ lie in the interval ¹ ¡ 1:96 p < x
n
n
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\266SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:33 PM PETERDELL
SA_12STU-2
STATISTICS
In the diagram we have shown a few x
¹
values in this interval as well as one that
is not in this interval.
(Chapter 7)
267
95%
x1
Notice that the
interval calculated
for ` !v does not
contain m.
x2
x3
x4
m
x1
x2
x3
x4
Note that each of the x
¹ is in the middle of a line segment. All of these segments have the
¾
¾
to ¹ + 1:96 p .
same length as the line segment from ¹ ¡ 1:96 p
n
n
Since Pr(¡1:96 < Z < 1:96) = 0:95 we know Pr(¡1:96 <
X ¡¹
¾ < 1:96) = 0:95 .
p
n
So for the outcome x within the confidence interval,
x¡¹
¾ < 1:96
p
n
and
x¡¹
¾ > ¡1:96
p
n
)
¾
x ¡ ¹ < 1:96 p
n
and
¾
x ¡ ¹ > ¡1:96 p
n
)
¾
¹ > x ¡ 1:96 p
n
and
¾
¹ < x + 1:96 p
n
This says that if we were to take many samples of size n and calculate the sample mean x
¹
for each of these samples, then for about 95% of these sample means, the population mean
¹ would lie in the interval
¾
¾
x ¡ 1:96 p < ¹ < x + 1:96 p :
n
n
So,
1.96 s
n
the 95% confidence interval for ¹
is from
¾
¾
to x + 1:96 p :
x ¡ 1:96 p
n
n
–x - 1.96 s
n
lower limit
1.96 s
n
–x
–x +1.96 s
n
upper limit
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Confidence intervals for different confidence levels can be constructed for the population ¹
in a similar way. Remember that we cannot be absolutely sure that ¹ will lie within the
confidence interval, but we can be confident that 95% of the time it will be.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\267SA12STU-2_07.CDR Thursday, 9 November 2006 12:04:06 PM DAVID3
SA_12STU-2
268
STATISTICS
(Chapter 7)
INVESTIGATION 7
CONFIDENCE LEVELS AND INTERVALS
To obtain a greater understanding of confidence levels and intervals, click
on the icon to visit a random sampler demonstration. This will
DEMO
calculate confidence intervals at various levels of your
choice (90%, 95%, 98% or 99%) and count the intervals
which include the population mean.
Note: Consider samples of different size but all with mean 10 and standard deviation 2.
The 95% confidence interval is 10 ¡
For various values of n we have:
1:960 £ 2
1:960 £ 2
p
p
< ¹ < 10 +
.
n
n
n
20
50
100
200
Confidence interval
9:123 < ¹ < 10:877
9:446 < ¹ < 10:554
9:608 < ¹ < 10:392
9:723 < ¹ < 10:277
m=10
n = 20
n = 50
n = 100
n = 200
9
9.5
10
10.5
11
We see that increasing the sample size produces confidence intervals of shorter width.
Example 27
A sample of 60 yabbies was taken from a dam. The sample mean weight of the
yabbies was 84:6 grams. Find the 95% confidence interval for the population mean if
the population standard deviation is 16:8 grams.
We are given that x = 84:6 and ¾ = 16:8.
¾
¾
x ¡ 1:96 p < ¹ < x + 1:96 p
n
n
The 95% confidence interval is:
i.e., 84:6 ¡
1:96 £ 16:8
1:96 £ 16:8
p
p
< ¹ < 84:6 +
60
60
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
) 80:3 < ¹ < 88:9
So, we are 95% confident that the population mean weight of yabbies lies
between 80:3 grams and 88:9 grams.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\268SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:45 PM PETERDELL
SA_12STU-2
STATISTICS
269
(Chapter 7)
Example 28
The fat content (in grams) of 30 randomly selected pasties at the
determined and recorded as:
15:1 14:8 13:7 15:6 15:1 16:1 16:6 17:4
17:5 15:7 16:2 16:6 15:1 12:9 17:4 16:5
17:2 17:3 16:1 16:5 16:7 16:8 17:2 17:6
local bakery was
16:1
13:2
17:3
13:9
14:0
14:7
Determine a 95% confidence interval for the mean fat content of all pasties made if
the population standard deviation is 1:35 grams.
From a calculator x = 15:90 and we are given ¾ = 1:35
The 95% confidence interval for ¹ is
¾
¾
x ¡ 1:96 p < ¹ < x + 1:96 p
n
n
)
1:35
1:35
15:90 ¡ 1:96 £ p < ¹ < 15:90 + 1:96 £ p
30
30
) 15:4 < ¹ < 16:4
So, we are 95% confident that the mean fat content of all pasties produced lies
between 15:4 g and 16:4 g.
EXERCISE 7I.1
1 A random sample of n individuals is selected from a population with known standard
deviation 11. The sample mean is 81:6.
a Find a 95% confidence interval for ¹ if:
i n = 36 ii n = 100.
b In changing n from 36 to 100, how does the width of the confidence interval change?
2 Neville works for a software company. He keeps records of the times customers have to
wait to receive telephone support for their software. During a six month period he logs
167 calls, and the mean waiting time is 8:7 minutes. Find a 95% confidence interval
for estimating the mean waiting time for all telephone customer calls for support if the
population standard deviation is 2:08 minutes.
3 A breakfast cereal manufacturer uses a machine to
deliver the cereal into plastic packets which then go
into cardboard boxes. The quality controller randomly samples 75 packets and obtains a sample mean
of 513:8 grams. Construct a 95% confidence interval
in which the true population mean should lie if the
population standard deviation is 14:9 grams.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
4 A sample of 42 patients from a drug rehabilitation program showed a mean length of
stay on the program of 38:2 days. Estimate with a 95% confidence interval the average
length of stay for all patients on the program if the population standard deviation is 4:7
days.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\269SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:52 PM PETERDELL
SA_12STU-2
270
STATISTICS
(Chapter 7)
5 To work out the credit limit of a prospective credit card holder, a company gives points
based on factors such as employment, income, home and car ownership, and general
credit history. A statistician working for the company randomly samples 40 applicants
and determines the point total for each. These are:
84
82
63
53
76
71
66
60
67
61
63
76
80
78
54
75
71
72
67
80
64
74
60
70
59
72
70
56
63
61
81
58
82
68
77
68
74
68
69
72
a Determine the sample mean x and standard deviation s.
b Using s to estimate ¾, determine a 95% confidence interval that the company would
use to estimate the mean point score for the population of applicants.
It is possible to obtain confidence intervals at any level of confidence
from graphics calculators. Click on the icon to see how to do this on
your calculator.
TI
C
Example 29
A 95% confidence interval for a mean ¹ of a population was recorded as
8:5617 6 ¹ 6 9:4383. This estimate was based on a sample of size n = 60.
Use this information to calculate
a x, the sample mean
b ¾, the population standard deviation which was used to calculate the
confidence interval.
a
¾
¾
x ¡ 1:96 p < ¹ < x + 1:96 p
n
n
¾
¾
So, x ¡ 1:96 p = 8:5617 and x + 1:96 p = 9:4383
n
n
The 95% confidence interval is
Adding these equations gives 2x = 8:5617 + 9:4383 = 18 and so x = 9.
b
Substituting n = 60 and x = 9 into
¾
¾
x ¡ 1:96 p = 8:5617 gives 9 ¡ 1:96 p + 8:5617
n
60
¾
) 1:96 p + 0:4383
p
60
60
) ¾ + 0:4383 £
+ 1:732
1:96
6 A 95% confidence interval for the mean ¹ of a population is based on a sample of
n = 50, and given by 3:5842 6 ¹ 6 4:4158. Find:
a x, the sample mean
b ¾, the population standard deviation which was used to calculate the confidence
interval.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
7 A 95% confidence interval for the mean ¹ of a population is given by
19:685 6 ¹ 6 22:315. If the population standard deviation is ¾ = 6, what was the
sample size?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\270SA12STU-2_07.CDR Thursday, 2 November 2006 3:16:59 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
271
DETERMINING HOW LARGE A SAMPLE SHOULD BE
When designing an experiment in which we wish to estimate the population mean, the size
of the sample is an important consideration. Finding the sample size is a problem that can be
solved using the confidence interval.
Let us revisit Example 28 on the fat content of pasties. The question arises:
‘How large should a sample be if we wish to be 95% confident that the sample mean will
differ from the population mean by less than 0:3 grams?’
i.e., ¡0:3 < ¹ ¡ x < 0:3
¾
Now the 95% confidence interval for ¹ is:
x ¡ 1:96 p <
n
¹
¾
< x + 1:96 p
n
¾
¾
Hence ¡1:96 p < ¹ ¡ x < 1:96 p
n
n
¾
and we need to find n when 1:96 p = 0:3 .
n
So,
p
1:96 £ 1:35
1:96¾
=
+ 8:82 and so n + 78.
n=
0:3
0:3
Thus, a sample of 78 pasties should be taken.
Example 30
Revisit the yabbies from the dam problem of Example 27. Suppose we wish to find
the sample size needed to be 95% confident that the sample mean differs from the
population mean by less than 5 grams. What sample size should be taken?
¾
¾
¡1:96 p < ¹ ¡ x < 1:96 p
n
n
Now
¾
1:96 £ 16:8
p
so we need to find n such that 1:96 p = 5 i.e.,
=5
n
n
¶2
µ
1:96 £ 16:8
+ 43:37
) n=
5
A sample of 44 yabbies should be taken.
EXERCISE 7I.2
1 A researcher wishes to estimate the mean weight of adult crayfish in South Australian
waters. She knows that the population standard deviation ¾ is 250:5 grams. How large
must a sample be so that she is 95% confident that the sample mean differs from the
population mean by less than 70 grams?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
2 A porridge manufacturer samples 80 packets of porridge and finds that the sample standard deviation s, of the contents’ weight is 17:8 grams. If s is used to estimate the
population standard deviation ¾, how many packets must be sampled to be 95% confident that the sample mean differs from the population mean by less than 3 grams?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\271SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:04 PM PETERDELL
SA_12STU-2
272
STATISTICS
(Chapter 7)
3 Patients from an alcohol rehabilitation program participate for various lengths of time
with a standard deviation of 4:7 days. How many patients would have to be sampled to
be 95% confident that the sample mean number of days on the program differs from the
population mean by less than 1:8 days?
w
Consider the typical 95% confidence interval
shown in the diagram.
x - 1.96
1:96¾
The width of this interval is w = 2 £ p .
n
s
n
x
x + 1.96
s
n
In taking a sufficiently large sample size n we can make w as small as we like.
1:96¾ p
2 £ 1:96¾
As w = 2 £ p ,
n=
w
n
µ
2 £ 1:96¾
w
and so n =
¶2
When we wish to estimate the population mean from a sample of size n at a 95%
confidence level, the sample size is given by
¶
µ
2 £ 1:96¾ 2
where ¾ is the population standard deviation
n=
w
and w is the confidence interval width.
µ
In Example 30, w = 2 £ 5 and ¾ + 16:8 : Thus, n =
2 £ 1:96 £ 16:8
10
¶2
+ 43:37, etc.
Since n is an integer, n = 44 would give a 95% confidence interval of width about 10 grams.
4 A population is known to have standard deviation ¾ = 34. Find the sample size n that
should be taken to find a 95% confidence interval for the population mean ¹ of width:
a w=5
b w=1
c w = 0:1
5 A manufacturer of bottled water knows that the machine dispenses water into 1 litre
bottles with a standard deviation of 2:3 mL. The machine needs to be checked regularly
to ensure it is still delivering the correct volume. How many bottles should a quality
controller be checking to find a 95% confidence interval of width:
a 2 mL
b 1 mL
c 0:5 mL?
a If the size n of a sample is doubled, by how much will the width of a 95% confidence
interval decrease?
b How much larger do you have to make a sample size to halve the width of a 95%
confidence interval?
6
USING A CONFIDENCE INTERVAL FOR A CLAIM ABOUT ¹
Confidence intervals provide an estimate for the size of the population mean ¹. They can
also be used to assess claims about population means. For example:
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
Suppose the volume V of fruit juice dispensed by a machine is normally distributed with
mean ¹ litres which can be adjusted, and standard deviation ¾ = 0:0015 litre (1 12 mL, about
1
4 of a teaspoon) which is fixed.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\272SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:10 PM PETERDELL
SA_12STU-2
STATISTICS
273
(Chapter 7)
Suppose a manufacturer needs to fill cartons with 1 litre of fruit juice. To ensure that almost
all cartons contain at least 1 litre, the value of the mean ¹ is set at 1:005 litre.
A quality controller takes a sample of n cartons and, with very accurate measurements, finds
that the sample mean v = 1:004 99 litres. We want to test the hypotheses H0 = 1:005,
Ha 6= 1:005 for various large values of n:
Note that for sufficiently large n the null hypothesis will not be accepted at the 5% level.
For such values of n the difference is statistically significant at the 5% level even though the
difference of 0:01 mL (hardly a drop) is not significant as the word is commonly understood.
Example 31
Suppose the volume V of cool drinks dispensed into cartons by a machine is
normally distributed with mean ¹ which can be adjusted, and standard deviation
10¡mL which is fixed. The value of ¹ is supposed to be 1005¡mL, but the machine
operator notices that actually ¹¡=¡995¡mL. The operator therefore adjusts the volume
dispensed by the machine. A quality controller tests 25 cartons and finds that their
mean volume is 1007¡mL.
a
Construct a 95% confidence interval for the volume ¹ dispensed by the machine.
b
Use the 95% confidence interval to assess the claim that the volume dispensed by
the machine has increased.
c
Can we conclude that the volume of ¹ is now larger than 1005 mL?
a
The confidence interval is 1003 6 ¹ 6 1011:
b
Since 995 is less than all the values in the 95% confidence interval we can be
confident that the population mean has increased.
c
Althouth the sample statistic 1007 mL is larger than 1005, the smallest number
in the 95% confidence interval for ¹ is 1003 mL. This means that ¹ could be as
small as 1003 mL, and there is not enough evidence to support the claim that
¹ > 1005 mL.
Note: This question is closely related to testing the hypotheses H0¡:¡¹¡=¡1005,
Ha¡:¡¹ =
¡6 ¡1005.
EXERCISE 7I.3
1 Suppose the time it takes Joan to run 100 metres is normally distributed with mean
¹ = 12:46 seconds and standard deviation 1 second. To improve her time Joan goes on
a training program. After the training program, Joan finds that the mean time from 12
trial runs is now 11:62 seconds.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
a Construct a 95% confidence interval for Joan’s mean assuming the standard deviation
has not changed.
b Use the result of part a to assess the claims:
i Joan’s time to run 100 metres has improved.
ii Joan is now better than Betty whose time for the 100 metres is 11:97 seconds.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\273SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:16 PM PETERDELL
SA_12STU-2
274
STATISTICS
(Chapter 7)
2 A complaint was made to a call centre that it took a mean time of 12 minutes before a
caller was put through to an operator. After changes were made, the call centre claimed
that the service had improved. To check this claim, a consumer group made 40 calls to
the centre. They found the mean waiting time was 8 minutes with a standard deviation
of 3 minutes. Assuming that 40 is large enough for the Central Limit Theorem to apply,
construct a 95% confidence interval for the mean waiting time ¹. Does the confidence
interval support the call centre’s claim? (Use s to estimate ¾.)
3 The distance D a golfer can hit a ball is randomly distributed with a mean ¹ = 115
metres and standard deviation ¾ = 32 metres.
a After spending time with a professional the golfer measured the
drives. The results of the drives in metres were as follows:
133 153 110 93 142 135 62 150 127
119 171 143 92 162 128 149 73
39
138 152 163 174 152 141 129 87 118
distance of 30
112
84
149
Assuming that the sample of 30 is large enough for the Central Limit Theorem to
apply, calculate a 95% confidence interval for the mean distance ¹ the golfer can
now hit the ball. Does the confidence interval provide enough evidence to support
the claim the golfer has improved?
b The golfer decided to have another trial of 50 drives. Suppose the mean of the 50
trials is the same as in part a.
i Explain briefly why increasing the number of trials could make a difference
to a drive length.
ii Does the new information provide evidence that the golfer has improved?
OTHER APPLICATIONS OF CONFIDENCE INTERVALS
Example 32
A buyer for a restaurant chain goes to a seafood wholesaler to inspect a large catch
of 50 000 prawns. She has instructions to buy the catch only if the prawns are heavy
enough. The buyer selects a sample of 60 prawns and finds that their mean weight is
57:2 grams. It is known that the population standard deviation ¾ is 4:2 grams.
a Find the 95% confidence interval for the population mean.
b The buyer claims she is 95% confident that no more than 10% of the prawns
weigh less than 50 grams. Use the confidence interval found in part a to justify
this claim. You may assume that the weights of prawns are normally distributed.
a Using technology, the 95% confidence interval for the population mean ¹ is
56:1 6 ¹ 6 58:3 .
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
b The smallest value in the 95% confidence intCI
erval is 56:1, and so the buyer can be 95%
confident that the population mean ¹ > 56:1 .
50.0
56.1
57.2
58.3
2
If W is the weight of prawns, then W » N(¹, ¾ ).
If we use ¹ = 56:1 and ¾ = 4:2, then using technology Pr(W < 50) = 0:0732.
Hence 7:32%, or less than 10% of the prawns weigh less than 50 grams.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\274SA12STU-2_07.CDR Friday, 10 November 2006 12:22:49 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
275
EXERCISE 7I.4
1 The manager of a golf club claimed that the income of most of its members was in excess
of $75 000 and thus its members could afford to pay increased annual subscriptions. To
justify this claim was not valid, the members sought the help of a statistician.
The statistician examined a random sample of 113 club members and found that the mean
income was $96 318. It is known that the standard deviation of the members’ incomes
is $14 268:
a Find the 95% confidence interval for the population mean income of all members.
b The statistician claimed that he was 95% certain that no more than 10% of the
members had a mean income of less than $75 000.
Assuming that the income of members is normally distributed, how could you justify
the statistician’s claim?
2 Fabtread manufacture motorcycle tyres. Under normal test conditions the stopping time
for motor cycles travelling at 60 km/h is 3:45 seconds with standard deviation 0:17
seconds. Their production team has just designed and manufactured a new tyre tread.
They take 41 stopping time measurements with the new tyres and find the mean time is
3:03 seconds.
a Calculate a 95% confidence interval for the mean stopping time of the new tyres.
b The team claims that they are 95% certain that less than 15% of the stopping times
of their new tyres will exceed the 3:45 seconds of the old tyres.
Assuming that the stopping time is normally distributed, how could you justify the
team’s claim?
EXTENSION TO CONFIDENCE INTERVALS OTHER THAN 95%
There are often good reasons to find confidence intervals other than those of 95%. In areas
like medicine, a researcher may want to have more certainty when making decisions and often
may prefer a confidence interval of 99%. In other areas where the outcomes of decisions are
not so important, people may be satisfied with 90% confidence intervals.
Your calculator can produce confidence intervals at any level.
EXERCISE 7I.5
1 The mean ¹ of a population is unknown, but its standard deviation is 10. In order to
estimate ¹ a random sample of size n = 35 was selected. The mean of the sample was
found to be 28:9.
a Find a 95% confidence interval for ¹. b Find a 99% confidence interval for ¹.
c In changing the confidence level from 95% to 99%, how does the width of the
confidence interval change?
¶
¶
µ
µ
¾
¾
< ¹ <x+a p
then
2 If the P % confidence interval for ¹ is x ¡ a p
n
n
for P = 95, a = 1:960: Find a if P is: a 99 b 80 c 85 d 96.
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
3 The choice of the confidence level to be used is made by an experimenter. Why is it
that experimenters do not always choose confidence intervals of at least 99%?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\275SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:29 PM PETERDELL
SA_12STU-2
276
STATISTICS
(Chapter 7)
J
REVIEW
REVIEW SET 7A
1 The arm lengths of 18 year old females are normally distributed with mean 64 cm
and standard deviation 4 cm.
a Find the percentage of 18 year old females whose arm lengths are:
i between 60 cm and 72 cm
ii greater than 60 cm.
b Find the probability that if an 18 year old female is chosen at random, she will
have an arm length in the range 56 cm to 68 cm.
2
a If Z has a standard normal distribution, find k if Pr(Z 6 k) = 0:95 .
b If X » N(23, 2:62 ) find k if Pr(X < k) = 0:6 .
3 In a mathematics test out of 40 marks, the mean mark was 28:3 and the standard
deviation was 4:1. The marks were all integers and the minimum pass mark was set
at 24. Assuming marks were approximately normal, what proportion of the students:
a passed the test
b scored more than 20
c scored between 25 and 35?
4 The weights of apples from an orchard are known to be normally distributed with
mean ¹ = 350 grams and standard deviation ¾ = 25 grams. The apples are packed
in boxes of 50 each.
a How many apples in a box would you expect to weigh more than 375 grams,
and how many less than 325 grams?
b In 500 boxes, how many apples would you expect to have a weight between 325
and 375 grams?
5 To test the hypotheses H 0 : ¹ = 36 and H a : ¹ 6= 36 a random sample of n = 20
was selected. The outcomes are listed below:
38 22 43 21 36 44 20 49 36 30
42 43 38 28 33 22 29 25 28 34
Use this information to test the null hypothesis at the 5% level if the population
standard deviation is 10 grams.
6 The standard deviation in the weight of cereal boxes is 23:6 grams. How many boxes
must be sampled from the population to be 95% confident that the sample mean differs
from the population mean by less than 4 grams?
7 A factory canning apricots uses a machine to deliver the fruit and syrup into cans. The
quality controller randomly samples 65 cans and finds that the mean mass of contents
is 828:2 grams.
a Construct a 95% confidence interval in which the true population mean should
lie if the population standard deviation is 16:3 grams.
b What should the sample size be to construct a confidence interval of half the
width of that in a?
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
95
100
50
75
25
0
5
cyan
5
a Kerry’s marks for an English essay and a Chemistry test were 26 out of 40 and
82% respectively.
i Explain briefly why the information given is not sufficient to determine
whether Kerry’s results are better in English than in Chemistry.
8
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\276SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:34 PM PETERDELL
SA_12STU-2
STATISTICS
(Chapter 7)
277
ii Suppose that the marks of all students in both the English essay and the
Chemistry test were normally distributed as N (22, 42 ) and N (75, 72 ) respectively. Use this information to determine which of Kerry’s two marks
is better.
iii If there were 50 students sitting for the English essay, how many would
have scored more than Kerry?
b Les is to sit for five subjects in the final examination. Because of many different
factors that determine examination marks, the marks Les can expect in each exam
are normally distributed. Suppose that the mean ¹ and standard deviation ¾ = 2
are the same for each exam.
If ¹ = 12 calculate the probability that Les will gain a total mark for the five
subjects of between 60 and 70.
c The value of the mean ¹ depends on the time t hours that Les studies. It is given
by ¹ = 16 ¡ 8=(t + 2).
i For how long must Les study to achieve a value of ¹ = 15?
ii Les’s total score for the five examinations was 65. Use this information to
test the hypotheses H0 : ¹ = 15 and Ha : ¹ 6= 15.
iii Use the total score of 65 to construct a 95% confidence interval for the mean
¹. Use this interval to estimate a range of times Les might have studied for
the examination.
REVIEW SET 7B
1 Find the mean and standard deviation of these two samples of
A 170:1 169:4 169:5 170:4 169:8 170:5 170:0
170:0 169:9 170:2 170:0 169:9 169:9 170:5
B 177 166 153 167 176 173 169 161 172
170 162 178 174 179 171 148 184 178
lengths given in cm:
170:0 170:3 170:8
170:1 169:7 170:0
174
175
Which of the above is a sample of heights of 15 year old boys, and which is a sample
of length of planks cut by a machine?
2 The contents of a certain brand of soft drink can is normally distributed with mean
377 mL and standard deviation 4:2 mL.
a Find the percentage of cans with contents:
i less than 368:6 mL
ii between 372:8 mL and 389:6 mL
b Find the probability of randomly selecting a can with contents between 364:4
mL and 381:2 mL.
3 The life of a Xenon battery is known to be normally distributed with a mean of
33:2 weeks and a standard deviation of 2:8 weeks.
a Find the probability that a randomly selected battery will last at least 35 weeks.
b For how many weeks can the manufacturer expect the batteries to last before 8%
of them fail?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
4 The length of steel rods produced by a machine is normally distributed with a standard
deviation of 3 mm. It is found that 2% of all rods are less than 25 mm long. Find
the mean length of rods produced by the machine.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\277SA12STU-2_07.CDR Thursday, 2 November 2006 3:17:41 PM PETERDELL
SA_12STU-2
278
5
STATISTICS
(Chapter 7)
a If Z has a standard normal distribution, find a if Pr(Z 6 a) = 0:9 .
b If X » N(15:6, 22 ) find a if Pr(X < a) = 0:9 .
6 A manufacturer claims that his canned soup contains 135 mg of salt. To check this
claim a consumer tested 87 cans for salt content and found that the mean was 139:6
mg. It is known that the population standard deviation is 22:8 mg. At a 5% level is
there sufficient evidence to reject the manufacturer’s claim?
7 To test the null hypothesis H 0 : ¹ = 2000 and H a : ¹ 6= 2000, a random sample
of n = 75 was selected and found to have mean x = 1840.
a If the population standard deviation ¾ = 690, is there sufficient evidence to reject
the null hypothesis at the 5% level?
b For what values of the sample mean x
¹ would you not reject the null hypothesis at
the 5% level?
8 A telephone call centre handles many calls each day. Let T be the time in minutes
taken to answer a call.
In 2006 the mean answering time for a call was ¹ = 4:3 minutes with standard
deviation ¾ = 1:2 minutes.
Let T be the mean time taken to answer a random sample of 100 calls.
a The two histograms below show the distribution of a sample of size 50 taken
from T . Note that the horizontal scale and the bin width are the same in both
histograms, but the vertical scales are different.
Histogram A
40
30
20
10
0
Histogram B
6
frequency
frequency
4
2
0
1
2
3
4
5
6
7 8
time (min)
0
0
1
2
3
4
5
6
7
8
Identify the histogram that represents a sample from T .
Explain your answer.
i Assuming that n = 100 is sufficiently large, explain why the distribution
of T is approximately normal with mean 4:3 minutes and standard deviation
0:12 minutes.
ii Calculate the probability Pr(T 6 4:35).
iii Hence calculate the probability that an operator in the call centre can be
occupied in answering 100 calls for less than seven and a quarter hours.
c As well as answering routine calls, the supervisor of the call centre also handles unusual cases that are too complicated for other staff to handle. When the
supervisor was timed her mean time to answer 100 calls was T = 4:6 minutes.
i Use the statistic T = 4:6 minutes to test the hypothesis H0 : ¹ = 4:3 and
Ha : ¹ 6= 4:3, at 5% level.
ii The supervisor is asked to explain why she is taking too long to answer
questions. What reasons can the supervisor provide to claim that the Central
Limit Theorem does not apply to her?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
b
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\278SA12STU-2_07.CDR Wednesday, 8 November 2006 8:46:05 AM DAVID3
SA_12STU-2
STATISTICS
(Chapter 7)
279
REVIEW SET 7C
1 Sketch the graph of X » N(3, 22 ).
On the horizontal axis mark in the z-scores as well as their corresponding x values.
Calculate these probabilities: a Pr(¡1 6 X 6 1) b Pr(¡1 6 Z 6 1) .
2 Staplers are manufactured for $5:00 each and are sold for $20:00 each. The staplers
are guaranteed to last three years. The mean life is actually 3:42 years and the
standard deviation 0:4 years. If the life of these staplers is normally distributed, how
much profit would we expect from selling a batch of 2000 (with a maximum of one
replacement)?
3 The edible part of a batch of Coffin Bay oysters is normally distributed with mean
38:6 grams and standard deviation 6:3 grams. Given that the random variable X is
the mass of a Coffin Bay oyster, find:
a a if Pr(38:6 ¡ a 6 X 6 38:6 + a) = 0:6826
b b if Pr(X > b) = 0:8413.
4 King prawns are favourite items on the menu of Stirling Caterers. From past experience the manager knows that people on average eat 325 g of prawns with standard
deviation 86 g. The manager is to cater for a wedding of 80 guests and decides to
purchase 27:5 kg of prawns. What is the probability that the caterer will run out of
prawns?
5 For export purposes peaches must be neither too small nor too large. A grower claims
that the peaches in his orchard have a mean weight of 300 grams, just right for export.
A buyer knows that the population standard deviation is 30 grams, and he wants to
test the grower’s claim.
a What hypotheses should the buyer consider?
b Suppose the buyer selects a random sample of 100 peaches and finds that their
mean weight x
¹ = 310 grams.
i What is the null distribution the buyer should use?
ii Calculate the test statistic z for this sample.
iii Does this sample support the grower’s claim at the 5% level?
6 The average width of snail shells of a local species
needs to be estimated.¡ It is known that the standard
deviation is 1.4 mm.¡ Pauline takes a random sample of
200 snails and measures the width of each shell to the
nearest mm.¡ The results are shown in the table
alongside.
a
b
Find the sample mean.
Determine a 95% confidence interval for the
population mean ¹.
Width (mm)
22
23
24
25
26
27
28
29
Frequency
1
3
17
43
68
41
24
3
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
7 Suppose the weight X of apricots is normally distributed with ¹ = 90 grams and
¾ = 10 grams.
a Calculate the proportion of apricots with weight less than 88 grams.
b In a box of 100 apricots, how many would you expect to weigh less than 88 g?
c The apricots are packaged into boxes of 100 each. What proportion of the boxes
will have apricots with a mean weight less than 88 g?
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\279SA12STU-2_07.cdr Wednesday, 8 November 2006 8:49:00 AM DAVID3
SA_12STU-2
280
STATISTICS
(Chapter 7)
d On each of the boxes of 100 apricots is printed that the nett weight is 8:8 kilograms. In a shipment of 500 boxes, for how many is the weight less than 8:8
kilograms?
cyan
magenta
yellow
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
95
100
50
75
25
0
5
8 The time T it takes Laura to travel to work is normally distributed with mean ¹
minutes and standard deviation 10 minutes. Laura’s work starts at 9 o’clock in the
morning.
a Suppose ¹ = 40 minutes and Laura leaves for work at a quarter past eight in
the morning.
i What is the probability she will be late?
ii If there are 250 working days in a year, how often would Laura be expected
to be late to work in a year?
b Laura does not know the value of ¹ and decides to keep a 10 day record of the
time it takes her to go to work. Let T 10 be the distribution of the mean time
over 10 days it takes Laura to go to work.
i Briefly describe the distribution T 10 in terms of the distribution T it takes
Laura to go to work.
ii Suppose Laura found that for her sample of 10 days the mean time to travel
to work was T 10 = 35 minutes. Use this information to test the hypotheses
H0 : ¹ = 40 and Ha : ¹ 6= 40, at 5% level.
iii Calculate the 95% confidence interval for ¹.
iv How large a sample should Laura take to obtain a 95% confidence interval
of width 2:48 minutes?
c After keeping records for a year consisting of 250 working days, Laura found
that the mean travelling time to work was 31:52 minutes.¡ She wants to be 95%
certain that she will be at work before 9 o’clock at least 90% of the time in the
following year.¡ To the nearest minute, what is the latest time you would advise
Laura to leave home?¡ Give reasons for your answer.
black
Y:\HAESE\SA_12STU-2ed\SA12STU-2_07\280SA12STU-2_07.cdr Wednesday, 8 November 2006 8:50:22 AM DAVID3
SA_12STU-2