lectures. - University of Toronto

Transcription

STAB57H3: Introduction to Statistics
Winter, 2015
Instructor: Jabed Tomal
Department of Computer and Mathematical Sciences
University of Toronto Scarborough
Toronto, ON
Canada
March 18, 2015
Jabed Tomal (U of T)
Statistics
March 18, 2015
1 / 31
Relationships Among Variables:
In science, biological science, social science, and business,
scientists/researchers are concerned in knowing
relationships among variables.
Statistics
March 18, 2015
2 / 31
Some examples are:
1
In business, it might be important to know the relationship
between sales of a product and amount of advertising
expenditure.
2
A company manager might be interested in knowing the
relationship between performance of an employee on a job and
employee’s aptitude tests score.
3
In environmental physics, a researcher might be interested in
predicting global temperature using the amount of carbon dioxide
placed into the atmosphere.
4
Goal: Predicting the length of hospital stay of a surgical patient.
Our interest might be in knowing the relationship between the time
stay in the hospital and severity of the operation.
Statistics
March 18, 2015
3 / 31
Example: Grade Point Average
The director of admissions of a small college selected 120 students at
random from the new freshman class in a study to determine whether
a student’s grade point average (GPA) at the end of the freshman year
(Y ) can be predicted from the ACT test score (X ). The results of the
study follow.
i
Xi
Yi
:
:
:
1
21
3.897
2
14
3.885
3
28
3.778
···
···
···
118
28
3.914
119
16
1.860
120
28
2.948
1
Is there any relationship exits between the two variables grade
point average and ACT test score?
2
If a relationship exists between the two variables, can grade point
average be predicted using ACT test score?
Statistics
March 18, 2015
4 / 31
Example: Property Assessments
The data that follow show assessed value for property tax purposes
(X1 , in thousand dollars) and sales price (X2 , in thousand dollars) for a
sample of 15 parcels of land for industrial development sold recently in
“arm’s length” transactions in a tax district.
i
X1i
X2i
:
:
:
1
13.9
28.6
2
16.0
34.7
3
10.3
21.0
···
···
···
13
14.9
35.1
14
12.9
30.0
15
15.8
36.2
1
Are the two variables associated with each other?
2
What is the strength of association? Weak, moderate, or strong?
Statistics
March 18, 2015
5 / 31
Two primary goals of analyzing relationships among variables
are:
1
to identify whether or not a relationship exists among variables,
and (That is to identify whether there exists weak, moderately
weak, moderately strong or strong relationships among variables.
Are the variables negatively associated or positively associated?)
2
to identify the form of the relationships. (linear relationship or
non-linear relationship?)
Statistics
March 18, 2015
6 / 31
Notations:
1
Let Π be a population of interest. (Example: Students enrolled in
STAB57.)
2
X (π) be a measurement taken on subject π ∈ Π. (Example: Time
spent (in hours) per week studying course materials by a student
enrolled in STAB57.)
3
Y (π) be another measurement taken on subject π ∈ Π. (Example:
Midterm grade of a particular student enrolled in STAB57.)
Statistics
March 18, 2015
7 / 31
1
A linear relationship between two variables X and Y can be
expressed as
Y = α + βX ,
where α and β are constants real numbers. Here, β = 0 indicates
no linear relationship between X and Y .
2
A non-linear relationship between two variables X and Y can be
expressed as
Y = α + β exp(X ),
where α and β are constants real numbers. Again, β = 0 indicates
no non-linear relationship between X and Y .
Statistics
March 18, 2015
8 / 31
The Definition of Relationship:
1
Consider we observe a set of values of Y against a particular
value of X = x. (Example: Students who studied one hour per
week will have a set of midterm marks.)
2
Hence, we can think of a distribution of Y conditioned on X = x.
3
Two variables X and Y are related, if there is a change in the
conditional distribution of Y given X = x, as x changes.
(Example: As the time spent (in hours) per week studying course
materials changes from 1 to 2, the average midterm marks
changes from 35 to 55.)
Statistics
March 18, 2015
9 / 31
The Strength of Relationship:
1
If we see large changes in the conditional distribution of Y given
X = x, as x changes, then we say a strong relationship exists.
2
If we see small changes in the conditional distribution of Y given
X = x, as x changes, then we say a weak relationship exists.
Statistics
March 18, 2015
10 / 31
Exercise 10.1.1 Prove that discrete random variables X and Y are
unrelated if and only if X and Y are independent.
Statistics
March 18, 2015
11 / 31
The Role of Statistical Models:
1
The relationship between two variables is completely described by
the set of conditional distributions of Y given X .
2
Example: Consider Y is global temperature and X is amount of
carbon dioxide placed into the atmosphere. Then, perhaps the
conditional distribution of Y given X = x can be expressed as:
Y |X = x ∼ N(µ(x) = α + βx, σ 2 ),
where the conditional mean of Y changes with the change of x,
i.e., µ(x) = E(Y |X = x) = α + βx. The conditional variance of Y
is fixed and independent of x, i.e., var(Y |X = x) = σ 2 .
3
Here, α and β are called the intercept and slope parameters,
respectively.
4
Assuming that the statistical model is correct, the two variables Y
and X are unrelated to each other if and only if β = 0.
Statistics
March 18, 2015
12 / 31
Response and Predictor Variables:
1
If we expect a change in the variable Y for a change in the
variable X , then we say Y depends on X . Hence, Y and X are
called dependent and independent variables, respectively.
Assuming that the relationship is unidirectional, the variable Y and
X are called response and predictor variables, respectively.
2
Example: The midterm marks (Y ) and the time (X ) spent per
week studying the course materials can be termed as the
response and predictor variables, respectively.
Statistics
March 18, 2015
13 / 31
The Role of Statistical Models:
1
We might have more than one predictor variables corresponding
to a response variable. Such relationship can be simplified by a
statistical model as following.
2
Let Y be the response and X1 , X2 , · · · , Xk be k predictor variables.
The statistical model is
Y |X1 = x1 , · · · , Xk = xk ∼ N(µ(x) = β0 + β1 x1 + · · · + βk xk , σ 2 ),
where the conditional mean of Y changes with the change of x,
i.e., µ(x) = E(Y |X = x) = β0 + β1 x1 + · · · + βk xk . The conditional
variance of Y is fixed and independent of x, i.e.,
var(Y |X = x) = σ 2 .
3
Assuming that the statistical model is correct, the variables Y and
X are unrelated if and only if β1 = · · · = βk = 0.
Statistics
March 18, 2015
14 / 31
Example (one response and more than on predictors): A hospital
administrator wished to study the relation between patient satisfaction
(Y ) and patient’s age (X1 , in years), severity of illness (X2 , an index),
and anxiety level (X3 , an index). The administrator randomly selected
46 patients and collected the data presented below, where larger
values of Y , X2 , and X3 are, respectively, associated with more
satisfaction, increased severity of illness, and more anxiety.
i
X1i
X2i
X3i
Yi
:
:
:
:
:
1
50
51
2.3
48
2
36
46
2.3
57
3
40
48
2.2
66
Statistics
···
···
···
···
···
44
45
51
2.2
68
45
37
53
2.1
59
46
28
46
1.8
92
March 18, 2015
15 / 31
Regression Models:
1
Let Y be the response and X1 , X2 , · · · , Xk be k predictor variables.
Then regression assumption specifies that the relationship
between the response and the predictors is expressed using the
conditional distribution of Y given X1 , X2 , · · · , Xk
Y |X1 , · · · , Xk ∼ N(β0 + β1 X1 + · · · + βk Xk , σ 2 ),
that is E(Y |X) = β0 + β1 X1 + · · · + βk Xk .
2
This model can be re-expressed as
Y = β0 + β1 X1 + · · · + βk Xk + Z ,
where Z ∼ N(0, σ 2 ).
Statistics
March 18, 2015
16 / 31
Cause-Effect Relationships:
1
Consider we have a response Y and a predictor X . If the
conditional distribution of Y given X = x changes for changes in
x, then we say that the two variables are related to each other. In
a simple linear regression set up we write that
E(Y |X = x) = β0 + β1 x,
where β0 and β1 are called the intercept and slope parameters,
respectively.
2
If the changes in Y can be attributed as a result of the changes in
X only, then we say there exists a cause-effect relationship
between Y and X .
Statistics
March 18, 2015
17 / 31
Cause-Effect Relationships:
1
Through extensive research, scientists have established that there
exits a cause-effect relationship between persons smoking status
and coronary heart disease.
Statistics
March 18, 2015
18 / 31
Confounding Variables:
1
Consider there exists a relationship between a response Y and a
predictor X . Suppose the relationship is as following
E(Y |X = x) = β0 + β1 x,
where β0 and β1 are called the intercept and slope parameters,
respectively.
2
If another variable Z is related both with Y and X , then we
consider Z a confounding variable. Inclusion of the variable Z in
the model shows a change in the relationship between Y and X
as following:
E(Y |X = x, Z = z) = β0∗ + β1∗ x + β2∗ z,
where β0∗ 6= β0 and β1∗ 6= β1 .
Statistics
March 18, 2015
19 / 31
Example: Confounding Variables:
1
We want to establish a relationship between grade point average
and gender. Consider female students secured higher GPA than
male students.
2
On the other hand, the most of the male students and a few of the
female student hold a part-time job.
3
Inclusion of the variable part-time job status in the model might
redefine the relationship between grade point average and
gender.
4
Here, part-time job status is a confounding variable.
Statistics
March 18, 2015
20 / 31
Experiments:
1
In an experiment, we randomly sample n (n1 + n2 + n3 ) items from
a population Π as we want to make inferences regarding the
population. Random sampling will help eliminating selection bias.
2
Consider, we have one response Y and one predictor X which
has 3 levels x1 , x2 and x3 (say). We randomly assign x1 , x2 and x3
to n1 , n2 and n3 items, respectively. Such random assignment of
X values will help eliminating any confounding effects of other
variables.
3
We then observe the values of the response variable Y .
4
Statistical inferences based on data collected via an experiment
has the capability of inferring that a cause-effect relationships
exist.
Statistics
March 18, 2015
21 / 31
Example: In a small-scale experimental study of the relation between
of degree of brand liking (Y ) and moisture content (X1 ) and sweetness
(X2 ) of the product, the following results were obtained from the
experiment based on a completely randomized design. The data are
coded below:
i
X1i
X2i
Yi
:
:
:
:
1
4
2
64
2
4
4
73
3
4
2
61
···
···
···
···
Statistics
14
10
4
95
15
10
2
94
16
10
4
100
March 18, 2015
22 / 31
Observational Studies:
1
In an observational study, the sample items are not randomly
selected from the population Π. Hence, the inferences regarding
the population might be flawed.
2
The levels of the predictor variables are not randomly assigned to
the items. There might present confounding effects of other
variables.
3
Statistical inferences based on data collected via an observational
studies do not necessarily imply a cause-effect relationships
between variables.
4
While experiments reside at the top of the hierarchy, the
observational studies reside at the bottom.
Statistics
March 18, 2015
23 / 31
Example: Property Assessments
The data that follow show assessed value for property tax purposes
(X1 , in thousand dollars) and sales price (X2 , in thousand dollars) for a
sample of 15 parcels of land for industrial development sold recently in
“arm’s length” transactions in a tax district.
i
X1i
X2i
:
:
:
1
13.9
28.6
2
16.0
34.7
3
10.3
21.0
Statistics
···
···
···
13
14.9
35.1
14
12.9
30.0
15
15.8
36.2
March 18, 2015
24 / 31
Experimental Design (Design of Experiments):
1
Let us consider a simple set up where the goal is to determine
whether a cause-effect relationship exists between a response Y
and a predictor X (also called factor) defined on a population Π.
2
We randomly select a sample of experimental units (before we
called items) π1 , π2 , · · · , πn from the population Π.
3
The values x1 , x2 , · · · , xk of X are called levels. When the possible
values of X is large or infinite, we select (perhaps randomly) a set
of finite values of X which spans the entire range of X well.
4
Each of the levels of X is then randomly assigned to ni
(i = 1, 2, · · · k ) experimental units. Finally, the values of Y are
observed corresponding to the sampled experimental units.
Statistics
March 18, 2015
25 / 31
Experimental Design (Design of Experiments):
1
After random assignment to the experimental units, each level of
the factor variable is called a treatment.
2
If the conditional distribution of Y against a particular level xi
shows large variability, then we choose ni to be large.
Statistics
March 18, 2015
26 / 31
Example
A rental car company wants to investigate whether the type of car
rented affects the length of the rental period. An experiment is run for
one week at a particular location, and 10 rental contracts are selected
at random for each car type. The results are shown in the following
table.
Type of Car
Sub-compact
Compact
Midsize
Full size
1
3
1
4
3
5
3
1
5
3
4
3
7
Observations
7 6 5 3
7 5 6 3
5 7 1 2
5 10 3 4
2
2
4
7
1
1
2
2
6
7
7
7
Is there evidence to support a claim that the type of car rented
affects the length of the rental contract?
Statistics
March 18, 2015
27 / 31
Control Treatment, the Placebo:
1
In experimental design, we often choose a level for the predictor
variable to be zero, i.e., X = 0, where X represents the doses of
any treatment. The zero value of the predictor variable is called a
control treatment and serves as a baseline against which we
assess the effect of the treatment.
2
In medical experiments, we might assign zero dose of a drug (the
so-called sugar pill) to some patients to measure the efficacy of
the drug in alleviating disease symptoms. Such control treatment
is also known as placebo. Sometimes, patients feel better with a
placebo and the effect is called placebo effect.
Statistics
March 18, 2015
28 / 31
Blinding:
1
Blinding: Blinding is a situation when a patient does not know
whether he/she is receiving a placebo or a drug.
2
Double-blinding: Double-blinding is a situation when both patients
and the experimenter do not know the identity of the treatment
assignments.
Statistics
March 18, 2015
29 / 31
Exercise 10.1.3: Suppose that a census is conducted on a population
and the joint distribution of (X , Y ) is obtained as in the following table.
X =1
X =2
Y =1
0.15
0.12
Y =2
0.18
0.09
Y =3
0.40
0.06
Determine whether or not a relationship exists between Y and X .
Statistics
March 18, 2015
30 / 31
Exercise 10.1.14: Suppose we have a quantitative response variable
Y and two categorical predictor variables W and X , both taking values
in {0, 1}. Suppose the conditional distributions of Y are given by
Y |W = 0, X = 0 ∼ N(3, 5)
Y |W = 1, X = 0 ∼ N(3, 5)
Y |W = 0, X = 1 ∼ N(4, 5)
Y |W = 1, X = 1 ∼ N(4, 5)
Does W have a relationship with Y ? Does X have a relationship with
Y ? Explain your answers.
Statistics
March 18, 2015
31 / 31

lectures. - University of Toronto

Transcription

Similar documents

metering systems for solids precision+reliability

STAC67H: Regression Analysis Fall, 2014 Instructor: Jabed Tomal October 26, 2014

STAC67H: Regression Analysis Fall, 2014 Instructor: Jabed Tomal October 9, 2014

STAC67H3: Regression Analysis Fall, 2014 Instructor: Jabed Tomal November 11, 2014

Context-Aware Online Traffic Prediction

Electrical Basic of Electrical Engineering, Power System

CHAPTER 8 STATISTICAL INTERPRETATION OF DATA

Homework 9

Permutation and Combination Worksheet