lectures. - University of Toronto
Transcription
lectures. - University of Toronto
STAB57H3: Introduction to Statistics Winter, 2015 Instructor: Jabed Tomal Department of Computer and Mathematical Sciences University of Toronto Scarborough Toronto, ON Canada March 18, 2015 Jabed Tomal (U of T) Statistics March 18, 2015 1 / 31 Relationships Among Variables: In science, biological science, social science, and business, scientists/researchers are concerned in knowing relationships among variables. Jabed Tomal (U of T) Statistics March 18, 2015 2 / 31 Relationships Among Variables: Some examples are: 1 In business, it might be important to know the relationship between sales of a product and amount of advertising expenditure. 2 A company manager might be interested in knowing the relationship between performance of an employee on a job and employee’s aptitude tests score. 3 In environmental physics, a researcher might be interested in predicting global temperature using the amount of carbon dioxide placed into the atmosphere. 4 Goal: Predicting the length of hospital stay of a surgical patient. Our interest might be in knowing the relationship between the time stay in the hospital and severity of the operation. Jabed Tomal (U of T) Statistics March 18, 2015 3 / 31 Relationships Among Variables: Example: Grade Point Average The director of admissions of a small college selected 120 students at random from the new freshman class in a study to determine whether a student’s grade point average (GPA) at the end of the freshman year (Y ) can be predicted from the ACT test score (X ). The results of the study follow. i Xi Yi : : : 1 21 3.897 2 14 3.885 3 28 3.778 ··· ··· ··· 118 28 3.914 119 16 1.860 120 28 2.948 1 Is there any relationship exits between the two variables grade point average and ACT test score? 2 If a relationship exists between the two variables, can grade point average be predicted using ACT test score? Jabed Tomal (U of T) Statistics March 18, 2015 4 / 31 Relationships Among Variables: Example: Property Assessments The data that follow show assessed value for property tax purposes (X1 , in thousand dollars) and sales price (X2 , in thousand dollars) for a sample of 15 parcels of land for industrial development sold recently in “arm’s length” transactions in a tax district. i X1i X2i : : : 1 13.9 28.6 2 16.0 34.7 3 10.3 21.0 ··· ··· ··· 13 14.9 35.1 14 12.9 30.0 15 15.8 36.2 1 Are the two variables associated with each other? 2 What is the strength of association? Weak, moderate, or strong? Jabed Tomal (U of T) Statistics March 18, 2015 5 / 31 Relationships Among Variables: Two primary goals of analyzing relationships among variables are: 1 to identify whether or not a relationship exists among variables, and (That is to identify whether there exists weak, moderately weak, moderately strong or strong relationships among variables. Are the variables negatively associated or positively associated?) 2 to identify the form of the relationships. (linear relationship or non-linear relationship?) Jabed Tomal (U of T) Statistics March 18, 2015 6 / 31 Relationships Among Variables: Notations: 1 Let Π be a population of interest. (Example: Students enrolled in STAB57.) 2 X (π) be a measurement taken on subject π ∈ Π. (Example: Time spent (in hours) per week studying course materials by a student enrolled in STAB57.) 3 Y (π) be another measurement taken on subject π ∈ Π. (Example: Midterm grade of a particular student enrolled in STAB57.) Jabed Tomal (U of T) Statistics March 18, 2015 7 / 31 Relationships Among Variables: 1 A linear relationship between two variables X and Y can be expressed as Y = α + βX , where α and β are constants real numbers. Here, β = 0 indicates no linear relationship between X and Y . 2 A non-linear relationship between two variables X and Y can be expressed as Y = α + β exp(X ), where α and β are constants real numbers. Again, β = 0 indicates no non-linear relationship between X and Y . Jabed Tomal (U of T) Statistics March 18, 2015 8 / 31 Relationships Among Variables: The Definition of Relationship: 1 Consider we observe a set of values of Y against a particular value of X = x. (Example: Students who studied one hour per week will have a set of midterm marks.) 2 Hence, we can think of a distribution of Y conditioned on X = x. 3 Two variables X and Y are related, if there is a change in the conditional distribution of Y given X = x, as x changes. (Example: As the time spent (in hours) per week studying course materials changes from 1 to 2, the average midterm marks changes from 35 to 55.) Jabed Tomal (U of T) Statistics March 18, 2015 9 / 31 Relationships Among Variables: The Strength of Relationship: 1 If we see large changes in the conditional distribution of Y given X = x, as x changes, then we say a strong relationship exists. 2 If we see small changes in the conditional distribution of Y given X = x, as x changes, then we say a weak relationship exists. Jabed Tomal (U of T) Statistics March 18, 2015 10 / 31 Relationships Among Variables: Exercise 10.1.1 Prove that discrete random variables X and Y are unrelated if and only if X and Y are independent. Jabed Tomal (U of T) Statistics March 18, 2015 11 / 31 Relationships Among Variables: The Role of Statistical Models: 1 The relationship between two variables is completely described by the set of conditional distributions of Y given X . 2 Example: Consider Y is global temperature and X is amount of carbon dioxide placed into the atmosphere. Then, perhaps the conditional distribution of Y given X = x can be expressed as: Y |X = x ∼ N(µ(x) = α + βx, σ 2 ), where the conditional mean of Y changes with the change of x, i.e., µ(x) = E(Y |X = x) = α + βx. The conditional variance of Y is fixed and independent of x, i.e., var(Y |X = x) = σ 2 . 3 Here, α and β are called the intercept and slope parameters, respectively. 4 Assuming that the statistical model is correct, the two variables Y and X are unrelated to each other if and only if β = 0. Jabed Tomal (U of T) Statistics March 18, 2015 12 / 31 Relationships Among Variables: Response and Predictor Variables: 1 If we expect a change in the variable Y for a change in the variable X , then we say Y depends on X . Hence, Y and X are called dependent and independent variables, respectively. Assuming that the relationship is unidirectional, the variable Y and X are called response and predictor variables, respectively. 2 Example: The midterm marks (Y ) and the time (X ) spent per week studying the course materials can be termed as the response and predictor variables, respectively. Jabed Tomal (U of T) Statistics March 18, 2015 13 / 31 Relationships Among Variables: The Role of Statistical Models: 1 We might have more than one predictor variables corresponding to a response variable. Such relationship can be simplified by a statistical model as following. 2 Let Y be the response and X1 , X2 , · · · , Xk be k predictor variables. The statistical model is Y |X1 = x1 , · · · , Xk = xk ∼ N(µ(x) = β0 + β1 x1 + · · · + βk xk , σ 2 ), where the conditional mean of Y changes with the change of x, i.e., µ(x) = E(Y |X = x) = β0 + β1 x1 + · · · + βk xk . The conditional variance of Y is fixed and independent of x, i.e., var(Y |X = x) = σ 2 . 3 Assuming that the statistical model is correct, the variables Y and X are unrelated if and only if β1 = · · · = βk = 0. Jabed Tomal (U of T) Statistics March 18, 2015 14 / 31 Relationships Among Variables: Example (one response and more than on predictors): A hospital administrator wished to study the relation between patient satisfaction (Y ) and patient’s age (X1 , in years), severity of illness (X2 , an index), and anxiety level (X3 , an index). The administrator randomly selected 46 patients and collected the data presented below, where larger values of Y , X2 , and X3 are, respectively, associated with more satisfaction, increased severity of illness, and more anxiety. i X1i X2i X3i Yi Jabed Tomal (U of T) : : : : : 1 50 51 2.3 48 2 36 46 2.3 57 3 40 48 2.2 66 Statistics ··· ··· ··· ··· ··· 44 45 51 2.2 68 45 37 53 2.1 59 46 28 46 1.8 92 March 18, 2015 15 / 31 Relationships Among Variables: Regression Models: 1 Let Y be the response and X1 , X2 , · · · , Xk be k predictor variables. Then regression assumption specifies that the relationship between the response and the predictors is expressed using the conditional distribution of Y given X1 , X2 , · · · , Xk Y |X1 , · · · , Xk ∼ N(β0 + β1 X1 + · · · + βk Xk , σ 2 ), that is E(Y |X) = β0 + β1 X1 + · · · + βk Xk . 2 This model can be re-expressed as Y = β0 + β1 X1 + · · · + βk Xk + Z , where Z ∼ N(0, σ 2 ). Jabed Tomal (U of T) Statistics March 18, 2015 16 / 31 Relationships Among Variables: Cause-Effect Relationships: 1 Consider we have a response Y and a predictor X . If the conditional distribution of Y given X = x changes for changes in x, then we say that the two variables are related to each other. In a simple linear regression set up we write that E(Y |X = x) = β0 + β1 x, where β0 and β1 are called the intercept and slope parameters, respectively. 2 If the changes in Y can be attributed as a result of the changes in X only, then we say there exists a cause-effect relationship between Y and X . Jabed Tomal (U of T) Statistics March 18, 2015 17 / 31 Relationships Among Variables: Cause-Effect Relationships: 1 Through extensive research, scientists have established that there exits a cause-effect relationship between persons smoking status and coronary heart disease. Jabed Tomal (U of T) Statistics March 18, 2015 18 / 31 Relationships Among Variables: Confounding Variables: 1 Consider there exists a relationship between a response Y and a predictor X . Suppose the relationship is as following E(Y |X = x) = β0 + β1 x, where β0 and β1 are called the intercept and slope parameters, respectively. 2 If another variable Z is related both with Y and X , then we consider Z a confounding variable. Inclusion of the variable Z in the model shows a change in the relationship between Y and X as following: E(Y |X = x, Z = z) = β0∗ + β1∗ x + β2∗ z, where β0∗ 6= β0 and β1∗ 6= β1 . Jabed Tomal (U of T) Statistics March 18, 2015 19 / 31 Relationships Among Variables: Example: Confounding Variables: 1 We want to establish a relationship between grade point average and gender. Consider female students secured higher GPA than male students. 2 On the other hand, the most of the male students and a few of the female student hold a part-time job. 3 Inclusion of the variable part-time job status in the model might redefine the relationship between grade point average and gender. 4 Here, part-time job status is a confounding variable. Jabed Tomal (U of T) Statistics March 18, 2015 20 / 31 Relationships Among Variables: Experiments: 1 In an experiment, we randomly sample n (n1 + n2 + n3 ) items from a population Π as we want to make inferences regarding the population. Random sampling will help eliminating selection bias. 2 Consider, we have one response Y and one predictor X which has 3 levels x1 , x2 and x3 (say). We randomly assign x1 , x2 and x3 to n1 , n2 and n3 items, respectively. Such random assignment of X values will help eliminating any confounding effects of other variables. 3 We then observe the values of the response variable Y . 4 Statistical inferences based on data collected via an experiment has the capability of inferring that a cause-effect relationships exist. Jabed Tomal (U of T) Statistics March 18, 2015 21 / 31 Relationships Among Variables: Example: In a small-scale experimental study of the relation between of degree of brand liking (Y ) and moisture content (X1 ) and sweetness (X2 ) of the product, the following results were obtained from the experiment based on a completely randomized design. The data are coded below: i X1i X2i Yi Jabed Tomal (U of T) : : : : 1 4 2 64 2 4 4 73 3 4 2 61 ··· ··· ··· ··· Statistics 14 10 4 95 15 10 2 94 16 10 4 100 March 18, 2015 22 / 31 Relationships Among Variables: Observational Studies: 1 In an observational study, the sample items are not randomly selected from the population Π. Hence, the inferences regarding the population might be flawed. 2 The levels of the predictor variables are not randomly assigned to the items. There might present confounding effects of other variables. 3 Statistical inferences based on data collected via an observational studies do not necessarily imply a cause-effect relationships between variables. 4 While experiments reside at the top of the hierarchy, the observational studies reside at the bottom. Jabed Tomal (U of T) Statistics March 18, 2015 23 / 31 Relationships Among Variables: Example: Property Assessments The data that follow show assessed value for property tax purposes (X1 , in thousand dollars) and sales price (X2 , in thousand dollars) for a sample of 15 parcels of land for industrial development sold recently in “arm’s length” transactions in a tax district. i X1i X2i : : : Jabed Tomal (U of T) 1 13.9 28.6 2 16.0 34.7 3 10.3 21.0 Statistics ··· ··· ··· 13 14.9 35.1 14 12.9 30.0 15 15.8 36.2 March 18, 2015 24 / 31 Relationships Among Variables: Experimental Design (Design of Experiments): 1 Let us consider a simple set up where the goal is to determine whether a cause-effect relationship exists between a response Y and a predictor X (also called factor) defined on a population Π. 2 We randomly select a sample of experimental units (before we called items) π1 , π2 , · · · , πn from the population Π. 3 The values x1 , x2 , · · · , xk of X are called levels. When the possible values of X is large or infinite, we select (perhaps randomly) a set of finite values of X which spans the entire range of X well. 4 Each of the levels of X is then randomly assigned to ni (i = 1, 2, · · · k ) experimental units. Finally, the values of Y are observed corresponding to the sampled experimental units. Jabed Tomal (U of T) Statistics March 18, 2015 25 / 31 Relationships Among Variables: Experimental Design (Design of Experiments): 1 After random assignment to the experimental units, each level of the factor variable is called a treatment. 2 If the conditional distribution of Y against a particular level xi shows large variability, then we choose ni to be large. Jabed Tomal (U of T) Statistics March 18, 2015 26 / 31 Relationships Among Variables: Example A rental car company wants to investigate whether the type of car rented affects the length of the rental period. An experiment is run for one week at a particular location, and 10 rental contracts are selected at random for each car type. The results are shown in the following table. Type of Car Sub-compact Compact Midsize Full size 1 3 1 4 3 5 3 1 5 3 4 3 7 Observations 7 6 5 3 7 5 6 3 5 7 1 2 5 10 3 4 2 2 4 7 1 1 2 2 6 7 7 7 Is there evidence to support a claim that the type of car rented affects the length of the rental contract? Jabed Tomal (U of T) Statistics March 18, 2015 27 / 31 Relationships Among Variables: Control Treatment, the Placebo: 1 In experimental design, we often choose a level for the predictor variable to be zero, i.e., X = 0, where X represents the doses of any treatment. The zero value of the predictor variable is called a control treatment and serves as a baseline against which we assess the effect of the treatment. 2 In medical experiments, we might assign zero dose of a drug (the so-called sugar pill) to some patients to measure the efficacy of the drug in alleviating disease symptoms. Such control treatment is also known as placebo. Sometimes, patients feel better with a placebo and the effect is called placebo effect. Jabed Tomal (U of T) Statistics March 18, 2015 28 / 31 Relationships Among Variables: Blinding: 1 Blinding: Blinding is a situation when a patient does not know whether he/she is receiving a placebo or a drug. 2 Double-blinding: Double-blinding is a situation when both patients and the experimenter do not know the identity of the treatment assignments. Jabed Tomal (U of T) Statistics March 18, 2015 29 / 31 Relationships Among Variables: Exercise 10.1.3: Suppose that a census is conducted on a population and the joint distribution of (X , Y ) is obtained as in the following table. X =1 X =2 Y =1 0.15 0.12 Y =2 0.18 0.09 Y =3 0.40 0.06 Determine whether or not a relationship exists between Y and X . Jabed Tomal (U of T) Statistics March 18, 2015 30 / 31 Relationships Among Variables: Exercise 10.1.14: Suppose we have a quantitative response variable Y and two categorical predictor variables W and X , both taking values in {0, 1}. Suppose the conditional distributions of Y are given by Y |W = 0, X = 0 ∼ N(3, 5) Y |W = 1, X = 0 ∼ N(3, 5) Y |W = 0, X = 1 ∼ N(4, 5) Y |W = 1, X = 1 ∼ N(4, 5) Does W have a relationship with Y ? Does X have a relationship with Y ? Explain your answers. Jabed Tomal (U of T) Statistics March 18, 2015 31 / 31