What is correlation?
Transcription
What is correlation?
Social Sciences Research Methods Centre Basic Quantitative Analysis Session 1 Nicole Janz ssrmcta@hermes.cam.ac.uk Computer Password • your user identification (ps342@cam.ac.uk) • your password for Desktop Services system Don’t have your password? Go NOW to the Computing Service helpdesk 2 Course Objectives • Understand theory, assumptions, and calculation of bivariate statistics • Develop software skills • Results interpretation • Apply bivariate tests to your own research 3 Readings & Weekly Quiz Come prepared Weekly quiz Exam More info at: www.nicolejanz.de/teaching 4 Course Outline Week 1: Correlation (today) Week 2: Chi-square tests Week 3: T-tests Week 4: One-way ANOVAs Part lecture, part exercises using 5 Recap: revision from last term • State two characteristics of a normal distribution. • A null hypothesis states that there is … effect. • “People who are more wealthy are happier” – is this a one- or two-tailed hypothesis? Why? • A deviation is the distance (difference) between an observation and the …. • Variance and standard deviation are a measure of … 6 Recap: What is a normal distribution? A • bell shaped curve • symmetrical • unimodal • majority of scores lies around the centre (mean) • frequency decreases when we deviate from the centre • Extends to +/- infinity • many naturally occurring distributions are normal 7 Outline for Today PART I: Theory • What is correlation / why use it? • Visual display of correlation: scatterplots • Simple relationship calculation: covariance • Better: Pearson's correlation coefficient ’ r ’ • Alternatives to Pearson's r: Spearman, Kendall PART II: R Practice 8 Correlation Part I: Theory 9 What is correlation? Correlation is a way of measuring the extent to which two variables are related. When X varies, do we consistently see similar variation in Y? Give examples of two variables that could be related from your field! 10 Why use correlation? • Check if variables are related before I run complicated models • Check if my measurement is valid by comparing with other measures of same variable • Multicollinearity in regression Most Papers report bivariate correlation matrix as a first step 11 Three ways to assess correlation between two continuous variables 1. Scatterplots 2. Covariance 3. Correlation coefficients (Pearson and alternatives) 14 1. Visual: Scatterplot The best way to start exploring a relationship between variables is graphically Scatterplots plot one variable against the other. They are very useful to compare two continuous variables. They should always be our first step. 15 Figure source: Field, Ch 6 What kind of relationship do you see? 16 17 Types of relationship Are studying for an exam & exam results correlated? Positive relationship: variables move in the same direction - As studying ↑, exam score ↑ - As studying ↓, exam score ↓ Negative relationship: variables move in opposite directions - As studying ↑, exam score ↓ - As studying ↓, exam score ↑ What is ‘null relationship’? 19 Three ways to assess correlation between two continuous variables 1. Scatterplots 2. Covariance 3. Correlation coefficients (Pearson and alternatives) 20 From visual to calculation: Covariance • We look at how much each score deviates from the mean = variance of single variable • If both variables deviate from the mean in a similar way, they are likely to be related 21 2. Covariance of two variables mean Variance variable 1 Variance Variable 2 Figure source: Field, Ch 6 Deviation from mean Do you see a covariance? Do you see a pos/neg. relationship? 22 Calculation: (1) calculate variance for one variable To calculate the variance, we sum the squared deviations, and then divide by n-1. Variance = s 2 = 2 ( x − x ) ∑ i N −1 23 (2) Include the second variable: Covariance of two variables Variance = s 2 = 2 (x − x ) ∑ i N −1 We now have two variables, X and Y and put them into the equation Cross-product deviations ∑(xi − x )(yi − y ) cov(x, y) = N −1 24 Solution: Standardization This is where we convert data to a common unit of measurement = standardization. The intuition is that if we divide a deviation of x by the standard deviation of x, it tells us the distance from the mean in standard deviation units. cov xy ∑( xi − x )( yi − y ) = =r sxsy ( N − 1) sxsy 25 We add the standard deviation for both variables, multiplied. Three ways to measure correlation between two continuous variables 1. Scatterplots 2. Covariance 3. Correlation coefficients (Pearson and alternatives) 26 3. Correlation coefficients: Pearson’s r The standardized version of covariance is known as the correlation coefficient. It is relatively unaffected by units of measurement. 27 What does the correlation coefficient r tell us? sign (+/-) of the correlation coefficient indicates the direction of the relationship (positive/negative) r always lies between -1 for a perfect negative correlation, and +1 for a perfect positive correlation 0 indicates no linear relationship 28 What What is is big “big”? • The absolute value of r indicates the strength of the relationship. • When two variables X and Y are highly correlated: They have |r | ⇡ 1 • Rule of thumb: ± 0.1 represent small effect ± 0.3 moderate ± 0.5 large effect Refer to your own field: 0.8 = large in political science 29 Correlation examples ± 0.1 represent small effect ± 0.3 are moderate ± 0.5 is large effect 30 Testing the significance of r • The null hypothesis is that the population correlation coefficient is equal to zero: Ho: ρ = 0 • This means there is no effect and no relationship between the variables • Significance test: “What is the probability of obtaining the observed r value by chance, even if ρ = 0 in the population?” Significance and p-value We want to reject the H-Null (‘no effect’): p<.05 means • The result is significant • There is less than 5% chance obtaining the result if no real effect exists. This means practically: relationship between the two variables is not by chance • There is strong reason to believe we found a correlation between our variables This is two lines in cor(variable1,variable2) This gives you a number (the correlation coefficient) between -1 and 1; 0 means no correlation No info on significance cor.test(variable1,variable2) This tests if your correlation coefficient is significant 33 Additional information: R2 If you square the correlation coefficient, you get an indication of the proportion of shared variances between two variables • r tells us the extent to which variables move in the same direction • r-squared tells us about how much they overlap = R2 , also called coefficient of determination E.g. If 20% of variance is shared by 2 variables, 80% must be explained by other variables not included. 34 Write up examples The bivariate relationship between miles.x and miles.y was positive (r=.87), with the two items sharing roughly 76% of their variance. However, the relationship was not significant. There was a … correlation between self-control and moral behaviour (r=…, p< …). Self-control accounted for …% of the variance in behavioural outcomes. See papers in your field. 35 Reporting results • It is convention to report the sign (if negative), value and significance level of the correlation when you report your findings. • Correlation coefficients are reported without the zero before the decimal point e.g., r = .87 • Significance is often indicated using cut-offs e.g., p<.05 , p<.01 • Results are typically presented in the past-tense. 36 Three ways to measure correlation between two continuous variables 1. Scatterplots 2. Covariance 3. Correlation coefficients (Pearson, Spearman) 37 When to use Pearson’s r • Two continuous variables (take on any value) & normally distributed • Ordered variables (>10 categories) count as “continuous” • binary variable and a continuous variable • two binary variables • variables show linear relationships (straight lines) 38 A Are data skewed? Check histograms 39 Alternative: Spearman’s correlation coefficient Spearman’s rho is a valuable alternative to Pearson’s r when data deviates strongly from normality. It is obtained by replacing the observations by their rank and computing the correlation. Use it for • Continuous variables & skewed (not norm. distrib) • Ordinal data (ranked in an order <10 categories) See Dalgaard (2008) p. 120 ff. 40 In practice: often choice betw 2 tests In the weekly quiz use either Pearson’s or Spearman Pearson’s R cor.test() • Two continuous variables & normally distributed • Ordered variables (>10 categories) count as “continuous” • binary variable and a continuous variable • two binary variables Spearman’s rho cor.test( , method=“spearman”) • Continuous variables & skewed (not norm. distrib) • Ordinal data (ranked in an order <10 categories, e.g. whether people got a fail, a pass, a merit or a distinction in their exam • Kendall’s tau Like Spearman’s for small samples where many scores have the same rank, less popular (…., method=“kendall”) 41 Three ways to measure correlation between two continuous variables 1. Scatterplots (visual) 2. Covariance (calculate) 3. Correlation coefficients (calculate: Pearson, Spearman) 42 Caveats Correlation ≠ Causation Non-significance ≠ No relationship Outliers, non-normal distributions and non-linearity will distort your correlation coefficient. Correlations may differ for subgroups in your data. Consider some separate analyses (e.g., males and females) Correlation is the basis for regression. In regression the value of one variable is used to predict the value of another 43 Correlation Part II: R Practice 44 • Download Rscript for this session from www.nicolejanz.de/teaching/bivariate.html (the course is now called Basic Quantitative Analysis but it used to be called bivariate) • If you download the Rscript in Internet Explorer, it will convert the .R file into a .txt file. Use Firefox!!! • Save it in U:// hard drive My Documents (or a folder of your choice within that) • Open the script “Janz_M3S1_ex.R” from Rstudio 45 Scatterplot and Covariance in plot(variable1,variable2,main=“mytitle”) cov(variable1 ,variable2) 46 Example 1: Scatterplot and Covariance Miles.x [1,] [2,] [3,] [4,] [5,] miles.x miles.y 5 8 4 9 4 10 6 13 8 15 1. Create Scatterplot of miles.x against miles.y 2. Calculate the covariance of miles.x and miles.y How many miles can you cover on inline skates in one day? Variable: Miles.y How many miles can you cover by bike in one day? Rscript: Janz_M3S1_ex.R Example 1 47 Example 2: Covariance Problem 1. Convert the numbers from miles to kilometers (*1.6) and call the new variables kilometers.x and kilometers.y We are still in the same Rscript: Janz_M3S1_ex.R Example 2 2. Get scatterplots for the new variables 3. Calculate the covariance for the new variables The covariance depends on the unit of measurement (e.g. miles or kilometers)! 48 Example 3 - 4 Example 3 Let’s check if the correlation coefficient between our variables in kilometers and miles is the same. Example 4 Calculate correlation coefficient for adverts watched and packets bought. 49 Example 5 Variable 1: minutes a lecturer talks without pause Variable 2: desire to attend next class • Test if the correlation is significant! • Check R-squared of our variables minutes a lecturer talks without pause and desire to attend next class Remember: p<.05 means result is significant, i.e. the relationship between the two variables is not by chance 50 Example 6 Data set CPS1985 from the AER package in R Is wage associated with education levels? We will go through 5 steps: 1. Histograms (to check distribution) 2. Scatterplots 3. Correlations, R-squared 4. Interpreting output Example taken from Kleiber/Zeileis: Applied Econometrics with R, 2008 p.52 51 Example 7 World Data set Is public expenditure associated with the percentage of urban population in a country? We will go through 5 steps: 1. Histograms (to check distribution) 2. Scatterplots 3. Correlations, R-squared 4. Interpreting output 52 Example 8 Student Survey Data We will go through 5 steps: 1. Histograms (to check distribution) 2. Scatter plots of variables of interest 3. Correlations, R-squared, significance test 4. Interpreting output 5. Write up results If you finished early: do the same with Example 8 (see R) 53 Take home message Scatterplot plot() Histogram hist() Correlation cor.test() Specify method method = "pearson”default method = "spearman” 54 Thank you ! Nicole Janz www.nicolejanz.de Some extra slides that might be helpful Literature Andy Field’s Companion Website with Data and Rcode for Chapter 6: http://www.uk.sagepub.com/books/Book236067 (click companion website, register) Correlation in R: http://www.statmethods.net/stats/correlations.html 57 Survey Data Variables of Interest • TEMPERATURE: Estimate the current temperature (in degrees Celsius) in Cambridge right now! • STATISTICS ANXIETY: How do you feel about statistics? Please rate the following statements from our textbook's online guide. E.g. Statistics make me cry • SELF-CONTROL: Do you agree or disagree with the following statements about yourself? e.g. I often act on the spur of the moment without stopping to think. / I often do whatever brings me pleasure here and now, even at the cost of some distant goal. ... • MORAL BEHAVIOR: Do you think it is very wrong, wrong, a little wrong or not wrong at all to...? e.g. Run a red light on a bicycle. / Push in or 'jump' a queue. Carefully read what higher numbers in the variables mean, see R script. 58 Recap: What data are not continuous? Categorical data are distinct categories and often integers • Binary data take on only two categories Example: Did you vote? {No=0, Yes=1} • Ordinal data are categories that can be ordered Example: Do you support 2010 Health Care reform? {Does too little=1, Just right=2, Doesn’t do enough=3} • Nominal data take on name values lacking a unique ordering Examples: Which candidate do you prefer? {Obama, Romney, Another Republican} 59 Watch out when… … the binary variable is used that has an underlying continuous concept (“continuous dichotomy”, see Field p. 229) This usually happens when you recode a variable from a continuous scale to a binary variable. Example: GDP is a continuous variable, but you re-code it to 0=under 1000, 1=over 1000. Then use biserial correlation: polyserial() polycor in R package Or: When you have categorical variables (green, blue, Obama, Romney etc). There are other tests for this (see next sessions). 60 Mathematical notation a “bar” indicates this is the mean of a variable the absolute value of x (drop any minus signs) Some superscripts tell us to raise a variable to a power; this says raise xi to the third power Types of variables Continuous (any numerical value): Interval: Equal intervals on the variable represent equal differences in the property being measured e.g. the difference between 6 and 8 is equivalent to the difference between 13 and 15 Ratio: The same as an interval variable, but the ratios of scores on the scale must also make sense e.g. a score of 16 on an anxiety scale means that the person is, in reality, twice as anxious as someone scoring 8 Categorical (entities are divided into distinct categories): Binary: There are only two categories, e.g. dead=0, alive=1 Nominal: There are more than two categories e.g. whether someone is a vegetarian, vegan, or fruitarian Ordinal: The same as a nominal variable but the categories have a logical order e.g. whether people got a fail, a pass, a merit or a distinction in their exam 63 More correlation examples 64 In a normal distribution: • 68% of obs fall within 1 SD of the mean • 95% of obs fall within 2 SD of the mean • 99.9% fall within 3 SD of the mean. This is known as the empirical rule. 65 How unlikely does the Null-‐Hypothesis have to be? ¡ The unlikeliness of the Null-‐Hypothesis is called statistical significance. There are three main cut-‐off points for statistical significance, often referred to as p-‐values. § p < 0.05: There is a chance of less than 1:20 (5%) that the observed pattern is due to chance. § p < 0.01: There is a chance of less than 1:100 (1%) that the observed pattern is due to chance. § p < 0.001: There is a chance of less than 1:1,000 (0.001%) that the observed pattern is due to chance. ¡ ¡ The smaller the p-‐value, the more evidence there is that we should reject the null hypothesis. If an observed pattern is unlikely to be due to chance, we reject the Null-‐ Hypothesis. P-‐values also (somewhat confusingly) correspond to what are called alpha levels, so an alpha of .05 means a significance level of 5%. 66 http://www.sagepub.com/upm-data/40007_Chapter8.pdf ¡ One-‐tailed tests have the advantage that it is easier to reject the Null-‐Hypothesis, results are more likely to be statistically significant. ¡ However, one-‐tailed tests require that one has very strong reasons (before the start of the study) that an effect can only go into one direction. ¡ As a general rule, therefore, most researchers prefer two-‐tailed test (and statistics programmes routinely report two-‐tailed tests of significance). 68 ¡ Two tail testing is used when we want to test a research hypothesis that a parameter is not equal (≠) to some value Pictures are from 2005 Brooks/Cole, Thompson learning 69 Why divide the standard deviation of a sample by n-1? Source: Andy Field, Ch 2