What is correlation?

Transcription

What is correlation?
Social Sciences Research
Methods Centre
Basic Quantitative Analysis
Session 1
Nicole Janz
ssrmcta@hermes.cam.ac.uk
Computer Password
•  your user identification (ps342@cam.ac.uk)
•  your password for Desktop Services system
Don’t have your password?
Go NOW to the Computing Service helpdesk
2
Course Objectives
•  Understand theory, assumptions, and
calculation of bivariate statistics
•  Develop software skills
•  Results interpretation
•  Apply bivariate tests to your own research
3
Readings & Weekly Quiz
Come prepared
Weekly quiz
Exam
More info at: www.nicolejanz.de/teaching
4
Course Outline
Week 1: Correlation (today)
Week 2: Chi-square tests
Week 3: T-tests
Week 4: One-way ANOVAs
Part lecture, part exercises using
5
Recap: revision from last term
• 
State two characteristics of a normal distribution.
• 
A null hypothesis states that there is … effect.
• 
“People who are more wealthy are happier” – is this
a one- or two-tailed hypothesis? Why?
• 
A deviation is the distance (difference) between an
observation and the ….
• 
Variance and standard deviation are a measure of …
6
Recap:
What is a normal distribution?
A
•  bell shaped curve
•  symmetrical
•  unimodal
•  majority of scores lies around
the centre (mean)
•  frequency decreases when we
deviate from the centre
•  Extends to +/- infinity
•  many naturally occurring
distributions are normal
7
Outline for Today
PART I: Theory
•  What is correlation / why use it?
• 
Visual display of correlation: scatterplots
• 
Simple relationship calculation: covariance
• 
Better: Pearson's correlation coefficient ’ r ’
• 
Alternatives to Pearson's r: Spearman, Kendall
PART II: R Practice
8
Correlation
Part I: Theory
9
What is correlation?
Correlation is a way of measuring the extent to
which two variables are related.
When X varies, do we consistently see similar
variation in Y?
Give examples of two variables that could
be related from your field!
10
Why use correlation?
•  Check if variables are related before I run
complicated models
•  Check if my measurement is valid by
comparing with other measures of same
variable
•  Multicollinearity in regression
Most Papers report bivariate correlation
matrix as a first step
11
Three ways to assess correlation
between two continuous variables
1.  Scatterplots
2.  Covariance
3.  Correlation coefficients (Pearson and
alternatives)
14
1. Visual: Scatterplot
The best way to start exploring a relationship
between variables is graphically
Scatterplots plot one variable against the
other.
They are very useful to compare two
continuous variables.
They should always be our first step.
15
Figure source: Field, Ch 6
What kind of relationship do you see?
16
17
Types of relationship
Are studying for an exam & exam results correlated?
Positive relationship: variables move in the
same direction
- As studying ↑, exam score ↑
- As studying ↓, exam score ↓
Negative relationship: variables move in
opposite directions
- As studying ↑, exam score ↓
- As studying ↓, exam score ↑
What is ‘null relationship’?
19
Three ways to assess correlation
between two continuous variables
1.  Scatterplots
2.  Covariance
3.  Correlation coefficients (Pearson and
alternatives)
20
From visual to calculation:
Covariance
•  We look at how much each score deviates
from the mean = variance of single variable
•  If both variables deviate from the mean in a
similar way, they are likely to be related
21
2. Covariance of
two variables
mean
Variance
variable 1
Variance
Variable 2
Figure source: Field, Ch 6
Deviation from mean
Do you see a covariance? Do you see a pos/neg. relationship? 22
Calculation: (1) calculate variance
for one variable
To calculate the variance, we sum the squared
deviations, and then divide by n-1.
Variance = s 2 =
2
(
x
−
x
)
∑ i
N −1
23
(2) Include the second variable:
Covariance of two variables
Variance = s 2 =
2
(x
−
x
)
∑ i
N −1
We now have two variables, X and Y
and put them into the equation
Cross-product deviations
∑(xi − x )(yi − y )
cov(x, y) =
N −1
24
Solution: Standardization
This is where we convert data to a common unit of
measurement = standardization.
The intuition is that if we divide a deviation of x by the
standard deviation of x, it tells us the distance from the
mean in standard deviation units.
cov xy
∑( xi − x )( yi − y )
=
=r
sxsy
( N − 1) sxsy
25
We add the standard deviation for both variables, multiplied.
Three ways to measure correlation
between two continuous variables
1.  Scatterplots
2.  Covariance
3.  Correlation coefficients (Pearson and
alternatives)
26
3. Correlation coefficients: Pearson’s r
The standardized version of covariance is known as the
correlation coefficient. It is relatively unaffected by units of
measurement.
27
What does the correlation coefficient r
tell us?
sign (+/-) of the correlation coefficient indicates the
direction of the relationship (positive/negative)
r always lies between
-1 for a perfect negative correlation, and
+1 for a perfect positive correlation
0 indicates no linear relationship
28
What
What is is
big “big”?
•  The absolute value of r indicates the strength of the
relationship.
•  When two variables X and Y are highly correlated: They
have |r | ⇡ 1
•  Rule of thumb:
±  0.1 represent small effect
±  0.3 moderate
±  0.5 large effect
Refer to your own field: 0.8 = large in political science
29
Correlation examples
±  0.1 represent small effect
±  0.3 are moderate
±  0.5 is large effect
30
Testing the significance of r
•  The null hypothesis is that the population correlation
coefficient is equal to zero: Ho: ρ = 0
•  This means there is no effect and no relationship
between the variables
•  Significance test: “What is the probability of obtaining
the observed r value by chance, even if ρ = 0 in the
population?”
Significance and p-value
We want to reject the H-Null (‘no effect’):
p<.05 means
•  The result is significant
•  There is less than 5% chance obtaining the result if
no real effect exists. This means practically:
relationship between the two variables is not by
chance
•  There is strong reason to believe we found a
correlation between our variables
This is two lines in
cor(variable1,variable2)
This gives you a number (the correlation coefficient)
between -1 and 1; 0 means no correlation
No info on significance
cor.test(variable1,variable2)
This tests if your correlation coefficient is significant
33
Additional information: R2
If you square the correlation coefficient,
you get an indication of the proportion of shared
variances between two variables
•  r tells us the extent to which variables move in the
same direction
•  r-squared tells us about how much they overlap
= R2 , also called coefficient of determination
E.g. If 20% of variance is shared by 2 variables, 80%
must be explained by other variables not included.
34
Write up examples
The bivariate relationship between miles.x and
miles.y was positive (r=.87), with the two items
sharing roughly 76% of their variance. However, the
relationship was not significant.
There was a … correlation between self-control and moral
behaviour (r=…, p< …). Self-control accounted for …% of
the variance in behavioural outcomes.
See papers in your field.
35
Reporting results
•  It is convention to report the sign (if negative),
value and significance level of the correlation
when you report your findings.
•  Correlation coefficients are reported without the
zero before the decimal point
e.g., r = .87
•  Significance is often indicated using cut-offs e.g.,
p<.05 , p<.01
•  Results are typically presented in the past-tense.
36
Three ways to measure correlation
between two continuous variables
1.  Scatterplots
2.  Covariance
3.  Correlation coefficients (Pearson,
Spearman)
37
When to use Pearson’s r
•  Two continuous variables (take on any value) & normally
distributed
•  Ordered variables (>10 categories) count as “continuous”
•  binary variable and a continuous variable
•  two binary variables
•  variables show linear relationships (straight lines)
38
A
Are data skewed? Check histograms
39
Alternative: Spearman’s correlation
coefficient
Spearman’s rho is a valuable alternative to Pearson’s r
when data deviates strongly from normality.
It is obtained by replacing the observations by their rank and
computing the correlation.
Use it for
•  Continuous variables & skewed (not norm. distrib)
•  Ordinal data (ranked in an order <10 categories)
See Dalgaard (2008) p. 120 ff.
40
In practice: often choice betw 2 tests
In the weekly quiz use either Pearson’s or Spearman
Pearson’s R cor.test()
•  Two continuous variables & normally distributed
•  Ordered variables (>10 categories) count as “continuous”
•  binary variable and a continuous variable
•  two binary variables
Spearman’s rho cor.test( , method=“spearman”)
•  Continuous variables & skewed (not norm. distrib)
•  Ordinal data (ranked in an order <10 categories, e.g. whether
people got a fail, a pass, a merit or a distinction in their exam
•  Kendall’s tau Like Spearman’s for small samples where many scores
have the same rank, less popular (…., method=“kendall”)
41
Three ways to measure correlation
between two continuous variables
1.  Scatterplots (visual)
2.  Covariance (calculate)
3.  Correlation coefficients (calculate: Pearson,
Spearman)
42
Caveats
Correlation ≠ Causation
Non-significance ≠ No relationship
Outliers, non-normal distributions and non-linearity will
distort your correlation coefficient.
Correlations may differ for subgroups in your data.
Consider some separate analyses (e.g., males and
females)
Correlation is the basis for regression. In regression the
value of one variable is used to predict the value of
another
43
Correlation
Part II: R Practice
44
•  Download Rscript for this session from
www.nicolejanz.de/teaching/bivariate.html
(the course is now called Basic Quantitative Analysis but it used to
be called bivariate)
•  If you download the Rscript in Internet
Explorer, it will convert the .R file into
a .txt file. Use Firefox!!!
•  Save it in U:// hard drive My Documents (or
a folder of your choice within that)
•  Open the script “Janz_M3S1_ex.R” from
Rstudio
45
Scatterplot and Covariance in
plot(variable1,variable2,main=“mytitle”)
cov(variable1 ,variable2)
46
Example 1: Scatterplot and
Covariance
Miles.x
[1,]
[2,]
[3,]
[4,]
[5,]
miles.x miles.y
5
8
4
9
4
10
6
13
8
15
1.  Create Scatterplot of miles.x
against miles.y
2.  Calculate the covariance of
miles.x and miles.y
How many miles can you
cover on inline skates in
one day?
Variable: Miles.y
How many miles can you
cover by bike in one day?
Rscript:
Janz_M3S1_ex.R
Example 1
47
Example 2: Covariance Problem
1.  Convert the numbers from miles to
kilometers (*1.6) and call the new
variables kilometers.x and
kilometers.y
We are still in
the same
Rscript:
Janz_M3S1_ex.R
Example 2
2.  Get scatterplots for the new
variables
3.  Calculate the covariance for the
new variables
The covariance depends on the unit of
measurement
(e.g. miles or kilometers)!
48
Example 3 - 4
Example 3
Let’s check if the correlation coefficient
between our variables in kilometers and
miles is the same.
Example 4
Calculate correlation coefficient for adverts
watched and packets bought.
49
Example 5
Variable 1: minutes a lecturer talks without pause
Variable 2: desire to attend next class
•  Test if the correlation is significant!
•  Check R-squared of our variables minutes a lecturer
talks without pause and desire to attend next class
Remember:
p<.05 means result is significant, i.e. the relationship
between the two variables is not by chance
50
Example 6
Data set CPS1985 from the AER package in R
Is wage associated with education levels?
We will go through 5 steps:
1.  Histograms (to check distribution)
2.  Scatterplots
3.  Correlations, R-squared
4.  Interpreting output
Example taken from Kleiber/Zeileis: Applied Econometrics with R, 2008 p.52
51
Example 7
World Data set
Is public expenditure associated with the
percentage of urban population in a country?
We will go through 5 steps:
1.  Histograms (to check distribution)
2.  Scatterplots
3.  Correlations, R-squared
4.  Interpreting output
52
Example 8
Student Survey Data
We will go through 5 steps:
1.  Histograms (to check distribution)
2.  Scatter plots of variables of interest
3.  Correlations, R-squared, significance test
4.  Interpreting output
5.  Write up results
If you finished early: do the same with Example 8 (see R)
53
Take home message
Scatterplot
plot()
Histogram
hist()
Correlation
cor.test()
Specify method
method = "pearson”default
method = "spearman”
54
Thank you !
Nicole Janz
www.nicolejanz.de
Some extra slides
that might be helpful
Literature
Andy Field’s Companion Website with Data and
Rcode for Chapter 6:
http://www.uk.sagepub.com/books/Book236067
(click companion website, register)
Correlation in R:
http://www.statmethods.net/stats/correlations.html
57
Survey Data Variables of Interest
•  TEMPERATURE: Estimate the current temperature (in degrees
Celsius) in Cambridge right now!
•  STATISTICS ANXIETY: How do you feel about statistics? Please rate
the following statements from our textbook's online guide. E.g. Statistics
make me cry
•  SELF-CONTROL: Do you agree or disagree with the following
statements about yourself? e.g. I often act on the spur of the moment
without stopping to think. / I often do whatever brings me pleasure here
and now, even at the cost of some distant goal. ...
•  MORAL BEHAVIOR: Do you think it is very wrong, wrong, a little
wrong or not wrong at all to...? e.g. Run a red light on a bicycle. / Push in
or 'jump' a queue.
Carefully read what higher numbers in the variables mean, see R script.
58
Recap: What data are not continuous?
Categorical data are distinct categories and often integers
•  Binary data take on only two categories
Example: Did you vote? {No=0, Yes=1}
•  Ordinal data are categories that can be ordered
Example: Do you support 2010 Health Care reform?
{Does too little=1, Just right=2, Doesn’t do enough=3}
•  Nominal data take on name values lacking a unique
ordering
Examples: Which candidate do you prefer?
{Obama, Romney, Another Republican}
59
Watch out when…
… the binary variable is used that has an underlying
continuous concept (“continuous dichotomy”, see Field p.
229)
This usually happens when you recode a variable from a
continuous scale to a binary variable. Example: GDP is a
continuous variable, but you re-code it to 0=under 1000,
1=over 1000.
Then use biserial correlation: polyserial()
polycor
in R package
Or:
When you have categorical variables (green, blue, Obama,
Romney etc). There are other tests for this (see next sessions).
60
Mathematical notation
a “bar” indicates this is the mean of a variable
the absolute value of x (drop any minus signs)
Some superscripts tell us to raise a variable to a
power; this says raise xi to the third power
Types of variables
Continuous (any numerical value):
Interval: Equal intervals on the variable represent equal differences in
the property being measured
e.g. the difference between 6 and 8 is equivalent to the difference
between 13 and 15
Ratio: The same as an interval variable, but the ratios of scores on the
scale must also make sense
e.g. a score of 16 on an anxiety scale means that the person is, in
reality, twice as anxious as someone scoring 8
Categorical (entities are divided into distinct categories):
Binary: There are only two categories, e.g. dead=0, alive=1
Nominal: There are more than two categories
e.g. whether someone is a vegetarian, vegan, or fruitarian
Ordinal: The same as a nominal variable but the categories have a
logical order
e.g. whether people got a fail, a pass, a merit or a distinction in their
exam
63
More correlation examples
64
In a normal distribution:
• 68% of obs fall within 1 SD of the mean
• 95% of obs fall within 2 SD of the mean
• 99.9% fall within 3 SD of the mean.
This is known as the empirical rule.
65
How unlikely does the Null-­‐Hypothesis have to be? ¡ 
The unlikeliness of the Null-­‐Hypothesis is called statistical significance. There are three main cut-­‐off points for statistical significance, often referred to as p-­‐values. §  p < 0.05: There is a chance of less than 1:20 (5%) that the observed pattern is due to chance. §  p < 0.01: There is a chance of less than 1:100 (1%) that the observed pattern is due to chance. §  p < 0.001: There is a chance of less than 1:1,000 (0.001%) that the observed pattern is due to chance. ¡ 
¡ 
The smaller the p-­‐value, the more evidence there is that we should reject the null hypothesis. If an observed pattern is unlikely to be due to chance, we reject the Null-­‐
Hypothesis. P-­‐values also (somewhat confusingly) correspond to what are called alpha levels, so an alpha of .05 means a significance level of 5%. 66
http://www.sagepub.com/upm-data/40007_Chapter8.pdf
¡ 
One-­‐tailed tests have the advantage that it is easier to reject the Null-­‐Hypothesis, results are more likely to be statistically significant. ¡ 
However, one-­‐tailed tests require that one has very strong reasons (before the start of the study) that an effect can only go into one direction. ¡ 
As a general rule, therefore, most researchers prefer two-­‐tailed test (and statistics programmes routinely report two-­‐tailed tests of significance). 68
¡ 
Two tail testing is used when we want to test a research hypothesis that a parameter is not equal (≠) to some value Pictures are from 2005 Brooks/Cole, Thompson learning
69
Why divide
the
standard
deviation
of a
sample by
n-1?
Source: Andy Field, Ch 2