Descriptive statistics - University of Warwick
Transcription
Descriptive statistics - University of Warwick
DTC Quantitative Research Methods Descriptive Statistics Thursday 16th October 2014 Some relevant online course extracts • Cramer (1998) Chapter 2: - Measurement and univariate analysis. • Diamond and Jefferies (2001) Chapter 5: - Measures and displays of spread. • Sarantakos (2007) Chapter 5: - Graphical displays. • Huizingh (2007) Chapter 12: - SPSS material. Some basic terminology • Quantitative measures are typically referred to as variables. • Some variables are generated directly via the data generation process, but other, derived variables may be constructed from the original set of variables later on. • As the next slide indicates, variables are frequently referred to in more specific ways. Cause(s) and effect…? • Often, one variable (and occasionally more than one variable) is viewed as being the dependent variable. • Variables which are viewed as impacting upon this variable, or outcome, are often referred to as independent variables. • However, for some forms of statistical analyses, independent variables are referred to in more specific ways (as can be seen within the menus of SPSS for Windows) Levels of measurement (Types of quantitative data) • A nominal variable relates to a set of categories such as ethnic groups or political parties which is not ordered. • An ordinal variable relates to a set of categories in which the categories are ordered, such as social classes or levels of educational qualification. • An interval-level variable relates to a ‘scale’ measure, such as age or income, that can be subjected to mathematical operations such as averaging. How many variables? • The starting point for statistical analyses is typically an examination of the distributions of values for the variables of interest. Such examinations of variables one at a time are a form of univariate analysis. • Once a researcher moves on to looking at relationships between pairs of variables she or he is engaging in bivariate analyses. • … and if they attempt to explain why two variables are related with reference to another variable or variables they have moved on to a form of multivariate analysis. Looking at categorical variables • For nominal/ordinal variables this largely means looking at the frequencies of each category, often pictorially using, say, barcharts or pie-charts. • It is usually easier to get a sense of the relative importance of the various categories if one converts the frequencies into percentages! Example of a frequency table Place met marital or cohabiting partner Frequency At school, college or university % 872 12.4 At/through work 1405 19.9 In a pub/cafe/restaurant/ bar/club 2096 29.7 At a social event organised by friend(s) 1055 14.9 Other 1631 23.1 TOTAL 7059 100.0 Example of a pie-chart At school, college or university Other At/through work At a social event organised by friend(s) In a pub/cafe/restaurant/ bar/club What are percentages? • It may seem self-evident, but percentages are a form of descriptive statistic • Specifically, they are useful in describing the distributions (of frequencies) for nominal or ordinal (i.e. categorical) variables • When we consider interval-level variables or more than one variable, we need (somewhat) more sophisticated descriptive statistics Descriptive statistics... • ... are data summaries which provide an alternative to graphical representations of distributions of values (or relationships) • ... aim to describe key aspects of distributions of values (or relationships) • ... are of most relevance when we are thinking about interval-level variables (scales) Description or inference? • Descriptive statistics summarise relevant features of a set of values. • Inferential statistics help researchers decide whether features of quantitative data from a sample can be safely concluded to be present in the population. • Generalizing from a sample to a population is part of the process of statistical inference • One objective may be to produce an estimate of the proportion of people in the population with a particular characteristic, i.e. a process of estimation. Types of (univariate) descriptive statistics Measures of ... • ... location (averages) • ... spread • ... skewness (asymmetry) • ... kurtosis • We typically want to know about the first two, sometimes about the third, and rarely about the fourth! What is ‘kurtosis’ anyway? • Increasing kurtosis is associated with the “movement of probability mass from the shoulders of a distribution into its center and tails.” (Balanda, K.P. and MacGillivray, H.L. 1988. ‘Kurtosis: A Critical Review’, The American Statistician 42:2: 111–119.) • Below, kurtosis increases from left to right... Visualising ‘scale’ variables • For interval-level data the appropriate visual summary of a distribution is a histogram, examining which can allow the researcher to assess whether it is reasonable to assume that the quantity of interest has a particular distributional shape (and whether it exhibits skewness). • Unlike bar charts, distances along the ‘horizontal’ dimension of a histogram have a well-defined, consistent meaning: i.e. they represent differences between values on the interval-level scale in question. Example of a histogram Measures of location • Mean (the arithmetic average of the values, i.e. the result of dividing the sum of the values by the total number of cases) • Median (the middle value, when the values are ranked/ordered) • Mode (the most common value) ... and measures of spread • Standard deviation (and Variance) (This is linked with the mean, as it is based on averaging [squared] deviations from it. The variance is simply the standard deviation squared). • Interquartile range / Quartile deviation (These are linked with the median, as they are also based on the values placed in order). Measures of location and spread: an example (household size) Mean = 2.94, Median = 2, Mode = 2 Mean = 2.96, Median = 3, Mode = 2 s.d. = 1.93, skewness = 2.10; kurtosis = 5.54 s.d. = 1.58, skewness = 1.27; kurtosis = 2.24 West Midlands London Why is the standard deviation so important? • The standard deviation (or, more precisely, the variance) is important because it introduces the idea of summarising variation in terms of summed, squared deviations. • And it is also central to some of the statistical theory used in statistical testing/statistical inference... An example of the calculation of a standard deviation • Number of seminars attended by a sample of undergraduates: 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 • Mean = 61/10 = 6.1 • Variance = ((5 – 6.1)2 + (4 – 6.1)2 + (4 – 6.1)2 + (7 – 6.1)2 + (9 – 6.1)2 + (8 – 6.1)2 + (9 – 6.1)2 + (4 – 6.1)2 + (6 – 6.1)2 + (5 – 6.1)2)/(10 – 1) = 36.9 /9 = 4.1 • Standard deviation = Square root of variance = 2.025 The Empire Median Strikes Back! • Comparing descriptive statistics between groups can be done graphically in a rather nice way using a form of display called a ‘boxplot’. • Boxplots are based on medians and quartiles rather than on the more commonly found mean and standard deviation. Example of a boxplot Moving on to bivariate ‘descriptive statistics'... • These are referred to as ‘Measures of association’, as they quantify the (strength of the) association between two variables • The most well-known of these is the (Pearson) correlation coefficient, often referred to as ‘the correlation coefficient’, or even ‘the correlation’ • This quantifies the closeness of the relationship between two interval-level variables (scales) Positive and negative relationships Positive or direct relationships • If the points cluster around a line that runs from the lower left to upper right of the graph area, then the relationship between the two variables is positive or direct. • An increase in the value of x is more likely to be associated with an increase in the value of y. • The closer the points are to the line, the stronger the relationship. Negative or inverse relationships • If the points tend to cluster around a line that runs from the upper left to lower right of the graph, then the relationship between the two variables is negative or inverse. • An increase in the value of x is more likely to be associated with a decrease in the value of y. Working out the correlation coefficient (Pearson’s r) • Pearson’s r tells us how much one variable changes as the values of another changes – their covariation. • Variation is measured with the standard deviation. This measures average variation of each variable from the mean for that variable. • Covariation is measured by calculating the amount by which each value of X varies from the mean of X, and the amount by which each value of Y varies from the mean of Y and multiplying the differences together and finding the average (by dividing by n-1). • Pearson’s r is calculated by dividing this by (SD of x) x (SD of y) in order to standardize it. x X y Y (n 1) sx s y Working out the correlation coefficient (Pearson’s r) • Because r is standardized it will always fall between +1 and -1. • A correlation of either 1 or -1 means perfect association between the two variables. • A correlation of 0 means that there is no association. • Note: correlation does not mean causation. We can only investigate causation by reference to our theory. However (thinking about it the other way round) there is unlikely to be causation if there is not correlation. A scatterplot of the values of two interval-level variables Example of calculating a correlation coefficient (corresponding to the last slide) • • • • • • X = 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 Mean(X) = 6.1 Y = 8, 7, 9, 7, 8, 8, 8, 5, 5, 6 Mean(Y) = 7.1 (5 - 6.1)(8 – 7.1) + (4 – 6.1)(7 – 7.1) ... etc. -0.99 + 0.21 + ... = 7.9 (Covariation) S.D. (X) = 2.02 ; S.D. (Y) = 1.37 (7.9 / 9) / (2.02 x 1.37) = 0.316 Looking at the relationship between two categorical variables If two variables are nominal or ordinal, i.e. categorical, we can look at the relationship between them in the form of a cross-tabulation, using percentages to summarize the pattern. (Typically, if there is one variable that can be viewed as depending on the other, i.e. a dependent variable, and the categories of this variable make up the columns of the crosstabulation, then it makes sense to have percentages that sum to 100% across each row; these are referred to as row percentages). An example of a cross-tabulation (from Jamieson et al., 2002#) ‘When you and your current partner first decided to set up home or move in together, did you think of it as a permanent arrangement or something that you would try and then see how it worked?’ Both Both Different ‘permanent’ ‘try and see’ answers TOTAL Cohabiting without marriage 15 (48%) 4 (13%) 12 (39%) 31 (100%) Cohabited and then married 16 (67%) 1 (4%) 7 (29%) 24 (100%) 9 (100%) 0 (0%) 0 Married without cohabiting (0%) 9 (100%) # Jamieson, L. et al. 2002. ‘Cohabitation and commitment: partnership plans of young men and women’, Sociological Review 50.3: 356–377. Alternative forms of percentage • In the following example, row percentages allow us to compare outcomes between the categories of an independent variable. • However, we can also use column percentages to look at the composition of each category of the dependent variable. • In addition, we can use total percentages to look at how the cases are distributed across combinations of the two variables. Example Cross-tabulation II: Row percentages Class origin * Class destination Crosstabulation Class destination Service Class origin Service Count % within Class origin Intermediate Count % within Class origin Working Count % within Class origin Total Count % within Class origin Intermediate Working Total 730 323 189 1242 58.8% 26.0% 15.2% 100.0% 857 1140 1108 3105 27.6% 36.7% 35.7% 100.0% 786 1385 2916 5087 15.5% 27.2% 57.3% 100.0% 2373 2848 4213 9434 25.2% 30.2% 44.7% 100.0% Derived from: Goldthorpe, J.H. with Llewellyn, C. and Payne, C. (1987). Social Mobility and Class Structure in Modern Britain (2nd Edition). Oxford: Clarendon Press. Example Cross-tabulation II: Column percentages Class origin * Class destination Crosstabulation Class destination Service Class origin Service Count % within Class destination Intermediate Count % within Class destination Working Count % within Class destination Total Count % within Class destination Intermediate Working Total 730 323 189 1242 30.8% 11.3% 4.5% 13.2% 857 1140 1108 3105 36.1% 40.0% 26.3% 32.9% 786 1385 2916 5087 33.1% 48.6% 69.2% 53.9% 2373 2848 4213 9434 100.0% 100.0% 100.0% 100.0% Example Cross-tabulation II: Total percentages Class origin * Class destination Crosstabulation Class destination Service Class origin Service Count Total Total 323 189 1242 7.7% 3.4% 2.0% 13.2% 857 1140 1108 3105 9.1% 12.1% 11.7% 32.9% 786 1385 2916 5087 % of Total 8.3% 14.7% 30.9% 53.9% Count 2373 2848 4213 9434 25.2% 30.2% 44.7% 100.0% Count % of Total Working Working 730 % of Total Intermediate Intermediate Count % of Total Percentages and Association • It is possibly self-evident that the differences between the percentages in different rows (or columns) can collectively be viewed as measuring association • In the case of a 2x2 cross-tabulation (i.e. one with two rows and two columns), the difference between the percentages is a measure of association for that cross-tabulation • But there are other ways of quantifying the association in the cross-tabulation… Odds ratios as a measure of association • The patterns in the social mobility table examined in an earlier session can clearly be expressed as differences in percentages (e.g. the differences between the percentages of sons with fathers in classes I and VII who are themselves in classes I and VII. • However, an alternative way of quantifying these class differences is to compare the odds of class I fathers having sons in class I as opposed to class VII with the odds of class VII fathers having sons in class I as opposed to class VII. • The ratio of these two sets of odds is an odds ratio, which will have a value of close to 1.0 if the two sets of odds are similar, i.e. if there is little or no difference between the chances of being in classes I and VII for sons with fathers in classes I and VII respectively. Odds Ratios vs. % Differences An Example: Gender and Higher Education Age 30-39 Men Women Degree No Degree 56 (13.0%) 374 70 (13.8%) 438 % difference = -0.8% Odds ratio = ((56/374)/(70/438)) = 0.937 Age 40-49 Men Women Degree No Degree 56 (14.4%) 334 38 (9.1%) 378 % difference = 5.3% Odds ratio = ((56/334)/(38/378)) = 1.668 Age 50-59 Men Women Degree No Degree 34 (9.9%) 308 18 (5.2%) 329 % difference = 4.7% Odds ratio = ((34/308)/(18/329)) = 2.018 Choice of measure can matter! • The choice of differences between percentages versus odds ratios as a way of quantifying differences between groups can matter, as in the preceding example of the ‘effect’ of gender on the likelihood of having a degree, according to age. • The % difference values of 4.7%, 5.3% and -0.8% suggest that inequality increased before it disappeared, whereas the odds ratios of 2.018, 1.668 and 0.937 suggest a small decrease in inequality before a larger decrease led to approximate equality! • Evidently, there are competing ways of measuring association in a cross-tabulation. But neither differences between percentages nor odds ratios provide an overall summary of the association in a cross-tabulation… Another measure of association • If we need an overall measure of association for two cross-tabulated (categorical) variables, one standard possibility is Cramér’s V • Like the Pearson correlation coefficient, it has a maximum of 1, and 0 indicates no relationship, but it can only take on positive values, and makes no assumption of linearity. • It is derived from a test statistic (inferential statistic), chi-square, which we will consider in a later session… An example of Cramér’s V Cramér’s V = 0.074 Other measures of association for cross-tabulations… • In a literature review more than thirty years ago, Goodman and Kruskal identified several dozen of these: Goodman, L.A. and Kruskal, W.H. 1979. Measures of association for cross classifications. New York, Springer-Verlag. • … and I added one of my own, Tog, which measures inequality (in a particular way) where both variables are ordinal… One of Tog’s (distant) relatives What if one variable is a set of categories, and the other is a scale? • The equivalent to comparing percentages in this instance is comparing means… but there may be quite a lot of these! • So one possible overall measure of association used in this situation is eta2 (η2) (eta-squared) • But this is a less familiar measure (at least to researchers in some social science disciplines!)