Presentation Slides
Transcription
Presentation Slides
The Analysis of Complex Surveys David Bellhouse A Typical Complex Design Stratified two-stage cluster sampling Strata • geographical areas First stage units • smaller areas within the larger areas Second stage units • households Clusters • all individuals in the household Why a Complex Design? Classical reasons • better cover of the entire region of interest (stratification) • efficient for interviewing: less travel, less costly A new reason • to reduce cost the sample is piggybacked on a sample chosen by a complex design such as the Canadian Labour Force Survey Problem: estimation and analysis are more complex Examples of Complex Designs • • • • Ontario Health Survey Canadian Community Heath Survey Youth in Transition Survey Youth Risk Behavior Survey (U.S.) Ontario Health Survey • carried out in 1990 • health status of the population was measured • data were collected relating to the risk factors associated with major causes of morbidity and mortality in Ontario • survey of 61,239 persons was carried out in a stratified two-stage cluster sample by Statistics Canada OHS Sample Selection • strata: public health units – divided into rural and urban strata • first stage: enumeration areas defined by the 1986 Census of Canada and selected by pps • second stage: dwellings selected by SRS • cluster: all persons in the dwelling Youth in Transition Survey Reading cohort (15 year-olds): stratified two-stage sampling • school population stratified by province, language of instruction and enrollment size • 1,200 schools selected within strata • eligible students selected within each sampled school – the initial student sample size for the survey conducted in 2000 was 38,000. Canadian Community Health Survey Piggybacked on the Canadian Labour Force Survey (LFS) • LFS design – stratified by province and economic regions – geographical areas (usually enumeration areas) within strata chosen with probability proportional to the population size of the area – dwellings chosen within geographical areas – all persons in the dwelling interviewed Youth Risk Behavior Survey Stratified three-stage cluster sample • strata are metropolitan statistical areas • primary sampling units are large counties or groups of smaller counties • second stage units are schools within counties • third stage units are classes within schools • all students within a class are interviewed Basic Problem in Survey Data Analysis ≠ Issues iid (independent and identical distribution) assumption • the assumption does not not hold in complex surveys because of correlations induced by the sampling design or because of the population structure • blindly applying standard programs to the analysis can lead to incorrect results Two Simple Examples to Illustrate the Problems Involved in Analyzing “Complex” Samples • an old Ontario lottery called Lottario that is similar to Lotto 6/49 • a pay equity lawsuit involving a stratified sampling design Lottery Example Lottario – old Ontario lottery, a Lotto 6/39 • seven numbers chosen on a draw night – six regular numbers and a bonus number • winning numbers collected for 167 draws ending in January 1982 • want to test whether each of the 39 numbers (or balls) has the same chance of being chosen Breakdown of Independence Assumption • On a draw night, the numbers are chosen by simple random sampling without replacement – numbers chosen within a draw are not independent • Between draws the balls are replaced to be drawn again on the next draw – numbers chosen between draws are independent Frequency Lottario Draws up to January 1982 50 45 40 35 30 25 20 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Ball Number Chi-square Test for Equality of Proportions • Ignoring the lack of independence – test statistic = 48.674 – p-value = 0.115 • Taking into account the lack of independence – test statistic = (38/32)*48.674 = 57.8 – p-value = 0.021 Pay Equity Example Pay equity survey dispute: Canada Post and PSAC • two job evaluations on the same set of people (and same set of information) carried out in 1987 and 1993 • rank correlation between the two sets of job values obtained through the evaluations was 0.539 • assumption to obtain a valid estimate of correlation: pairs of observations are iid Scatterplot of Evaluations Rank in 1993 200 100 0 0 100 200 Rank in 1987 • Rank correlation is 0.539 A Stratified Design with Distinct Differences Between Strata • the pay level increases with each pay category (four in number) • the job value also generally increases with each pay category • therefore the observations are not iid Scatterplot by Pay Category Rank in 1993 200 2 3 4 5 100 0 0 100 Rank in 1987 200 Correlations within Level Correlations within each pay level • Level 2: –0.293 • Level 3: –0.010 • Level 4: 0.317 • Level 5: 0.496 Only Level 4 is significantly different from 0 Available Software for Complex Survey Analysis • commercial Packages: • STATA • SAS • SPSS • SUDAAN • noncommercial Package •R Typical Survey Data File stratum 7 7 7 7 7 7 7 12 12 12 12 12 12 12 12 12 12 12 12 12 12 psu 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 initwt 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 finalwt 10 10 10 10 10 14 14 10 10 10 10 10 10 10 14 14 10 14 10 10 10 age 14 13 12 15 14 14 16 17 16 14 16 16 16 18 17 17 16 17 16 13 17 race 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1 2 2 2 ethnicty 2 2 2 2 1 2 2 2 2 9 2 2 2 2 2 2 2 2 2 2 2 educ 7 7 5 8 7 9 9 9 9 8 9 10 9 11 11 11 9 10 9 7 10 sex 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 Survey Weights: Definitions initial weight • equal to the inverse of the inclusion probability of the unit final weight • initial weight adjusted for nonresponse, poststratification and/or benchmarking • interpreted as the number of units in the population that the sample unit represents Interpretation • the survey weight for a particular sample unit is the number of units in the population that the unit represents Not sampled, Wt = 2, Wt = 5, Wt = 6, Wt = 7 Some Consequences of Ignoring the Weights: Survey of Youth in Custody • first U.S. survey of youths confined to longterm, state-operated institutions • complemented existing Children in Custody censuses. • companion survey to the Surveys of State Prisons • the data contain information on criminal histories, family situations, drug and alcohol use, and peer group activities • survey carried out in 1989 using stratified systematic sampling SYC Design strata • type (a) groups of smaller institutions • type (b) individual larger institutions sampling units • strata type (a) • first stage – institution by probability proportional to size of the institution • second stage – individual youths in custody • strata type (b) • individual youths in custody • individuals chosen by systematic random sampling Effect of the Weights • Example: age distribution, Survey of Youth in Custody Sum of Age Counts Weights 11 1 28 12 9 149 13 53 764 14 167 2143 15 372 3933 16 622 5983 17 634 5189 18 334 2778 19 196 1763 20 122 1164 21 57 567 22 27 273 23 14 150 24 13 128 Totals 2621 25012 Unweighted Histogram Age Distribution of Youth in Custody 0.3 Proportion 0.25 0.2 0.15 0.1 0.05 0 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Age Weighted Histogram Age Distribution of Youth in Custody 0.3 Proportion 0.25 0.2 0.15 0.1 0.05 0 11 12 13 14 15 16 17 18 Age 19 20 21 22 23 24 Weighted versus Unweighted Proportion Weighted and Unweighted Histograms 0.3 0.25 0.2 0.15 0.1 0.05 0 11 12 13 14 15 16 17 18 19 20 Age Weighted Unweighted 21 22 23 24 General Approach to Analysis with Standard Software • the software usually handles stratified two-stage cluster samples • if there are more than two stages of sampling the latter stages are usually ignored in the analysis Reason 2 1 2 2 2 3 S S S V≅ + + l lm lmn Typical Models used in Analysis Ordinary Regression E( y) = Xβ Logistic Regression ⎛ π ⎞ ln⎜ ⎟ = Xβ ⎝ 1− π ⎠ General Consequences of Using the Sampling Weights but Ignoring the Sampling Design Inferences are usually base on the quadratic form (βˆ − β) T Vˆ −1 (βˆ − β) • V is the variance-covariance matrix of the regression parameter estimates • ignoring the survey design leads to estimates of V that are too small • therefore estimates of V-1 are larger than they should be • leads to test statistics that are larger than they should be (you find a significant result more often than you should) • leads to confidence interval statements that are narrower than they should be Regression Example: Ontario Health Survey • size of the circle is related to the sum of the surveys weights in the estimate more data in the BMI range 17 to 29 approximately DBMI versus BMI (binned) 30 25 DBMI • 20 15 12 22 32 BMI 42 Ontario Health Survey Regress desired body mass index (DBMI) on body mass index (BMI) STATA Unweighted Weighted Intercept Estimate S.E. 10.877 0.141 11.196 0.064 10.877 0.065 Slope Estimate S.E. 0.4958 0.0058 0.4716 0.0025 0.4858 0.0026 Youth Risk Behavior Survey Recall • sampling design is a stratified three-stage cluster sample • need only to give the strata and first stage unit identifiers to do the analysis with available software Demo in SPSS Log into MyVlab Youth Risk Behavior Survey Data File in an SPSS sav file Find the survey design variables in the file Prepare for Analysis by Specifying the Design Specifying the design variables Design options to choose from Logistic Regression Analysis Other Approaches • The estimate of the variance of the regression parameters is obtained using a technique called Taylor linearization – the cluster identifiers are needed to carry out this procedure – due to privacy constraints StatCan will provide this information only through an RDC – you need to apply to get into an RDC • Alternate approach – the bootstrap – different approach to the bootstrap for complex surveys than iid data sets – data file consists of several sets of bootstrap weights – calculate the estimates for each set of bootstrap weights and look at the variability in the estimates – can be done using SAS macros