Presentation Slides

Transcription

Presentation Slides
The Analysis of Complex
Surveys
David Bellhouse
A Typical Complex Design
Stratified two-stage cluster sampling
Strata
• geographical areas
First stage units
• smaller areas within the larger areas
Second stage units
• households
Clusters
• all individuals in the household
Why a Complex Design?
Classical reasons
• better cover of the entire region of interest
(stratification)
• efficient for interviewing: less travel, less costly
A new reason
• to reduce cost the sample is piggybacked on a
sample chosen by a complex design such as the
Canadian Labour Force Survey
Problem: estimation and analysis are more
complex
Examples of Complex Designs
•
•
•
•
Ontario Health Survey
Canadian Community Heath Survey
Youth in Transition Survey
Youth Risk Behavior Survey (U.S.)
Ontario Health Survey
• carried out in 1990
• health status of the population was
measured
• data were collected relating to the risk
factors associated with major causes of
morbidity and mortality in Ontario
• survey of 61,239 persons was carried
out in a stratified two-stage cluster
sample by Statistics Canada
OHS
Sample Selection
• strata: public health
units – divided into
rural and urban strata
• first stage:
enumeration areas
defined by the 1986
Census of Canada and
selected by pps
• second stage:
dwellings selected by
SRS
• cluster: all persons in
the dwelling
Youth in Transition Survey
Reading cohort (15 year-olds): stratified two-stage
sampling
• school population stratified by province,
language of instruction and enrollment size
• 1,200 schools selected within strata
• eligible students selected within each sampled
school
– the initial student sample size for the survey
conducted in 2000 was 38,000.
Canadian Community Health Survey
Piggybacked on the Canadian Labour Force
Survey (LFS)
• LFS design
– stratified by province and economic regions
– geographical areas (usually enumeration
areas) within strata chosen with probability
proportional to the population size of the area
– dwellings chosen within geographical areas
– all persons in the dwelling interviewed
Youth Risk Behavior Survey
Stratified three-stage cluster sample
• strata are metropolitan statistical areas
• primary sampling units are large counties
or groups of smaller counties
• second stage units are schools within
counties
• third stage units are classes within schools
• all students within a class are interviewed
Basic Problem
in
Survey Data Analysis
≠
Issues
iid (independent and identical distribution)
assumption
• the assumption does not not hold in
complex surveys because of correlations
induced by the sampling design or
because of the population structure
• blindly applying standard programs to the
analysis can lead to incorrect results
Two Simple Examples to Illustrate the
Problems Involved in Analyzing “Complex”
Samples
• an old Ontario lottery called Lottario that is
similar to Lotto 6/49
• a pay equity lawsuit involving a stratified
sampling design
Lottery Example
Lottario – old Ontario lottery, a Lotto 6/39
• seven numbers chosen on a draw night –
six regular numbers and a bonus number
• winning numbers collected for 167 draws
ending in January 1982
• want to test whether each of the 39
numbers (or balls) has the same chance of
being chosen
Breakdown of Independence
Assumption
• On a draw night, the numbers are chosen
by simple random sampling without
replacement
– numbers chosen within a draw are not
independent
• Between draws the balls are replaced to
be drawn again on the next draw
– numbers chosen between draws are
independent
Frequency
Lottario Draws up to January 1982
50
45
40
35
30
25
20
15
10
5
0
1
3
5 7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Ball Number
Chi-square Test for Equality of
Proportions
• Ignoring the lack of independence
– test statistic = 48.674
– p-value = 0.115
• Taking into account the lack of
independence
– test statistic = (38/32)*48.674 = 57.8
– p-value = 0.021
Pay Equity Example
Pay equity survey dispute: Canada Post and
PSAC
• two job evaluations on the same set of people
(and same set of information) carried out in 1987
and 1993
• rank correlation between the two sets of job
values obtained through the evaluations was
0.539
• assumption to obtain a valid estimate of
correlation: pairs of observations are iid
Scatterplot of Evaluations
Rank in 1993
200
100
0
0
100
200
Rank in 1987
• Rank correlation is 0.539
A Stratified Design with Distinct
Differences Between Strata
• the pay level increases with each pay
category (four in number)
• the job value also generally increases with
each pay category
• therefore the observations are not iid
Scatterplot by Pay Category
Rank in 1993
200
2
3
4
5
100
0
0
100
Rank in 1987
200
Correlations within Level
Correlations within each pay level
• Level 2: –0.293
• Level 3: –0.010
• Level 4: 0.317
• Level 5: 0.496
Only Level 4 is significantly different from 0
Available Software for Complex Survey
Analysis
• commercial Packages:
• STATA
• SAS
• SPSS
• SUDAAN
• noncommercial Package
•R
Typical Survey Data File
stratum
7
7
7
7
7
7
7
12
12
12
12
12
12
12
12
12
12
12
12
12
12
psu
20
20
20
20
20
20
20
21
21
21
21
21
21
21
21
21
21
21
21
21
21
initwt
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
finalwt
10
10
10
10
10
14
14
10
10
10
10
10
10
10
14
14
10
14
10
10
10
age
14
13
12
15
14
14
16
17
16
14
16
16
16
18
17
17
16
17
16
13
17
race
2
2
2
2
2
1
1
2
2
2
2
2
2
2
1
1
2
1
2
2
2
ethnicty
2
2
2
2
1
2
2
2
2
9
2
2
2
2
2
2
2
2
2
2
2
educ
7
7
5
8
7
9
9
9
9
8
9
10
9
11
11
11
9
10
9
7
10
sex
1
1
2
1
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
Survey Weights: Definitions
initial weight
• equal to the inverse of the inclusion
probability of the unit
final weight
• initial weight adjusted for nonresponse,
poststratification and/or benchmarking
• interpreted as the number of units in the
population that the sample unit represents
Interpretation
• the survey
weight for a
particular
sample unit
is the
number of
units in the
population
that the unit
represents
Not sampled, Wt = 2, Wt = 5, Wt = 6, Wt = 7
Some Consequences of Ignoring the
Weights: Survey of Youth in Custody
• first U.S. survey of youths confined to longterm, state-operated institutions
• complemented existing Children in Custody
censuses.
• companion survey to the Surveys of State
Prisons
• the data contain information on criminal
histories, family situations, drug and alcohol
use, and peer group activities
• survey carried out in 1989 using stratified
systematic sampling
SYC Design
strata
• type (a) groups of smaller institutions
• type (b) individual larger institutions
sampling units
• strata type (a)
• first stage – institution by probability proportional to size
of the institution
• second stage – individual youths in custody
• strata type (b)
• individual youths in custody
• individuals chosen by systematic random
sampling
Effect of the
Weights
• Example: age
distribution, Survey of
Youth in Custody
Sum of
Age Counts Weights
11
1
28
12
9
149
13
53
764
14
167
2143
15
372
3933
16
622
5983
17
634
5189
18
334
2778
19
196
1763
20
122
1164
21
57
567
22
27
273
23
14
150
24
13
128
Totals 2621
25012
Unweighted Histogram
Age Distribution of Youth in Custody
0.3
Proportion
0.25
0.2
0.15
0.1
0.05
0
11 12 13 14 15 16 17 18 19 20 21 22 23 24
Age
Weighted Histogram
Age Distribution of Youth in Custody
0.3
Proportion
0.25
0.2
0.15
0.1
0.05
0
11
12
13
14
15
16
17
18
Age
19
20
21
22
23
24
Weighted versus Unweighted
Proportion
Weighted and Unweighted
Histograms
0.3
0.25
0.2
0.15
0.1
0.05
0
11
12
13 14
15
16 17
18 19
20
Age
Weighted
Unweighted
21 22
23
24
General Approach to Analysis
with Standard Software
• the software usually handles stratified two-stage
cluster samples
• if there are more than two stages of sampling
the latter stages are usually ignored in the
analysis
Reason
2
1
2
2
2
3
S
S
S
V≅
+
+
l
lm lmn
Typical Models used in Analysis
Ordinary Regression
E( y) = Xβ
Logistic Regression
⎛ π ⎞
ln⎜
⎟ = Xβ
⎝ 1− π ⎠
General Consequences of Using the Sampling
Weights but Ignoring the Sampling Design
Inferences are usually base on the quadratic form
(βˆ − β) T Vˆ −1 (βˆ − β)
• V is the variance-covariance matrix of the regression
parameter estimates
• ignoring the survey design leads to estimates of V that
are too small
• therefore estimates of V-1 are larger than they should be
• leads to test statistics that are larger than they should be
(you find a significant result more often than you should)
• leads to confidence interval statements that are narrower
than they should be
Regression Example: Ontario Health Survey
•
size of the circle is related to the sum of the surveys weights in
the estimate
more data in the BMI range 17 to 29 approximately
DBMI versus BMI (binned)
30
25
DBMI
•
20
15
12
22
32
BMI
42
Ontario Health Survey
Regress desired body mass index (DBMI) on body
mass index (BMI)
STATA Unweighted Weighted
Intercept
Estimate
S.E.
10.877
0.141
11.196
0.064
10.877
0.065
Slope
Estimate
S.E.
0.4958
0.0058
0.4716
0.0025
0.4858
0.0026
Youth Risk Behavior Survey
Recall
• sampling design is a stratified three-stage
cluster sample
• need only to give the strata and first stage
unit identifiers to do the analysis with
available software
Demo in SPSS
Log into MyVlab
Youth Risk Behavior Survey
Data File in an SPSS sav file
Find the survey design variables in the file
Prepare for Analysis by Specifying the Design
Specifying the design variables
Design options to choose from
Logistic Regression Analysis
Other Approaches
• The estimate of the variance of the regression
parameters is obtained using a technique called
Taylor linearization
– the cluster identifiers are needed to carry out this
procedure
– due to privacy constraints StatCan will provide this
information only through an RDC
– you need to apply to get into an RDC
• Alternate approach – the bootstrap
– different approach to the bootstrap for complex
surveys than iid data sets
– data file consists of several sets of bootstrap weights
– calculate the estimates for each set of bootstrap
weights and look at the variability in the estimates
– can be done using SAS macros