ERRORS IN SAMPLE SURVEYS

Transcription

ERRORS IN SAMPLE SURVEYS
ERRORS IN SAMPLE SURVEYS
TAUQUEER AHMAD
Indian Agricultural Statistics Research Institute
Library Avenue, New Delhi-110 012
tauqueer@iasri.res.in
1. Introduction
In Probability Sampling when observation yi on the ith unit is the correct value for that
unit, the error of estimate arises purely from the random sampling variation i.e. when
fraction of units is measured instead of the complete population. This deviation of the
sample statistics from the population parameters is usually called sampling error. It is well
known that if all units of the population are measured the estimate will be free from
sampling error. But in practice it may not be always possible to get the true observation yi
on the ith unit. Consequently the estimate based on sample will also involve errors
different from sampling errors. All the errors in estimation, which are not the result of
sampling, are called non-sampling errors, i.e. these are the residual categories. The
sampling errors arise because it is based on a ‘Part' from the ‘Whole’ while non sampling
errors mainly arise because of some departure from the prescribed rules of the survey,
such as survey design, field work, tabulation or analysis of data etc. This is the reason that
the census results though free from sampling errors are subject to various types of non
sampling errors and sometimes these non sampling errors may be more important than the
sampling errors and thus may affect the results substantially.
2. Classification of Non Sampling Errors
Non sampling errors arise due to numerous factors and almost at every stage of survey
from planning of the survey to report writing. In order to study different aspects of non
sampling errors effectively, it is considered desirable to classify the non sampling errors
according to the source or the stage of the survey or type of error.
One approach to classify non sampling errors is by the stage of the survey the non
sampling errors occur. The three major stages in the survey are
1. Survey design and preparation,
2. Data collection and,
3. Data Processing and Analysis.
This classification is more useful for discussing the measures of control of non-sampling
errors. A survey activity checklist for control of non-sampling errors is given in Appendix.
A second approach to classify the non sampling errors is on the basis of source or type of
error. The three major categories under such classifications are
(i)
Coverage errors,
(ii)
Non-response errors and,
(iii) Measurement or response errors.
Errors in Sample Surveys
This type of classification is more useful to discuss the implications of non sampling
errors and to suggest methods of obtaining unbiased estimators in the presence of such
errors. These are discussed in details in the following sections.
3. Coverage Errors
The objective of any survey is to make inference about the desired or a Target Population.
For this purpose selection is done by applying appropriate randomized procedure to
sampling frame in which all the units of the Target Population are supposed to be
represented uniquely. The coverage errors arise mainly due to the use of faulty frame of
sampling units. For example in a household survey if the old list of households prepared
for the population census a few years ago is used for selection of the sample, some newly
added households will not form a part of the sampling frame whereas a number of
households which might have already migrated will remain in the frame. The use of such
frames may thus lead either to inclusion of some units not belonging to the Target
Population or to omission of some units which belong to the Target Population. Coverage
errors may also arise due to incorrect specifications or ignorance of correct procedure by
field workers, failure to identify actual units selected, enumerating wrong units
intentionally or unintentionally by the enumerator etc. Some dishonest enumerators
complete the questionnaires for some imaginary and make up households and submit them
in place of actual households. In USA this practice has been named at Curb Stoning.
Rules of associations also many times cause non sampling errors. For example in
household surveys dejure and defacto (dejure: usual residence, defacto: actual presence of
individual at the time of interview) status may be a cause of some non-sampling errors.
Hansen, Hurwitz and Jubine (1964) presented a detailed account of dealing with imperfect
frames and proposed a useful technique known as predecessor-successor method to obtain
information on omissions in the frame. Seal (1962) presented the use of outdated frames
in large scale surveys assuming the changes in the population to be a continuous
stochastic process. Hartley (1962) proposed the use of two or more frames to overcome
the problem of incomplete frames. Singh (1983,1986,1989) presented a mathematical
formulation for the predecessor-successor method for estimating the total number of
missing units from the frame and estimation of the total of character under study for the
Target Population.
4. Non Response Errors
Non response errors arise due to various causes arising right from the stage of the survey
design, planning, execution etc. But most of the non response errors arise mainly because
of
•
Not-at-home i.e. respondent may not be at home when the enumerators call on them
and,
•
Refusal: the respondent may refuse to provide information to the enumerators for one
reason or the other. (In most of the survey cases legal obligations to respond do not
exist).
2
Errors in Sample Surveys
A panel on Incomplete Data was established by the Committee on National Statistics,
National Research Council, Washington in 1977. The panel prepared three volumes on
‘Incomplete Data in Sample Surveys’ published by Academic Press during 1983. These
publications provide detailed information on several case studies, theory and
Bibliographies.
The first attempt to deal with the problem of non response was perhaps made by Hansen
and Hurwitz(1946) through Call back method. They assumed the population as divided
into two classes, (i) response class where respondents respond in the first attempt and (ii)
a non response class where respondents do not respond in the first attempt. Another
method to obtain unbiased estimators from the information collected from the respondent
in the first attempt only was proposed by Politz and Simmons (1949). Kish and Hess
(1959) proposed the adding of a sample of non responding units from previous surveys for
obtaining information about the non respondents. Another method to deal with non
response used quite often is the method of substitution i.e. substitution of non respondent
units by other similar units. However, substitution does not eliminate, bias, due to non
response at all.
4.1 Method of Imputation for Missing Data
The extent of non-response varies greatly between different questions. Items, such as race
and sex usually have little non-response; on the other hand receipts of income from
various sources may have high non-response (Kalton, Kasprzyk and Santos 1981). The
multivariate nature of surveys, with all variables potentially subject to missing data,
suggest the need for a general purpose strategy for handling item-non-response.
Imputation defined as the process of estimating individual missing values in a data set has
become quite popular to deal with item non response. Kalton (1982) has given three
important desirable features of the imputation procedures as:
•
•
•
By weighting adjustments for total non-response, it aims to reduce biases in survey
estimates arising from missing data,
By assigning values at micro-level and thus allowing analysis to be conducted as if the
data set were complete, it makes analysis easier to conduct and results easier to
present. Complex algorithms to estimate population parameters in the presence of
missing data are not required, and
The results obtained from different analysis are bound to be consistent, a feature
which need not apply with an incomplete data set.
Imputation of missing data does, however, has its drawbacks, as it is a last resort activity,
which may be justifiable for statistical data, and is certainly not a cure for, but is often a
symptom of poor data quality. It does not necessarily lead to estimates that are less biased
than those obtained from the incomplete data set, indeed the biases could be much greater,
depending on the imputation procedure and the form of estimate. There is also the risk
that the analyst may treat the completed data set as if all the data were actual responses,
thereby overstating the precision of the survey estimates. Therefore, the analysts working
with a data set containing imputed values should proceed with caution, and be aware of
the extent of imputation for the variables in their analysis as well as the details of the
procedures used.
3
Errors in Sample Surveys
Platek and Gray (1983) discussed the total survey error model to deal with imputation
methodology and obtained contribution of different components to the total variance.
Singh and Rai (1983) examined the effect of various imputation procedures on survey
results and studied empirically some important imputation procedures:
4.1.1 Traditional Methods of Imputation
An imputation procedure is defined as a procedure that imputes a value for each missing
value which is assumed to be quite close to the true missing value. A wide variety of
imputation methods have been developed for assigning values for missing item responses
(Kalton and Kasprzyk, 1986). Imputation technique may be quite useful when imputation
for any missing value is done based on homogeneous imputation classes.
Deductive Imputation: Sometimes the missing answer to an item can be deduced with
certainty from the pattern of responses to other items. Edit checks should check for
consistency between responses to related items. When the edit checks constrain a missing
response to only one possible value, deductive imputation can be employed. Deductive
imputation is the ideal form of imputation.
Mean Imputation: Missing values are replaced by the mean of all responding values for
the variable. This can be done based on the whole dataset or separately for different
categories of respondents defined by combinations of selected classification variables.
Zero Imputation: It is a method of imputation in which zero is substituted for the
missing data when a unit fails to respond.
Regression Imputation: This method uses respondent data to regress the variable for
which imputations are required on a set of auxiliary variables. The regression equation is
then used to predict the values for the missing responses. The imputed value may either be
the predicted value or the predicted value plus some residual. There are several ways in
which the residual may be obtained.
Cold-deck Imputation: Missing values are replaced by values of older data, e.g. from a
previous survey, which could furthermore be adjusted for trend.
Hot-deck Imputation: In general, a hot-deck procedure is a duplication process - when a
value is missing from a sample, a reported value is duplicated to represent this missing
value. The adjective “hot” refers to imputing with values from the current sample. This
procedure usually has some classification process associated with it. All of the sample
units are classified into disjoint groups so that the units are as homogeneous as possible
within each group. For each missing value, a reported value is imputed which is in the
same classification group. Thus, the assumption is made that within each classification
group the non-respondents follow the same distribution as the respondents. Current survey
practice uses many variations of hot-deck procedures.
A sequential hot-deck procedure is one in which the sample is put in some type of order
within each classification group, and for each missing value the previous reported value is
duplicated. For example, the ordering might be based on a geographic variable. The result
4
Errors in Sample Surveys
of a geographic ordering is that the reported value duplicated for a missing value is from a
unit which is geographically close to the unit with the missing value. The sequential hotdeck suffers the disadvantage that it may easily make multiple uses of donors, a feature
that leads to a loss of precision in survey estimates.
The above disadvantages of the sequential hot-deck are avoided in the hierarchical hotdeck method. The procedure sorts respondents and non respondents into a large number of
imputation classes from a detailed categorization of a sizeable set of auxiliary variables.
Non-respondents are then matched with respondents on a hierarchical basis, in the sense
that if a match cannot be made in the initial imputation class, classes are collapsed and the
match is made at a lower level of detail.
Another form of hot-deck method is distance function matching which assigns a nonrespondent the value of the ‘nearest’ respondent, where ‘nearest’ is defined in terms of a
distance function for the auxiliary variable. Various forms of distance function have been
proposed and the function can be constructed to reduce the multiple uses of donors by
incorporating a penalty for each use.
Multiple Imputation: Because many imputation methods often do not preserve
distributional properties, multiple imputations are advocated as a way of improving the
ability to make inferences from data where imputation has been undertaken, particularly
when the proportion of values missing is high. Multiple imputation method retains the
advantages of single imputation like completing the data set and using the expert
knowledge for imputation and rectifies its major disadvantages Rubin (1986). As its name
suggests, multiple imputation replaces each missing value by a vector composed of M ≥ 2
possible values. The M values are ordered in the sense that the first components of the
vectors for the missing values are used to create one complete data set, the second
components of the vectors are used to create the second completed data set and so on.
There are some practical difficulties with multiple imputation as there is generally a desire
to produce one definitive micro data set for public use rather than several which will give
slightly different results and the typical data user may not be willing to analyse several
datasets in order to obtain each answer.
4.1.2 New Methods of Imputation
Recent advances in methods and computing capabilities have made possible the
application of more complex statistical modeling techniques like non-parametric
regression; neural networks including multi layer perception, self organizing maps,
support vector machines, etc. for the purpose of imputation.
Measures to Study the Effectiveness of Imputations
To study the effectiveness of different imputation methods the following three measures
have been computed:
Mean Departure (MD): The mean departure denotes the mean of difference between the
true value for the missing unit and the imputed value.
5
Errors in Sample Surveys
MD =
n
1
∑
n
i =1
(y
k
i
)
− yi = y k − y
where yi and yik denote the actual value and imputed value using k-th imputation method
for the i-th unit. When computed for a large number of samples, MD may provide a
measure of bias of the method of imputation.
Mean Absolute Departure (MAD): The mean absolute departure is used to denote the
mean of the absolute deviation of imputed values from the actual values, i.e.,
MAD =
n
1
∑
n
i =1
yi k − yi
when computed for a large number of samples MAD provides a measure of the closeness
with which the imputation method reconstructs the missing values.
Standard Deviation Departure (SDD): To study the impact of various imputations
method on disturbing the distribution of the character under study, the standard deviation
departure (SDD) is used which is defined as the difference between the SDD of the actual
values and the SDD of the imputed values.
Through by a simulation study it was observed that except the zero substitution method,
all other imputation methods performed almost equally well and all the three measures
worked well. Also as expected, as the non response rate increases, the departures increase
for all the imputation methods.
5. Measurement or Response Errors
Response Errors arise in data collection or taking observations and are mainly contributed
by the respondent or the enumerator or both. Response errors refer to the differences
between the individual true value and the corresponding observed sampling value
irrespective of the reasons for discrepancies. For example in an Agricultural Survey a
Householder may report a total area of his holding, which may differ from the cadastral
data. Sometimes the measurement devices or techniques may be defective and may cause
observational errors. Many times response errors may be accidental but these may also be
introduced purposely or may arise from lack of information. This may be due to fear and
prestige or simply to confirm to what they think is appropriate. Women generally declare
themselves younger. People raise their level of education or their occupation, Assistant
declaring Manager, a Compounder declaring Medical Practitioner, etc. Similarly people
exaggerate their salary, rent, money spent on food, clothing etc. Mc Ford (1951) showed
how people tried to appear well informed. Respondents were asked if they had heard
about some particular magazines, writers, piece of legislation etc. that in fact never
existed. There was very large proportion of respondents answering ‘Yes’.
Given the importance of measurement errors in survey sampling an International
Conference on Measurement Errors in Surveys’ was held during Nov 11-14, 1990 in
Tucson, Arizona sponsored by Survey Research Methods Section of the American
6
Errors in Sample Surveys
Statistical Association. Thirty two invited papers presented at the conference have been
published in a book form ‘Measurement Errors in Surveys’ Edited by Paul P. Biemer,
Robert M. Groves, Lars E. Lyberg, Nancy A. Mathiowetz and Seymour Sudman by Wiley
Interscience, John Wiley & Sons Inc. 1991.
5.1 Study of Measurement Errors
In recent years much of research on sampling practices has been devoted to the study of
measurement errors. The objectives are to discover the components that make large
contributions and to find ways of eliminating or decreasing their contributions.
Ideally the best method is to obtain the correct value yi. The approach is however limited
to items which can be measured correctly by some alternative method. Belloc (1954)
compared data on Hospitalisation as reported in household interview with the hospital
records for the individual. Checks of this type called ‘Record checks’ are possible with
items such as age, occupation, price paid, etc.
An alternative method is to remeasure by an independent method, which is more accurate.
Kish and Lansing (1954) engaged professional appraises to estimate the selling price of
homes that had already been reported by the homeowners.
Another possibility is to reinterview a sub sample of respondents with more qualified
enumerators and with more accurate measuring devices.
5.2 Interpenetrating Subsampling
This important technique proposed by Mahalanobis (1946) mainly for estimation of
variance is particularly useful for study of correlated errors. In simplest terms, a random
sample of n units is divided at random into k sub samples each sub sample containing
n
m = units. The fieldwork and processing of samples are planned in such a manner that
k
there is no correlation between the errors of measurement between units in different subsamples. The most important factor which introduces correlation is the bias of
enumerators and thus if each of the k enumerators is assigned to different sub-samples and
if there is no correlation between errors of measurement for different interviewers, we can
easily, estimate the contribution of interviewer bias to the variance and also give a test of
significance of the null hypothesis of no interviewer bias. In the mathematical treatment of
observational errors, mathematical models based on the assumption that repetitive
observation can be made on a unit, have been proposed by Sukhatme and Seth (1952),
Hensen et.al. (1953, 1961, 1964).
7
Errors in Sample Surveys
APPENDIX
CHECK LIST FOR CONTROL OF NON SAMPLING ERRORS
Survey Activity
Action
1.
General Planning
Has any such survey been conducted earlier
2.
Selection of
and items
3.
Data collection
4.
Data processing and Correct and unique identification of each questionnaire,
analysis
instruction for manual edit and coding, verification of
coders' work, computer processing, method of estimation
and tabulation.
5.
Report writing
topics Number and length of questions, reference period,
concepts, frame, sampling design, sampling units and
rules of association, methods of data collection,
development of questionnaires, pre testing for refining
and estimating cost factors, outline of tabulation,
interviewer selection and training.
Schedule of field supervision, editing of completed
questionnaires, re-interview of sub-sample, suggestions
for improvement in subsequent surveys.
Description of survey design, concepts and definitions,
sampling and non sampling errors and suggestions for
future surveys.
References and Suggested Reading
Belloc, B.B. (1954). Validation of morbidity survey data by comparison with hospital
records. J. Amer. Stat. Assoc., 49, 832-846.
Hansen, M.H. and Hurwitz, W.N. (1946). The problem of non response in sample surveys.
Jour. Amer. Stat. Assoc., 41, 517-529.
Hanson, R.H., and Marks, E.S. (1958). Influence of the interviewer on the accuracy of
survey results. J. Amer. Stat. Assoc., 53, 635-655.
Hansen, M.H., Hurwitz W.N. and Bershad, M. (1961). Measurement errors in census and
surveys Bull. Int. Stat. Inst., 38, 2, 359-374.
Hansen, M.H., Hurwitz, W.N. and Jubine, T.B. (1964). The use of imperfect lists for
probability sampling at the U.S. Bureau of the Census. Bull. Internal. Statist. Inst.,
40.
Kish, L. and Lansing. J.B. (1954). Response errors in estimating the value of homes. J
.Amer. Stat. Assoc., 49, 520-538.
Mahalanobis, P.C. (1946). Recent experiments in statistical sampling in the Indian
Statistical Institute. J. Roy. Stat. Soc,. 109, 325-370.
Politz, A.N. and Simmens, W.R. (1949). An attempt to get the ‘not at homes into the
sample without call backs. J. American Stat. Assoc. 44, 9-31, and 45, 136-137.
Seal, K.C. (1962). Use of outdated frames in large scale sample surveys. Calcutta Statist.
Assoc. Bull.11.
8
Errors in Sample Surveys
Singh, R. (1983). On the use of incomplete frames in sample surveys. Biom. J. 25, 545549.
Singh, R. (1985). Estimation from incomplete data in longitudinal surveys. JSPI, 7, 163170.
Singh, R.(1986). Predecessor-Successor Method. Encyclopedia of Statistical Sciences.
V.7, 137-139. John Wiley & Sons Inc.
Singh, R. and T. Rai (1983). Use of Imputations for Missing Data in Census and Surveys.
Project Report, Indian Agricultural Statistics Research Institute (ICAR), New
Delhi.
Sukhatme, P.V. and Seth, G.R.(1952). Non sampling errors in surveys. J. Indian Soc.
Agril. Statist. 4, 5-41.
Zarkovich, S.S. (1966). Quality of Statistical Data. F.A.O., Rome.
9