Small Sample Corrections for LTS and MCD

Transcription

Small Sample Corrections for LTS and MCD
Small Sample Corrections for LTS and MCD
G. Pison, S. Van Aelst , and and G. Willems
Department of Mathematics and Computer Science, Universitaire Instelling Antwerpen
(UIA), Belgium
E-mail: gpison, vaelst, gewillem@uia.ua.ac.be
Summary. The least trimmed squares estimator and the minimum covariance determinant
estimator Rousseeuw (1984) are frequently used robust estimators of regression and of location and scatter. Consistency factors can be computed for both methods to make the estimators
consistent at the normal model. However, for small data sets these factors do not make the
estimator unbiased. Based on simulation studies we therefore construct formulas which allow
us to compute small sample correction factors for all sample sizes and dimensions without
having to carry out any new simulations. We give some examples to illustrate the effect of the
correction factor.
Key words: Robustness, Least Trimmed Squares estimator, Minimum Covariance
Determinant estimator, Bias
1 Introduction
The classical estimators of regression and multivariate location and scatter can be
heavily influenced when outliers are present in the data set. To overcome this problem Rousseeuw (1984) introduced the least trimmed squares (lts) estimator as a
robust alternative for least squares regression and the minimum covariance determinant (MCD) estimator instead of the empirical mean and covariance estimators.
Consistency factors can be computed to make the LTS scale and MCD scatter
estimators consistent at the normal model. However, these consistency factors are
not sufficient to make the LTS scale or MCD scatter unbiased for small sample sizes.
Simulations and examples with small sample sizes clearly show that these estimators underestimate the true scatter such that too many observations are identified as
outliers.
To solve this problem we construct small sample correction factors which allow
us to identify outliers correctly. For several sample sizes and dimensions we
carried out Monte Carlo simulations with data generated from the standard Gaussian
distribution. Based on the results we then derive a formula which approximates
the actual correction factors very well. These formulas allow us to compute the
correction factor at any sample size and dimension immediately without having
to carry out any new simulations.
Research Assistant with the FWO Belgium
330
G. Pison, S. Van Aelst, and and G. Willems
In Section 2, we focus on the LTS scale estimator. We start with a motivating
example and then introduce the Monte Carlo simulation study. Based on the simulation results we construct the function which yields finite sample corrections for all
and . Similarly, correction factors for the MCD scatter estimator are constructed
in Section 3. The reweighted version of both methods is shortly treated in Section
4. In Section 5 we apply the LTS and MCD on a real data set to illustrate the effect
of the small sample correction factor. Section 6 gives the conclusions.
2 Least Trimmed Squares Estimator
Consider the regression model
(1)
"
for . Here ! "$# &%' (
are the regressors, $ %)' ( is
*
%
'
(
the response and
is the error term. We assume that the errors +,
- are
independent of the carriers and are i.i.d. according to . 0/132546# which
is the usual
"
assumption for outlier identification and inference. For every %*' ( we denote the
corresponding residuals by 7 38#9 7 ;: <>=?
and 7 43@ -*A A 7 -$4 @ - denote
the squared ordered residuals.
The LTS estimator searches for the optimal subset of size B whose least squares
fit has the smallest sum of squared residuals. Formally, for /1 C AEDFA , the LTS
estimator G minimizes the objective function
JK
IP
K
Q
H$I L
4 #
@ (2)
O 7
B D #NM R
I
H$I S - IP>TVU
4Y$ZN# W ![ 4 with \ EZ W #
where
W UX
4
4 is the consistency factor
for the LTS scale (Croux and Rousseeuw, 1992) and B B D # determines the subset
MO
size.
`,#3ab,c which yields the highest breakWhen D ]/1 C , B D # equals ^ _
# equals , such that we obtain the least
8
C
e
/
f
d
#
g
D
D
down value
, and when
,B
D
squares estimator. For other values of we compute the subset size by linear interpolation. To compute the LTS we use the FAST-LTS algorithm of Rousseeuw and
Van Driessen (1998). The LTS estimate of the error scale is given by the minimum
of the objective function (2).
2.1 Example
In this example we generated F]h/ points such that the predictor variables are
generated from a multivariate standard Gaussian .&i 0/1 ' # distribution and the response variable comes from the univariate standard Gaussian distribution. We used
the LTS estimator with D ]/1 C to analyse this data set and computed the robust
Small Sample Corrections for LTS and MCD
331
standardized residuals 7 3 G #3a 2 G based on the LTS estimates G and 2 G . Using the cutoff values b+ C and = b+ C we expect to find approximately 1% of outliers in the case
of normally distributed errors. Hence, we expect to find at most one outlier in our
example. In Figure 1a the robust standardized residuals of the observations are plotted. We see that LTS finds outlying objects which is much more than expected. The
main problem is that LTS underestimates the scale of the residuals. Therefore the
robust standardized residuals are too large, and too many observations are flagged
as outliers.
0
•
•
•
•
-2
•
•
••
•
•
••
•
•
•
•
•
••
•
•
•
•
•
0
5
10
15
20
25
1
•
•
•
•
0
•
•
•
•
•
•
•
••
•
•
-1
2
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
-2
•
-4
uncorrected standardized residuals
•
corrected standardized residuals
(b)
4
(a)
•
30
0
5
10
Index
15
20
25
30
Index
Fig. 1. Robust standardized residuals (a) without correction factors, and (b) with correction
factors of a generated data set with
objects and
regressors.
2.2 Monte Carlo Simulation Study
To determine correction factors for small data sets, first a Monte Carlo simulation
study is carried out for several sample sizes and dimensions . In the simulation
we also consider the distribution of to be Gaussian. Note that the LTS estimator
*
G G G 4 # with G the slope vector and G 4 the intercept, is regression, scale and
affine equivariant (see Rousseeuw and Leroy, 1987, p. 132). This means that
G V! #9 W # G 0V
! #> 5#
G 4 V! #9 G 4 0V!!8#>
"
, /1 % ' ( and nonsingular matrix for every % ' (
scale 2 G is affine equivariant meaning that
4 2 4 7 #
2 4 7 #9
G
G
for every / , % ' ( . From these equivariances it follows that
=
. Also the LTS
2 4
G V
! V
G
4 2 4 = >
G 0V!!8#
V = G 4 0V!!8#!#
G
#=
G 4 N
V
#
(3)
332
G. Pison, S. Van Aelst, and and G. Willems
Therefore it suffices to consider standard Gaussian distributions for and since
(3) shows that this correction factor remains valid for any Gaussian distribution of
and .
for sample size and
we generate regressors
P the simulation,
- - "
In
P dimension
P % '(
P % ' ( P
P dataset and a response variable . For each
O # , ] we then determine
2
the LTS scale
O
O
P the residuals
RG 2 of
5
2
;
#
:
corresponding
to the LTS fit. Finally, the mean
is
computed.
O
O
O
G
G
If the estimator is unbiased we have ^ 2G c for thisI model, so
OP we expectI that
also 2>G # equals approximately 1. In general,
denote " - : then ^ " - 2G c
I
"
equals approximately 1, so we can use as a finite-sample correction factor to
O
make the LTS scale unbiased.
6
/
/
/ simulations for
To determine the correction factor we performed
different sample sizes and dimensions , and for several values of D I . For the
I
model with intercept 1" ,# we denote the resulting
correction factor " - and for
"
the model without intercept it is denoted by .
From the simulations, we found empirically that for fixed and the mean
25# is approximately linear in function of D . Therefore we reduced the actual
G
simulations to cases with D /1 C and D /1 C . For values of D in between
we then determine the correction factor by linear interpolation. If D then least
squares regression is carried out. In this case, we don’t need a correction factor
because this estimator is unbiased. So, if /1 C A D?A we interpolate between the
#
D /1 C and to determine the correction factor.
value of 25
G for
In Table 1, the mean 2>G # for LTS with intercept and D /1 C is given for
several values of and . We clearly see that when the sample size is small,
25# is very small. Moreover, for fixed, the mean becomes smaller when the
G
dimension increases. Note that for fixed the mean increases monotone to , so for
large samples the consistency factor suffices to make the estimator unbiased. Table 2
for Table 1. p
!
1
3
5
8
n 20
0.71
0.49
0.35
0.25
and for several sample sizes
and dimensions
25
30
35
50
55
80
85
100
0.77
0.58
0.45
0.26
0.77
0.60
0.46
0.34
0.81
0.65
0.53
0.36
0.84
0.71
0.60
0.49
0.86
0.74
0.64
0.51
0.89
0.79
0.71
0.62
0.90
0.81
0.72
0.62
0.91
0.83
0.75
0.67
shows the result for D )/1 C . In comparison with Table 1 we see that these values
#
" - i
" iof 25
G are higher such that the correction factor #"#$ %'& will be smaller than ("#$
for the same value of and . Similar results were found for LTS without intercept.
Small Sample Corrections for LTS and MCD
for Table 2. p
1
3
5
8
!
n 20
0.91
0.83
0.73
0.56
and for several sample sizes
and dimensions
25
30
35
50
55
80
85
100
0.94
0.86
0.77
0.69
0.94
0.88
0.83
0.72
0.96
0.90
0.83
0.75
0.97
0.93
0.89
0.84
0.97
0.94
0.90
0.85
0.98
0.95
0.93
0.90
0.98
0.96
0.93
0.90
0.99
0.97
0.95
0.92
333
2.3 Finite Sample Corrections
We now construct a function which approximates the actual correction factors obtained from the simulations and allows us to compute the correction factor at any
sample size and dimension immediately without having to carry out any new
#
simulations. First, for a fixed dimension we plotted the mean 25
G versus 6the
/ ),
A
number of observations . We made plots for several dimensions ( A
1
/
C
)
1
/
C
D
D
for
, and
and for LTS with and without intercept. Some plots are
shown in Figure 2.
#
From these plots we see that for fixed the mean 25
G has a smooth pattern in
function of . For fixed we used the model
I
" 5# `;
(4)
#
D
to fit the mean 25
correG in function
I of . : Hence,
I for each and we obtain the
"
:
"
: " I ,
sponding parameters and
for
LTS
with
intercept
and
" I for LTS without intercept. In Figure 2 the functions I obtained by using the
: model (4) are superimposed. We see that the function values " 5# approximate the
>
2
#
actual values of
G obtained from the simulations very well.
When the regression dataset
has a dimension that was included in our simulation
I
study, then the functions " 5# already yield a correction factor for all possible
values of . However, when the data set has another dimension, then we have not
I
yet determined the corresponding correction factor. To be able to obtain
correction
\ 4 # for \ h
"
factors for these higher dimensions we fitted the function values
I
and \ C as a function
of the number of dimensions (
b ). In Figure 3 we
" 46#
\
plotted the values
versus the dimension for the LTS with intercept and
D I S/1 C . Also in Figure 3 we see a smooth pattern. Note that the function I values
" 46#
\
converge to as goes to infinity since we know from (4) that
I " \ 4,#
" 46#
4
goes to if \
goes to infinity. The model we used to fit the values
\
in
function of is given by
I
#N ;
(5)
U
By fitting this model for \ I h and C and D I /1 C and /1 C we obtain the correI
sponding parameters : U and : U for LTS with intercept and : U ,
334
G. Pison, S. Van Aelst, and and G. Willems
(b)
1.00
1.00
(a)
0.95
0.90
mean
0.75
50
100
150
200
20
40
sample size n
60
80
100
sample size n
(d)
1.0
1.0
(c)
0.4
•
0.9
••
• •
•
• •
• •
•
• •
• •••
•••
•
•
• ••
••
• ••
• •
••
•
•
0.7
0.6
•••
••••
••••
•••
•••
••
•
•
•
••
••
mean
••
••
0.8
0.8
•
mean
• •
•
•
•
0
• •
• •
• •
•••
•••••••
••
• •
• •••
•••
•
•• •
••
•
•
0.85
mean
•
••
0.75
0.80
0.85
0.90
•••• •
•••• •
•••••
•
•••
•
••
•
•
•••
••
0.80
0.95
•
••
•
50
100
150
200
20
40
sample size n
60
80
100
sample size n
and LTS without intercept,
and LTS with intercept,
Fig. 2. The approximating function for (a) , and LTS without intercept , (c)
,
(b) , (d)
,
and LTS with intercept.
(b)
•
•
0.5
•
2
0.9
•
•
•
•
•
•
4
6
8
10
2
4
p
6
8
10
p
Fig. 3. The approximating function ,
and LTS with intercept.
•
•
0.7
•
0.8
•
function values
0.8
0.7
•
•
•
0.6
function values
0.9
1.0
1.0
(a)
for (a) , and LTS with intercept, (b)
Small Sample Corrections for LTS and MCD
335
I
:
U
for LTS without intercept. From Figure 3 we see that the resulting functions fit the points very well.
Finally, for any and we now have the following procedure to determine
the corresponding correction factor for the LTS scale estimator.
I For the LTS with
is given by - : - P where
intercept
the
correction
factor
in
the
case
I
5# I a . In the case , we first solve the following system of
O
equations
" I
I
;
; (6)
0h 4 # I i
" I
; ;
(7)
8C 4 # I
I
I
I
to obtain the estimates G " and G " of the parameter values " and " . Note that
I
the system of Equations (6)–(7) can be rewritten into a linear system of equations
" - : by taking
logarithms.
The
corresponding
correction
factor
is
then
given
by
I
I
a G " 5# where G " 5# S " I a . Similarly, we also obtain the correction
G
factors for the LTS without intercept.
Using this procedure we obtain the functions shown in Figure 4. We
I can clearly
see that these functions are nearly the same as the original functions " 5# shown
in Figure 2.
Let us reconsider the example of Section 2.1. The corrected LTS estimator
with D /1 C is now used to analyse the dataset. The resulting robust
standardW 0/1 C#
ized residuals
are plotted in Figure 1b. Using the cutoff values Z
and
= Z W 0/1 C# we find 1 outlier which corresponds with the b+ C$d of outliers we
expect to find. Also, we clearly see that the corrected residuals are much smaller
than the uncorrected. The corrected residuals range between = h and b while the
uncorrected residuals range between = C and . We conclude that the scale is not
underestimated when we use the LTS estimator with small sample corrections and
therefore it gives more reliable values for the standardized residuals and more reliable outlier identification.
Finally, we investigated whether the correction factor is also valid when working
with non-normal explanatory variables. In Table 3 we give the mean 2>G # for some
I student (with 3 df.) and cauchy dissimulation set ups where we used exponential,
tributed carriers. The approximated values G " 5# of 2>G # obtained with normally
distributed carriers are given between brackets. From Table 3 we see that the difference between the simulated value and the correction factor is very small. Therefore,
we conclude that in general, also for nonnormal carrier distributions, the correction
factor makes the LTS scale unbiased.
3 Minimum Covariance Determinant Estimator
The MCD estimates the location vector and
" the scatter matrix . Suppose we
have a dataset - E ' ( , then the MCD searches for the sub
336
G. Pison, S. Van Aelst, and and G. Willems
(b)
1.00
1.00
(a)
0.95
mean
0.90
• •
• •
•••
•••••••
••
• •
• •••
•••
•
•• •
••
•
•
0.85
mean
•
••
•
0
50
100
150
• •
200
20
40
sample size n
60
80
1.0
1.0
(d)
0.9
••
•
• •
• •
• •
•
• •
• •••
•••
•
•
• ••
••
• ••
• •
••
•
•
0.8
•
••
0.7
•••
••••
••••
•••
•••
••
•
•
•
••
••
mean
••
0.6
mean
0.8
•
0.4
100
sample size n
(c)
•
50
100
150
200
20
sample size n
40
60
80
100
sample size n
and LTS without intercept, (b)
, and LTS with intercept,
Fig. 4. The approximation for (a)
,
,
and LTS without intercept , (c)
(d)
,
and LTS with intercept.
• •
•
•
0.75
0.75
0.80
0.85
0.90
•••• •
•••• •
•••••
•
•••
•
••
•
•
•••
••
0.80
0.95
•
••
Table 3. for several other distributions of the carriers.
, , without intercept 0.84 0.91 0.94 0.96
(0.82) (0.91) (0.94) (0.96)
, , , with intercept
0.50
0.67
0.75
0.80
(0.52) (0.68) (0.75) (0.79)
cauchy, , , with intercept 0.63
0.83
0.88
0.92
exp,
(0.63) (0.81) (0.88) (0.91)
Small Sample Corrections for LTS and MCD
337
set of B B D # observations whose covariance matrix has the lowest determinant.
For /1 C A D A , its objective is to minimize the determinant of
I (8)
I
P
IP
IP
IP
where R , = >G -1# ,= >G -1# with >G - R , . The factor
I
I D a I #
I
" 4 makes the MCD scatter estimator
MO
MO
consistent at
M\ O with \
MO
the normal model (see Croux and Haesbroeck, 1999). The MCD center is then the
mean of the optimal subset and the MCD scatter is a multiple of its covariance
matrix as given by (8). A fast algorithm have been constructed to compute the MCD
(Rousseeuw and Van Driessen, 1999).
3.1 Example
Similarly as for LTS, we generated data from a multivariate standard Gaussian distribution. For ` b/ observations of . 0/1 ' # we computed the MCD estimates
with D /1 C . As cutoff value to determine outliers the 97.5% quantile of the N4
distribution is used. Since no outliers are present, we therefore expect that MCD will
find at most one outlier in this case. Nevertheless, the MCD estimator identifies outlying objects as shown in Figure 5a where we plotted the robust distances of the
b/ observations. Hence a similar problem arises as with LTS. The MCD estimator
underestimates the volume of the scatter matrix, such that the robust distances are
too large. Therefore the MCD identifies too many observations as outliers.
(b)
•
• •
• •
• •
5
•
•
10
•
• •
•
•
•
15
•
20
Index
14
12
•
•
8
10
•
6
10
•
4
•
•
• •
• •
2
•
corrected robust distances
•
15
•
5
uncorrected robust distances
20
(a)
• •
5
•
•
10
•
•
• •
15
•
•
•
20
Index
Fig. 5. Robust distances (a) without correction factors, (b) with correction factors, of a generated data set with
objects and dimensions.
3.2 Monte Carlo Simulation Study
A Monte Carlo simulation study is carried
-for
" several sample sizes and di P %'out
(
mensions . We generated datasets from the standard Gaussian disO
tribution. It suffices to consider the standard
Gaussian distribution since the MCD
338
G. Pison, S. Van Aelst, and and G. Willems
is affine
equivariant (see Rousseeuw and Leroy, 1987, page 262). For P each dataset
P
, we then determine the MCD scatter matrix . If the estiO
mator
is unbiased, we have that ^ c9 ' " so we expect that the O -th root of the
P the mean of the -th root of the determinant
determinant of G equals . Therefore,
R # ![ "
#*: ,I where denotes the determinant of
given by
O
Y
a square matrix I , is computed. Denote " - : P , then we expect that the
I
determinant
of Y " - equals approximately 1. Similarly as for LTS, we now use
Y " - as a finite-sample correction factor for MCD. WeO performed 6/// simulations for different sample sizes and dimensions , and for several values of D to
compute the correction factors.
From the simulation study similar results as for LTS were obtained. Empirically
we found that the mean # is approximately linear in function of D so we
reduced the actual simulations to cases with D /1 C and D /1 C . The other
values of D are determined by linear interpolation. Also here we saw that the mean
is very small when the sample size is small, and for fixed the mean increases
monotone to when goes to infinity.
We now construct a function which approximates the actual correction factors obtained from the simulations. The same setup as for LTS is used. Model (4) and for
I
b also model (5) with \ ]b and \ Sh are used to derive a function
which
yields a correction factor for every and . The function values G " 5# obtained
from this procedure are illustrated in Figure 6. In this Figure the mean # is
I D and superimposed are the funcplotted I versus the sample size for a fixed and
" 5#
5
#
"
tions G
. We see that the function values G
are very close to the original
values obtained from the simulations.
(b)
40
0.95
•
•
••
••••
••••
•
•
••
••••
••
••
••
••
0.90
•
0.85
•••
••••
••••
•
•
••••
•
•• •
••
•
•
•
•
mean
•
0.8
mean
0.9
1.0
1.00
(a)
0.7
3.3 Finite Sample Corrections
60
80
100
40
sample size n
Fig. 6. The approximation •
•
•
•
60
80
100
sample size n
for (a)
,
, and (b)
, .
•
Small Sample Corrections for LTS and MCD
339
Finally, we return to the example in Section 3.1. We now use the corrected MCD
estimator to analyse the dataset. The resulting robust distances are plotted in Figure
5b. Using the same cutoff value we now find 1 outlier which corresponds to the
b+ C$d of outliers that is expected. Note that the corrected distances are much smaller
than the uncorrected ones. The corrected distances are all below ,C+ C while the
uncorrected distances range between / and b/ . When we use the MCD with small
sample corrections the volume of the MCD scatter estimator is not underestimated
anymore, so we obtain more reliable robust distances and outlier identification.
4 Reweighted LTS and MCD
To increase the efficiency of the LTS and MCD, the reweighted version of these
estimators is often used in practice (Rousseeuw and Leroy, 1987). Similarly to the
initial LTS and MCD, the reweighted LTS scale and MCD scatter are not unbiased
at small samples even when the consistency factor is included. Therefore, we also
determine small sample corrections for the reweighted LTS and MCD based on the
corrected LTS and MCD as initial estimators. We performed Monte Carlo studies
similar to those for the initial LTS and MCD to compute the finite-sample correction factor for several sample sizes and dimensions . Based on these simulation
results, we then constructed functions which determine the finite sample correction
factor for all and .
5 Examples
Let us now look at some real data examples. First we consider the Coleman data
set which contains information on b/ schools from the Mid-Atlantic and New England states, drawn from a population studied by Coleman et al. (1966). The dataset
contains C predictor variables which are the staff salaries per pupil 5# , the percent
of white-collar fathers 4 # , the socioeconomic status composite deviation # , the
mean teacher’s verbal test score # and the mean mother’s educational level i # .
The response variable measures the verbal mean test score. Analyzing this dataset
using LTS with intercept and D `/1 C , we obtain the standardized residuals shown
in Figure 7. Figure 7a is based on LTS without correction factor while Figure 7b is
based on the corrected LTS. The corresponding results for the reweighted LTS are
shown in Figures 7c and 7d. Based on the uncorrected LTS objects are identified
as outliers. On the other hand, by using the corrected LTS the standardized residuals
are rescaled and only b huge outliers and boundary case are left. The standardized
residuals of the uncorrected LTS range between = and ,C while the values of the
corrected LTS range between = and C . Also when using the reweighted LTS we
can see that the uncorrected LTS finds C outliers and b boundary cases while the
corrected version only finds b outliers.
In the second example we consider the aircraft dataset (Gray, 1985) which deals
with 23 single-engine aircraft built between 1947–1979. We use the MCD with
340
G. Pison, S. Van Aelst, and and G. Willems
(a)
(b)
15
uncorrected LTS
corrected LTS
•
• •
•
-5
•
•
2
•
5
10
15
5
10
(d)
15
20
•
•
10
15
•
4
2
-4
•
Index
corrected reweighted LTS
6
•
• •
0
•
standardized residuals
•
•
5
•
20
•
• •
•
•
•
• • •
•
•
• •
•
•
•
•
•
-2
10
5
0
•
•
•
-5
standardized residuals
•
•
•
• •
Index
•
• • • •
•
•
20
uncorrected reweighted LTS
•
•
•
Index
(c)
•
•
•
-4
•
•
•
• • • •
•
0
•
•
-2
0
•
•
•
•
standardized residuals
10
5
•
•
• • • •
•
-10
standardized residuals
4
•
•
5
10
15
20
Index
Fig. 7. Robust standardized residuals for the coleman data (
,
) based on LTS
(a) uncorrected , (b) corrected, (c) uncorrected reweighted, and
with intercept and (d) corrected reweighted .
D </1 C to analyse the independent variables which are Aspect Ratio 5# , Liftto-Drag ratio 4 # , Weight # and Thrust # . Based on MCD without correction
factor we obtain the robust distances shown in Figure 8a. We see that 4 observations
are identified as outliers of which aircraft 15 is a boundary case. The robust distance
of aircraft 14 equals . If we use the corrected MCD then we obtain the robust
distances in Figure 8b where the boundary case has disappeared. Note that the robust distances have been rescaled. For example the robust distance of aircraft 14 is
h $C . Similar results are obtained for the reweighted MCD as shown by
reduced to Figures 8c and 8d.
6 Conclusions
Even when a consistency factor is included, this is not sufficient to make the LTS
and MCD unbiased at small samples. Consequently, the LTS based standardized
residuals and the MCD based robust distances are too large such that too many observations are identified as outliers. To solve this problem, we performed Monte
•
(a)
Small Sample Corrections for LTS and MCD
•
uncorrected MCD
44 39
0
• • •
•
5
10
25
20
15
robust distances
•
•
•
•
•
•
•
•
15
•
• • •
•
•
• • •
• • •
20
5
Index
(c)
•
5
• • • •
• • •
0
•
•
10
30
25
20
15
10
•
5
10
•
• •
(d)
uncorrected reweighted MCD
corrected reweighted MCD
249
15
• • • •
•
5
0
• • •
5
•
• •
10
15
•
•
•
•
•
•
0
10
•
•
•
20
20
robust distances
25
30
30
46 42
•
•
•
20
Index
•
475
•
15
10
robust distances
•
36
395
•
•
robust distances
corrected MCD
30
494
(b)
341
• • • •
•
20
5
Index
•
• • • • •
• •
• •
10
•
•
•
•
15
•
20
Index
Fig. 8. Robust distances for the aircraft data (
,
) based on MCD with (a) uncorrected , (b) corrected, (c) uncorrected reweighted, and (d) corrected reweighted.
Carlo simulations to compute correction factors for several sample sizes and dimensions . Based on the simulation results we constructed functions that allow
us to determine the correction factor for all sample sizes and all dimensions. Similar results have been obtained for the reweighted LTS and MCD. Some examples
have been given to illustrate the difference between the uncorrected and corrected
estimators.
References
J. Coleman et al. Equality of educational opportunity. U.S. Department of Health, Washington D.C., 1966.
C. Croux and G. Haesbroeck. Influence function and efficiency of the minimum covariance
determinant scatter matrix estimator. The J. of Multivariate Analysis, 71:161–190, 1999.
C. Croux and P.J. Rousseeuw. A class of high-breakdown scale estimators based on subranges. Comm. Statist., Theory Meth., 21:1935–1951, 1992.
J.B. Gray. Graphics for regression diagnostics. In Am. Statist. Assoc. Proceedings of the
Statist. Computing Section, pages 102–107, 1985.
342
G. Pison, S. Van Aelst, and and G. Willems
P.J. Rousseeuw. Least median of squares regression. J. Am. Statist. Assoc., 79:871–880,
1984.
P.J. Rousseeuw and A.M. Leroy. Robust regression and outlier detection. Wiley, New York,
1987.
P.J. Rousseeuw and K. Van Driessen. Computing LTS regression for large data sets. Technical
report, University of Antwerp, 1998. Submitted.
P.J. Rousseeuw and K. Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212–223, 1999.