This article was downloaded by: [Rand R. Wilcox]
Transcription
This article was downloaded by: [Rand R. Wilcox]
This article was downloaded by: [Rand R. Wilcox] On: 10 August 2012, At: 13:53 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of Statistical Computation and Simulation Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/gscs20 Some small-sample properties of some recently proposed multivariate outlier detection techniques Rand R. Wilcox a a Department of Psychology, University of Southern California, California, USA Version of record first published: 18 Aug 2008 To cite this article: Rand R. Wilcox (2008): Some small-sample properties of some recently proposed multivariate outlier detection techniques, Journal of Statistical Computation and Simulation, 78:8, 701-712 To link to this article: http://dx.doi.org/10.1080/00949650701245041 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-andconditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. Journal of Statistical Computation and Simulation Vol. 78, No. 8, August 2008, 701–712 Some small-sample properties of some recently proposed multivariate outlier detection techniques RAND R. WILCOX* Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 Department of Psychology, University of Southern California, California, USA (Received 29 September 2006; in final form 27 January 2007) Recently, several new robust multivariate estimators of location and scatter have been proposed that provide new and improved methods for detecting multivariate outliers. But for small sample sizes, there are no results on how these new multivariate outlier detection techniques compare in terms of pn , their outside rate per observation (the expected proportion of points declared outliers) under normality. And there are no results comparing their ability to detect truly unusual points based on the model that generated the data. Moreover, there are no results comparing these methods to two fairly new techniques that do not rely on some robust covariance matrix. It is found that for an approach based on the orthogonal Gnanadesikan–Kettenring estimator, pn can be very unsatisfactory with small sample sizes, but a simple modification gives much more satisfactory results. Similar problems were found when using the median ball algorithm, but a modification proved to be unsatisfactory. The translated-biweights (TBS) estimator generally performs well with a sample size of n ≥ 20 and when dealing with p-variate data where p ≤ 5. But with p = 8 it can be unsatisfactory, even with n = 200. A projection method as well the minimum generalized variance method generally perform best, but with p ≤ 5 conditions where the TBS method is preferable are described. In terms of detecting truly unusual points, the methods can differ substantially depending on where the outliers happen to be, the number of outliers present, and the correlations among the variables. Keywords: Robust methods; OGK estimator; TBS estimator; Median ball algorithm; Minimum generalized variance technique; Projection methods 1. Introduction In various settings, multivariate outlier detection methods play an important role [1, 2]. When choosing a multivariate outlier detection technique, at least two fundamental properties are of interest. The first is the outside rate per observation, which is the expected proportion of outliers among a sample of size n, say pn . When sampling from a multivariate normal distribution, generally it is desired to have pn reasonably small, say 0.05, and often methods are ‘tuned’ to achieve this goal, at least when n is large [3]. With small to moderate sample sizes, however, typically it is unclear how well this goal is achieved and, at least in some cases, some adjustment is needed when n is small. A recent example in the univariate case can be *Email: rwilcox@usc.edu Journal of Statistical Computation and Simulation ISSN 0094-9655 print/ISSN 1563-5163 online © 2008 Taylor & Francis http://www.tandf.co.uk/journals DOI: 10.1080/00949650701245041 702 R. R. Wilcox found in Carling [4] who suggested a modification of the boxplot rule so that the outside rate per observation is fairly stable as a function of n. A second fundamental goal is to avoid masking. Roughly, a method is said to suffer from masking if the very presence of outliers causes them to be missed. Let M be some multivariate measure of location, based on data randomly sampled from some p-variate distribution, and let C be some measure of scatter. If M is the usual sample mean and C the usual covariance 2 matrix, based on X1 , . . . , Xn , then the classic approach is to use Mahalanobis distance Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 Di = (Xi − M)C −1 (Xi − M) (1) and declare Xi an outlier if Di is sufficiently large. In particular, if the goal is to have pn = α, declare Xi an outlier if 2 , (2) Di ≥ χ1−α/2,p the square root of the 1 − α/2 quantile of a χ -squared distribution with p degrees of freedom. But it is well known that this method suffers from masking [1], roughly because the usual sample mean and covariance matrix are not robust. That is, outliers can greatly influence their values, which can cause Di to be small even when Xi is highly atypical. A seemingly natural approach to avoid masking is to take M and C to be some robust measure of location and scatter in equation (1) and then use equation (2). Campbell [5] proposed using a particular M-estimator, but the M-estimator used has a rather unsatisfactory breakdown point, where the breakdown point of an estimator is the smallest proportion of points that must be altered to make it arbitrarily large or small. The M-estimator used has a breakdown point of only 1/(p + 1), meaning that masking can be a problem, particularly as p gets large. Consequently, Rousseeuw and van Zomeren [3] suggest using the minimum volume ellipsoid (MVE) estimator, which was introduced by Rousseeuw [6] and which is discussed in detail by Rousseeuw and Leroy [1]. It seems that this method performs quite well in terms of achieving pn ≈ 0.05 [2]. But concerns have been expressed by Olive [7] as well as Hawkins and Olive [8] and Fung [9] describes conditions where it can declare too many points outliers. Rousseeuw and van Driessen [10] suggest replacing the (MVE) estimator with the fast minimum covariance determinant (FMCD) estimator, but with small to moderate sample sizes, pn becomes unstable and might exceed 0.05 by an unacceptable amount [2]. There are at least three alternatives to the MVE and FMCD estimators that might be used instead: the median ball algorithm (MBA) proposed by Olive [7], the so-called orthogonal Gnanadesikan–Kettenring (OGK) estimator suggested by Maronna and Zamar [11], and the translated-biweights (TBS) estimator derived by Rocke [12]. But it seems that nothing is known about how these methods perform in terms of pn . One goal here is to report some small-sample size results relevant to this issue. Another goal is to include results on two other outlier detection methods that do not use Mahalanobis distance. 2. Description of the methods The first portion of this section describes the measures of location and scatter that were considered when using equations (1) and (2). Then the computational details for the other two outlier detection methods are provided. Multivariate outlier detection techniques 2.1 703 The OGK estimator In its general form, the OGK estimator, derived by Maronna and Zamar [11], is applied as follows. Let σ (X) and μ(X) be any measures of dispersion and location, respectively. The method begins with the robust covariance between any two variables, say X and Y , which was proposed by Gnanadesikan and Kettenring [13]: cov(X, Y ) = 1 (σ (X + Y )2 − σ (X − Y )2 ). 4 (3) Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 When σ (X) and μ(X) are the usual standard deviation and mean, respectively, the usual covariance between X and Y results. Here, following Maronna and Zamar, σ (X) is taken to be the tau scale of Yohai and Zamar [14]. Let x 2 2 Wc (x) = 1 − I (|x| ≤ c) c and ρc (x) = min(x 2 , c2 ), where the indicator function I (|x| ≤ c) = 1 if |x| ≤ c and 0 otherwise. For the univariate sample X1 , . . . , Xn , let MAD(X) be the median of |X1 − Mx |, . . . , |Xn − Mx |, where Mx is the usual median of X1 , . . . , Xn , and let Xi − Mx ωi = Wc1 . MAD(X) Then the location and scale statistics are defined as μ(X) = ωi Xi ωi and MAD σ (X) = ρc2 n 2 Xi − μ(X) . MAD(X) Using this measure of scale in equation (3), the resulting measure of covariance will be denoted by υ(X, Y ). Again following Maronna and Zamar, the choices for c1 and c2 are taken to be 4.5 and 3, respectively. Following the notation in Maronna and Zamar [11], let xi be the ith row of the n × p matrix X. Then Maronna and Zamar define a scatter matrix V(X) and a location vector t(X) as follows: 1. 2. 3. 4. Let D = diag(σ (X1 ), . . . , σ (Xp )) and yi = D−1 xi , i = 1, . . . , n. Compute U = (Ujk ) by applying v to the columns of Y. Compute the eigenvectors ej of U and call E the matrix whose columns are the ej ’s. Let A = DE, zi = A−1 xi , in which case V(X) = AA and t(X) = Aν, where = diag(σ (Z1 ) , . . . , σ (Zp )) and ν = (μ(Z1 ), . . . , μ(Zp )). 2 704 R. R. Wilcox Maronna and Zamar [11] note that the above procedure can be iterated and report results suggesting that a single iteration be used. More precisely, compute V and t for Z (the matrix corresponding to zi computed in step 4) and then express them in the original coordinate system, namely, V2 = AV(Z)A and t2 (X) = At(Z). Maronna and Zamar show that the estimate can be improved by a reweighting step. Let zij − μ(Zj ) di = σ (Zj ) j and ωi = I (di ≤ d0 ), where Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 d0 = 2 med(d1 , . . . , dn ) χp,β 2 χp,.5 , 2 is the β quantile of the χ -squared distribution with p degrees of freedom and where χp,β ‘med’ denotes the median. Then the measure of location is now estimated to be tω = ωi xi ωi and the measure of scatter is Vω = ωi (xi − tω )(xi − tω ) . ωi Here, when using the OGK estimator, equation (2) was used to check for outliers. Results reported by Maronna and Zamar [11] suggest using β = 0.9, but here it is found that this can result in pn exceeding 0.05 by a considerable amount when n is small, and moreover, pn is rather unstable as a function of n. A preliminary simulation was run with the goal of possibly adjusting β so that pn is approximately 0.05 under multivariate normality. Based on simulation results with n = 10, 20, 50, 100, 200 and p = 2(1) 10, the following choice for β is suggested: β = max(0.95, min(0.99, 1/n + 0.94)) and this choice will be used henceforth. 2.2 The TBS estimator Rocke [12] proposed what is called a TBS estimator. Generally, S-estimators of multivariate location and scatter are values for θˆ and S that minimize |S|, the determinant of S, subject to 1 ˆ 1/2 ) = b0 , ξ(((Xi − θˆ ) S−1 (Xi − θ)) n i=1 n (4) where b0 is some constant, and ξ is a non-decreasing function, but Rocke [12] showed that S-estimators can be sensitive to outliers even if the breakdown point is close to 0.5. He suggested an alternative approach where the function ξ(d) is defined as follows. Let m and c be values to be determined. When m ≤ d ≤ m + c, m4 m2 m2 m2 (m4 − 5m2 c2 + 15c4 ) 2 0.5 + + d − ξ(d) = − 2 30c4 2c4 c2 2 4m 4md5 4m3 1 d6 4 3m + d3 + d − − − + , 3c2 3c4 2c4 2c2 5c4 6c4 Multivariate outlier detection techniques 705 for 0 ≤ d < m, ξ(d) = d2 , 2 and for d > m + c, m2 c(5c + 16m) + . 2 30 The values for m and c can be chosen to achieve the desired breakdown point and the asymptotic rejection probability, roughly referring to the probability that a point will get zero weight when the sample size is large. If the asymptotic rejection probability is to be γ , say, then m and c are determined by Eχp2 (ξ(d)) = b0 , ξ(d) = Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 and m+c = 2 χp,1−γ . An iterative estimation method is used to compute the measures of location and scatter [15], which requires an initial estimate of location and scatter. Here this initial estimate is the MVE estimator, which was computed with the S-PLUS function cov.mve, but some results on using an alternative initial estimate are mentioned in section 3. As with the OGK estimator, when using TBS, checks for outliers are based on equation (2). 2.3 Median ball algorithm Following Olive [7], the reweighted MBA (RMBA) begins with two initial estimates of location and scatter, both of which are based on an iterative algorithm. The basic strategy is as follows. For the j th estimator (j = 1, 2), let (T0,j , C0,j ) be some starting value. Compute all n Mahalanobis distances Di ((T0,j , C0,j ) based on this measure location and scatter. The next iteration consists of estimating the usual mean and covariance matrix based on the cn ≈ n/2 cases corresponding to the smallest distances, yielding (Ti,j , C1,j ). Repeating this process, based on Di (T1,j, C1,j ) yields an updated measure of location and scatter, (T2,j , C2j ). As done by Olive, (T5,j , C5,j ) is used here. The first of the two starting values used by Olive takes (T0,1 , C0,1 ) to be the usual mean and covariance matrix. The other starting value, (T0,2, C0,2 ), is the usual mean and covariance based on the cn cases that are closest to the coordinate wise median in Euclidean distance. Let (TA , CrmA ) = (T5,i , C5,i ), where i = 1 if the determinant |C5,1 | ≤ | C5,2 |, otherwise i = 2. The RMBA estimator of location is TA and the measure of scatter is MED(Di2 (TA , CA )) CA . CRMBA = 2 χp,0.5 √ Olive and Hawkins [16] show that the RMBA estimator is a n consistent. (The R function RMBA, available at www.math.siu.edu/olive/rpack.txt, computes the RMBA estimate of location and scatter and was used in the simulations described in section 3.) It was found that if Mahalanobis distance is computed using the RMBA estimator, and points are declared outliers using (2) with α = 0.975, the outside rate per observation is reasonably close to 0.05 under normality, provided that n/p ≥ 10, at least for 2 ≤ p ≤ 12. But otherwise the outside rate per observation can be very unsatisfactory. For example, with n = 20 and p = 5 it was estimated to exceed 0.24 regardless of the correlation among the variables. So this approach is not as satisfactory as other methods considered when n is small, but the RMBA measure of location does seem to be practical when used in conjunction with the minimum generalized variance (MGV) method described next. 706 R. R. Wilcox 2.4 The MGV method Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 The next approach is based on what is called the MGV method. It has certain similarities to the forward search method described by Atkinson et al. [17, p. 4], but it differs in ways that will become clear. From basic multivariate techniques, the generalized variance is the determinant of the usual covariance matrix; it reflects how tightly a cloud of points is clustered together. The method in this section is based on the fact that the generalized variance is not robust; a single unusual point can greatly inflate its value. The MGV method is applied as follows. 1. Initially, all n points are described as belonging to set A. 2. Find the p points that are most centrally located. There are many ways this might be done and two are considered here. The first is based on p n Xj − Xi 2 di = , (5) MAD j =1 =1 where MAD is the value of median absolute deviation (MAD) based on X1 , . . . , Xn . The p most centrally located points are taken to be the points having the p smallest di values. The second and generally more satisfactory approach takes thep most centrally located points to be the p points having the smallest Mahalanobis distance based on the RMBA estimators, TA and CRMBA . • Remove the p centrally located points from set A and put them into set B. At this step, the generalized variance of the points in set B is zero. • If among the points remaining in set A, the ith point is put in set B, the generalized variance of the points in set B will be changed to some value which is labeled sgi2 . That is, associated with every point remaining in A is the value sgi2 , which is the resulting generalized variance when it, and it only, is placed in set B. Compute sgi2 for every point in A. • Among the sgi2 values computed in the previous step, permanently remove the point associated with the smallest sgi2 value from set A and put it in set B. That is, find the point in set A, which is most tightly clustered together with the points in set B. Once this point is identified, permanently remove it from A and leave it in B henceforth. • Repeat steps 4 and 5 until all points are now in set B. The first p points removed from set A have a generalized variance of zero which is labeled 2 2 = · · · = sg(p) = 0. When the next point is removed from A and put into B (using steps 4 sg(1) 2 and 5), the resulting generalized variance of the set B is labeled sg(p+1) and continuing this process, each point has associated with it some generalized variance when it is put into set B. Based on the process just described, the ith point has associated with it one of the generalized variances just computed. For convenience, this generalized variance associated with the ith 2 point, sg(j ) , is labeled Ci . The p deepest points have C values of zero. Points located at the edges of a scatterplot have the highest C values meaning that they are relatively far from the center of the cloud of points. A strategy for detecting outliers is simply applying some good univariate outlier rule to the Ci values. Note that a point would not be declared an outlier if Ci is small, only if Ci is large. In terms of maintaining an outside rate per observation that is stable as a function of n and p, and approximately equal to 0.05 under normality, a boxplot rule for detecting outliers seems best when p = 2, and for p > 2 a slight generalization of Carling’s [4] modification of the boxplot rule appears to perform well. In particular, if p = 2, then declare the ith point an Multivariate outlier detection techniques 707 Ci > q2 + 1.5(q2 − q1 ), (6) outlier if Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 where q1 and q2 are the ideal fourths based on the Ci values. For p > 2 variables, replace equation (5) with 2 (q2 − q1 ), (7) Ci > MC + χ0.975,p 2 is the square root of the 0.975 quantile of a χ -squared distribution with where again χ0.975,p p degrees of freedom and MC is the usual median of the Ci values. A possible criticism, when detecting outliers among the Ci values, is that the interquartile range has a breakdown point of 0.25. Ideally, a univariate outlier detection method would have a breakdown point of 0.5, the highest possible value. This can be achieved with a commonly used MAD-median rule. When p = 2, for example, this means that a point Xi is declared an outlier if |Ci − Mc | > 2.24, (8) MADC /0.6745 where MADC is the value of MAD based on the C values. A concern about this approach is that the outside rate per observation is no longer stable as a function of n and no method for correcting this problem is available at this time. For completeness, details of the forward search method suggested by Atkinson et al. [17] are outlined in order to provide some sense of how it compares to the MGV method. The forward search method begins with a subset of m points. Based on these m points, Mahalanobis distance is computed for all n points and the m + 1 points having the smallest Mahalanobis distance forms the new subset. This is done until all n points are included. Then a plot is created [17, p. 7] and simulations are used to determine the kind of fluctuations that are to be expected based on this plot. The Mahalanobis distances are calculated for each sample and ordered. Another approach is to plot the change in the Mahalanobis distances as each new point is added; see figure 1.3 in Atkinson et al. So unlike MGV, ellipsoidal regions are used to determine whether points are outliers. In some sense this approach might compete well with the methods considered here, but due to the role of the plots that are used, including it in the simulations used here is difficult at best. 2.5 A projection method Consider any projection of the data onto a straight line. A projection-type method for detecting outliers among multivariate data is based on the idea that if a point is an outlier, then it should be an outlier for some projection of the n points. So if it were possible to consider all possible projections, and if for some projection a point is an outlier, then the point is declared an outlier. Not all projections can be considered, so following Wilcox [2], the strategy is to orthogonally project the data onto all n lines formed by the center of the data cloud, as represented by ξˆ , and each Xi . It seems natural that ξˆ should have a high breakdown point. Here, two choices for ξˆ were considered. The first is a weighted mean, based in part on FMCD, which is computed by the S-PLUS function cov.mcd. More precisely, Mahalanobis distance is computed using the FMCD measures of location and scatter, any points flagged outliers, using (2), are removed, and the center of location is taken to be the mean of the remaining values. The second measure of location considered was RMBA. The computational details are as follows. Fix i, and for the point Xi , orthogonally project all n points onto the line connecting ξˆ and Xi , and let Dij be the distance between ξˆ and Xi , 708 R. R. Wilcox based on this projection. More formally, let Ai = Xi − ξˆ , Bj = Xj − ξˆ , where both Ai and Bj are column vectors having length p and let Cj = Aj Bj Bj Bj Bj , Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 j = 1, . . . , n. Then when projecting the points onto the line between Xi and ξˆ the distance of the j th point from ξˆ is Dij = Cj , where Cj = Cj21 + · · · + Cjp2 . Here, an extension of Carling’s modification of the boxplot rule (similar to the modification used by the MGV method) is used to check for outliers among the Dij values. Let = [n/4 + 5/12], Let where [·] is the greatest integer function, and let n 5 + − . 4 12 For fixed i let Di(1) ≤ · · · ≤ Di(n) be the n distances written in ascending order. The ideal fourths associated with the Dij values are h= q1 = (1 − h)Di(h) + hDi(h+1) and q2 = (1 − h)Di() + hDi(−1) , Then the j th point is declared an outlier if Dij > MD + 2 χ0.975,p (q2 − q1 ), (9) where MD is the usual sample median based on the Di1 , . . . , Din values. The process just described is for a single projection; for fixed i points are projected onto the line connecting Xi to ξˆ . Repeating this process for each i, i = 1, . . . , n, point is declared an outlier if for any of these projections, it satisfies equation (9). This will be called method OP, which has certain similarities with a projection method suggested by Pe˜na and Prieto [18]. One important difference is that the method used by Pe˜na and Prieto is based on the usual sample mean, which is not robust and which in turn could result in masking. As was the case with the MGV method, a simple and seemingly desirable modification of the method just described is to replace the interquartile range with the MAD measure of scale based on the values Di1 , . . . , Din . So here, MAD is the median of the values |Di1 − MD |, . . . , |Din − MD |, which is denoted by MADi . Then the j th point is declared an outlier if for any i MADi 2 Dij > MD + χ0.95,p . (10) 0.6745 Equation (10) represents an approximation of the method given by equation (1.3) in Donoho and Gasko [19]. Again, an appealing feature of MAD is that it has a higher finite sample Multivariate outlier detection techniques 709 breakdown point than the interquartile range. But a negative feature of equation (10) is that the outside rate per observation appears to be less stable as a function of n. In the bivariate case, for example, it is approximately 0.09 with n = 10 and drops below 0.02 as n increases. For the same situations, the outside rate per observation using equation (9) ranges, approximately, between 0.043 and 0.038. Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 3. Simulation results Simulations were used to estimate pn under normality, and some additional simulations were used as a partial check on the ability of the methods to detect truly unusual points. All simulations were done in S-PLUS, which has a built-in function for computing the TBS estimator. The OGK estimator was applied using software written by K. Konis, with a minor error corrected by V. Todorov. The MBA was applied with R code written by D. Olive, which was previously mentioned. Methods OP (using FMCD) and MGV were applied with software in Wilcox [2]. When generating data from a multivariate normal distribution, a common correlation, ρ, was used, each method was applied, and the number of points flagged as outliers was recorded. (All marginal distributions have mean 0 and variance 1.) This process was replicated 1000 times resulting in pˆ n , the average proportion of points declared outliers. Table 1 shows the results for n = 20 and ρ = 0 and 0.9. (Results under MGV used the first measure of location mentioned in section 2.4. The column MGV(RMBA) reports results when the MGV method uses the RMBA as the measure of location. It is noted that some additional simulations were run where the correlation between variables j and k were taken to be ρjk = j k/(p + 1)2 The results were very similar to those for ρ = 0, so for brevity they are not reported. The goal is to have pn reasonably close to 0.05. If we view a method as unacceptable when pˆ n exceeds 0.075, no method is acceptable among all of the conditions considered. This is not surprising when p is large relative to n. For example, Rousseeuw and van Zomeren [3] indicate that their method should be used only when n/p ≥ 5. All indications are that with n/p ≥ 5, methods MGV and OP provide adequate control over pn , at least when p ≤ 12, but when p = 2, methods OGK and TBS can be unsatisfactory when n = 20. Table 1. n 20 20 100 100 20 20 100 100 20 20 40 40 60 60 100 100 Outside rate per observation. ρ p OGK TBS MGV OP MGV(RMBA) 0.0 0.9 0.0 0.9 0.0 0.9 0.0 0.9 0.0 0.9 0.0 0.9 0.0 0.9 0.0 0.9 2 2 2 2 5 5 5 5 8 8 8 8 8 8 8 8 0.047 0.081 0.054 0.093 0.071 0.062 0.061 0.062 0.090 0.069 0.091 0.075 0.081 0.068 0.068 0.061 0.078 0.078 0.031 0.005 0.037 0.038 0.028 0.027 0.015 0.032 0.031 0.062 0.033 0.069 0.107 0.109 0.068 0.068 0.053 0.052 0.064 0.068 0.030 0.033 0.091 0.046 0.044 0.053 0.038 0.046 0.040 0.046 0.045 0.049 0.026 0.019 0.067 0.011 0.035 0.005 0.081 0.064 0.064 0.003 0.054 0.003 0.045 0.002 0.069 0.072 0.052 0.053 0.063 0.065 0.031 0.031 0.088 0.091 0.045 0.044 0.066 0.039 0.044 0.064 Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 710 R. R. Wilcox A general concern about TBS is that as n increases, it is unclear that pn approaches 0.05. And perhaps a more serious concern is that situations are encountered where pn increases beyond 0.1 as n gets large. This is the case with p = 8 as indicated in table 1 and a similar result was obtained with p = 12. The reason for this is unknown. As a partial check on this result, the built-in S-PLUS function that computes Rocke’s TBS estimator was replaced by S-PLUS code written by Prof. Rocke. But the same result was obtained. One feature worth noting is that the algorithm relies on an initial estimate of location and scatter and can be altered when using Rocke’s code. It was found that altering the initial estimate from MVE to the weighted mean used by method MGV had a considerable impact on pn . But regardless of which initial estimate was used, improved control over pn was not obtained, and using the OGK estimator failed to correct this problem as well. Some additional simulations were run with p = 5 and n = 200, now pn is approximately 0.025. So there is no indication that a similar problem occurs with p = 5. But caution seems warranted and it would seem that for p > 5, if the goal is to control the outside rate per observation, some other method should be used. Another issue has to do with the ability of a method to correctly identify points that are truly unusual based on the model that generated the data. And a related issue is, as unusual points are added to the data, what happens to the ability of a method to detect them? To provide at least some results relevant to these issues, first consider the case p = 2, n = 20, ρ = 0.9, and suppose an additional point is added at (1, −1). This point has Mahalanobis distance 4.47, meaning that its distance from the origin is unusually large. Of interest is the probability of a correct decision (PCD) regarding this point when applying the methods under study. The first line of table 2 reports the estimated PCD for this special case. The second line reports the estimated PCD when there are two points added at (1, −1). (So the notation (1, −1)2 indicates that two points were added at (1, −1).) Results are also given when adding 3, 4, 5 and 6 points. For one or two points, method OGK and TBS have a higher estimated PCD than methods MGV and OP. But as the number of outliers is increased from four to five, method TBS performs in a relatively poor fashion, as does method MGV. Note, however, that MGV(RMBA) performs relatively well. That is, in this case, the choice for the measure of location used by the MGV method makes a practical difference. With five outliers, method OP is best with methods OGK MGV(RMBA) performing reasonably well. To complicate matters, the location of an outlier can affect the ability of a method to detect it, even when its Mahalanobis distance is extreme. Consider again the case where a single outlier Table 2. Estimated PCD, n = 20, ρ = 0.9. Point OGK TBS (1, −1)1 (1, −1)2 (1, −1)3 (1, −1)4 (1, −1)5 (1, −1)6 (2.3, −2.3)1 (2.3, −2.3)2 0.99 0.93 0.82 0.65 0.42 0.22 0.18 0.17 0.99 0.98 0.95 0.84 0.18 0.17 0.28 0.14 (1, −1, 0, 0, 0)1 (1, −1, 0, 0, 0)2 (2.3, −2.3, 0, 0, 0)1 0.63 0.33 1.00 0.62 0.03 1.00 MGV p=2 0.88 0.86 0.78 0.78 0.09 0.04 0.21 0.07 p=5 0.61 0.47 0.51 OP MGV(RMBA) 0.78 0.66 0.54 0.54 0.54 0.03 0.26 0.34 0.98 0.94 0.85 0.69 0.35 0.04 0.17 0.08 0.20 0.12 0.30 0.61 0.51 0.42 Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 Multivariate outlier detection techniques 711 is placed at (1, −1), only now another outlier is placed at (2.3, −2.3). This second point has a Mahalanobis distance of 10.23, which would seem to be exceptionally large. But the estimated PCD for this second point is only 0.18, 0.28, 0.21, 0.26 and 0.17 for methods OGK, TBS, MGV, OP and MGV(RMBA), respectively. With three outliers at (1, −1), the estimates are now 0.17, 0.14, 0.07, 0.34 and 0.13. So now, method OP is best. Finally, the bottom portion of table 2 shows some results when p = 5, there is a single outlier at (2.3, −2.3, 0, 0, 0), and there are one or more outliers at (1, −1, 0, 0, 0). The entries in table 2 are the estimated PCD when trying to detect the outlier at (1, −1, 0, 0, 0). Now methods MGV and MG(RMBA) perform as well or better than the other three methods, and with three outliers at (1, −1, 0, 0, 0), they offer a distinct advantage. But in terms of detecting the outlier at (2.3, −2.3, 0, 0, 0), they do not compete well with methods OGK and TBS. For example, the last line in table 2 shows the estimated PCD with three points at (1, −1, 0, 0, 0) and the goal is to detect the outlier at (2.3, −2.3, 0, 0, 0). As indicated, PCD is estimated to 0.51 using method MGV, versus 1.00 when using methods OGK and TBS. To add perspective, when adding outliers to the data, the outside rate per observation among the original n observations, generated from a multivariate normal distribution, was checked. With p = 5 and n = 20 and two outliers at (1, −1, 0, 0, 0) and (2.3, −2.3, 0, 0, 0), the rates for OGK, TBS, OP, MGV and MGV(RMBA) were 0.130, 0.229, 0.036, 0.058 and 0.028, respectively. So both OGK and TBS are unsatisfactory; methods OP, MGV and MGV(MBA) are preferable. Increasing n to 30, again with two outliers, the rates are now 0.063, 0.071, 0.007, 0.019 and 0.021. With n = 100, all five methods have estimated rates less than 0.02, with methods OP, MGV and MGR(MBA) having estimates less than 0.01. 4. Concluding remarks No single method dominates and no single method is always satisfactory in terms of controlling pn , the outside rate per observation. But for routine use, methods OP, MGV and MGV(RMBA) seem best based on the estimated pn values reported here. However, for p ≤ 5, a case can be made for using TBS, the main exception being p = 2 and n = 20. The simple strategy of using the OGK estimator with β fixed was found to be unsatisfactory. Adjusting β, as suggested in section 2, its performance improved considerably, but MGV and OP seem preferable in general. In terms of detecting true outliers, the situation is complex. In particular, the ability of a method to detect an outlier depends on where the outlier is located relative to the entire cloud of points and it can depend on how many outliers there happen to be. In some situations, method TBS is a clear winner, but situations arise where the reverse is true. So if p ≤ 5, it would seem that TBS should be given serious consideration both in terms of pn and its ability to detect true outliers. But with p > 5, all indications are that methods OP, MGV and MGV(RMBA) are preferable, with MGV(RMBA) performing a bit better than MGV, and even when p ≤ 5, they can have a higher PCD than TBS. References [1] Rousseeuw, P.J. and Leroy, A.M., 1987, Robust Regression & Outlier Detection (New York: Wiley). [2] Wilcox, R.R., 2005, Introduction to Robust Estimation and Hypothesis Testing (2nd edn) (San Diego CA: Academic Press). [3] Rousseeuw, P.J. and van Zomeren, B.C., 1990, Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85, 633–639. [4] Carling, K., 2000, Resistant outlier rules and the non-Gaussian case. Computational Statistics & Data Analysis, 33, 249–258. Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012 712 R. R. Wilcox [5] Campbell, N.A., 1980, Robust procedures in multivariate analysis I: Robust covariance estimation. Applied Statistics, 29, 231–237. [6] Rousseeuw, P.J., 1985, Multivariate estimation with high breakdown point. In: W. Grossmann, G. Pflug and W. Wertz (Eds) Mathematical Statistics and Applications, B. (Dordrecht: Reidel Publishing), pp. 283–297. [7] Olive, D.J., 2004, A resistant estimator of multivariate location and dispersion. Computational Statistics & Data Analysis, 46, 93–102. [8] Hawkins, D.M. and Olive, D., 2002, Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97, 136–147. [9] Fung, W.-K., 1993, Unmasking outliers and leverage points: A confirmation. Journal of the American Statistical Association, 88, 515–519. [10] Rousseeuw, P.J. and Van Driesen, K., 1999, A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223. [11] Maronna, R.A. and Zamar, R.H., 2002, Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44, 307–317. [12] Rocke, D.M., 1996, Robustness properties of S-estimators of multivariate location and shape in high dimension. Annals of Statistics, 24, 1327–1345. [13] Gnanadesikan, R. and Kettenring, J.R., 1972, Robust estimates, residuals and outlier detection with multiresponse data. Biometrics, 28, 81–124. [14] Yohai, V.J. and Zamar, R., 1988, High breakdown point estimates of regression by means of the minimization of an efficient scale. Journal of the American Statistical Association, 86, 403–413. [15] Rocke, D.M. and Woodruff, D.L., 1993, Computation of robust estimates of multivariate location and shape. Statistica Neerlandica, 47, 27–42. [16] Olive, D.J. and Hawkins, D.M., 2006, Robustifying robust estimators. Preprint available online at: www.math.siu.edu/olive/preprints.htm. [17] Atkinson, A.C., Riani, M. and Ceriolo, A., 2004, Exploring Multivariate Data with the Forward Search (New York: Springer). [18] Pe˜na, D. and Prieto, F.J., 2001, Multivariate outlier detection and robust covariance matrix estimation. Technometrics, 43, 286–299. [19] Donoho, D.L. and Gasko, M., 1992, Breakdown properties of the location estimates based on halfspace depth and projected outlyingness. Annals of Statistics, 20, 1803–1827. [20] Rocke, D.M. and Woodruff, D.L., 1996, Identification of outliers in multivariate data. Journal of the American Statistical Association, 91, 1047–1061.