Nonparametric Two-Sample Methods for Ranked-Set Sample Data Michael A. F
Transcription
Nonparametric Two-Sample Methods for Ranked-Set Sample Data Michael A. F
Nonparametric Two-Sample Methods for Ranked-Set Sample Data Michael A. F LIGNER and Steven N. M AC E ACHERN A new collection of procedures is developed for the analysis of two-sample, ranked-set samples, providing an alternative to the Bohn–Wolfe procedure. These procedures split the data based on the ranks in the ranked-set sample and lead to tests for the centers of distributions, confidence intervals, and point estimators. The advantages of the new tests are that they require essentially no assumptions about the mechanism by which rankings are imperfect, that they maintain their level whether rankings are perfect or imperfect, that they lead to generalizations of the Bohn–Wolfe procedure that can be used to increase power in the case of perfect rankings, and that they allow one to analyze both balanced and unbalanced ranked-set samples. A new class of imperfect ranking models is proposed, and the performance of the procedure is investigated under these models. When rankings are random, a theorem is presented which characterizes efficient data splits. Because random rankings are equivalent to iid samples, this theorem applies to a wide class of statistics and has implications for a variety of computationally intensive methods. KEY WORDS: Mann–Whitney; Perceptual error models; Pitman efficacy; Ranking models; Thurstonian model; Wilcoxon. 1. INTRODUCTION There are many experimental settings where it is far cheaper to recruit a unit to participate in the study and to make an informal measurement on it than it is to make the formal measurement that is traditionally analyzed with statistical methods. Ranked-set sampling provides a design that allows one to exploit this differential cost of recruitment and measurement. A basic description of a ranked-set sample drawn from a single population is as follows: k2 units are collected as iid draws from the population. These k2 units are partitioned, at random, into k sets, each of size k. In the first set of k units, the response judged to be smallest is chosen for full measurement; in the second set, the response judged to be second smallest is chosen; in the third set, the response judged to be third smallest is chosen; and so on, until, in the last set, the response judged to be largest (kth smallest) is chosen. This process is repeated, say, m times, to produce a (balanced) ranked-set sample with mk fully measured units. The full measurements, along with the associated ranks, constitute the data from the ranked-set sample. When sampling more than one population, independent rankedset samples are drawn from each population. The ranked-set sampling design was first described by McIntyre (1952), who proposed it for estimation of pasture yields, where a trained eye can order a set of plots by yield fairly accurately. Since then, an extensive literature on the sampling plan has been produced, with particular attention given to development of novel ranked-set sample estimators, comparison of ranked-set sample estimators to those based on simple random samples, and selection of the set size. Review articles by Kaur, Patil, Sinha, and Taillie (1995) and Bohn (1996) described much of this work and showed some of the wide variety of applications to which the technique has been put. We note the particular influence of Stokes and Sager (1988), who exploited the connection between stratified sampling and ranked-set sampling (the k ranking classes loosely correspond to k distinct strata) to consider estimation of a distribution function while relaxing all assumptions on the ranking process. The general Michael A. Fligner is Professor Emeritus, Department of Statistics, The Ohio State University, Columbus, OH 43210 (E-mail: maf@stat.ohio-state.edu). Steven N. MacEachern is Professor, Department of Statistics, The Ohio State University, Columbus, OH 43210 (E-mail: snm@stat.ohio-state.edu). This material is based on work supported by the National Science Foundation under awards DMS-00-72526 and SES-0437251 and by the National Security Agency under award MSPF-04G-109. tenor of results is that a ranked-set sample provides more accurate inference than does an iid sample of the same size. Bohn and Wolfe (1992, 1994) developed and investigated a version of the Mann–Whitney–Wilcoxon (MW) rank sum test appropriate for use with balanced ranked-set sample data. They found that their test outperformed the MW test based on independent random samples from the two populations. However, their test relies on the assumption that the units in a set are ranked perfectly. When rankings are imperfect, the level of their test rises, perhaps substantially. In this article we develop statistical inference for a version of the MW test for data collected from two populations with independent ranked-set samples. The proposed test is distribution free. It does not depend on either perfect rankings or exact knowledge of the mechanism that yields imperfect rankings. The procedure works for both balanced and unbalanced ranked-set samples. For shift alternatives, it leads to confidence intervals and, through the Hodges– Lehmann device, to an estimator of the difference in locations of the two populations. Properties of the proposed procedure are investigated through asymptotic theory and simulation. For the simulation, a new class of models is proposed for imperfect rankings. The proposed procedure, which relies on minimal assumptions about the judgment ranking process, is shown to be competitive at both extremes for the quality of rankings. When judgment rankings are perfect, the new procedure is competitive with the BW procedure; when judgment rankings are random, the new procedure is asymptotically equivalent to the MW test. Section 2 establishes notation, describes the statistic on which the new tests are based, and provides the main result for the asymptotic equivalence to existing procedures under certain settings. This section compares the new method to the traditional MW test and then compares our new procedure to the BW procedures. Section 3 provides a further description of the theoretical results, focusing on the consistency of the test statistic and the Pitman efficacy of the hypothesis test. Section 4 introduces a new class of models for imperfect rankings. Section 5 gives the results of a simulation. Section 6 applies the new method to an experiment on spray deposits, and Section 7 concludes with a discussion. 1107 © 2006 American Statistical Association Journal of the American Statistical Association September 2006, Vol. 101, No. 475, Theory and Methods DOI 10.1198/016214506000000410 1108 Journal of the American Statistical Association, September 2006 2. PRELIMINARY RESULTS 2.1 Balanced Ranked-Set Samples To construct a ranked-set sample (RSS) from a single population, the experimenter must select k mutually independent random samples, each of size k. In the rth random sample, the experimenter measures the item judged to be the rth smallest observation among the k, r = 1, . . . , k. This process is repeated m times with each replication of the process known as a cycle. This is referred to as a balanced ranked-set sample because the same number of observations is taken on each judgment order statistic. In the two-sample situation, a balanced ranked-set sample is obtained from each population. The necessary notation follows. For population 1, let X[r]i denote the observation in sample r of cycle i judged to be the rth smallest. The ranked-set sample of size M = mk from population 1 is X[1]1 , . . . , X[1]m , . . . , X[k]1 , . . . , X[k]m , where F(x) denotes the underlying continuous distribution of this population. In addition, X[r]1 , . . . , X[r]m is a random sample with cdf F[r] (x). If the judgment ranking is perfect, then F[r] (x) = F(r) (x), the distribution of the rth-order statistic of a random sample of size k from F(x); if the judgment ranking is made at random, then F[r] (x) = F(x). Similarly, we let Y[1]1 , . . . , Y[1]n , . . . , Y[k]1 , . . . , Y[k]n denote an RSS of size N = nk from a second population with continuous cdf G(y) = F(y − ), with Y[r]1 , . . . , Y[r]n a random sample with cdf G[r] (y). Note that the sample from the Y population is based on n cycles, although we are assuming that the set size (number of items ranked) is the same in both ranked-set samples. Although there are important differences between ranked-set samples and stratified samples, we sometimes refer to F[r] (x) and G[r] (y) as the distributions of the rth strata, r = 1, . . . , k. Bohn and Wolfe (1992) proposed a distribution-free test of H0 : = 0 versus H1 : > 0 under the assumption of perfect ranking. Their test statistic is given by BW = k n k m ψ Y[s]t − X[i] j s=1 t=1 i=1 j=1 = k k s=1 i=1 Tsi , where Tsi = nt=1 m j=1 ψ(Y[s]t − X[i] j ) and ψ(x) = 1 if x > 0 and ψ(x) = 0 otherwise. The null distribution of the BW statistic is distribution-free under the assumption of perfect ranking. Although all arrangements of the mk X’s and nk Y’s are no longer equally likely under the null hypothesis, the assumption of perfect rankings enables us to compute the probabilities of the various rankings. These probabilities do not depend on the underlying distribution, and so the distribution-free property holds. Interestingly, the BW statistic is also distributionfree when the rankings are random as it then has the same distribution as the MW statistic based on independent samples of size mk and nk. Thus, the distribution of BW under random ranking is quite different from its distribution under perfect ranking, and the level of the test under random ranking is greatly inflated. The intermediate case of imperfect ranking is studied in Bohn and Wolfe (1994), where it was shown that the effect of milder forms of imperfect ranking is also to inflate the level of the BW test. Our proposed test is based on the statistic T= k Tii . i=1 The null distribution of each Tii is that of the MW statistic based on sample sizes m and n, because under the null hypothesis F[r] (x) = G[r] (x), r = 1, . . . , k. This null distribution for each Tii remains the same whether the judgment ranking is perfect, imperfect, or random. In the case of imperfect rankings, we make only the reasonable assumption that the mechanism for imperfect rankings is the same for both populations under the null hypothesis. Because the Tii are mutually independent, the null distribution of T is that of the convolution of k independent MW statistics based on sample sizes m and n whether the judgment ranking is perfect, imperfect, or random. An immediate advantage of the proposed statistic T over BW is that the distribution-free property does not require perfect rankings. An inspection of the formula for the proposed statistic T shows that it is based on a total of kmn comparisons among the X’s and Y’s, whereas the BW statistic is based on k2 mn comparisons, as is the standard MW statistic for samples of size mk and nk. The considerable reduction in the number of comparisons suggests that in the case of perfect ranking there may be a loss of information relative to the BW statistic. We might expect an additional loss of information because our procedure is valid under perfect, imperfect, or even random rankings. Fortunately, in those cases where information is lost, the effect is minimal. Our proposed test assumes that the data have been collected as independent ranked-set samples. The judgment ordering can be perfect as was assumed by Bohn and Wolfe, although this assumption was quite strong, particularly as the number of items ranked increases. Imperfect judgment ordering occurs in various degrees with the most extreme case corresponding to complete randomness; that is, the item judged as the rth smallest is equivalent to having chosen an item at random from the k observations in the set. The purpose of using ranked-set samples is to gain information over simple random sampling. If we knew the judgment ordering was going to be completely at random, there would be little reason to collect data as a ranked-set sample, and we could just collect simple random samples from each population and use the standard MW statistic for comparing two populations. However, we now show that the proposed statistic T is asymptotically equivalent to the MW if the rankings are completely random, whereas it will be shown in the next section that with more informative judgment orderings it will outperform the MW statistic based on comparable sample sizes. This result for random rankings is somewhat surprising as our test is based on kmn comparisons among the X’s and Y’s, whereas the MW statistic is based on k2 mn comparisons. Under random ranking, the joint distribution of the rankedset samples is the same as a random sample of M = mk X’s with cdf F(x) and an independent random sample of N = nk Y’s with cdf G(y). Now suppose we divide the X’s at random into k groups of size m and the Y’s at random into k groups of size n. Let MWi be the MW statistic computed on the ith group Fligner and MacEachern: Nonparametric Two-Sample Methods 1109 of X’s and ith group of Y’s only, and let RMW = k MWi . i=1 We call RMW the k random-split Mann–Whitney statistic. Letting MW denote the full MW statistic computed on all M = mk X’s and N = nk Y’s, we now show that the difference between MW and RMW, suitably normalized, goes in probability to 0. d d Because, under random ranking, T = RMW and MW = BW, d where = denotes equal in distribution, the theorem also shows the asymptotic stochastic equivalence of the BW statistic and the proposed statistic T under random ranking. Theorem 2.1. Let X1 , X2 , . . . , Xkm and Y1 , Y2 , . . . , Ykn denote independent random samples from populations with continuous cdf’s F(x) and G(y), respectively. Let k be a fixed constant, M = km, N = kn, and assume 0 < λ = limM,N→∞ (M/ (M + N)) < 1. Then √ RMW MW − M+N 2 kmn k mn converges in probability to 0 as M, N → ∞. Proof. Dividing the X’s at random into k groups of size m, we let Xr,1 , . . . , Xr,m denote the rth group of X’s, and dividing the Y’s at random into k groups of size n, we let Yr,1 , . . . , Yr,n denote the rth group of Y’s, r = 1, . . . , k. Then, letting MWi = n m t=1 j=1 ψ(Yi,t − Xi,j ), RMW = k MWi i=1 and MW = k n k m ψ(Ys,t − Xi,j ), s=1 t=1 i=1 j=1 where we note that the sum in MW is over all pairs of X’s and Y’s. Lehmann (1951) extended Hoeffding’s (1948) U-statistic theorem to two-sample U-statistics (see also Randles and Wolfe 1991). Using this result, it follows from the projection of the MW onto sums of iid random variables that 1 MW 1 = F(Ys,t ) + (1 − G(Xi,j )) 2 km k mn kn k k n s=1 t=1 m i=1 j=1 1 . + op √ M+N (1) Applying the projection to the individual MWi gives 1 MWi 1 = F(Yi,t ) + (1 − G(Xi,j )) mn n m n m t=1 j=1 + op 1 , √ M+N i = 1, . . . , k. Now, by combining (1) and (2), it follows immediately that k 1 MW RMW 1 MWi . = = 2 + op √ kmn k mn k mn M+N i=1 (2) Theorem 2.1 relies on the description of the MW statistic as a nearly linear function of the data. If the statistic were exactly linear, as is the difference in sample means, there would be no loss of information in a k random split. That is, Y −X, the difference in means of the kn Y observations and the km X observations, can be recovered exactly as the average of the differences in means of the k random groups. Although the MW statistic is not linear in the X’s and Y’s, Theorem 2.1 follows from the approximate linearity of the MW statistic. Approximate linearity is a common property of estimators and has been used to show that many classes of statistics have limiting normal distributions. Consequently, a similar argument for the equivalence of a k-random-split statistic and the corresponding full statistic applies to a large group of estimators and test statistics. In Section 5, it will be shown via simulation that, even for small to moderate sample sizes, the RMW statistic loses little information relative to the full MW statistic. For some statistics that are more complex, particularly those for which computational time is worse than O(M + N), use of one or even several k random splits can be preferable to processing the complete dataset. However, this must be weighed against the fact that the results will depend on the random splits used. The next result is a comparison of the exact variances of the MW and RMW statistics. Letting γ = F dG, both RMW/kmn and MW/k2 mn provide unbiased estimates of γ . Using standard results for the variance of U-statistics (Randles and Wolfe 1991), it follows that var(RMW/kmn) k[(m − 1)ζ0,1 + (n − 1)ζ1,0 + ζ1,1 ] = →1 (km − 1)ζ0,1 + (kn − 1)ζ1,0 + ζ1,1 var(MW/k2 mn) as m, n → ∞, where ζ0,1 = F 2 dG − γ 2 , ζ1,0 = (1 − G)2 dF − γ 2 , and ζ1,1 = γ (1 − γ ). Under H0 , for k = 3 and m = n = 10, the preceding expression for the ratio of variances takes the value 1.033, suggesting the near equivalence of MW and RMW, even for relatively small sample sizes. We now consider the case of perfect rankings. In this situation, the null distribution of the BW statistic depends on the fact that comparisons are being made between F[r] (x) = F(r) (x) and G[s] (y) = G(s) (y) for r, s = 1, . . . , k. Because of this, the distinction between the BW statistic and the proposed statistic T involves more than just a reduction in the number of comparisons. Specifically, the set of comparisons used in the calculation of T are intended to be those comparisons that are more informative for detecting the alternative. Because, under the shift model, G(r) (y) = F(r) (y − ), r = 1, . . . , k, we would expect comparisons within the strata determined by the different judgment order statistics to be the most useful for detecting a shift due to the reduction in variability within the strata. Surprisingly, dropping the less informative (noisier) comparisons can actually increase power. Thus, it is possible in some cases for T to have greater power than BW, despite the reduction in the number of comparisons. For a better understanding, consider the class of statistics Wc = k i=1 Tii + c Tsi s=i for some constant c. For any choice of c, under perfect rankings the null distribution of Wc is distribution-free. This is because the probabilities of the various orderings of the mk X’s 1110 Journal of the American Statistical Association, September 2006 the balanced case, neither the mi nor the ni can depend on i. From a design standpoint, imbalance can be useful because not all judgment order statistics contain the same information regarding a difference in the populations. Ozturk and Wolfe (2000) investigated the design issue under the assumption of perfect rankings and under the restriction that the mi (ni ) are either equal to some common value, m (n), or are 0. In this section, we consider a more general lack of balance, where the mi and ni , i = 1, . . . , k, are arbitrary. Our focus is on the development of appropriate testing procedures within this context. The proposed procedure T can easily be adapted to unbalanced designs, namely, those that do not use the same sample size for each of the judgment order statistics. When the design is unbalanced, the question arises as to how the Tii should be combined to form the final statistic. For testing H0 : = 0 versus H1 : > 0, a natural statistic would be of the form Figure 1. Plots of Pitman ARE When Rankings Are Perfect and k = 2. A value of c = 0 corresponds to T , whereas a value of 1 corresponds to BW. The highest line is for the uniform distribution, the middle line is for the normal distribution, and the lowest line is for the t distribution with 3 df. and nk Y’s can be computed under perfect ranking and Wc depends on the data only through these orderings. The class Wc includes both the BW statistic (c = 1) and the proposed statistic T (c = 0). Although dropping the cross-stratum comparisons (c = 0) entails some loss of information, for many distributions the BW statistic tends to overweight these comparisons. When k = 2, the Pitman asymptotic relative efficiency (ARE) of BW relative to the MW statistic was shown by Bohn and Wolfe (1992) to be 1.5 for any distribution under perfect rankings. Because BW is just the Mann–Whitney computed on the rankedset sample, this efficiency results from the improvement in RSS over simple random sampling (SRS) for perfect ranking. The Pitman ARE of T relative to the MW statistic depends on the underlying distribution, even under perfect rankings. The exact expression will be given in the next section. However, Figure 1 shows the Pitman ARE between Wc and the MW statistic as a function of c for the t distribution with 3 degrees of freedom, the normal distribution, and the uniform distribution. From Figure 1, we see that neither BW (c = 1) nor T (c = 0) is optimal for any of these three distributions under perfect ranking. Although the optimal c depends on the underlying distribution, it appears that for these three distributions a choice of c around .6 would produce a test with reasonably good properties. We also note that, even under perfect rankings, for k = 2 the loss of information when using T instead of BW is minimal for the normal distribution and is larger for the heavier-tailed t distribution, but that T performs better than BW for the uniform. Finally, the weights in the expression for Wc can be made more general because the optimal statistic would not necessarily treat all of the Tii or Tsi , s = i, in an equivalent manner. Although an exploration of this might be interesting for very small values of k, once k reaches 4 or 5 it becomes of less practical interest, as the assumption of perfect ranking becomes less tenable as k increases. 2.2 Unbalanced Ranked-Set Samples Suppose that mi and ni are the number of X and Y observations, respectively, in the ith judgment class for i = 1, . . . , k. In Tα = k i=1 αi Tii , m i ni with weights αi suitably chosen to reflect the imbalance. For simplicity, the weights should not involve the underlying distribution. In addition, the weights should ensure that Tα coincides with the proposed statistic T in the balanced case and that, under random rankings, the procedures based on Tα should not suffer relative to the MW. The rationale for this last requirement is the same as in the balanced case. Because each statistic Tα is distribution-free without the assumption of perfect ranking, if the procedure is equivalent to the MW under random ranking, then we wish to gain information provided the judgment orderings are informative. Now suppose that the null hypothesis is true and that the rankings are random. Then, for each i = 1, . . . , k for which mi ni > 0, 1 Tii = E m i ni 2 and 1 1 1 m i + ni Tii = σi2 = + . var m i ni 12 mi ni 12 mi ni Because the Tii are independent, standard results show that, under the null hypothesis and for random rankings, var(Tα ) is minimized by choosing σ −2 αi = k i −2 . t=1 σt (3) by (3). Let Define Tα˜ to be Tα with the weights specified λi = mi /(mi + ni ), M = ki=1 mi , and N = ki=1 ni . Provided λi → λ, 0 < λ < 1 for i = 1, . . . , k as mi , ni → ∞, it follows easily that var(Tα˜ ) → 1. var(MW/mn) (4) Thus, the proportions, (mi + ni )/(M + N), of observations in the different strata determined by the judgment order statistics are not important; only that the balance, λi , between the X’s and Y’s be the same within each stratum. A consequence of this, proved in Section 3, is that the Pitman ARE of Tα˜ relative Fligner and MacEachern: Nonparametric Two-Sample Methods to MW is 1 under random ranking, provided λi → λ, 0 < λ < 1 as mi , ni → ∞ for i = 1, . . . , k. The result in expression (4) has some rather surprising implications for the RMW, because, under random ranking, the joint distribution of the ranked-set samples is the same as under simple random sampling. Suppose that, under simple random sampling from the two populations, there are a total of 30 X’s and 30 Y’s. The usual MW is then based on 30 × 30 = 900 comparisons. The balanced RMW with k = 3 would randomly divide the X’s into three groups of size 10 and the Y’s into three groups of size 10 and then compute three MW statistics using 10 X’s and 10 Y’s for each MW. Adding these MW statistics gives an RMW that is based on only 300 comparisons in total. We have already seen that the exact ratio of the variances in (4) in this case is 1.033; that is, the two statistics are nearly equivalent. Now suppose we randomly divide the X’s into two groups of size 10 and 20 and the Y’s into groups of size 20 and 10. Then we compute two MW statistics, the first using the 10 X’s and the 20 Y’s and the second using the 20 X’s and the 10 Y’s, violating the equality of the λi . Combining these two MW statistics using the optimal weights αi provided previously yields a statistic based on 400 total comparisons, 200 in each MW. Evaluation of the ratio of variances in (4) gives a value of 1.143 for this setting. Thus, despite the 100 additional comparisons over the balanced RMW, the imbalance in the ratio of X’s to Y’s in the “strata” produces a net loss! To better understand this, suppose that instead of comparing the groups via the MW statistic we used the difference in the sample means. Thus, we start with the overall estimate Y − X based on 30 X’s and 30 Y’s. If we compute three differences in means based on three groups with 10 X’s and 10 Y’s and then optimally combine them, we just get back Y − X. Now suppose we randomly divide the X’s into two groups of size 10 and 20 and the Y’s into groups of size 10 and 20. If we match the 20 X’s with the 20 Y’s and the 10 X’s with the 10 Y’s, compute the two differences in means, and optimally combine them we get back Y − X. However, if we match the 20 X’s with the 10 Y’s and the 10 X’s with the 20 Y’s, compute the two differences in means, and optimally combine them, the resulting statistic is the difference in a weighted average of the Y’s minus a weighted average in the X’s. The result is an estimate that is more variable than Y − X. The relevance of the preceding example can be seen by considering the approximate linearity expression (1) provided in the proof of Theorem 2.1. To a good approximation, the overall MW behaves like the difference in two averages of iid random variables. The same approximate linearity can be applied to the individual Tii . However, when there is inequality in the λi , even with the optimal weights, the statistic Tα˜ behaves like the difference in two weighted averages of iid random variables, where the weights are unequal. The unequal weights in these averages produce poorer performance as in the example using means. Requiring the equality of the λi between strata ensures behavior of Tα˜ that is asymptotically equivalent to the full MW under random ranking. 3. PERFECT RANKING: LARGE–SAMPLE PROPERTIES In this section large-sample properties of Tα˜ are obtained under the assumption of perfect ranking. This is necessary for 1111 comparison with the Bohn–Wolfe procedure, which is valid only under this assumption. In the next section a new class of models for imperfect rankings is introduced, and the behavior of Tα˜ is studied for several members of this class. The Pitman efficiency is evaluated for the shift model G(y) = F(y − ). For perfect ranking, under the shift model, G(i) (y) = F(i) (y − ) for each order statistic, i = 1, . . . , k. Additionally, assume the balance criterion within strata, mi /(mi + ni ) → λ, i = 1, . . . , k. Using the fact that the shift model holds for each rank-specific distribution and the result in (7) concerning the null variance of Tα˜ , it follows easily that the efficacy of Tα˜ under perfect ranking is k 2 12λ(1 − λ) αi f(i) (x) dx. (5) i=1 If the rankings are random, replacing f(i) (x) by f (x), i = 1, . . . , k, in (5) gives the efficacy of Tα˜ under random ranking. Because ki=1 αi = 1, this shows the Pitman ARE of Tα˜ is 1 with respect to MW under random ranking, as asserted in Section 2. Note that in situations where an imperfect-ranking model ensures that the shift model holds within each of the strata, the efficacy of Tα˜ under this imperfect-ranking model is obtained by replacing f(i) (x) with f[i] (x), i = 1, . . . , k, in expression (5). When F is stochastically smaller than G, we have each F(i) is stochastically smaller than the corresponding G(i) , i = 1, . . . , k. Because Tα˜ is a convex combination of MW statistics (MWi /mi ni ) based on F(i) and G(i) , i = 1, . . . , k, the consistency of Tα˜ under perfect ranking follows from the consistency of each individual MW test for a stochastically ordered alternative. In Section 4 a class of models for imperfect rankings is developed, which ensures that, under suitable conditions, if F is stochastically smaller than G, we have each F[i] is stochastically smaller than the corresponding G[i] , i = 1, . . . , k. This leads to some general results for the consistency of Tα˜ under imperfect-ranking models. Under perfect ranking and a balanced ranked-set sample as in Bohn and Wolfe, the Pitman efficiencies of BW relative to MW are 1.5, 2, and 2.5 for k = 2, k = 3, and k = 4, respectively (see Bohn and Wolfe 1992). The increased efficiency of the BW statistic relative to the MW statistic is due to the reduction in the null variance of the statistic and, thus, does not depend on the underlying distribution. Table 1 provides the Pitman efficiency of Tα˜ relative to the MW under perfect ranking for the normal, t(3), and uniform distributions. The results in the table are for balanced ranked-set samples (αi = 1/k) and set sizes k = 2, 3, and 4. Some of these situations are presented in the simulation study of Section 5. Both Tα˜ and MW have the same asymptotic null variance. The increased efficiency of Tα˜ relative to MW is due to the derivative of the mean function being larger because of the selected within-strata comparisons being made by the Tα˜ statistic. Table 1. Pitman Efficiencies of Tα˜ Relative to MW Distribution k=2 k=3 k=4 Normal t (3) Uniform 1.48 1.40 1.78 1.95 1.80 2.56 2.42 2.21 3.35 1112 Journal of the American Statistical Association, September 2006 This is why the efficiency depends on the underlying distribution. By comparing the results of the table to the efficiencies of BW relative to MW given previously, we see that BW is more efficient than Tα˜ for the t(3), only slightly more efficient for the normal, and less efficient for the uniform. These efficiency results agree well with the simulation results presented in Section 5. 4. IMPERFECT–RANKING MODELS Ranked-set sampling presumes that an investigator can examine a set of k units and assign ranks to them. Assignment of ranks is subjective and, in typical contexts, may not correspond to a perfect ranking of the (unmeasured) Xi , i = 1, . . . , k. The literature contains two main classes of imperfect-ranking models. The first class is based on a ranking of the units’ perceived values, with the perceived values tied directly to the unmeasured, true values of the units. The second class (Bohn and Wolfe 1994) is based on the selection of order statistics. In this section we introduce a third class of models that overlaps the first class, and we investigate properties of the new models. The behavior of Tα˜ is studied under these models. Dell and Clutter (1972) presented the first class of models, which have an antecedent special case dating at least to Thurstone (1927). They posited a set of true values Xi , i = 1, . . . , k, for the units in a set. Along with its true value, each unit has a perceived value, Zi = Xi + i , where the i are iid draws from some distribution. This leads to the set of k vectors (Xi , Zi ). The units are ranked according to the Zi , with the Xi traveling along as concomitants. Importantly, the Zi can be considered a mental construct used only to generate the ranking that produces the X[i] . The models do not state that one can record the Zi , and so these values are not accessible to build regression models, perform diagnostics, and so forth. Thus, the ordered set of pairs correspond to (X[i] , Z(i) ). Dell and Clutter’s model has several nice features. Among them, the probability of ranking Xi as larger than Xj depends on the difference between the two true values of the units, δ = Xi − Xj . The probability equals P(i − j > −δ). It is an increasing function of δ. The model also yields stochastic ordering of the distributions of the X[i] , i = 1, . . . , k. However, in some circumstances, the model seems inadequate, because the probability that Xi is ranked larger than Xj depends only on δ, but not on the value of Xi . Often, one imagines that it is more difficult to correctly rank two units with large values and a separation of δ than it is to correctly rank two units with small values and a separation of δ. See Presnell and Bohn (1999) for a discussion on the two classes of ranking models. An imperfect-ranking model based on perceived values should satisfy the following properties: 4. The rank-specific distributions should be ordered. That is, the distributions of X[i] should be stochastically increasing in i, i = 1, . . . , k. 5. For the two-sample problem, if the distribution of Y is stochastically greater than the distribution of X, then Y[i] is stochastically greater than X[i] for each i = 1, . . . , k. To develop a family of models that have these properties, we rely on notation that facilitates presentation of cross-population comparisons. We write, as a model for actual value (T) and perceived value (U), T|θ ∼ Fθ , (6) U|T, θ ∼ GT , (7) where the distribution GT does not involve the parameter θ . The two populations are determined by values of the parameter θ . We use θ1 ≤ θ2 to index the X and Y populations, respectively. Note that the distribution of a perceived value, U, depends only on the actual value of the unit, T. It depends on the population, indexed by θ , indirectly, through the influence of θ on T. The new family of models supplements the hierarchical model in (6) and (7) with the assumptions that the Fθ are monotone likelihood ratio (MLR) in T and that the GT are MLR in U. We refer to a model in this new family as an MLR-ranking model. For Dell and Clutter’s model, the family of distributions GT is a location family with location parameter T. We allow much more general forms for the GT . We recall several familiar facts about MLR families as they relate to our problem. First, recall that the MLR property implies stochastic ordering of the distributions. That is, if U|T is MLR in U, then the distributions GT are stochastically increasing in T. Thus, the assumption that U|T is MLR in U ensures that property 1 holds. Second, if U|T is MLR in U, then, for any strictly monotone g(·), U|g(T) is also MLR in U. Through choice of g, this establishes property 2. Third, if U|T is MLR in T, and if V is a coarsened version of U created through interval censoring, then V|T is MLR in V. Creating a coarsening of U in this way allows us to produce models with property 3. Fourth, if U|T is MLR in U, then T|U is MLR in T (see, e.g., Whitt 1979). This fact allows us to examine features of the distribution of the actual values given the perceived values and is used in the theoretical development to follow. Lemma 4.1. Assume that the MLR-ranking model holds. Then the distributions U|θ are stochastically increasing in θ . Proof. The MLR-ranking model implies that the distribution of T|θ2 is stochastically greater than that of T|θ1 . The family of distributions of U|T are stochastically increasing in T. Combining these two facts, we see that the distribution of U|θ2 is stochastically greater than the distribution of U|θ1 . The next theorem shows that property 4 holds. 1. Perceived values associated with larger actual values should tend to be larger. Formally, the distribution for Z|X = x should be stochastically increasing in x. 2. The family of models should allow for a trend in the probability of misranking units for fixed δ as a function of X1 . 3. The family of models should allow for ties in the rankings (i.e., the model should allow for a ranker being unable to rank some units). Theorem 4.2. Assume that the MLR-ranking model holds and that a ranked-set sample with set size k is chosen from some population. Then the distribution of T[i] is stochastically smaller than that of T[ j] for i < j. Proof. Focus on a single set of size k. T[i] and T[ j] are concomitants of U(i) and U( j) , respectively. Also, U(i) ≤ U( j) . Consider now the distribution of T|U = ui or T|U = uj . The Fligner and MacEachern: Nonparametric Two-Sample Methods distribution of T|U is MLR in T and is, therefore, stochastically increasing in U. This establishes the result. The proof of property 5 follows the same line as the proof of the previous theorem. To prove the result, we must handle two additional difficulties—that we have two populations indexed by different parameter values and that we wish to compare observations that have the same rank. To this end, we consider a one-dimensional path through the space for (u, θ ). Define a monotone path, P(s) = (u(s), θ (s)), to be a path for which the functions u(s) and θ (s) are nondecreasing. The next lemma is useful in the proof of the upcoming theorem. Lemma 4.3. Assume that the MLR-ranking model holds. Then each of the following distributions is MLR in T: a. T|U = u, θ for fixed U = u. b. T|U, θ for fixed θ . c. T|P for any monotone path P. Proof. Throughout the proof, take t1 < t2 and θ1 < θ2 . a. With U = u fixed and using g(u|θ ) to represent the marginal density of g, we have f (t1 |U = u, θ1 ) f (t1 |θ1 )g(u|t1 , θ1 )/g(u|θ1 ) = f (t1 |U = u, θ2 ) f (t1 |θ2 )g(u|t1 , θ2 )/g(u|θ2 ) ≥ f (t2 |θ1 )g(u|t2 , θ1 )/g(u|θ1 ) f (t2 |θ2 )g(u|t2 , θ2 )/g(u|θ2 ) = f (t2 |U = u, θ1 ) . f (t2 |U = u, θ2 ) This inequality establishes the MLR property. b. With θ fixed, we have a joint distribution on T and U. Because U|T is MLR in U, it follows that T|U is MLR in T. c. This part follows from parts a and b. We work with an alternate formulation of the likelihood ratios and take s2 > s1 : f (t1 |U = u(s1 ), θ (s1 )) f (t1 |U = u(s1 ), θ (s2 )) ≥ f (t2 |U = u(s1 ), θ (s1 )) f (t2 |U = u(s1 ), θ (s2 )) ≥ f (t1 |U = u(s2 ), θ (s2 )) . f (t2 |U = u(s2 ), θ (s2 )) Rewriting the inequality, we have f (t1 |U = u(s1 ), θ (s1 )) f (t2 |U = u(s1 ), θ (s1 )) ≥ , f (t1 |U = u(s2 ), θ (s2 )) f (t2 |U = u(s2 ), θ (s2 )) which establishes part c. Now we present the theorem that establishes property 5. Theorem 4.4. Assume that the MLR-ranking model holds, with the X population indexed by θ1 and the Y population indexed by θ2 , where θ1 < θ2 . Assume that ranked-set samples with set size k are collected independently from the X and Y populations. Then the distribution of X[i] is stochastically smaller than the distribution of Y[i] for each i = 1, . . . , k. Proof. Lemma 4.1 establishes that the distribution U|θ1 is stochastically smaller than is the distribution U|θ2 . To distinguish between the perceived values for observations from the X and Y populations, we use Z and W, respectively, and so we have that Z is stochastically smaller than W and, hence, that Z(i) is stochastically smaller than W(i) for each i = 1, . . . , k. 1113 Fix i and imagine that Z(i) and W(i) are generated in a coupled fashion. That is, a single uniform variate is used to generate both perceived values with the inverse cdf method. Then the realized values are ordered so that Z(i) ≤ W(i) . Consider now the distribution of T[i] |U(i) = u, θ . For us, the perceived X and Y values are z and w, respectively, with z ≤ w. The conditional distribution for T[i] is the same as that for T|U = u, θ . Define a monotone path, P, which has P(0) = (z, θ1 ) and P(1) = (w, θ2 ). Such a path exists because z ≤ w and θ1 < θ2 . Applying Lemma 4.3, we have that the distributions along this path are MLR in T and, hence, that the distributions for T are stochastically increasing. Noting that the distribution for Y[i] |W(i) , θ2 corresponds to a larger value of s [where s indexes the path P(s)] than does the distribution of X[i] |Z(i) , θ1 establishes the result. This has established stochastic ordering of the distributions for each stratum. The Mann–Whitney test is known to be consistent against stochastically ordered distributions, and so each within-stratum test is consistent. The next corollary establishes that the proposed test is consistent by combining tests across the (finite) number of strata. Corollary 4.5. Assume that the MLR-ranking model holds, with the X population indexed by θ1 and the Y population indexed by θ2 , where θ1 > θ2 . Further assume that ranked-set samples with set size k are collected independently from the X and Y populations and that, for some i ∈ {1, . . . , k}, both mi and ni tend to ∞. Then the test based on Tα˜ is consistent for a test of H0 : θ1 = θ2 against the alternative HA : θ1 > θ2 . We note that milder properties may result in models that do not yield X[i] stochastically smaller than Y[i] for each i. For example, it is easy to construct a counterexample where X is stochastically smaller than Y, the model for U|T remains MLR in U, but where, for some i, X[i] is not stochastically smaller than Y[i] . Example 1: Additive perceptual errors. The classic example of perceptual error has been used by Thurstone and by Dell and Clutter: T|θ ∼ N(θ, τ 2 ), U|T ∼ N(T, σ 2 ). At each stage of the model, the normal distributions, indexed by their mean, form an MLR family. Dell and Clutter’s models have been used with a variety of families replacing the normal model for the true values. The normal distribution for the perceptual values can also be replaced with a different distribution having mean T. Example 2: Nonadditive perceptions 1. At times, we must step outside the mode of additive perceptual errors, taking us beyond Dell and Clutter’s model. One method of accomplishing this is to use a link function. An alternative version of Thurstone’s model suggests that in some instances comparisons of positive quantities, such as distances and areas, are made on the basis of the relative size of the objects. This suggests the following model, which exhibits property 2: T|θ ∼ lognormal(θ, τ 2 ), U|T ∼ N(log(T), σ 2 ). 1114 Journal of the American Statistical Association, September 2006 Property 1 ensures that the family of lognormal distributions (with fixed τ 2 ) is MLR in T. The link function log(·) adjusts the true values so that two values separated by δ are closer together for perceptual purposes if the true values are large than if they are small. A reviewer has suggested (and we agree) that this qualitative effect should hold for judging the relative sizes of books selected from a library shelf. Replacing the log link function with an arbitrary monotonic transformation preserves the MLR property for U|T and, hence, preserves the MLR model. Example 3: Nonadditive perceptions 2. Models for nonadditive perceptual error can be obtained in many ways. As an example, we write a model for perceptions, which is parameterized by T: T|θ ∼ Weibull(θ, β), U|T ∼ gamma(cT, 1), T θ /β follows an exponential distribution with mean 1. where For the gamma distribution, cT is the shape parameter. The parameter c governs the quality of the rankings, with larger c indicating better rankings. The gamma(cT, 1) perceptual portion of the model can be replaced by a gamma(α, cT) distribution. Example 4: Tied ranks 1. Empirically, we have observed (MacEachern, Stasny, and Wolfe 2004) that rankers feel that they are sometimes unable to make a judgment about a pair of items. When the true values lie in an interval, we may write the model T|θ ∼ beta Mθ, M(1 − θ ) , U|T ∼ binomial(n, T). The parameter n in the binomial distribution determines the likelihood of a tie in the rankings. This yields property 3, a feature that the additive perceptual errors of Dell and Clutter’s model cannot accommodate. Example 5: Tied ranks 2. A second approach to ties is to begin with a continuous model for the perceptual values and then coarsen the perceptual space. For example, we may write T|θ ∼ N(θ, τ 2 ), V|T ∼ N(T, σ 2 ), U|V ∼ round(V). Because the model for V is MLR, so is the model for U. The models produced in this fashion include categorical models for a response, such as those associated with ordinal regression, where a set of cut points defines the categories. 5. FINITE–SAMPLE COMPARISON OF THE PROCEDURES In this section we compare the performances of the Bohn– Wolfe (BW) and proposed (T) procedures based on ranked-set samples and the Mann–Whitney–Wilcoxon (MW) and randomsplit Mann–Whitney–Wilcoxon (RMW) procedures based on simple random samples. The comparison is based on a simulation study where we investigated a variety of set sizes and underlying distributions and where we considered a variety of quality of rankings, ranging from perfect to random. The simulation study investigated far more cases than are presented here. We have selected cases for presentation that demonstrate a range of behaviors for the tests. The reported simulations consist of four models for rankings with various set sizes k and total sample sizes M and N. The first three models are additive perceptual-error models Zi = Xi + i , i = 1, . . . , k, as described in Example 1 of Section 4. The perceptual errors i are normally distributed, with the true values Xi having a t distribution with 3 degrees of freedom, the normal distribution, or the uniform distribution. These correspond to tests of H0 : θ = 0 versus HA : θ = 0. The correlation between Zi and Xi determines the information content of the rankings. We have used models with ρ = 1 for perfect ranking and ρ = .9 and .7 for imperfect rankings. Once the correlation drops below .7, we have found that there is little information contained in the rankings. Under perfect ranking, the resulting models correspond to the three models for which the Pitman efficiency of the proposed test Tα relative to BW was computed in Section 3. The fourth model considered is the Weibull distribution with gamma perceptions as described in Example 3 of Section 4, where the test is for H0 : β = 1 versus HA : β = 1. The parameter c governs the quality of the rankings. This provides an example of an asymmetric population distribution, where the distributions do not differ by a shift parameter but instead are stochastically ordered. The simulation itself relied on variance reduction techniques, using a common stream of random numbers to investigate all distributions, shifts, and ρ’s with a common set size and total sample size. The same stream of random numbers was used to generate the ranked-set samples and the simple random samples. For the MW, RMW, and T tests, randomized tests were used, so that the level would be exactly .05. The BW procedure declares a cutoff for significance based on the asymptotic normal distribution of the test statistic under perfect rankings, but because the level is not maintained when the ranking is imperfect, no further attempt was made to control the level of this procedure. Each power curve is based on 10,000 replicates. The left panels of Figure 2 contain the results for the additive perceptual-error model when the distribution of the true values is standard normal, the set size is 3, and the sample sizes are M = N = 36. The right panels contain the results for the additive perceptual-error model when the distribution of the true values is uniform, the set size is 4, and the sample sizes are M = N = 48. Each of the six panels displays four simulated power curves. The lowest two curves in each panel are for the MW and RMW procedures, with the RMW showing slightly less power than the MW procedure. When rankings are perfect, the BW curve is a little higher than the T curve for the normal distribution, whereas the situation is reversed for the uniform distribution. This observation agrees with the Pitman efficiency results. The BW test makes no adjustment for the imperfections in rankings, and so its level rises as the rankings worsen. Although the BW test provides greater power to detect θ > 0, this is attributable to its higher level. In contrast to this behavior, the T test’s power curve drops toward that of the RMW procedure, based on a simple random sample. The advantage of the new ranked-set sample test over the procedures based on simple random samples is substantial when ρ = .9 and still noticeable when ρ = .7. The level of the BW test has risen to .110 for Fligner and MacEachern: Nonparametric Two-Sample Methods 1115 Figure 2. Plots of Power Curves for Four Tests. The left panels are for M = N = 36 and k = 3. The population distribution is normal. The right panels are for M = N = 48 and k = 4. The population distribution is uniform. In all plots, the broken line is T , the highest solid line is BW, and the two low, solid, nearly indistinguishable lines are MW and RMW. the lower left panel and to .140 for the lower right panel. For random rankings, the levels rise to .165 and .215, respectively. Figure 3 contains results for set size 2 and sample sizes M = N = 24. The left panels are for the additive perceptualerror model when the distribution of the true values is a t with 3 degrees of freedom, and the right panels are for the Weibull distribution with gamma perceptions. When rankings are perfect, the BW curve is slightly higher than the T curve for the t model and is almost identical to T for the Weibull model. For imperfect rankings, the situations are quite similar to those of Figure 2. The level of the BW test rises to .095 and .099 for the lower left and right panels, respectively. Under random rankings, the level of the BW test rises to .111 for both the normal model and the Weibull model. The patterns that we see in Figures 2 and 3 held throughout the other situations that were simulated. Following are the main conclusions reached from the simulation: • The ranked-set sampling procedures are preferable to the simple random-sampling procedures when the quality of the rankings is moderate to high. With low-quality rankings, the advantage of a ranked-set sample design is moderate, perhaps nearly disappearing. • The RMW procedure is a near equivalent of the MW procedure. The more extensive simulations show that the small difference apparent in the figures shrinks as the within-strata sample sizes grow. When these sample sizes are 16 (the next smallest within-strata sample size we investigated), the difference is small. • Under perfect rankings, the BW procedure and the T procedure are nearly equal. The T procedure is sometimes a little better than the BW procedure, sometimes nearly equal, and sometimes a little worse. Which procedure is better depends on the underlying distribution. The Pitman AREs provide a good match to the patterns that we see for finite sample sizes. • When rankings are imperfect, the actual level of the BW procedure rises. It may be considerably higher than the nominal level. The new procedure maintains its level and provides additional power compared to procedures based on a simple random sample. Importantly, our procedure 1116 Journal of the American Statistical Association, September 2006 Figure 3. Plots of Power Curves for Four Tests. All panels are for M = N = 24 and k = 2. The left panels have a population distribution that is t with 3 df. The right panels are for the Weibull model. In all plots, the broken line is T , the highest solid line is BW, and the two low, solid, nearly indistinguishable lines are MW and RMW. does not rely on knowledge of the form of the imperfectranking model or on the specific parameters that determine the imperfect-ranking model. 6. EXAMPLE Investigators at Horticulture Research International and the University of Kent conducted a pilot study in 1997 to evaluate the effectiveness of ranked-set sampling. The pilot study (see Murray, Ridout, and Cross 2000) focused on an experiment to compare the coverage of spray deposits on the leaves of apple trees under two different sprayer settings. Two plots of trees were sprayed with a fluorescent water-soluble tracer at 2% concentration. The first plot was sprayed at a high volume with a coarse nozzle on the sprayer to produce large droplet sizes (the coarse treatment), and the second was sprayed at a low volume with a fine nozzle on the sprayer to produce small droplet sizes (the fine treatment). The variable of interest (%Cover) is the percentage of the leaf surface covered by the spray. A second, related variable (Deposit) quantified the amount of spray on the leaf surface. Deposit was measured by washing the leaf surface with 5 mL of water and measuring the relative concentration of the tracer. We will restrict our attention to %Cover. One hundred and twenty-five leaves were sampled from each plot, and each sample was randomly divided into 25 sets of size 5. For each set of size 5, an observer ranked the leaves according to the perceived coverage based on the visual appearance of the leaf surface under ultraviolet light. The actual value of the variable %Cover was measured using an image analysis system (Optimax V). Because the purpose of the experiment was to investigate the efficacy of RSS as a technique, %Cover was measured on all 125 leaves on each treatment. Measuring %Cover on all leaves allows us to investigate two important issues. The first is evaluation of the quality of the observer’s ranking, which can be done because we can determine the true ranking in each set of 5. The second is to simulate many ranked-set samples of size 25 and many simple random samples of size 25, all drawn from the same dataset. This gives us a better comparison of the competing ranked-set sampling procedures as well as enabling us to compare them to simple random-sampling procedures. To draw an RSS from the data, we randomly select five sets in which the observation judged to Fligner and MacEachern: Nonparametric Two-Sample Methods be the minimum is chosen, a different five sets in which the observation judged to be the second smallest is chosen, and so on. The resulting sample is a balanced ranked-set sample. Such a sample is chosen independently for each of the two treatments. To select an SRS, a single observation is selected at random from each of the 25 sets for each treatment. This method of selecting an SRS matches the conditional division of leaves into sets (which was accomplished at random) for the RSS and results in more equitable comparisons. We first consider the quality of the rankings. In the simulation study, the quality of the rankings was measured using the correlation between perceived and actual values. Because the perceived values are unobservable, we use the following idea to link the quality of rankings in the example to the simulation study. For the 25 sets of rankings made by the observer in the coarse treatment, the average value of the Kendall tau distance (Randles and Wolfe 1991) between the observer’s rankings and the true rankings was 1.48, and for the fine treatment the average tau distance was 1.40. The distributions of the tau distances were quite similar as well. We simulated the Dell–Clutter model with a normal population distribution, normal errors, and a correlation of .9 to obtain 10,000 sets of five perceived and true observations. After ranking the sets of perceived and true values, we found the average tau distance between the rankings to be 1.43, in fairly good agreement with the example (a correlation of .7 gave an average tau distance of 2.52). This gives us a sense that the observer’s rankings in the example, while not perfect, were of fairly high quality. The correlation of .9 relates the quality of the rankings in the example, in a broad sense, to the simulation study results. We return to this point in the following discussion. Although an examination of the data shows considerable overlap between the two treatments, the coarse treatment (mean = .221, sd = .162) had a clearly higher %Cover than the fine treatment (mean = .153, sd = .115). In fact, an MW test based on all 125 observations on each treatment yields a two-tailed p value of .001. We simulated 10,000 ranked-set samples of size 25 from each treatment, and for each sample we computed both the BW statistic and the T statistic. In addition, 10,000 simple random samples of size 25 were obtained from each treatment, and the MW statistic and the RMW statistic were computed. For two-sided level-.05 tests, rejection rates for the BW test, the T test, the MW test, and the RMW test were .7500, .5445, .2852, and .2783, respectively. Because the coarse treatment apparently concentrates on larger values than does the fine treatment, a large rejection rate points to effectiveness of the test in detecting this difference. There is close agreement between the MW and the RMW with the MW performing just slightly better, as was shown in the simulation. The ranked-set sampling design, in conjunction with the T statistic, shows a clear superiority to the MW procedure. As expected, the BW test rejects substantially more often than the proposed procedure because the level of the BW test is not maintained under imperfect ranking. Our previous discussion of the quality of rankings suggests that in this example a Dell–Clutter model with a correlation of .9 may provide a rough approximation to the quality of rankings for this observer. We simulated 10,000 replications under the null hypothesis of identical populations with the Dell–Clutter model, a correlation of .9, and a set size of 5. The simulated level of the BW 1117 procedure was .1086, whereas with the same data, the level of the proposed procedure was .0510. This potential doubling of the level can easily explain the increased number of rejections for the BW test. If the proposed procedure were performed at the .10 level instead of the .05 level on the 10,000 ranked-set samples simulated from the example, the rejection rate would increase from .5445 to .7009. Importantly, although the quality of rankings in this example is quite good, the results suggest that the rise in level due to even small departures from perfect ranking can make the BW procedure inappropriate as a tool for comparing two samples. 7. CONCLUSIONS From a theoretical perspective, a ranked-set sample design has been shown, in many contexts, to be a preferable design to a simple random-sample design. The additional information, collected in the form of ranks associated with the observations, allows one to construct more powerful tests and more accurate estimators. The drawback of many theoretical results is that they rely on very strong assumptions, such as perfect rankings, balanced samples, exact symmetry of a population distribution, or rankings that, while imperfect, can be described with perfect knowledge. The central problem for ranked-set sampling is the development of procedures that are robust to the sort of violation of assumptions likely to be seen in practice. It is this issue that we have addressed in this article. One practical issue that our procedures address is how to analyze an unbalanced ranked-set sample. Whether the imbalance arises by design, through loss of units to full measurement, or through judgment poststratification (see MacEachern et al. 2004), the procedures we discuss allow us to make a full range of inferences about the shift parameter, . The second practical issue that we address is ensuring that hypothesis tests have an actual level equal to their nominal level. The level of the BW test rises to unacceptable levels when ranking is imperfect. We have developed a new class of models for imperfect ranking that are based on perceptions about the actual units in a set. This began with an intuitively reasonable set of properties that models based on perceived values should satisfy. A set of assumptions were then developed, which ensure that these properties hold. A series of examples illustrates the scope of the models. [Received May 2003. Revised December 2005.] REFERENCES Bohn, L. L. (1996), “A Review of Nonparametric Ranked-Set Sampling Methodology,” Communications in Statistics, Part A—Theory and Methods, 25, 2675–2685. Bohn, L. L., and Wolfe, D. A. (1992), “Nonparametric Two-Sample Procedures for Ranked-Set Samples Data,” Journal of the American Statistical Association, 87, 552–561. (1994), “The Effect of Imperfect Judgment Rankings on Properties of Procedures Based on the Ranked-Set Samples Analog of the Mann–Whitney– Wilcoxon Statistic,” Journal of the American Statistical Association, 89, 168–176. Dell, T. R., and Clutter, J. L. (1972), “Ranked Set Sampling Theory With Order Statistics Background,” Biometrics, 28, 545–555. Hoeffding, W. (1948), “A Class of Statistics With Asymptotically Normal Distribution,” The Annals of Mathematical Statistics, 19, 293–315. Kaur, A., Patil, G. P., Sinha, A. K., and Taillie, C. (1995), “Ranked Set Sampling: An Annotated Bibliography,” Environmental and Ecological Statistics, 2, 25–54. Lehmann, E. L. (1951), “Consistency and Unbiasedness of Certain Nonparametric Tests,” The Annals of Mathematical Statistics, 22, 165–179. 1118 MacEachern, S. N., Stasny, E. A., and Wolfe, D. A. (2004), “Judgment PostStratification With Imprecise Rankings,” Biometrics, 60, 207–215. McIntyre, G. A. (1952), “A Method of Unbiased Selective Sampling Using Ranked Sets,” Australian Journal of Agricultural Research, 3, 385–390. Murray, R. A., Ridout, M. S., and Cross, J. V. (2000), “The Use of Ranked Set Sampling in Spray Deposit Assessment,” Aspects of Applied Biology, 57, 141–146. Ozturk, O., and Wolfe, D. A. (2000), “An Improved Ranked Set Two-Sample Mann–Whitney–Wilcoxon Test,” The Canadian Journal of Statistics, 28, 123–135. Journal of the American Statistical Association, September 2006 Presnell, B., and Bohn, L. L. (1999), “U-Statistics and Imperfect Ranking in Ranked Set Sampling,” Journal of Nonparametric Statistics, 10, 111–126. Randles, R. H., and Wolfe, D. A. (1991), Introduction to the Theory of Nonparametric Statistics (2nd ed.), New York: Wiley. Stokes, S. L., and Sager, T. W. (1988), “Characterization of a Ranked-Set Sample With Application to Estimating Distribution Functions,” Journal of the American Statistical Association, 83, 374–381. Thurstone, L. L. (1927), “A Law of Comparative Judgement,” Psychological Reviews, 34, 273–286. Whitt, W. (1979), “A Note on the Influence of the Sample on the Posterior Distribution,” Journal of the American Statistical Association, 74, 424–426.