Estimation and sample size calculations for matching performance of biometric authentication
Transcription
Estimation and sample size calculations for matching performance of biometric authentication
Estimation and sample size calculations for matching performance of biometric authentication 1 Michael E. Schuckers Department of Mathematics, Computer Science and Statistics St. Lawrence University, Canton, NY 13617 USA and Center for Identification Technology Research (CITeR) Abstract Performance of biometric authentication devices can be measured in a variety of ways. The most common way is by calculating the false accept and false reject rates, usually referred to as FAR and FRR, respectively. In this paper we present two methodologies for creating confidence intervals for matching error rates. The approach that we take is based on a general parametric model. Because we utilize a parametric model, we are able to ’invert’ our confidence intervals to develop appropriate sample size calculations that account for both number of attempts per person and number of individuals to be tested– a first for biometric authentication testing. The need for sample size calculations that acheive this is currently acute in biometric authentication. These methods are approximate and their small sample performance is assessed through a simulation study. The distribution we use for simulating data is one that arises repeatedly in actual biometric tests. Key words: False match rate, false non-match rate, intra-individual correlation, logit transformation, Beta-binomial distribution, confidence intervals, Monte Carlo simulation, sample size calculations 1991 MSC: 62F11, 62F25, 62N99 Email address: schuckers@stlawu.edu (Michael E. Schuckers). This research was made possible by generous funding from the Center for Identification Technology Research (CITeR) at the West Virginia University and by NSF grant CNS-0325640 which is cooperative funded by the National Science Foundation and the United States Department of Homeland Security. 1 Preprint submitted to Pattern Recognition 21 November 2005 1 Introduction Biometric authentication devices or biometric authenticators aim to match a presented physiological image against one or more stored physiological images. This matching performance of a biometric authenticator (BA) is an important aspect of their overall performance. Since each matching decision from a BA results in either a “reject” or an “accept”, the two most common measures of performance are the false reject rate and the false accept rate. These rates are often abbreviated by FAR and FRR, respectively, and the estimation of these quantities is the focus of this paper. The methods described herein are equally applicable for false match rates and false non-match rates since data for these summaries are collected in a similar manner. As acknowledged in a variety of papers, including [1], [2], and more recently by [3], there is an ongoing need for assessing the uncertainty in the estimation of these error rates for a BA. In particular there is an immense need for tools that assess the uncertainty in estimation via confidence intervals (CI’s) and for sample size calculations grounded in such intervals. Along these lines, the question of how many individuals to test is particularly difficult because biometric devices are generally tested on multiple individuals and each individual is tested multiple times. In this paper we present two CI methodologies for estimating the matching error rates for a BA. We then ‘invert’ the better performing of these CI’s to create sample size calculations for error rate estimation. Several methods for the estimation of FAR and FRR appear in the biometrics literature. These generally fall into two approaches: parametric and nonparametric. Among the parametric approaches, [4] uses the binomial distribution to make confidence intervals and obtain sample size calculations, while [5] models the features by a multivariate Gaussian distribution to obtain estimates of the error rates. The approach of [6] is to assume that individual error rates follow a Gaussian distribution. In a similar vein, [7] assumes a Beta distribution for the individual error rates and uses maximum likelihood estimation to create CI’s. Several non-parametric approaches have also been considered. [8] outlined exact methods for estimating the error rates for binomial data as well as for estimating the FAR when cross-comparisons are used. Two resampling methods have been proposed. They are the ‘block bootstrap’ or ‘subsets bootstrap’ of [1] and the balanced repeated replicates approach of [9]. It is worth noting that non-parametric methods do not allow for sample size calculations since it is not possible to ‘invert’ these calculations. The approach taken in this paper will be a parametric one that allows for sample size calculations. In addition to CI methods, several approximate methods have been proposed for sample size calculations. ”Doddington’s Rule” [10] states that one should collect data until there are 30 errors. Likewise the ”Rule of 3” [2] is that 2 3/(the number of attempts) is an appropriate upper bound for a 95% CI for the overall error rate when zero errors are observed. However, both of these methods make use of the binomial distribution which is often an unacceptable choice for biometric data [8]. [6] developed separate calculations for the number of individuals to test and for the number of tests per individual. Here we propose a single formula that accounts for both. In this paper we propose two methods for estimating the error rates for BA’s. The first of these methods is a simplification of the methodology based on maximum likelihood estimation that is found in [7]. The advantage of the approach taken here is that it does not depend on numerical maximization methods. The second is a completely new method based on a transformation. For both methods we present simulations to describe how these methods perform under a variety of simulated conditions. In order to validate the simulation strategy we show that the versatile distribution used in our simulation is a ‘good’ fit for real data collected on BA’s. Finally we develop sample size calculations based on the second methodology since that method performed better. The paper is organized in the following manner. Section 2 describes a general model formulated by [11] for dealing with overdispersion in binary data. There we also introduce this model and the notation we will use throughout the paper. Section 3 introduces our CI methodologies based on this model. Results from a simulation study of the performance of these CI’s are found in Section 4. In that section, we also present an argument for the validity and applicability of the simulated data. Section 5 derives sample size calculations based on the second model. Finally, Section 6 contains a discussion of the results presented here and their implications. 2 Extravariation Model As mentioned above, several approaches to modelling error rates from a BA have been developed. In order to develop sample size calculations, we take a flexible parametric approach. Previously [7] presented an extravariation model for estimating FARs and FRRs based on the Beta-binomial distribution which assumes that error rates for each individual follow a Beta distribution. Here we follow [11] in assuming the first two moments of a Beta-binomial model but we do not utilize the assumptions of the shape of the Beta distribution for individual error rates found in [7]. We also replace the numerical estimation methods therein with closed form calculations. Details of this model are given below. We begin by assuming an underlying population error rate, either FAR or FRR, of π. Following [2], let n be the number of comparison pairs tested and let mi be the number of decisions made regarding the ith comparison pair with 3 i = 1, 2, . . . , n. We define a comparison pair broadly to encompass any measurement of a biometric image to another image, of an image to a template or of a template to a template. This enables us to model both FAR and FRR in the same manner. Then for the ith comparison pair, let Xi represent the observed number of incorrect decisions from the mi attempts and let pi = m−1 i Xi represent the observed proportion of errors from mi observed decisions from the ith comparison pair. We assume that the Xi ’s are conditionally independent given mi , n, π and ρ. Then, E[Xi | π, ρ, mi ] = mi π V ar[Xi | π, ρ, mi ] = mi π(1 − π)(1 + (mi − 1)ρ) (1) where ρ is a term representing the degree of extravariation in the model. The assumption of conditional independence is the same one that is implicit in the ‘subset bootstrap’ of [1]. The ρ found in (1) is often referred to as the intra-class correlation coefficient, see e.g. [12] or [13]. Here we will refer to it as the intra-comparison correlation The model in (1) reduces to the binomial if ρ = 0 or mi = 1 for all i and, thus, (1) is a generalization of the binomial that allows for within comparison correlation. 3 Confidence Intervals The previous section introduced notation for the extravariation model. Here we use this model for estimating an error rate π. Suppose that we have n observed Xi ’s from a test of a biometric authentication device. We can then use that data to estimate the parameters of our model. Let π ˆ= ρˆ = n X i=1 n P i=1 mi !−1 n X Xi i=1 Xi (Xi − 1) − 2ˆ π (mi − 1)Xi + mi (mi − 1)ˆ π2 n P i=1 (2) mi (mi − 1)ˆ π (1 − π ˆ) This estimation procedure for ρ is given by [14], while here we use a traditional unbiased estimator of π. 4 3.1 Traditional Confidence Interval We simplify the approach of [7] by replacing the maximum likelihood estimates with a moments-based approach. Thus we have an estimate, π ˆ , of the error rate, π, and the intra-comparison correlation, ρˆ. We use these to evaluate the standard error of π ˆ following (1) assuming that the image pairs tested are conditionally independent of each other. The estimated variance of π ˆ is then Xi Vˆ [ˆ π ] = Vˆ [ P ] mi X X = ( mi )−2 Vˆ [Xi ] P ≈ where π ˆ (1 − π ˆ )(1 + (m0 − 1)ˆ ρ) P mi m0 = m ¯ − (3) n P i=1 (mi − m) ¯ 2 (4) mn(n ¯ − 1) P and m ¯ = n−1 ni=1 mi . Note that in the notation of [7], (1 + (m0 − 1)ρ) = C. We can create a nominally 100 × (1 − α)% CI for π from this. Using the results in [15] about the sampling distribution of πˆ , we get the following interval π ˆ ± z1− α2 " π ˆ (1 − π ˆ )(1 + (m0 − 1)ˆ ρ) P mi #1/2 (5) where z1− α2 represents the 100×(1− α2 )th percentile of a Gaussian distribution. Our use of the Gaussian or Normal distribution is justified by the asymptotic properties of these estimators [14, 15]. 3.2 Logit Confidence Interval One of the traditional difficulties with estimation of proportions near zero (or one) is that sampling distributions of the estimated proportions are nonGaussian. Another problem is that CI’s for proportions, such as that given in (5), are not constrained to fall within the interval (0, 1). The latter is specifically noted by [2]. One method that has been used to compensate for both of these is to transform the proportions to another scale. Many transformations for proportions have been proposed including the logit, probit and arcsin of the square root, e. g. [14]. [16] has an extensive discussion of transformed CI’s of the kind that we are proposing here. Below we use the logit or logodds transformation to create CI’s for the error rate π. [17] offers a specific discussion of CI’s based on a logit transformation for binomial proportions. 5 Table 1 Sample Confidence Intervals Based on the Traditional Approach Confidence Interval Modality Rate n m π ˆ ρˆ Lower Endpoint Upper Endpoint Hand FAR 2450 5 0.0637 0.0621 0.0588 0.0685 Finger FAR 2450 5 0.0589 0.0021 0.0548 0.0631 Face FAR 2450 5 0.0189 0.0066 0.0158 0.0206 Hand FRR 50 10 0.0520 0.0000 0.0325 0.0715 Finger FRR 50 10 0.0420 0.0666 0.0198 0.0642 Face FRR 50 10 0.0300 0.0759 0.0106 0.0494 The logit or log-odds transformation is one of the most commonly used transformations in statistics. For this reason, we focus on the logit over other π ) as the natural logarithm of the transformations. Define logit(π) ≡ ln( 1−π odds of an error occurring. The logit function has a domain of (0, 1) and a range of (−∞, ∞). One advantage of using the logit transformation is that we move from a bounded parameter space, π ∈ (0, 1), to an unbounded one, logit(π) ∈ (−∞, ∞). Thus, our approach is as follows. We first transform our estimand, π, and our estimator π ˆ to γ ≡ logit(π) and γˆ ≡ logit(ˆ π ), respectively. Next, we create a CI for γ using an approximation to the standard error of γˆ . Finally, we invert the endpoints of that interval back to the original scale. eγ Letting ilogit(γ) ≡ logit−1 (γ) = 1+e γ , we can create a 100(1 − α)% CI using γˆ . To do this we use a Delta method expansion for the estimated standard error of γˆ . (The Delta method, as it is known in the statistical literature, is simply a one step Taylor series expansion of the variance. See [14] for more details.) Then our CI on the transformed scale is γˆ ± z1− α2 1 + (m0 − 1)ˆ ρ π ˆ (1 − π ˆ )mn !1 2 (6) where m ¯ = n1 ni=1 . Thus (6) gives a CI for γ = logit(π) and we will refer to the endpoints of this interval as γL and γU for lower and upper respectively. The final step for making a CI for π is to take the ilogit of both endpoints of this interval which results in (ilogit(γL ), ilogit(γL )). Thus an approximate (1 − α) ∗ 100% CI for π is P (ilogit(γL ), ilogit(γU )). (7) The interval, (7), is asymmetric because the logit is not a linear transformation. This differs from a traditional CI’s that are plus or minus a margin of 6 Table 2 Sample Confidence Intervals Based on the Logit Approach Confidence Interval Modality Rate n m π ˆ ρˆ Lower Endpoint Upper Endpoint Hand FAR 2450 5 0.0637 0.0621 0.0590 0.0687 Finger FAR 2450 5 0.0589 0.0021 0.0549 0.0633 Face FAR 2450 5 0.0189 0.0066 0.0160 0.0208 Hand FRR 50 10 0.0520 0.0000 0.0356 0.0753 Finger FRR 50 10 0.0420 0.0666 0.0246 0.0708 Face FRR 50 10 0.0300 0.0759 0.0156 0.0568 error. However, this interval has the same properties as other CI’s. (See [18] for a rigorous definition of a CI.) In addition, this interval is guaranteed to fall inside the interval (0, 1) as long as at least one error is observed. In the next section we focus on how well the CI’s found in (6) and (7) perform for reasonable values of π, ρ, m, and n. 3.3 Examples To illustrate these methods we present example CI’s for both of the CI methods given above. Results for the traditional approach and the logit approach are found in table 1 and 2, respectively. The data used for these intervals comes from [19]. In that paper the authors investigated data from three biometric modalities – face, fingerprint and hand geometry – and recorded the match scores for ten within individual image pairs of 50 people and for five between individual image pairs for those same 50 individuals. Note that the between individual cross comparisons here are not symmetric and thus there were 49 × 50 = 2450 comparison pairs in the sense we are using here. Thus there are 500 decisions to compare an individual to themselves and 12250 decisions regarding an individual to another individual. Here several things are apparent from these results. For this data the two intervals produce similar endpoints on the same data. This is a result of the relatively large n’s. As noted earlier, the logit CI is asymmetric and has intervals that are larger while the traditional confidence interval is symmetric. 7 Table 3 Goodness-of-fit test results for hand geometry FAR’s from data found in [19], 4 Threshold π ˆ p-value 80 0.1136 0.0017 70 0.0637 0.1292 60 0.0272 0.6945 50 0.0098 0.9998 40 0.0016 0.9972 30 0.0008 0.9996 Assessing Performance To test the small sample performance of these CI’s we simulate data from a variety of different scenarios. Simulations were run because they give the best gauge of performance for statistical methodology under a wide variety of parameter combinations. We will refer to each parameter combination as a scenario. Assuming that the simulated data is similar in structure to observed data, we get a much better understanding of performance from simulation than from looking at a single observed set of data. Below we argue that the distribution that we use for simulation is an excellent fit to data from [19]. Further details on the Monte Carlo approach to evaluating statistical methodology can be found in [20]. Under each scenario, 1000 data sets were generated and from each data set a nominally 95% CI was calculated. The percentage of times that π is captured inside these intervals is recorded and referred to as the empirical coverage probability or, simply, the coverage. For a 95% CI, we should expect the coverage to be 95%. However, this is not always the case especially for small sample sizes. Below we consider a full factorial simulation study using the following values: n = (1000, 2000), m = (5, 10), π = (0.005, 0.01, 0.05, 0.1), and ρ = (0.1, 0.2, 0.4, 0.8). For simplicity we let mi = m, i = 1, . . . , n for these simulations. These values were chosen to determine their impact on the coverage of the confidence intervals. Specifically, these values of π were chose to be representative of possible values for a BA, while the chosen values of ρ were chosen to represent a larger range than would be expected. Performance for both of the methods given in this paper is exemplary when 0.1 < π < 0.9. Because of the symmetry of binary estimation, it is sufficient to consider only values of π less than 0.1. 8 Table 4 Goodness-of-fit test results for fingerprint FAR’s from data found in [19], Threshold π ˆ p-value 10 0.0930 0.8191 20 0.0589 0.9761 30 0.0292 0.9276 40 0.0114 0.3726 50 0.0074 0.7563 60 0.0042 0.9541 70 0.0032 0.9781 80 0.0016 0.9972 90 0.0004 0.9987 4.1 Goodness-of-Fit Tests Because it is easy to generate from a Beta-binomial, we would like to utilize this distribution for our simulations. To determine whether or not the Beta-binomial distribution is appropriate for generating data, we considered biometric decision data from [19]. To determine whether or not the Betabinomial distribution was appropriate we computed “goodness-of-fit” tests statistics and p-values which are discussed by a several authors, e.g. [21]. The idea of a “goodness-of-fit” test is that we fit the distribution to be tested and determine if the observed data are significantly different from this structure. Summaries based on the data are compared to summaries based on the null distributional form. In the case of the Beta-binomial distribution, we compare the expected counts if the data perfectly followed a Beta-binomial distribution to the observed counts. [21] gives an excellent introduction to these tests. Tables 3, 4 and 5 summarize the results of these tests for FAR’s across the three modalities. Tables 6, 7 and 8 repeat that analysis for FRR’s. Note that for hand match scores we accept below the given threshold, while for finger and face match scores we accept above the given threshold.For both of these tables small p-values indicate lack of fit and that the null hypothesis that the Beta-binomial distribution fits this data should be rejected. A more general “goodness-of-fit” test is given by [22] when the value of mi varies across comparison pairs. Looking at Tables 3 to 5 as well as Tables 6 to 8, we can readily see that the Beta-binomial fits both FAR and FRR quite well for all three modalities. 9 Table 5 Goodness-of-fit test results for facial recognition FAR’s from data found in [19], Threshold π ˆ p-value 60 0.0876 < 0.0001 50 0.0446 0.3124 45 0.0291 0.9539 40 0.0182 0.9948 35 0.0446 0.5546 30 0.0047 0.9323 25 0.0024 0.9908 Table 6 Goodness-of-fit test results for FRR’s from data found in [19], Hand FRR Threshold π ˆ p-value 100 0.1120 0.7803 120 0.0520 0.7813 140 0.0280 0.9918 160 0.0180 0.9986 180 0.0120 0.9950 200 0.0100 0.9905 220 0.0080 0.9795 240 0.0060 0.9255 10 Only two of the fifty-one thresholds considered resulted in a rejection of the Beta-binomial distribution as inappropriate. This is approximately what we would expect by chance alone using a significance level of 5%. For this analysis we reported on a subset of thresholds that produced FAR’s and FRR’s near or below 0.1. This choice was made because it is unlikely that a BA would be implemented with error rates above that cutoff. It is important to note that we are not arguing here that binary decision data, the Xi ’s, from a biometric experiment will always follows a Beta-binomial distribution. Nor are we stating that Beta-binomial data is necessary for the use of these CI’s. (As mentioned above, we are only specifying the first two moments of the distribution of the Xi ’s instead of specifying a particular shape for the distribution.) Rather, what we conclude from the above results in Tables 3 through 8 is that the Betabinomial is a reasonable distribution for simulation of small sample decision data since it fit data from three different modalities well. Thus we generate Xi0 s from a Beta-binomial distribution to test the performance of the CI’s methods specified above. 4.2 Simulation Results Before presenting the simulation results, it is necessary to summarize our goals. First, we want to determine for what combination of parameters, the methodology achieves coverage close to the nominal level, in this case, 95%. Second, because we are dealing with simulations, we should focus on overall trends rather than on specific outcomes. If we repeated these same simulations again, we would see slight changes in the coverages of individual scenarios but the overall trends should remain. Third, we would like to be able to categorize which parameter combinations give appropriate performance. We use the Monte Carlo approach because it is more complete than would be found in the evaluation of a “real” data set. See, e.g. [20] for a complete discussion. Evaluations from observed test data gives a less complete assessment of how well an estimation method performs since there is no way to know consider all the possible parameter combinations from such data. 4.2.1 Traditional confidence interval performance Using (5) we calculated coverage for each scenario. The results of this simulation can be found in Table 9. Several clear patterns emerge. Coverage increases as π increases, as ρ decreases, as n increases and as m increases. This is exactly as we would have expected. More observations should increase our ability to accurately estimate π. Similarly the assumption of approximate Normality will be most appropriate when π is moderate (far from zero and far from one) and when ρ is small. This CI performs well except when π < 0.01 and ρ ≥ 0.2. 11 Table 7 Goodness-of-fit test results for fingerprint FRR’s from data in [19], Threshold π ˆ p-value 50 0.1140 0.4134 40 0.0980 0.5160 30 0.0880 0.1554 20 0.0760 0.5121 10 0.0589 0.9761 5 0.0120 0.9950 1 0.0020 0.9999 Table 8 Goodness-of-fit test results for facial recognition FRR’s from data found in [19] Threshold π ˆ p-value 45 0.1060 0.2614 50 0.0660 0.9509 55 0.0540 0.5353 60 0.0500 0.5885 65 0.0300 0.9216 70 0.0180 0.9067 75 0.0140 0.9067 80 0.0060 0.9985 85 0.0040 0.9996 90 0.0040 0.9996 95 0.0040 0.9996 100 0.0040 0.9996 105 0.0020 1.0000 12 There is quite a range of coverages from a high of 0.959 to a low of 0.896 with a mean coverage of 0.940. One way to think about ρ is that it governs how much ‘independent’ - in a statistical sense - information can be found in the data. Higher values of ρ indicate that there is less ’independent’ information in the data. This performance is not surprising since binary data is difficult to assess when there is a high degree of correlation within a comparison. One reasonable rule of thumb is that the CI performs well when the effective sample size, n† π ≥ 10 where nm n† = (8) 1 + (m0 − 1)ρ and is referred to as the effective sample size in the statistics literature [23]. 4.2.2 Logit confidence interval performance To assess how well this second interval estimates π we repeated the simulation using (6) to create our intervals. Output from these simulations is summarized in Table 10. Again coverage should be approximately 95% for a nominally 95% CI. Looking at the results found in Table 10, we note that there are very similar patterns to those found in the previous section. However, it is clear that the coverage here is generally higher than for the traditional interval. As before our interest is in overall trends. In general, coverage increases as π increases, as ρ decreases, as m increases and as n increases. Coverages range from a high of 0.969 to a low of 0.930 with a mean coverage of 0.949. Only one of the coverages when n = 1000, m = 5, π = 0.005 and ρ = 0.4 is of concern here. That value seems anomalous when compared to the coverage obtained when n = 1000, m = 5, π = 0.005 and ρ = 0.8. Otherwise the CI based on a logit transformation performed extremely well. Overall, coverage for the logit CI is higher than for the traditional confidence interval. It performs well when n† π ≥ 5. Thus, use of this CI is appropriate when the number of comparison pairs is roughly half what would be needed for the traditional CI. 5 Sample size calculations As discussed earlier and highlighted by [3], there is a pressing need for appropriate sample size calculations for testing of BA’s. Here we present sample size calculations using the logit transformed interval since it gives better coverage and requires fewer observations for its usage than the traditional approach. (It is straightforward to solve (5) as we do below for (6) to achieve a specified margin of error.) Because of the way that matching performance for BA’s is assessed, there are effectively two sample size for a biometric test: n and m. The calculations given below solve for n, the number of comparison pairs, conditional on knowledge of m, the number of decisions per comparison pair. 13 Table 9 Empirical Coverage Probabilities for Traditional Confidence Interval n = 1000, m = 5 π\ρ 0.1 0.2 0.4 0.8 0.005 0.935 0.926 0.925 0.907 0.010 0.935 0.935 0.929 0.928 0.050 0.952 0.945 0.954 0.947 0.100 0.943 0.957 0.949 0.953 n = 1000, m = 10 π\ρ 0.1 0.2 0.4 0.8 0.005 0.931 0.926 0.922 0.896 0.010 0.941 0.934 0.924 0.924 0.050 0.945 0.949 0.948 0.945 0.100 0.949 0.944 0.947 0.950 n = 2000, m = 5 π\ρ 0.1 0.2 0.4 0.8 0.005 0.951 0.928 0.941 0.927 0.010 0.947 0.958 0.934 0.932 0.050 0.941 0.959 0.947 0.954 0.100 0.946 0.944 0.951 0.938 n = 2000, m = 10 π\ρ 0.1 0.2 0.4 0.8 0.005 0.941 0.927 0.930 0.919 0.010 0.945 0.941 0.941 0.940 0.050 0.942 0.941 0.941 0.940 0.100 0.949 0.950 0.953 0.950 Each cell represents the coverage based on 1000 simulated data sets. 14 Table 10 Empirical Coverage Probabilities for Logit Confidence Interval n = 1000, m = 5 π\ρ 0.1 0.2 0.4 0.8 0.005 0.949 0.940 0.930 0.952 0.010 0.946 0.946 0.935 0.960 0.050 0.969 0.949 0.935 0.960 0.100 0.938 0.946 0.946 0.948 n = 1000, m = 10 π\ρ 0.1 0.2 0.4 0.8 0.005 0.952 0.937 0.941 0.952 0.010 0.948 0.945 0.943 0.964 0.050 0.952 0.947 0.952 0.959 0.100 0.945 0.944 0.950 0.952 n = 2000, m = 5 π\ρ 0.1 0.2 0.4 0.8 0.005 0.952 0.960 0.944 0.951 0.010 0.965 0.939 0.951 0.953 0.050 0.953 0.956 0.951 0.945 0.100 0.940 0.965 0.950 0.952 n = 2000, m = 10 π\ρ 0.1 0.2 0.4 0.8 0.005 0.947 0.937 0.954 0.956 0.010 0.960 0.944 0.945 0.943 0.050 0.946 0.951 0.939 0.947 0.100 0.954 0.957 0.944 0.950 Each cell represents 1000 simulated data sets. 15 Our sample size calculation require the specification of a priori estimates of π and ρ. This is typical of any sample size calculation. In the next section we discuss suggestions for selecting values of π and ρ as part of a sample size calculation. The asymmetry of the logit interval provides a challenge relative to the typical sample size calculation. Thus rather than specifying the margin of error as is typical, we will specify the desired upper bound for the CI, call it πmax . Given the nature of BA’s and their usage, it seems somewhat natural to specify the highest acceptable value for the range of the interval. We then set (6) equal to the logit(πm ax) and solve for n. Given (6), we can determine the appropriate sample size needed to estimate π with a certain level of confidence, 1 − α, to be a specified upper bound, πmax . Since it is not possible to simultaneously solve for m and n, we propose a conditional solution. First, specify appropriate values for π, ρ, πmax , and 1 − α. Second, fix m, the number of attempts per comparison. We assume for sample size calculations that mi = m for all i. (If significant variability in the mi ’s is anticipated then we recommend using a value of m that is slightly less than the anticipated average of the mi ’s.) Third solve for n, the number of comparisons to be tested. We then find n via the following equation, given the other quantities, z1− α2 n= logit(πmax ) − logit(π) !2 1 + (m − 1)ρ . mπ(1 − π) (9) The above follows directly from (6). To illustrate this suppose we want to estimate π to an upper bound of πmax = 0.01 with 99% confidence and we believe π to be 0.005 and ρ to be 0.2. If we plan on testing each comparison pair 5 times we would need 2.576 n= logit(0.01) − logit(0.005) !2 = d984.92e = 985. (1 + (5 − 1)0.2) 5(0.005)(0.995) (10) So we would need to test 985 comparison pairs 5 times each to achieve a 99% CI with an upper bound of 0.01. If asymmetric cross-comparisons are to be used among multiple individuals, then one could replace n on the right hand side of (9) with n ∗ (n ∗ −1) and solve for n∗. In the example above, n∗ = 32 would be the required number of individuals. In the case of symmetric cross comparisons would solve for n∗(n∗−1)/2 = 986 which yields n∗ = 45 individuals assuming the conditions specified above. Table 11 contains additional values of n for given values of m. In addition this table contains mn, the total number of “decisions” that would be needed to achieve the specified upper bound for this CI. Clearly the relationship between mn and n is non-linear. This concurs with 16 Table 11 n necessary to create a 99% confidence interval with π = 0.005, πmax = 0.01, ρ = 0.2 for various values of m. m n mn 2 1642 3284 5 985 4925 8 821 6568 10 767 7670 12 730 8760 15 694 10410 20 657 13140 30 621 18630 the observation of [2] when they discuss the “non-stationarity” of collecting biometric data. 6 Discussion The recent Biometric Research Agenda stated clearly that one of the fundamental needs for research on BA’s was the development of “statistical understanding of biometric systems sufficient to produce models useful for performance evaluation and prediction,” [3, p. 3]. The methodologies discussed in this paper are a significant step toward that. This paper adds two significant tools for testers of biometric identification devices: well-understood CI methodology and a formula for determining the number of individuals to be tested. These are significant advances to core issues in the evaluation, assessment and development of biometric authentication devices. Below we discuss the properties of these methods and outline some future directions for research in this area. The models we have developed are based on the following widely applicable assumptions. First we assume that the moments of the Xi ’s are given by (1). Second we assume that attempts made by each comparison are conditionally independent given the model parameters. We reiterate that an analysis of data found in [19] suggests that these are reasonable assumptions. For any BA, its matching performance is often critical to the overall performance of the system in which it is imbedded. In this paper we have presented two new methodologies for creating a CI for an error rate. The logit transformed CI, (6), had superior performance to the traditional CI. This methodology did well when n† π > 5. Though this study presented results only for 95% CI’s, it 17 is reasonable to assume performance will be similar for other confidence levels. Further, we have presented methodology for determining the number of attempts needed for making a CI. This is an immediate consequence of using a parametric CI. Because of the asymmetry of this CI, it is necessary specify the upper bound for the CI as well as specifying m, π and ρ. All sample size calculations carried out before data is collected require estimates of parameters. To choose estimates we suggest the following possibilities in order of importance. (1) Use estimates for π and ρ from previous studies collected under similar circumstances. (2) Conduct a pilot study with some small number of comparisons and a value of m that will likely be used in the full experiment. That will allow for reasonable estimates of π and ρ. (3) Make a reasonable estimate based on knowledge of the BA and the environment in which it will be tested. One strategy here is to overestimate π and ρ which will generally yield n larger than is needed. As outlined above, this now gives BA testers an important tool for determining the number of comparisons and the number of decisions per comparison pair necessary for assessing a single FAR or FRR. References [1] R. M. Bolle, N. K. Ratha, S. Pankanti, Error analysis of pattern recognition systems – the subsets bootstrap, Computer Vision and Image Understanding 93 (2004) 1–33. [2] T. Mansfield, J. L. Wayman, Best practices in testing and reporting performance of biometric devices, on the web at www.cesg.gov.uk/site/ ast/biometrics/media/BestPractice.pdf (2002). [3] E. P. Rood, A. K. Jain, Biometric research agenda, Report of the NSF Workshop (2003). [4] W. Shen, M. Surette, R. Khanna, Evaluation of automated biometrics-based identification and verification systems, Proceedings of the IEEE 85 (9) (1997) 1464–1478. [5] M. Golfarelli, D. Maio, D. Maltoni, On the error-reject trade-off in biometric verification systems, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7) (1997) 786–796. [6] I. Guyon, J. Makhoul, R. Schwartz, V. Vapnik, What size test set gives good error rate estimates, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1) (1998) 52–64. 18 [7] M. E. Schuckers, Using the beta-binomial distribution to assess performance of a biometric identification device, International Journal of Image and Graphics 3 (3) (2003) 523–529. [8] J. L. Wayman, Confidence interval and test size estimation for biometric data, in: Proceedings of IEEE AutoID ’99, 1999, pp. 177–184. [9] R. J. Michaels, T. E. Boult, Efficient evaluation of classification and recognition systems, in: Proceedings of the International Conference on Computer Vision and Pattern Recognition, 2001. [10] G. R. Doddington, M. A. Przybocki, A. F. Martin, D. A. Reynolds, The NIST speaker recognition evaluation: overview methodology, systems, results, perspective, Speech Communication 31 (2-3) (2000) 225–254. [11] D. F. Moore, Modeling the extraneous variance in the presence of extra-binomial variation, Applied Statistics 36 (1) (1987) 8–14. [12] G. W. Snedecor, W. G. Cochran, Statistical Methods, 8th Edition, Iowa State University Press, 1995. [13] W. G. Cochran, Sampling Techniques, 3rd Edition, John Wiley & Sons, New York, 1977. [14] J. L. Fleiss, B. Levin, M. C. Paik, Statistical Methods for Rates and Proportions, John Wiley & Sons, Inc., 2003. [15] D. F. Moore, Asymptotic properties of moment estimators for overdispersed counts and proportions, Biometrika 73 (3) (1986) 583–588. [16] A. Agresti, Categorical Data Analysis, John Wiley & Sons, New York, 1990. [17] R. G. Newcombe, Logit confidence intervals and the inverse sinh transformation, The American Statistician 55 (3) (2001) 200–202. [18] M. J. Schervish, Theory of Statistics, Springer-Verlag, New York, 1995. [19] A. Ross, A. K. Jain, Information fusion in biometrics, Pattern Recognition Letters 24 (13) (2003) 2115–2125. [20] J. E. Gentle, Random Number Generation and Monte Carlo Methods, SpringerVerlag, 2003. [21] D. D. Wackerly, W. M. III, R. L. Scheaffer, Mathematical Statistics with Applications, 6th Edition, Duxbury, 2002. [22] S. T. Garren, R. L. Smith, W. W. Piegorsch, Bootstrap goodness-of-fit test for the beta-binomial model, Journal of Applied Statistics 28 (5) (2001) 561–571. [23] L. Kish, Survey Sampling, John Wiley & Sons, New York, 1965. 19