To evaluate the performance of speaker recognition systems, a detection cost function defined as a
weighted sum of the probabilities of type I and type II errors is employed. The speaker datasets may
have data dependency due to multiple uses of the same subjects. Using the standard errors of the
detection cost function computed by means of the two-layer nonparametric two-sample bootstrap
method, a significance test is performed to determine whether the difference between the measured
performance levels of two speaker recognition algorithms is statistically significant. While
conducting the significance test, the correlation coefficient between two systems’ detection cost
functions is taken into account. Examples are provided.
The National Institute of Standards and Technology conducts an ongoing series of Speaker
Recognition Evaluations (SRE). Speaker detection performance is measured using a detection cost
function defined as a weighted sum of the probabilities of type I and type II errors. The sampling
variability can result in measurement uncertainties. In our prior study, the data independency was
assumed in using the nonparametric two-sample bootstrap method to compute the standard errors
(SE) of the detection cost function based on our extensive bootstrap variability studies in ROC
analysis on large datasets. In this article, the data dependency caused by multiple uses of the same
subjects is taken into account. The data are grouped into target sets and non-target sets, and each set
contains multiple scores. One-layer and two-layer bootstrap methods are proposed based on whether
the two-sample bootstrap resampling takes place only on target sets and non-target sets, or
subsequently on target scores and non-target scores within the sets, respectively. The SEs of the
detection cost function using these two methods along with those with the assumption of data
independency are compared. It is found that the data dependency increases both estimated SEs and
the variations of SEs. Some suggestions regarding the test design are provided.
The National Institute of Standards and Technology (NIST) Speaker Recognition Evaluations (SRE)
are an ongoing series of projects conducted by NIST. In the NIST SRE, speaker detection
performance is measured using a detection cost function, which is defined as a weighted sum of
probabilities of type I error and type II error. The sampling variability can result in measurement
uncertainties of the detection cost function. Hence, while evaluating and comparing the
performances of speaker recognition systems, the uncertainties of measures must be taken into
account. In this article, the uncertainties of detection cost functions in terms of standard errors (SE)
and confidence intervals are computed using the nonparametric two-sample bootstrap methods based
on our extensive bootstrap variability studies on large datasets conducted before. The data
independence is assumed because the bootstrap results of SEs matched very well with the analytical
results of SEs using the Mann-Whitney statistic for independent and identically distributed samples
if the metric of area under a receiver operating characteristic curve is employed. Examples are
provided.
To evaluate the performance of fingerprint-image matching algorithms on large datasets, a receiver
operating characteristic (ROC) curve is applied. From the operational perspective, the true accept
rate (TAR) of the genuine scores at a specified false accept rate (FAR) of the impostor scores and/or
the equal error rate (EER) are often employed. Using the standard errors of these metrics computed
using the nonparametric two-sample bootstrap based on our studies of bootstrap variability on large
fingerprint datasets, the significance test is performed to determine whether the difference between
the performance of one algorithm and a hypothesized value, or the difference between the
performances of two algorithms where the correlation is taken into account is statistically significant.
In the case that the alternative hypothesis is accepted, the sign of the difference is employed to
determine which is better than the other. Examples are provided.
A nonparametric inferential statistical data analysis is presented. The utility of this method is
demonstrated through analyzing results from minutiae exchange with two-finger fusion. The
analysis focused on high-accuracy vendors and two modes of matching standard fingerprint templates: 1) Native Matching - where the same vendor generates the templates and the matcher,
and 2) Scenario 1 Interoperability - where vendor A's enrollment template is matched to vendor B's authentication template using vendor B's matcher. The purpose of this analysis is to make inferences about the underlying population from sample data, which provide insights at an
aggregate level. This is very different from the data analysis presented in the MINEX04 report
in which vendors are individually ranked and compared. Using the nonparametric bootstrap
bias-corrected and accelerated (BCa) method, 95 % confidence intervals are computed for each
mean error rate. Nonparametric significance tests are then applied to further determine if the
difference between two underlying populations is real or by chance with a certain probability. Results from this method show that at a greater-than-95 % confidence level there is a significant degradation in accuracy of Scenario 1 Interoperability with respect to Native Matching. The difference of error rates can reach on average a two-fold increase in False Non-Match Rate. Additionally, it is proved why two-finger fusion using the sum rule is more accurate than single-finger
matching under the same conditions. Results of a simulation are also presented to show the significance of the confidence intervals derived from the small size of samples, such as six error rates in some of our cases.
KEYWORDS: Monte Carlo methods, Biometrics, Error analysis, Tolerancing, Statistical analysis, Receivers, Standards development, Information technology, Data analysis, Homeland security
The fingerprint datasets in many cases may exceed millions of samples. Thus, the needed size of a biometric evaluation test sample is an important issue in terms of both efficiency and accuracy. In this article, an empirical study, namely, using Chebyshev's inequality in combination with simple random sampling, is applied to determine the sample size for biometric applications. No parametric model is assumed, since the underlying distribution functions of the similarity scores are unknown. The performance of fingerprint-image matcher is measured by a Receiver Operating Characteristic (ROC) curve. Both the area under an ROC curve and the True Accept Rate (TAR) at an operational False Accept Rate (FAR) are employed. The Chebyshev's greater-than-95% intervals of using these two criteria based on 500 Monte Carlo iterations are computed for different sample sizes as well as for both high- and low-quality fingerprint-image matchers. The stability of such Monte Carlo calculations with respect to the number of iterations is also explored. The choice of sample size depends on matchers' qualities as well as on which performance criterion is invoked. In general, for 6,000 match similarity scores, 50,000 to 70,000 scores randomly selected from 35,994,000 non-match similarity scores can ensure the accuracy with greater-than-95% probability.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.