1.IntroductionFor the typical diagnostic radiology study, several readers (usually 4 to 10 radiologists) assign confidence-of-disease ratings to each case (i.e., subject) based on one or more corresponding radiologic images, using one or more tests (typically imaging modalities), with the numbers of diseased and nondiseased cases each typically between 25 and 100. The resulting data are called multi-reader multi-case (MRMC) data. These studies are typically used to compare different imaging modalities with respect to reader performance. Often measures of reader performance are functions of the estimated receiver-operating-characteristic (ROC) curve, such as the area under the ROC curve (AUC). Throughout we assume AUC is the reader performance metric of interest. Two commonly used methods for analyzing reader performance outcomes that allows conclusions to generalize to both the reader and case populations are the method proposed by Obuchowski and Rockette1 and later modified by Hillis,2 which will be referred to as the “OR” method, and the method proposed by Gallas3 and Gallas et al.,4 which will be referred to as the “Gallas” method. The most frequently used model for simulating MRMC data has been the model first proposed by Roe and Metz5 and later generalized by Hillis,6 Abbey7 and Gallas and Hillis.8 We will refer to each of these models as a “Roe and Metz” or “RM” model when there is no need to distinguish between them. These RM models have been used for evaluating MRMC analysis and sample size methods. As discussed by Hillis,9 these RM models generate continuous confidence-of-disease ratings based on an underlying binormal model for each reader, with the separation between the normal and abnormal rating distributions varying across readers. The parameter settings included in the original RM paper5 result in RM “null” models, where the mean AUC across readers is the same for each test. These null models are useful for evaluating the performance of MRMC methods with respect to type I error for the hypothesis of equal test AUCs. However, these null models can result in correlations for the simulated ratings that would be different if the two tests were identical. For example, it will be shown that between-test correlations of case ratings generated from the RM null model are less than or equal to corresponding correlations when the two tests are identical. An RM null model where the two tests are identical will be referred to as an “identical-test” model. Although there is no reason to compare two tests that are known to be identical, sometimes it is of interest to compare two tests that are quite similar is most respects, e.g., when the two tests are the same imaging modality but used with slightly different radiation doses. For this situation a researcher likely would want to test if the lower-dose modality is noninferior or equivalent to the higher dose modality. For such situations, it is important to know that the MRMC analysis method being used performs well when the tests are close to being identical, not only in terms of AUC, but in other ways. A discussion of how to determine parameter settings that result in an identical-test model is not provided in the original RM paper or in any of the previously mentioned papers that generalize the original RM model. The purpose of this paper is to show how to formulate an RM identical-test model and to show its usefulness for validating the need for the error covariance constraints employed by the OR method. A summary of the paper is as follows: a review of the various RM models is provided in Sec. 2, the definition and derivation of an RM identical-test model are provided in Sec. 3 with illustrative examples in Sec. 4, a brief review of the conventional OR, unconstrained OR and Gallas methods is provided in Sec. 5 with simulation studies comparing the methods in Sec. 6, a discussion of how a negative OR variance can occur is presented in Sec. 7 with illustrative simulation studies in Sec. 8, followed by a summary and discussion in Sec. 9. 2.Roe and Metz null Models: Original, Constrained, and Unconstrained Unequal-Variance2.1.Original RM Null ModelLet denote a confidence-of-disease rating assigned by a reader to a case; is often called a decision variable (DV). The original RM simulation model proposed by Roe and Metz5 is a mixed four-factor (test, reader, case, and truth) ANOVA model for with case nested within truth; test, reader, and truth crossed; test and truth treated as fixed factors; and reader and case treated as random factors. Using their notation, their null model is given as where denotes the confidence-of-disease rating assigned to case of truth state by reader when reading under test , with = “−” indicating a nondiseased case and = “+” indicating a diseased case. Here is the expected difference in the means for the diseased and nondiseased DV distributions, is an indicator function that takes the value 1 when and 0 when , is the interaction effect of reader and truth state , is the effect of case nested within truth state , the multiple symbols in parentheses denote interactions, and is the error term. By comparison, the nonnull model given by Roe and Metz is the same as Eq. (1) except that it also includes a test-by-truth interaction term, denoted by , which is implicitly set to zero in the null model Eq. (1).All effects in Eq. (1) are random except for . The random effects are mutually independent and normally distributed with zero means. Roe and Metz denote the corresponding variance components by , , , , , , and . They note that and cannot be estimated separately for this model with no replications, and hence define Although not mentioned by Roe and Metz, the omission of effects that do not depend on truth is justified by the invariance of the ROC curve to location shifts; thus, inclusion of these terms would not change the ROC curve for a given reader. Note that interactions with truth are denoted only by a subscript in Eq. (1). Roe and Metz constrain the sum of the error variance and variance components involving case to be equal to one: It can be shown (e.g., Hillis9) that the reader nondiseased and diseased DV distributions have unit variances (and hence their ROC curves are symmetric about the negative 45 deg diagonal), with the reader true AUCs varying across the reader population and having the same expectation for each test. Furthermore, a randomly selected reader has the same ROC curve under each test. 2.2.Constrained and Unconstrained Unequal-Variance RM Null ModelsIn practice, estimated binormal-model nondiseased and diseased DV variances for a fixed reader are often different, with diseased subjects typically having more variable case ratings. To better emulate real data, Hillis6 modified the original RM model by allowing the error variance and variance components involving case to depend on truth, with variance components involving diseased cases set equal to those involving normal cases multiplied by the factor , Specifically, the null model is given by Eq. (1) with variance components (using an obvious notation) denoted as withSimilar to Eq. (3), the constraint is imposed. It follows thatFollowing Hillis,6 we refer to this as the “constrained unequal-variance RM null model.” It follows6 from Eq. (5) that setting results in the original RM model and that is the conventional binormal-model slope coefficient for each reader’s ROC curve. A more general RM null model, called the “unconstrained unequal-variance RM null model” by Hillis,9 results if the variance components , , , and are not constrained to satisfy any particular relationship with , , , and . This model includes the original and constrained unequal-variance RM null models as special applications. 2.3.Comparison of the RM Null ModelsThe original RM null model and the constrained and unconstrained unequal-variance RM null models all have the same mixed linear model formulation, given by Eq. (1); all of them also constrain the sum of the variance components corresponding to effects involving nondiseased cases to be equal to 1, as given by Eq. (6). The null models differ only with respect to their constraints on the variance components corresponding to effects involving diseased cases, with the original RM model requiring that the variance components be the same as those for the nondiseased cases, the constrained unequal-variance model requiring that they differ by a factor of from those for the nondiseased cases, and the unconstrained unequal-variance model not placing any constraints on them. 3.Proposed RM Identical-Test Model3.1.Definition of Identical-Test ModelI define two tests to be “identical” if they are the same in all respects. I will derive an RM identical-test model by applying this definition to an unconstrained unequal-variance RM null model; since this model includes the original and constrained unequal-variance RM null models as specific applications, the derivation can also be applied to those models. Recall that the unconstrained unequal-variance RM null model is defined by mixed linear model Eq. (1) with variance components given by Eq. (4) subject only to constraint Eq. (6). 3.2.Derivation of an RM Identical-Test ModelIn this section I derive the RM identical-test model by modifying the unconstrained unequal-variance RM null model. The definition of identical tests implies that model effects (excluding the error term) cannot differ by test in an RM identical-test model. Thus, if tests and are identical, it follows that model effects in Eq. (1) that include test do not depend on the value of the test subscript . Specifically, , , and in Eq. (1) cannot depend on the value of subscript ; hence , , and . Thus I can derive the RM identical-test model from the unequal-variance RM null model using the following result. Result 1. Setting test subscript values (i) in Eq. (1) equal to 1 for model effects (excluding the error term) results in the corresponding RM identical-test model. Applying Result 1 results in none of the model effects that include test depending on the value of the test subscript, since it will be the same for all of these effects. Applying Result 1 to the unequal-variance RM null model given by mixed linear model Eq. (1) with variance components Eq. (4) subject only to constraint Eq. (6) results in the identical-test RM null model where is the identical-test model DV. Consolidating random effects results in the equivalent model whereCorresponding variance components for , , , and are given as It follows from Eqs. (6) and (9)–(11) that In summary, the RM identical-test model derived from the unconstrained unequal-variance null RM model is given by model Eq. (7) with variance components Eqs. (8)–(11) and constraint Eq. (12). Because the original, constrained unequal variance, and unconstrained unequal variance RM null models specify values for without specifying specific values for either or , values must be assigned to , for the null model in order to determine values for and in the identical-test model using Eqs. (10) and (11). For simplicity, for the remainder of this paper I will assume in the unconstrained unequal variance RM model, resulting in in the identical-test model. On the other hand, if the values for or are specified, then the values for and can be computed using Eqs. (10) and (11).When using the identical-test model for simulations, ratings and are simulated, corresponding to tests 1 and 2, respectively. But since the only term on the right of Eq. (7) that depends on test is the error term, it follows that for a given reader, case, and truth status, the ratings for the two tests will differ only because their error term values will not be the same. Note that because the derivation was based on an RM null model, the resulting RM identical-test model is also an RM null model and is a specific application of the unconstrained unequal-variance RM null model. 3.3.Comparison of the RM Null Model and the Corresponding RM Identical-Test ModelThe following relationships for the ratings generating from an unconstrained unequal-variance RM null model and its corresponding RM identical-test model Eqs. (7)–(15) can be shown.
In summary, we see that the only difference between the rating distributions for the two models is that the between-test covariances for the unconstrained unequal-variance RM model can be less than those for the RM identical-test model. 3.4.RM Identical-Test Model Expressed in Terms of a Null Unconstrained Unequal-Variance RM Model with Altered Variance ComponentsIt follows from Eqs. (7)–(15) that the RM identical-test model can be expressed in terms of an unconstrained unequal-variance RM null model, with RM identical-test variance components (indicated by an overline) defined in terms of the unconstrained unequal-variance RM null model variance components as follows, with The advantage of this approach is that for simulations, an unconstrained unequal-variance RM null model that is already programmed can be easily modified to produce identical-test simulations by altering the values of the null model variance components using Eqs. (16)–(22). 3.5.General Definition of an RM Identical-Test ModelIt follows from Eqs. (12) and (16)–(22) that an unconstrained unequal-variance null RM is an RM identical-test model if it can be expressed by mixed linear model Eq. (1) with andThis result can also be applied to original RM null or constrained unequal-variance RM null models, since they are specific applications of the unconstrained unequal-variance RM null model. In particular, it follows from Eqs. (5), (23), and (24) that a constrained unequal-variance null RM or an original RM null model is an RM identical-test model if it can be expressed by mixed linear model Eq. (1) with and constraint Eq. (24).4.Examples of RM Null and RM Identical-Test ModelsTable 1 illustrates the derivation of several RM identical-test models from RM null models using Eqs. (16)–(22). In row 1 of the table are the parameter values for one of the RM null models proposed by Roe and Metz.5 In row 2 are the variance components for the corresponding RM identical-test model, computed using Eqs. (16)–(22). Table 1Examples of RM null models and corresponding RM identical-test models.
Notes: “Const.” = constrained; “Unconst.” = unconstrained; and “var.” = variance; b = 0.771 for the constrained unequal variance RM model, RM model (b). aAz is equal to the median AUC across the reader population; the purpose of the parentheses is to indicate that it is not an RM model parameter used for simulating data, but rather is included to provide additional information about the model. It is computed using Az=Φ(μ+/σ−2+σ+2), where σ−2=σC(−)2+στC(−)2+σRC(−)2+σε(−)2=1 and σ+2=σC(+)2+στC(+)2+σRC(+)2+σε(+)2. Similarly, in row 3 are the parameter values for a constrained unequal variance null RM model given by Hillis,6 which has the same median AUC and the same variance components for random effects involving nondiseased cases as the original RM null model in row 1, but sets so that the median mean-to-sigma ratio10 will be 4.50. In row 4 are the corresponding identical-test parameter values, derived using Eqs. (16)–(22). Finally, in row 5 is an unconstrained unequal variance null RM model6 with the corresponding identical-test model variance components, again derived using Eqs. (16)–(22), given in row 6. 5.Review and Comparison of Conventional OR, Unconstrained OR, and Gallas MRMC Methods5.1.OR MethodThe OR method assumes a test × reader factorial ANOVA model for AUC estimates and other reader performance measure estimates resulting from an MRMC study, with each AUC estimate corresponding to one reader using one of several tests (typically an imaging modality). Here we are assuming the study design discussed in the first paragraph of Sec. 1. Unlike a conventional ANOVA model, the errors are assumed to be correlated to account for correlation due to each reader evaluating the same cases. The OR model is given as where is the fixed intercept term, denotes the fixed effect of test , denotes the random effect of reader , denotes the random test × reader interaction, and is the error term. The and are assumed to be mutually independent and normally distributed with zero means and respective variances and . (We include “OR” in effect and variance component subscripts to distinguish OR effects and variance components from similarly notated RM-model quantities.) The are assumed to be normally distributed with mean zero and variance and are assumed uncorrelated with the and . Three possible error covariances are assumed: The OR model assumes11 The OR model can alternatively be described with population correlations instead of the covariances, i.e., with replaced by , .These error variance-covariance parameters are typically estimated by averaging corresponding fixed-reader estimates computed using the jackknife,12–14 bootstrap,14,15 or the method proposed by DeLong et al.16 (DeLong), with DeLong only for empirical AUC estimates. These three estimation methods are consistent but are not unbiased. An unbiased error covariance method, which we will refer to as the “unbiased” method, was recently proposed by Hillis17 for use when empirical AUC is the outcome. This method utilizes the unbiased method fixed-reader method discussed by Gallas (Ref. 3, p. 362) for estimating the error variance [which Gallas notes is equivalent to the expressions given by Bamber (Ref. 18, p. 402)] and extensions of it for estimating the error covariances. OR analysis using this method is included in the freely available R software package MRMCaov.19 5.2.Conventional OR Test Statistic and Variance EstimateThe conventional OR test statistic for testing the null hypothesis of no test effect () is given as where , , is the number of tests, is the number of readers and and are the and estimates. Here a subscript replaced by a dot indicates the average across the corresponding levels; e.g., . Under , has an approximate distribution with numerator degrees of freedom and denominator degrees of freedom2For tests, Eq. (29) can be written in the form where is the OR estimate for the variance of .Note that Eqs. (29)–(32) incorporate the error-covariance constraints given in Eq. (27). We will sometimes refer to these as the “conventional OR” statistics, denominator degrees of freedom estimate and variance estimate, to distinguish them from the unconstrained versions of these statistics discussed below. 5.3.Unconstrained OR Test Statistics and Variance EstimateThe importance of the OR constraints given in Eq. (27) will be demonstrated by simulations in Sec. 6. In this section the OR test statistics and variance estimate are defined without constraints Eq. (27) imposed. Use of these unconstrained test statistics in place of Eqs. (29)–(32) will be called the “unconstrained OR” method. The unconstrained OR test statistics, denominator degrees of freedom and variance are given as and whereNote that and that Eq. (35) is not defined if Eq. (36) is not positive.5.4.Equivalence of Gallas and Unconstrained OR F StatisticsWhen the outcome is the empirical AUC and there are two tests, Hillis17 has shown that the Gallas method statistic for testing the null hypothesis of no difference in test AUCs is equivalent to the unconstrained OR method statistic Eq. (35) when the unbiased covariance estimation method is used to compute and . However, the Gallas denominator degrees of freedom estimate differs from the conventional and unconstrained OR denominator degrees of freedom estimates. 5.5.Relationship of OR Model and RM Identical-Test ModelHillis9 derived the OR parameters for the distribution of empirical AUC estimates simulated using the unconstrained unequal-variance RM model. I show in Appendix A that it follows from these results that for data simulated from the unconstrained unequal-variance RM identical-test model These results are intuitive. The first result states that the expected AUCs (as given by for test ) must be the same for each test and the second result states that and must be equal, which makes sense since for equal tests the covariances have the same definition. To understand the third result, we note that it can be shown [Ref. 9, p. 2069] that is equal to half of the variance of the within-reader differences of the expected AUCs; under the assumption of the identical-test RM model, these differences are zero, and hence . 6.Simulation Studies Comparing Conventional and Unconstrained OR Based on RM Null and Identical-Test Models6.1.Simulation Study Using Tables 1(a) and 1(b) RM Null and Identical-Test ModelsMulti-reader rating data for five readers, each reading the same cases under two tests, were simulated based on the original RM null and corresponding RM identical-test models, and on the constrained unequal variance RM null and corresponding RM identical-test models, given in Tables 1(a) and 1(b), respectively. (Results based on the Table 1(c) model are omitted from Table 1 for brevity and because the Table 1(c) RM null model parameter values, unlike the other two RM null model parameter values, have not been previously suggested in the literature.) For each model, 5000 simulated MRMC samples were generated for case sample sizes of 25/25 and 50/50 each, where “25/25” indicates 25 nondiseased and 25 diseased cases. The empirical AUC was computed for each simulated MRMC sample with OR error covariances estimated using the unbiased error-covariance method. The null hypothesis of equal test AUCs, versus the two-sided alternative hypothesis, was tested at the 0.05 significance level using both the conventional and unconstrained OR test statistics, given by Eqs. (31) and (35). Results of the simulations, presented in Table 2, include the empirical type I error rate; the proportion of samples having negative variance estimates, as defined by Eq. (32) or Eq. (36); and the proportion of negative values for . Table 2OR analysis results using the unbiased covariance method, for 5 readers reading the same cases under both tests, with 5000 MRMC samples simulated from Table 1 RM null models (a) and (b) and their corresponding identical-test models for each case size combination.
Notes: see Table 1 for definitions of “RM model” and “Model type”; OR F statistic = “Conventional” if the OR constrained F statistic Eq. (32) is used and = “Unconstrained” if the unconstrained F statistic Eq. (33) is used; “25/25" indicates 25 nondiseased and 25 diseased cases; “type I error” is the proportion of samples where the null hypothesis of no test effect is rejected; “N/A” stands for “not applicable” and indicates that the empirical type I error rate could not be computed because the variance of the test statistic, computed using Eq. (36), was negative for some samples. Note that although Cov^2−Cov^3 is not constrained in this table, in the computation of the conventional OR variance it is constrained to be nonnegative. If the variance estimate was negative, the type I error rate could not be computed because the test statistic Eq. (31) or Eq. (35), which is required for deciding whether to accept or reject the null hypothesis of equal test AUCs, was not defined for all the simulated samples; this situation is indicated in Table 2 by “NA” (not applicable). 6.1.1.RM null model resultsWe see from Table 2 that when model type = “null,” the empirical type I error rates are the same for the conventional and constrained OR methods, with the type I rates varying between 0.048 and 0.051. That these rates are the same can be explained by Eq. (37) and by the nonnegativity of for all the samples (as indicated in the last column). 6.1.2.Identical-test model resultsIn contrast to the null model results reported above, we see in Table 2 that the identical-test model type I error rates depend on whether the conventional or unconstrained OR method was used. Conventional OR resultsFor the identical-test models the conventional OR type I error rates vary between 0.033 and 0.046 with no negative variance estimates. Unconstrained OR resultsFor the identical-test models, all of the unconstrained OR type I error rates were undefined (as indicated by “NA” in Table 2) because of negative variance estimates. For the original RM identical-test and constrained unequal variance RM identical-test models, respective negative unconstrained OR variance rates were 0.039 and 0.049 for 25/25 samples and 0.026 and 0.031 for 50/50 samples. Note that these negative variance estimate rates apply also to the Gallas statistic, since it is the same as the unconstrained OR statistic, as discussed in Sec. 5.4. 6.2.Simulation Study Using Original Roe and Metz Null Models and Corresponding Identical-Test Null ModelsIn the original Roe and Metz5 paper, four different variance component “structures,” denoted as “HL,” “LL,” “HH,” and “LH,” are given for , 1.5, and 2.50, resulting in twelve different parameter combinations. In the upper half of Table 3 are the four variance component structures for . In the lower half of Table 3 are the corresponding RM identical-test model structures that result from application of Eqs. (16)–(22) to the structures in the upper half of Table 3. (See Table 8 in Appendix B for a similar table that includes all twelve parameter combinations and corresponding RM identical-test parameter specifications.) Table 3Subset of original 12 sets of Roe and Metz5 (RM) null simulation model parameter values and corresponding RM identical-test model parameter values. The Table 4 simulation results are based on these parameter values. The complete set of 12 sets of parameter values is included in Table 8.
Notes: μ+ is the median and mean separation of the normal and abnormal DV distributions across the reader population, and Az=Φ(μ+/2) is the median reader-specific true area under the ROC curve. For each parameter combination in Table 3, 2000 MRMC samples were simulated for each of 6 combinations of 3 reader levels (3, 5, and 10 readers) and 2 sample size levels (25/25 and 50/50). Each set of 2000 samples was analyzed using the conventional and unconstrained OR methods, using both unbiased and DeLong error-covariance estimates. For each error covariance method and model type (null or identical-test), Table 4 presents the analysis results for each reader and sample size combination, averaged across the four structures in Table 3. For example, the type I error of 0.061 in the first row of Table 4 is the average of four empirical type I error rates, corresponding to the four original RM null model structures, resulting from performing a conventional OR analysis using the DeLong covariance method on each of 2000 simulated MRMC samples for each structure, with each simulated MRMC sample containing rating data from 3 readers reading 25 nondiseased and 25 diseased cases. (For brevity, averages of the four empirical type I error rates are reported rather than the rates for each separate structure, since the averages are sufficient to reveal the problem of negative variances with the unconstrained OR method.) Table 4Conventional and unconstrained OR analysis results using the DeLong and unbiased error covariance methods, for MRMC samples simulated from the original RM null model and the corresponding RM identical-test model parameter values given in Table 3. Readers read the same cases under both tests. For each combination of structure (HL, LL, HH, or LH), error covariance method (Delong or unbiased), readers (3, 5, or 10) and case sample sizes (25/25 or 50/50), 2000 MRMC samples were simulated and analyzed using both conventional and unconstrained OR with empirical AUC being the outcome. This table presents those analysis results averaged across the four structures. For example, the conventional OR type I error of 0.061 in the first line is the average of the four conventional OR empirical type I error statistics computed for each of the four parameter structures in Table 3, based on 2000 simulated MRMC samples for each structure for 3 readers each reading 25 nondiseased and 25 diseased cases.
Notes: “covariance” = error covariance method used with the OR method; “N” = number of parameter strutures in Table 3 that results are averaged across; “25/25" indicates 25 nondiseased and 25 diseased cases; “type I ” is the empirical type I error rate; “var < 0" is the proportion of samples where the variance estimate for the difference of the reader-averaged test AUCs is negative, and hence the OR F statistic is not defined; Cov2, Cov3, and στR:OR2 are OR parameter estimates (which are the same for the conventional and unconstrained OR methods); AUC1 and AUC2 are the empirical AUCs for tests 1 and 2, respectively. Results of the simulations, presented in Table 4, include the empirical type I error rate and the negative-variance rate. The negative-variance rate is the proportion of samples having negative variance estimates, as defined by Eq. (32) or Eq. (36), for both the conventional and unconstrained OR methods. As in Table 2, a value of “NA” for the type I rate indicates at least one sample had a negative variance estimate, and hence an undefined type I error rate. Table 4 also includes the averages of the empirical AUC estimates for tests 1 and 2 and the averages of the OR estimates for , , and ; these last three estimates depend on the OR error covariance method but not on the use of conventional or unconstrained OR. From Table 4, I make the following remarks.
7.Understanding How a Negative Variance OccursWe can rewrite Eq. (36) in the form whereThe term will never be negative because cannot be negative. Thus can be negative only if is sufficiently negative to result in . For the unconstrained unequal-variance RM identical-test model, and have the same distributions; thus and has a symmetric distribution about 0. It follows that will be negative with probability 0.5, which is in agreement with the results in Table 2 where the negative rates are . It has been shown by Hillis,17 under the assumption of the unconstrained unequal-variance RM model, that Here and elsewhere in this section I often express , because it has been shown9 that these correlations remain approximately constant for a given RM model across different reader sample sizes and case sample sizes, making them easy to interpret. To simplify the discussion, I now assume that the estimates are unbiased, which is the case when the unbiased error-covariance method is used with OR. Making this assumption, it follows from Eqs. (41)–(43) that Although Eq. (44) assumes unbiased and estimates, typically we expect the right side of Eq. (44) to approximate the left side when a reasonable alternative error covariance estimation method is used, such as the jackknife or DeLong method. From Eq. (44) it follows that increases as and increase, assuming all other parameters in the model remain the same. Thus, recalling that for an RM identical-test model and , it seems likely that the probability of a negative variance will decrease as or increase. On the other hand, because Eq. (44) does not depend on the difference of the AUCs, as shown by the omission of and , there is no indication that the probability of a negative variance will decrease or increase as the magnitude of the AUC difference increases. 8.Simulation Studies for Examining Effects of and on Negative Variance Rates8.1.PurposeThe simulations in Sec. 6 established the usefulness of the identical-test RM model for detecting the negative variance problem inherent in using the unconstrained OR procedure. A natural follow-up question to ask is, “to what extent does the unconstrained OR procedure have this problem when the conditions of the identical-test RM model are not exactly satisfied?” The purpose of this section is to empirically address this question by simulating data from RM simulation models that are not identical-test RM models. As discussed in Sec. 5.5, data simulated from an identical-test RM model results in AUC estimates such that three conditions are true: (1) the tests have equal expected AUCs; (2) the OR and parameters are equal, or equivalently, where and are the OR correlations defined by Eq. (28); and (3) the OR test-by-reader interaction variance component is zero. These conditions are implied by Eqs. (38)–(40). In this section I simulate data from RM models that have been formulated such that not all of these conditions are true, and thus none of the simulation models are identical-test RM models. The results of these simulations will allow us to answer the question posed in the previous paragraph, as well as to provide support for the conjectures in Sec. 7 regarding the associations between each of the three conditions and the negative variance rate. 8.2.Simulations8.2.1.OverviewData are simulated that result in OR distributions with parameter values similar to those estimated for two real datasets that are analyzed by the unconstrained OR method with empirical AUC as the outcome and using the unbiased error-covariance method. In each of the 2 examples, 10,000 MRMC samples are simulated from 8 different constrained unequal-variance RM models, with each corresponding empirical AUC distribution corresponding to one of eight possible combinations of 2 different levels for and . The two levels are 0.01 and 0.04 for , 0.0000 and 0.0002 for , and 0.00 and 0.04 for (note that ). All of these values are representative of real datasets. The case and reader sample sizes for the simulated MRMC samples are the same as for the original datasets. Although for both of the original datasets, a negative value for is not included as one of the study design parameters because the OR model assumes . 8.2.2.Example: simulations based on Kundel datasetKundel et al.20 compared reader AUCs for hard-copy and soft-copy computed radiograph chest images selected randomly from a medical intensive care unit. Four radiologists blindly read both types of images obtained from the same patients. Six months separated the end of the hard-copy readings and the start of the soft-copy readings. A five-point ordinal scale was used to rate the likelihood of the presence of the condition (which we will consider to be the disease) implied by the reason for requesting the corresponding examination. Ninety-five images, consisting of 29 diseased and 66 nondiseased images, were read under each test condition. The difference of the empirical AUC estimates was 0.0375 () and was , computed from a conventional OR analysis using the unbiased covariance method. The unconstrained variance estimate was not negative. The OR parameter estimates for the original data, using the unbiased covariance method, are shown in Table 5(a). In Table 5(b) are eight sets of parameter values similar to the original data estimates, corresponding to the eight possible combinations of the levels of and . Table 5(c) presents constrained unequal-variance RM model parameter values that result in simulated data that can be described by the OR parameters in Table 5(b); these were computed using the algorithm developed by Hillis et al.21 Because some of these RM models are not null models, in Eq. (1) is replaced by ; thus is the expected difference in the means for the diseased and nondiseased DV distributions for test Table 5Simulations based on Kundel20 dataset showing effects of r2−r3, στR:OR2 and AUC1−AUC2 on negative variance rates. In this study, 4 readers read the same 29 diseased and 66 nondiseased cases.
Note: “(var < 0) rate” = proportion of samples where var^OR;unconstrained(θ^1•−θ^2•)<0. Table 5(d) presents the estimates of the OR parameter estimates and the negative variance and negative rates computed from the simulated data, based on the RM models in Table 5(c). The excellent agreement between Tables 5(b) and 5(d) confirms that the RM model parameter values in Table 5(c) were appropriately chosen. In Table 5(d) negative variance rates range between 0.8% and 4.2% and negative rates range between 14% and 41%. Figure 1 displays a plot of the negative variance rate for each combination of the true values of and . The labels on the -axis indicate the levels of and , with “LL” indicating both at the lowest level, “LH” indicating the low level of and the high level of , etc. From Fig. 1, we see that higher negative variance rates are associated with lower levels of both and , in agreement with the conjectures in the previous section. The effect of is minimal except when both and are at their lowest levels, as shown by the first pair of points; for this situation the negative variance rate is higher for the larger magnitude of . As noted in the previous section, there was no indication from Eq. (44) as to whether there would be any effect from . 8.2.3.Example: simulations based on Franken datasetFranken et al.22 compared the diagnostic accuracy of interpreting clinical neonatal radiographs using a picture archiving and communication system workstation versus plain film. The case sample consisted of 100 chest or abdominal radiographs (67 abnormal and 33 normal). The readers were four radiologists with considerable experience in interpreting neonatal examinations. The readers indicated whether each patient had normal or abnormal findings and their degree of confidence in this judgment using a 5-point ordinal scale. The difference of the empirical AUC estimates was 0.0109 () and was , computed from a conventional OR analysis using the unbiased covariance method. The unconstrained variance estimate was negative. Table 6 gives results for this dataset in the same format as Table 5. Similar to Table 5, there is excellent agreement between Tables 6(b) and 6(d) that confirms that the RM model parameter values in Table 6(c) were appropriately chosen. In Table 6(d), negative variance rates range between 0.3% and 2.2% and negative rates range between 7% and 36%. Table 6Simulations based on Franken et al.22 dataset showing effects of r2−r3, στR:OR2, and AUC1−AUC2 on negative variance rates. In this study, 4 readers read the same 67 diseased and 33 nondiseased cases.
Note: “(var < 0) rate” = proportion of samples where var^OR;unconstrained(θ^1•−θ^2•)<0. Figure 2 displays a plot of the negative variance rate for each combination of the true values of and . Similar to Fig. 1, higher negative variance rates are associated with lower levels of both and , in agreement with the conjectures in the previous section. The effect of is minimal, with the negative variance rate slightly higher for the larger magnitude of when is at its low level, as shown by the first two pairs of points. 8.2.4.Summary of simulation resultsFrom the results of the simulations in Secs. 8.2.2 and 8.2.3, we saw that the negative variance problem for the unconstrained OR method is present even when conditions Eqs. (38)–(40), which are implied by the identical-test RM model, do not hold. Moreover, the simulations supported the conjectures given in Sec. 7. Specifically, we saw that while higher negative variance rates were associated with lower levels of both and , there was little association with the magnitude of . Surprisingly, the largest effect of the magnitude of the AUC difference, shown by the first pair of points in Figs. 1 and 2, shows the negative variance rate to be higher for the larger AUC difference magnitude of 0.04. In summary, these results suggest that negative variance estimates can be a problem for the unconstrained OR procedure when and are small, regardless of the difference in the AUCs, with the negative variance rate diminishing with increasing numbers of readers and cases. However, we caution that these findings are based on only two simulation studies and will need to be confirmed by additional studies. 9.Summary and DiscussionSometimes it is of interest to compare two tests that may be similar in most respects, such as when noninferiority or equivalence testing is appropriate. For this situation it is important to be able to assess how well a particular MRMC analysis method performs, and hence there is a need for simulation models that emulate this situation. This need was the motivation for developing the RM identical-test model, where the two tests are exactly the same. The derivation of the RM identical-test model from a particular RM null model was straightforward: simply change the test subscript for all of the RM null model effects to 1, which results in none of the test effects depending on test. This derivation was illustrated for the unconstrained unequal-variance RM model,9 which includes the constrained equal-variance6 and original5 RM null models as special cases. It was shown that the null and corresponding identical-test model rating distributions are the same and the within-test covariances are the same, but the between-test covariances for the null model can be less than those for the identical-test model. In terms of the reader empirical AUCs computed from ratings generated from the identical-test model, it was shown that the expected test empirical AUC estimates are equal, and are equal, and . The RM identical-test simulations showed how the performance of the unconstrained OR method is unacceptable because of a nontrivial percentage of negative variance estimates. Because negative estimates can occur, the significance level cannot be estimated unless the action to be taken when a negative estimate occurs has been specified in advance of the analysis and is incorporated into the simulation study. In contrast, the conventional OR method did not have the negative variance problem because its variance estimate can never be negative, and it had an acceptable type I error rate. The original RM null model5 simulations also revealed that the unconstrained OR variance estimate could be negative, but the rates were much smaller than for the identical-test model. Of course, in practice we would rarely expect two tests to be identical. But if an analysis method does not perform satisfactorily when two tests are exactly the same, then it seems likely that the performance will also not be acceptable when the tests are “close” to being identical. This situation was illustrated in the simulations in Sec. 8, where RM models were created to result in OR distributions somewhat similar to those for two real datasets. In both of those examples, there were nontrivial rates of negative variance estimates (3.2% and 1.55%) with moderate deviations from an identical-test model with respect to two categories ( and and a slight deviation with respect to the other category (). Furthermore, the results of the two examples in Sec. 8 suggest that increasing the AUC difference does not reduce the negative-variance rate; if future research shows this relationship to hold in general, then this result implies that negative variance rates can be nontrivial even when the AUC difference is substantial. Although there has never been any suggestion in the literature that the unconstrained version of OR should be used instead of the conventional version, the findings of this paper are relevant because of the relationship between the unconstrained OR method and the often-used Gallas analysis method. As discussed in Sec. 5, recently17 it has been shown that the Gallas test statistic for comparing two tests is equivalent to the unconstrained OR test statistic when empirical AUC is the outcome and the unbiased error-covariance method is used. Thus we recommend that the Gallas method not be used. For the Gallas method to be a statistically acceptable method, there would have to be a defined follow-up analysis procedure to use if the Gallas variance is negative, as well as simulation studies validating the performance of this two-step approach. In my opinion, it is much easier to interpret an RM model in terms of the OR parameter values describing the resulting empirical AUC distribution based on data simulated from the model, as opposed to interpreting the RM parameter values in terms of the distribution of the confidence-of-disease ratings. It was shown in Sec. 5.5 that an unconstrained unequal-variance RM identical-test model will have an empirical AUC distribution with (or equivalently, ), no reader-by-test interaction, and equal expected test AUC values. These three OR relationships are intuitively obvious for identical tests and they can be thought of as the criteria by which tests can be identical in terms of the empirical AUC distributions. In contrast, it has been shown [Ref. 9, Tables 4 and 6] that the original5 12 sets of RM model parameter values lead to OR distributions with identical expected test AUC values, but with , and for 10 of the sets, ; for the other 2 sets, . To obtain some perspective on the size of the interaction variance component, we note that implies that the middle 95% probability range is 0.08 for the true difference for a randomly selected reader, as discussed in Hillis and Schartz;23 for this reason, we consider to be at least moderate test-by-reader interaction. Thus, in terms of the 3 OR identical-test criteria, 10 of the 12 original RM parameter sets describe tests that are similar with respect only to the OR equal-test AUC criterion, with the other 2 sets describing tests also approximately similar with respect to the OR criterion. But none of them describes tests that are approximately similar with respect to all three OR criteria. In summary, the RM identical-test model is useful because it allows for assessment of an MRMC analysis method for the situation where the two tests are identical and it is easy to derive from a previously formulated RM null model. Ideally, an MRMC analysis method would be assessed with respect to a wide range underlying rating models. Thus the RM identical-test model should typically be used in conjunction with other RM models. For brevity, results of the simulation studies in this paper have been limited to the minimum needed to accomplish the two purposes of the paper: to show how to formulate an identical-test RM model and to show its usefulness for validating the need for the OR error covariance constraints. For example, a more extensive analysis might include estimating the type I error, not just for the two-sided nonequivalence set of hypotheses, but also for the noninferiority and equivalence sets of hypotheses; reporting results in Tables 4 and 5 for each structure instead of averaging across the four structures; and reporting results for more combinations of RM parameter values. Finally, for future research, I recommend creating a new set of RM model parameter sets that correspond to real datasets, as was done in Sec. 8. Doing this will allow for a better understanding of what types of studies are emulated by the simulated data. Recently21 an algorithm has been developed that maps OR parameter estimates obtained from real datasets to constrained unequal-variance RM model parameter values; this algorithm can be easily implemented using the R function OR_to_RMH, available in the R package MRMCaov.19 This algorithm was utilized in Sec. 8 to create the RM parameter values corresponding to the two real-datasets. 10.Appendix A Derivation of the Relationships Between the OR Model and Unconstrained Unequal-Variance RM Identical-Test Model Given in Sec. 5.5The OR parameters that describe the distribution of the empirical AUC estimates computed from MRMC data simulated from the unconstrained unequal-variance RM model have been expressed as functions of the RM model parameters by Hillis.9 Table 7 presents the relationships for the three OR parameters given in Sec. 5.5. Table 7The OR Cov2 and Cov3 parameters corresponding to empirical AUC estimates computed from MRMC data simulated from the unconstrained unequal-variance RM model, expressed as functions of the RM model parameters.
Notes: these results are taken from Table 3 in Hillis.9 FBVN(.,.;ρ) is the standardized bivariate normal distribution function with correlation ρ; δi=μ++τi+; V=σfixed(−)2+σfixed(+)2+2(σR2+στR2), where σfixed(−)2=σC(−)2+στC(−)2+σRC(−)2+σε(−)2 and σfixed(+)2=σC(+)2+στC(+)2+σRC(+)2+σε(+)2; c1=1/(n0n1); c2=(n1−1)/(n0n1); c3=(n0−1)/(n0n1); and c4=(1−n0−n1)/(n0n1). The unconstrained unequal-variance RM model assumed in Table 7 is the same as the unconstrained unequal-variance RM null model discussed in Sec. 5.5, but with the addition of the test-by-truth interaction effect to the mixed linear model Eq. (1). It follows that the expected difference between the nondiseased and diseased decision-variable distributions is for test 1 and for test 2. Relationships Eqs. (38)–(40) in Sec. 5.5 are for the unconstrained unequal-variance RM identical-test model, which is the same as the model assumed in Table 7 with the following constraints imposed andRelationships Eqs. (38)–(40) follow directly from the results in Table 8 when constraints Eqs. (45) and (46) are imposed. Specifically where and whereTable 812 original sets of Roe and Metz5 (RM) null simulation model parameter values and corresponding RM identical-test model parameter values. Table 3 is a subset of this table.
Notes: μ+ is the median and mean separation of the normal and abnormal DV distributions across the reader population, and Az=Φ(μ+/2) is the median reader-specific true area under the ROC curve. In addition, we can write and the first term on the right in the Table 7 expression for can be expressed in the formReplacing the first term on the right in the Table 7 expression for by yields 11.Appendix B Original Roe and Metz Null Simulation Model Parameter Values and Corresponding RM Identical-Test Model Parameter ValuesFor completeness, Table 8 lists the 12 sets of the orginal Roe and Metz5 (RM) null simulation model parameter values and the corresponding RM identical-test model parameter values. Table 3 is a subset of this table. Code, Data, and Materials AvailabilityThe two datasets (VanDyke and Kundel) analyzed in this article are publicly available as part of the R package MRMCaov.19 Code for performing the conventional OR analysis using either the unbiased or DeLong or jackknife error covariance methods is also included in the MRMCaov package. Although code for performing unconstrained OR analysis is not included in MRMCaov, one can perform the unconstrained OR analysis based on the MRMCaov conventional OR analysis output. AcknowledgmentsThis research was supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (Grant No. R01EB025174). Some of the information presented in this paper was presented in a prior SPIE proceedings paper by the author.24 I thank two reviewers and the editor for their very helpful comments and suggestions. This content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health. ReferencesN. A. Obuchowski and H. E. Rockette,
“Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: an ANOVA approach with dependent observations,”
Commun. Stat. - Simul. Comput., 24
(2), 285
–308 https://doi.org/10.1080/03610919508813243 CSSCDB 0361-0918
(1995).
Google Scholar
S. L. Hillis,
“A comparison of denominator degrees of freedom methods for multiple observer ROC analysis,”
Stat. Med., 26
(3), 596
–619 https://doi.org/10.1002/sim.2532 SMEDDA 1097-0258
(2007).
Google Scholar
B. D. Gallas,
“One-shot estimate of MRMC variance: AUC,”
Acad. Radiol., 13
(3), 353
–362 https://doi.org/10.1016/j.acra.2005.11.030
(2006).
Google Scholar
B. D. Gallas et al.,
“A framework for random-effects ROC analysis: biases with the bootstrap and other variance estimators,”
Commun. Stat. - Theory Methods, 38
(15), 2586
–2603 https://doi.org/10.1080/03610920802610084 CSTMDC 0361-0926
(2009).
Google Scholar
C. A. Roe and C. E. Metz,
“Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation,”
Acad. Radiol., 4
(4), 298
–303 https://doi.org/10.1016/S1076-6332(97)80032-3
(1997).
Google Scholar
S. L. Hillis,
“Simulation of unequal-variance binormal multireader ROC decision data: an extension of the Roe and Metz simulation model,”
Acad. Radiol., 19
(12), 1518
–1528 https://doi.org/10.1016/j.acra.2012.09.011
(2012).
Google Scholar
C. K. Abbey, F. W. Samuelson and B. D. Gallas,
“Statistical power considerations for a utility endpoint in observer performance studies,”
Acad. Radiol., 20
(7), 798
–806 https://doi.org/10.1016/j.acra.2013.02.008
(2013).
Google Scholar
B. D. Gallas and S. L. Hillis,
“Generalized Roe and Metz receiver operating characteristic model: analytic link between simulated decision scores and empirical AUC variances and covariances,”
J. Med. Imaging, 1
(3), 031006 https://doi.org/10.1117/1.JMI.1.3.031006 JMEIET 0920-5497
(2014).
Google Scholar
S. L. Hillis,
“Relationship between Roe and Metz simulation model for multireader diagnostic data and Obuchowski-Rockette model parameters,”
Stat. Med., 37
(13), 2067
–2093 https://doi.org/10.1002/sim.7616 SMEDDA 1097-0258
(2018).
Google Scholar
S. L. Hillis and K. S. Berbaum,
“Using the mean-to-sigma ratio as a measure of the improperness of binormal ROC curves,”
Acad. Radiol., 18
(2), 143
–154 https://doi.org/10.1016/j.acra.2010.09.002
(2011).
Google Scholar
S. L. Hillis,
“A marginal-mean ANOVA approach for analyzing multireader multicase radiological imaging data,”
Stat. Med., 33
(2), 330
–360 https://doi.org/10.1002/sim.5926 SMEDDA 1097-0258
(2014).
Google Scholar
M. Quenoille,
“Approximate tests of correlation in time series,”
J. R. Stat. Soc. Ser. B, 11 68
–84 https://doi.org/10.1111/j.2517-6161.1949.tb00023.x
(1949).
Google Scholar
J. Shao and T. Dongshen, The Jackknife and Bootstrap, Springer-Verlag, New York
(1995). Google Scholar
B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans, SIAM(
(1982). Google Scholar
B. Efron and R. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, New York
(1993). Google Scholar
E. R. DeLong, D. M. DeLong and D. L. Clarke-Pearson,
“Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach,”
Biometrics, 44
(3), 837
–845 https://doi.org/10.2307/2531595 BIOMB6 0006-341X
(1988).
Google Scholar
S. L. Hillis,
“Relationship between Obuchowski–Rockette–Hillis and Gallas methods for analyzing multi-reader diagnostic imaging data with empirical AUC as the reader performance measure,”
Biostat. Epidemiol., 1
–38 https://doi.org/10.1080/24709360.2022.2062115
(2022).
Google Scholar
D. Bamber,
“Area above ordinal dominance graph and area below receiver operating characteristic graph,”
J. Math. Psychol., 12
(4), 387
–415 https://doi.org/10.1016/0022-2496(75)90001-2 JMTPAJ 0022-2496
(1975).
Google Scholar
B. J. Smith, S. L. Hillis and L. L. Pesce,
“MRMCaov: multi-reader multi-case analysis of variance,”
https://cran.r-project.org/package=MRMCaov
(2023).
Google Scholar
H. Kundel et al.,
“Accuracy of bedside chest hard-copy screen-film versus hard-and soft-copy computed radiographs in a medical intensive care unit: receiver operating characteristic analysis,”
Radiology, 205
(3), 859
–863 https://doi.org/10.1148/radiology.205.3.9393548 RADLAX 0033-8419
(1997).
Google Scholar
S. L. Hillis, B. J. Smith and W. Chen,
“Determining Roe and Metz model parameters for simulating multireader multicase confidence-of-disease rating data based on real-data or conjectured Obuchowski–Rockette parameter estimates,”
J. Med. Imaging, 9
(4), 045501 https://doi.org/10.1117/1.JMI.9.4.045501 JMEIET 0920-5497
(2022).
Google Scholar
J. Franken et al.,
“Evaluation of a digital workstation for interpreting neonatal examinations: a receiver operating characteristic study,”
Invest. Radiol., 27
(9), 732
–737 https://doi.org/10.1097/00004424-199209000-00016
(1992).
Google Scholar
S. L. Hillis and K. M. Schartz,
“Multireader sample size program for diagnostic studies: demonstration and methodology,”
J. Med. Imaging, 5
(4), 045503 https://doi.org/10.1117/1.JMI.5.4.045503 JMEIET 0920-5497
(2018).
Google Scholar
S. L. Hillis,
“Identical-test Roe and Metz simulation model for validating multi-reader methods of analysis for comparing different radiologic imaging modalities,”
Proc. SPIE, 12035 120350E https://doi.org/10.1117/12.2612691 PSISDG 0277-786X
(2022).
Google Scholar
BiographyStephen L. Hillis received his PhD in statistics in 1987 and an MFA in music 1978, both from the University of Iowa. Currently, he is working as a research professor in the Departments of Radiology and Biostatistics at the University of Iowa. He is the author of more than 100 peer-reviewed journal articles and four book chapters. Since 1998, his research has focused on methodology for multireader diagnostic radiologic imaging studies. |
Data modeling
Error analysis
Covariance
Statistical analysis
Statistical modeling
Computer simulations
Sampling rates