In psychometric practice, the parameter estimates of a standard item-response theory (IRT) model can become biased when item-response data, of persons’ individual responses to test items, contain outliers relative to the model. Also, the manual removal of outliers can be a time-consuming and difficult task. Besides, removing outliers leads to data information loss in parameter estimation. To address these concerns, a Bayesian IRT model that includes person and latent item-response outlier parameters, in addition to person ability and item parameters, is proposed and illustrated, and is defined by item characteristic curves (ICCs) that are each specified by a robust, Student’s t-distribution function. The outlier parameters and the robust ICCs enable the model to automatically identify item-response outliers, and to make estimates of the person ability and item parameters more robust to outliers. Hence, under this IRT model, it is unnecessary to remove outliers from the data analysis. Our IRT model is illustrated through the analysis of two data sets, involving dichotomous- and polytomous-response items, respectively.
Test items scored as polytomous have the potential to display multidimensionality across rating scale score categories. This article uses a multidimensional nominal response model (MNRM) to examine the possibility that the proficiency dimension/dimensional composite best measured by a polytomously scored item may vary by score category, an issue not generally considered in multidimensional item response theory (MIRT). Some practical considerations in exploring rubric-related multidimensionality, including potential consequences of not attending to it, are illustrated through simulation examples. A real data application is applied in the study of item format effects using the 2007 administration of Trends in Mathematics and Science Study (TIMSS) among eighth graders in the United States.
This article examines the interdependency of two context effects that are known to occur regularly in large-scale assessments: item position effects and effects of test-taking effort on the probability of correctly answering an item. A microlongitudinal design was used to measure test-taking effort over the course of a large-scale assessment of 60 min. Two components of test-taking effort were investigated: initial effort and change in effort. Both components of test-taking effort significantly affected the probability to solve an item. In addition, it was found that participants’ current test-taking effort diminished considerably across the course of the test. Furthermore, a substantial linear position effect was found, which indicated that item difficulty increased during the test. This position effect varied considerably across persons. Concerning the interplay of position effects and test-taking effort, it was found that only the change in effort moderates the position effect and that persons differ with respect to this moderation effect. The consequences of these results concerning the reliability and validity of large-scale assessments are discussed.
The Model With Internal Restrictions on Item Difficulty (MIRID; Butter, 1994) has been useful for investigating cognitive behavior in terms of the processes that lead to that behavior. The main objective of the MIRID model is to enable one to test how component processes influence the complex cognitive behavior in terms of the item parameters. The original MIRID model is, indeed, a fairly restricted model for a number of reasons. One of these restrictions is that the model treats items as fixed and does not fit measurement contexts where the concept of the random items is needed. In this article, random item approaches to the MIRID model are proposed, and both simulation and empirical studies to test and illustrate the random item MIRID models are conducted. The simulation and empirical studies show that the random item MIRID models provide more accurate estimates when substantial random errors exist, and thus these models may be more beneficial.
The assumption of local independence is central to all item response theory (IRT) models. Violations can lead to inflated estimates of reliability and problems with construct validity. For the most widely used fit statistic Q_{3}, there are currently no well-documented suggestions of the critical values which should be used to indicate local dependence (LD), and for this reason, a variety of arbitrary rules of thumb are used. In this study, an empirical data example and Monte Carlo simulation were used to investigate the different factors that can influence the null distribution of residual correlations, with the objective of proposing guidelines that researchers and practitioners can follow when making decisions about LD during scale development and validation. A parametric bootstrapping procedure should be implemented in each separate situation to obtain the critical value of LD applicable to the data set, and provide example critical values for a number of data structure situations. The results show that for the Q_{3} fit statistic, no single critical value is appropriate for all situations, as the percentiles in the empirical null distribution are influenced by the number of items, the sample size, and the number of response categories. Furthermore, the results show that LD should be considered relative to the average observed residual correlation, rather than to a uniform value, as this results in more stable percentiles for the null distribution of an adjusted fit statistic.
Concurrent calibration using anchor items has proven to be an effective alternative to separate calibration and linking for developing large item banks, which are needed to support continuous testing. In principle, anchor-item designs and estimation methods that have proven effective with dominance item response theory (IRT) models, such as the 3PL model, should also lead to accurate parameter recovery with ideal point IRT models, but surprisingly little research has been devoted to this issue. This study, therefore, had two purposes: (a) to develop software for concurrent calibration with, what is now the most widely used ideal point model, the generalized graded unfolding model (GGUM); (b) to compare the efficacy of different GGUM anchor-item designs and develop empirically based guidelines for practitioners. A Monte Carlo study was conducted to compare the efficacy of three anchor-item designs in vertical and horizontal linking scenarios. The authors found that a block-interlaced design provided the best parameter recovery in nearly all conditions. The implications of these findings for concurrent calibration with the GGUM and practical recommendations for pretest designs involving ideal point computer adaptive testing (CAT) applications are discussed.
Forced-choice item response theory (IRT) models are being more widely used as a way of reducing response biases in noncognitive research and operational testing contexts. As applications have increased, there has been a growing need for methods to link parameters estimated in different examinee groups as a prelude to measurement equivalence testing. This study compared four linking methods for the Zinnes and Griggs (ZG) pairwise preference ideal point model. A Monte Carlo simulation compared test characteristic curve (TCC) linking, item characteristic curve (ICC) linking, mean/mean (M/M) linking, and mean/sigma (M/S) linking. The results indicated that ICC linking and the simpler M/M and M/S methods performed better than TCC linking, and there were no substantial differences among the top three approaches. In addition, in the absence of possible contamination of the common (anchor) item subset due to differential item functioning, five items should be adequate for estimating the metric transformation coefficients. Our article presents the necessary equations for ZG linking and provides recommendations for practitioners who may be interested in developing and using pairwise preference measures for research and selection purposes.
Constructed-response items are commonly used in educational and psychological testing, and the answers to those items are typically scored by human raters. In the current rater monitoring processes, validity scoring is used to ensure that the scores assigned by raters do not deviate severely from the standards of rating quality. In this article, an adaptive rater monitoring approach that may potentially improve the efficiency of current rater monitoring practice is proposed. Based on the Rasch partial credit model and known development in multidimensional computerized adaptive testing, two essay selection methods—namely, the D-optimal method and the Single Fisher information method—are proposed. These two methods intend to select the most appropriate essays based on what is already known about a rater’s performance. Simulation studies, using a simulated essay bank and a cloned real essay bank, show that the proposed adaptive rater monitoring methods can recover rater parameters with much fewer essay questions. Future challenges and potential solutions are discussed in the end.
Although person-fit analysis has a long-standing tradition within item response theory, it has been applied in combination with dominance response models almost exclusively. In this article, a popular log likelihood-based parametric person-fit statistic under the framework of the generalized graded unfolding model is used. Results from a simulation study indicate that the person-fit statistic performed relatively well in detecting midpoint response style patterns and not so well in detecting extreme response style patterns.
It is common to encounter polytomous and nominal responses with latent variables in social or behavior research, and a variety of polytomous and nominal item response theory (IRT) models are available for applied researchers across diverse settings. With its flexibility and scalability, the Bayesian approach using the Markov chain Monte Carlo (MCMC) method demonstrates its great advantages for polytomous and nominal IRT models. However, the potential of the Bayesian approach would not be fully realized without model formulations that can cover various models and effective fit measures for model assessment or criticism. This research first provided formulations for typical models that are representative of different modeling groups. Then, a series of discrepancy measures that can offer diagnostic information for model-data misfit were introduced. Simulation studies showed that the formulation worked as expected, and some of the fit measures were more useful than the others or across different situations.
The logistic regression (LR) procedure for testing differential item functioning (DIF) typically depends on the asymptotic sampling distributions. The likelihood ratio test (LRT) usually relies on the asymptotic chi-square distribution. Also, the Wald test is typically based on the asymptotic normality of the maximum likelihood (ML) estimation, and the Wald statistic is tested using the asymptotic chi-square distribution. However, in small samples, the asymptotic assumptions may not work well. The penalized maximum likelihood (PML) estimation removes the first-order finite sample bias from the ML estimation, and the bootstrap method constructs the empirical sampling distribution. This study compares the performances of the LR procedures based on the LRT, Wald test, penalized likelihood ratio test (PLRT), and bootstrap likelihood ratio test (BLRT) in terms of the statistical power and type I error for testing uniform and non-uniform DIF. The result of the simulation study shows that the LRT with the asymptotic chi-square distribution works well even in small samples.
Methods for testing differential item functioning (DIF) require that the reference and focal groups are linked on a common scale using group-invariant anchor items. Several anchor-selection strategies have been introduced in an item response theory framework. However, popular strategies often utilize likelihood ratio testing with all-others-as-anchors that requires multiple model fittings. The current study explored alternative anchor-selection strategies based on a modified version of the Wald ^{2} test that is implemented in flexMIRT and IRTPRO, and made comparisons with methods based on the popular likelihood ratio test. Accuracies of anchor identification of four different strategies (two testing methods combined with two selection criteria), along with the power and Type I error associated with respective follow-up DIF tests, will be presented. Implications for applied researchers and suggestions for future research will be discussed.
Cognitive diagnostic computerized adaptive testing (CD-CAT) can be divided into two broad categories: (a) single-purpose tests, which are based on the subject’s knowledge state (KS) alone, and (b) dual-purpose tests, which are based on both the subject’s KS and traditional ability level (
Test fraud has recently received increased attention in the field of educational testing, and the use of comprehensive integrity analysis after test administration is recommended for investigating different types of potential test frauds. One type of test fraud involves answer copying between two examinees, and numerous statistical methods have been proposed in the literature to screen and identify unusual response similarity or irregular response patterns on multiple-choice tests. The current study examined the classification performance of answer-copying indices measured by the area under the receiver operating characteristic (ROC) curve under different item response theory (IRT) models (one- [1PL], two- [2PL], three-parameter [3PL] models, nominal response model [NRM]) using both simulated and real response vectors. The results indicated that although there is a slight increase in the performance for low amount of copying conditions (20%), when nominal response outcomes were used, these indices performed in a similar manner for 40% and 60% copying conditions when dichotomous response outcomes were utilized. The results also indicated that the performance with simulated response vectors was almost identically reproducible with real response vectors.
Cognitive diagnostic computerized adaptive testing (CD-CAT) purports to obtain useful diagnostic information with great efficiency brought by CAT technology. Most of the existing CD-CAT item selection algorithms are evaluated when test length is fixed and relatively long, but some applications of CD-CAT, such as in interim assessment, require to obtain the cognitive pattern with a short test. The mutual information (MI) algorithm proposed by Wang is the first endeavor to accommodate this need. To reduce the computational burden, Wang provided a simplified scheme, but at the price of scale/sign change in the original index. As a result, it is very difficult to combine it with some popular constraint management methods. The current study proposes two high-efficiency algorithms, posterior-weighted cognitive diagnostic model (CDM) discrimination index (PWCDI) and posterior-weighted attribute-level CDM discrimination index (PWACDI), by modifying the CDM discrimination index. They can be considered as an extension of the Kullback–Leibler (KL) and posterior-weighted KL (PWKL) methods. A pre-calculation strategy has also been developed to address the computational issue. Simulation studies indicate that the newly developed methods can produce results comparable with or better than the MI and PWKL in both short and long tests. The other major advantage is that the computational issue has been addressed more elegantly than MI. PWCDI and PWACDI can run as fast as PWKL. More importantly, they do not suffer from the problem of scale/sign change as MI and, thus, can be used with constraint management methods together in a straightforward manner.
Clinical psychologists are advised to assess clinical and statistical significance when assessing change in individual patients. Individual change assessment can be conducted using either the methodologies of classical test theory (CTT) or item response theory (IRT). Researchers have been optimistic about the possible advantages of using IRT rather than CTT in change assessment. However, little empirical evidence is available to support the alleged superiority of IRT in the context of individual change assessment. In this study, the authors compared the CTT and IRT methods with respect to their Type I error and detection rates. Preliminary results revealed that IRT is indeed superior to CTT in individual change detection, provided that the tests consist of at least 20 items. For shorter tests, however, CTT is generally better at correctly detecting change in individuals. The results and their implications are discussed.
Researchers are commonly interested in group comparisons such as comparisons of group means, called impact, or comparisons of individual scores across groups. A meaningful comparison can be made between the groups when there is no differential item functioning (DIF) or differential test functioning (DTF). During the past three decades, much progress has been made in detecting DIF and DTF. However, little research has been conducted on what researchers can do after such detection. This study presents and evaluates a confirmatory multigroup multidimensional item response model to obtain the purified item parameter estimates, person scores, and impact estimates on the primary dimension, controlling for the secondary dimension due to DIF. In addition, the item response model approach was compared with current practices of DIF treatment such as deleting and ignoring DIF items and using multigroup item response models through simulation studies. The authors suggested guidelines for DIF treatment based on the simulation study results.
Attitude surveys are widely used in the social sciences. It has been argued that the underlying response process to attitude items may be more aligned with the ideal-point (unfolding) process than with the cumulative (dominance) process, and therefore, unfolding item response theory (IRT) models are more appropriate than dominance IRT models for these surveys. Missing data and don’t know (DK) responses are common in attitude surveys, and they may not be ignorable in the likelihood for parameter estimation. Existing unfolding IRT models often treat missing data or DK as missing at random. In this study, a new class of unfolding IRT models for nonignorable missing data and DK were developed, in which the missingness and DK were assumed to measure a hierarchy of latent traits, which may be correlated with the latent attitude that a test intended to measure. The Bayesian approach with Markov chain Monte Carlo methods was used to estimate the parameters of the new models. Simulation studies demonstrated that the parameters were recovered fairly well, and ignoring nonignorable missingness or DK resulted in poor parameter estimates. An empirical example of a religious belief scale about health was given.
A recent article in this journal addressed the choice between specialized heuristics and mixed-integer programming (MIP) solvers for automated test assembly. This reaction is to comment on the mischaracterization of the general nature of MIP solvers in this article, highlight the quite inefficient modeling of the test-assembly problems used in its empirical examples, and counter these examples by presenting the MIP solutions for a set of 35 real-world multiple-form assembly problems.
Forced-choice questionnaires have been proposed as a way to control some response biases associated with traditional questionnaire formats (e.g., Likert-type scales). Whereas classical scoring methods have issues of ipsativity, item response theory (IRT) methods have been claimed to accurately account for the latent trait structure of these instruments. In this article, the authors propose the multi-unidimensional pairwise preference two-parameter logistic (MUPP-2PL) model, a variant within Stark, Chernyshenko, and Drasgow’s MUPP framework for items that are assumed to fit a dominance model. They also introduce a Markov Chain Monte Carlo (MCMC) procedure for estimating the model’s parameters. The authors present the results of a simulation study, which shows appropriate goodness of recovery in all studied conditions. A comparison of the newly proposed model with a Brown and Maydeu’s Thurstonian IRT model led us to the conclusion that both models are theoretically very similar and that the Bayesian estimation procedure of the MUPP-2PL may provide a slightly better recovery of the latent space correlations and a more reliable assessment of the latent trait estimation errors. An application of the model to a real data set shows convergence between the two estimation procedures. However, there is also evidence that the MCMC may be advantageous regarding the item parameters and the latent trait correlations.
In recent years, there has been a surge of interest in measuring noncognitive constructs in educational and managerial/organizational settings. For the most part, these noncognitive constructs have been and continue to be measured using Likert-type (ordinal response) scales, which are susceptible to several types of response distortion. To deal with these response biases, researchers have proposed using forced-choice format, which requires respondents or raters to evaluate cognitive, affective, or behavioral descriptors presented in blocks of two or more. The workhorse for this measurement endeavor is the item response theory (IRT) model developed by Zinnes and Griggs (Z-G), which was first used as the basis for a computerized adaptive rating scale (CARS), and then extended by many organizational scientists. However, applications of the Z-G model outside of organizational contexts have been limited, primarily due to the lack of publicly available software for parameter estimation. This research effort addressed that need by developing a Markov chain Monte Carlo (MCMC) estimation program, called MCMC Z-G, which uses a Metropolis-Hastings-within-Gibbs algorithm to simultaneously estimate Z-G item and person parameters. This publicly available computer program MCMC Z-G can run on both Mac OS^{®} and Windows^{®} platforms.
A simulation study was conducted to investigate the efficacy of multiple indicators multiple causes (MIMIC) methods for multi-group uniform and non-uniform differential item functioning (DIF) detection. DIF was simulated to originate from one or more sources involving combinations of two background variables, gender and ethnicity. Three implementations of MIMIC DIF methods were compared: constrained baseline, free baseline, and a new sequential-free baseline. When the MIMIC assumption of equal factor variance across comparison groups was satisfied, the sequential-free baseline method provided excellent Type I error and power, with results similar to an idealized free baseline method that used a designated DIF-free anchor, and results much better than a constrained baseline method, which used all items other than the studied item as an anchor. However, when the equal factor variance assumption was violated, all methods showed inflated Type I error. Finally, despite the efficacy of the two free baseline methods for detecting DIF, identifying the source(s) of DIF was problematic, especially when background variables interacted.
Item response theory (IRT) models provide an appropriate alternative to the classical ordinal confirmatory factor analysis (CFA) during the development of patient-reported outcome measures (PROMs). Current literature has identified the assessment of IRT model fit as both challenging and underdeveloped. This study evaluates the performance of Ordinal Bayesian Instrument Development (OBID), a Bayesian IRT model with a probit link function approach, through applications in two breast cancer-related instrument development studies. The primary focus is to investigate an appropriate method for comparing Bayesian IRT models in PROMs development. An exact Bayesian leave-one-out cross-validation (LOO-CV) approach is implemented to assess prior selection for the item discrimination parameter in the IRT model and subject content experts’ bias (in a statistical sense and not to be confused with psychometric bias as in differential item functioning) toward the estimation of item-to-domain correlations. Results support the utilization of content subject experts’ information in establishing evidence for construct validity when sample size is small. However, the incorporation of subject experts’ content information in the OBID approach can be sensitive to the level of expertise of the recruited experts. More stringent efforts need to be invested in the appropriate selection of subject experts to efficiently use the OBID approach and reduce potential bias during PROMs development.
Even in the age of abundant and fast computing resources, concurrency requirements for large-scale online testing programs still put an uninterrupted delivery of computer-adaptive tests at risk. In this study, to increase the concurrency for operational programs that use the shadow-test approach to adaptive testing, we explored various strategies aiming for reducing the number of reassembled shadow tests without compromising the measurement quality. Strategies requiring fixed intervals between reassemblies, a certain minimal change in the interim ability estimate since the last assembly before triggering a reassembly, and a hybrid of the two strategies yielded substantial reductions in the number of reassemblies without degradation in the measurement accuracy. The strategies effectively prevented unnecessary reassemblies due to adapting to the noise in the early test stages. They also highlighted the practicality of the shadow-test approach by minimizing the computational load involved in its use of mixed-integer programming.
In generalizability theory (G theory), one-facet models are specified to be additive, which is equivalent to the assumption that subject-by-facet interaction effects are absent. In this article, the authors first derive estimators of variance components (VCs) for nonadditive models and show that, in some cases, they are different from their counterparts in additive models. The authors then demonstrate and later confirm with a simulation study that when the subject-by-facet interaction exists, but the additive-model formulas are used, the VC of subjects is underestimated. Consequently, generalizability coefficients are also underestimated. Thus, depending on the nature of interaction effects, an appropriate model, either additive or nonadditive, should be used in applications of G theory. The nonadditive G theory developed in this article generalizes current G theory and uses data at hand to determine when additive or nonadditive models should be used to estimate VCs. Finally, the implications of the findings are discussed in light of an analysis of real data.
Sato and Tatsuoka suggested the caution index, several extended caution indices (ECIs), and their standardized versions. Among these indices, the standardized versions of the second and fourth ECIs (denoted as
Online calibration is a technology-enhanced architecture for item calibration in computerized adaptive tests (CATs). Many CATs are administered continuously over a long term and rely on large item banks. To ensure test validity, these item banks need to be frequently replenished with new items, and these new items need to be pretested before being used operationally. Online calibration dynamically embeds pretest items in operational tests and calibrates their parameters as response data are gradually obtained through the continuous test administration. This study extends existing formulas, procedures, and algorithms for dichotomous item response theory models to the generalized partial credit model, a popular model for items scored in more than two categories. A simulation study was conducted to investigate the developed algorithms and procedures under a variety of conditions, including two estimation algorithms, three pretest item selection methods, three seeding locations, two numbers of score categories, and three calibration sample sizes. Results demonstrated acceptable estimation accuracy of the two estimation algorithms in some of the simulated conditions. A variety of findings were also revealed for the interacted effects of included factors, and recommendations were made respectively.
A classification method is presented for adaptive classification testing with a multidimensional item response theory (IRT) model in which items are intended to measure multiple traits, that is, within-dimensionality. The reference composite is used with the sequential probability ratio test (SPRT) to make decisions and decide whether testing can be stopped before reaching the maximum test length. Item-selection methods are provided that maximize the determinant of the information matrix at the cutoff point or at the projected ability estimate. A simulation study illustrates the efficiency and effectiveness of the classification method. Simulations were run with the new item-selection methods, random item selection, and maximization of the determinant of the information matrix at the ability estimate. The study also showed that the SPRT with multidimensional IRT has the same characteristics as the SPRT with unidimensional IRT and results in more accurate classifications than the latter when used for multidimensional data.
In applications of cognitive diagnostic models (CDMs), practitioners usually face the difficulty of choosing appropriate CDMs and building accurate Q-matrices. However, functions of model-fit indices that are supposed to inform model and Q-matrix choices are not well understood. This study examines the performance of several promising model-fit indices in selecting model and Q-matrix under different sample size conditions. Relative performance between Akaike information criterion and Bayesian information criterion in model and Q-matrix selection appears to depend on the complexity of data generating models, Q-matrices, and sample sizes. Among the absolute fit indices, MX2 is least sensitive to sample size under correct model and Q-matrix specifications, and performs the best in power. Sample size is found to be the most influential factor on model-fit index values. Consequences of selecting inaccurate model and Q-matrix in classification accuracy of attribute mastery are also evaluated.
This brief research report shows how different applications of the Multigroup Ethnic Identity Measure (MEIM) can have implications for the interpretation of the role of ethnic identity in research. Throughout the MEIM’s widespread use, notable inconsistencies lie in how the measure has been applied. This report uses empirical data to demonstrate differences in statistical inference due to these differences in usage.
Item position effects can seriously bias analyses in educational measurement, especially when multiple matrix sampling designs are deployed. In such designs, item position effects may easily occur if not explicitly controlled for. Still, in practice it usually turns out to be rather difficult—or even impossible—to completely control for effects due to the position of items. The objectives of this article are to show how item position effects can be modeled using the linear logistic test model with additional error term (LLTM +) in the framework of generalized linear mixed models (GLMMs), to explore in a simulation study how well the LLTM + holds the nominal Type I risk threshold, to conduct power analysis for this model, and to examine the sensitivity of the LLTM + to designs that are not completely balanced concerning item position. Overall, the LLTM + proved suitable for modeling item position effects when a balanced design is used. With decreasing balance, the model tends to be more conservative in the sense that true item position effects are more unlikely to be detected. Implications for linking and equating procedures which use common items are discussed.
The single-strategy deterministic, inputs, noisy "and" gate (SS-DINA) model has previously been extended to a model called the multiple-strategy deterministic, inputs, noisy "and" gate (MS-DINA) model to address more complex situations where examinees can use multiple problem-solving strategies during the test. The main purpose of this article is to adapt an efficient estimation algorithm, the Expectation–Maximization algorithm, that can be used to fit the MS-DINA model when the joint attribute distribution is most general (i.e., saturated). The article also examines through a simulation study the impact of sample size and test length on the fit of the SS-DINA and MS-DINA models, and the implications of misfit on item parameter recovery and attribute classification accuracy. In addition, an analysis of fraction subtraction data is presented to illustrate the use of the algorithm with real data. Finally, the article concludes by discussing several important issues associated with multiple-strategies models for cognitive diagnosis.
This article discusses four-item selection rules to design efficient individualized tests for the random weights linear logistic test model (RWLLTM): minimum posterior-weighted
In questionnaires, items can be presented in a grouped format (same-scale items are presented in the same block) or in a randomized format (items from one scale are mixed with items from other scales). Some researchers have advocated the grouped format because it enhances discriminant validity. The current study demonstrates that positioning items in separate blocks of a questionnaire may indeed lead to increased discriminant validity, but this can happen even in instances where discriminant validity should not be present. In particular, the authors show that splitting an established unidimensional scale into two arbitrary blocks of items separated by unrelated buffer items results in the emergence of two clearly identifiable but artificial factors that show discriminant validity.
Andrich, Marais, and Humphry showed formally that Waller’s procedure that removes responses to multiple choice (MC) items that are likely to be guessed eliminates the bias in the Rasch model (RM) estimates of difficult items and makes them more difficult. The former did not study any consequences on the person proficiency estimates. This article shows that when the procedure is applied, the more proficient persons who are least likely to guess benefit by a greater amount than the less proficient, who are most likely to guess. This surprising result is explained by appreciating that the more proficient persons answer difficult items correctly at a greater rate than do the less proficient, even when the latter guess some items correctly. As a consequence, increasing the difficulty of the difficult items benefits them more than the less proficient persons. Analyses of a simulated and real example are shown illustratively. To not disadvantage the more proficient persons, it is suggested that Waller’s procedure be used when the RM is used to analyze MC items.
The issue of latent trait granularity in diagnostic models is considered, comparing and contrasting latent trait and latent class models used for diagnosis. Relationships between conjunctive cognitive diagnosis models (CDMs) with binary attributes and noncompensatory multidimensional item response models are explored, leading to a continuous generalization of the Noisy Input, Deterministic "And" Gate (NIDA) model. A model that combines continuous and discrete latent variables is proposed that includes a noncompensatory item response theory (IRT) term and a term following the discrete attribute Deterministic Input, Noisy "And" Gate (DINA) model in cognitive diagnosis. The Tatsuoka fraction subtraction data are analyzed with the proposed models as well as with the DINA model, and classification results are compared. The applicability of the continuous latent trait model and the combined IRT and CDM is discussed, and arguments are given for development of simple models for complex cognitive structures.
This study investigated the efficacy of the l_{z} person fit statistic for detecting aberrant responding with unidimensional pairwise preference (UPP) measures, constructed and scored based on the Zinnes–Griggs item response theory (IRT) model, which has been used for a variety of recent noncognitive testing applications. Because UPP measures are used to collect both "self-" and "other" reports, the capability of l_{z} to detect two of the most common and potentially detrimental response sets, namely fake good and random responding, was explored. The effectiveness of l_{z} was studied using empirical and theoretical critical values for classification, along with test length, test information, the type of statement parameters, and the percentage of items answered aberrantly (20%, 50%, 100%). It was found that l_{z} was ineffective in detecting fake good responding, with power approaching zero in the 100% aberrance conditions. However, l_{z} was highly effective in detecting random responding, with power approaching 1.0 in long-test, high information conditions, and there was no diminution in efficacy when using marginal maximum likelihood estimates of statement parameters in place of the true values. Although using empirical critical values for classification provided slightly higher power and more accurate Type I error rates, theoretical critical values, corresponding to a standard normal distribution, provided nearly as good results.
In "compensatory" multidimensional item response theory (IRT) models, latent ability scores are typically assumed to be independent and combine additively to influence the probability of responding to an item correctly. However, testing situations arise where modeling an additive relationship between latent abilities is not appropriate or desired. In these situations, "noncompensatory" models may be better suited to handle this phenomenon. Unfortunately, relatively few estimation studies have been conducted using these types of models and effective estimation of the parameters by maximum-likelihood has not been well established. In this article, the authors demonstrate how noncompensatory models may be estimated with a Metropolis–Hastings Robbins–Monro hybrid (MH-RM) algorithm and perform a computer simulation study to determine how effective this algorithm is at recovering population parameters. Results suggest that although the parameters are not recovered accurately in general, the empirical fit was consistently better than a competing product-constructed IRT model and latent ability scores were also more accurately recovered.
When students solve problems, their proficiency in a particular subject may influence how well they perform in a similar, but different area of study. For example, studies have shown that science ability may have an effect on the mastery of mathematics skills, which in turn may affect how examinees respond to mathematics items. From this view, it becomes natural to examine the relationship of performance on a particular area of study to the mastery of attributes on a related subject. To examine such an influence, this study proposes a covariate extension to the deterministic input noisy "and" gate (DINA) model by applying a latent class regression framework. The DINA model has been selected for the study as it is known for its parsimony, easy interpretation, and potential extension of the covariate framework to more complex cognitive diagnostic models. In this approach, covariates can be specified to affect items or attributes. Real-world data analysis using the fourth-grade Trends in International Mathematics and Science Study (TIMSS) data showed significant relationships between science ability and attributes in mathematics. Simulation study results showed stable recovery of parameters and latent classes for varying sample sizes. These findings suggest further applications of covariates in a cognitive diagnostic modeling framework that can aid the understanding of how various factors influence mastery of fine-grained attributes.
The performance of ^{2} difference tests based on limited information estimation methods has not been extensively examined for differential functioning, particularly in the context of multidimensional item response theory (MIRT) models. Chi-square tests for detecting differential item functioning (DIF) and global differential item functioning (GDIF) in an MIRT model were conducted using two robust weighted least square estimators: weighted least square with adjusted means and variance (WLSMV) and weighted least square with adjusted means (WLSM), and the results were evaluated in terms of Type I error rates and rejection rates. The present study demonstrated systematic test procedures for detecting different types of GDIF and DIF in multidimensional tests. For the ^{2} tests for detecting GDIF, WLSM tended to produce inflated Type I error rates for small sample size conditions, whereas WLSMV appeared to yield lower error rates than the expected value on average. In addition, WLSM produced higher rejection rates than WLSMV. For the ^{2} tests for detecting DIF, WLSMV appeared to yield somewhat higher rejection rates than WLSM for all DIF tests except for the omnibus test. The error rates for both estimators were close to the expected value on average.
Item replenishment is important for maintaining a large-scale item bank. In this article, the authors consider calibrating new items based on pre-calibrated operational items under the deterministic inputs, noisy-and-gate model, the specification of which includes the so-called Q-matrix, as well as the slipping and guessing parameters. Making use of the maximum likelihood and Bayesian estimators for the latent knowledge states, the authors propose two methods for the calibration. These methods are applicable to both traditional paper–pencil–based tests, for which the selection of operational items is prefixed, and computerized adaptive tests, for which the selection of operational items is sequential and random. Extensive simulations are done to assess and to compare the performance of these approaches. Extensions to other diagnostic classification models are also discussed.
The maximum likelihood (ML) and the weighted likelihood (WL) estimators are commonly used to obtain proficiency level estimates with pre-calibrated item parameters. Both estimators have the same asymptotic standard error (ASE) that can be easily derived from the expected information function of the test. However, the accuracy of this asymptotic formula is unclear with short tests when only a few items are administered. The purpose of this paper is to compare the ASE of these estimators with their exact values, evaluated at the proficiency-level estimates. The exact standard error (SE) is computed by generating the full exact sample distribution of the estimators, so its practical feasibility is limited to small tests (except under the Rasch model). A simulation study was conducted to compare the ASE and the exact SE of the ML and WL estimators, with the "true" SE (i.e., computed as the exact SE with the true proficiency levels). It is concluded that with small tests, the exact SEs are less biased and return smaller root mean square error values than the asymptotic SEs, while as expected, the two estimators return similar results with longer tests.
The violation of the assumption of local independence when applying item response theory (IRT) models has been shown to have a negative impact on all estimates obtained from the given model. Numerous indices and statistics have been proposed to aid analysts in the detection of local dependence (LD). A Monte Carlo study was conducted to evaluate the relative performance of selected LD measures in conditions considered typical of studies collecting psychological assessment data. Both the Jackknife Slope Index and likelihood ratio statistic G^{2} are available across the two IRT models used and displayed adequate to good performance in most simulation conditions. The use of these indices together is the final recommendation for applied analysts. Future research areas are discussed.
For classification problems in psychology (e.g., clinical diagnosis), batteries of tests are often administered. However, not every test or item may be necessary for accurate classification. In the current article, a combination of classification and regression trees (CART) and stochastic curtailment (SC) is introduced to reduce assessment length of questionnaire batteries. First, the CARTalgorithm provides relevant subscales and cutoffs needed for accurate classification, in the form of a decision tree. Second, for every subscale and cutoff appearing in the decision tree, SC reduces the number of items needed for accurate classification. This procedure is illustrated by post hoc simulation on a data set of 3,579 patients, to whom the Mood and Anxiety Symptoms Questionnaire (MASQ) was administered. Subscales of the MASQ are used for predicting diagnoses of depression. Results show that CART-SC provided an assessment length reduction of 56%, without loss of accuracy, compared with the more traditional prediction method of performing linear discriminant analysis on subscale scores. CART-SC appears to be an efficient and accurate algorithm for shortening test batteries.
Growing reliance on complex constructed response items has generated considerable interest in automated scoring solutions. Many of these solutions are described in the literature; however, relatively few studies have been published that compare automated scoring strategies. Here, comparisons are made among five strategies for machine-scoring examinee performances of computer-based case simulations, a complex item format used to assess physicians’ patient-management skills as part of the Step 3 United States Medical Licensing Examination. These strategies utilize expert judgments to obtain various (a) case-specific or (b) generic scoring algorithms. The various compromises between efficiency, validity, and reliability that characterize each scoring approach are described and compared.
One of the distinctions between classical test theory and item response theory is that the former focuses on sum scores and their relationship to true scores, whereas the latter concerns item responses and their relationship to latent scores. Although item response theory is often viewed as the richer of the two theories, sum scores are still often used in practice. The issue addressed here is how to conduct item response modeling when only sum scores are available for some respondents; that is, their item responses are missing, but their sums scores are known. The author reviews the important role of sum scores in item response theory and shows how to estimate item response models using sum scores as data in lieu of item responses. The author also shows how this can be easily implemented in a Bayesian framework using the software package Just Another Gibbs Sampler (JAGS), and provides three examples for illustration.
The Monte Carlo approach which has previously been implemented in traditional computerized adaptive testing (CAT) is applied here to cognitive diagnostic CAT to test the ability of this approach to address multiple content constraints. The performance of the Monte Carlo approach is compared with the performance of the modified maximum global discrimination index (MMGDI) method on simulations in which the only content constraint is on the number of items that measure each attribute. The results of the two simulation experiments show that (a) the Monte Carlo method fulfills all the test requirements and produces satisfactory measurement precision and item exposure results and (b) the Monte Carlo method outperforms the MMGDI method when the Monte Carlo method applies either the posterior-weighted Kullback–Leibler algorithm or the hybrid Kullback–Leibler information as the item selection index. Overall, the recovery rate of the knowledge states, the distribution of the item exposure, and the utilization rate of the item bank are improved when the Monte Carlo method is used.
Multiple imputation (MI) has become a highly useful technique for handling missing values in many settings. In this article, the authors compare the performance of a MI model based on empirical Bayes techniques to a direct maximum likelihood analysis approach that is known to be robust in the presence of missing observations. Specifically, they focus on handling of missing item scores in multilevel cross-classification item response data structures that may require more complex imputation techniques, and for situations where an imputation model can be more general than the analysis model. Through a simulation study and an empirical example, the authors show that MI is more effective in estimating missing item scores and produces unbiased parameter estimates of explanatory item response theory models formulated as cross-classified mixed models.
The majority of large-scale assessments develop various score scales that are either linear or nonlinear transformations of raw scores for better interpretations and uses of assessment results. The current formula for coefficient alpha (α; the commonly used reliability coefficient) only provides internal consistency reliability estimates of raw scores. This article presents a general form of α and extends its use to estimate internal consistency reliability for nonlinear scale scores (used for relative decisions). The article also examines this estimator of reliability using different score scales with real data sets of both dichotomously scored and polytomously scored items. Different score scales show different estimates of reliability. The effects of transformation functions on reliability of different score scales are also explored.
Interest in developing computerized adaptive testing (CAT) under cognitive diagnosis models (CDMs) has increased recently. CAT algorithms that use a fixed-length termination rule frequently lead to different degrees of measurement precision for different examinees. Fixed precision, in which the examinees receive the same degree of measurement precision, is a major advantage of CAT over nonadaptive testing. In addition to the precision issue, test security is another important issue in practical CAT programs. In this study, the authors implemented two termination criteria for the fixed-precision rule and evaluated their performance under two popular CDMs using simulations. The results showed that using the two criteria with the posterior-weighted Kullback–Leibler information procedure for selecting items could achieve the prespecified measurement precision. A control procedure was developed to control item exposure and test overlap simultaneously among examinees. The simulation results indicated that in contrast to no method of controlling exposure, the control procedure developed in this study could maintain item exposure and test overlap at the prespecified level at the expense of only a few more items.
This study examines adverse consequences of using hierarchical linear modeling (HLM) that ignores rater effects to analyze ratings collected by multiple raters in longitudinal research. The most severe consequence of using HLM ignoring rater effects is the biased estimation of Levels 1 and 2 fixed effects and potentially incorrect significance tests about them. A cross-classified random effects model (CCREM) is proposed as an alternative to HLM. A Monte Carlo study and an empirical evaluation confirm that CCREM performs better than does HLM in dealing with rater effects. Strengths, limitations, and implications of the study are discussed.
Most methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes require the Q-matrix that associates each item in a test with the cognitive skills (attributes) needed to answer it correctly. In most cases, the Q-matrix is not known but is constructed from the (fallible) judgments of experts in the educational domain. It is widely recognized that a misspecification of the Q-matrix can negatively affect the estimation of the model parameters, which may then result in the misclassification of examinees. This article develops a Q-matrix refinement method based on the nonparametric classification method (Chiu & Douglas, in press), and comparisons of the residual sum of squares computed from the observed and the ideal item responses. The method is evaluated with three simulation studies and an application to real data. Results show that the method can identify and correct misspecified entries in the Q-matrix, thereby improving its accuracy.
Many latent traits in the human sciences have a hierarchical structure. This study aimed to develop a new class of higher order item response theory models for hierarchical latent traits that are flexible in accommodating both dichotomous and polytomous items, to estimate both item and person parameters jointly, to allow users to specify customized item response functions, and to go beyond two orders of latent traits and the linear relationship between latent traits. Parameters of the new class of models can be estimated using the Bayesian approach with Markov chain Monte Carlo methods. Through a series of simulations, the authors demonstrated that the parameters in the new clasf of models can be well recovered with the computer software WinBUGS, and the joint estimation approach was more efficient than multistaged or consecutive approaches. Two empirical examples of achievement and personality assessments were given to demonstrate applications and implications of the new models.
This article proposes a generalized distance discriminating method for test with polytomous response (GDD-P). The new method is the polytomous extension of an item response theory (IRT)-based cognitive diagnostic method, which can identify examinees’ ideal response patterns (IRPs) based on a generalized distance index. The similarities between observed response patterns and IRPs for polytomous response situation are measured by the index of GDD-P, and the attribute patterns can be recognized via the relationship between attribute patterns and IRPs. Feasible designs about polytomous Q-matrix and scoring items for polytomous response are also discussed. In simulation, the classification accuracy of the GDD-P method for the test with polytomous response was investigated, and results indicated that the proposed method had promising performance in recognizing examinees’ attribute patterns.
Hierarchical generalized linear models (HGLMs) have been used to assess differential item functioning (DIF). For model identification, some literature assumed that the reference (majority) and focal (minority) groups have an equal mean ability so that all items in a test can be assessed for DIF. In reality, it is very unlikely that the two groups have an identical mean. If so, other model identification procedures should be adopted. A feasible procedure for model identification is to set an item that is the most likely to be DIF-free as a reference, so that the two groups can have different means and the other items can be assessed for DIF. In Simulation Study 1, several methods based on HGLMs in selecting DIF-free items were compared. In Simulation Study 2, those items assessed as DIF-free were anchored, and the other items were assessed for DIF. This new method was compared with the traditional method based on HGLMs in which the two groups are assumed to have an equal mean in terms of the Type I error rate and the power rate. The results showed that the new method outperformed the traditional method when the two groups did not have an equal mean.
The purpose of this research was to develop observed score and true score equating procedures to be used in conjunction with the multidimensional item response theory (MIRT) framework. Three equating procedures—two observed score procedures and one true score procedure—were created and described in detail. One observed score procedure was presented as a direct extension of unidimensional IRT (UIRT) observed score equating and is referred to as the "Full MIRT Observed Score Equating Procedure." The true score procedure and the second observed score procedure incorporated unidimensional approximation procedures to equate exams using UIRT equating principles. These procedures are referred to as the "Unidimensional Approximation of MIRT True Score Equating Procedure" and the "Unidimensional Approximation of MIRT Observed Score Equating Procedure," respectively. Three exams were used to conduct UIRT observed score and true score equating, MIRT observed score and true score equating, and equipercentile equating. The equipercentile equating procedure was conducted for the purpose of comparison because this procedure does not explicitly violate the IRT assumption of unidimensionality. Results indicated that the MIRT equating procedures performed more similarly to the equipercentile equating procedure than the UIRT equating procedures, presumably due to the violation of the unidimensionality assumption under the UIRT equating procedures.
Common test items play an important role in equating alternate test forms under the common item nonequivalent groups design. When the item response theory (IRT) method is applied in equating, inconsistent item parameter estimates among common items can lead to large bias in equated scores. It is prudent to evaluate inconsistency in parameter estimates of common items before conducting IRT equating. The evaluation of inconsistency in parameter estimates is typically achieved through detecting outliers in the common item set. In this study, a linear regression method is proposed as a detection method. The newly proposed method was compared with a traditional method in various conditions. The results of this study confirmed the necessity of detecting and removing outlying common items. The results also show that the newly proposed method performed better than did the traditional method in most conditions.
Polytomous attributes, particularly those defined as part of the test development process, can provide additional diagnostic information. The present research proposes the polytomous generalized deterministic inputs, noisy, "and" gate (pG-DINA) model to accommodate such attributes. The pG-DINA model allows input from substantive experts to specify attribute levels and is a general model that subsumes various reduced models. In addition to model formulation, the authors evaluate the viability of the proposed model by examining how well the model parameters can be estimated under various conditions, and compare its classification accuracy against that of the conventional G-DINA model with a modified classification rule. A real-data example is used to illustrate the application of the model in practice.