MetaTOC stay on top of your field, easily

Journal of Educational Measurement

Impact factor: 0.66 5-Year impact factor: 1.038 Print ISSN: 0022-0655 Online ISSN: 1745-3984 Publisher: Wiley Blackwell (Blackwell Publishing)

Subjects: Mathematical Psychology, Educational Psychology, Applied Psychology

Most recent papers:

  • Issue Information.

    Journal of Educational Measurement. September 03, 2018
    --- - - Journal of Educational Measurement, Volume 55, Issue 3, Page 355-356, Fall 2018.
    September 03, 2018   doi: 10.1111/jedm.12152   open full text
  • Development of Information Functions and Indices for the GGUM‐RANK Multidimensional Forced Choice IRT Model.
    Seang‐Hwane Joo, Philseok Lee, Stephen Stark.
    Journal of Educational Measurement. September 03, 2018
    --- - |2 Abstract This research derived information functions and proposed new scalar information indices to examine the quality of multidimensional forced choice (MFC) items based on the RANK model. We also explored how GGUM‐RANK information, latent trait recovery, and reliability varied across three MFC formats: pairs (two response alternatives), triplets (three alternatives), and tetrads (four alternatives). As expected, tetrad and triplet measures provided substantially more information than pairs, and MFC items composed of statements with high discrimination parameters were most informative. The methods and findings of this study will help practitioners to construct better MFC items, make informed projections about reliability with different MFC formats, and facilitate the development of MFC triplet‐ and tetrad‐based computerized adaptive tests. - Journal of Educational Measurement, Volume 55, Issue 3, Page 357-372, Fall 2018.
    September 03, 2018   doi: 10.1111/jedm.12183   open full text
  • A Comparison of Procedures for Estimating Person Reliability Parameters in the Graded Response Model.
    David M. LaHuis, Kinsey B. Bryant‐Lees, Shotaro Hakoyama, Tyler Barnes, Andrea Wiemann.
    Journal of Educational Measurement. September 03, 2018
    --- - |2 Abstract Person reliability parameters (PRPs) model temporary changes in individuals’ attribute level perceptions when responding to self‐report items (higher levels of PRPs represent less fluctuation). PRPs could be useful in measuring careless responding and traitedness. However, it is unclear how well current procedures for estimating PRPs can recover parameter estimates. This study assesses these procedures in terms of mean error (ME), average absolute difference (AAD), and reliability using simulated data with known values. Several prior distributions for PRPs were compared across a number of conditions. Overall, our results revealed little differences between using the χ or lognormal distributions as priors for estimated PRPs. Both distributions produced estimates with reasonable levels of ME; however, the AAD of the estimates was high. AAD did improve slightly as the number of items increased, suggesting that increasing the number of items would ameliorate this problem. Similarly, a larger number of items were necessary to produce reasonable levels of reliability. Based on our results, several conclusions are drawn and implications for future research are discussed. - Journal of Educational Measurement, Volume 55, Issue 3, Page 421-432, Fall 2018.
    September 03, 2018   doi: 10.1111/jedm.12186   open full text
  • The Impact of Multidimensionality on Extraction of Latent Classes in Mixture Rasch Models.
    Yoonsun Jang, Seock‐Ho Kim, Allan S. Cohen.
    Journal of Educational Measurement. September 03, 2018
    --- - |2 Abstract This study investigates the effect of multidimensionality on extraction of latent classes in mixture Rasch models. In this study, two‐dimensional data were generated under varying conditions. The two‐dimensional data sets were analyzed with one‐ to five‐class mixture Rasch models. Results of the simulation study indicate the mixture Rasch model tended to extract more latent classes than the number of dimensions simulated, particularly when the multidimensional structure of the data was more complex. In addition, the number of extracted latent classes decreased as the dimensions were more highly correlated regardless of multidimensional structure. An analysis of the empirical multidimensional data also shows that the number of latent classes extracted by the mixture Rasch model is larger than the number of dimensions measured by the test. - Journal of Educational Measurement, Volume 55, Issue 3, Page 403-420, Fall 2018.
    September 03, 2018   doi: 10.1111/jedm.12185   open full text
  • Subjective Priors for Item Response Models: Application of Elicitation by Design.
    Allison Ames, Elizabeth Smith.
    Journal of Educational Measurement. September 03, 2018
    --- - |2 Abstract Bayesian methods incorporate model parameter information prior to data collection. Eliciting information from content experts is an option, but has seen little implementation in Bayesian item response theory (IRT) modeling. This study aims to use ethical reasoning content experts to elicit prior information and incorporate this information into Markov Chain Monte Carlo (MCMC) estimation. A six‐step elicitation approach is followed, with relevant details at each stage for two IRT items parameters: difficulty and guessing. Results indicate that using content experts is the preferred approach, rather than noninformative priors, for both parameter types. The use of a noninformative prior for small samples provided dramatically different results when compared to results from content expert–elicited priors. The WAMBS (When to worry and how to Avoid the Misuse of Bayesian Statistics) checklist is used to aid in comparisons. - Journal of Educational Measurement, Volume 55, Issue 3, Page 373-402, Fall 2018.
    September 03, 2018   doi: 10.1111/jedm.12184   open full text
  • Detecting Differential Item Discrimination (DID) and the Consequences of Ignoring DID in Multilevel Item Response Models.
    Woo‐yeol Lee, Sun‐Joo Cho.
    Journal of Educational Measurement. September 01, 2017
    Cross‐level invariance in a multilevel item response model can be investigated by testing whether the within‐level item discriminations are equal to the between‐level item discriminations. Testing the cross‐level invariance assumption is important to understand constructs in multilevel data. However, in most multilevel item response model applications, the cross‐level invariance is assumed without testing of the cross‐level invariance assumption. In this study, the detection methods of differential item discrimination (DID) over levels and the consequences of ignoring DID are illustrated and discussed with the use of multilevel item response models. Simulation results showed that the likelihood ratio test (LRT) performed well in detecting global DID at the test level when some portion of the items exhibited DID. At the item level, the Akaike information criterion (AIC), the sample‐size adjusted Bayesian information criterion (saBIC), LRT, and Wald test showed a satisfactory rejection rate (>.8) when some portion of the items exhibited DID and the items had lower intraclass correlations (or higher DID magnitudes). When DID was ignored, the accuracy of the item discrimination estimates and standard errors was mainly problematic. Implications of the findings and limitations are discussed.
    September 01, 2017   doi: 10.1111/jedm.12148   open full text
  • Modeling Skipped and Not‐Reached Items Using IRTrees.
    Dries Debeer, Rianne Janssen, Paul Boeck.
    Journal of Educational Measurement. September 01, 2017
    When dealing with missing responses, two types of omissions can be discerned: items can be skipped or not reached by the test taker. When the occurrence of these omissions is related to the proficiency process the missingness is nonignorable. The purpose of this article is to present a tree‐based IRT framework for modeling responses and omissions jointly, taking into account that test takers as well as items can contribute to the two types of omissions. The proposed framework covers several existing models for missing responses, and many IRTree models can be estimated using standard statistical software. Further, simulated data is used to show that ignoring missing responses is less robust than often considered. Finally, as an illustration of its applicability, the IRTree approach is applied to data from the 2009 PISA reading assessment.
    September 01, 2017   doi: 10.1111/jedm.12147   open full text
  • Structured Constructs Models Based on Change‐Point Analysis.
    Hyo Jeong Shin, Mark Wilson, In‐Hee Choi.
    Journal of Educational Measurement. September 01, 2017
    This study proposes a structured constructs model (SCM) to examine measurement in the context of a multidimensional learning progression (LP). The LP is assumed to have features that go beyond a typical multidimentional IRT model, in that there are hypothesized to be certain cross‐dimensional linkages that correspond to requirements between the levels of the different dimensions. The new model builds on multidimensional item response theory models and change‐point analysis to add cut‐score and discontinuity parameters that embody these substantive requirements. This modeling strategy allows us to place the examinees in the appropriate LP level and simultaneously to model the hypothesized requirement relations. Results from a simulation study indicate that the proposed change‐point SCM recovers the generating parameters well. When the hypothesized requirement relations are ignored, the model fit tends to become worse, and the model parameters appear to be more biased. Moreover, the proposed model can be used to find validity evidence to support or disprove initial theoretical hypothesized links in the LP through empirical data. We illustrate the technique with data from an assessment system designed to measure student progress in a middle‐school statistics and modeling curriculum.
    September 01, 2017   doi: 10.1111/jedm.12146   open full text
  • Optimal Linking Design for Response Model Parameters.
    Michelle D. Barrett, Wim J. Linden.
    Journal of Educational Measurement. September 01, 2017
    Linking functions adjust for differences between identifiability restrictions used in different instances of the estimation of item response model parameters. These adjustments are necessary when results from those instances are to be compared. As linking functions are derived from estimated item response model parameters, parameter estimation error automatically propagates into linking error. This article explores an optimal linking design approach in which mixed‐integer programming is used to select linking items to minimize linking error. Results indicate that the method holds promise for selection of linking items.
    September 01, 2017   doi: 10.1111/jedm.12145   open full text
  • Detecting Item Drift in Large‐Scale Testing.
    Hongwen Guo, Frederic Robin, Neil Dorans.
    Journal of Educational Measurement. September 01, 2017
    The early detection of item drift is an important issue for frequently administered testing programs because items are reused over time. Unfortunately, operational data tend to be very sparse and do not lend themselves to frequent monitoring analyses, particularly for on‐demand testing. Building on existing residual analyses, the authors propose an item index that requires only moderate‐to‐small sample sizes to form data for time‐series analysis. Asymptotic results are presented to facilitate statistical significance tests. The authors show that the proposed index combined with time‐series techniques may be useful in detecting and predicting item drift. Most important, this index is related to a well‐known differential item functioning analysis so that a meaningful effect size can be proposed for item drift detection.
    September 01, 2017   doi: 10.1111/jedm.12144   open full text
  • Person‐Fit Statistics for Joint Models for Accuracy and Speed.
    Jean‐Paul Fox, Sukaesi Marianti.
    Journal of Educational Measurement. June 01, 2017
    Response accuracy and response time data can be analyzed with a joint model to measure ability and speed of working, while accounting for relationships between item and person characteristics. In this study, person‐fit statistics are proposed for joint models to detect aberrant response accuracy and/or response time patterns. The person‐fit tests take the correlation between ability and speed into account, as well as the correlation between item characteristics. They are posited as Bayesian significance tests, which have the advantage that the extremeness of a test statistic value is quantified by a posterior probability. The person‐fit tests can be computed as by‐products of a Markov chain Monte Carlo algorithm. Simulation studies were conducted in order to evaluate their performance. For all person‐fit tests, the simulation studies showed good detection rates in identifying aberrant patterns. A real data example is given to illustrate the person‐fit statistics for the evaluation of the joint model.
    June 01, 2017   doi: 10.1111/jedm.12143   open full text
  • Evaluating Statistical Targets for Assembling Parallel Mixed‐Format Test Forms.
    Dries Debeer, Usama S. Ali, Peter W. Rijn.
    Journal of Educational Measurement. June 01, 2017
    Test assembly is the process of selecting items from an item pool to form one or more new test forms. Often new test forms are constructed to be parallel with an existing (or an ideal) test. Within the context of item response theory, the test information function (TIF) or the test characteristic curve (TCC) are commonly used as statistical targets to obtain this parallelism. In a recent study, Ali and van Rijn proposed combining the TIF and TCC as statistical targets, rather than using only a single statistical target. In this article, we propose two new methods using this combined approach, and compare these methods with single statistical targets for the assembly of mixed‐format tests. In addition, we introduce new criteria to evaluate the parallelism of multiple forms. The results show that single statistical targets can be problematic, while the combined targets perform better, especially in situations with increasing numbers of polytomous items. Implications of using the combined target are discussed.
    June 01, 2017   doi: 10.1111/jedm.12142   open full text
  • A New Statistic for Detection of Aberrant Answer Changes.
    Sandip Sinharay, Minh Q. Duong, Scott W. Wood.
    Journal of Educational Measurement. June 01, 2017
    As noted by Fremer and Olson, analysis of answer changes is often used to investigate testing irregularities because the analysis is readily performed and has proven its value in practice. Researchers such as Belov, Sinharay and Johnson, van der Linden and Jeon, van der Linden and Lewis, and Wollack, Cohen, and Eckerly have suggested several statistics for detection of aberrant answer changes. This article suggests a new statistic that is based on the likelihood ratio test. An advantage of the new statistic is that it follows the standard normal distribution under the null hypothesis of no aberrant answer changes. It is demonstrated in a detailed simulation study that the Type I error rate of the new statistic is very close to the nominal level and the power of the new statistic is satisfactory in comparison to those of several existing statistics for detecting aberrant answer changes. The new statistic and several existing statistics were shown to provide useful information for a real data set. Given the increasing interest in analysis of answer changes, the new statistic promises to be useful to measurement practitioners.
    June 01, 2017   doi: 10.1111/jedm.12141   open full text
  • Stabilizing Conditional Standard Errors of Measurement in Scale Score Transformations.
    Tim Moses, YoungKoung Kim.
    Journal of Educational Measurement. June 01, 2017
    The focus of this article is on scale score transformations that can be used to stabilize conditional standard errors of measurement (CSEMs). Three transformations for stabilizing the estimated CSEMs are reviewed, including the traditional arcsine transformation, a recently developed general variance stabilization transformation, and a new method proposed in this article involving cubic transformations. Two examples are provided and the three scale score transformations are compared in terms of how well they stabilize CSEMs estimated from compound binomial and item response theory (IRT) models. Advantages of the cubic transformation are demonstrated with respect to CSEM stabilization and other scaling criteria (e.g., scale score distributions that are more symmetric).
    June 01, 2017   doi: 10.1111/jedm.12140   open full text
  • Dual‐Objective Item Selection Criteria in Cognitive Diagnostic Computerized Adaptive Testing.
    Hyeon‐Ah Kang, Susu Zhang, Hua‐Hua Chang.
    Journal of Educational Measurement. June 01, 2017
    The development of cognitive diagnostic‐computerized adaptive testing (CD‐CAT) has provided a new perspective for gaining information about examinees' mastery on a set of cognitive attributes. This study proposes a new item selection method within the framework of dual‐objective CD‐CAT that simultaneously addresses examinees' attribute mastery status and overall test performance. The new procedure is based on the Jensen‐Shannon (JS) divergence, a symmetrized version of the Kullback‐Leibler divergence. We show that the JS divergence resolves the noncomparability problem of the dual information index and has close relationships with Shannon entropy, mutual information, and Fisher information. The performance of the JS divergence is evaluated in simulation studies in comparison with the methods available in the literature. Results suggest that the JS divergence achieves parallel or more precise recovery of latent trait variables compared to the existing methods and maintains practical advantages in computation and item pool usage.
    June 01, 2017   doi: 10.1111/jedm.12139   open full text
  • Structural Zeros and Their Implications With Log‐Linear Bivariate Presmoothing Under the Internal‐Anchor Design.
    Hyung Jin Kim, Robert L. Brennan, Won‐Chan Lee.
    Journal of Educational Measurement. June 01, 2017
    In equating, when common items are internal and scoring is conducted in terms of the number of correct items, some pairs of total scores (X) and common‐item scores (V) can never be observed in a bivariate distribution of X and V; these pairs are called structural zeros. This simulation study examines how equating results compare for different approaches to handling structural zeros. The study considers four approaches: the no‐smoothing, unique‐common, total‐common, and adjusted total‐common approaches. This study led to four main findings: (1) the total‐common approach generally had the worst results; (2) for relatively small effect sizes, the unique‐common approach generally had the smallest overall error; (3) for relatively large effect sizes, the adjusted total‐common approach generally had the smallest overall error; and, (4) if sole interest focuses on reducing bias only, the adjusted total‐common approach was generally preferable. These results suggest that, when common items are internal and log‐linear bivariate presmoothing is performed, structural zeros should be maintained, even if there is some loss in the moment preservation property.
    June 01, 2017   doi: 10.1111/jedm.12138   open full text
  • Statistically Modeling Individual Students’ Learning Over Successive Collaborative Practice Opportunities.
    Jennifer Olsen, Vincent Aleven, Nikol Rummel.
    Journal of Educational Measurement. March 06, 2017
    Within educational data mining, many statistical models capture the learning of students working individually. However, not much work has been done to extend these statistical models of individual learning to a collaborative setting, despite the effectiveness of collaborative learning activities. We extend a widely used model (the additive factors model) to account for the effect of collaboration on individual learning, including having the help of a partner and getting to observe/help a partner. We find evidence that models that include these collaborative features have a better fit than the original models for performance data and that learning rates estimated using the extended models provide insights into how collaboration benefits individual students’ learning outcomes.
    March 06, 2017   doi: 10.1111/jedm.12137   open full text
  • Mapping an Experiment‐Based Assessment of Collaborative Behavior Onto Collaborative Problem Solving in PISA 2015: A Cluster Analysis Approach for Collaborator Profiles.
    Katharina Herborn, Maida Mustafić, Samuel Greiff.
    Journal of Educational Measurement. March 06, 2017
    Collaborative problem solving (CPS) assessment is a new academic research field with a number of educational implications. In 2015, the Programme for International Student Assessment (PISA) assessed CPS with a computer‐simulated human‐agent (H‐A) approach that claimed to measure 12 individual CPS skills for the first time. After reviewing the approach, we conceptually embedded a computer‐based collaborative behavior assessment (COLBAS) into the overarching PISA 2015 CPS approach. COLBAS is an H‐A CPS assessment instrument that can be used to measure certain aspects of CPS. In addition, we applied a model‐based cluster analysis to the embedded COLBAS aspects. The analysis revealed three types of student collaborator profiles that differed in cognitive performance and motivation: (a) passive low‐performing (non‐)collaborators, (b) active high‐performing collaborators, and (c) compensating collaborators.
    March 06, 2017   doi: 10.1111/jedm.12135   open full text
  • Modeling Data From Collaborative Assessments: Learning in Digital Interactive Social Networks.
    Mark Wilson, Perman Gochyyev, Kathleen Scalise.
    Journal of Educational Measurement. March 06, 2017
    This article summarizes assessment of cognitive skills through collaborative tasks, using field test results from the Assessment and Teaching of 21st Century Skills (ATC21S) project. This project, sponsored by Cisco, Intel, and Microsoft, aims to help educators around the world enable students with the skills to succeed in future career and college goals. In this article, ATC21S collaborative assessments focus on the project's “ICT Literacy—Learning in digital networks” learning progression. The article includes a description of the development of the learning progression, as well as examples and the logic behind the instrument construction. Assessments took place in random pairs of students in a demonstration digital environment. Modeling of results employed unidimensional and multidimensional item response models, with and without random effects for groups. The results indicated that, based on this data set, the models that take group into consideration in both the unidimensional and the multidimensional analyses fit better. However, the group‐level variances were substantially higher than the individual‐level variances. This indicates that a total individual estimate of group plus individual is likely a more informative estimate than individual alone but also that the performances of the pairs dominated the performances of the individuals. Implications are discussed in the results and conclusions.
    March 06, 2017   doi: 10.1111/jedm.12134   open full text
  • Measuring Student Engagement During Collaboration.
    Peter F. Halpin, Alina A. von Davier, Jiangang Hao, Lei Liu.
    Journal of Educational Measurement. March 06, 2017
    This article addresses performance assessments that involve collaboration among students. We apply the Hawkes process to infer whether the actions of one student are associated with increased probability of further actions by his/her partner(s) in the near future. This leads to an intuitive notion of engagement among collaborators, and we consider a model‐based index that can be used to quantify this notion. The approach is illustrated using a simulation‐based task designed for science education, in which pairs of collaborators interact using online chat. We also consider the empirical relationship between chat engagement and task performance, finding that less engaged collaborators were less likely to revise their responses after being given an opportunity to share their work with their partner.
    March 06, 2017   doi: 10.1111/jedm.12133   open full text
  • Modeling Collaborative Interaction Patterns in a Simulation‐Based Task.
    Jessica J. Andrews, Deirdre Kerr, Robert J. Mislevy, Alina Davier, Jiangang Hao, Lei Liu.
    Journal of Educational Measurement. March 06, 2017
    Simulations and games offer interactive tasks that can elicit rich data, providing evidence of complex skills that are difficult to measure with more conventional items and tests. However, one notable challenge in using such technologies is making sense of the data generated in order to make claims about individuals or groups. This article presents a novel methodological approach that uses the process data and performance outcomes from a simulation‐based collaborative science assessment to explore the propensities of dyads to interact in accordance with certain interaction patterns. Further exploratory analyses examine how the approach can be used to answer important questions in collaboration research regarding gender and cultural differences in collaborative behavior and how interaction patterns relate to performance outcomes.
    March 06, 2017   doi: 10.1111/jedm.12132   open full text
  • Assessing Students in Human‐to‐Agent Settings to Inform Collaborative Problem‐Solving Learning.
    Yigal Rosen.
    Journal of Educational Measurement. March 06, 2017
    In order to understand potential applications of collaborative problem‐solving (CPS) assessment tasks, it is necessary to examine empirically the multifaceted student performance that may be distributed across collaboration methods and purposes of the assessment. Ideally, each student should be matched with various types of group members and must apply the skills in varied contexts and tasks. One solution to these assessment demands is to use computer‐based (virtual) agents to serve as the collaborators in the interactions with students. This article proposes a human‐to‐agent (H‐A) approach for formative CPS assessment and describes an international pilot study aimed to provide preliminary empirical findings on the use of H‐A CPS assessment to inform collaborative learning. Overall, the findings showed promise in terms of using a H‐A CPS assessment task as a formative tool for structuring effective groups in the context of CPS online learning.
    March 06, 2017   doi: 10.1111/jedm.12131   open full text
  • Designs for Operationalizing Collaborative Problem Solving for Automated Assessment.
    Claire Scoular, Esther Care, Friedrich W. Hesse.
    Journal of Educational Measurement. March 06, 2017
    Collaborative problem solving is a complex skill set that draws on social and cognitive factors. The construct remains in its infancy due to lack of empirical evidence that can be drawn upon for validation. The differences and similarities between two large‐scale initiatives that reflect this state of the art, in terms of underlying assumptions about the construct and approach to task development, are outlined. The goal is to clarify how definitions of the nature of the construct impact the approach to design of assessment tasks. Illustrations of two different approaches to the development of a task designed to elicit behaviors that manifest the construct are presented. The method highlights the degree to which these approaches might constrain a comprehensive assessment of the construct.
    March 06, 2017   doi: 10.1111/jedm.12130   open full text
  • Further Study of the Choice of Anchor Tests in Equating.
    Tammy J. Trierweiler, Charles Lewis, Robert L. Smith.
    Journal of Educational Measurement. December 01, 2016
    In this study, we describe what factors influence the observed score correlation between an (external) anchor test and a total test. We show that the anchor to full‐test observed score correlation is based on two components: the true score correlation between the anchor and total test, and the reliability of the anchor test. Findings using an analytical approach suggest that making an anchor test a miditest does not generally maximize the anchor to total test correlation. Results are discussed in the context of what conditions maximize the correlations between the anchor and total test.
    December 01, 2016   doi: 10.1111/jedm.12128   open full text
  • Autoscoring Essays Based on Complex Networks.
    Xiaohua Ke, Yongqiang Zeng, Haijiao Luo.
    Journal of Educational Measurement. December 01, 2016
    This article presents a novel method, the Complex Dynamics Essay Scorer (CDES), for automated essay scoring using complex network features. Texts produced by college students in China were represented as scale‐free networks (e.g., a word adjacency model) from which typical network features, such as the in‐/out‐degrees, clustering coefficient (CC), and dynamic networks, were obtained. The CDES integrates the classical concepts of network feature representation and essay score series variation. Several experiments indicated that the network measures different essay qualities and can be clearly demonstrated to develop complex networks for autoscoring tasks. The average agreement of the CDES and human rater scores was 86.5%, and the average Pearson correlation was .77. The results indicate that the CDES produced functional complex systems and autoscored Chinese essays in a method consistent with human raters. Our research suggests potential applications in other areas of educational assessment.
    December 01, 2016   doi: 10.1111/jedm.12127   open full text
  • Asymptotic Standard Errors of Observed‐Score Equating With Polytomous IRT Models.
    Björn Andersson.
    Journal of Educational Measurement. December 01, 2016
    In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.
    December 01, 2016   doi: 10.1111/jedm.12126   open full text
  • Diagnostic Profiles: A Standard Setting Method for Use With a Cognitive Diagnostic Model.
    Gary Skaggs, Serge F. Hein, Jesse L. M. Wilkins.
    Journal of Educational Measurement. December 01, 2016
    This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard.
    December 01, 2016   doi: 10.1111/jedm.12125   open full text
  • A Short Note on the Relationship Between Pass Rate and Multiple Attempts.
    Ying Cheng, Cheng Liu.
    Journal of Educational Measurement. December 01, 2016
    For a certification, licensure, or placement exam, allowing examinees to take multiple attempts at the test could effectively change the pass rate. Change in the pass rate can occur without any change in the underlying latent trait, and can be an artifact of multiple attempts and imperfect reliability of the test. By deriving formulae to compute the pass rate under two definitions, this article provides tools for testing practitioners to compute and evaluate the change in the expected pass rate when a certain (maximum) number of attempts are allowed without any change in the latent trait. This article also includes a simulation study that considers change in ability and differential motivation of examinees to retake the test. Results indicate that the general trend shown by the analytical results is maintained—that is, the marginal expected pass rate increases with more attempts when the testing volume is defined as the total number of test takers, and decreases with more attempts when the testing volume is defined as the total number of test attempts.
    December 01, 2016   doi: 10.1111/jedm.12124   open full text
  • Effect Size Measures for Differential Item Functioning in a Multidimensional IRT Model.
    Youngsuk Suh.
    Journal of Educational Measurement. December 01, 2016
    This study adapted an effect size measure used for studying differential item functioning (DIF) in unidimensional tests and extended the measure to multidimensional tests. Two effect size measures were considered in a multidimensional item response theory model: signed weighted P‐difference and unsigned weighted P‐difference. The performance of the effect size measures was investigated under various simulation conditions including different sample sizes and DIF magnitudes. As another way of studying DIF, the χ2 difference test was included to compare the result of statistical significance (statistical tests) with that of practical significance (effect size measures). The adequacy of existing effect size criteria used in unidimensional tests was also evaluated. Both effect size measures worked well in estimating true effect sizes, identifying DIF types, and classifying effect size categories. Finally, a real data analysis was conducted to support the simulation results.
    December 01, 2016   doi: 10.1111/jedm.12123   open full text
  • Higher Education Value Added Using Multiple Outcomes.
    Joniada Milla, Ernesto San Martín, Sébastien Bellegem.
    Journal of Educational Measurement. September 01, 2016
    In this article we develop a methodology for the joint value added analysis of multiple outcomes that takes into account the inherent correlation between them. This is especially crucial in the analysis of higher education institutions. We use a unique Colombian database on universities, which contains scores in five domains tested in a standardized exit examination that is compulsory in order to graduate. We develop a new estimation procedure that accommodates any number of outcomes. Another novelty of our method is related to the structure of the random effect covariance matrix. Effects of the same school can be correlated and this correlation is allowed to vary among schools.
    September 01, 2016   doi: 10.1111/jedm.12114   open full text
  • Investigating College Learning Gain: Exploring a Propensity Score Weighting Approach.
    Ou Lydia Liu, Huili Liu, Katrina Crotts Roohr, Daniel F. McCaffrey.
    Journal of Educational Measurement. September 01, 2016
    Learning outcomes assessment has been widely used by higher education institutions both nationally and internationally. One of its popular uses is to document learning gains of students. Prior studies have recognized the potential imbalance between freshmen and seniors in terms of their background characteristics and their prior academic performance and have used linear regression adjustments for these differences, which some researchers have argued are not fully adequate. We explored an alternative adjustment via propensity score weighting to balance the samples on background variables including SAT score, gender, and ethnicity. Results involving a cross‐sectional sample of freshmen and seniors from seven groups of majors within a large research university showed that students in most of the majors demonstrated significant learning gain. Additionally, there was a slight difference in learning gain rankings across major groupings when compared to multiple regression results.
    September 01, 2016   doi: 10.1111/jedm.12112   open full text
  • Pretest‐Posttest‐Posttest Multilevel IRT Modeling of Competence Growth of Students in Higher Education in Germany.
    Susanne Schmidt, Olga Zlatkin‐Troitschanskaia, Jean‐Paul Fox.
    Journal of Educational Measurement. September 01, 2016
    Longitudinal research in higher education faces several challenges. Appropriate methods of analyzing competence growth of students are needed to deal with those challenges and thereby obtain valid results. In this article, a pretest‐posttest‐posttest multivariate multilevel IRT model for repeated measures is introduced which is designed to address educational research questions according to a German research project. In this model, dependencies between repeated observations of the same students are considered not, as usual, by clustering observations within participants but rather by clustering observations within semesters. Estimation of the model is conducted within a Bayesian framework. Results indicate that competences grew over time. Gender, intelligence, motivation, and prior education could explain differences in the level of competence among business and economics students.
    September 01, 2016   doi: 10.1111/jedm.12115   open full text
  • No Second Chance to Make a First Impression: The “Thin‐Slice” Effect on Instructor Ratings and Learning Outcomes in Higher Education.
    Preeti G. Samudra, Inah Min, Kai S. Cortina, Kevin F. Miller.
    Journal of Educational Measurement. September 01, 2016
    Prior research has found strong and persistent effects of instructor first impressions on student evaluations. Because these studies look at real classroom lessons, this finding fits two different interpretations: (1) first impressions may color student experience of instruction regardless of lesson quality, or (2) first impressions may provide valid evidence for instructional quality. By using scripted lessons, we experimentally investigated how first impression and instruction quality related to learning and evaluation of instruction among college students. Results from two studies indicate that quality of instruction is the strongest determinant of student factual and conceptual learning, but that both instructional quality and first impressions affect evaluations of the instructor. First impressions matter, but our findings suggest that lesson quality matters more.
    September 01, 2016   doi: 10.1111/jedm.12116   open full text
  • Integrating the Analysis of Mental Operations Into Multilevel Models to Validate an Assessment of Higher Education Students’ Competency in Business and Economics.
    Sebastian Brückner, James W. Pellegrino.
    Journal of Educational Measurement. September 01, 2016
    The Standards for Educational and Psychological Testing indicate that validation of assessments should include analyses of participants’ response processes. However, such analyses typically are conducted only to supplement quantitative field studies with qualitative data, and seldom are such data connected to quantitative data on student or item performance. This paper presents an example of how data from an analysis of mental operations collected using a sociocognitive approach can be quantitatively integrated with other data on student and item performance to validate in part an assessment of higher education students’ competency in business and economics. Evidence of forward reasoning and paraphrasing as mental operations is obtained using the think‐aloud method. As part of the validity argument and to enhance credibility of the findings, the generalized linear models are expressed as multilevel models in which the analyses of response processes are aligned with quantitative findings from large‐scale field studies.
    September 01, 2016   doi: 10.1111/jedm.12113   open full text
  • How Developments in Psychology and Technology Challenge Validity Argumentation.
    Robert J. Mislevy.
    Journal of Educational Measurement. September 01, 2016
    Validity is the sine qua non of properties of educational assessment. While a theory of validity and a practical framework for validation has emerged over the past decades, most of the discussion has addressed familiar forms of assessment and psychological framings. Advances in digital technologies and in cognitive and social psychology have expanded the range of purposes, targets of inference, contexts of use, forms of activity, and sources of evidence we now see in educational assessment. This article discusses some of these developments and how concepts and representations that are employed to design and use assessments, hence to frame validity arguments, can be extended accordingly. Ideas are illustrated with a variety of examples, with an emphasis on assessment in higher education.
    September 01, 2016   doi: 10.1111/jedm.12117   open full text
  • Semiparametric Item Response Functions in the Context of Guessing.
    Carl F. Falk, Li Cai.
    Journal of Educational Measurement. June 01, 2016
    We present a logistic function of a monotonic polynomial with a lower asymptote, allowing additional flexibility beyond the three‐parameter logistic model. We develop a maximum marginal likelihood‐based approach to estimate the item parameters. The new item response model is demonstrated on math assessment data from a state, and a computationally efficient strategy for choosing the order of the polynomial is demonstrated. Finally, our approach is tested through simulations and compared to response function estimation using smoothed isotonic regression. Results indicate that our approach can result in small gains in item response function recovery and latent trait estimation.
    June 01, 2016   doi: 10.1111/jedm.12111   open full text
  • On the Issue of Item Selection in Computerized Adaptive Testing With Response Times.
    Bernard P. Veldkamp.
    Journal of Educational Measurement. June 01, 2016
    Many standardized tests are now administered via computer rather than paper‐and‐pencil format. The computer‐based delivery mode brings with it certain advantages. One advantage is the ability to adapt the difficulty level of the test to the ability level of the test taker in what has been termed computerized adaptive testing (CAT). A second advantage is the ability to record not only the test taker's response to each item (i.e., question), but also the amount of time the test taker spends considering and answering each item. Combining these two advantages, various methods were explored for utilizing response time data in selecting appropriate items for an individual test taker. Four strategies for incorporating response time data were evaluated, and the precision of the final test‐taker score was assessed by comparing it to a benchmark value that did not take response time information into account. While differences in measurement precision and testing times were expected, results showed that the strategies did not differ much with respect to measurement precision but that there were differences with regard to the total testing time.
    June 01, 2016   doi: 10.1111/jedm.12110   open full text
  • Using Networks to Visualize and Analyze Process Data for Educational Assessment.
    Mengxiao Zhu, Zhan Shu, Alina A. Davier.
    Journal of Educational Measurement. June 01, 2016
    New technology enables interactive and adaptive scenario‐based tasks (SBTs) to be adopted in educational measurement. At the same time, it is a challenging problem to build appropriate psychometric models to analyze data collected from these tasks, due to the complexity of the data. This study focuses on process data collected from SBTs. We explore the potential of using concepts and methods from social network analysis to represent and analyze process data. Empirical data were collected from the assessment of Technology and Engineering Literacy, conducted as part of the National Assessment of Educational Progress. For the activity sequences in the process data, we created a transition network using weighted directed networks, with nodes representing actions and directed links connecting two actions only if the first action is followed by the second action in the sequence. This study shows how visualization of the transition networks represents process data and provides insights for item design. This study also explores how network measures are related to existing scoring rubrics and how detailed network measures can be used to make intergroup comparisons.
    June 01, 2016   doi: 10.1111/jedm.12107   open full text
  • Equating With Miditests Using IRT.
    Joseph Fitzpatrick, William P. Skorupski.
    Journal of Educational Measurement. June 01, 2016
    The equating performance of two internal anchor test structures—miditests and minitests—is studied for four IRT equating methods using simulated data. Originally proposed by Sinharay and Holland, miditests are anchors that have the same mean difficulty as the overall test but less variance in item difficulties. Four popular IRT equating methods were tested, and both the means and SDs of the true ability of the group to be equated were varied. We evaluate equating accuracy marginally and conditional on true ability. Our results suggest miditests perform about as well as traditional minitests for most conditions. Findings are discussed in terms of comparability to the typical minitest design and the trade‐off between accuracy and flexibility in test construction.
    June 01, 2016   doi: 10.1111/jedm.12109   open full text
  • A Comparison of Linking Methods for Estimating National Trends in International Comparative Large‐Scale Assessments in the Presence of Cross‐National DIF.
    Karoline A. Sachse, Alexander Roppelt, Nicole Haag.
    Journal of Educational Measurement. June 01, 2016
    Trend estimation in international comparative large‐scale assessments relies on measurement invariance between countries. However, cross‐national differential item functioning (DIF) has been repeatedly documented. We ran a simulation study using national item parameters, which required trends to be computed separately for each country, to compare trend estimation performances to two linking methods employing international item parameters across several conditions. The trend estimates based on the national item parameters were more accurate than the trend estimates based on the international item parameters when cross‐national DIF was present. Moreover, the use of fixed common item parameter calibrations led to biased trend estimates. The detection and elimination of DIF can reduce this bias but is also likely to increase the total error.
    June 01, 2016   doi: 10.1111/jedm.12106   open full text
  • Monitoring Items in Real Time to Enhance CAT Security.
    Jinming Zhang, Jie Li.
    Journal of Educational Measurement. June 01, 2016
    An IRT‐based sequential procedure is developed to monitor items for enhancing test security. The procedure uses a series of statistical hypothesis tests to examine whether the statistical characteristics of each item under inspection have changed significantly during CAT administration. This procedure is compared with a previously developed CTT‐based procedure through simulation studies. The results show that when the total number of examinees is fixed both procedures can control the rate of type I errors at any reasonable significance level by choosing an appropriate cutoff point and meanwhile maintain a low rate of type II errors. Further, the IRT‐based method has a much lower type II error rate or more power than the CTT‐based method when the number of compromised items is small (e.g., 5), which can be achieved if the IRT‐based procedure can be applied in an active mode in the sense that flagged items can be replaced with new items.
    June 01, 2016   doi: 10.1111/jedm.12104   open full text
  • The Effect of Differential Motivation on IRT Linking.
    Marie‐Anne Mittelhaëuser, Anton A. Béguin, Klaas Sijtsma.
    Journal of Educational Measurement. September 01, 2015
    The purpose of this study was to investigate whether simulated differential motivation between the stakes for operational tests and anchor items produces an invalid linking result if the Rasch model is used to link the operational tests. This was done for an external anchor design and a variation of a pretest design. The study also investigated whether a constrained mixture Rasch model could identify latent classes in such a way that one latent class represented high‐stakes responding while the other represented low‐stakes responding. The results indicated that for an external anchor design, the Rasch linking result was only biased when the motivation level differed between the subpopulations to which the anchor items were administered. However, the mixture Rasch model did not identify the classes representing low‐stakes and high‐stakes responding. When a pretest design was used to link the operational tests by means of a Rasch model, the linking result was found to be biased in each condition. Bias increased as percentage of students showing low‐stakes responding to the anchor items increased. The mixture Rasch model only identified the classes representing low‐stakes and high‐stakes responding under a limited number of conditions.
    September 01, 2015   doi: 10.1111/jedm.12080   open full text
  • Filtering Data for Detecting Differential Development.
    Matthieu J. S. Brinkhuis, Marjan Bakker, Gunter Maris.
    Journal of Educational Measurement. September 01, 2015
    The amount of data available in the context of educational measurement has vastly increased in recent years. Such data are often incomplete, involve tests administered at different time points and during the course of many years, and can therefore be quite challenging to model. In addition, intermediate results like grades or report cards being available to pupils, teachers, parents, and policy makers might influence future performance, which adds to the modeling difficulties. We propose the use of simple data filters to obtain a reduced set of relevant data, which allows for simple checks on the relative development of persons, items, or both.
    September 01, 2015   doi: 10.1111/jedm.12078   open full text
  • Comparing the Effectiveness of Self‐Paced and Collaborative Frame‐of‐Reference Training on Rater Accuracy in a Large‐Scale Writing Assessment.
    Kevin R. Raczynski, Allan S. Cohen, George Engelhard, Zhenqiu Lu.
    Journal of Educational Measurement. September 01, 2015
    There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large‐scale writing assessments. This study compared the effectiveness of two widely used rater training methods—self‐paced and collaborative frame‐of‐reference training—in the context of a large‐scale writing assessment. Sixty‐six raters were randomly assigned to the training methods. After training, all raters scored the same 50 representative essays prescored by a group of expert raters. A series of generalized linear mixed models were then fitted to the rating data. Results suggested that the self‐paced method was equivalent in effectiveness to the more time‐intensive and expensive collaborative method. Implications for large‐scale writing assessments and suggestions for further research are discussed.
    September 01, 2015   doi: 10.1111/jedm.12079   open full text
  • A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift.
    Rui Guo, Yi Zheng, Hua‐Hua Chang.
    Journal of Educational Measurement. September 01, 2015
    An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items.
    September 01, 2015   doi: 10.1111/jedm.12077   open full text
  • Criterion‐Related Validity: Assessing the Value of Subscores.
    Mark L. Davison, Ernest C. Davenport, Yu‐Feng Chang, Kory Vue, Shiyang Su.
    Journal of Educational Measurement. September 01, 2015
    Criterion‐related profile analysis (CPA) can be used to assess whether subscores of a test or test battery account for more criterion variance than does a single total score. Application of CPA to subscore evaluation is described, compared to alternative procedures, and illustrated using SAT data. Considerations other than validity and reliability are discussed, including broad societal goals (e.g., affirmative action), fairness, and ties in expected criterion predictions. In simulation data, CPA results were sensitive to subscore correlations, sample size, and the proportion of criterion‐related variance accounted for by the subscores. CPA can be a useful component in a thorough subscore evaluation encompassing subscore reliability, validity, distinctiveness, fairness, and broader societal goals.
    September 01, 2015   doi: 10.1111/jedm.12081   open full text
  • Differential Item Functioning Assessment in Cognitive Diagnostic Modeling: Application of the Wald Test to Investigate DIF in the DINA Model.
    Likun Hou, Jimmy de la Torre, Ratna Nandakumar.
    Journal of Educational Measurement. March 27, 2014
    Analyzing examinees’ responses using cognitive diagnostic models (CDMs) has the advantage of providing diagnostic information. To ensure the validity of the results from these models, differential item functioning (DIF) in CDMs needs to be investigated. In this article, the Wald test is proposed to examine DIF in the context of CDMs. This study explored the effectiveness of the Wald test in detecting both uniform and nonuniform DIF in the DINA model through a simulation study. Results of this study suggest that for relatively discriminating items, the Wald test had Type I error rates close to the nominal level. Moreover, its viability was underscored by the medium to high power rates for most investigated DIF types when DIF size was large. Furthermore, the performance of the Wald test in detecting uniform DIF was compared to that of the traditional Mantel‐Haenszel (MH) and SIBTEST procedures. The results of the comparison study showed that the Wald test was comparable to or outperformed the MH and SIBTEST procedures. Finally, the strengths and limitations of the proposed method and suggestions for future studies are discussed.
    March 27, 2014   doi: 10.1111/jedm.12036   open full text
  • The Random‐Effect DINA Model.
    Hung‐Yu Huang, Wen‐Chung Wang.
    Journal of Educational Measurement. March 27, 2014
    The DINA (deterministic input, noisy, and gate) model has been widely used in cognitive diagnosis tests and in the process of test development. The outcomes known as slip and guess are included in the DINA model function representing the responses to the items. This study aimed to extend the DINA model by using the random‐effect approach to allow examinees to have different probabilities of slipping and guessing. Two extensions of the DINA model were developed and tested to represent the random components of slipping and guessing. The first model assumed that a random variable can be incorporated in the slipping parameters to allow examinees to have different levels of caution. The second model assumed that the examinees’ ability may increase the probability of a correct response if they have not mastered all of the required attributes of an item. The results of a series of simulations based on Markov chain Monte Carlo methods showed that the model parameters and attribute‐mastery profiles can be recovered relatively accurately from the generating models and that neglect of the random effects produces biases in parameter estimation. Finally, a fraction subtraction test was used as an empirical example to demonstrate the application of the new models.
    March 27, 2014   doi: 10.1111/jedm.12035   open full text
  • Local Observed‐Score Kernel Equating.
    Marie Wiberg, Wim J. Linden, Alina A. Davier.
    Journal of Educational Measurement. March 27, 2014
    Three local observed‐score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed‐score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts.
    March 27, 2014   doi: 10.1111/jedm.12034   open full text
  • Evaluating Equating Accuracy and Assumptions for Groups That Differ in Performance.
    Sonya Powers, Michael J. Kolen.
    Journal of Educational Measurement. March 27, 2014
    Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method.
    March 27, 2014   doi: 10.1111/jedm.12033   open full text
  • Multidimensional CAT Item Selection Methods for Domain Scores and Composite Scores With Item Exposure Control and Content Constraints.
    Lihua Yao.
    Journal of Educational Measurement. March 27, 2014
    The intent of this research was to find an item selection procedure in the multidimensional computer adaptive testing (CAT) framework that yielded higher precision for both the domain and composite abilities, had a higher usage of the item pool, and controlled the exposure rate. Five multidimensional CAT item selection procedures (minimum angle; volume; minimum error variance of the linear combination; minimum error variance of the composite score with optimized weight; and Kullback‐Leibler information) were studied and compared with two methods for item exposure control (the Sympson‐Hetter procedure and the fixed‐rate procedure, the latter simply refers to putting a limit on the item exposure rate) using simulated data. The maximum priority index method was used for the content constraints. Results showed that the Sympson‐Hetter procedure yielded better precision than the fixed‐rate procedure but had much lower item pool usage and took more time. The five item selection procedures performed similarly under Sympson‐Hetter. For the fixed‐rate procedure, there was a trade‐off between the precision of the ability estimates and the item pool usage: the five procedures had different patterns. It was found that (1) Kullback‐Leibler had better precision but lower item pool usage; (2) minimum angle and volume had balanced precision and item pool usage; and (3) the two methods minimizing the error variance had the best item pool usage and comparable overall score recovery but less precision for certain domains. The priority index for content constraints and item exposure was implemented successfully.
    March 27, 2014   doi: 10.1111/jedm.12032   open full text
  • An Assessment of the Nonparametric Approach for Evaluating the Fit of Item Response Models.
    Tie Liang, Craig S. Wells, Ronald K. Hambleton.
    Journal of Educational Measurement. March 27, 2014
    As item response theory has been more widely applied, investigating the fit of a parametric model becomes an important part of the measurement process. There is a lack of promising solutions to the detection of model misfit in IRT. Douglas and Cohen introduced a general nonparametric approach, RISE (Root Integrated Squared Error), for detecting model misfit. The purposes of this study were to extend the use of RISE to more general and comprehensive applications by manipulating a variety of factors (e.g., test length, sample size, IRT models, ability distribution). The results from the simulation study demonstrated that RISE outperformed G2 and S‐X2 in that it controlled Type I error rates and provided adequate power under the studied conditions. In the empirical study, RISE detected reasonable numbers of misfitting items compared to G2 and S‐X2, and RISE gave a much clearer picture of the location and magnitude of misfit for each misfitting item. In addition, there was no practical consequence to classification before and after replacement of misfitting items detected by three fit statistics.
    March 27, 2014   doi: 10.1111/jedm.12031   open full text
  • Longitudinal Multistage Testing.
    Steffi Pohl.
    Journal of Educational Measurement. January 23, 2014
    This article introduces longitudinal multistage testing (lMST), a special form of multistage testing (MST), as a method for adaptive testing in longitudinal large‐scale studies. In lMST designs, test forms of different difficulty levels are used, whereas the values on a pretest determine the routing to these test forms. Since lMST allows for testing in paper and pencil mode, lMST may represent an alternative to conventional testing (CT) in assessments for which other adaptive testing designs are not applicable. In this article the performance of lMST is compared to CT in terms of test targeting as well as bias and efficiency of ability and change estimates. Using a simulation study, the effect of the stability of ability across waves, the difficulty level of the different test forms, and the number of link items between the test forms were investigated.
    January 23, 2014   doi: 10.1111/jedm.12028   open full text
  • Adjoined Piecewise Linear Approximations (APLAs) for Equating: Accuracy Evaluations of a Postsmoothing Equating Method.
    Tim Moses.
    Journal of Educational Measurement. January 23, 2014
    The purpose of this study was to evaluate the use of adjoined and piecewise linear approximations (APLAs) of raw equipercentile equating functions as a postsmoothing equating method. APLAs are less familiar than other postsmoothing equating methods (i.e., cubic splines), but their use has been described in historical equating practices of large‐scale testing programs. This study used simulations to evaluate APLA equating results and compare these results with those from cubic spline postsmoothing and from several presmoothing equating methods. The overall results suggested that APLAs based on four line segments have accuracy advantages similar to or better than cubic splines and can sometimes produce more accurate smoothed equating functions than those produced using presmoothing methods.
    January 23, 2014   doi: 10.1111/jedm.12027   open full text
  • Multilevel Modeling of Item Position Effects.
    Anthony D. Albano.
    Journal of Educational Measurement. January 23, 2014
    In many testing programs it is assumed that the context or position in which an item is administered does not have a differential effect on examinee responses to the item. Violations of this assumption may bias item response theory estimates of item and person parameters. This study examines the potentially biasing effects of item position. A hierarchical generalized linear model is formulated for estimating item‐position effects. The model is demonstrated using data from a pilot administration of the GRE wherein the same items appeared in different positions across the test form. Methods for detecting and assessing position effects are discussed, as are applications of the model in the contexts of test development and item analysis.
    January 23, 2014   doi: 10.1111/jedm.12026   open full text
  • The Long‐Term Sustainability of IRT Scaling Methods in Mixed‐Format Tests.
    Lisa A. Keller, Ronald K. Hambleton.
    Journal of Educational Measurement. January 23, 2014
    Due to recent research in equating methodologies indicating that some methods may be more susceptible to the accumulation of equating error over multiple administrations, the sustainability of several item response theory methods of equating over time was investigated. In particular, the paper is focused on two equating methodologies: fixed common item parameter scaling (with two variations, FCIP‐1 and FCIP‐2) and the Stocking and Lord characteristic curve scaling technique in the presence of nonequivalent groups. Results indicated that the improvements made to fixed common item parameter scaling in the FCIP‐2 method were sustained over time. FCIP‐2 and Stocking and Lord characteristic curve scaling performed similarly in many instances and produced more accurate results than FCIP‐1. The relative performance of FCIP‐2 and Stocking and Lord characteristic curve scaling depended on the nature of the change in the ability distribution: Stocking and Lord characteristic curve scaling captured the change in the distribution more accurately than FCIP‐2 when the change was different across the ability distribution; FCIP‐2 captured the changes more accurately when the change was consistent across the ability distribution.
    January 23, 2014   doi: 10.1111/jedm.12025   open full text
  • Generalization of the Lord‐Wingersky Algorithm to Computing the Distribution of Summed Test Scores Based on Real‐Number Item Scores.
    Seonghoon Kim.
    Journal of Educational Measurement. January 23, 2014
    With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number‐correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real‐number item scores. The generalized algorithm is distinct from the Lord‐Wingersky algorithm in that it explicitly incorporates the task of figuring out all possible unique real‐number test scores in each recursion. Some applications of the generalized recursive algorithm, such as IRT test score reliability estimation and IRT proficiency estimation based on summed test scores, are illustrated with a short test by varying scoring schemes for its items.
    January 23, 2014   doi: 10.1111/jedm.12024   open full text
  • Rater Comparability Scoring and Equating: Does Choice of Target Population Weights Matter in This Context?
    Gautam Puhan.
    Journal of Educational Measurement. January 23, 2014
    When a constructed‐response test form is reused, raw scores from the two administrations of the form may not be comparable. The solution to this problem requires a rescoring, at the current administration, of examinee responses from the previous administration. The scores from this “rescoring” can be used as an anchor for equating. In this equating, the choice of weights for combining the samples to define the target population can be critical. In rescored data, the anchor usually correlates very strongly with the new form but only moderately with the reference form. This difference has a predictable impact: the equating results are most accurate when the target population is the reference form sample, least accurate when the target population is the new form sample, and somewhere in the middle when the new form and reference form samples are equally weighted in forming the target population.
    January 23, 2014   doi: 10.1111/jedm.12023   open full text
  • Evaluating the Wald Test for Item‐Level Comparison of Saturated and Reduced Models in Cognitive Diagnosis.
    Jimmy Torre, Young‐Sun Lee.
    Journal of Educational Measurement. January 23, 2014
    This article used the Wald test to evaluate the item‐level fit of a saturated cognitive diagnosis model (CDM) relative to the fits of the reduced models it subsumes. A simulation study was carried out to examine the Type I error and power of the Wald test in the context of the G‐DINA model. Results show that when the sample size is small and a larger number of attributes are required, the Type I error rate of the Wald test for the DINA and DINO models can be higher than the nominal significance levels, while the Type I error rate of the A‐CDM is closer to the nominal significance levels. However, with larger sample sizes, the Type I error rates for the three models are closer to the nominal significance levels. In addition, the Wald test has excellent statistical power to detect when the true underlying model is none of the reduced models examined even for relatively small sample sizes. The performance of the Wald test was also examined with real data. With an increasing number of CDMs from which to choose, this article provides an important contribution toward advancing the use of CDMs in practical educational settings.
    January 23, 2014   doi: 10.1111/jedm.12022   open full text
  • Situations Where It Is Appropriate to Use Frequency Estimation Equipercentile Equating.
    Hongwen Guo, Hyeonjoo J. Oh, Daniel Eignor.
    Journal of Educational Measurement. September 22, 2013
    In operational equating situations, frequency estimation equipercentile equating is considered only when the old and new groups have similar abilities. The frequency estimation assumptions are investigated in this study under various situations from both the levels of theoretical interest and practical use. It shows that frequency estimation equating can be used under circumstances when it is not normally used. To link theoretical results with practice, statistical methods are proposed for checking frequency estimation assumptions based on available data: observed‐score distributions and item difficulty distributions of the forms. In addition to the conventional use of frequency estimation equating when the group abilities are similar, three situations are identified when the group abilities are dissimilar: (a) when the two forms and the observed conditional score distributions are similar the two forms and the observed conditional score distributions are similar (in this situation, the frequency estimation equating assumptions are likely to hold, and frequency estimation equating is appropriate); (b) when forms are similar but the observed conditional score distributions are not (in this situation, frequency estimation equating is not appropriate); and (c) when forms are not similar but the observed conditional score distributions are (frequency estimation equating is not appropriate). Statistical analysis procedures for comparing distributions are provided. Data from a large‐scale test are used to illustrate the use of frequency estimation equating when the group difference in ability is large.
    September 22, 2013   doi: 10.1111/jedm.12021   open full text
  • More Issues in Observed‐Score Equating.
    Wim J. Linden.
    Journal of Educational Measurement. September 22, 2013
    This article is a response to the commentaries on the position paper on observed‐score equating by van der Linden (this issue). The response focuses on the more general issues in these commentaries, such as the nature of the observed scores that are equated, the importance of test‐theory assumptions in equating, the necessity to use multiple equating transformations, and the choice of conditioning variables in equating.
    September 22, 2013   doi: 10.1111/jedm.12020   open full text
  • Comments on “Some Conceptual Issues in Observed‐Score Equating” by Wim J. van der Linden.
    Eric T. Bradlow.
    Journal of Educational Measurement. September 22, 2013
    The van der Linden article (this issue) provides a roadmap for future research in equating. My belief is that the roadmap begins and ends with collecting auxiliary data that can be utilized to provide improved equating, especially when data are sparse or equating beyond simple moments is desired.
    September 22, 2013   doi: 10.1111/jedm.12019   open full text
  • Statistical Models and Inference for the True Equating Transformation in the Context of Local Equating.
    Jorge González B., Matthias Davier.
    Journal of Educational Measurement. September 22, 2013
    Based on Lord's criterion of equity of equating, van der Linden (this issue) revisits the so‐called local equating method and offers alternative as well as new thoughts on several topics including the types of transformations, symmetry, reliability, and population invariance appropriate for equating. A remarkable aspect is to define equating as a standard statistical inference problem in which the true equating transformation is the parameter of interest that has to be estimated and assessed as any standard evaluation of an estimator of an unknown parameter in statistics. We believe that putting equating methods in a general statistical model framework would be an interesting and useful next step in the area. van der Linden's conceptual article on equating is certainly an important contribution to this task.
    September 22, 2013   doi: 10.1111/jedm.12018   open full text
  • On Attempting to Do What Lord Said Was Impossible: Commentary on van der Linden's “Some Conceptual Issues in Observed‐Score Equating”.
    Neil J. Dorans.
    Journal of Educational Measurement. September 22, 2013
    van der Linden (this issue) uses words differently than Holland and Dorans. This difference in language usage is a source of some confusion in van der Linden's critique of what he calls equipercentile equating. I address these differences in language. van der Linden maintains that there are only two requirements for score equating. I maintain that the requirements he discards have practical utility and are testable. The score equity requirement proposed by Lord suggests that observed score equating was either unnecessary or impossible. Strong equity serves as the fulcrum for van der Linden's thesis. His proposed solution to the equity problem takes inequitable measures and aligns conditional error score distributions, resulting in a family of linking functions, one for each level of θ. In reality, θ is never known. Use of an anchor test as a proxy poses many practical problems, including defensibility.
    September 22, 2013   doi: 10.1111/jedm.12017   open full text
  • Local Equating Using the Rasch Model, the OPLM, and the 2PL IRT Model—or—What Is It Anyway if the Model Captures Everything There Is to Know About the Test Takers?
    Matthias Davier, Jorge González B., Alina A. Davier.
    Journal of Educational Measurement. September 22, 2013
    Local equating (LE) is based on Lord's criterion of equity. It defines a family of true transformations that aim at the ideal of equitable equating. van der Linden (this issue) offers a detailed discussion of common issues in observed‐score equating relative to this local approach. By assuming an underlying item response theory model, one of the main features of LE is that it adjusts the equated raw scores using conditional distributions of raw scores given an estimate of the ability of interest. In this article, we argue that this feature disappears when using a Rasch model for the estimation of the true transformation, while the one‐parameter logistic model and the two‐parameter logistic model do provide a local adjustment of the equated score.
    September 22, 2013   doi: 10.1111/jedm.12016   open full text
  • Comments on van der Linden's Critique and Proposal for Equating.
    Paul W. Holland.
    Journal of Educational Measurement. September 22, 2013
    While agreeing with van der Linden (this issue) that test equating needs better theoretical underpinnings, my comments criticize several aspects of his article. His examples are, for the most part, worthless; he does not use well‐established terminology correctly; his view of 100 years of attempts to give a theoretical basis for equating is unreasonably dismissive; he exhibits no understanding of the role of the synthetic population for anchor test equating for the nonequivalent groups with anchor test design; he is obtuse regarding the condition of symmetry, requiring it of the estimand but not of the estimator; and his proposal for a foundational basis for all test equating, the “true equating transformation,” allows a different equating function for every examinee, which is way past what equating actually does or hopes to achieve. Most importantly, he appears to think that criticism of others is more important than improved insight that moves a field forward based on the work of many other theorists whose contributions have improved the practice of equating.
    September 22, 2013   doi: 10.1111/jedm.12015   open full text
  • Some Conceptual Issues in Observed‐Score Equating.
    Wim J. Linden.
    Journal of Educational Measurement. September 22, 2013
    In spite of all of the technical progress in observed‐score equating, several of the more conceptual aspects of the process still are not well understood. As a result, the equating literature struggles with rather complex criteria of equating, lack of a test‐theoretic foundation, confusing terminology, and ad hoc analyses. A return to Lord's foundational criterion of equity of equating, a derivation of the true equating transformation from it, and mainstream statistical treatment of the problem of estimating the transformation for various data‐collection designs exist as a solution to the problem.
    September 22, 2013   doi: 10.1111/jedm.12014   open full text
  • Unidimensional Interpretations for Multidimensional Test Items.
    Nilufer Kahraman.
    Journal of Educational Measurement. June 11, 2013
    This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item‐level multidimensionality, and (2) whether a Projection IRT model can provide a useful remedy. A real‐data example is used to illustrate the problem and also is used as a base model for a simulation study. The results suggest that ignoring item‐level multidimensionality might lead to inflated item discrimination parameter estimates when the proportion of multidimensional test items to unidimensional test items is as low as 1:5. The Projection IRT model appears to be a useful tool for updating unidimensional item parameter estimates of multidimensional test items for a purified unidimensional interpretation.
    June 11, 2013   doi: 10.1111/jedm.12012   open full text
  • Measuring Growth With Vertical Scales.
    Derek C. Briggs.
    Journal of Educational Measurement. June 11, 2013
    A vertical score scale is needed to measure growth across multiple tests in terms of absolute changes in magnitude. Since the warrant for subsequent growth interpretations depends upon the assumption that the scale has interval properties, the validation of a vertical scale would seem to require methods for distinguishing interval scales from ordinal scales. In taking up this issue, two different perspectives on educational measurement are contrasted: a metaphorical perspective and a classical perspective. Although the metaphorical perspective is more predominant, at present it provides no objective methods whereby the properties of a vertical scale can be validated. In contrast, when taking a classical perspective, the axioms of additive conjoint measurement can be used to test the hypothesis that the latent variable underlying a vertical scale is quantitative (supporting ratio or interval properties) rather than merely qualitative (supporting ordinal or nominal properties). The application of such an approach is illustrated with both a hypothetical example and by drawing upon recent research that has been conducted on the Lexile scale for reading comprehension.
    June 11, 2013   doi: 10.1111/jedm.12011   open full text
  • Estimation Methods for One‐Parameter Testlet Models.
    Hong Jiao, Shudong Wang, Wei He.
    Journal of Educational Measurement. June 11, 2013
    This study demonstrated the equivalence between the Rasch testlet model and the three‐level one‐parameter testlet model and explored the Markov Chain Monte Carlo (MCMC) method for model parameter estimation in WINBUGS. The estimation accuracy from the MCMC method was compared with those from the marginalized maximum likelihood estimation (MMLE) with the expectation‐maximization algorithm in ConQuest and the sixth‐order Laplace approximation estimation in HLM6. The results indicated that the estimation methods had significant effects on the bias of the testlet variance and ability variance estimation, the random error in the ability parameter estimation, and the bias in the item difficulty parameter estimation. The Laplace method best recovered the testlet variance while the MMLE best recovered the ability variance. The Laplace method resulted in the smallest random error in the ability parameter estimation while the MCMC method produced the smallest bias in item parameter estimates. Analyses of three real tests generally supported the findings from the simulation and indicated that the estimates for item difficulty and ability parameters were highly correlated across estimation methods.
    June 11, 2013   doi: 10.1111/jedm.12010   open full text
  • Modeling Item‐Position Effects Within an IRT Framework.
    Dries Debeer, Rianne Janssen.
    Journal of Educational Measurement. June 11, 2013
    Changing the order of items between alternate test forms to prevent copying and to enhance test security is a common practice in achievement testing. However, these changes in item order may affect item and test characteristics. Several procedures have been proposed for studying these item‐order effects. The present study explores the use of descriptive and explanatory models from item response theory for detecting and modeling these effects in a one‐step procedure. The framework also allows for consideration of the impact of individual differences in position effect on item difficulty. A simulation was conducted to investigate the impact of a position effect on parameter recovery in a Rasch model. As an illustration, the framework was applied to a listening comprehension test for French as a foreign language and to data from the PISA 2006 assessment.
    June 11, 2013   doi: 10.1111/jedm.12009   open full text
  • Detection of Test Collusion via Kullback–Leibler Divergence.
    Dmitry I. Belov.
    Journal of Educational Measurement. June 11, 2013
    The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam.
    June 11, 2013   doi: 10.1111/jedm.12008   open full text
  • Relative and Absolute Fit Evaluation in Cognitive Diagnosis Modeling.
    Jinsong Chen, Jimmy Torre, Zao Zhang.
    Journal of Educational Measurement. June 11, 2013
    As with any psychometric models, the validity of inferences from cognitive diagnosis models (CDMs) determines the extent to which these models can be useful. For inferences from CDMs to be valid, it is crucial that the fit of the model to the data is ascertained. Based on a simulation study, this study investigated the sensitivity of various fit statistics for absolute or relative fit under different CDM settings. The investigation covered various types of model–data misfit that can occur with the misspecifications of the Q‐matrix, the CDM, or both. Six fit statistics were considered: –2 log likelihood (–2LL), Akaike's information criterion (AIC), Bayesian information criterion (BIC), and residuals based on the proportion correct of individual items (p), the correlations (r), and the log‐odds ratio of item pairs (l). An empirical example involving real data was used to illustrate how the different fit statistics can be employed in conjunction with each other to identify different types of misspecifications. With these statistics and the saturated model serving as the basis, relative and absolute fit evaluation can be integrated to detect misspecification efficiently.
    June 11, 2013   doi: 10.1111/j.1745-3984.2012.00185.x   open full text