Path models with observed composites based on multiple items (e.g., mean or sum score of the items) are commonly used to test interaction effects. Under this practice, researchers generally assume that the observed composites are measured without errors. In this study, we reviewed and evaluated two alternative methods within the structural equation modeling (SEM) framework, namely, the reliability-adjusted product indicator (RAPI) method and the latent moderated structural equations (LMS) method, which can both flexibly take into account measurement errors. Results showed that both these methods generally produced unbiased estimates of the interaction effects. On the other hand, the path model—without considering measurement errors—led to substantial bias and a low confidence interval coverage rate of nonzero interaction effects. Other findings and implications for future studies are discussed.
Cluster randomized trials involving participants nested within intact treatment and control groups are commonly performed in various educational, psychological, and biomedical studies. However, recruiting and retaining intact groups present various practical, financial, and logistical challenges to evaluators and often, cluster randomized trials are performed with a low number of clusters (~20 groups). Although multilevel models are often used to analyze nested data, researchers may be concerned of potentially biased results due to having only a few groups under study. Cluster bootstrapping has been suggested as an alternative procedure when analyzing clustered data though it has seen very little use in educational and psychological studies. Using a Monte Carlo simulation that varied the number of clusters, average cluster size, and intraclass correlations, we compared standard errors using cluster bootstrapping with those derived using ordinary least squares regression and multilevel models. Results indicate that cluster bootstrapping, though more computationally demanding, can be used as an alternative procedure for the analysis of clustered data when treatment effects at the group level are of primary interest. Supplementary material showing how to perform cluster bootstrapped regressions using R is also provided.
This article outlines a procedure for examining the degree to which a common factor may be dominating additional factors in a multicomponent measuring instrument consisting of binary items. The procedure rests on an application of the latent variable modeling methodology and accounts for the discrete nature of the manifest indicators. The method provides point and interval estimates (a) of the proportion of the variance explained by all factors, which is due to the common (global) factor and (b) of the proportion of the variance explained by all factors, which is due to some or all other (local) factors. The discussed approach can also be readily used as a means of assessing approximate unidimensionality when considering application of unidimensional versus multidimensional item response modeling. The procedure is similarly utilizable in case of highly discrete (e.g., Likert-type) ordinal items, and is illustrated with a numerical example.
The correlated trait–correlated method (CTCM) model for the analysis of multitrait–multimethod (MTMM) data is known to suffer convergence and admissibility (C&A) problems. We describe a little known and seldom applied reparameterized version of this model (CTCM-R) based on Rindskopf’s reparameterization of the simpler confirmatory factor analysis model. In a Monte Carlo study, we compare the CTCM, CTCM-R, and the correlated trait–correlated uniqueness (CTCU) models in terms of C&A, model fit, and parameter estimation bias. The CTCM-R model largely avoided C&A problems associated with the more traditional CTCM model, producing C&A solutions nearly as often as the CTCU model, but also avoiding parameter estimation biases known to plague the CTCU model. As such, the CTCM-R model is an attractive alternative for the analysis of MTMM data.
A first-order latent growth model assesses change in an unobserved construct from a single score and is commonly used across different domains of educational research. However, examining change using a set of multiple response scores (e.g., scale items) affords researchers several methodological benefits not possible when using a single score. A curve of factors (CUFFS) model assesses change in a construct from multiple response scores but its use in the social sciences has been limited. In this article, we advocate the CUFFS for analyzing a construct’s latent trajectory over time, with an emphasis on applying this model to educational research. First, we present a review of longitudinal factorial invariance, a condition necessary for ensuring that the measured construct is the same across time points. Next, we introduce the CUFFS model, followed by an illustration of testing factorial invariance and specifying a univariate and a bivariate CUFFS model to longitudinal data. To facilitate implementation, we include syntax for specifying these statistical methods using the free statistical software R.
A latent variable modeling method for studying measurement invariance when evaluating latent constructs with multiple binary or binary scored items with no guessing is outlined. The approach extends the continuous indicator procedure described by Raykov and colleagues, utilizes similarly the false discovery rate approach to multiple testing, and permits one to locate violations of measurement invariance in loading or threshold parameters. The discussed method does not require selection of a reference observed variable and is directly applicable for studying differential item functioning with one- or two-parameter item response models. The extended procedure is illustrated on an empirical data set.
Recent research has explored the use of models adapted from Mokken scale analysis as a nonparametric approach to evaluating rating quality in educational performance assessments. A potential limiting factor to the widespread use of these techniques is the requirement for complete data, as practical constraints in operational assessment systems often limit the use of complete rating designs. In order to address this challenge, this study explores the use of missing data imputation techniques and their impact on Mokken-based rating quality indicators related to rater monotonicity, rater scalability, and invariant rater ordering. Simulated data and real data from a rater-mediated writing assessment were modified to reflect varying levels of missingness, and four imputation techniques were used to impute missing ratings. Overall, the results indicated that simple imputation techniques based on rater and student means result in generally accurate recovery of rater monotonicity indices and rater scalability coefficients. However, discrepancies between violations of invariant rater ordering in the original and imputed data are somewhat unpredictable across imputation methods. Implications for research and practice are discussed.
Statistical mediation analysis allows researchers to identify the most important mediating constructs in the causal process studied. Identifying specific mediators is especially relevant when the hypothesized mediating construct consists of multiple related facets. The general definition of the construct and its facets might relate differently to an outcome. However, current methods do not allow researchers to study the relationships between general and specific aspects of a construct to an outcome simultaneously. This study proposes a bifactor measurement model for the mediating construct as a way to parse variance and represent the general aspect and specific facets of a construct simultaneously. Monte Carlo simulation results are presented to help determine the properties of mediated effect estimation when the mediator has a bifactor structure and a specific facet of a construct is the true mediator. This study also investigates the conditions when researchers can detect the mediated effect when the multidimensionality of the mediator is ignored and treated as unidimensional. Simulation results indicated that the mediation model with a bifactor mediator measurement model had unbiased and adequate power to detect the mediated effect with a sample size greater than 500 and medium a- and b-paths. Also, results indicate that parameter bias and detection of the mediated effect in both the data-generating model and the misspecified model varies as a function of the amount of facet variance represented in the mediation model. This study contributes to the largely unexplored area of measurement issues in statistical mediation analysis.
The Angoff standard setting method relies on content experts to review exam items and make judgments about the performance of the minimally proficient examinee. Unfortunately, at times content experts may have gaps in their understanding of specific exam content. These gaps are particularly likely to occur when the content domain is broad and/or highly technical, or when non-expert stakeholders are included in a standard setting panel (e.g., parents, administrators, or union representatives). When judges lack expertise regarding specific exam content, the ratings associated with those items may be bias. This study attempts to illustrate the impact of rating unfamiliar items on Angoff passing scores. The study presents a comparison of Angoff ratings for typical items with those identified by judges as containing unfamiliar content. The results indicate that judges tend to perceive unfamiliar items as being artificially difficult resulting in systematically lower Angoff ratings. The results suggest that when judges are forced to rate unfamiliar items, the validity of the resulting classification decision may be jeopardized.
Critics of null hypothesis significance testing suggest that (a) its basic logic is invalid and (b) it addresses a question that is of no interest. In contrast to (a), I argue that the underlying logic of hypothesis testing is actually extremely straightforward and compelling. To substantiate that, I present examples showing that hypothesis testing logic is routinely used in everyday life. These same examples also refute (b) by showing circumstances in which the logic of hypothesis testing addresses a question of prime interest. Null hypothesis significance testing may sometimes be misunderstood or misapplied, but these problems should be addressed by improved education.
The article provides perspectives on p values, null hypothesis testing, and alternative techniques in light of modern robust statistical methods. Null hypothesis testing and p values can provide useful information provided they are interpreted in a sound manner, which includes taking into account insights and advances that have occurred during the past 50 years. There are, of course, limitations to what null hypothesis testing and p values reveal about data. But modern advances make it clear that there are serious limitations and concerns associated with conventional confidence intervals, standard Bayesian methods, and commonly used measures of effect size. Many of these concerns can be addressed using modern robust methods.
In 1881, Donald MacAlister posed a problem in the Educational Times that remains relevant today. The problem centers on the statistical evidence for the effectiveness of a treatment based on a comparison between two proportions. A brief historical sketch is followed by a discussion of two default Bayesian solutions, one based on a one-sided test between independent rates, and one on a one-sided test between dependent rates. We demonstrate the current-day relevance of MacAlister’s original question with a modern-day example about the effectiveness of an educational program.
There has been much controversy over the null hypothesis significance testing procedure, with much of the criticism centered on the problem of inverse inference. Specifically, p gives the probability of the finding (or one more extreme) given the null hypothesis, whereas the null hypothesis significance testing procedure involves drawing a conclusion about the null hypothesis given the finding. Many critics have called for null hypothesis significance tests to be replaced with confidence intervals. However, confidence intervals also suffer from a version of the inverse inference problem. The only known solution to the inverse inference problem is to use the famous theorem by Bayes, but this involves commitments that many researchers are not willing to make. However, it is possible to ask a useful question for which inverse inference is not a problem and that leads to the computation of the coefficient of confidence. In turn, and much more important, using the coefficient of confidence implies the desirability of switching from the current emphasis on a posteriori inferential statistics to an emphasis on a priori inferential statistics.
An alternative to null hypothesis significance testing is presented and discussed. This approach, referred to as observation-oriented modeling, is centered on model building in an effort to explicate the structures and processes believed to generate a set of observations. In terms of analysis, this novel approach complements traditional methods based on means, variances, and covariances with methods of pattern detection and analysis. Using data from a previously published study by Shoda et al., the basic tenets and methods of observation-oriented modeling are demonstrated and compared with traditional methods, particularly with regard to null hypothesis significance testing.
Brain-imaging technology has boosted the quantification of neurobiological phenomena underlying human mental operations and their disturbances. Since its inception, drawing inference on neurophysiological effects hinged on classical statistical methods, especially, the general linear model. The tens of thousands of variables per brain scan were routinely tackled by independent statistical tests on each voxel. This circumvented the curse of dimensionality in exchange for neurobiologically imperfect observation units, a challenging multiple comparisons problem, and limited scaling to currently growing data repositories. Yet, the always bigger information granularity of neuroimaging data repositories has lunched a rapidly increasing adoption of statistical learning algorithms. These scale naturally to high-dimensional data, extract models from data rather than prespecifying them, and are empirically evaluated for extrapolation to unseen data. The present article portrays commonalities and differences between long-standing classical inference and upcoming generalization inference relevant for conducting neuroimaging research.
We present three strategies to replace the null hypothesis statistical significance testing approach in psychological research: (1) visual representation of cognitive processes and predictions, (2) visual representation of data distributions and choice of the appropriate distribution for analysis, and (3) model comparison. The three strategies have been proposed earlier, so we do not claim originality. Here we propose to combine the three strategies and use them not only as analytical and reporting tools but also to guide the design of research. The first strategy involves a visual representation of the cognitive processes involved in solving the task at hand in the form of a theory or model together with a representation of a pattern of predictions for each condition. The second approach is the GAMLSS approach, which consists of providing a visual representation of distributions to fit the data, and choosing the best distribution that fits the raw data for further analyses. The third strategy is the model comparison approach, which compares the model of the researcher with alternative models. We present a worked example in the field of reasoning, in which we follow the three strategies.
Because of the continuing debates about statistics, many researchers may feel confused about how to analyze and interpret data. Current guidelines in psychology advocate the use of effect sizes and confidence intervals (CIs). However, researchers may be unsure about how to extract effect sizes from factorial designs. Contrast analysis is helpful because it can be used to test specific questions of central interest in studies with factorial designs. It weighs several means and combines them into one or two sets that can be tested with t tests. The effect size produced by a contrast analysis is simply the difference between means. The CI of the effect size informs directly about direction, hypothesis exclusion, and the relevance of the effects of interest. However, any interpretation in terms of precision or likelihood requires the use of likelihood intervals or credible intervals (Bayesian). These various intervals and even a Bayesian t test can be obtained easily with free software. This tutorial reviews these methods to guide researchers in answering the following questions: When I analyze mean differences in factorial designs, where can I find the effects of central interest, and what can I learn about their effect sizes?
Psychometric measurement models are only valid if measurement invariance holds between test takers of different groups. Global model tests, such as the well-established likelihood ratio (LR) test, are sensitive to violations of measurement invariance, such as differential item functioning and differential step functioning. However, these traditional approaches are only applicable when comparing previously specified reference and focal groups, such as males and females. Here, we propose a new framework for global model tests for polytomous Rasch models based on a model-based recursive partitioning algorithm. With this approach, a priori specification of reference and focal groups is no longer necessary, because they are automatically detected in a data-driven way. The statistical background of the new framework is introduced along with an instructive example. A series of simulation studies illustrates and compares its statistical properties to the well-established LR test. While both the LR test and the new framework are sensitive to differential item functioning and differential step functioning and respect a given significance level regardless of true differences in the ability distributions, the new data-driven approach is more powerful when the group structure is not known a priori—as will usually be the case in practical applications. The usage and interpretation of the new method are illustrated in an empirical application example. A software implementation is freely available in the R system for statistical computing.
The generalized partial credit model (GPCM) is often used for polytomous data; however, the nominal response model (NRM) allows for the investigation of how adjacent categories may discriminate differently when items are positively or negatively worded. Ten items from three different self-reported scales were used (anxiety, depression, and perceived stress), and authors wrote an additional item worded in the opposite direction to pair with each original item. Sets of the original and reverse-worded items were administered, and responses were analyzed using the two models. The NRM fit significantly better than the GPCM, and it was able to detect category responses that may not function well. Positively worded items tended to be more discriminating than negatively worded items. For the depression scale, category boundary locations tended to have a larger range for the positively worded items than for the negatively worded items from both models. Some pairs of items functioned comparably when reverse-worded, but others did not. If an examinee responds in an extreme category to an item, the same examinee is not necessarily likely to respond in an extreme category at the opposite end of the rating scale to a similar item worded in the opposite direction. Results of this study may support the use of scales composed of items worded in the same direction, and particularly in the positive direction.
In confirmatory factor analysis quite similar models of measurement serve the detection of the difficulty factor and the factor due to the item-position effect. The item-position effect refers to the increasing dependency among the responses to successively presented items of a test whereas the difficulty factor is ascribed to the wide range of item difficulties. The similarity of the models of measurement hampers the dissociation of these factors. Since the item-position effect should theoretically be independent of the item difficulties, the statistical ex post manipulation of the difficulties should enable the discrimination of the two types of factors. This method was investigated in two studies. In the first study, Advanced Progressive Matrices (APM) data of 300 participants were investigated. As expected, the factor thought to be due to the item-position effect was observed. In the second study, using data simulated to show the major characteristics of the APM data, the wide range of items with various difficulties was set to zero to reduce the likelihood of detecting the difficulty factor. Despite this reduction, however, the factor now identified as item-position factor, was observed in virtually all simulated datasets.
Self-report surveys are widely used to measure adolescent risk behavior and academic adjustment, with results having an impact on national policy, assessment of school quality, and evaluation of school interventions. However, data obtained from self-reports can be distorted when adolescents intentionally provide inaccurate or careless responses. The current study illustrates the problem of invalid respondents in a sample (N = 52,012) from 323 high schools that responded to a statewide assessment of school climate. Two approaches for identifying invalid respondents were applied, and contrasts between the valid and invalid responses revealed differences in means, prevalence rates of student adjustment, and associations among reports of bullying victimization and student adjustment outcomes. The results lend additional support for the need to screen for invalid responders in adolescent samples.
Null hypothesis significance testing (NHST) provides an important statistical toolbox, but there are a number of ways in which it is often abused and misinterpreted, with bad consequences for the reliability and progress of science. Parts of contemporary NHST debate, especially in the psychological sciences, is reviewed, and a suggestion is made that a new distinction between strongly, weakly, and very weakly anti-NHST positions is likely to bring added clarity to the debate.
Null hypothesis significance testing (NHST) has been the subject of debate for decades and alternative approaches to data analysis have been proposed. This article addresses this debate from the perspective of scientific inquiry and inference. Inference is an inverse problem and application of statistical methods cannot reveal whether effects exist or whether they are empirically meaningful. Hence, raising conclusions from the outcomes of statistical analyses is subject to limitations. NHST has been criticized for its misuse and the misconstruction of its outcomes, also stressing its inability to meet expectations that it was never designed to fulfil. Ironically, alternatives to NHST are identical in these respects, something that has been overlooked in their presentation. Three of those alternatives are discussed here (estimation via confidence intervals and effect sizes, quantification of evidence via Bayes factors, and mere reporting of descriptive statistics). None of them offers a solution to the problems that NHST is purported to have, all of them are susceptible to misuse and misinterpretation, and some bring around their own problems (e.g., Bayes factors have a one-to-one correspondence with p values, but they are entirely deprived of an inferential framework). Those alternatives also fail to cover a broad area of inference not involving distributional parameters, where NHST procedures remain the only (and suitable) option. Like knives or axes, NHST is not inherently evil; only misuse and misinterpretation of its outcomes needs to be eradicated.
We briefly discuss the philosophical basis of science, causality, and scientific evidence, by introducing the hidden but most fundamental principle of science: the similarity principle. The principle’s use in scientific discovery is illustrated with Simpson’s paradox and other examples. In discussing the value of null hypothesis statistical testing, the controversies in multiple regression, and multiplicity issues in statistics, we describe how these difficult issues should be handled based on our interpretation of the similarity principle.
This article considers the nature and place of tests of statistical significance (ToSS) in science, with particular reference to psychology. Despite the enormous amount of attention given to this topic, psychology’s understanding of ToSS remains deficient. The major problem stems from a widespread and uncritical acceptance of null hypothesis significance testing (NHST), which is an indefensible amalgam of ideas adapted from Fisher’s thinking on the subject and from Neyman and Pearson’s alternative account. To correct for the deficiencies of the hybrid, it is suggested that psychology avail itself of two important and more recent viewpoints on ToSS, namely the neo-Fisherian and the error-statistical perspectives. The neo-Fisherian perspective endeavors to improve on Fisher’s original account and rejects key elements of Neyman and Pearson’s alternative. In contrast, the error-statistical perspective builds on the strengths of both statistical traditions. It is suggested that these more recent outlooks on ToSS are a definite improvement on NHST, especially the error-statistical position. It is suggested that ToSS can play a useful, if limited, role in psychological research. At the end, some lessons learnt from the extensive debates about ToSS are presented.
Bayesian and classical statistical approaches are based on different types of logical principles. In order to avoid mistaken inferences and misguided interpretations, the practitioner must respect the inference rules embedded into each statistical method. Ignoring these principles leads to the paradoxical conclusions that the hypothesis
P values have been critiqued on several grounds but remain entrenched as the dominant inferential method in the empirical sciences. In this article, we elaborate on the fact that in many statistical models, the one-sided P value has a direct Bayesian interpretation as the approximate posterior mass for values lower than zero. The connection between the one-sided P value and posterior probability mass reveals three insights: (1) P values can be interpreted as Bayesian tests of direction, to be used only when the null hypothesis is known from the outset to be false; (2) as a measure of evidence, P values are biased against a point null hypothesis; and (3) with N fixed and effect size variable, there is an approximately linear relation between P values and Bayesian point null hypothesis tests.
Synthesizing results from multiple studies is a daunting task during which researchers must tackle a variety of challenges. The task is even more demanding when studying developmental processes longitudinally and when different instruments are used to measure constructs. Data integration methodology is an emerging field that enables researchers to pool data drawn from multiple existing studies. To date, these methods are not commonly utilized in the social and behavioral sciences, even though they can be very useful for studying various complex developmental processes. This article illustrates the use of two data integration methods, the data fusion and the parallel analysis approaches. The illustration makes use of six longitudinal studies of mathematics ability in children with a goal of examining individual changes in mathematics ability and determining differences in the trajectories based on sex and socioeconomic status. The studies vary in their assessment of mathematics ability and in the timing and number of measurement occasions. The advantages of using a data fusion approach, which can allow for the fitting of more complex growth models that might not otherwise have been possible to fit in a single data set, are emphasized. The article concludes with a discussion of the limitations and benefits of these approaches for research synthesis.
The current study proposes novel methods to predict multistage testing (MST) performance without conducting simulations. This method, called MST test information, is based on analytic derivation of standard errors of ability estimates across theta levels. We compared standard errors derived analytically to the simulation results to demonstrate the validity of the proposed method in both measurement precision and classification accuracy. The results indicate that the MST test information effectively predicted the performance of MST. In addition, the results of the current study highlighted the relationship among the test construction, MST design factors, and MST performance.
Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen’s kappa (). Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another (concordance), using both nonstochastic and stochastic category membership. Using a probability model to express category assignments in terms of rater accuracy and random error, it is shown that observed agreement (Po) depends only on rater accuracy and number of categories; however, expected agreement (Pe) and depend additionally on category frequencies. Moreover, category frequencies affect Pe and solely through the variance of the category proportions, regardless of the specific frequencies underlying the variance. Paradoxically, some judgment paradigms involving stochastic categories are shown to yield higher values than their nonstochastic counterparts. Using the stated probability model, assignments to categories were generated for 552 combinations of paradigms, rater and category parameters, category frequencies, and number of stimuli. Observed means and standard errors for Po, Pe, and were fully consistent with theory expectations. Guidelines for interpretation of rater accuracy and reliability are offered, along with a discussion of alternatives to the basic model.
The purpose of this study is to compare alternative multidimensional scaling (MDS) methods for constraining the stimuli on the circumference of a circle and on the surface of a sphere. Specifically, the existing MDS-T method for plotting the stimuli on the circumference of a circle is applied, and its extension is proposed for constraining the stimuli on the surface of a sphere. The data analyzed come from previous research and concerns Maslach and Jackson’s burnout syndrome and Holland’s vocational personality types. The configurations for the same data on the circle and the sphere shared similarities but also had differences, that is, the general item-groupings were the same but most of the differences across the two methods resulted in more meaningful interpretations for the three-dimensional configuration. Furthermore, in most cases, items and/or scales could be better discriminated from each other on the sphere.
To date, small sample problems with latent growth models (LGMs) have not received the amount of attention in the literature as related mixed-effect models (MEMs). Although many models can be interchangeably framed as a LGM or a MEM, LGMs uniquely provide criteria to assess global data–model fit. However, previous studies have demonstrated poor small sample performance of these global data–model fit criteria and three post hoc small sample corrections have been proposed and shown to perform well with complete data. However, these corrections use sample size in their computation—whose value is unclear when missing data are accommodated with full information maximum likelihood, as is common with LGMs. A simulation is provided to demonstrate the inadequacy of these small sample corrections in the near ubiquitous situation in growth modeling where data are incomplete. Then, a missing data correction for the small sample correction equations is proposed and shown through a simulation study to perform well in various conditions found in practice. An applied developmental psychology example is then provided to demonstrate how disregarding missing data in small sample correction equations can greatly affect assessment of global data–model fit.
Application of MIRT modeling procedures is dependent on the quality of parameter estimates provided by the estimation software and techniques used. This study investigated model parameter recovery of two popular MIRT packages, BMIRT and flexMIRT, under some common measurement conditions. These packages were specifically selected to investigate the model parameter recovery of three item parameter estimation techniques, namely, Bock–Aitkin EM (BA-EM), Markov chain Monte Carlo (MCMC), and Metropolis–Hastings Robbins–Monro (MH-RM) algorithms. The results demonstrated that all estimation techniques had similar root mean square error values when larger sample size and higher test length were used. Depending on the number of dimensions, sample size, and test length, each estimation technique exhibited some strengths and weaknesses. Overall, the BA-EM technique was found to have shorter estimation time with all test specifications.
Student Growth Percentiles (SGPs) increasingly are being used in the United States for inferences about student achievement growth and educator effectiveness. Emerging research has indicated that SGPs estimated from observed test scores have large measurement errors. As such, little is known about "true" SGPs, which are defined in terms of nonlinear functions of latent achievement attributes for individual students and their distributions across students. We develop a novel framework using latent regression multidimensional item response theory models to study distributional properties of true SGPs. We apply these methods to several cohorts of longitudinal item response data from more than 330,000 students in a large urban metropolitan area to provide new empirical information about true SGPs. We find that true SGPs are correlated 0.3 to 0.5 across mathematics and English language arts, and that they have nontrivial relationships with individual student characteristics, particularly student race/ethnicity and absenteeism. We evaluate the potential of using these relationships to improve the accuracy of SGPs estimated from observed test scores, finding that accuracy gains even under optimal circumstances are modest. We also consider the properties of SGPs averaged to the teacher level, widely used for teacher evaluations. We find that average true SGPs for individual teachers vary substantially as a function of the characteristics of the students they teach. We discuss implications of our findings for the estimation and interpretation of SGPs at both the individual and aggregate levels.
Cognitive diagnosis models are diagnostic models used to classify respondents into homogenous groups based on multiple categorical latent variables representing the measured cognitive attributes. This study aims to present longitudinal models for cognitive diagnosis modeling, which can be applied to repeated measurements in order to monitor attribute stability of individuals and to account for respondent dependence. Models based on combining latent transition analysis modeling and the DINA and DINO cognitive diagnosis models were developed and then evaluated through a Monte Carlo simulation study. The study results indicate that the proposed models provide adequate convergence and correct classification rates.
Reliable measurements are key to social science research. Multiple measures of reliability of the total score have been developed, including coefficient alpha, coefficient omega, the greatest lower bound reliability, and others. Among these, the coefficient alpha has been most widely used, and it is reported in nearly every study involving the measure of a construct through multiple items in social and behavioral research. However, it is known that coefficient alpha underestimates the true reliability unless the items are tau-equivalent, and coefficient omega is deemed as a practical alternative to coefficient alpha in estimating measurement reliability of the total score. However, many researchers noticed that the difference between alpha and omega is minor in applications. Since the observed differences in alpha and omega can be due to sampling errors, the purpose of the present study, therefore, is to propose a method to evaluate the difference of coefficient alpha (
Typically, in education and psychology research, the investigator collects data and subsequently performs descriptive and inferential statistics. For example, a researcher might compute group means and use the null hypothesis significance testing procedure to draw conclusions about the populations from which the groups were drawn. We propose an alternative inferential statistical procedure that is performed prior to data collection rather than afterwards. To use this procedure, the researcher specifies how close she or he desires the group means to be to their corresponding population means and how confident she or he wishes to be that this actually is so. We derive an equation that provides researchers with a way to determine the sample size needed to meet the specifications concerning closeness and confidence, regardless of the number of groups.
The theoretical reason for the presence of differential item functioning (DIF) is that data are multidimensional and two groups of examinees differ in their underlying ability distribution for the secondary dimension(s). Therefore, the purpose of this study was to determine how much the secondary ability distributions must differ before DIF is detected. Two-dimensional binary data sets were simulated using a compensatory multidimensional item response theory (MIRT) model, incrementally varying the mean difference on the second dimension between reference and focal group examinees while systematically increasing the correlation between dimensions. Three different DIF detection procedures were used to test for DIF: (1) SIBTEST, (2) Mantel–Haenszel, and (3) logistic regression. Results indicated that even with a very small mean difference on the secondary dimension, smaller than typically considered in previous research, DIF will be detected. Additional analyses indicated that even with the smallest mean difference considered in this study, 0.25, statistically significant differences will almost always be found between reference and focal group examinees on subtest scores consisting of items measuring the secondary dimension.
Assessing global interrater agreement is difficult as most published indices are affected by the presence of mixtures of agreements and disagreements. A previously proposed method was shown to be specifically sensitive to global agreement, excluding mixtures, but also negatively biased. Here, we propose two alternatives in an attempt to find what makes such methods so specific. The first method, R_{B}, is found to be unbiased while at the same time rejecting mixtures, is detecting agreement with good power and is little affected by unequal category prevalence as soon as there are more than two categories.
The item-position effect describes how an item’s position within a test, that is, the number of previous completed items, affects the response to this item. Previously, this effect was represented by constraints reflecting simple courses, for example, a linear increase. Due to the inflexibility of these representations our aim was to examine whether adapted representations are more appropriate than the existing ones. Models of confirmatory factor analysis were used for testing the different representations. Analyses were conducted by means of simulated data that followed the covariance pattern of Raven’s Advanced Progressive Matrices (APM) items. Since the item-position effect has been demonstrated repeatedly for the APM, it is a very suitable measure for our investigations. Results revealed no remarkable improvement by using an adapted representation. Possible reasons causing these results are discussed.
This article extends the procedure outlined in the article by Raykov, Marcoulides, and Tong for testing congruence of latent constructs to the setting of binary items and clustering effects. In this widely used setting in contemporary educational and psychological research, the method can be used to examine if two or more homogeneous multicomponent instruments with distinct components measure the same construct. The approach is useful in scale construction and development research as well as in construct validation investigations. The discussed method is illustrated with data from a scholastic aptitude assessment study.
Meta-analysis is a significant methodological advance that is increasingly important in research synthesis. Fundamental to meta-analysis is the presumption that effect sizes, such as the standardized mean difference (SMD), based on scores from different measures are comparable. It has been argued that population observed score SMDs based on scores from different measures A and B will be equal only if the conjunction of three conditions are met: construct equivalence (CE), equal reliabilities (ER), and the absence of differential test functioning (DTF) in all subpopulations of the combined populations of interest. It has also been speculated the results of a meta-analysis of SMDs might differ between circumstances in which the SMDs included in a meta-analysis are based on measures which all met the conjunction of these conditions and that in which the conjunction of these conditions is violated. No previous studies have tested this conjecture. This Monte Carlo study investigated this hypothesis. A population of studies comparing one of five hypothetical treatments with a placebo condition was simulated. The SMDs in these simulated studies were based on true scores from six hypothetical measures. The scores from some of these measures met the conjunction of CE, ER, and, the absence of DTF, while others failed to meet CE. Three meta-analyses were conducted using both fixed effects and random effects methods. The results suggested that the results of meta-analyses can vary to a practically significant degree when the SMDs were based on scores from measures failing to meet the CE condition. Implications for future research are considered.
The clinical assessment of mental disorders can be a time-consuming and error-prone procedure, consisting of a sequence of diagnostic hypothesis formulation and testing aimed at restricting the set of plausible diagnoses for the patient. In this article, we propose a novel computerized system for the adaptive testing of psychological disorders. The proposed system combines a mathematical representation of psychological disorders, known as the "formal psychological assessment," with an algorithm designed for the adaptive assessment of an individual’s knowledge. The assessment algorithm is extended and adapted to the new application domain. Testing the system on a real sample of 4,324 healthy individuals, screened for obsessive-compulsive disorder, we demonstrate the system’s ability to support clinical testing, both by identifying the correct critical areas for each individual and by reducing the number of posed questions with respect to a standard written questionnaire.
Researchers continue to be interested in efficient, accurate methods of estimating coefficients of covariates in mixture modeling. Including covariates related to the latent class analysis not only may improve the ability of the mixture model to clearly differentiate between subjects but also makes interpretation of latent group membership more meaningful. Very few studies have been conducted that compare the performance of various approaches to estimating covariate effects in mixture modeling, and fewer yet have considered more complicated models such as growth mixture models where the latent class variable is more difficult to identify. A Monte Carlo simulation was conducted to investigate the performance of four estimation approaches: (1) the conventional three-step approach, (2) the one-step maximum likelihood (ML) approach, (3) the pseudo class (PC) approach, and (4) the three-step ML approach in terms of their ability to recover covariate effects in the logistic regression class membership model within a growth mixture modeling framework. Results showed that when class separation was large, the one-step ML approach and the three-step ML approach displayed much less biased covariate effect estimates than either the conventional three-step approach or the PC approach. When class separation was poor, estimation of the relation between the dichotomous covariate and latent class variable was severely affected when the new three-step ML approach was used.
In behavioral sciences broadly, estimating growth models with Bayesian methods is becoming increasingly common, especially to combat small samples common with longitudinal data. Although Mplus is becoming an increasingly common program for applied research employing Bayesian methods, the limited selection of prior distributions for the elements of covariance structures makes more general software more advantages under certain conditions. However, as a disadvantage of general software’s software flexibility, few preprogrammed commands exist for specifying covariance structures. For instance, PROC MIXED has a few dozen such preprogrammed options, but when researchers divert to a Bayesian framework, software offer no such guidance and requires researchers to manually program these different structures, which is no small task. As such the literature has noted that empirical papers tend to simplify their covariance matrices to circumvent this difficulty, which is not desirable because such a simplification will likely lead to biased estimates of variance components and standard errors. To facilitate wider implementation of Bayesian growth models that properly model covariance structures, this article overviews how to generally program a growth model in SAS PROC MCMC and then demonstrates how to program common residual error structures. Full annotated SAS code and an applied example are provided.
A number of studies have found multiple indicators multiple causes (MIMIC) models to be an effective tool in detecting uniform differential item functioning (DIF) for individual items and item bundles. A recently developed MIMIC-interaction model is capable of detecting both uniform and nonuniform DIF in the unidimensional item response theory (IRT) framework. The goal of the current study is to extend the MIMIC-interaction model for detecting DIF in the context of multidimensional IRT modelling and examine the performance of the multidimensional MIMIC-interaction model under various simulation conditions with respect to Type I error and power rates. Simulation conditions include DIF pattern and magnitude, test length, correlation between latent traits, sample size, and latent mean differences between focal and reference groups. The results of this study indicate that power rates of the multidimensional MIMIC-interaction model under uniform DIF conditions were higher than those of nonuniform DIF conditions. When anchor item length and sample size increased, power for detecting DIF increased. Also, the equal latent mean condition tended to produce higher power rates than the different mean condition. Although the multidimensional MIMIC-interaction model was found to be a reasonably useful tool for identifying uniform DIF, the performance of the model in detecting nonuniform DIF appeared to be questionable.
Purification of the test has been a well-accepted procedure in enhancing the performance of tests for differential item functioning (DIF). As defined by Lord, purification requires reestimation of ability parameters after removing DIF items before conducting the final DIF analysis. IRTPRO 3 is a recently updated program for analyses in item response theory, with built-in DIF tests but not purification procedures. A simulation study was conducted to investigate the effect of two new methods of purification. The results suggested that one of the purification procedures showed significantly improved power and Type I error. The procedure, which can be cumbersome by hand, can be easily applied by practitioners by using the web-based program developed for this study.
There is an increasing demand for assessments that can provide more fine-grained information about examinees. In response to the demand, diagnostic measurement provides students with feedback on their strengths and weaknesses on specific skills by classifying them into mastery or nonmastery attribute categories. These attributes often form a hierarchical structure because student learning and development is a sequential process where many skills build on others. However, it remains to be seen if we can use information from the attribute structure and work that into the design of the diagnostic tests. The purpose of this study is to introduce three approaches of Q-matrix design and investigate their impact on classification results under different attribute structures. Results indicate that the adjacent approach provides higher accuracy in a shorter test length when compared with other Q-matrix design approaches. This study provides researchers and practitioners guidance on how to design the Q-matrix in diagnostic tests, which are in high demand from educators.
A fundamental assumption in computerized adaptive testing is that item parameters are invariant with respect to context—items surrounding the administered item. This assumption, however, may not hold in forced-choice (FC) assessments, where explicit comparisons are made between items included in the same block. We empirically examined the influence of context on item parameters by comparing parameter estimates from two FC instruments. The first instrument was composed of blocks of three items, whereas in the second, the context was manipulated by adding one item to each block, resulting in blocks of four. The item parameter estimates were highly similar. However, a small number of significant deviations were observed, confirming the importance of context when designing adaptive FC assessments. Two patterns of such deviations were identified, and methods to reduce their occurrences in an FC computerized adaptive testing setting were proposed. It was shown that with a small proportion of violations of the parameter invariance assumption, score estimation remained stable.
Various tests to check the homogeneity of variance assumption have been proposed in the literature, yet there is no consensus as to their robustness when the assumption of normality does not hold. This simulation study evaluated the performance of 14 tests for the homogeneity of variance assumption in one-way ANOVA models in terms of Type I error control and statistical power. Seven factors were manipulated: number of groups, average number of observations per group, pattern of sample sizes in groups, pattern of population variances, maximum variance ratio, population distribution shape, and nominal alpha level for the test of variances. Overall, the Ramsey conditional, O’Brien, Brown–Forsythe, Bootstrap Brown–Forsythe, and Levene with squared deviations tests maintained adequate Type I error control, performing better than the others across all the conditions. The power for each of these five tests was acceptable and the power differences were subtle. Guidelines for selecting a valid test for assessing the tenability of this critical assumption are provided based on average cell size.
Several researchers have recommended that level-specific fit indices should be applied to detect the lack of model fit at any level in multilevel structural equation models. Although we concur with their view, we note that these studies did not sufficiently consider the impact of intraclass correlation (ICC) on the performance of level-specific fit indices. Our study proposed to fill this gap in the methodological literature. A Monte Carlo study was conducted to investigate the performance of (a) level-specific fit indices derived by a partially saturated model method (e.g.,
Molenaar extended Mokken’s original probabilistic-nonparametric scaling models for use with polytomous data. These polytomous extensions of Mokken’s original scaling procedure have facilitated the use of Mokken scale analysis as an approach to exploring fundamental measurement properties across a variety of domains in which polytomous ratings are used, including rater-mediated educational assessments. Because their underlying item step response functions (i.e., category response functions) are defined using cumulative probabilities, polytomous Mokken models can be classified as cumulative models based on the classifications of polytomous item response theory models proposed by several scholars. In order to permit a closer conceptual alignment with educational performance assessments, this study presents an adjacent-categories variation on the polytomous monotone homogeneity and double monotonicity models. Data from a large-scale rater-mediated writing assessment are used to illustrate the adjacent-categories approach, and results are compared with the original formulations. Major findings suggest that the adjacent-categories models provide additional diagnostic information related to individual raters’ use of rating scale categories that is not observed under the original formulation. Implications are discussed in terms of methods for evaluating rating quality.
Although multidimensional adaptive testing (MAT) has been proven to be highly advantageous with regard to measurement efficiency when several highly correlated dimensions are measured, there are few operational assessments that use MAT. This may be due to issues of constraint management, which is more complex in MAT than it is in unidimensional adaptive testing. Very few studies have examined the performance of existing constraint management methods (CMMs) in MAT. The present article focuses on the effectiveness of two promising heuristic CMMs in MAT for varying levels of imposed constraints and for various correlations between the measured dimensions. Through a simulation study, the multidimensional maximum priority index (MMPI) and multidimensional weighted penalty model (MWPM), as an extension of the weighted penalty model, are examined with regard to measurement precision and constraint violations. The results show that both CMMs are capable of addressing complex constraints in MAT. However, measurement precision losses were found to differ between the MMPI and MWPM. While the MMPI appears to be more suitable for use in assessment situations involving few to a moderate number of constraints, the MWPM should be used when numerous constraints are involved.
This study defines subpopulation item parameter drift (SIPD) as a change in item parameters over time that is dependent on subpopulations of examinees, and hypothesizes that the presence of SIPD in anchor items is associated with bias and/or lack of invariance in three psychometric outcomes. Results show that SIPD in anchor items is associated with a lack of invariance in dimensionality structure of an anchor test, a lack of invariance in scaling coefficients across subpopulations, and a lack of invariance in ability estimates. It is demonstrated that these effects go beyond what can be understood from item parameter drift or differential item functioning.
Field experiments in education frequently assign entire groups such as schools to treatment or control conditions. These experiments incorporate sometimes a longitudinal component where for example students are followed over time to assess differences in the average rate of linear change, or rate of acceleration. In this study, we provide methods for power analysis in three-level polynomial change models for cluster randomized designs (i.e., treatment assigned to units at the third level). Power computations take into account clustering effects at the second and third levels, the number of measurement occasions, the impact of sample sizes at different levels (e.g., number of schools or students), and covariates effects. An illustrative example that shows how power is influenced by the number of measurement occasions, and sample sizes and covariates at the second or third levels is presented.
Mixture item response theory (IRT) models have been suggested as an efficient method of detecting the different response patterns derived from latent classes when developing a test. In testing situations, multiple latent traits measured by a battery of tests can exhibit a higher-order structure, and mixtures of latent classes may occur on different orders and influence the item responses of examinees from different classes. This study aims to develop a new class of higher-order mixture IRT models by integrating mixture IRT models and higher-order IRT models to address these practical concerns. The proposed higher-order mixture IRT models can accommodate both linear and nonlinear models for latent traits and incorporate diverse item response functions. The Rasch model was selected as the item response function, metric invariance was assumed in the first simulation study, and multiparameter IRT models without an assumption of metric invariance were used in the second simulation study. The results show that the parameters can be recovered fairly well using WinBUGS with Bayesian estimation. A larger sample size resulted in a better estimate of the model parameters, and a longer test length yielded better individual ability recovery and latent class membership recovery. The linear approach outperformed the nonlinear approach in the estimation of first-order latent traits, whereas the opposite was true for the estimation of the second-order latent trait. Additionally, imposing identical factor loadings between the second- and first-order latent traits by fitting the mixture bifactor model resulted in biased estimates of the first-order latent traits and item parameters. Finally, two empirical analyses are provided as an example to illustrate the applications and implications of the new models.
Even though there is an increasing interest in response styles, the field lacks a systematic investigation of the bias that response styles potentially cause. Therefore, a simulation was carried out to study this phenomenon with a focus on applied settings (reliability, validity, scale scores). The influence of acquiescence and extreme response style was investigated, and independent variables were, for example, the number of reverse-keyed items. Data were generated from a multidimensional item response model. The results indicated that response styles may bias findings based on self-report data and that this bias may be substantial if the attribute of interest is correlated with response style. However, in the absence of such correlations, bias was generally very small, especially for extreme response style and if acquiescence was controlled for by reverse-keyed items. An empirical example was used to illustrate and validate the simulations. In summary, it is concluded that the threat of response styles may be smaller than feared.
The purpose of this article is to highlight the distinction between the reliability of test scores and the fit of psychometric measurement models, reminding readers why it is important to consider both when evaluating whether test scores are valid for a proposed interpretation and/or use. It is often the case that an investigator judges both the reliability of scores and the fit of a corresponding measurement model to be either acceptable or unacceptable for a given situation, but these are not the only possible outcomes. This article focuses on situations in which model fit is deemed acceptable, but reliability is not. Data were simulated based on the item characteristics of the PROMIS (Patient Reported Outcomes Measurement Information System) anxiety item bank and analyzed using methods from classical test theory, factor analysis, and item response theory. Analytic techniques from different psychometric traditions were used to illustrate that reliability and model fit are distinct, and that disagreement among indices of reliability and model fit may provide important information bearing on a particular validity argument, independent of the data analytic techniques chosen for a particular research application. We conclude by discussing the important information gleaned from the assessment of reliability and model fit.
This study examined the performance of a proposed iterative Wald approach for detecting differential item functioning (DIF) between two groups when preknowledge of anchor items is absent. The iterative approach utilizes the Wald-2 approach to identify anchor items and then iteratively tests for DIF items with the Wald-1 approach. Monte Carlo simulation was conducted across several conditions including the number of response options, test length, sample size, percentage of DIF items, DIF effect size, and type of cumulative DIF. Results indicated that the iterative approach performed well for polytomous data in all conditions, with well-controlled Type I error rates and high power. For dichotomous data, the iterative approach also exhibited better control over Type I error rates than the Wald-2 approach without sacrificing the power in detecting DIF. However, inflated Type I error rates were found for the iterative approach in conditions with dichotomous data, noncompensatory DIF, large percentage of DIF items, and medium to large DIF effect sizes. Nevertheless, the Type I error rates were substantially less inflated in those conditions compared with the Wald-2 approach.
This note is concerned with examining the relationship between within-group and between-group variances in two-level nested designs. A latent variable modeling approach is outlined that permits point and interval estimation of their ratio and allows their comparison in a multilevel study. The procedure can also be used to test various hypotheses about the discrepancy between these two variances and assist with their relationship interpretability in empirical investigations. The method can also be utilized as an addendum to point and interval estimation of the popular intraclass correlation coefficient in hierarchical designs. The discussed approach is illustrated with a numerical example.
Performance of students in low-stakes testing situations has been a concern and focus of recent research. However, researchers who have examined the effect of stakes on performance have not been able to compare low-stakes performance to truly high-stakes performance of the same students. Results of such a comparison are reported in this article. GRE test takers volunteered to take an additional low-stakes test, of either verbal or quantitative reasoning as part of a research study immediately following their operational high-stakes test. Analyses of performance under the high- and low-stakes situations revealed that the level of effort in the low-stakes situation (as measured by the amount of time on task) strongly predicted the stakes effect on performance (difference between test scores in low- and high-stakes situations). Moreover, the stakes effect virtually disappeared for participants who spent at least one-third of the allotted time in the low-stakes situation. For this group of test takers (more than 80% of the total sample), the correlations between the low- and high-stakes scores approached the upper bound possible considering the reliability of the test.
In a pioneering research article, Wollack and colleagues suggested the "erasure detection index" (EDI) to detect test tampering. The EDI can be used with or without a continuity correction and is assumed to follow the standard normal distribution under the null hypothesis of no test tampering. When used without a continuity correction, the EDI often has inflated Type I error rates. When used with a continuity correction, the EDI has satisfactory Type I error rates, but smaller power compared with the EDI without a continuity correction. This article suggests three methods for detecting test tampering that do not rely on the assumption of a standard normal distribution under the null hypothesis. It is demonstrated in a detailed simulation study that the performance of each suggested method is slightly better than that of the EDI. The EDI and the suggested methods were applied to a real data set. The suggested methods, although more computation intensive than the EDI, seem to be promising in detecting test tampering.
Growth mixture modeling is generally used for two purposes: (1) to identify mixtures of normal subgroups and (2) to approximate oddly shaped distributions by a mixture of normal components. Often in applied research this methodology is applied to both of these situations indistinctly: using the same fit statistics and likelihood ratio tests. This can lead to the overextraction of latent classes and the attribution of substantive meaning to these spurious classes. The goals of this study are (1) to explore the performance of the Bayesian information criterion, sample-adjusted BIC, and bootstrap likelihood ratio test in growth mixture modeling analysis with nonnormal distributed outcome variables and (2) to examine the effects of nonnormal time invariant covariates in the estimation of the number of latent classes when outcome variables are normally distributed. For both of these goals, we will include nonnormal conditions not considered previously in the literature. Two simulation studies were conducted. Results show that spurious classes may be selected and optimal solutions obtained in the data analysis when the population departs from normality even when the nonnormality is only present in time invariant covariates.
This article reproduces correspondence between Georg Rasch of The University of Copenhagen and Benjamin Wright of The University of Chicago in the period from January 1966 to July 1967. This correspondence reveals their struggle to operationalize a unidimensional measurement model with sufficient statistics for responses in a set of ordered categories. The article then explains the original approach taken by Rasch, Wright, and Andersen, and then how, from a different tack originating in 1961 and culminating in 1978, three distinct stages of development led to the current relatively simple and elegant form of the model. The article shows that over this period of almost two decades, the demand for sufficiency of a unidimensional parameter of the object of measurement, which enabled the separation of this parameter from the parameter of the instrument, drove the theoretical development of the model.
The measurement error in principal components extracted from a set of fallible measures is discussed and evaluated. It is shown that as long as one or more measures in a given set of observed variables contains error of measurement, so also does any principal component obtained from the set. The error variance in any principal component is shown to be (a) bounded from below by the smallest error variance in a variable from the analyzed set and (b) bounded from above by the largest error variance in a variable from that set. In the case of a unidimensional set of analyzed measures, it is pointed out that the reliability and criterion validity of any principal component are bounded from above by these respective coefficients of the optimal linear combination with maximal reliability and criterion validity (for a criterion unrelated to the error terms in the individual measures). The discussed psychometric features of principal components are illustrated on a numerical data set.
This article describes an approach to test scoring, referred to as delta scoring (D-scoring), for tests with dichotomously scored items. The D-scoring uses information from item response theory (IRT) calibration to facilitate computations and interpretations in the context of large-scale assessments. The D-score is computed from the examinee’s response vector, which is weighted by the expected difficulties (not "easiness") of the test items. The expected difficulty of each item is obtained as an analytic function of its IRT parameters. The D-scores are independent of the sample of test-takers as they are based on expected item difficulties. It is shown that the D-scale performs a good bit better than the IRT logit scale by criteria of scale intervalness. To equate D-scales, it is sufficient to rescale the item parameters, thus avoiding tedious and error-prone procedures of mapping test characteristic curves under the method of IRT true score equating, which is often used in the practice of large-scale testing. The proposed D-scaling proved promising under its current piloting with large-scale assessments and the hope is that it can efficiently complement IRT procedures in the practice of large-scale testing in the field of education and psychology.
This study examined the predictors and psychometric outcomes of survey satisficing, wherein respondents provide quick, "good enough" answers (satisficing) rather than carefully considered answers (optimizing). We administered surveys to university students and respondents—half of whom held college degrees—from a for-pay survey website, and we used an experimental method to randomly assign the participants to survey formats, which presumably differed in task difficulty. Based on satisficing theory, we predicted that ability, motivation, and task difficulty would predict satisficing behavior and that satisficing would artificially inflate internal consistency reliability and both convergent and discriminant validity correlations. Indeed, results indicated effects for task difficulty and motivation in predicting survey satisficing, and satisficing in the first part of the study was associated with improved internal consistency reliability and convergent validity but also worse discriminant validity in the second part of the study. Implications for research designs and improvements are discussed.
This article introduces an entropy-based measure of data–model fit that can be used to assess the quality of logistic regression models. Entropy has previously been used in mixture-modeling to quantify how well individuals are classified into latent classes. The current study proposes the use of entropy for logistic regression models to quantify the quality of classification and separation of group membership. Entropy complements preexisting measures of data–model fit and provides unique information not contained in other measures. Hypothetical data scenarios, an applied example, and Monte Carlo simulation results are used to demonstrate the application of entropy in logistic regression. Entropy should be used in conjunction with other measures of data–model fit to assess how well logistic regression models classify cases into observed categories.
Bullying among youth is recognized as a serious student problem, especially in middle school. The most common approach to measuring bullying is through student self-report surveys that ask questions about different types of bullying victimization. Although prior studies have shown that question-order effects may influence participant responses, no study has examined these effects with middle school students. A randomized experiment (n = 5,951 middle school students) testing the question-order effect found that changing the sequence of questions can result in 45% higher prevalence rates. These findings raise questions about the accuracy of several widely used bullying surveys.
The number of performance assessments continues to increase around the world, and it is important to explore new methods for evaluating the quality of ratings obtained from raters. This study describes an unfolding model for examining rater accuracy. Accuracy is defined as the difference between observed and expert ratings. Dichotomous accuracy ratings (0 = inaccurate, 1 = accurate) are unfolded into three latent categories: inaccurate below expert ratings, accurate ratings, and inaccurate above expert ratings. The hyperbolic cosine model (HCM) is used to examine dichotomous accuracy ratings from a statewide writing assessment. This study suggests that HCM is a promising approach for examining rater accuracy, and that the HCM can provide a useful interpretive framework for evaluating the quality of ratings obtained within the context of rater-mediated assessments.
Methods to assess the significance of mediated effects in education and the social sciences are well studied and fall into two categories: single sample methods and computer-intensive methods. A popular single sample method to detect the significance of the mediated effect is the test of joint significance, and a popular computer-intensive method to detect the significance of the mediated effect is the bias-corrected bootstrap method. Both these methods are used for testing the significance of mediated effects in structural equation models (SEMs). A recent study by Leth-Steensen and Gallitto 2015 provided evidence that the test of joint significance was more powerful than the bias-corrected bootstrap method for detecting mediated effects in SEMs, which is inconsistent with previous research on the topic. The goal of this article was to investigate this surprising result and describe two issues related to testing the significance of mediated effects in SEMs which explain the inconsistent results regarding the power of the test of joint significance and the bias-corrected bootstrap found by Leth-Steensen and Gallitto 2015. The first issue was that the bias-corrected bootstrap method was conducted incorrectly. The bias-corrected bootstrap was used to estimate the standard error of the mediated effect as opposed to creating confidence intervals. The second issue was that the correlation between the path coefficients of the mediated effect was ignored as an important aspect of testing the significance of the mediated effect in SEMs. The results of the replication study confirmed prior research on testing the significance of mediated effects. That is, the bias-corrected bootstrap method was more powerful than the test of joint significance, and the bias-corrected bootstrap method had elevated Type 1 error rates in some cases. Additional methods for testing the significance of mediated effects in SEMs were considered and limitations and future directions were discussed.
The multilevel latent class model (MLCM) is a multilevel extension of a latent class model (LCM) that is used to analyze nested structure data structure. The nonparametric version of an MLCM assumes a discrete latent variable at a higher-level nesting structure to account for the dependency among observations nested within a higher-level unit. In the present study, a simulation study was conducted to investigate the impact of ignoring the higher-level nesting structure. Three criteria—the model selection accuracy, the classification quality, and the parameter estimation accuracy—were used to evaluate the impact of ignoring the nested data structure. The results of the simulation study showed that ignoring higher-level nesting structure in an MLCM resulted in the poor performance of the Bayesian information criterion to recover the true latent structure, the inaccurate classification of individuals into latent classes, and the inflation of standard errors for parameter estimates, while the parameter estimates were not biased. This article concludes with remarks on ignoring the nested structure in nonparametric MLCMs, as well as recommendations for applied researchers when LCM is used for data collected from a multilevel nested structure.
We investigated methods of including covariates in two-level models for cluster randomized trials to increase power to detect the treatment effect. We compared multilevel models that included either an observed cluster mean or a latent cluster mean as a covariate, as well as the effect of including Level 1 deviation scores in the model. A Monte Carlo simulation study was performed manipulating effect sizes, cluster sizes, number of clusters, intraclass correlation of the outcome, patterns of missing data, and the squared correlations between Level 1 and Level 2 covariates and the outcome. We found no substantial difference between models with observed means or latent means with respect to convergence, Type I error rates, coverage, and bias. However, coverage could fall outside of acceptable limits if a latent mean is included as a covariate when cluster sizes are small. In terms of statistical power, models with observed means performed similarly to models with latent means, but better when cluster sizes were small. A demonstration is provided using data from a study of the Tools for Getting Along intervention.
A method for evaluating the validity of multicomponent measurement instruments in heterogeneous populations is discussed. The procedure can be used for point and interval estimation of criterion validity of linear composites in populations representing mixtures of an unknown number of latent classes. The approach permits also the evaluation of between-class validity differences as well as within-class validity coefficients. The method can similarly be used with known class membership when distinct populations are investigated, their number is known beforehand and membership in them is observed for the studied subjects, as well as in settings where only the number of latent classes is known. The discussed procedure is illustrated with numerical data.
Multilevel modeling (MLM) is frequently used to detect cluster-level group differences in cluster randomized trial and observational studies. Group differences on the outcomes (posttest scores) are detected by controlling for the covariate (pretest scores) as a proxy variable for unobserved factors that predict future attributes. The pretest and posttest scores that are most often used in MLM are total scores. In prior research, there have been concerns regarding measurement error in the use of total scores in using MLM. In this article, using ordinary least squares and an attenuation formula, we derive the measurement error correction formula for cluster-level group difference estimates from MLM in the presence of measurement error in the outcome, the covariate, or both. Examples are provided to illustrate the correction formula in cluster randomized and observational studies using between-cluster reliability coefficients recently developed.
There are many reasons to believe that open-ended (OE) and multiple-choice (MC) items elicit different cognitive demands of students. However, empirical evidence that supports this view is lacking. In this study, we investigated the reactions of test takers to an interactive assessment with immediate feedback and answer-revision opportunities for the two types of items. Eighth-grade students solved mathematics problems, both MC and OE, with standard instructions and feedback-and-revision opportunities. An analysis of scores based on revised answers in feedback mode revealed gains in measurement precision for OE items but not for MC items. These results are explained through the concept of effortful engagement—the OE format encourages more mindful engagement with the items in interactive mode. This interpretation is supported by analyses of response times and test takers’ reports.
The present study investigates different approaches to adding covariates and the impact in fitting mixture item response theory models. Mixture item response theory models serve as an important methodology for tackling several psychometric issues in test development, including the detection of latent differential item functioning. A Monte Carlo simulation study is conducted in which data generated according to a two-class mixture Rasch model with both dichotomous and continuous covariates are fitted to several mixture Rasch models with misspecified covariates to examine the effects of covariate inclusion on model parameter estimation. In addition, both complete response data and incomplete response data with different types of missingness are considered in the present study in order to simulate practical assessment settings. Parameter estimation is carried out within a Bayesian framework vis-à-vis Markov chain Monte Carlo algorithms.
Standard approaches for estimating item response theory (IRT) model parameters generally work under the assumption that the latent trait being measured by a set of items follows the normal distribution. Estimation of IRT parameters in the presence of nonnormal latent traits has been shown to generate biased person and item parameter estimates. A number of methods, including Ramsay curve item response theory, have been developed to reduce such bias, and have been shown to work well for relatively large samples and long assessments. An alternative approach to the nonnormal latent trait and IRT parameter estimation problem, nonparametric Bayesian estimation approach, has recently been introduced into the literature. Very early work with this method has shown that it could be an excellent option for use when fitting the Rasch model when assumptions cannot be made about the distribution of the model parameters. The current simulation study was designed to extend research in this area by expanding the simulation conditions under which it is examined and to compare the nonparametric Bayesian estimation approach to the Ramsay curve item response theory, marginal maximum likelihood, maximum a posteriori, and the Bayesian Markov chain Monte Carlo estimation method. Results of the current study support that the nonparametric Bayesian estimation approach may be a preferred option when fitting a Rasch model in the presence of nonnormal latent traits and item difficulties, as it proved to be most accurate in virtually all scenarios that were simulated in this study.
In this article, an overview is given of four methods to perform factor score regression (FSR), namely regression FSR, Bartlett FSR, the bias avoiding method of Skrondal and Laake, and the bias correcting method of Croon. The bias correcting method is extended to include a reliable standard error. The four methods are compared with each other and with structural equation modeling (SEM) by using analytic calculations and two Monte Carlo simulation studies to examine their finite sample characteristics. Several performance criteria are used, such as the bias using the unstandardized and standardized parameterization, efficiency, mean square error, standard error bias, type I error rate, and power. The results show that the bias correcting method, with the newly developed standard error, is the only suitable alternative for SEM. While it has a higher standard error bias than SEM, it has a comparable bias, efficiency, mean square error, power, and type I error rate.
The purpose of the present studies was to test the hypothesis that the psychometric characteristics of ability scales may be significantly distorted if one accounts for emotional factors during test taking. Specifically, the present studies evaluate the effects of anxiety and motivation on the item difficulties of the Rasch model. In Study 1, the validity of a reading comprehension scale was evaluated using the Rasch model with 60 students with learning disabilities (LD). Item parameters were retested for the presence of anxiety and results indicated that the scale was substantially more difficult in its presence. Study 2 replicated the findings of Study 1 using maladaptive motivation and extended with inclusion of adaptive motivational variables in order to reverse the effect. Results using students with and without LD indicated that the difficulty levels of the scale was lower for students with LD, in the presence of positive motivation, compared with a typical student group. Study 3 extended the dichotomous hierarchical generalized linear model with polytomous data. The measures of an ability test were adjusted for the presence of anxiety and results indicated that differential item functioning was observed at both the global level and the most difficult ability item. It is concluded that the difficulty levels of a scale are heavily influenced by situational factors during testing, such as students’ entry levels of motivation and affect.
Mokken scale analysis is a probabilistic nonparametric approach that offers statistical and graphical tools for evaluating the quality of social science measurement without placing potentially inappropriate restrictions on the structure of a data set. In particular, Mokken scaling provides a useful method for evaluating important measurement properties, such as invariance, in contexts where response processes are not well understood. Because rater-mediated assessments involve complex interactions among many variables, including assessment contexts, student artifacts, rubrics, individual rater characteristics, and others, rater-assigned scores are suitable candidates for Mokken scale analysis. The purposes of this study are to describe a suite of indices that can be used to explore the psychometric quality of data from rater-mediated assessments and to illustrate the substantive interpretation of Mokken-based statistics and displays in this context. Techniques that are commonly used in polytomous applications of Mokken scaling are adapted for use with rater-mediated assessments, with a focus on the substantive interpretation related to individual raters. Overall, the findings suggest that indices of rater monotonicity, rater scalability, and invariant rater ordering based on Mokken scaling provide diagnostic information at the level of individual raters related to the requirements for invariant measurement. These Mokken-based indices serve as an additional suite of diagnostic tools for exploring the quality of data from rater-mediated assessments that can supplement rating quality indices based on parametric models.
A latent variable modeling procedure is discussed that can be used to test if two or more homogeneous multicomponent instruments with distinct components are measuring the same underlying construct. The method is widely applicable in scale construction and development research and can also be of special interest in construct validation studies. The approach can be readily utilized in empirical settings with observed measure nonnormality and/or incomplete data sets. The procedure is based on testing model nesting restrictions, and it can be similarly employed to examine the collapsibility of latent variables evaluated by multidimensional measuring instruments. The outlined method is illustrated with two data examples.
We use data from a large-scale experiment conducted in Indiana in 2009-2010 to examine the impact of two interim assessment programs (mCLASS and Acuity) across the mathematics and reading achievement distributions. Specifically, we focus on whether the use of interim assessments has a particularly strong effect on improving outcomes for low achievers. Quantile regression is used to estimate treatment effects across the entire achievement distribution (i.e., provide estimates in the lower, middle, or upper tails). Results indicate that in Grades 3 to 8 (particularly third, fifth, and sixth) lower achievers seem to benefit more from interim assessments than higher achieving students.
Although differences in goodness-of-fit indices (GOFs) have been advocated for assessing measurement invariance, studies that advanced recommended differential cutoffs for adjudicating invariance actually utilized a very limited range of values representing the quality of indicator variables (i.e., magnitude of loadings). Because quality of measurement has been found to be relevant in the context of assessing data-model fit in single-group models, this study used simulation and population analysis methods to examine the extent to which quality of measurement affects GOFs for tests of invariance in multiple group models. Results show that McDonald’s NCI is minimally affected by loading magnitude and sample size when testing invariance in the measurement model, while differences in comparative fit index varies widely when testing both measurement and structural variance as measurement quality changes, making it difficult to pinpoint a common value that suggests reasonable invariance.
In this article, a new model for test response times is proposed that combines latent class analysis and the proportional hazards model with random effects in a similar vein as the mixture factor model. The model assumes the existence of different latent classes. In each latent class, the response times are distributed according to a class-specific proportional hazards model. The class-specific proportional hazards models relate the response times of each subject to his or her work pace, which is considered as a random effect. The latent class extension of the proportional hazards model allows for differences in response strategies between subjects. The differences can be captured in the hazard functions, which trace the progress individuals make over time when working on an item. The model can be calibrated with marginal maximum likelihood estimation. The fit of the model can either be assessed with information criteria or with a test of model fit. In a simulation study, the performance of the proposed approaches to model calibration and model evaluation is investigated. Finally, the model is used for a real data set.
This article addresses the problem of testing the difference between two correlated agreement coefficients for statistical significance. A number of authors have proposed methods for testing the difference between two correlated kappa coefficients, which require either the use of resampling methods or the use of advanced statistical modeling techniques. In this article, we propose a technique similar to the classical pairwise t test for means, which is based on a large-sample linear approximation of the agreement coefficient. We illustrate the use of this technique with several known agreement coefficients including Cohen’s kappa, Gwet’s AC_{1}, Fleiss’s generalized kappa, Conger’s generalized kappa, Krippendorff’s alpha, and the Brenann–Prediger coefficient. The proposed method is very flexible, can accommodate several types of correlation structures between coefficients, and requires neither advanced statistical modeling skills nor considerable computer programming experience. The validity of this method is tested with a Monte Carlo simulation.
The present study tested the possibility of operationalizing levels of knowledge acquisition based on Vygotsky’s theory of cognitive growth. An assessment tool (SAM-Math) was developed to capture a hypothesized hierarchical structure of mathematical knowledge consisting of procedural, conceptual, and functional levels. In Study 1, SAM-Math was administered to 4th-grade students (N = 2,216). The results of Rasch analysis indicated that the test provided an operational definition for the construct of mathematical competence that included the three levels of mastery corresponding to the theoretically based hierarchy of knowledge. In Study 2, SAM-Math was administered to students in 4th, 6th, 8th, and 10th grades (N = 396) to examine developmental changes in the levels of mathematics knowledge. The results showed that the mastery of mathematical concepts presented in elementary school continued to deepen beyond elementary school, as evidenced by a significant growth in conceptual and functional levels of knowledge. The findings are discussed in terms of their implications for psychological theory, test design, and educational practice.
The standardized mean difference (SMD) is perhaps the most important meta-analytic effect size. It is typically used to represent the difference between treatment and control population means in treatment efficacy research. It is also used to represent differences between populations with different characteristics, such as persons who are depressed and those who are not. Measurement error in the independent variable (IV) attenuates SMDs. In this article, we derive a formula for the SMD that explicitly represents accuracy of classification of persons into populations on the basis of scores on an IV. We suggest an alternate version of the SMD less vulnerable to measurement error in the IV. We derive a novel approach to correcting the SMD for measurement error in the IV and show how this method can also be used to reliability correct the unstandardized mean difference. We compare this reliability correction approach with one suggested by Hunter and Schmidt in a series of Monte Carlo simulations. Finally, we consider how the proposed reliability correction method can be used in meta-analysis and suggest future directions for both research and further theoretical development of the proposed reliability correction method.
Differential item functioning (DIF) for an item between two groups is present if, for the same person location on a variable, persons from different groups have different expected values for their responses. Applying only to dichotomously scored items in the popular Mantel–Haenszel (MH) method for detecting DIF in which persons are classified by their total scores on an instrument, Andrich and Hagquist articulated the concept of artificial DIF and showed that as an artifact of the MH method, real DIF in one item favoring one group inevitably induces artificial DIF favoring the other group in all other items. Using the dichotomous Rasch model in which the total score for a person is a sufficient statistic, and therefore justifies classifying persons by their total scores, Andrich and Hagquist showed that to distinguish between real and artificial DIF in an item identified by the MH method, a sequential procedure for resolving items is implied. Using the polytomous Rasch model, this article generalizes the concept of artificial DIF to polytomous items, in which multiple item parameters play a role. The article shows that the same principle of resolving items sequentially as with dichotomous items applies also to distinguishing between real and artificial DIF with polytomous items. A real example and a small simulated example that parallels the real example are used illustratively.
This study established an effect size measure for differential functioning for items and tests’ noncompensatory differential item functioning (NCDIF). The Mantel–Haenszel parameter served as the benchmark for developing NCDIF’s effect size measure for reporting moderate and large differential item functioning in test items. The effect size of NCDIF is influenced by the model, the discrimination parameter, and the difficulty parameter. Therefore, tables of NCDIF’s effect size were presented at given levels of a, b, and c parameters. In addition, a general effect size recommendation for moderate and large NCDIF is also established.
Correlation attenuation due to measurement error and a corresponding correction, the deattenuated correlation, have been known for over a century. Nevertheless, the deattenuated correlation remains underutilized. A few studies in recent years have investigated factors affecting the deattenuated correlation, and a couple of them provide alternative solutions based on the deattenuated correlation. One study proposed bootstrap confidence intervals (CIs) for the deattenuated correlation. However, CI research for the deattenuated correlation is in the beginning phases. Therefore, the bootstrapped deattenuated correlation CIs are investigated for 95% coverage through a Monte Carlo simulation that includes nonnormal distributions. Overall, both the bias-corrected and accelerated (BCa) and percentile bootstrap (PB) CIs had good performance, but the BCa CIs had slightly better coverage. In addition, with the exception of the Pareto distribution, both CIs had good coverage under all simulation conditions and across all other investigated distributions (i.e., the Normal, Uniform, Triangular, Beta, and Laplace).
This report summarizes an empirical study that addresses two related topics within the context of writing assessment—illusory halo and how much unique information is provided by multiple analytic scores. Specifically, we address the issue of whether unique information is provided by analytic scores assigned to student writing, beyond what is depicted by holistic scores, and to what degree multiple analytic scores assigned by a single rater display evidence of illusory halo. To that end, we analyze student responses to an expository writing prompt that were scored by six groups of raters—four groups assigned single analytic scores, one group assigned multiple analytic scores, and one group assigned holistic scores—using structural equation modeling. Our results suggest that there is evidence of illusory halo when raters assign multiple analytic scores to a single student response and that, at best, only two factors seem to be distinguishable in analytic writing scores assigned to expository essays.
An automated item selection procedure in Mokken scale analysis partitions a set of items into one or more Mokken scales, if the data allow. Two algorithms are available that pursue the same goal of selecting Mokken scales of maximum length: Mokken’s original automated item selection procedure (AISP) and a genetic algorithm (GA). Minimum sample size requirements for the two algorithms to obtain stable, replicable results have not yet been established. In practical scale construction reported in the literature, we found that researchers used sample sizes ranging from 133 to 15,022 respondents. We investigated the effect of sample size on the assignment of items to the correct scales. Using a misclassification of 5% as a criterion, we found that the AISP and the GA algorithms minimally required 250 to 500 respondents when item quality was high and 1,250 to 1,750 respondents when item quality was low.
Differential item functioning (DIF) indicates the violation of the invariance assumption, for instance, in models based on item response theory (IRT). For item-wise DIF analysis using IRT, a common metric for the item parameters of the groups that are to be compared (e.g., for the reference and the focal group) is necessary. In the Rasch model, therefore, the same linear restriction is imposed in both groups. Items in the restriction are termed the ``anchor items''. Ideally, these items are DIF-free to avoid artificially augmented false alarm rates. However, the question how DIF-free anchor items are selected appropriately is still a major challenge. Furthermore, various authors point out the lack of new anchor selection strategies and the lack of a comprehensive study especially for dichotomous IRT models. This article reviews existing anchor selection strategies that do not require any knowledge prior to DIF analysis, offers a straightforward notation, and proposes three new anchor selection strategies. An extensive simulation study is conducted to compare the performance of the anchor selection strategies. The results show that an appropriate anchor selection is crucial for suitable item-wise DIF analysis. The newly suggested anchor selection strategies outperform the existing strategies and can reliably locate a suitable anchor when the sample sizes are large enough.
a-Stratified computerized adaptive testing with b-blocking (AST), as an alternative to the widely used maximum Fisher information (MFI) item selection method, can effectively balance item pool usage while providing accurate latent trait estimates in computerized adaptive testing (CAT). However, previous comparisons of these methods have treated item parameter estimates as if they are the true population parameter values. Consequently, capitalization on chance may occur. In this article, we examined the performance of the AST method under more realistic conditions where item parameter estimates instead of true parameter values are used in the CAT. Its performance was compared against that of the MFI method when the latter is used in conjunction with Sympson–Hetter or randomesque exposure control. Results indicate that the MFI method, even when combined with exposure control, is susceptible to capitalization on chance. This is particularly true when the calibration sample size is small. On the other hand, AST is more robust to capitalization on chance. Consistent with previous investigations using true item parameter values, AST yields much more balanced item pool usage, with a small loss in the precision of latent trait estimates. The loss is negligible when the test is as long as 40 items.
Researchers using factor analysis tend to dismiss the significant ill fit of factor models by presuming that if their factor model is close-to-fitting, it is probably close to being properly causally specified. Close fit may indeed result from a model being close to properly causally specified, but close-fitting factor models can also be seriously causally misspecified. This article illustrates a variety of nonfactor causal worlds that are perfectly, but inappropriately, fit by factor models. Seeing nonfactor worlds that are perfectly yet erroneously fit via factor models should help researchers understand that close-to-fitting factor models may seriously misrepresent the world’s causal structure. Statistical cautions regarding the factor model’s proclivity to fit when it ought not to fit have been insufficiently publicized and are rarely heeded. A research commitment to understanding the world’s causal structure, combined with clear examples of factor mismodeling should spur diagnostic assessment of significant factor model failures—including reassessment of published failing factor models.
An essential feature of the linear logistic test model (LLTM) is that item difficulties are explained using item design properties. By taking advantage of this explanatory aspect of the LLTM, in a mixture extension of the LLTM, the meaning of latent classes is specified by how item properties affect item difficulties within each class. To improve the interpretations of latent classes, this article presents a mixture generalization of the random weights linear logistic test model (RWLLTM). In detail, the present study considers individual differences in their multidimensional aspects, a general propensity (random intercept) and random coefficients of the item properties, as well as the differences among the fixed coefficients of the item properties. As an empirical illustration, data on verbal aggression were analyzed by comparing applications of the one- and two-class LLTM and RWLLTM. Results suggested that the two-class RWLLTM yielded better agreement with the empirical data than the other models. Moreover, relations between two random effects explained differences between the two classes detected by the mixture RWLLTM. Evidence from a simulation study indicated that the Bayesian estimation used in the present study appeared to recover the parameters in the mixture RWLLTM fairly well.
Many scales contain both positively and negatively worded items. Reverse recoding of negatively worded items might not be enough for them to function as positively worded items do. In this study, we commented on the drawbacks of existing approaches to wording effect in mixed-format scales and used bi-factor item response theory (IRT) models to test the assumption of reverse coding and evaluate the magnitude of the wording effect. The parameters of the bi-factor IRT models can be estimated with existing computer programs. Two empirical examples from the Program for International Student Assessment and the Trends in International Mathematics and Science Study were given to demonstrate the advantages of the bi-factor approach over traditional ones. It was found that the wording effect in these two data sets was substantial and that ignoring the wording effect resulted in overestimated test reliability and biased person measures.
This article presents a comparative judgment approach for holistically scored constructed response tasks. In this approach, the grader rank orders (rather than rate) the quality of a small set of responses. A prior automated evaluation of responses guides both set formation and scaling of rankings. Sets are formed to have similar prior scores and subsequent rankings by graders serve to update the prior scores of responses. Final response scores are determined by weighting the prior and ranking information. This approach allows for scaling comparative judgments on the basis of a single ranking, eliminates rater effects in scoring, and offers a conceptual framework for combining human and automated evaluation of constructed response tasks. To evaluate this approach, groups of graders evaluated responses to two tasks using either the ranking (with sets of 5 responses) or traditional rating approach. Results varied by task and the relative weighting of prior versus ranking information, but in general the ranking scores showed comparable generalizability (reliability) and validity coefficients.
Cultural consensus theory (CCT) is a data aggregation technique with many applications in the social and behavioral sciences. We describe the intuition and theory behind a set of CCT models for continuous type data using maximum likelihood inference methodology. We describe how bias parameters can be incorporated into these models. We introduce two extensions to the basic model in order to account for item rating easiness/difficulty. The first extension is a multiplicative model and the second is an additive model. We show how the multiplicative model is related to the Rasch model. We describe several maximum-likelihood estimation procedures for the models and discuss issues of model fit and identifiability. We describe how the CCT models could be used to give alternative consensus-based measures of reliability. We demonstrate the utility of both the basic and extended models on a set of essay rating data and give ideas for future research.
The name "SAT" has become synonymous with college admissions testing; it has been dubbed "the gold standard." Numerous studies on its reliability and predictive validity show that the SAT predicts college performance beyond high school grade point average. Surprisingly, studies of the factorial structure of the current version of today’s SAT, revised in 2005, have not been reported, if conducted. One purpose of this study was to examine the factorial structure of two administrations of the SAT (October 2010 and May 2011), testing competing models (e.g., one-factor—general ability; two factor—mathematics and "literacy"; three factor—mathematics, critical reading, and writing). We found support for the two-factor model with revise-in-context writing items loading on (and bridging) a reading and writing factor equally, thereby bridging these factors into a literacy factor. A second purpose was to draw tentative implications of our finding for the "next generation" SAT or other college readiness exams in light of Common Core State Standards Consortia efforts, suggesting that combining critical reading and writing (including the essay) would offer unique revision opportunities. More specifically, a reading and writing (combined) construct might pose a relevant problem or issue with multiple documents to be used to answer questions about the issue(s) (multiple-choice, short answer) and to write an argumentative/analytical essay based on the documents provided. In this way, there may not only be an opportunity to measure students’ literacy but also perhaps students’ critical thinking—key factors in assessing college readiness.
A direct approach to point and interval estimation of Cronbach’s coefficient alpha for multiple component measuring instruments is outlined. The procedure is based on a latent variable modeling application with widely circulated software. As a by-product, using sample data the method permits ascertaining whether the population discrepancy between alpha and the composite reliability coefficient may be practically negligible for a given empirical setting. The outlined approach is illustrated with numerical data.
Conventional differential item functioning (DIF) detection methods (e.g., the Mantel–Haenszel test) can be used to detect DIF only across observed groups, such as gender or ethnicity. However, research has found that DIF is not typically fully explained by an observed variable. True sources of DIF may include unobserved, latent variables, such as personality or response patterns. The factor mixture model (FMM) is designed to detect unobserved sources of heterogeneity in factor models. The current study investigated use of the FMM for detecting between-class latent DIF and class-specific observed DIF. Factors that were manipulated included the DIF effect size and the latent class probabilities. The performance of model fit indices (Akaike information criterion [AIC], Bayesian information criterion [BIC], sample size–adjusted BIC, and consistent AIC) were assessed for their detection of the correct DIF model. The recovery of DIF parameters was also assessed. Results indicated that use of FMMs with binary outcomes performed well in terms of the DIF detection and for recovery of large DIF effects. When class probabilities were unequal with small DIF effects, performance decreased for fit indices, power, and the recovery of DIF effects compared with equal class probability conditions. Inflated Type I errors were found for non-DIF items across simulation conditions. Results and future research directions for applied and methodological are discussed.
The authors analyze the effectiveness of the R^{2} and delta log odds ratio effect size measures when using logistic regression analysis to detect differential item functioning (DIF) in dichotomous items. A simulation study was carried out, and the Type I error rate and power estimates under conditions in which only statistical testing was used were compared with the rejection rates obtained when statistical testing was combined with an effect size measure based on recommended cutoff criteria. The manipulated variables were sample size, impact between groups, percentage of DIF items in the test, and amount of DIF. The results showed that false-positive rates were higher when applying only the statistical test than when an effect size decision rule was used in combination with a statistical test. Type I error rates were affected by the number of test items with DIF, as well as by the magnitude of the DIF. With respect to power, when a statistical test was used in conjunction with effect size criteria to determine whether an item exhibited a meaningful magnitude of DIF, the delta log odds ratio effect size measure performed better than R^{2}. Power was affected by the percentage of DIF items in the test and also by sample size. The study highlights the importance of using an effect size measure to avoid false identification.
The present study assessed the impact of sample size on the power and fit of structural equation modeling applied to functional brain connectivity hypotheses. The data consisted of time-constrained minimum norm estimates of regional brain activity during performance of a reading task obtained with magnetoencephalography. Power analysis was first conducted for an autoregressive model with 5 latent variables (brain regions), each defined by 3 indicators (successive activity time bins). A series of simulations were then run by generating data from an existing pool of 51 typical readers (aged 7.5-12.5 years). Sample sizes ranged between 20 and 1,000 participants and for each sample size 1,000 replications were run. Results were evaluated using chi-square Type I errors, model convergence, mean RMSEA (root mean square error of approximation) values, confidence intervals of the RMSEA, structural path stability, and -Fit index values. Results suggested that 70 to 80 participants were adequate to model relationships reflecting close to not so close fit as per MacCallum et al.’s recommendations. Sample sizes of 50 participants were associated with satisfactory fit. It is concluded that structural equation modeling is a viable methodology to model complex regional interdependencies in brain activation in pediatric populations.
Growth mixture modeling has gained much attention in applied and methodological social science research recently, but the selection of the number of latent classes for such models remains a challenging issue, especially when the assumption of proper model specification is violated. The current simulation study compared the performance of a linear growth mixture model (GMM) for determining the correct number of latent classes against a completely unrestricted multivariate normal mixture model. Results revealed that model convergence is a serious problem that has been underestimated by previous GMM studies. Based on two ways of dealing with model nonconvergence, the performance of the two types of mixture models and a number of model fit indices in class identification are examined and discussed. This article provides suggestions to practitioners who want to use GMM for their research.
A study in a university clinic/laboratory investigated adaptive Bayesian scaling as a supplement to interpretation of scores on the Mini-IPIP. A "probability of belonging" in categories of low, medium, or high on each of the Big Five traits was calculated after each item response and continued until all items had been used or until a predetermined criteria for the posterior probability has been obtained. The study found higher levels of correspondence with the IPIP-50 score categories using the adaptive Bayesian scaling than with the Mini-IPIP alone. The number of additional items ranged from a mean of 2.9 to 12.5 contingent on the level of certainty desired.
In this study, we explored the potential for machine scoring of short written responses to the Classroom-Video-Analysis (CVA) assessment, which is designed to measure teachers’ usable mathematics teaching knowledge. We created naïve Bayes classifiers for CVA scales assessing three different topic areas and compared computer-generated scores to those assigned by trained raters. Using cross-validation techniques, average correlations between rater- and computer-generated total scores exceeded .85 for each assessment, providing some evidence for convergent validity of machine scores. These correlations remained moderate to large when we controlled for length of response. Machine scores exhibited internal consistency, which we view as a measure of reliability. Finally, correlations between machine scores and another measure of teacher knowledge were close in size to those observed for human scores, providing further evidence for the validity of machine scores. Findings from this study suggest that machine learning techniques hold promise for automating scoring of the CVA.
A popular method to assess measurement invariance of a particular item is based on likelihood ratio tests with all other items as anchor items. The results of this method are often only reported in terms of statistical significance, and researchers proposed different methods to empirically select anchor items. It is unclear, however, how many anchor items should be selected and which method will provide the "best" results using empirical data. In the present study, we examined the impact of using different numbers of anchor items on effect size indices when investigating measurement invariance on a personality questionnaire in two different assessment situations. Results suggested that the effect size indices were not influenced by using different numbers of anchor items. The values were comparable across different number of anchor items used and were small, which indicate that the effect of differential functioning at the item and test level is very small if not negligible. Practical implications are discussed and we discuss the use of anchor items and effect size indices in practice.
The simultaneous item bias test (SIBTEST) method regression procedure and the differential item functioning (DIF)-free-then-DIF strategy are applied to the logistic regression (LR) method simultaneously in this study. These procedures are used to adjust the effects of matching true score on observed score and to better control the Type I error rates of the LR method in assessing DIF, respectively. The performance and the detailed procedure, including anchor length, of the newly proposed method are investigated through a series of simulation studies. The results show that the standard LR method yielded inflated Type I error rates as the percentage of DIF items or group ability differences increased, whereas the newly proposed method produced less inflated results. It controlled Type I error rates well in these conditions as the length of anchor increased. However, the usually suggested one-anchor or four-anchor rule of the DIF-free-then-DIF strategy is not long enough for methods that use the raw score as the matching variable. In general, the newly proposed method with eight anchor items yielded well-controlled Type I error rates under all study conditions, even with 40% DIF items in the test and a group ability difference equal to one standard deviation. It is recommended that both the SIBTEST correction procedure and the DIF-free-then-DIF strategy be applied to the LR method when assessing DIF.
This article introduces a new construct coined as Computer User Learning Aptitude (CULA). To establish construct validity, CULA is embedded in a nomological network that extends the technology acceptance model (TAM). Specifically, CULA is posited to affect perceived usefulness and perceived ease of use, the two underlying TAM constructs. Furthermore, we examine several antecedents of CULA by relying on the second language learning literature. These include computer anxiety, tolerance of ambiguity, and risk taking. Conceptualization of CULA is based on the observation that computer systems use language as communication between the computer and the user, making system usage significantly dependent on the ability of the individual to learn the language. We posit that learning to communicate with computer technology is akin to learning a second language, that is, a language learned after the first language(s) or native language(s), and is referred to as computerese. The proposed construct, CULA, measures the aptitude of an individual to learn computerese, and it is specified as a second-order variable. It includes measures of three critical facets of computerese pertaining to general hardware/software, programming, and the Internet. Significant relationships are found between computer anxiety, tolerance of ambiguity, and taking risk with CULA, as well as between CULA and TAM constructs.
Response styles, the tendency to respond to Likert-type items irrespective of content, are a widely known threat to the reliability and validity of self-report measures. However, it is still debated how to measure and control for response styles such as extreme responding. Recently, multiprocess item response theory models have been proposed that allow for separating multiple response processes in rating data. The rationale behind these models is to define process variables that capture psychologically meaningful aspects of the response process like, for example, content- and response style-related processes. The aim of the present research was to test the validity of this approach using two large data sets. In the first study, responses to a 7-point rating scale were disentangled, and it was shown that response style-related and content-related processes were selectively linked to extraneous criteria of response styles and content. The second study, using a 4-point rating scale, focused on a content-related criterion and revealed a substantial suppression effect of response style. The findings have implications for both basic and applied fields, namely, for modeling response styles and for the interpretation of rating data.
Item response theory (IRT) models allow model–data fit to be assessed at the individual level by using person-fit indices. This assessment is also feasible when IRT is used to model test–retest data. However, person-fit developments for this type of modeling are virtually nonexistent. This article proposes a general person-fit approach for test–retest data, which is based on practical likelihood-based indices. The approach is intended for two types of assumption regarding trait levels—stability and change—and can be used with a variety of IRT models. It consists of two groups of indices: (a) overall indices based on the full test–retest pattern, which are more powerful and are intended to flag a respondent as potentially inconsistent; and (b) partial indices intended to provide additional information about the location and sources of misfit. Furthermore, because the overall procedures assume local independence under repetition, a statistic for assessing the presence of retest effects at the individual level is also proposed. The functioning of the procedures was assessed by using simulation and is illustrated with two empirical studies: a stability study based on graded-response items and a change study based on binary items. Finally, limitations and further lines of research are discussed.
It is known that sum score-based methods for the identification of differential item functioning (DIF), such as the Mantel–Haenszel (MH) approach, can be affected by Type I error inflation in the absence of any DIF effect. This may happen when the items differ in discrimination and when there is item impact. On the other hand, outlier DIF methods have been developed that are robust against this Type I error inflation, although they are still based on the MH DIF statistic. The present article gives an explanation for why the common MH method is indeed vulnerable to the inflation effect whereas the outlier DIF versions are not. In a simulation study, we were able to produce the Type I error inflation by inducing item impact and item differences in discrimination. At the same time and in parallel with the Type I error inflation, the dispersion of the DIF statistic across items was increased. As expected, the outlier DIF methods did not seem sensitive to impact and differences in item discrimination.
This study examined the empirical differences between the tendency to omit items and reading ability by applying tree-based item response (IRTree) models to the Japanese data of the Programme for International Student Assessment (PISA) held in 2009. For this purpose, existing IRTree models were expanded to contain predictors and to handle multilevel data. The results revealed that Japanese students were more likely to omit open-ended items than closed-ended items, despite the fact that average item difficulty of the open-ended items was lower than that of the closed-ended items. Variances of the omission tendency were larger than those of reading ability, especially for open-ended items. Female students tended to omit more closed-ended items but fewer open-ended items than males, but the female students showed higher reading ability on average. Use of control strategies was negatively correlated with the difficulty of reading items and with the tendency to omit open-ended items. After controlling for other student properties, use of memorization strategies was negatively correlated with reading ability, which was opposite to the simple correlation. The results clearly show that the omission tendency can be differentiated from reading ability. School-level means of socioeconomic status and teacher stimulation explained more variance of both the omission tendency and reading ability than student-level properties. This implies that school-level interventions will be more effective than student-level instructions.
This study compared four item-selection procedures developed for use with severely constrained computerized adaptive tests (CATs). Severely constrained CATs refer to those adaptive tests that seek to meet a complex set of constraints that are often not conclusive to each other (i.e., an item may contribute to the satisfaction of several constraints at the same time). The procedures examined in the study included the weighted deviation model (WDM), the weighted penalty model (WPM), the maximum priority index (MPI), and the shadow test approach (STA). In addition, two modified versions of the MPI procedure were introduced to deal with an edge case condition that results in the item selection procedure becoming dysfunctional during a test. The results suggest that the STA worked best among all candidate methods in terms of measurement accuracy and constraint management. For the other three heuristic approaches, they did not differ significantly in measurement accuracy and constraint management at the lower bound level. However, the WPM method appears to perform considerably better in overall constraint management than either the WDM or MPI method. Limitations and future research directions were also discussed.
Observational methods are increasingly being used in classrooms to evaluate the quality of teaching. Operational procedures for observing teachers are somewhat arbitrary in existing measures and vary across different instruments. To study the effect of different observation procedures on score reliability and validity, we conducted an experimental study that manipulated the length of observation and order of presentation of 40-minute videotaped lessons from secondary grade classrooms. Results indicate that two 20-minute observation segments presented in random order produce the most desirable effect on score reliability and validity. This suggests that 20-minute occasions may be sufficient time for a rater to observe true characteristics of teaching quality assessed by the measure used in the study, and randomizing the order in which segments were rated may reduce construct irrelevant variance arising from carry over effects and rater drift.
This exploratory study investigated potential sources of setting accommodation resulting in differential item functioning (DIF) on math and reading assessments for examinees with varied learning characteristics. The examinees were those who participated in large-scale assessments and were tested in either standardized or accommodated testing conditions. The data were examined using multilevel measurement modeling, latent class analyses (LCA), and log-linear and odds ratio analyses. The results indicate that LCA models yielded substantially better fits to the observed data when they included only one covariate (total scores) than others with multiple covariates. Consistent patterns emerged from the results also show that the observed math and reading DIF can be explained by examinees’ latent abilities, accommodation status, and characteristics (including gender, home language, and learning attitudes). The present study not only confirmed previous findings that examinees’ characteristics are helpful in identifying sources of DIF but also addressed some limitations of previous studies by using an alternative and viable covariate strategy for LCA models.
The practice of screening students to identify behavioral and emotional risk is gaining momentum, with limited guidance regarding the frequency with which screenings should occur. Screening frequency decisions are influenced by the stability of the constructs assessed and changes in risk status over time. This study investigated the 4-year longitudinal stability of behavioral and emotional risk screening scores among a sample of youth to examine change in risk status over time. Youth (N = 156) completed a self-report screening measure, the Behavioral and Emotional Screening System, at 1-year intervals in the 8th through 11th grades. Categorical and dimensional stability coefficients, as well as transitions across risk status categories, were analyzed. A latent profile analysis was conducted to determine if there were salient and consistent patterns of screening scores over time. Stability coefficients were moderate to large, with stronger coefficients across shorter time intervals. Latent profile analysis pointed to a three-class solution in which classes were generally consistent with risk categories and stable across time. Results showed that the vast majority of students continued to be classified within the same risk category across time points. Implications for practice and future research needs are discussed.
This research note contributes to the discussion of methods that can be used to identify useful auxiliary variables for analyses of incomplete data sets. A latent variable approach is discussed, which is helpful in finding auxiliary variables with the property that if included in subsequent maximum likelihood analyses they may enhance considerably the plausibility of the underlying assumption of data missing at random. The auxiliary variables can also be considered for inclusion alternatively in imputation models for following multiple imputation analyses. The approach can be particularly helpful in empirical settings where violations of missing at random are suspected, and is illustrated with data from an aging research study.
When item parameter estimates are used to estimate the ability parameter in item response models, the standard error (SE) of the ability estimate must be corrected to reflect the error carried over from item calibration. For maximum likelihood (ML) ability estimates, a corrected asymptotic SE is available, but it requires a long test and the covariance matrix of item parameter estimates, which may not be available. An alternative SE can be obtained using the bootstrap. The first purpose of this article is to propose a bootstrap procedure for the SE of ML ability estimates when item parameter estimates are used for scoring. The second purpose is to conduct a simulation to compare the performance of the proposed bootstrap SE with the asymptotic SE under different test lengths and different magnitudes of item calibration error. Both SE estimates closely approximated the empirical SE when the test was long (i.e., 40 items) and when the true ability value was close to the mean of the ability distribution. However, neither SE estimate was uniformly superior: the asymptotic SE tended to underpredict the empirical SE, and the bootstrap SE tended to overpredict the empirical SE. The results suggest that the choice of SE depends on the type and purpose of the test. Additional implications of the results are discussed.
For computerized adaptive tests (CATs) to work well, they must have an item pool with sufficient numbers of good quality items. Many researchers have pointed out that, in developing item pools for CATs, not only is the item pool size important but also the distribution of item parameters and practical considerations such as content distribution and item exposure issues. Yet, there is little research on how to design item pools to have those desirable features. The research reported in this article provided step-by-step hands-on guidance on the item pool design process by applying the bin-and-union method to design item pools for a large-scale licensure CAT employing complex adaptive testing algorithm with variable test length, a decision based on stopping rule, content balancing, and exposure control. The design process involved extensive simulations to identify several alternative item pool designs and evaluate their performance against a series of criteria. The design output included the desired item pool size and item parameter distribution. The results indicate that the mechanism used to identify the desirable item pool features functions well and that two recommended item pool designs would support satisfactory performance of the operational testing program.
In the social sciences, latent traits often have a hierarchical structure, and data can be sampled from multiple levels. Both hierarchical latent traits and multilevel data can occur simultaneously. In this study, we developed a general class of item response theory models to accommodate both hierarchical latent traits and multilevel data. The freeware WinBUGS was used for parameter estimation. A series of simulations were conducted to evaluate the parameter recovery and the consequence of ignoring the multilevel structure. The results indicated that the parameters were recovered fairly well; ignoring multilevel structures led to poor parameter estimation, overestimation of test reliability for the second-order latent trait, and underestimation of test reliability for the first-order latent traits. The Bayesian deviance information criterion and posterior predictive model checking were helpful for model comparison and model-data fit assessment. Two empirical examples that involve an ability test and a teaching effectiveness assessment are provided.
The nominal response model (NRM), a much understudied polytomous item response theory (IRT) model, provides researchers the unique opportunity to evaluate within-item category distinctions. Polytomous IRT models, such as the NRM, are frequently applied to psychological assessments representing constructs that are unlikely to be normally distributed in the population. Unfortunately, models estimated using estimation software with the MML/EM algorithm frequently employs a set of normal quadrature points, effectively ignoring the true shape of the latent trait distribution. To address this problem, the current research implements an alternative estimation approach, Ramsay Curve Item Response Theory (RC-IRT), to provide more accurate item parameter estimates modeled under the NRM under normal, skewed, and bimodal latent trait distributions for ordered polytomous items. Based on the results of improved item parameter recovery under RC-IRT, it is recommended that RC-IRT estimation be implemented whenever a researcher considers the construct being measured has the potential of being nonnormally distributed.
In this study, smoothing and scaling approaches are compared for estimating subscore-to-composite scaling results involving composites computed as rounded and weighted combinations of subscores. The considered smoothing and scaling approaches included those based on raw data, on smoothing the bivariate distribution of the subscores, on smoothing the bivariate distribution of the subscore and weighted composite, and two weighted averages of the raw and smoothed marginal distributions. Results from simulations showed that the approaches differed in terms of their estimation accuracy for scaling situations with smaller and larger sample sizes, and on weighted composite distributions of varied complexity.
Typically a longitudinal growth modeling based on item response theory (IRT) requires repeated measures data from a single group with the same test design. If operational or item exposure problems are present, the same test may not be employed to collect data for longitudinal analyses and tests at multiple time points are constructed with unique item sets, as well as a set of common items (i.e., anchor test) for a study of examinee growth. In this study, three IRT approaches to examinee growth modeling were applied to a single-group anchor test design and their examinee growth estimates were compared. In terms of tracking individual growth, growth patterns in the examinee population distribution, and the overall model–data fit, results show the importance of modeling the serial correlation over multiple time points and other additional dependence coming from the use of the unique item sets, as well as the anchor test.
Data from competence tests usually show a number of missing responses on test items due to both omitted and not-reached items. Different approaches for dealing with missing responses exist, and there are no clear guidelines on which of those to use. While classical approaches rely on an ignorable missing data mechanism, the most recently developed model-based approaches account for nonignorable missing responses. Model-based approaches include the missing propensity in the measurement model. Although these models are very promising, the assumptions made in these models have not yet been tested for plausibility in empirical data. Furthermore, studies investigating the performance of different approaches have only focused on one kind of missing response at once. In this study, we investigated the performance of classical and model-based approaches in empirical data, accounting for different kinds of missing responses simultaneously. We confirmed the existence of a unidimensional tendency to omit items. Indicating nonignorability of the missing mechanism, missing tendency due to both omitted and not-reached items correlated with ability. However, results on parameter estimation showed that ignoring missing responses was sufficient to account for missing responses, and that the missing propensity was not needed in the model. The results from the empirical study were corroborated in a complete case simulation.
Invariant relationships in the internal mechanisms of estimating achievement scores on educational tests serve as the basis for concluding that a particular test is fair with respect to statistical bias concerns. Equating invariance and differential item functioning are both concerned with invariant relationships yet are treated separately in the psychometric literature. Connecting these two facets of statistical invariance is critical for developing a holistic definition of fairness in educational measurement, for fostering a deeper understanding of the nature and causes of equating invariance and a lack thereof, and for providing practitioners with guidelines for addressing reported score-level equity concerns. This study hypothesizes that differential item functioning manifested in anchor items of an assessment will have an effect on equating dependence. Findings show that when anchor item differential item functioning varies across forms in a differential manner across subpopulations, population invariance of equating can be compromised.
Extreme response style (ERS) is a systematic tendency for a person to endorse extreme options (e.g., strongly disagree, strongly agree) on Likert-type or rating-scale items. In this study, we develop a new class of item response theory (IRT) models to account for ERS so that the target latent trait is free from the response style and the tendency of ERS is quantified. Parameters of these new models can be estimated with marginal maximum likelihood estimation methods or Bayesian methods. In this study, we use the freeware program WinBUGS, which implements Bayesian methods. In a series of simulations, we find that the parameters are recovered fairly well; ignoring ERS by fitting standard IRT models resulted in biased estimates, and fitting the new models to data without ERS did little harm. Two empirical examples are provided to illustrate the implications and applications of the new models.
A challenge associated with traditional mixture regression models (MRMs), which rest on the assumption of normally distributed errors, is determining the number of unobserved groups. Specifically, even slight deviations from normality can lead to the detection of spurious classes. The current work aims to (a) examine how sensitive the commonly used model selection indices are in class enumeration of MRMs with nonnormal errors, (b) investigate whether a skew-normal MRM can accommodate nonnormality, and (c) illustrate the potential of this model with a real data analysis. Simulation results indicate that model information criteria are not useful for class determination in MRMs unless errors follow a perfect normal distribution. The skew-normal MRM can accurately identify the number of latent classes in the presence of normal or mildly skewed errors, but fails to do so in severely skewed conditions. Furthermore, across the experimental conditions it is seen that some parameter estimates provided by the skew-normal MRM become more biased as skewness increases whereas others remain unbiased. Discussion of these results in the context of the applicability of skew-normal MRMs is provided.
Previous research has demonstrated that differential item functioning (DIF) methods that do not account for multilevel data structure could result in too frequent rejection of the null hypothesis (i.e., no DIF) when the intraclass correlation coefficient () of the studied item was the same as the of the total score. The current study extended previous research by comparing the performance of DIF methods when of the studied item was less than of the total score, a condition that may be observed with considerable frequency in practice. The performance of two simple and frequently used DIF methods that do not account for multilevel data structure, the Mantel–Haenszel test (MH) and logistic regression (LR), was compared with the performance of a complex and less frequently used DIF method that does account for multilevel data structure, hierarchical logistic regression (HLR). Simulation indicated that HLR and LR performed equivalently in terms of significance tests under most conditions, and MH was conservative across most of the conditions. Effect size estimate of HLR was equally accurate and consistent as effect size estimates of LR and MH under the Rasch model and was more accurate and consistent than LR and MH effect size estimates under the two-parameter item response theory model. The results of the current study provide evidence to help researchers further understand the comparative performance between complex and simple modeling for DIF detection under multilevel data structure.
Latent growth curve models with piecewise functions are flexible and useful analytic models for investigating individual behaviors that exhibit distinct phases of development in observed variables. As an extension of this framework, this study considers a piecewise linear–linear latent growth mixture model (LGMM) for describing segmented change of individual behavior over time where the data come from a mixture of two or more unobserved subpopulations (i.e., latent classes). Thus, the focus of this article is to illustrate the practical utility of piecewise linear–linear LGMM and then to demonstrate how this model could be fit as one of many alternatives—including the more conventional LGMMs with functions such as linear and quadratic. To carry out this study, data (N = 214) obtained from a procedural learning task research were used to fit the three alternative LGMMs: (a) a two-class LGMM using a linear function, (b) a two-class LGMM using a quadratic function, and (c) a two-class LGMM using a piecewise linear–linear function, where the time of transition from one phase to another (i.e., knot) is not known a priori, and thus is a parameter to be estimated.
The usefulness of the l_{z} person-fit index was investigated with achievement test data from 20 exams given to more than 3,200 college students. Results for three methods of estimating showed that the distributions of l_{z} were not consistent with its theoretical distribution, resulting in general overfit to the item response theory model and underidentification of potentially nonfitting response vectors. The distributions of l_{z} were not improved for the Bayesian estimation method. A follow-up Monte Carlo simulation study using item parameters estimated from real data resulted in mean l_{z} approximating the theoretical value of 0.0 for one of three estimation methods, but all standard deviations were substantially below the theoretical value of 1.0. Use of the l_{z} distributions from these simulations resulted in levels of identification of significant misfit consistent with the nominal error rates. The reasons for the nonstandardized distributions of l_{z} observed in both these data sets were investigated in additional Monte Carlo simulations. Previous studies showed that the distribution of item difficulties was primarily responsible for the nonstandardized distributions, with smaller effects for item discrimination and guessing. It is recommended that with real tests, identification of significantly nonfitting examinees be based on empirical distributions of l_{z} generated from Monte Carlo simulations using item parameters estimated from real data.
When missing values are present in item response data, there are a number of ways one might impute a correct or incorrect response to a multiple-choice item. There are significantly fewer methods for imputing the actual response option an examinee may have provided if he or she had not omitted the item either purposely or accidentally. This article applies the multiple-choice model, a multiparameter logistic model that allows for in-depth distractor analyses, to impute response options for missing data in multiple-choice items. Following a general introduction of the issues involved with missing data, the article describes the details of the multiple-choice model and demonstrates its use for multiple imputation of missing item responses. A simple simulation example is provided to demonstrate the accuracy of the imputation method by comparing true item difficulties (p values) and item–total correlations (r values) to those estimated after imputation. Missing data are simulated according to three different types of missing mechanisms: missing completely at random, missing at random, and missing not at random.
The assessment of test data for the presence of differential item functioning (DIF) is a key component of instrument development and validation. Among the many methods that have been used successfully in such analyses is the mixture modeling approach. Using this approach to identify the presence of DIF has been touted as potentially superior for gaining insights into the etiology of DIF, as compared to using intact groups. Recently, researchers have expanded on this work to incorporate multilevel mixture modeling, for cases in which examinees are nested within schools. The current study further expands on this multilevel mixture modeling for DIF detection by using a multidimensional multilevel mixture model that incorporates multiple measured dimensions, as well as the presence of multiple subgroups in the population. This model was applied to a national sample of third-grade students who completed math and language tests. Results of the analysis demonstrate that the multidimensional model provides more complete information regarding the nature of DIF than do separate unidimensional models.
Determining sample size requirements for structural equation modeling (SEM) is a challenge often faced by investigators, peer reviewers, and grant writers. Recent years have seen a large increase in SEMs in the behavioral science literature, but consideration of sample size requirements for applied SEMs often relies on outdated rules-of-thumb. This study used Monte Carlo data simulation techniques to evaluate sample size requirements for common applied SEMs. Across a series of simulations, we systematically varied key model properties, including number of indicators and factors, magnitude of factor loadings and path coefficients, and amount of missing data. We investigated how changes in these parameters affected sample size requirements with respect to statistical power, bias in the parameter estimates, and overall solution propriety. Results revealed a range of sample size requirements (i.e., from 30 to 460 cases), meaningful patterns of association between parameters and sample size, and highlight the limitations of commonly cited rules-of-thumb. The broad "lessons learned" for determining SEM sample size requirements are discussed.
The performance of the normal theory bootstrap (NTB), the percentile bootstrap (PB), and the bias-corrected and accelerated (BCa) bootstrap confidence intervals (CIs) for coefficient omega was assessed through a Monte Carlo simulation under conditions not previously investigated. Of particular interests were nonnormal Likert-type and binary items. The results show a clear order in performance. The NTB CI had the best performance in that it had more consistent acceptable coverage under the simulation conditions investigated. The results suggest that the NTB CI can be used for sample sizes larger than 50. The NTB CI is still a good choice for a sample size of 50 so long as there are more than 5 items. If one does not wish to make the normality assumption about coefficient omega, then the PB CI for sample sizes of 100 or more or the BCa CI for samples sizes of 150 or more are good choices.
The objective of this article was to find an optimal decision rule for identifying polytomous items with large or moderate amounts of differential functioning. The effectiveness of combining statistical tests with effect size measures was assessed using logistic discriminant function analysis and two effect size measures: R^{2} and conditional log odds ratio in delta scale (_{LR}). Four independent variables were manipulated: (a) different sample sizes for the reference and focal groups (1,000/500, 1,000/250, 500/250), (b) impact between reference and focal group (equal-ability distribution, i.e., no impact; or different-ability distribution, i.e., impact), (c) the percentage of differential item functioning (DIF) items in a test (0%, 12%, i.e., only the first three items of the test; 20%, i.e., the first five items of the test; 32%, i.e., the first eight items of the test), and (d) direction of DIF (one-sided and both-sided). The magnitudes of DIF were indirectly manipulated through the percentage of DIF items and DIF direction, and they were simulated to be moderate or large. The results show that the false positive rates were low when an effect size decision rule was used in combination with a statistical test, and they were very low when R^{2} effect size criteria were applied. With respect to power, when a statistical test was used in conjunction with effect size criteria to determine whether an item exhibited a meaningful magnitude of DIF, we found when using the _{LR}decision rule that the percentage of meaningful DIF items was higher with greater amounts of DIF. Examining DIF by means of blended statistical tests, in other words, those incorporating both the p value and effect size measures, can be recommended as a procedure for classifying items displaying DIF.
Type I error rates in multiple regression, and hence the chance for false positive research findings, can be drastically inflated when multiple regression models are used to analyze data that contain random measurement error. This article shows the potential for inflated Type I error rates in commonly encountered scenarios and provides new insights into the causes of this problem. Computer simulations and an illustrative example are used to demonstrate that when the predictor variables in a multiple regression model are correlated and one or more of them contains random measurement error, Type I error rates can approach 1.00, even for a nominal level of 0.05. The most important factors causing the problem are summarized and the implications are discussed. The authors use Zumbo’s Draper–Lindley–de Finetti framework to show that the inflation in Type I error rates results from a mismatch between the data researchers have, the assumptions of the statistical model, and the inferences they hope to make.
Although a substantial amount of research has been conducted on differential item functioning in testing, studies have focused on detecting differential item functioning rather than on explaining how or why it may occur. Some recent work has explored sources of differential functioning using explanatory and multilevel item response models. This study uses hierarchical generalized linear modeling to examine differential performance due to gender and opportunity to learn, two variables that have been examined in the literature primarily in isolation, or in terms of mean performance as opposed to item performance. The relationships between item difficulty, gender, and opportunity to learn are explored using data for three countries from an international survey of preservice mathematics teachers.
The relationship between saturated path-analysis models and their fit to data is revisited. It is demonstrated that a saturated model need not fit perfectly or even well a given data set when fit to the raw data is examined, a criterion currently frequently overlooked by researchers utilizing path analysis modeling techniques. The potential of individual case residuals for saturated model fit assessment is revealed by showing how they can be used to examine local fit, as opposed to overall fit, to sense possible model deficiencies or misspecifications, and to suggest model improvements when needed. The discussion is illustrated with several numerical examples.
A study was designed to examine a multidimensional measure of children’s coping in the academic domain as part of a larger model of motivational resilience. Using items tapping multiple ways of dealing with academic problems, including five adaptive ways (strategizing, help-seeking, comfort-seeking, self-encouragement, and commitment) and six maladaptive ways (confusion, escape, concealment, self-pity, rumination, and projection), analyses of self-reports collected from 1,020 third through sixth graders in fall and spring of the same school year showed that item sets marking each way of coping were generally unidimensional and internally consistent; and confirmatory analyses showed that multidimensional models provided a good fit to the data for both adaptive and maladaptive coping at both time points. Of greatest interest were the connections of these ways of coping to the constructs from a model of motivational resilience. As predicted, adaptive coping was positively correlated with students’ self-system processes of relatedness, competence, and autonomy as well as their ongoing engagement and reengagement, and negatively correlated with their catastrophizing appraisals and emotional reactivity. Maladaptive coping showed the opposite pattern of correlations. The potential utility of the measure, the different scores derived from it, and the role of constructive coping in motivational resilience are discussed.
Classroom observation of teachers is a significant part of educational measurement; measurements of teacher practice are being used in teacher evaluation systems across the country. This research investigated whether observations made live in the classroom and from video recording of the same lessons yielded similar inferences about teaching. Using scores on the Classroom Assessment Scoring System–Secondary (CLASS-S) from 82 algebra classrooms, we explored the effect of observation mode on inferences about the level or ranking of teaching in a single lesson or in a classroom for a year. We estimated the correlation between scores from the two observation modes and tested for mode differences in the distribution of scores, the sources of variance in scores, and the reliability of scores using generalizability and decision studies for the latter comparisons. Inferences about teaching in a classroom for a year were relatively insensitive to observation mode. However, time trends in the raters’ use of the score scale were significant for two CLASS-S domains, leading to mode differences in the reliability and inferences drawn from individual lessons. Implications for different modes of classroom observation with the CLASS-S are discussed.
This study compares the progressive-restricted standard error (PR-SE) exposure control procedure to three commonly used procedures in computerized adaptive testing, the randomesque, Sympson–Hetter (SH), and no exposure control methods. The performance of these four procedures is evaluated using the three-parameter logistic model under the manipulated conditions of item pool size (small vs. large) and stopping rules (fixed-length vs. variable-length). PR-SE provides the advantage of similar constraints to SH, without the need for a preceding simulation study to execute it. Overall for the large and small item banks, the PR-SE method administered almost all of the items from the item pool, whereas the other procedures administered about 52% or less of the large item bank and 80% or less of the small item bank. The PR-SE yielded the smallest amount of item overlap between tests across conditions and administered fewer items on average than SH. PR-SE obtained these results with similar, and acceptable, measurement precision compared to the other exposure control procedures while vastly improving on item pool usage.
Large-scale experiments that involve nested structures may assign treatment conditions either to subgroups such as classrooms or to individuals such as students within subgroups. Key aspects of the design of such experiments include knowledge of the variance structure in higher levels and the sample sizes necessary to reach sufficient power to detect the treatment effect. This study provides methods for maximizing power within a fixed budget in three-level block randomized balanced designs with two levels of nesting, where, for example, students are nested within classrooms and classrooms are nested within schools, and schools and classrooms are random effects. The power computations take into account the costs of units of different levels, the variance structure at the second (e.g., classroom) and third (e.g., school) levels, and the sample sizes (e.g., number of Level-1, Level-2, and Level-3 units).
This note is concerned with a latent variable modeling approach for the study of differential item functioning in a multigroup setting. A multiple-testing procedure that can be used to evaluate group differences in response probabilities on individual items is discussed. The method is readily employed when the aim is also to locate possible sources of differential item functioning in homogenous behavioral measuring instruments across two or more populations under investigation. The approach is readily applicable using the popular software Mplus and R and is illustrated with a numerical example.
There has been growing interest in comparing achievement goal orientations across ethnic groups. Such comparisons, however, cannot be made until validity evidence has been collected to support the use of an achievement goal orientation instrument for that purpose. Therefore, this study investigates the measurement invariance of a particular measure of achievement goal orientation, the modified Achievement Goal Questionnaire (AGQ-M), across African American and White university students. Confirmatory factor analyses support measurement invariance across the two groups. These findings provide additional validity evidence for the newly conceptualized 2 x 2 framework of achievement goal orientation and for the equivalence of functioning of the AGQ-M across these distinct groups. Because this level of invariance is established, researchers can make more valid inferences about differences in the AGQ-M scores across African American and White students.