This article describes the analysis of regression-discontinuity designs (RDDs) using the
Researchers designing multisite and cluster randomized trials of educational interventions will usually conduct a power analysis in the planning stage of the study. To conduct the power analysis, researchers often use estimates of intracluster correlation coefficients and effect sizes derived from an analysis of survey data. When there is heterogeneity in treatment effects across the clusters in the study, these parameters will need to be adjusted to produce an accurate power analysis for a hierarchical trial design. The relevant adjustment factors are derived and presented in the current article. The adjustment factors depend upon the covariance between treatment effects and cluster-specific average values of the outcome variable, illustrating the need for better information about this parameter. The results in the article also facilitate understanding of the relative power of multisite and cluster randomized studies conducted on the same population by showing how the parameters necessary to compute power in the two types of designs are related. This is accomplished by relating parameters defined by linear mixed model specifications to parameters defined in terms of potential outcomes.
Meta-analysis is a statistical technique that allows an analyst to synthesize effect sizes from multiple primary studies. To estimate meta-analysis models, the open-source statistical environment R is quickly becoming a popular choice. The meta-analytic community has contributed to this growth by developing numerous packages specific to meta-analysis. The purpose of this study is to locate all publicly available meta-analytic R packages. We located 63 packages via a comprehensive online search. To help elucidate these functionalities to the field, we describe each of the packages, recommend applications for researchers interested in using R for meta-analyses, provide a brief tutorial of two meta-analysis packages, and make suggestions for future meta-analytic R package creators.
Causal mediation analysis is the study of mechanisms—variables measured between a treatment and an outcome that partially explain their causal relationship. The past decade has seen an explosion of research in causal mediation analysis, resulting in both conceptual and methodological advancements. However, many of these methods have been out of reach for applied quantitative researchers, due to their complexity and the difficulty of implementing them in standard statistical software distributions. The
An increasing concern of producers of educational assessments is fraudulent behavior during the assessment (van der Linden, 2009). Benefiting from item preknowledge (e.g., Eckerly, 2017; McLeod, Lewis, & Thissen, 2003) is one type of fraudulent behavior. This article suggests two new test statistics for detecting individuals who may have benefited from item preknowledge; the statistics can be used for both nonadaptive and adaptive assessments that may include either or both of dichotomous and polytomous items. Each new statistic has an asymptotic standard normal n distribution. It is demonstrated in detailed simulation studies that the Type I error rates of the new statistics are close to the nominal level and the values of power of the new statistics are larger than those of an existing statistic for addressing the same problem.
Test score distributions of schools or demographic groups are often summarized by frequencies of students scoring in a small number of ordered proficiency categories. We show that heteroskedastic ordered probit (HETOP) models can be used to estimate means and standard deviations of multiple groups’ test score distributions from such data. Because the scale of HETOP estimates is indeterminate up to a linear transformation, we develop formulas for converting the HETOP parameter estimates and their standard errors to a scale in which the population distribution of scores is standardized. We demonstrate and evaluate this novel application of the HETOP model with a simulation study and using real test score data from two sources. We find that the HETOP model produces unbiased estimates of group means and standard deviations, except when group sample sizes are small. In such cases, we demonstrate that a "partially heteroskedastic" ordered probit (PHOP) model can produce estimates with a smaller root mean squared error than the fully heteroskedastic model.
Equivalence assessment is becoming an increasingly important topic in many application areas including behavioral and social sciences research. Although there exist more powerful tests, the two one-sided tests (TOST) procedure is a technically transparent and widely accepted method for establishing statistical equivalence. Alternatively, a direct extension of Welch’s solution for the Behrens–Fisher problem is preferred in equivalence testing of means when the homogeneity of variance assumption is violated. For advance planning of equivalence studies, this article describes both exact and nearly exact power functions of the heteroscedastic TOST procedure and develops useful approaches to optimal sample size determinations under various allocation and cost considerations. Detailed numerical illustrations and simulation studies are presented to demonstrate the distinct features of the suggested techniques and the potential deficiency of existing method. Moreover, computer programs are provided to facilitate the implementation of the described sample size procedures. The proposed formulas and algorithms are recommended over the current results for their technical transparency, overall performance, and diverse utility.
In recent years, a wide array of tools have emerged for the purposes of conducting educational data mining (EDM) and/or learning analytics (LA) research. In this article, we hope to highlight some of the most widely used, most accessible, and most powerful tools available for the researcher interested in conducting EDM/LA research. We will highlight the utility that these tools have with respect to common data preprocessing and analysis steps in a typical research project as well as more descriptive information such as price point and user-friendliness. We will also highlight niche tools in the field, such as those used for Bayesian knowledge tracing (BKT), data visualization, text analysis, and social network analysis. Finally, we will discuss the importance of familiarizing oneself with multiple tools—a data analysis toolbox—for the practice of EDM/LA research.
A review of the software Just Another Gibbs Sampler (JAGS) is provided. We cover aspects related to history and development and the elements a user needs to know to get started with the program, including (a) definition of the data, (b) definition of the model, (c) compilation of the model, and (d) initialization of the model. An example using a latent class model with large-scale education data is provided to illustrate how easily JAGS can be implemented in R. We also cover details surrounding the many programs implementing JAGS. We conclude with a discussion of the newest features and upcoming developments. JAGS is constantly evolving and is developing into a flexible, user-friendly program with many benefits for Bayesian inference.
This article reviews PROC IRT, which was added to Statistical Analysis Software in 2014. We provide an introductory overview of a free version of SAS, describe what PROC IRT offers for item response theory (IRT) analysis and how one can use PROC IRT, and discuss how other SAS macros and procedures may compensate the IRT functionalities of PROC IRT.
Detection of differential item functioning (DIF) by use of the logistic modeling approach has a long tradition. One big advantage of the approach is that it can be used to investigate nonuniform (NUDIF) as well as uniform DIF (UDIF). The classical approach allows one to detect DIF by distinguishing between multiple groups. We propose an alternative method that is a combination of recursive partitioning methods (or trees) and logistic regression methodology to detect UDIF and NUDIF in a nonparametric way. The output of the method are trees that visualize in a simple way the structure of DIF in an item showing which variables are interacting in which way when generating DIF. In addition, we consider a logistic regression method, in which DIF can be induced by a vector of covariates, which may include categorical but also continuous covariates. The methods are investigated in simulation studies and illustrated by two applications.
Meijer and van Krimpen-Stoop noted that the number of person-fit statistics (PFSs) that have been designed for computerized adaptive tests (CATs) is relatively modest. This article partially addresses that concern by suggesting three new PFSs for CATs. The statistics are based on tests for a change point and can be used to detect an abrupt change in test performance of examinees during a CAT. The Type I error rate and power of the statistics are computed from a detailed simulation study. The performances of the new statistics are compared with those of four existing PFSs using receiver operating characteristics curves. The new statistics are then computed using data from an operational and high-stakes CAT. The new PFSs appear promising for assessment of person fit for CATs.
This article revisits how the end points of plotted line segments should be selected when graphing interactions involving a continuous target predictor variable. Under the standard approach, end points are chosen at ±1 or 2 standard deviations from the target predictor mean. However, when the target predictor and moderator are correlated or the conditional variance of the target predictor depends on the moderator variable value, these end points may reside in regions with little or no supporting data, encouraging potentially erroneous interpretations of the interaction, in particular, and patterns in the data, in general. Tumble graphs are introduced to minimize the likelihood of these problems. The utility of the Tumble graph over the standard approach is demonstrated with a real data example.
Recently, there has been an increase in the number of cluster randomized trials (CRTs) to evaluate the impact of educational programs and interventions. These studies are often powered for the main effect of treatment to address the "what works" question. However, program effects may vary by individual characteristics or by context, making it important to also consider power to detect moderator effects. This article presents a framework for calculating statistical power for moderator effects at all levels for two- and three-level CRTs. Annotated R code is included to make the calculations accessible to researchers and increase the regularity in which a priori power analyses for moderator effects in CRTs are conducted.
When cluster randomized experiments are analyzed as if units were independent, test statistics for treatment effects can be anticonservative. Hedges proposed a correction for such tests by scaling them to control their Type I error rate. This article generalizes the Hedges correction from a posttest-only experimental design to more common designs used in practice. We show that for many experimental designs, the generalized correction controls its Type I error while the Hedges correction does not. The generalized correction, however, necessarily has low power due to its control of the Type I error. Our results imply that using the Hedges correction as prescribed, for example, by the What Works Clearinghouse can lead to incorrect inferences and has important implications for evidence-based education.
Unless strong assumptions are made, nonparametric identification of principal causal effects can only be partial and bounds (or sets) for the causal effects are established. In the presence of a secondary outcome, recent results exist to sharpen the bounds that exploit conditional independence assumptions. More general results, though not embedded in a causal framework, can be found in concentration graphical models with a latent variable. The aim of this article is to establish a link between the two settings and to show that adapting and extending results pertaining to concentration graphical models can help achieving identification of principal casual effects in studies when more than one additional outcome is available. Model selection criteria are also suggested. An empirical illustrative example is provided, using data from a real social experiment.
We present types of constructs, individual- and cluster-level, and their confirmatory factor analytic validation models when data are from individuals nested within clusters. When a construct is theoretically individual level, spurious construct-irrelevant dependency in the data may appear to signal cluster-level dependency; in such cases, however, and consistent with theory, a single-level analysis with a correction for dependency may be appropriate. Regarding cluster-level constructs, we discuss two types—shared and configural—and present appropriate validation models. Illustrative validation analyses with individual, shared, and configural constructs are provided using empirical data as well as simple simulations demonstrating the spurious effects that can occur with nested data. The article concludes with future directions to be examined in construct validation in multilevel settings.
We address the problem of selecting the best of a set of units based on a criterion variable, when its value is recorded for every unit subject to estimation, measurement, or another source of error. The solution is constructed in a decision-theoretical framework, incorporating the consequences (ramifications) of the various kinds of error that can be committed. The related problems of classifying the units to a small number of groups and ranking them are solved by a similar approach. An application is presented involving retention rates in the undergraduate courses of a university.
To assess the direct and indirect effect of an intervention, multilevel 2-1-1 studies with intervention randomized at the upper (class) level and mediator and outcome measured at the lower (student) level are frequently used in educational research. In such studies, the mediation process may flow through the student-level mediator (the within indirect effect) or a class-aggregated mediator (the contextual indirect effect). In this article, we cast mediation analysis within the counterfactual framework and clarify the assumptions that are needed to identify the within and contextual indirect effect. We show that unlike the contextual indirect effect, the within indirect effect can be unbiasedly estimated in linear models in the presence of unmeasured confounders of the mediator–outcome relationship at the upper level that exert additive effects on mediator and outcome. When unmeasured confounding occurs at the individual level, both indirect effects are no longer identified. We propose sensitivity analyses to assess the robustness of the within and contextual indirect effect under lower and upper-level confounding, respectively.
Multilevel modeling techniques are becoming more popular in handling data with multilevel structure in educational and behavioral research. Recently, researchers have paid more attention to cross-classified data structure that naturally arises in educational settings. However, unlike traditional single-level research, methodological studies about multilevel effect size have been rare and those that have recently appeared had an emphasis on strictly hierarchical data structure. This article extends the work on multilevel standardized mean differences from strictly hierarchical structure to both fully and partially cross-classified structures. Analytically derived formulae for calculating effect sizes and the corresponding sampling variances (or standard errors) are presented, verified by simulation results, and illustrated with real data examples. Implications for primary research studies and meta-analyses are discussed.
Second-order item response theory models have been used for assessments consisting of several domains, such as content areas. We extend the second-order model to a third-order model for assessments that include subdomains nested in domains. Using a graphical model framework, it is shown how the model does not suffer from the curse of multidimensionality. We apply unidimensional, second-order, and third-order item response models to the 2007 Trends in International Mathematics and Science Study. Our findings suggest that deviations from unidimensionality are more pronounced at the content domain level than at the cognitive domain level and that deviations from unidimensionality at the content domain level become negligible after taking into account topic areas.
Extreme response set, the tendency to prefer the lowest or highest response option when confronted with a Likert-type response scale, can lead to misfit of item response models such as the generalized partial credit model. Recently, a series of intrinsically multidimensional item response models have been hypothesized, wherein tendency toward extreme response set is simultaneously estimated alongside one or more psychological constructs of interest. The multidimensional nominal response model (MNRM) is a divide-by-total model that allows person parameters for response sets, including extreme response set. The proportional thresholds model (PTM) is a difference model with response set parameters. The present study introduces a two-decision model (TDM) as an alternative to the MNRM and PTM and compares all three on data from assessments used in employee selection.
In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian, joint modeling approach to multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions. The approach automatically models complex dependencies while being computationally expedient. The Dirichlet process prior distributions enable analysts to avoid fixing the number of mixture components at an arbitrary number. We illustrate repeated sampling properties of the approach using simulated data. We apply the methodology to impute missing background data in the 2007 Trends in International Mathematics and Science Study.
In education randomized control trials (RCTs), the misreporting of student outcome data could lead to biased estimates of average treatment effects (ATEs) and their standard errors. This article discusses a statistical model that adjusts for misreported binary outcomes for two-level, school-based RCTs, where it is assumed that misreporting could occur for students with truly undesirable outcomes, but not for those with truly desirable outcomes. A latent variable index approach using study baseline data is employed to model both the misreporting and binary outcome decision processes, separately for treatments and controls, using random effects probit models to adjust for school-level clustering. Quasi-Newton maximum likelihood methods are developed to obtain consistent estimates of the ATE parameter and the unobserved misreporting rates. The estimation approach is demonstrated using self-reported arrest data from a large-scale RCT of Job Corps, the nation’s largest residential training program for disadvantaged youths between the ages of 16 and 24.
Since heterogeneity between reliability coefficients is usually found in reliability generalization studies, moderator analyses constitute a crucial step for that meta-analytic approach. In this study, different procedures for conducting mixed-effects meta-regression analyses were compared. Specifically, four transformation methods for the reliability coefficients, two estimators of the residual between-studies variance, and two methods for testing regression coefficients significance were combined in a Monte Carlo simulation study. The different methods were compared in terms of bias and mean square error (MSE) of the slope estimates, and Type I error and statistical power rates for the slope statistical tests. The results of the simulation study did not vary as a function of the residual variance estimator. All transformation methods provided negatively biased estimates, but both bias and MSE were reasonably small in all cases. In contrast, important differences were found regarding statistical tests, with the method proposed by Knapp and Hartung showing a better adjustment to the nominal significance level and higher power rates than the standard method.