When cluster randomized trials are used to evaluate school-based nutritional interventions such as school lunch programs, design-stage estimates of the required sample size must take into account the correlation in outcomes among individuals within each cluster (e.g., classrooms, schools, or districts). Estimates of the necessary parameters have been carefully developed for educational interventions, but for nutritional interventions the literature is thin.
Using data from two large multi-school, multi-district impact evaluations conducted in the United States, this article calculates estimates of the design parameters required for sizing school-based nutritional studies. The large size of the trials (252 and 1,327 schools) yields precise estimates of the parameters of interest. Variance components are estimated by fitting random-intercept multilevel models in Stata.
School-level intraclass correlations are similar to those typically found for educational outcomes. In particular, school-level estimates range from less than .01 to .26 across the two studies, and district-level estimates ranged from less than .01 to .19. This suggests that cluster randomized trials of nutritional interventions may require samples with numbers of schools similar to the education studies to detect similar effect sizes.
Over the past two decades, the lack of reliable empirical evidence concerning the effectiveness of educational interventions has motivated a new wave of research in education in sub-Saharan Africa (and across most of the world) that focuses on impact evaluation through rigorous research designs such as experiments. Often these experiments draw on the random assignment of entire clusters, such as schools, to accommodate the multilevel structure of schooling and the theory of action underlying many school-based interventions. Planning effective and efficient school randomized studies, however, requires plausible values of the intraclass correlation coefficient (ICC) and the variance explained by covariates during the design stage. The purpose of this study was to improve the planning of two-level school-randomized studies in sub-Saharan Africa by providing empirical estimates of the ICC and the variance explained by covariates for education outcomes in 15 countries.
Our investigation drew on large-scale representative samples of sixth-grade students in 15 countries in sub-Saharan Africa and includes over 60,000 students across 2,500 schools. We examined two core education outcomes: standardized achievement in reading and mathematics. We estimated a series of two-level hierarchical linear models with students nested within schools to inform the design of two-level school-randomized trials.
The analyses suggested that outcomes were substantially clustered within schools but that the magnitude of the clustering varied considerably across countries. Similarly, the results indicated that covariance adjustment generally reduced clustering but that the prognostic value of such adjustment varied across countries.
Cluster randomized controlled trials (CRCTs) often require a large number of clusters in order to detect small effects with high probability. However, there are contexts where it may be possible to design a CRCT with a much smaller number of clusters (10 or fewer) and still detect meaningful effects.
The objective is to offer recommendations for best practices in design and analysis for small CRCTs.
I use simulations to examine alternative design and analysis approaches. Specifically, I examine (1) which analytic approaches control Type I errors at the desired rate, (2) which design and analytic approaches yield the most power, (3) what is the design effect of spurious correlations, and (4) examples of specific scenarios under which impacts of different sizes can be detected with high probability.
I find that (1) mixed effects modeling and using Ordinary Least Squares (OLS) on data aggregated to the cluster level both control the Type I error rate, (2) randomization within blocks is always recommended, but how best to account for blocking through covariate adjustment depends on whether the precision gains offset the degrees of freedom loss, (3) power calculations can be accurate when design effects from small sample, spurious correlations are taken into account, and (4) it is very difficult to detect small effects with just four clusters, but with six or more clusters, there are realistic circumstances under which small effects can be detected with high probability.
Even a well-designed randomized control trial (RCT) study can produce ambiguous results. This article highlights a case in which full-sample results from a large-scale RCT in the United Kingdom differ from results for a subsample of survey respondents.
Our objective is to ascertain the source of the discrepancy in inferences across data sources and, in doing so, to highlight important threats to the reliability of the causal conclusions derived from even the strongest research designs.
The study analyzes administrative data to shed light on the source of the differences between the estimates. We explore the extent to which heterogeneous treatment impacts and survey nonresponse might explain these differences. We suggest checks which assess the external validity of survey measured impacts, which in turn provides an opportunity to test the effectiveness of different weighting schemes to remove bias. The subjects included 6,787 individuals who participated in a large-scale social policy experiment.
Our results were not definitive but suggest nonresponse bias is the main source of the inconsistent findings.
The results caution against overconfidence in drawing conclusions from RCTs and highlight the need for great care to be taken in data collection and analysis. Particularly, given the modest size of impacts expected in most RCTs, small discrepancies in data sources can alter the results. Survey data remain important as a source of information on outcomes not recorded in administrative data. However, linking survey and administrative data is strongly recommended where possible.
For a variety of reasons, researchers and evidence-based clearinghouses synthesizing the results of multiple studies often have very few studies that are eligible for any given research question. This situation is less than optimal for meta-analysis as it is usually practiced, that is, by employing inverse variance weights, which allows more informative studies to contribute relatively more to the analysis. This article outlines the choices available for synthesis when there are few studies to synthesize. As background, we review the synthesis practices used in several projects done at the behest of governmental agencies and private foundations. We then discuss the strengths and limitations of different approaches to meta-analysis in a limited information environment. Using examples from the U.S. Department of Education’s What Works Clearinghouse as case studies, we conclude with a discussion of Bayesian meta-analysis as a potential solution to the challenges encountered when attempting to draw inferences about the effectiveness of interventions from a small number of studies.
Prior research has investigated design parameters for assessing average program impacts on achievement outcomes with cluster randomized trials (CRTs). Less is known about parameters important for assessing differential impacts.
This article develops a statistical framework for designing CRTs to assess differences in impact among student subgroups and presents initial estimates of critical parameters.
Effect sizes and minimum detectable effect sizes for average and differential impacts are calculated before and after conditioning on effects of covariates using results from several CRTs. Relative sensitivities to detect average and differential impacts are also examined.
Student outcomes from six CRTs are analyzed.
Achievement in math, science, reading, and writing.
The ratio of between-cluster variation in the slope of the moderator divided by total variance—the "moderator gap variance ratio"—is important for designing studies to detect differences in impact between student subgroups. This quantity is the analogue of the intraclass correlation coefficient. Typical values were .02 for gender and .04 for socioeconomic status. For studies considered, in many cases estimates of differential impact were larger than of average impact, and after conditioning on effects of covariates, similar power was achieved for detecting average and differential impacts of the same size.
Measuring differential impacts is important for addressing questions of equity, generalizability, and guiding interpretation of subgroup impact findings. Adequate power for doing this is in some cases reachable with CRTs designed to measure average impacts. Continuing collection of parameters for assessing differential impacts is the next step.
Past studies have examined factors associated with reductions in bias in comparison group studies (CGSs). The companion work to this article extends the framework to investigate the accuracy of generalized inferences from CGS.
This article empirically examines levels of bias in CGS-based impact estimates when used for generalization, and reductions in bias resulting from covariate adjustment. It assesses potential for bias reduction against criteria from past studies.
Multisite trials are used to generate impact estimates based on cross-site comparisons that are evaluated against site-specific experimental benchmarks. Strategies for reducing bias are evaluated. Results from two experiments are considered.
Students in Grades K–3 in 79 schools in Tennessee and students in Grades 4–8 in 82 schools in Alabama.
Grades K–3 Stanford Achievement Test reading and math scores; Grades 4–8 Stanford Achievement Test (SAT) 10 reading scores.
Generalizing impacts to sites through estimates based on between-site nonexperimental comparisons leads to bias from differences between sites in average performance, and in impact, and covariation between these quantities. The first of these biases is larger. Covariate adjustments reduce bias but not completely. Criteria for bias reduction from past studies appear to extend to generalized inferences based on CGSs.
When generalizing from a CGS, results may be affected by bias from differences between the study and inference sites in both average performance and average impact. The same factors may underlie both forms of bias. Researchers and practitioners can assess the validity of generalized inferences from CGSs by applying criteria for bias reduction from past studies.
Mathematics professional development is widely offered, typically with the goal of improving teachers’ content knowledge, the quality of teaching, and ultimately students’ achievement. Recently, new assessments focused on mathematical knowledge for teaching (MKT) have been developed to assist in the evaluation and improvement of mathematics professional development. This study presents empirical estimates of average program change in MKT and its variation with the goal of supporting the design of experimental trials that are adequately powered to detect a specified program effect. The study drew on a large database representing five different assessments of MKT and collectively 326 professional development programs and 9,365 teachers. Results from cross-classified hierarchical growth models found that standardized average change estimates across the five assessments ranged from a low of 0.16 standard deviations (SDs) to a high of 0.26 SDs. Power analyses using the estimated pre- and posttest change estimates indicated that hundreds of teachers are needed to detect changes in knowledge at the lower end of the distribution. Even studies powered to detect effects at the higher end of the distribution will require substantial resources to conduct rigorous experimental trials. Empirical benchmarks that describe average program change and its variation provide a useful preliminary resource for interpreting the relative magnitude of effect sizes associated with professional development programs and for designing adequately powered trials.
The federal government’s emphasis on supporting the implementation of evidence-based programs has fueled a need to conduct and assess rigorous evaluations of programs. Through partnerships with researchers, policy makers, and practitioners, evidence reviews—projects that identify, assess, and summarize existing research in a given area—play an important role in supporting the quality of these evaluations and how the findings are used. These reviews encourage the use of sound scientific principles to identify, select, and implement evidence-based programs. The goals and standards of each review determine its conclusions about whether a given evaluation is of high quality or a program is effective. It can be difficult for decision makers to synthesize the body of evidence when faced with results from multiple program evaluations.
This study examined 14 federally funded evidence reviews to identify commonalities and differences in their assessments of evidence of effectiveness.
There were both similarities and significant differences across the reviews. In general, the evidence reviews agreed on the broad critical elements to consider when assessing evaluation quality, such as research design, low attrition, and baseline equivalence. The similarities suggest that, despite differences in topic and the availability of existing research, reviews typically favor evaluations that limit potential bias in their estimates of program effects. However, the way in which some of the elements were assessed, such as what constituted acceptable amounts of attrition, differed. Further, and more substantially, the reviews showed greater variation in how they conceptualized "effectiveness."
There is a need for greater guidance regarding design parameters and empirical benchmarks for social and behavioral outcomes to inform assumptions in the design and interpretation of cluster randomized trials (CRTs).
We calculated the empirical reference values on critical research design parameters associated with statistical power for children’s social and behavioral outcomes, including effect sizes, intraclass correlations (ICCs), and proportions of variance explained by a covariate at different levels (R 2).
Children from kindergarten to Grade 5 in the samples from four large CRTs evaluating the effectiveness of two classroom- and two school-level preventive interventions.
Teacher ratings of students’ social and behavioral outcomes using the Teacher Observation of Classroom Adaptation–Checklist and the Social Competence Scale–Teacher.
Two types of effect size benchmarks were calculated: (1) normative expectations for change and (2) policy-relevant demographic performance gaps. The ICCs and R 2 were calculated using two-level hierarchical linear modeling (HLM), where students are nested within schools, and three-level HLM, where students were nested within classrooms, and classrooms were nested within schools.
Comprehensive tables of benchmarks and ICC values are provided to inform prevention researchers in interpreting the effect size of interventions and conduct power analyses for designing CRTs of children’s social and behavioral outcomes. The discussion also provides a demonstration for how to use the parameter reference values provided in this article to calculate the sample size for two- and three-level CRTs designs.
To limit the influence of attrition bias in assessments of intervention effectiveness, several federal evidence reviews have established a standard for acceptable levels of sample attrition in randomized controlled trials. These evidence reviews include the What Works Clearinghouse (WWC), the Home Visiting Evidence of Effectiveness Review, and the Teen Pregnancy Prevention Evidence Review. We believe the WWC attrition standard may constitute the first use of model-based, empirically supported bounds on attrition bias in the context of a federally sponsored systematic evidence review. Meeting the WWC attrition standard (or one of the attrition standards based on the WWC standard) is now an important consideration for researchers conducting studies that could potentially be reviewed by the WWC (or other evidence reviews).
The purpose of this article is to explain the WWC attrition model, how that model is used to establish attrition bounds, and to assess the sensitivity of attrition bounds to key parameter values.
Results are based on equations derived in the article and values generated by applying those equations to a range of parameter values.
The authors find that the attrition boundaries are more sensitive to the maximum level of bias that an evidence review is willing to tolerate than to other parameters in the attrition model.
The authors conclude that the most productive refinements to existing attrition standards may be with respect to the definition of "maximum tolerable bias."
Conducting a systematic review in social policy is a resource-intensive process in terms of time and funds. It is thus important to understand the scope of the evidence base of a topic area prior to conducting a synthesis of primary research in order to maximize these resources. One approach to conserving resources is to map out the available evidence prior to undertaking a traditional synthesis. A few examples of this approach exist in the form of gap maps, overviews of reviews, and systematic maps supported by social policy and systematic review agencies alike. Despite this growing call for alternative approaches to systematic reviews, it is still common for systematic review teams to embark on a traditional in-depth review only.
This article describes a three-stage approach to systematic reviewing that was applied to a systematic review focusing in interventions for smallholder farmers in Africa. We argue that this approach proved useful in helping us to understand the evidence base.
By applying preliminary steps as part of a three-stage approach, we were able to maximize the resources needed to conduct a traditional systematic review on a more focused research question. This enabled us to identify and fill real knowledge gaps, build on work that had already been done, and avoid wasting resources on areas of work that would have no useful outcome. It also facilitated meaningful engagement between the review team and our key policy stakeholders.
In 2002, the U.S. Department of Education’s Institute of Education Sciences (IES) established the What Works Clearinghouse (WWC) at the confluence of a push to improve education research quality, a shift toward evidence-based decision-making, and an expansion of systematic reviews. In addition to providing decision makers with evidence to inform their choices, a systematic review sets expectations regarding study quality and execution for research on program efficacy. In this article, we examine education research through the filter of a long running systematic review to assess research quality over time and the role of the systematic review in producing evidence.
Using the WWC’s database of reviewed studies, we explored the relationships between study characteristics and dispositions as well as the differences by topic area and changes over time.
Through its design standards, the WWC has defined its requirements for a study to be considered causal evidence, which may have been one of the factors contributing to observed improvement in the quality of education research over the past 15 years. The levels and rates of studies meeting standards have been increasing over the life of the WWC. Additionally, the number and proportion of studies excluded due to ineligible design are decreasing. Thus, less research is ineligible due to design issues, and more eligible studies are meeting standards. As IES continues to conduct and fund studies designed to meet standards, and more decisions are directly tied to evidence, the body of rigorous education research may continue to grow.
Saving plays a crucial role in the process of economic growth. However, one main reason why poor people often do not save is that they lack financial knowledge. Improving the savings culture of children through financial education is a promising way to develop savings attitudes and behavior early in life.
This study is one of the first that examines the effects of social and financial education training and a children’s club developed by Aflatoun on savings attitudes and behavior among primary school children in Uganda, besides Berry, Karlan, and Pradhan.
A randomized phase in approach was used by randomizing the order in which schools implemented the program (school-level randomization). The treatment group consisted of students in schools where the program was implemented, while in the control group the program was not yet implemented. The program lasted 3 months including 16 hours. We compared posttreatment variables for the treatment and control group.
Study participants included 1,746 students, of which 936 students were from 22 schools that were randomly assigned to receive the program between May and July 2011; the remaining 810 students attended 22 schools that did not implement the program during the study period.
Indicators for children’s savings attitudes and behavior were key outcomes.
The intervention increased awareness of money, money recording, and savings attitudes. It also provides some evidence—although less robust—that the intervention increased actual savings.
A short financial literacy and social training can improve savings attitudes and behavior of children considerably.
Exposure to media violence might have detrimental effects on psychological adjustment and is associated with aggression-related attitudes and behaviors. As a result, many media literacy programs were implemented to tackle that major public health issue. However, there is little evidence about their effectiveness. Evaluating design effectiveness, particularly regarding targeting process, would prevent adverse effects and improve the evaluation of evidence-based media literacy programs.
The present research examined whether or not different relational lifestyles may explain the different effects of an antiviolence intervention program.
Based on relational and lifestyles theory, the authors designed a randomized controlled trial and applied an analysis of variance 2 (treatment: experimental vs. control) x 4 (lifestyle classes emerged from data using latent class analysis: communicative vs. autonomous vs. meta-reflexive vs. fractured).
Seven hundred and thirty-five Italian students distributed in 47 classes participated anonymously in the research (51.3% females).
Participants completed a lifestyle questionnaire as well as their attitudes and behavioral intentions as the dependent measures.
The results indicated that the program was effective in changing adolescents’ attitudes toward violence. However, behavioral intentions toward consumption of violent video games were moderated by lifestyles. Those with communicative relational lifestyles showed fewer intentions to consume violent video games, while a boomerang effect was found among participants with problematic lifestyles.
Adolescents’ lifestyles played an important role in influencing the effectiveness of an intervention aimed at changing behavioral intentions toward the consumption of violent video games. For that reason, audience lifestyle segmentation analysis should be considered an essential technique for designing, evaluating, and improving media literacy programs.
Injury and violence prevention strategies have greater potential for impact when they are based on scientific evidence. Systematic reviews of the scientific evidence can contribute key information about which policies and programs might have the greatest impact when implemented. However, systematic reviews have limitations, such as lack of implementation guidance and contextual information, that can limit the application of knowledge. "Technical packages," developed by knowledge brokers such as the federal government, nonprofit agencies, and academic institutions, have the potential to be an efficient mechanism for making information from systematic reviews actionable. Technical packages provide information about specific evidence-based prevention strategies, along with the estimated costs and impacts, and include accompanying implementation and evaluation guidance to facilitate adoption, implementation, and performance measurement. We describe how systematic reviews can inform the development of technical packages for practitioners, provide examples of technical packages in injury and violence prevention, and explain how enhancing review methods and reporting could facilitate the use and applicability of scientific evidence.
Systematic reviews sponsored by federal departments or agencies play an increasingly important role in disseminating information about evidence-based programs and have become a trusted source of information for administrators and practitioners seeking evidence-based programs to implement. These users vary in their knowledge of evaluation methods and their ability to interpret systematic review findings. They must consider factors beyond program effectiveness when selecting an intervention, such as the relevance of the intervention to their target population, community context, and service delivery system; readiness for replication and scale-up; and the ability of their service delivery system or agency to implement the intervention.
To support user decisions about adopting evidence-based practices, this article discusses current systematic review practices and alternative approaches to synthesizing and presenting findings and providing information.
We reviewed the publicly available information on review methodology and findings for eight federally funded systematic reviews in the labor, education, early childhood, mental health/substance abuse, family support, and criminal justice topic areas.
The eight federally sponsored evidence reviews we examined all provide information that can help users to interpret findings on evidence of effectiveness and to make adoption decisions. However, they are uneven in the amount, accessibility, and consistency of information they report. For all eight reviews, there is room for improvement in supporting users’ adoption decisions through more detailed, accessible, and consistent information in these areas.
Systematic reviews—which identify, assess, and summarize existing research—are usually designed to determine whether research shows that an intervention has evidence of effectiveness, rather than whether an intervention will work under different circumstances. The reviews typically focus on the internal validity of the research and do not consistently incorporate information on external validity into their conclusions.
In this article, we focus on how systematic reviews address external validity.
We conducted a brief scan of 19 systematic reviews and a more in-depth examination of information presented in a systematic review of home visiting research.
We found that many reviews do not provide information on generalizability, such as statistical representativeness, but focus on factors likely to increase heterogeneity (e.g., numbers of studies or settings) and report on context. The latter may help users decide whether the research characteristics—such as sample demographics or settings—are similar to their own. However, we found that differences in reporting, such as which variables are included and how they are measured, make it difficult to summarize across studies or make basic determinations of sample characteristics, such as whether the majority of a sample was unemployed or married.
Evaluation research and systematic reviews would benefit from reporting guidelines for external validity to ensure that key information is reported across studies.
In this article, we examine whether a well-executed comparative interrupted time series (CITS) design can produce valid inferences about the effectiveness of a school-level intervention. This article also explores the trade-off between bias reduction and precision loss across different methods of selecting comparison groups for the CITS design and assesses whether choosing matched comparison schools based only on preintervention test scores is sufficient to produce internally valid impact estimates.
We conduct a validation study of the CITS design based on the federal Reading First program as implemented in one state using results from a regression discontinuity design as a causal benchmark.
Our results contribute to the growing base of evidence regarding the validity of nonexperimental designs. We demonstrate that the CITS design can, in our example, produce internally valid estimates of program impacts when multiple years of preintervention outcome data (test scores in the present case) are available and when a set of reasonable criteria are used to select comparison organizations (schools in the present case).
Systematic reviews help policy makers and practitioners make sense of research findings in a particular program, policy, or practice area by synthesizing evidence across multiple studies. However, the link between review findings and practical decision-making is rarely one-to-one. Policy makers and practitioners may use systematic review findings to help guide their decisions, but they may also rely on other information sources or personal judgment.
To describe a recent effort by the U.S. federal government to narrow the gap between review findings and practical decision-making. The Teen Pregnancy Prevention (TPP) Evidence Review was launched by the U.S. Department of Health and Human Services (HHS) in 2009 as a systematic review of the TPP literature. HHS has used the review findings to determine eligibility for federal funding for TPP programs, marking one of the first attempts to directly link systematic review findings with federal funding decisions.
The high stakes attached to the review findings required special considerations in designing and conducting the review. To provide a sound basis for federal funding decisions, the review had to meet accepted methodological standards. However, the review team also had to account for practical constraints of the funding legislation and needs of the federal agencies responsible for administering the grant programs. The review team also had to develop a transparent process for both releasing the review findings and updating them over time. Prospective review authors and sponsors must recognize both the strengths and limitations of this approach before applying it in other areas.
This review describes the methods used for a systematic review of oral health intervention literature in a target population (people with intellectual and developmental disability (I/DD)), which spans a broad range of interventions and study types, conducted with specialized software.
The aim of this article is to demonstrate the review strategy, using the free, online systematic review data repository (SRDR) tool, for oral health interventions aimed at reducing disparities between people with I/DD and the general population.
Researchers used online title/abstract review (Abstrackr) and data extraction (SRDR) tools to structure the literature review and data extraction. A practicing clinician and an expert methodologist completed the quality review for each study. The data extraction team reported on the experience of using and customizing the SRDR.
Using the SRDR, the team developed four extraction templates for eight key questions and completed extraction on 125 articles.
This report discusses the advantages and disadvantages of using an electronic tool, such as the SRDR, in completing a systematic review in an area of growing research. This review provides valuable insight for researchers who are considering the use of the SRDR.
Intuitionistic fuzzy sets (IFS) represent a methodology for quantifying latent variables in questionnaire analysis through membership and non-membership functions, which are linked by an uncertainty function.
We aim to apply an IFS approach to the problem of students’ satisfaction of university teaching. Such framework can take into account a source of uncertainty related to items and another related to subjects.
A new technique for IFS analysis is set forth and generalized to a multivariate scenario. Potential advantages of the IFS perspective with respect to other nonfuzzy approaches are provided.
We apply this method to a national program of university courses evaluation and we focus, in particular, on the outcomes of two Masters in Statistics.
Given increasing concerns about the relevance of research to policy and practice, there is growing interest in assessing and enhancing the external validity of randomized trials: determining how useful a given randomized trial is for informing a policy question for a specific target population.
This article highlights recent advances in assessing and enhancing external validity, with a focus on the data needed to make ex post statistical adjustments to enhance the applicability of experimental findings to populations potentially different from their study sample.
We use a case study to illustrate how to generalize treatment effect estimates from a randomized trial sample to a target population, in particular comparing the sample of children in a randomized trial of a supplemental program for Head Start centers (the Research-Based, Developmentally Informed study) to the national population of children eligible for Head Start, as represented in the Head Start Impact Study.
For this case study, common data elements between the trial sample and population were limited, making reliable generalization from the trial sample to the population challenging.
To answer important questions about external validity, more publicly available data are needed. In addition, future studies should make an effort to collect measures similar to those in other data sets. Measure comparability between population data sets and randomized trials that use samples of convenience will greatly enhance the range of research and policy relevant questions that can be answered.
This article offers important statistics to evaluators planning future evaluations in southeast Africa. There are little to no published statistics describing the variance of southeast African agricultural and household indicators.
We seek to publish the standard deviations, intracluster correlation coefficients (ICCs), and R 2s from outcomes and covariates used in a 2014 quasi-experimental evaluation of the Millennium Challenge Corporation’s Mozambique Farmer Income Support Project (FISP) and thus guide researchers in their calculation of design effects relevant to future evaluations in the region.
We summarize data from a roughly 168-item farmer survey conducted in 1,227 households during June–July 2014 in coconut-farming regions of the Zambezia province in Mozambique. We report descriptive statistics, estimates of ICC, and R 2s obtained from linear regression models with cluster random effects. We consider three different cluster definitions.
We report ICCs for a range of different specifications. For the FISP evaluation, the average design effect for education outcomes is 1.16. Average design effects for wealth measures based on consumption are 1.23. For agricultural-related outcomes, 1.05 is the average design effects for income measures, 1.47 for knowledge, and 1.64 for sales of specific crops.
We offer a detailed picture of the variance structure of agricultural and other outcomes in Mozambique. Our results indicate that the design effect associated with these outcomes is less than the rule-of-thumb design effect (2.0) used in nutrition studies which are commonly cited in the studies of this region.
Large-scale randomized experiments are important for determining how policy interventions change average outcomes. Researchers have begun developing methods to improve the external validity of these experiments. One new approach is a balanced sampling method for site selection, which does not require random sampling and takes into account the practicalities of site recruitment including high nonresponse.
The goal of balanced sampling is to develop a strategic sample selection plan that results in a sample that is compositionally similar to a well-defined inference population. To do so, a population frame is created and then divided into strata, which "focuses" recruiters on specific subpopulations. Units within these strata are then ranked, thus identifying "replacements" similar to sites that can be recruited when the ideal site refuses to participate in the experiment.
In this article, we consider how a balanced sample strategic site selection method might be implemented in a welfare policy evaluation.
We find that simply developing a population frame can be challenging, with three possible and reasonable options arising in the welfare policy arena. Using relevant study-specific contextual variables, we craft a recruitment plan that considers nonresponse.
Violent drug markets are not as prominent as they once were in the United States, but they still exist and are associated with significant crime and lower quality of life. The drug market intervention (DMI) is an innovative strategy that uses focused deterrence, community engagement, and incapacitation to reduce crime and disorder associated with these markets. Although studies show that DMI can reduce crime and overt drug activity, one perspective is prominently missing from these evaluations: those who purchase drugs.
This study explores the use of respondent-driven sampling (RDS)—a statistical sampling method—to approximate a representative sample of drug users who purchased drugs in a targeted DMI market to gain insight into the effect of a DMI on market dynamics.
Using RDS, we recruited individuals who reported hard drug use (crack or powder cocaine, heroin, methamphetamine, or illicit use of prescriptions opioids) in the last month to participate in a survey. The main survey asked about drug use, drug purchasing, and drug market activity before and after DMI; a secondary survey asked about network characteristics and recruitment.
Our sample of 212 respondents met key RDS assumptions, suggesting that the characteristics of our weighted sample approximate the characteristics of the drug user network. The weighted estimates for market purchasers are generally valid for inferences about the aggregate population of customers, but a larger sample size is needed to make stronger inferences about the effects of a DMI on drug market activity.
Policy makers and researchers are frequently interested in understanding how effective a particular intervention may be for a specific population. One approach is to assess the degree of similarity between the sample in an experiment and the population. Another approach is to combine information from the experiment and the population to estimate the population average treatment effect (PATE).
Several methods for assessing the similarity between a sample and population currently exist as well as methods estimating the PATE. In this article, we investigate properties of six of these methods and statistics in the small sample sizes common in education research (i.e., 10–70 sites), evaluating the utility of rules of thumb developed from observational studies in the generalization case.
In small random samples, large differences between the sample and population can arise simply by chance and many of the statistics commonly used in generalization are a function of both sample size and the number of covariates being compared. The rules of thumb developed in observational studies (which are commonly applied in generalization) are much too conservative given the small sample sizes found in generalization.
This article implies that sharp inferences to large populations from small experiments are difficult even with probability sampling. Features of random samples should be kept in mind when evaluating the extent to which results from experiments conducted on nonrandom samples might generalize.
There is an increased focus on randomized trials for proximal behavioral outcomes in early childhood research. However, planning sample sizes for such designs requires extant information on the size of effect, variance decomposition, and effectiveness of covariates.
The purpose of this article is to employ a recent large representative sample of early childhood longitudinal study kindergartners to estimate design parameters for use in planning cluster randomized trials. A secondary objective is to compare the results of math and reading with the previous kindergartner cohort of 1999.
For each measure, fall–spring gains in effect size units are calculated. In addition, multilevel models are fit to estimate variance components that are used to calculate intraclass correlations (ICCs) and R 2 statistics. The implications of the reported parameters are summarized in tables of required school sample sizes to detect small effects.
The outcomes include information about student scores regarding learning behaviors, general behaviors, and academic abilities.
Aside from math and reading, there were small gains in these measures from fall to spring, leading to effect sizes between about .1 and .2. In addition, the nonacademic ICCs are smaller than the academic ICCs but are still nontrivial. Use of a pretest covariate is generally effective in reducing the required sample size in power analyses. The ICCs for math and reading are smaller for the current sample compared with the 1999 sample.
Despite rapid advances in research on the evaluation of complex interventions, debate on evaluation methods and approaches still mainly revolves around the conventional and mostly outdated positivist–constructivist dichotomy. The lack of a clear conceptual and theoretical framework from which to choose appropriate evaluation approaches and methods means that approaches are often misused by both researchers and practitioners.
Using three case studies, this article shows how different approaches should and should not be used in practice according to levels of nonlinearity. Both the theoretical development and the case studies presented in this article rely heavily on interviews conducted by the author with program management and staff, evaluation managers, heads of evaluation units, and evaluators in several countries across two continents, along with a quantitative survey.
This article expands the classic discussion on evaluation approaches, adapting it to current managerial demands, increased complexity, and newly developed methodologies. It suggests an operational tool for categorizing evaluations and then matching evaluation approaches to the circumstances and the evaluation objectives.
The findings suggest that approaches which are not congruent with levels of nonlinearity may hinder attempts to accurately evaluate results, causing dissatisfaction of evaluation commissioners from the evaluation process and methods applied. In contrast, analyzing the nonlinear and structural elements of complexity separately allows an extended categorization of evaluation approaches to be matched to the nonlinearity of programs to be evaluated.
In this exploratory study, we wanted to know how evaluators differentiate collaborative approaches to evaluation (CAE) perceived to be successful from those perceived to be less-than-successful.
In an online questionnaire survey, we obtained 320 responses from evaluators who practice CAE (i.e., evaluations on which program stakeholders coproduce evaluation knowledge). Respondents identified two specific CAE projects from their own experience—one they believed to be "highly successful" and another they considered "far less successful than [they] had hoped."—and offered their comments and reflections about them. They rated the respective evaluations on 5-point opinion and frequency scales about (i) antecedent stakeholder perspectives, (ii) the purposes and justifications for collaborative inquiry, and (iii) the form such inquiry takes.
The results showed that successful evaluations, relative to their less-than-successful counterparts, tended to reflect higher levels of agreement among stakeholders about the focal program; higher intentionality estimates of evaluation justification and espoused purposes; and wider ranges and deeper levels of stakeholder participation. No differences were found for control of technical decision-making, and evaluators tended to lead evaluation decision making, regardless of success condition.
The results are discussed in terms of implications for ongoing research on CAE.
Variations in local context bedevil the assessment of external validity: the ability to generalize about effects of treatments. For evaluation, the challenges of assessing external validity are intimately tied to the translation and spread of evidence-based interventions. This makes external validity a question for decision makers, who need to determine whether to endorse, fund, or adopt interventions that were found to be effective and how to ensure high quality once they spread.
To present the rationale for using theory to assess external validity and the value of more systematic interaction of theory and practice.
We review advances in external validity, program theory, practitioner expertise, and local adaptation. Examples are provided for program theory, its adaptation to diverse contexts, and generalizing to contexts that have not yet been studied. The often critical role of practitioner experience is illustrated in these examples. Work is described that the Robert Wood Johnson Foundation is supporting to study treatment variation and context more systematically.
Researchers and developers generally see a limited range of contexts in which the intervention is implemented. Individual practitioners see a different and often a wider range of contexts, albeit not a systematic sample. Organized and taken together, however, practitioner experiences can inform external validity by challenging the developers and researchers to consider a wider range of contexts. Researchers have developed a variety of ways to adapt interventions in light of such challenges.
In systematic programs of inquiry, as opposed to individual studies, the problems of context can be better addressed. Evaluators have advocated an interaction of theory and practice for many years, but the process can be made more systematic and useful. Systematic interaction can set priorities for assessment of external validity by examining the prevalence and importance of context features and treatment variations. Practitioner interaction with researchers and developers can assist in sharpening program theory, reducing uncertainty about treatment variations that are consistent or inconsistent with the theory, inductively ruling out the ones that are harmful or irrelevant, and helping set priorities for more rigorous study of context and treatment variation.
Government and private funders increasingly require social service providers to adopt program models deemed "evidence based," particularly as defined by evidence-based program registries, such as What Works Clearinghouse and National Registry of Evidence-Based Programs and Practices. These registries summarize the evidence about programs’ effectiveness, giving near-exclusive priority to evidence from experimental-design evaluations. The registries’ goal is to aid decision making about program replication, but critics suspect the emphasis on evidence from experimental-design evaluations, while ensuring strong internal validity, may inadvertently undermine that goal, which requires strong external validity as well.
The objective of this study is to determine the extent to which the registries’ reports provide information about context-specific program implementation factors that affect program outcomes and would thus support decision making about program replication and adaptation.
A research-derived rubric was used to rate the extent of context-specific reporting in the population of seven major registries’ evidence summaries (N = 55) for youth development programs.
Nearly all (91%) of the reports provide context-specific information about program participants, but far fewer provide context-specific information about implementation fidelity and other variations in program implementation (55%), the program’s environment (37%), costs (27%), quality assurance measures (22%), implementing agencies (19%), or staff (15%).
Evidence-based program registries provide insufficient information to guide context-sensitive decision making about program replication and adaptation. Registries should supplement their evidence base with nonexperimental evaluations and revise their methodological screens and synthesis-writing protocols to prioritize reporting—by both evaluators and the registries themselves—of context-specific implementation factors that affect program outcomes.
Italy is a country showing low math achievement, especially in the Southern regions. Moreover, national student assessments are recent and rigorous policy evaluation is lacking. This study presents the results of one of the first randomized controlled trials implemented in Italian schools in order to measure the effects of a professional development (PD) program for teachers on student math achievement. The program was already at scale when it was being evaluated.
Assessing the effects of a PD program for math teachers on their students’ achievement and making suggestions for future policy evaluations.
A large-scale clustered randomized control trial has been conducted. It involves 175 lower secondary schools (sixth - eighth grade) in four among the Italian lowest performing regions. Alongside national standard math assessments, the project collected a wide amount of information.
Math in lower secondary schools.
Math achievement as measured by standardized tests provided by the National Education Assessment Institute (Istituto Nazionale per la Valutazione del Sistema di Istruzione e Formazione); teacher and student practices and attitudes collected through questionnaires.
Findings suggest that the program had no significant impact on math scores during the first year (when the program was held). Nonetheless some heterogeneity was detected, as the treatment does seem "to work" with middle-aged teachers. Moreover, effects on teaching practice and student attitudes appear.
Some effects attributable to the intervention have been detected. Moreover, this project shows that a rigorous approach to evaluation is feasible also in a context lacking attention towards evidence-based policies, such the Italian school system.
Therapeutic emotion work is performed by health care providers as they manage their own feelings as well as those of colleagues and patients as part of efforts to improve the physical and psychosocial health outcomes of patients. It has yet to be examined within the context of traumatic brain injury rehabilitation.
To evaluate the impact of a research-based theater intervention on emotion work practices of neurorehabilitation staff.
Data were collected at baseline and at 3 and 12 months postintervention in the inpatient neurorehabilitation units of two rehabilitation hospitals in central urban Canada.
Participants (N = 33) were recruited from nursing, psychology, allied health, recreational therapy, and chaplaincy.
Naturalistic observations (N = 204.5 hr) of a range of structured and unstructured activities in public and private areas, and semistructured interviews (N = 87) were conducted.
Preintervention analysis indicated emotion work practices were characterized by stringent self-management of empathy, suppression of client grief, adeptness with client anger, and discomfort with reactions of family and spouses. Postintervention analysis indicated significant staff changes in a relationality orientation, specifically improvements in outreach to homosexual and heterosexual family care partners, and support for sexual orientation and intimacy expression. No improvements were demonstrated in grief support.
Emotion work has yet to be the focus of initiatives to improve neurorehabilitative care. Our findings suggest the dramatic arts are well positioned to improve therapeutic emotion work and effect cultures of best practice. Recommendations are made for interprofessional educational initiatives to improve responses to client grief and potential intimate partner violence.
Cluster-randomized experiments that assign intact groups such as schools or school districts to treatment conditions are increasingly common in educational research. Such experiments are inherently multilevel designs whose sensitivity (statistical power and precision of estimates) depends on the variance decomposition across levels. This variance decomposition is usually summarized by the intraclass correlation (ICC) structure and, if covariates are used, the effectiveness of the covariates in explaining variation at each level of the design.
This article provides a compilation of school- and district-level ICC values of academic achievement and related covariate effectiveness based on state longitudinal data systems. These values are designed to be used for planning group-randomized experiments in education. The use of these values to compute statistical power and plan two- and three-level group-randomized experiments is illustrated.
We fit several hierarchical linear models to state data by grade and subject to estimate ICCs and covariate effectiveness. The total sample size is over 4.8 million students. We then compare our average of state estimates with the national work by Hedges and Hedberg.
Prior research has focused primarily on empirically estimating design parameters for cluster-randomized trials (CRTs) of mathematics and reading achievement. Little is known about how design parameters compare across other educational outcomes.
This article presents empirical estimates of design parameters that can be used to appropriately power CRTs in science education and compares them to estimates using mathematics and reading.
Estimates of intraclass correlations (ICCs) are computed for unconditional two-level (students in schools) and three-level (students in schools in districts) hierarchical linear models of science achievement. Relevant student- and school-level pretest and demographic covariates are then considered, and estimates of variance explained are computed.
Five consecutive years of Texas student-level data for Grades 5, 8, 10, and 11.
Science, mathematics, and reading achievement raw scores as measured by the Texas Assessment of Knowledge and Skills.
Findings show that ICCs in science range from .172 to .196 across grades and are generally higher than comparable statistics in mathematics, .163–.172, and reading, .099–.156. When available, a 1-year lagged student-level science pretest explains the most variability in the outcome. The 1-year lagged school-level science pretest is the best alternative in the absence of a 1-year lagged student-level science pretest.
Science educational researchers should utilize design parameters derived from science achievement outcomes.
Today, the ability of deoxyribonucleic acid (DNA) evidence to place persons at crime scenes with near certainty is broadly accepted by criminal investigators, courts, policy makers, and the public. However, the public safety benefits of investments in DNA databases are largely unknown and research attempting to quantify these benefits is only gradually emerging. Given the inherent difficulty in randomly assigning offenders to treatment and comparison groups for the purpose of inferring specific deterrence and probative effects (PREs) of DNA databases, this study developed an alternate strategy for extracting these effects from transactional data.
Reoffending patterns of a large cohort of offenders released from the Florida Department of Corrections custody between 1996 and 2004 were analyzed across a range of criminal offense categories. First, an identification strategy using multiple clock models was developed that linked the two simultaneous effects of DNA databases to different clocks measuring the same events. Then, a semiparametric approach was developed for estimating the models.
The estimation models yielded mixed results. Small deterrent effects—2–3% reductions in recidivism risk attributable to deterrence—were found only for robbery and burglary. However, strong PREs—20–30% increase in recidivism risk attributable to PREs—were uncovered for most offense categories.
The probative and deterrent effects of DNA databases can be elucidated through innovative semiparametric models.
Few methods have been defined for evaluating the individual and collective impacts of academic research centers. In this project, with input from injury center directors, we systematically defined indicators to assess the progress and contributions of individual Injury Control Research Centers (ICRCs) and, ultimately, to monitor progress of the overall injury center program.
We used several methods of deriving a list of recommended priority and supplemental indicators. This included published literature review, telephone interviews with selected federal agency staff, an e-mail survey of injury center directors, an e-mail survey of staff at the Centers for Disease Control and Prevention, a two-stage Delphi process (e-mailed), and an in-person focus group with injury center directors. We derived the final indicators from an analysis of ratings of potential indicators by center directors and CDC staff. We also examined qualitative responses to open-ended items that address conceptual and implementation issues.
All currently funded ICRCs participated in at least one part of the process, resulting in a list of 27 primary indicators (some with subcomponents), 31 supplemental indicators, and multiple suggestions for using the indicators.
Our results support an approach that combines standardized definitions and quantifiable indicators with qualitative reporting, which allows consideration of center distinctions and priorities. The center directors urged caution in using the indicators, given funding constraints and recognition of unique institutional environments. While focused on injury research centers, we suggest these indicators also may be useful to academic research centers of other types.
Discussions of the economics of scholarly communication are usually devoted to Open Access, rising journal prices, publisher profits, and boycotts. That ignores what seems a much more important development in this market. Publishers, through the oft-reviled Big Deal packages, are providing much greater and more egalitarian access to the journal literature, an approximation to true Open Access. In the process, they are also marginalizing libraries and obtaining a greater share of the resources going into scholarly communication. This is enabling a continuation of publisher profits as well as of what for decades has been called "unsustainable journal price escalation." It is also inhibiting the spread of Open Access and potentially leading to an oligopoly of publishers controlling distribution through large-scale licensing. The Big Deal practices are worth studying for several general reasons. The degree to which publishers succeed in diminishing the role of libraries may be an indicator of the degree and speed at which universities transform themselves. More importantly, these Big Deals appear to point the way to the future of the whole economy, where progress is characterized by declining privacy, increasing price discrimination, increasing opaqueness in pricing, increasing reliance on low-paid or unpaid work of others for profits, and business models that depend on customer inertia.
There is a long scholarly debate on the trade-off between research and teaching in various fields, but relatively little study of the phenomenon in law. This analysis examines the relationship between the two core academic activities at one particular school, the University of Chicago Law School, which is considered one of the most productive in legal academia.
We measure of scholarly productivity with the total number of publications by each professor for each year, and we approximate performance in teaching with course loads and average scores in student evaluations for each course. In OLS regressions, we estimate scholarly output as a function of teaching loads, faculty characteristics, and other controls. We also estimate teaching evaluation scores as a function of scholarly productivity, fixed effects for years and course subject, and faculty characteristics.
Net of other factors, we find that, under some specifications, research and teaching are positively correlated. In particular, we find that students’ perceptions of teaching quality rises, but at a decreasing rate, with the total amount of scholarship. We also find that certain personal characteristics correlate with productivity.
The recent debate on the mission of American law schools has hinged on the assumption that a trade-off exists between teaching and research, and this article’s analysis, although limited in various ways, casts some doubt on that assumption.
Many psychological processes unfold over time, necessitating longitudinal research designs. Longitudinal research poses a host of methodological challenges, foremost of which is participant attrition. Building on Dillman’s work, we provide a review of how social influence and relationship research informs retention strategies in longitudinal studies. Objective: We introduce the tailored panel management (TPM) approach, which is designed to establish communal norms that increase commitment to a longitudinal study, and this commitment, in turn, increases response rates and buffers against attrition. Specifically, we discuss practices regarding compensation, communication, consistency, and credibility that increase longer term commitment to panel participation. Research design: Throughout the article, we describe how TPM is being used in a national longitudinal study of undergraduate minority science students. TheScienceStudy is a continuing panel, which has 12 waves of data collected across 6 academic years, with response rates ranging from 70% to 92%. Although more than 90% of participants have either left or graduated from their undergraduate degree program, this highly mobile group of people remains engaged in the study. TheScienceStudy has usable longitudinal data from 96% of the original panel. Conclusion: This article combines social psychological theory, current best practice, and a detailed case study to illustrate the TPM approach to longitudinal data collection. The approach provides guidance for other longitudinal researchers, and advocates for empirical research into longitudinal research methodologies.
Background: Authorship and inventorship are the key attribution rights that contribute to a scientist’s reputation and professional achievement. This article discusses the concepts of coinventorship and coauthorship in the legal and sociological literature, as well as journals’ publication guidelines and technology transfer offices’ recommendations. It discusses also the relative importance of social and legal norms in the allocation of scientific credit. Method: This article revises critically the literature on inventorship and authorship in academic science and derives some policy implications on the institutional mechanisms allocating scientific credit. It reports and assesses the recent empirical evidence on the importance of social norms for the attribution of inventorship and authorship in teams of scientists. Finally, it discusses those norms from a social welfare perspective. Result: The social norms that regulate the distribution of authorship and inventorship do not reflect exclusively the relative contribution of each team member but also the members’ relative seniority or status. In the case of inventorship, such social norms appear to be as important as the legal norms whose respect is often invoked by technology transfer officers. Conclusion: Authorship and inventorship appear to be obsolete because they do not capture the increasing division of labor and responsibility typical of contemporary scientific research teams. The informative value of both authorship and inventorship attributions may be much more limited than assumed by recent evaluation exercises.