Language Testing

Impact factor: 1.08 Print ISSN: 0265-5322 Publisher: Sage Publications

Subject: Linguistics

Most recent papers:

Topical knowledge in L2 speaking assessment: Comparing independent and integrated speaking test tasks.
Huang, H.-T. D., Hung, S.-T. A., Plakans, L.
Language Testing. November 30, 2016

Integrated speaking test tasks (integrated tasks) provide reading and/or listening input to serve as the basis for test-takers to formulate their oral responses. This study examined the influence of topical knowledge on integrated speaking test performance and compared independent speaking test performance and integrated speaking test performance in terms of how each was related to topical knowledge. The researchers derived four integrated tasks from TOEFL iBT preparation materials, developed four independent speaking test tasks (independent tasks), and validated four topical knowledge tests (TKTs) on a group of 421 EFL learners. For the main study, they invited another 352 students to respond to the TKTs and to perform two independent tasks and two integrated tasks. Half of the test takers took the independent tasks and integrated tasks on one topic combination while the other half took tasks on another topic combination. Data analysis, drawing on a series of path analyses, led to two major findings. First, topical knowledge significantly impacted integrated speaking test performance in both topic combinations. Second, the impact of topical knowledge on the two types of speaking test performances was topic dependent. Implications are proposed in light of these findings.

November 30, 2016 doi: 10.1177/0265532216677106 open full text
Investigating the construct measured by banked gap-fill items: Evidence from eye-tracking.
McCray, G., Brunfaut, T.
Language Testing. November 22, 2016

This study investigates test-takers’ processing while completing banked gap-fill tasks, designed to test reading proficiency, in order to test theoretically based expectations about the variation in cognitive processes of test-takers across levels of performance. Twenty-eight test-takers’ eye traces on 24 banked gap-fill items (on six tasks) were analysed according to seven online eye-tracking measures representing overall, text and task processing. Variation in processing was related to test-takers’ level of performance on the tasks overall. In particular, as hypothesized, lower-scoring students exerted more cognitive effort on local reading and lower-level cognitive processing in contrast to test-takers who attained higher scores. The findings of different cognitive processes associated with variation in scores illuminate the construct measured by banked gap-fill items, and therefore have implications for test design and the validity of score interpretations.

November 22, 2016 doi: 10.1177/0265532216677105 open full text
Setting cut scores on an EFL placement test using the prototype group method: A receiver operating characteristic (ROC) analysis.
Eckes, T.
Language Testing. November 11, 2016

This paper presents an approach to standard setting that combines the prototype group method (PGM; Eckes, 2012) with a receiver operating characteristic (ROC) analysis. The combined PGM–ROC approach is applied to setting cut scores on a placement test of English as a foreign language (EFL). To implement the PGM, experts first named learners whom they considered to be typical of each of five levels of language proficiency as specified by the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001). Out of a total of 3,310 examinees taking different trial versions of the placement test, 470 learner prototypes were identified. For this set of prototypes, Rasch model estimates of EFL proficiency served as input to a series of ROC analyses, one for each pair of adjacent proficiency levels. Cut scores were derived using the Youden index that maximizes the overall rate of correct classification and minimizes the overall rate of misclassification. Findings confirmed that this method allows for the setting of cut scores that show a high level of classification accuracy in terms of the correspondence with expert categorizations of examinee prototypes. In addition, the ROC-based cut scores were associated with higher classification accuracy than cut scores derived from a logistic regression analysis of the same data. Potential further uses and implications of the PGM–ROC approach in the context of language testing and assessment are discussed.

November 11, 2016 doi: 10.1177/0265532216672703 open full text
Understanding test-takers perceptions of difficulty in EAP vocabulary tests: The role of experiential factors.
Oruc Ertürk, N., Mumford, S. E.
Language Testing. October 18, 2016

This study, conducted by two researchers who were also multiple-choice question (MCQ) test item writers at a private English-medium university in an English as a foreign language (EFL) context, was designed to shed light on the factors that influence test-takers’ perceptions of difficulty in English for academic purposes (EAP) vocabulary, with the aim of improving test writers’ judgments on difficulty. The research consisted of a survey of 588 test-takers, followed by a focus group interview, aimed at investigating the relative influences of test-taker factors and word factors on difficulty perceptions. Results reveal a complex interaction of factors influencing perceived difficulty dominated by the educational, and particularly, the social context. Factors traditionally associated with vocabulary difficulty, such as abstractness and word length, appeared to have little influence. The researchers concluded that rather than basing their intuitions regarding vocabulary difficulty on language-lesson input or surface features of words, EAP vocabulary test writers need a clear understanding of test-takers’ difficulty perceptions, and how these emerge from interactions between academic, social and linguistic factors. As a basis for EAP vocabulary item writer training, four main implications are drawn, related to test-takers’ social and educational background, field of study, the features of academic words, and the test itself.

October 18, 2016 doi: 10.1177/0265532216673399 open full text
Shaping a score: Complexity, accuracy, and fluency in integrated writing performances.
Plakans, L., Gebril, A., Bilki, Z.
Language Testing. September 26, 2016

The present study investigates integrated writing assessment performances with regard to the linguistic features of complexity, accuracy, and fluency (CAF). Given the increasing presence of integrated tasks in large-scale and classroom assessments, validity evidence is needed for the claim that their scores reflect targeted language abilities. Four hundred and eighty integrated writing essays from the Internet-based Test of English as a Foreign Language (TOEFL) were analyzed using CAF measures with correlation and regression to determine how well these linguistic features predict scores on reading–listening–writing tasks. The results indicate a cumulative impact on scores from these three features. Fluency was found to be the strongest predictor of integrated writing scores. Analysis of error type revealed that morphological errors contributed more to the regression statistic than syntactic or lexical errors. Complexity was significant but had the lowest correlation to score across all variables.

September 26, 2016 doi: 10.1177/0265532216669537 open full text
Functional adequacy in L2 writing: Towards a new rating scale.
Kuiken, F., Vedder, I.
Language Testing. September 20, 2016

The importance of functional adequacy as an essential component of L2 proficiency has been observed by several authors (Pallotti, 2009; De Jong, Steinel, Florijn, Schoonen, & Hulstijn, 2012a, b). The rationale underlying the present study is that the assessment of writing proficiency in L2 is not fully possible without taking into account the functional dimension of L2 production. In the paper a rating scale for functional adequacy is proposed, containing four dimensions: (1) content, (2) task requirements, (3) comprehensibility, and (4) coherence and cohesion. The scale is an adaptation of the global rating scale of functional adequacy, which in an earlier study was carried out with expert raters (Kuiken, Vedder, & Gilabert, 2010; Kuiken & Vedder, 2014). The new rating scale for functional adequacy was tested out by a group of non-expert raters, who assessed the functional adequacy of a corpus of argumentative texts written by native and non-native writers of Dutch and Italian. The results showed that functional adequacy in L2 writing can be reliably measured by a rating scale comprising four different subscales.

September 20, 2016 doi: 10.1177/0265532216663991 open full text
Borrowing legitimacy as English learner (EL) leaders: Indianas 14-year history with English language proficiency standards.
Morita-Mullaney, T.
Language Testing. June 21, 2016

English language proficiency or English language development (ELP/D) standards guide how content-specific instruction and assessment is practiced by teachers and how English learners (ELs) at varying levels of English proficiency can perform grade-level-specific academic standards in K–12 US schools. With the transition from the state-developed Indiana ELP/D standards adopted in 2003 to the World Class Instructional Design and Assessment (WIDA) English language development standards adopted in 2013, this paper explores Indiana’s ELP/D standard’s 14-year history and how its district EL/Bilingual district leaders have interpreted and implemented these two sets of standards between the school years 2002–03 and 2015–16.
Using critical leadership and feminism within a narrative design, EL/Bilingual leaders illuminate distinct leadership logics as they mediate and implement ELP/D standards in their districts. Academic content standards are regarded with greater privilege, complicating how EL/Bilingual leaders can position ELP/D standards. Restricted by this standards hierarchy, EL/Bilingual leaders found limited educational venues in which to discuss the performance-based nature of ELP/D standards. Implications for assessment, policy and leadership preparation are discussed.

June 21, 2016 doi: 10.1177/0265532216653430 open full text
Evaluating different standard-setting methods in an ESL placement testing context.
Shin, S.-Y., Lidster, R.
Language Testing. June 15, 2016

In language programs, it is crucial to place incoming students into appropriate levels to ensure that course curriculum and materials are well targeted to their learning needs. Deciding how and where to set cutscores on placement tests is thus of central importance to programs, but previous studies in educational measurement disagree as to which standard-setting method (or methods) should be employed in different contexts. Furthermore, the results of different standard-setting methods rarely converge on a single set of cutscores, and standard-setting procedures within language program placement testing contexts specifically have been relatively understudied. This study aims to compare and evaluate three different standard-setting procedures – the Bookmark method (a test-centered approach), the Borderline group method (an examinee-centered approach), and cluster analysis (a statistical approach) – and to discuss the ways in which they do and do not provide valid and reliable information regarding placement cut-offs for an intensive English program at a large Midwestern university in the USA. As predicted, the cutscores derived from the different methods did not converge on a single solution, necessitating a means of judging between divergent results. We discuss methods of evaluating cutscores, explicate the advantages and limitations associated with each standard-setting method, recommend against using statistical approaches for most English for academic purposes (EAP) placement contexts, and demonstrate how specific psychometric qualities of the exam can affect the results obtained using those methods. Recommendations for standard setting, exam development, and cutscore use are discussed.

June 15, 2016 doi: 10.1177/0265532216646605 open full text
Book Review: Assessing L2 Students with Learning and Other Disabilities.
Lange, R.
Language Testing. June 08, 2016

There is no abstract available for this paper.

June 08, 2016 doi: 10.1177/0265532216649267 open full text
Book Review: Assessing Second Language Pragmatics.
Willcox, E. F.
Language Testing. June 08, 2016

There is no abstract available for this paper.

June 08, 2016 doi: 10.1177/0265532216649269 open full text
How many words do children know? A corpus-based estimation of childrens total vocabulary size.
Segbers, J., Schroeder, S.
Language Testing. April 28, 2016

In this article we present a new method for estimating children’s total vocabulary size based on a language corpus in German. We drew a virtual sample of different lexicon sizes from a corpus and let the virtual sample "take" a vocabulary test by comparing whether the items were included in the virtual lexicons or not. This enabled us to identify the relation between test performance and total lexicon size. We then applied this relation to the test results of a real sample of children (grades 1–8, aged 6 to 14) and young adults (aged 18 to 25) and estimated their total vocabulary sizes. Average absolute vocabulary sizes ranged from 5900 lemmas in first grade to 73,000 for adults, with significant increases between adjacent grade levels except from first to second grade. Our analyses also allowed us to observe parts of speech and morphological development. Results thus shed light on the course of vocabulary development during primary school.

April 28, 2016 doi: 10.1177/0265532216641152 open full text
Probing the relative importance of different attributes in L2 reading and listening comprehension items: An application of cognitive diagnostic models.
Yi, Y.-S.
Language Testing. April 28, 2016

The present study examines the relative importance of attributes within and across items by applying four cognitive diagnostic assessment models. The current study utilizes the function of the models that can indicate inter-attribute relationships that reflect the response behaviors of examinees to analyze scored test-taker responses to four forms of TOEFL reading and listening comprehension sections. The results are interpreted to determine whether the subskills defined in each subtest contribute equally within an item and across items. The study also discusses whether the empirical results support the claims for compensatory processing of L2 comprehension skills and presents practical implications of the findings. The article concludes with the limitations of the study and suggestions for future research.

April 28, 2016 doi: 10.1177/0265532216646141 open full text
Working with sparse data in rated language tests: Generalizability theory applications.
Lin, C.-K.
Language Testing. March 30, 2016

Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the technical complexity involved in estimating score reliability from sparse-rated data. Examining the estimation precision of reliability is of great importance because the utility of any performance-based language test depends on its reliability. Results suggest that when some raters are expected to have greater score variability than other raters (e.g., a mixture of novice and experienced raters being deployed in a rating session), the sub-dividing method is recommended as it yields more precise reliability estimates. When all raters are expected to exhibit similar variability in their scoring, both the rating and sub-dividing methods are equally precise in estimating score reliability, and the rating method is recommended for operational use, as it is easier to implement in practice. Informed by these methodological results, the current study also demonstrates a step-by-step analysis for investigating the score reliability from sparse-rated data taken from a large-scale English speaking proficiency test. Implications for operational performance-based language tests are discussed.

March 30, 2016 doi: 10.1177/0265532216638890 open full text
Book Review: Assessing English Proficiency for University Study.
Hill, K.
Language Testing. March 17, 2016

There is no abstract available for this paper.

March 17, 2016 doi: 10.1177/0265532216638474 open full text
Theoretical models of comprehension skills tested through a comprehension assessment battery for primary school children.
Tobia, V., Ciancaleoni, M., Bonifacci, P.
Language Testing. January 28, 2016

In this study, two alternative theoretical models were compared, in order to analyze which of them best explains primary school children’s text comprehension skills. The first one was based on the distinction between two types of answers requested by the comprehension test: local or global. The second model involved texts’ input modality: written or oral. For this purpose, a new instrument that assesses listening and reading comprehension skills (ALCE battery; Bonifacci et al., 2014) was administered to a large sample of 1,658 Italian primary school students. The two models were tested separately for the five grades (first to fifth grade). Furthermore, a third model, that included both the types of answers and the texts’ input modality, was considered. Results of confirmatory factor analyses suggested that all models are adequate, but the second one (reading vs. listening) provided a better fit. The major role of the distinction between input modalities is discussed in relation to individual differences and developmental trajectories in text comprehension. Theoretical and clinical implications are discussed.

January 28, 2016 doi: 10.1177/0265532215625705 open full text
Facilitating the interpretation of English language proficiency scores: Combining scale anchoring and test score mapping methodologies.
Powers, D., Schedl, M., Papageorgiou, S.
Language Testing. January 06, 2016

The aim of this study was to develop, for the benefit of both test takers and test score users, enhanced TOEFL ITP^® test score reports that go beyond the simple numerical scores that are currently reported. To do so, we applied traditional scale anchoring (proficiency scaling) to item difficulty data in order to develop performance descriptors for multiple levels of each of the three sections of the TOEFL ITP. A (novel) constraint was that these levels should correspond to those established in an earlier study that mapped (i.e., aligned) TOEFL ITP scores to a widely accepted framework for describing language proficiency – the Common European Framework of Reference (CEFR). The data used in the present study came from administrations of five current operational forms of the recently revised TOEFL ITP test. The outcome of the effort is a set of performance descriptors for each of several levels of TOEFL ITP scores for each of the three sections of the test. The contribution, we believe, constitutes (1) an enhancement of the interpretation of scores for one widely used assessment of English language proficiency and (2) a modest contribution to the literature on developing proficiency descriptors – an approach that combines elements of both scale anchoring and test score mapping.

January 06, 2016 doi: 10.1177/0265532215623582 open full text
Lexical difficulty - using elicited imitation to study child L2.
Campfield, D. E.
Language Testing. December 30, 2015

This paper reports a post-hoc analysis of the influence of lexical difficulty of cue sentences on performance in an elicited imitation (EI) task to assess oral production skills for 645 child L2 English learners in instructional settings. This formed part of a large-scale investigation into effectiveness of foreign language teaching in Polish primary schools. EI item design and scoring, IRT and post-hoc lexical analysis of items is described in detail. The research aim was to resolve how much the lexical complexity of items (lexical density, morphological complexity, function word density, and sentence length) contributed to item difficulty and scores. Sentence length, as number of words, predicted better than number of syllables. Function words also contributed, and their importance to EI item construction is discussed. It is suggested that future research should examine phonological aspects of cue sentences to explain potential sources for variability. EI is shown to be a reliable and robust method for young L2 learners with potential for classroom assessment by teachers for emergent oral production skills.

December 30, 2015 doi: 10.1177/0265532215623580 open full text
Determining cloze item difficulty from item and passage characteristics across learner backgrounds.
Trace, J., Brown, J. D., Janssen, G., Kozhevnikova, L.
Language Testing. December 30, 2015

Cloze tests have been the subject of numerous studies regarding their function and use in both first language and second language contexts (e.g., Jonz & Oller, 1994; Watanabe & Koyama, 2008). From a validity standpoint, one area of investigation has been the extent to which cloze tests measure reading ability beyond the sentence level. Using test data from 50 30-item cloze passages administered to 2,298 Japanese and 5,170 Russian EFL students, this study examined the degree to which linguistic features for cloze passages and items influenced item difficulty. Using a common set of 10 anchor items, all 50 tests were modeled in terms of person ability and item difficulty onto a single scale using many-faceted Rasch measurement (k = 1314). Principle components analysis was then used to categorize 25 linguistic item- and passage-level variables for the 50 cloze tests and their respective items, from which three components for each passage- and item-level variables were identified. These six factors along with item difficulty were then entered into both a hierarchical structural equation model and a linear multiple regression to determine the degree to which difficulty in cloze tests could be explained separately by passage and item features. Comparisons were further made by looking at differences in models by nationality and by proficiency level (e.g., high and low). The analyses revealed noteworthy differences in mean item difficulties and in the variance structures between passage- and item-level features, as well as between different examinee proficiency groups.

December 30, 2015 doi: 10.1177/0265532215623581 open full text
Equating in small-scale language testing programs.
LaFlair, G. T., Isbell, D., May, L. D. N., Gutierrez Arvizu, M. N., Jamieson, J.
Language Testing. December 23, 2015

Language programs need multiple test forms for secure administrations and effective placement decisions, but can they have confidence that scores on alternate test forms have the same meaning? In large-scale testing programs, various equating methods are available to ensure the comparability of forms. The choice of equating method is informed by estimates of quality, namely the method with the least error as defined by random error, systematic error, and total error. This study compared seven different equating methods to no equating – mean, linear Levine, linear Tucker, chained equipercentile, circle-arc, nominal weights mean, and synthetic. A non-equivalent groups anchor test (NEAT) design was used to compare two listening and reading test forms based on small samples (one with 173 test takers the other, 88) at a university’s English for Academic Purposes (EAP) program. The equating methods were evaluated based on the amount of error they introduced and their practical effects on placement decisions. It was found that two types of error (systematic and total) could not be reliably computed owing to the lack of an adequate criterion; consequently, only random error was compared. Among the seven methods, the circle-arc method introduced the least random error as estimated by the standard error of equating (SEE). Classification decisions made using the seven methods differed from no equating; all methods indicated that fewer students were ready for university placement. Although interpretations regarding the best equating method could not be made, circle-arc equating reduced the amount of random error in scores, had reportedly low bias in other studies, accounted for form and person differences, and was relatively easy to compute. It was chosen as the method to pilot in an operational setting.

December 23, 2015 doi: 10.1177/0265532215620825 open full text
Measuring English language workplace proficiency across subgroups: Using CFA models to validate test score interpretation.
Yoo, H., Manna, V. F.
Language Testing. December 11, 2015

This study assessed the factor structure of the Test of English for International Communication (TOEIC^®) Listening and Reading test, and its invariance across subgroups of test-takers. The subgroups were defined by (a) gender, (b) age, (c) employment status, (d) time spent studying English, and (e) having lived in a country where English is the main language. The study results indicated that a correlated two-factor model corresponding to the two language abilities of listening and reading best accounted for the factor structure of the test. In addition, the underlying construct had the same structure across the test-taker subgroups studied. There were, however, significant differences in the means of the latent construct across the subgroups. This study provides empirical support for the current score reporting practice for the TOEIC test, suggests that the test scores have the same meaning across studied test-taker subgroups, and identifies possible test-taker background characteristics that affect English language abilities as measured by the TOEIC test.

December 11, 2015 doi: 10.1177/0265532215618987 open full text
National reading tests in Denmark, Norway, and Sweden: A comparison of construct definitions, cognitive targets, and response formats.
Tengberg, M.
Language Testing. October 29, 2015

Reading comprehension tests are often assumed to measure the same, or at least similar, constructs. Yet, reading is not a single but a multidimensional form of processing, which means that variations in terms of reading material and item design may emphasize one aspect of the construct at the cost of another. The educational systems in Denmark, Norway, and Sweden share a number of traits, and in the recent decade, the development of national test instruments, especially for reading, has been highly influenced by international surveys of student achievement. In this study, national tests of L1 reading comprehension in secondary school in the three Scandinavian countries are compared in order to reveal the present range of diversity/commonality within the three test domains. The analysis employs both qualitative and quantitative aspects of data, including frameworks, text samples, task samples, and scoring guidelines from 2011 to 2014. Findings indicate that the three tests differ substantially from each other, not only in terms of the intentional and operative constructs of reading to be measured, but also in terms of testing methods and stability over time. Implications for the future development of reading comprehension assessment are discussed.

October 29, 2015 doi: 10.1177/0265532215609392 open full text
Empirical profiles of academic oral English proficiency from an international teaching assistant screening test.
Choi, I.
Language Testing. October 06, 2015

Language proficiency constitutes a crucial barrier for prospective international teaching assistants (ITAs). Many US universities administer screening tests to ensure that ITAs possess the required academic oral English proficiency for their TA duties. Such ITA screening tests often elicit a sample of spoken English, which is evaluated in terms of multiple aspects by trained raters. In this light, ITA screening tests provide an advantageous context in which to gather rich information about test taker performances. This study introduces a systematic way of extracting meaningful information for major stakeholders from an ITA screening test administered at a US university. In particular, this study illustrates how academic oral English proficiency profiles can be identified based on test takers’ subscale score patterns, and discusses how the resulting profiles can be used as feedback for ITA training and screening policy makers, the ITA training program of the university, ESL instructors, and test takers. The proficiency profiles were identified using finite mixture modeling based on the subscale scores of 960 test takers. The modeling results suggested seven profile groups. These groups were interpreted and labeled based on the characteristic subscale score patterns of their members. The implications of the results are discussed, with the main focus on how such information can help ITA policy makers, the ITA training program, ESL instructors, and test takers make important decisions.

October 06, 2015 doi: 10.1177/0265532215601881 open full text
Validity theory: Reform policies, accountability testing, and consequences.
Chalhoub-Deville, M.
Language Testing. August 27, 2015

Educational policies such as Race to the Top in the USA affirm a central role for testing systems in government-driven reform efforts. Such reform policies are often referred to as the global education reform movement (GERM). Changes observed with the GERM style of testing demand socially engaged validity theories that include consequential research. The article revisits the Standards and Kane’s interpretive argument (IA) and argues that the role envisioned for consequences remains impoverished. Guided by theory of action, the article presents a validity framework, which targets policy-driven assessments and incorporates a social role for consequences. The framework proposes a coherent system that makes explicit the interconnections among policy ambitions, testing functions, and the levels/sectors that are affected. The article calls for integrating consequences into technical quality documentation, demands a more realistic delineation of stakeholders and their roles, and compels engagement in policy research.

August 27, 2015 doi: 10.1177/0265532215593312 open full text
Elicited imitation as a measure of second language proficiency: A narrative review and meta-analysis.
Yan, X., Maeda, Y., Lv, J., Ginther, A.
Language Testing. August 17, 2015

Elicited imitation (EI) has been widely used to examine second language (L2) proficiency and development and was an especially popular method in the 1970s and early 1980s. However, as the field embraced more communicative approaches to both instruction and assessment, the use of EI diminished, and the construct-related validity of EI scores as a representation of language proficiency was called into question. Current uses of EI, while not discounting the importance of communicative activities and assessments, tend to focus on the importance of processing and automaticity. This study presents a systematic review of EI in an effort to clarify the construct and usefulness of EI tasks in L2 research.
The review underwent two phases: a narrative review and a meta-analysis. We surveyed 76 theoretical and empirical studies from 1970 to 2014, to investigate the use of EI in particular with respect to the research/assessment context and task features. The results of the narrative review provided a theoretical basis for the meta-analysis. The meta-analysis utilized 24 independent effect sizes based on 1089 participants obtained from 21 studies. To investigate evidence of construct-related validity for EI, we examined the following: (1) the ability of EI scores to distinguish speakers across proficiency levels; (2) correlations between scores on EI and other measures of language proficiency; and (3) key task features that moderate the sensitivity of EI.
Results of the review demonstrate that EI tasks vary greatly in terms of task features; however, EI tasks in general have a strong ability to discriminate between speakers across proficiency levels (Hedges’ g = 1.34). Additionally, construct, sentence length, and scoring method were identified as moderators for the sensitivity of EI. Findings of this study provide supportive construct-related validity evidence for EI as a measure of L2 proficiency and inform appropriate EI task development and administration in L2 research and assessment.

August 17, 2015 doi: 10.1177/0265532215594643 open full text
Differential and long-term language impact on math.
Chen, F., Chalhoub-Deville, M.
Language Testing. August 12, 2015

Literature provides consistent evidence that there is a strong relationship between language proficiency and math achievement. However, research results show conflicts supporting either an increasing or a decreasing longitudinal relationship between the two. This study explored a longitudinal data and adopted quantile regression analyses to overcome several limitations in past research. The goal of the study is to detect more accurate and richer information on the long-term relationship between language and math, taking into consideration the socioeconomic status, gender, and ethnicity background at the same time. Results confirmed a persistent relationship between math achievement and all the factors explored. More importantly, it revealed that the strength of the relationship between language and math differed for students with various abilities both within and across grades. Model comparison suggests that language demand contributes to the achievement gap between ELLs and non-ELLs in math. There also seems to be a disadvantage for the geographically isolated group in academic achievement. Interpretation and implications for teaching and assessment are discussed.

August 12, 2015 doi: 10.1177/0265532215594641 open full text
Comparing C-tests and Yes/No vocabulary size tests as predictors of receptive language skills.
Harsch, C., Hartig, J.
Language Testing. August 10, 2015

Placement and screening tests serve important functions, not only with regard to placing learners at appropriate levels of language courses but also with a view to maximizing the effectiveness of administering test batteries. We examined two widely reported formats suitable for these purposes, the discrete decontextualized Yes/No vocabulary test and the embedded contextualized C-test format, in order to determine which format can explain more variance in measures of listening and reading comprehension. Our data stem from a large-scale assessment with over 3000 students in the German secondary educational context; the four measures relevant to our study were administered to a subsample of 559 students. Using regression analysis on observed scores and SEM on a latent level, we found that the C-test outperforms the Yes/No format in both methodological approaches. The contextualized nature of the C-test seems to be able to explain large amounts of variance in measures of receptive language skills. The C-test, being a reliable, economical and robust measure, appears to be an ideal candidate for placement and screening purposes. In a side-line of our study, we also explored different scoring approaches for the Yes–No format. We found that using the hit rate and the false-alarm rate as two separate indicators yielded the most reliable results. These indicators can be interpreted as measures for vocabulary breadth and as guessing factors respectively, and they allow controlling for guessing.

August 10, 2015 doi: 10.1177/0265532215594642 open full text
Topic and background knowledge effects on performance in speaking assessment.
Khabbazbashi, N.
Language Testing. August 10, 2015

This study explores the extent to which topic and background knowledge of topic affect spoken performance in a high-stakes speaking test. It is argued that evidence of a substantial influence may introduce construct-irrelevant variance and undermine test fairness. Data were collected from 81 non-native speakers of English who performed on 10 topics across three task types. Background knowledge and general language proficiency were measured using self-report questionnaires and C-tests respectively. Score data were analysed using many-facet Rasch measurement and multiple regression. Findings showed that for two of the three task types, the topics used in the study generally exhibited difficulty measures which were statistically distinct. However, the size of the differences in topic difficulties was too small to have a large practical effect on scores. Participants’ different levels of background knowledge were shown to have a systematic effect on performance. However, these statistically significant differences also failed to translate into practical significance. Findings hold implications for speaking performance assessment.

August 10, 2015 doi: 10.1177/0265532215595666 open full text
Fitting the mixed Rasch model to a reading comprehension test: Exploring individual difference profiles in L2 reading.
Aryadoust, V., Zhang, L.
Language Testing. August 03, 2015

The present study used the mixed Rasch model (MRM) to identify subgroups of readers within a sample of students taking an EFL reading comprehension test. Six hundred and two (602) Chinese college students took a reading test and a lexico-grammatical knowledge test and completed a Metacognitive and Cognitive Strategy Use Questionnaire (MCSUQ) (Zhang, Goh, & Kunnan, 2014). MRM analysis revealed two latent classes. Class 1 was more likely to score highly on reading in-depth (RID) items. Students in this class had significantly higher general English proficiency, better lexico-grammatical knowledge, and reported using reading strategies more frequently, especially planning, monitoring, and integrating strategies. In contrast, Class 2 was more likely to score highly on skimming and scanning (SKSN) items, but had relatively lower mean scores for lexico-grammatical knowledge and general English proficiency; they also reported using strategies less frequently than did Class 1. The implications of these findings and further research are discussed.

August 03, 2015 doi: 10.1177/0265532215594640 open full text
Measuring the impact of rater negotiation in writing performance assessment.
Trace, J., Janssen, G., Meier, V.
Language Testing. July 28, 2015

Previous research in second language writing has shown that when scoring performance assessments even trained raters can exhibit significant differences in severity. When raters disagree, using discussion to try to reach a consensus is one popular form of score resolution, particularly in contexts with limited resources, as it does not require adjudication by at third rater. However, from an assessment validation standpoint, questions remain about the impact of negotiation on the scoring inference of a validation argument (Kane, 2006, 2012). Thus, this mixed-methods study evaluates the impact of score negotiation on scoring consistency in second language writing assessment, as well as negotiation’s potential contributions to raters’ understanding of test constructs and the local curriculum. Many-faceted Rasch measurement (MFRM) was used to analyze scores (n = 524) from the writing section an EAP placement exam and to quantify how negotiation affected rater severity, self-consistency, and bias toward individual categories and test takers. Semi-structured interviews with raters (n = 3) documented their perspectives about how negotiation affects scoring and teaching. In this study, negotiation did not change rater severity, though it greatly reduced measures of rater bias. Furthermore, rater comments indicated that negotiation supports a nuanced understanding of the rubric categories and increases positive washback on teaching practices.

July 28, 2015 doi: 10.1177/0265532215594830 open full text
Book Review: Talking about Language Assessment: The LAQ Interviews.
McCord, R.
Language Testing. July 28, 2015

There is no abstract available for this paper.

July 28, 2015 doi: 10.1177/0265532215595139 open full text
Validity of the American Sign Language Discrimination Test.
Bochner, J. H., Samar, V. J., Hauser, P. C., Garrison, W. M., Searls, J. M., Sanders, C. A.
Language Testing. July 13, 2015

American Sign Language (ASL) is one of the most commonly taught languages in North America. Yet, few assessment instruments for ASL proficiency have been developed, none of which have adequately demonstrated validity. We propose that the American Sign Language Discrimination Test (ASL-DT), a recently developed measure of learners’ ability to discriminate phonological and morphophonological contrasts in ASL, provides an objective overall measure of ASL proficiency. In this study, the ASL-DT was administered to 194 participants at beginning, intermediate, and high levels of ASL proficiency, a subset of which (N = 57) also was administered the Sign Language Proficiency Interview (SLPI), a widely used subjective proficiency measure. Using Rasch analysis to model ASL-DT item difficulty and person ability, we tested the ability of the ASL-DT Rasch measure to detect participant proficiency group mean differences and compared its discriminant performance to the SLPI ratings for classifying individuals into their pre-assigned proficiency groups using resource operating characteristic statistics. The ASL-DT Rasch measure outperformed the SLPI ratings, indicating that the ASL-DT may provide a valid objective measure of overall ASL proficiency. As such, the ASL-DT Rasch measure may provide a useful complement to measures such as the SLPI in comprehensive sign language assessment programs.

July 13, 2015 doi: 10.1177/0265532215590849 open full text
The selection of cognitive diagnostic models for a reading comprehension test.
Li, H., Hunter, C. V., Lei, P.-W.
Language Testing. July 13, 2015

Cognitive diagnostic models (CDMs) have great promise for providing diagnostic information to aid learning and instruction, and a large number of CDMs have been proposed. However, the assumptions and performances of different CDMs and their applications in regard to reading comprehension tests are not fully understood. In the present study, we compared the performance of a saturated model (G-DINA), two compensatory models (DINO, ACDM), and two non-compensatory models (DINA, RRUM) with the Michigan English Language Assessment Battery (MELAB) reading test. Compared to the saturated G-DINA model, the ACDM showed comparable model fit and similar skill classification results. The RRUM was slightly worse than the ACDM and G-DINA in terms of model fit and classification results, whereas the more restrictive DINA and DINO performed much worse than the other three models. The findings of this study highlighted the process and considerations pertinent to model selection in applications of CDMs with reading tests.

July 13, 2015 doi: 10.1177/0265532215590848 open full text
Removing bias towards World Englishes: The development of a Rater Attitude Instrument using Indian English as a stimulus.
Hsu, T. H.-L.
Language Testing. July 03, 2015

This study explores the attitudes of raters of English speaking tests towards the global spread of English and the challenges in rating speakers of Indian English in descriptive speaking tasks. The claims put forward by language attitude studies indicate a validity issue in English speaking tests: listeners tend to hold negative attitudes towards speakers of non-standard English, and judge them unfavorably. As there are no adequate measures of listener/rater attitude towards emerging varieties of English in language assessment research, a Rater Attitude Instrument comprising a three-phase self-measure was developed. It comprises 11 semantic differential scale items and 31 Likert scale items representing three attitude dimensions of feeling, cognition, and behavior tendency as claimed by psychologists. Confirmatory factor analysis supported a two-factor structure with acceptable model fit indices. This measure represents a new initiative to examine raters’ psychological traits as a source of validity evidence in English speaking tests to strengthen arguments about test-takers’ English language proficiency in response to the change of sociolinguistic landscape. The implications for norm selection in English oral tests are also discussed.

July 03, 2015 doi: 10.1177/0265532215590694 open full text
Speaking self-assessment: Mismatches between learners' and teachers' criteria.
Babaii, E., Taghaddomi, S., Pashmforoosh, R.
Language Testing. July 01, 2015

Perceptual (mis)matches between teachers and learners are said to affect learning success or failure. Self-assessment, as a formative assessment tool, may, inter alia, be considered a means to minimize such mismatches. Therefore, the present study investigated the extent to which learners’ assessment of their own speaking performance, before and after their being provided with a list of agreed-upon scoring criteria followed by a practice session, matches that of their teachers. In so doing, 29 EFL learners and six EFL teachers served as participants; the learners were asked to assess their audio-recorded speaking performance before and after their being provided with the scoring criteria and practice session. The teachers were also asked to assess the learners’ performance according to the same criteria. Finally, the learners were required to evaluate the effectiveness of doing self-assessment in the form of reflection papers. The results revealed a significant difference between the learners’ assessment of their own speaking ability on the two occasions. The findings also suggested that providing the learners with the scoring criteria and the follow-up practice session minimized the existing mismatches between learner assessment and teacher assessment. Moreover, the inductive analysis of the reflection papers yielded a number of themes suggesting that, despite some limitations, the learners’ overall evaluation of the effectiveness of speaking self-assessment was positive.

July 01, 2015 doi: 10.1177/0265532215590847 open full text
Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing.
Kyle, K., Crossley, S. A., McNamara, D. S.
Language Testing. June 19, 2015

This study explores the construct validity of speaking tasks included in the TOEFL iBT (e.g., integrated and independent speaking tasks). Specifically, advanced natural language processing (NLP) tools, MANOVA difference statistics, and discriminant function analyses (DFA) are used to assess the degree to which and in what ways responses to these tasks differ with regard to linguistic characteristics. The findings lend support to using a variety of speaking tasks to assess speaking proficiency. Namely, with regard to linguistic differences, the findings suggest that responses to performance tasks can be accurately grouped based on whether a task is independent or integrated. The findings also suggest that although the independent tasks included in the TOEFL iBT may represent a single construct, responses to integrated tasks vary across task sub-type.

June 19, 2015 doi: 10.1177/0265532215587391 open full text
Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies.
In'nami, Y., Koizumi, R.
Language Testing. June 17, 2015

We addressed Deville and Chalhoub-Deville’s (2006), Schoonen’s (2012), and Xi and Mollaun’s (2006) call for research into the contextual features that are considered related to person-by-task interactions in the framework of generalizability theory in two ways. First, we quantitatively synthesized the generalizability studies to determine the percentage of variation in L2 speaking and L2 writing performance that was accounted for by tasks, raters, and their interaction. Second, we examined the relationships between person-by-task interactions and moderator variables. We used 28 datasets from 21 studies for L2 speaking, and 22 datasets from 17 studies for L2 writing. Across modalities, most of the score variation was explained by examinees’ performance; the interaction effects of tasks or raters were greater than the independent effects of tasks or raters. Task and task-related interaction effects explained a greater percentage of the score variances, than did the rater and rater-related interaction effects. The variances associated with the person-by-task interactions were larger for assessments based on both general and academic contexts, than for those based only on academic contexts. Further, large person-by-task interactions were related to analytic scoring and scoring criteria with task-specific language features. These findings derived from L2 speaking studies indicate that contexts, scoring methods, and scoring criteria might lead to varied performance over tasks. Consequently, this particularly requires us to define constructs carefully.

June 17, 2015 doi: 10.1177/0265532215587390 open full text
Predicting grades from an English language assessment: The importance of peeling the onion.
Bridgeman, B., Cho, Y., DiPietro, S.
Language Testing. May 13, 2015

Data from 787 international undergraduate students at an urban university in the United States were used to demonstrate the importance of separating a sample into meaningful subgroups in order to demonstrate the ability of an English language assessment to predict the first-year grade point average (GPA). For example, when all students were pooled in a single analysis, the correlation of scores from the Test of English as a Foreign Language (TOEFL) with GPA was .18; in a subsample of engineering students from China, the correlation with GPA was .58, or .77 when corrected for range restriction. Similarly, the corrected correlation of the TOEFL Reading score with GPA for Chinese business students changed dramatically (from .01 to .36) when students with an extreme discrepancy between their receptive (reading/listening) and productive (speaking/writing) scores were trimmed from the sample.

May 13, 2015 doi: 10.1177/0265532215583066 open full text
A comparison of video- and audio-mediated listening tests with many-facet Rasch modeling and differential distractor functioning.
Batty, A. O.
Language Testing. May 14, 2014

The rise in the affordability of quality video production equipment has resulted in increased interest in video-mediated tests of foreign language listening comprehension. Although research on such tests has continued fairly steadily since the early 1980s, studies have relied on analyses of raw scores, despite the growing prevalence of item response theory in the field of language testing as a whole. The present study addresses this gap by comparing data from identical, counter-balanced multiple-choice listening test forms employing three text types (monologue, conversation, and lecture) administered to 164 university students of English in Japan. Data were analyzed via many-facet Rasch modeling to compare the difficulties of the audio and video formats; to investigate interactions between format and text-type, and format and proficiency level; and to identify specific items biased toward one or the other format. Finally, items displaying such differences were subjected to differential distractor functioning analyses. No interactions between format and text-type, or format and proficiency level, were observed. Four items were discovered displaying format-based differences in difficulty, two of which were found to correspond to possible acting anomalies in the videos. The author argues for further work focusing on item-level interactions with test format.

May 14, 2014 doi: 10.1177/0265532214531254 open full text
A study on the impact of fatigue on human raters when scoring speaking responses.
Ling, G., Mollaun, P., Xi, X.
Language Testing. May 06, 2014

The scoring of constructed responses may introduce construct-irrelevant factors to a test score and affect its validity and fairness. Fatigue is one of the factors that could negatively affect human performance in general, yet little is known about its effects on a human rater’s scoring quality on constructed responses. In this study, we compared the scoring quality of 72 raters under four shift conditions differing on the shift length (total scoring time in a day) and session length (time continuously spent on a task). About 14,000 audio responses to four TOEFL iBT speaking tasks were scored, including 5446 validity responses that have pre-assigned "true" scores used to measure scoring accuracy. Our results suggest that the overall scoring accuracy is high for the TOEFL iBT Speaking Test, but varying levels of rating accuracy and consistency exist across shift conditions. The raters working the shorter shifts or shorter sessions on average maintain greater rating productivity, accuracy, and consistency than those working longer shifts or sessions do. The raters working the 6-hour shift with three 2-hour sessions outperform those under other shift conditions in both rating accuracy and consistency.

May 06, 2014 doi: 10.1177/0265532214530699 open full text
Assessing learners' writing skills in a SLA study: Validating the rating process across tasks, scales and languages.
Huhta, A., Alanen, R., Tarnanen, M., Martin, M., Hirvela, T.
Language Testing. April 11, 2014

There is still relatively little research on how well the CEFR and similar holistic scales work when they are used to rate L2 texts. Using both multifaceted Rasch analyses and qualitative data from rater comments and interviews, the ratings obtained by using a CEFR-based writing scale and the Finnish National Core Curriculum scale for L2 writing were examined to validate the rating process used in the study of the linguistic basis of the CEFR in L2 Finnish and English. More specifically, we explored the quality of the ratings and the rating scales across different tasks and across the two languages. As the task is an integral part of the data-gathering procedure, the relationship of task peformance across the scales and languages was also examined. We believe the kinds of analyses reported here are also relevant to other SLA studies that use rating scales in their data-gathering process.

April 11, 2014 doi: 10.1177/0265532214526176 open full text
Strategies for testing statistical and practical significance in detecting DIF with logistic regression models.
Fidalgo, A. M., Alavi, S. M., Amirian, S. M. R.
Language Testing. April 11, 2014

This study examines three controversial aspects in differential item functioning (DIF) detection by logistic regression (LR) models: first, the relative effectiveness of different analytical strategies for detecting DIF; second, the suitability of the Wald statistic for determining the statistical significance of the parameters of interest; and third, the degree of equivalence between the main DIF classification systems. Different strategies to tests–LR models, and different DIF classification systems, were compared using data obtained from the University of Tehran English Proficiency Test (UTEPT). The data obtained from 400 test takers who hold a master’s degree in science and engineering or humanities were investigated for DIF. The data were also analyzed with the Mantel–Haenszel procedure in order to have an appropriate comparison for detecting uniform DIF. The article provides some guidelines for DIF detection using LR models than can be useful for practitioners in the field of language testing and assessment.

April 11, 2014 doi: 10.1177/0265532214526748 open full text
Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment.
Min, S., He, L.
Language Testing. April 11, 2014

This study examined the relative effectiveness of the multidimensional bi-factor model and multidimensional testlet response theory (TRT) model in accommodating local dependence in testlet-based reading assessment with both dichotomously and polytomously scored items. The data used were 14,089 test-takers’ item-level responses to the testlet-based reading comprehension section of the Graduate School Entrance English Exam (GSEEE) in China administered in 2011. The results showed that although the bi-factor model was the best-fitting model, followed by the TRT model, and the unidimensional 2-parameter logistic/graded response (2PL/GR) model, the bi-factor model produced essentially the same results as the TRT model in terms of item parameter, person ability and standard error estimates. It was also found that the application of the unidimensional 2PL/GR model had a bigger impact on the item slope parameter estimates, person ability estimates, and standard errors of estimates than on the intercept parameter estimates. It is hoped that this study might help to guide test developers and users to choose the measurement model that best satisfies their needs based on available resources.

April 11, 2014 doi: 10.1177/0265532214527277 open full text
Investigating correspondence between language proficiency standards and academic content standards: A generalizability theory study.
Lin, C.-K., Zhang, J.
Language Testing. February 12, 2014

Research on the relationship between English language proficiency standards and academic content standards serves to provide information about the extent to which English language learners (ELLs) are expected to encounter academic language use that facilitates their content learning, such as in mathematics and science. Standards-to-standards correspondence thus contributes to validity evidence regarding ELL achievements in a standard-based assessment system. The current study aims to examine the reliability of reviewer judgments about language performance indicators associated with academic disciplines in standards-to-standards correspondence studies in the US K–12 settings. Ratings of cognitive complexity germane to the language performance indicators were collected from 20 correspondence studies with over 500 reviewers, consisting of content experts and ESL specialists. Using generalizability theory, we evaluate reviewer reliability and standard errors of measurement in their ratings with respect to the number of reviewers. Results show that depending on the particular grades and subject areas, 3–6 reviewers are needed to achieve acceptable reliability and to control for reasonable measurement errors in their judgments.

February 12, 2014 doi: 10.1177/0265532213520304 open full text
Dynamic assessment of elicited imitation: A case analysis of an advanced L2 English speaker.
van Compernolle, R. A., Zhang, H.
Language Testing. February 09, 2014

The focus of this paper is on the design, administration, and scoring of a dynamically administered elicited imitation test of L2 English morphology. Drawing on Vygotskian sociocultural psychology, particularly the concepts of zone of proximal development and dynamic assessment, we argue that support provided during the elicited imitation test both reveals and promotes the continued growth of emerging L2 capacities. Following a discussion of the theoretical and methodological background to the study, we present a single case analysis of one advanced L2 English speaker (L1 Korean). First, we present overall scores, which include three types: an "actual" score, based on first responses only; a "mediated" score, which is weighted to account for those abilities that become possible only with support; and a learning potential score, which may be used as a predictor of readiness to benefit from further instruction. Second, we illustrate how an item analysis can be useful in developing a detailed diagnostic profile of the learner that accounts for changes in the learner’s need for, and responsiveness to, support over the course of the task. In concluding, we consider the implications of our approach to dynamically assessing elicited imitation tasks and directions for further research.

February 09, 2014 doi: 10.1177/0265532213520303 open full text
A latent variable investigation of the Phonological Awareness Literacy Screening- Kindergarten assessment: Construct identification and multigroup comparisons between Spanish-speaking English-language learners (ELLs) and non-ELL students.
Huang, F. L., Konold, T. R.
Language Testing. July 25, 2013

Psychometric properties of the Phonological Awareness Literacy Screening for Kindergarten (PALS-K) instrument were investigated in a sample of 2844 first-time public school kindergarteners. PALS-K is a widely used English literacy screening assessment. Exploratory factor analysis revealed a theoretically defensible measurement structure that was found to replicate in a randomly selected hold-out sample when examined through the lens of confirmatory factor analytic methods. Multigroup latent variable comparisons between Spanish-speaking English-language learners (ELLs) and non-ELL students largely demonstrated the PALS-K to yield configural and metric invariance with respect to associations between subtests and latent dimensions. In combination, these results support the educational utility of the PALS-K as a tool for assessing important reading constructs and informing early interventions across groups of Spanish-speaking ELL and non-ELL students.

July 25, 2013 doi: 10.1177/0265532213496773 open full text
Hebrew language assessment measure for preschool children: A comparison between typically developing children and children with specific language impairment.
Katzenberger, I., Meilijson, S.
Language Testing. July 11, 2013

The Katzenberger Hebrew Language Assessment for Preschool Children (henceforth: the KHLA) is the first comprehensive, standardized language assessment tool developed in Hebrew specifically for older preschoolers (4;0–5;11 years). The KHLA is a norm-referenced, Hebrew specific assessment, based on well-established psycholinguistic principles, as well as on the established knowledge in the field of normal language development in the preschool years. The main goal of the study is to evaluate the KHLA as a tool for identification of language-impaired Hebrew-speaking preschoolers and to find out whether the test distinguishes between typically developing (TDL) and language-impaired children. The aim of the application of the KHLA is to characterize the language skills of Hebrew-speaking children with specific language impairment (SLI). The tasks comprised in the assessment are considered in the literature to be the sensitive areas of language skills appropriate for assessing children with SLI. Participants included 454 (383 TDL and 71 SLI) mid–high SES, monolingual native speakers of Hebrew, aged 4;0–5;11 years. The assessment included six subtests (with a total of 171 items): Auditory Processing, Lexicon, Grammar, Phonological Awareness, Semantic Categorization, and Narration of Picture Series. The study focuses on the psychometric aspect of the test. The KHLA was found useful for distinguishing between TDL and SLI when the identification is based on the total Z-score or at least two of the subtest-specific Z-scores in –1.25 SD cutoff points. The results provide a ranking order for assessment: Grammar, Auditory Processing, Semantic Categorization, Narration of Picture Series/Lexicon, and Phonological Awareness. The main clinical implications of this study are to consider the optimal cutoff point of –1.25 SD for diagnosis of SLI children and to apply the entire test for assessment. In cases when the clinician may decide to assess only two or three subtests, it is recommended that the ranking order be applied as described in the study.

July 11, 2013 doi: 10.1177/0265532213491961 open full text
Examining Testlet Effects in the TestDaF Listening Section: A Testlet Response Theory Modeling Approach.
Eckes, T.
Language Testing. July 11, 2013

Testlets are subsets of test items that are based on the same stimulus and are administered together. Tests that contain testlets are in widespread use in language testing, but they also share a fundamental problem: Items within a testlet are locally dependent with possibly adverse consequences for test score interpretation and use. Building on testlet response theory (Wainer, Bradlow, & Wang, 2007), the listening section of the Test of German as a Foreign Language (TestDaF) was analyzed to determine whether, and to which extent, testlet effects were present. Three listening passages (i.e., three testlets) with 8, 10, and 7 items, respectively, were analyzed using a two-parameter logistic testlet response model. The data came from two live exams administered in April 2010 (N = 2859) and November 2010 (N = 2214). Results indicated moderate effects for one testlet, and small effects for the other two testlets. As compared to a standard IRT analysis, neglecting these testlet effects led to an overestimation of test reliability and an underestimation of the standard error of ability estimates. Item difficulty and item discrimination estimates remained largely unaffected. Implications for the analysis and evaluation of testlet-based tests are discussed.

July 11, 2013 doi: 10.1177/0265532213492969 open full text
The reliability of morphological analyses in language samples.
Tommerdahl, J., Kilpatrick, C. D.
Language Testing. July 02, 2013

It is currently unclear to what extent a spontaneous language sample of a given number of utterances is representative of a child’s ability in morphology and syntax. This lack of information about the regularity of children’s linguistic productions and the reliability of spontaneous language samples have serious implications for language testing based upon natural language. This study investigates the reliability of children’s spontaneous language samples by using a test-retest procedure to examine repeated samples of various lengths (50, 100, 150, and 200 utterances) in regard to morpheme production in 23 typically developing children aged 2;6 to 3;6. Analyses indicate that out of five morphosyntactic categories studied, one of these (the contracted auxiliary) achieves an ICC for absolute agreement over .6 using 100 utterances while most others (past tense, third-person singular and the uncontracted ‘be’ in an auxiliary form) fail to reach a correlation above .52 even when samples of 200 utterances are compared. The study indicates that (1) 200-utterance samples did not provide a significantly greater degree of reliability than 100 utterance samples; (2) several structures that children were able to produce did not show up in a 200-utterance sample; and (3) earlier acquired morphemes were not used more reliably than more recently acquired items. The notion of reliability and its importance in the area of spontaneous language samples and language testing are also discussed.

July 02, 2013 doi: 10.1177/0265532213485570 open full text
Re-examining the content validation of a grammar test: The (im)possibility of distinguishing vocabulary and structural knowledge.
Alderson, J. C., Kremmel, B.
Language Testing. July 02, 2013

‘Vocabulary and structural knowledge’ (Grabe, 1991) appears to be a key component of reading ability. However, is this component to be taken as a unitary one or is structural knowledge a separate factor that can therefore also be tested in isolation in, say, a test of syntax? If syntax can be singled out (e.g. in order to investigate its contribution to reading ability), this test of syntactic knowledge would require validation. The usefulness and reliability of using expert judgments as a means of analysing the content or difficulty of test items in language assessment has been questioned for more than two decades. Still, groups of expert judges are often called upon as they are perceived to be the only or at least a very convenient way of establishing key features of items. Such judgments, however, are particularly opaque and thus problematic when judges are required to make categorizations where categories are only vaguely defined or are ontologically questionable in themselves. This is, for example, the case when judges are asked to classify the content of test items based on a distinction between lexis and syntax, a dichotomy corpus linguistics has suggested cannot be maintained. The present paper scrutinizes a study by Shiotsu (2010) that employed expert judgments, on the basis of which claims were made about the relative significance of the components ‘syntactic knowledge’ and ‘vocabulary knowledge’ in reading in a second language. By both replicating and partially replicating Shiotsu’s (2010) content analysis study, the paper problematizes not only the issue of the use of expert judgments, but, more importantly, their usefulness in distinguishing between construct components that might, in fact, be difficult to distinguish anyway. This is particularly important for an understanding and diagnosis of learners’ strengths and weaknesses in reading in a second language.

July 02, 2013 doi: 10.1177/0265532213489568 open full text
The cognitive processing of candidates during reading tests: Evidence from eye-tracking.
Bax, S.
Language Testing. May 16, 2013

The research described in this article investigates test takers’ cognitive processing while completing onscreen IELTS (International English Language Testing System) reading test items. The research aims, among other things, to contribute to our ability to evaluate the cognitive validity of reading test items (Glaser, 1991; Field, in press).
The project focused on differences in reading behaviours of successful and unsuccessful candidates while completing IELTS test items. A group of Malaysian undergraduates (n = 71) took an onscreen test consisting of two IELTS reading passages with 11 test items. Eye movements of a random sample of these participants (n = 38) were tracked. Stimulated recall interview data was collected to assist in interpretation of the eye-tracking data.
Findings demonstrated significant differences between successful and unsuccessful test takers on a number of dimensions, including their ability to read expeditiously (Khalifa & Weir, 2009), and their focus on particular aspects of the test items and texts, while no observable difference was noted in other items. This offers new insights into the cognitive processes of candidates during reading tests. Findings will be of value to examination boards preparing reading tests, to teachers and learners, and also to researchers interested in the cognitive processes of readers.

May 16, 2013 doi: 10.1177/0265532212473244 open full text
Investigating the effects of prompt characteristics on the comparability of TOEFL iBTTM integrated writing tasks.
Cho, Y., Rijmen, F., Novak, J.
Language Testing. May 16, 2013

This study examined the influence of prompt characteristics on the averages of all scores given to test taker responses on the TOEFL iBT^TM integrated Read-Listen-Write (RLW) writing tasks for multiple administrations from 2005 to 2009. In the context of TOEFL iBT RLW tasks, the prompt consists of a reading passage and a lecture.
To understand characteristics of individual prompts, 107 previously administered RLW prompts were evaluated by participants on nine measures of perceived task difficulty via a questionnaire. Because some of the RLW prompts were administered more than once, multilevel modeling analyses were conducted to examine the relationship between ratings of the prompt characteristics and the average RLW scores, while taking into account dependency among the observed average RLW scores and controlling for differences in the English ability of the test takers across administrations.
Results showed that some of the variation in the average RLW scores was attributable to differences in the English ability of the test takers that also varied across administrations. Two variables related to perceived task difficulty, distinctness of ideas within the prompt and difficulty of ideas in the passage, were also identified as potential sources of variation in the average RLW scores.

May 16, 2013 doi: 10.1177/0265532213478796 open full text
An argument against using standardized test scores for placement of international undergraduate students in English as a Second Language (ESL) courses.
Kokhan, K.
Language Testing. April 01, 2013

Development and administration of institutional ESL placement tests require a great deal of financial and human resources. Due to a steady increase in the number of international students studying in the United States, some US universities have started to consider using standardized test scores for ESL placement. The English Placement Test (EPT) is a locally administered ESL placement test at the University of Illinois at Urbana-Champaign (UIUC). This study examines the appropriateness of using pre-arrival SAT, ACT, and TOEFL iBT test scores as an alternative to the EPT for placement of international undergraduate students into one of the two levels of ESL writing courses at UIUC. Exploratory analysis shows that only the lowest SAT Reading and ACT English scores, and the highest TOEFL iBT total and Writing section scores can separate the students between the two placement courses. However, the number of undergraduate ESL students, who scored at the lowest and highest ends of each of these test scales, has been very low over the last six years (less than 5%). Thus, setting cutoff scores for such a small fraction of the ESL population may not be very practical. As far as the majority of the undergraduate ESL population is concerned, there is about a 40% chance that they may be misplaced if the placement decision is made solely on the standardized test scores.

April 01, 2013 doi: 10.1177/0265532213475782 open full text
Differential importance of language components in determining secondary school students' Chinese reading literacy performance.
Leong, C. K., Ho, M. K., Chang, J., Hau, K. T.
Language Testing. March 21, 2013

The present study examined pedagogic components of Chinese reading literacy in a representative sample of 1164 Grades 7, 9 and 11 Chinese students (mean age of 15 years) from 11 secondary schools in Hong Kong with each student tested for about 2.5 hours. Multiple group confirmatory factor analyses showed that across the three grade levels, the eight reading literacy constructs (Essay Writing, Morphological Compounding, Correction of Characters and Words, Segmentation of Text, Text Comprehension, Copying of Characters and Words, Writing to Dictation and Reading Aloud), each subserved by multiple indicators, had differential concurrent prediction of scaled internal school performance in reading and composing. Writing–reading and their interactive effects were foremost in their predictive power, followed by performance in error correction and writing to dictation, morphological compounding, segmenting text and copying with reading aloud playing a negligible role. Our battery of tasks with some refinement could serve as a screening instrument for secondary Chinese students struggling with Chinese reading literacy.

March 21, 2013 doi: 10.1177/0265532212469178 open full text
At the interface between language testing and second language acquisition: Language ability and context of learning.
Gu, L.
Language Testing. March 21, 2013

This study investigated the relationship between latent components of academic English language ability and test takers’ study-abroad and classroom learning experiences through a structural equation modeling approach in the context of TOEFL iBT^® testing. Data from the TOEFL iBT public dataset were used. The results showed that test takers’ performance on the test’s four skill sections, namely listening, reading, writing, and speaking, could be accounted for by two correlated latent components: the ability to listen, read, and write, and the ability to speak English. This two-factor model held equivalently across two groups of test takers, with one group having been exposed to an English-speaking environment and the other without such experience. Imposing a mean structure on the factor model led to the finding that the groups did not differ in terms of their standings on the factor means. The relationship between learning contexts and the latent ability components was further examined in structural regression models. The results of this study suggested an alternative characterization of the ability construct of the TOEFL test-taking population, and supported the comparability of the language ability developed in the home-country and the study-abroad groups. The results also shed light on the impact of studying abroad and home-country learning on language ability development.

March 21, 2013 doi: 10.1177/0265532212469177 open full text
An application of Multifaceted Rasch measurement in the Yes/No Angoff standard setting procedure.
Hsieh, M.
Language Testing. March 21, 2013

When implementing standard setting procedures, there are two major concerns: variance between panelists and efficiency in conducting multiple rounds of judgments. With regard to the former, there is concern over the consistency of the cutoff scores made by different panelists. If the cut scores show an inordinately wide range then further rounds of group discussion are required to reach consensus, which in turn leads to the latter concern. The Yes/No Angoff procedure is typically implemented across several rounds. Panelists revise their original decisions for each item based on discussion with co-panelists between each round. The purpose of this paper is to demonstrate a framework for evaluating the judgments in the standard setting process. The Multifaceted Rasch model was applied as a tool to evaluate the quality of standard setting in a context of language assessment. The results indicate that the Multifaceted Rasch model offers a promising approach to examination of the variability in the standard setting procedures. In addition, this model can identify aberrant decision making for each panelist, which can be used as feedback for both standard setting designers and panelists.

March 21, 2013 doi: 10.1177/0265532213476259 open full text