Integrated speaking test tasks (integrated tasks) provide reading and/or listening input to serve as the basis for test-takers to formulate their oral responses. This study examined the influence of topical knowledge on integrated speaking test performance and compared independent speaking test performance and integrated speaking test performance in terms of how each was related to topical knowledge. The researchers derived four integrated tasks from TOEFL iBT preparation materials, developed four independent speaking test tasks (independent tasks), and validated four topical knowledge tests (TKTs) on a group of 421 EFL learners. For the main study, they invited another 352 students to respond to the TKTs and to perform two independent tasks and two integrated tasks. Half of the test takers took the independent tasks and integrated tasks on one topic combination while the other half took tasks on another topic combination. Data analysis, drawing on a series of path analyses, led to two major findings. First, topical knowledge significantly impacted integrated speaking test performance in both topic combinations. Second, the impact of topical knowledge on the two types of speaking test performances was topic dependent. Implications are proposed in light of these findings.
This study investigates test-takers’ processing while completing banked gap-fill tasks, designed to test reading proficiency, in order to test theoretically based expectations about the variation in cognitive processes of test-takers across levels of performance. Twenty-eight test-takers’ eye traces on 24 banked gap-fill items (on six tasks) were analysed according to seven online eye-tracking measures representing overall, text and task processing. Variation in processing was related to test-takers’ level of performance on the tasks overall. In particular, as hypothesized, lower-scoring students exerted more cognitive effort on local reading and lower-level cognitive processing in contrast to test-takers who attained higher scores. The findings of different cognitive processes associated with variation in scores illuminate the construct measured by banked gap-fill items, and therefore have implications for test design and the validity of score interpretations.
This paper presents an approach to standard setting that combines the prototype group method (PGM; Eckes, 2012) with a receiver operating characteristic (ROC) analysis. The combined PGM–ROC approach is applied to setting cut scores on a placement test of English as a foreign language (EFL). To implement the PGM, experts first named learners whom they considered to be typical of each of five levels of language proficiency as specified by the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001). Out of a total of 3,310 examinees taking different trial versions of the placement test, 470 learner prototypes were identified. For this set of prototypes, Rasch model estimates of EFL proficiency served as input to a series of ROC analyses, one for each pair of adjacent proficiency levels. Cut scores were derived using the Youden index that maximizes the overall rate of correct classification and minimizes the overall rate of misclassification. Findings confirmed that this method allows for the setting of cut scores that show a high level of classification accuracy in terms of the correspondence with expert categorizations of examinee prototypes. In addition, the ROC-based cut scores were associated with higher classification accuracy than cut scores derived from a logistic regression analysis of the same data. Potential further uses and implications of the PGM–ROC approach in the context of language testing and assessment are discussed.
This study, conducted by two researchers who were also multiple-choice question (MCQ) test item writers at a private English-medium university in an English as a foreign language (EFL) context, was designed to shed light on the factors that influence test-takers’ perceptions of difficulty in English for academic purposes (EAP) vocabulary, with the aim of improving test writers’ judgments on difficulty. The research consisted of a survey of 588 test-takers, followed by a focus group interview, aimed at investigating the relative influences of test-taker factors and word factors on difficulty perceptions. Results reveal a complex interaction of factors influencing perceived difficulty dominated by the educational, and particularly, the social context. Factors traditionally associated with vocabulary difficulty, such as abstractness and word length, appeared to have little influence. The researchers concluded that rather than basing their intuitions regarding vocabulary difficulty on language-lesson input or surface features of words, EAP vocabulary test writers need a clear understanding of test-takers’ difficulty perceptions, and how these emerge from interactions between academic, social and linguistic factors. As a basis for EAP vocabulary item writer training, four main implications are drawn, related to test-takers’ social and educational background, field of study, the features of academic words, and the test itself.
The present study investigates integrated writing assessment performances with regard to the linguistic features of complexity, accuracy, and fluency (CAF). Given the increasing presence of integrated tasks in large-scale and classroom assessments, validity evidence is needed for the claim that their scores reflect targeted language abilities. Four hundred and eighty integrated writing essays from the Internet-based Test of English as a Foreign Language (TOEFL) were analyzed using CAF measures with correlation and regression to determine how well these linguistic features predict scores on reading–listening–writing tasks. The results indicate a cumulative impact on scores from these three features. Fluency was found to be the strongest predictor of integrated writing scores. Analysis of error type revealed that morphological errors contributed more to the regression statistic than syntactic or lexical errors. Complexity was significant but had the lowest correlation to score across all variables.
The importance of functional adequacy as an essential component of L2 proficiency has been observed by several authors (Pallotti, 2009; De Jong, Steinel, Florijn, Schoonen, & Hulstijn, 2012a, b). The rationale underlying the present study is that the assessment of writing proficiency in L2 is not fully possible without taking into account the functional dimension of L2 production. In the paper a rating scale for functional adequacy is proposed, containing four dimensions: (1) content, (2) task requirements, (3) comprehensibility, and (4) coherence and cohesion. The scale is an adaptation of the global rating scale of functional adequacy, which in an earlier study was carried out with expert raters (Kuiken, Vedder, & Gilabert, 2010; Kuiken & Vedder, 2014). The new rating scale for functional adequacy was tested out by a group of non-expert raters, who assessed the functional adequacy of a corpus of argumentative texts written by native and non-native writers of Dutch and Italian. The results showed that functional adequacy in L2 writing can be reliably measured by a rating scale comprising four different subscales.
English language proficiency or English language development (ELP/D) standards guide how content-specific instruction and assessment is practiced by teachers and how English learners (ELs) at varying levels of English proficiency can perform grade-level-specific academic standards in K–12 US schools. With the transition from the state-developed Indiana ELP/D standards adopted in 2003 to the World Class Instructional Design and Assessment (WIDA) English language development standards adopted in 2013, this paper explores Indiana’s ELP/D standard’s 14-year history and how its district EL/Bilingual district leaders have interpreted and implemented these two sets of standards between the school years 2002–03 and 2015–16.
Using critical leadership and feminism within a narrative design, EL/Bilingual leaders illuminate distinct leadership logics as they mediate and implement ELP/D standards in their districts. Academic content standards are regarded with greater privilege, complicating how EL/Bilingual leaders can position ELP/D standards. Restricted by this standards hierarchy, EL/Bilingual leaders found limited educational venues in which to discuss the performance-based nature of ELP/D standards. Implications for assessment, policy and leadership preparation are discussed.
In language programs, it is crucial to place incoming students into appropriate levels to ensure that course curriculum and materials are well targeted to their learning needs. Deciding how and where to set cutscores on placement tests is thus of central importance to programs, but previous studies in educational measurement disagree as to which standard-setting method (or methods) should be employed in different contexts. Furthermore, the results of different standard-setting methods rarely converge on a single set of cutscores, and standard-setting procedures within language program placement testing contexts specifically have been relatively understudied. This study aims to compare and evaluate three different standard-setting procedures – the Bookmark method (a test-centered approach), the Borderline group method (an examinee-centered approach), and cluster analysis (a statistical approach) – and to discuss the ways in which they do and do not provide valid and reliable information regarding placement cut-offs for an intensive English program at a large Midwestern university in the USA. As predicted, the cutscores derived from the different methods did not converge on a single solution, necessitating a means of judging between divergent results. We discuss methods of evaluating cutscores, explicate the advantages and limitations associated with each standard-setting method, recommend against using statistical approaches for most English for academic purposes (EAP) placement contexts, and demonstrate how specific psychometric qualities of the exam can affect the results obtained using those methods. Recommendations for standard setting, exam development, and cutscore use are discussed.
In this article we present a new method for estimating children’s total vocabulary size based on a language corpus in German. We drew a virtual sample of different lexicon sizes from a corpus and let the virtual sample "take" a vocabulary test by comparing whether the items were included in the virtual lexicons or not. This enabled us to identify the relation between test performance and total lexicon size. We then applied this relation to the test results of a real sample of children (grades 1–8, aged 6 to 14) and young adults (aged 18 to 25) and estimated their total vocabulary sizes. Average absolute vocabulary sizes ranged from 5900 lemmas in first grade to 73,000 for adults, with significant increases between adjacent grade levels except from first to second grade. Our analyses also allowed us to observe parts of speech and morphological development. Results thus shed light on the course of vocabulary development during primary school.
The present study examines the relative importance of attributes within and across items by applying four cognitive diagnostic assessment models. The current study utilizes the function of the models that can indicate inter-attribute relationships that reflect the response behaviors of examinees to analyze scored test-taker responses to four forms of TOEFL reading and listening comprehension sections. The results are interpreted to determine whether the subskills defined in each subtest contribute equally within an item and across items. The study also discusses whether the empirical results support the claims for compensatory processing of L2 comprehension skills and presents practical implications of the findings. The article concludes with the limitations of the study and suggestions for future research.
Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the technical complexity involved in estimating score reliability from sparse-rated data. Examining the estimation precision of reliability is of great importance because the utility of any performance-based language test depends on its reliability. Results suggest that when some raters are expected to have greater score variability than other raters (e.g., a mixture of novice and experienced raters being deployed in a rating session), the sub-dividing method is recommended as it yields more precise reliability estimates. When all raters are expected to exhibit similar variability in their scoring, both the rating and sub-dividing methods are equally precise in estimating score reliability, and the rating method is recommended for operational use, as it is easier to implement in practice. Informed by these methodological results, the current study also demonstrates a step-by-step analysis for investigating the score reliability from sparse-rated data taken from a large-scale English speaking proficiency test. Implications for operational performance-based language tests are discussed.
In this study, two alternative theoretical models were compared, in order to analyze which of them best explains primary school children’s text comprehension skills. The first one was based on the distinction between two types of answers requested by the comprehension test: local or global. The second model involved texts’ input modality: written or oral. For this purpose, a new instrument that assesses listening and reading comprehension skills (ALCE battery; Bonifacci et al., 2014) was administered to a large sample of 1,658 Italian primary school students. The two models were tested separately for the five grades (first to fifth grade). Furthermore, a third model, that included both the types of answers and the texts’ input modality, was considered. Results of confirmatory factor analyses suggested that all models are adequate, but the second one (reading vs. listening) provided a better fit. The major role of the distinction between input modalities is discussed in relation to individual differences and developmental trajectories in text comprehension. Theoretical and clinical implications are discussed.
The aim of this study was to develop, for the benefit of both test takers and test score users, enhanced TOEFL ITP® test score reports that go beyond the simple numerical scores that are currently reported. To do so, we applied traditional scale anchoring (proficiency scaling) to item difficulty data in order to develop performance descriptors for multiple levels of each of the three sections of the TOEFL ITP. A (novel) constraint was that these levels should correspond to those established in an earlier study that mapped (i.e., aligned) TOEFL ITP scores to a widely accepted framework for describing language proficiency – the Common European Framework of Reference (CEFR). The data used in the present study came from administrations of five current operational forms of the recently revised TOEFL ITP test. The outcome of the effort is a set of performance descriptors for each of several levels of TOEFL ITP scores for each of the three sections of the test. The contribution, we believe, constitutes (1) an enhancement of the interpretation of scores for one widely used assessment of English language proficiency and (2) a modest contribution to the literature on developing proficiency descriptors – an approach that combines elements of both scale anchoring and test score mapping.
This paper reports a post-hoc analysis of the influence of lexical difficulty of cue sentences on performance in an elicited imitation (EI) task to assess oral production skills for 645 child L2 English learners in instructional settings. This formed part of a large-scale investigation into effectiveness of foreign language teaching in Polish primary schools. EI item design and scoring, IRT and post-hoc lexical analysis of items is described in detail. The research aim was to resolve how much the lexical complexity of items (lexical density, morphological complexity, function word density, and sentence length) contributed to item difficulty and scores. Sentence length, as number of words, predicted better than number of syllables. Function words also contributed, and their importance to EI item construction is discussed. It is suggested that future research should examine phonological aspects of cue sentences to explain potential sources for variability. EI is shown to be a reliable and robust method for young L2 learners with potential for classroom assessment by teachers for emergent oral production skills.
Cloze tests have been the subject of numerous studies regarding their function and use in both first language and second language contexts (e.g., Jonz & Oller, 1994; Watanabe & Koyama, 2008). From a validity standpoint, one area of investigation has been the extent to which cloze tests measure reading ability beyond the sentence level. Using test data from 50 30-item cloze passages administered to 2,298 Japanese and 5,170 Russian EFL students, this study examined the degree to which linguistic features for cloze passages and items influenced item difficulty. Using a common set of 10 anchor items, all 50 tests were modeled in terms of person ability and item difficulty onto a single scale using many-faceted Rasch measurement (k = 1314). Principle components analysis was then used to categorize 25 linguistic item- and passage-level variables for the 50 cloze tests and their respective items, from which three components for each passage- and item-level variables were identified. These six factors along with item difficulty were then entered into both a hierarchical structural equation model and a linear multiple regression to determine the degree to which difficulty in cloze tests could be explained separately by passage and item features. Comparisons were further made by looking at differences in models by nationality and by proficiency level (e.g., high and low). The analyses revealed noteworthy differences in mean item difficulties and in the variance structures between passage- and item-level features, as well as between different examinee proficiency groups.
Language programs need multiple test forms for secure administrations and effective placement decisions, but can they have confidence that scores on alternate test forms have the same meaning? In large-scale testing programs, various equating methods are available to ensure the comparability of forms. The choice of equating method is informed by estimates of quality, namely the method with the least error as defined by random error, systematic error, and total error. This study compared seven different equating methods to no equating – mean, linear Levine, linear Tucker, chained equipercentile, circle-arc, nominal weights mean, and synthetic. A non-equivalent groups anchor test (NEAT) design was used to compare two listening and reading test forms based on small samples (one with 173 test takers the other, 88) at a university’s English for Academic Purposes (EAP) program. The equating methods were evaluated based on the amount of error they introduced and their practical effects on placement decisions. It was found that two types of error (systematic and total) could not be reliably computed owing to the lack of an adequate criterion; consequently, only random error was compared. Among the seven methods, the circle-arc method introduced the least random error as estimated by the standard error of equating (SEE). Classification decisions made using the seven methods differed from no equating; all methods indicated that fewer students were ready for university placement. Although interpretations regarding the best equating method could not be made, circle-arc equating reduced the amount of random error in scores, had reportedly low bias in other studies, accounted for form and person differences, and was relatively easy to compute. It was chosen as the method to pilot in an operational setting.
This study assessed the factor structure of the Test of English for International Communication (TOEIC®) Listening and Reading test, and its invariance across subgroups of test-takers. The subgroups were defined by (a) gender, (b) age, (c) employment status, (d) time spent studying English, and (e) having lived in a country where English is the main language. The study results indicated that a correlated two-factor model corresponding to the two language abilities of listening and reading best accounted for the factor structure of the test. In addition, the underlying construct had the same structure across the test-taker subgroups studied. There were, however, significant differences in the means of the latent construct across the subgroups. This study provides empirical support for the current score reporting practice for the TOEIC test, suggests that the test scores have the same meaning across studied test-taker subgroups, and identifies possible test-taker background characteristics that affect English language abilities as measured by the TOEIC test.
Reading comprehension tests are often assumed to measure the same, or at least similar, constructs. Yet, reading is not a single but a multidimensional form of processing, which means that variations in terms of reading material and item design may emphasize one aspect of the construct at the cost of another. The educational systems in Denmark, Norway, and Sweden share a number of traits, and in the recent decade, the development of national test instruments, especially for reading, has been highly influenced by international surveys of student achievement. In this study, national tests of L1 reading comprehension in secondary school in the three Scandinavian countries are compared in order to reveal the present range of diversity/commonality within the three test domains. The analysis employs both qualitative and quantitative aspects of data, including frameworks, text samples, task samples, and scoring guidelines from 2011 to 2014. Findings indicate that the three tests differ substantially from each other, not only in terms of the intentional and operative constructs of reading to be measured, but also in terms of testing methods and stability over time. Implications for the future development of reading comprehension assessment are discussed.
Language proficiency constitutes a crucial barrier for prospective international teaching assistants (ITAs). Many US universities administer screening tests to ensure that ITAs possess the required academic oral English proficiency for their TA duties. Such ITA screening tests often elicit a sample of spoken English, which is evaluated in terms of multiple aspects by trained raters. In this light, ITA screening tests provide an advantageous context in which to gather rich information about test taker performances. This study introduces a systematic way of extracting meaningful information for major stakeholders from an ITA screening test administered at a US university. In particular, this study illustrates how academic oral English proficiency profiles can be identified based on test takers’ subscale score patterns, and discusses how the resulting profiles can be used as feedback for ITA training and screening policy makers, the ITA training program of the university, ESL instructors, and test takers. The proficiency profiles were identified using finite mixture modeling based on the subscale scores of 960 test takers. The modeling results suggested seven profile groups. These groups were interpreted and labeled based on the characteristic subscale score patterns of their members. The implications of the results are discussed, with the main focus on how such information can help ITA policy makers, the ITA training program, ESL instructors, and test takers make important decisions.
Educational policies such as Race to the Top in the USA affirm a central role for testing systems in government-driven reform efforts. Such reform policies are often referred to as the global education reform movement (GERM). Changes observed with the GERM style of testing demand socially engaged validity theories that include consequential research. The article revisits the Standards and Kane’s interpretive argument (IA) and argues that the role envisioned for consequences remains impoverished. Guided by theory of action, the article presents a validity framework, which targets policy-driven assessments and incorporates a social role for consequences. The framework proposes a coherent system that makes explicit the interconnections among policy ambitions, testing functions, and the levels/sectors that are affected. The article calls for integrating consequences into technical quality documentation, demands a more realistic delineation of stakeholders and their roles, and compels engagement in policy research.
Elicited imitation (EI) has been widely used to examine second language (L2) proficiency and development and was an especially popular method in the 1970s and early 1980s. However, as the field embraced more communicative approaches to both instruction and assessment, the use of EI diminished, and the construct-related validity of EI scores as a representation of language proficiency was called into question. Current uses of EI, while not discounting the importance of communicative activities and assessments, tend to focus on the importance of processing and automaticity. This study presents a systematic review of EI in an effort to clarify the construct and usefulness of EI tasks in L2 research.
The review underwent two phases: a narrative review and a meta-analysis. We surveyed 76 theoretical and empirical studies from 1970 to 2014, to investigate the use of EI in particular with respect to the research/assessment context and task features. The results of the narrative review provided a theoretical basis for the meta-analysis. The meta-analysis utilized 24 independent effect sizes based on 1089 participants obtained from 21 studies. To investigate evidence of construct-related validity for EI, we examined the following: (1) the ability of EI scores to distinguish speakers across proficiency levels; (2) correlations between scores on EI and other measures of language proficiency; and (3) key task features that moderate the sensitivity of EI.
Results of the review demonstrate that EI tasks vary greatly in terms of task features; however, EI tasks in general have a strong ability to discriminate between speakers across proficiency levels (Hedges’ g = 1.34). Additionally, construct, sentence length, and scoring method were identified as moderators for the sensitivity of EI. Findings of this study provide supportive construct-related validity evidence for EI as a measure of L2 proficiency and inform appropriate EI task development and administration in L2 research and assessment.
Literature provides consistent evidence that there is a strong relationship between language proficiency and math achievement. However, research results show conflicts supporting either an increasing or a decreasing longitudinal relationship between the two. This study explored a longitudinal data and adopted quantile regression analyses to overcome several limitations in past research. The goal of the study is to detect more accurate and richer information on the long-term relationship between language and math, taking into consideration the socioeconomic status, gender, and ethnicity background at the same time. Results confirmed a persistent relationship between math achievement and all the factors explored. More importantly, it revealed that the strength of the relationship between language and math differed for students with various abilities both within and across grades. Model comparison suggests that language demand contributes to the achievement gap between ELLs and non-ELLs in math. There also seems to be a disadvantage for the geographically isolated group in academic achievement. Interpretation and implications for teaching and assessment are discussed.
Placement and screening tests serve important functions, not only with regard to placing learners at appropriate levels of language courses but also with a view to maximizing the effectiveness of administering test batteries. We examined two widely reported formats suitable for these purposes, the discrete decontextualized Yes/No vocabulary test and the embedded contextualized C-test format, in order to determine which format can explain more variance in measures of listening and reading comprehension. Our data stem from a large-scale assessment with over 3000 students in the German secondary educational context; the four measures relevant to our study were administered to a subsample of 559 students. Using regression analysis on observed scores and SEM on a latent level, we found that the C-test outperforms the Yes/No format in both methodological approaches. The contextualized nature of the C-test seems to be able to explain large amounts of variance in measures of receptive language skills. The C-test, being a reliable, economical and robust measure, appears to be an ideal candidate for placement and screening purposes. In a side-line of our study, we also explored different scoring approaches for the Yes–No format. We found that using the hit rate and the false-alarm rate as two separate indicators yielded the most reliable results. These indicators can be interpreted as measures for vocabulary breadth and as guessing factors respectively, and they allow controlling for guessing.
This study explores the extent to which topic and background knowledge of topic affect spoken performance in a high-stakes speaking test. It is argued that evidence of a substantial influence may introduce construct-irrelevant variance and undermine test fairness. Data were collected from 81 non-native speakers of English who performed on 10 topics across three task types. Background knowledge and general language proficiency were measured using self-report questionnaires and C-tests respectively. Score data were analysed using many-facet Rasch measurement and multiple regression. Findings showed that for two of the three task types, the topics used in the study generally exhibited difficulty measures which were statistically distinct. However, the size of the differences in topic difficulties was too small to have a large practical effect on scores. Participants’ different levels of background knowledge were shown to have a systematic effect on performance. However, these statistically significant differences also failed to translate into practical significance. Findings hold implications for speaking performance assessment.
The present study used the mixed Rasch model (MRM) to identify subgroups of readers within a sample of students taking an EFL reading comprehension test. Six hundred and two (602) Chinese college students took a reading test and a lexico-grammatical knowledge test and completed a Metacognitive and Cognitive Strategy Use Questionnaire (MCSUQ) (Zhang, Goh, & Kunnan, 2014). MRM analysis revealed two latent classes. Class 1 was more likely to score highly on reading in-depth (RID) items. Students in this class had significantly higher general English proficiency, better lexico-grammatical knowledge, and reported using reading strategies more frequently, especially planning, monitoring, and integrating strategies. In contrast, Class 2 was more likely to score highly on skimming and scanning (SKSN) items, but had relatively lower mean scores for lexico-grammatical knowledge and general English proficiency; they also reported using strategies less frequently than did Class 1. The implications of these findings and further research are discussed.
Previous research in second language writing has shown that when scoring performance assessments even trained raters can exhibit significant differences in severity. When raters disagree, using discussion to try to reach a consensus is one popular form of score resolution, particularly in contexts with limited resources, as it does not require adjudication by at third rater. However, from an assessment validation standpoint, questions remain about the impact of negotiation on the scoring inference of a validation argument (Kane, 2006, 2012). Thus, this mixed-methods study evaluates the impact of score negotiation on scoring consistency in second language writing assessment, as well as negotiation’s potential contributions to raters’ understanding of test constructs and the local curriculum. Many-faceted Rasch measurement (MFRM) was used to analyze scores (n = 524) from the writing section an EAP placement exam and to quantify how negotiation affected rater severity, self-consistency, and bias toward individual categories and test takers. Semi-structured interviews with raters (n = 3) documented their perspectives about how negotiation affects scoring and teaching. In this study, negotiation did not change rater severity, though it greatly reduced measures of rater bias. Furthermore, rater comments indicated that negotiation supports a nuanced understanding of the rubric categories and increases positive washback on teaching practices.
American Sign Language (ASL) is one of the most commonly taught languages in North America. Yet, few assessment instruments for ASL proficiency have been developed, none of which have adequately demonstrated validity. We propose that the American Sign Language Discrimination Test (ASL-DT), a recently developed measure of learners’ ability to discriminate phonological and morphophonological contrasts in ASL, provides an objective overall measure of ASL proficiency. In this study, the ASL-DT was administered to 194 participants at beginning, intermediate, and high levels of ASL proficiency, a subset of which (N = 57) also was administered the Sign Language Proficiency Interview (SLPI), a widely used subjective proficiency measure. Using Rasch analysis to model ASL-DT item difficulty and person ability, we tested the ability of the ASL-DT Rasch measure to detect participant proficiency group mean differences and compared its discriminant performance to the SLPI ratings for classifying individuals into their pre-assigned proficiency groups using resource operating characteristic statistics. The ASL-DT Rasch measure outperformed the SLPI ratings, indicating that the ASL-DT may provide a valid objective measure of overall ASL proficiency. As such, the ASL-DT Rasch measure may provide a useful complement to measures such as the SLPI in comprehensive sign language assessment programs.
Cognitive diagnostic models (CDMs) have great promise for providing diagnostic information to aid learning and instruction, and a large number of CDMs have been proposed. However, the assumptions and performances of different CDMs and their applications in regard to reading comprehension tests are not fully understood. In the present study, we compared the performance of a saturated model (G-DINA), two compensatory models (DINO, ACDM), and two non-compensatory models (DINA, RRUM) with the Michigan English Language Assessment Battery (MELAB) reading test. Compared to the saturated G-DINA model, the ACDM showed comparable model fit and similar skill classification results. The RRUM was slightly worse than the ACDM and G-DINA in terms of model fit and classification results, whereas the more restrictive DINA and DINO performed much worse than the other three models. The findings of this study highlighted the process and considerations pertinent to model selection in applications of CDMs with reading tests.
This study explores the attitudes of raters of English speaking tests towards the global spread of English and the challenges in rating speakers of Indian English in descriptive speaking tasks. The claims put forward by language attitude studies indicate a validity issue in English speaking tests: listeners tend to hold negative attitudes towards speakers of non-standard English, and judge them unfavorably. As there are no adequate measures of listener/rater attitude towards emerging varieties of English in language assessment research, a Rater Attitude Instrument comprising a three-phase self-measure was developed. It comprises 11 semantic differential scale items and 31 Likert scale items representing three attitude dimensions of feeling, cognition, and behavior tendency as claimed by psychologists. Confirmatory factor analysis supported a two-factor structure with acceptable model fit indices. This measure represents a new initiative to examine raters’ psychological traits as a source of validity evidence in English speaking tests to strengthen arguments about test-takers’ English language proficiency in response to the change of sociolinguistic landscape. The implications for norm selection in English oral tests are also discussed.
Perceptual (mis)matches between teachers and learners are said to affect learning success or failure. Self-assessment, as a formative assessment tool, may, inter alia, be considered a means to minimize such mismatches. Therefore, the present study investigated the extent to which learners’ assessment of their own speaking performance, before and after their being provided with a list of agreed-upon scoring criteria followed by a practice session, matches that of their teachers. In so doing, 29 EFL learners and six EFL teachers served as participants; the learners were asked to assess their audio-recorded speaking performance before and after their being provided with the scoring criteria and practice session. The teachers were also asked to assess the learners’ performance according to the same criteria. Finally, the learners were required to evaluate the effectiveness of doing self-assessment in the form of reflection papers. The results revealed a significant difference between the learners’ assessment of their own speaking ability on the two occasions. The findings also suggested that providing the learners with the scoring criteria and the follow-up practice session minimized the existing mismatches between learner assessment and teacher assessment. Moreover, the inductive analysis of the reflection papers yielded a number of themes suggesting that, despite some limitations, the learners’ overall evaluation of the effectiveness of speaking self-assessment was positive.
This study explores the construct validity of speaking tasks included in the TOEFL iBT (e.g., integrated and independent speaking tasks). Specifically, advanced natural language processing (NLP) tools, MANOVA difference statistics, and discriminant function analyses (DFA) are used to assess the degree to which and in what ways responses to these tasks differ with regard to linguistic characteristics. The findings lend support to using a variety of speaking tasks to assess speaking proficiency. Namely, with regard to linguistic differences, the findings suggest that responses to performance tasks can be accurately grouped based on whether a task is independent or integrated. The findings also suggest that although the independent tasks included in the TOEFL iBT may represent a single construct, responses to integrated tasks vary across task sub-type.
We addressed Deville and Chalhoub-Deville’s (2006), Schoonen’s (2012), and Xi and Mollaun’s (2006) call for research into the contextual features that are considered related to person-by-task interactions in the framework of generalizability theory in two ways. First, we quantitatively synthesized the generalizability studies to determine the percentage of variation in L2 speaking and L2 writing performance that was accounted for by tasks, raters, and their interaction. Second, we examined the relationships between person-by-task interactions and moderator variables. We used 28 datasets from 21 studies for L2 speaking, and 22 datasets from 17 studies for L2 writing. Across modalities, most of the score variation was explained by examinees’ performance; the interaction effects of tasks or raters were greater than the independent effects of tasks or raters. Task and task-related interaction effects explained a greater percentage of the score variances, than did the rater and rater-related interaction effects. The variances associated with the person-by-task interactions were larger for assessments based on both general and academic contexts, than for those based only on academic contexts. Further, large person-by-task interactions were related to analytic scoring and scoring criteria with task-specific language features. These findings derived from L2 speaking studies indicate that contexts, scoring methods, and scoring criteria might lead to varied performance over tasks. Consequently, this particularly requires us to define constructs carefully.
Data from 787 international undergraduate students at an urban university in the United States were used to demonstrate the importance of separating a sample into meaningful subgroups in order to demonstrate the ability of an English language assessment to predict the first-year grade point average (GPA). For example, when all students were pooled in a single analysis, the correlation of scores from the Test of English as a Foreign Language (TOEFL) with GPA was .18; in a subsample of engineering students from China, the correlation with GPA was .58, or .77 when corrected for range restriction. Similarly, the corrected correlation of the TOEFL Reading score with GPA for Chinese business students changed dramatically (from .01 to .36) when students with an extreme discrepancy between their receptive (reading/listening) and productive (speaking/writing) scores were trimmed from the sample.
The rise in the affordability of quality video production equipment has resulted in increased interest in video-mediated tests of foreign language listening comprehension. Although research on such tests has continued fairly steadily since the early 1980s, studies have relied on analyses of raw scores, despite the growing prevalence of item response theory in the field of language testing as a whole. The present study addresses this gap by comparing data from identical, counter-balanced multiple-choice listening test forms employing three text types (monologue, conversation, and lecture) administered to 164 university students of English in Japan. Data were analyzed via many-facet Rasch modeling to compare the difficulties of the audio and video formats; to investigate interactions between format and text-type, and format and proficiency level; and to identify specific items biased toward one or the other format. Finally, items displaying such differences were subjected to differential distractor functioning analyses. No interactions between format and text-type, or format and proficiency level, were observed. Four items were discovered displaying format-based differences in difficulty, two of which were found to correspond to possible acting anomalies in the videos. The author argues for further work focusing on item-level interactions with test format.
The scoring of constructed responses may introduce construct-irrelevant factors to a test score and affect its validity and fairness. Fatigue is one of the factors that could negatively affect human performance in general, yet little is known about its effects on a human rater’s scoring quality on constructed responses. In this study, we compared the scoring quality of 72 raters under four shift conditions differing on the shift length (total scoring time in a day) and session length (time continuously spent on a task). About 14,000 audio responses to four TOEFL iBT speaking tasks were scored, including 5446 validity responses that have pre-assigned "true" scores used to measure scoring accuracy. Our results suggest that the overall scoring accuracy is high for the TOEFL iBT Speaking Test, but varying levels of rating accuracy and consistency exist across shift conditions. The raters working the shorter shifts or shorter sessions on average maintain greater rating productivity, accuracy, and consistency than those working longer shifts or sessions do. The raters working the 6-hour shift with three 2-hour sessions outperform those under other shift conditions in both rating accuracy and consistency.
There is still relatively little research on how well the CEFR and similar holistic scales work when they are used to rate L2 texts. Using both multifaceted Rasch analyses and qualitative data from rater comments and interviews, the ratings obtained by using a CEFR-based writing scale and the Finnish National Core Curriculum scale for L2 writing were examined to validate the rating process used in the study of the linguistic basis of the CEFR in L2 Finnish and English. More specifically, we explored the quality of the ratings and the rating scales across different tasks and across the two languages. As the task is an integral part of the data-gathering procedure, the relationship of task peformance across the scales and languages was also examined. We believe the kinds of analyses reported here are also relevant to other SLA studies that use rating scales in their data-gathering process.
This study examines three controversial aspects in differential item functioning (DIF) detection by logistic regression (LR) models: first, the relative effectiveness of different analytical strategies for detecting DIF; second, the suitability of the Wald statistic for determining the statistical significance of the parameters of interest; and third, the degree of equivalence between the main DIF classification systems. Different strategies to tests–LR models, and different DIF classification systems, were compared using data obtained from the University of Tehran English Proficiency Test (UTEPT). The data obtained from 400 test takers who hold a master’s degree in science and engineering or humanities were investigated for DIF. The data were also analyzed with the Mantel–Haenszel procedure in order to have an appropriate comparison for detecting uniform DIF. The article provides some guidelines for DIF detection using LR models than can be useful for practitioners in the field of language testing and assessment.
This study examined the relative effectiveness of the multidimensional bi-factor model and multidimensional testlet response theory (TRT) model in accommodating local dependence in testlet-based reading assessment with both dichotomously and polytomously scored items. The data used were 14,089 test-takers’ item-level responses to the testlet-based reading comprehension section of the Graduate School Entrance English Exam (GSEEE) in China administered in 2011. The results showed that although the bi-factor model was the best-fitting model, followed by the TRT model, and the unidimensional 2-parameter logistic/graded response (2PL/GR) model, the bi-factor model produced essentially the same results as the TRT model in terms of item parameter, person ability and standard error estimates. It was also found that the application of the unidimensional 2PL/GR model had a bigger impact on the item slope parameter estimates, person ability estimates, and standard errors of estimates than on the intercept parameter estimates. It is hoped that this study might help to guide test developers and users to choose the measurement model that best satisfies their needs based on available resources.
Research on the relationship between English language proficiency standards and academic content standards serves to provide information about the extent to which English language learners (ELLs) are expected to encounter academic language use that facilitates their content learning, such as in mathematics and science. Standards-to-standards correspondence thus contributes to validity evidence regarding ELL achievements in a standard-based assessment system. The current study aims to examine the reliability of reviewer judgments about language performance indicators associated with academic disciplines in standards-to-standards correspondence studies in the US K–12 settings. Ratings of cognitive complexity germane to the language performance indicators were collected from 20 correspondence studies with over 500 reviewers, consisting of content experts and ESL specialists. Using generalizability theory, we evaluate reviewer reliability and standard errors of measurement in their ratings with respect to the number of reviewers. Results show that depending on the particular grades and subject areas, 3–6 reviewers are needed to achieve acceptable reliability and to control for reasonable measurement errors in their judgments.
The focus of this paper is on the design, administration, and scoring of a dynamically administered elicited imitation test of L2 English morphology. Drawing on Vygotskian sociocultural psychology, particularly the concepts of zone of proximal development and dynamic assessment, we argue that support provided during the elicited imitation test both reveals and promotes the continued growth of emerging L2 capacities. Following a discussion of the theoretical and methodological background to the study, we present a single case analysis of one advanced L2 English speaker (L1 Korean). First, we present overall scores, which include three types: an "actual" score, based on first responses only; a "mediated" score, which is weighted to account for those abilities that become possible only with support; and a learning potential score, which may be used as a predictor of readiness to benefit from further instruction. Second, we illustrate how an item analysis can be useful in developing a detailed diagnostic profile of the learner that accounts for changes in the learner’s need for, and responsiveness to, support over the course of the task. In concluding, we consider the implications of our approach to dynamically assessing elicited imitation tasks and directions for further research.
Psychometric properties of the Phonological Awareness Literacy Screening for Kindergarten (PALS-K) instrument were investigated in a sample of 2844 first-time public school kindergarteners. PALS-K is a widely used English literacy screening assessment. Exploratory factor analysis revealed a theoretically defensible measurement structure that was found to replicate in a randomly selected hold-out sample when examined through the lens of confirmatory factor analytic methods. Multigroup latent variable comparisons between Spanish-speaking English-language learners (ELLs) and non-ELL students largely demonstrated the PALS-K to yield configural and metric invariance with respect to associations between subtests and latent dimensions. In combination, these results support the educational utility of the PALS-K as a tool for assessing important reading constructs and informing early interventions across groups of Spanish-speaking ELL and non-ELL students.
The Katzenberger Hebrew Language Assessment for Preschool Children (henceforth: the KHLA) is the first comprehensive, standardized language assessment tool developed in Hebrew specifically for older preschoolers (4;0–5;11 years). The KHLA is a norm-referenced, Hebrew specific assessment, based on well-established psycholinguistic principles, as well as on the established knowledge in the field of normal language development in the preschool years. The main goal of the study is to evaluate the KHLA as a tool for identification of language-impaired Hebrew-speaking preschoolers and to find out whether the test distinguishes between typically developing (TDL) and language-impaired children. The aim of the application of the KHLA is to characterize the language skills of Hebrew-speaking children with specific language impairment (SLI). The tasks comprised in the assessment are considered in the literature to be the sensitive areas of language skills appropriate for assessing children with SLI. Participants included 454 (383 TDL and 71 SLI) mid–high SES, monolingual native speakers of Hebrew, aged 4;0–5;11 years. The assessment included six subtests (with a total of 171 items): Auditory Processing, Lexicon, Grammar, Phonological Awareness, Semantic Categorization, and Narration of Picture Series. The study focuses on the psychometric aspect of the test. The KHLA was found useful for distinguishing between TDL and SLI when the identification is based on the total Z-score or at least two of the subtest-specific Z-scores in –1.25 SD cutoff points. The results provide a ranking order for assessment: Grammar, Auditory Processing, Semantic Categorization, Narration of Picture Series/Lexicon, and Phonological Awareness. The main clinical implications of this study are to consider the optimal cutoff point of –1.25 SD for diagnosis of SLI children and to apply the entire test for assessment. In cases when the clinician may decide to assess only two or three subtests, it is recommended that the ranking order be applied as described in the study.
Testlets are subsets of test items that are based on the same stimulus and are administered together. Tests that contain testlets are in widespread use in language testing, but they also share a fundamental problem: Items within a testlet are locally dependent with possibly adverse consequences for test score interpretation and use. Building on testlet response theory (Wainer, Bradlow, & Wang, 2007), the listening section of the Test of German as a Foreign Language (TestDaF) was analyzed to determine whether, and to which extent, testlet effects were present. Three listening passages (i.e., three testlets) with 8, 10, and 7 items, respectively, were analyzed using a two-parameter logistic testlet response model. The data came from two live exams administered in April 2010 (N = 2859) and November 2010 (N = 2214). Results indicated moderate effects for one testlet, and small effects for the other two testlets. As compared to a standard IRT analysis, neglecting these testlet effects led to an overestimation of test reliability and an underestimation of the standard error of ability estimates. Item difficulty and item discrimination estimates remained largely unaffected. Implications for the analysis and evaluation of testlet-based tests are discussed.
It is currently unclear to what extent a spontaneous language sample of a given number of utterances is representative of a child’s ability in morphology and syntax. This lack of information about the regularity of children’s linguistic productions and the reliability of spontaneous language samples have serious implications for language testing based upon natural language. This study investigates the reliability of children’s spontaneous language samples by using a test-retest procedure to examine repeated samples of various lengths (50, 100, 150, and 200 utterances) in regard to morpheme production in 23 typically developing children aged 2;6 to 3;6. Analyses indicate that out of five morphosyntactic categories studied, one of these (the contracted auxiliary) achieves an ICC for absolute agreement over .6 using 100 utterances while most others (past tense, third-person singular and the uncontracted ‘be’ in an auxiliary form) fail to reach a correlation above .52 even when samples of 200 utterances are compared. The study indicates that (1) 200-utterance samples did not provide a significantly greater degree of reliability than 100 utterance samples; (2) several structures that children were able to produce did not show up in a 200-utterance sample; and (3) earlier acquired morphemes were not used more reliably than more recently acquired items. The notion of reliability and its importance in the area of spontaneous language samples and language testing are also discussed.
‘Vocabulary and structural knowledge’ (Grabe, 1991) appears to be a key component of reading ability. However, is this component to be taken as a unitary one or is structural knowledge a separate factor that can therefore also be tested in isolation in, say, a test of syntax? If syntax can be singled out (e.g. in order to investigate its contribution to reading ability), this test of syntactic knowledge would require validation. The usefulness and reliability of using expert judgments as a means of analysing the content or difficulty of test items in language assessment has been questioned for more than two decades. Still, groups of expert judges are often called upon as they are perceived to be the only or at least a very convenient way of establishing key features of items. Such judgments, however, are particularly opaque and thus problematic when judges are required to make categorizations where categories are only vaguely defined or are ontologically questionable in themselves. This is, for example, the case when judges are asked to classify the content of test items based on a distinction between lexis and syntax, a dichotomy corpus linguistics has suggested cannot be maintained. The present paper scrutinizes a study by Shiotsu (2010) that employed expert judgments, on the basis of which claims were made about the relative significance of the components ‘syntactic knowledge’ and ‘vocabulary knowledge’ in reading in a second language. By both replicating and partially replicating Shiotsu’s (2010) content analysis study, the paper problematizes not only the issue of the use of expert judgments, but, more importantly, their usefulness in distinguishing between construct components that might, in fact, be difficult to distinguish anyway. This is particularly important for an understanding and diagnosis of learners’ strengths and weaknesses in reading in a second language.
The research described in this article investigates test takers’ cognitive processing while completing onscreen IELTS (International English Language Testing System) reading test items. The research aims, among other things, to contribute to our ability to evaluate the cognitive validity of reading test items (Glaser, 1991; Field, in press).
The project focused on differences in reading behaviours of successful and unsuccessful candidates while completing IELTS test items. A group of Malaysian undergraduates (n = 71) took an onscreen test consisting of two IELTS reading passages with 11 test items. Eye movements of a random sample of these participants (n = 38) were tracked. Stimulated recall interview data was collected to assist in interpretation of the eye-tracking data.
Findings demonstrated significant differences between successful and unsuccessful test takers on a number of dimensions, including their ability to read expeditiously (Khalifa & Weir, 2009), and their focus on particular aspects of the test items and texts, while no observable difference was noted in other items. This offers new insights into the cognitive processes of candidates during reading tests. Findings will be of value to examination boards preparing reading tests, to teachers and learners, and also to researchers interested in the cognitive processes of readers.
This study examined the influence of prompt characteristics on the averages of all scores given to test taker responses on the TOEFL iBTTM integrated Read-Listen-Write (RLW) writing tasks for multiple administrations from 2005 to 2009. In the context of TOEFL iBT RLW tasks, the prompt consists of a reading passage and a lecture.
To understand characteristics of individual prompts, 107 previously administered RLW prompts were evaluated by participants on nine measures of perceived task difficulty via a questionnaire. Because some of the RLW prompts were administered more than once, multilevel modeling analyses were conducted to examine the relationship between ratings of the prompt characteristics and the average RLW scores, while taking into account dependency among the observed average RLW scores and controlling for differences in the English ability of the test takers across administrations.
Results showed that some of the variation in the average RLW scores was attributable to differences in the English ability of the test takers that also varied across administrations. Two variables related to perceived task difficulty, distinctness of ideas within the prompt and difficulty of ideas in the passage, were also identified as potential sources of variation in the average RLW scores.
Development and administration of institutional ESL placement tests require a great deal of financial and human resources. Due to a steady increase in the number of international students studying in the United States, some US universities have started to consider using standardized test scores for ESL placement. The English Placement Test (EPT) is a locally administered ESL placement test at the University of Illinois at Urbana-Champaign (UIUC). This study examines the appropriateness of using pre-arrival SAT, ACT, and TOEFL iBT test scores as an alternative to the EPT for placement of international undergraduate students into one of the two levels of ESL writing courses at UIUC. Exploratory analysis shows that only the lowest SAT Reading and ACT English scores, and the highest TOEFL iBT total and Writing section scores can separate the students between the two placement courses. However, the number of undergraduate ESL students, who scored at the lowest and highest ends of each of these test scales, has been very low over the last six years (less than 5%). Thus, setting cutoff scores for such a small fraction of the ESL population may not be very practical. As far as the majority of the undergraduate ESL population is concerned, there is about a 40% chance that they may be misplaced if the placement decision is made solely on the standardized test scores.
The present study examined pedagogic components of Chinese reading literacy in a representative sample of 1164 Grades 7, 9 and 11 Chinese students (mean age of 15 years) from 11 secondary schools in Hong Kong with each student tested for about 2.5 hours. Multiple group confirmatory factor analyses showed that across the three grade levels, the eight reading literacy constructs (Essay Writing, Morphological Compounding, Correction of Characters and Words, Segmentation of Text, Text Comprehension, Copying of Characters and Words, Writing to Dictation and Reading Aloud), each subserved by multiple indicators, had differential concurrent prediction of scaled internal school performance in reading and composing. Writing–reading and their interactive effects were foremost in their predictive power, followed by performance in error correction and writing to dictation, morphological compounding, segmenting text and copying with reading aloud playing a negligible role. Our battery of tasks with some refinement could serve as a screening instrument for secondary Chinese students struggling with Chinese reading literacy.
This study investigated the relationship between latent components of academic English language ability and test takers’ study-abroad and classroom learning experiences through a structural equation modeling approach in the context of TOEFL iBT® testing. Data from the TOEFL iBT public dataset were used. The results showed that test takers’ performance on the test’s four skill sections, namely listening, reading, writing, and speaking, could be accounted for by two correlated latent components: the ability to listen, read, and write, and the ability to speak English. This two-factor model held equivalently across two groups of test takers, with one group having been exposed to an English-speaking environment and the other without such experience. Imposing a mean structure on the factor model led to the finding that the groups did not differ in terms of their standings on the factor means. The relationship between learning contexts and the latent ability components was further examined in structural regression models. The results of this study suggested an alternative characterization of the ability construct of the TOEFL test-taking population, and supported the comparability of the language ability developed in the home-country and the study-abroad groups. The results also shed light on the impact of studying abroad and home-country learning on language ability development.
When implementing standard setting procedures, there are two major concerns: variance between panelists and efficiency in conducting multiple rounds of judgments. With regard to the former, there is concern over the consistency of the cutoff scores made by different panelists. If the cut scores show an inordinately wide range then further rounds of group discussion are required to reach consensus, which in turn leads to the latter concern. The Yes/No Angoff procedure is typically implemented across several rounds. Panelists revise their original decisions for each item based on discussion with co-panelists between each round. The purpose of this paper is to demonstrate a framework for evaluating the judgments in the standard setting process. The Multifaceted Rasch model was applied as a tool to evaluate the quality of standard setting in a context of language assessment. The results indicate that the Multifaceted Rasch model offers a promising approach to examination of the variability in the standard setting procedures. In addition, this model can identify aberrant decision making for each panelist, which can be used as feedback for both standard setting designers and panelists.