Competence
Assessment of Students With Special Educational
Needs—Identification of Appropriate Testing Accommodations
Anna Südkampa,
Steffi Pohlb, & Sabine Weinertc
aTU Dortmund University, Germany
bFreie Universität Berlin, Germany
cUniversity of Bamberg, Germany
Article received 23 October 2014
/ revised 16 March 2015 / accepted 18 May 2015 / available
online 1 June 2015
Abstract
Including students with special educational needs
in learning (SEN-L) is a challenge for large-scale
assessments. In order to draw inferences with respect to
students with SEN-L and to compare their scores to students in
general education, one needs to assure that the measurement
model is reliable and that the same construct is measured for
different samples and test forms. In this article, we focus on
testing the appropriateness of competence assessments for
students with SEN-L. We specifically asked how the reading
competence of students with SEN-L may be assessed reliably and
comparably. We thoroughly evaluated different testing
accommodations for students with SEN-L. The reading competence
of N = 433 students with SEN-L was assessed using a standard
reading test, a reduced test version, and an easy test
version. Also, N = 5,208 general education students and a
group of N = 490 low-performing students were tested. Results
show that all three reading test versions are suitable for a
reliable and comparable measurement of reading competence in
students without SEN-L. For students with SEN-L, the
accommodated test versions considerably reduced the amount of
missing values and resulted in better psychometric properties
than the standard test. They did not, however, show
satisfactory item fit and measurement invariance. Implications
for future research are discussed.
Keywords: students
with special educational needs; testing accommodations;
reading competence; large-scale assessment
Corresponding
author: Anna Südkamp, Emil-Figge-Str. 50, 44227 Dortmund,
Germany, Phone: +49 231 755 6570, Fax: +49 231 755 6572,
E-Mail: anna.suedkamp@tu-dortmund.de DOI: http://dx.doi.org/10.14786/flr.v3i2.130
1.
Introduction
Large-scale assessments generally aim at drawing
inferences about individuals’ knowledge, competencies, and
skills (Popham, 2000). Today, educational assessments play an
important role as they inform students, parents, educators,
policymakers, and the public about the effectiveness of
educational services (Pellegrino, Chudowsky, & Glaser,
2F001). Using results from large-scale assessments,
researchers can study factors influencing the acquisition and
development of competencies and derive strategies on the
improvement of educational systems. Often, assessments are
meant to serve even more ambitious purposes such as supporting
student learning (Chudowsky & Pellegrino, 2003). Assessing
students’ domain-specific competencies (e.g., reading
competence, mathematical competence) is a key aspect of most
large-scale assessments today (Weinert, 2001).
In this study, we focus on the
assessment of competencies of students with special
educational needs (SEN) in large-scale assessments. While
national large-scale assessments like the National Assessment
of Educational Progress (NAEP) in the United States and
international assessments like the Programme for International
Student Assessment (PISA) have established sophisticated
methods for the assessment of students without SEN, testing
students with SEN has proven to be challenging. In order to
inform strategies for the assessment of students with SEN, we
evaluate whether and if so, how students with SEN may be
tested reliably and comparably to general education students.
For this purpose, students with and without SEN were tested
with accommodated and non-accommodated test versions. On the
level of the single items, we carefully checked the
reliability and comparability of the test scores obtained with
the different test versions as reliability and comparability
are necessary prerequisites for drawing meaningful inferences
from large-scale assessments.
1.1
Assessing
Reading Competence of Students With SEN
Large-scale
assessments usually aim at describing the abilities of
students within a country across the whole spectrum of the
educational system or even across countries. This also
includes students with SEN. In our notion, students with SEN
include all students who are provided with special educational
services due to a physical or mental impairment. In Germany,
special schools are established for students with SEN. The
special school system—in turn—is highly differentiated itself.
There are special schools for students with special
educational needs in learning, visual impairments, hearing
disability/impairment, specific language/speech impairments,
physical handicaps/disabilities, severe intellectual
impairment/disability, emotional and behavioral difficulties,
comprehensive SEN, and students with health impairment.
So far,
comparatively little is known about the educational careers of
students with SEN and their development of competencies across
the life span (Heydrich, Weinert, Nusser, Artelt, &
Carstensen, 2013; Ysseldyke et al., 1998). However, there is
evidence that for students with SEN, reading problems pose one
of the greatest barriers to success in school (Kavale &
Reece, 1992; Swanson, 1999). Learning to read is a tedious
process requiring psycholinguistic, perceptual, cognitive, and
social skills (Gee, 2004). Beyond the basic acquisition of the
alphabet system (i.e., letter-sound correspondence and
spelling patterns), reading expertise implies phonological
processing and decoding skills, linguistic knowledge
(vocabulary, grammar), and text comprehension skills (Durkin,
1993; Verhoeven & van Leeuwe, 2008). According to Kintsch
(2007), text comprehension can be seen as a combination of
text-based processes that integrate previous knowledge to a
mental representation of the text. It is thus a form of
cognitive construction in which the individual takes an active
role. Text comprehension entails deep-level problem-solving
processes that enable readers to construct meaning from text
and derives from the intentional interaction between reader
and text (Duke & Pearson, 2002; Durkin, 1993).
On average,
students with SEN show lower reading performance in
large-scale assessments than students without SEN (Thurlow,
2010; Thurlow, Bremer, & Albus, 2008; Ysseldyke et al.,
1998). For example, for the NAEP 1998 reading assessment in
grades 4 and 8, Lutkus, Mazzeo, Zhang, and Jerry (2004) report
lower average scale scores for students with SEN compared to
students without SEN. Within the German KESS study (Bos et
al., 2009) reading competence of seventh graders in special
schools was compared to the reading competence of fourth
graders in general education settings. Results demonstrated
that fourth grade primary school students outperformed
students with SEN in seventh grade in reading competence, the
difference being about one third of a standard deviation.
Drawing on data from a three-year longitudinal study, Wu et
al. (2012) found that, compared to their general education
peers, students receiving special educational services were
more likely to score below the 10th percentile for several
years in a row. In light of these findings, different reasons
for the low performance of students with SEN have been
discussed (Abedi et al., 2011). First, some students with SEN
have difficulties related to the comprehension of text (e.g.,
lack of knowledge of common text structures, restricted
language competencies, inappropriate use of background
knowledge while reading; Gersten, Fuchs, Williams, &
Baker, 2001). Reading problems of students with SEN in upper
elementary and middle school are likely to be complex and
heterogeneous resulting,
for example, from a lack of phonological processing and
decoding skills, a lack
of linguistic knowledge (vocabulary, grammar), and a lack of
text comprehension skills, or from a combination of problems
in these areas. Second, lower performance could be attributed
to a lack of opportunities to learn and to low teacher
expectations (Woodcock & Vialle, 2011). Third, there could
be barriers for students with disabilities in large-scale
assessments that lead to unfair testing conditions (Pitoniak
& Royer, 2001). According to Thurlow (2010), a combination
of all these factors is likely. Taking the norm of test
fairness seriously, large- scale studies try to ensure that
students with disabilities will not be confronted with unfair
testing conditions. That is why testing accommodations are
often employed for students with SEN.
1.2
Providing Students with SEN With Testing Accommodations
The
provision of testing accommodations for individuals with
disabilities is a highly controversial issue in the assessment
literature (Pitoniak & Royer, 2001; Sireci, Scarpati,
& Li, 2005). Generally, testing accommodations are defined
as changes in test administration that are meant to reduce
construct-irrelevant difficulty associated with students’
disability-related impediments to performance. According to
the Standards for Educational and Psychological Testing,
accommodations comprise “any action taken in response to a
determination that an individual’s disability requires a
departure from established testing protocol. Depending on
circumstances, such accommodation may include modification of
test administration processes or modification of test content”
(American Educational Research Association, 1999, p. 110).
Note that some authors differentiate between accommodations and modifications: While
accommodations are not meant to change the nature of the
construct being measured, modifications result in a change in
the test and equally affect all students taking it
(Hollenbeck, Tindal, & Almond, 1998; Tindal, Heath,
Hollenbeck, Almond, & Harniss, 1998). In this article,
however, we use the definition of the Standards for
Educational and Psychological Testing.
Due to the
many types of disabilities, various accommodations have been
provided when testing students with SEN. Accommodations
include, for example, modification of presentation
format—including the use of braille or large-print booklets
for visually-impaired examinees and the use of written or
signed test directions for hearing-impaired examinees—and
modification of timing, including extended testing time or
frequent breaks (Koretz & Barton, 2003). In the 1998 NAEP
reading assessment, a sample of students with varying
disabilities and students with limited English proficiency
were assigned to the following accommodations based on their
individual needs: one-on-one testing, small-group testing,
extended time, oral reading of directions, signing of
directions, use of magnifying equipment, and use of an aide
for transcribing responses.
Changes in
the test bear the possibility that they alter the construct
measured. If accommodated tests for students with SEN measure
a different construct than the standard test for general
education students, the competence scores between the two
student groups are not comparable. Thus, it is utterly
important to test whether test accommodations result in
reliable and comparable competence measures
(Borsboom, 2006; Millsap, 2011). Lutkus et
al. (2004) address the issue of whether the NAEP reading
construct remains comparable for accommodated versus
non-accommodated students by analyzing differential item
functioning (DIF). DIF exists when subjects with the same
trait level have a different probability of endorsing an item.
Only very few items were found to have statistically
significant DIF for the focal group (accommodated students)
versus the reference group (non-accommodated students), which
indicated measurement invariance across subgroups being
assessed with different tests. In contrast, Koretz (1997) did
find indications of DIF as 13 of 22 common items showed strong
DIF when comparing item difficulty for students with SEN
tested with accommodations and students without SEN tested
under standard conditions using data from the Kentucky
Instructional Results Information System assessment. In PISA
2012, samples of students with SEN were also tested with
accommodated test versions (a shortened test version of the
standard test and a test version including easier items).
Here, the results on the psychometric properties of the
accommodated test versions are still to be published (Müller,
Sälzer, Mang, & Prenzel, 2014). In sum, the results
concerning the use of testing accommodations are inconsistent.
A major concern remains that in some cases accommodations may
alter the test to the extent that accommodated and
non-accommodated tests are no longer comparable (Abedi et al.,
2011; Bielinski, Thurlow, Ysseldyke, Freidebach, &
Freidebach, 2001; Cormier, Altman, Shyyan, & Thurlow,
2010).
However,
one drawback of the studies by Lutkus et al. (2004) and Koretz
(1997) is that in the analyses, different accommodations are
not distinguished although different accommodations may have
different effects. Another disadvantage is that students with
SEN are often compared to students without SEN at the same
grade level. Here, students with SEN and students without SEN
differ not only in terms of their SEN status but also in terms
of their expected achievement level. The study by Yovanoff and
Tindal (2007) is one of the rare studies using an alternative
comparison group where students with SEN in grade 3 are
compared to students without SEN in grade 2.
Another
issue is that comparisons usually involve students without SEN
receiving the standard test and students with SEN receiving
the accommodated test versions. By doing this, possible DIF
may be due to both, testing accommodations and problems of
testing students with SEN. In order to disentangle the
appropriateness of test accommodations from the testability
problems of students with SEN, the effects of test
accommodations should separately be tested in a group of
students without SEN. In the same vein, Pitoniak and Royer
(2001) identify three major challenges for research on testing
accommodations: variability in examinees, variability in
accommodations, and small sample sizes (also see Geisinger,
1994). In the present study, we approach these challenges by
focusing on students with special educational needs in
learning (SEN-L), by focusing on specific accommodations
appropriate for students with SEN-L, and by using a study
design that incorporates a group of low-achieving students
without SEN for evaluating the appropriateness of testing
accommodations.
1.3
Testing Students With Special Educational Needs in
Learning (SEN-L)
While
providing students with physical, hearing, and visual
impairments with testing accommodations is rather accepted,
Pitoniak and Royer (2001) stress the importance of studying
the effects of testing accommodations on test validity (or
comparability), especially when testing students with learning
disabilities. In this study, we focus on students with SEN-L
in Germany, who comprise all students, who are provided with
special educational services due to a general learning
disability[1].
In Germany, students are assigned to the SEN-L group when
their learning, academic achievement, and/or learning behavior
are impaired (KMK, 2012) and when students cognitive abilities
are below normal range (Grünke, 2004). In contrast to students
with SEN-L, students with (specific) learning disabilities
(e.g., a reading disorder) are not necessarily impaired in
their general cognitive abilities. In Germany, the decision of
whether a student has special educational needs in learning is
based on a diagnostic procedure and made collaboratively by
parents, teachers, consultants, and school administrations.
About 78% of the SEN-L students in Germany (KMK, 2012) do not
attend regular schools but attend special schools with
specific programs and trainings tailored to those who are
unable to follow school lessons and subject matter in regular
classes.
In fact,
students with SEN-L compose the largest group of students with
special educational needs in Germany (KMK, 2012). Comparably,
students with learning disabilities compose the largest group
of students with disabilities in the Unites States (Cortiella
& Horowitz,
2014; US Department of Education, 2013). Our assumption is that the
acceptance of testing accommodations for students with SEN-L
is low, because the disabilities of students with SEN-L (e.g.,
information processing restrictions) are very likely to
interfere with the construct that is to be measured (e.g.,
reading literacy). In turn, respective testing accommodations
are likely to be construct-relevant. There are two
test accommodations typically implemented for students with
SEN-L: extended test time and “out-of-level” testing. Extended
test time is usually implemented in order to compensate for
information-processing restrictions in students with SEN-L. In
his review on the appropriateness of extended time
accommodations for students with SEN—including students with
SEN-L among others—Lovett (2010) identified two studies with a
serious amount of differentially functioning items, while DIF
was negligible in one other study. A prominent hypothesis
regarding extended test time is the differential boost
hypothesis (Fuchs, Fuchs, Eaton, Hamlett, & Karns,
2000), which states that students with SEN benefit more from
extended time than students without SEN. In their review on test
accommodations for students with SEN including 14 studies on
extended time, Sireci et al. (2005) conclude that students with
SEN as well as students without SEN benefit from extended test
time. In only one of the reviewed studies students with SEN
benefited more from extended test time than students without
SEN.
Another
common method is to provide students with SEN-L with an
out-of-level test, which was originally meant for testing
younger children (Thurlow, Elliott, & Ysseldyke, 1999). Similarly, alternate
assessments that test lower-level reading and mathematical
skills or skills that are precursory to reading and numerical
literacy can be applied (Zebehazy, Zigmond, & Zimmerman,
2012). Both methods aim at avoiding undue frustrations for
students with SEN-L and at improving the accuracy of
measurement. Critics of out-of-level and alternate assessments
argue that students with SEN-L are faced with low expectations
due to the assessment and are prevented from taking the
standard tests, and thus consider the assessments to be
inappropriate for accountability assessment. Nevertheless,
Thurlow et al. (1999) consider out-of-level testing a good
opportunity to test students with SEN, if one can make sure
that a common scale across different disparate grade levels is
available. Such a common scale may be achieved by using
methods of Item Response Theory (IRT), given that the items
measure the same construct. When scaling Oregon’s early
reading alternate assessment onto the first general statewide
benchmark reading assessment in grade 3, Yovanoff and Tindal
(2007) identified
good psychometric properties of the alternate assessment and
no severe DIF between students with SEN (grade 3) and students
without SEN (grade 2). However, data that support either the
use or nonuse of out-of-level testing or alternate assessments
is still rare (Minnema, Thurlow, Bielinski, & Scott, 2000;
see the study by Zebehazy et al., 2012, which focuses on
visually impaired students, for an exception).
2.
Research Questions
As prior research has shown, testing competencies
of students with SEN-L represents a challenge for large-scale
assessments (Thurlow, 2010). Assessing competencies of
students with SEN-L with tests that have been developed for
students without SEN-L may fail to result in satisfying item
fit measures and may be associated with differential item
functioning, which impedes the opportunity to compare the
competence scores of students with and without SEN. The
present study aims to evaluate different strategies of testing
students with SEN-L. Generally, we address the question of
whether and how satisfying item fit measures and measurement
invariant test scores can be obtained for students with SEN-L
in large-scale-assessments. We evaluate whether standard tests
developed for students without SEN-L and testing
accommodations for students with SEN-L result in reliable and
comparable measures of reading competence. If a reliable and
comparable measurement of reading competence can be achieved,
substantial research on the competence level, predictors of
reading competence and competence development, as well as
group differences may be investigated.
In this study, two major research questions are
addressed: First, we investigate whether a reduction in test
difficulty and a reduction of the number of items lead to test
results comparable to students tested without accommodations.
We approach these questions by testing students in general
education, for whom reliable and valid competence scores can
be obtained using a standard reading test. As the accommodated
test versions are targeted towards a lower competence level,
we did not use the whole group of students in general
education, but focused on the subgroup of low-achieving
students. Secondly, we explore whether these accommodations
are suitable for testing students with SEN-L.
3.
Method
3.1
Sample and Design
We collected data within the German National
Educational Panel Study (NEPS). The NEPS is a national,
large-scale longitudinal multicohort study that investigates
the development of competencies across the lifespan (Blossfeld
& von Maurice, 2011; Blossfeld, von Maurice, &
Schneider, 2011). The study aims at providing high-quality,
user-friendly data on competence development and educationally
relevant processes for an international scientific community
(Barkow et al., 2011).
Between 2009 and 2012, six representative start
cohorts (Aßmann et al., 2011) were sampled, including about
60,000 individuals from early childhood to adulthood. Specific
target groups include migrants (Kristen et al., 2011) and
students with SEN-L (Heydrich et al., 2013). All participants
are accompanied on their individual educational pathways
through a collection of data on competencies (Weinert et al.,
2011), learning environments (Bäumer, Preis, Roßbach, Stecher,
& Klieme, 2011), educational decisions (Stocké, Blossfeld,
Hoenig, & Sixt, 2011), and educational returns (Gross,
Jobst, Jungbauer-Gans, & Schwarze, 2011). Following the
principles of universal design (Dolan & Hall, 2001;
Thompson, Johnstone, Anderson, & Miller, 2005), the NEPS
aims at providing a basis for fair and equitable measures of
competencies for all individuals.
In the present study, we used data from three
different studies of students in fifth grade. These studies
comprise a) a representative sample of general education
students (main sample), b) a sample of students with SEN-L,
and c) a group of students in the lowest academic track (LAT).
The response rate in these studies was 55%, 45%, and 63%,
respectively. In the main sample there were N = 5,208 general
education students, including N = 700 students in the
lowest academic track (see Aßmann, Steinhauer, & Zinn,
2012, for more information on the NEPS main sample). On
average, these students were Mage =
10.95 (SDage
= .53) years old and 48.3% were female (0.7% had a missing
response on age, 0.2% had a missing response on gender). About
24.1% of the students reported that they spoke a language
other than German at home. The sample of students with SEN-L
draws on a feasibility study with N = 433 students who
were recruited at special schools for children with SEN-L in
Germany. Students in this sample were Mage =
11.41 (SDage
= .63) years old and 43.3% were female (0.7% had a missing
response on gender). In this sample, about 30.1% of the
students reported that they spoke a language other than German
at home.
In this feasibility study, we applied two
accommodated test versions that aimed at a) reducing the
difficulty of the test and b) reducing the test length (and
thereby increasing the testing time per item). In order to
discern whether test items do not function properly because
the accommodations change the test construct or whether
students with SEN-L still have problems with the test, we
implemented a group of low achieving students without SEN.
This group consisted of a separate sample of N = 490 students
enrolled in the lowest academic track, or Hauptschule. Students
in this sample were Mage
= 11.28 (SDage
= .63) years old and 48.4% were female. About 29.8% of the
students in the LAT spoke a language other than German at
home.
Focusing on this sample, we evaluated whether the
accommodated test versions yield reliable test scores and
whether they assess the same construct as the standard reading
test. Students without SEN were tested as for this group it
has already been shown that reliable and valid competence
assessment can be obtained using the standard reading test.
Thus, we could investigate the impact of the testing
accommodations and disentangled testing problems resulting
from badly-constructed accommodated test versions and testing
problems resulting from the assessment of students with SEN-L.
We restricted our sample to students in the lowest academic
track, because the accommodated test versions were targeted
towards a lower competence level. For students in general
education in higher academic tracks, the test accommodations
would be too easy and, as a consequence of such low test
targeting, could result in aberrant response patterns (due to
motivation problems) as well as in low item discriminations
(due to the low variability in item responses). Implementing
this group of low-achieving students allowed us to investigate
whether the accommodated test versions generally result in
reliable and comparable measures of competence.
All students were tested in the middle of fifth
grade in November and December 2010. Data were collected by
the International Association for the Evaluation of
Educational Achievement (IEA) Data Processing and Research
Center (DPC). Students participated in the study voluntarily,
so student and parental consent was necessary. Each student
who participated in the study received 5 euros.
3.2
Measures and Procedures
Within all three samples, reading literacy as
well as mathematical competence was assessed. The orientation
towards the functionality and everyday relevance of the
competencies studied is one central aspect of the NEPS
framework for the assessment of competencies. It draws on the
concept of literacy in international comparative studies with
a focus on enabling participation in society (see OECD, 1999).
In this study, we focus on the assessment of reading literacy.
Within the NEPS, the reading competence assessment focuses on
text comprehension. All reading tests are developed based on a
framework for the assessment of reading competence (Gehrer,
Zimmermann, Artelt, & Weinert, 2013). This framework has
been developed based on theoretical and pragmatic
considerations that take earlier concepts and studies of
reading competence within large-scale assessments into
account. The most important dimensions within the framework
are text types, cognitive requirements, and task formats.
Concerning text types, texts with commenting, information,
literacy-aesthetic, instruction, and advertising functions are
included. In turn, cognitive requirements range from finding
information in the text, drawing text-related conclusions, and
reflecting and assessing. Across all age groups, the items in
the test are either simple multiple choice (MC) items, complex
MC items, or matching items. Complex multiple-choice (CMC)
items present a common stimulus followed by a number of MC
questions with two response options each. Matching (MA) items
consist of a common stimulus followed by a number of
statements, which require assigning a list of response options
to these statements (see Gehrer, Zimmermann, Artelt, &
Weinert (2012) for a full description of the framework
including information on text types, cognitive requirements,
item formats, and example items).
3.2.1
Standard
reading test
The standard reading test was designed for
students enrolled in the regular school system. It was
developed based on the conceptual framework sketched above.
Students were asked to read five different texts and answer
questions focusing on the content of these texts (Gehrer,
Zimmermann, Artelt, & Weinert, 2013). The test for
students in fifth grade included a text about a continent
(information function), a recipe (instruction function), an
invitation (advertising function), a critical statement on a
societal topic (commenting function), and a fictive story
about a famous character (literacy-aesthetic function). In the
analysis of the standard reading test, 56 items were included;
however, subtasks of complex MC and matching items were
treated as single items. So when combined, there were 33
questions in the standard reading test, which students had to
complete within 30 minutes. For testing general education
students, the test has shown good psychometric properties
(Pohl, Haberkorn, Hardt, & Wiegand, 2012).
3.2.2
Reading
test with accommodations
Based on the standard reading test, two
accommodated test versions were administered in this study. As
mentioned above typical testing accommodations for students
with SEN-L include extended testing time and “out of level”
testing. Within the NEPS, time for testing a
domain-specific or domain-general competence is limited to 30
minutes. Under this restriction, we decided to develop one
accommodated test version by reducing test length (reduced test),
resulting in an increased test time per item. One text and its respective nine items, plus an
additional 10 hard items were removed. The text on the
societal topic and the items were removed, because the items
showed to be comparatively difficult in prior item analyses in
samples of general education students. In order to facilitate
scaling of the different test versions on a same scale, an
anchor item design (e.g., Kolen & Brennan, 2004) was used
for linking the different test versions. For this design a
sufficient number of items need to be the same in all test
forms. Therefore, in the reduced test four texts and 37 items
remained the same as in the standard reading test and
functioned as anchor items in this design. We refer to the
term “anchor item” when an item is the same in the standard
test and in the accommodated test versions. As a result of
reducing the length of the test, one text function was left
out in the reduced test version (the commenting function).
Still, the anchor items represented all three cognitive
requirements. Note that while this accommodation mainly served
to reduce test length, it also reduced item difficulty.
We decided to
develop a second accommodated test version (easy test) that mainly aimed at reducing the difficulty of the
standard test. Therefore, three texts and their respective 37
items from the standard reading test were removed (the text
about the continent, the critical statement on a societal
topic, and the fictive story about a famous character). These
texts and its respective items were replaced with three texts
and 23 items that had been developed for younger children in
grade 3—including a text on the human body (information
function), a short story about a family (literacy-aesthetic
function), and an invitation (advertising function). This
procedure can be considered as some sort of “out-of-level”
testing. However, two texts remained the same as in the
standard reading test as we used an anchor item design. Based
on prior item analysis in samples of general education
students in grade 5, five especially difficult items were
eliminated from these texts. This procedure resulted in 12
overlapping items in the standard reading test and the easy
test version. These items were used as anchor items in this
design. In sum, the easy test version included 35 items.
Overall, 5,208 general education students
including 700 students from the lowest academic track were
tested with the standard reading test. Students with SEN-L
took the standard reading test (N = 176), the reduced
test (N = 173), or
the easy test (N =
84) by random assignment. The additional sample of N = 490 students
from the lowest academic track was randomly assigned to the
reduced test (N =
332) and the easy test (N
= 158). Note that the standard reading test was not
administered to this sample of students in the lowest academic
track. For investigating the appropriateness of the standard
reading test for students in the LAT, the subsample of the
main sample of general education students attending schools of
the lowest academic track were used (N = 700).
In order to control for fatigue and acquaintance
effects, the order of the different tests was rotated within
the booklet in almost all test versions. The standard test and
the easy test were administered either before or after a
mathematics test. For the analyses, due to sample size issues,
the test order was ignored and the different conditions were
analyzed together. Due to sample size limitations, there was
no rotation of the position of the reduced test; the reduced
test was only administered before the mathematics test. For
the comparison of estimated item difficulties with general
education students, data from these students refer to the same
position within the booklet as data of students with SEN-L or
the students in the LAT. So no bias is to be expected from
test position.
4.
Analyses
4.1
The Model
We scaled the data within the framework of Item
Response Theory (IRT). In accordance with the scaling
procedure for competence data in the NEPS (Pohl &
Carstensen, 2012; 2013), we used a Rasch model (Rasch, 1960)
estimated in ConQuest (Wu, Adams, Wilson, & Haldane,
2007). In this model a unidimensional measurement model with
equal loadings across items is proposed. Various fit indices
are available that describe the psychometric properties of the
tests.
As described above, the reading test included
complex MC and matching items. These items consisted of a set
of subtasks that were aggregated to a polytomous variable in
the final scaling model in the NEPS. When aggregating the
responses on the subtasks to a single polytomous super-item,
we lose information on the single subtasks. Since in this
study we were interested in the fit of the items, we treated
the subtasks of complex MC and matching items as single
dichotomous items in the analyses. As such, we could not
account for possible local item dependence within each set of
subtasks. We applied the Rasch model to every test version
(standard test, reduced test, easy test) and sample (students
with SEN-L, students in the LAT).
4.2
Item Fit
In order to investigate whether the standard test
and the accommodated reading tests reliably measured reading
competence, we evaluated different fit measures. These
included the weighted mean square (WMNSQ; Wright &
Masters, 1982), item discrimination, point-biserial
correlation of the distractors with the total score and the
empirically approximated item characteristic curve (ICC). All
of these measures provide information on how well the items
fit a unidimensional Rasch model.
As Wu (1997) showed, fit statistics depend on the
sample size. The larger the sample size, the smaller the WMNSQ
and the greater the t-value. Thus, since the group of students
with SEN-L differs in sample size from the group of students
in the LAT, we considered different evaluation criteria for
the interpretation of the WMNSQ. In this study, we report item
discrimination, which describes the point-biserial correlation
of the item with the total score (i.e., relative number of
correct responses on the total number of valid responses). A
well-fitting item should have a high positive correlation—that
is, subjects with a high ability should score higher on the
item than subjects with a low ability. For an easier
interpretation, we report the discrimination not only in
absolute values, but classify the item fit regarding the
discrimination into acceptable item fit (discrimination >
.2), slight misfit (discrimination between .1 and .2) and
strong misfit (discrimination < .1). Furthermore,
point-biserial correlations of incorrect response options and
the total score are evaluated. The correlations of the
incorrect responses with the total score allow for a thorough
investigation of the performance of the distractors. A good
item fit would imply a negative or zero correlation of the
distractor with the total score. Distractors with a high
positive correlation may indicate an ambiguity in relation to
the correct response. Finally, empirically approximated item
characteristic curves (ICC) were considered. These describe
whether the number of correct responses corresponds to the
theoretical implied response probability at each competence
level.
4.3
Measurement Invariance
Reading scores of students with SEN-L versus
students in general education can only be compared when the
tests are measurement invariant—that is, when there is no
differential item functioning (DIF). Measurement invariance is
furthermore a necessary assumption for linking the different
test forms. When measurement invariance holds—and thus there
is no DIF—the probability of endorsing an item is the same for
students with SEN-L and those without SEN-L who have the same
ability. The presence of DIF is an indication that the
respective reading test measures a different reading construct
for both target groups, and thus that the reading scores
between the target groups may not be compared.
We tested DIF for each test version (standard,
reduced, easy) and each target group (students with SEN-L,
students in the LAT) by comparing the estimated item
difficulties in the respective test version and target group
to the estimated item difficulty of the same items for
students in the main sample of the NEPS. Students with SEN-L
as well as students in the LAT were, thus, compared to general
education students in the main sample. There is one exemption:
The group of students in the LAT was not tested with the
standard reading test. In order to estimate DIF for that group
on the standard test, we used the data of the students in the
lowest academic track of the main sample of general education
students. For this, we separated the main sample into students
in the lowest academic track and students in other tracks and
compared the estimated item difficulty between both groups.
We estimated DIF in a multi-facet IRT model,
estimating separate item difficulties for general education
students and for the respective target group. In line with the
benchmarks chosen in the NEPS (Pohl & Carstensen, 2012),
we considered absolute differences in item difficulties
greater than 0.6 to be noticeable and absolute differences
greater than 1 to be strong DIF. Note that these benchmarks
serve here as an orientation for interpretation. To get a
thorough picture, we also report the absolute DIF value. Also
note that while in the standard test DIF may be investigated
for all items, DIF in the reduced test and the easy test may
only be investigated for the anchor items. In the reduced test
and the easy test there are anchor items that allow linking of
the different test versions. As described above, there are 37
anchor items in the reduced test and 12 anchor items in the
easy test.
5.
Results
In the following we will first represent the
occurrence of missing values in each test form. Then we will
present item fit for the different test forms and samples,
followed by a further investigation of reasons for item
misfit. In a next step, results on the comparability of test
scores are presented. The results on item fit and measurement
invariance are then considered together for evaluating the
appropriateness of the different test forms for assessing
competencies of students with SEN-L.
5.1
Missing Responses
Table 1 depicts the mean of the relative amount
of different kinds of missing responses for each of the target
groups and test versions. Similar to the main study (Pohl et
al., 2012), there is a large number of missing responses—on
average, up to 19% of the items are missing. The amount of
missing responses is larger in the students with SEN-L group
than in the group of students in the LAT for all test versions
and all types of missing responses.
Comparing the different test versions, the lowest
number of omitted items is found in the easy test version.
This is probably due to the fact that the easy test version
contains many easy items and that omission of items is related
to the difficulty of the item (see, e.g., Pohl, Gräfe, &
Rose, 2014). The lowest number of not reached items is found
in the reduced test version. Thus, the reduction of texts and
items to work on within the given assessment time does
increase the number of items reached. The lowest number of
invalid missing responses occurs in the reduced test version.
This is likely because the reduced test version contains fewer
matching items; this is the item format with the largest
number of invalid responses (Pohl et al., 2012).
Table 1
Averages
of the Relative Frequency of Missing Responses
Type of missing response |
Test |
SEN-L M |
LAT M |
Omitted |
standard |
6.72 |
4.78 |
reduced |
5.20 |
2.70 |
|
easy |
2.01 |
0.92 |
|
Not reached |
standard |
10.46 |
9.45 |
reduced |
3.90 |
1.04 |
|
easy |
5.63 |
3.44 |
|
Invalid |
standard |
1.13 |
0.44 |
reduced |
0.48 |
0.18 |
|
easy |
1.59 |
0.18 |
|
Total number of missing responses |
standard |
18.31 |
14.67 |
reduced |
9.58 |
3.92 |
|
easy |
9.22 |
4.53 |
Note.
SEN-L = Special educational needs in learning; LAT = Lowest
academic track.
5.2
Item Fit
5.2.1
Standard
test
First we analyzed item fit for the standard
reading test for students with SEN-L and students in the
lowest academic track. Overall, item discrimination is
relatively small for students with SEN-L. The mean item
discrimination is .25 (it is .34 in the lowest academic
track). Four items show a slight misfit (discrimination
between .1 and .2) and 10 items a strong misfit
(discrimination less than .1). In the lowest academic track,
there is only one item with a strong misfit and nine items
with a slight misfit.
Evaluation of further fit measures for students
with SEN-L confirms these results. Table 2 depicts the number
of misfitting items for the WMNSQ, ICC, and point-biserial
correlations. Summarizing these results, there is a large
amount of items in the standard test that do not fit.
EAP-Reliability of competence scores for students in the
lowest academic track is sufficiently high
(Rel = 0.823), while it is considerably lower for
students with SEN-L (Rel = 0.652).We can conclude
that students with SEN-L may not be tested appropriately with
the standard reading test. In contrast, fit indices in the
lowest academic track indicate a relatively good item fit that
is comparable to the fit found in the main sample of general
education students (see Pohl et al., 2012 for the results in
the main study). The results indicate that the test is
appropriate not only for the main sample including students
attending higher academic tracks but also for low-performing
students.
Table 2
Number
of Items With Misfit Indicated by Weighted Mean Square
(WMNSQ), Item Characteristic Curve (ICC), and Point-Biserial
Correlations
Fit
measure |
Test |
SEN-L |
LAT |
WMNSQ |
Standard |
7 |
7 |
Reduced |
2 |
1 |
|
Easy |
1 |
3 |
|
ICC |
Standard |
15 |
9 |
Reduced |
17 |
2 |
|
Easy |
12 |
1 |
|
Point-biserial
correlations |
Standard |
21 |
3 |
Reduced |
14 |
1 |
|
Easy |
5 |
0 |
Note. SEN-L =
Special educational needs in learning; LAT = Lowest academic
track.
5.2.2
Reduced
test
The item discriminations of the items in the
reduced test version indicate a better item fit for students
in the LAT than for students with SEN-L. For both target
groups, the reduced test shows better item fit indices than
the standard test version. For students with SEN-L, there are
six items with a slight misfit (discrimination between .1 and
.2) and five items with a strong misfit (discrimination below
.1). Note that—not necessarily—the items showing misfit in the
standard reading test, also show low discriminations in the
reduced test. This may indicate that problems with testing of
students with SEN-L do not necessarily lay in the specificity
of the items, but may reflect other aspects of testing. The
mean item discrimination is .28. In contrast, for students in
the LAT the mean item discrimination is .47 and there is only
one item with a slight misfit and one item with a strong
misfit. Note that the item with the strong misfit was also
problematic in the main sample. Evaluation of the WMNSQ, the
ICCs, as well as of the point-biserial correlations of the
responses (see Table 2) corroborates these findings. The
results show that the items in the reduced test version have a
good item fit for students in the LAT. They have, however, an
insufficient fit in the students with SEN-L group.
Nevertheless, the item fit in the students with SEN-L group is
better for the reduced test than for the standard test. As in
the standard test, EAP-reliability was sufficiently high for
students in the lowest academic track (Rel = 0.850)
but it was not sufficient for students with SEN-L
(Rel = 0.525).
5.2.3
Easy test
The items in the easy test fit the data for both
target groups better than the standard test. For students with
SEN-L there are only four items with a slight misfit and three
items with a strong misfit. The mean item discrimination for
students with SEN-L is .30, while it is .46 for the students
in the LAT. In the students of the LAT group, there is no item
with an unsatisfactory discrimination. Also the other fit
measures evaluated (see Table 2) show that the items in the
easy test version fit the model in the group of students in
the LAT but show some misfit in the students with SEN-L group.
The EAP-reliability for students in the lowest academic track
was high (Rel = 0.877), while it was not
satisfactory for students with SEN-L (Rel = 0.600)
Compared to the other two test versions, the easy test version
shows the best model fit for students with SEN-L.
5.3
Investigation of Item Misfit
We further investigated the occurrence of item
misfit based on test characteristics. We did not find any
systematic relationship between item misfit and the different
dimensions of the conceptual framework of the reading test
(text function, cognitive requirements, and item format).
However, we did find a relationship between item misfit and
item difficulty.
5.3.1
Standard
test
The correlation of the item difficulty estimated
in the main sample—thus being independent of the measurement
model in the SEN-L group—and item discrimination within the
students with SEN-L group is ‑.492. The more difficult an
item, the lower is the discrimination. This may be an
indication of disadvantageous test targeting—that is,
inappropriate item difficulties for this target group. The
items in the standard test are too difficult for students with
SEN-L (mean item difficulty with the mean of the reading
ability set to zero = 0.58 logits), while item difficulties
match the abilities of the students of the lowest academic
track well and are in fact rather easy (mean item difficulty
= -0.41 logits). Here, the correlation between item
difficulty estimated in the main sample and item
discrimination for students in the lowest academic track is
-.324. Note that since the measurement model of the standard
test in the lowest academic track was estimated based on a
subsample of the main sample, estimated item difficulty is not
independent of the estimated item discrimination in the sample
of students of the lowest academic track in the main sample.
5.3.2
Reduced
test
In the group of students in the LAT item fit of
the reduced test is not substantively correlated with item
difficulty (cor =
-0.06) and is considerably negatively correlated in the
students with SEN-L group (cor = -.43). Within
students in the LAT, there is no relationship between item
difficulty and item misfit, while in the students with SEN-L
group, items with high difficulty show larger item misfit.
This may also be a result of the small variance in item
discrimination in the group of students in the LAT for this
test version. Test targeting shows that the reduced test is
still too difficult for students with SEN-L (mean item
difficulty = 0.43 logits) but too easy for students in the
lower academic track of general education (mean item
difficulty = -1.03 logits).
5.3.3
Easy test
Since most of the items in the easy test are not
part of the standard test, we did not compute correlations
between item difficulty and item fit. However, we did
investigate test targeting. In test targeting, the easy test
version is also too easy for students in the LAT (mean item
difficulty = -0.99 logits) and too hard for students with
SEN-L (mean item difficulty = 0.61 logits). Note that the easy
test version is even more difficult than the reduced test
version.
5.4
Measurement Invariance
5.4.1
Standard
test
Table 3 shows the absolute differences in
estimated item difficulties first, between general education
students and students with SEN-L and second, between students
in the lowest academic track and students in other tracks of
the main sample taking the standard test version. For students
with SEN-L, negative values in the table indicate a higher
item difficulty compared to general education students while
positive values indicate lower item difficulty. For students
in the lowest academic track, negative values indicate a
higher item difficulty for these students compared to students
in other tracks in the main sample and positive values
indicate a lower item difficulty.
Table 3
Differential
Item Functioning (DIF) in the Different Test Versions and
Student Groups
|
|
Differential Item Functioning |
|||||||
Item |
Difficulty |
|
SEN-L |
|
|
LAT |
|||
|
|
Standard |
Reduced |
Easy |
|
Standard |
Reduced |
Easy |
|
REG50110 |
-1.909 |
-1.010 |
-0.942 |
|
|
-0.304 |
-0.256 |
|
|
REG50121 |
-2.814 |
-1.678 |
-1.200 |
|
|
-0.320 |
0.246 |
|
|
REG50122 |
-2.063 |
-0.926 |
-0.800 |
|
|
-0.276 |
-0.360 |
|
|
REG50123 |
-2.078 |
-0.444 |
-0.848 |
|
|
-0.222 |
-0.072 |
|
|
REG50124 |
-2.236 |
-0.510 |
-0.930 |
|
|
-0.140 |
-0.246 |
|
|
REG50125 |
-2.202 |
-1.018 |
-0.752 |
|
|
-0.260 |
-0.442 |
|
|
REG50126 |
-1.793 |
-0.652 |
-0.512 |
|
|
0.018 |
0.858 |
|
|
REG50127 |
-2.173 |
-0.714 |
-1.234 |
|
|
-0.414 |
-0.362 |
|
|
REG50130 |
-0.805 |
-0.850 |
-0.090 |
|
|
-0.068 |
-0.106 |
|
|
REG50140 |
-0.148 |
-0.382 |
-0.288 |
|
|
-0.132 |
-0.024 |
|
|
REG50150 |
0.874 |
-0.400 |
|
|
|
0.200 |
|
|
|
REG50161 |
0.542 |
-1.688 |
-1.388 |
|
|
-0.536 |
-0.302 |
|
|
REG50162 |
0.149 |
-1.020 |
-0.130 |
|
|
-0.310 |
-0.686 |
|
|
REG50163 |
0.035 |
-0.348 |
0.422 |
|
|
-0.342 |
-0.330 |
|
|
REG50164 |
-0.076 |
-1.320 |
-0.774 |
|
|
-0.466 |
-0.116 |
|
|
REG50165 |
0.048 |
-0.302 |
-0.042 |
|
|
-0.428 |
-0.368 |
|
|
REG50170 |
2.351 |
0.570 |
|
|
|
-0.294 |
|
|
|
REG50210 |
-1.411 |
-1.054 |
-0.566 |
-0.564 |
|
-0.352 |
-0.574 |
-0,304 |
|
REG50220 |
1.436 |
1.200 |
1.602 |
1.360 |
|
0.576 |
0.414 |
0,490 |
|
REG50230 |
-1.187 |
-0.926 |
-0.814 |
-0.850 |
|
-0.148 |
0.006 |
-0.148 |
|
REG50240 |
0.050 |
-0.082 |
0.146 |
0.232 |
|
-0.094 |
-0.044 |
-0.226 |
|
REG50250 |
0.667 |
0.164 |
0.344 |
-0.096 |
|
0.134 |
0.170 |
-0.018 |
|
REG50261 |
-1.352 |
-0.318 |
|
|
|
-0.204 |
|
|
|
REG50262 |
1.924 |
0.580 |
|
|
|
-0.038 |
|
|
|
REG50263 |
2.159 |
0.172 |
|
|
|
0.088 |
|
|
|
REG50264 |
2.167 |
-0.290 |
|
|
|
0.188 |
|
|
|
REG50265 |
2.195 |
0.724 |
|
|
|
0.180 |
|
|
|
REG50266 |
2.221 |
1.016 |
|
|
|
0.116 |
|
|
|
REG50310 |
-0.867 |
-0.824 |
-1.254 |
-0.756 |
|
-0.318 |
-0.142 |
-0.444 |
|
REG50320 |
-1.425 |
-0.982 |
-0.870 |
-0.798 |
|
-0.464 |
-0.196 |
-0.066 |
|
REG50330 |
-1.185 |
-1.654 |
-1.632 |
-1.020 |
|
-0.440 |
-0.106 |
-0.154 |
|
REG50340 |
-0.158 |
-0.570 |
0.378 |
0.026 |
|
-0.186 |
0.078 |
0.030 |
|
REG50350 |
0.838 |
0.082 |
0.310 |
0.420 |
|
0.102 |
0.028 |
0.222 |
|
REG50360 |
-0.887 |
-0.844 |
-0.324 |
-0.324 |
|
-0.274 |
-0.062 |
-0.130 |
|
|
|
|
|
|
|
(continued) |
|||
Item |
Difficulty |
SEN-L |
|
LAT |
|||||
|
|
Standard |
Reduced |
Easy |
|
Standard |
Reduced |
Easy |
|
|
|
|
|
|
|
|
|
|
|
REG50370 |
0.140 |
-0.256 |
0.318 |
-0.058 |
|
0.020 |
0.288 |
-0.100 |
|
REG50410 |
0.885 |
0.370 |
|
|
|
0.206 |
|
|
|
REG50421 |
-0.481 |
0.042 |
|
|
|
0.342 |
|
|
|
REG50422 |
-0.225 |
0.380 |
|
|
|
0.666 |
|
|
|
REG50423 |
0.243 |
1.268 |
|
|
|
0.536 |
|
|
|
REG50430 |
2.371 |
0.772 |
|
|
|
0.080 |
|
|
|
REG50452 |
0.531 |
1.586 |
|
|
|
0.590 |
|
|
|
REG50440 |
1.922 |
1.264 |
|
|
|
0.374 |
|
|
|
REG50451 |
0.183 |
1.526 |
|
|
|
0.716 |
|
|
|
REG50460 |
1.356 |
0.436 |
|
|
|
0.100 |
|
|
|
REG50510 |
-0.898 |
-0.532 |
-0.618 |
|
|
-0.594 |
0.044 |
|
|
REG50521 |
-0.313 |
-0.052 |
0.878 |
|
|
-0.366 |
-0.058 |
|
|
REG50522 |
-0.635 |
-0.156 |
0.966 |
|
|
-0.080 |
0.416 |
|
|
REG50523 |
-0.004 |
0.366 |
1.066 |
|
|
0.214 |
0.250 |
|
|
REG50524 |
-0.634 |
0.256 |
0.872 |
|
|
-0.318 |
0.704 |
|
|
REG50530 |
1.487 |
0.770 |
|
|
|
0.206 |
|
|
|
REG50540 |
0.064 |
-0.262 |
0.748 |
|
|
-0.428 |
-0.096 |
|
|
REG50551 |
-0.035 |
0.064 |
-0.756 |
|
|
0.030 |
-0.090 |
|
|
REG50552 |
1.135 |
0.188 |
1.214 |
|
|
-0.184 |
-0.108 |
|
|
REG50553 |
0.385 |
0.210 |
0.540 |
|
|
-0.452 |
0.214 |
|
|
REG50560 |
1.125 |
0.716 |
0.938 |
|
|
0.624 |
0.666 |
|
|
REG50570 |
0.515 |
-0.334 |
0.392 |
|
|
-0.330 |
0.206 |
|
|
Note.
SEN-L = Special educational needs in learning; LAT = Lowest
academic track.
The results clearly show measurement invariance
for students in the lowest academic track and large
differences in estimated item difficulties for students with
SEN-L. For students in the lowest academic track, of the 56
items there is no item with strong DIF (absolute difference in
item difficulties greater than 1) and only three items with
slight DIF (absolute difference in item difficulties between
0.6 and 1). For students with SEN-L there are 12 items with
slight DIF and 14 items with strong DIF. The results indicate
that measurement invariance holds for students in the lowest
academic track but that the test measures a different
construct for the group of students with SEN-L compared to
general education students. Thus, reading test scores for
students with SEN-L are not comparable to test scores for
general education students.
5.4.2
Reduced
test
Table 3 also shows DIF for the accommodated test
versions. In the reduced test, for students with SEN-L, 15 out
of 38 items have slight DIF and eight items have strong DIF.
Only 15 items show no considerable DIF. Thus, the measurement
of reading competence with the reduced test is different from
that of general education students with the standard test.
This does, however, not seem to be a result of the test
accommodation. Within the group of students in the LAT
measurement invariance holds as only three items show slight
DIF. The results indicate that for students with SEN-L the
measurement model, and thus, the measured construct, is
different from that of students in general education.
5.4.3
Easy test
In the easy test, for students with SEN-L three
out of twelve anchor items show noticeable DIF and two items
show strong DIF. There are only seven items with no noticeable
DIF. In contrast, in the LAT group there is no noteworthy DIF
in the easy test and only four items show slight DIF in the
reduced test. While measurement invariance may be assumed for
the students in the LAT, it does not hold for students with
SEN-L. Again, differences in the measurement model do not seem
to be induced by the test accommodation, but rather reflect a
specific testing problem of students with SEN-L.
5.5
Item Fit and Measurement Invariance
Considering both criteria—item fit and
measurement invariance—how many items with good psychometric
properties are left within the different groups and test
versions? Is it possible to construct a test out of
well-fitting items? Figure 1 shows the discrimination and DIF
of the items in the standard test version for students with
SEN-L (a) and for students of the lowest academic track (b).
The grey lines give the rules of thumb for the evaluation of
the items. Items within discrimination > .2 and absolute
DIF < 0.6 have no noticeable misfit or DIF. Items within .2
> discrimination > .1 and 0.6 < absolute DIF < 1
have noticeable but not considerable misfit and/or DIF. Items
with discrimination < .1 and absolute DIF > 1 have considerable
misfit and/or DIF. These items should not be used for testing.
Figure 1a) shows that a considerable amount of items do not
meet the fit and DIF criteria in the SEN-L group. Only 22 out
of 56 items show good fit and DIF indices. Thirteen items show
a slight misfit in at least one of the two criteria and 21
items exceed at least one of the criteria for a strong misfit
or large DIF. There are obviously not many items left that
meet the criteria of a good test. For students of the lowest
academic track (Figure 1b), there is only one item with a
slight misfit in either of the two criteria and seven items
with a strong deviation from at least one of the two criteria.
Thus, there are 48 items that meet the criteria of a good test
in the lowest academic track group of the main sample.
a) Students
with SEN-L
b) Students
in the lowest academic track of the main sample
Figure
1. Discrimination and
differential item functioning of the items in the regular
test. SEN-L = Special educational needs in learning. (see pdf)
In the reduced test (see Figure 2a), for students
with SEN-L, 13 out of 38 items show a strong misfit and/or
DIF, 16 items show a slight deviation from at least one of the
two criteria and only nine items are suitable for testing
considering both
criteria. There are a high number of items that may not be
used on a test. Again, in the LAT group the items fit both
criteria very well (Figure 2b). Only one out of 38 items needs
to be excluded due to strong misfit or DIF, and only three
items show a slight misfit and/or DIF. Thirty-four items meet
the criteria of fit and measurement invariance. The low DIF
values in the LAT group provide evidence in support of the
argument that reducing the test length (i.e., increasing the
testing time per text and item) does not threaten the
comparability of the results. Thus, reducing test length may
be an appropriate accommodation. However, this accommodation
is not sufficient to reliably and comparably measure reading
competence for students with SEN-L.
a) Students
with SEN-L
b) Students
in the lowest academic track
Figure
2. Discrimination and differential item functioning of the
items in the reduced test. SEN-L = Special educational needs in
learning. (see pdf)
Since there are only 12 items in the easy test
that may be tested for DIF, we refrained from plotting the
different evaluation criteria for this test version. It may,
however, be concluded that from the 12 items, there are four
with a slight misfit or DIF and two with a strong one. Only
six of the 12 anchor items meet the criteria of fit and DIF.
Since linking may only be done using 12 items, losing six
items due to fit and DIF problems raises questions as to the
appropriateness of this accommodated test version for the
group of students with SEN-L. As a comparison, in the LAT
group there is only one of these 12 items with a slight misfit
and one with a strong misfit. The results in the LAT group are
an indication that reducing the difficulty of the test does
result in reliable and comparable reading competence measures.
However, this test accommodation is not appropriate enough for
assessing students with SEN-L.
6.
Discussion
The present research dealt with the question of
how competencies of students with SEN-L may be assessed
reliably and comparably to general education students. We
assessed the reading competence of students with SEN-L using a
standard reading test, a reduced reading test, and an easy
reading test. We used a group of low-achieving students
without SEN to test whether the test accommodations alter the
measured construct. The results showed that all three reading
test versions are suitable for a reliable and comparable
measurement of reading competence in students without SEN.
Reducing both test length and item difficulty resulted in
reliable measures that are comparable to those of a standard
test for general education students. For students with SEN-L,
the accommodated test versions considerably reduced the amount
of missing values. They did not, however, show a satisfactory
item fit and measurement invariance. Although the testing
accommodations increase item fit and measurement invariance
for students with SEN-L as compared to using a standard
reading test, there are still many items unsuitable for a
reliable and comparable assessment of reading competence in
students with SEN-L. Thus, the competence scores assessed by
the tests in this study are neither suitable for a substantive
interpretation of the competence level of students with SEN-L,
nor may they be used for a valid comparison of competence
levels between students with SEN-L and students in general
education.
Concerning the testing accommodations implemented
in this study, the reduced
test primarily aimed at compensating for
information-processing restrictions in students with SEN-L
(e.g., for slow processing speed) while the easy test primarily
aimed at adapting the test to a reduced competence level in
reading (by reducing test difficulty in general) thereby
improving the accuracy of measurement and avoiding undue
frustrations for students with SEN-L. Since we showed—within
the group of students in the LAT—that the items in the
accommodated test versions have a good fit, we may conclude
that the misfit in the group of SEN-L students is not due to
badly constructed items or to the fact that the test versions
changed the measured construct. Misfit of items in the SEN
sample must be due to problems in testing this specific target
group. Our analyses on test targeting showed that even the
accommodated test versions are too difficult for students with
SEN-L. Since item fit became better for accommodated versions,
which were composed of easier items than the standard test, we
hypothesize that a further reduction in item difficulty may
help to improve testing of students with SEN-L. This
hypothesis is corroborated by the negative correlation of item
difficulty and discrimination. Still, both testing
accommodations focus on general problems faced by students
with SEN-L when reading (slow processing speed, reduced
competence level in reading). In future research, it would be
desirable to identify more specific reading problems of
students with SEN-L that can be addressed in testing
accommodations. Another explanation for item misfit in the
sample of students with SEN-L may lay in the test-taking
behavior (such as guessing or item omission, see Pohl,
Südkamp, Hardt, Carstensen, & Weinert, 2015). It is also
possible that differences in item fit between the students in
the LAT and the students with SEN-L are due to differences in
school curricula.
Comparing the three test versions—the standard
test, the reduced test, and the easy test—in the LAT group,
the accommodated test versions resulted in better competence
measures than the standard test. For students with SEN-L, the
easy test showed the best results regarding item fit, test
targeting, and DIF. Since in the reading test, items are
grouped to sets belonging to different texts, constructing a
reading test from well-fitting and measurement invariant items
is a difficult encounter. This is different in other
competence domains of the NEPS that do not have such a strong
testlet structure (see Weinert et al., 2011, for a description
of the tests).
6.1
Strengths and Limitations
Studying the effects of testing accommodations
not only in groups of students with SEN-L but also in groups
of students in general education (here: low-performing
students), is a promising approach to the identification of
appropriate testing accommodations. In many previous studies,
accommodated test versions were only applied to students with
SEN. Thus, one could not disentangle whether low psychometric
properties of accommodated tests and change of the measured
construct were due to testing accommodations or testability
problems of students with SEN. Using the LAT group allowed us
to investigate whether the applied testing accommodations
generally provide reliable and measurement invariant measures
of reading competence. With the results in the group of LAT
students, we ruled out the premise that misfit and measurement
invariance for students with SEN-L is due to changes in the
measured construct resulting from a reduction in test length
or reduction in item difficulty. Considering the wide range of
competence levels of students in general education, students
in the LAT are the group of students without SEN being closest
in competence level to students with SEN. Thus, the
accommodated test versions—that are targeted towards students
with SEN-L—will still be better targeted to students in the
LAT than to all students in general education.
The study’s strength also lies in the use of a
sophisticated methodological approach and the evaluation of
various measures of item fit in addition to differential item
functioning. When using methods of IRT, other studies on the
assessment of students with SEN mainly report DIF but leave
out information on item fit in the sample of students with SEN
(Abedi, Leon, & Kao, 2008; Bolt & Ysseldyke, 2008).
Considering the group of students with SEN-L,
using data from a relatively large representative sample
allows us to draw credible conclusions. However, our samples
of students with SEN-L and students in the LAT group
considerably differed in their size. There were about twice as
many students in the LAT group compared to the students with
SEN-L group. For some testing conditions the sample was
comparatively small. For example, only 84 students with SEN-L
were assessed with the reduced test version. Due to the large
number of missing responses, there were items with just 52
valid responses. Fit and DIF measures may, as a consequence,
be unreliable. We tried to account for this in the evaluation
of the fit and DIF criteria.
One might also argue that the group of students
with SEN-L is still a highly heterogeneous one, including, for
example, students with different performance and ability
profiles in the cognitive domain. Compared to prior research,
however, the target population is rather homogeneous as
students with SEN in areas other than learning (e.g., those
with physical impairments) are precluded. Other studies
investigated appropriateness of competence assessments on even
more heterogeneous groups of students (e.g., Lutkus et al.,
2004, including students with disabilities in general).
Possible testing problems may, however, only occur for
students with specific disabilities (e.g., for students with
SEN-L, but not for students with visual impairments) or for
specific testing accommodations. Analyzing the whole group of
students with disabilities and running analyses across all
types of testing accommodations may mask possible testing
effects. In our study we focused on a specific group of
students with SEN and analyzed different testing
accommodations separately.
Item misfit and DIF do not need to be caused by
all students with SEN-L, but only by a certain group of
students. However, we did not account for interindividual
differences within our samples in this study. In ongoing
research, we (Pohl et al., 2015) use a person-based approach
and try to empirically identify groups of students with SEN-L
whose assessment is especially challenging. Here, we assume
that individual student characteristics (e.g., individual test
taking strategies, cognitive performance profiles) are related
to testability[2].
6.2
Implications and Future Research
Incorporating easy instead of hard items in the
test version (e.g., as done in the easy test version), is
methodologically seen a form of adaptive testing. Adaptive
testing is currently discussed in large-scale studies such as
the NAEP (Xu, Sikali, Oranje, & Kulick, 2011), the
Programme for International Student Assessment (PISA; Pearson,
2011), and the NEPS (Pohl, 2014). If better test targeting is
one of the key issues for testing students with SEN-L,
adaptive testing procedures for general education students may
well be extended to include students with SEN-L. One way to
systematically reduce difficulty in reading tests for students
with SEN might be a reduction in grammatical and lexical
complexity of texts and items (Abedi et al., 2011). In
upcoming feasibility studies within the NEPS, seventh graders
with SEN-L will be tested with a standard reading test that is
reduced in grammatical and lexical complexity. In another
feasibility study in grade 3 we will examine the effects of
newly developed test instructions on students’ test
performance, missing values, and invalid answers, as well as
on their motivation, and test anxiety.
There are numerous and manifold arguments for the
inclusion of students with SEN in large-scale assessments.
However, the issue of whether students with SEN-L may be
assessed reliably and comparably in large-scale assessments
—and if so how—remains to be an important and complex
question. In our study, we aim to present a sophisticated
design and a comprehensive methodological approach to these
questions and to shed light on them. We think that the
systematic identification of specific testing accommodations
for groups of students with SEN is a promising approach.
Keypoints
So far, data on the acquisition
and development of competencies of students with special
educational needs in learning (SEN-L) are rare.
Assessing competencies
of students with special educational needs within large scale
assessments is challenging.
This study addresses the question
of whether and how satisfying item fit measures and
measurement invariant test scores can be obtained for students
with SEN-L in large-scale-assessments.
Testing accommodations
may result in reliable and to the standard test comparable
competence measures.
The investigated
testing accommodations helped to some extent to increase the
testability of students with SEN-L.
The systematic
identification of further appropriate testing accommodations
is a promising approach to the assessment of students with
SEN-L.
Acknowledgments
This paper uses data from the National
Educational Panel Study (NEPS). From 2008 to 2013, NEPS data
were collected as part of the Framework Program for the
Promotion of Empirical Educational Research funded by the
German Federal Ministry of Education and Research (BMBF). As
of 2014, the NEPS survey is carried out by the Leibniz
Institute for Educational Trajectories (LIfBi) at the
University of Bamberg in cooperation with a nationwide
network.
We especially thank Cordula Artelt, Claus H.
Carstensen, Jana Heydrich, Lena Nusser, and Markus
Messingschlager for their contribution to this study. Our
thanks also go to the staff of the NEPS administration of
surveys and to the methods group.
We would also like to thank the anonymous
reviewers for their comments on earlier versions of the
manuscript and Erika Fisher for copy editing services.
References
Abedi, J.,
Leon, S., & Kao, J. (2008). Examining
differential item functioning in reading assessments for
students with disabilities. (CRESST Report 744). Los
Angeles, CA: University of California, Los Angeles, National
Center for Research on Evaluation, Standards, and Student
Testing.
Abedi, J.,
Leon, S., Kao, J., Bayley, R., Ewers, N., Herman, J., &
Mundhenk, K. (2011). Accessible
reading assessments for students with disabilities: The role
of cognitive, grammatical, lexical, and textual/visual
features (CRESST Report 785). Los Angeles, CA:
University of California, Los Angeles, National Center for
Research on Evaluation, Standards, and Student Testing.
American
Educational Research Association, American Psychological
Association, & National Council on Measurement in
Education (1999). Standards
for educational and psychological testing. Washington,
DC: American Educational Research Association.
Aßmann, C., Steinhauer, H. W., Kiesl, H., Koch, S.,
Schönberger, B., Müller-Kuller, A., … Blossfeld,
H.-P. (2011). Sampling designs of the National Educational
Panel Study: Challenges and solutions. Zeitschrift für
Erziehungswissenschaft, 14, 51-65.
doi:10.1007/s11618-011-0181-8
Aßmann, C., Steinhauer, H. W., & Zinn, S. (2012).
Weighting
the fifth and ninth grader cohort samples of the National
Educational Panel Study, panel cohorts (Technical Report).
Bamberg, Germany: University of Bamberg National Educational
Panel Study, Retrieved from
https://www.nepsdata.de/Portals/0/NEPS/Datenzentrum/Forschungsdaten/SC3/1-0-0/SC3_SC4_1-0-0_Weighting_EN.pdf.
Bäumer, T., Preis, N., Roßbach, H.-G., Stecher, L.,
& Klieme, E. (2011). Education processes in
life-course-specific learning environments. Zeitschrift für
Erziehungswissenschaft, 14, 87-101.
doi:10.1007/s11618-011-0183-6
Barkow, I., Leopold, T., Raab, M., Schiller, D.,
Wenzig, K., Blossfeld, H.-P., & Rittberger, M. (2011).
RemoteNEPS: Data dissemination in a collaborative workspace. Zeitschrift für
Erziehungswissenschaft, 14, 315-325. doi:
10.1007/s11618-011-0192-5
Bielinski, J., Thurlow, M. L., Ysseldyke, J. E.,
Freidebach, J., & Freidebach, M. (2001). Read-aloud accommodations:
Effects on multiple-choice reading and math items (NCEO
Technical Report 31). Minneapolis, MN: University of Minnesota,
National Center on Educational Outcomes.
Blossfeld, H.-P., & von Maurice, J. (2011). Education as a lifelong
process. Zeitschrift
für Erziehungswissenschaft, 14, 19-34.
doi:10.1007/s11618-011-0179-2
Blossfeld, H.-P., von Maurice, J., & Schneider,
T. (2011). The National Educational Panel Study: Need, main
features, and research potential. Zeitschrift für
Erziehungswissenschaft, 14, 5-17.
doi:10.1007/s11618-011-0178-3
Bolt, S. E.,
& Ysseldyke, J. (2008). Accommodating students with
disabilities in large-scale testing: A comparison of
differential item functioning (DIF) identified across
disability types. Journal
of Psychoeducational Assessment, 26, 121-138.
doi:10.1177/0734282907307703
Borsboom, D.
(2006). The attack of the psychometricians. Psychometrika,
71, 425-440. doi: 10.1007/s11336-006-1447-6
Bos, W., Bonsen, M., Gröhlich, C., Guill, K., May,
P., Rau, A., et al. (2009). KESS 7: Kompetenzen und
Einstellungen von Schülerinnen und Schülern—Jahrgangsstufe 7
[KESS 7: Competencies and attitudes of students in grade 7].
Hamburg, Germany: Behörde für Bildung und Sport.
Chudowsky,
N., & Pellegrino, J. (2003). Large-scale assessment that
support student learning: What will it take? Theory into Practice, 42,
75-83. doi:10.1207/s15430421tip4201_10
Cormier, D.
C., Altman, J., Shyyan, V., & Thurlow, M. L. (2010). A summary of the
research on the effects of test accommodations: 2007-2008
(Technical Report 56). Minneapolis, MN: University of
Minnesota, National Center on Educational Outcomes.
Cortiella,
C., & Horowitz, S. H. (2014). The state of learning
disabilities: Facts, trends and emerging issues. New
York: National Center for Learning Disabilities.
Dolan, R.
P., & Hall, T. E. (2001). Universal design for learning:
Implications for large-scale assessment. IDA Perspectives, 27,
22-25.
Duke, N. K.,
& Pearson, P. D. (2002). Effective practices for
developing reading comprehension. In A. E. Farstrup & S.
J. Samuels (Eds.), What
research has to say about reading instruction (pp.
205–242). Newark, DE: International Reading Association.
Durkin, D.
(1993). Teaching them
to read. Boston, MA: Allyn and Bacon.
Fuchs, L. S., Fuchs, D.,
Eaton, S. B., Hamlett, C. L., & Karns, K. M. (2000). Supplementing
teacher judgments of mathematics test accommodations with
objective data. School
Psychology Review, 29, 65–85.
Gee, J. P. (2004).
Reading as situated language: A sociocognitive persepective.
In R. B. Ruddell & N. J. Unrau (Eds.), Theoretical models and
processes of reading (pp. 116-132). Newark:
International Reading Association.
Gehrer, K., Zimmermann, S.,
Artelt, C., & Weinert, S. (2013). NEPS framework for
assessing reading competence and results from an adult pilot
study. Journal of
Educational Research Online, 5, 50-79.
Gehrer, K.,
Zimmermann, S., Artelt, C., & Weinert, S. (2012). The assessment of
reading competence (including sample items for grade 5 and
9) [Scientific Use File 2012, Version 1.0.0.] Bamberg:
University of Bamberg, National Educational Panel Study.
Geisinger,
K. F. (1994). Psychometric issues in testing students with
disabilities. Applied
Measurement in Education, 7, 121-140.
doi:10.1207/s15324818ame0702_2
Gersten, R.,
Fuchs, L. S., Williams, J. P., & Baker, S. (2001).
Teaching reading comprehension strategies to students with
learning disabilities: A review of research. Review of Educational
Research, 71, 279-320. doi:10.3102/00346543071002279
Gross, C.,
Jobst, A., Jungbauer-Gans, M., & Schwarze, J. (2011). Educational
returns over the life course. Zeitschrift für
Erziehungswissenschaft, 14, 139-153.
doi:10.1007/s11618-011-0195-2
Grünke, M. (2004). Lernbehinderung [Learning
Disabilities]. In Lauth, G.,
Grünke, M., & Brunstein, J. (Eds.). Interventionen bei
Lernstörungen [Interventions to
learning deficits](pp. 65-77).
Göttingen: Hogrefe.
Heydrich, J., Weinert, S., Nusser, L., Artelt, C.,
& Carstensen, C. H. (2013). Including
students with special educational needs into large-scale
assessments of competencies: Challenges and approaches with
the German National Educational Panel Study (NEPS). Journal of Educational
Research Online, 5, 217-240.
Hollenbeck,
K. Tindal, G. Almond, P. (1998). Teachers’ knowledge of
accommodations as a validity issue in high-stakes testing. The Journal of Special
Education, 32, 175-183.
Kavale, K.
A., & Reece, J. H. (1992). The character of learning
disabilities. Learning
Disability Quarterly, 15, 74-94. doi:
http://dx.doi.org/10.2307/1511010
Kintsch, W.
(2007). Comprehension:
A paradigm for cognition. Cambridge, UK: Cambridge
University Press.
KMK – Sekretariat der Ständigen Konferenz der
Kultusminister der Länder in der Bundesrepublik Deutschland
[Standing Conference of the Ministers of Education and Cultural
Affairs of Germany] (2012). Sonderpädagogische
Förderung in Schulen 2001–2010 [Special education in
schools 2001–2010]. Retrieved from
http://www.kmk.org/fileadmin/pdf/Statistik/KomStat/Dokumentation_SoPaeFoe_2010.pdf
Kolen M. J., & Brennan R. L. (2004). Test equating,
scaling, and linking. New York, NY: Springer-Verlag.
Koretz, D.
M. (1997). The
assessment of students with disabilities in Kentucky
(CSE Technical Report 431). Los Angeles, CA: CRESST/RAND
Institute on Education and Training.
Koretz, D. M., & Barton, K. E. (2003). Assessing
students with disabilities: Issues and evidence (CSE
Technical Report 587). Los Angeles, CA: University of
California, Center for the Study of Evaluation.
Kristen, C., Edele, A., Kalter, F., Kogan, I.,
Schulz, B., Stanat, P., & Will, G. (2011). The
education of migrants and their children across the life
course. Zeitschrift für
Erziehungswissenschaft, 14, 121-137.
doi:10.1007/s11618-011-0194-3
Lovett, B.
J. (2010). Extended time testing accommodations for students
with disabilities: Answers to five fundamental questions. Review of Educational
Research, 80, 611-638. doi:10.3102/0034654310364063
Lutkus, A.
D., Mazzeo, J., Zhang, J., & Jerry, L. (2004). Including special-needs
students in the NAEP 1998 reading assessment part II:
Results for students with disabilities and limited-English
proficient students (Research Report ETS-NAEP 04-R01).
Princeton, NJ: ETS.
Millsap, R.
E. (2011). Statistical
approaches to measurement invariance. New York, NY:
Routledge.
Minnema, J.,
Thurlow, M., Bielinski, J., & Scott, J. (2000). Past and present
understandings of out-of-level testing: A research
synthesis. (Out-of-Level Testing Project Report 1).
Minneapolis, MN: University of Minnesota, National Center on
Educational Outcomes. Retrieved
from http://education.umn.edu/NCEO/OnlinePubs/OOLT1.html
Müller, K., Sälzer, C., Mang, J., & Prenzel, M.
(2014, March). Kompetenzen
von Schülerinnen und Schüler mit besonderem Förderbedarf. Ergebnisse aus dem PISA
2012 Förderschul-Oversample [Competencies
of students with special educational needs. Results from the
PISA 2012 oversample of special schools]. Paper presented at
the Conference of the German Association for Empirical
Educational Research, Frankfurt, Germany.
OECD –
Organisation for Economic Co-Operation and Development.
(1999). Measuring
student knowledge and skills: A new framework for
assessment. Paris, France: OECD.
Pearson
(2011, October 7th).
Pearson to develop
framework for OECD’s PISA students assessment for 2015
[Pearson announcement]. Retrieved from
http://www.pearson.com/news/2011/october/pearson-to-develop-frameworks-for-oecds-pisa-student-assessment-f.html?article=true
Pellegrino,
J., Chudowsky, N., & Glaser, R. (2001). Knowing what students
know: The science and design of educational assessment.
Washington, D. C.: National Academy Press.
Pitoniak, M.
J., & Royer, J. M. (2001). Testing accommodations for
examinees with disabilities: A review of psychometric, legal,
and social policy issues. Review of Educational
Research, 71, 53-104. doi:10.3102/00346543071001053
Pohl, S.
(2014). Longitudinal multi-stage testing. Journal of Educational
Measurement, 50, 447-468.
doi: 10.1111/jedm.12028
Pohl, S., & Carstensen, C. H. (2012). NEPS
Technical Report: Scaling the data of the competence test (NEPS
Working Paper No. 14). Bamberg, Germany: University of
Bamberg, National Educational Panel Study.
Pohl, S., & Carstensen, C. H. (2013). Scaling the competence
tests in the National Educational Panel Study—Many questions,
some answers, and further challenges. Journal for Educational
Research Online, 5, 189-216.
Pohl, S.,
Gräfe, L., & Rose, N. (2014). Dealing with omitted and not
reached items in competence tests - Evaluating approaches
accounting for missing responses in IRT models. Educational and
Psychological Measurement, 74, 423-452. doi:
10.1177/0013164413504926
Pohl, S.,
Haberkorn, K., Hardt, K., & Wiegand, E. (2012). NEPS technical report
for reading—Scaling results of starting cohort 3 in fifth
grade (NEPS Working Paper No. 15). Bamberg, Germany:
University of Bamberg, National Educational Panel Study.
Pohl, S.,
Südkamp, A., Hardt, K., Carstensen, C. H., & Weinert, S.
(2015). Testability and
test-taking behavior of students with special educational
needs in large-scale assessments. Manuscript submitted
for publication.
Popham, W.
J. (2000). Educational
measurement. Boston, MA: Allyn and Bacon.
Rasch, G.
(1960). Probabilistic
models for some intelligence and attainment tests.
Copenhagen: Nielsen & Lydiche (Expanded Edition, Chicago,
University of Chicago Press, 1980).
Ritchey, K.
D., Silverman, R. D., Schatschneider, C., & Speece, D. L.
(2015). Prediction and stability of reading problems in middle
childhood. Journal of
Learning Disabilities, 48, 298-309.
doi:10.1177/0022219413498116
Sireci, S.
G., Scarpati, S. E., & Li, S. (2005). Test accommodations
for students with disabilities: An analysis of the interaction
hypothesis. Review of
Educational Research, 75, 457-490.
doi:10.3102/00346543075004457
Stocké, V.,
Blossfeld, H.-P., Hoenig, K., & Sixt, M. (2011). Social
inequality and educational decisions in the life course. Zeitschrift für
Erziehungswissenschaft, 14, 103-199.
doi:10.1007/s11618-011-0193-4
Swanson.
(1999). Reading research for students with LD: A meta-analysis
of intervention outcomes. Journal of Learning
Disabilities, 32, 504-532. doi:10.1177/002221949903200605
Thompson, S.
J., Johnstone, C. J., Anderson, M. E., & Miller, N. A.
(2005). Considerations
for the development and review of universally designed
assessments (NCEO Technical Report 42). Minneapolis, MN:
University of Minnesota, National Center on Educational
Outcomes.
Thurlow, M.
L. (2010). Steps toward creating fully accessible reading
assessments. Applied
Measurement in Education, 23, 121-131.
doi:10.1080/08957341003673765
Thurlow, M.
L., Bremer, C., & Albus, D. (2008). Good news and bad news in
disaggregated subgroup reporting to the public on 2005-2006
assessment results (Technical Report 52). Minneapolis,
MN: University of Minnesota, National Center on Educational
Outcomes.
Thurlow, M.,
Elliott, J., & Ysseldyke, J. (1999). Out-of-level testing:
Pros and cons (Policy Directions No. 9). Minneapolis,
MN: University of Minnesota, National Center on Educational
Outcomes. Retrieved from
http://education.umn.edu/NCEO/OnlinePubs/Policy9.htm
Tindal, G.,
Heath, B., Hollenbeck, K., Almond, P., & Harniss, M.
(1998). Accommodating students with disabilities on
large-scale tests: An experimental study. Exceptional Children, 64,
439–450
U.S.
Department of Education, National Center for Education
Statistics. (2013). Digest of Education
Statistics, 2012 (NCES 2 014-015).
Verhoeven,
L., & van Leeuwe, J. (2008). Prediction of the development
of reading comprehension: A longitudinal study. Applied Cognitive
Psychology, 22, 407-423. doi:10.1002/acp.1414
Weinert, F.
E. (2001). Concept of competence: A conceptual clarification.
In D. S. Rychen, L. & H. Salganik (Eds.), Defining and selecting
key competencies (pp. 45-66). Seattle: Hogrefe
& Huber.
Weinert, S., Artelt, C., Prenzel, M., Senkbeil, M.,
Ehmke, T., & Carstensen, C. H. (2011). Development
of competencies across the life span. Zeitschrift für
Erziehungswissenschaft, 14, 67-86.
doi:10.1007/s11618-011-0182-7
Woodcock,
S., & Vialle, W. (2011). Are we exacerbating students’
learning disabilities? An investigation of pre-service
teachers’ attributions of the educational outcomes of students
with learning disabilities. Annals of Dyslexia, 61, 223-241.
doi:10.1007/s11881-011-0058-9
Wright, B.
D., & Masters, G. N. (1982). Rating scale analysis:
Rasch measurement. Chicago, IL: MESA Press.
Wu, M.
(1997). The development and
application of a fit test for use with marginal maximum
likelihood estimation and generalized item response models
(Unpublished doctoral dissertation). Melbourne, Australia:
University of Melbourne.
Wu, M.,
Adams, R. J., Wilson, M., & Haldane, S. (2007). Conquest 2.0.
[Computer Software] Camberwell, Australia: ACER Press.
Wu, Y.-C.,
Liu, K. K., Thurlow, M. L., Lazarus, S. S., Altman, J., &
Christian, E. (2012). Characteristics
of low performing special education and non-special
education students on large-scale assessments (Technical
Report 60). Minneapolis, MN: University of Minnesota, National
Centre on Educational Outcomes.
Xu, X., Sikali, E., Oranje, A., & Kulick, E.
(2011, April). Multi-stage
testing in educational survey assessments.
Paper presented at the annual meeting of the National Council
on Measurement in Education (NCME), New Orleans, LA.
Yovanoff,
P., & Tindal, G. (2007). Scaling early reading alternate
assessments with statewide measures. Exceptional Children, 73,
184-201.
Ysseldyke,
J. E., Thurlow, M. L., Langenfeld, K. L., Nelson, R. J.,
Teelucksingh, E., & Seyfarth, A. (1998). Educational results for
students with disabilities: What do the data tell us?
(Technical Report 23). Minneapolis, MN: University of
Minnesota, National Center on Educational Outcomes.
Zebehazy, K. T., Zigmond, N., & Zimmerman, G. J.
(2012). Ability or access-ability: Differential item functioning
of items on alternate performance-based assessment tests for
students with visual impairments. Journal of Visual
Impairment & Blindness, 106, 325-338.
Table of Footnotes
2 |
As for the
term SEN-L, the term “learning disabilities” is not
clearly defined. Note that we refer to a heterogeneous
group of students with multifaceted etiology. |
3 |
In the present
study, differences in test taking in students with and
without SEN-L might also be caused by differences in
school curricula. This alternative hypothesis could be
tested by comparing students with SEN-L attending
general education and special schools. However, in
Germany only few students with SEN-L attended general
education schools at the time of data collection and
these students often differ in individual as well as
in social background characteristics from students
attending special schools. |
[1]
As for the term “learning disabilities”, the term SEN-L is
not clearly defined. Note that we refer to a heterogeneous
group of students with multifaceted etiology.
[2] In the present study, differences
in test taking in students with and without SEN-L might
also be caused by differences in school curricula. This
alternative hypothesis could be tested by comparing
students with SEN-L attending general education and
special schools. However, in Germany only few students
with SEN-L attended general education schools at the time
of data collection and these students often differ in
individual as well as in social background characteristics
from students attending special schools.