Higher Order Testlet Response Models for Hierarchical Latent Traits

Transcription

Article
Higher Order Testlet
Response Models for
Hierarchical Latent Traits
and Testlet-Based Items
Educational and Psychological
Measurement
73(3) 491–511
Ó The Author(s) 2012
Reprints and permissions:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/0013164412454431
epm.sagepub.com
Hung-Yu Huang1 and Wen-Chung Wang2
Abstract
Both testlet design and hierarchical latent traits are fairly common in educational and
psychological measurements. This study aimed to develop a new class of higher
order testlet response models that consider both local item dependence within testlets and a hierarchy of latent traits. Due to high dimensionality, the authors adopted
the Bayesian approach implemented in the WinBUGS freeware for parameter estimation. A series of simulations were conducted to evaluate parameter recovery,
consequences of model misspecification, and effectiveness of model–data fit statistics.
Results show that the parameters of the new models can be recovered well. Ignoring
the testlet effect led to a biased estimation of item parameters, underestimation of
factor loadings, and overestimation of test reliability for the first-order latent traits.
The Bayesian deviance information criterion and the posterior predictive model
checking were helpful for model comparison and model–data fit assessment. Two
empirical examples of ability tests and nonability tests are given.
Keywords
item response theory, testlet response theory, higher order models, hierarchical
latent trait, Bayesian methods
Testlet design, in which a set of items share a common stimulus (e.g., a reading
comprehension passage or a figure), has been widely used in educational and
1
Taipei Municipal University of Education, Taipei, Taiwan
The Hong Kong Institute of Education, Hong Kong, Hong Kong SAR
2
Corresponding Author:
Wen-Chung Wang, 10 Lo Ping Road, Tai Po, New Territories, Hong Kong SAR.
Email: wcwang@ied.edu.hk
Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015
492
Educational and Psychological Measurement 73(3)
psychological tests. Many test developers find testlet design attractive because of its
efficiency in item writing and administration. However, testlet design poses a challenge to standard item response theory (IRT) models, because items within a testlet
are connected with the same stimulus such that the usual assumption of local item
independence in standard IRT models may not hold. Fitting standard IRT models to
such data leads to biased parameter estimation and overestimation of test reliability
(Thissen, Steinberg, & Mooney, 1989; Wainer & Lukhele, 1997; Wainer & Thissen,
1996; Wainer & Wang, 2000). Various testlet response models have been developed
to account for the local dependence among items within a testlet by adding a set of
random-effect parameters to standard IRT models, one for each testlet (Bradlow,
Wainer, & Wang, 1999; Wainer, Bradlow, & Du, 2000; Wainer, Bradlow, & Wang,
2007; W.-C. Wang & Wilson, 2005).
In the human sciences, latent traits may be hierarchical. For example, a hierarchical structure orders mental abilities with g (general ability) at the top of the hierarchy
and other broad abilities (e.g., mathematical, spatial-mechanical, and verbal reasoning) lower down. Specifically, the Wechsler Adult Intelligence Scale measures a total
of three orders of latent traits. At the first order, there are 13 subtests, each measuring
a specific ability. At the second order, these 13 specific abilities are grouped into four
broad abilities: verbal comprehension, perceptual reasoning, working memory, and
processing speed. At the third order, these four broad abilities constitute a general
mental ability (Ryan & Schnakenberg-Ott, 2003). The SAT Reasoning Test consists
of three sections: critical reading, mathematics, and writing. Each section receives a
score on a scale of 200 to 800. Total scores are calculated by adding up the scores of
the three sections. One can treat the tests as measuring three first-order latent traits
(critical reading, mathematics, and writing) and one second-order latent trait (scholastic aptitude). Similarly, the Graduate Record Examinations consist of three sections of verbal, quantitative, and analytical writing. One can treat the examinations as
measuring three first-order latent traits (verbal reasoning, quantitative reasoning, and
analytical writing skills) and one second-order latent trait (scholastic aptitude).
Recently, several higher order IRT models for hierarchical latent traits have been
developed. These include the cognitive diagnosis models by de la Torre and Douglas
(2004), multidimensional models with a hierarchical structure by Sheng and Wikle
(2008), and higher order IRT models by de la Torre and Song (2009), de la Torre
and Hong (2010), and Huang, Wang, and Chen (2010). In the IRT literature, testlet
response models and higher order IRT models were developed separately. Since hierarchical latent traits may be measured by testlet-based items, it is of great value to
develop a class of IRT models that consider both testlet design and hierarchical
structure in latent traits, which is the main purpose of this study. This article is organized as follows. First, higher order IRT models are briefly introduced and a new
class of higher order testlet response models is formulated. Second, parameter estimation and model–data fit assessment for the new class of models are described.
Third, simulations are conducted to assess parameter recovery, effects of model misspecification, and model–data fit indices using the WinBUGS computer program
Huang and Wang
493
(Spiegelhalter, Thomas, & Best, 2003), and the results are summarized. Fourth, two
empirical examples of achievement and personality assessments are presented to
demonstrate the applications of the new class of higher order testlet response models.
Fifth, conclusions are drawn and suggestions for future study are provided.
Higher Order IRT Models
Assume the hierarchy of latent traits is in two orders. Each of the first-order latent
traits is measured by a set of unidimensional items, and these first-order latent traits
are governed by the same second-order latent trait. Let u(2)
n be the second-order latent
be
the
vth
first-order
latent
trait
for
person
n. The relationship between
trait and u(1)
nv
these two orders of latent traits is assumed to be
(2)
(1)
u(1)
nv = bv un + env ,
ð1Þ
where e(1)
nv is assumed to be normally distributed with mean zero and independent of
other es and us, and bv is the regression weight (factor loading) between the secondorder latent trait and the vth first-order latent trait. More orders of latent traits (e.g.,
the third order) can be readily generalized.
For a dichotomous item measuring the first-order latent trait, the item response
function can be the one-, two-, or three-parameter logistic model (Birnbaum, 1968;
Rasch, 1960) or other IRT functions. In the three-parameter logistic model, the probability of a correct response to item i in test v for person n is
Pni1v = piv + (1 piv ) 3
exp½aiv (u(1)
nv div )
,
1 + exp½aiv (u(1)
nv div )
ð2Þ
where aiv is the slope (discrimination) parameter, div is the location (difficulty) parameter, and piv is the asymptotic (pseudo-guessing) parameter of item i in test v.
Combining Equations 1 and 2 leads to
(1)
exp½aiv (bv u(2)
n div + env )
(1)
1 + exp½aiv (bv u(2)
n div + env )
:
ð3Þ
If the item response function follows the two-parameter logistic model, piv in
Equation 3 becomes zero. If it follows the one-parameter logistic model, piv and aiv
in Equation 3 become zero and one, respectively. Equation 3 is referred to as the
three-parameter higher order IRT model (3P-HIRT). If the item response function is
the two-parameter logistic model, then it is called the 2P-HIRT. If the item response
function is the one-parameter logistic model, then it is called the 1P-HIRT (de la
Torre & Hong, 2010; Huang et al., 2010).
For a polytomous item measuring the first-order latent trait, the item response
function can be the generalized partial credit model (Muraki, 1992), the partial credit
model (Masters, 1982), the rating scale model (Andrich, 1978), or the graded
494
response model (Samejima, 1969). Take the generalized partial credit model as an
example. The log odds are defined as
Pnijv
log
ð4Þ
= aiv (u(1)
nv div tijv ),
Pni(j1)v
where Pnijv and Pni(j 2 1)v are the probabilities of scoring j and j 2 1 on item i in test
v for person n, respectively; div is the overall difficulty of item i in test v; tijv is the
jth threshold parameter of item i in test v; and the others are defined as above.
Combining Equations 1 and 4 leads to
Pnijv
(1)
log
ð5Þ
= aiv (bv u(2)
n div tijv + env ):
Pni(j1)v
If the item response function follows the partial credit model, then aiv = 1. If it
follows the rating scale model, then aiv = 1 and tijv = tjv. If it follows the graded
response model, then cumulative logit should be used as the link function and a common set of step parameters are used across items. The corresponding higher order
IRT model can be easily formed (Huang et al., 2010). When the item response function is the generalized partial credit model, the corresponding higher order IRT
model is denoted as the GP-HIRT. Likewise, when item response function is the partial credit model, rating scale model, or graded response model, the corresponding
higher order IRT model is denoted as the PC-HIRT, RS-HIRT, or GR-HIRT,
respectively.
Higher Order Testlet Response Models
Assume that testlet-based items are used and the testlet effect exists in each testlet. To
account for the testlet effect, one can add a random-effect parameter to Equation 2
(dichotomous item) or Equation 4 (polytomous item):
exp½aiv (u(1)
nv div + gnd(i)v )
1 + exp½aiv (u(1)
nv div + gnd(i)v )
,
Pnijv
log
= aiv (u(1)
nv div tijv + gnd(i)v ),
Pni(j1)v
ð6Þ
ð7Þ
where gnd(i)v is an additional latent trait to account for local dependence among items
within testlet d of test v and is assumed to be normally distributed and independent
of u and other gs (Wainer et al., 2007; W.-C. Wang & Wilson, 2005). The g variance
depicts the magnitude of the testlet effect: the larger the variance, the stronger the
testlet effect is. When the latent traits are hierarchical, one can combine Equations 1
and 6 or Equations 1 and 7:
Huang and Wang
495
(1)
exp½aiv (bv u(2)
n div + env + gnd(i)v )
(1)
1 + exp½aiv (bv u(2)
n div + env + gnd(i)v )
Pnijv
(1)
= aiv (bv u(2)
log
n div tijv + env + gnd(i)v ):
Pni(j1)v
,
ð8Þ
ð9Þ
Equation 8 is referred to as the three-parameter higher order testlet response model
(3P-HTM). If piv = 0 for all iv, then the 3P-HTM is reduced to the two-parameter
higher order testlet response model (2P-HTM). If piv = 0 and aiv = 1 for all iv, then
the 3P-HTM is reduced to the one-parameter higher order testlet response model (1PHTM). Equation 9 is referred to as the generalized partial credit higher order testlet
response model (GP-HTM), because its item response function is the generalized partial credit. Likewise, if the item response function is the partial credit model, rating
scale model, or graded response model, then it is denoted as the PC-HTM, RS-HTM,
or GS-HTM, respectively.
In the new class of HTM, the item response function is not limited to a specific
function. Users are allowed to assign any function to any item. The HTM can accommodate the situation where a test consists of both independent items and testlet-based
items and both dichotomous items and polytomous items. In addition, it is feasible to
assign different item response functions to different items, even in the same test. For
example, some dichotomous items may follow the one-parameter logistic model, and
others follow the two-parameter or three-parameter logistic model. Some polytomous
items follow the rating scale model, and others follow the generalized partial credit
model. Users are also allowed to create their own customized item response
functions.
For model identification, one can set u(2) ; N (0, 1) and u(1)
nv ; N (0, 1) so that bv
can be interpreted as the correlation between u(2) and u(1)
v , as in traditional linear factor analysis. When the item response function follows the family of Rasch models,
the common slope parameter across all items in the same test can be freely estimated.
MCMC Estimation and Bayesian Model–Data Fit Assessment
The new class of higher order testlet response models often involves many randomeffect parameters––u, g, and e. Although it is possible to develop likelihood-based
estimation methods for higher order testlet response models, these methods become
computationally burdensome because of high-dimensional numerical integration and
may fail to converge. Bayesian estimation with the Markov chain Monte Carlo
(MCMC) methods, which have been widely used in complicated IRT models (Bolt
& Lall, 2003; Bradlow et al., 1999; Fox & Glas, 2003; Kang & Cohen, 2007; Klein
Entink, Fox, & van der Linden, 2009; Wainer et al., 2000, 2007), were thus used to
estimate the parameters. In Bayesian estimation, specifications of a statistical model,
prior distributions of model parameters, and observed data are required to produce a
joint posterior distribution for the model parameters. MCMC methods provide an
496
alternative and simple way to simulate the joint posterior distribution of the unknown
quantities and obtain simulation-based estimates of posterior parameters of interest.
The Metropolis-Hastings and the Gibbs sampling algorithms (Gelfand & Smith,
1990; Geman & Geman, 1984) are two major procedures to construct the Markov
chain through transition density to target density. In this study, we used
the WinBUGS freeware (Spiegelhalter et al., 2003) to simulate Markov chains, and
the mean of the joint posterior distribution was treated as the parameter estimate.
WinBUGS is flexible enough to allow users to specify different models for different items within or between tests. For example, a test may consist of both dichotomous items and polytomous items, in which some dichotomous items follow the
1PLM, some dichotomous items follow the 3PLM, some polytomous items follow
the partial credit model, and some polytomous items follow the rating scale model.
A test may consist of both independent items (no testlet effects) and testlet-based
items, exclusively independent items, or exclusively testlet-based items. All these
combinations can be easily formulated in WinBUGS.
Posterior predictive model checking (PPMC; Gelman, Meng, & Stern, 1996) is a
Bayesian model–data fit checking technique to assess the plausibility of posterior
predictive replicated data against observed data, which has the advantage of a strong
theoretical basis and an intuitively appealing simplicity applied with numerical evidence. Let y be the observed data and yrep be the replicated data. The posterior predictive density function of yrep given model parameter v can be given by
ð
P(yrep jy) = P(yrep jv)P(vjy)dv,
ð10Þ
A test statistic T is chosen to detect the systematic discrepancy between the
observed data and the replicated data. The posterior predictive p value is defined as
follows to summarize the comparison between the two test statistics over a large
number of iterations:
ð
rep
p [ P½T (y ) T (y)jy =
P(yrep jy)dyrep :
ð11Þ
T (yrep )T (y)
An extreme p value (close to 0 or 1) indicates a poor model–data fit.
According to the nature of the higher order testlet response model (HTM), four
statistics were chosen to assess model-data fit in this study. The first one is the
Bayesian chi-square test (Sinharay, Johnson, & Stern, 2006), which evaluates the
overall model–data fit:
X X ½yni E(yni jv)2
n
i
Var(yni jv)
,
ð12Þ
where subscript n is for person and subscript i is for item. The second statistic evaluates factor structure. It compares the reproduced correlation matrix with the original
Huang and Wang
497
correlation matrix. If the difference between the two matrices is huge, then the factors
do not account for a great deal of the variance in the original correlation matrix, and
thus, the model–data fit is poor.
The third and fourth statistics measure the association among item pairs within a
testlet for dichotomous items and polytomous items, respectively. For dichotomous
items, we compute the odds ratios between item pairs within a testlet to depict uncontrolled dependence between items. Let Nkk# be the number of persons scoring k on
the first item and k# on the second item. The odds ratio (OR) between an item pair is
given by
OR =
N11 N00
:
N10 N01
ð13Þ
For polytomous items, the Spearman rank correlation between item pairs within a
testlet is computed to assess between-item association. If any systematic discrepancy
is found between the odds ratios (or the rank correlations) obtained from the observed
data and the replicated data, misfit due to the unexplained variance from item dependence is detected (Sinharay et al., 2006). To reduce the computational burden and
concentrate on testlet effects, only the odd ratios and the correlations for item pairs
within a testlet are computed, and those across testlets are not computed.
For model comparison, one can use the Bayesian deviance information criterion
(DIC), which simultaneously takes into account model fit and model complexity
(Spiegelhalter, Best, Carlin, & van der Linde, 2002):
DIC = D + PD,
ð14Þ
where D is the posterior expectation of the deviance and PD is defined as the difference between the posterior mean of the deviance and the deviance at the posterior
mean of the parameters.
Method
Simulation Design
The simulations comprised two major parts: dichotomous items and polytomous
items. Each part had three first-order latent traits and one second-order latent trait.
Each test measuring the first-order latent trait consisted of either 20 dichotomous
testlet-based items or 20 four-point polytomous testlet-based items. All the g variances (testlet effects) were set at one, suggesting a rather large testlet effect. The
number of testlets in the 20-item test was either two (each testlet had 10 items—a
long testlet) or four (each testlet had 5 items—a short testlet). Although the g variances were all equal across testlets, their effects were stronger in the long testlets.
For dichotomous items, the item response functions in each test followed the oneparameter, two-parameter, or three-parameter testlet response model. For polytomous
items, the item response functions in each test followed the generalized partial credit,
498
partial credit, rating scale, or graded response testlet response model. The factor loadings (the correlations between the second-order latent trait and the three first-order
latent traits) were set at .9, .8, and .7, respectively.
In the dichotomous items, the difficulty parameters were sampled from U(22, 2),
the slope parameters were sampled from N(1, 0.25) and truncated above 0, and the
pseudo-guessing parameters were sampled from U(0, 0.3). In the polytomous items,
the step difficulty parameters were generated from U(22.5, 2.5), and the location
parameters were generated from U(21.5, 1.5) with 21, 0, and 1 as the category
threshold parameters, respectively. The specifications of item parameters were in line
with those commonly found in practice. The item parameters were generated independently, consistent with the IRT literature. If necessary, multivariate normal distribution can be assumed for the item parameters (van der Linden, Klein Entink, &
Fox, 2010). A total of 5,000 persons were generated when the item response function
of dichotomous items followed the three-parameter model. However, a total of 2,000
persons were generated when the item response function followed the other models
because a large sample size is required for estimating the pseudo-guessing parameters. All the variances of latent traits were set at one.
The simulated data sets were analyzed in two ways: First, the data-generating
model was also the analysis model (i.e., the HTM was fit to HTM data). Second, the
standard higher order IRT model (HIRT) was fit to HTM data. We wrote a Matlab
computer program to generate item responses. Ten replications were made under
each condition, mainly because each replication could take several days of computer
time, and there were many conditions. In our experience, 10 replications appeared to
be sufficient to gain reliable inferences because the sampling variation across replications appeared to be rather small. Other studies that used Bayesian estimation for
complicated IRT models also used 10 replications or fewer (Bolt & Lall, 2003;
Cao & Stokes, 2008; Klein Entink et al., 2009; Li, Bolt, & Fu, 2006; van der Linden
et al., 2010).
Analysis
The freeware WinBUGS 1.4 with MCMC methods was used to estimate the parameters. We specified a normal prior with mean 0 and variance 4 for all the location and
threshold parameters, a lognormal prior with mean 0 and variance 1 for the slope
parameters, a beta prior with both hyperparameters equal to 1 for the pseudo-guessing
parameters, a normal prior with mean 0.5 and variance 10 accommodating a truncated
distribution between 0 and 1 for the factor loadings, and a gamma prior with both
hyperparameters equal to 0.01 for the inverse of g variances (testlet effects). A total
of 15,000 iterations were obtained with the first 5,000 iterations as the burn in. These
settings were consistent with the literature when WinBUGS was used to fit IRT
models.
After monitoring the convergence diagnostic according to the multivariate potential scale reduction factor (Brooks & Gelman, 1998) with three parallel chains for
Huang and Wang
499
the first simulated data set across all analysis models, we found that the chain lengths
were sufficient to reach stationarity for all structural parameters. In addition, the last
200 MCMC samples for latent trait estimates were used to compute test reliability
(precision of person measures).
For each estimator, we computed the bias and the root mean square error (RMSE):
Bias(E(zr )) =
R
X
(EAP(^zr ) z)=10,
ð15Þ
r=1
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u R
uX
2
RMSE(E(zr )) = t
(EAP(^zr ) z) =10,
ð16Þ
r=1
where z was the generating value and EAP(^zr ) was the expected a posterior estimate
in replication r.
It was expected that (a) the parameter recovery would be satisfactory when the
HTM was fit to HTM data because the data-generating model and the analysis model
were identical; (b) the parameter estimation would be biased and the test reliability
would be overestimated when the HIRT was fit to HTM data because the testlet effect
existed but was not considered; (c) the longer the testlet, the worse the estimation of
parameters and test reliability would be when the testlet effect was not considered;
and (d) the DIC would favor the data-generating HTM rather than the corresponding
HIRT, and the PPMC would show a good model–data fit for the generating HTM and
a poor fit for the corresponding HIRT.
Results
Parameter Recovery of Dichotomous Items
Due to space constraints, the bias and RMSE for individual parameters are not
reported; instead, their means and standard deviations are provided. Table 1 summarizes the bias values when the data-generating HTM (left-hand side) and the
HIRT (right-hand side) were fit to HTM data. The bias values yielded by the HTM
appeared to be very close to zero and were consistent with those found in the literature when data-generating dichotomous testlet response models were fit (Bradlow et
al., 1999; Wainer et al., 2000; W.-C. Wang & Wilson, 2005). In comparison, the 1PHTM yielded bias values slightly better than those by the 2P-HTM, which were
slightly better than those by the 3P-HTM. The difference in bias values between long
and short testlets was small. When the testlet effect was ignored and the HIRT was
fit to HTM data, the bias values were many times more extreme than those when the
HTM was fit. The underestimation for the three factor loadings by the HIRT was the
most significant, with bias values ranging from 20.193 to 20.043.
Table 2 summarizes the RMSE values when the data-generating HTM (left-hand
side) and the HIRT (right-hand side) were fit to HTM data. The HTM yielded small
500
—
—
9
8
213
—
—
24
15
23
46
68
—
—
10
26
12
5
27
60
50
28
28
211
34
27
5
4
44
Short
218
43
Long
11
21
Short
2-Parameter
40
30
1
6
3
16
42
19
50
51
154
Long
25
43
6
29
0
29
53
62
72
103
186
Short
3-Parameter
—
—
2178
2158
2149
—
—
97
5
9
132
Long
—
—
270
269
243
—
—
237
2
30
158
Short
1-Parameter
—
—
2167
2137
2128
—
—
90
101
28
212
Long
—
—
276
267
275
—
—
254
79
26
113
Short
2-Parameter
HIRT
Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model; — = not applicable.
Difficulty
Mean
29
SD
19
Slope
Mean
220
SD
10
Pseudo-guessing
Mean
—
SD
—
Factor loading
22
b1
19
b2
21
b3
Testlet variance
Mean
5
SD
19
Long
1-Parameter
HTM
Table 1. Bias of Dichotomous Items (Multiplied by 1,000) When the HTM and HIRT Were Fit to HTM Data.
—
—
2193
2147
2121
46
123
223
264
186
496
Long
—
—
277
286
266
45
91
6
151
169
356
Short
3-Parameter
501
110
43
81
31
—
—
34
28
41
175
61
23
5
—
—
29
24
26
118
26
Long
63
15
Short
167
55
18
22
28
—
—
90
34
110
47
Short
2-Parameter
122
41
31
27
24
57
32
114
57
209
137
Long
129
35
17
21
17
70
42
179
110
253
159
Short
3-Parameter
—
—
179
159
153
—
—
101
6
133
53
Long
—
—
73
72
46
—
—
43
4
156
66
Short
1-Parameter
—
—
169
139
131
—
—
141
71
207
105
Long
—
—
78
69
80
—
—
106
59
141
62
Short
2-Parameter
HIRT
Difficulty
Mean
64
SD
17
Slope
Mean
41
SD
5
Pseudo-guessing
Mean
—
SD
—
Factor loading
19
b1
27
b2
39
b3
Testlet variance
Mean
81
SD
18
Long
1-Parameter
HTM
—
—
194
149
123
116
87
332
196
446
346
Long
—
—
78
88
68
99
67
185
86
367
255
Short
3-Parameter
Table 2. Root Mean Square Error of Dichotomous Items (Multiplied by 1,000) When the HTM and HIRT Were Fit to HTM Data.
502
RMSE values, especially for the 1P-HTM. In contrast, when the testlet effect was
ignored and the HIRT was fit, the RMSE values became many times larger than
those in the HTM, especially for the factor loadings. A factorial analysis of variance
was conducted to evaluate the effects of the three independent variables: (a) method
(HIRT or HTM), (b) testlet length (short or long), and (c) model (1P, 2P, or 3P).
When the RMSE of difficulty parameters was the dependent variable, the partial h2
was 17% for the main effect of model, 9% for the main effect of method, and less
than 2% for the other effects. When the RMSE of discrimination parameters was the
dependent variable, the partial h2 was 10% for the main effect of method and less
than 2% for the other effects. When the RMSE of pseudo-guessing parameters was
the dependent variable, the partial h2 was 8% for the main effect of method and less
than 1% for the other effects. When the RMSE of factor loadings was the dependent
variable, the partial h2 was 81% for the main effect of method, 52% for the main
effect of testlet length, 43% for the interaction effect for method and testlet length,
and less than 3% for the other effects. In short, method had a large impact on the
parameter recovery, which demonstrated that ignoring the testlet effects by fitting
the HIRT to HTM data yielded poor parameter estimation.
Parameter Recovery of Polytomous Items
Table 3 summarizes the bias values when the data-generating HTM (left-hand side)
and the HIRT (right-hand side) were fit to HTM data. Similar to those found in
dichotomous items, the bias values yielded by the HTM were very close to zero and
were consistent with those found in the literature when data-generating polytomous
testlet response models were fit (W.-C. Wang & Wilson, 2005; X. Wang, Bradlow,
& Wainer, 2002). In comparison, the Rasch family models (partial credit and rating
scale models) had a slightly smaller bias than the two-parameter models (generalized
partial credit and graded response models). The difference in bias values between
long and short testlets was small. When the testlet effect was ignored and the HIRT
was fit to HTM data, most bias values were many times more extreme than those
when the HTM was fit. The three factor loadings were substantially underestimated.
Table 4 summarizes the RMSE values when the data-generating HTM (left-hand
side) and the HIRT (right-hand side) were fit to HTM data. The results were consistent with those found in dichotomous items (Table 2). The HTM yielded small
RMSE values, whereas the HIRT yielded RMSE values many times larger than those
in the HTM. According to the factorial analysis of variance, when the RMSE of location parameters was the dependent variable, the partial h2 was 43% for the main
effect of method and less than 3% for the other effects. When the RMSE of discrimination parameters was the dependent variable, the partial h2 was 5% for the main
effect of method and less than 1% for the other effects. When the RMSE of factor
loadings was the dependent variable, the partial h2 was 73% for the main effect of
method, 48% for the main effect of testlet length, 42% for the interaction effect for
method and testlet length, 16% for the interaction effect of method and model, 10%
503
23
57
227
25
25
12
7
67
60
225
3
27
4
22
26
22
Long
1
22
Short
43
29
0
23
21
216
18
8
42
Short
Generalized
Partial Credit
29
15
29
0
6
212
3
22
15
Long
3
15
1
28
210
27
6
3
12
Short
Rating
Scale
45
23
24
29
22
213
17
25
22
Long
27
27
26
214
6
213
25
211
21
Short
Graded
Response
—
—
2124
2118
2127
260
7
214
315
Long
—
—
287
279
276
2235
3
213
321
Short
Partial
Credit
—
—
2186
2140
2119
256
151
220
326
Long
—
—
287
277
269
2231
149
0
329
Short
Generalized
Partial Credit
—
—
2161
2153
2122
261
3
10
329
—
—
277
293
284
2238
7
19
324
Short
Rating
Scale
Long
HIRT
Location
Mean 212
SD
29
Slope
Mean 224
SD
11
Factor loading
25
b1
8
b2
213
b3
Testlet variance
Mean
22
SD
24
Long
Partial
Credit
HTM
Table 3. Bias of Polytomous Items (Multiplied by 1,000) When the HTM and HIRT Was Fit to HTM Data.
—
—
2182
2160
2131
258
160
18
323
Long
—
—
291
290
258
2243
157
25
315
Short
Graded
Response
504
114
42
63
21
21
26
35
144
31
79
23
32
5
22
16
27
75
21
110
21
14
24
30
61
22
110
43
60
11
42
27
25
19
2
43
10
Long
61
7
20
21
28
17
0
44
9
Short
Rating Scale
87
19
25
29
29
50
19
56
16
Long
77
17
24
31
23
61
26
55
15
Short
Graded
Response
—
—
132
126
132
64
4
293
155
Long
—
—
87
82
78
236
3
281
160
Short
Partial Credit
—
—
187
142
122
151
86
302
160
Long
—
—
89
80
75
238
144
307
167
Short
Generalized
Partial Credit
HIRT
Location
Mean
82
SD
23
Slope
Mean
37
SD
5
Factor loading
29
b1
36
b2
31
b3
Testlet variance
Mean
60
SD
6
Short
Long
Long
Short
Generalized
Partial Credit
Partial Credit
HTM
—
—
165
155
124
63
3
285
168
Long
—
—
78
93
87
238
7
286
159
Short
Rating Scale
Table 4. Root Mean Square Error of Polytomous Items (Multiplied by 1,000) When the HTM and HIRT Was Fit to HTM Data.
—
—
183
161
133
158
91
280
169
Long
—
—
93
93
61
249
154
279
153
Short
Graded
Response
Huang and Wang
505
for the main effect of model, and less than 2% for the other effects. Apparently,
method played a major role in parameter recovery, and ignoring testlet effect yielded
poor parameter estimation.
Test Reliability
The mean test reliabilities for dichotomous items and polytomous items across
10 replications are shown in Table 5. The test reliability obtained from the HTM was
treated as a gold standard to which the test reliability obtained from the HIRT was
compared, because the HTM was the data-generating model. It was found that fitting
the HIRT to the HTM data (both dichotomous and polytomous items) substantially
overestimated the test reliability for the first-order latent traits, because the testlet
effect was mistakenly treated as a true variance. However, the test reliability of the
second-order latent trait appeared to be less affected, which might be because the
overestimation in the test reliability of the first-order latent traits and the underestimation of the factor loadings cancelled out such that the test reliability of the secondorder latent trait was estimated appropriately in the HIRT.
Posterior Predictive Model Checking
When the Bayesian chi-square and the reproduced correlation were used to evaluate
model–data fit using the PPMC, as Table 6 shows, fitting the HTM to HTM data consistently yielded a good model–data fit (slightly overconservative); however, fitting
the HIRT to the HTM data yielded a poor fit only for the one-parameter models. In
other words, these two statistics appeared not to be sensitive enough to model misspecification in the multiparameter models. With respect to the odds ratio statistic for
dichotomous items, the mean number of misfits across 120 item pairs was computed
across 10 replications. According to this statistic, fitting the HTM yielded a very small
number of misfits, whereas fitting the HIRT yielded a very large number of misfits,
especially for the one-parameter models. In addition, the longer the testlets were, the
more powerful the odds ratio statistic. With respect to the rank correlation for polytomous items, the mean number of misfits across 120 item pairs was computed. Like the
odds ratio statistic for dichotomous items, the rank correlation performed fairly well in
distinguishing the true and wrong models. In addition to the PPMC, the Bayesian DIC
was computed to compare the HTM and HIRT. It was found that the HTM always had
a smaller DIC value than the corresponding HIRT. Due to a small number of replications, the above PPMC results should be interpreted with caution.
Two Empirical Examples
Example 1: Ability Tests With Dichotomous Items
In Taiwan, junior high school graduates who wish to enter senior high schools have
to take the Basic Competence Tests for Junior High School Students, which consists
506
Table 5. Mean Test Reliability for Dichotomous and Polytomous Items Across Replications.
Long
Dichotomous items
1-Parameter
Test 1
Test 2
Test 3
2nd order
2-Parameter
Test 1
Test 2
Test 3
2nd order
3-Parameter
Test 1
Test 2
Test 3
2nd order
Polytomous items
Partial credit
Test 1
Test 2
Test 3
2nd order
Generalized partial credit
Test 1
Test 2
Test 3
2nd order
Rating scale
Test 1
Test 2
Test 3
2nd order
Graded response
Test 1
Test 2
Test 3
2nd order
Short
HTM
HIRT
HTM
HIRT
.67
.65
.63
.66
.81
.81
.81
.65
.72
.71
.70
.71
.79
.79
.79
.71
.66
.66
.62
.66
.80
.83
.80
.66
.71
.72
.68
.70
.77
.80
.77
.70
.62
.62
.59
.62
.76
.78
.73
.61
.67
.67
.63
.67
.72
.75
.70
.66
.71
.70
.68
.70
.92
.92
.92
.69
.79
.78
.77
.75
.90
.90
.90
.76
.71
.70
.69
.70
.91
.92
.91
.69
.78
.79
.76
.75
.89
.90
.88
.75
.72
.70
.69
.70
.92
.92
.92
.70
.79
.78
.77
.76
.90
.90
.90
.77
.70
.70
.68
.69
.91
.92
.91
.68
.78
.78
.77
.75
.89
.90
.89
.74
Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order
IRT model. All the standard deviations of test reliabilities were between .01 and .02.
of five subjects: language, mathematics, English, social sciences, and nature sciences.
In the 2006 tests, there were 48 items in language, 33 items in mathematics, 45 items
in English, 63 items in social sciences, and 58 items in nature sciences; all were in
Huang and Wang
507
Table 6. Frequencies of Misfit in Posterior Predictive Model Checking in 10 Replications.
Long
1-Parameter
Bayesian chi-square
Reproduced correlation
Odds ratio (mean)
2-Parameter
Bayesian chi-square
Odds ratio (mean)
3-Parameter
Bayesian chi-square
Odds ratio (mean)
Partial credit
Bayesian chi-square
Rank correlation (mean)
Generalized partial credit
Bayesian chi-square
Rating scale
Bayesian chi-square
Graded response
Bayesian chi-square
Short
HTM
HIRT
HTM
HIRT
0
0
0.34
8
8
9.38
0
0
0.42
0
0
9.06
0
0
0.00
0
0
8.71
0
0
0.07
0
0
6.03
0
0
0.02
0
0
8.73
0
0
0.06
0
0
5.85
0
0
0.15
0
1
10.00
0
0
0.21
10
10
9.70
0
0
0.00
0
0
9.68
0
0
0.01
0
0
9.23
0
0
0.14
0
2
10.00
0
0
0.31
10
10
9.70
0
0
0.00
0
0
9.68
0
0
0.00
0
0
9.34
Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order
IRT model.
multiple-choice format. Every test consisted of both independent items and testletbased items. Altogether, there were 23 testlets in the five tests. Each of the five tests
was treated as measuring a first-order latent trait. The five first-order latent traits
were treated as governed by the same second-order latent trait, referred to as ‘‘academic ability.’’ A total of 5,000 examinees (2,639 boys and 2,361 girls) were randomly sampled from more than 300,000 examinees.
The HTM and HIRT with item response functions following the 1PLM, 2PLM,
and 3PLM were fit to the data, resulting in a total of six models. WinBUGS was used
for parameter estimation. According to the Bayesian DIC, the 3P-HTM had the best
fit. According to the PPMC with the Bayesian chi-square, reproduced correlation,
and odds ratio statistics, the 3P-HTM also had a good fit. The estimates were between
508
0.60 and 10.71 (M = 2.60) for the slope parameters, between 22.72 and 1.53 (M =
20.19) for the difficulty parameters, and between 0.02 and 0.47 (M = 0.21) for the
pseudo-guessing parameters. The factor loadings were .95, .90, .94, .96, and .98, and
the test reliabilities were .95, .94, .94, .96, and .96, for language, English, mathematics, social sciences, and nature sciences, respectively. The test reliability for the
second-order latent trait was .97. The high factor loadings were not surprising,
because these five tests were designed to measure basic competence for junior high
school students. The standard deviations of g for the 23 testlets were between 0.10
and 0.52 (M = 0.26). Compared to the unit standard deviation of the first-order latent
traits (i.e., set at 1 for model identification), the random testlet effects ranged from trivial to moderate. When the testlet effects were not considered by fitting the 3P-HIRT
to the data, the resulting factor loadings and test reliabilities were very similar to
those found in the 3P-HTM, mainly because the testlet effects were rather small.
Example 2: Nonability Tests With Polytomous Items
Three tests were used to measure pathological Internet use in Taiwan (Lin & Liu,
2003), including the Internet Hostility Questionnaire (31 four-point items), the
Chinese Internet Addiction Scale (26 four-point items), and the Questionnaire of
Cognitive Distortions of Pathological Internet Use (16 four-point items). A single
total score was reported for each test: the higher the total score, the more serious the
Internet hostility, Internet addiction, or cognitive distortion. Each of the three tests
was treated as measuring a first-order latent trait. The three first-order latent traits
were treated as governed by a second-order latent trait, referred to as ‘‘pathological
Internet use.’’ Each test consisted of several ‘‘subtests,’’ which were treated as testlets, because after partitioning out the target latent trait (u), the specific latent trait
(g) left for each subtest was considered unimportant (nuisance). There were 15 testlets altogether in the three tests. A total of 987 Taiwanese Internet users (college students—625 males and 362 females) responded to the three tests.
The HTM and HIRT with item response functions following the PCM, GPCM,
RSM, and GRM were fit to the data, resulting in a total of eight models. WinBUGS
was used. According to the Bayesian DIC, the GP-HTM had the best fit. According
to the PPMC with the Bayesian chi-square, reproduced correlation, and rank correlation, the GP-HTM also had a good fit. The estimates were between 0.10 and 6.77
(M = 1.59) for the slope parameters and between 28.88 and 6.49 (M = 0.44) for the
location parameters. The factor loadings were .68, .61, and .83, and the test reliabilities were .82, .89, and .81 for Internet addiction, Internet hostility, and cognitive distortions, respectively. The test reliability for the second-order latent trait was .71. It
appeared that the factor loading of cognitive distortions contributed more to pathological Internet use than Internet hostility and Internet addiction. The standard deviations
of g for the 15 testlets were between 0.40 and 3.23 (M = 1.41). Compared to the unit
standard deviation of the first-order latent traits, most of the testlet effects were very
large. When the large testlet effects were not considered by fitting the GPC-HIRT to
Huang and Wang
509
the data, the factor loadings were .60, .61, and .80, and the test reliabilities were .91,
.94, and .88, for Internet addiction, Internet hostility, and cognitive distortions,
respectively. The test reliability for the second-order latent trait was .70. Apparently,
ignoring the testlet effects led to an underestimation of factor loadings and an overestimation of test reliabilities for the first-order latent traits, which is consistent with
results found in the simulations.
Discussion and Conclusion
The new class of HTM was developed for tests measuring hierarchical latent traits
with testlet-based items. The HTM is very flexible because the item response function is not limited to any specific function. Different items can have different item
response functions, even in the same test. Due to high dimensionality, we used the
Bayesian approach with MCMC methods that were implemented on WinBUGS for
parameter estimation. A series of simulations were conducted to evaluate parameter
recovery of the HTM, the consequences of ignoring testlet effects on estimation of
parameters and test reliability, and the effectiveness of model–data fit statistics. The
results show that (a) the parameters of the HTM were recovered very well; (b) when
testlet effect existed but was not considered by fitting the HIRT, the resulting item
parameter estimates were biased, the factor loadings were underestimated, and the
test reliability of the first-order latent traits was overestimated; and (c) the Bayesian
DIC and PPMC were helpful in model comparison and model-data fit checking.
The simulations were conducted on personal computers with 2.66-GHz Intel Core
i5. It took approximately 2 to 6 days per replication. The computation time is feasible
for most real data analyses but may not be feasible for comprehensive simulations,
which was the main reason why only 10 replications were conducted in this study. To
facilitate parameter estimation of the HTM, one may adopt standard IRT computer
programs, such as BILOG-MG, MULTILOG, or PARSCALE, to fit each test at the
first order, and then use the parameter estimates as starting values for the HTM using
WinBUGS. Even better, a stand-alone computer program should be developed for the
HTM.
In the HTM, we formulated two orders of latent traits. Actually, the HTM can be
easily extended to accommodate more orders but at the cost of the computational burden. In recent years, multilevel IRT models have been developed to describe multilevel structure in persons (e.g., persons are nested within schools; Fox, 2005). It is of
great value to combine both the HTM and multilevel models. Future studies can be
conducted to explore these theoretical issues and develop more efficient computer
programs for these complicated models.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.
510
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship,
and/or publication of this article: The first author was supported by the National Science
Council (Grant No. NSC 100-2410-H-133-015).
References
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43,
561-573.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinees’ ability.
In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-479).
Reading, MA: Addison-Wesley.
Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory
multidimensional item response models using Markov chain Monte Carlo. Applied
Psychological Measurement, 27, 395-414.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlet.
Psychometrika, 64, 153-168.
Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative
simulations. Journal of Computational and Graphical Statistics, 7, 434-455.
Cao, J., & Stokes, S. L. (2008). Bayesian IRT guessing models for partial guessing behaviors.
Psychometrica, 73, 209-230.
de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive
diagnosis. Psychometrika, 69, 333-353.
de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size a higher order
IRT model approach. Applied Psychological Measurement, 34, 267-285.
de la Torre, J., & Song, H. (2009). Simultaneously estimation of overall and domain abilities:
A higher order IRT model approach. Applied Psychological Measurement, 33, 620-639.
Fox, J.-P. (2005). Multilevel IRT using dichotomous and polytomous items. British Journal of
Mathematical and Statistical Psychology, 58, 145-172.
Fox, J.-P., & Glas, C. A. W. (2003). Bayesian modeling of measurement error in predictor
variables using item response theory. Psychometrika, 68, 169-191.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal
densities. Journal of the American Statistical Association, 85, 398-409.
Gelman, A., Meng, X.-L., & Stern, H. S. (1996). Posterior predictive assessment of model
fitness via realized discrepancies. Statistica Sinica, 6, 733-807.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6,
721-741.
Huang, H.-Y., Wang, W.-C., & Chen, P.-H. (2010). An item response model with hierarchical
latent traits. Paper presented at the annual meeting of the American Educational Research
Association, Denver, CO.
Kang, T., & Cohen, A. S. (2007). IRT model selection methods for dichotomous items.
Applied Psychological Measurement, 31, 331-358.
Klein Entink, R. H., Fox, J.-P., & van der Linden, W. J. (2009). A multivariate multilevel
approach to the modeling of accuracy and speed of test takers. Psychometrika, 74, 21-48.
Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied
Huang and Wang
511
Lin, S. S. J., & Liu, E. Z.-F. (2003). Study on Internet flaming and hostility: Cognitive process,
situational factors, and the building of an automatic sorting system of hostile posters (I)
(NSC Research Report No. NSC92-2520-S009-005). Taipei, Taiwan: National Science
Council.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm.
Applied Psychological Measurement, 16, 159-176.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.
Copenhagen, Denmark: Institute of Educational Research. (Expanded edition, 1980;
Chicago, IL: University of Chicago Press)
Ryan, J. J., & Schnakenberg-Ott, S. D. (2003). Scoring reliability on the Wechsler Adult
Intelligence Scale–Third Edition (WAIS-III). Assessment, 10, 151-159.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika Monograph Supplement, No. 17.
Sheng, Y., & Wikle, C. K. (2008). Bayesian multidimensional IRT models with a hierarchical
structure. Educational and Psychological Measurement, 68, 413-430.
Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of item
response theory models. Applied Psychological Measurement, 30, 298-321.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures
of model complexity and fit. Journal of the Royal Statistical Society, Series B:
Methodological, 64, 583-616.
Spiegelhalter, D. J., Thomas, A., & Best, N. (2003). WinBUGS version 1.4 [Computer
program]. Cambridge, England: MRC Biostatistics Unit, Institute of Public Health.
Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlet: A use of multiplecategorical-response models. Journal of Educational Measurement, 26, 247-260.
van der Linden, W. J., Klein Entink, R. H., & Fox, J.-P. (2010). IRT parameter estimation with
response times as collateral information. Applied Psychological Measurement, 34, 327-347.
Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3-PL
useful in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized
adaptive testing: Theory and practice (pp. 245-269). Dordrecht, the Netherlands: Kluwer
Academic.
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications.
New York, NY: Cambridge University Press.
Wainer, H., & Lukhele, R. (1997). How reliable are TOEFL scores? Educational and
Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What
is the effect of local dependence on reliability? Educational Measurement: Issues and
Practice, 15, 22-29.
Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL.
Journal of Educational Measurement, 37, 203-220.
Wang, W.-C., & Wilson, M. R. (2005). The Rasch testlet model. Applied Psychological
Measurement, 29, 126-149.
Wang, X., Bradlow, E. T., & Wainer, H (2002). A general Bayesian model for testlets: Theory
and applications. Applied Psychological Measurement, 26, 109-128.

Higher Order Testlet Response Models for Hierarchical Latent Traits

Transcription

Similar documents

NJ DLM Test Security Agreement for Test Administrators

About Spring Tests

Surpassing Human-Level Face Verification Performance on LFW with GaussianFace (Supplementary Material) Chaochao Lu

something is or has

Phase B Tests

Create Your Own Dichotomous Key - hamilton

When buying a home, be alert to latent defects