A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response

In low-stakes assessments, test performance has few or no consequences for examinees themselves, so that examinees may not be fully engaged when answering the items. Instead of engaging in solution behaviour, disengaged examinees might randomly guess or generate no response at all. When ignored, examinee disengagement poses a severe threat to the validity of results obtained from low-stakes assessments. Statistical modelling approaches in educational measurement have been proposed that account for non-response or for guessing, but do not consider both types of disengaged behaviour simultaneously. We bring together research on modelling examinee engagement and researchonmissingvaluesandpresentahierarchicallatentresponsemodelforidentifying and modelling the processes associated with examinee disengagement jointly with the processes associated with engaged responses. To that end, we employ a mixture model that identiﬁes disengagement at the item-by-examinee level by assuming different data-generating processes underlying item responses and omissions, respectively, as well as response times associated with engaged and disengaged behaviour. By modelling examinee engagement with a latent response framework, the model allows assessing how examineeengagementrelates toabilityand speedaswell astoidentifyitemsthatarelikely to evoke disengaged test-taking behaviour. An illustration of the model by means of an application to real data is presented.


Introduction
The aim of large-scale assessments (LSAs) is to measure examinee competencies using test items. In doing so, it is assumed that examinees actively try to determine the correct answer to every item by employing their abilities (Schnipke & Scrams, 1997;Wang & Xu, 2015). Most comparative LSAs, however, are low-stakes for examinees and aim at systemlevel comparisons. As such, examinee test performance in most LSAs has few or no consequences for examinees themselves and examinees may not be fully engaged when attempting the items. When disengaged, examinees might attempt items without applying their abilities, but instead proceed quickly through the assessment by randomly guessing on multiple-choice (MC) items, answering items with an open-response (OR) format only perfunctorily, or generating no response at all (Verbi c & Tomi c, 2009;Wise & Gao, 2017). Such disengaged test-taking behaviour poses a severe threat to the validity of results obtained from LSAs since test scores assumed to reflect the level of competency may be confounded with the level of disengagement (Braun, Kirsch, & Yamamoto, 2011). Identifying and understanding the processes associated with examinee disengagement is therefore paramount for drawing valid inferences on examinee ability.
In this study, we argue that both rapid guesses and item omissions can be understood as indicators of examinee disengagement (see Wise & Gao, 2017). To capture the underlying processes, we bring together research on modelling examinee engagement and research on item-level non-response and provide a generalized modelling framework that identifies disengagement by jointly considering information on responses, omissions, and response times (RTs).
The remainder of this article is structured as follows: First, we review current approaches for identifying examinee disengagement as well as for handling item omissions. Second, we present a hierarchical latent response framework for examinee disengagement in terms of guessing and omitting. We then evaluate the statistical performance of the proposed model, illustrate how it differs from current approaches for identifying examinee disengagement and handling item omissions, and illustrate its application employing data from the Programme for International Student Assessment (PISA) 2015.

Response-time-based scoring techniques
In RT-based scoring methods for identifying and filtering disengaged responses, responses associated with RTs below a certain threshold are considered to be rapid guesses. Different approaches exist for establishing these thresholds. The most heuristic threshold method is to define a common threshold for all items representing the minimum amount of time needed to give an engaged response (Wise, Kingsbury, Thomason, & Kong, 2004). Item-specific thresholds can be established by setting the threshold to, for example, 10% of the average time (Wise & Ma, 2012), by visually assessing bimodal RT distributions for a distinctive gap (Wise, Pastor, & Kong, 2009), or by assessing RT distributions jointly with the conditional proportion correct in order to identify an RT threshold at which accuracy exceeds what would be expected from random responding (Goldhammer et al., 2016;Guo et al., 2016;Lee & Jia, 2014).

Model-based approaches
Model-based approaches aiming to identify disengaged test-taking behaviour usually apply mixture modelling techniques, with responses, and, if considered, RTs assumed to stem from two different processes: solution behaviour and rapid guessing behaviour. For responses stemming from solution behaviour, customary item response theory (IRT) models are assumed. That is, probability correct is modelled as a function of examinee ability and item difficulty. Responses stemming from rapid guessing processes are assumed to contain no information on ability; probability correct under disengaged behaviour is thus assumed to correspond to the probability of guessing correctly at chance level (Schnipke & Scrams, 1997;Wang & Xu, 2015). RTs are either assumed to stem from different lognormal distributions with different means and variances associated with solution and random guessing behaviour (Meyer, 2010;Schnipke & Scrams, 1997;Wang & Xu, 2015) or employed to predict latent class membership (Pokropek, 2016).

Assumptions and limitations
RT-based scoring techniques for identifying examinee disengagement are rather heuristic and might considerably disagree in the rate of responses classified as rapid guesses (Lee & Jia, 2014) or perfunctory answers (Goldhammer et al., 2016). For instance, Goldhammer et al. (2016) have reported proportions of perfunctory answers ranging from 0.05% to 8.20% for different threshold methods applied to the same data set. Mixture models for disengaged behaviour, on the other hand, often come with strong assumptions regarding the processes underlying examinee disengagement. In mixture models for disengaged behaviour, mixing proportions are allowed at the population (Meyer, 2010), examinee (Cao & Stokes, 2008;Mislevy & Verhelst, 1990;Wang & Xu, 2015), item (Schnipke & Scrams, 1997) or item-by-examinee level (Pokropek, 2016). While models allowing for varying mixing proportions at the item level assume that items can evoke disengaged behaviour to a different degree, they assume all examinees to be equally prone to show disengaged behaviour. Models assuming examinee-specific mixing proportions allow for the probability of being disengaged to vary across examinees, while the proportion of disengaged responses is assumed to be constant across items. The probability of disengaged responses, however, has repeatedly been shown to be related to both examinee characteristics such as academic ability or achievement goals and item characteristics such as response format or position (Goldhammer et al., 2016;Lee & Jia, 2014;Wise et al., 2009). Considering this when modelling examinee engagement renders it necessary to allow for mixing proportions at the item-by-examinee level. To our knowledge, the grade of membership framework presented by Erosheva (2002) and adapted for identifying examinee disengagement by Pokropek (2016) is the only framework that allows for mixing proportions at the item-by-examinee level. It does so by regressing item-by-examinee-level mixing proportions on the associated RTs.
In addition, mixture models for identifying examinee disengagement do not model the probability of being engaged jointly with ability but rather as an independent process. Thus, these models assume ability and engagement to be unrelated constructs. In RTbased scoring approaches, on the other hand, item responses identified to be the result of guessing behaviour are often coded as missing and therefore ignored when estimating ability. Doing so comes with the assumption that the missing responses induced through such filtering techniques are ignorable in the sense that they are missing at random (MAR) given the observed (engaged) responses and the background variables considered, and that the processes leading to disengaged item responses are unrelated to ability (Pokropek, 2016;Rios et al., 2017;Rubin, 1976). A rich body of research, however, suggests that motivation and the tendency to show guessing behaviour are indeed related to ability (Boe, May, & Boruch, 2002;Braun et al., 2011;Goldhammer et al., 2016;Wise & DeMars, 2005;Wise et al., 2009). Not taking this into account has been shown to yield biased ability estimates (Pokropek, 2016;Rios et al., 2017). To overcome these limitations, there is a need for a model-based approach that allows for the probability of observing disengaged behaviour to vary at the item-by-examinee level, as well as joint modelling of the processes underlying examinee disengagement and ability.

Omissions
Various studies have related the occurrence of item omissions to lack of examinee motivation (Cosgrove, 2011;Jakwerth & Stancavage, 2003;K€ ohler, Pohl, & Carstensen, 2015a;Verbi c & Tomi c, 2009;Wise & Gao, 2017). Decline in test scores over time, for instance, has been attributed to a decline in examinee motivation, with an increase in omission rates taken as an indicator of examinee disengagement (Cosgrove, 2011;Sachse, Mahler, & Pohl, 2019). Likewise, it has been suggested to employ the rate of item omissions on background questionnaires as an indicator of disengagement in cognitive assessments, with the rationale being that examinees who are not motivated to fill out the background questionnaire might also be less motivated to engage with the items of the cognitive assessment (Boe et al., 2002).
Notwithstanding, there is an ongoing discussion about the treatment of item omissions in the cognitive assessments of LSAs. Operationally in LSAs there is considerable variety in the treatment of item omissions, where omissions are either ignored, scored as incorrect, or scored as partially correct (see Pohl, Gr€ afe, & Rose, 2014, for an overview). While scoring item omissions as wrong assumes the probability of a correct response to an omitted item to be zero (Rose, von Davier, & Xu, 2010), ignoring item omissions implies ignorability (Rose et al., 2010). In the case that ignorability does not hold, ignoring missing data jeopardizes validity of inference and can induce bias to person and item parameter estimates (de Ayala, Plake, & Impara, 2001;Culbertson, 2011;Finch, 2008;K€ ohler, Pohl, & Carstensen, 2015bPohl et al., 2014;Rose, 2013;Rose et al., 2010).

Response-time-based scoring techniques
RT-based scoring techniques for item omissions aim to distinguish item omissions occurring due to processes different from and similar to those operating when examinees generate (engaged) responses. For item omissions associated with RTs remarkably shorter than RTs associated with observed responses, it is assumed that the examinee did not engage with the item but skipped it without trying to solve it. Item omissions associated with RTs that do not notably differ from RTs associated with (wrong) observed responses are assumed to have occurred for skill-related reasons, since the examinee engaged sufficiently long with the item to generate a response, but decided not to. To distinguish between these two types of omissions, the Programme for the International Assessment of Adult Competencies (PIAAC) employs a 5-s scoring rule, where item omissions associated with RTs exceeding 5 s are treated as wrong. Otherwise, item omissions are considered not attempted and treated as missing responses in all further analyses (Yamamoto, Khorramdel, & von Davier, 2013). Recent approaches for RT-based scoring of omitted responses extend this rationale by allowing for item-specific, empirically derived thresholds (Frey, Spoden, Goldhammer, & Wenzel, 2018;Weeks, von Davier, & Yamamoto, 2016).

Model-based approaches
In model-based approaches for non-ignorable item omissions, the missingness mechanism assumed to underlie item omissions is usually modelled via an additional manifest or latent variable which represents the examinees' propensity to omit items (Holman & Glas, 2005;O'Muircheartaigh & Moustaki, 1999;Rose et al., 2010), such that response and omission behaviour are modelled jointly.
For response indicators u ij , representing person i's response on item j, customary IRT models are employed, with the probability of a correct response being modelled as a function of ability h i and item difficulty b j : Omission indicators d ij contain information on whether examinee i generated a response to item j, with 1 indicating an item omission and 0 an observed response. The probability of an item omission is modelled as with ξ i denoting person omission propensity and a j item omission difficulty. A multivariate normal distribution is assumed for ability and omission propensity. Traditionally, model-based approaches have relied only on information as to whether a response has been observed or not. Based on the work of  have extended model-based approaches for non-ignorable item omissions by integrating them with models for RTs, allowing for different processes determining the time examinees require to generate a response or to omit an item. Doing so allows assessment of the degree to which these processes differ and, as such, for a finer-grained understanding of the occurrence of item omissions as well as test-taking behaviour in general.

Assumptions and limitations
Although RT-based scoring techniques for item omissions allow different types of item omissions to be distinguished, they assume that either item omissions are ignorable or the probability of solving an omitted item is zero (Lord, 1983;Rose, 2013). By modelling omission propensity jointly with ability, model-based approaches for item omissions overcome these assumptions, allow to assess how examinee ability relates to the probability of omitting responses, and have been shown to yield unbiased item and person parameter estimates, even when the missingness mechanism is non-ignorable in the sense that parameters of the response model are not distinct from those of the missingness model (Holman & Glas, 2005;Pohl et al., 2014;Rose et al., 2010;Ulitzsch et al., 2019). If one were to consider omissions as indicators of disengaged behaviour while assuming that all observed responses stem from solution behaviour and that examinees do not omit while engaged, the omission propensity in these models can be understood as an examinee disengagement parameter that is modelled jointly with ability. As such, these models overcome the assumption of independence between the processes governing disengaged behaviour and ability inherent to model-based approaches for disengaged guessing. They are, however, restrictive in that they assume all item omissions to stem from the same datagenerating processes and all observed responses to stem from engaged response processes.

Proposed model
Conceptualizing disengaged test-taking behaviour in terms of both randomly guessing (or producing perfunctory answers) and omitting, we present a hierarchical latent response model for identifying and modelling the processes associated with examinee disengagement jointly with the processes associated with engaged responses. 1 We thereby bring together research on examinee disengagement and non-response behaviour. Addressing limitations of previously developed approaches, the speed-accuracy + engagement (SA+E) model allows for item-by-examinee-specific engagement probabilities, defines engagement in terms of both random guessing (or perfunctory answers) and disengaged item omissions, and models processes associated with examinee disengagement jointly with ability. To that end, we employ mixture models that identify disengagement at the item-by-examinee level by assuming different data-generating processes underlying item responses, omissions, and RTs associated with engaged and disengaged behaviour. Itemby-examinee mixing proportions are modelled with a latent response framework employing an IRT model. The framework is shown in Figure 1, where the left-and righthand parts depict the models for disengaged and engaged behaviour, respectively.
Following Wang and Xu (2015), latent engagement indicators D ij denote whether examinee i has engaged in solution behaviour when attempting item j or not, with 0 and 1 indicating disengaged and solution behaviour, respectively. Whether or not examinee i generated an engaged response to item j is not observable. Engaged and disengaged behaviours, however, are assumed to result in different distributions of item responses, omissions, and RTs.

Engaged behaviour
When attempting items in an engaged manner, examinees are assumed to generate engaged responses to all items attempted. That is, if D ij = 1 the probability of a correct response on response indicator u ij is assumed to be a function of person ability h i and the item's difficulty b j . In line with the frameworks of analysis implemented in major LSAs such as PISA (OECD, 2017), we present the framework employing a Rasch model for item responses as given by equation (1). (2007) and Wang and Xu (2015), RTs t ij , denoting the time examinee i interacted with item j, are assumed to follow a lognormal distribution governed by examinee working speed s i and item time intensity b j when associated with an engaged response:

Following van der Linden
For reasons of simplicity, we assume a common residual variance r 2 E (van der Linden, 2007).
Item omissions are assumed not to occur when examinees are engaged. We therefore fix the probability of observing an item omission to zero if D ij = 1. Thus, Conversely, this restriction corresponds to the assumption that examinee disengagement is observable in the case an item is omitted.

Disengaged behaviour
When disengaged (D ij = 0), we assume that examinees either randomly guess or omit. Whether examinees omit or guess is modelled via an examinee-specific but not itemspecific omission probability o i which describes the probability that examinee i omits (d ij = 1) rather than guesses (d ij = 0) when attempting an item in a disengaged manner. o i is modelled as a function of ability h i and speed s i via a logistic regression, thereby allowing for differences in omission behaviour depending on the examinee's ability and speed level: For observed disengaged responses, the probability of a correct guess is assumed to be determined by a common guessing parameter c (Schnipke & Scrams, 1997;Wang & Xu, 2015): Following Schnipke and Scrams (1997), we assume that neither person nor item characteristics affect the distribution of RTs when examinees are disengaged and produce responses by guessing or omit items. Thus, under D ij = 0, RTs for all items and examinees are assumed to follow a lognormal distribution governed by a common mean across all items and examinees b D and variance r 2 D : In the proposed framework, it is assumed that examinees tend to require less time to interact with an item when disengaged than to read, understand, and generate an engaged response to the item (Wise, 2017). We incorporate this assumption by assuming that all time intensities for the RTs associated with engaged behaviour b j are the sum of the common mean b D and an item-specific, positive offset parameter b Ã j . That is, The offset parameter b Ã j indicates how much longer examinees need to engage with the item to generate an engaged response rather than to omit or guess.

Higher-order models
Whether examinee i engaged in solution behaviour when attempting item j is only partially observable; however, it determines the measurement properties of the observed responses and associated RTs. Engagement indicators D ij thus represent latent response variables (Maris, 1995). For the probability that examinee i is engaged when attempting item j, p(D ij = 1), we assume a Rasch model with where / i denotes examinee i's engagement and i j gives item j's engagement difficulty. Examinee engagement determines whether examinees tend to approach items engagedly. Engagement difficulty determines how easily examinees interact with an item engagedly. All person parameters are assumed to be multivariate normally distributed with mean vector and covariance matrix When a Rasch model is employed for responses and engagement indicators, the model can be identified by setting the expectations of /, h, and s to zero. Item parameters are modelled as fixed effects. 2 The proposed model's likelihood can be written as As can be seen, the framework allows for mixture distributions of responses and RTs at the item-by-examinee level, with the first row representing the model for engaged and the second the model for disengaged test-taking behaviour. The mixing proportions pðD ij ¼ 1j/ i ; i j Þ and 1 À pðD ij ¼ 1j/ i ; i j Þ are modelled as a function of examinee engagement / i and engagement difficulty parameters i j with an IRT model. gð/; h; sjl P ; R P Þ denotes the multivariate normal density of the person parameters. Note that in the case where examinee i omits item j, the first row does not contribute to the likelihood function, thereby incorporating the assumption that examinee i's engagement status is observable in the case where d ij = 1.

Prior distributions
Bayesian estimation techniques are employed to facilitate model estimation. For the prior distribution for the person parameter variance-covariance matrix R P , we follow a separation strategy where the correlation matrix X P and person parameter standard deviations S P are separated out (Barnard, McCulloch, & Meng, 2000), that is, Such separation strategies have been shown to yield unbiased parameter estimates of variances and correlations even under conditions with smaller sample sizes (Alvarez, Niemi, & Simpson, 2014). Furthermore, separation strategies circumvent the dependencies between variances and correlations inherent to inverse Wishart priors (Alvarez et al., 2014;Gelman & Hill, 2007). Following recommendations by the Stan Development Team (2017), we employ an LKJ prior (Lewandowski, Kurowicka, & Joe, 2009) with shape 1 for the correlation matrix X P , implying a uniform distribution on the correlation parameters and half Cauchy priors with location 0 and scale 5 for each element of S P .
Following Fox (2010), we employ diffuse normal priors with mean 0 and standard deviation 10 for all engagement difficulties i j , difficulties b j , time intensity offsets b Ã j , as well as the common mean b D and each element of the vector of logistic regression parameters c. For residual standard deviations of logarithmized engaged RTs r E and the common standard deviation r D we suggest diffuse half Cauchy priors with location 0 and scale 5. For the common guessing parameter c we employ diffuse beta priors with B(1,1).

Parameter recovery
To investigate estimability of the SA+E model, a simulation study was performed. We addressed two major research questions. First, the simulation study served to investigate whether true parameter values can satisfactorily be recovered under realistic conditions. Second, we aimed to identify boundary conditions concerning the sparseness of information on examinee disengagement for the detection thereof.

Data generation
Data were generated according to the SA+E model, employing R version 3.5.1 (R Development Core Team, 2017). To evaluate model performance under realistic research conditions, data-generating values were chosen to resemble parameter estimates reported in the empirical example below. To identify possible challenging conditions, we varied factors that are relevant for data sparseness in disengaged behaviour. Four variables were manipulated: the number of examinees (250, 500, 1,000), representing low, medium, and large sample sizes per item encountered in LSAs with balanced incomplete block designs (Gonzalez & Rutkowski, 2010); the number of items (10, 20); the rate of disengaged behaviour in the data set of size N 9 K (5%, 10%), reflecting rates of disengaged rapid guesses typically found in data from LSAs (Goldhammer et al., 2016;Lee & Jia, 2014) as well as low to medium omission rates (OECD, 2013); and the percentage of omissions as opposed to guessing in disengaged behaviour (10%, 50%, 90%). Since omissions are assumed to occur only when examinees are disengaged, we suspect that sufficiently high omission rates facilitate estimation. Estimation might be more challenging when examinees mainly guess when disengaged, or when guessing is hard to detect due to low incidence.
Our manipulation of variables led to 3 9 2 9 2 9 3 = 36 conditions. For each condition, 50 data sets were generated. Using the MVRNORM function from the MASS package (Venables & Ripley, 2002), person parameters were randomly drawn from a multivariate normal distribution. We set engagement /, ability h, and speed s variances to 3.50, 1.00, and 0.05, respectively. Correlations of engagement with ability, cor(/,h), and speed, cor(/,s), were set to .55 and .20, respectively. The correlation between ability and speed, cor(h,s), was set to À.40. Such negative correlations between ability h and speed s indicate that examinees showing higher levels of ability operate at a lower speed level and are rather common for low-stakes LSAs (Goldhammer et al., 2014). For all item parameter types, we considered five different values, stemming from sequences fi 0 þ 0:5lg 5 l¼1 for engagement difficulties i, fÀ1 þ 0:5lg 5 l¼1 for difficulties b, and f3 þ 0:25lg 5 l¼1 for time intensities b. For tests of length K = 10 and K = 20 these sequences were repeated twice and four times, respectively. To obtain rates of disengaged behaviour of 5% and 10%, i 0 was set to À5 and À4.25, respectively. This resulted in item-level disengagement rates ranging from 1.11% to 8.20% and from 2.35% to 17.38% under conditions with overall disengagement rates of 5% and 10%, respectively. The logistic regression parameters were set to c h ¼ À1 and c s ¼ À10. For omission rates in disengaged behaviour of 10%, 50%, and 90%, the intercept was set to c 0 = À3, c 0 = 0, and c 0 = 3, respectively. The probability correct for disengaged responses was set to c = .25 for all items. Logarithmized disengaged RTs were drawn from a normal distribution with mean b D = 3 and variance r 2 D = 1.95. The common residual variance for logarithmized engaged RTs was set to r 2 D = 0.15.

Estimation procedure
Bayesian estimation was conducted using Stan version 2.18 (Carpenter et al., 2017), employing the RSTAN package (Guo, Gabry, & Goodrich, 2018) for R version 3.5.1 (R Development Core Team, 2017). For sampling from the posterior distributions, Stan employs the No-U-Turn sampler (Hoffman & Gelman, 2014), an adaptive form of Hamiltonian Monte Carlo sampling (Neal, 2011). Data were analysed employing the SA+E model. On each data set, we ran four Markov chain Monte Carlo (MCMC) chains with 10,000 iterations each, with the first 5,000 employed as warm-up. The number of iterations was chosen based on conclusions drawn from pre-analyses, inspecting potential scale reduction factor (PSRF) values, trace plots, and effective sample sizes (ESSs). Stan code for the SA+E model is provided in the Supporting Information.

Results
Statistical performance was evaluated in terms of convergence and efficiency of the MCMC chains as well as bias and efficiency of parameter estimates. We assessed convergence on the basis of PSRF values. Replications with PSRF values below 1.10 for all parameters were considered as being converged (Gelman & Rubin, 1992;Gelman & Shirley, 2011). The efficiency of the estimation procedure was evaluated by considering ESS (Kass, Carlin, Gelman, & Neal, 1998), indicating the degree of precision with which the empirical mean of the MCMC chains approximates the expected value of the posterior distribution (L€ udtke, Robitzsch, & Wagner, 2018). Following Zitzmann and Hecht (2019), we considered an ESS above 400 for all parameters as sufficient. Table 1 displays proportions of replications with PSRF values below 1.10 as well as ESSs above 400 across all conditions. Convergence rates as indicated by PSRF values below 1.10 were at least 90% under all conditions with K = 20 items. Under conditions with K = 10 and smaller sample sizes (N ≤ 500), however, convergence was challenged, with the lowest convergence rate being .82.
In some cells of the simulation design with N ≤ 500, proportions of replications with ESSs for all parameters higher than 400 were somewhat lower than convergence rates as evaluated on the basis of PSRF values. This indicates that although the chains converged and mixed well, the parameter space was explored rather slowly and more iterations might be needed to ensure good approximation of the posterior mean (Zitzmann & Hecht, 2019). Further assessments of convergence behaviour of replications with PSRF values above 1.10 showed very poor, if any, mixing of the MCMC chains, with PSRF values of up to 573.45, indicating that engaged and disengaged behaviours were not separable. Since the mean across chains that did not show any mixing is not meaningful and non-coverged solutions would not be interpreted in practice, we excluded replications with PSRF values exceeding 1.10 from all subsequent analyses.
To evaluate bias and efficiency of parameter estimates, we assessed the median and 50% ranges of posterior means. Good parameter recovery was found under all conditions with a sufficiently high number of examinees (N = 1,000) and items (K = 20). Under conditions with fewer items and examinees, engagement variance, engagement difficulty, as well as regression parameters for predicting the probability of omitting rather than guessing when being disengaged were sensitive to bias when little information on examinee disengagement was available. All remaining parameters could be recovered without systematic bias across all conditions of the simulation design. Results for these are given in the Supporting Information. As was to be expected, efficiency in parameter estimates as indicated by narrower 50% ranges increased with an increasing number of both examinees and items for all parameter types. In addition, parameters associated with disengagement were estimated more efficiently under conditions with higher omission rates, that is, under conditions with a higher portion of disengaged behaviour being directly observable.
Results for person parameter variances and correlations are given in Figure 2. Engagement variance var(/) estimates were upwardly biased under conditions with sparse information on examinee disengagement, such that under rather challenging conditions with only 250 examinees, 10 items, and a low disengagement rate of 5% out of which only 10% went back to item omissions, median var(/) estimates of 4.09 were observed, as compared to the data-generating value of 3.50. However, bias decreased rapidly with an increasing number of examinees as well as higher omission rates, such that under conditions with N ≥ 500, medians of posterior means were extremely close to the true value.
Likewise, under conditions with smaller data sets, engagement difficulties i were sensitive to bias for items with low rates of disengaged behaviour, that is, when the true parameter was small (see Figure 3). This effect was further intensified when disengaged behaviour was not directly observable and consisted predominantly of random guesses. Accordingly, parameter estimates for the smallest data-generating value assessed in the simulation study of À4.50 (corresponding to an item-level disengagement rate of only 1%) were most sensitive to bias under conditions with only 10% of disengaged behaviour resulting in item omissions, such that under the condition with only N = 250 examinees and K = 10 items, a median of parameter estimates of À4.74 was observed. Bias decreased rapidly with an increasing number of examinees as well as higher percentages of omissions on disengaged behaviour. Under conditions with N = 1,000 examinees, differences were extremely close to zero for all values of i considered. Regression parameters were challenging to estimate under conditions with less than K = 20 items (see Figure 4). Under such conditions, highly negative as well as highly positive intercepts c 0 , resulting in disengaged behaviour consisting mainly of rapid guesses and omissions, respectively, were biased with median parameter estimates ranging from À2.95 to À3.47 and from 3.16 to 3.76, as compared to the true values of À3 and 3, respectively. In addition, slopes for the regression of omission probability on speed, c s , were underestimated under conditions with K = 10.

Illustrating the model
To illustrate how the SA+E model differs conceptually from current approaches for identifying examinee disengagement as well as for handling item omissions, we took data from single replications of the simulation study for conditions with a disengagement rate of 10% out of which 50% went back to item omissions, and compared parameter estimates for the SA+E model with those obtained from models that either model the occurrence of item omissions but assume all observed responses to stem from engaged response processes, or filter disengaged behaviour but assume engagement to be unrelated to ability and item omissions to be ignorable. We chose to compare the SA+E model to the speed-accuracy + omission (SA+O) model  and the mixture model for identifying examinee engagement presented by Wang and Xu (2015), representing two recent modelling approaches for omissions and the identification of disengaged guessing behaviour, respectively. Adopting the graphical notation of the SA+E framework, the models are depicted in Figures 5 and 6, respectively. The SA+O model models the omission process according to equation (2). All responses are assumed to stem from engaged response processes and thus modelled with a Rasch model as in equation (1). Different data-generating processes are assumed for RTs associated with responses and omission, respectively. RTs associated with responses are modelled as a function of speed and time intensity, as in equation (3) item omission are modelled analogously, but with a different set of item and person parameters (omission time intensity and omission speed) thereby allowing for examinees to operate at different speed levels when generating responses and omitting. The SA+O model assumes a joint distribution for ability, speed, omission propensity, and omission speed.
The Wang and Xu model is a mixture modelling approach for examinee disengagement in terms of guessing. The model assumes a person-specific disengagement probability that is constant across items and distinct from ability, that is, pðD ij ¼ 1Þ ¼ p i , with p i denoting examinee i's engagement probability. For responses and RTs, the model assumes models for engaged and disengaged examinees that are equivalent to those assumed in the SA+E framework. When specifying the Wang and Xu model, item omissions were ignored. The Wang and Xu model differs from the SA+E model in the treatment of item omissions as well as in that it assumes engagement probability to be unrelated to ability and constant across items. 3 All models were estimated employing the same set-up for model estimation as in the simulation study.  To investigate the effects of different test lengths and sample sizes, we varied the number of items (10, 20) and examinees (250, 500, 1,000).
Differences in ability estimates between the SA+O model and the SA+E model (given in Figure 7 as a function of engagement estimated using the SA+E framework as well as the number of item omissions) are close to zero for examinees with high engagement, that is, for examinees who rarely guess or omit items. With increasing disengagement, however, there are increasing differences in ability estimates between the SA+O and SA+E models. This goes back to assuming all responses to be engaged as well as misspecifying engagement (or omission propensity) by neglecting the fact that disengaged examinees tend not only to omit but also to guess. This is also reflected in the differences in item difficulties (given in Figure 8 as a function of engagement difficulty). Due to assuming all responses to be engaged, difficulties of easy items tend to be overestimated. This effect is especially pronounced for items with higher engagement difficulties, as these tend to be guessed on more often. The Wang and Xu model, too, gives ability estimates for examinees with higher engagement that are very close to those obtained from the SA+E model (see Figure 9). However, due to neglecting the fact that ability and engagement are positively related, ability for examinees with lower engagement is overestimated by the Wang and Xu model. This also results in systematically lower item difficulties (see Figure 10), with differences being higher for easy items. Since in the data-generating model ability and engagement are positively correlated, observed engaged responses are more likely to be observed for more able examinees, resulting in difficulties being underestimated (Rose, 2013).
For both model comparisons, differences in ability and item difficulty estimates are similar for different numbers of examinees and items.

Empirical example
To illustrate the use of the SA+E model for detecting and understanding disengagement, we employed data from PISA 2015. We focused on mathematical literacy block number 1, comprising K = 12 items, out of which three had an OR format and nine were MC. For reasons of simplicity, we dichotomized partial credit items, scoring partially correct as incorrect. We applied the model to several samples of students from different countries, all of which led to comparable conclusions. Exemplarily, results for the Austrian subset, containing N = 844 examinees, are reported. The data set under consideration had an omission rate of 10.40%. Item-level omission rates ranged from 0.04% for the MC item administered at position 1 to 34.60% for the OR item administered at position 5. An additional 0.48% of responses were missing due to not-reached items. These were ignored in the estimation.

Estimation and model checking
For estimation, the same set-up as in the simulation study was employed. To take into account that the item block contained different item types, we specified item-typespecific probabilities correct when answering perfunctorily (c O ) or guessing (c M ) on OR and MC items, respectively. In addition, we allowed for item-type-specific regression intercepts c O0 and c M0 determining the probability of omitting instead of perfunctorily answering or guessing on an item with an OR or MC format, respectively. After 10,000 iterations per chain, the highest PSRF value and lowest ESS were 1.002 and 3,471.55, respectively. Model fit was evaluated employing posterior predictive checks (Gelman & Hill, 2007). For these, we simulated 30 data sets by drawing parameters from the posterior distribution and visually compared observed and simulated proportions correct and omitted as well as distributions of observed and simulated RTs. RT distributions were predicted well by the model. Although overall proportions correct were predicted well by the model, for some items, comparisons of observed and predicted probabilities correct indicate that a more complex measurement model, such as a twoparameter logistic model, might fit the data better. Likewise, for some items, the model underpredicted item omissions for examinees with higher proportions of item omissions. Possible model extensions with less restrictive assumptions concerning measurement models for responses, RTs, and latent response indicators are addressed in the discussion in Section 8. Plots for posterior predictive checks are given in the Supporting Information.

Results
Probabilities correct for disengaged guesses on MC items and perfunctory answers on OR items indicate that while examinees correctly guessed with a probability of .23 [.19, .28] on MC items, it was highly unlikely (.11 [.08, .15]) to answer correctly on an OR item when answering only perfunctorily. Examinees tended to spend on average exp(b D ) = exp (3.54) = 34.53 s on an item when approaching it in a disengaged manner. At the same time, there was considerable variation in logarithmized RTs associated with disengaged behaviour, with r 2 D ¼ 1: 40 [1.31, 1.49]. Means of the posterior distribution of person parameter variances and correlations, together with 95% highest density intervals, are displayed in Table 2. More able examinees tended to be more engaged. Furthermore, engaged as well as more able examinees tended to work at a slower pace when generating engaged responses. The intercepts of the logistic regression predicting the probability of omitting rather than randomly guessing or answering perfunctorily indicate that examinees with average ability and speed were more likely to guess and less likely to perfunctorily answer than to omit (c M0 = À0.71 [À0.94, À0.47], c O0 = 0.45 [0.26, 0.67]). The slopes indicate that examinees with higher ability and higher speed tended to guess or perfunctorily answer rather than to omit when disengaged (c h = À0.74 [À0.98, À0.52]; c s = À4.79 [À6.04, À3.70]). Item parameters and 95% highest density intervals are depicted in Figure 11. Item numbers for OR items are given in bold type. Examinees were more likely to disengage on more difficult items (corði; bÞ ¼ :68) as well as on items with higher time intensity offsets (corði; b Ã Þ ¼ :81). Time intensity offsets b* indicate that examinees tended, on average, to require exp(0.24) = 1.27 to exp(1.73) = 5.65 times longer to generate engaged responses to these items than they tended to interact with items in a disengaged manner. Engagement difficulty parameters ranged from À6.55 (item 1) to 0.60 (item 5). For these, respectively 0.05% and 59.22% of item-by-examinee interactions were classified as disengaged. Note that for item 5, the model-implied disengagement rate was notably higher than the item-level omission rate of 34.60%. The difference between the expected engagement rate and the observed omission rate can be attributed to guessing (or perfunctory answers), illustrating that examinees both omitted and answered perfunctorily when disengaged. This is also illustrated in an overall expected disengagement rate of 18.23% as compared to the omission rate of 10.40%.

Discussion
The SA+E model presented in this paper brings together research on modelling examinee engagement and research on missing values and provides a framework for identifying and modelling examinee disengagement in terms of both random guesses and perfunctory answers as well as in terms of omissions. By employing a latent response approach with engagement probabilities modelled as a function of person and item parameters, the model allows for classifying disengaged behaviour at the item-by-examinee level as well as for assessment of item and examinee characteristics associated with such behaviour. In addition, the model allows for differences in disengaged test-taking behaviour across examinees by regressing the probability of omitting rather than randomly guessing or answering perfunctorily on ability and speed. The SA+E framework complements and refines recent approaches for examinee disengagement as well as non-ignorable item omissions. Compared to RT-based scoring methods separating engaged and disengaged responses and/or item omissions by defining RT thresholds (Frey et al., 2018;Lee & Jia, 2014;Wise & DeMars, 2006), the SA+E framework comes with less strict assumptions concerning RT distributions associated with engaged and disengaged behaviour since these are allowed to overlap. Compared to previous model-based approaches for identifying disengaged examinee behaviour (Meyer, 2010;Pokropek, 2016;Schnipke & Scrams, 1997;Wang & Xu, 2015), the model allows disengaged behaviour to vary across both items and examinees while considering engagement when estimating ability. In this regard, the model also adds to a broader class of models that employ mixture modelling to identify differences in examinee behaviour. Few model-based approaches for detecting differences in examinee behaviour allow for differences at the item-by-examinee level (Erosheva, 2002;Molenaar & de Boeck, 2018;Pokropek, 2016). These do, however, not allow these differences to be related to different levels of ability. In this context, the proposed framework can be adapted to suit other applications and further model developments seeking to identify behavioural differences at the item-by-examinee level while modelling the underlying processes jointly with ability.
We illustrated the model's advantages by showing that ability estimates and item difficulties can be biased when neglecting the fact that examinees tend to omit and guess when disengaged, that engagement is related to ability, and that engagement probabilities tend to vary across items. Our findings corroborate findings from previous studies on ignoring guessing behaviour and item omissions as well as on neglecting the relationship between engagement and ability (Pohl et al., 2014;Pokropek, 2016;Rios et al., 2017;Rose et al., 2010;Wang & Xu, 2015).
The model yields unbiased and efficient parameter estimates under conditions with at least N = 500 examinees and K = 20 items even under disengagement rates of as low as 5% and unbalanced proportions of item omissions and guesses for disengaged behaviour. Under conditions with fewer items or examinees, low disengagement rates pose a threat to obtaining unbiased and efficient parameter estimates. We therefore recommend applying the model to smaller data sets with N < 500 or K < 20 only when omission rates are high, that is, at least 5%. Due to the model's complexity, convergence might be more challenging to achieve under conditions with few items and examinees.
When no convergence can be reached it is likely that disengaged behaviour predominantly consists of item omissions (e.g., for tests with complex item formats where observed responses are unlikely to go back to guesses or perfunctory answers) and model-based approaches for modelling omission processes (Holman & Glas, 2005;Ulitzsch et al., 2019) pose a less complex alternative to the SA+E model. When seeing item omissions as indicators of disengaged behaviour, omission propensity in modelbased approaches for item omissions is equivalent to the engagement variable, with examinee disengagement manifesting itself only in item omissions, while all observed responses are assumed to stem from engaged response processes. Under such assumptions, examinee engagement would be fully observable, with engagement indicators D ij corresponding to the negation of omission indicators 1 À d ij (see the Supporting Information). Likewise, when no omissions occurred, the model can easily be simplified to assuming disengaged behaviour to a result in guessing only while still jointly modelling engagement, ability, and speed (see the Supporting Information).
In the empirical example we found examinee engagement and ability to be related. At the same time, the only moderate correlation between engagement and ability provides supporting evidence that engagement and ability represent different constructs. In addition, we found engagement to vary largely across items and examinees. Items that were more complex in terms of difficulty and time intensity were found to evoke disengagement more easily. This is in line with findings from previous studies employing threshold methods for identifying examinee disengagement (Lee & Jia, 2014;Wise et al., 2009). In addition, we illustrated that both item omissions and guessing are prevalent in LSA data and thus both need to be considered.

Limitations and future directions
In the SA+E model, identifying examinee disengagement is facilitated by assuming that all item omissions stem from examinee disengagement while allowing for observed responses to stem from either solution or guessing behaviour. Thus, similarly to previous model-based approaches for item omissions (Holman & Glas, 2005;Ulitzsch et al., 2019), the SA+E model assumes that all item omissions stem from the same data-generating process. The model does not allow for engaged item omissions, which might occur when examinees omit items after seriously reading and considering the item. Such mechanisms have been discussed to be plausible (Becker & Pohl, 2016;Mislevy & Wu, 1996;Robitzsch, 2014). Extending the model to allow for different omission mechanisms is therefore a pertinent topic for future research.
Furthermore, examinees might not work with a constant level of engagement throughout the test. While some examinees might be disengaged throughout the whole test, it is easy to imagine that others might be engaged at the beginning of the assessment but become more disengaged towards the end. Non-stationarity of person variables can in principle be incorporated by adding additional linear or nonlinear terms (see Fox & Marianti, 2016, for an extension of the speed-accuracy model that allows for varying speed across the test).
Although the SA+E model allows for the occurrence of item omissions to vary across examinees when these approach an item in a disengaged manner, it is still rather restrictive in that it assumes the probability of omitting rather than guessing to be a function of ability and speed. A variety of other examinee-or item-specific factors such as demographic variables or item features might determine disengaged test-taking strategies. Considering these therefore constitutes a promising extension of the SA+E model.
The proposed model assumes examinee disengagement to result in random guesses, perfunctory answers, and item omissions. Examinee disengagement can, however, manifest itself in a variety of test-taking behaviours different from those considered in the proposed model. Examinees could, for instance, still employ solution strategies on an item but just try less hard (Debeer & Janssen, 2013) or still use their ability to some extent for differentiating among responses while guessing (San Mart ın, del Pino, & de Boeck, 2006). In its most extreme form examinee disengagement might result in quitting the assessment altogether. In fact, examinees who spend only a short time on a test without reaching the time limit or the end of the test are more likely to guess on the items they attempted (Cao & Stokes, 2008). A model for not-reached items due to quitting has been proposed by Ulitzsch, von Davier, and Pohl (in press). Integrating research on modelling quitting behaviour with research on examinee disengagement would enrich research on examinee disengagement as well as provide further insights into examinee test-taking behaviour.
Assessing the joint distribution of person variables yields valuable insights into examinee behaviour. In addition, relating engagement to, for example, demographic variables or personality can provide additional insight into possible reasons for examinee disengagement or for identifying groups of persons with a high prevalence of disengagement. For instance, omission propensity has been shown to be relatively stable across different domains and to be related to demographic variables such as gender (K€ ohler et al., 2015a). Similar effects could be expected for examinee engagement. Furthermore, relating examinee engagement to self-reported test-taking motivation, as for example administered in PISA (OECD, 2017), could be used to validate the assumptions made in the proposed model.
The model was presented employing a Rasch model for item responses as well as its RT equivalent, with time discrimination parameters fixed to be the same across all items. Although Rasch modelling is in accordance with the analysis frameworks of major LSAs (OECD, 2017;Pohl & Carstensen, 2012), such assumptions might not always hold for the data at hand. In fact, in the empirical example, posterior predictive analyses revealed that less restrictive measurement models for responses might indeed fit the data better. Implementing these into the model, however, is not trivial since it can be challenging to distinguish engaged and disengaged responses on items with low item discrimination. Such model extensions therefore remain a task for future research.
Similarly, in future research, the model could be extended to include more complex measurement models for latent response indicators. For these, the SA+E model assumes a unidimensional Rasch model. This assumption is likely to be violated when examinees differ in the level of engagement with which they approach different types of items. Previous research suggests that this might indeed be the case. Omission behaviour, for instance, has often been found to differ for items with a simple MC format and items with a more complex response format (K€ ohler et al., 2015b;Koretz, 1993). Likewise, in the empirical example, for some items, the model underpredicted item omissions for examinees with a higher number of omissions. In this context, specifying a multidimensional measurement model for latent response indicators might model disengaged testtaking behaviour more adequately.
In addition, Molenaar, Bolsinova, and Vermunt (2018) have shown that violations of the lognormal assumptions for RTs may jeopardize correct classifications in mixture IRT models employing RTs for identifying differences in examinee behaviour. As a solution,  suggested a semi-parametric approach based on categorizing RTs that can easily be integrated with the SA+E framework.
In the current paper, Bayesian techniques were employed for model estimation. Although this yielded good parameter recovery with sample sizes of as low as N = 500, estimation was rather time-intense: under the conditions with the largest data sets (N = 1,000, K = 20), estimation took approximately 24 hr. Research questions in educational research often concern multiple groups, constructs, or points in time, and involve larger data sets. Bayesian estimation might thus not always be feasible. With technical and algorithmic advances, we expect this to be resolved. Until then, future research may also consider the feasibility of maximum likelihood estimation for the proposed model. Here the challenge will be to obtain convergence and valid solutions when the prevalence of item omissions and guessing is low on some items.