Graduation Year


Document Type




Degree Name

Doctor of Philosophy (Ph.D.)

Degree Granting Department

Educational Measurement and Research

Major Professor

Yi-Hsin Chen, Ph.D.

Committee Member

Robert Dedrick, Ph.D.

Committee Member

John Ferron, Ph.D.

Committee Member

Danielle Dennis, Ph.D.

Committee Member

Stephen Stark, Ph.D.


DIF, HGLM, PIRLS, Rasch model, reading comprehension, simulation study


Hierarchical generalized linear modeling (HGLM) has become more and more popular in detecting differential item functioning (DIF) items in multilevel contexts due to its flexibility and efficiency. This dissertation aims to investigate and compare four different HGLM DIF detection models under various DIF conditions with multilevel data. Three studies were conducted in this dissertation. The first two studies were simulation studies and the third one was an empirical study. Simulation Study One compared the performance of two-level HGLM and three-level HGLM with the individual-level covariate (HGLM-IL) on detecting individual-level DIF items with two-level and three-level data structures. Simulation Study Two compared three three-level HGLM DIF detection models with three-level data: HGLM-IL, HGLM with both individual-level and cluster-level covariates (HGLM-both), and HGLM with individual-level, cluster-level, and cross-level covariates (HGLM-full). Positive, finite variances of parameter estimates were examined as admissible solution rates (ASRs). Type I error and statistical power of DIF detection, as well as the power of fit index, were outcome variables in two simulation studies. ANOVA with generalized eta-squares was used to evaluate the impacts of design factors on these outcome variables. Study three was an empirical data DIF analysis. Reading comprehension data were downloaded from the Progress in International Reading Literacy Study (PIRLS) and four HGLM DIF detection models used in the simulation studies were applied to detect individual-level DIF items in reading comprehension tests in PIRLS 2016.

Results indicated that when the true data were two levels (i.e., item level and individual level), the ASRs reached 100% satisfactory conditions for the two-level HGLM, while the ASRs for HGLM-IL were less than 50% under most conditions. The performances of ASRs for all four HGLM DIF models were satisfactory when data were three levels (i.e., item level, individual level, and cluster level). All four HGLM DIF models displayed adequate performance in controlling for Type I error around nominal level, only two-level HGLM had inflated Type I error rates when data were three levels. The power of HGLM-full was lower than the other three models. The power of fit index (e.g., AIC, BIC) was around .50 when two-level HGLM and HGLM-IL were compared. When compared three three-level HGLM DIF models, fit indices barely selected the HGLM-full. Empirical study showed that both fit indices selected HGLM-both. Moreover, two-level HGLM and HGLM-IL found exactly the same 11 DIF items, HGLM-both had the same 10 DIF items. HGLM-full found fewer DIF items and two of them were not identified by the other three HGLM models.

Computing intraclass correlation (ICC) before selecting an appropriate HGLM DIF detection model is highly recommended even though the group of interest is at the individual level. With the two-level data (i.e., ICC = 0), the two-level HGLM DIF model is highly suggested for detecting individual-level DIF items due to adequate Type I error control and high ASRs. When the data are three levels (i.e., ICC ≠ 0), both the three-level HGLM-IL and HGLM-both DIF models are suggested because of adequate Type I error control and ASRs. It should be warned that all HGLM DIF models fail to obtain satisfactory power when number of clusters (CN), cluster sizes (CS), and DIF magnitude are small (e.g., CN = 30, CS = 6). Finally, fit indices for HGLM DIF model selection should be used with caution. Comparing DIF item results with several possible HGLM models under specific data conditions might be more reliable.