Graduation Year
2022
Document Type
Dissertation
Degree
Ph.D.
Degree Name
Doctor of Philosophy (Ph.D.)
Degree Granting Department
Mathematics and Statistics
Major Professor
Kandethody Ramachandran, Ph.D.
Committee Member
Chris Tsokos, Ph.D.
Committee Member
Lu Lu, Ph.D.
Committee Member
Howard Goldstein, Ph.D.
Keywords
Multivariate Adaptive Regression Splines (MARS), Ensemble Methods, Shrinkage Methods, Tree-based Modeling, Lexical Characteristics
Abstract
Poor methodological and statistical practices can lead to unreliable results. The collaboration between statisticians and researchers can remedy this. Early education intervention research rarely uses advanced statistical techniques. Within early education, vocabulary instruction has been well-studied, yet outcomes continue to be underwhelming. The specialized knowledge and expertise statisticians possess has the potential to enhance word learning research by applying sophisticated analyses not commonly used. Choosing vocabulary words for instruction can be a daunting task and is highly subjective. In an effort to aid in the selection process, researchers use a word selection framework that groups words into three tiers. Even with words organized into these tiers, there is still considerable variability when selecting words for instruction. There could be other factors related to word learning, and these, combined with a word’s tier, would better organize words for instruction. Recent research has been done to examine the lexical characteristics that influence children’s word learning and recognition. Multivariate linear regression and stepwise regression are two common statistical analyses used to model these relations. These models can be appropriate in certain situations, but the assumptions they rely on may not be satisfied in the context of word learning models. Interdisciplinary collaboration between statisticians and word learning researchers could lead to more appropriate modeling approaches that better-describe the influence of lexical characteristics on word learning.
The purpose of this three-part dissertation is to advance word learning research by implementing sophisticated statistical techniques that are not commonly used. (i) First, we introduced and compare the theoretical framework of statistical and machine learning techniques that would be applied to word learning data such as shrinkage methods and ensemble learning. (ii) The performance of these advanced techniques are compared using fit measures and an example subset of the data. We demonstrated why multivariate adaptive regression splines (MARS) is a better choice for a robust word learning model by comparing it to advanced statistical and machine learning techniques, as well as typically used methods by education researchers, such as multivariate linear regression and stepwise regression. (iii) Three word learning datasets were modeled using MARS to examine the relations among lexical characteristics and children’s word learning. This was done to see if results were consistent with the first analysis and to determine the differential effects lexical characteristics had on word learning across grade levels. Words were characterized by various lexical factors including age of acquisition, word frequency, level of concreteness, neighborhood density and phonotactic probability. Compared to multivariate linear regression and stepwise regression results, the different statistical and machine learning techniques performed well, but MARS proved to be superior for its balance of accuracy and interpretability. Results indicated age of acquisition and level of concreteness were the most relevant predictors of word learning. Children had difficulty learning words that were rated older than their age and that were highly abstract. The points at which learning declined appeared to shift as children aged. Examining hinge data, we can determine the threshold for learning words based on this information. Using final models for each grade level, we can predict the number of students expected to learn a given word based on the lexical characteristics. This information can be used to systematically organize vocabulary targets into an optimal sequence for instruction.
Scholar Commons Citation
Sanders, Houston T., "Effective Statistical and Machine Learning Methods to Analyze Children's Vocabulary Learning" (2022). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/10353