Mark My Words! On the Automated Prediction of Lexical Difficulty for Foreign Language Readers

Tack, Anaïs
Ph.D. Thesis
UCLouvain & KU Leuven
The goal of this doctoral research is to automatically predict difficult words in a text for non-native speakers. This prediction is crucial because good text comprehension is strongly determined by vocabulary. If a text contains too high a percentage of unknown words, the reader is likely to struggle to understand it. In order to provide good support to the non-native reader, we must first be able to predict the number of difficult words. Usually, we do this manually based on expertise or prior vocabulary tests. However, such methods are not practical when we are reading in a computer-based environment such as a tablet or an online learning platform. In these cases, we need to properly automate the predictions. The thesis is divided into three parts. The first part contains a systematic review of the relevant scientific literature. The synthesis includes 50 years of research and 140 peer-reviewed publications on the statistical prediction of lexical competence in non-native readers. Among other things, the analyses show that the scientific scope is divided into two fields of research that have little connection with each other. On the one hand, there is a long tradition of experimental research in foreign language acquisition (SLA) and computer-assisted language learning (CALL). These experimental studies mainly test the effect of certain factors (e.g., repeating difficult words or adding electronic glosses) on learning unrecognized words during reading. On the other hand, recent studies in natural language processing (NLP) rely on artificial intelligence to automatically predict difficult words. Moreover, the literature review points out some limitations that were further studied in this doctoral research. The first limitation is the lack of contextualized measures and predictions. Although we know from research that the context in which a word occurs is an important factor, predictions are often made based on isolated vocabulary tests, among other things. The second limitation is the lack of personalized measures and predictions. Although research in foreign language acquisition has shown that there are many differences among non-native readers, recent studies in artificial intelligence make predictions based on aggregate data. The final limitation is that the majority of studies (74%) focus on English as a foreign language. Consequently, the goal of this doctoral research is a contextualized and personalized approach and a focus on Dutch and French as foreign languages. The second part looks at two measures of lexical difficulty for non-native readers. On the one hand, it investigates how words are introduced in didactic reading materials labeled with CEFR levels. This study introduces a new graded lexical database for Dutch, namely NT2Lex (Tack et al., 2018). The innovative feature of this database is that the frequency per difficulty level was calculated for the meaning of each word, disambiguated based on the sentence context. However, the results show that there are important inconsistencies in how etymologically related translations occur in the Dutch and French databases. Therefore, this difficulty measure does not yet seem valid as a basis for an automated system. On the other hand, it is investigated how non-native speakers themselves perceive difficult words during reading. The perception of difficulty is important to predict because the learner’s attention is a determining factor in the learning process (Schmidt, 2001). The study introduces new data for readers of French. An important goal of these data is to make correct predictions for all words in the text, which contrasts with studies in foreign language acquisition that focus on a limited number (Mdn = 22) of target words in the text. Moreover, the analyses show that the data can be used to develop a personalized and contextualized system. The final section looks at two types of predictive models developed on the aforementioned data, namely mixed-effects models and artificial neural networks. The results validate the idea that perceptions of lexical difficulty can be predicted primarily on the basis of "word surprisal", a central concept in information theory. Furthermore, the analyses show that commonly used performance statistics (such as accuracy and F-score) are sensitive to individual differences in rates of difficulty. Because these are therefore not appropriate for comparing predictions for different learners, the D and Phi coefficients are used. Moreover, the results clearly show that a personalized model makes significantly better predictions than a non-personalized model. On the other hand, the results show that a contextualized model can better discriminate difficulty, although these improvements are not always significant for each learner.