When you learn a new language, you’ll find that some of the new vocabulary is easy to memorise, and some of it more difficult. In several recent studies we looked at retrieval practice data from Dutch secondary school students learning English and French words and phrases (van der Velde et al., 2021, 2023). In one of these studies, we found systematic differences in memorability between different vocabulary items. For instance, the English word pilot turned out to be very easy to memorise, perhaps because it is so similar to its Dutch counterpart piloot, while learners had much more trouble making the word contradiction (Dutch: tegenstelling) stick. Learning a new language requires memorising thousands of vocabulary items. Knowing which items are difficult to master and which aren’t can help us understand and improve the learning process. For example, teachers can spend additional time with their students on the most difficult material. Similarly, an adaptive learning system can automatically give learners more opportunities to practice the most challenging items, so that they’ll remember those difficult words just as well as easier ones.
Vocabulary Map
Given the large number of words and phrases that students learn, it can be challenging to glean insights from their practice data. Which items are difficult to memorise? Are certain types of word harder to learn than others?
This is a map of English-language vocabulary items that Dutch secondary school students learn. On this map, each vocabulary item is represented by a point. Its colour describes its memorability; specifically its predicted rate of forgetting (see our paper for details on how we calculate this value). A higher rate of forgetting means that we’d expect the item to be forgotten more quickly after being studied, while a lower rate of forgetting means that the item is easier to memorise. These rates of forgetting are based on the retrieval practice data of about 140 thousand learners. Vocabulary items that are similar in meaning will appear closer together on the map, while dissimilar items appear further apart. To define similarity, I have converted each word into a so-called embedding, a collection of numbers that forms a mathematical description of its meaning. The more similar the embeddings of two items are, the more similar the meaning of the associated words is likely to be, and the closer together they appear on the map (1).
Even without knowing which vocabulary item is represented by each point, we can already see some interesting structure in this map. For instance, there are groups of clustered points in several places: vocabulary items that are tightly connected in some way. Some areas of the map also appear to feature more difficult items than other areas.
Below you see the same kind of map for French-language vocabulary. This map more clearly shows several clusters that are populated by particularly easy or difficult items.
In the visualisations below, I have added text labels to each of the points on the map, so that we can better understand the structure.
For example, the English-language map (left) contains a cluster of words related to the body, in which we can see that certain words (sprain, bruise, throat) are relatively difficult to memorise, while others (lungs, finger, ear) are easier and on the French map (right), we find this cluster of first-person singular verbs which also shows substantial variability in difficulty:
As we can see from these examples, the memorability of a second-language vocabulary item depends on more than its location in semantic space alone. Indeed, it’s very unlikely that any single factor would fully explain the memorability of a word or phrase. We know, for instance, that word recall can be affected by lexical features like word length and frequency, as well as by semantic features such as concreteness and animacy (Madan, 2021). In addition, second-language learners may find words that are more recognisable to them, such as pilot in the example above, easier to memorise than words containing unfamiliar sounds or patterns (Hulstijn, 2001). Nevertheless, plotting this kind of language map does give us insight into the relative difficulty of (sets of) vocabulary items related to a given topic, which students, teachers, and instruction designers can use to make learning a new language just that little bit easier.
(1) A note on embeddings: By training statistical models on large volumes of written text, we can obtain a detailed numerical description of the linguistic context in which each word tends to appear. These descriptions are called word vectors or embeddings. According to the idea of distributional semantics, the meaning of a word is derived from the context in which it occurs. In the words of linguist John Rupert Firth: “You shall know a word by the company it keeps” (1957). By encoding a word’s context, an embedding therefore tells us something about its meaning. Two words which both occur in comparable contexts (such as glass and bottle, which both show up in the same place in sentences like She ordered a ___ of wine) will have similar embeddings. For this analysis, I used the fastText embeddings trained on a vast amount of text from Common Crawl and Wikipedia (Grave et al., 2018). I used Uniform Manifold Approximation and Projection (UMAP; McInnes, Healy, & Melville, 2020) as implemented in the R package uwot (Melville, 2022) to reduce the high-dimensional embeddings from three hundred dimensions to two dimensions, so that the data can be plotted. You can find my code here.