Today, talking to your computer is no longer reserved for science fiction movies or imaginary friends. The development of deep learning technologies and the fast increase of computation power has paved the way for the emergence of a technology that allows human speech to be processed into a written format by a computer in real time: automatic speech recognition (ASR). At the same time, the global COVID-19 pandemic accelerated the shift to remote and hybrid learning in educational settings, making digital learning solutions more important than ever. The potential of ASR in such applications appears to be limitless: Imagine a world where interactive conversations with virtual tutors feel as natural as chatting with a friend. As it facilitates human-like interactions, intelligent speech technology is starting to be applied in today’s classroom. Will intelligent speech technology revolutionize the way we learn?
How does an ASR system ‘understand’ speech?
In order to transcribe human speech to text, a typical ASR system first extracts a number of characteristics, such as fluctuations in pitch, amplitude or rhythm, from a raw speech signal. These features summarize an utterance in a series of numbers that can be interpreted by the computer. Next, ASR systems use a deep learning model to associate the speech characteristics with specific written words. This so-called acoustic model is trained on large amounts of data, in order to recognize which patterns in pitch, rhythm or amplitude are indicative of specific spoken words.
Many current ASR systems combine acoustic models to language models. Language models provide contextual information and linguistic knowledge to enhance the accuracy and coherence of transcriptions. They help in resolving ambiguities and predicting word sequences. Finally, many current ASR systems use a lexicon: a pronunciation dictionary that instructs how phonemes — units of spoken speech — can be converted to graphemes — written letters. The synergy between acoustic and language models forms a crucial aspect of ASR technology, allowing for the seamless conversion of speech into text.
Early ASR systems faced numerous challenges in recognizing patterns in speech. For example, limited amounts of available training data and less sophisticated algorithms resulted in low transcription accuracies, especially in noisy surroundings or if speakers’ voice characteristics deviated from the speakers where the system was trained on. Today, training datasets have grown in size, and breakthroughs in deep learning techniques have resulted in remarkable progress. Although 10 years ago, ASR services were only offered by a select number of specialized research groups, today, many prominent parties in the technology landscape, such as Google, Microsoft and Amazon offer ASR services in a growing number of languages. Such systems can transcribe speech to text with high accuracy, even for speakers with accents, or in noisy (classroom) situations.
Learning to speak a new language
Learning to speak is one of the key aspects of learning a new language. One of the advantages of using speech technology in digital learning is the ability to automatically learn the pronunciation of words when learning vocabulary items. ASR systems are already employed on a large scale by private language learning apps that use speech recognition software to help users improve their pronunciation and speaking skills. To a smaller extent, a number of major educational publishing houses have begun to implement ASR technology in their digital learning solutions.
Although scientific research into the efficiency of such systems is limited, some studies have compared the effectiveness of ASR-based vocabulary- and pronunciation learning to learning in more traditional settings. For example, 63 students with Arabic nationality learned English vocabulary and pronunciation, either with an ASR-based virtual tutor, or with a regular teacher. The results showed that the students using the ASR-based system learned more words and received higher pronunciation scores than the students who received regular, teacher-based instruction. The authors argue that the application of speech technology allows users to practice speaking in a low-pressure environment and receive immediate feedback on their pronunciation, which is often not possible in traditional classroom settings. Finally, they argue that speech technology can be a solution for people who have difficulty writing or typing. For example, typing can be a challenging task for people with dyslexia, and speech technology provides an alternative way to provide input to a learning system.
Clever personalized learning systems
The potential of speech technology extends beyond pronunciation learning, and has shown promising results in the domain of personalized learning applications. Personalized learning systems aid in memorizing information by adapting to the needs of individual learners. Such systems measure learning behavior to estimate the optimal point in time to repeat items or provide feedback to the learner. Information that is already mastered by the learner should not receive much attention, whereas materials that are difficult for the learner should be repeated frequently. In adaptive learning systems, a challenge is to identify which materials are already mastered and which materials are not. Recent work suggests that it is possible to use the prosodic information in speech – in other words, the way in which the user gives a spoken response – to estimate the extent to which the user has successfully memorized a response. More specifically, users that raise their voice while giving a response are likely to be uncertain of their answer, and increased loudness and speaking speed are generally predictors of high memory strength and accuracy answers. Interestingly, even if no human experimenter was present (i.e., learners were interacting with a fully digital learning application) participants used prosodic cues when giving their answers. In short, detecting prosodic features in speech may be a relatively straightforward and computationally inexpensive way to automatically detect a learner’s memory strength, and provide a learning experience that is tailored towards the needs of the individual learner.
Despite its broad potential, ASR technology faces challenges when it comes to accurately recognizing speech in children. One reason is the inherent variability in children’s speech due to their developing vocal apparatus and pronunciation patterns, which differ from those of adults. ASR systems trained primarily on adult speech struggle to accurately interpret and transcribe children’s speech. Similarly, minority groups may exhibit distinctive accents, dialects, or speech patterns that are not adequately represented in ASR training data, leading to reduced recognition performance. To address these challenges, researchers are working on developing more inclusive and diverse training datasets that encompass a broader range of speech characteristics. Another approach involves leveraging data augmentation techniques to artificially generate diverse speech samples, allowing models to generalize better across different speech patterns.
Another issue that needs to be addressed is the energy consumption associated with speech-to-text technology. As ASR systems rely on computationally expensive processes (e.g., storing and analyzing large datasets), ASR systems use substantial amounts of energy, especially when deployed at scale. A recent study showed that — depending on the geographical location of the user and the exact methods used — training an ASR system can result in an emission of more than 100 kgs of carbon dioxide, or the equivalent of driving 1000 kms by car. By optimizing the deep learning models, as well as carefully weighing the added benefits of more advanced models to their higher computational cost (i.e., very minor improvements can be obtained at a high carbon price) these issues can be addressed.
Overall, ASR technology has developed rapidly over the last few years. In its current form, it has the potential to offer personalized, speech-based learning to a large number of people, including those with writing – or spelling disabilities. However, the technology faces difficulties in providing accurate transcriptions in specific groups, such as children — an area where accurate transcriptions in educational settings seem especially relevant. Furthermore, the carbon footprint currently associated with speech to text transcriptions is substantial, and research should focus on minimizing it before it is deployed at scale in educational settings.