Hackerman Hall B17 @ 3400 N. Charles Street, Baltimore, MD 21218
Code-switching is a commonly occurring phenomenon inmany multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. Multilingual Automated Speech Recognition systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the data-scarce languages. Most state-of-the-art multilingualmodels require the encoding of language information and therefore are notas flexible or scalable when expanding to newer languages. Language independent multilingual models help to address this issue, and are also better suited for multicultural societies where several languages are frequently used together (but often rendered with different writing systems).
In this talk, I will discuss the use of a new metric, transliteration-optimized Word Error Rate (toWER) to evaluate ASR systems in code-switched languages. This metric smoothes out many of the irregularities by mapping all text to one writing system. I will also discuss a new approach to building a language-agnostic multilingual ASR system which transforms all languages to one writing system through a many-to-one transliteration transducer.Thus, similar sounding acoustics are mapped to a single, canonical target sequence of graphemes, effectively separating the modeling and rendering problems. We show with Indic languages, that the language-agnostic multilingual model achieves up to 10% relative reduction in Word Error Rate (WER) over a language-dependent multilingual model.
Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google, focusing on multilingual speech recognition and speech synthesis. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She has served as an elected member of the IEEE SPS Speech and Language Technical Committee (SLTC), and as Vice Chair and Chair (2014?2016), served on the IEEE SPS conference board (2017-2018) and the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011?2015). She currently serves on the IEEE Flanagan Award Committee and is the Regional Director-At-Large for Region 6. She also serves on the ISCA board. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. Some of her recent work has focused on understanding neural networks and methods to merge speech synthesis and recognition systems.