Can a system learn to recognise (and understand) speech with little or no supervision? In contrast to modern speech recognition systems which are trained on thousands of hours of transcribed speech audio, human infants acquire language without any text-based supervision. Models that can emulate this would be beneficial for low-resource speech processing, word learning in robotics, and could improve our understandingof language acquisition in humans.
A system of this kind will need to take advantage of (1) unlabelled audio from its surroundings, (2) co-occurring signals for grounding, and (3) interaction with its environment. In this talk I will focus on (1) and (2). We will first look at unsupervised acoustic unit discovery: learning the phonetic inventory of a language fromunlabelled speech audio alone. I will specifically discuss recent methodsusing vector quantised neural networks, and will give the practical example of voice conversion using discovered units. In the second part, I will talk about incorporating visual context for word learning, specificallyintroducing the task of multi-modal one-shot learning from images and speech. I will describe recent work comparing and combining unsupervised and transfer learning for this task.
Herman is a senior lecturer in E&E Engineering at Stellenbosch University in South Africa. His core interest is in developing speech and language processing models that can learn from as little supervision as possible. Herman was an organiser of the Deep Learning Indaba 2018, and he has received two Google Faculty Awards. Before joining Stellenbosch, he was a postdoc at TTI-Chicago, working with Karen Livescu and Greg Shakhnarovich on multi-modal machine learning models combining speech and vision. Before that, he did his PhD with Sharon Goldwater, Aren Jansen and Simon King at the University of Edinburgh.