November 8, 2019

12:00 pm / 1:15 pm


Hackerman B17

Abstract: The increased availability of high resolution cameras and array microphones in live meetings, video production, and camera enabled assistant devices has created opportunities for exploiting multiplemodalities in speech applications.  This presentation summarizes initial work at Google in fusing audio and visual information to improve the performance of speech recognition and speaker tracking. We show that multimodalapproaches provide significant improvement in both speech recognition andspeaker diarization especially under noisy conditions. However, these gains are not always robust to missing modalities, and there is considerable work to be done to make audio/visual speech processing practical.  Results from our initial multimodal ASR and speaker diarization experiments will be presented.
Bio: Rick Rose has been a research scientist at Google inNew York City since October, 2014. While at Google he has contributed toefforts in far-field speech recognition, acoustic modeling for ASR, speaker diarization, and audio-visual speech processing. Before coming to Google, he served as a Professor of Electrical and Computer Engineering at McGill University in Montreal since 2004, as a member of research staff at AT&T Labs / Bell Labs, and member of staff at MIT Lincoln Labs. He received his PhD degree in Electrical Engineering from the Georgia Institute of Technology. He has been active in the IEEE Signal Processing Society andis an IEEE Fellow.