April 13, 2021

12:00 pm / 1:00 pm



Recorded Seminar:


Kristen Grauman, PhD


Department of Computer Science
University of Texas at Austin

& Facebook AI Research


?Sights, sounds, andspace: Audio-visual learning in 3D environments?


Moving around in the world is naturally a multisensoryexperience, but today’s embodied agents are deaf?restricted to solely theirvisual perception of the environment.  We explore audio-visual learning incomplex, acoustically and visually realistic 3D environments. By both seeingand hearing, the agent must learnto navigate to a sounding object, useecholocation to anticipate its 3D surroundings, and discover the link betweenits visual inputs and spatial sound. 

To support this goal, we introduce SoundSpaces: a platformfor audio rendering based on geometrical acoustic simulations for two sets ofpublicly available 3D environments (Matterport3D and Replica). SoundSpaces makes it possible to insert arbitrary sound sources in an array ofreal-world scanned environments.  Building on this platform, we pursue aseries of audio-visual spatial learning tasks.  Specifically, inaudio-visual navigation, the agent is tasked with traveling to a soundingtarget in an unfamiliar environment (e.g., go to the ringing phone).  Inaudio-visual floorplan reconstruction, a short video with audio is convertedinto a house-wide map, where audio allows the system to ?see? behindthe cameraand behind walls.  For self-supervised feature learning, we explore how echoesobserved in training can enrich an RGB encoder for downstream spatial tasksincluding monocular depth estimation.  Our results suggest how audio canbenefit visual understanding of 3D spaces, and ourwork lays groundwork for newresearch in embodied AI with audio-visual perception.


Kristen Grauman is a Professor in the Department of ComputerScience at the University of Texas at Austin and a ResearchScientist inFacebook AI Research (FAIR).  Her research in computer vision and machinelearning focuses on visual recognition, video, and embodied perception. Before joining UT-Austin in 2007, she received her Ph.D. at MIT.  She isan IEEE Fellow, AAAI Fellow, Sloan Fellow, andrecipient of the 2013 Computersand Thought Award.  She and her collaborators have been recognized withseveral Best Paper awards in computer vision, including a 2011 Marr Prize and a2017 Helmholtz Prize (test of time award).  She currently serves as anAssociate Editor-in-Chief for PAMI and previously served as a Program Chair ofCVPR 2015 and NeurIPS 2018.