July 12, 2018

10:30 am / 11:30 am

Abstract

In this talk, we present issues in natural language modeling for text entry in languages that use noisy (i.e., non-standard) romanization strategies, with a particular focus on languages using Indic scripts. We discuss romanization strategies, and present data indicating that this sort of romanization typically amounts to a rough phonetictranscription. We present Gboard keyboards that make use of models very similar to widely used grapheme-to-phoneme models. We also discuss languagemodeling of romanized text directly.

Bio

Brian Roark is a computational linguist working on various topics in natural language processing. His research interests include: syntactic parsing of text and speech; language modeling for automatic speech recognition and other applications; supervised and unsupervised learning of language and parsing models;text entry, accessibility and augmentative & alternative communication (AAC).

Before joining Google as a research scientist in 2013, he was a faculty member for 9 years in the Center for Spoken Language Understanding(CSLU) at Oregon Health & Science University (OHSU) ? part of what used to be the Oregon Graduate Institute (OGI). Before that, he was in the Speech Algorithms Department at AT&T Labs ? Research from 2001?2004. He received his PhD in the Department of Cognitive and Linguistic Sciences at BrownUniversity in 2001.