November 9, 2021

12:00 pm / 1:00 pm


H Virtually; Zoom Link TBA

Recorded seminar:

?Deep learning for biological sequences?

Jean?Philippe Vert, PhD

Research Scientist

Google Brain


Abstract:Inrecent years, deep learning has revolutionized natural language processing(NLP), and is increasingly used to analyze biological sequences including DNA,RNA and proteins. While many deep learning architectures and techniquessuccessful in NLP can be directly applied to biological sequences, there arealso specificities in biological sequences that should be taken into account toadapt NLP techniques to that context. In this talk I will discuss several suchspecificities, including the fact that 1) biological sequences have no naturalseparation as a sequence of words, 2) a double-stranded DNA sequence can berepresented by two reverse-complement sequences, and 3) a natural way to comparehomologous biological sequences is to align them. In each case, I will show howthe biological constraints can lead to specific models, and illustrateempirically the benefits of incorporating such prior knowledge on several taskssuch as metagenomics read binning, protein-DNA binding prediction, or proteinannotation.

Biography: Jean-Philippe Vert is a research scientist at Google Brain in Paris and adjunct research professor at PSL Mines ParisTech's Centrefor Computational Biology. Prior to joining Google in 2018, he worked as a postdoc in computational biology at Kyoto University (2001-2002), research professor and founding director of the Centre for Computational Biology at Mines ParisTech (2003-2018), team leader at the Curie Institute in Paris on computational biology of cancer (2008-2018), Miller visiting professor at UC Berkeley (2015-2016), and research professor at the department of mathematics of Ecole normale superieure in Paris (2016-2018). He graduated from Ecole Polytechnique (1995), Corps des Mines (1998), and holds a PhD in mathematics from Paris 6 University (2001). His research interest concerns the development of statistical and machine learning methods, particularly to model complex, high-dimensional and structured data, with an application focus on computational biology, genomics and precision medicine. Hisrecent contributions include new methods to embed structured data such as strings, graphs or permutations to vector spaces, regularization techniques to learn from limited amounts of data, and computationally efficient techniques for pattern detection and feature selection. He is also working onseveral medical applications in cancer research, including quantifying and modeling cancer heterogeneity, predicting response to therapy, and modeling the genome and epigenome of cancer cells at the single-cell level.