February 4, 2020

12:00 pm / 1:30 pm


Clark Hall, Room 316

Seminar 12:00 pm – 1:00 pm
Lunch 1:00 pm – 1:30 pm

 CIS / MIINDS Seminar

Tuesday, February4, 2020 at 12:00 pm

Kavli NDI North,Clark 316

“DirtyData: statistical learning on non-curated databases?

Gaël Varoquaux

Research Director,Parietal, INRIA

Director ofthescikit-learn operations at INRIA Foundation

Member of theboard of the Paris-Saclay Center for Data Science

 Ifyou would like to meet with Gaël Varoquaux, please sign up for a time on thegoogle sheet at:  https://docs.google.com/spreadsheets/d/1sBkXn2FQoYf8HaVAm0MwWtSy3n5qyM1moR8lt2Zw0SY/edit#gid=632884729

 Abstract: While growing amounts and diversity of data bring many promisesto empirical studies, they also imposes more and more human curation beforestatistical analysis. “Dirty data” is reported as the worst roadblockto data science in practice [1]. One challenge is that in many data-scienceapplications, for instance in healthcare or social sciences, the data are notmeasurements that naturally have a homogeneous structure, but ratherheterogeneous entries and columns of different nature. The analysts must investsignificant manual effort to cast the data in a representation amenable tostatistical learning, traditionally using database-cleaning methods. Our goalin the DirtyData researchaxis is to unite statistical learning and databasetechniques to work directly on non-curated databases.

Iwill present 2 recent contributions tobuilding a statistical-learningframework on non-curated databases. First,we tackle the problem ofnon-normalized categorical columns, eg with typosor nomenclature variations.We introduce two approaches to inject the data in a vector space, based eitheron a character-level Gamma-Poisson factorization to recover latent categories,or by exploiting unstudied properties of min-hash vectors that lead to veryfast stateless transformations of string inclusions into simple vectorinequalities [2]. Second, we study supervised learning in the presence ofmissing value [3]. We show that in missing-at-random settings simple imputationby the mean is consistent for powerful supervised models. We also stress thatin missing not at random settings imputing may render supervised learningimpossible and we study simple practical solutions. Studying of theseemingly-simple case of data generated with a linear mechanism shows thatfitting imputation and linear models is brittle, and it is preferable to forgoimputation and fit richer models [4].


[2]Encoding high-cardinality string categorical variables, P Cerda, G Varoquaux https://arxiv.org/abs/1907.01860

[3]On the consistency of supervised learning with missing values J Josse, NProst,E Scornet, G Varoquaux, https://arxiv.org/abs/1902.06931

[4]Linear predictor on linearly-generated data with missing values: non

consistencyand solutions, M Le Morvan, N Prost, J Josse, E Scornet, G Varoquaux, acceptedat AISTATS 2020.

 Bio: Gaël Varoquaux is atenured research director at Inria. His research focuses onstatistical-learning tools for data science and scientific inference. Since2008, he has been exploring data-intensive approaches to understand brainfunction and mental health. More generally, he develops tools to make machinelearning easier, with statistical models suited for real-life, uncurated data,and software for data science. He co-funded scikit-learn, one of the referencemachine-learning toolboxes, and helped build various central tools for dataanalysis in Python. Varoquaux has contributed key methods for learning onspatial data,matrix factorizations, and modeling covariance matrices. He has aPhD in quantum physics and is a graduate from Ecole Normale Superieure, Paris.