Mapping textual mentions to entities in a knowledge graph is a key step in using knowledge graphs, called Named Entity Disambiguation (NED). A key challenge in NED is generalizing to rarely seen (tail) entities. Traditionally NED uses hand-tuned patterns from a knowledge base to capture rare, but reliable, signals. Hand-built features make it challenging to deploy and maintain NED?especially in multiple locales. While at Apple in 2018, we built a self-supervised system for NED that was deployed in a handful of locales and that improved performance of downstream models significantly. However, due to the fog of production, it was unclear what aspects of these models were most valuable. Motivated by this experience, we built Bootleg, a clean-slate, open-source, self-supervised system to improve tail performance using a simple transformer-based architecture. Bootleg improves tail generalization through a new inverse regularization scheme to favor more generalizable signals automatically. Bootleg-like models are used by several downstream applications. As a result, quality issues fixed in one application may need to be fixed independentlyin many applications. Thus, we initiate the study of techniques to fix systematic errors in self-supervised models using weak supervision, augmentation, and training set refinement. Bootleg achieves new state-of-the-art performance on the three major NED benchmarks by up to 3.3 F1 points, and it improves performance over BERT baselines on tail slices by 50.1 F1 points.
Bootleg is open source at http://hazyresearch.stanford.edu/bootleg/.
Christopher (Chris) RÃ© is an associate professor in the Department of Computer Science at Stanford University. He is in the Stanford AI Lab and is affiliated with the Statistical Machine Learning Group. His recent work is to understand how software and hardware systems will change as a result of machine learning along with a continuing, petulant drive to work on math problems. Research from his group has been incorporated into scientific and humanitarian efforts, such as the fight against human trafficking, along with products from technology and enterprise companies. He has cofounded four companies based on his research into machine learning systems,SambaNova and Snorkel, along with two companies that are now part of Apple, Lattice (DeepDive) in 2017 and Inductiv (HoloClean) in 2020.
He received a SIGMOD Dissertation Award in 2010, an NSF CAREER Award in 2011, an Alfred P. Sloan Fellowship in 2013, a Moore Data Driven Investigator Award in 2014, the VLDB early Career Award in 2015, theMacArthur Foundation Fellowship in 2015, and an Okawa Research Grant in 2016. His research contributions have spanned database theory, database systems, and machine learning, and his work has won best paper at a premier venue in each area, respectively, at PODS 2012, SIGMOD 2014, and ICML 2016.