Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environmentfor evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show very weak correlation with the actual human evaluation in practice.
To bridge such a gap, we propose a new frameworknamed ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGM is model-free and agnostic to the behavior policies for collecting the experience data,which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
Tuo Zhao (https://www2.isye.gatech.edu/~tzhao80/) is an assistant professor at Georgia Tech. He received his Ph.D. degree in Computer Science at Johns Hopkins University. His research mainly focuses on developing methodologies, algorithms and theories for machine learning, especially deep learning. He is also actively working in neural language models and open-source machine learning software for scientific data analysis. He has received several awards, including the winner of INDI ADHD-200 global competition, ASA best student paper award on statistical computing, INFORMS best paper award on data mining and Google faculty research award.