Huawei Russian Research Institute Special challenges

  1. RussianSuperGLUE: A Russian language understanding evaluation benchmark, 2020.

Authors: Shavrina, Tatiana; Fenogenova, Alena; Emelyanov, Anton; Shevelev, Denis; Artemova, Ekaterina; Malykh, Valentin; Mikhailov, Vladislav; Tikhonova, Maria; Chertok, Andrey; Evlampiev, Andrey;

Publication:
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Abstract:

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark – Russian SuperGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills – detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We also provide baselines, human level evaluation, open-source framework for evaluating models, and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the translated diagnostic test set and offer the first steps to further expanding or assessing State-of-the-art models independently of language.

Link: https://aclanthology.org/2020.emnlp-main.381/

2. SumTitles: a summarization dataset with low extractiveness, 2020.

Authors: Malykh, Valentin; Chernis, Konstantin; Artemova, Ekaterina; Piontkovskaya, Irina;

Publication:

In Proceedings of the 28th International Conference on Computational Linguistics

Abstract:

The existing dialogue summarization corpora are significantly extractive. We introduce a methodology for dataset extractiveness evaluation and present a new low-extractive corpus of movie dialogues for abstractive text summarization along with baseline evaluation. The corpus contains 153k dialogues and consists of three parts: 1) automatically aligned subtitles, 2) automatically aligned scenes from scripts, and 3) manually aligned scenes from scripts. We also present an alignment algorithm which we use to construct the corpus.

Link: https://aclanthology.org/2020.coling-main.503/

3. Grad-tts: A diffusion probabilistic model for text-to-speech, 2021.

Authors: Popov, Vadim; Vovk, Ivan; Gogoryan, Vladimir; Sadekova, Tasnima; Kudinov, Mikhail;

Publication:

In Proceedings of International Conference on Machine Learning

Abstract:

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.

Link: http://proceedings.mlr.press/v139/popov21a.html

4. Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection.

Authors: Podolskiy, A., Lipin, D., Bout, A., Artemova, E., & Piontkovskaya, I.

Publication: 

In Proceedings of the AAAI Conference on Artificial Intelligence

Abstract:

Real-life applications, heavily relying on machine learning, such as dialog systems, demand for out-of-domain detection methods. Intent classification models should be equipped with a mechanism to distinguish seen intents from unseen ones so that the dialog agent is capable of rejecting the latter and avoiding undesired behavior. However, despite increasing attention paid to the task, the best practices for out-of-domain intent detection have not yet been fully established. This paper conducts a thorough comparison of out-of-domain intent detection methods. We prioritize the methods, not requiring access to out-of-domain data during training, gathering of which is extremely time- and labor-consuming due to lexical and stylistic variation of user utterances. We evaluate multiple contextual encoders and methods, proven to be efficient, on three common datasets for intent classification, expanded with out-of-domain utterances. Our main findings show that fine-tuning Transformer-based encoders on in-domain data leads to superior results. Mahalanobis distance, together with utterance representations, derived from Transformer-based encoders, outperform other methods by a wide margin(1-5% in terms of AUROC) and establish new state-of-the-art results for all datasets. The broader analysis shows that the reason for success lies in the fact that the fine-tuned Transformer is capable of constructing homogeneous representations of in-domain utterances, revealing geometrical disparity to out of domain utterances. In turn, the Mahalanobis distance captures this disparity easily.

Link: https://ojs.aaai.org/index.php/AAAI/article/view/17612

5. Gaussian LPCNet for multisample speech synthesis.

Authors: Popov, V., Kudinov, M., & Sadekova, T.

Publication:

In Proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Abstract:

LPCNet vocoder has recently been presented to TTS community and is now gaining increasing popularity due to its effectiveness and high quality of the speech synthesized with it. In this work, we present a modification of LPCNet that is 1.5x faster, has twice less non-zero parameters and synthesizes speech of the same quality. Such enhancement is possible mostly due to two features that we introduce into the original architecture: the proposed vocoder is designed to generate 16-bit signal instead of 8-bit µ-companded signal, and it predicts two consecutive excitation values at a time independently of each other. To show that these modifications do not lead to quality degradation we train models for five different languages and perform extensive human evaluation.

Link: https://ieeexplore.ieee.org/abstract/document/9053337/