The LeVoice Far-field Speech Recognition System for VOiCES from a Distance Challenge 2019

孙乐逸
2023-12-01

The LeVoice Far-field Speech Recognition System for VOiCES from a Distance Challenge 2019
Yulong Liang, Lin Yang, Xuyang Wang, Yingjie Li, Chen Jia, Junjie Wang
Lenovo Research
Liangyl3@lenovo.com

语音识别
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1944.pdf

This paper describes our submission to the “VOiCES from a Distance Challenge 2019”, which is designed to foster research in the area of speaker recognition and automatic speech recognition (ASR) with a special focus on single channel distant/far-field audio under noisy conditions. We focused on the ASR task under a fixed condition in which the training data was clean and small, but the development data and test data were noisy and unmatched. Thus we developed the following major technical points for our system, which included data augmentation, weighted-prediction-error based speech enhancement, acoustic models based on different networks, TDNN or LSTM based language model rescore, and ROVER. Experiments on the development set and the evaluation set showed that the front-end processing, data augmentation and system fusion made the main contributions for the performance increasing, and the final word error rate results based on our system scored 15.91% and 19.6% respectively.
本文介绍了我们对“2019年远程语音挑战”的意见,该挑战旨在促进说话人识别和自动语音识别(ASR)领域的研究,特别是在噪声条件下的单声道远程/远场音频。在训练数据干净、小,但开发数据和测试数据噪声大、不匹配的固定条件下,重点研究了ASR任务。因此,我们为系统开发了以下主要技术点,包括数据增强、基于加权预测误差的语音增强、基于不同网络的声学模型、基于TDNN或LSTM的语言模型rescore和ROVER。在开发集和评估集上的实验表明,前端处理、数据增强和系统融合是提高系统性能的主要贡献,基于本系统的最终误码率分别为15.91%和19.6%。

  1. Introduction

Since the accuracy of the close-talking and the noise-free speech recognition is approaching the best possible human speech recognition performance [1-4], more and more researchers have turned their attention to the far-field and noisy scenarios[5-8]. The “VOiCES from a Distance Challenge 2019’’ [9][10] is such a competition designed to foster research in the area of speaker recognition and automatic speech recognition (ASR) with a special focus on single channel distant/far-field audio under noisy conditions. This challenge is based on the newly released Voices Obscured in Complex Environmental Settings (VOiCES) corpus, and the training data is an 80 hours subset of the Librispeech dataset. The VOiCES challenge has two tasks: speaker recognition and automatic speech recognition (ASR). Each task has fixed and open training conditions. The main difficulty of each task is that the training data is small, and there was mismatch between the training data and the evaluation data. For far-field speech recognition, a lot of researches have been conducted. These researches can be divided into two categories. In the first category, researchers process the evaluation data in the front-end to make it more matchable with the model. In the second category, researchers train acoustic models(AM) and language models in the back-end to make model parameters match the data under the test conditions as much as possible. For the front-end processing, the main methods such as Optimal Modified Minimum MeanSquare Error Log-Spectral Amplitude and Improved Minimal Controlled Recursive Averaging (OMLSA-IMCRA)[11] and Weighted Prediction Error(WPE)[12][13] are used to realize de-reverberation and de-noising. For the back end, the mainly methods include applying different acoustic model architectures, such as Deep Neural Network(DNN), Timedelay Neural Network(TDNN)[5], factorized TDNN(TDNNF), Convolutional Neural Network(CNN), Long Short Term Memory(LSTM), model parameters optimization, Neural Network Language Model(NNLM) based rescore and multimodel fusion. The goal is to decrease the mismatch between the distant speech to be recognized with the training condition. Because the training set given was clean speech, while the development set and the evaluation set were speech under complex conditions in which different kinds of noises and reverberation existed, we took several measures to optimize the recognition performance. Firstly, in order to solve the lacking of training data, we expanded the dataset by data augmentation strategies and adding reverberation and noises; Also we trained acoustic models with different network architectures; Thirdly a rescoring mechanism was added based on the one-pass decoding lattices; Finally, ROVER [14] was used to make full use of the complementarity among different systems. The rest of this paper is organized as follows. Section 2 introduces each component of the system. Section 3 shows ASR results obtained using the VOiCES corpus. Section 4 is the conclusion of paper.
由于近距离语音识别和无噪声语音识别的精度已接近人类语音识别的最佳性能[1-4],越来越多的研究者将注意力转向了远场和噪声场景[5-8]。“2019年远程挑战之声”[9][10]就是这样一个旨在促进说话人识别和自动语音识别(ASR)领域研究的竞赛,特别关注在噪声条件下的单声道远程/远场音频。这项挑战是基于新发布的声音在复杂的环境设置(声音)语料库中模糊,训练数据是80小时的Librispeech数据集子集。语音挑战有两项任务:说话人识别和自动语音识别(ASR)。每项任务都有固定的、开放的训练条件。每项任务的主要难点在于训练数据量小,训练数据与评价数据不匹配。

对于远场语音识别,人们进行了大量的研究。这些研究可分为两类。在第一类中,研究人员在前端处理评估数据,使其与模型更加匹配。在第二类中,研究人员在后端训练声学模型(AM)和语言模型,使模型参数与测试条件下的数据尽可能匹配。在前端处理方面,主要采用最优修正最小均方误差对数谱幅和改进最小控制递推平均(OMLSA-IMCRA)[11]和加权预测误差(WPE)[12][13]等方法实现混响和去噪。对于后端,主要采用不同的声学模型结构,如深神经网络(DNN)、时延神经网络(TDNN)[5]、分解TDNN(TDNNF)、卷积神经网络(CNN)、长短期记忆(LSTM)、模型参数优化等,基于神经网络语言模型(NNLM)的rescore与多峰融合。其目的是减少被识别的远程语音与训练条件之间的不匹配。

由于给定的训练集是干净的语音,而开发集和评价集是语音在不同噪声和混响存在的复杂条件下,我们采取了多种措施来优化识别性能。首先,为了解决训练数据的不足,我们通过数据增强策略和加入混响和噪声来扩展数据集;然后,我们使用不同的网络结构来训练声学模型;最后,我们添加了一个基于单通解码格的重分格机制;最后,我们使用ROVER[14]来充分利用不同系统之间的互补性。本文的其余部分安排如下。第二节介绍了系统的各个组成部分。第3节展示了使用语音语料库获得的ASR结果。第四部分是论文的结论。

 类似资料:

相关阅读

相关文章

相关问答