ICASSP 2019----Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification

松刚豪
2023-12-01

Yichi Zhang; University of Rochester
Meng Yu; Tencent
Na Li; Tencent
Chengzhu Yu; Tencent
Jia Cui; Tencent
Dong Yu; Tencent

https://ieeexplore.ieee.org/document/8682676

Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification
用于文本相关说话人验证的注意力(双胞胎,连体,双塔?)神经网络
Abstract:
文摘:
In this paper, we present a Sequence-to-Sequence Attentional Siamese Neural Network (Seq2Seq-ASNN) that leverages temporal alignment information for end-to-end speaker verification.
在本文中,我们提出了一个序列到序列的注意力Siamese神经网络(Seq2Seq-ASNN),它利用时间对齐信息进行端到端说话人的验证。
In prior works of speaker discriminative neural networks, utterance-level evaluation/enrollment speaker representations are usually calculated.
在以往的说话人判别神经网络研究中,通常计算,说话人的话语水平的,评估/注册表示。
Our proposed model, utilizing a sequence-to-sequence (Seq2Seq) attention mechanism, maps the frame-level evaluation representation into enrollment feature domain and further generates an utterance-level evaluation-enrollment joint vector for final similarity measure.
我们的模型利用序列对序列(Seq2Seq)注意力机制,将帧级评价表示映射到注册特征域,进而生成话语级评价注册联合向量,用于最终的相似度度量。
Feature learning, attention mechanism, and metric learning are jointly optimized using an end-to-end loss function.
使用端到端损失函数,对特征学习、注意力机制和度量学习进行了联合优化。
Experimental results show that our proposed model outperforms various baseline methods, including the traditional i-Vector/PLDA method, multi-enrollment end-to-end speaker verification models, d-vector approaches, and a self attention model, for text-dependent speaker verification on a Tencent internal voice wake-up dataset.
实验结果表明,该模型在基于腾讯内部语音唤醒数据集的文本依赖的说话人验证方面,优于传统的i-Vector/PLDA方法、多注册端到端说话人验证模型、d-vector方法和自我注意模型等多种基线方法。

SECTION 1.INTRODUCTION
1.节介绍
Speaker verification is the process of verifying, based on a speaker’s enrolled utterances, whether an evaluation utterance belongs to that speaker.
说话人验证是根据说话人所登记的话语来验证评价话语是否属于说话人的过程。
It can be categorized into text-dependent and text-independent tasks [1].
它可以分为依赖于文本和独立于文本的任务[1]。
In text-dependent systems, transcripts of enrollment are constrained to a specific phrase [2], which is not the case in text-independent systems.
在依赖于文本的系统中,注册记录,被限制在一个特定的短语[2]中,而在不依赖于文本的系统中则不是这样。
Because of the constraint of the phonetic variability, text-dependent speaker verification usually achieves robust verifi-cation results with very short enrollment utterances.
由于语音变异性的限制,文本依赖的说话人验证,通常能够在非常短的注册话语中,获得鲁棒的验证结果。
With the proliferation of smart home/vehicles and mobile applications, human-machine interactions through voice command are becoming widespread where text-dependent speaker verification is essential.
随着智能家居/汽车和移动应用的普及,通过语音命令的人机交互变得越来越普遍,而文本依赖的说话人验证是必不可少的。
For example, an ideal application scenario would be speech assisted devices continuously listening for specific wake-up keywords only by a certain speaker, where text-dependent speaker verification is necessary for personalized service and unauthorized usage prevention.
例如,理想的应用场景是语音辅助设备,只由特定的扬声器,持续侦听特定的唤醒关键字,其中文本依赖的说话人验证,对于个性化服务和未经授权的使用预防,是必要的。
Traditional techniques for text-dependent speaker verification include GMM-UBM [3], GMM-SVM [4], and i-Vector/PLDA [5].
传统的文本相关说话人验证技术包括GMM-UBM[3]、GMM-SVM[4]和i-Vector/PLDA[5]。
Recently, inspired by the huge success of applying Deep Neural Networks (DNN) in Automatic Speech Recognition (ASR) [6], deep learning based text-dependent speaker verification has become popular.
近年来,受深度神经网络(DNN)在自动语音识别(ASR)[6]中的巨大成功启发,基于文本的基于深度学习的说话人验证技术开始流行起来。
In [2], [7], speaker discriminative DNNs are investigated to extract frame-level features, which are treated with equal importance and aggregated into reliable utterance-level speaker representations called d-vectors.
在[2]、[7]中,研究了说话人识别用DNN提取帧级特征的方法,对帧级特征进行同等重要的处理,并将其聚合为可靠的话语级说话人表示,即d-向量。
Utterance-level features from the test speaker and enrolled speakers are then scored using a predefined cosine distance [8] or PLDA [9] similarity measure.
然后使用预先定义的余弦距离[8]或PLDA[9]相似性度量对,测试说话者和已登记说话者,的话语水平特征,进行评分。
The end-to-end text-dependent speaker verification system has also attracted much attention due to its simple training procedure and effective inference scheme.
端到端文本依赖的说话人验证系统,也因其简单的训练过程和有效的推理方案,而备受关注。
In [10], the last frame output of LSTM layer is defined as the d-Vector for evaluation and enrollment representations, respectively, which are then passed to calculate cosine distance and logistic regression for the similarity score.
在[10]中,LSTM层的最后一帧输出,分别定义为评价和登记表示的d向量,然后通过d向量计算相似度评分,的余弦距离和logistic回归。
In [11], a normalized score of each LSTM frame is calculated and all frames are weighted averaged to generate the d-Vector.
在[11]中,计算每个LSTM帧的归一化得分,并对所有帧进行加权平均,生成d向量。
Similar attention mechanism is applied to a triplet loss model in [12].
类似的注意力机制也适用于[12]中的 triplet loss 模型。
Another attention based model in [13] takes the additional phonetic model information to learn the attention weights for each evaluation and enrollment.
[13]中的另一个基于注意力的模型,使用附加的语音模型信息,来学习每个评价和注册的注意力权重。
However, in [11], [12], [13], evaluation and enrollment implement their own attention mechanism and no evaluation-enrollment joint information is utilized.
然而,在[11]、[12]、[13]中,评价与注册阶段,实施各自的注意力机制,没有利用评价与注册的联合信息。
For a better end-to-end training, the mismatch in the phonetic contexts and duration between the evaluation and enrollment can be resolved by a sequence-to-sequence (Seq2Seq) temporal alignment.
为了更好地进行端到端训练,可以通过序列对序列(Seq2Seq)的,时间比对,来解决评估和注册,在语音上下文和持续时间上的不匹配。
Original Seq2Seq attention is widely used in machine translation [14] and image captioning [15], where alignments are learned between source and target sequences.
原始Seq2Seq注意广泛应用于机器翻译[14]和图像字幕[15]中,在这些领域中可以学习源序列和目标序列之间的对齐。
This motivates us to learn temporal alignment between enrollment and evaluation utterances.
这促使我们学习注册和评价话语之间的时间一致性。
In this paper, we propose a Seq2Seq style attentional Siamese neural network model, named Seq2Seq-ASNN, for the above purpose.
为此,本文提出了一种Seq2Seq型注意Siamese神经网络模型,即Seq2Seq- asnn。
A Siamese neural network consists of two towers with identical structures for encoding individual input features.
Siamese神经网络由两个具有相同结构的塔组成,用于编码单个输入特征。
It has been successfully applied to many image/video/audio tasks such as face verification [16], object tracking in videos [17], and sound search by vocal imitation [18], [19].
它已成功地应用于许多图像/视频/音频任务中,如人脸验证[16]、视频[17]中的目标跟踪、语音模仿[18]、[19]的声音搜索等。
The proposed Siamese neural network encodes an enrollment and an evaluation utterance with separate towers.
本文提出的Siamese神经网络,采用独立的塔,对注册和评价语句进行编码。
Each tower is composed of a convolutional layer followed by a recurrent layer to extract the temporal-frequency feature representation.
每个塔由卷积层和递归层组成,提取时频特征表示。
Then the extracted frame-level features from the two towers are weighted aligned and combined into an utterance-level evaluation-enrollment joint vector by an Seq2Seq attention mechanism.
然后利用Seq2Seq注意机制,将提取的两塔帧级特征,加权对齐,并将其组合成一个话语级评价-注册联合向量。
The dual-tower feature extraction, Seq2Seq attention mechanism, and the verification scoring are jointly trained by optimizing the end-to-end loss.
通过优化端到端损失,联合训练了双塔特征提取、Seq2Seq注意机制和验证评分。
The rest of the paper is organized as follows: We describe the proposed Seq2Seq-ASNN in Section 2.
本文的其余部分组织如下:第2节描述了提出的Seq2Seq-ASNN。
The experimental setup is summarized in Section 3.
第3节总结了实验设置。
We present the evaluation results in Section 4 and conclude this paper in Section 5.
我们在第4节中给出了评价结果,并在第5节中对本文进行了总结。

SECTION 2.
第二节。
THE PROPOSED SEQ2SEQ-ASNN
本文中的SEQ2SEQ-ASNN网络
The typical speaker verification protocol includes three phases: training, enrollment, and evaluation [10].
典型的说话人验证系统,包括三个阶段:培训、注册和评估[10]。
In the training phase, our proposed network learns to extract internal speaker representations from a pair of utterances.
在训练阶段,我们提出的网络,从一对话语中,学习,提取,内在的说话人表示。
The encoding network includes two parts, a CRNN (CNN + GRU) and an attention network as shown in Figure 1.
编码网络包括CRNN (CNN + GRU)和注意力网络两部分,如图1所示。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hZx4wWn1-1581784149529)(https://ieeexplore.ieee.org/mediastore_new/IEEE/content/media/8671773/8682151/8682676/zhang1-p5-zhang-large.gif)]
Fig. 1.
Architecture of the proposed Seq2Seq-ASNN model for end-to-end speaker verification.

After feature extraction by the CRNN, the Seq2Seq attention mechanism takes frame-level features to compute attention weights for temporal alignment between evaluation and enrollment representations.
通过CRNN提取特征后,Seq2Seq注意机制利用帧级特征计算注意力权值,实现评价和登记表示之间的时间对齐。
Finally, two fully connected layers produce a binary decision on whether the two utterances belong to the same speaker.
最后,两个全连接层,产生一个二元决策(是or否,01 or 10 ?),决定这两个话语是否属于同一个说话者。
All the parameters in the whole system are jointly trained using an end-to-end criterion on positive (i.e.
整个系统的所有参数均采用端到端正准则,阳性对(即
, two input utterances share the same speaker identity, a.k.a. target samples in testing phase) and negative (i.e.
,两个输入话语具有相同的说话者身份,即测试阶段的目标样本)和阴性对(即
, two input utterances belong to different speakers, a.k.a. impostor samples in testing phase) pairs, as described in Section 2.5.
,两个输入话语属于不同的说话者,即测试阶段的冒名替用例),如2.5节所述。
While the attention model in [13] is learned based on individual utterance, our attention model is trained in a Seq2Seq manner where both evaluation and enrollment frame-level features are required to produce an utterance-level joint vector.
[13]中的注意力模型是基于个体话语进行学习的,而我们的注意力模型是以Seq2Seq的方式训练的,评价和注册阶段的特征,都产生一个话语级联合向量。
Besides, although the enrollment and verification phases are implemented in one-shot, enrollment frame-level features could still be extracted and saved beforehand for real-time verification deployment.
此外,虽然注册和验证阶段是一次性实现的,但仍然可以提前提取和保存,注册帧级的特征,以便实时验证部署。
Finally, in end-to-end settings like [10] and [13], the evaluation and enrollment branches are combined at the “Metric Learning” stage, while our proposed method couples the two branches at the “Attention Mechanism” stage to generate the utterance-level joint vector.
最后,在[10]和[13]等端到端设置中,评价分支和登记分支在“度量学习”阶段相结合,而我们提出的方法在“注意力机制”阶段将这两个分支耦合(结合)起来,生成话语层面的联合向量。

2.1.
2.1。
Preprocessing
预处理
The evaluation and enrollment utterances are sampled at 16 kHz and recorded for shorter than 3 seconds.
评估和注册的话语在16 kHz采样,记录时间少于3秒。
Each utterance is zero-padded in the end to maintain 3 seconds long, and then converted to a 128-band log-mel spectrogram with 32 ms analysis window and 16 ms overlap, resulting in a dimensionality of 128 frequency bins by 188 time frames.
每句话最后加零zero-padded,保持3秒长,转换成128波段的log-mel谱图,分析窗口32 ms,重叠16 ms,得到128个frequency bins的维度,时间帧188。
2.2.
2.2。
Feature Learning
特征学习
Each tower of the Siamese network comprises of a convolutional layer and a recurrent layer.
暹罗网络的每个塔由卷积层和递归层组成。
The model parameters are shown on the upper right side of Figure 1.
模型参数如图1右上角所示。
The convolutional layer has 12 filters with Rectified Linear Unit (ReLU) activations and a receptive field of 5 × 5, followed by a 2(frequency) × 5(time) max-pooling.
卷积层有12个滤波器,它们具有(ReLU)激活,接收域为5×5,然后是2(频率)×5(时间)的最大池化。
For each time step, we concatenate the features across different channels, then project to a 48 dimensional layer, and finally feed into a GRU layer with 32 hidden units.
对于每一个时间步骤,我们将不同通道上的特性连接起来,然后投射到一个48维的层中,最后将其输入到包含32个隐藏单元的GRU层中。
Up to now, we get the frame-level features for both evaluation and enrollment utterances.
到目前为止,我们得到了评价和注册话语的帧级特征。
2.3.
2.3。
Seq2Seq Attention Mechanism
Seq2Seq注意力机制
Rather than averaging the frame-level CRNN features to produce an utterance-level representation for enrollment and evaluation respectively, we adopt Seq2seq attention mechanism to first align these two frame-level feature sequences.
我们采用Seq2seq注意机制,首先对这两个帧级特征序列进行对齐,而不是对帧级的CRNN特征进行平均,生成一个分别用于注册和评价的话语级表示。
Particularly, each enrollment frame can be aligned to a weighted average of evaluation frames.
特别的是,每个登记帧都可以与评估帧的加权平均值对齐。
This average representation is then concatenated with the original enrollment feature to form a unified feature sequence of the two utterances, which is then averaged to generate an evaluation-enrollment joint vector.
然后将该平均表示与原始登记特征连接起来,形成两种话语的统一特征序列,然后对其进行平均,生成评价登记联合向量。

 类似资料:

相关阅读

相关文章

相关问答