Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

丁曦
2023-12-01

Baidu Research – Silicon Valley AI Lab

Dario Amodei,
Rishita Anubhai,
Eric Battenberg,
Carl Case,
Jared Casper,
Bryan Catanzaro,
Jingdong Chen,
Mike Chrzanowski,
Adam Coates,
Greg Diamos,
Erich Elsen,
Jesse Engel,
Linxi Fan,
Christopher Fougner,
Tony Han,
Awni Hannun,
Billy Jun,
Patrick LeGresley,
Libby Lin,
Sharan Narang,
Andrew Ng,
Sherjil Ozair,
Ryan Prenger,
Jonathan Raiman,
Sanjeev Satheesh,
David Seetapun,
Shubho Sengupta,
Yi Wang,
Zhiqian Wang,
Chong Wang,
Bo Xiao,
Dani Yogatama,
Jun Zhan,
Zhenyao Zhu

Abstract 摘要

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech—two vastly different languages.
我们证明了端到端的深度学习方法可用于识别英语或普通话(两种截然不同的语言)。

Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages.
由于它用神经网络代替了人工工程组件的整个流水线,因此端到端学习使我们能够处理各种语音,包括嘈杂的环境,口音和不同的语言。

Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system.
我们方法的关键是我们在HPC技术上的应用,与以前的系统相比,其速度提高了7倍。

Because of this efficiency, experiments that previously took weeks now run in days.
由于这种效率,以前耗时数周的实验现在可以在几天内完成。

This enables us to iterate more quickly to identify superior architectures and algorithms.
这使我们能够更快地进行迭代,以识别出卓越的体系结构和算法。

As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets.
结果,在某些情况下,以标准数据集为基准,我们的系统在人工转录方面具有竞争力。

Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.
最后,通过在数据中心中使用称为批处理分配技术和GPU的技术,我们证明了我们的系统可以廉价地在线部署,在为大规模用户提供服务时,延迟很短。

1 Introduction 介绍

Decades worth of hand-engineered domain knowledge has gone into current state-of-the-art automatic speech recognition (ASR) pipelines.
数十年的手工工程领域知识已进入当前最先进的自动语音识别(ASR)流程。

A simple but powerful alternative solution is to train such ASR models end-to-end, using deep learning to replace most modules with a single model.
一个简单但功能强大的替代解决方案是使用深度学习以单个模型替换大多数模块,从而端到端训练此类ASR模型。

We present the second generation of our speech system that exemplifies the major advantages of end-to-end learning.
我们介绍了语音系统的第二代产品,该产品体现了端到端学习的主要优势。

The Deep Speech 2 ASR pipeline approaches or exceeds the accuracy of Amazon Mechanical Turk human workers on several benchmarks, works in multiple languages with little modification, and is deployable in a production setting.
Deep Speech 2 ASR管道在多个基准上达到或超过了Amazon Mechanical Turk人类工人的准确性,只需很少的修改即可使用多种语言工作,并且可以在生产环境中部署。

It thus represents a significant step towards a single ASR system that addresses the entire range of speech recognition contexts handled by humans.
因此,它代表了朝着单一ASR系统迈出的重要一步,该系统可解决人类所处理的语音识别上下文的整个范围。

Since our system is built on end-to-end deep learning, we can employ a spectrum of deep learning techniques: capturing large training sets, training larger models with high performance computing, and methodically exploring the space of neural network architectures.
由于我们的系统基于端到端的深度学习,因此我们可以采用各种深度学习技术:捕获大型训练集,使用高性能计算训练大型模型以及有条不紊地探索神经网络架构的空间。

We show that through these techniques we are able to reduce error rates of our previous end-to-end system in English by up to 43%, and can also recognize Mandarin speech with high accuracy.
我们证明,通过这些技术,我们可以将以前的端到端英语系统的错误率降低多达43%,并且还可以高精度地识别普通话。

One of the challenges of speech recognition is the wide range of variability in speech and acoustics.
语音识别的挑战之一是语音和声学的广泛变化。

As a result, modern ASR pipelines are made up of numerous components including complex feature extraction, acoustic models, language and pronunciation models, speaker adaptation, etc.
结果,现代的ASR管道由众多组件组成,包括复杂的特征提取,声学模型,语言和发音模型,说话者自适应等。

Building and tuning these individual components makes developing a new speech recognizer very hard, especially for a new language.
构建和调整这些单独的组件使开发新的语音识别器非常困难,尤其是对于新语言而言。

Indeed, many parts do not generalize well across environments or languages and it is often necessary to support multiple application-specific systems in order to provide acceptable accuracy.
确实,许多部分在各种环境或语言之间的概括性不佳,通常需要支持多个特定于应用程序的系统才能提供可接受的准确性。

This state of affairs is different from human speech recognition: people have the innate ability to learn any language during childhood, using general skills to learn language.
这种状况不同于人类的语音识别:人们具有天生的能力,即使用一般技能来学习语言,从而在儿童时期学习任何语言。

After learning to read and write, most humans can transcribe speech with robustness to variation in environment, speaker accent and noise, without additional training for the transcription task.
在学习读写之后,大多数人都可以对语音进行鲁棒的转录,以适应环境,说话人口音和噪音的变化,而无需进行额外的转录任务训练。

To meet the expectations of speech recognition users, we believe that a single engine must learn to be similarly competent; able to handle most applications with only minor modifications and able to learn new languages from scratch without dramatic changes.
为了满足语音识别用户的期望,我们认为一个引擎必须学会同样的能力;能够处理大多数应用程序,只需稍作修改,并且能够从零开始学习新的语言,而不会发生巨大的变化。

Our end-to-end system puts this goal within reach, allowing us to approach or exceed the performance of human workers on several tests in two very different languages: Mandarin and English.
我们的端到端系统将这一目标付诸实现,使我们能够以两种非常不同的语言(普通话和英语)在几种测试中达到或超过人类工作者的表现。

Since Deep Speech 2 (DS2) is an end-to-end deep learning system, we can achieve performance gains by focusing on three crucial components: the model architecture, large labeled training datasets, and computational scale.
由于Deep Speech 2(DS2)是端到端的深度学习系统,因此我们可以通过关注三个关键组件来实现性能提升:模型架构,大型标签训练数据集和计算规模。

This approach has also yielded great advances in other application areas such as computer vision and natural language.
这种方法还在诸如计算机视觉和自然语言之类的其他应用领域中取得了长足的进步。

This paper details our contribution to these three areas for speech recognition, including an extensive investigation of model architectures and the effect of data and model size on recognition performance.
本文详细介绍了我们对这三个领域的语音识别所做的贡献,包括对模型体系结构以及数据和模型大小对识别性能的影响的广泛研究。

In particular, we describe numerous experiments with neural networks trained with the Connectionist Temporal Classification (CTC)
loss function to predict speech transcriptions from audio.
特别是,我们描述了使用连接主义时间分类(CTC)训练的神经网络进行的大量实验 损失功能可预测音频的语音转录。

We consider networks composed of many layers of recurrent connections, convolutional filters, and nonlinearities, as well as the impact of a specific instance of Batch Normalization (BatchNorm) applied to RNNs.
我们考虑由多层递归连接,卷积滤波器和非线性组成的网络,以及应用于RNN的批标准化(BatchNorm)特定实例的影响。

We not only find networks that produce much better predictions than those in previous work , but also find instances of recurrent models that can be deployed in a production setting with no significant loss in accuracy.
我们不仅找到比以前的工作提供更好的预测的网络,而且还找到可以在生产环境中部署而不会造成重大准确性损失的递归模型实例。

Beyond the search for better model architecture, deep learning systems benefit greatly from large quantities of training data.
除了寻求更好的模型架构之外,深度学习系统还从大量训练数据中受益匪浅。

We detail our data capturing pipeline that has enabled us to create larger datasets than what is typically used to train speech recognition systems.
我们详细介绍了数据捕获管道,使我们能够创建比通常用于训练语音识别系统的数据集更大的数据集。

Our English speech system is trained on 11,940 hours of speech, while the Mandarin system is trained on 9,400 hours.
我们的英语语音系统训练了11,940个小时的语音,而普通话系统则训练了9,400个小时。

We use data synthesis to further augment the data during training.
在训练过程中,我们使用数据合成来进一步扩充数据。

Training on large quantities of data usually requires the use of larger models.
训练大量数据通常需要使用更大的模型。

Indeed, our models have many more parameters than those used in our previous system.
实际上,我们的模型比以前的系统中使用的参数更多。

Training a single model at these scales requires tens of exaFLOPs that would require 3-6 weeks to execute on a single GPU.
在这些规模上训练单个模型需要数十个exaFLOP,而在单个GPU上执行则需要3到6周的时间。

This makes model exploration a very time consuming exercise, so we have built a highly optimized training system that uses 8 or 16 GPUs to train one model.
这使得模型探索成为一项非常耗时的练习,因此我们建立了一个高度优化的训练系统,该系统使用8或16个GPU来训练一个模型。

In contrast to previous large-scale training approaches that use parameter servers and asynchronous updates, we use synchronous SGD, which is easier to debug while testing new ideas, and also converges faster for the same degree of data parallelism.
与以前使用参数服务器和异步更新的大规模培训方法相反,我们使用同步SGD,它在测试新思路时更易于调试,并且在相同程度的数据并行性下收敛速度也更快。

To make the entire system efficient, we describe optimizations for a single GPU as well as improvements to scalability for multiple GPUs.
为了使整个系统高效,我们描述了单个GPU的优化以及多个GPU的可伸缩性的改进。

We employ optimization techniques typically found in High Performance Computing to improve scalability.
我们采用通常在高性能计算中发现的优化技术来提高可伸缩性。

These optimizations include a fast implementation of the CTC loss function on the GPU, and a custom memory allocator.
这些优化包括在GPU上快速实现CTC损失功能以及自定义内存分配器。

We also use carefully integrated compute nodes and a custom implementation of all-reduce to accelerate inter-GPU communication.
我们还使用精心集成的计算节点和全缩减的自定义实现来加速GPU间的通信。

Overall the system sustains approximately 50 teraFLOP/second when training on 16 GPUs.
总体而言,在16个GPU上进行训练时,系统维持约50 teraFLOP /秒的速度。

This amounts to 3 teraFLOP/second per GPU which is about 50% of peak theoretical performance.
每个GPU总计3 teraFLOP /秒,约为峰值理论性能的50%。

This scalability and efficiency cuts training times down to 3 to 5 days, allowing us to iterate more quickly on our models and datasets.
这种可扩展性和效率将培训时间缩短到3至5天,从而使我们可以更快地迭代模型和数据集。

We benchmark our system on several publicly available test sets and compare the results to our previous end-to-end system.
我们在几个公开测试集上对我们的系统进行基准测试,并将结果与​​以前的端到端系统进行比较。

Our goal is to eventually reach human-level performance not only on specific benchmarks, where it is possible to improve through dataset-specific tuning, but on a range of benchmarks that reflects a diverse set of scenarios.
我们的目标是最终不仅在特定基准上达到人类水平的性能,在特定基准上可以通过特定于数据集的调整来提高性能,而且还可以在反映各种场景的一系列基准上达到人的性能。

To that end, we have also measured the performance of human workers on each benchmark for comparison.
为此,我们还测量了每个基准上的人力资源绩效,以进行比较。

We find that our system outperforms humans in some commonly-studied benchmarks and has significantly closed the gap in much harder cases.
我们发现,我们的系统在某些经常被研究的基准测试中优于人类,并且在更困难的情况下已大大缩小了差距。

In addition to public benchmarks, we show the performance of our Mandarin system on internal datasets that reflect real-world product scenarios.
除了公开的基准测试,我们还在反映真实产品场景的内部数据集上显示了普通话系统的性能。

Deep learning systems can be challenging to deploy at scale.
深度学习系统可能难以大规模部署。

Large neural networks are computationally expensive to evaluate for each user utterance, and some network architectures are more easily deployed than others.
大型神经网络要评估每个用户的话语,计算量很大,而且某些网络体系结构比其他网络体系结构更容易部署。

Through model exploration, we find high-accuracy, deployable network architectures, which we detail here.
通过模型探索,我们发现了高精度,可部署的网络体系结构,在此将对其进行详细介绍。

We also employ a batching scheme suitable for GPU hardware called Batch Dispatch that leads to an efficient, real-time implementation of our Mandarin engine on production servers.
我们还采用了适用于GPU硬件的批处理方案(称为批处理派发),该方案可在生产服务器上高效,实时地实现我们的普通话引擎。

Our implementation achieves a 98th percentile compute latency of 67 milliseconds, while the server is loaded with 10 simultaneous audio streams.
我们的实施实现了98毫秒的67毫秒计算延迟,而服务器同时加载了10个音频流。

The remainder of the paper is as follows.
本文的其余部分如下。

We begin with a review of related work in deep learning, end-to-end speech recognition, and scalability in Section 2.
我们将从第2节中有关深度学习,端到端语音识别和可伸缩性的相关工作的回顾开始。

Section 3 describes the architectural and algorithmic improvements to the model and Section 4 explains how to efficiently compute them.
第3节介绍了模型的体系结构和算法改进,第4节介绍了如何有效地计算模型。

We discuss the training data and steps taken to further augment the training set in Section 5.
我们将讨论培训数据和为进一步扩大第5节中的培训集而采取的步骤。

An analysis of results for the DS2 system in English and Mandarin is presented in Section 6.
第6节介绍了英语和普通话DS2系统的结果分析。

We end with a description of the steps needed to deploy DS2 to real users in Section 7.
最后,在第7节中描述了将DS2部署到实际用户所需的步骤。

2 Related Work 相关工作

This work is inspired by previous work in both deep learning and speech recognition.
这项工作的灵感来自先前在深度学习和语音识别方面的工作。

Feed-forward neural network acoustic models were explored more than 20 years ago.
前馈神经网络声学模型已经探索了20多年。

Recurrent neural networks and networks with convolution were also used in speech recognition around the same time.
大约在同一时间,递归神经网络和卷积网络也被用于语音识别。

More recently DNNs have become a fixture in the ASR pipeline with almost all state of the art speech work containing some form of deep neural network.
最近,DNN已成为ASR流程中的固定装置,几乎所有最新的语音工作都包含某种形式的深度神经网络。

Convolutional networks have also been found beneficial for acoustic models.
还发现卷积网络对声学模型有益。

Recurrent neural networks, typically LSTMs, are just beginning to be deployed in state-of-the art recognizers and work well together with convolutional layers for the feature extraction.
循环神经网络(通常为LSTM)刚刚开始在最先进的识别器中部署,并且与卷积层一起很好地用于特征提取。

Models with both bidirectional and unidirectional recurrence have been explored as well.
还研究了具有双向和单向递归的模型。

End-to-end speech recognition is an active area of research, showing compelling results when used to re-score the outputs of a DNN-HMM and standalone.
端到端语音识别是研究的一个活跃领域,当用于重新评分DNN-HMM和独立的输出时,显示出令人信服的结果。

Two methods are currently used to map variable length audio sequences directly to variable length transcriptions.
当前使用两种方法将可变长度音频序列直接映射到可变长度转录。

The RNN encoder-decoder paradigm uses an encoder RNN to map the input to a fixed length vector and a decoder network to expand the fixed length vector into a sequence of output predictions.
RNN编码器-解码器范例使用编码器RNN将输入映射到固定长度向量,并使用解码器网络将固定长度向量扩展为输出预测序列。

Adding an attentional mechanism to the decoder greatly improves performance of the system, particularly with long inputs or outputs.
在解码器中添加注意机制可以极大地提高系统的性能,特别是在输入或输出较长的情况下。

In speech, the RNN encoder-decoder with attention performs well both in predicting phonemes or graphemes.
在语音方面,具有注意力的RNN编解码器在预测音素或音素方面均表现出色。

The other commonly used technique for mapping variable length audio input to variable length output is the CTC loss function coupled with an RNN to model temporal information.
将可变长度音频输入映射到可变长度输出的另一种常用技术是CTC损失函数与RNN耦合以对时间信息建模。

The CTC-RNN model performs well in end-to-end speech recognition with grapheme outputs.
CTC-RNN模型在带有字素输出的端到端语音识别中表现良好。

The CTC-RNN model has also been shown to work well in predicting phonemes, though a lexicon is still needed in this case.
CTC-RNN模型在预测音素方面也能很好地工作,尽管在这种情况下仍需要词典。

Furthermore it has been necessary to pre-train the CTC-RNN network with a DNN cross-entropy network that is fed frame-wise alignments from a GMM-HMM system.
此外,有必要用DNN交叉熵网络对CTC-RNN网络进行预训练,该网络由GMM-HMM系统进行逐帧对齐。

In contrast, we train the CTC-RNN networks from scratch without the need of frame-wise alignments for pre-training.
相比之下,我们从头开始训练CTC-RNN网络,而无需进行预训练的逐帧对齐。

Exploiting scale in deep learning has been central to the success of the field thus far.
到目前为止,深度学习中的规模开发一直是该领域成功的关键。

Training on a single GPU resulted in substantial performance gains, which were subsequently scaled linearly to two or more GPUs.
在单个GPU上进行训练可显着提高性能,随后将其线性扩展至两个或多个GPU。

We take advantage of work in increasing individual GPU efficiency for low-level deep learning primitives.
我们利用工作来提高低级深度学习原语的单个GPU效率。

We build on the past work in using modelparallelism, data-parallelism or a combination of the two to create a fast and highly scalable system for training deep RNNs in speech recognition.
我们基于过去使用模型并行性,数据并行性或两者的组合来创建快速,高度可扩展的系统来训练深度RNN进行语音识别的工作。

Data has also been central to the success of end-to-end speech recognition, with over 7000 hours of labeled speech used in Deep Speech 1 (DS1).
数据对端到端语音识别的成功也至关重要,在深度语音1(DS1)中使用了超过7000个小时的标记语音。

Data augmentation has been highly effective in improving the performance of deep learning in computer vision.
数据增强在提高计算机视觉中的深度学习性能方面非常有效。

This has also been shown to improve speech systems.
还显示出这可以改善语音系统。

Techniques used for data augmentation in speech range from simple noise addition to complex perturbations such as simulating changes to the vocal tract length and rate of speech of the speaker.
用于语音数据增强的技术范围从简单的噪声添加到复杂的扰动,例如模拟说话人的声道长度和语速变化。

Existing speech systems can also be used to bootstrap new data collection.
现有的语音系统也可以用于引导新的数据收集。

In one approach, the authors use one speech engine to align and filter a thousand hours of read speech.
在一种方法中,作者使用一个语音引擎来对齐和过滤一千个小时的阅读语音。

In another approach, a heavy-weight offline speech recognizer is used to generate transcriptions for tens of thousands of hours of speech.
在另一种方法中,使用重量级的离线语音识别器来生成成千上万小时的语音转录。

This is then passed through a filter and used to re-train the recognizer, resulting in significant performance gains.
然后将其通过过滤器,并用于重新训练识别器,从而显着提高性能。

We draw inspiration from these past approaches in bootstrapping larger datasets and data augmentation to increase the effective amount of labeled datafor our system.
我们从这些过去的方法中汲取了灵感,引导了更大的数据集和数据扩充,以增加我们系统的有效标记数据量。

 类似资料: