mozilla 源码
In the age of OTT platforms, there are still some who prefer to download movies/videos from YouTube/Facebook/Torrents (shush 狼) over streaming. I am one of them and on one such occasion, I couldn’t find the subtitle file for a particular movie I had downloaded. Then, the idea for AutoSub struck me and since I had worked with DeepSpeech previously, I decided to use it to generate subtitles for my movie.
在OTT平台时代,仍然有些人更喜欢通过流媒体从YouTube / Facebook / Torrents(shush狼)下载电影/视频。 我就是其中之一,在这样的情况下,我找不到我下载的特定电影的字幕文件。 然后,关于AutoSub的想法震惊了我,并且因为我之前曾与DeepSpeech合作,所以我决定使用它为电影生成字幕。
Given a video file as input, my goal was to generate an .srt file. The subtitles could then be imported into any modern video player. In this article, I’m going to walk you through some of the code. You can find the project on my GitHub here with instructions on how to install locally.
给定一个视频文件作为输入,我的目标是生成一个.srt文件。 然后可以将字幕导入任何现代视频播放器中。 在本文中,我将引导您完成一些代码。 你可以在我的GitHub上找到项目在这里就如何在本地安装的说明。
Prerequisites: Intermediate understanding of Python, some familiarity with Automatic Speech Recognition Engines, and a basic understanding of signal processing will be great.
先决条件:对Python有一定的了解,对自动语音识别引擎有一定的了解,并且对信号处理有基本的了解。
Note: This is my first article on Medium. If you have any suggestions/doubts please comment them down. Happy reading :)
注意:这是我的第一篇有关Medium的文章。 如果您有任何建议/疑问,请给他们留言。 快乐阅读:)
Mozilla DeepSpeech (Mozilla DeepSpeech)
DeepSpeech is an open-source speech-to-text engine based on the original Deep Speech research paper by Baidu. It is one of the best speech recognition tools out there given its versatility and ease of use. It is built using Tensorflow, is trainable using custom datasets, trained on the huge Mozilla Common Voice dataset, and is licensed under the Mozilla Public License. The best advantage is that we can download the model files and perform inference locally just within a couple of minutes!
DeepSpeech是开源的语音转文本引擎,它基于百度的原始“深度语音”研究论文。 鉴于其多功能性和易用性,它是目前最好的语音识别工具之一。 它使用Tensorflow构建,可以使用自定义数据集进行训练,可以在庞大的Mozilla Common Voice数据集上进行训练,并且根据Mozilla Public License获得了许可。 最好的好处是,我们可以在几分钟之内下载模型文件并在本地执行推理!
Although, DeepSpeech does have its issues. The model struggles with non-native English accent speech. There is a workaround to this — fine-tuning the model using a custom dataset in the language we want to predict. I’ll write another article on how to do that soon.
虽然,DeepSpeech确实有其问题。 该模型与非母语英语口音作斗争。 有一个解决方法-使用我们要预测的语言使用自定义数据集对模型进行微调。 我将很快写另一篇文章。
If you’re working with speech recognition tasks, I strongly recommend you have a look at DeepSpeech.
如果您正在处理语音识别任务,强烈建议您看一下DeepSpeech。
自动订阅 (AutoSub)
Let’s start off by installing some packages we’ll be needing. All commands have been tested on Ubuntu 18.04 in a pip virtual environment.
让我们从安装一些我们需要的软件包开始。 所有命令均已在pip虚拟环境中的Ubuntu 18.04上进行了测试。
FFmpeg: FFmpeg is the leading multimedia framework, able to decode, encode, transcode, mux, demux, stream, filter, and play pretty much anything that humans and machines have created. We need it to extract audio from our input video file.
FFmpeg :FFmpeg是领先的多媒体框架,能够解码,编码,转码,复用,解复用,流式传输,过滤并播放人类和机器创建的几乎所有内容。 我们需要它从输入的视频文件中提取音频。
$ sudo apt-get install ffmpeg
2. DeepSpeech: Install the python package from PyPI and download the model file. The scorer file is optional but greatly increases accuracy.
2. DeepSpeech :从PyPI安装python软件包并下载模型文件。 计分器文件是可选的,但可以大大提高准确性。
$ pip install deepspeech==0.8.2# Model file (~190 MB)
$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-models.pbmm# Scorer file (~900 MB)$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-models.scorer
Now that we’re all set up, let’s first extract the audio from our input video file using FFmpeg. We need to create a subprocess to run UNIX commands. DeepSpeech expects the input audio file to be sampled at 16kHz and hence the arguments to ffmpeg given below.
现在我们已经完成了所有的设置,让我们首先使用FFmpeg从输入视频文件中提取音频。 我们需要创建一个子进程来运行UNIX命令。 DeepSpeech希望输入音频文件以16kHz采样,因此下面给出了ffmpeg的参数。
Now, suppose our input video file is 2 hours long. Running DeepSpeech inference on the whole file is not generally recommended. I’ve tried it and it didn’t work out too well. One workaround to this is to split the audio file on silent segments. After splitting, we have multiple small files containing speech which we need to infer. This is done using pyAudioAnalysis.
现在,假设我们输入的视频文件长达2个小时。 通常不建议在整个文件上运行DeepSpeech推断。 我已经尝试过了,但效果不是很好。 一种解决方法是将音频文件拆分为无声段。 分割后,我们有多个包含语音的小文件,需要推断。 这是使用pyAudioAnalysis完成的。
The following function uses the functions read_audio_file() and silenceRemoval() from pyAudioAnalysis and generates segment limits from where speech begins and ends. The arguments control the smoothing window size in seconds and weight factor in (0,1). Using the segment limits, smaller audio files are written to disk.
以下函数使用pyAudioAnalysis中的read_audio_file()和silenceRemoval()函数,并根据语音的开始和结束位置生成片段限制。 参数控制以秒为单位的平滑窗口大小,以(0,1)为单位的权重因子。 使用段限制,可以将较小的音频文件写入磁盘。
We now need to run DeepSpeech inference on these files individually and write the inferred text to a SRT file. Let’s start by creating an instance of the DeepSpeech Model and add the scorer file. We then read the audio file into a NumPy array and feed it into the speech-to-text function to produce inference. As mentioned above, pyAudioAnalysis saves files with the segment limits time in seconds. We need to extract those limits and convert it into a suitable form before writing to the SRT file. The write function is defined here.
现在,我们需要对这些文件分别运行DeepSpeech推断,并将推断的文本写入SRT文件。 让我们从创建DeepSpeech模型的实例开始并添加得分器文件。 然后,我们将音频文件读取到NumPy数组中,并将其输入到语音到文本功能中以进行推断。 如上所述,pyAudioAnalysis使用段限制时间(以秒为单位)保存文件。 在写入SRT文件之前,我们需要提取这些限制并将其转换为合适的形式。 这里定义了写功能。
The whole process shouldn’t take more than 60% of the original video file duration. Here’s a video showing a sample run on my laptop.
整个过程的时间不应超过原始视频文件持续时间的60%。 这是一个视频,显示了在笔记本电脑上运行的示例。
There’s one more area I hope to improve in the future. The inferred text is unformatted. We need to add proper punctuation, correct possible small mistakes in words (off by one letter), and split very long segments into smaller ones (although this will be difficult to automate).
我希望将来还有一个领域可以改进。 推断的文本是未格式化的。 我们需要添加适当的标点符号,纠正单词中可能出现的小错误(以一个字母为单位),并将很长的段分成较小的段(尽管这将很难实现自动化)。
That’s it! If you’ve reached till here, thank you for sticking along :)
而已! 如果您到这里为止,谢谢您的支持:)
mozilla 源码