kaldi官网对egs的介绍在这里:
http://kaldi-asr.org/doc/examples.html
下面只是简单记录下,如果你需要哪个,请细看每个文件夹里面的README。
aishell:里面有两文件夹,S5是一种语音识别方法的demo,V1是一种说话人识别的demo。数据用的就是aishell1。
aishell2:只有S5,也就是语音识别。
ami:The AMI Meeting Corpus,只有S5和S5b,都是语音识别,S5b应该是在S5方法的基础上有所改良。
an4:只有语音识别
apiai_decode:只有语音识别。This directory contains scripts on how to use a pre-trained chain english model and kaldi base code to recognize any number of wav files.
aspire:语音识别。This recipe is JHU’s submission to the ASpIRE challenge. It uses Fisher-English corpus for training the acoustic and language models. It uses impulse responses and noises from RWCP, AIR and Reverb2014 databases to create multi-condition data
aurora4:use the Wall Street Journal corpus。有干净的声音,也有人工加入的噪音。
babel:语音识别。有4种方法demo。有多语种识别的demo。
babel_multilang:同上,应该是一个专门的语种识别demo。
bentham:图像识别,OCR识别。This directory contains example scripts for handwriting recognition on the Bentham dataset: http://www.transcriptorium.eu/~htrcontest/contestICFHR2014/public_html/ ,In the ICFHR 2014 contest, the best performing system in the unrestricted track obtained a WER of 8.6%.
bn_music_speech:The MUSAN corpus is required for system training. It is available at: http://www.openslr.org/17/
The test requires Broadcast News data. The LDC Catalog numbers are:
Speech LDC97S44
Transcripts LDC97T22
callhome_diarization:This directory contains example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation. The 2000 NIST SRE is required, and has an LDC catalog number LDC2001S97. Additional data sources (mostly past NIST SREs, Switchboard, etc) are required to train the systems in the subdirectories. See the corresponding README.txt files in the subdirectories for more details. The subdirectories “v1” and so on are different diarization recipes. The recipe in v1 demonstrates a standard approach using iVectors, PLDA scoring and agglomerative hierarchical clustering。
callhome_egyptian:语音识别
chime1:语音识别demo。This is a kaldi setup for 1st CHiME challenge. See
http://spandh.dcs.shef.ac.uk/projects/chime/challenge.html
for more detailed information.
chime2:语音识别demo,同上
chime3:语音识别demo,同上
chime4:语音识别demo,同上
chime5:语音识别demo,同上
cifar:图像识别。This directory contains example scripts for image classification with the CIFAR-10 and CIFAR-100 datasets, which are available for free from
https://www.cs.toronto.edu/~kriz/cifar.html.
This demonstrates applying the nnet3 framework to image classification for
fixed size images.
commonvoice:语音识别。This is a Kaldi recipe for the Mozilla Common Voice corpus v1. See https://voice.mozilla.org/data for additional details.
The amount of training audio is approximately 240 hours.
csj:语音识别。使用日语语料库。Corpus of Spontaneous Japanese
dihard_2018:This is a Kaldi recipe for The First DIHARD Speech Diarization Challenge. DIHARD is a new annual challenge focusing on “hard” diarization; that is,speech diarization for challenging corpora where there is an expectation that
the current state-of-the-art will fare poorly, including, but not limited
to: clinical interviews, extended child language acquisition recordings,
YouTube videos and “speech in the wild” (e.g., recordings in restaurants)
See https://coml.lscp.ens.fr/dihard/index.html for details.
The subdirectories “v1” and so on are different speaker diarization
recipes. The recipe in v1 demonstrates a standard approach using a
full-covariance GMM-UBM, i-vectors, PLDA scoring and agglomerative
hierarchical clustering. The example in v2 demonstrates DNN speaker
embeddings, PLDA scoring and agglomerative hierarchical clustering.
fame:The FAME! Speech Corpus。弗里斯兰人语料库。有语音识别和说话人识别。
farsdat:语音识别。farsdat是TIMIT在波斯语中的的对应物。波斯语语料库。
fisher_callhome_spanish:语音识别。西班牙语语料库。
fisher_english:语音识别。the Fisher-English corpus
fisher_swbd:语音识别。这个没有readme文件。
有关SWBD语料库的介绍:https://catalog.ldc.upenn.edu/LDC97S62
formosa:语音识别。### Welcome to the demo recipe of the Formosa Speech in the Wild (FSW) Project ###
The language habits of Taiwanese people are different from other Mandarin speakers (both accents and cultures) [1]. Especially Tainwaese use tranditional Chinese characters, i.e., 繁體中文). To address this issue, a Taiwanese speech corpus collection project “Formosa Speech in the Wild (FSW)” was initiated in 2017 to improve the development of Taiwanese-specific speech recognition techniques.
gale_arabic:语音识别。阿拉伯语语料库。GALE Phase 2 Arabic Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 200 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program.
gale_mandarin:语音识别,中文普通话语料库。This recipe is trained on LDC2013S08 (text transcripts from LDC2013T20) which is Gale Phase 2 Chinese Broadcast News speech: 126 hours of of Mandarin Chinese broadcast news speech collected in 2006 and 2007 by LDC and HKUST.
gp:多语种语音识别。About the GlobalPhone corpus:
This is a corpus of read sentences from the newspapers in 19
different languages recorded under varying degrees of “clean”
conditions. There is roughly 15-20 hours of training data for
each language, as well as DEV and EVAL sets of roughly 2 hours
each.
heroico:西班牙语,语音识别
hkust:湖南方言,语音识别
hub4_english:语音识别,English Broadcast News (HUB4) corpus.包含了十份LDC的数据
hub4_spanish:西班牙语,语音识别
iam:This directory contains example scripts for handwriting recognition on
the IAM dataset:
http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
iban:语音识别,马来西亚语
ifnenit:This directory contains example scripts for handwriting recognition on the Arabic IFN/ENIT dataset: http://www.ifnenit.com
You’ll need to register at their website to be able to download the dataset.
librispeech:语音识别。The LibriSpeech corpus is a large (1000 hour) corpus of English read speech derived from audiobooks in the LibriVox project, sampled at 16kHz.
The accents are various and not marked, but the majority are US English. It is available for download for free at http://www.openslr.org/12/. It was prepared as a speech recognition corpus by Vassil Panayotov.
lre:语种识别
lre07:语种识别。This directory (lre07) contains example recipes for the 2007 NIST
Language Evaluation. The subdirectory v1 demonstrates the standard
LID system, which is an I-Vector based recipe using full covariance
GMM-UBM and logistic regression model. The subdirectory v2 demonstrates
the LID system using a time delay deep neural network based UBM
which is used to replace the GMM-UBM of v1. The DNN is trained using
about 1800 hours of the English portion of Fisher.
madcat_ar:MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Arabic Corpus is a LDC dataset (LDC2012T15, LDC2013T09, LDC2013T15) for handwriting recognition. The dataset contains abstracts from News related passages and blogs. The xml file for each page provides line segmentation and word segmentation information and also provides the writing condition (writing style, speed, carefulness) of the page. It is a large size dataset with total 42k page images and 750k (600k training, 75k dev, 75k eval) line images and 305 writers.The major text is in Arabic but it also contains English letters and numerals. The dataset contains about 95k unique words and 160 unique characters. The dataset has been used in NIST 2010 and 2013 (Openhart Arabic large vocabulary unconstrained handwritten text recognition competition) evaluation (maybe with different splits) for line level recognition task. 16.1% WER was obtained for line level recognition in that competition.
More info: https://catalog.ldc.upenn.edu/LDC2012T15,
https://catalog.ldc.upenn.edu/LDC2013T09/,
https://catalog.ldc.upenn.edu/LDC2013T15/.
madcat_zh:This directory contains example scripts for handwriting recognition on
the MADCAT Chinese HWR dataset (LDC2014T13).
This dataset consists of handwritten Chinese documents, scanned
at high resolution and annotated for each line and token.
More info: https://catalog.ldc.upenn.edu/LDC2014T13
mini_librispeech:语音识别,没有readme
multi_en:语音识别。This is a WIP English LVCSR recipe that trains on data from multiple corpora。Large Vocabulary Continuous Speech Recognition (LVCSR)
ptb:语音识别。use the Penn Treebank corpus
reverb:带混响的语音识别。About the REVERB challenge ASR task:
This is a kaldi recipe for REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. The challenge assumes the scenario of capturing utterances spoken by a single stationary distant-talking speaker with 1-channel (1ch), 2-channel (2ch) or 8-channel (8ch) microphone-arrays in reverberant meeting rooms. It features both real recordings and simulated data, a part of which simulates the real recordings. The ASR challenge task consists of improving the recognition accuracy of the same reverberant speech. The background noise is mostly stationary and the signal-to-noise ratio is modest.
See http://reverb2014.dereverberation.com in more detail.
rimes:Rimes is a French handwriting recognition database created by A2iA.
rm:语音识别。dan的ppt上讲语音识别流程用的例子。
sitw:真实环境中的说话人识别。This directory (sitw) contains example scripts for the Speakers in the Wild (SITW) Speaker Recognition Challenge. The SITW corpus is required, and can be obtained by following the directions at the url
http://www.speech.sri.com/projects/sitw/
sprakbanken:语音识别,丹麦语。About the sprakbanken corpus:
This corpus is a free corpus originally collected by NST for ASR purposes and currently hosted by the Norwegian libraries. The corpus is multilingual and contains Swedish, Norwegian (Bokmål) and Danish. The current setup uses the Danish subcorpus. The vocabulary is large and there is approx. 350 hours of read-aloud speech with associated text scripts。
Some months ago the corpus was republished here: http://www.nb.no/sprakbanken/#ticketsfrom?lang=en
sprakbanken-swe:同上,语音识别,瑞典语。
sre08:说话人识别,2008年的比赛语料库。This directory (sre08) contains example scripts for speaker identification, not speech recognition.
sre10:说话人识别 ,This directory (sre10) contains example scripts for the NIST SRE 2010 speaker recognition evaluation
sre16: 说话人识别,同上,2016年比赛语料库
svhn:This directory contains example scripts for image classification with the SVHN (Street View House Numbers) dataset, which is available for free from http://ufldl.stanford.edu/housenumbers/.
This demonstrates applying the nnet3 framework to image classification for
fixed size images.
swahili:语音识别,斯瓦希里语语音语料库
swbd:语音识别,Switchboard corpus,Fisher corpus
tedlium,语音识别
thchs30:中文语音识别。THCHS30 is an open Chinese speech database published by Center for Speech and Language Technology (CSLT) at Tsinghua University.
The origional recording was conducted in 2002 by Dong Wang, supervised by Prof. Xiaoyan Zhu, at the Key State Lab of Intelligence and System, Department of Computer Science, Tsinghua Universeity, and the original name was ‘TCMSD’, standing for ‘Tsinghua Continuous Mandarin Speech Database’. The publication after 13 years has been initiated by Dr. Dong Wang and was supported by Prof. Xiaoyan Zhu. We hope to provide a toy database for new researchers in the field of speech recognition. Therefore, the database is totally free to academic users.
The database can be downloaded from openslr:
http://www.openslr.org/18/
or from the CSLT server:
http://data.cslt.org/thchs30/README.html
tidigits:英文数字语音识别。The TIDIGITS database consists of men, women, boys and girls reading digit strings of varying lengths; these are sampled at 20 kHz. It’s available from the LDC as catalog number LDC93S10.
timit:Available as LDC corpus LDC93S1, TIMIT is one of the original
clean speech databases. Description of catalog from LDC
(http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1):
“The TIMIT corpus of read speech is designed to provide speech data
for acoustic-phonetic studies and for the development and evaluation
of automatic speech recognition systems. TIMIT contains broadband
recordings of 630 speakers of eight major dialects of American English,
each reading ten phonetically rich sentences. The TIMIT corpus includes
time-aligned orthographic, phonetic and word transcriptions as well as
a 16-bit, 16kHz speech waveform file for each utterance.”
Note: please do not use this TIMIT setup as a generic example of how to run Kaldi, as TIMIT has a very nonstandard structure. Any of the other setups
would be better for this purpose: e.g. librispeech/s5 is quite nice, and is
free; yesno is very tiny and fast to run and is also free; and wsj/s5 has an
unusually complete set of example scripts which may however be confusing.
s5: Monophone, Triphone GMM/HMM systems trained with Maximum Likelihood, followed by SGMM and DNN recipe.
Training is done on 48 phonemes (see- Lee and Hon: Speaker-Independent
Phone Recognition Using Hidden Markov Models. IEEE TRANSACTIONS ON ACOUSTICS. SPEECH, AND SIGNAL PROCESSING, VOL. 31. NO. 11, PG. 1641-48, NOVEMBER 1989, ). In scoring we map to 39 phonememes, as is usually done in conference papers.
tunisian_msa:语音识别。使用突尼斯语料库的阿拉伯语Kaldi方法。
uw3:This directory contains example scripts for optical character recognition (i.e. OCR) on the UW3 dataset (it’s a printed English OCR corpus):
http://isis-data.science.uva.nl/events/dlia//datasets/uwash3.html
voxceleb:说话人识别。 This is a Kaldi recipe for speaker verification using the VoxCeleb1 and VoxCeleb2 corpora. See http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
and
http://www.robots.ox.ac.uk/~vgg/data/voxceleb2/
for additional details and information on how to obtain them.
Note: This recipe requires ffmpeg to be installed and its location included
in $PATH
The subdirectories “v1” and so on are different speaker recognition
recipes. The recipe in v1 demonstrates a standard approach using a
full-covariance GMM-UBM, iVectors, and a PLDA backend. The example
in v2 demonstrates DNN speaker embeddings with a PLDA backend.
voxforge:voxforge是个收集语音的网址,你可以免费得到语音库。VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines.
VoxForge was set up to collect transcribed speech to create a free GPL speech corpus for use with open source speech recognition engines. The speech audio files will be ‘compiled’ into acoustic models for use with open source speech recognition engines such as Julius, ISIP, and Sphinx and HTK (note: HTK has distribution restrictions).
VoxForge has[1] used LibriVox as a source of audio data since 2007.
这个demo中有一个online语音识别的例子。所谓offline语音识别,就是在你开始识别一句话的时候,这段语音的开始和结束你都已经有了。online就是只有开始,没有结束。
vystadial_cz:语音识别,捷克语,也有online识别的demo。
vystadial_en:英语,语音识别,也有online识别的demo。和上面那个好像出自同一篇论文的关系。好像还有个对话系统啥的。如果你需要的话,就自己看readme吧。The data comprise over 41 hours of speech in English.
The English recordings were collected from humans interacting via telephone
calls with statistical dialogue systems, designed to provide the user
with information on a suitable dining venue in the town.
The data collection process is described in detail
in article “Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license”
published for LREC 2014 (To Appear).
WE USE COMMON KALDI DECODERS IN THE SCRIPTS (gmm-latgen-faster through steps/decode.sh)
However, the main purpose of providing the data and scripts
is training acoustic models for real-time speech recognition unit
for dialog system ALEX, which uses modified real-time Kaldi OnlineLatgenRecogniser.
The modified Kaldi decoders are NOT required for running the scripts!
wsj:语音识别,华尔街日报语料库。推荐新手开始用kaldi学语音识别的小例子。About the Wall Street Journal corpus:
This is a corpus of read
sentences from the Wall Street Journal, recorded under clean conditions.
The vocabulary is quite large. About 80 hours of training data.
Available from the LDC as either: [ catalog numbers LDC93S6A (WSJ0) and LDC94S13A (WSJ1) ]
or: [ catalog numbers LDC93S6B (WSJ0) and LDC94S13B (WSJ1) ]
The latter option is cheaper and includes only the Sennheiser
microphone data (which is all we use in the example scripts).
yesno:语音识别,这个网上的资料最多了。新手推荐。The “yesno” corpus is a very small dataset of recordings of one individual saying yes or no multiple times per recording, in Hebrew. It is available from http://www.openslr.org/1.
It is mainly included here as an easy way to test out the Kaldi scripts.
The test set is perfectly recognized at the monophone stage, so the dataset is not exactly challenging.
yomdle_fa:This directory contains example scripts for OCR on the Yomdle and Slam datasets. Training is done on the Yomdle dataset and testing is done on Slam. LM rescoring is also done with extra corpus data obtained from various newswires (e.g. Hamshahri)
yomdle_korean:同上
yomdle_russian:同上
yomdle_tamil:同上
yomdle_zh:同上,应该是中文的
zeroth_korean:韩语,语音识别。Zeroth-Korean kaldi example is from Zeroth Project. Zeroth project introduces free Korean speech corpus and aims to make Korean speech recognition more broadly accessible to everyone. This project was developed in collaboration between Lucas Jo(@Atlas Guide Inc.) and Wonkyum Lee(@Gridspace Inc.).
In this example, we are using 51.6 hours transcribed Korean audio for training data (22,263 utterances, 105 people, 3000 sentences) and 1.2 hours transcribed Korean audio for testing data (457 utterances, 10 people). Besides audio and transcription, we provide pre-trained/designed language model, lexicon and morpheme-based segmenter(morfessor)
The database can be also downloaded from openslr:
http://www.openslr.org/40
The database is licensed under Attribution 4.0 International (CC BY 4.0)
This folder contains a speech recognition recipe which is based on WSJ/Librispeech example.
For more details about Zeroth project, please visit:
https://github.com/goodatlas/zeroth
终于弄完了,dan真的是大佬,做过这么多项目,肯定这只是一部分,资历真是要靠项目和时间啊!
下面简单整理一下:
所有的例子中除了英语语音识别这个大项目外,还有很多关于语音的小项目:
比如:
说话人识别:
aishell,fame,sitw,sre08,sre10,sre16,voxceleb
图像识别,ocr,多语种手写图片识别:
bentham,cifar,iam,ifnenit,madcat_ar,madcat_zh,rimes,svhn,uw3
yomdle_fa,yomdle_korean,yomdle_russian,yomdle_tamil,yomdle_zh
speaker diarization: 这方汪德亮老师是大牛
callhome_diarization,callhome_egyptian,dihard_2018
多语种语音识别:
csj,日语,
fame,弗里斯兰人语
farsdat,波斯语
fisher_callhome_spanish,西班牙语
formosa,台湾语
gale_arabic,阿拉伯语
gale_mandarin,中文普通话
gp,多语种
heroico:西班牙语
hkust:湖南方言
hub4_spanish, 西班牙语
iban:马来西亚语
lre,
lre07
madcat_ar
sprakbanken:丹麦语
sprakbanken-swe:瑞典语
swahili:斯瓦希里语
thchs30:中文
tunisian_msa:突尼斯语
vystadial_cz,捷克语
zeroth_korean:韩语
带混响的语音识别:
reverb
on-line语音识别:
voxforge
vystadial_cz,捷克语
vystadial_en
一些涉及到的比赛:
CHiME:
Speech Processing in Everyday Environments (CHiME 2018)
7th September, Microsoft, Hyderabad (a satellite event of Interspeech 2018)
http://spandh.dcs.shef.ac.uk/chime_workshop/chime2018/
DIHARD:
The Second DIHARD Speech Diarization Challenge
https://coml.lscp.ens.fr/dihard/index.html
LRE:
NIST 2017 Language Recognition Evaluation 语种识别大赛
https://www.nist.gov/itl/iad/mig/nist-2017-language-recognition-evaluation
国内清华也有一个语种识别大赛
REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge
https://reverb2014.dereverberation.com/
SRE:
NIST 2019 Speaker Recognition Evaluation
https://www.nist.gov/itl/iad/mig/nist-2019-speaker-recognition-evaluation
nist 举办的说话人识别比赛的结果或paper一般都发表在第二年的interspeech上
国内aishell有举办说话人识别比赛
SITW:
The Speakers in the Wild (SITW) Speaker Recognition Challenge
http://www.speech.sri.com/projects/sitw/
缩写:
LVCSR:Large Vocabulary Continuous Speech Recognition
------------2019.10.28号添加:------------------------------------
下面是从别的地方找到,也做参考吧:
1、babel : IARPA Babel program 语料库来自巴比塔项目,主要是对低资源语言的
语音识别和关键词检索例子,包括普什语,波斯语,土耳其语,越南语等等。据
文献上讲效果不太好,wer 达到50以上。 2、sre08:" Speaker Recognition Evaluations" 说话人识别。
3、aurora4: 主页:http://aurora.hsnr.de/ 研究各种噪音的。带噪音的语音识别-- 健壮的语音识别项目。包括说话人分离,音乐分离,噪声分离。
4、hkust:香港大学的普通话语音识别
5、callhome_egyptian: 埃及的阿拉伯语语音识别
6、chime_wsj0: chime 挑战项目数据,这个挑战是对电话,会议,远距离麦克风
数据进行识别。
7、fisher_englist:英语的双声道话音。
8、gale_arabic:全球自动语言开发计划中的阿拉伯语。
9、gp:global phone项目,全球电话语音:19种不同的语言,每种15-20小时的语
音
10、lre:包括说话人识别,语种识别
11、wsj:wall street journal 华尔街日报语料库,似乎所有的脚本都是这个东西开
始的。
12、swbd:Switchboard 语料库
13、tidigits:男人,女人,孩子说的不同的数字串语音的识别训练,
14、voxforge:开源语音收集项目
15、timit:不同性别,不同口音的美国英语发音和词汇标注,包括 Texas Instruments
(TI) 和 Massachusetts Institute of Technology (MIT), 所以叫timit
16、tedlium: 数据在这里
http://www.openslr.org/resources/7/TEDLIUM_release1.tar.gz
TED talks英语语音数据, 由Laboratoire d’Informatique de l’Université du Maine
(LIUM) 创建
17、vystadial_cz:
dataset of telephone conversations in Czech 希腊人搞的电话语音识别数据
18、vystadial_en:
dataset of telephone conversations in English 希腊人搞的电话语音识别数据
19、yesno: 各种yes,no 两个词的语音识别,归入命令词语音识别吧。
20、rm:DARPA Resource Management Continuous Speech Corpora