Kaldi 使用，egs下通用样例及功能小结

张森

2023-12-01

样例表

名词解释：

egs下的样例	数据源，功能	用到的相关工具
aidatatang_200zh/s5	数据堂200h中文开源数据，用于语音识别	LM+MFCC+Mono+Triphone(tri1:deltas;tri2:delta+delta-delta;tri3a:lda+mllt)+fMLLR+SAT+TDNN
aishell/v1	openslr33数据，声纹识别	MFCC+UBM+PLDA
aishell/s5	openslr33数据，语音识别	LM+MFCC+Mono+Triphone+fMLLR+SAT+TDNN
aishell2/s5	aishell2，语音识别	LM + GMM-HMM(MFCC+Mono+Triphone)+TDNN
ami/s5/run_ihm.sh	----，语音识别	IHM(independent headset microphone): LM+MFCC+Mono+Triphone+tri4a(LDA+MLLT+SAT)+DNN+TDNN;
ami/s5/run_mdm.sh	----，语音识别	MDM(multiple distant microphone): LM+MFCC+Mono+Triphone+SAT+MMI+DNN(dnn+lad+mllt)+TDNN;
ami/s5/run_sdm.sh	----，语音识别	SDM(single distant microphone): LM+MFCC+Mono+Triphone+SAT+MMI+DNN(dnn+lad+mllt)+TDNN
ami/s5b	----，语音识别	LM+MFCC+tri1(deltas)+tri2(lda+mllt)+tri3(lda+mllt+sat)+tdnn
an4/s5	AN4，语音识别	LM+MFCC+tri1(deltas)+tri2(lda+mllt)+tri3(lda+mllt+sat)
apiai_decode/s5	16Hz数据，只有解码，没有训练模型	略
aspire/s5	corpora3/LDC/LDC2005T19，corpora3/LDC/LDC2004S13，corpora3/LDC/LDC2005S13，语音识别	LM+MFCC+CMVN+Mono+Triphone+fMLLR+SAT+build_silprob.sh+TDNN+TDNN_SLTM
aurora4/s5	corpora5/LDC/LDC93S6B，corpora5/AURORA，语音识别	MFCC+tri1(deltas)+tri2(deltas)+tri2b(lda_mllt)+tri3b(lda+mllt+sat)+TDNN
babel/s5		run有点多，挑有特点的写，plp+pitch+feats+(ffv)+mono+tri1+tri2+tri3(deltas)+tri4(lda_mllt)+sat+SGMM(fmllr+ubm+sgmm)+MMI
bentham/v1/run_end2end.sh	corpora5/handwriting_ocr/hwr1/ICDAR-HTR-Competition-2015，图像识别，OCR识别，端到端识别	features+cmvn+lm+e2e_cnn
bn_music_speech/v1/	corpora5/LDC/LDC97S44，corpora/LDC/LDC97T22，音乐语音识别	MFCC+UBM+vad_GMM
callhome_diarization/v1	swbd，家庭电话的声纹识别	MFCC+VAD+UBM+PLDA+Cluster
callhome_diarization/v2/	swbd，家庭电话的声纹识别	xvector+vad+数据增强+mfcc+plda+cluster+diag(ubm)+VB
callhome_egyptian/s5	略，语音识别	mfcc+cmvn+mono+Triphone+sat+fmllr+tdnn
casia_hwdb/v1	corpora5/handwriting_ocr/CASIA_HWDB/Offline，端到端语音识别	略
chime1-6	略，语音识别
cifar/v1	cifar，图像识别	略
cmu_cslu_kids/s5	略，语音识别	LM+MFCC+CMVN+Mono+Triphone+MMI+Boosting+MPE+SAT+VTLN+tdnnf
cnceleb/v1	CN-Celeb dataset，声纹识别	MFCC+UBM+PLDA
commonvoice/s5	corpus v1，语音识别	LM+MFCC+Mono+Triphone+fmllr+tdnn
csj/s5	日语语料库，语音识别	LM+MFCC+CMVN+GMM-HMM+fmllr+（sgmm, tdnn, dnn, rnnlm等)
dihard_2018/v1	略，声纹识别	MFCC+UBM+PLDA+Cluster
dihard_2018/v2	略，声纹识别	MFCC+数据增强+cmvn+xvector+plda+cluster
egs/fame	弗里斯兰人语料库，语音识别s5，声纹识别v1+v2	s5: mfcc+cmvn+mono+triphone+sgmm+dnn+dnn_fbank；v1:常规操作，略；v2:引入了ubm+dnn
farsdat/s5	波斯语语料库，语音识别	MFCC+CMVN+Mono+tri1(deltas + delta-deltas)+tri2(LDA + MLLT)+tri3(LDA + MLLT + SAT)+SGMM+MMI + SGMM2
fisher_callhome_spanish/s5	西班牙语语料库，语音识别	MFCC+CMVN+Mono+deltas+deltas+lda_mllt+fmllr+sgmm+mmi+tdnn_1g
fisher_english/s5	Fisher-English corpus，语音识别	MFCC+CMVN+deltas+deltas+lda_mllt+fmllr+sat
fisher_swbd/s5	SWBD语料库，语音识别	lm+mfcc+cmvn+mono+delta+delta+delta+lda_mllt+fmllr+sat+lmresocre
formosa/s5	台湾话，语音识别	lm+mfcc+pitch+cmvn+mono+delta+delta+lda_mllt+fmllr+sat+tdnn
gale_arabic	阿拉伯语语料库，语音识别	s5:lm+mfcc+cmvn+mono+delta+delta+lda_mllt+sat+fmllr+mmi+sgmm+dnn, s5b:lm+mfcc+cmvn+mono+delta+lad_mllt+sat+fmllr+tdnn, s5c:lm+mfcc+mono+delta+lda_mllt+sat+fmllr+tdnn, s5d:lm+mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr+tdnn+tdnn_lstm
gale_mandarin/s5	中文普通话语料库，语音识别	lm+mfcc+cmvn+mono+delta+lad_mllt+MMI+MPE+sat+fmllr+UBM+sgmm
gop/s5	略，google的电话评分	略
gp	三个语种，每个语种15-20h，多语种语音识别	略
heroico/s5	西班牙语，语音识别	lm+mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr+tdnn
hi_mia/v1	openslr，唤醒词识别	略
hkust/s5	湖南方言，语音识别	lm+mfcc+cmvn+mono+delta+delta+lda_mllt+fmllr+sat+nnet2_ms+tdnn+tdnn
hub4_english/s5	English Broadcast News (HUB4) corpus，语音识别	lm+mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr
hub4_spanish/s5	西班牙语，语音识别	lm+mfcc+cmvn+mono+delta+delta+delta+lda_mllt+sat+fmllr
iam	手写数据，图像识别	略
iban	马来西亚语，语音识别	lm+mfcc+cmvn+mono+delta+lmrescore+delta+lmrescore+lda_mllt+lmrescore+sat+fmllr+ubm+sgmm+lmrescore（特色是每次decode都会用lmrescore）
ifnenit	手写数据，图像识别	略
librispeech/s5	英语	lm+mfcc+cmvn+mono+deltas+lmrescore+lda_mllt+lmrescore+sat+fmllr+tdnn（除了没有数据增强，其他比较齐全了）
lre/v1	----，语种识别	mfcc+vad+ubm+vtln+ivector
lre07/v1	----，语种识别	v1:vtln+mfcc+ubm+ivector, v2:vtln+mfcc+ubm+ivector_dnn+dnn
madcat_ar，madcat_zh	手写数据，图像文字识别	略
malach/s5	MALACH data，语音识别	mfcc+cmvn+lda_mllt+sat+fmllr+tdnn
mandarin_bn_bc/s5	LDC，语音识别	lm+mfcc+pitch+cmvn+mono+delta+lad_mllt+sat+fmllr+tdnn+dtnn_lstm
material/s5	斯瓦希里语，语音识别	lm+mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr+lm修改
mgb2_arabic/s5	MGB-2 corpus，语音识别	lm+mfcc+cmvn+mono+delta+delta+lad_mllt+sat+fmllr+dnn
mgb5/s5	MGB-5 corpus	lm+mfcc+cmvn+mono+delta+delta+lda_mllt+sat+fmllr+sgmm+tdnn
mini_librispeech/s5	openslr 31，语音识别	lm+mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr+lmrescore+tdnn
mobvoi/v1	mobvoi提供的数据，语音识别	数据增强+mfcc+cmvn+tdnn
mobvoihotwords/v1	略，语音识别	数据增强+mfcc+cmvn+fmllr+tdnn
multi_cn/s5	中文(openslr)，语音识别	lm+mfcc+pitch+cmvn+mono+delta+delta+lda_mllt+sat+fmllr+cnn_tdnn
multi_en/s5	英语，语音识别	lm+mfcc+cmvn+mono+delta+delta+delta+lda_mllt+fmllr+sat
ptb/s5	Penn Treebank corpus，lm建模	略
reverb/s5	----，带混响的语音识别	mfcc+cmvn+mono+delta+lad_mllt+sat+fmllr+tdnn
rimes/v1	French handwriting，图片文字识别	略
rm/s5	语音识别（dan的ppt上讲语音识别流程用的例子）	mfcc+plp+cmvn+mono+delta+lda_mllt+denlats+mmi+mpe+sat+fmllr+ubm+mmi_fmmi+sgmm2+tdnn+tdnn_online_cmn
sitw	数据，真实环境中的说话人识别	v1:mfcc+vad+ubm+ivector+数据增强+lda+plda, v2:mfcc+vad+数据增强+xvector+lda+plda
snips/v1	唤醒词，语音识别	mfcc+cmvn+数据增强+mfcc+cmvn+mono+fmllr+tdnn
spanish_dimex100/s5	墨西哥西班牙语，语音识别	mfcc+cmvn+mono+delta+lda_mllt+denlats+mm
sprakbanken/s5	丹麦语，语音识别	mfcc+cmvn+irstlm+mono+delta+delta+lda_mllt+sat+fmllr+tdnn_lstm
sprakbanken_swe/s5	瑞典语，语音识别	mfcc+cmvn+irstlm+mono+delta+delta+lda_mllt+sat+fmllr+local/sprak_run_nnet_cpu.sh
sre08/v1	LDC2011S05，声纹识别	mfcc+vad+ubm+ivector+lda+plda
sre10	NIST SRE 2010 ，声纹识别	v1:mfcc+vad+ubm+ivector+plda, v2:mfcc+vad+ubm+ivector_dnn+plda
sre16	NIST SRE 2016 enroll，声纹识别	v1:mfcc+vad+ubm+ivector+数据增强+mfcc+ivector+plda, v2:mfcc+vad+数据增强+mfcc+cmvn+xvector+plda
svhn/v1	Street View House Numbers，图像识别	略
swahili/s5	斯瓦希里语语音语料库，语音识别	mfcc+cmvn+mono+delta+lad_mllt+sat+fmllr+denlats+mmi+ubm+mmi_fmmi+ubm+sgmm+denlats_sgmm+mmi_sgmm
swbd	Switchboard corpus，Fisher corpus，语音识别	s5:mfcc+cmvn+mono+delta+delta+lda_mllt+fmllr+sgmm+sat+fmllr+denlats+mmi+ubm+mmi_fmmi, s5b:mfcc+cmvn+mono+delta+delta+lda_mllt+fmllr+sat+fmllr+denlats+mmi+ubm+mmi_fmmi, s5c:mfcc+cmvn+mono+delta+delta+lda_mllt+fmllr+lmrescore+mmi+ubm+mmi_fmmi+lmrescore
tedlium	----，语音识别	s5:mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr+denlats+mmi+dnn, s5_r2:mfcc+cmvn+mono+delta+lmscore+lda_mllt+sat+fmllr+tdnn, s5_r2_wsj:mfcc+cmvn+mono+delta+lad_mllt+sat+fmllr, s5_r3:mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr+tdnn
thchs30/s5	中文，语音识别	mfcc+cmvn+lm+mono+delta+lda_mllt+sat+fmllr+quick+dnn
tidigits/s5	LDC93S10，英文数字语音识别	mfcc+cmvn+mono+delta
timit/s5	LDC93S1，语音识别	mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr+ubm+sgmm+mmi_sgmm+dnn
tunisian_msa/s5	突尼斯语料库，语音识别	mfcc+cmvn+mono+lda_mllt+sat+fmllr+tdnn
uw3/v1	----，图像识别	略
voxceleb	VoxCeleb1 and VoxCeleb2 corpora，声纹识别	v1:mfcc+vad+ubm+ivector+lad+plda, v2:mfcc+vad+数据增强+cmvn+xvector+lda+plda
voxforge/s5	可以从voxforge得到免费语音库，语音识别	mfcc+cmvn+mono+delta+delta+lda_mllt+denlats+mmi+mpe+sat+fmllr+ubm+mmi_fmmi+sgmm
vystadial_cz	捷克语，语音识别	s5:mfcc+cmvn+mono+delta+delta+lda_mllt+denlats+mmi, s5b:mfcc+cmvn+mono+delta+lda_mllt+sat+fmllr+tdnn
vystadial_en/s5	英语，语音识别	mfcc+cmvn+mono+delta+delta+lda_mllt+denlats+mmi+mpe
wsj/s5	华尔街日报数据，语音识别	mfcc+cmvn+mono+delta+lmrescore+lda_mllt+lmrescore+sat+fmllr+tdnn
yesno/s5	yesno数据，语音识别	mfcc+cmvn+mono
yomdle_fa, yomdle_korean, yomdle_russian, yomdle_tamil, yomdle_zh	OCR数据，图像识别	略
zeroth_korean/s5	韩语，语音识别	mfcc+cmvn+mono+delta+lmrescore+lda_mllt+sat+fmllr+rebulidlm+lmrescore+fmllr+sat+tdnn

LM：语言模型
MFCC：Mel频谱特征
CMVN：倒谱均值方差归一化
Mono：Mono phon，单音素模型训练
Triphone：三音素模型训练，一般 tri1: deltas; tri2: delta+delta-delta; tri3a: lda+mllt
GMM：高斯混合模型
HMM：隐马尔可夫
sGMM：子空间高斯混合模型（subspace GMM)，可有效减少GMM参数
GMM-HMM：MFCC+Mono+Triphone
MLLT：最大似然线性变换
CMLLR/fMLLR：约束最大似然线性回归/特征空间最大似然线性回归（feature-space maximum likelihood linear regression），针对说话人特征的鲁棒性
SAT：说话人自适应
VTLN：Vocal Tract Length Normalisation，声道长度归一化。主要用于语音识别，消除男，女的声道长度的差异。在HTK中有源码，HTK book中有介绍。修改了MEL频率中的中心频率。
LDA：线性判别分析
PLDA：概率线性判别分析
CE：帧错误率（一般默认）
MMI/BMMI：最小化句子错误率，steps/train_mmi.sh
MPE：最小化各种粒度指标的错误率，steps/train_mpe.sh
sMBR：最小化状态错误率
lattice：词格，lmrescore会用到

脚本解释：

脚本名称	作用
utils/subset_data_dir.sh	分割数据，用于建立初始小模型，而后一步一步扩充
steps/train_mono.sh	单音素模型训练
steps/align.sh, steps/align_si.sh, steps/align_fmllr.sh	强制对齐
steps/train_sat.sh	说话人自适应，一般之后跟fmllr，第一个sat前用si或者fmllr，sat一般用两轮
steps/get_prons.sh	从训练数据中计算发音和静音概率，并重新创建lang目录，样例参见fisher_swbd/s5
steps/make_plp_pitch.sh	提取plp和pitch特征
steps/make_plp.sh	提取plp特征
utils/fix_data_dir.sh	数据规整
steps/make_fbank.sh	提取fbank特征，一般与local/nnet/run_dnn_fbank.sh组合使用
steps/make_mfcc.sh	提取MFCC特征，相较于fbank有损失
steps/compute_cmvn_stats.sh	cmvn，提取倒谱特征，语音识别时用
local/train_irstlm.sh	建lm的一个工具包
local/nnet3/xvector/prepare_feats.sh	cmvn，倒谱归一化，声纹识别时用
steps/align_fmllr.sh	fmllr对齐
steps/train_mmi.sh	句错误率最小化训练
steps/train_mpe.sh	字错误率(最小颗粒度)去训练
sid/train_diag_ubm.sh, sid/train_full_ubm.sh, steps/train_ubm.sh	ubm训练
steps/train_sgmm2.sh，steps/align_sgmm2.sh，steps/make_denlats_sgmm2.sh	sgmm训练
sid/compute_vad_decision_gmm.sh	Compute energy based VAD output
sid/compute_vad_decision.sh	利用能量提取有效音频段
local/run_lmrescore.sh	利用RNN对LM重新打分
local/run_wpe.sh, local/run_beamformit.sh	麦克风阵列相关处理，用于数据增强，代码在chime5/s5b/run.sh中。此外，run.sh中还有加噪，混响相关代码
steps/data/reverberate_data_dir.py, steps/data/augment_data_dir.py	加噪，加混响相关操作，用于数据增强
chime6/s5_track2/local/train_diarizer.sh	训练xvector dnn
local/vtln.sh	用于消除男女声道长度差异
local/chain/run_tdnnf.sh，local/chain/run_tdnn.sh	tdnn训练脚本，tdnnf比tdnn两层中间多了层维数较低的中间层
local/nnet3/run_tdnn.sh	nnet3 TDNN
local/chain/run_tdnn_1g.sh	与tdnn_1f类似，但做了一些调整，样例在fisher_callhome_spanish/s5中
steps/train_deltas.sh	一般在tri1，也会在tri2，tri3
steps/train_lda_mllt.sh	LDA+MLLT，一般在tri2，tri3，tri2b，tri3b，看个人喜好命名
steps/train_quick.sh	在现有特征的基础上训练模型(不进行任何类型的特征空间学习)
local/run_sgmm2.sh	SGMM训练
local/nnet/run_dnn.sh	DNN训练
local/online/run_nnet2_ms.sh
local/csj_run_rnnlm.sh	日语重打分RNNLM训练
diarization/vad_to_segments.sh	音频做vad
diarization/score_plda.sh, diarization/cluster.sh	plda打分，根据打分分类，合并重复说话人。一般说话人id不明确的时候用
local/nnet3/xvector/prepare_feats_for_egs.sh, local/nnet3/xvector/run_xvector.sh, sid/nnet3/xvector/extract_xvectors.sh	CMVN，提取xvector特征
ivector-mean, ivector-compute-lda, ivector-compute-plda	lda和plda训练
ivector-plda-scoring	plda打分
sid/train_diag_ubm.sh, sid/train_full_ubm.sh, sid/train_ivector_extractor.sh	一般提取ivector，例子可见fame/v1
sid/init_full_ubm_from_dnn.sh, sid/train_ivector_extractor_dnn.sh, sid/extract_ivectors_dnn.sh	用dnn提取相关ivector特征，例子可见fame/v2
copy-feats	查看ark文件，一般文件合并时用

小结

语音识别

数据增强：加噪，加音乐，加混响，速度扰动，SpecAugment()
特征提取：MFCC，pitch，CMVN，fbank，ubm
ASR训练：mono+triphone+tdnn，其中triphone会有变化（deltas，LDA，MLLT，fMLLR，SGMM等），tdnn会被替换成其他
训练策略：CE，MMI/BMM，MPE，sMBR
LM：先用较小LM，而后decode的时候用RNNLM进行重打分（主要是为了节省时间），当然可以直接用完整的LM，只是比较费时。
ASR：一般训练是把数据拆分train(训练集)，dev(开发集)，test(测试集)。一般调参是根据dev结果进行调参。此外，也会把train拆分成多个，在训练过程中不断增加数据，增加参数。

声纹识别

若没有segment，则需要先做一步vad，以去除静音段
特征提取：ivector，xvector
训练：ubm，lda/plda，cluster

Kaldi 使用，egs下通用样例及功能小结

语音识别

声纹识别

相关阅读

相关文章

相关问答

相关文档