1.首先安装HTK-3.4.1和HDecode-3.4.1,HTS-2.3_for_HTK-3.4.1
所有的东西都在这个网站的下面::http://hts.sp.nitech.ac.jp/
a.下载的版本分别是:
HTK-3.4.1: http://htk.eng.cam.ac.uk/download.shtml
HDecode-3.4.1: http://htk.eng.cam.ac.uk/extensions/index.shtml
b.解压下
$ tar -zxvf HTK-3.4.1.tar.gz
$ tar -zxvf HDecode-3.4.1.tar.gz
c.然后应用下载的补丁HTS-2.3_for_HTK-3.4.1.patch
$ cd htk
$ patch -p1 -d . path/ HTS-2.3_for_HTK-3.4.1.patch
d.运行配置脚本
$ ./configure
如果想看些细节的内容,使用./configure --help
e.编译安装
$ make
$ make install(可以在前面加个sudo)
好像某个部分会报错,报错的是HTK里面关于ARCH的错误,通过在文件esignal.c里面添加
#define ARCH "darwin"可以解决这个错误.
2.安装SPTK
SPTK-3.10:http://sp-tk.sourceforge.net/
安装步骤都是:
a. 解压相应的文件夹
b. cd 相应文件夹
c. ./configure
d. make
e. make install
3.sudo apt-get install tcl tk libsnack2
4.hts_engine API
hts_engine_API-1.10:http://hts-engine.sourceforge.net/
安装步骤同安装sptk
5.安装speech_tools下载地址:http://www.cstr.ed.ac.uk/downloads/festival/2.4/
下载地址:speech_tools和festival是放在一起的,要求speech_tool先编译.
$ sudo apt-get install g++ (因为要用到g++的库,所以要更新确定已安装)不一定需要
$ sudo apt-get install libncurses5-dev(可能直接安装装不上,可以使用aptitude命令安装相关的以来之后在安装包)
$ ./configure
$ make
$ make test 显示安装成功
6.安装festival
$ ./configure
$ make
$ make install(忘了是否有这一步,可以试试)
$ make test(会报错需要下载相关的文件festvox_don,festlex_POSLEX,festlex_OALD,festlex_CMU,festvox_kallpc16k,festvox_rablpc16k,解压放到festival相应的文件夹下)
但是好像还是会报错,但是直接输festival的命令没有错.所以后面就没管了.
下载HTS-demo_CMU-ARCTIC-ADAPT.tar.bz2,安装参看里面的install文件
我在make之后进行make install发现安装不上.然后参考http://www.cs.columbia.edu/~ecooper/tts/training.html
$ perl path/scripts/Training.pl path/scripts/Config.pm > train.log 2> err.log(我运行的时候没加后缀,据说加了如果出错,可以知道是什么操作那停止的,后面可以接上,不用从头来过).
训练成功后,在path/HTS-demo_CMU-ARCTIC-SLT/gen/qst001里面生成相应的wav文件.
步骤就这些,但是关系什么的还没有理清楚.
7.下载festvox:https://github.com/festvox/festvox(记得make)
8.训练数据及相关训练过程
Note:http://www.cs.columbia.edu/~ecooper/tts/data.html(后续训练指导)
1)目录设置
创建一个空的文件目录,链接是使用命令生成的,但是没找到这个文件夹,所以干脆直接把例子的文件目录复制了下,然后使用自己的目录进行填充.
链接里面给出的命令如下
cp -r /proj/tts/hts-2.3/template_si_htsengine /path/to/yourvoicename
创建好相应的文件目录后,cd进入相应的文件夹.cd yourvoicename
然后在scripts/Config.pm里面,将$prjdir改成自己创建的顶层目录.
2)准备HTS数据
a.准备.raw数据
If your audio is in some .wav format other than 16k, use sox to convert it:
sox input.wav -r 16000 output.wav
sox可以通过sudo apt install sox安装一下,不行的话,可以去https://sourceforge.net/projects/sox/files/sox/
然后自己下载安装一下.
然后就是把wav转换成raw格式,
ch_wave -c 0 -F 32000 -otype raw in.wav | x2x +sf | interpolate -p 2 -d | ds -s 43 | x2x +fs > out.raw
我的不知道为什么直接不行,路径已经添加到环境变量里面去了
所以看我切换到了speech_tool/bin下面,然后
./ch_wave -c 0 -F 32000 -otype raw in.wav | x2x +sf | interpolate -p 2 -d | ds -s 43 | x2x +fs > out.raw
就得到想要的wav了.
b.创建utt文件
要先配好环境变量,比如
export PATH=/proj/tts/tools/babel_scripts/build/festival/bin:$PATH
export PATH=/proj/tts/tools/babel_scripts/build/speech_tools/bin:$PATH
export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox
export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools
文件名叫txt.done.data(首先直接创建txt文件,然后改文件名就行),然后文件内容按照下面的格式,前面是对应的音频文件名,后面是音频内容,同时准备好每个音频的txt文件.(虽然网栈没要求准备,但是例子里面带了,所以也准备一份):文件里不要有空行.
( uniph_0001 "a whole joy was reaping." )
( uniph_0002 "but they've gone south." )
( uniph_0003 "you should fetch azure mike." )
使用HMM对齐的全文本标签(Fullcontext labels using EHMM alignment)
准备目录:
$FESTVOXDIR/src/clustergen/setup_cg cmu us awb_arctic
cmu和awb_arctic可以修改,中间的如果是使用英文发音不要修改.
把wav文件复制到wav/文件夹下,wav文件的格式应该是16khz,16bit的单通道音频,RIFF格式.
运行下面的三个命令
./bin/do_build build_prompts
./bin/do_build label
./bin/do_build build_utts
三个命令都没有问题的话,就可以下festival/utts下面找到相应的文件了.(需要make一下festvox/src/ehmm/src,才会在festvox/src/ehmm/bin下面生成ConvertFeatsFileToBinaryFormat,edec,ehmm,FeatureExtraction,ScaleBinaryFeats等相应的文件,不MAKE的话会在执行上面的第二行命令出错.)
Getting alignment score from EHMM
进到$FESTVOXDIR/src/ehmm/src/ehmm.cc里面,找到函数ProcessSentence,在计算lh之后加上这一行程序,
cout << "Utterance: " << tF << " LL: " << lh << endl;
然后到$FESTVOXDIR的顶层目录,使用make进行重新编译.
c.如果想生成特定的说话句子,参考:http://www.cs.columbia.edu/~ecooper/tts/gen_eng.html
不然就不动,就用原来的爱丽丝的例子.
然后将生成的文件,放在你创建的目录下面
Place your .raw files in yourvoicename/data/raw.
Place your .utt files in yourvoicename/data/utts.
Place your gen labels in yourvoicename/data/labels/gen.
d.编译数据步骤
在yourvoicename/data下面,你会看到一个Makefile.下面的所有步骤都应该在yourvoicename/data下面执行
a.修改makefile里面的LOWERF0 and UPPERF0,标准的是75~600.你可以微调那个参数适应你的speaker来获得更好的结果.
b.
make features
创建下列文件:
c.组合了很多之前步骤提取出的声学特征到一个组和的.cmp文件.
make cmp
d.将utt文件格式转化成 hts格式,This step creates labels/full/*.lab, the fullcontext labels, and labels/mono/*.lab, which are monophone labels for each utterance. Run:
make lab
The fullcontext labels (full) contain phonemes in context as determined by the fronted. The monophone labels (mono) are just the phoneme sequence. Both formats have the start and end times of each phoneme, in ten-millionths of a second, so to get times in seconds, add a decimal point before 7 digits from the end.
e.These files are "Master Label Files," which can contain all of the information in the .lab files in one file, or can contain pointers to the individual .lab files. We will be creating .mlf files that are pointers to the .lab files. Run:
make mlf
f.This step creates full.list, full_all.list, and mono.list, which are lists of all of the unique labels. full.list contains all of the fullcontext labels in the training data, and full_all.list contains all of the training labels plus all of the gen labels. Run:
make list
** Note that make list as-is in the demo scripts relies on the cmp files already being there -- it checks that there is both a cmp and a lab file there before adding the labels to the list. However, it does not use any of the information actually in the cmp file, beyond checking that it exists.
g.This step creates training and generation script files, train.scp and gen.scp. This is just a list of the files you want to use to train the voice, and a list of files from which you want to synthesize examples. Run:
make scp
如果只想训练其中一个子集,你需要修正train.scp.
question文件部分并没有进行修改
3)做完上述步骤后,进行声音训练:http://www.cs.columbia.edu/~ecooper/tts/training.html
perl scripts/Training.pl scripts/Config.pm > train.log 2> err.log
我还是直接使用的
perl path/to/scripts/Training.pl path/to/scripts/Config.pm
由于我只使用了50个句子,每个句子2~6s,所以很快就训练完了.
最后Test synthesis utterances can be found under gen/qst001/ver1. hts_engine contains utterances synthesized using HTS-engine, and 1mix, 2mix, and stc contain synthesis using either SPTK or STRAIGHT.
The different speech waveform generation methods are as follows (from this thread):