kaldi中文語音識別thchs30模型訓練程式碼功能和配置引數解讀

bbzz2發表於2017-06-06

Monophone

單音素模型的訓練
  1. # Flat start and monophone training, with delta-delta features. # This script applies cepstral mean normalization (per speaker).
  2. #monophone 訓練單音素模型
  3. steps/train_mono.sh --boost-silence 1.25 --nj $n --cmd "$train_cmd" data/mfcc/train data/lang exp/mono || exit 1;
  4. #test monophone model
  5. local/thchs-30_decode.sh --mono true --nj $n "steps/decode.sh" exp/mono data/mfcc &
 

train_mono.sh用法
  1. echo "Usage: steps/train_mono.sh [options] <data-dir> <lang-dir> <exp-dir>"
  2. echo " e.g.: steps/train_mono.sh data/train.1k data/lang exp/mono"
  3. echo "main options (for others, see top of script file)"
其中的引數設定,訓練單音素的基礎HMM模型,迭代40次,並按照realign_iters的次數對資料對齊
  1. # Begin configuration section.
  2. nj=4
  3. cmd=run.pl
  4. scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
  5. num_iters=40 # Number of iterations of training
  6. max_iter_inc=30 # Last iter to increase #Gauss on.
  7. totgauss=1000 # Target #Gaussians.
  8. careful=false
  9. boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
  10. realign_iters="1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 23 26 29 32 35 38";
  11. config= # name of config file.
  12. stage=-4
  13. power=0.25 # exponent to determine number of gaussians from occurrence counts
  14. norm_vars=false # deprecated, prefer --cmvn-opts "--norm-vars=false"
  15. cmvn_opts= # can be used to add extra options to cmvn.
  16. # End configuration section.

thchs-30_decode.sh測試單音素模型,實際使用mkgraph.sh建立完全的識別網路,並輸出一個有限狀態轉換器,最後使用decode.sh以語言模型和測試資料為輸入計算WER.
  1. #decode word
  2. utils/mkgraph.sh $opt data/graph/lang $srcdir $srcdir/graph_word || exit 1;
  3. $decoder --cmd "$decode_cmd" --nj $nj $srcdir/graph_word $datadir/test $srcdir/decode_test_word || exit 1
  4. #decode phone
  5. utils/mkgraph.sh $opt data/graph_phone/lang $srcdir $srcdir/graph_phone || exit 1;
  6. $decoder --cmd "$decode_cmd" --nj $nj $srcdir/graph_phone $datadir/test_phone $srcdir/decode_test_phone || exit 1

align_si.sh用指定模型對指定資料進行對齊,一般在訓練新模型前進行,以上一版本模型作為輸入,輸出在<align-dir>
  1. #monophone_ali
  2. steps/align_si.sh --boost-silence 1.25 --nj $n --cmd "$train_cmd" data/mfcc/train data/lang exp/mono exp/mono_ali || exit 1;
  3. # Computes training alignments using a model with delta or
  4. # LDA+MLLT features.
  5. # If you supply the "--use-graphs true" option, it will use the training
  6. # graphs from the source directory (where the model is). In this
  7. # case the number of jobs must match with the source directory.
  8. echo "usage: steps/align_si.sh <data-dir> <lang-dir> <src-dir> <align-dir>"
  9. echo "e.g.: steps/align_si.sh data/train data/lang exp/tri1 exp/tri1_ali"
  10. echo "main options (for others, see top of script file)"
  11. echo " --config <config-file> # config containing options"
  12. echo " --nj <nj> # number of parallel jobs"
  13. echo " --use-graphs true # use graphs in src-dir"
  14. echo " --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs."

Triphone

以單音素模型為輸入訓練上下文相關的三音素模型
  1. #triphone
  2. steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" 2000 10000 data/mfcc/train data/lang exp/mono_ali exp/tri1 || exit 1;
  3. #test tri1 model
  4. local/thchs-30_decode.sh --nj $n "steps/decode.sh" exp/tri1 data/mfcc &

train_deltas.sh中的相關配置如下,其中輸入
  1. # Begin configuration.
  2. stage=-4 # This allows restarting after partway, when something when wrong.
  3. config=
  4. cmd=run.pl
  5. scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
  6. realign_iters="10 20 30";
  7. num_iters=35 # Number of iterations of training
  8. max_iter_inc=25 # Last iter to increase #Gauss on.
  9. beam=10
  10. careful=false
  11. retry_beam=40
  12. boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
  13. power=0.25 # Exponent for number of gaussians according to occurrence counts
  14. cluster_thresh=-1 # for build-tree control final bottom-up clustering of leaves
  15. norm_vars=false # deprecated. Prefer --cmvn-opts "--norm-vars=true"
  16. # use the option --cmvn-opts "--norm-means=false"
  17. cmvn_opts=
  18. delta_opts=
  19. context_opts= # use"--context-width=5 --central-position=2" for quinphone
  20. # End configuration.
  21. echo "Usage: steps/train_deltas.sh <num-leaves> <tot-gauss> <data-dir> <lang-dir> <alignment-dir> <exp-dir>"
  22. echo "e.g.: steps/train_deltas.sh 2000 10000 data/train_si84_half data/lang exp/mono_ali exp/tri1"

LDA_MLLT

對特徵使用LDA和MLLT進行變換,訓練加入LDA和MLLT的三音素模型。

LDA+MLLT refers to the way we transform the features after computing the MFCCs: we splice across several frames, reduce the dimension (to 40 by default) using Linear Discriminant Analysis), and then later estimate, over multiple iterations, a diagonalizing transform known as MLLT or CTC.

詳情可參考 http://kaldi-asr.org/doc/transform.html


  1. #triphone_ali
  2. steps/align_si.sh --nj $n --cmd "$train_cmd" data/mfcc/train data/lang exp/tri1 exp/tri1_ali || exit 1;
  3. #lda_mllt
  4. steps/train_lda_mllt.sh --cmd "$train_cmd" --splice-opts "--left-context=3 --right-context=3" 2500 15000 data/mfcc/train data/lang exp/tri1_ali exp/tri2b || exit 1;
  5. #test tri2b model
  6. local/thchs-30_decode.sh --nj $n "steps/decode.sh" exp/tri2b data/mfcc &
train_lda_mllt.sh相關程式碼配置如下:
  1. # Begin configuration.
  2. cmd=run.pl
  3. config=
  4. stage=-5
  5. scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
  6. realign_iters="10 20 30";
  7. mllt_iters="2 4 6 12";
  8. num_iters=35 # Number of iterations of training
  9. max_iter_inc=25 # Last iter to increase #Gauss on.
  10. dim=40
  11. beam=10
  12. retry_beam=40
  13. careful=false
  14. boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
  15. power=0.25 # Exponent for number of gaussians according to occurrence counts
  16. randprune=4.0 # This is approximately the ratio by which we will speed up the
  17. # LDA and MLLT calculations via randomized pruning.
  18. splice_opts=
  19. cluster_thresh=-1 # for build-tree control final bottom-up clustering of leaves
  20. norm_vars=false # deprecated. Prefer --cmvn-opts "--norm-vars=false"
  21. cmvn_opts=
  22. context_opts= # use "--context-width=5 --central-position=2" for quinphone.
  23. # End configuration.

Sat

運用基於特徵空間的最大似然線性迴歸(fMLLR)進行說話人自適應訓練
This does Speaker Adapted Training (SAT), i.e. train on fMLLR-adapted features.  It can be done on top of either LDA+MLLT, or delta and delta-delta features.  If there are no transforms supplied in the alignment directory, it will estimate transforms itself before building the tree (and in any case, it estimates transforms a number of times during training).
  1. #lda_mllt_ali
  2. steps/align_si.sh --nj $n --cmd "$train_cmd" --use-graphs true data/mfcc/train data/lang exp/tri2b exp/tri2b_ali || exit 1;
  3. #sat
  4. steps/train_sat.sh --cmd "$train_cmd" 2500 15000 data/mfcc/train data/lang exp/tri2b_ali exp/tri3b || exit 1;
  5. #test tri3b model
  6. local/thchs-30_decode.sh --nj $n "steps/decode_fmllr.sh" exp/tri3b data/mfcc &
train_sat.sh的具體配置如下:
  1. # Begin configuration section.
  2. stage=-5
  3. exit_stage=-100 # you can use this to require it to exit at the
  4. # beginning of a specific stage. Not all values are
  5. # supported.
  6. fmllr_update_type=full
  7. cmd=run.pl
  8. scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
  9. beam=10
  10. retry_beam=40
  11. careful=false
  12. boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
  13. context_opts= # e.g. set this to "--context-width 5 --central-position 2" for quinphone.
  14. realign_iters="10 20 30";
  15. fmllr_iters="2 4 6 12";
  16. silence_weight=0.0 # Weight on silence in fMLLR estimation.
  17. num_iters=35 # Number of iterations of training
  18. max_iter_inc=25 # Last iter to increase #Gauss on.
  19. power=0.2 # Exponent for number of gaussians according to occurrence counts
  20. cluster_thresh=-1 # for build-tree control final bottom-up clustering of leaves
  21. phone_map=
  22. train_tree=true
  23. tree_stats_opts=
  24. cluster_phones_opts=
  25. compile_questions_opts=
  26. # End configuration section.

decode_fmllr.sh :對做了發音人自適應的模型進行解碼
Decoding script that does fMLLR.  This can be on top of delta+delta-delta, or LDA+MLLT features.
  1. # There are 3 models involved potentially in this script,
  2. # and for a standard, speaker-independent system they will all be the same.
  3. # The "alignment model" is for the 1st-pass decoding and to get the
  4. # Gaussian-level alignments for the "adaptation model" the first time we
  5. # do fMLLR. The "adaptation model" is used to estimate fMLLR transforms
  6. # and to generate state-level lattices. The lattices are then rescored
  7. # with the "final model".
  8. #
  9. # The following table explains where we get these 3 models from.
  10. # Note: $srcdir is one level up from the decoding directory.
  11. #
  12. # Model Default source:
  13. #
  14. # "alignment model" $srcdir/final.alimdl --alignment-model <model>
  15. # (or $srcdir/final.mdl if alimdl absent)
  16. # "adaptation model" $srcdir/final.mdl --adapt-model <model>
  17. # "final model" $srcdir/final.mdl --final-model <model>
Quick
Train a model on top of existing features (no feature-space learning of any kind is done).  This script initializes the model (i.e., the GMMs) from the previous system's model.That is: for each state in the current model (after tree building), it chooses the closes state in the old model, judging the similarities based on overlap of counts in the tree stats.
  1. #sat_ali
  2. steps/align_fmllr.sh --nj $n --cmd "$train_cmd" data/mfcc/train data/lang exp/tri3b exp/tri3b_ali || exit 1;
  3. #quick
  4. steps/train_quick.sh --cmd "$train_cmd" 4200 40000 data/mfcc/train data/lang exp/tri3b_ali exp/tri4b || exit 1;
  5. #test tri4b model
  6. local/thchs-30_decode.sh --nj $n "steps/decode_fmllr.sh" exp/tri4b data/mfcc &
train_quick.sh的配置:
  1. # Begin configuration..
  2. cmd=run.pl
  3. scale_opts="--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
  4. realign_iters="10 15"; # Only realign twice.
  5. num_iters=20 # Number of iterations of training
  6. maxiterinc=15 # Last iter to increase #Gauss on.
  7. batch_size=750 # batch size to use while compiling graphs... memory/speed tradeoff.
  8. beam=10 # alignment beam.
  9. retry_beam=40
  10. stage=-5
  11. cluster_thresh=-1 # for build-tree control final bottom-up clustering of leaves
  12. # End configuration section.
1
 
0

相關文章