kaidi之AISHELL2脚本阅读

2018-12-06  本文已影响0人  氢离子游离

前言

utils和steps文件夹是共享脚本,通用流程

数据集简介

AISHELL-2 is by far the largest free speech corpus available for Mandarin ASR research.

1. DATA

Training data

Evaluation data:

Currently we release AISHELL2-2018A-EVAL, containing:

Both sets are available across the three channel conditions.

2. RECIPE

Based on Kaldi standard system, AISHELL-2 provides a self-contained Mandarin ASR recipe, with:

脚本大纲

run.sh 脚本解析

1. 数据准备

local/prepare_all.sh ${trn_set} ${dev_set} ${tst_set} || exit 1;

input: trn_set /dev_set / tst_set
三个参数分别对应训练集、验证集和测试集的文件路径,该步骤分别对训练集,验证集和测试集做数据准备。

2. 训练GMM-HMM

local/run_gmm.sh --nj $nj --stage $gmm_stage

3. 训练chain模型

local/chain/run_tdnn.sh --nj $nj

4. 解码结果

local/show_results.sh

aishell的run.sh风格和wsj不太一样,在run.sh中只展示了三个主框架流程:

prepare_all.sh 脚本解读

1. 词典准备

从github上下载《大辞典》原始数据,转化为kaldi的词典格式

local/prepare_dict.sh data/local/dict || exit 1

2. 生成wav.scp, txt(word-segmented),utt2spk,spk2utt

 local/prepare_data.sh ${trn_set} data/local/dict data/local/train data/train || exit 1;
 local/prepare_data.sh ${dev_set} data/local/dict data/local/dev   data/dev   || exit 1;
 local/prepare_data.sh ${tst_set} data/local/dict data/local/test  data/test  || exit 1;

(以下以一个数据集为例)

3. 生成词典L.fst文件

utils/prepare_lang.sh --position-dependent-phones false \
  data/local/dict "<UNK>" data/local/lang data/lang || exit 1;

4. 语言模型准备

local/train_lms.sh \
     data/local/dict/lexicon.txt data/local/train/text data/local/lm || exit 1;

5. 生成语言模型G.fst文件

该程序的主要目标就是根据语言模型生成G.fst文件。方便与之前的L.fst结合,发挥fst的优势。

utils/format_lm.sh data/lang data/local/lm/3gram-mincount/lm_unpruned.gz \
    data/local/dict/lexicon.txt data/lang_test || exit 1;

local/run_gmm.sh脚本解读

1. 特征提取MFCC&CMVN

mfccdir should be some place with a largish disk where you want to store MFCC features.

1.1 MFCC特征提取

Combine MFCC and pitch features together
Note: This file is based on make_mfcc.sh and make_pitch_kaldi.sh

steps/make_mfcc_pitch.sh --pitch-config conf/pitch.conf --cmd "$train_cmd" --nj $nj \
      data/$x exp/make_mfcc/$x mfcc || exit 1;

1.2 CMVN特征提取

steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x mfcc || exit 1;

?和mfcc在一个文件夹里?

1.3 检查数据

This script makes sure that only the segments present in all of "feats.scp", "wav.scp" [if present], segments [if present] text, and utt2spk are present in any of them. It puts the original contents of data-dir into data-dir/.backup

utils/fix_data_dir.sh data/$x

2. 划分训练子集

subset the training data for fast startup

utils/subset_data_dir.sh data/train ${x}000 data/train_${x}k

3. 单音素模型训练

3.1 训练步骤

主要输出为final.mdl和tree。训练的核心流程就是迭代对齐-统计算GMM与HMM信息-更新参数

steps/train_mono.sh --cmd "$train_cmd" --nj $nj \
    data/train_100k data/lang exp/mono || exit 1;

3.2 解码步骤

采用刚刚训练得到的模型来对测试数据集进行解码并计算准确率等信息

3.2.1 构造解码图

构造一个完全扩展的解码图(HLCG.FST),能表示语言模型、词典、上下文相关性和HMM解构。
脚本原注释如下:
This script creates a fully expanded decoding graph (HCLG) that represents all the language-model, pronunciation dictionary (lexicon), context-dependency,and HMM structure in our model. The output is a Finite State Transducer that has word-ids on the output, and pdf-ids on the input (these are indexes that resolve to Gaussian Mixture Models).
See
http://kaldi-asr.org/doc/graph_recipe_test.html
(this is compiled from this repository using Doxygen,
the source for this part is in src/doc/graph_recipe_test.dox)

utils/mkgraph.sh data/lang_test exp/mono exp/mono/graph || exit 1;

3.2.2 解码测试

通过调用gmm-latgen-faster或gmm-latgen-faster-parallel进行解码,生成lat.JOB.gz

  steps/decode.sh --cmd "$decode_cmd" --config conf/decode.conf --nj ${dev_nj} \    exp/mono/graph data/dev exp/mono/decode_dev
#for dev
  steps/decode.sh --cmd "$decode_cmd" --config conf/decode.conf --nj ${test_nj} \
    exp/mono/graph data/test exp/mono/decode_test
#for test

以训练集为例:

3.3 对齐步骤

用训练好的模型将数据进行强制对齐方便以后使用

steps/align_si.sh --cmd "$train_cmd" --nj $nj \
    data/train_300k data/lang exp/mono exp/mono_ali || exit 1;

4. 三因子模型训练

4.1 训练步骤

4.2 解码步骤

4.3 对齐步骤

local/chain/run_tdnn.sh脚本解读

关于kaldi的chain模型目前没有完全了解,后续进行补充

Reference

张涵沛的博客

上一篇 下一篇

猜你喜欢

热点阅读