kaldi中文语音识别_基于thchs30(3)

接上回，我们继续看run.sh

#you can obtain the database by uncommting the following lines
#[ -d $thchs ] || mkdir -p $thchs || exit 1
#echo "downloading THCHS30 at $thchs ..."
#local/download_and_untar.sh $thchs http://www.openslr.org/resources/18 data_thchs30 || exit 1
#local/download_and_untar.sh $thchs http://www.openslr.org/resources/18 resource || exit 1
#local/download_and_untar.sh $thchs http://www.openslr.org/resources/18 test-noise || exit 1
这没什么可说的，这个就是让你下载thchs30语音数据包后,解压到相应的目录下，但是这里原版run.sh中已经注释这些了，意思是你如果需要就用这个脚本下载，我们已经下载完毕了，这里不需要。

还记得上回咱们说到，因为内存可能不够，单步跑时是在run.sh的脚本中看到
#data preparation
这句，在它之后就全是shell的命令。建议一条一条的跑。不然中间会有莫名奇妙的断档和错误。如何一条条跑呢？
使用注释：:<<! 。。。。 ! 这两句相当于c语言的/* */. 中间的。。。。相当于要注释的内容。
这里就是啦。

#data preparation
#generate text, wav.scp, utt2pk, spk2utt
local/thchs-30_data_prep.sh $H $thchs/data_thchs30 || exit 1;

所以我们先来跑local/thchs-30_data_prep.sh 这里是数据准备工作，我们先来看看这里面的内容

#!/bin/bash
# Copyright 2016 Tsinghua University (Author: Dong Wang, Xuewei Zhang). Apache 2.0.
# 2016 LeSpeech (Author: Xingyu Na)

#This script pepares the data directory for thchs30 recipe. 此处注明此脚本用于准备hchs30的数据目录。

#It reads the corpus and get wav.scp and transcriptions.它读取语料库并得到wav.scp和音标。

dir=$1
corpus_dir=$2

这两个其实就是上面这个命令local/thchs-30_data_prep.sh $H $thchs/data_thchs30的两个参数 $H $thchs/data_thchs30

$1 代表 $H 也就是 run.sh中的H=`pwd` 实际上就是当前目录

$2 代表 $thchs/data_thchs30 因为run.sh中之前声明thchs=/opt/kaldi/egs/thchs30/thchs30-openslr
所以这里$thchs/data_thchs30就是指的/opt/kaldi/egs/thchs30/thchs30-openslr/data_thchs30 也就是语音目录

cd $dir
echo "creating data/{train,dev,test}" 进入该目录,打印文字"creating data/{train,dev,test}"
mkdir -p data/{train,dev,test} 创建data目录,及子目录，一会儿会在这下面生成数据准备文件

#create wav.scp, utt2spk.scp, spk2utt.scp, text 我的理解是创建语音的相关文件

这里说明一下根据音频名和标注创建:wav.scp, utt2spk.scp, spk2utt.scp, text以及word.txt phone.txt。
wav.scp中第一列为录音编号<recording-id>，第二列为音频文件路径<extended-filename>
举例：A11_000 /opt/kaldi/egs/thchs30/thchs30-openslr/data_thchs30/train/A11_0.wav

utt2spk中第一列为录音编号<utterance-id>，第二列为讲话者id<speaker-id>
举例：A11_000 A11
spk2utt中第一列为讲话着<speaker-id>，后面跟着他所说的话<utterance-id1> <utterance-id2> …

这个就是后面需要将data/train/utt2spk 转换为 data/train/spk2utt格式的
word.txt中第一列为录音编号<utterance-id>，第二列为讲话内容，后面我们在研究这些是怎么生成的。

举例：A11_000 绿是阳春烟景大块文章的底色四月的林峦更是绿得鲜活秀媚诗意盎然
phone.txt中第一列为录音编号<utterance-id>，第二列为讲话内容的声音标注，后面我们在研究这些是怎么生成的。
举例：A11_000 l v4 sh ix4 ii iang2 ch un1 ii ian1 j ing3 d a4 k uai4 uu un2 zh ang1 d e5 d i3 s e4 s iy4 vv ve4 d e5 l in2 l uan2 g eng4 sh ix4 l v4 d e5 x ian1 h uo2 x iu4 m ei4 sh ix1 ii i4 aa ang4 r an2

(
#进入循环，这里是生成每个文件的步骤
for x in train dev test; do
echo "cleaning data/$x" #循环显示
cd $dir/data/$x #进入每个目录
rm -rf wav.scp utt2spk spk2utt word.txt phone.txt text #删除这个文件，应该是如果有这些文件就重新生成
echo "preparing scps and text in data/$x" #循环显示
#updated new "for loop" figured out the compatibility issue with Mac created by Xi Chen, in 03/06/2018 #这个是个注释，意思是更新了for循环,修复了在Mac上的兼容问题
#for nn in `find $corpus_dir/$x/*.wav | sort -u | xargs -i basename {} .wav`; do
for nn in `find $corpus_dir/$x -name "*.wav" | sort -u | xargs -I {} basename {} .wav`; do #进入相应目录循环查找"*.wav"语音文件,并排序去除重复行
spkid=`echo $nn | awk -F"_" '{print "" $1}'` #说话者id
spk_char=`echo $spkid | sed 's/$[A-Z]$.*/\1/'` #说话的内容
spk_num=`echo $spkid | sed 's/[A-Z]$[0-9]$/\1/'` #说话者号，号码为0向上递增
spkid=$(printf '%s%.2d' "$spk_char" "$spk_num") #说话者内容和号码输出
utt_num=`echo $nn | awk -F"_" '{print $2}'` #说话号，号码为0向上递增
uttid=$(printf '%s%.2d_%.3d' "$spk_char" "$spk_num" "$utt_num") #说话者内容和号码, 说话号输出

echo $uttid $corpus_dir/$x/$nn.wav >> wav.scp #说话者内容和号码, 说话号码输出语音文件全路径名称输出例如
A11_000 /opt/kaldi/egs/thchs30/thchs30-openslr/data_thchs30/train/A11_0.wav
  echo $uttid $spkid >> utt2spk #说话者内容和号码, 说话号码输出说话者id 例如 A11_000 A11
echo $uttid `sed -n 1p $corpus_dir/data/$nn.wav.trn` >> word.txt # #说话者内容和号码, 说话号码输出并且找到相应文件获取语音数据（内容的第一行是中文)
例如   A11_000 绿是阳春烟景大块文章的底色四月的林峦更是绿得鲜活秀媚诗意盎然
echo $uttid `sed -n 3p $corpus_dir/data/$nn.wav.trn` >> phone.txt #说话者内容和号码, 说话号码输出并且找到相应文件获取语音数据（内容的第三行是音标)
  例如   A11_000 l v4 sh ix4 ii iang2 ch un1 ii ian1 j ing3 d a4 k uai4 uu un2 zh ang1 d e5 d i3 s e4 s iy4 vv ve4 d e5 l in2 l uan2 g eng4 sh ix4 l v4 d e5 x ian1 h uo2 x iu4 m ei4 sh ix1 ii i4 aa ang4 r an2

done
#所有的都进行排序
cp word.txt text
sort wav.scp -o wav.scp
sort utt2spk -o utt2spk
sort text -o text
sort phone.txt -o phone.txt
done
) || exit 1

utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
#调用utils/utt2spk_to_spk2utt.pl 将utt2spk文件转为spk2utt,以下同样
utils/utt2spk_to_spk2utt.pl data/dev/utt2spk > data/dev/spk2utt
utils/utt2spk_to_spk2utt.pl data/test/utt2spk > data/test/spk2utt

echo "creating test_phone for phone decoding" #应该是创建测试集的音标
(
rm -rf data/test_phone && cp -R data/test data/test_phone || exit 1 #删除data下的test_phone目录，将data的test data下的拷过来
cd data/test_phone && rm text && cp phone.txt text || exit 1 #进去后删除原来的text ，拷贝phone.txt作为text

)

我们来看看utils/utt2spk_to_spk2utt.pl 这个脚本

#!/usr/bin/env perl
# Copyright 2010-2011 Microsoft Corporation

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.

# converts an utt2spk file to a spk2utt file.
# Takes input from the stdin or from a file argument;
# output goes to the standard out.

if ( @ARGV > 1 ) {
die "Usage: utt2spk_to_spk2utt.pl [ utt2spk ] > spk2utt";
}

while(<>){
@A = split(" ", $_);
@A == 2 || die "Invalid line in utt2spk file: $_";
($u,$s) = @A;
if(!$seen_spk{$s}) {
$seen_spk{$s} = 1;
push @spklist, $s;
}
push (@{$spk_hash{$s}}, "$u");
}
foreach $s (@spklist) {
$l = join(' ',@{$spk_hash{$s}});
print "$s $l\n";

}

这里面基本上就是转换，好了，我们先将这些处理完了再说，未完待续。。。。。。

kaldi中文语音识别_基于thchs30(3)

猜你喜欢