## VoxForge scripts for Kaldi

Some weeks ago there was a question on the Kaldi's mailing list about the possibility of creating a Kaldi recipe using VoxForge's data. For those not familiar with it, VoxForge is a project, which has the goal of collecting speech data for various languages, that can be used for training acoustic models for automatic speech recognition. The project is founded and maintained, to the best of my knowledge, by Ken MacLean and thrives thanks to the generous contributions of great number of volunteers, who record sample utterances using the Java applet available on the website, or submit pre-recorded data. As far as I know this is the largest body of free(in both of the usual senses of the word) speech data, readily available for acoustic model training. It seemed like a good idea to develop a Kaldi recipe, that can be used by people who want to try the toolkit, but don't have access to the commercial corpora. My previous recipe, based on freely available features for a subset of RM data can be also used for that purpose, but it has somewhat limited functionality. This post describes the data preparation steps, specific to VoxForge's data.

Prerequisites
As usual the following instructions assume you are using Linux operating system. The scripts try to install some of the external tools needed for their work, but also assume you have some tools and libraries already installed. These are things that should be either already available on your system or can be easily installed through its package manager. I will try to enumerate the main dependencies when describing the respective part of the recipe making use of them.
If you don't have Kaldi installed you need to download and install it by following the steps in the documentation. If you already have it installed you need to make sure you have a recent version, which includes this recipe. It can be found in egs/voxforge/s5 under Kaldi's installation directory. From this point on all paths in this blog entry will be relative to this directory, unless otherwise noted.

Before doing anything else you should set DATA_ROOT variable in ./path.sh to point to a directory residing on a drive, which has enough free space to store several tens of gigabytes of data (about 25GB should be enough in the default recipe config at the time of writing).
You can use the ./getdata.sh to download VoxForge's data. As many other parts of the recipe it hasn't been extensively tested, but hopefully will work. It assumes you have wget installed on your system and downloads the 16KHz versions of the data archives to $\$${DATA_ROOT}/tgz and extracts them to \$${DATA_ROOT}/extracted. If you want to free several gigabytes by deleting the archives after the extraction finishes, you can add a "--deltgz true" parameter: ./getdata.sh --deltgz true  Configuration and starting the recipe It's usually recommended to run Kaldi recipes by hand by copy/pasting the commands in the respective run.sh scripts. If you do this be sure to source path.sh first in order to set the paths and LC_ALL=C variable. If this environment variable is not set you may run into sometimes hard to diagnose problems, due to the different sorted orders used in other locales. Or if you are like me and prefer to just start /bin/bash run.sh to execute all steps automatically, you can do this too but you may want to modify several variables first. If you happen to have a cluster of machines with Sun Grid Engine installed you may want to modify the train and decode commands in cmd.sh to match your config. A related parameter in run.sh is njobs, which defines the maximum number of parallel processes to be executed. The current default is 2 which is perhaps too low even for a relative new home desktop machine. There are several parameters toward the start of run.sh script, that I will explain when discussing the parts of the recipe they affect. The recipe is structured according to the currently recommended("s5") Kaldi script style. The main sub-directories used are: • local/ - hosts scripts that are specific to each recipe. This is mostly data normalization stuff, which is taking the information from the particular speech database and is transforming it into the files/data structures that the subsequent steps expect. The work of using new data with Kaldi, including this recipe, involves writing and modifying scripts that fall into this category. • conf/ - small configuration files, specifying things like e.g. feature extraction parameters and decoding beams. • steps/ - scripts implementing various acoustic model training methods, mostly through calling Kaldi's binary tools and the scripts in "utils/" • utils/ - scripts performing small, low-level task, like adding disambiguation symbols to lexicons, converting between symbolic and integer representation of words etc. • data/ - This is where various metadata, produced at the recipe run time is stored. Most of it is result from the work of the scripts in local/ • exp/ - The most important output of the recipe goes there. This includes acoustic models and recognition results. • tools/ - A non-standard directory, specific to this recipe, that I am using to put various external tools into(more about these later). Note: Keep in mind that steps/ and utils/ are shared between all "s5" scripts and are just symlinks pointing to egs/wsj/s5. That means you should be careful if you make changes there, as this may affect the other "s5" recipes. Data subset selection The recipe has the option to train and test on a subset of the VoxForge's data. For (almost) every submission directory there is a etc/README file, with meta-information like pronunciation dialect and gender of the speaker. For English there are many pronunciation dialects, but from an (admittedly very limited) sample of the data I've got the impression that some of the recordings with dialect set to say "European English" and "Indian English" sound distinctively non-native. So you have the option to select a subset of the dialects if you like. In my limited experience it seems, that the speech tagged with "American English" has relative lower percentage of non-native speech intermixed. The default for the recipe is currently set to produce an acoustic model with good coverage over (presumably) native English speakers: dialects="((American)|(British)|(Australia)|(Zealand))"  The local/voxforge_select.sh script creates symbolic links to the matching submission subdirectories in$\$${DATA_ROOT}/selected and the subsequent steps work just on this data. If you want to select all of VoxForge's English speech you should perhaps set this to: dialects="English"  I don't have experience with the non-English audio at VoxForge, so I can't comment how the scripts should be modified to work with it. Mapping anonymous speakers to unique IDs VoxForge allows for anonymous speaker registration and speech submission. This is not ideal from Kaldi's scripts viewpoint, because they perform various speaker-dependent transforms. The "anonymous" speech is recorded under different environment/channel conditions (microphones used, background noise etc), by speakers that may be both males and females and have different accents. So instead of lumping all this data together I decided to give each speaker unique identity. The script local/voxforge_fix_data.sh renames all "anonymous" speakers to "anonDDDD"(D is a decimal digit), based on the submission date. This is not entirely precise of course because it may give two different IDs to the same speaker who made recordings on two different dates, and also give the same ID to two or more different "anonymous" speakers who happened to submit speech on the same date. Train/test set splitting and normalizing the metadata The next steps is to split the data into train and test sets and to produce the relevant transcription and speaker-dependent information. These steps are performed by a rather convoluted and not particularly efficient script called local/voxforge_data_prep.sh. The number of speakers to be assigned to the test set is defined in run.sh. The actual test set speakers are chosen at random. This is probably not an ideal arrangement, because the speakers will be different each time voxforge_data_prep.sh is started and probably the WER for the test set will be slightly different too. I think this is not very important in this case, however because VoxForge still doesn't have predefined high-quality train and test sets and test time language model. It has something at here, but there are speakers, for which there are utterances in both sets and I wanted the sets to be disjoint. The script assumes that you have FLAC codec installed on your machine, because some of the files are encoded in this lossless compression format. One solution is to convert the files beforehand into WAV format, but the recipe is using a nifty feature of Kaldi called extended filenames to convert the audio on-demand. For example data/local/train_wav.scp, which contains a list of files to be used for feature extraction, there are lines like: benkay_20090111-ar_ar-09 flac -c -d --silent /media/secondary/voxforge/selected/benkay-20090111-ar/flac/ar-09.flac |  This means in effect, that when ar-09.flac needs to be converted into MFCC features flac is invoked first to decode the file, and the decoded .wav-format stream is passed to compute-mfcc-feats using Unix pipes. It seems there are missing or not properly formatted transcripts for some speakers and these are ignored by the data preparation script. You can see these errors in exp/data_prep/make_trans.log file. Building the language model I decided to use MITLM toolkit to estimate the test-time language model. IRSTLM is installed by default under \$$KALDI_ROOT/tools by Kaldi installation scripts, but I had a mixed experience with this toolkit before and decided to try a different tool this time. A script called local/voxforge_prepare_lm.sh installs MITLM in tools/mitlm-svn and then trains a language model on the train set. The installation of MITLM assumes you have svn, GNU autotools, C++ and Fortran compilers, as well as Boost C++ libraries installed. The order of the LM is given by a variable named lm_order in run.sh.

Preparing the dictionary
The script, used for this task is called local/voxforge_prepare_dict.sh. It downloads the CMU's pronunciation dictionary first, and prepares a list of the words that are found in the train set, but not in cmudict. Pronunciations for these words are automatically generated using Sequitur G2P, which is installed under tools/g2p. The installation assumes you have NumPy, SWIG and C++ compiler on your system. Because the training of Sequitur models takes a lot of time this script is downloading and using a pre-built model trained on cmudict instead.

Decoding
These were the most important steps specific to this recipe. Most of the rest is just borrowed from WSJ and RM scripts.
The decoding results for some of the steps are as follows:
exp/mono/decode/wer_10
%WER 64.96 [ 5718 / 8803, 320 ins, 1615 del, 3783 sub ]
%SER 96.29 [ 935 / 971 ]

exp/tri2a/decode/wer_13
%WER 44.63 [ 3929 / 8803, 412 ins, 824 del, 2693 sub ]
%SER 87.95 [ 854 / 971 ]

exp/tri2b_mmi/decode_it4/wer_12
%WER 38.94 [ 3428 / 8803, 401 ins, 753 del, 2274 sub ]
%SER 81.77 [ 794 / 971 ]


These are obtained using a monophone(mono), a maximum likelihood trained triphone(tri2a) and a discriminatevely trained(tri2b_mmi) triphone acoustic models. Now, I know the results are not very inspiring and may look even somewhat disheartening, but keep in mind they were produced using a very poor language model - a bigram estimated just on the quite small corpus of training set transcripts. Another factor contributing to the relatively poor results is that a typically about 2-3% of the words(depending on the random test set selection) are found just in test set, and thus considered out-of-vocabulary. Not only they don't have any chance to be recognized themselves, but are also reducing the chance for the words surrounding them to be correctly decoded.
One thing that I haven't had the time to try yet is to change the number of states and Gaussians in the acoustic model. As I've mentioned the training and decoding commands in run.sh were copied from RM recipe. My guess is that if the states and PDFs are increased somewhat that will lower the WER by 2-3% at least. This may be done by tweaking the relevant lines in run.sh, e.g.
# train tri2a [delta+delta-deltas]
steps/train_deltas.sh --cmd "$train_cmd" 1800 9000 \ data/train data/lang exp/tri1_ali exp/tri2a || exit 1;  In the above 1800 is the number of PDFs and 9000 is the total number of Gaussians in the system. There is also something else worth mentioning here. As you may have noticed already I am using relatively small number of speakers(40) for testing. This is mostly because the set of prompts used in audio submission applet is limited and so there are many utterances duplicated across speakers. Because the test utterances are excluded when training the LM, we don't want too many test set speakers, because this would also mean less text for training language model and thus even worse performance. Just to check everything is OK with the acoustic model I ran some tests using language models trained on both the train and the test set (WARNING: this is considered to be "cheating" and a very bad practice - please don't do this at home!). The results were about 3% WER with a trigram and about 17% with a bigram model with a discriminatevely trained AM. ### 3 comments 1. Thanks for sharing your thoughts with us.. they are really interesting.. I would like to read more from you. Sound Testing 2. Thanks for sharing... Was really helpful 3. Hi, thanks for this. I am having this problem with the training: steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance. queue.pl: error submitting jobs to queue (return status was 512) queue log file is exp/make_mfcc/train/q/make_mfcc_train.log, command was qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64* -o exp/make_mfcc/train/q/make_mfcc_train.log -l mem_free=2G,ram_free=2G -t 1:2 /home/kaldi/kaldi/egs/voxforge/s5/exp/make_mfcc/train/q/make_mfcc_train.sh >>exp/make_mfcc/train/q/make_mfcc_train.log 2>&1 qsub: illegal -c value "" usage: qsub [-a date_time] [-A account_string] [-b secs] [-c [ none | { enabled | periodic | shutdown | depth= | dir= | interval=}... ] [-C directive_prefix] [-d path] [-D path] [-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}] [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue] [-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list] [-w] path [-W otherattributes=value...] [-v variable_list] [-V ] [-x] [-X] [-z] [script] I know almost nothing about bash code, i am actually learning, and i found that this error came from the steps/make_mfcc.sh, in this part of the code:$cmd JOB=1:$nj$logdir/make_mfcc_${name}.JOB.log \ compute-mfcc-feats$vtln_opts --verbose=2 --config=$mfcc_config \ scp,p:$logdir/wav_${name}.JOB.scp ark:- \| \ copy-feats --compress=$compress ark:- \
ark,scp:$mfccdir/raw_mfcc_$name.JOB.ark,$mfccdir/raw_mfcc_$name.JOB.scp \
|| exit 1;

But i really don't know how to fix it. Can you give me a hand? I already installed the necessary external libraries and other stuff.
Thank you so much.
Regards,

Carlos Cortes