I've got bitten recently by an issue in a Kaldi decoding cascade I was working on. I was getting more than 40% WER, while the language and acoustic model I was using suggested the decoder should have been able to do much better than that. After a lot of head-scratching I've found the reason for the sub-optimal performance was that I simply failed to add self-loops to the lexicon transducer(L). These are needed in order to pass through the special "#0" symbol used in Kaldi's recipes to make the grammar transducer(G) determinizable. The effect of this omission was that the back-off arcs in the bigram G I was using were effectively cut-off, leading to a highly non-stochastic LG cascade with a very spiky distribution over the allowed word sequences and hence the higher WER. After adding the self-loop the WER went down to 17% without changing anything else.
At that point I realized that I don't have detailed knowledge about the decoding graph construction process and decided to take a closer look. One problem here is that the actual large vocabulary cascades are hard to observe directly. In my experience GraphViz requires exorbitant resources in terms of CPU time and memory even for graphs orders of magnitudes smaller than those used in LVCSR. Even if that wasn't a problem the full optimized HCLG WFST is in my humble opinion beyond the human abilities to comprehend(at least beyond my abilities). We can easily build a small-scale version of the graph however, and I think this is mostly OK because using small scale-models to design and test various constructs is a widely used and proven method in engineering and science. This blog entry is not meant to be a replacement for the famous hbka.pdf (a.k.a. The Holly Book of WFSTs) or the excellent Kaldi decoding-graph creation recipe by Dan Povey, to which this post can hopefully serve as a complement.
Decoding graph construction in Kaldi:
A visual walkthrough
Subscribe to:
Posts (Atom)