Kaldi is a relatively new addition to the open source speech recognition toolkits, officially released about an year ago. It's based on the WFST paradigm and is mostly oriented toward the research community. There are many things I like about Kaldi. First of all I like the overall "openness" of the project: the source code is distributed under the very permissive Apache license, and even the $\LaTeX$ sources for the papers about Kaldi are in the repo. The design decision to use relatively small, single-purpose tools, that can be Unix-style pipelined, makes their code very clean and easy to follow. The project is very actively developed and support experimental options like neural nets based acoustic features and GPGPU acceleration, although I haven't had the chance to play with these yet. Last, but not least Kaldi comes with extensive documentation.
There is also a big, and growing number of recipes working with the most widely used speech databases. My problem was, however that I am not an academic, but just a coder that likes to play from time to time, with technologies that look interesting and promising. I wanted to get an idea how Kaldi works, but I don't have access to these expensive datasets. That's why I have started to search for a way to use a publicly available database. The best option I have found so far is a subset of RM1 dataset, freely available from CMU. In fact the data only includes pre-extracted feature vectors, and not the original audio. The details of feature extraction in Sphinx are almost certainly different from that in Kaldi, but as long the features used in training and decoding match we should be OK. Kaldi already has a recipe for RM, so modifiying it to use CMU's subset was a rather trivial excercise. This post describes how to use this modified recipe in case other people want to try it.
Update: As of Feb 27, 2012 a slightly modified version of this recipe is part of the official Kaldi distribution.
Poor man's Kaldi recipe
Subscribe to:
Posts (Atom)