<p align="center"></p>
--------------------------------------------------------------------------------
[![Build Status]][Travis]
PyKaldi is a Python scripting layer for the [Kaldi] speech recognition toolkit.
It provides easy-to-use, low-overhead, first-class Python wrappers for the C++
code in Kaldi and [OpenFst] libraries. You can use PyKaldi to write Python code
for things that would otherwise require writing C++ code such as calling
low-level Kaldi functions, manipulating Kaldi and OpenFst objects in code or
implementing new Kaldi tools.
You can think of Kaldi as a large box of legos that you can mix and match to
build custom speech recognition solutions. The best way to think of PyKaldi is
as a supplement, a sidekick if you will, to Kaldi. In fact, PyKaldi is at its
best when it is used alongside Kaldi. To that end, replicating the functionality
of myriad command-line tools, utility scripts and shell-level recipes provided
by Kaldi is a non-goal for the PyKaldi project.
## Overview
- [Getting Started](#getting-started)
- [About PyKaldi](#about-pykaldi)
- [Coverage Status](#coverage-status)
- [Installation](#installation)
- [FAQ](#faq)
- [Citing](#citing)
- [Contributing](#contributing)
## Getting Started
Like Kaldi, PyKaldi is primarily intended for speech recognition researchers and
professionals. It is jam packed with goodies that one would need to build Python
software taking advantage of the vast collection of utilities, algorithms and
data structures provided by Kaldi and OpenFst libraries.
If you are not familiar with FST-based speech recognition or have no interest in
having access to the guts of Kaldi and OpenFst in Python, but only want to run a
pre-trained Kaldi system as part of your Python application, do not fret.
PyKaldi includes a number of high-level application oriented modules, such as
[`asr`], [`alignment`] and [`segmentation`], that should be accessible to most
Python programmers.
If you are interested in using PyKaldi for research or building advanced ASR
applications, you are in luck. PyKaldi comes with everything you need to read,
write, inspect, manipulate or visualize Kaldi and OpenFst objects in Python. It
includes Python wrappers for most functions and methods that are part of the
public APIs of Kaldi and OpenFst C++ libraries. If you want to read/write files
that are produced/consumed by Kaldi tools, check out I/O and table utilities in
the [`util`] package. If you want to work with Kaldi matrices and vectors, e.g.
convert them to [NumPy] ndarrays and vice versa, check out the [`matrix`]
package. If you want to use Kaldi for feature extraction and transformation,
check out the [`feat`], [`ivector`] and [`transform`] packages. If you want to
work with lattices or other FST structures produced/consumed by Kaldi tools,
check out the [`fstext`], [`lat`] and [`kws`] packages. If you want low-level
access to Gaussian mixture models, hidden Markov models or phonetic decision
trees in Kaldi, check out the [`gmm`], [`sgmm2`], [`hmm`], and [`tree`]
packages. If you want low-level access to Kaldi neural network models, check out
the [`nnet3`], [`cudamatrix`] and [`chain`] packages. If you want to use the
decoders and language modeling utilities in Kaldi, check out the [`decoder`],
[`lm`], [`rnnlm`], [`tfrnnlm`] and [`online2`] packages.
Interested readers who would like to learn more about Kaldi and PyKaldi might
find the following resources useful:
* [Kaldi Docs]: Read these to learn more about Kaldi.
* [PyKaldi Docs]: Consult these to learn more about the PyKaldi API.
* [PyKaldi Examples]: Check these out to see PyKaldi in action.
* [PyKaldi Paper]: Read this to learn more about the design of PyKaldi.
Since automatic speech recognition (ASR) in Python is undoubtedly the "killer
app" for PyKaldi, we will go over a few ASR scenarios to get a feel for the
PyKaldi API. We should note that PyKaldi does not provide any high-level
utilities for training ASR models, so you need to train your models using Kaldi
recipes or use pre-trained models available online. The reason why this is so is
simply because there is no high-level ASR training API in Kaldi C++ libraries.
Kaldi ASR models are trained using complex shell-level [recipes][Kaldi Recipes]
that handle everything from data preparation to the orchestration of myriad
Kaldi executables used in training. This is by design and unlikely to change in
the future. PyKaldi does provide wrappers for the low-level ASR training
utilities in Kaldi C++ libraries but those are not really useful unless you want
to build an ASR training pipeline in Python from basic building blocks, which is
no easy task. Continuing with the lego analogy, this task is akin to building
[this][Lego Chiron] given access to a truck full of legos you might need. If you
are crazy enough to try though, please don't let this paragraph discourage you.
Before we started building PyKaldi, we thought that was a mad man's task too.
### Automatic Speech Recognition in Python
PyKaldi [`asr`] module includes a number of easy-to-use, high-level classes to
make it dead simple to put together ASR systems in Python. Ignoring the
boilerplate code needed for setting things up, doing ASR with PyKaldi can be as
simple as the following snippet of code:
```python
asr = SomeRecognizer.from_files("final.mdl", "HCLG.fst", "words.txt", opts)
with SequentialMatrixReader("ark:feats.ark") as feats_reader:
for key, feats in feats_reader:
out = asr.decode(feats)
print(key, out["text"])
```
In this simplified example, we first instantiate a hypothetical recognizer
`SomeRecognizer` with the paths for the model `final.mdl`, the decoding graph
`HCLG.fst` and the symbol table `words.txt`. The `opts` object contains the
configuration options for the recognizer. Then, we instantiate a [PyKaldi table
reader][`util.table`] `SequentialMatrixReader` for reading the feature
matrices stored in the [Kaldi archive][Kaldi Archive Docs] `feats.ark`. Finally,
we iterate over the feature matrices and decode them one by one. Here we are
simply printing the best ASR hypothesis for each utterance so we are only
interested in the `"text"` entry of the output dictionary `out`. Keep in mind
that the output dictionary contains a bunch of other useful entries, such as the
frame level alignment of the best hypothesis and a weighted lattice representing
the most likely hypotheses. Admittedly, not all ASR pipelines will be as simple
as this example, but they will often have the same overall structure. In the
following sections, we will see how we can adapt the code given above to
implement more complicated ASR pipelines.
#### Offline ASR using Kaldi Models
This is the most common scenario. We want to do offline ASR using pre-trained
Kaldi models, such as [ASpIRE chain models]. Here we are using the term "models"
loosely to refer to everything one would need to put together an ASR system. In
this specific example, we are going to need:
* a [neural network acoustic model][Kaldi Neural Network Docs],
* a [transition model][Kaldi Transition Model Docs],
* a [decoding graph][Kaldi Decoding Graph Docs],
* a [word symbol table][Kaldi Symbol Table Docs],
* and a couple of feature extraction [configs][Kaldi Config Docs].
Note that you can use this example code to decode with [ASpIRE chain models].
```python
from kaldi.asr import NnetLatticeFasterRecognizer
from kaldi.decoder import LatticeFasterDecoderOptions
from kaldi.nnet3 import NnetSimpleComputationOptions
from kaldi.util.table import SequentialMatrixReader, CompactLatticeWriter
# Set the paths and read/write specifiers
model_path = "models/aspire/final.mdl"
graph_path = "models/aspire/graph_pp/HCLG.fst"
symbols_path = "models/aspire/graph_pp/words.txt"
feats_rspec = ("ark:compute-mfcc-feats --config=models/aspire/conf/mfcc.conf "
"scp:wav.scp ark:- |")
ivectors_rspec = (feats_rspec + "ivector-extract-online2 "