pytorch-kaldi-master

所属分类:人工智能/神经网络/深度学习
开发工具:Python
文件大小:351KB
下载次数:0
上传日期:2020-06-01 00:03:38
上 传 者jagoern
说明:  Pythorch Kaldi是一个开源存储库,用于开发最先进的DNN/HMM语音识别系统。DNN部分由PyTorch管理,而特征提取、标签计算和解码则由Kaldi工具包执行
(PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition systems. The DNN part is managed by PyTorch, while feature extraction, label computation, and decoding are performed with the Kaldi toolkit.)

文件列表:
RESULTS (340, 2019-10-24)
best_wer.sh (1450, 2019-10-24)
cfg (0, 2019-10-24)
cfg\DIRHA_baselines (0, 2019-10-24)
cfg\DIRHA_baselines\DIRHA_GRU_fmllr.cfg (4462, 2019-10-24)
cfg\DIRHA_baselines\DIRHA_MLP_fmllr.cfg (4919, 2019-10-24)
cfg\DIRHA_baselines\DIRHA_liGRU_fmllr.cfg (4491, 2019-10-24)
cfg\Librispeech_baselines (0, 2019-10-24)
cfg\Librispeech_baselines\libri_GRU_fmllr.cfg (4506, 2019-10-24)
cfg\Librispeech_baselines\libri_LSTM_fmllr.cfg (4495, 2019-10-24)
cfg\Librispeech_baselines\libri_MLP_fmllr.cfg (3928, 2019-10-24)
cfg\Librispeech_baselines\libri_RNN_fmllr.cfg (4485, 2019-10-24)
cfg\Librispeech_baselines\libri_liGRU_fmllr.cfg (4534, 2019-10-24)
cfg\TIMIT_baselines (0, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_CNN_fbank.cfg (4617, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_CNN_raw.cfg (6313, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_GRU_fbank.cfg (6663, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_GRU_fmllr.cfg (6663, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_GRU_mfcc.cfg (6661, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_LSTM_fbank.cfg (6665, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_LSTM_fmllr.cfg (6665, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_LSTM_fmllr_cudnn.cfg (6501, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_LSTM_mfcc.cfg (6663, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_MLP_fbank.cfg (6631, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_MLP_fbank_autoencoder.cfg (3746, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_MLP_fbank_prod.cfg (7025, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_MLP_fmllr.cfg (6628, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_MLP_mfcc.cfg (6626, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_MLP_mfcc_basic.cfg (3738, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_MLP_mfcc_basic_flex.cfg (3835, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_RNN_fbank.cfg (6662, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_RNN_fmllr.cfg (6661, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_RNN_mfcc.cfg (6660, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_SRU_fbank.cfg (5090, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_SincNet_raw.cfg (6287, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_liGRU_fbank.cfg (6691, 2019-10-24)
cfg\TIMIT_baselines\TIMIT_liGRU_fmllr.cfg (6691, 2019-10-24)
... ...

# The PyTorch-Kaldi Speech Recognition Toolkit PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition systems. The DNN part is managed by PyTorch, while feature extraction, label computation, and decoding are performed with the Kaldi toolkit. This repository contains the last version of the PyTorch-Kaldi toolkit (PyTorch-Kaldi-v1.0). To take a look into the previous version (PyTorch-Kaldi-v0.1), [click here](https://bitbucket.org/mravanelli/pytorch-kaldi-v0.0/src/master/). If you use this code or part of it, please cite the following paper: *M. Ravanelli, T. Parcollet, Y. Bengio, "The PyTorch-Kaldi Speech Recognition Toolkit", [arXiv](https://arxiv.org/abs/1811.07453)* ``` @inproceedings{pytorch-kaldi, title = {The PyTorch-Kaldi Speech Recognition Toolkit}, author = {M. Ravanelli and T. Parcollet and Y. Bengio}, booktitle = {In Proc. of ICASSP}, year = {2019} } ``` The toolkit is released under a **Creative Commons Attribution 4.0 International license**. You can copy, distribute, modify the code for research, commercial and non-commercial purposes. We only ask to cite our paper referenced above. To improve transparency and replicability of speech recognition results, we give users the possibility to release their PyTorch-Kaldi model within this repository. Feel free to contact us (or doing a pull request) for that. Moreover, if your paper uses PyTorch-Kaldi, it is also possible to advertise it in this repository. [See a short introductory video on the PyTorch-Kaldi Toolkit](https://www.youtube.com/watch?v=VDQaf0SS4K0&t=2s) ## Next Version: SpeechBrain We are happy to announce the SpeechBrain project (https://speechbrain.github.io/), that aims to develop an **open-source all-in-one** toolkit based on PyTorch. The SpeechBrain project will significantly extend the functionality of the current PyTorch-Kaldi toolkit. The goal is to develop a *single*, *flexible*, and *user-friendly* toolkit that can be used to easily develop state-of-the-art speech systems for speech recognition (both end-to-end and HMM-DNN), speaker recognition, speech separation, multi-microphone signal processing (e.g, beamforming), self-supervised learning, and many others. The project will be lead by Mila and is sponsored by Samsung, Nvidia, Dolby. SpeechBrain will also benefit from the collaboration and expertise of other companies such as Facebook/PyTorch, IBMResearch, FluentAI. We are actively looking for collaborators. Feel free to contact us at speechbrainproject@gmail.com if you are interested to collaborate. Thanks to our sponsors we are also able to hire interns working at Mila on the SpeechBrain project. The ideal candidate is a PhD student with experience on pytorch and speech technologies (send your CV to speechbrainproject@gmail.com) The development of SpeechBrain will require some months before having a working repository. Meanwhile, we will continue providing support for the pytorch-kaldi project. Stay Tuned! ## Table of Contents * [Introduction](#introduction) * [Prerequisites](#prerequisites) * [How to install](#how-to-install) * [Recent Updates](#recent-updates) * [Tutorials:](#timit-tutorial) * [TIMIT tutorial](#timit-tutorial) * [Librispeech tutorial](#librispeech-tutorial) * [Toolkit Overview:](#overview-of-the-toolkit-architecture) * [Toolkit architecture](#overview-of-the-toolkit-architecture) * [Configuration files](#description-of-the-configuration-files) * [FAQs:](#how-can-i-plug-in-my-model) * [How can I plug-in my model?](#how-can-i-plug-in-my-model) * [How can I tune the hyperparameters?](#how-can-i-tune-the-hyperparameters) * [How can I use my own dataset?](#how-can-i-use-my-own-dataset) * [How can I plug-in my own features?](#how-can-i-plug-in-my-own-features) * [How can I transcript my own audio files?](#how-can-i-transcript-my-own-audio-files) * [Batch size, learning rate, and droput scheduler](#Batch-size,-learning-rate,-and-dropout-scheduler) * [How can I contribute to the project?](#how-can-i-contribute-to-the-project) * [EXTRA:](#speech-recognition-from-the-raw-waveform-with-sincnet) * [Speech recognition from the raw waveform with SincNet](#speech-recognition-from-the-raw-waveform-with-sincnet) * [Joint training between speech enhancement and ASR](#joint-training-between-speech-enhancement-and-asr) * [Distant Speech Recognition with DIRHA](#distant-speech-recognition-with-dirha) * [Training an autoencoder](#training-an-autoencoder) * [References](#references) ## Introduction The PyTorch-Kaldi project aims to bridge the gap between the Kaldi and the PyTorch toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these toolkits, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with rich documentation and is designed to properly work locally or on HPC clusters. Some features of the new version of the PyTorch-Kaldi toolkit: - Easy interface with Kaldi. - Easy plug-in of user-defined models. - Several pre-implemented models (MLP, CNN, RNN, LSTM, GRU, Li-GRU, SincNet). - Natural implementation of complex models based on multiple features, labels, and neural architectures. - Easy and flexible configuration files. - Automatic recovery from the last processed chunk. - Automatic chunking and context expansions of the input features. - Multi-GPU training. - Designed to work locally or on HPC clusters. - Tutorials on TIMIT and Librispeech Datasets. ## Prerequisites 1. If not already done, install Kaldi (http://kaldi-asr.org/). As suggested during the installation, do not forget to add the path of the Kaldi binaries into $HOME/.bashrc. For instance, make sure that .bashrc contains the following paths: ``` export KALDI_ROOT=/home/mirco/kaldi-trunk PATH=$PATH:$KALDI_ROOT/tools/openfst PATH=$PATH:$KALDI_ROOT/src/featbin PATH=$PATH:$KALDI_ROOT/src/gmmbin PATH=$PATH:$KALDI_ROOT/src/bin PATH=$PATH:$KALDI_ROOT//src/nnetbin export PATH ``` Remember to change the KALDI_ROOT variable using your path. As a first test to check the installation, open a bash shell, type "copy-feats" or "hmm-info" and make sure no errors appear. 2. If not already done, install PyTorch (http://pytorch.org/). We tested our codes on PyTorch 1.0 and PyTorch 0.4. An older version of PyTorch is likely to raise errors. To check your installation, type “python” and, once entered into the console, type “import torch”, and make sure no errors appear. 3. We recommend running the code on a GPU machine. Make sure that the CUDA libraries (https://developer.nvidia.com/cuda-downloads) are installed and correctly working. We tested our system on Cuda 9.0, 9.1 and 8.0. Make sure that python is installed (the code is tested with python 2.7 and python 3.7). Even though not mandatory, we suggest using Anaconda (https://anaconda.org/anaconda/python). ## Recent updates **19 Feb. 2019: updates:** - It is now possible to dynamically change batch size, learning rate, and dropout factors during training. We thus implemented a scheduler that supports the following formalism within the config files: ``` batch_size_train = 128*12 | ****10 | 32*2 ``` The line above means: do 12 epochs with 128 batches, 10 epochs with *** batches, and 2 epochs with 32 batches. A similar formalism can be used for learning rate and dropout scheduling. [See this section for more information](#batch-size,-learning-rate,-and-dropout-scheduler). **5 Feb. 2019: updates:** 1. Our toolkit now supports parallel data loading (i.e., the next chunk is stored in memory while processing the current chunk). This allows a significant speed up. 2. When performing monophone regularization users can now set “dnn_lay = N_lab_out_mono”. This way the number of monophones is automatically inferred by our toolkit. 3. We integrated the kaldi-io toolkit from the [kaldi-io-for-python](https://github.com/vesis84/kaldi-io-for-python) project into data_io-py. 4. We provided a better hyperparameter setting for SincNet ([see this section](#speech-recognition-from-the-raw-waveform-with-sincnet)) 5. We released some baselines with the DIRHA dataset ([see this section](#distant-speech-recognition-with-dirha)). We also provide some configuration examples for a simple autoencoder ([see this section](#training-an-autoencoder)) and for a system that jointly trains a speech enhancement and a speech recognition module ([see this section](#joint-training-between-speech-enhancement-and-asr)) 6. We fixed some minor bugs. **Notes on the next version:** In the next version, we plan to further extend the functionalities of our toolkit, supporting more models and features formats. The goal is to make our toolkit suitable for other speech-related tasks such as end-to-end speech recognition, speaker-identification, keyword spotting, speech separation, speech activity detection, speech enhancement, etc. If you would like to propose some novel functionalities, please give us your feedback by [filling this survey](https://docs.google.com/forms/d/12jd-QP5m8NAJVpiypvtVGy1n_d2iuWaLozXq5hsg4yA/edit?usp=sharing). ## How to install To install PyTorch-Kaldi, do the following steps: 1. Make sure all the software recommended in the “Prerequisites” sections are installed and are correctly working 2. Clone the PyTorch-Kaldi repository: ``` git clone https://github.com/mravanelli/pytorch-kaldi ``` 3. Go into the project folder and Install the needed packages with: ``` pip install -r requirements.txt ``` ## TIMIT tutorial In the following, we provide a short tutorial of the PyTorch-Kaldi toolkit based on the popular TIMIT dataset. 1. Make sure you have the TIMIT dataset. If not, it can be downloaded from the LDC website (https://catalog.ldc.upenn.edu/LDC93S1). 2. Make sure Kaldi and PyTorch installations are fine. Make also sure that your KALDI paths are currently working (you should add the Kaldi paths into the .bashrc as reported in the section "Prerequisites"). For instance, type "copy-feats" and "hmm-info" and make sure no errors appear. 3. Run the Kaldi s5 baseline of TIMIT. This step is necessary to compute features and labels later used to train the PyTorch neural network. We recommend running the full timit s5 recipe (including the DNN training): ``` cd kaldi/egs/timit/s5 ./run.sh ./local/nnet/run_dnn.sh ``` This way all the necessary files are created and the user can directly compare the results obtained by Kaldi with that achieved with our toolkit. 4. Compute the alignments (i.e, the phone-state labels) for test and dev data with the following commands (go into $KALDI_ROOT/egs/timit/s5). If you want to use tri3 alignments, type: ``` steps/align_fmllr.sh --nj 4 data/dev data/lang exp/tri3 exp/tri3_ali_dev steps/align_fmllr.sh --nj 4 data/test data/lang exp/tri3 exp/tri3_ali_test ``` If you want to use dnn alignments (as suggested), type: ``` steps/nnet/align.sh --nj 4 data-fmllr-tri3/train data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali steps/nnet/align.sh --nj 4 data-fmllr-tri3/dev data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_dev steps/nnet/align.sh --nj 4 data-fmllr-tri3/test data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_test ``` 5. We start this tutorial with a very simple MLP network trained on mfcc features. Before launching the experiment, take a look at the configuration file *cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg*. See the [Description of the configuration files](#description-of-the-configuration-files) for a detailed description of all its fields. 6. Change the config file according to your paths. In particular: - Set “fea_lst” with the path of your mfcc training list (that should be in $KALDI_ROOT/egs/timit/s5/data/train/feats.scp) - Add your path (e.g., $KALDI_ROOT/egs/timit/s5/data/train/utt2spk) into “--utt2spk=ark:” - Add your CMVN transformation e.g.,$KALDI_ROOT/egs/timit/s5/mfcc/cmvn_train.ark - Add the folder where labels are stored (e.g.,$KALDI_ROOT/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali for training and ,$KALDI_ROOT/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_dev for dev data). To avoid errors make sure that all the paths in the cfg file exist. **Please, avoid using paths containing bash variables since paths are read literally and are not automatically expanded** (e.g., use /home/mirco/kaldi-trunk/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali instead of $KALDI_ROOT/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali) 7. Run the ASR experiment: ``` python run_exp.py cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg ``` This script starts a full ASR experiment and performs training, validation, forward, and decoding steps. A progress bar shows the evolution of all the aforementioned phases. The script *run_exp.py* progressively creates the following files in the output directory: - *res.res*: a file that summarizes training and validation performance across various validation epochs. - *log.log*: a file that contains possible errors and warnings. - *conf.cfg*: a copy of the configuration file. - *model.svg* is a picture that shows the considered model and how the various neural networks are connected. This is really useful to debug models that are more complex than this one (e.g, models based on multiple neural networks). - The folder *exp_files* contains several files that summarize the evolution of training and validation over the various epochs. For instance, files *.info report chunk-specific information such as the chunk_loss and error and the training time. The *.cfg files are the chunk-specific configuration files (see general architecture for more details), while files *.lst report the list of features used to train each specific chunk. - At the end of training, a directory called *generated outputs* containing plots of loss and errors during the various training epochs is created. **Note that you can stop the experiment at any time.** If you run again the script it will automatically start from the last chunk correctly processed. The training could take a couple of hours, depending on the available GPU. Note also that if you would like to change some parameters of the configuration file (e.g., n_chunks=,fea_lst=,batch_size_train=,..) you must specify a different output folder (output_folder=). **Debug:** If you run into some errors, we suggest to do the following checks: 1. Take a look into the standard output. 2. If it is not helpful, take a look into the log.log file. 3. Take a look into the function run_nn into the core.py library. Add some prints in the various part of the function to isolate the problem and figure out the issue. 8. At the end of training, the phone error rate (PER\%) is appended into the res.res file. To see more details on the decoding results, you can go into “decoding_test” in the output folder and take a look to the various files created. For this specific example, we obtained the following *res.res* file: ``` ep=000 tr=['TIMIT_tr'] loss=3.3*** err=0.721 valid=TIMIT_dev loss=2.268 err=0.591 lr_architecture1=0.080000 time(s)=86 ep=001 tr=['TIMIT_tr'] loss=2.137 err=0.570 valid=TIMIT_dev loss=1.990 err=0.541 lr_architecture1=0.080000 time(s)=87 ep=002 tr=['TIMIT_tr'] loss=1.896 err=0.524 valid=TIMIT_dev loss=1.874 err=0.516 lr_architecture1=0.080000 time(s)=87 ep=003 tr=['TIMIT_tr'] loss=1.751 err=0.494 valid=TIMIT_dev loss=1.819 err=0.504 lr_architecture1=0.080000 time(s)=88 ep=004 tr=['TIMIT_tr'] loss=1.***5 err=0.472 valid=TIMIT_dev loss=1.775 err=0.494 lr_architecture1=0.080000 time(s)=89 ep=005 tr=['TIMIT_tr'] loss=1.560 err=0.453 valid=TIMIT_dev loss=1.773 err=0.493 lr_architecture1=0.080000 time(s)=88 ......... ep=020 tr=['TIMIT_tr'] loss=0.968 err=0.304 valid=TIMIT_dev loss=1.***8 err=0.446 lr_architecture1=0.002500 time(s)=89 ep=021 tr=['TIMIT_tr'] loss=0.965 err=0.304 valid=TIMIT_dev loss=1.***9 err=0.446 lr_architecture1=0.002500 time(s)=90 ep=022 tr=['TIMIT_tr'] loss=0.960 err=0.302 valid=TIMIT_dev loss=1.652 err=0.447 lr_architecture1=0.001250 time(s)=88 ep=023 tr=['TIMIT_tr'] loss=0.959 err=0.301 valid=TIMIT_dev loss=1.651 err=0.446 lr_architecture1=0.000625 time(s)=88 %WER 18.1 | 192 7215 | 84.0 11.9 4.2 2.1 18.1 99.5 | -0.583 | /home/mirco/pytorch-kaldi-new/exp/TIMIT_MLP_basic5/decode_TIMIT_test_out_dnn1/score_6/ctm_39phn.filt.sys ``` The achieved PER(%) is 18.1%. Note that there could be some variability in the results, due to different initializations on different machines. We believe that averaging the performance obtained with different initialization seeds (i.e., change the field *seed* in the config file) is crucial for TIMIT since the natural performance variability might completely hide the experimental evidence. We noticed a standard deviation of about 0.2% for the TIMIT experiments. If you want to change the features, you have to first compute them with the Kaldi toolkit. To compute fbank features, you have to open *$KALDI_ROOT/egs/timit/s5/run.sh* and compute them with the following lines: ``` feadir=fbank for x in train dev test; do steps/make_fbank.sh --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir steps/compute_cmvn_stats.sh data/$x exp/make_fbank/$x $feadir done ``` Then, change the aforementioned configuration file with the new feature list. If you already have run the full timit Kaldi recipe, you can directly find the fmllr features in *$KALDI_ROOT/egs/timit/s5/data-fmllr-tri3*. If you feed the neural network with such features you should expect a substantial performance improvement, due to the adoption of the speaker adaptation. In the TIMIT_baseline folder, we propose several other examples of possible TIMIT baselines. Similarly to the previous example, you can run them by simply typing: ``` python run_exp.py $cfg_file ``` There are some examples with recurrent (TIMIT_RNN*,TIMIT_LSTM*,TIMIT_GRU*,TIMIT_LiGRU*) and CNN architectures (TIMIT_CNN*). We also propose a more advanced model (TIMIT_DNN_liGRU_DNN_mfcc+fbank+fmllr.cfg) where we used a combination of feed-forward and recurrent neural networks fed by a concatenation of mfcc, fbank, and fmllr features. Note that the latter configuration files correspond to the best architecture described in the reference paper. As you might see from the above-mentioned configuration files, we improve the ASR performance by including some tricks such as the monophone regularization (i.e., we jointly estimate both context-dependent and context-independent targets). The following table reports the results obtained by running the latter systems (average PER\%): | Model | mfcc | fbank | fMLLR | | ------ | -----| ------| ------| | Kaldi DNN Baseline | -----| ------| 18.5 | | MLP | 18.2 | 18.7 | 16.7 | | RNN | 17.7 | 17.2 | 15.9 | | SRU | -----| 16.6 | -----| |LSTM| 15.1 | 14.3 |14.5 | |GRU| 16.0 | 15.2| 14.9 | |li-GRU| **15.5** | **14.9**| **14.2** | Results show that, as expected, fMLLR features outperform MFCCs and FBANKs coefficients, thanks to the speaker adaptation process. Recurrent models significantly outperform the standard MLP one, especially when using LSTM, GRU, and Li-GRU architecture, that effectively address ... ...

近期下载者

相关文件


收藏者