transformer-aan

所属分类:模式识别(视觉/语音等)
开发工具:Python
文件大小:657KB
下载次数:1
上传日期:2020-07-16 11:31:20
上 传 者sh-1993
说明:  “通过平均注意力网络加速神经transformer”代码(ACL 2018)
(Code for "Accelerating Neural Transformer via an Average Attention Network" (ACL 2018))

文件列表:
LICENSE (1510, 2020-07-16)
code (0, 2020-07-16)
code\thumt (0, 2020-07-16)
code\thumt\__init__.py (50, 2020-07-16)
code\thumt\data (0, 2020-07-16)
code\thumt\data\__init__.py (50, 2020-07-16)
code\thumt\data\dataset.py (10514, 2020-07-16)
code\thumt\data\record.py (5326, 2020-07-16)
code\thumt\data\vocab.py (747, 2020-07-16)
code\thumt\interface (0, 2020-07-16)
code\thumt\interface\__init__.py (204, 2020-07-16)
code\thumt\interface\model.py (820, 2020-07-16)
code\thumt\layers (0, 2020-07-16)
code\thumt\layers\__init__.py (133, 2020-07-16)
code\thumt\layers\attention.py (13212, 2020-07-16)
code\thumt\layers\nn.py (5979, 2020-07-16)
code\thumt\layers\rnn_cell.py (4979, 2020-07-16)
code\thumt\models (0, 2020-07-16)
code\thumt\models\__init__.py (594, 2020-07-16)
code\thumt\models\rnnsearch.py (14632, 2020-07-16)
code\thumt\models\seq2seq.py (6964, 2020-07-16)
code\thumt\models\transformer.py (16651, 2020-07-16)
code\thumt\scripts (0, 2020-07-16)
code\thumt\scripts\build_vocab.py (2304, 2020-07-16)
code\thumt\scripts\checkpoint_averaging.py (4000, 2020-07-16)
code\thumt\scripts\convert_old_model.py (4900, 2020-07-16)
code\thumt\scripts\convert_vocab.py (493, 2020-07-16)
code\thumt\scripts\input_converter.py (4882, 2020-07-16)
code\thumt\scripts\shuffle_corpus.py (1319, 2020-07-16)
code\thumt\utils (0, 2020-07-16)
code\thumt\utils\__init__.py (50, 2020-07-16)
code\thumt\utils\bleu.py (3038, 2020-07-16)
code\thumt\utils\hooks.py (13111, 2020-07-16)
code\thumt\utils\loss.py (350, 2020-07-16)
code\thumt\utils\mrt_utils.py (9155, 2020-07-16)
code\thumt\utils\parallel.py (3735, 2020-07-16)
code\thumt\utils\sample.py (160, 2020-07-16)
... ...

# transformer-aan Source code for "Accelerating Neural Transformer via an Average Attention Network" The source code is developed upon THUMT > The used THUMT for experiments in our paper is downloaded at Jan 11, 2018 > Fork from [transformer-aan](https://github.com/bzhangGo/transformer-aan) # About AAN Structure We introduce two sub-layers for AAN in our ACL paper: one FFN layer (Eq. (1)) and one gating layer (Eq. (2)). However, after our extensive experiments, we observe that **the FFN layer is redundant and can be removed without loss of translation quality**. In addition, removing FFN layer reduces the amount of model parameters and slightly improves the training speed. It also largely improves the decoding speed. **For re-implementation, we suggest other researchers to use the AAN model without the FFN sub-layer!** See how we [disable this layer](https://github.com/bzhangGo/transformer-aan/blob/master/code/thumt/models/transformer.py#L137). # Other Implementations * [Marian](https://github.com/marian-nmt/marian): an efficient NMT toolkit implemented by C++. * [Neutron](https://github.com/anoidgit/transformer): a pytorch NMT toolkit * [translate](https://github.com/pytorch/translate): a fairseq-based NMT translation toolkit * [OpenNMT](https://github.com/OpenNMT/OpenNMT-py): a pytorch NMT toolkit ## File structure: `train.sh`: provides the training script with our used configuration. `test.sh`: provides the testing script. directory `train` and `test` are generated on WMT14 en-de translation task. * `train/eval/log` records the approximate BLEU score on development set during training. * `test/` contains the decoded development and test dataset, for researchers who are interested in the translations generated by our model. The processed WMT14 en-de dataset can be found at Transformer-AAN-Data. (Original files are downloaded from Stanford NMT website.) ## Requirements * Python: 2.7 * Tensorflow >= 1.4.1 (The used version for experiments in our paper is 1.4.1) ## Training Parameters ``` batch_size=3125,device_list=[0],eval_steps=5000,train_steps=100000,save_checkpoint_steps=1500,shared_embedding_and_softmax_weights=true,shared_source_target_embedding=false,update_cycle=8,aan_mask=True,use_ffn=False ``` 1. train_steps: The total training steps, we used 100000 in most experiments. 2. eval_steps: We obtain the approximate BLEU score on development set in every 5000 training steps. 3. shared_embedding_and_softmax_weights: We shared the target-side word embedding and target-side pre-softmax parameters 4. shared_source_target_embedding: We used separate source and target vocabulary, so the source-side word embedding and target-side word embedding were not shared. 5. aan_mask: - This setting enables the mask-matrix multiplication for accumulative-average computation. - Without this setting, we used the native `tf.cumsum()` implementation. - In practice, the speed of both implementations is similar. - For long target sentences, we recommend the native implementation, because it is more memory-compact. 6. use_ffn: - With this setting, the AAN model includes the FFN layer as presented in Eq. (1) in our paper. - Why we add this option? - Because FFN introduces many model parameters, and significantly slows our model. - Without FFN, our AAN can generate very similar performance, as shown in Table 2 in our paper. - Furthermore, we surprisingly find that in some cases, removing FFN improves AAN's performance. 7. batch_size, device_list, update_cycle: This is used for parallel training. For one training step, the training procedure is as follows: ``` for device_i in device_list: (this runs in parallel): for cycle_i in range(update_cycle): (this runs in sequence): train a batch of size `batch_size` collect gradients and costs update the model ``` Therefore, the actual training batch size is: batch_size x len(device_list) x update_cycle. * In our paper, we train the model in one GPU card, so we only set the device_list to [0]. For researchers who have more available GPU card, we encourage you to reduce the update_cycle and increase the device_list. This can improve your training speed. Particularly, training one model for WMT 14 en-de with `batch_size=3125, device_list=[0,1,2,3,4,5,6,7], update_cycle=1` takes less than 1 day. ## Discussions We have received several discussions from other researchers, and we'd like to show some great discussion here. 1. Why AAN can accelerate the Transformer with a factor of 4~7? *The acceleration is for Transformer without cache strategy* In theory, Suppose both the source and target sentence have a length of `n_s` and `n_t` respectively, and the model dimension is `d`. In one step of the Transformer decoder, the original model has a computational complexity of `O([n_tgt d^2] (self-attention) + [n_src d^2] (cross-attention) + [d^2] (FFN))`. By contrast, the AAN has a computational complexity of `O([d^2] (AAN FFN+Gate) + [n_src d^2] (cross-attention))`. Therefore, the theoretical acceleration is around `(n_tgt + n_src) / n_src`, and the longer the target sentence is, the larger the acceleration will be. * Welcome more discussions :). ## Citation Please cite the following paper: > Biao Zhang, Deyi Xiong and Jinsong Su. *Accelerating Neural Transformer via an Average Attention Network*. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. ``` @InProceedings{zhang-Etal:2018:ACL2018accelerating, author = {Zhang, Biao and Xiong, Deyi and Su, Jinsong}, title = {Accelerating Neural Transformer via an Average Attention Network}, booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics}, month = {July}, year = {2018}, address = {Melbourne, Australia}, publisher = {Association for Computational Linguistics}, } ``` ## Contact For any further comments or questions about AAN, please email Biao Zhang.

近期下载者

相关文件


收藏者