transformer-aan
所属分类:模式识别(视觉/语音等)
开发工具:Python
文件大小:657KB
下载次数:1
上传日期:2020-07-16 11:31:20
上 传 者:
sh-1993
说明: “通过平均注意力网络加速神经transformer”代码(ACL 2018)
(Code for "Accelerating Neural Transformer via an Average Attention Network" (ACL 2018))
文件列表:
LICENSE (1510, 2020-07-16)
code (0, 2020-07-16)
code\thumt (0, 2020-07-16)
code\thumt\__init__.py (50, 2020-07-16)
code\thumt\data (0, 2020-07-16)
code\thumt\data\__init__.py (50, 2020-07-16)
code\thumt\data\dataset.py (10514, 2020-07-16)
code\thumt\data\record.py (5326, 2020-07-16)
code\thumt\data\vocab.py (747, 2020-07-16)
code\thumt\interface (0, 2020-07-16)
code\thumt\interface\__init__.py (204, 2020-07-16)
code\thumt\interface\model.py (820, 2020-07-16)
code\thumt\layers (0, 2020-07-16)
code\thumt\layers\__init__.py (133, 2020-07-16)
code\thumt\layers\attention.py (13212, 2020-07-16)
code\thumt\layers\nn.py (5979, 2020-07-16)
code\thumt\layers\rnn_cell.py (4979, 2020-07-16)
code\thumt\models (0, 2020-07-16)
code\thumt\models\__init__.py (594, 2020-07-16)
code\thumt\models\rnnsearch.py (14632, 2020-07-16)
code\thumt\models\seq2seq.py (6964, 2020-07-16)
code\thumt\models\transformer.py (16651, 2020-07-16)
code\thumt\scripts (0, 2020-07-16)
code\thumt\scripts\build_vocab.py (2304, 2020-07-16)
code\thumt\scripts\checkpoint_averaging.py (4000, 2020-07-16)
code\thumt\scripts\convert_old_model.py (4900, 2020-07-16)
code\thumt\scripts\convert_vocab.py (493, 2020-07-16)
code\thumt\scripts\input_converter.py (4882, 2020-07-16)
code\thumt\scripts\shuffle_corpus.py (1319, 2020-07-16)
code\thumt\utils (0, 2020-07-16)
code\thumt\utils\__init__.py (50, 2020-07-16)
code\thumt\utils\bleu.py (3038, 2020-07-16)
code\thumt\utils\hooks.py (13111, 2020-07-16)
code\thumt\utils\loss.py (350, 2020-07-16)
code\thumt\utils\mrt_utils.py (9155, 2020-07-16)
code\thumt\utils\parallel.py (3735, 2020-07-16)
code\thumt\utils\sample.py (160, 2020-07-16)
... ...
# transformer-aan
Source code for "Accelerating Neural Transformer via an Average Attention Network"
The source code is developed upon
THUMT
> The used THUMT for experiments in our paper is downloaded at Jan 11, 2018
> Fork from [transformer-aan](https://github.com/bzhangGo/transformer-aan)
# About AAN Structure
We introduce two sub-layers for AAN in our ACL paper: one FFN layer (Eq. (1)) and one gating layer (Eq. (2)). However, after our extensive experiments, we observe that **the FFN layer is redundant and can be removed without loss of translation quality**. In addition, removing FFN layer reduces the amount of model parameters and slightly improves the training speed. It also largely improves the decoding speed.
**For re-implementation, we suggest other researchers to use the AAN model without the FFN sub-layer!** See how we [disable this layer](https://github.com/bzhangGo/transformer-aan/blob/master/code/thumt/models/transformer.py#L137).
# Other Implementations
* [Marian](https://github.com/marian-nmt/marian): an efficient NMT toolkit implemented by C++.
* [Neutron](https://github.com/anoidgit/transformer): a pytorch NMT toolkit
* [translate](https://github.com/pytorch/translate): a fairseq-based NMT translation toolkit
* [OpenNMT](https://github.com/OpenNMT/OpenNMT-py): a pytorch NMT toolkit
## File structure:
`train.sh`: provides the training script with our used configuration.
`test.sh`: provides the testing script.
directory `train` and `test` are generated on WMT14 en-de translation task.
* `train/eval/log` records the approximate BLEU score on development set during training.
* `test/` contains the decoded development and test dataset, for researchers who are interested in the translations generated by our model.
The processed WMT14 en-de dataset can be found at
Transformer-AAN-Data. (Original files are downloaded from
Stanford NMT website.)
## Requirements
* Python: 2.7
* Tensorflow >= 1.4.1 (The used version for experiments in our paper is 1.4.1)
## Training Parameters
```
batch_size=3125,device_list=[0],eval_steps=5000,train_steps=100000,save_checkpoint_steps=1500,shared_embedding_and_softmax_weights=true,shared_source_target_embedding=false,update_cycle=8,aan_mask=True,use_ffn=False
```
1. train_steps: The total training steps, we used 100000 in most experiments.
2. eval_steps: We obtain the approximate BLEU score on development set in every 5000 training steps.
3. shared_embedding_and_softmax_weights: We shared the target-side word embedding and target-side pre-softmax parameters
4. shared_source_target_embedding: We used separate source and target vocabulary, so the source-side word embedding and target-side word embedding were not shared.
5. aan_mask:
- This setting enables the mask-matrix multiplication for accumulative-average computation.
- Without this setting, we used the native `tf.cumsum()` implementation.
- In practice, the speed of both implementations is similar.
- For long target sentences, we recommend the native implementation, because it is more memory-compact.
6. use_ffn:
- With this setting, the AAN model includes the FFN layer as presented in Eq. (1) in our paper.
- Why we add this option?
- Because FFN introduces many model parameters, and significantly slows our model.
- Without FFN, our AAN can generate very similar performance, as shown in Table 2 in our paper.
- Furthermore, we surprisingly find that in some cases, removing FFN improves AAN's performance.
7. batch_size, device_list, update_cycle: This is used for parallel training. For one training step, the training procedure is as follows:
```
for device_i in device_list: (this runs in parallel):
for cycle_i in range(update_cycle): (this runs in sequence):
train a batch of size `batch_size`
collect gradients and costs
update the model
```
Therefore, the actual training batch size is: batch_size x len(device_list) x update_cycle.
* In our paper, we train the model in one GPU card, so we only set the device_list to [0]. For researchers who have more available GPU card, we encourage you to reduce the update_cycle and increase the device_list. This can improve your training speed. Particularly, training one model for WMT 14 en-de with `batch_size=3125, device_list=[0,1,2,3,4,5,6,7], update_cycle=1` takes less than 1 day.
## Discussions
We have received several discussions from other researchers, and we'd like to show some great discussion here.
1. Why AAN can accelerate the Transformer with a factor of 4~7?
*The acceleration is for Transformer without cache strategy*
In theory,
Suppose both the source and target sentence have a length of `n_s` and `n_t` respectively, and the model dimension is `d`. In one step of the Transformer decoder, the original model has a computational complexity of `O([n_tgt d^2] (self-attention) + [n_src d^2] (cross-attention) + [d^2] (FFN))`. By contrast, the AAN has a computational complexity of `O([d^2] (AAN FFN+Gate) + [n_src d^2] (cross-attention))`.
Therefore, the theoretical acceleration is around `(n_tgt + n_src) / n_src`, and the longer the target sentence is, the larger the acceleration will be.
* Welcome more discussions :).
## Citation
Please cite the following paper:
> Biao Zhang, Deyi Xiong and Jinsong Su. *Accelerating Neural Transformer via an Average Attention Network*. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.
```
@InProceedings{zhang-Etal:2018:ACL2018accelerating,
author = {Zhang, Biao and Xiong, Deyi and Su, Jinsong},
title = {Accelerating Neural Transformer via an Average Attention Network},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics},
month = {July},
year = {2018},
address = {Melbourne, Australia},
publisher = {Association for Computational Linguistics},
}
```
## Contact
For any further comments or questions about AAN, please email
Biao Zhang.
近期下载者:
相关文件:
收藏者: