
上传日期:2021-02-27 15:23:53
上 传 者sh-1993
说明:  带有句子标记器的转换器语言模型(GPT-2)
(Transformer language model (GPT-2) with sentencepiece tokenizer ,)

.dockerignore (34, 2021-02-27)
.travis.yml (311, 2021-02-27)
Dockerfile (533, 2021-02-27) (1421, 2021-02-27) (252, 2021-02-27)
lm (0, 2021-02-27)
lm\ (0, 2021-02-27)
lm\ (89, 2021-02-27)
lm\ (4578, 2021-02-27)
lm\ (1375, 2021-02-27)
lm\ (576, 2021-02-27)
lm\gpt_2_tf (0, 2021-02-27)
lm\gpt_2_tf\ (0, 2021-02-27)
lm\gpt_2_tf\ (6983, 2021-02-27)
lm\gpt_2_tf\ (2994, 2021-02-27)
lm\gpt_2_tf\ (11281, 2021-02-27)
lm\ (9037, 2021-02-27)
lm\ (14185, 2021-02-27)
lm\ (7116, 2021-02-27)
lm_web_ui (0, 2021-02-27)
lm_web_ui\ (0, 2021-02-27)
lm_web_ui\ (4518, 2021-02-27)
lm_web_ui\requirements.txt (64, 2021-02-27)
lm_web_ui\templates (0, 2021-02-27)
lm_web_ui\templates\about.jinja2 (405, 2021-02-27)
lm_web_ui\templates\base.jinja2 (396, 2021-02-27)
lm_web_ui\templates\index.jinja2 (4493, 2021-02-27)
requirements.lambda.txt (120, 2021-02-27)
requirements.txt (90, 2021-02-27) (628, 2021-02-27)
tests (0, 2021-02-27)
tests\ (0, 2021-02-27)
tests\shakespeare (0, 2021-02-27)
tests\shakespeare\test (0, 2021-02-27)
tests\shakespeare\test\macbeth.txt (100152, 2021-02-27)
tests\shakespeare\train (0, 2021-02-27)
tests\shakespeare\train\hamlet.txt (175132, 2021-02-27)
... ...

Training GPT-2 transformer language model with sentencepiece tokenizer ====================================================================== .. image:: :target: :alt: Build Status Training GPT-2 transformer language model on your own corpora with `sentencepiece `_ tokenization. This repo contains a PyTorch implementation of GPT-2, which support multi-GPU training. It also contains a TensorFlow implementation in ``lm/gpt_2_tf``, but it is not developed any more. They share the same data preparation scripts. TF training command is ``gpt-2-tf-train`` and needs TensorFlow 1.13. Documentation below is for PyTorch version. .. contents:: Installation ------------ Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below. `Install `__ appropriate version of pytorch first, and then:: pip install -r requirements.txt python develop Usage ----- Instructions are below. See also ``test/`` for a complete pipeline demo on a small corpus (takes a minute on a CPU). Prepare data for training +++++++++++++++++++++++++ Corpus format: a directory with top-level ``train``, ``valid`` and ``test`` folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with ``.txt`` extension. The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as ``data/corpora-*``. 1. Train sentencepiece model (``sp-text.txt`` can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the ``sp-train`` command directly):: sp-train data/corpora-* sp-text.txt sp-model 2. Encode corpora, producing numpy files:: sp-encode data/corpora-* sp-model.model data/encoded Training ++++++++ Example command:: gpt-2 run-root data/encoded sp-model.model ``run-root`` would contain model checkpoints and json-lines logs, which can be plotted in a jupyter notebook with ``json_log_plots.plot("run-root")``, with number of tokens seen on the X axis. Default hyperparameters correspond to released "small" GPT-2 model. When multiple GPUs are available, they would be used for training with the help of ``torch.distributed``. If the path exists and ``--clean`` key is NOT passed, training would be resumed. Note that all parameters still need to be specified and model parameters need to match. Notes on training parameters: - ``--batch-size`` is per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory. - ``--g-accum-gradients`` is the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is always ``batch_size * g_accum_gradients``. - ``--lr`` does not need to be changed when changing ``--batch-size`` or ``--g-accum-gradients`` or number of GPUs or ``--n-ctx``: loss is already scaled appropriately. Inference +++++++++ Example command:: gpt-2-gen run-root "Artificial intelligence" ``run-root`` would contain model checkpoints ``"Artificial intelligence"`` is the text prefix used as a starting point for generating tokens Notes on inference parameters: - ``--tokens-to-generate``: number of tokens to generate, default is 42 - ``--top-k``: number of token candidates to generate for each position (beam width), default is 8. License & credits ----------------- License is MIT. TensorFlow GPT-2 model is taken from and TensorFlow GPT-2 training code is based on PyTorch port is based on original OpenAI code. Test Shakespeare corpus under ``tests/shakespeare`` is from under public domain. See also OpenAI GPT-2 `paper `_ and `blog `_.


