transformer-lm

所属分类:GPT/ChatGPT
开发工具:Python
文件大小:258KB
下载次数:0
上传日期:2021-02-27 15:23:53
上 传 者sh-1993
说明:  带有句子标记器的转换器语言模型(GPT-2)
(Transformer language model (GPT-2) with sentencepiece tokenizer ,)

文件列表:
.dockerignore (34, 2021-02-27)
.travis.yml (311, 2021-02-27)
Dockerfile (533, 2021-02-27)
lambda.py (1421, 2021-02-27)
lambda.sh (252, 2021-02-27)
lm (0, 2021-02-27)
lm\__init__.py (0, 2021-02-27)
lm\common.py (89, 2021-02-27)
lm\data.py (4578, 2021-02-27)
lm\fire_utils.py (1375, 2021-02-27)
lm\generate.py (576, 2021-02-27)
lm\gpt_2_tf (0, 2021-02-27)
lm\gpt_2_tf\__init__.py (0, 2021-02-27)
lm\gpt_2_tf\model.py (6983, 2021-02-27)
lm\gpt_2_tf\sample.py (2994, 2021-02-27)
lm\gpt_2_tf\train.py (11281, 2021-02-27)
lm\inference.py (9037, 2021-02-27)
lm\main.py (14185, 2021-02-27)
lm\model.py (7116, 2021-02-27)
lm_web_ui (0, 2021-02-27)
lm_web_ui\__init__.py (0, 2021-02-27)
lm_web_ui\main.py (4518, 2021-02-27)
lm_web_ui\requirements.txt (64, 2021-02-27)
lm_web_ui\templates (0, 2021-02-27)
lm_web_ui\templates\about.jinja2 (405, 2021-02-27)
lm_web_ui\templates\base.jinja2 (396, 2021-02-27)
lm_web_ui\templates\index.jinja2 (4493, 2021-02-27)
requirements.lambda.txt (120, 2021-02-27)
requirements.txt (90, 2021-02-27)
setup.py (628, 2021-02-27)
tests (0, 2021-02-27)
tests\__init__.py (0, 2021-02-27)
tests\shakespeare (0, 2021-02-27)
tests\shakespeare\test (0, 2021-02-27)
tests\shakespeare\test\macbeth.txt (100152, 2021-02-27)
tests\shakespeare\train (0, 2021-02-27)
tests\shakespeare\train\hamlet.txt (175132, 2021-02-27)
... ...

Training GPT-2 transformer language model with sentencepiece tokenizer ====================================================================== .. image:: https://img.shields.io/travis/lopuhin/transformer-lm/master.svg :target: https://travis-ci.org/lopuhin/transformer-lm :alt: Build Status Training GPT-2 transformer language model on your own corpora with `sentencepiece `_ tokenization. This repo contains a PyTorch implementation of GPT-2, which support multi-GPU training. It also contains a TensorFlow implementation in ``lm/gpt_2_tf``, but it is not developed any more. They share the same data preparation scripts. TF training command is ``gpt-2-tf-train`` and needs TensorFlow 1.13. Documentation below is for PyTorch version. .. contents:: Installation ------------ Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below. `Install `__ appropriate version of pytorch first, and then:: pip install -r requirements.txt python setup.py develop Usage ----- Instructions are below. See also ``test/test_shakespeare.sh`` for a complete pipeline demo on a small corpus (takes a minute on a CPU). Prepare data for training +++++++++++++++++++++++++ Corpus format: a directory with top-level ``train``, ``valid`` and ``test`` folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with ``.txt`` extension. The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as ``data/corpora-*``. 1. Train sentencepiece model (``sp-text.txt`` can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the ``sp-train`` command directly):: sp-train data/corpora-* sp-text.txt sp-model 2. Encode corpora, producing numpy files:: sp-encode data/corpora-* sp-model.model data/encoded Training ++++++++ Example command:: gpt-2 run-root data/encoded sp-model.model ``run-root`` would contain model checkpoints and json-lines logs, which can be plotted in a jupyter notebook with ``json_log_plots.plot("run-root")``, with number of tokens seen on the X axis. Default hyperparameters correspond to released "small" GPT-2 model. When multiple GPUs are available, they would be used for training with the help of ``torch.distributed``. If the path exists and ``--clean`` key is NOT passed, training would be resumed. Note that all parameters still need to be specified and model parameters need to match. Notes on training parameters: - ``--batch-size`` is per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory. - ``--g-accum-gradients`` is the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is always ``batch_size * g_accum_gradients``. - ``--lr`` does not need to be changed when changing ``--batch-size`` or ``--g-accum-gradients`` or number of GPUs or ``--n-ctx``: loss is already scaled appropriately. Inference +++++++++ Example command:: gpt-2-gen run-root "Artificial intelligence" ``run-root`` would contain model checkpoints ``"Artificial intelligence"`` is the text prefix used as a starting point for generating tokens Notes on inference parameters: - ``--tokens-to-generate``: number of tokens to generate, default is 42 - ``--top-k``: number of token candidates to generate for each position (beam width), default is 8. License & credits ----------------- License is MIT. TensorFlow GPT-2 model is taken from https://github.com/openai/gpt-2/blob/master/src/model.py and TensorFlow GPT-2 training code is based on https://github.com/nshepperd/gpt-2/blob/finetuning/train.py PyTorch port is based on original OpenAI code. Test Shakespeare corpus under ``tests/shakespeare`` is from http://shakespeare.mit.edu under public domain. See also OpenAI GPT-2 `paper `_ and `blog `_.

近期下载者

相关文件


收藏者