dlrm-nvidia 联合开发网

Pudn.com > 下载中心 > GPU/显卡 > dlrm-nvidia

dlrm-nvidia

所属分类：GPU/显卡
开发工具：Python
文件大小：1022KB
下载次数：0
上传日期：2021-09-29 17:11:30
上传者：sh-1993

说明： dlrm英维迪亚，，
(dlrm-nvidia,,)

文件列表:

Dockerfile (1004, 2021-09-30)
Dockerfile_old (402, 2021-09-30)
Dockerfile_preprocessing (2953, 2021-09-30)
LICENSE (11356, 2021-09-30)
NOTICE (129, 2021-09-30)
bind.sh (6647, 2021-09-30)
dgxa100_ccx.sh (909, 2021-09-30)
dlrm (0, 2021-09-30)
dlrm\__init__.py (0, 2021-09-30)
dlrm\cuda_ext (0, 2021-09-30)
dlrm\cuda_ext\__init__.py (166, 2021-09-30)
dlrm\cuda_ext\dot_based_interact.py (1601, 2021-09-30)
dlrm\cuda_ext\fused_gather_embedding.py (1573, 2021-09-30)
dlrm\cuda_ext\sparse_embedding.py (2591, 2021-09-30)
dlrm\cuda_src (0, 2021-09-30)
dlrm\cuda_src\dot_based_interact_ampere (0, 2021-09-30)
dlrm\cuda_src\dot_based_interact_ampere\dot_based_interact.cu (35975, 2021-09-30)
dlrm\cuda_src\dot_based_interact_ampere\dot_based_interact_fp32.cu (15965, 2021-09-30)
dlrm\cuda_src\dot_based_interact_ampere\dot_based_interact_pytorch_types.cu (3157, 2021-09-30)
dlrm\cuda_src\dot_based_interact_ampere\dot_based_interact_tf32.cu (37857, 2021-09-30)
dlrm\cuda_src\dot_based_interact_ampere\pytorch_ops.cpp (596, 2021-09-30)
dlrm\cuda_src\dot_based_interact_ampere\shared_utils.cuh (1140, 2021-09-30)
dlrm\cuda_src\dot_based_interact_volta (0, 2021-09-30)
dlrm\cuda_src\dot_based_interact_volta\dot_based_interact.cu (52763, 2021-09-30)
dlrm\cuda_src\dot_based_interact_volta\dot_based_interact_pytorch_types.cu (3064, 2021-09-30)
dlrm\cuda_src\dot_based_interact_volta\pytorch_ops.cpp (596, 2021-09-30)
dlrm\cuda_src\gather_gpu_fused.cu (11774, 2021-09-30)
dlrm\cuda_src\gather_gpu_fused_pytorch_impl.cu (4854, 2021-09-30)
dlrm\cuda_src\pytorch_embedding_ops.cpp (1240, 2021-09-30)
dlrm\cuda_src\sparse_gather (0, 2021-09-30)
dlrm\cuda_src\sparse_gather\common.h (930, 2021-09-30)
dlrm\cuda_src\sparse_gather\gather_gpu.cu (6617, 2021-09-30)
dlrm\cuda_src\sparse_gather\sparse_pytorch_ops.cpp (813, 2021-09-30)
dlrm\data (0, 2021-09-30)
dlrm\data\__init__.py (0, 2021-09-30)
dlrm\data\data_loader.py (2170, 2021-09-30)
dlrm\data\datasets.py (10806, 2021-09-30)
... ...

# DLRM For PyTorch This repository provides a script and recipe to train the Deep Learning Recommendation Model (DLRM) to achieve state-of-the-art accuracy and is tested and maintained by NVIDIA. ## Table Of Contents * [Model overview](https://github.com/soycoder/dlrm-nvidia/blob/master/#model-overview) + [Model architecture](https://github.com/soycoder/dlrm-nvidia/blob/master/#model-architecture) + [Default configuration](https://github.com/soycoder/dlrm-nvidia/blob/master/#default-configuration) + [Feature support matrix](https://github.com/soycoder/dlrm-nvidia/blob/master/#feature-support-matrix) - [Features](https://github.com/soycoder/dlrm-nvidia/blob/master/#features) + [Mixed precision training](https://github.com/soycoder/dlrm-nvidia/blob/master/#mixed-precision-training) - [Enabling mixed precision](https://github.com/soycoder/dlrm-nvidia/blob/master/#enabling-mixed-precision) - [Enabling TF32](https://github.com/soycoder/dlrm-nvidia/blob/master/#enabling-tf32) + [Hybrid-parallel multi-GPU with all-2-all communication](https://github.com/soycoder/dlrm-nvidia/blob/master/#hybrid-parallel-multi-gpu-with-all-2-all-communication) - [Embedding table placement and load balancing](https://github.com/soycoder/dlrm-nvidia/blob/master/#embedding-table-placement-and-load-balancing) + [Preprocessing on GPU](https://github.com/soycoder/dlrm-nvidia/blob/master/#preprocessing-on-gpu) * [Setup](https://github.com/soycoder/dlrm-nvidia/blob/master/#setup) + [Requirements](https://github.com/soycoder/dlrm-nvidia/blob/master/#requirements) * [Quick Start Guide](https://github.com/soycoder/dlrm-nvidia/blob/master/#quick-start-guide) * [Advanced](https://github.com/soycoder/dlrm-nvidia/blob/master/#advanced) + [Scripts and sample code](https://github.com/soycoder/dlrm-nvidia/blob/master/#scripts-and-sample-code) + [Parameters](https://github.com/soycoder/dlrm-nvidia/blob/master/#parameters) + [Command-line options](https://github.com/soycoder/dlrm-nvidia/blob/master/#command-line-options) + [Getting the data](https://github.com/soycoder/dlrm-nvidia/blob/master/#getting-the-data) - [Dataset guidelines](https://github.com/soycoder/dlrm-nvidia/blob/master/#dataset-guidelines) - [Multi-dataset](https://github.com/soycoder/dlrm-nvidia/blob/master/#multi-dataset) - [Preprocessing](https://github.com/soycoder/dlrm-nvidia/blob/master/#preprocessing) * [NVTabular](https://github.com/soycoder/dlrm-nvidia/blob/master/#nvtabular) * [Spark](https://github.com/soycoder/dlrm-nvidia/blob/master/#spark) + [Training process](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-process) + [Inference process](https://github.com/soycoder/dlrm-nvidia/blob/master/#inference-process) + [Deploying DLRM Using NVIDIA Triton Inference Server](https://github.com/soycoder/dlrm-nvidia/blob/master/#deploying-dlrm-using-nvidia-triton-inference-server) * [Performance](https://github.com/soycoder/dlrm-nvidia/blob/master/#performance) + [Benchmarking](https://github.com/soycoder/dlrm-nvidia/blob/master/#benchmarking) - [Training performance benchmark](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-performance-benchmark) - [Inference performance benchmark](https://github.com/soycoder/dlrm-nvidia/blob/master/#inference-performance-benchmark) + [Results](https://github.com/soycoder/dlrm-nvidia/blob/master/#results) - [Training accuracy results](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-accuracy-results) * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-accuracy-nvidia-dgx-a100-8x-a100-80gb) * [Training accuracy: NVIDIA DGX-1 (8x V100 32GB)](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-accuracy-nvidia-dgx-1-8x-v100-32gb) * [Training accuracy plots](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-accuracy-plots) * [Training stability test](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-stability-test) * [Impact of mixed precision on training accuracy](https://github.com/soycoder/dlrm-nvidia/blob/master/#impact-of-mixed-precision-on-training-accuracy) - [Training performance results](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-performance-results) * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-performance-nvidia-dgx-a100-8x-a100-80gb) * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-performance-nvidia-dgx-a100-8x-a100-40gb) * [Training performance: NVIDIA DGX-1 (8x V100 32GB)](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-performance-nvidia-dgx-1-8x-v100-32gb) * [Training performance: NVIDIA DGX-2 (16x V100 32GB)](https://github.com/soycoder/dlrm-nvidia/blob/master/#training-performance-nvidia-dgx-2-16x-v100-32gb) * [Release notes](https://github.com/soycoder/dlrm-nvidia/blob/master/#release-notes) + [Changelog](https://github.com/soycoder/dlrm-nvidia/blob/master/#changelog) + [Known issues](https://github.com/soycoder/dlrm-nvidia/blob/master/#known-issues) ## Model overview The Deep Learning Recommendation Model (DLRM) is a recommendation model designed to make use of both categorical and numerical inputs. It was first described in [Deep Learning Recommendation Model for Personalization and Recommendation Systems](https://github.com/soycoder/dlrm-nvidia/blob/master/https://arxiv.org/abs/1906.00091). This repository provides a reimplementation of the codebase provided originally [here](https://github.com/soycoder/dlrm-nvidia/blob/master/https://github.com/facebookresearch/dlrm). The scripts provided enable you to train DLRM on the [Criteo Terabyte Dataset](https://github.com/soycoder/dlrm-nvidia/blob/master/https://labs.criteo.com/2013/12/download-terabyte-click-logs/). Using the scripts provided here, you can efficiently train models that are too large to fit into a single GPU. This is because we use a hybrid-parallel approach, which combines model parallelism for the embedding tables with data parallelism for the Top MLP. This is explained in details in [next sections](https://github.com/soycoder/dlrm-nvidia/blob/master/#hybrid-parallel-multigpu-with-all-2-all-communication). This model uses a slightly different preprocessing procedure than the one found in the original implementation. You can find a detailed description of the preprocessing steps in the [Dataset guidelines](https://github.com/soycoder/dlrm-nvidia/blob/master/#dataset-guidelines) section. Using DLRM you can train a high-quality general model for providing recommendations. This model is trained with mixed precision using Tensor Cores on Volta, Turing, and NVIDIA Ampere GPU architectures. Therefore, researchers can get results up to 3.3x faster than training without Tensor Cores while experiencing the benefits of mixed precision training. It is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. ### Model architecture DLRM accepts two types of features: categorical and numerical. For each categorical feature, an embedding table is used to provide dense representation to each unique value. The dense features enter the model and are transformed by a simple neural network referred to as "bottom MLP". This part of the network consists of a series of linear layers with ReLU activations. The output of the bottom MLP and the embedding vectors are then fed into the "dot interaction" operation. The output of "dot interaction" is then concatenated with the features resulting from the bottom MLP and fed into the "top MLP" which is also a series of dense layers with activations. The model outputs a single number which can be interpreted as a likelihood of a certain user clicking an ad.

Figure 1. The architecture of DLRM.

### Default configuration The following features were implemented in this model: - general - static loss scaling for Tensor Cores (mixed precision) training - hybrid-parallel multi-GPU training - preprocessing - dataset preprocessing using Spark 3 on GPUs - dataset preprocessing using NVTabular on GPUs ### Feature support matrix The following features are supported by this model: | Feature | DLRM |-----------------------------------------|----- |Automatic mixed precision (AMP) | yes |Hybrid-parallel multi-GPU with all-2-all | yes |Preprocessing on GPU with NVTabular | yes |Preprocessing on GPU with Spark 3 | yes #### Features Automatic Mixed Precision (AMP) - enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable. Multi-GPU training with PyTorch distributed - our model uses `torch.distributed` to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see the [PyTorch Tutorial](https://github.com/soycoder/dlrm-nvidia/blob/master/https://pytorch.org/tutorials/intermediate/dist_tuto.html). Preprocessing on GPU with NVTabular - Criteo dataset preprocessing can be conducted using [NVTabular](https://github.com/soycoder/dlrm-nvidia/blob/master/https://github.com/NVIDIA/NVTabular). For more information on the framework, see the [Announcing the NVIDIA NVTabular Open Beta with Multi-GPU Support and New Data Loaders](https://github.com/soycoder/dlrm-nvidia/blob/master/https://developer.nvidia.com/blog/announcing-the-nvtabular-open-beta-with-multi-gpu-support-and-new-data-loaders/). Preprocessing on GPU with Spark 3 - Criteo dataset preprocessing can be conducted using [Apache Spark 3.0](https://github.com/soycoder/dlrm-nvidia/blob/master/https://spark.apache.org/). For more information on the framework and how to leverage GPU to preprocessing, see the [Accelerating Apache Spark 3.0 with GPUs and RAPIDS](https://github.com/soycoder/dlrm-nvidia/blob/master/https://developer.nvidia.com/blog/accelerating-apache-spark-3-0-with-gpus-and-rapids/). ### Mixed precision training Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://github.com/soycoder/dlrm-nvidia/blob/master/https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in the half-precision floating-point format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://github.com/soycoder/dlrm-nvidia/blob/master/https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision – up to 3.3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps: 1. Porting the model to use the FP16 data type where appropriate. 2. Adding loss scaling to preserve small gradient values. The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://github.com/soycoder/dlrm-nvidia/blob/master/https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK. For information about: - How to train using mixed precision, see the [Mixed Precision Training](https://github.com/soycoder/dlrm-nvidia/blob/master/https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://github.com/soycoder/dlrm-nvidia/blob/master/https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation. - Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://github.com/soycoder/dlrm-nvidia/blob/master/https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog. - APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://github.com/soycoder/dlrm-nvidia/blob/master/https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/). #### Enabling mixed precision Mixed precision training is turned off by default. To turn it on issue the `--amp` flag to the `main.py` script. #### Enabling TF32 TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://github.com/soycoder/dlrm-nvidia/blob/master/https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models that require a high dynamic range for weights or activations. For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://github.com/soycoder/dlrm-nvidia/blob/master/https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post. TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default. ### Hybrid-parallel multi-GPU with all-2-all communication Many recommendation models contain very large embedding tables. As a result, the model is often too large to fit onto a single device. This could be easily solved by training in a model-parallel way, using either the CPU or other GPUs as "memory donors". However, this approach is suboptimal as the "memory donor" devices' compute is not utilized. In this repository, we use the model-parallel approach for the bottom part of the model (Embedding Tables + Bottom MLP) while using a usual data parallel approach for the top part of the model (Dot Interaction + Top MLP). This way we can train models much larger than what would normally fit into a single GPU while at the same time making the training faster by using multiple GPUs. We call this approach hybrid-parallel. The transition from model-parallel to data-parallel in the middle of the neural net needs a specific multi-GPU communication pattern called [all-2-all](https://github.com/soycoder/dlrm-nvidia/blob/master/https://en.wikipedia.org/wiki/All-to-all_$parallel_pattern$) which is available in our [PyTorch 21.04-py3](https://github.com/soycoder/dlrm-nvidia/blob/master/https://ngc.nvidia.com/catalog/containers/nvidia:pytorch/tags) NGC docker container. In the [original DLRM whitepaper](https://github.com/soycoder/dlrm-nvidia/blob/master/https://arxiv.org/abs/1906.00091) this has been also referred to as "butterfly shuffle".

In the example shown in this repository we train models of three sizes: "small" (~15 GB), "large" (~82 GB), and "xlarge" (~142 GB). We use the hybrid-parallel approach for the "large" and "xlarge" models, as they do not fit in a single GPU. #### Embedding table placement and load balancing We use the following heuristic for dividing the work between the GPUs: - The Bottom MLP is placed on GPU-0 and no embedding tables are placed on this device. - The tables are sorted from the largest to the smallest - Set `max_tables_per_gpu := ceil(number_of_embedding_tables / number_of_available_gpus)` - Repeat until all embedding tables have an assigned device: - Out of all the available GPUs find the one with the largest amount of unallocated memory - Place the largest unassigned embedding table on this GPU. Raise an exception if it does not fit. - If the number of embedding tables on this GPU is now equal to `max_tables_per_gpu` remove this GPU from the list of available GPUs so that no more embedding tables will be placed on this GPU. This ensures the all2all communication is well balanced between all devices. ### Preprocessing on GPU Please refer to [the "Preprocessing" section](https://github.com/soycoder/dlrm-nvidia/blob/master/#preprocessing) for a detailed description of the Apache Spark 3.0 and NVTabular GPU functionality ## Setup The following section lists the requirements for training DLRM. ### Requirements This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components: - [NVIDIA Docker](https://github.com/soycoder/dlrm-nvidia/blob/master/https://github.com/NVIDIA/nvidia-docker) - [PyTorch 21.04-py3](https://github.com/soycoder/dlrm-nvidia/blob/master/https://ngc.nvidia.com/catalog/containers/nvidia:pytorch/tags) NGC container - Supported GPUs: - [NVIDIA Volta architecture](https://github.com/soycoder/dlrm-nvidia/blob/master/https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) - [NVIDIA Turing architecture](https://github.com/soycoder/dlrm-nvidia/blob/master/https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/) - [NVIDIA Ampere architecture](https://github.com/soycoder/dlrm-nvidia/blob/master/https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation: - [Getting Started Using NVIDIA GPU Cloud](https://github.com/soycoder/dlrm-nvidia/blob/master/https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html) - [Accessing And Pulling From The NGC Container Registry](https://github.com/soycoder/dlrm-nvidia/blob/master/https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry) - [Running PyTorch](https://github.com/soycoder/dlrm-nvidia/blob/master/https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html#running) For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://github.com/soycoder/dlrm-nvidia/blob/master/https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). ## Quick Start Guide To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of DLRM on the Criteo Terabyte dataset. For the specifics concerning training and inference, see the [Advanced](https://github.com/soycoder/dlrm-nvidia/blob/master/#advanced) section. 1. Clone the repository. ``` git clone https://github.com/NVIDIA/DeepLearningExamples cd DeepLearningExamples/PyTorch/Recommendation/DLRM ``` 2. Download the dataset. You can download the data by following the instructions at: http://labs.criteo.com/2013/12/download-terabyte-click-logs/. When you have successfully downloaded it and unpacked it, set the `CRITEO_DATASET_PARENT_DIRECTORY` to its parent directory: ``` CRITEO_DATASET_PARENT_DIRECTORY=/raid/dlrm ``` We recommend to choose the fastest possible file system, otherwise it may lead to an IO bottleneck. 3. Build DLRM Docker containers ```bash docker build -t nvidia_dlrm_pyt . docker build -t nvidia_dlrm_preprocessing -f Dockerfile_preprocessing . --build-arg DGX_VERSION=[DGX-2|DGX-A100] ``` 3. Start an interactive session in the NGC container to run preprocessing. The DLRM PyTorch container can be launched with: ```bash docker run --runtime=nvidia -it --rm --ipc=host -v ${CRITEO_DATASET_PARENT_DIRECTORY}:/data/dlrm nvidia_dlrm_preprocessing bash ``` 4. Preprocess the dataset. Here are a few examples of different preprocessing commands. For the details on how those scripts work and detailed description of dataset types (small FL=15, large FL=3, xlarge FL=2), training possibilities and all the parameters consult the [preprocessing section](https://github.com/soycoder/dlrm-nvidia/blob/master/#preprocessing). Depending on datastet type (small FL=15, large FL=3, xlarge FL=2) run one of following command: 4.1. Preprocess to small dataset (FL=15) with Spark GPU: ```bash cd /workspace/dlrm/preproc ./prepare_dataset.sh 15 GPU Spark ``` 4.2. Preprocess to large dataset (FL=3) with Spark GPU: ```bash cd /workspace/dlrm/preproc ./prepare_dataset.sh 3 GPU Spark ``` 4.3. Preprocess to xlarge dataset (FL=2) with Spark GPU: ```bash cd /workspace/dlrm/preproc ./prepare_dataset.sh 2 GPU Spark ``` 5. Start training. - First start the docker container (adding `--security-opt seccomp=unconfined` option is needed to take the full advantage of processor affinity in multi-GPU training): ```bash docker run --security-opt seccomp=unconfined --runtime=nvidia -it --rm --ipc=host -v ${PWD}/data:/data nvidia_dlrm_pyt bash ``` - single-GPU: ```bash python -m dlrm.scripts.main --mode train --dataset /data/dlrm/binary_dataset/ ``` - multi-GPU for DGX A100: ```bash python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \ bash -c './bind.sh --cpu=dgxa100_ccx.sh --mem=dgxa100_ccx.sh python -m dlrm.scripts.dist_main \ --dataset /data/dlrm/binary_dataset/ --seed 0 --epochs 1 --amp' ``` - multi-GPU for DGX-1 and DGX-2: ```bash python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \ bash -c './bind.sh --cpu=exclusive -- python -m dlrm.scripts.dist_main \ --dataset /data/dlrm/binary_dataset/ --seed 0 --epochs 1 --amp' ``` 6. Start validation/evaluation. If you want to run validation or evaluation, you can either: - use the checkpoint obtained from the training commands above, or - download the pretrained checkpoint from NGC. In order to download the checkpoint from NGC, visit ngc.nvidia.com website and browse the available models. Download the checkpoint files and unzip them to some path, for example, to `$CRITEO_DATASET_PARENT_DIRECTORY/checkpoints/`. The checkpoint requires around 15GB of disk space. Commands: - single-GPU: ```bash python -m dlrm.scripts.main --mode test --dataset /data/dlrm/binary_dataset/ --load_checkpoint_path `$CRITEO_DATASET_PARENT_DIRECTORY/checkpoints/checkpoint` ``` - multi-GPU for DGX A100: ```bash python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \ bash -c './bind.sh --cpu=dgxa100_ccx.sh --mem=dgxa100_ccx.sh python -m dlrm.scripts.dist_main \ --dataset /data/dlrm/binary_dataset/ --seed 0 --epochs 1 --amp --load_checkpoint_path `$CRITEO_DATASET_PARENT_DIRECTORY/checkpoints/checkpoint`' ``` - multi-GPU for DGX-1 and DGX-2: ```bash python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \ bash -c './bind.sh --cpu=exclusive -- python -m dlrm.scripts.dist_main \ --dataset /data/dlrm/binary_dataset/ --seed 0 --epochs 1 --amp --load_checkpoint_path `$CRITEO_DATASET_PARENT_DIRECTORY/checkpoints/checkpoint`' ``` ## Advanced The following sections provide greater details of the dataset, running training and inference, and the training results. ### Scripts and sample code The `dlrm/scripts/main.py` script provides an entry point to most of the functionality in a single-GPU setting. Using different command-line flags allows you to run training, validation, and benchmark both training and inference on real or synthetic data. Analogously, the `dlrm/scripts/dist_main.py` script provides an entry point for the functionality in a multi-GPU setting. It us ... ...

近期下载者：

相关文件：

评论：[我要评论] [举报此文件]

收藏者：