Skip to content

Latest commit

 

History

History
 
 

torchrec_dlrm

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

TorchRec DLRM Example

dlrm_main.py trains, validates, and tests a Deep Learning Recommendation Model (DLRM) with TorchRec. The DLRM model contains both data parallel components (e.g. multi-layer perceptrons & interaction arch) and model parallel components (e.g. embedding tables). The DLRM model is pipelined so that dataloading, data-parallel to model-parallel comms, and forward/backward are overlapped. Can be run with either a random dataloader or Criteo 1 TB click logs dataset.

It has been tested on the following cloud instance types:

Cloud Instance Type GPUs vCPUs Memory (GB)
AWS p4d.24xlarge 8 x A100 (40GB) 96 1152
Azure Standard_ND96asr_v4 8 x A100 (40GB) 96 900
GCP a2-megagpu-16g 16 x A100 (40GB) 96 1300

Running

Install dependencies

pip install tqdm torchmetrics

Torchx

We recommend using torchx to run. Here we use the DDP builtin

  1. pip install torchx
  2. (optional) setup a slurm or kubernetes cluster
  3. a. locally: torchx run -s local_cwd dist.ddp -j 1x2 --script dlrm_main.py b. remotely: torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py

TorchRun

You can also use torchrun.

  • e.g. torchrun --nnodes 1 --nproc_per_node 2 --rdzv_backend c10d --rdzv_endpoint localhost --rdzv_id 54321 --role trainer dlrm_main.py

Preliminary Training Results

Setup:

  • Dataset: Criteo 1TB Click Logs dataset
  • CUDA 11.0, NCCL 2.10.3.
  • AWS p4d24xlarge instances, each with 8 40GB NVIDIA A100s.

Results

Common settings across all runs:

--num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" --embedding_dim_size 128 --pin_memory --over_arch_layer_sizes "1024,1024,512,256,1" --dense_arch_layer_sizes "512,256,128" --epochs 1 --change_lr --shuffle_batches --learning_rate 15.0
Number of GPUs Collective Size of Embedding Tables (GiB) Local Batch Size Global Batch Size AUROC over Val Set After 1 Epoch AUROC Over Test Set After 1 Epoch Train Records/Second Time to Train 1 Epoch Unique Flags
1 91.10 16384 16384 0.8025434017181396 0.8026024103164673 ~740,065 rec/s 1h35m29s --batch_size 16384 --lr_change_point 0.65 --lr_after_change_point 0.035
4 91.10 4096 16384 0.8030692934989929 0.8030484914779663 ~1,458,176 rec/s 48m39s --batch_size 4096 --lr_change_point 0.80 --lr_after_change_point 0.20
8 91.10 2048 16384 0.802501916885376 0.8025660514831543 ~1,671,168 rec/s 43m24s --batch_size 2048 --lr_change_point 0.80 --lr_after_change_point 0.20
8 91.10 8192 65536 0.7996258735656738 0.7996508479118347 ~5,373,952 rec/s 13m40s --batch_size 8192

QPS (train record/second) is calculated by using the following formula: x it/s * local_batch_size * num_gpus. The it/s can be found within the logs of the training results.

The final row, using 8 GPUs with a batch size of 8192, was not tuned to hit the MLPerf benchmark but is shown to highlight the QPS (train record/second) achievable with torchrec.

The change_lr parameter activates the variable learning rate schedule. lr_after_change_point is a parameter that we use to dictate the point at which we'll shift the learning rate to the value specified by lr_change_point. We found that having a high learning rate to start (e.g. 15.0) and dropping to a smaller learning rate (e.g. 0.20) near the end of the first epoch (e.g. 80% through) helped us converge faster to the 0.8025 MLPerf AUROC metric.

Reproduce

Run the following command to reproduce the results for a single node (8 GPUs) on AWS. This command makes use of the aws_component.py script.

Ensure to:

  • set $PATH_TO_1TB_NUMPY_FILES to the path with the pre-processed .npy files of the Criteo 1TB dataset.
  • set $TRAIN_QUEUE to the partition that handles training jobs

Example command:

torchx run --scheduler slurm --scheduler_args partition=$TRAIN_QUEUE,time=5:00:00 aws_component.py:run_dlrm_main --num_trainers=8 -- --pin_memory --batch_size 2048 --epochs 1 --num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" --embedding_dim 128 --dense_arch_layer_sizes "512,256,128" --over_arch_layer_sizes "1024,1024,512,256,1" --in_memory_binary_criteo_path $PATH_TO_1TB_NUMPY_FILES --learning_rate 15.0 --shuffle_batches --change_lr --lr_change_point 0.80 --lr_after_change_point 0.20

Upon scheduling the job, there should be an output that looks like this:

warnings.warn(
slurm://torchx/14731
torchx 2022-01-07 21:06:59 INFO     Launched app: slurm://torchx/14731
torchx 2022-01-07 21:06:59 INFO     AppStatus:
  msg: ''
  num_restarts: -1
  roles: []
  state: UNKNOWN (7)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2022-01-07 21:06:59 INFO     Job URL: None

In this example, the job was launched to: slurm://torchx/14731.

Run the following commands to check the status of your job and read the logs:

# Status should be "RUNNING" if properly scheduled
torchx status slurm://torchx/14731

# Log file was automatically created in the directory where you launched the job from
cat slurm-14731.out

The results from the training can be found in the log file (e.g. slurm-14731.out).

Debugging

The --validation_freq_within_epoch x parameter can be used to print the AUROC every x iterations through an epoch.

The in-memory dataloader can take approximately 20-30 minutes to load the data into memory before training starts. The --mmap_mode parameter can be used to load data from disk which reduces start-up time for training at the cost of QPS.

TODO/Work In Progress

  • Write section on how to pre-process the dataset.