Skip to content

Latest commit





Folders and files

Last commit message
Last commit date

parent directory


TorchRec DLRM Example trains, validates, and tests a Deep Learning Recommendation Model (DLRM) with TorchRec. The DLRM model contains both data parallel components (e.g. multi-layer perceptrons & interaction arch) and model parallel components (e.g. embedding tables). The DLRM model is pipelined so that dataloading, data-parallel to model-parallel comms, and forward/backward are overlapped. Can be run with either a random dataloader or Criteo 1 TB click logs dataset.

It has been tested on the following cloud instance types:

Cloud Instance Type GPUs vCPUs Memory (GB)
AWS p4d.24xlarge 8 x A100 (40GB) 96 1152
Azure Standard_ND96asr_v4 8 x A100 (40GB) 96 900
GCP a2-megagpu-16g 16 x A100 (40GB) 96 1300


Install dependencies

pip install tqdm torchmetrics


We recommend using torchx to run. Here we use the DDP builtin

  1. pip install torchx
  2. (optional) setup a slurm or kubernetes cluster
  3. a. locally: torchx run -s local_cwd dist.ddp -j 1x2 --script b. remotely: torchx run -s slurm dist.ddp -j 1x8 --script


You can also use torchrun.

  • e.g. torchrun --nnodes 1 --nproc_per_node 2 --rdzv_backend c10d --rdzv_endpoint localhost --rdzv_id 54321 --role trainer

Preliminary Training Results


  • Dataset: Criteo 1TB Click Logs dataset
  • CUDA 11.0, NCCL 2.10.3.
  • AWS p4d24xlarge instances, each with 8 40GB NVIDIA A100s.


Common settings across all runs:

--num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" --embedding_dim_size 128 --pin_memory --over_arch_layer_sizes "1024,1024,512,256,1" --dense_arch_layer_sizes "512,256,128" --epochs 1 --change_lr --shuffle_batches --learning_rate 15.0
Number of GPUs Collective Size of Embedding Tables (GiB) Local Batch Size Global Batch Size AUROC over Val Set After 1 Epoch AUROC Over Test Set After 1 Epoch Train Records/Second Time to Train 1 Epoch Unique Flags
1 91.10 16384 16384 0.8025434017181396 0.8026024103164673 ~740,065 rec/s 1h35m29s --batch_size 16384 --lr_change_point 0.65 --lr_after_change_point 0.035
4 91.10 4096 16384 0.8030692934989929 0.8030484914779663 ~1,458,176 rec/s 48m39s --batch_size 4096 --lr_change_point 0.80 --lr_after_change_point 0.20
8 91.10 2048 16384 0.802501916885376 0.8025660514831543 ~1,671,168 rec/s 43m24s --batch_size 2048 --lr_change_point 0.80 --lr_after_change_point 0.20
8 91.10 8192 65536 0.7996258735656738 0.7996508479118347 ~5,373,952 rec/s 13m40s --batch_size 8192

QPS (train record/second) is calculated by using the following formula: x it/s * local_batch_size * num_gpus. The it/s can be found within the logs of the training results.

The final row, using 8 GPUs with a batch size of 8192, was not tuned to hit the MLPerf benchmark but is shown to highlight the QPS (train record/second) achievable with torchrec.

The change_lr parameter activates the variable learning rate schedule. lr_after_change_point is a parameter that we use to dictate the point at which we'll shift the learning rate to the value specified by lr_change_point. We found that having a high learning rate to start (e.g. 15.0) and dropping to a smaller learning rate (e.g. 0.20) near the end of the first epoch (e.g. 80% through) helped us converge faster to the 0.8025 MLPerf AUROC metric.


Run the following command to reproduce the results for a single node (8 GPUs) on AWS. This command makes use of the script.

Ensure to:

  • set $PATH_TO_1TB_NUMPY_FILES to the path with the pre-processed .npy files of the Criteo 1TB dataset.
  • set $TRAIN_QUEUE to the partition that handles training jobs

Example command:

torchx run --scheduler slurm --scheduler_args partition=$TRAIN_QUEUE,time=5:00:00 --num_trainers=8 -- --pin_memory --batch_size 2048 --epochs 1 --num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" --embedding_dim 128 --dense_arch_layer_sizes "512,256,128" --over_arch_layer_sizes "1024,1024,512,256,1" --in_memory_binary_criteo_path $PATH_TO_1TB_NUMPY_FILES --learning_rate 15.0 --shuffle_batches --change_lr --lr_change_point 0.80 --lr_after_change_point 0.20

Upon scheduling the job, there should be an output that looks like this:

torchx 2022-01-07 21:06:59 INFO     Launched app: slurm://torchx/14731
torchx 2022-01-07 21:06:59 INFO     AppStatus:
  msg: ''
  num_restarts: -1
  roles: []
  state: UNKNOWN (7)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2022-01-07 21:06:59 INFO     Job URL: None

In this example, the job was launched to: slurm://torchx/14731.

Run the following commands to check the status of your job and read the logs:

# Status should be "RUNNING" if properly scheduled
torchx status slurm://torchx/14731

# Log file was automatically created in the directory where you launched the job from
cat slurm-14731.out

The results from the training can be found in the log file (e.g. slurm-14731.out).


The --validation_freq_within_epoch x parameter can be used to print the AUROC every x iterations through an epoch.

The in-memory dataloader can take approximately 20-30 minutes to load the data into memory before training starts. The --mmap_mode parameter can be used to load data from disk which reduces start-up time for training at the cost of QPS.

TODO/Work In Progress

  • Write section on how to pre-process the dataset.