dlrm_main.py
trains, validates, and tests a Deep Learning Recommendation Model (DLRM) with TorchRec. The DLRM model contains both data parallel components (e.g. multi-layer perceptrons & interaction arch) and model parallel components (e.g. embedding tables). The DLRM model is pipelined so that dataloading, data-parallel to model-parallel comms, and forward/backward are overlapped. Can be run with either a random dataloader or Criteo 1 TB click logs dataset.
It has been tested on the following cloud instance types:
Cloud | Instance Type | GPUs | vCPUs | Memory (GB) |
---|---|---|---|---|
AWS | p4d.24xlarge | 8 x A100 (40GB) | 96 | 1152 |
Azure | Standard_ND96asr_v4 | 8 x A100 (40GB) | 96 | 900 |
GCP | a2-megagpu-16g | 16 x A100 (40GB) | 96 | 1300 |
pip install tqdm torchmetrics
We recommend using torchx to run. Here we use the DDP builtin
- pip install torchx
- (optional) setup a slurm or kubernetes cluster
- a. locally:
torchx run -s local_cwd dist.ddp -j 1x2 --script dlrm_main.py
b. remotely:torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py
You can also use torchrun.
- e.g.
torchrun --nnodes 1 --nproc_per_node 2 --rdzv_backend c10d --rdzv_endpoint localhost --rdzv_id 54321 --role trainer dlrm_main.py
Setup:
- Dataset: Criteo 1TB Click Logs dataset
- CUDA 11.0, NCCL 2.10.3.
- AWS p4d24xlarge instances, each with 8 40GB NVIDIA A100s.
Results
Common settings across all runs:
--num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" --embedding_dim_size 128 --pin_memory --over_arch_layer_sizes "1024,1024,512,256,1" --dense_arch_layer_sizes "512,256,128" --epochs 1 --change_lr --shuffle_batches --learning_rate 15.0
Number of GPUs | Collective Size of Embedding Tables (GiB) | Local Batch Size | Global Batch Size | AUROC over Val Set After 1 Epoch | AUROC Over Test Set After 1 Epoch | Train Records/Second | Time to Train 1 Epoch | Unique Flags |
---|---|---|---|---|---|---|---|---|
1 | 91.10 | 16384 | 16384 | 0.8025434017181396 | 0.8026024103164673 | ~740,065 rec/s | 1h35m29s | --batch_size 16384 --lr_change_point 0.65 --lr_after_change_point 0.035 |
4 | 91.10 | 4096 | 16384 | 0.8030692934989929 | 0.8030484914779663 | ~1,458,176 rec/s | 48m39s | --batch_size 4096 --lr_change_point 0.80 --lr_after_change_point 0.20 |
8 | 91.10 | 2048 | 16384 | 0.802501916885376 | 0.8025660514831543 | ~1,671,168 rec/s | 43m24s | --batch_size 2048 --lr_change_point 0.80 --lr_after_change_point 0.20 |
8 | 91.10 | 8192 | 65536 | 0.7996258735656738 | 0.7996508479118347 | ~5,373,952 rec/s | 13m40s | --batch_size 8192 |
QPS (train record/second) is calculated by using the following formula: x it/s * local_batch_size * num_gpus
. The it/s
can be found within the logs of the training results.
The final row, using 8 GPUs with a batch size of 8192, was not tuned to hit the MLPerf benchmark but is shown to highlight the QPS (train record/second) achievable with torchrec.
The change_lr
parameter activates the variable learning rate schedule. lr_after_change_point
is a parameter that we use to dictate the point
at which we'll shift the learning rate to the value specified by lr_change_point
. We found that having a high learning rate to start (e.g. 15.0) and dropping to a smaller learning rate (e.g. 0.20) near the end of the first epoch (e.g. 80% through) helped us converge faster to the 0.8025 MLPerf AUROC metric.
Reproduce
Run the following command to reproduce the results for a single node (8 GPUs) on AWS. This command makes use of the aws_component.py
script.
Ensure to:
- set $PATH_TO_1TB_NUMPY_FILES to the path with the pre-processed .npy files of the Criteo 1TB dataset.
- set $TRAIN_QUEUE to the partition that handles training jobs
Example command:
torchx run --scheduler slurm --scheduler_args partition=$TRAIN_QUEUE,time=5:00:00 aws_component.py:run_dlrm_main --num_trainers=8 -- --pin_memory --batch_size 2048 --epochs 1 --num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" --embedding_dim 128 --dense_arch_layer_sizes "512,256,128" --over_arch_layer_sizes "1024,1024,512,256,1" --in_memory_binary_criteo_path $PATH_TO_1TB_NUMPY_FILES --learning_rate 15.0 --shuffle_batches --change_lr --lr_change_point 0.80 --lr_after_change_point 0.20
Upon scheduling the job, there should be an output that looks like this:
warnings.warn(
slurm://torchx/14731
torchx 2022-01-07 21:06:59 INFO Launched app: slurm://torchx/14731
torchx 2022-01-07 21:06:59 INFO AppStatus:
msg: ''
num_restarts: -1
roles: []
state: UNKNOWN (7)
structured_error_msg: <NONE>
ui_url: null
torchx 2022-01-07 21:06:59 INFO Job URL: None
In this example, the job was launched to: slurm://torchx/14731
.
Run the following commands to check the status of your job and read the logs:
# Status should be "RUNNING" if properly scheduled
torchx status slurm://torchx/14731
# Log file was automatically created in the directory where you launched the job from
cat slurm-14731.out
The results from the training can be found in the log file (e.g. slurm-14731.out
).
Debugging
The --validation_freq_within_epoch x
parameter can be used to print the AUROC every x
iterations through an epoch.
The in-memory dataloader can take approximately 20-30 minutes to load the data into memory before training starts. The
--mmap_mode
parameter can be used to load data from disk which reduces start-up time for training at the cost
of QPS.
TODO/Work In Progress
- Write section on how to pre-process the dataset.