Name		Name	Last commit message	Last commit date
parent directory ..
data		data
modules		modules
tests		tests
README.MD		README.MD
__init__.py		__init__.py
aws_component.py		aws_component.py
dlrm_main.py		dlrm_main.py
dlrm_packager.py		dlrm_packager.py

README.MD

TorchRec DLRM Example

dlrm_main.py trains, validates, and tests a Deep Learning Recommendation Model (DLRM) with TorchRec. The DLRM model contains both data parallel components (e.g. multi-layer perceptrons & interaction arch) and model parallel components (e.g. embedding tables). The DLRM model is pipelined so that dataloading, data-parallel to model-parallel comms, and forward/backward are overlapped. Can be run with either a random dataloader or Criteo 1 TB click logs dataset.

It has been tested on the following cloud instance types:

Cloud	Instance Type	GPUs	vCPUs	Memory (GB)
AWS	p4d.24xlarge	8 x A100 (40GB)	96	1152
Azure	Standard_ND96asr_v4	8 x A100 (40GB)	96	900
GCP	a2-megagpu-16g	16 x A100 (40GB)	96	1300

Running

Install dependencies

pip install tqdm torchmetrics

Torchx

We recommend using torchx to run. Here we use the DDP builtin

pip install torchx
(optional) setup a slurm or kubernetes cluster
a. locally: torchx run -s local_cwd dist.ddp -j 1x2 --script dlrm_main.py b. remotely: torchx run -s slurm dist.ddp -j 1x8 --script dlrm_main.py

TorchRun

You can also use torchrun.

e.g. torchrun --nnodes 1 --nproc_per_node 2 --rdzv_backend c10d --rdzv_endpoint localhost --rdzv_id 54321 --role trainer dlrm_main.py

Preliminary Training Results

Setup:

Dataset: Criteo 1TB Click Logs dataset
CUDA 11.0, NCCL 2.10.3.
AWS p4d24xlarge instances, each with 8 40GB NVIDIA A100s.

Results

Common settings across all runs:

--num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" --embedding_dim_size 128 --pin_memory --over_arch_layer_sizes "1024,1024,512,256,1" --dense_arch_layer_sizes "512,256,128" --epochs 1 --change_lr --shuffle_batches --learning_rate 15.0

Number of GPUs	Collective Size of Embedding Tables (GiB)	Local Batch Size	Global Batch Size	AUROC over Val Set After 1 Epoch	AUROC Over Test Set After 1 Epoch	Train Records/Second	Time to Train 1 Epoch	Unique Flags
1	91.10	16384	16384	0.8025434017181396	0.8026024103164673	~740,065 rec/s	1h35m29s	`--batch_size 16384 --lr_change_point 0.65 --lr_after_change_point 0.035`
4	91.10	4096	16384	0.8030692934989929	0.8030484914779663	~1,458,176 rec/s	48m39s	`--batch_size 4096 --lr_change_point 0.80 --lr_after_change_point 0.20`
8	91.10	2048	16384	0.802501916885376	0.8025660514831543	~1,671,168 rec/s	43m24s	`--batch_size 2048 --lr_change_point 0.80 --lr_after_change_point 0.20`
8	91.10	8192	65536	0.7996258735656738	0.7996508479118347	~5,373,952 rec/s	13m40s	`--batch_size 8192`

QPS (train record/second) is calculated by using the following formula: x it/s * local_batch_size * num_gpus. The it/s can be found within the logs of the training results.

The final row, using 8 GPUs with a batch size of 8192, was not tuned to hit the MLPerf benchmark but is shown to highlight the QPS (train record/second) achievable with torchrec.

The change_lr parameter activates the variable learning rate schedule. lr_after_change_point is a parameter that we use to dictate the point at which we'll shift the learning rate to the value specified by lr_change_point. We found that having a high learning rate to start (e.g. 15.0) and dropping to a smaller learning rate (e.g. 0.20) near the end of the first epoch (e.g. 80% through) helped us converge faster to the 0.8025 MLPerf AUROC metric.

Reproduce

Run the following command to reproduce the results for a single node (8 GPUs) on AWS. This command makes use of the aws_component.py script.

Ensure to:

set $PATH_TO_1TB_NUMPY_FILES to the path with the pre-processed .npy files of the Criteo 1TB dataset.
set $TRAIN_QUEUE to the partition that handles training jobs

Example command:

torchx run --scheduler slurm --scheduler_args partition=$TRAIN_QUEUE,time=5:00:00 aws_component.py:run_dlrm_main --num_trainers=8 -- --pin_memory --batch_size 2048 --epochs 1 --num_embeddings_per_feature "45833188,36746,17245,7413,20243,3,7114,1441,62,29275261,1572176,345138,10,2209,11267,128,4,974,14,48937457,11316796,40094537,452104,12606,104,35" --embedding_dim 128 --dense_arch_layer_sizes "512,256,128" --over_arch_layer_sizes "1024,1024,512,256,1" --in_memory_binary_criteo_path $PATH_TO_1TB_NUMPY_FILES --learning_rate 15.0 --shuffle_batches --change_lr --lr_change_point 0.80 --lr_after_change_point 0.20

Upon scheduling the job, there should be an output that looks like this:

warnings.warn(
slurm://torchx/14731
torchx 2022-01-07 21:06:59 INFO     Launched app: slurm://torchx/14731
torchx 2022-01-07 21:06:59 INFO     AppStatus:
  msg: ''
  num_restarts: -1
  roles: []
  state: UNKNOWN (7)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2022-01-07 21:06:59 INFO     Job URL: None

In this example, the job was launched to: slurm://torchx/14731.

Run the following commands to check the status of your job and read the logs:

# Status should be "RUNNING" if properly scheduled
torchx status slurm://torchx/14731

# Log file was automatically created in the directory where you launched the job from
cat slurm-14731.out

The results from the training can be found in the log file (e.g. slurm-14731.out).

Debugging

The --validation_freq_within_epoch x parameter can be used to print the AUROC every x iterations through an epoch.

The in-memory dataloader can take approximately 20-30 minutes to load the data into memory before training starts. The --mmap_mode parameter can be used to load data from disk which reduces start-up time for training at the cost of QPS.

TODO/Work In Progress

Write section on how to pre-process the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchrec_dlrm

torchrec_dlrm

README.MD

TorchRec DLRM Example

Running

Install dependencies

Torchx

TorchRun

Preliminary Training Results

Files

torchrec_dlrm

Directory actions

More options

Directory actions

More options

Latest commit

History

torchrec_dlrm

Folders and files

parent directory

README.MD

TorchRec DLRM Example

Running

Install dependencies

Torchx

TorchRun

Preliminary Training Results