Skip to content

sergey-serebryakov/mlperf-training-v0.6

Repository files navigation

0. Configurations and benchmark running

  • DGXSYSTEM=DGX1 2-CPU 8-way GPU server with 20 cores per CPU, HT enabled (total cores per CPU = 40).
# Build docker image
docker build --pull -t  mlperf-nvidia:image_classification .

export NEXP=1
export DATADIR=<path-to-location-of-ImageNet-dataset> 
export LOGDIR=<path-to-where-you-want-to-store-logfiles>
export DGXSYSTEM=DGX1

./run.sub

1. Problem

This problem uses the ResNet-50 CNN to do image classification.

Requirements

2. Directions

Steps to download and verify data

Download the data using the following command:

Please download the dataset manually following the instructions from the ImageNet website. We use non-resized Imagenet dataset, packed into MXNet recordio database. It is not resized and not normalized. No preprocessing was performed on the raw ImageNet jpegs.

For further instructions, see https://github.com/NVIDIA/DeepLearningExamples/blob/master/MxNet/Classification/RN50v1.5/README.md#prepare-dataset .

Steps to launch training

NVIDIA DGX-1 (single node)

Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1 single node submission are in the config_DGX1.sh script.

Steps required to launch single node training on NVIDIA DGX-1:

docker build --pull -t mlperf-nvidia:image_classification .
DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1 ./run.sub

NVIDIA DGX-2 (single node)

Launch configuration and system-specific hyperparameters for the NVIDIA DGX-2 single node submission are in the config_DGX2.sh script.

Steps required to launch single node training on NVIDIA DGX-2:

docker build --pull -t  mlperf-nvidia:image_classification .
DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX2 ./run.sub

NVIDIA DGX-1 (multi node)

Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1 multi node submission are in the config_DGX1_multi.sh script.

Steps required to launch multi node training on NVIDIA DGX-1:

  1. Build the docker container and push to a docker registry
docker build --pull -t <docker/registry>/mlperf-nvidia:image_classification .
docker push <docker/registry>/mlperf-nvidia:image_classification
  1. Launch the training
source config_DGX1_multi.sh && CONT="<docker/registry>/mlperf-nvidia:image_classification" DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub

NVIDIA DGX-2 (multi node)

Launch configuration and system-specific hyperparameters for the NVIDIA DGX-2 multi node submission are in the config_DGX2_multi.sh script.

Steps required to launch multi node training on NVIDIA DGX-2:

  1. Build the docker container and push to a docker registry
docker build --pull -t <docker/registry>/mlperf-nvidia:image_classification .
docker push <docker/registry>/mlperf-nvidia:image_classification
  1. Launch the training
source config_DGX2_multi.sh && CONT="<docker/registry>/mlperf-nvidia:image_classification" DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX2_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub