CUDA Error when loading checkpoint on more than one GPU #19

agonzalezd · 2021-11-11T15:54:00Z

Hello.

I am having an issue when using your code. If I try to resume a training from a checkpoint when having more than one GPU (I am using docker containers) I get the following error:

  File "__main__.py", line 55, in <module>
    main(parser.parse_args())
  File "__main__.py", line 39, in main
    spawn(train_distributed, args=(replica_count, port, args, params), nprocs=replica_count, join=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/diffwave/src/diffwave/learner.py", line 188, in train_distributed
    _train_impl(replica_id, model, dataset, args, params)
  File "/opt/diffwave/src/diffwave/learner.py", line 163, in _train_impl
    learner.restore_from_checkpoint()
  File "/opt/diffwave/src/diffwave/learner.py", line 95, in restore_from_checkpoint
    checkpoint = torch.load(f'{self.model_dir}/{filename}.pt')
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 584, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 842, in _load
    result = unpickler.load()
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 834, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 823, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 174, in default_restore_location
    result = fn(storage, location)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 156, in _cuda_deserialize
    return obj.cuda(device)
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 77, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

But when starting from scratch or using a single GPU this error does not show and the training goes flawlessly.

I must add that I have checked that the GPUs were completely free when launching the training.

Any advice on this issue?

Thanks in advance.

The text was updated successfully, but these errors were encountered:

sharvil · 2021-11-17T02:50:54Z

Hmm I haven't run across that error before. Sorry, I don't think I'll be of much help here.

agonzalezd · 2022-01-28T08:41:05Z

It somehow spawns multiple processes on a single GPU, but only on one of them...
I am launching the training on 4 GPUs. Three of the GPUs spawn each a single process, but one of them spawns 4 processes.
If changing to 3 GPUs, the same happens: one of the GPUs spawns 3 processes.
I cannot find a bug in the code that forces this behaviour....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Error when loading checkpoint on more than one GPU #19

CUDA Error when loading checkpoint on more than one GPU #19

agonzalezd commented Nov 11, 2021

sharvil commented Nov 17, 2021

agonzalezd commented Jan 28, 2022 •

edited

Loading

CUDA Error when loading checkpoint on more than one GPU #19

CUDA Error when loading checkpoint on more than one GPU #19

Comments

agonzalezd commented Nov 11, 2021

sharvil commented Nov 17, 2021

agonzalezd commented Jan 28, 2022 • edited Loading

agonzalezd commented Jan 28, 2022 •

edited

Loading