Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error when loading checkpoint on more than one GPU #19

Open
agonzalezd opened this issue Nov 11, 2021 · 2 comments
Open

CUDA Error when loading checkpoint on more than one GPU #19

agonzalezd opened this issue Nov 11, 2021 · 2 comments

Comments

@agonzalezd
Copy link

Hello.

I am having an issue when using your code. If I try to resume a training from a checkpoint when having more than one GPU (I am using docker containers) I get the following error:

  File "__main__.py", line 55, in <module>
    main(parser.parse_args())
  File "__main__.py", line 39, in main
    spawn(train_distributed, args=(replica_count, port, args, params), nprocs=replica_count, join=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/diffwave/src/diffwave/learner.py", line 188, in train_distributed
    _train_impl(replica_id, model, dataset, args, params)
  File "/opt/diffwave/src/diffwave/learner.py", line 163, in _train_impl
    learner.restore_from_checkpoint()
  File "/opt/diffwave/src/diffwave/learner.py", line 95, in restore_from_checkpoint
    checkpoint = torch.load(f'{self.model_dir}/{filename}.pt')
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 584, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 842, in _load
    result = unpickler.load()
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 834, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 823, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 174, in default_restore_location
    result = fn(storage, location)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 156, in _cuda_deserialize
    return obj.cuda(device)
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 77, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

But when starting from scratch or using a single GPU this error does not show and the training goes flawlessly.

I must add that I have checked that the GPUs were completely free when launching the training.

Any advice on this issue?

Thanks in advance.

@sharvil
Copy link
Contributor

sharvil commented Nov 17, 2021

Hmm I haven't run across that error before. Sorry, I don't think I'll be of much help here.

@agonzalezd
Copy link
Author

agonzalezd commented Jan 28, 2022

It somehow spawns multiple processes on a single GPU, but only on one of them...
I am launching the training on 4 GPUs. Three of the GPUs spawn each a single process, but one of them spawns 4 processes.
If changing to 3 GPUs, the same happens: one of the GPUs spawns 3 processes.
I cannot find a bug in the code that forces this behaviour....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants