-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: can't pickle torch.distributed.ProcessGroupNCCL objects #279
Comments
Hello @OYRQ, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com. |
Hm, I got this too after pulling from latest release. I thought I broke something in my code, but I guess not. |
What do you mean? This is running Single process DistributedDataParallel, so it should be the same PID. I am working on Multi process DDP if that's what you're thinking about. |
I encountered the same problem here after pulling from the lastest version. Any ideas how to fix it? |
I would just suggest waiting a while, see if others have the same issue and if @glenn-jocher would see this. It can be due to the most recent merge e02a189 . You can also use an earlier version. The 30th June was fine for me. EDIT: Found out that this only happens for multiple GPU because of nccl backend. It works fine for single GPU. So you can run it by setting |
@NanoCode012 Thanks for your suggestion. Branch of 30th June works fine for me. |
@GWwangshuo @NanoCode012 the problem may be in the new update in ema, i will try to find out what's going on with it |
@yxNONG , I’m not sure what went wrong. I tried to look through code. The issue is with saving for multi gpu, but the only place that is related is with saving ckpt for ema The only thing else in commits are update to ONNX |
@NanoCode012 Hi guys. Unfortunately I don't have access to a multi-gpu machine to debug. If anyone here finds the problem please submit a PR. Seems like the current patch is to train single-gpu. Unit tests (run on single GPU) are all passing currently. |
To add a bit more detail, this issue likely originates in recent pushes to update the EMA code. I'll try to update the EMA handling to isolate it as single-GPU in all cases, as right now both the main model and the EMA are a confusing allowable mix of single GPU and DP. We swapped test.py multigpu out last month for single-gpu FP16 testing during training, so I suppose this will go well with that change. |
May I ask if you still have the test.py for multi GPU or can reference it? So, ema.ema.module means that its distributed right? In which part does it become like that? I only see that we pass model to Ema, and it creates deep copy called ema.ema . Where does module come from? |
@NanoCode012 unit tests are here. We run these in a Colab notebook. You can modify the train device to multi-gpu i.e. 0,1. Be warned this will delete your default yolov5 directory if it exists, so you should run from a subdirectory. # Unit tests
rm -rf yolov5 && git clone https://github.com/ultralytics/yolov5 && cd yolov5
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
pip install -r requirements.txt onnx
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv ./coco128 ../
for x in yolov5s yolov5m yolov5l yolov5x # models
do
python train.py --weights $x.pt --cfg $x.yaml --epochs 4 --img 320 --device 0,1 # train
for di in 0 cpu # inference devices
do
python detect.py --weights $x.pt --device $di # detect official
python detect.py --weights weights/last.pt --device $di # detect custom
python test.py --weights $x.pt --device $di # test official
python test.py --weights weights/last.pt --device $di # test custom
done
python models/yolo.py --cfg $x.yaml # inspect
python models/export.py --weights $x.pt --img 640 --batch 1 # export
done |
@NanoCode012 and everyone, I just pushed an EMA update which may or may not resolve this issue. The updates 1) creates and maintains the EMA as a single-device model at all times, and passes it to test.py this way and to checkpoint saving this way, and 2) reverts the EMA to FP16 to reduce device 0 GPU memory usage slightly. This passes all single-gpu unit tests above, though as I said before someone with a multi-gpu machine should run the tests themselves to verify. |
Right now, i only have one gpu available, so I will test multiple later, but I got weird error when running it on single. Calling
Second time I ran it, I got Third time, EDIT: My device 0 is busy. Could it be related? I also tried using |
@NanoCode012 I don't know. I can't reproduce on Colab. May be specific to your environment? The tests are intended for Colab or Docker. Anything outside of that I can't speak for. |
@NanoCode012 got your same error in the docker container, but not on colab strangely enough. When I reverted EMA to FP32 this removed the docker error. |
@glenn-jocher , thanks. That commit fixed the Single GPU issue for me. I also tested it on Colab, no issue. For Multiple GPU, the same error persists unfortunately.
Single: |
@NanoCode012 Ok. If you find a solution that works please submit a PR. |
@NanoCode012 why don't you try to move the EMA definition up before the DDP init? |
Sure, that’s one of the things I looked to do. One other thing is whether I can move the parameters of model like model.nc before the DDP wrapper? I read on pytorch that it can cause unexpected behaviors when modifying a model’s parameters after DDP wrapped. Though, I was hesitant on doing it as I would have to modify all calls on model in the training loop? |
Ok! parameters in that context means model weights that have gradients. The values atta he’s to the model after DDP are class attributes. |
I've tried a few things.
It always errors on line I did some print for k,v pairs in
|
@NanoCode012 @glenn-jocher |
@NanoCode012 thanks for running the experiments! This process_group should definitely not be added, it must be the problem. I suppose we could insert a check into the EMA attribute update to prevent it from being added. Can you try this? def update_attr(self, model):
# Update EMA attributes
for k, v in model.__dict__.items():
if not k.startswith('_') and k != 'module' and k != 'process_group':
setattr(self.ema, k, v) |
Hi,
I meet a problem:
Traceback (most recent call last):
File "train.py", line 394, in
train(hyp)
File "train.py", line 331, in train
torch.save(ckpt, last)
File "/home/yy/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 328, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File "/home/yy/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 401, in _legacy_save
pickler.dump(obj)
TypeError: can't pickle torch.distributed.ProcessGroupNCCL objects
Thanks!
environment:
ubuntu 16.04
GPU 2080Ti *4
pytorch 1.4.0
The text was updated successfully, but these errors were encountered: