Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HUB not working correctly with Multi-GPU custom agent setup #695

Open
1 task done
sinchinpark opened this issue May 23, 2024 · 8 comments
Open
1 task done

HUB not working correctly with Multi-GPU custom agent setup #695

sinchinpark opened this issue May 23, 2024 · 8 comments
Assignees
Labels
bug Something isn't working Stale

Comments

@sinchinpark
Copy link

sinchinpark commented May 23, 2024

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Models, Training

Bug

Description

I am experiencing issues when using HUB portal for training on dataset with a multi-GPU custom agent setup. Specifically, I am using 2xGPUs and have modified the default parameters as follows:

device=0,1
workers=16

However, the HUB does not seem to process the training data correctly and gets stuck throughout the training process. This issue persists even after the training is supposedly finished, as shown in the attached screenshot.

swappy-20240523_131045

swappy-20240523_132213

Interestingly, using device=0 on the same machine with the same model works fine!

Logs and Errors:

Here are some potentially useful logs and errors from my custom agents:

Ultralytics HUB: View model at https://hub.ultralytics.com/models/zCnR3gSc9n1xTow1CTpS 🚀
Ultralytics YOLOv8.2.19 🚀 Python-3.10.12 torch-2.3.0+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24253MiB)
                                                                CUDA:1 (NVIDIA GeForce RTX 3090, 24253MiB)
engine/trainer: task=detect, mode=train, model=yolov8m.pt, data=***, epochs=10, time=None, patience=100, batch=-1, imgsz=640, save=True, save_period=-1, cache=ram, device=[0, 1], workers=8, project=None, name=train2, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=True, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train2
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 

Also, I encountered the following warnings multiple times:

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:456](https://jupyter81.backprop.co/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py#line=455): UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)

Expected Behavior:

The training should proceed without getting stuck, showing progress and metrics on Dashboard and allow to deploy/export after training finished (similar to the behavior observed when using device=0).

Custom Agent Env

Python: 3.10.12
PyTorch: 2.3.0+cu121
GPUs: 2x NVIDIA GeForce RTX 3090
Ultralytics YOLOv8.2.19

Environment

Ultralytics HUB Version
v0.1.43
Client User Agent
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
Operating System
Linux x86_64
Server Timestamp
1716456982

Minimal Reproducible Example

No response

Additional

No response

@sinchinpark sinchinpark added the bug Something isn't working label May 23, 2024
Copy link

👋 Hello @sinchinpark, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

@sinchinpark
Copy link
Author

Sorry it's duplicate of #606

@sergiuwaxmann sergiuwaxmann self-assigned this May 23, 2024
@sergiuwaxmann
Copy link
Member

sergiuwaxmann commented May 23, 2024

@sinchinpark Did you use the Custom option from the Advanced Model Configuration accordion (read more here) to change the device from 0 to 0,1?
custom_device

@sinchinpark
Copy link
Author

@sinchinpark Did you use the Custom option from the Advanced Model Configuration accordion ([read more here]

Yes, I'm using HUB portal for all operations (from importing dataset to training the model)

@sergiuwaxmann
Copy link
Member

@sinchinpark Our team will investigate this issue and I will update you as soon as possible.
Thank you for your patience!

@sinchinpark
Copy link
Author

@sergiuwaxmann Thanks
BTW this is the model ID if it helps the further investigation:
https://hub.ultralytics.com/models/zCnR3gSc9n1xTow1CTpS

@sergiuwaxmann
Copy link
Member

@sinchinpark Thank you!

Copy link

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants