Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training killed from the beginning #536

Closed
1 task done
eder1234 opened this issue Jan 15, 2024 · 4 comments
Closed
1 task done

Training killed from the beginning #536

eder1234 opened this issue Jan 15, 2024 · 4 comments
Labels
app Issue related to Ultralytics HUB App question A HUB question that does not involve a bug

Comments

@eder1234
Copy link

Search before asking

Question

Hi, I would like to know why is the training being killed when I run it locally.

Additional

Ultralytics YOLOv8.1.1 🚀 Python-3.10.13 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 4050 Laptop GPU, 5905MiB)
Setup complete ✅ (12 CPUs, 15.3 GB RAM, 69.3/199.9 GB disk)
Ultralytics HUB: New authentication successful ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/HLv7cxztEUvk5eWJdJ9C 🚀
Downloading https://github.com/ultralytics/assets/releases/download/v8.1.0/yolov8s-cls.pt to 'yolov8s-cls.pt'...
100%|██████████████████████████████████████| 12.2M/12.2M [00:08<00:00, 1.54MB/s]
Ultralytics YOLOv8.1.1 🚀 Python-3.10.13 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 4050 Laptop GPU, 5905MiB)
engine/trainer: task=classify, mode=train, model=yolov8s-cls.pt, data=https://storage.googleapis.com/ultralytics-hub.appspot.com/users/ZWUKwk47LeVGf1U0Cw0uiZmR8HQ2/datasets/F3xQK5zKgATriBNeyhOM/classify.zip, epochs=100, time=None, patience=100, batch=-1, imgsz=640, save=True, save_period=-1, cache=ram, device=0, workers=8, project=None, name=train6, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/classify/train6
WARNING ⚠️ Skipping /home/rodriguez/datasets/classify.zip unzip as destination directory /home/rodriguez/datasets/classify is not empty.
train: /home/rodriguez/datasets/classify/train... found 2450 images in 5 classes ✅
val: /home/rodriguez/datasets/classify/val... found 525 images in 5 classes ✅
test: /home/rodriguez/datasets/classify/test... found 525 images in 5 classes ✅
Overriding model.yaml nc=1000 with nc=5

               from  n    params  module                                       arguments                     

0 -1 1 928 ultralytics.nn.modules.conv.Conv [3, 32, 3, 2]
1 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
2 -1 1 29056 ultralytics.nn.modules.block.C2f [64, 64, 1, True]
3 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
4 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
5 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2]
6 -1 2 788480 ultralytics.nn.modules.block.C2f [256, 256, 2, True]
7 -1 1 1180672 ultralytics.nn.modules.conv.Conv [256, 512, 3, 2]
8 -1 1 1838080 ultralytics.nn.modules.block.C2f [512, 512, 1, True]
9 -1 1 664325 ultralytics.nn.modules.head.Classify [512, 5]
YOLOv8s-cls summary: 99 layers, 5087141 parameters, 5087141 gradients, 12.6 GFLOPs
Transferred 156/158 items from pretrained weights
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for imgsz=640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 4050 Laptop GPU) 5.77G total, 0.22G reserved, 0.07G allocated, 5.48G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
5087141 12.58 0.392 34.26 20.78 (1, 3, 640, 640) (1, 5)
5087141 25.17 0.516 4.312 14.08 (2, 3, 640, 640) (2, 5)
5087141 50.34 0.761 9.301 26.13 (4, 3, 640, 640) (4, 5)
5087141 100.7 1.275 21.11 32 (8, 3, 640, 640) (8, 5)
5087141 201.4 2.282 47.01 63.57 (16, 3, 640, 640) (16, 5)
AutoBatch: Using batch-size 23 for CUDA:0 3.45G/5.77G (60%) ✅
train: Scanning /home/rodriguez/datasets/classify/train... 2450 images, 0 corrup
val: Scanning /home/rodriguez/datasets/classify/val... 525 images, 0 corrupt: 10
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically...
optimizer: AdamW(lr=0.000714, momentum=0.9) with parameter groups 26 weight(decay=0.0), 27 weight(decay=0.0005390625), 27 bias(decay=0.0)
100 epochs...

  Epoch    GPU_mem       loss  Instances       Size

0%| | 0/107 [00:00<?, ?it/s]Killed

@eder1234 eder1234 added the question A HUB question that does not involve a bug label Jan 15, 2024
Copy link

👋 Hello @eder1234, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

@pderrenger pderrenger added the app Issue related to Ultralytics HUB App label Jan 15, 2024
@UltralyticsAssistant
Copy link
Member

@eder1234 hello! It looks like your training process is being terminated early, which could be due to a few reasons. Here are some common causes to consider:

  1. Insufficient Memory: Your GPU has 5905MiB of memory, which might not be enough for the batch size you're using. The training might be killed if the system runs out of memory. Try reducing the batch size or image size to see if that helps.

  2. System Limits: The operating system might have limits on user processes, which can cause the training to be killed if those limits are exceeded. Check your system's limits with commands like ulimit -a and adjust them if necessary.

  3. Out-of-Memory (OOM) Killer: If your system is running out of RAM, the OOM Killer might terminate processes to free up memory. Monitor your system's memory usage during training to see if this is happening.

  4. Software Issues: Ensure that all dependencies are correctly installed and that there are no conflicts between software versions.

  5. Hardware Issues: There might be issues with your hardware, such as overheating or hardware failure, that could cause the process to be killed.

To troubleshoot, you can start by monitoring system resources during training, reducing the batch size, and checking for any system logs that might indicate why the process was killed. If you continue to face issues, please provide more details, such as any error messages or logs, so we can assist you further.

For more detailed guidance on troubleshooting, you can refer to the Ultralytics HUB Docs. Good luck with your training! 🚀

@eder1234
Copy link
Author

Thank you! Indeed, my system ran out RAM. Therefore, I increased the swap memory and it works better now.

@UltralyticsAssistant
Copy link
Member

You're welcome, @eder1234! I'm glad to hear that increasing the swap memory resolved the issue. Remember that using swap memory can slow down the training process since it's not as fast as RAM, but it's a good workaround when physical RAM is limited. If you have any more questions or run into further issues, feel free to reach out. Happy training! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app Issue related to Ultralytics HUB App question A HUB question that does not involve a bug
Projects
None yet
Development

No branches or pull requests

3 participants