Failed to allocate memory #1093

Ahrovan · 2023-02-11T14:03:30Z

donkey train --tub ./data --model ./models/myModel.h5

2 root error(s)

(0) Resource exhausted: Failed to allocate memory for the batch of component 0
[[node IteratorGetNext (defined at /projects/donkeycar/donkeycar/parts/keras.py:183) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[IteratorGetNext/_6]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: Failed to allocate memory for the batch of component 0
[[node IteratorGetNext (defined at /projects/donkeycar/donkeycar/parts/keras.py:183) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Hardware/Software Details:

Jetson Nano B01
donkeycar 4.4.0
Used Fix Issue 188: Added ability for users to configure joystick buttons in config.py #190 for solve libgomp.so.1 Error
/data folder sizze 98MB, About 3k Images
Tensorflow 2.3.1
Used jetson-nano-jp451-sd-card-image.zip as Image

The text was updated successfully, but these errors were encountered:

Ezward · 2023-02-11T17:18:04Z

That branch is 6 years old. You should try the main branch. We are also close to merging a new branch that upgrades to jp4.6 and Tensorflow 2.9.

Ahrovan · 2023-02-11T21:07:11Z

Thank you, @Ezward how to try new?
must to be install jp4.6 ?

Ahrovan · 2023-02-12T00:46:57Z

@Ezward Failed to allocate memory

using donkey v4.4.4-main ...
INFO:donkeycar.config:loading config file: ./config.py
INFO:donkeycar.config:loading personal config over-rides from ./myconfig.py
2023-02-11 19:44:42.958070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
WARNING:donkeycar.pipeline.database:No model database found at /home/ahrovan/Projects/myCar/models/database.json
INFO:donkeycar.utils:get_model_by_type: model type is: linear
2023-02-11 19:44:52.458462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-02-11 19:44:52.499566: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:52.499729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1742] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
2023-02-11 19:44:52.499810: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2023-02-11 19:44:52.663623: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-02-11 19:44:52.736950: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-02-11 19:44:52.843827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-02-11 19:44:52.975454: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-02-11 19:44:53.048683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2023-02-11 19:44:53.052396: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-02-11 19:44:53.052787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.053117: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.053198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1884] Adding visible gpu devices: 0
2023-02-11 19:44:53.084356: W tensorflow/core/platform/profile_utils/cpu_utils.cc:108] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
2023-02-11 19:44:53.085005: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3eb00150 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-02-11 19:44:53.085078: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2023-02-11 19:44:53.222570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.223062: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3eb7a360 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-11 19:44:53.223133: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2023-02-11 19:44:53.248679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.248849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1742] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
2023-02-11 19:44:53.248996: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2023-02-11 19:44:53.249223: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-02-11 19:44:53.249356: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-02-11 19:44:53.249468: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-02-11 19:44:53.249573: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-02-11 19:44:53.249676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2023-02-11 19:44:53.249780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-02-11 19:44:53.250062: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.250353: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.250446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1884] Adding visible gpu devices: 0
2023-02-11 19:44:53.250572: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2023-02-11 19:44:57.660757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1283] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-02-11 19:44:57.660846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1289] 0
2023-02-11 19:44:57.660895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1302] 0: N
2023-02-11 19:44:57.661470: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:57.661845: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:57.662002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1428] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1080 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
INFO:donkeycar.parts.keras:Created KerasLinear with interpreter: KerasInterpreter
Model: "linear"

Layer (type) Output Shape Param # Connected to

img_in (InputLayer) [(None, 600, 800, 3) 0

conv2d_1 (Conv2D) (None, 298, 398, 24) 1824 img_in[0][0]

dropout (Dropout) (None, 298, 398, 24) 0 conv2d_1[0][0]

conv2d_2 (Conv2D) (None, 147, 197, 32) 19232 dropout[0][0]

dropout_1 (Dropout) (None, 147, 197, 32) 0 conv2d_2[0][0]

conv2d_3 (Conv2D) (None, 72, 97, 64) 51264 dropout_1[0][0]

dropout_2 (Dropout) (None, 72, 97, 64) 0 conv2d_3[0][0]

conv2d_4 (Conv2D) (None, 70, 95, 64) 36928 dropout_2[0][0]

dropout_3 (Dropout) (None, 70, 95, 64) 0 conv2d_4[0][0]

conv2d_5 (Conv2D) (None, 68, 93, 64) 36928 dropout_3[0][0]

dropout_4 (Dropout) (None, 68, 93, 64) 0 conv2d_5[0][0]

flattened (Flatten) (None, 404736) 0 dropout_4[0][0]

dense_1 (Dense) (None, 100) 40473700 flattened[0][0]

dropout_5 (Dropout) (None, 100) 0 dense_1[0][0]

dense_2 (Dense) (None, 50) 5050 dropout_5[0][0]

dropout_6 (Dropout) (None, 50) 0 dense_2[0][0]

n_outputs0 (Dense) (None, 1) 51 dropout_6[0][0]

n_outputs1 (Dense) (None, 1) 51 dropout_6[0][0]

Total params: 40,625,028
Trainable params: 40,625,028
Non-trainable params: 0

None
Using catalog /home/ahrovan/Projects/myCar/data/catalog_0.catalog
INFO:donkeycar.pipeline.types:Loading tubs from paths ['./data']
Records # Training 641
Records # Validation 161
Epoch 1/100
2023-02-11 19:45:06.600075: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-02-11 19:45:26.891143: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 1474560000 exceeds 10% of free system memory.
Traceback (most recent call last):
File "/home/ahrovan/env/bin/donkey", line 33, in
sys.exit(load_entry_point('donkeycar', 'console_scripts', 'donkey')())
File "/home/ahrovan/Projects/donkeycar/donkeycar/management/base.py", line 626, in execute_from_command_line
c.run(args[2:])
File "/home/ahrovan/Projects/donkeycar/donkeycar/management/base.py", line 563, in run
args.comment)
File "/home/ahrovan/Projects/donkeycar/donkeycar/pipeline/training.py", line 158, in train
show_plot=cfg.SHOW_PLOT)
File "/home/ahrovan/Projects/donkeycar/donkeycar/parts/keras.py", line 183, in train
use_multiprocessing=False)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
tmp_logs = train_function(iterator)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 780, in call
result = self._call(*args, **kwds)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2829, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: Failed to allocate memory for the batch of component 0
[[node IteratorGetNext (defined at /Projects/donkeycar/donkeycar/parts/keras.py:183) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: Failed to allocate memory for the batch of component 0
[[node IteratorGetNext (defined at /Projects/donkeycar/donkeycar/parts/keras.py:183) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[IteratorGetNext/_6]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1794]

Function call stack:
train_function -> train_function

Ezward · 2023-02-21T22:59:09Z

What machine are you running training on? Maybe your are trying to train on a Jetson Nano? I have heard of others that did this, but it is not a supported training machine. I guess I would try to reduce the batch size by changing BATCH_SIZE configuration; try 16.

Ahrovan · 2023-02-22T04:51:08Z

@Ezward yes try to train with Jetson Nano B01, I changed BATCH_SIZE to 1, but failed

DocGarbanzo · 2023-02-22T22:12:23Z

@Ahrovan - you seem to be using an image size of 600x800. The linear model that you are trying to train is geared towards an image size of 120x160, and has about 500k parameters for that size. For a 600x800 you would need a model with higher compression, i.e. more layers or larger strides. You can see that your model now has 40,000,000 parameters (look at the Flattened layer from your model: you are coming out with a 400k dimensional vector and then going into a 100d dense layer, this basically gives you the 40m parameters). This model is far to big to fit into the ram of the nano. So either, set the image size to the standard of 120x160 or modify the model architecture.

Ahrovan · 2023-02-23T05:00:37Z

@DocGarbanzo Thank you

Ezward · 2023-02-27T01:32:40Z

Note that image should be 120 high, 160 wide. @Ahrovan have you retried using the new image size?

Ezward · 2023-03-04T00:23:03Z

@Ahrovan I am going to close this; if you have more info please do add a comment to the closed issue.

DocGarbanzo added Neural network Anything for the neural network, the architecture, training, inferencing Nano NVidia Jetson hardware labels Feb 22, 2023

Ezward closed this as completed Mar 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to allocate memory #1093

Failed to allocate memory #1093

Ahrovan commented Feb 11, 2023 •

edited

Loading

Ezward commented Feb 11, 2023

Ahrovan commented Feb 11, 2023

Ahrovan commented Feb 12, 2023

Ezward commented Feb 21, 2023

Ahrovan commented Feb 22, 2023

DocGarbanzo commented Feb 22, 2023

Ahrovan commented Feb 23, 2023

Ezward commented Feb 27, 2023

Ezward commented Mar 4, 2023

Failed to allocate memory #1093

Failed to allocate memory #1093

Comments

Ahrovan commented Feb 11, 2023 • edited Loading

2 root error(s)

Ezward commented Feb 11, 2023

Ahrovan commented Feb 11, 2023

Ahrovan commented Feb 12, 2023

Layer (type) Output Shape Param # Connected to

n_outputs1 (Dense) (None, 1) 51 dropout_6[0][0]

Ezward commented Feb 21, 2023

Ahrovan commented Feb 22, 2023

DocGarbanzo commented Feb 22, 2023

Ahrovan commented Feb 23, 2023

Ezward commented Feb 27, 2023

Ezward commented Mar 4, 2023

Ahrovan commented Feb 11, 2023 •

edited

Loading