Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to allocate memory #1093

Closed
Ahrovan opened this issue Feb 11, 2023 · 9 comments
Closed

Failed to allocate memory #1093

Ahrovan opened this issue Feb 11, 2023 · 9 comments
Labels
Nano NVidia Jetson hardware Neural network Anything for the neural network, the architecture, training, inferencing

Comments

@Ahrovan
Copy link

Ahrovan commented Feb 11, 2023

donkey train --tub ./data --model ./models/myModel.h5

2 root error(s)

(0) Resource exhausted: Failed to allocate memory for the batch of component 0
[[node IteratorGetNext (defined at /projects/donkeycar/donkeycar/parts/keras.py:183) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[IteratorGetNext/_6]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: Failed to allocate memory for the batch of component 0
[[node IteratorGetNext (defined at /projects/donkeycar/donkeycar/parts/keras.py:183) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
image

Hardware/Software Details:

@Ezward
Copy link
Contributor

Ezward commented Feb 11, 2023

That branch is 6 years old. You should try the main branch. We are also close to merging a new branch that upgrades to jp4.6 and Tensorflow 2.9.

@Ahrovan
Copy link
Author

Ahrovan commented Feb 11, 2023

Thank you, @Ezward how to try new?
must to be install jp4.6 ?

@Ahrovan
Copy link
Author

Ahrovan commented Feb 12, 2023

@Ezward Failed to allocate memory

using donkey v4.4.4-main ...
INFO:donkeycar.config:loading config file: ./config.py
INFO:donkeycar.config:loading personal config over-rides from ./myconfig.py
2023-02-11 19:44:42.958070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
WARNING:donkeycar.pipeline.database:No model database found at /home/ahrovan/Projects/myCar/models/database.json
INFO:donkeycar.utils:get_model_by_type: model type is: linear
2023-02-11 19:44:52.458462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-02-11 19:44:52.499566: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:52.499729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1742] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
2023-02-11 19:44:52.499810: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2023-02-11 19:44:52.663623: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-02-11 19:44:52.736950: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-02-11 19:44:52.843827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-02-11 19:44:52.975454: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-02-11 19:44:53.048683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2023-02-11 19:44:53.052396: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-02-11 19:44:53.052787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.053117: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.053198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1884] Adding visible gpu devices: 0
2023-02-11 19:44:53.084356: W tensorflow/core/platform/profile_utils/cpu_utils.cc:108] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
2023-02-11 19:44:53.085005: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3eb00150 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-02-11 19:44:53.085078: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2023-02-11 19:44:53.222570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.223062: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3eb7a360 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-11 19:44:53.223133: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2023-02-11 19:44:53.248679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.248849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1742] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
2023-02-11 19:44:53.248996: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2023-02-11 19:44:53.249223: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-02-11 19:44:53.249356: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-02-11 19:44:53.249468: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-02-11 19:44:53.249573: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-02-11 19:44:53.249676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2023-02-11 19:44:53.249780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-02-11 19:44:53.250062: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.250353: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:53.250446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1884] Adding visible gpu devices: 0
2023-02-11 19:44:53.250572: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2023-02-11 19:44:57.660757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1283] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-02-11 19:44:57.660846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1289] 0
2023-02-11 19:44:57.660895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1302] 0: N
2023-02-11 19:44:57.661470: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:57.661845: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero
2023-02-11 19:44:57.662002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1428] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1080 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
INFO:donkeycar.parts.keras:Created KerasLinear with interpreter: KerasInterpreter
Model: "linear"


Layer (type) Output Shape Param # Connected to

img_in (InputLayer) [(None, 600, 800, 3) 0


conv2d_1 (Conv2D) (None, 298, 398, 24) 1824 img_in[0][0]


dropout (Dropout) (None, 298, 398, 24) 0 conv2d_1[0][0]


conv2d_2 (Conv2D) (None, 147, 197, 32) 19232 dropout[0][0]


dropout_1 (Dropout) (None, 147, 197, 32) 0 conv2d_2[0][0]


conv2d_3 (Conv2D) (None, 72, 97, 64) 51264 dropout_1[0][0]


dropout_2 (Dropout) (None, 72, 97, 64) 0 conv2d_3[0][0]


conv2d_4 (Conv2D) (None, 70, 95, 64) 36928 dropout_2[0][0]


dropout_3 (Dropout) (None, 70, 95, 64) 0 conv2d_4[0][0]


conv2d_5 (Conv2D) (None, 68, 93, 64) 36928 dropout_3[0][0]


dropout_4 (Dropout) (None, 68, 93, 64) 0 conv2d_5[0][0]


flattened (Flatten) (None, 404736) 0 dropout_4[0][0]


dense_1 (Dense) (None, 100) 40473700 flattened[0][0]


dropout_5 (Dropout) (None, 100) 0 dense_1[0][0]


dense_2 (Dense) (None, 50) 5050 dropout_5[0][0]


dropout_6 (Dropout) (None, 50) 0 dense_2[0][0]


n_outputs0 (Dense) (None, 1) 51 dropout_6[0][0]


n_outputs1 (Dense) (None, 1) 51 dropout_6[0][0]

Total params: 40,625,028
Trainable params: 40,625,028
Non-trainable params: 0


None
Using catalog /home/ahrovan/Projects/myCar/data/catalog_0.catalog
INFO:donkeycar.pipeline.types:Loading tubs from paths ['./data']
Records # Training 641
Records # Validation 161
Epoch 1/100
2023-02-11 19:45:06.600075: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2023-02-11 19:45:26.891143: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 1474560000 exceeds 10% of free system memory.
Traceback (most recent call last):
File "/home/ahrovan/env/bin/donkey", line 33, in
sys.exit(load_entry_point('donkeycar', 'console_scripts', 'donkey')())
File "/home/ahrovan/Projects/donkeycar/donkeycar/management/base.py", line 626, in execute_from_command_line
c.run(args[2:])
File "/home/ahrovan/Projects/donkeycar/donkeycar/management/base.py", line 563, in run
args.comment)
File "/home/ahrovan/Projects/donkeycar/donkeycar/pipeline/training.py", line 158, in train
show_plot=cfg.SHOW_PLOT)
File "/home/ahrovan/Projects/donkeycar/donkeycar/parts/keras.py", line 183, in train
use_multiprocessing=False)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
tmp_logs = train_function(iterator)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 780, in call
result = self._call(*args, **kwds)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2829, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: Failed to allocate memory for the batch of component 0
[[node IteratorGetNext (defined at /Projects/donkeycar/donkeycar/parts/keras.py:183) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: Failed to allocate memory for the batch of component 0
[[node IteratorGetNext (defined at /Projects/donkeycar/donkeycar/parts/keras.py:183) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[IteratorGetNext/_6]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1794]

Function call stack:
train_function -> train_function

@Ezward
Copy link
Contributor

Ezward commented Feb 21, 2023

What machine are you running training on? Maybe your are trying to train on a Jetson Nano? I have heard of others that did this, but it is not a supported training machine. I guess I would try to reduce the batch size by changing BATCH_SIZE configuration; try 16.

@Ahrovan
Copy link
Author

Ahrovan commented Feb 22, 2023

@Ezward yes try to train with Jetson Nano B01, I changed BATCH_SIZE to 1, but failed

@DocGarbanzo
Copy link
Contributor

@Ahrovan - you seem to be using an image size of 600x800. The linear model that you are trying to train is geared towards an image size of 120x160, and has about 500k parameters for that size. For a 600x800 you would need a model with higher compression, i.e. more layers or larger strides. You can see that your model now has 40,000,000 parameters (look at the Flattened layer from your model: you are coming out with a 400k dimensional vector and then going into a 100d dense layer, this basically gives you the 40m parameters). This model is far to big to fit into the ram of the nano. So either, set the image size to the standard of 120x160 or modify the model architecture.

@DocGarbanzo DocGarbanzo added Neural network Anything for the neural network, the architecture, training, inferencing Nano NVidia Jetson hardware labels Feb 22, 2023
@Ahrovan
Copy link
Author

Ahrovan commented Feb 23, 2023

@DocGarbanzo Thank you

@Ezward
Copy link
Contributor

Ezward commented Feb 27, 2023

Note that image should be 120 high, 160 wide. @Ahrovan have you retried using the new image size?

@Ezward
Copy link
Contributor

Ezward commented Mar 4, 2023

@Ahrovan I am going to close this; if you have more info please do add a comment to the closed issue.

@Ezward Ezward closed this as completed Mar 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nano NVidia Jetson hardware Neural network Anything for the neural network, the architecture, training, inferencing
Projects
None yet
Development

No branches or pull requests

3 participants