debugging out-of-memory exception #75

tlh24 · 2022-12-13T06:52:11Z

Hello,

I've been trying to shift a hybrid ocaml-python program to mostly ocaml. Part of this program is a simple image collision test; when implemented it in ocaml, it is >2x slower than the python equivalent. Digging a bit, I noticed that the ocaml implementation needs a lot of memory. The following is a minimum working example that runs into the same memory leak / out-of-memory problem:

open Torch
open Unix

let image_count = 2048
let image_res = 30

(* 
test the ocaml equivalent of (python): 
	dbf = th.ones(image_count, image_res, image_res)
	d = th.sum((dbf - a)**2, (1,2))
	mindex = th.argmin(d)
	dist = d[mindex]
*)

let image_dist dbf img = 
	let d = Tensor.( (dbf - img) ) in
	(* per-element square and sum *)
	let d2 = Tensor.einsum ~equation:"ijk, ijk -> i" [d;d] ~path:None in
	let mindex = Tensor.argmin d2 ~dim:None ~keepdim:true 
		|> Tensor.int_value in
	let dist = Tensor.get d2 mindex |> Tensor.float_value in
	dist,mindex

let () = 
	Unix.clear_nonblock stdin; 
	Printf.printf "cuda available: %b\n%!" (Cuda.is_available ());
	let device = Torch.Device.cuda_if_available () in
	(* dbf is a tensor of images to be compared (MSE) against *)
	let dbf = Tensor.( 
		( ones [image_count; image_res; image_res] ) * (f (-1.0))) 
		|> Tensor.to_device ~device in
	let start = Unix.gettimeofday () in
	for _i = 0 to 100_000 do (
		(* generate a random image *)
		let img = Tensor.(randn [image_res; image_res] ) 
			|> Tensor.to_device ~device in
		ignore( image_dist dbf img )
		(* in the actual program, we do something with dist,mindex *)
	) done; 
	let stop = Unix.gettimeofday () in
	Printf.printf "100k image_dist calc time: %fs\n%!" 
		(stop -. start);

Would love to figure out how to get this working. I suspect variables are being allocated every loop, and the GC is not getting around to removing them. Would love to make it as performant as Python. (Can't believe I'm saying that!! -- perhaps by streamlining "d = th.sum((dbf - img)**2, (1,2))"?? )

Any advice much appreciated.

The text was updated successfully, but these errors were encountered:

LaurentMazare · 2022-12-13T07:38:12Z

That's indeed one of the disadvantage of using a gc rather than ref-counting: gpu memory is handled via RAII mechanisms in libtorch and so is only collected when the gc triggers and collects the data rather than being collected as early as possible.
The easy way to get around this is to trigger the gc manually though there is a balance to find here as calling the gc has some significant cost. This is done in most example, e.g. for min-gpt on this line.

tlh24 · 2022-12-13T09:00:35Z

Thanks, yes -- I found that adding this line at the end of the for loop:

if i mod 30 = 29 then 
    Caml.Gc.major()

keeps the memory from blowing up. It also improves the performance to the point that it's faster than Pytorch. yay.

tlh24 · 2023-04-14T22:45:56Z

Hi Laurent,

Still struggling with this issue; repeated calls to Caml.Gc.major () or Caml.Gc.full_major () don't seem to be helping.

I'm guessing that the GC sees the Tensors only as their small pointers / Ocaml structures, not as the many MB of GPU RAM that they consume, so that it chooses not to deallocate them? For example, below change in GC state corresponds with losing > 10GB of GPU RAM -- thereby making the app fail.

minor_collections:      8269
major_collections:      10
compactions:            0
forced_major_collections: 1

minor_words:    61346823
promoted_words:  8816555
major_words:    17280763

top_heap_words: 17656542
heap_words:      4194144
live_words:      3420356
free_words:       768036
largest_free:          0
fragments:          5752

live_blocks: 856672
free_blocks: 0
heap_chunks: 0

------- after algorithm (similar to Dec 12 code):
minor_collections:      8412            (+143)
major_collections:      13              (+3)
compactions:            0
forced_major_collections: 2

minor_words:    61731907                (+385084)
promoted_words:  8938926                (+122371)
major_words:    17404160                (+123397)

top_heap_words: 17656542                (same)
heap_words:      4109158                (-84986)
live_words:      3420403                (+47)
free_words:       682575                (-85461)
largest_free:          0
fragments:          6180                (+428)

live_blocks: 842693                     (-13979)
free_blocks: 0
heap_chunks: 0

------- after Gc.full_major ():
minor_collections:      8646            (+234)
major_collections:      19              (+6)
compactions:            0
forced_major_collections: 4

minor_words:    61732479                (-572)
promoted_words:  8938926                (same)
major_words:    17404160                (same)

top_heap_words: 17656542                (same)
heap_words:      4080486                (-28672)
live_words:      3374323                (-46080)
free_words:       700011                (+17436)
largest_free:          0
fragments:          6152                (-28)

live_blocks: 827333                     (-15360)
free_blocks: 0
heap_chunks: 0

Note: the code from Dec 12, with repeated calls to Caml.Gc.full_major(), consumes > 50x the GPU RAM as it naively ought to... excluding the 916MB allocated by default ...

Am considering writing the critical code in C++ & deallocating through IIRC, as you mention. If there are examples of this in your source code, I would be happy to study them & report back.

Thank you again for this excellent library!

tlh24 · 2023-04-15T00:57:16Z

In wrapper_generated.ml, there are (deferred) calls to C.Tensor.free :
Gc.finalise C.Tensor.free t0;
via

open Ctypes
module C = Torch_bindings.C (Torch_generated)
open C.TensorG

Is it possible to call C.Tensor.free ? Seems easier than writing in C++

LaurentMazare · 2023-04-15T08:06:32Z

It's indeed the case that the gc doesn't see the tensor as occupying a large amount of memory, however this should not be an issue as the call to Gc.full_major() should collect all the dangling memory regardless of whether it uses a large amount or not (the gc knowing about the memory usage is only useful to decide when to trigger the gc).
So if you still see memory usage increasing despite regular full major collections, there is a deeper issue somewhere, it could be a bug in ocaml-torch or that your code somehow retains references to the tensors. I would suggest trying to reduce as much as possible the example until there is no memory leak anymore and hopefully this should give an idea of what is going on (and if you have some very short repro, that would be useful to help debug the issue if it's within ocaml-torch).

tlh24 · 2023-04-16T19:09:11Z

Got it. I'll experiment more and make a minimal repository that demonstrates the leak.

…

On Sat, Apr 15, 2023, 1:06 AM Laurent Mazare ***@***.***> wrote: It's indeed the case that the gc doesn't see the tensor as occupying a large amount of memory, however this should not be an issue as the call to Gc.full_major() should collect all the dangling memory regardless of whether it uses a large amount or not (the gc knowing about the memory usage is only useful to decide *when* to trigger the gc). So if you still see memory usage increasing despite regular full major collections, there is a deeper issue somewhere, it could be a bug in ocaml-torch or that your code somehow retains references to the tensors. I would suggest trying to reduce as much as possible the example until there is no memory leak anymore and hopefully this should give an idea of what is going on (and if you have some very short repro, that would be useful to help debug the issue if it's within ocaml-torch). — Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOLULUYZWI4XQJUSJ63VKDXBJJJFANCNFSM6AAAAAAS42O32M> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

tlh24 · 2023-04-17T22:25:32Z

Can you try this?
https://github.com/tlh24/ocaml-torch-leaktest

Calling Gc.full_major () does not decrease the memory allocation.

Interestingly, memory allocation climbs the first several iterations then saturates by the 20th.

tlh24 · 2023-04-25T00:36:27Z

What if this is some issue with gradient tracing (??)

LaurentMazare · 2023-04-25T05:14:18Z

Sorry I didn't find the time to look at your repro so far, gradient tracing may indeed be a culprit though the issue usually happens if you have some form of global accumulator which I don't see in your code. Anyway you can give a try at running this whithin a Tensor.no_grad block to deactivate gradient tracing. Might be worth looking at this PyTorch FAQ too in case there is anything related, the memory allocation climbing only for the first iteration makes me more suspicious of the allocator doing some caching, it could be interesting to check what happens when using a cpu device rather than a gpu, as well as checking whether this also happens when running similar code with the Python api.

tlh24 · 2023-04-25T06:04:54Z

Cool -- I tried isolating with a Tensor.no_grad( fun () -> ...) and it did not do anything to the memory usage. Tried running it on the Cpu, as you suggested. Memory peaks at 1.3GB, goes down to ~950MB. This is roughly half the mem usage on the Gpu. So - that's good ??? There is a python implementation of the same operations in the repo. It does not leak, so far as I can tell.

…

On Mon, Apr 24, 2023 at 10:14 PM Laurent Mazare ***@***.***> wrote: Sorry I didn't find the time to look at your repro so far, gradient tracing may indeed be a culprit though the issue usually happens if you have some form of global accumulator which I don't see in your code. Anyway you can give a try at running this whithin a Tensor.no_grad block to deactivate gradient tracing. Might be worth looking at this PyTorch FAQ <https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory> too in case there is anything related, the memory allocation climbing only for the first iteration makes me more suspicious of the allocator doing some caching, it could be interesting to check what happens when using a cpu device rather than a gpu, as well as checking whether this also happens when running similar code with the Python api. — Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOLULUZ2ISAYIJSC2DM4T3XC5MTLANCNFSM6AAAAAAS42O32M> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

LaurentMazare · 2023-04-25T10:34:06Z

I just tried your example and it seems to me that after adding a call to Gc.full_major at the beginning of each loop, the GPU memory stays roughly constant. This would tend to agree with some allocation caching taking place within libtorch so not really a leak on the ocaml side.

tlh24 · 2023-04-26T06:11:47Z

Yes, that makes sense -- I wonder what it's caching, though. Would be really nice to get my GPU ram back.

FWIW, added a C++ test (thank you, gpt4), which uses even less memory. I suppose I can ffi it? Might be useful for other ocaml-torch users?

tlh24 closed this as completed Dec 13, 2022

tlh24 reopened this Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debugging out-of-memory exception #75

debugging out-of-memory exception #75

tlh24 commented Dec 13, 2022 •

edited

Loading

LaurentMazare commented Dec 13, 2022

tlh24 commented Dec 13, 2022 •

edited

Loading

tlh24 commented Apr 14, 2023 •

edited

Loading

tlh24 commented Apr 15, 2023

LaurentMazare commented Apr 15, 2023

tlh24 commented Apr 16, 2023 via email

tlh24 commented Apr 17, 2023

tlh24 commented Apr 25, 2023

LaurentMazare commented Apr 25, 2023

tlh24 commented Apr 25, 2023 via email

LaurentMazare commented Apr 25, 2023

tlh24 commented Apr 26, 2023

debugging out-of-memory exception #75

debugging out-of-memory exception #75

Comments

tlh24 commented Dec 13, 2022 • edited Loading

LaurentMazare commented Dec 13, 2022

tlh24 commented Dec 13, 2022 • edited Loading

tlh24 commented Apr 14, 2023 • edited Loading

tlh24 commented Apr 15, 2023

LaurentMazare commented Apr 15, 2023

tlh24 commented Apr 16, 2023 via email

tlh24 commented Apr 17, 2023

tlh24 commented Apr 25, 2023

LaurentMazare commented Apr 25, 2023

tlh24 commented Apr 25, 2023 via email

LaurentMazare commented Apr 25, 2023

tlh24 commented Apr 26, 2023

tlh24 commented Dec 13, 2022 •

edited

Loading

tlh24 commented Dec 13, 2022 •

edited

Loading

tlh24 commented Apr 14, 2023 •

edited

Loading