-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Docs] Provide guidelines for expectations about and how to investigate memory usage #36378
Comments
It might be good to include a link to: https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas |
All of these are notes from @jorisvandenbossche, that are possibly relevant for this document: If you have a pandas table with strings, those won't show up (by default) in the pandas memory usage indicator:
Because pandas by default uses a numpy object-dtype array, i.e. literally an array of Python objects, the memory usage indicator by default only shows the size of the numpy array itself, i.e. just the pointers to the Python objects. This is because iteratively querying the size of all the Python objects in this array can be very costly. So to get the real memory usage, you have to pass Verify what More info for
Some reasons that a pandas dataframe can be a lot bigger than the table that created it: if you have nested data such as structs, those get converted to dictionaries stored in a pandas column, so that is much less efficient, or boolean values are stored as bytes instead of bits, so that's a size x8)
|
https://arrow.apache.org/docs/python/generated/pyarrow.log_memory_allocations.html
https://github.com/bloomberg/memray with
There might be differences between the amount of memory Arrow requests, and how much the allocator allocates. |
Expected differences in behaviour between memory allocators: https://issues.apache.org/jira/browse/ARROW-15730 |
will show additional logging for |
Add the information in this thread: #40301 |
We regularly get reports about potential memory usage issues (memory leaks, ..), and often we need to clarify expectations around what is being observed or give hints on how to explore the memory usage. Given this comes up regularly, it might be useful to gather such content so we can point to that page instead of every time having to repeat it.
(recent examples: #36100, #36101)
Some aspects that could be useful to mention on such a page:
Some basic background on memory allocation. Of course we can't provide a full tutorial on this, but a few facts might help set expectations. For example this quote from Weston in a recent issue ([Python] Pyarrow Table.pylist doesn't release memory untill the program terminates. #36100 (comment)) to explain why memory usage stays high:
Or similar comment from Antoine in [Python] Memory leak in pq.read_table and table.to_pandas #18431 (comment)
List the functionality in
pyarrow
that can help diagnose or verify memory usage:pa.total_allocated_bytes()
,release_unused
, ...More advanced, but mention there are different memory pool implementations, so you can also try using a different one. Each memory pool might also have some options to set (eg
pa.jemalloc_set_decay_ms(0)
)General tips and tricks (eg run your reproducer multiple times in a row -> it might not keep increasing memory usage after the first time -> in that case it's not a memory leak)
Potentially mention some external tools that can help (eg
memray
)Other things we could add?
cc @westonpace @pitrou @AlenkaF @anjakefala
The text was updated successfully, but these errors were encountered: