[Python][Docs] Provide guidelines for expectations about and how to investigate memory usage #36378

jorisvandenbossche · 2023-06-29T12:39:01Z

We regularly get reports about potential memory usage issues (memory leaks, ..), and often we need to clarify expectations around what is being observed or give hints on how to explore the memory usage. Given this comes up regularly, it might be useful to gather such content so we can point to that page instead of every time having to repeat it.

(recent examples: #36100, #36101)

Some aspects that could be useful to mention on such a page:

Some basic background on memory allocation. Of course we can't provide a full tutorial on this, but a few facts might help set expectations. For example this quote from Weston in a recent issue ([Python] Pyarrow Table.pylist doesn't release memory untill the program terminates. #36100 (comment)) to explain why memory usage stays high:

What is happening is that pyarrow is returning the memory back to the allocator (in these graphs I was using the system allocator so we are returning the memory to malloc). However, the allocator is not releasing this memory to the OS. This is because obtaining memory from the OS is expensive and so the allocator tries to avoid it if it can.

Or similar comment from Antoine in [Python] Memory leak in pq.read_table and table.to_pandas #18431 (comment)
List the functionality in pyarrow that can help diagnose or verify memory usage: pa.total_allocated_bytes(), release_unused, ...
More advanced, but mention there are different memory pool implementations, so you can also try using a different one. Each memory pool might also have some options to set (eg pa.jemalloc_set_decay_ms(0))
General tips and tricks (eg run your reproducer multiple times in a row -> it might not keep increasing memory usage after the first time -> in that case it's not a memory leak)
Potentially mention some external tools that can help (eg memray)

Other things we could add?

cc @westonpace @pitrou @AlenkaF @anjakefala

The text was updated successfully, but these errors were encountered:

anjakefala · 2023-12-07T00:30:47Z

It might be good to include a link to: https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas

anjakefala · 2023-12-21T00:02:03Z

All of these are notes from @jorisvandenbossche, that are possibly relevant for this document:

If you have a pandas table with strings, those won't show up (by default) in the pandas memory usage indicator:

In [1]: df = pd.DataFrame({"col": ["some", "strings", "in", "my", "dataset"]*100})

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col     500 non-null    object
dtypes: object(1)
memory usage: 4.0+ KB

In [3]: df.memory_usage()
Out[3]: 
Index     132
col      4000
dtype: int64

In [4]: df.memory_usage(deep=True)
Out[4]: 
Index      132
col      30700
dtype: int64

In [5]: df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col     500 non-null    object
dtypes: object(1)
memory usage: 30.1 KB

Because pandas by default uses a numpy object-dtype array, i.e. literally an array of Python objects, the memory usage indicator by default only shows the size of the numpy array itself, i.e. just the pointers to the Python objects. This is because iteratively querying the size of all the Python objects in this array can be very costly. So to get the real memory usage, you have to pass deep=True to the memory_usage() method.

Verify what table.get_total_buffer_size() returns (this can give a different number than nbytes, in case that table's buffers are a slice of a bigger dataset. table.nbytes only returns the bytes that are actually used for table, while get_total_buffer_size() gives the total size of the referenced buffers, disregarding any offsets).

More info for split_blocks=True:

Nowadays there are not that many operations that trigger consolidation, except for copying a dataframe (copy(), but this can also be done under the hood in other operations that return a copy).

In practice, the consolidation means that columns of the same data type are stored in a single 2D array. So assume that you have a pyarrow table with 10 float columns, which in pyarrow is represented by 10 separate 1D arrays. When converting to pandas, those 10 1D arrays are converted (and thus copied) into a single 2D array.

But if you have this kind of consolidation, that also means that it can roughly only give "double" the memory usage of the original table (given that the new 2D array will have the same total memory as the 10 1D arrays).

Some reasons that a pandas dataframe can be a lot bigger than the table that created it: if you have nested data such as structs, those get converted to dictionaries stored in a pandas column, so that is much less efficient, or boolean values are stored as bytes instead of bits, so that's a size x8)

table.schema to get the types of the columns, which is important for diagnosing memory footprint confusion.

anjakefala · 2024-01-31T23:06:36Z

https://arrow.apache.org/docs/python/generated/pyarrow.log_memory_allocations.html

pa.log_memory_allocations() will communicate when we request for, and free memory.

https://github.com/bloomberg/memray with --native flag.

mention summary, stats, and flamegraph --temporal as helpful.

There might be differences between the amount of memory Arrow requests, and how much the allocator allocates.

anjakefala · 2024-01-31T23:07:18Z

Expected differences in behaviour between memory allocators: https://issues.apache.org/jira/browse/ARROW-15730

anjakefala · 2024-02-07T22:56:48Z

 MIMALLOC_SHOW_STATS=1 python program.py

will show additional logging for mimalloc specifically.

anjakefala · 2024-04-10T17:41:47Z

Add the information in this thread: #40301

jorisvandenbossche added Component: Python Component: Documentation labels Jun 29, 2023

anjakefala self-assigned this Jul 12, 2023

hendrikmakait mentioned this issue Aug 17, 2023

P2P blows up memory dask/distributed#8015

Closed

d33bs mentioned this issue Aug 17, 2023

Possible Memory leak in the convert module cytomining/CytoTable#75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Docs] Provide guidelines for expectations about and how to investigate memory usage #36378

[Python][Docs] Provide guidelines for expectations about and how to investigate memory usage #36378

jorisvandenbossche commented Jun 29, 2023

anjakefala commented Dec 7, 2023 •

edited

Loading

anjakefala commented Dec 21, 2023

anjakefala commented Jan 31, 2024 •

edited

Loading

anjakefala commented Jan 31, 2024

anjakefala commented Feb 7, 2024

anjakefala commented Apr 10, 2024

[Python][Docs] Provide guidelines for expectations about and how to investigate memory usage #36378

[Python][Docs] Provide guidelines for expectations about and how to investigate memory usage #36378

Comments

jorisvandenbossche commented Jun 29, 2023

anjakefala commented Dec 7, 2023 • edited Loading

anjakefala commented Dec 21, 2023

anjakefala commented Jan 31, 2024 • edited Loading

anjakefala commented Jan 31, 2024

anjakefala commented Feb 7, 2024

anjakefala commented Apr 10, 2024

anjakefala commented Dec 7, 2023 •

edited

Loading

anjakefala commented Jan 31, 2024 •

edited

Loading