feat(python): allow low memory df.to_pandas() #6720

dxe4 · 2023-02-07T22:26:31Z

df.to_pandas() takes lot of memory, more than it should (benchmark attached)
current to_pandas() takes ~14000 MB for a ~6000MB df the new version takes around ~6700MB
This change allows running it faster and with lower memory, but it deletes the original data.
Although its not ideal to delete the original data, if the choice is between OOM (therefore not using polars) and allowing this, i believe its a good choice.
I don't know the codebase well sorry if i made any bad assumptions.

code used for profiling (note this is a bit different than the committed code but it shouldn't make a big difference)

def current_to_pandas():
    df = pl.read_csv(path, sep="\t")
    df = df.to_pandas()

def low_mem_to_pandas():
    df = pl.read_csv(path, sep="\t")    
    new_df = pd.DataFrame()
    for column in df.columns:
        new_df.loc[:,column] = df[column].to_arrow()
        df._df.drop_in_place(column)

def profile_to_pandas():
    current_to_pandas()
    time.sleep(1)
    low_mem_to_pandas()

ritchie46 · 2023-02-08T09:19:19Z

I am not convinced doing this column for column is faster. We use pyarrow conversion designed by Wes himself.

https://ursalabs.org/blog/fast-pandas-loading/

It might be faster in your case where you are tight on memory, but I first want to understand why you see reported memory usage. Have you opened an issue upstream in the arrow repo?

dxe4 · 2023-02-08T11:07:18Z

@ritchie46
a) there's an open issue here on pyarrow here apache/arrow#18431 the original issue is on a small csv but on a bigger csv like the comment here apache/arrow#18431 (comment) this seems to be roughly the scale of mem increase i experience
b) i am not on tight memory, i have 32gb ram with low swappiness and it peaks at 14000 (i also tested with smaller files), although i would be on tight memory on a 20gb df (assuming a + 240% increase at that size too)
c) if its faster or not, i am not convinced either, i didn't want to go to the details of that since i was trying to address the mem issue. To "prove" this is faster, will need to run multiple benchmarks on different machines, different number of rows, number of columns and dtypes, and maybe even different params in read_csv
d) i can try debugging the py_arrow mem issue, but i can't promise i can fix it
e) is there a way to achieve c) currently? if no i can try making something

dxe4 · 2023-02-08T18:47:36Z

@ritchie46
i spend some time profiling arrow.
so if you use deduplicate_objects=False it makes it a lot better eg (benchmarks attached)
still needs more memory than my current pr

tbl.to_pandas(*args, date_as_object=date_as_object, deduplicate_objects=False, **kwargs)

this goes down from 14000 to ~9k
The issues with this are:
a) i don't understand what is the impact of changing this
b) in pyarrow/array.pxi the docs say default is false, but the function sets the default to be true, like so:

code:
         bint deduplicate_objects=True,
docs:
        deduplicate_objects : bool, default False
            Do not create multiple copies Python objects when created, to save
            on memory use. Conversion will be slower.

so i am not sure if this is what caused people to report this issue to pyarrow, i posted this in the pyarrow issue apache/arrow#18431 (comment)

https://github.com/apache/arrow/blob/fc1f9ebbc4c3ae77d5cfc2f9322f4373d3d19b8a/python/pyarrow/array.pxi#L724
https://github.com/apache/arrow/blob/fc1f9ebbc4c3ae77d5cfc2f9322f4373d3d19b8a/python/pyarrow/array.pxi#L690

with duplicate_objcets=False

    22   3366.0 MiB   3229.8 MiB           1       df = pl.read_csv(path, sep="\t")
    24   6022.0 MiB   2655.9 MiB           1       df = df.to_pandas()

  1164   3369.5 MiB   3369.5 MiB           1   @profile
  1165                                         def _table_to_blocks(options, block_table, categories, extension_columns):
  1167                                             # Part of table_to_blockmanager
  1168                                         
  1169                                             # Convert an arrow table to Block from the internal pandas API
  1170   3369.5 MiB      0.0 MiB           1       columns = block_table.column_names
  1171   8953.0 MiB   5583.5 MiB           2       result = pa.lib.table_to_blocks(options, block_table, categories,
  1172   3369.5 MiB      0.0 MiB           1                                       list(extension_columns.keys()))
  1173   8953.0 MiB      0.0 MiB           8       return [_reconstruct_block(item, columns, extension_columns)
  1174   8953.0 MiB      0.0 MiB           3               for item in result]

with duplicate_objects=True


    22   3366.0 MiB   3229.9 MiB           1       df = pl.read_csv(path, sep="\t")
    24   9880.3 MiB   6514.4 MiB           1       df = df.to_pandas()

  1165                                         def _table_to_blocks(options, block_table, categories, extension_columns):
  1167                                             # Part of table_to_blockmanager
  1168                                         
  1169                                             # Convert an arrow table to Block from the internal pandas API
  1170   3368.4 MiB      0.0 MiB           1       columns = block_table.column_names
  1171  12811.2 MiB   9442.9 MiB           2       result = pa.lib.table_to_blocks(options, block_table, categories,
  1172   3368.4 MiB      0.0 MiB           1                                       list(extension_columns.keys()))
  1173  12811.2 MiB      0.0 MiB           8       return [_reconstruct_block(item, columns, extension_columns)
  1174  12811.2 MiB      0.0 MiB           3               for item in result]

ritchie46 · 2023-02-08T19:32:29Z

Right, Thanks for looking into this. Could you change this PR and add expose the needed pyarrow arguments to control this? I think we should allow the user to set this and default to False indeed.

dxe4 · 2023-02-08T20:13:20Z

I am not convinced doing this column for column is faster. We use pyarrow conversion designed by Wes himself.

https://ursalabs.org/blog/fast-pandas-loading/

performance wasn't why i worked on this, but if you don't mind i have a question on this:

Correct me if i'm wrong, i go column by column so i chunk vertically, df.to_pandas() does df..iter_chunks() so it chunks horizontally.

i use df[col].to_arrow() it eventually boils down to to_py.rs

    let array = pyarrow.getattr("Array")?.call_method1(
        "_import_from_c",
        (array_ptr as Py_uintptr_t, schema_ptr as Py_uintptr_t),
    )?;

instead of df that does

    let record = pyarrow
        .getattr("RecordBatch")?
        .call_method1("from_arrays", (arrays, names.to_vec()))?;

so i don't see why this cant be fast, sorry for ignorance.

on a related note, this is from arrow docs on writing:

Writing in batches is effective because we in theory need to keep in memory only the current batch we are writing.

I think what i'm doing here is equivalent to that but in a different context, i only keep in memory what hasn't been copied to the new df.

dxe4 · 2023-02-08T20:13:56Z

nks for looking into this. Could you change this PR and add expose the needed pyarrow arguments to control this? I think we should allow the user to set this and default to False indee

ok will do later today, or tomorrow

ghuls · 2023-02-08T21:27:20Z

Constructing a pandas dataframe column, by column is not the best way as Pandas requires for a lot of operations that the dataframe is consolidated (2D arrays are created which contain all columns with the same dtype). So as soon as you run certain operations on the dataframe, you will get an additional copy. Pyarrow builds those 2D arrays under the hood, so you won't have that extra consolidation step afterwards.

https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy

You could use something like this if you want to reduce peak memory usage and take the risk that you will trigger consolidation later.

df_pd = df.to_arrow().to_pandas(self_destruct=True, split_blocks=True)

dxe4 · 2023-02-08T21:46:30Z

@ritchie46 done
thank you for helping, sorry i don't understand the internals.

ghuls · 2023-02-09T15:34:40Z

@dxe4 This might be a better approach. It will use the same memory as the Polars dataframe for the Pandas columns:
#6756

dxe4 · 2023-02-09T16:46:51Z

@ghuls i profiled your pr and its better so i am closing this.

benchmark code:

def ff():
    df = pl.read_csv(path, sep="\t")
    df.to_pandas(use_pyarrow_extension_array=False, deduplicate_objects=False)

def ft():
    df = pl.read_csv(path, sep="\t")
    df.to_pandas(use_pyarrow_extension_array=False, deduplicate_objects=True)

def tf():
    df = pl.read_csv(path, sep="\t")
    df.to_pandas(use_pyarrow_extension_array=True, deduplicate_objects=False)
    
def tt():
    df = pl.read_csv(path, sep="\t")
    df.to_pandas(use_pyarrow_extension_array=True, deduplicate_objects=True)


ff()
ft()
tf()
tt()

ghuls · 2023-02-09T20:12:10Z

@dxe4 Thanks for testing!

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Feb 7, 2023

dxe4 force-pushed the implement_low_memory_to_pandas branch 3 times, most recently from cc7a0f1 to 5c18f2b Compare February 8, 2023 21:45

dxe4 force-pushed the implement_low_memory_to_pandas branch 2 times, most recently from c3bce09 to 4eae159 Compare February 8, 2023 23:35

fix(python): set deduplicate_objects to False as default

4eae159

dxe4 closed this Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): allow low memory df.to_pandas() #6720

feat(python): allow low memory df.to_pandas() #6720

dxe4 commented Feb 7, 2023 •

edited

Loading

ritchie46 commented Feb 8, 2023

dxe4 commented Feb 8, 2023

dxe4 commented Feb 8, 2023 •

edited

Loading

ritchie46 commented Feb 8, 2023

dxe4 commented Feb 8, 2023

dxe4 commented Feb 8, 2023

ghuls commented Feb 8, 2023

dxe4 commented Feb 8, 2023

ghuls commented Feb 9, 2023

dxe4 commented Feb 9, 2023 •

edited

Loading

ghuls commented Feb 9, 2023

feat(python): allow low memory df.to_pandas() #6720

feat(python): allow low memory df.to_pandas() #6720

Conversation

dxe4 commented Feb 7, 2023 • edited Loading

ritchie46 commented Feb 8, 2023

dxe4 commented Feb 8, 2023

dxe4 commented Feb 8, 2023 • edited Loading

ritchie46 commented Feb 8, 2023

dxe4 commented Feb 8, 2023

dxe4 commented Feb 8, 2023

ghuls commented Feb 8, 2023

dxe4 commented Feb 8, 2023

ghuls commented Feb 9, 2023

dxe4 commented Feb 9, 2023 • edited Loading

ghuls commented Feb 9, 2023

dxe4 commented Feb 7, 2023 •

edited

Loading

dxe4 commented Feb 8, 2023 •

edited

Loading

dxe4 commented Feb 9, 2023 •

edited

Loading