Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): allow low memory df.to_pandas() #6720

Closed
wants to merge 1 commit into from

Conversation

dxe4
Copy link

@dxe4 dxe4 commented Feb 7, 2023

df.to_pandas() takes lot of memory, more than it should (benchmark attached)
current to_pandas() takes ~14000 MB for a ~6000MB df the new version takes around ~6700MB
This change allows running it faster and with lower memory, but it deletes the original data.
Although its not ideal to delete the original data, if the choice is between OOM (therefore not using polars) and allowing this, i believe its a good choice.
I don't know the codebase well sorry if i made any bad assumptions.

to_pandas_profile

code used for profiling (note this is a bit different than the committed code but it shouldn't make a big difference)

def current_to_pandas():
    df = pl.read_csv(path, sep="\t")
    df = df.to_pandas()

def low_mem_to_pandas():
    df = pl.read_csv(path, sep="\t")    
    new_df = pd.DataFrame()
    for column in df.columns:
        new_df.loc[:,column] = df[column].to_arrow()
        df._df.drop_in_place(column)

def profile_to_pandas():
    current_to_pandas()
    time.sleep(1)
    low_mem_to_pandas()
   

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Feb 7, 2023
@ritchie46
Copy link
Member

I am not convinced doing this column for column is faster. We use pyarrow conversion designed by Wes himself.

https://ursalabs.org/blog/fast-pandas-loading/

It might be faster in your case where you are tight on memory, but I first want to understand why you see reported memory usage. Have you opened an issue upstream in the arrow repo?

@dxe4
Copy link
Author

dxe4 commented Feb 8, 2023

@ritchie46
a) there's an open issue here on pyarrow here apache/arrow#18431 the original issue is on a small csv but on a bigger csv like the comment here apache/arrow#18431 (comment) this seems to be roughly the scale of mem increase i experience
b) i am not on tight memory, i have 32gb ram with low swappiness and it peaks at 14000 (i also tested with smaller files), although i would be on tight memory on a 20gb df (assuming a + 240% increase at that size too)
c) if its faster or not, i am not convinced either, i didn't want to go to the details of that since i was trying to address the mem issue. To "prove" this is faster, will need to run multiple benchmarks on different machines, different number of rows, number of columns and dtypes, and maybe even different params in read_csv
d) i can try debugging the py_arrow mem issue, but i can't promise i can fix it
e) is there a way to achieve c) currently? if no i can try making something

@dxe4
Copy link
Author

dxe4 commented Feb 8, 2023

@ritchie46
i spend some time profiling arrow.
so if you use deduplicate_objects=False it makes it a lot better eg (benchmarks attached)
still needs more memory than my current pr

tbl.to_pandas(*args, date_as_object=date_as_object, deduplicate_objects=False, **kwargs)

this goes down from 14000 to ~9k
The issues with this are:
a) i don't understand what is the impact of changing this
b) in pyarrow/array.pxi the docs say default is false, but the function sets the default to be true, like so:

code:
         bint deduplicate_objects=True,
docs:
        deduplicate_objects : bool, default False
            Do not create multiple copies Python objects when created, to save
            on memory use. Conversion will be slower.

so i am not sure if this is what caused people to report this issue to pyarrow, i posted this in the pyarrow issue apache/arrow#18431 (comment)

https://github.com/apache/arrow/blob/fc1f9ebbc4c3ae77d5cfc2f9322f4373d3d19b8a/python/pyarrow/array.pxi#L724
https://github.com/apache/arrow/blob/fc1f9ebbc4c3ae77d5cfc2f9322f4373d3d19b8a/python/pyarrow/array.pxi#L690

no_dup@2x

with duplicate_objcets=False

    22   3366.0 MiB   3229.8 MiB           1       df = pl.read_csv(path, sep="\t")
    24   6022.0 MiB   2655.9 MiB           1       df = df.to_pandas()
  1164   3369.5 MiB   3369.5 MiB           1   @profile
  1165                                         def _table_to_blocks(options, block_table, categories, extension_columns):
  1167                                             # Part of table_to_blockmanager
  1168                                         
  1169                                             # Convert an arrow table to Block from the internal pandas API
  1170   3369.5 MiB      0.0 MiB           1       columns = block_table.column_names
  1171   8953.0 MiB   5583.5 MiB           2       result = pa.lib.table_to_blocks(options, block_table, categories,
  1172   3369.5 MiB      0.0 MiB           1                                       list(extension_columns.keys()))
  1173   8953.0 MiB      0.0 MiB           8       return [_reconstruct_block(item, columns, extension_columns)
  1174   8953.0 MiB      0.0 MiB           3               for item in result]

with duplicate_objects=True


    22   3366.0 MiB   3229.9 MiB           1       df = pl.read_csv(path, sep="\t")
    24   9880.3 MiB   6514.4 MiB           1       df = df.to_pandas()
  1165                                         def _table_to_blocks(options, block_table, categories, extension_columns):
  1167                                             # Part of table_to_blockmanager
  1168                                         
  1169                                             # Convert an arrow table to Block from the internal pandas API
  1170   3368.4 MiB      0.0 MiB           1       columns = block_table.column_names
  1171  12811.2 MiB   9442.9 MiB           2       result = pa.lib.table_to_blocks(options, block_table, categories,
  1172   3368.4 MiB      0.0 MiB           1                                       list(extension_columns.keys()))
  1173  12811.2 MiB      0.0 MiB           8       return [_reconstruct_block(item, columns, extension_columns)
  1174  12811.2 MiB      0.0 MiB           3               for item in result]

@ritchie46
Copy link
Member

Right, Thanks for looking into this. Could you change this PR and add expose the needed pyarrow arguments to control this? I think we should allow the user to set this and default to False indeed.

@dxe4
Copy link
Author

dxe4 commented Feb 8, 2023

I am not convinced doing this column for column is faster. We use pyarrow conversion designed by Wes himself.

https://ursalabs.org/blog/fast-pandas-loading/

performance wasn't why i worked on this, but if you don't mind i have a question on this:

Correct me if i'm wrong, i go column by column so i chunk vertically, df.to_pandas() does df..iter_chunks() so it chunks horizontally.

i use df[col].to_arrow() it eventually boils down to to_py.rs

    let array = pyarrow.getattr("Array")?.call_method1(
        "_import_from_c",
        (array_ptr as Py_uintptr_t, schema_ptr as Py_uintptr_t),
    )?;

instead of df that does

    let record = pyarrow
        .getattr("RecordBatch")?
        .call_method1("from_arrays", (arrays, names.to_vec()))?;

so i don't see why this cant be fast, sorry for ignorance.

on a related note, this is from arrow docs on writing:

Writing in batches is effective because we in theory need to keep in memory only the current batch we are writing.

I think what i'm doing here is equivalent to that but in a different context, i only keep in memory what hasn't been copied to the new df.

@dxe4
Copy link
Author

dxe4 commented Feb 8, 2023

nks for looking into this. Could you change this PR and add expose the needed pyarrow arguments to control this? I think we should allow the user to set this and default to False indee

ok will do later today, or tomorrow

@ghuls
Copy link
Collaborator

ghuls commented Feb 8, 2023

Constructing a pandas dataframe column, by column is not the best way as Pandas requires for a lot of operations that the dataframe is consolidated (2D arrays are created which contain all columns with the same dtype). So as soon as you run certain operations on the dataframe, you will get an additional copy. Pyarrow builds those 2D arrays under the hood, so you won't have that extra consolidation step afterwards.

https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy

You could use something like this if you want to reduce peak memory usage and take the risk that you will trigger consolidation later.

df_pd = df.to_arrow().to_pandas(self_destruct=True, split_blocks=True)

@dxe4 dxe4 force-pushed the implement_low_memory_to_pandas branch 3 times, most recently from cc7a0f1 to 5c18f2b Compare February 8, 2023 21:45
@dxe4
Copy link
Author

dxe4 commented Feb 8, 2023

@ritchie46 done
thank you for helping, sorry i don't understand the internals.

@dxe4 dxe4 force-pushed the implement_low_memory_to_pandas branch 2 times, most recently from c3bce09 to 4eae159 Compare February 8, 2023 23:35
@ghuls
Copy link
Collaborator

ghuls commented Feb 9, 2023

@dxe4 This might be a better approach. It will use the same memory as the Polars dataframe for the Pandas columns:
#6756

@dxe4
Copy link
Author

dxe4 commented Feb 9, 2023

@ghuls i profiled your pr and its better so i am closing this.

memt__test

benchmark code:

def ff():
    df = pl.read_csv(path, sep="\t")
    df.to_pandas(use_pyarrow_extension_array=False, deduplicate_objects=False)

def ft():
    df = pl.read_csv(path, sep="\t")
    df.to_pandas(use_pyarrow_extension_array=False, deduplicate_objects=True)

def tf():
    df = pl.read_csv(path, sep="\t")
    df.to_pandas(use_pyarrow_extension_array=True, deduplicate_objects=False)
    
def tt():
    df = pl.read_csv(path, sep="\t")
    df.to_pandas(use_pyarrow_extension_array=True, deduplicate_objects=True)


ff()
ft()
tf()
tt()

@dxe4 dxe4 closed this Feb 9, 2023
@ghuls
Copy link
Collaborator

ghuls commented Feb 9, 2023

@dxe4 Thanks for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants