-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): allow low memory df.to_pandas() #6720
Conversation
I am not convinced doing this column for column is faster. We use https://ursalabs.org/blog/fast-pandas-loading/ It might be faster in your case where you are tight on memory, but I first want to understand why you see reported memory usage. Have you opened an issue upstream in the arrow repo? |
@ritchie46 |
@ritchie46
this goes down from 14000 to ~9k
so i am not sure if this is what caused people to report this issue to pyarrow, i posted this in the pyarrow issue apache/arrow#18431 (comment) https://github.com/apache/arrow/blob/fc1f9ebbc4c3ae77d5cfc2f9322f4373d3d19b8a/python/pyarrow/array.pxi#L724
|
Right, Thanks for looking into this. Could you change this PR and add expose the needed pyarrow arguments to control this? I think we should allow the user to set this and default to |
performance wasn't why i worked on this, but if you don't mind i have a question on this: Correct me if i'm wrong, i go column by column so i chunk vertically, i use
instead of df that does
so i don't see why this cant be fast, sorry for ignorance. on a related note, this is from arrow docs on writing:
I think what i'm doing here is equivalent to that but in a different context, i only keep in memory what hasn't been copied to the new df. |
ok will do later today, or tomorrow |
Constructing a pandas dataframe column, by column is not the best way as Pandas requires for a lot of operations that the dataframe is consolidated (2D arrays are created which contain all columns with the same dtype). So as soon as you run certain operations on the dataframe, you will get an additional copy. Pyarrow builds those 2D arrays under the hood, so you won't have that extra consolidation step afterwards. https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy You could use something like this if you want to reduce peak memory usage and take the risk that you will trigger consolidation later. df_pd = df.to_arrow().to_pandas(self_destruct=True, split_blocks=True) |
cc7a0f1
to
5c18f2b
Compare
@ritchie46 done |
c3bce09
to
4eae159
Compare
@ghuls i profiled your pr and its better so i am closing this. benchmark code:
|
@dxe4 Thanks for testing! |
df.to_pandas()
takes lot of memory, more than it should (benchmark attached)current
to_pandas()
takes ~14000 MB for a ~6000MB df the new version takes around ~6700MBThis change allows running it faster and with lower memory, but it deletes the original data.
Although its not ideal to delete the original data, if the choice is between OOM (therefore not using polars) and allowing this, i believe its a good choice.
I don't know the codebase well sorry if i made any bad assumptions.
code used for profiling (note this is a bit different than the committed code but it shouldn't make a big difference)