feat(python): Add option to use PyArrow backed-extension arrays when … #6756

ghuls · 2023-02-09T15:27:36Z

…converting to pandas.

Add "use_pyarrow_extension_array" argument to to pl.Series.to_pandas() and pl.DataFrame.to_pandas() as from Pandas 1.5.0, pandas Series and pandas DataFrame columns can be backed by PyArrow arrays. This allow zero copy operations and preservation of null values in Pandas dataframes.

For big dataframe this can make the conversion to a pandas DataFrame almost for free, both in conversion time and memory usage:

%time df_pd = df.pandas()
CPU times: user 5.18 s, sys: 817 ms, total: 6 s
Wall time: 5.12 s

%time df_pd_pa = df.to_pandas(use_pyarrow_extension_array=True)
CPU times: user 1.63 ms, sys: 71 µs, total: 1.7 ms
Wall time: 1.57 ms

Preservation of null values in pandas Series:

>>> s1 = pl.Series("a", [1, 2, 3])
>>> s1.to_pandas()
0    1
1    2
2    3
Name: a, dtype: int64
>>> s1.to_pandas(use_pyarrow_extension_array=True)
0    1
1    2
2    3
Name: a, dtype: int64[pyarrow]
>>> s2 = pl.Series("b", [1, 2, None, 4])
>>> s2.to_pandas()
0    1.0
1    2.0
2    NaN
3    4.0
Name: b, dtype: float64
>>> s2.to_pandas(use_pyarrow_extension_array=True)
0       1
1       2
2    <NA>
3       4
Name: b, dtype: int64[pyarrow]

ritchie46 · 2023-02-09T15:31:21Z

This is awesome!

ritchie46 · 2023-02-09T15:32:04Z

py-polars/polars/internals/dataframe/frame.py

-            Arguments will be sent to :meth:`pyarrow.Table.to_pandas`.
-        date_as_object
-            Cast dates to objects. If ``False``, convert to ``datetime64[ns]`` dtype.
+        use_pyarrow_extension_array


Can we make this default to True?

yes, I can. if we drop Python 3.7

and we need to require pandas 1.5.x or higher.

maybe you can disable it for them

In [7]: import sys In [8]: (sys.version_info.major, sys.version_info.minor) Out[8]: (3, 10) In [9]: import pandas as pd pd In [10]: pd.__version__ Out[10]: '1.5.3'

Given @kylebarron comment, let's be conservative here and set it to False.

IMO it would still be ideal to have it set to True, but if you're supporting pandas pre 1.5, then users would get different responses depending on their pandas version, which seems confusing.

Th pyarrow backed extension array implementation is considered experimental in the pandas 1.5.0 release:
https://pandas.pydata.org/docs/whatsnew/v1.5.0.html#native-pyarrow-backed-extensionarray

This feature is experimental, and the API can change in a future release without warning.

So defaulting to it, might be too soon (even not considering the rest of the ecosystem).

ritchie46 · 2023-02-09T15:33:32Z

Is the reverse also possible of the pd dataframe is backed by arrow?

ghuls · 2023-02-09T16:00:17Z

Yes, the reverse is possible too (zero copy):

In [14]: df.shape
Out[14]: (102753538, 5)

In [15]: %time pandas_df= df.to_pandas(use_pyarrow_extension_array=False)
CPU times: user 5.07 s, sys: 779 ms, total: 5.85 s
Wall time: 5.02 s

In [10]: %time pandas_df_pa = df.to_pandas(use_pyarrow_extension_array=True)
CPU times: user 3.17 ms, sys: 0 ns, total: 3.17 ms
Wall time: 3.01 ms

In [11]: %time pandas_df_pa2 = df.to_pandas(use_pyarrow_extension_array=True)
CPU times: user 4.17 ms, sys: 63 µs, total: 4.23 ms
Wall time: 3.99 ms

In [12]: %time pandas_df_pa3 = df.to_pandas(use_pyarrow_extension_array=True)
CPU times: user 1.32 ms, sys: 0 ns, total: 1.32 ms
Wall time: 1.21 ms

In [13]: %time pl_df_from_pandas = pl.from_pandas(pandas_df_pa3)
CPU times: user 731 µs, sys: 0 ns, total: 731 µs
Wall time: 1.57 ms

kylebarron · 2023-02-09T16:47:28Z

IIRC I thought some pandas operations were less efficient on arrow data than numpy data, and some pandas operations always return numpy memory, even if the memory started as arrow memory. So definitely something to make sure the user is aware of

alexander-beedie · 2023-02-09T19:13:11Z

Very cool indeed!

Unfortunately it's hard to set this conversion to True by default for the additional reason that lots of the pandas/numpy associated libraries still don't know what to do with the NA-aware dtypes; you don't really want to produce them by default until the rest of the ecosystem catches up (we tried at work last year and it was a problem, so we reverted) :(

…converting to pandas. Add "use_pyarrow_extension_array" argument to to pl.Series.to_pandas() and pl.DataFrame.to_pandas() as from Pandas 1.5.0, pandas Series and pandas DataFrame columns can be backed by PyArrow arrays. This allow zero copy operations and preservation of null values in Pandas dataframes. For big dataframe this can make the conversion to a pandas DataFrame almost for free, both in conversion time and memory usage: %time df_pd = df.pandas() CPU times: user 5.18 s, sys: 817 ms, total: 6 s Wall time: 5.12 s %time df_pd_pa = df.to_pandas(use_pyarrow_extension_array=True) CPU times: user 1.63 ms, sys: 71 µs, total: 1.7 ms Wall time: 1.57 ms Preservation of null values in pandas Series: >>> s1 = pl.Series("a", [1, 2, 3]) >>> s1.to_pandas() 0 1 1 2 2 3 Name: a, dtype: int64 >>> s1.to_pandas(use_pyarrow_extension_array=True) 0 1 1 2 2 3 Name: a, dtype: int64[pyarrow] >>> s2 = pl.Series("b", [1, 2, None, 4]) >>> s2.to_pandas() 0 1.0 1 2.0 2 NaN 3 4.0 Name: b, dtype: float64 >>> s2.to_pandas(use_pyarrow_extension_array=True) 0 1 1 2 2 <NA> 3 4 Name: b, dtype: int64[pyarrow]

ghuls · 2023-02-09T20:30:53Z

It is ready now. use_pyarrow_extension_array=False by default and use_pyarrow_extension_array=True will complain if pandas is not new enough.

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Feb 9, 2023

ritchie46 reviewed Feb 9, 2023

View reviewed changes

ghuls mentioned this pull request Feb 9, 2023

feat(python): allow low memory df.to_pandas() #6720

Closed

ghuls force-pushed the feat_python_pandas_pyarrow_extension_arrays branch from 4a9765d to c539d1e Compare February 9, 2023 16:50

ghuls force-pushed the feat_python_pandas_pyarrow_extension_arrays branch from c539d1e to f758960 Compare February 9, 2023 19:52

ritchie46 merged commit 0cf7d7f into pola-rs:master Feb 10, 2023

MarcoGorelli mentioned this pull request Feb 10, 2023

ENH: Handle conversion from Polars with pd.DataFrame pandas-dev/pandas#47368

Closed

alexander-beedie mentioned this pull request Feb 15, 2023

fix(python): fix regression with date => object typing in to_pandas method #6902

Merged

cnpryer mentioned this pull request Feb 16, 2023

Add selective zero-copy DataFrame interchange #6951

Closed

datapythonista mentioned this pull request Feb 18, 2023

Series round trip to pandas with arrow types copies data #7007

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Add option to use PyArrow backed-extension arrays when … #6756

feat(python): Add option to use PyArrow backed-extension arrays when … #6756

ghuls commented Feb 9, 2023

ritchie46 commented Feb 9, 2023

ritchie46 Feb 9, 2023

ghuls Feb 9, 2023

ghuls Feb 9, 2023

dxe4 Feb 9, 2023

ritchie46 Feb 9, 2023

kylebarron Feb 9, 2023

ghuls Feb 9, 2023

ritchie46 commented Feb 9, 2023

ghuls commented Feb 9, 2023 •

edited

Loading

kylebarron commented Feb 9, 2023

alexander-beedie commented Feb 9, 2023 •

edited

Loading

ghuls commented Feb 9, 2023

feat(python): Add option to use PyArrow backed-extension arrays when … #6756

feat(python): Add option to use PyArrow backed-extension arrays when … #6756

Conversation

ghuls commented Feb 9, 2023

ritchie46 commented Feb 9, 2023

ritchie46 Feb 9, 2023

Choose a reason for hiding this comment

ghuls Feb 9, 2023

Choose a reason for hiding this comment

ghuls Feb 9, 2023

Choose a reason for hiding this comment

dxe4 Feb 9, 2023

Choose a reason for hiding this comment

ritchie46 Feb 9, 2023

Choose a reason for hiding this comment

kylebarron Feb 9, 2023

Choose a reason for hiding this comment

ghuls Feb 9, 2023

Choose a reason for hiding this comment

ritchie46 commented Feb 9, 2023

ghuls commented Feb 9, 2023 • edited Loading

kylebarron commented Feb 9, 2023

alexander-beedie commented Feb 9, 2023 • edited Loading

ghuls commented Feb 9, 2023

ghuls commented Feb 9, 2023 •

edited

Loading

alexander-beedie commented Feb 9, 2023 •

edited

Loading