Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add option to use PyArrow backed-extension arrays when … #6756

Merged

Conversation

ghuls
Copy link
Collaborator

@ghuls ghuls commented Feb 9, 2023

…converting to pandas.

Add "use_pyarrow_extension_array" argument to to pl.Series.to_pandas() and pl.DataFrame.to_pandas() as from Pandas 1.5.0, pandas Series and pandas DataFrame columns can be backed by PyArrow arrays. This allow zero copy operations and preservation of null values in Pandas dataframes.

For big dataframe this can make the conversion to a pandas DataFrame almost for free, both in conversion time and memory usage:

%time df_pd = df.pandas()
CPU times: user 5.18 s, sys: 817 ms, total: 6 s
Wall time: 5.12 s

%time df_pd_pa = df.to_pandas(use_pyarrow_extension_array=True)
CPU times: user 1.63 ms, sys: 71 µs, total: 1.7 ms
Wall time: 1.57 ms

Preservation of null values in pandas Series:

>>> s1 = pl.Series("a", [1, 2, 3])
>>> s1.to_pandas()
0    1
1    2
2    3
Name: a, dtype: int64
>>> s1.to_pandas(use_pyarrow_extension_array=True)
0    1
1    2
2    3
Name: a, dtype: int64[pyarrow]
>>> s2 = pl.Series("b", [1, 2, None, 4])
>>> s2.to_pandas()
0    1.0
1    2.0
2    NaN
3    4.0
Name: b, dtype: float64
>>> s2.to_pandas(use_pyarrow_extension_array=True)
0       1
1       2
2    <NA>
3       4
Name: b, dtype: int64[pyarrow]

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Feb 9, 2023
@ritchie46
Copy link
Member

This is awesome!

Arguments will be sent to :meth:`pyarrow.Table.to_pandas`.
date_as_object
Cast dates to objects. If ``False``, convert to ``datetime64[ns]`` dtype.
use_pyarrow_extension_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this default to True?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I can. if we drop Python 3.7

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we need to require pandas 1.5.x or higher.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can disable it for them

In [7]: import sys

In [8]: (sys.version_info.major,  sys.version_info.minor)
Out[8]: (3, 10)

In [9]: import pandas as pd
pd
In [10]: pd.__version__
Out[10]: '1.5.3'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given @kylebarron comment, let's be conservative here and set it to False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would still be ideal to have it set to True, but if you're supporting pandas pre 1.5, then users would get different responses depending on their pandas version, which seems confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Th pyarrow backed extension array implementation is considered experimental in the pandas 1.5.0 release:
https://pandas.pydata.org/docs/whatsnew/v1.5.0.html#native-pyarrow-backed-extensionarray

This feature is experimental, and the API can change in a future release without warning.

So defaulting to it, might be too soon (even not considering the rest of the ecosystem).

@ritchie46
Copy link
Member

Is the reverse also possible of the pd dataframe is backed by arrow?

@ghuls
Copy link
Collaborator Author

ghuls commented Feb 9, 2023

Yes, the reverse is possible too (zero copy):

In [14]: df.shape
Out[14]: (102753538, 5)

In [15]: %time pandas_df= df.to_pandas(use_pyarrow_extension_array=False)
CPU times: user 5.07 s, sys: 779 ms, total: 5.85 s
Wall time: 5.02 s

In [10]: %time pandas_df_pa = df.to_pandas(use_pyarrow_extension_array=True)
CPU times: user 3.17 ms, sys: 0 ns, total: 3.17 ms
Wall time: 3.01 ms

In [11]: %time pandas_df_pa2 = df.to_pandas(use_pyarrow_extension_array=True)
CPU times: user 4.17 ms, sys: 63 µs, total: 4.23 ms
Wall time: 3.99 ms

In [12]: %time pandas_df_pa3 = df.to_pandas(use_pyarrow_extension_array=True)
CPU times: user 1.32 ms, sys: 0 ns, total: 1.32 ms
Wall time: 1.21 ms

In [13]: %time pl_df_from_pandas = pl.from_pandas(pandas_df_pa3)
CPU times: user 731 µs, sys: 0 ns, total: 731 µs
Wall time: 1.57 ms

@kylebarron
Copy link
Contributor

IIRC I thought some pandas operations were less efficient on arrow data than numpy data, and some pandas operations always return numpy memory, even if the memory started as arrow memory. So definitely something to make sure the user is aware of

@ghuls ghuls force-pushed the feat_python_pandas_pyarrow_extension_arrays branch from 4a9765d to c539d1e Compare February 9, 2023 16:50
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 9, 2023

Very cool indeed!

Unfortunately it's hard to set this conversion to True by default for the additional reason that lots of the pandas/numpy associated libraries still don't know what to do with the NA-aware dtypes; you don't really want to produce them by default until the rest of the ecosystem catches up (we tried at work last year and it was a problem, so we reverted) :(

…converting to pandas.

Add "use_pyarrow_extension_array" argument to to pl.Series.to_pandas() and
pl.DataFrame.to_pandas() as from Pandas 1.5.0, pandas Series and pandas
DataFrame columns can be backed by PyArrow arrays. This allow zero copy
operations and preservation of null values in Pandas dataframes.

For big dataframe this can make the conversion to a pandas DataFrame
almost for free, both in conversion time and memory usage:

    %time df_pd = df.pandas()
    CPU times: user 5.18 s, sys: 817 ms, total: 6 s
    Wall time: 5.12 s

    %time df_pd_pa = df.to_pandas(use_pyarrow_extension_array=True)
    CPU times: user 1.63 ms, sys: 71 µs, total: 1.7 ms
    Wall time: 1.57 ms

Preservation of null values in pandas Series:

    >>> s1 = pl.Series("a", [1, 2, 3])
    >>> s1.to_pandas()
    0    1
    1    2
    2    3
    Name: a, dtype: int64
    >>> s1.to_pandas(use_pyarrow_extension_array=True)
    0    1
    1    2
    2    3
    Name: a, dtype: int64[pyarrow]
    >>> s2 = pl.Series("b", [1, 2, None, 4])
    >>> s2.to_pandas()
    0    1.0
    1    2.0
    2    NaN
    3    4.0
    Name: b, dtype: float64
    >>> s2.to_pandas(use_pyarrow_extension_array=True)
    0       1
    1       2
    2    <NA>
    3       4
    Name: b, dtype: int64[pyarrow]
@ghuls ghuls force-pushed the feat_python_pandas_pyarrow_extension_arrays branch from c539d1e to f758960 Compare February 9, 2023 19:52
@ghuls
Copy link
Collaborator Author

ghuls commented Feb 9, 2023

It is ready now. use_pyarrow_extension_array=False by default and use_pyarrow_extension_array=True will complain if pandas is not new enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants