-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series round trip to pandas with arrow types copies data #7007
Comments
datapythonista
added
bug
Something isn't working
python
Related to Python Polars
labels
Feb 18, 2023
Timings also seem to confirm the copy with arrow types and not with numpy: >>> %timeit _ = polars.from_pandas(original_polars_series.to_pandas(use_pyarrow_extension_array=False))
393 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit _ = polars.from_pandas(original_polars_series.to_pandas(use_pyarrow_extension_array=True))
141 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
@datapythonista I assume this is a pyarrow limitation at the moment. # Converting Pandas Series with PyArrow results in a copy.
In [46]: pl.from_arrow(pa.array(pandas_with_arrow))._get_ptr()
Out[46]: 140342545672704
In [47]: pl.from_arrow(pa.array(pandas_with_arrow))._get_ptr()
Out[47]: 140342545671424
In [48]: pl.from_arrow(pa.array(pandas_with_arrow))._get_ptr()
Out[48]: 140342545670976
# Converting PyArrow extension array backing Pandas Series avoid the copy.
In [78]: pl.from_arrow(pandas_with_arrow._data.arrays[0]._data.chunks[0])._get_ptr()
Out[78]: 140347645945408
In [79]: pl.from_arrow(pandas_with_arrow._data.arrays[0]._data.chunks[0])._get_ptr()
Out[79]: 140347645945408
In [80]: pl.from_arrow(pandas_with_arrow._data.arrays[0]._data.chunks[0])._get_ptr()
Out[80]: 140347645945408
In [81]: pl.from_arrow(pandas_with_arrow._data.arrays[0]._data.chunks[0])
Out[81]:
shape: (100000000,)
Series: '' [i64]
[
0
1
2
3
4
5
6
7
8
9
0
1
...
7
8
9
0
1
2
3
4
5
6
7
8
9
] |
Seems like the problem is in Polars, which is making the copy by combining the arrow chunks even when there is just one. I opened #7084 to fix it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
I'd expect that a polars Series doesn't copy data in a roundtrip to pandas when arrow types are used:
Interesting enough this is true when numpy types are used with
use_pyarrow_extension_array=False
, but not when data is always an arrow array.I see in @ghuls comment that this is expected to be true: #6756 (comment)
But as shown in the reproducible example it seems to be making a copy.
Reproducible example
Expected behavior
Installed versions
The text was updated successfully, but these errors were encountered: