Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series round trip to pandas with arrow types copies data #7007

Closed
2 tasks done
datapythonista opened this issue Feb 18, 2023 · 3 comments · Fixed by #7084
Closed
2 tasks done

Series round trip to pandas with arrow types copies data #7007

datapythonista opened this issue Feb 18, 2023 · 3 comments · Fixed by #7084
Labels
bug Something isn't working python Related to Python Polars

Comments

@datapythonista
Copy link
Contributor

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I'd expect that a polars Series doesn't copy data in a roundtrip to pandas when arrow types are used:

polars.from_pandas(data.to_pandas(use_pyarrow_extension_array=True))

Interesting enough this is true when numpy types are used with use_pyarrow_extension_array=False, but not when data is always an arrow array.

I see in @ghuls comment that this is expected to be true: #6756 (comment)

But as shown in the reproducible example it seems to be making a copy.

Reproducible example

original_polars_series = polars.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9] * 10_000_000)
print('original polars, address:', original_polars_series._get_ptr())

pandas_with_arrow = original_polars_series.to_pandas(use_pyarrow_extension_array=True)
print('pandas exported with arrow types, address:', pandas_with_arrow._data.array._data.chunks[0].buffers()[1].address)

polars_roundtrip_with_arrow = polars.from_pandas(pandas_with_arrow)
print('polars after roundtrip to pandas with arrow types, address:', polars_roundtrip_with_arrow._get_ptr())

pandas_with_numpy = original_polars_series.to_pandas(use_pyarrow_extension_array=False)
print('pandas exported with numpy types, address:', pandas_with_numpy.to_numpy().__array_interface__['data'][0])

polars_roundtrip_with_numpy = polars.from_pandas(pandas_with_numpy)
print('polars after roundtrip to pandas with numpy types, address:', polars_roundtrip_with_numpy._get_ptr())

Expected behavior

original polars, address: 140026900974656
pandas exported with arrow types, address: 140026900974656
polars after roundtrip to pandas with arrow types, address: 140028774713088     <--- I'd expect this address to stay the same as the original
pandas exported with numpy types, address: 140026900974656
polars after roundtrip to pandas with numpy types, address: 140026900974656

Installed versions

---Version info---
Polars: 0.16.6
Index type: UInt32
Platform: Linux-6.1.10-arch1-1-x86_64-with-glibc2.10
Python: 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35) 
[GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 2.0.0.dev0+1580.g686d674f2d
numpy: 1.23.5
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.6.3```

</details>
@datapythonista datapythonista added bug Something isn't working python Related to Python Polars labels Feb 18, 2023
@datapythonista
Copy link
Contributor Author

Timings also seem to confirm the copy with arrow types and not with numpy:

>>> %timeit _ = polars.from_pandas(original_polars_series.to_pandas(use_pyarrow_extension_array=False))
393 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit _ = polars.from_pandas(original_polars_series.to_pandas(use_pyarrow_extension_array=True))
141 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@ghuls
Copy link
Collaborator

ghuls commented Feb 20, 2023

@datapythonista I assume this is a pyarrow limitation at the moment.

# Converting Pandas Series with PyArrow results in a copy.
In [46]: pl.from_arrow(pa.array(pandas_with_arrow))._get_ptr()
Out[46]: 140342545672704

In [47]: pl.from_arrow(pa.array(pandas_with_arrow))._get_ptr()
Out[47]: 140342545671424

In [48]: pl.from_arrow(pa.array(pandas_with_arrow))._get_ptr()
Out[48]: 140342545670976

# Converting PyArrow extension array backing Pandas Series avoid the copy.
In [78]: pl.from_arrow(pandas_with_arrow._data.arrays[0]._data.chunks[0])._get_ptr()
Out[78]: 140347645945408

In [79]: pl.from_arrow(pandas_with_arrow._data.arrays[0]._data.chunks[0])._get_ptr()
Out[79]: 140347645945408

In [80]: pl.from_arrow(pandas_with_arrow._data.arrays[0]._data.chunks[0])._get_ptr()
Out[80]: 140347645945408

In [81]: pl.from_arrow(pandas_with_arrow._data.arrays[0]._data.chunks[0])
Out[81]:
shape: (100000000,)
Series: '' [i64]
[
        0
        1
        2
        3
        4
        5
        6
        7
        8
        9
        0
        1
        ...
        7
        8
        9
        0
        1
        2
        3
        4
        5
        6
        7
        8
        9
]

@datapythonista
Copy link
Contributor Author

Seems like the problem is in Polars, which is making the copy by combining the arrow chunks even when there is just one. I opened #7084 to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants