Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implode results in extra level of nesting when run within a group_by(...).agg #16756

Closed
2 tasks done
jcrist opened this issue Jun 6, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@jcrist
Copy link

jcrist commented Jun 6, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"x": range(10)})

expr = pl.col("x").implode().alias("res")

# Apply expression without a group_by
no_group_by = df.select(expr)

# Apply expression within a group_by
with_group_by = df.group_by(pl.col("x") > 4).agg(expr)

# The dtype with the group_by is a nested list (List(List(Int64))), which is
# unexpected. For other aggregates the dtype is the same within a group_by as
# it is top-level.
expected_dtype = pl.List(pl.Int64)
assert no_group_by["res"].dtype == expected_dtype
assert with_group_by["res"].dtype == expected_dtype

Log output

keys/aggregates are not partitionable: running default HASH AGGREGATION
Traceback (most recent call last):
  File "/home/jcristharif/Code/ibis/test.py", line 18, in <module>
    assert with_group_by["res"].dtype == expected_dtype
AssertionError

Issue description

When running implode within a group_by(...).agg, an extra level of list nesting is added, which isn't present when run within a select. For other aggregations the dtype when run in a select is the same as the dtype when run within a group_by(...).agg

Expected behavior

The dtype of an aggregation doesn't change when run within a group_by(...).agg.

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-6.5.0-35-generic-x86_64-with-glibc2.38
Python:               3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            0.16.0
fastexcel:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.0
nest_asyncio:         1.5.8
numpy:                1.23.5
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              16.1.0
pydantic:             2.4.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.49
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@jcrist jcrist added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 6, 2024
@cmdlineluser
Copy link
Contributor

The rationale is explained in #6487 (.implode() was previously known as .list())

@jcrist
Copy link
Author

jcrist commented Jun 6, 2024

Ah, thanks. I still find the behavior surprising, but since it's intended then I'll close.

@jcrist jcrist closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants