`.over()` fails with `.get`. #18191

PierXuY · 2024-08-14T14:06:45Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame(
    {"name": list("abcdef"), "age": [21, 31, 32, 53, 45, 26], "country": list("AABBBC")}
)

out1 = df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .implode()
    .over("country", mapping_strategy="explode"),
)
print(out1)

out2 = df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .implode()
    .list
    .get(0)
    .over("country", mapping_strategy="explode"),
)
# InvalidOperationError: 'implode' followed by an aggregation is not allowed
print(out2)

Log output

InvalidOperationError: 'implode' followed by an aggregation is not allowed.

Issue description

It seems that the .over method cannot be used with the .get method.

I also tried the code below

out3 = df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .implode()
    .get(0)
    .over("country", mapping_strategy="explode"),
)
print(out3)

Error: OutOfBoundsError: gather indices are out of bounds.

Expected behavior

I don’t understand what the above two error reports refer to and what are their reasons.

I know that this operation logic can be replaced by the group_by method. What I want to know more is what is the underlying reason for the error and whether this is an unexpected bug.

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Windows-11-10.0.22631-SP0
Python:               3.12.2 (tags/v3.12.2:6abddd9, Feb  6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.1
pyarrow:              16.0.0
pydantic:             2.6.4
pyiceberg:            <not installed>
sqlalchemy:           2.0.29
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-08-14T14:24:20Z

.list.get can go after .implode()

df.select(
    pl.col("name")
      .sort_by("age")
      .implode()
      .over("country", mapping_strategy="explode")
      .list.get(0)
)

# shape: (3, 1)
# ┌──────┐
# │ name │
# │ ---  │
# │ str  │
# ╞══════╡
# │ a    │
# │ c    │
# │ f    │
# └──────┘

Not sure what is happening with Expr.get() example.

df.select(
    pl.col("name")
      .sort_by("age")
      .implode()
      .get(0)
      .over("country", mapping_strategy="explode")
)

# OutOfBoundsError: gather indices are out of bounds

Perhaps it's supposed to raise the InvalidOperationError and not be allowed to run?

PierXuY · 2024-08-14T14:32:39Z

@cmdlineluser
Thank you for your answer.

I would like to ask why .list.get cannot be used before .over？

In addition, for :

df.select(
    pl.col("name")
      .sort_by("age")
      .implode()
      .over("country", mapping_strategy="explode")
      .list.get(0)
)

, isn't this exactly 'implode' followed by an aggregation, which seems to conflict with the content of the error report?

deanm0000 · 2024-08-14T21:10:59Z

The reason you can't do implode().list.get(0) is that implode is going to make the column a List but it doesn't do that until the context (select) resolves. It, therefore, can't actually use the List namespace yet.

The reason you can do .implode().over("country", mapping_strategy="explode").list.get(0) is that the over is a context which resolves before you're applying the list namespace to it.

What are you expecting for the output of

df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .implode()
    .list
    .get(0)
    .over("country", mapping_strategy="explode"),
)

Is it something different from what this gives?

df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .get(0)
    .over("country", mapping_strategy="explode"),
)

If I'm misunderstanding what you're trying to do I'm happy to reopen but for now I'm going to close.

PierXuY · 2024-08-15T00:52:13Z

The difference is that only the .get method of the namespace has the ·null_on_oob· parameter. If I want to get .get(2), an error will be reported.

deanm0000 · 2024-08-15T02:00:29Z

Ahhh I see. Try doing slice(2,1).first() instead. The 2 is the offset and the 1 is the length. If it the slice is oob it returns 0 rows and the first() makes the no results into null.

df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .slice(2,1).first()
    .over("country", mapping_strategy="explode"),
)
shape: (3, 2)
┌─────────┬──────┐
│ country ┆ name │
│ ---     ┆ ---  │
│ str     ┆ str  │
╞═════════╪══════╡
│ B       ┆ null │
│ A       ┆ e    │
│ C       ┆ null │
└─────────┴──────┘

Incidentally, you might want to follow this

PierXuY · 2024-08-15T03:47:16Z

Thank you! .slice is a nice method.

But I found that I abused .select and .unique, because the order in which .unique returns results is not certain, which may cause data mismatch errors.

Used the group_by method instead and got the expected results:

df.group_by("country").agg(
    pl.col("name").sort_by("age").slice(2, 1).first()
)

PierXuY added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 14, 2024

deanm0000 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 14, 2024

deanm0000 removed the needs triage Awaiting prioritization by a maintainer label Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.over()` fails with `.get`. #18191

`.over()` fails with `.get`. #18191

PierXuY commented Aug 14, 2024 •

edited

Loading

cmdlineluser commented Aug 14, 2024

PierXuY commented Aug 14, 2024

deanm0000 commented Aug 14, 2024

PierXuY commented Aug 15, 2024

deanm0000 commented Aug 15, 2024 •

edited

Loading

PierXuY commented Aug 15, 2024 •

edited

Loading

.over() fails with .get. #18191

.over() fails with .get. #18191

Comments

PierXuY commented Aug 14, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Aug 14, 2024

PierXuY commented Aug 14, 2024

deanm0000 commented Aug 14, 2024

PierXuY commented Aug 15, 2024

deanm0000 commented Aug 15, 2024 • edited Loading

PierXuY commented Aug 15, 2024 • edited Loading

`.over()` fails with `.get`. #18191

`.over()` fails with `.get`. #18191

PierXuY commented Aug 14, 2024 •

edited

Loading

deanm0000 commented Aug 15, 2024 •

edited

Loading

PierXuY commented Aug 15, 2024 •

edited

Loading