Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.over() fails with .get. #18191

Closed
2 tasks done
PierXuY opened this issue Aug 14, 2024 · 6 comments
Closed
2 tasks done

.over() fails with .get. #18191

PierXuY opened this issue Aug 14, 2024 · 6 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@PierXuY
Copy link

PierXuY commented Aug 14, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame(
    {"name": list("abcdef"), "age": [21, 31, 32, 53, 45, 26], "country": list("AABBBC")}
)

out1 = df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .implode()
    .over("country", mapping_strategy="explode"),
)
print(out1)

out2 = df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .implode()
    .list
    .get(0)
    .over("country", mapping_strategy="explode"),
)
# InvalidOperationError: 'implode' followed by an aggregation is not allowed
print(out2) 

Log output

InvalidOperationError: 'implode' followed by an aggregation is not allowed.

Issue description

It seems that the .over method cannot be used with the .get method.

I also tried the code below

out3 = df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .implode()
    .get(0)
    .over("country", mapping_strategy="explode"),
)
print(out3)

Error: OutOfBoundsError: gather indices are out of bounds.

Expected behavior

I don’t understand what the above two error reports refer to and what are their reasons.

I know that this operation logic can be replaced by the group_by method. What I want to know more is what is the underlying reason for the error and whether this is an unexpected bug.

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Windows-11-10.0.22631-SP0
Python:               3.12.2 (tags/v3.12.2:6abddd9, Feb  6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.1
pyarrow:              16.0.0
pydantic:             2.6.4
pyiceberg:            <not installed>
sqlalchemy:           2.0.29
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@PierXuY PierXuY added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 14, 2024
@cmdlineluser
Copy link
Contributor

.list.get can go after .implode()

df.select(
    pl.col("name")
      .sort_by("age")
      .implode()
      .over("country", mapping_strategy="explode")
      .list.get(0)
)

# shape: (3, 1)
# ┌──────┐
# │ name │
# │ ---  │
# │ str  │
# ╞══════╡
# │ a    │
# │ c    │
# │ f    │
# └──────┘

Not sure what is happening with Expr.get() example.

df.select(
    pl.col("name")
      .sort_by("age")
      .implode()
      .get(0)
      .over("country", mapping_strategy="explode")
)

# OutOfBoundsError: gather indices are out of bounds

Perhaps it's supposed to raise the InvalidOperationError and not be allowed to run?

@PierXuY
Copy link
Author

PierXuY commented Aug 14, 2024

@cmdlineluser
Thank you for your answer.

I would like to ask why .list.get cannot be used before .over

In addition, for :

df.select(
    pl.col("name")
      .sort_by("age")
      .implode()
      .over("country", mapping_strategy="explode")
      .list.get(0)
)

, isn't this exactly 'implode' followed by an aggregation, which seems to conflict with the content of the error report?

@deanm0000
Copy link
Collaborator

The reason you can't do implode().list.get(0) is that implode is going to make the column a List but it doesn't do that until the context (select) resolves. It, therefore, can't actually use the List namespace yet.

The reason you can do .implode().over("country", mapping_strategy="explode").list.get(0) is that the over is a context which resolves before you're applying the list namespace to it.

What are you expecting for the output of

df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .implode()
    .list
    .get(0)
    .over("country", mapping_strategy="explode"),
)

Is it something different from what this gives?

df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .get(0)
    .over("country", mapping_strategy="explode"),
)

If I'm misunderstanding what you're trying to do I'm happy to reopen but for now I'm going to close.

@deanm0000 deanm0000 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 14, 2024
@deanm0000 deanm0000 removed the needs triage Awaiting prioritization by a maintainer label Aug 14, 2024
@PierXuY
Copy link
Author

PierXuY commented Aug 15, 2024

The difference is that only the .get method of the namespace has the ·null_on_oob· parameter. If I want to get .get(2), an error will be reported.

@deanm0000
Copy link
Collaborator

deanm0000 commented Aug 15, 2024

Ahhh I see. Try doing slice(2,1).first() instead. The 2 is the offset and the 1 is the length. If it the slice is oob it returns 0 rows and the first() makes the no results into null.

df.select(
    pl.col("country").unique(),
    pl.col("name")
    .sort_by("age")
    .slice(2,1).first()
    .over("country", mapping_strategy="explode"),
)
shape: (3, 2)
┌─────────┬──────┐
│ country ┆ name │
│ ---     ┆ ---  │
│ str     ┆ str  │
╞═════════╪══════╡
│ B       ┆ null │
│ A       ┆ e    │
│ C       ┆ null │
└─────────┴──────┘

Incidentally, you might want to follow this

@PierXuY
Copy link
Author

PierXuY commented Aug 15, 2024

Thank you! .slice is a nice method.

But I found that I abused .select and .unique, because the order in which .unique returns results is not certain, which may cause data mismatch errors.

Used the group_by method instead and got the expected results:

df.group_by("country").agg(
    pl.col("name").sort_by("age").slice(2, 1).first()
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants