-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter based on nulls in multiple columns #6808
Comments
Using
You can use
All null rows:
The inverse: (I thought there may be a
It seems that if you do not reduce it down and pass multiple booleans to filter - it's filtering where all the values are This is why you get
And
... in your tests It's like there is an implicit |
thank you @cmdlineluser ! this makes sense, now I understand. I imagine that the best practice is to always use a pl.all() or pl.any() when filtering based on multiple conditions? Would you know what can be the problem for the tests 4 and 5? should they be written differently? |
Sorry, I didn't notice the exceptions in 4 and 5. It has to do with using
You can remove them as you're referring to a single column in each
I'm not exactly sure if there are any valid use cases for passing multiple booleans to |
Hi @cmdlineluser , thank you, it's all clear now. it's great help. I agree with your suggestion: it would be nice if Polars returned an error when multiple booleans are passed to the forced use of pl.all() or pl.any() would make the code more explicit. |
Maybe the issue should be left open to allow the polars team to give their opinion on the matter and perhaps something can be changed/improved upon if they feel it is needed. |
Reopen for comments from the Polars team. Ideas/suggestions: or in alternative: |
If I can summarize with the smallest possible example: import polars as pl
df = pl.DataFrame({"a": [True, True, False], "b": [True, False, False]}).with_row_count()
df
shape: (3, 3)
┌────────┬───────┬───────┐
│ row_nr ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ bool │
╞════════╪═══════╪═══════╡
│ 0 ┆ true ┆ true │
│ 1 ┆ true ┆ false │
│ 2 ┆ false ┆ false │
└────────┴───────┴───────┘
df.filter(pl.col(["a", "b"]))
shape: (1, 3)
┌────────┬──────┬──────┐
│ row_nr ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ bool │
╞════════╪══════╪══════╡
│ 0 ┆ true ┆ true │
└────────┴──────┴──────┘ There is thus an implicity Then df.select(pl.when(pl.col(["a", "b"])).then("i am true").otherwise("i am false").keep_name() )
shape: (3, 2)
┌────────────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪════════════╡
│ i am true ┆ i am true │
│ i am true ┆ i am false │
│ i am false ┆ i am false │
└────────────┴────────────┘ And thus it has the same logic as my first code snippet: df.filter(pl.when(pl.col(["a", "b"])).then(True).otherwise(False).keep_name())
shape: (1, 3)
┌────────┬──────┬──────┐
│ row_nr ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ bool │
╞════════╪══════╪══════╡
│ 0 ┆ true ┆ true │
└────────┴──────┴──────┘ Please let me know @lucazanna if I misunderstood your point about I am removing the bug label, because I think this is more a question on |
Hi @zundertj thanks for the clear explanation. Here is a new example (I'm taking your example and replacing booleans with strings or nulls)
Let me know if there is anything I am missing |
I think we should raise an error as the intent is clearly ambiguous. |
Sorry, I'm having trouble with this (in Python polars 0.20.31). From this issue and #7265 I understand that for the above example dataframe:
as would and thus filtering based on that expression is ambiguous. Firstly, I don't really understand why that is ambiguous? The docs for df.filter() say:
and indeed providing the separate arguments works:
So I don't understand why I am also confused about how I should be using
because when I try to do that I also get an error:
when I would have expected that to be exactly the right way to disambiguate based on the other error message. Equally, trying
and even when adding brackets Sadly, the API page for Sorry for my likely very stupid beginner mistakes. |
There is a Polars Discord for real-time help: https://discord.gg/4UfP5cfBE7 and also StackOverflow for help questions. Things have changed quite a bit since this issue. Dedicated horizontal functions were introduced: So instead of: df.filter(pl.col("*").is_null().any()) It would be: df.filter(pl.any_horizontal(pl.all().is_null())) The filter *args syntax is new and did not exist at the time of this issue. They are implicitly combined by
|
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
Filtering a dataframe based on nulls in multiple columns does not work in a few cases.
Reproducible example
Expected behavior
Test 1: filter when all columns are null with is_null -> OK
df.filter( pl.col(['col1','col2','col3']).is_null() )
OK: Returns row 1 where all columns are null.
Test 2: Opposite of test 1-> NOK
df.filter( ~pl.col(['col1','col2','col3']).is_null() )
Returns: row 2, where all columns are not null
Expected: row 0 and row 2 (the opposite of test 1)
Test 3: New column that is True where all columns are nulls -> NOK
df.with_columns([ pl.when(pl.col(['col1','col2','col3']).is_null()).then(True).otherwise(False).alias('all null'), ])
Returns: true, true, false
Expected: false, true, false (true only for row 1 that is all null)
Test 4: New column that is True where all columns are nulls, using separate is_null -> NOK
all_null = (pl.col(['a']).is_null()) & (pl.col(['b']).is_null()) & (pl.col(['c']).is_null())
df.with_columns([ pl.when(all_null).then(True).otherwise(False).alias('all null separate check') ])
Returns:
ComputeError: Expanding more than one
col
is not yet allowed. Error originated just after operation: ' DF ["col1", "col2", "col3"]; PROJECT */3 COLUMNS; SELECTION: "None"'This operation could not be added to the plan.
Expected: false, true, false (true only for row 1 that is all null)
Test 5: New column that is True where at least one column is not null, using separate is_not_null -> NOK
at_least_one_not_null = (pl.col(['a']).is_not_null()) | (pl.col(['b']).is_not_null()) | (pl.col(['c']).is_not_null())
df.with_columns([ pl.when(at_least_one_not_null).then(True).otherwise(False).alias('at least one not null') ])
Returns:
ComputeError: Expanding more than one
col
is not yet allowed. Error originated just after operation: ' DF ["col1", "col2", "col3"]; PROJECT */3 COLUMNS; SELECTION: "None"'This operation could not be added to the plan.
Expected: true, false, true (true for row 0 and row 2 that have at least one non-null column)
Installed versions
The text was updated successfully, but these errors were encountered: