Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter based on nulls in multiple columns #6808

Closed
2 tasks done
lucazanna opened this issue Feb 11, 2023 · 11 comments · Fixed by #7265
Closed
2 tasks done

Filter based on nulls in multiple columns #6808

lucazanna opened this issue Feb 11, 2023 · 11 comments · Fixed by #7265
Assignees
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@lucazanna
Copy link

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Filtering a dataframe based on nulls in multiple columns does not work in a few cases.

Reproducible example

# Test data
df = pl.DataFrame([{'col1': '1', 'col2': 'a', 'col3': None},
 {'col1': None, 'col2': None, 'col3': None},
 {'col1': '3', 'col2': 'c', 'col3': 'z'}
 ])

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1col2col3 │
│ ---------  │
│ strstrstr  │
╞══════╪══════╪══════╡
│ 1anull │
│ nullnullnull │
│ 3cz    │
└──────┴──────┴──────┘

Row 0 : one column null, and other columns with data
Row 1: all columns null
Row 2: all columns not null

Expected behavior

Test 1: filter when all columns are null with is_null -> OK

df.filter( pl.col(['col1','col2','col3']).is_null() )
OK: Returns row 1 where all columns are null.

Test 2: Opposite of test 1-> NOK
df.filter( ~pl.col(['col1','col2','col3']).is_null() )

Returns: row 2, where all columns are not null
Expected: row 0 and row 2 (the opposite of test 1)

Test 3: New column that is True where all columns are nulls -> NOK

df.with_columns([ pl.when(pl.col(['col1','col2','col3']).is_null()).then(True).otherwise(False).alias('all null'), ])

Returns: true, true, false
Expected: false, true, false (true only for row 1 that is all null)

Test 4: New column that is True where all columns are nulls, using separate is_null -> NOK

all_null = (pl.col(['a']).is_null()) & (pl.col(['b']).is_null()) & (pl.col(['c']).is_null())
df.with_columns([ pl.when(all_null).then(True).otherwise(False).alias('all null separate check') ])

Returns:
ComputeError: Expanding more than one col is not yet allowed. Error originated just after operation: ' DF ["col1", "col2", "col3"]; PROJECT */3 COLUMNS; SELECTION: "None"
'This operation could not be added to the plan.

Expected: false, true, false (true only for row 1 that is all null)

Test 5: New column that is True where at least one column is not null, using separate is_not_null -> NOK
at_least_one_not_null = (pl.col(['a']).is_not_null()) | (pl.col(['b']).is_not_null()) | (pl.col(['c']).is_not_null())
df.with_columns([ pl.when(at_least_one_not_null).then(True).otherwise(False).alias('at least one not null') ])

Returns:
ComputeError: Expanding more than one col is not yet allowed. Error originated just after operation: ' DF ["col1", "col2", "col3"]; PROJECT */3 COLUMNS; SELECTION: "None"
'This operation could not be added to the plan.

Expected: true, false, true (true for row 0 and row 2 that have at least one non-null column)

Installed versions

---Version info---
Polars: 0.16.2
Index type: UInt32
Platform: Linux-5.15.83.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python: 3.11.1 (main, Dec  7 2022, 01:11:44) [GCC 9.4.0]
---Optional dependencies---
pyarrow: 11.0.0
pandas: 1.5.2
numpy: 1.24.0
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: 0.8.1
deltalake: <not installed>
matplotlib: 3.6.2
@lucazanna lucazanna added bug Something isn't working python Related to Python Polars labels Feb 11, 2023
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Feb 11, 2023

Using .select() may help to visualize things as this topic can be a little confusing:

df.select(
   pl.col(["col1", "col2", "col3"]).is_null()
)
shape: (3, 3)
┌───────┬───────┬───────┐
│ col1  | col2  | col3  │
│ ---   | ---   | ---   │
│ bool  | bool  | bool  │
╞═══════╪═══════╪═══════╡
│ false | false | true  │
│ true  | true  | true  │
│ false | false | false │
└───────┴───────┴───────┘

You can use pl.all() to reduce this down to a single true/false for each row

df.select(
   pl.all(pl.col(["col1", "col2", "col3"]).is_null())
)
shape: (3, 1)
┌───────┐
│ all   │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
│ false │
└───────┘

All null rows:

df.filter(
   pl.all(pl.col(["col1", "col2", "col3"]).is_null())
)
shape: (1, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ ---  | ---  | ---  │
│ str  | str  | str  │
╞══════╪══════╪══════╡
│ null | null | null │
└──────┴──────┴──────┘

The inverse: (I thought there may be a pl.none but there doesn't seem to be.)

df.filter(
   ~pl.all(pl.col(["col1", "col2", "col3"]).is_null())
)
shape: (2, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ ---  | ---  | ---  │
│ str  | str  | str  │
╞══════╪══════╪══════╡
│ 1    | a    | null │
│ 3    | c    | z    │
└──────┴──────┴──────┘

It seems that if you do not reduce it down and pass multiple booleans to filter - it's filtering where all the values are true

This is why you get row 1

>>> df.select( pl.col(['col1','col2','col3']).is_null() )
shape: (3, 3)
┌───────┬───────┬───────┐
│ col1  | col2  | col3  │
│ ---   | ---   | ---   │
│ bool  | bool  | bool  │
╞═══════╪═══════╪═══════╡
│ false | false | true  │
│ true  | true  | true  │
│ false | false | false │
└───────┴───────┴───────┘

And row 2

>>> df.select( ~pl.col(['col1','col2','col3']).is_null() )
shape: (3, 3)
┌───────┬───────┬───────┐
│ col1  | col2  | col3  │
│ ---   | ---   | ---   │
│ bool  | bool  | bool  │
╞═══════╪═══════╪═══════╡
│ true  | true  | false │
│ false | false | false │
│ true  | true  | true  │
└───────┴───────┴───────┘

... in your tests 1 and 2

It's like there is an implicit pl.all() meaning it's translating to pl.all(~pl.col(...)) when what you want is ~pl.all(pl.col(...))

@lucazanna
Copy link
Author

thank you @cmdlineluser !

this makes sense, now I understand.
A pl.all() is added whenever there's a filter on multiple columns.
For select or with_columns statement, the pl.all() should be added manually.
And if we want to inverse a filter referring to multiple columns, we should also manually add the pl.all().

I imagine that the best practice is to always use a pl.all() or pl.any() when filtering based on multiple conditions?

Would you know what can be the problem for the tests 4 and 5? should they be written differently?

@cmdlineluser
Copy link
Contributor

Sorry, I didn't notice the exceptions in 4 and 5.

It has to do with using pl.col([...]) multiple times - specifically the the [] part. #6485

[] is used to refer to multiple columns in a single .col() call - and this is only allowed once.

You can remove them as you're referring to a single column in each .col() e.g.

all_null = (pl.col('a').is_null() & pl.col('b').is_null() & pl.col('c').is_null())

I'm not exactly sure if there are any valid use cases for passing multiple booleans to .filter() without reducing them down with .any() / .all() - perhaps polars could detect this and suggest their usage.

@lucazanna
Copy link
Author

Hi @cmdlineluser ,

thank you, it's all clear now. it's great help.

I agree with your suggestion: it would be nice if Polars returned an error when multiple booleans are passed to to.filter(): asking to specify how the booleans should be aggregated, for example with pl.all() or pl.any().

the forced use of pl.all() or pl.any() would make the code more explicit.
It's a minor improvement but could be nice for new users.

@cmdlineluser
Copy link
Contributor

Maybe the issue should be left open to allow the polars team to give their opinion on the matter and perhaps something can be changed/improved upon if they feel it is needed.

@lucazanna lucazanna reopened this Feb 12, 2023
@lucazanna
Copy link
Author

Reopen for comments from the Polars team.

Ideas/suggestions:
1- passing multiple booleans (example: pl.col(['a','b','c'].is_null()) ) seems to have a different behaviour in .filter() and pl.when():
.filter applies a pl.all() to the multiple booleans, while pl.when() applies a pl.any() to the multiple booleans.
should the behavior be aligned between .filter() and pl.when()?

or in alternative:
2- should Polars force explicit code by raising an error when multiple booleans are passed as a condition in either .filter() or pl.when()?
Polars might ask to use pl.all() or pl.any() to clarify how to consolidate multiple booleans

@zundertj zundertj added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels Feb 18, 2023
@zundertj
Copy link
Collaborator

zundertj commented Feb 18, 2023

If I can summarize with the smallest possible example:

import polars as pl
df = pl.DataFrame({"a": [True, True, False], "b": [True, False, False]}).with_row_count()
df
shape: (3, 3)
┌────────┬───────┬───────┐
│ row_nrab     │
│ ---------   │
│ u32boolbool  │
╞════════╪═══════╪═══════╡
│ 0truetrue  │
│ 1truefalse │
│ 2falsefalse │
└────────┴───────┴───────┘

df.filter(pl.col(["a", "b"]))
shape: (1, 3)
┌────────┬──────┬──────┐
│ row_nrab    │
│ ---------  │
│ u32boolbool │
╞════════╪══════╪══════╡
│ 0truetrue │
└────────┴──────┴──────┘

There is thus an implicity pl.all() going on, and the question is whether this is should be allowed, changed to any, or something else.

Then pl.when, I am not sure I understand what is different, as it works element wise:

df.select(pl.when(pl.col(["a", "b"])).then("i am true").otherwise("i am false").keep_name() )
shape: (3, 2)
┌────────────┬────────────┐
│ ab          │
│ ------        │
│ strstr        │
╞════════════╪════════════╡
│ i am truei am true  │
│ i am truei am false │
│ i am falsei am false │
└────────────┴────────────┘

And thus it has the same logic as my first code snippet:

df.filter(pl.when(pl.col(["a", "b"])).then(True).otherwise(False).keep_name())
shape: (1, 3)
┌────────┬──────┬──────┐
│ row_nrab    │
│ ---------  │
│ u32boolbool │
╞════════╪══════╪══════╡
│ 0truetrue │
└────────┴──────┴──────┘

Please let me know @lucazanna if I misunderstood your point about pl.when.

I am removing the bug label, because I think this is more a question on filter having an implicit all behaviour.

@lucazanna
Copy link
Author

Hi @zundertj

thanks for the clear explanation.
I tried the example again, it might have to do with the is_null() operator.

Here is a new example (I'm taking your example and replacing booleans with strings or nulls)

import polars as pl
df = pl.DataFrame({"a": [None, "2", "3"], "b": [None, None, "z"]}).with_row_count()
shape: (3, 3)

shape: (3, 3)
┌────────┬──────┬──────┐
│ row_nr ┆ a    ┆ b    │
│ ---    ┆ ---  ┆ ---  │
│ u32    ┆ str  ┆ str  │
╞════════╪══════╪══════╡
│ 0      ┆ null ┆ null │
│ 1      ┆ 2    ┆ null │
│ 2      ┆ 3    ┆ z    │
└────────┴──────┴──────┘

# Test 1 Filter -> returns row 0
df.filter(pl.col(['a','b']).is_null())

shape: (1, 3)
┌────────┬──────┬──────┐
│ row_nr ┆ a    ┆ b    │
│ ---    ┆ ---  ┆ ---  │
│ u32    ┆ str  ┆ str  │
╞════════╪══════╪══════╡
│ 0      ┆ null ┆ null │
└────────┴──────┴──────┘

# Test 2 With Columns and When -> returns row 0 and row 1
df.with_columns(
    pl.when(pl.col(["a", "b"]).is_null()).then(True).otherwise(False) 
    )
shape: (3, 4)
┌────────┬──────┬──────┬─────────┐
│ row_nr ┆ a    ┆ b    ┆ literal │
│ ---    ┆ ---  ┆ ---  ┆ ---     │
│ u32    ┆ str  ┆ str  ┆ bool    │
╞════════╪══════╪══════╪═════════╡
│ 0      ┆ null ┆ null ┆ true    │
│ 1      ┆ 2    ┆ null ┆ true    │
│ 2      ┆ 3    ┆ z    ┆ false   │
└────────┴──────┴──────┴─────────┘

Let me know if there is anything I am missing

@ritchie46
Copy link
Member

I think we should raise an error as the intent is clearly ambiguous.

@matterhorn103
Copy link

Sorry, I'm having trouble with this (in Python polars 0.20.31).
Since this isn't an issue with polars, it's a help question, I figured it's wrong to open a new issue and it's better if I jump on this thread?
Or is there a better way to ask for help with polars?

From this issue and #7265 I understand that for the above example dataframe:

# Test data
df = pl.DataFrame([{'col1': '1', 'col2': 'a', 'col3': None},
 {'col1': None, 'col2': None, 'col3': None},
 {'col1': '3', 'col2': 'c', 'col3': 'z'}
 ])

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ a    ┆ null │
│ null ┆ null ┆ null │
│ 3    ┆ c    ┆ z    │
└──────┴──────┴──────┘

(pl.col(*).is_null()) expands to

(pl.col("col1").is_null(), pl.col("col2").is_null(), pl.col("col3").is_null())

as would (pl.col(['col1','col2','col3']).is_null())

and thus filtering based on that expression is ambiguous.

Firstly, I don't really understand why that is ambiguous? The docs for df.filter() say:

Provide multiple filters using *args syntax:

df.filter(
   pl.col("foo") <= 2,
   ~pl.col("ham").is_in(["b", "c"]),
)

and indeed providing the separate arguments works:

>>> df.filter(
    pl.col("col1").is_null(),
    pl.col("col2").is_null(),
    pl.col("col3").is_null()
)
shape: (1, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
└──────┴──────┴──────┘

So I don't understand why df.filter(pl.col("*").is_null()) raises this new error rather than just expanding to the second example?
Is the idea that it is ambiguous because the user might think that they are selecting based on any of the columns containing null when in reality they are selecting base on all of them containing null, and it's better to raise an error?

I am also confused about how I should be using pl.all() and pl.any() to disambiguate this, as suggested by the error message:

polars.exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions: 

        col("col1").is_null(),
        col("col2").is_null(),
        col("col3").is_null(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.

because when I try to do that I also get an error:

>>> df.filter(pl.any(pl.col("*").is_null()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/matt/python/venv/rfb/lib64/python3.12/site-packages/polars/functions/aggregation/vertical.py", line 115, in any
    return F.col(*names).any(ignore_nulls=ignore_nulls)
           ^^^^^^^^^^^^^
  File "/home/matt/python/venv/rfb/lib64/python3.12/site-packages/polars/functions/col.py", line 292, in __new__
    return _create_col(name, *more_names)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/python/venv/rfb/lib64/python3.12/site-packages/polars/functions/col.py", line 67, in _create_col
    raise TypeError(msg)
TypeError: invalid input for `col`

Expected `str` or `DataType`, got 'Expr'.

when I would have expected that to be exactly the right way to disambiguate based on the other error message.
And indeed, that is the way ChatGPT suggests to do it.

Equally, trying df.filter(pl.col("*").is_null().any()) returns

polars.exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions: 

        col("col1").is_null().any(),
        col("col2").is_null().any(),
        col("col3").is_null().any(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.

and even when adding brackets df.filter((pl.col("*").is_null()).any()) the result is exactly the same, which I find extremely confusing (showing that I really just don't get what the different expressions are doing, I suppose).

Sadly, the API page for pl.any() has very little explanation (for a beginner like me) on what any() and all() actually do, and the singular example is not very enlightening either and deals with a very different sort of case.
I also can't find anything in the user guide.

Sorry for my likely very stupid beginner mistakes.
If anyone can help even a bit I'd be very much obliged.

@cmdlineluser
Copy link
Contributor

@matterhorn103

There is a Polars Discord for real-time help: https://discord.gg/4UfP5cfBE7 and also StackOverflow for help questions.

Things have changed quite a bit since this issue.

Dedicated horizontal functions were introduced:

So instead of:

df.filter(pl.col("*").is_null().any())

It would be:

df.filter(pl.any_horizontal(pl.all().is_null()))

The filter *args syntax is new and did not exist at the time of this issue.

They are implicitly combined by & (or as if surrounded by pl.all_horizontal) - so you raise a valid point.

So I don't understand why df.filter(pl.col("*").is_null()) raises this new error rather than just expanding to the second example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants