Filter based on nulls in multiple columns #6808

lucazanna · 2023-02-11T12:13:21Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

Filtering a dataframe based on nulls in multiple columns does not work in a few cases.

Reproducible example

# Test data
df = pl.DataFrame([{'col1': '1', 'col2': 'a', 'col3': None},
 {'col1': None, 'col2': None, 'col3': None},
 {'col1': '3', 'col2': 'c', 'col3': 'z'}
 ])

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ a    ┆ null │
│ null ┆ null ┆ null │
│ 3    ┆ c    ┆ z    │
└──────┴──────┴──────┘

Row 0 : one column null, and other columns with data
Row 1: all columns null
Row 2: all columns not null

Expected behavior

Test 1: filter when all columns are null with is_null -> OK

df.filter( pl.col(['col1','col2','col3']).is_null() )
OK: Returns row 1 where all columns are null.

Test 2: Opposite of test 1-> NOK
df.filter( ~pl.col(['col1','col2','col3']).is_null() )

Returns: row 2, where all columns are not null
Expected: row 0 and row 2 (the opposite of test 1)

Test 3: New column that is True where all columns are nulls -> NOK

df.with_columns([ pl.when(pl.col(['col1','col2','col3']).is_null()).then(True).otherwise(False).alias('all null'), ])

Returns: true, true, false
Expected: false, true, false (true only for row 1 that is all null)

Test 4: New column that is True where all columns are nulls, using separate is_null -> NOK

all_null = (pl.col(['a']).is_null()) & (pl.col(['b']).is_null()) & (pl.col(['c']).is_null())
df.with_columns([ pl.when(all_null).then(True).otherwise(False).alias('all null separate check') ])

Returns:
ComputeError: Expanding more than one col is not yet allowed. Error originated just after operation: ' DF ["col1", "col2", "col3"]; PROJECT */3 COLUMNS; SELECTION: "None"
'This operation could not be added to the plan.
Expected: false, true, false (true only for row 1 that is all null)

Test 5: New column that is True where at least one column is not null, using separate is_not_null -> NOK
at_least_one_not_null = (pl.col(['a']).is_not_null()) | (pl.col(['b']).is_not_null()) | (pl.col(['c']).is_not_null())
df.with_columns([ pl.when(at_least_one_not_null).then(True).otherwise(False).alias('at least one not null') ])

Returns:
ComputeError: Expanding more than one col is not yet allowed. Error originated just after operation: ' DF ["col1", "col2", "col3"]; PROJECT */3 COLUMNS; SELECTION: "None"
'This operation could not be added to the plan.
Expected: true, false, true (true for row 0 and row 2 that have at least one non-null column)

Installed versions

---Version info---
Polars: 0.16.2
Index type: UInt32
Platform: Linux-5.15.83.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python: 3.11.1 (main, Dec  7 2022, 01:11:44) [GCC 9.4.0]
---Optional dependencies---
pyarrow: 11.0.0
pandas: 1.5.2
numpy: 1.24.0
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: 0.8.1
deltalake: <not installed>
matplotlib: 3.6.2

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2023-02-11T12:39:18Z

Using .select() may help to visualize things as this topic can be a little confusing:

df.select(
   pl.col(["col1", "col2", "col3"]).is_null()
)

shape: (3, 3)
┌───────┬───────┬───────┐
│ col1  | col2  | col3  │
│ ---   | ---   | ---   │
│ bool  | bool  | bool  │
╞═══════╪═══════╪═══════╡
│ false | false | true  │
│ true  | true  | true  │
│ false | false | false │
└───────┴───────┴───────┘

You can use pl.all() to reduce this down to a single true/false for each row

df.select(
   pl.all(pl.col(["col1", "col2", "col3"]).is_null())
)
shape: (3, 1)
┌───────┐
│ all   │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
│ false │
└───────┘

All null rows:

df.filter(
   pl.all(pl.col(["col1", "col2", "col3"]).is_null())
)
shape: (1, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ ---  | ---  | ---  │
│ str  | str  | str  │
╞══════╪══════╪══════╡
│ null | null | null │
└──────┴──────┴──────┘

The inverse: (I thought there may be a pl.none but there doesn't seem to be.)

df.filter(
   ~pl.all(pl.col(["col1", "col2", "col3"]).is_null())
)

shape: (2, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ ---  | ---  | ---  │
│ str  | str  | str  │
╞══════╪══════╪══════╡
│ 1    | a    | null │
│ 3    | c    | z    │
└──────┴──────┴──────┘

It seems that if you do not reduce it down and pass multiple booleans to filter - it's filtering where all the values are true

This is why you get row 1

>>> df.select( pl.col(['col1','col2','col3']).is_null() )
shape: (3, 3)
┌───────┬───────┬───────┐
│ col1  | col2  | col3  │
│ ---   | ---   | ---   │
│ bool  | bool  | bool  │
╞═══════╪═══════╪═══════╡
│ false | false | true  │
│ true  | true  | true  │
│ false | false | false │
└───────┴───────┴───────┘

And row 2

>>> df.select( ~pl.col(['col1','col2','col3']).is_null() )
shape: (3, 3)
┌───────┬───────┬───────┐
│ col1  | col2  | col3  │
│ ---   | ---   | ---   │
│ bool  | bool  | bool  │
╞═══════╪═══════╪═══════╡
│ true  | true  | false │
│ false | false | false │
│ true  | true  | true  │
└───────┴───────┴───────┘

... in your tests 1 and 2

It's like there is an implicit pl.all() meaning it's translating to pl.all(~pl.col(...)) when what you want is ~pl.all(pl.col(...))

lucazanna · 2023-02-11T14:21:38Z

thank you @cmdlineluser !

this makes sense, now I understand.
A pl.all() is added whenever there's a filter on multiple columns.
For select or with_columns statement, the pl.all() should be added manually.
And if we want to inverse a filter referring to multiple columns, we should also manually add the pl.all().

I imagine that the best practice is to always use a pl.all() or pl.any() when filtering based on multiple conditions?

Would you know what can be the problem for the tests 4 and 5? should they be written differently?

cmdlineluser · 2023-02-11T17:01:42Z

Sorry, I didn't notice the exceptions in 4 and 5.

It has to do with using pl.col([...]) multiple times - specifically the the [] part. #6485

[] is used to refer to multiple columns in a single .col() call - and this is only allowed once.

You can remove them as you're referring to a single column in each .col() e.g.

all_null = (pl.col('a').is_null() & pl.col('b').is_null() & pl.col('c').is_null())

I'm not exactly sure if there are any valid use cases for passing multiple booleans to .filter() without reducing them down with .any() / .all() - perhaps polars could detect this and suggest their usage.

lucazanna · 2023-02-11T17:42:49Z

Hi @cmdlineluser ,

thank you, it's all clear now. it's great help.

I agree with your suggestion: it would be nice if Polars returned an error when multiple booleans are passed to to.filter(): asking to specify how the booleans should be aggregated, for example with pl.all() or pl.any().

the forced use of pl.all() or pl.any() would make the code more explicit.
It's a minor improvement but could be nice for new users.

cmdlineluser · 2023-02-12T08:04:49Z

Maybe the issue should be left open to allow the polars team to give their opinion on the matter and perhaps something can be changed/improved upon if they feel it is needed.

lucazanna · 2023-02-12T16:51:00Z

Reopen for comments from the Polars team.

Ideas/suggestions:
1- passing multiple booleans (example: pl.col(['a','b','c'].is_null()) ) seems to have a different behaviour in .filter() and pl.when():
.filter applies a pl.all() to the multiple booleans, while pl.when() applies a pl.any() to the multiple booleans.
should the behavior be aligned between .filter() and pl.when()?

or in alternative:
2- should Polars force explicit code by raising an error when multiple booleans are passed as a condition in either .filter() or pl.when()?
Polars might ask to use pl.all() or pl.any() to clarify how to consolidate multiple booleans

zundertj · 2023-02-18T13:36:02Z

If I can summarize with the smallest possible example:

import polars as pl
df = pl.DataFrame({"a": [True, True, False], "b": [True, False, False]}).with_row_count()
df
shape: (3, 3)
┌────────┬───────┬───────┐
│ row_nr ┆ a     ┆ b     │
│ ---    ┆ ---   ┆ ---   │
│ u32    ┆ bool  ┆ bool  │
╞════════╪═══════╪═══════╡
│ 0      ┆ true  ┆ true  │
│ 1      ┆ true  ┆ false │
│ 2      ┆ false ┆ false │
└────────┴───────┴───────┘

df.filter(pl.col(["a", "b"]))
shape: (1, 3)
┌────────┬──────┬──────┐
│ row_nr ┆ a    ┆ b    │
│ ---    ┆ ---  ┆ ---  │
│ u32    ┆ bool ┆ bool │
╞════════╪══════╪══════╡
│ 0      ┆ true ┆ true │
└────────┴──────┴──────┘

There is thus an implicity pl.all() going on, and the question is whether this is should be allowed, changed to any, or something else.

Then pl.when, I am not sure I understand what is different, as it works element wise:

df.select(pl.when(pl.col(["a", "b"])).then("i am true").otherwise("i am false").keep_name() )
shape: (3, 2)
┌────────────┬────────────┐
│ a          ┆ b          │
│ ---        ┆ ---        │
│ str        ┆ str        │
╞════════════╪════════════╡
│ i am true  ┆ i am true  │
│ i am true  ┆ i am false │
│ i am false ┆ i am false │
└────────────┴────────────┘

And thus it has the same logic as my first code snippet:

df.filter(pl.when(pl.col(["a", "b"])).then(True).otherwise(False).keep_name())
shape: (1, 3)
┌────────┬──────┬──────┐
│ row_nr ┆ a    ┆ b    │
│ ---    ┆ ---  ┆ ---  │
│ u32    ┆ bool ┆ bool │
╞════════╪══════╪══════╡
│ 0      ┆ true ┆ true │
└────────┴──────┴──────┘

Please let me know @lucazanna if I misunderstood your point about pl.when.

I am removing the bug label, because I think this is more a question on filter having an implicit all behaviour.

lucazanna · 2023-02-18T15:11:31Z

Hi @zundertj

thanks for the clear explanation.
I tried the example again, it might have to do with the is_null() operator.

Here is a new example (I'm taking your example and replacing booleans with strings or nulls)

import polars as pl
df = pl.DataFrame({"a": [None, "2", "3"], "b": [None, None, "z"]}).with_row_count()
shape: (3, 3)

shape: (3, 3)
┌────────┬──────┬──────┐
│ row_nr ┆ a    ┆ b    │
│ ---    ┆ ---  ┆ ---  │
│ u32    ┆ str  ┆ str  │
╞════════╪══════╪══════╡
│ 0      ┆ null ┆ null │
│ 1      ┆ 2    ┆ null │
│ 2      ┆ 3    ┆ z    │
└────────┴──────┴──────┘

# Test 1 Filter -> returns row 0
df.filter(pl.col(['a','b']).is_null())

shape: (1, 3)
┌────────┬──────┬──────┐
│ row_nr ┆ a    ┆ b    │
│ ---    ┆ ---  ┆ ---  │
│ u32    ┆ str  ┆ str  │
╞════════╪══════╪══════╡
│ 0      ┆ null ┆ null │
└────────┴──────┴──────┘

# Test 2 With Columns and When -> returns row 0 and row 1
df.with_columns(
    pl.when(pl.col(["a", "b"]).is_null()).then(True).otherwise(False) 
    )
shape: (3, 4)
┌────────┬──────┬──────┬─────────┐
│ row_nr ┆ a    ┆ b    ┆ literal │
│ ---    ┆ ---  ┆ ---  ┆ ---     │
│ u32    ┆ str  ┆ str  ┆ bool    │
╞════════╪══════╪══════╪═════════╡
│ 0      ┆ null ┆ null ┆ true    │
│ 1      ┆ 2    ┆ null ┆ true    │
│ 2      ┆ 3    ┆ z    ┆ false   │
└────────┴──────┴──────┴─────────┘

Let me know if there is anything I am missing

ritchie46 · 2023-02-18T15:36:09Z

I think we should raise an error as the intent is clearly ambiguous.

matterhorn103 · 2024-06-21T13:57:53Z

Sorry, I'm having trouble with this (in Python polars 0.20.31).
Since this isn't an issue with polars, it's a help question, I figured it's wrong to open a new issue and it's better if I jump on this thread?
Or is there a better way to ask for help with polars?

From this issue and #7265 I understand that for the above example dataframe:

# Test data
df = pl.DataFrame([{'col1': '1', 'col2': 'a', 'col3': None},
 {'col1': None, 'col2': None, 'col3': None},
 {'col1': '3', 'col2': 'c', 'col3': 'z'}
 ])

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ a    ┆ null │
│ null ┆ null ┆ null │
│ 3    ┆ c    ┆ z    │
└──────┴──────┴──────┘

(pl.col(*).is_null()) expands to

(pl.col("col1").is_null(), pl.col("col2").is_null(), pl.col("col3").is_null())

as would (pl.col(['col1','col2','col3']).is_null())

and thus filtering based on that expression is ambiguous.

Firstly, I don't really understand why that is ambiguous? The docs for df.filter() say:

Provide multiple filters using *args syntax:
df.filter(
   pl.col("foo") <= 2,
   ~pl.col("ham").is_in(["b", "c"]),
)

and indeed providing the separate arguments works:

>>> df.filter(
    pl.col("col1").is_null(),
    pl.col("col2").is_null(),
    pl.col("col3").is_null()
)
shape: (1, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
└──────┴──────┴──────┘

So I don't understand why df.filter(pl.col("*").is_null()) raises this new error rather than just expanding to the second example?
Is the idea that it is ambiguous because the user might think that they are selecting based on any of the columns containing null when in reality they are selecting base on all of them containing null, and it's better to raise an error?

I am also confused about how I should be using pl.all() and pl.any() to disambiguate this, as suggested by the error message:

polars.exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions: 

        col("col1").is_null(),
        col("col2").is_null(),
        col("col3").is_null(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.

because when I try to do that I also get an error:

>>> df.filter(pl.any(pl.col("*").is_null()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/matt/python/venv/rfb/lib64/python3.12/site-packages/polars/functions/aggregation/vertical.py", line 115, in any
    return F.col(*names).any(ignore_nulls=ignore_nulls)
           ^^^^^^^^^^^^^
  File "/home/matt/python/venv/rfb/lib64/python3.12/site-packages/polars/functions/col.py", line 292, in __new__
    return _create_col(name, *more_names)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/python/venv/rfb/lib64/python3.12/site-packages/polars/functions/col.py", line 67, in _create_col
    raise TypeError(msg)
TypeError: invalid input for `col`

Expected `str` or `DataType`, got 'Expr'.

when I would have expected that to be exactly the right way to disambiguate based on the other error message.
And indeed, that is the way ChatGPT suggests to do it.

Equally, trying df.filter(pl.col("*").is_null().any()) returns

polars.exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions: 

        col("col1").is_null().any(),
        col("col2").is_null().any(),
        col("col3").is_null().any(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.

and even when adding brackets df.filter((pl.col("*").is_null()).any()) the result is exactly the same, which I find extremely confusing (showing that I really just don't get what the different expressions are doing, I suppose).

Sadly, the API page for pl.any() has very little explanation (for a beginner like me) on what any() and all() actually do, and the singular example is not very enlightening either and deals with a very different sort of case.
I also can't find anything in the user guide.

Sorry for my likely very stupid beginner mistakes.
If anyone can help even a bit I'd be very much obliged.

cmdlineluser · 2024-06-21T14:13:28Z

@matterhorn103

There is a Polars Discord for real-time help: https://discord.gg/4UfP5cfBE7 and also StackOverflow for help questions.

Things have changed quite a bit since this issue.

Dedicated horizontal functions were introduced:

So instead of:

df.filter(pl.col("*").is_null().any())

It would be:

df.filter(pl.any_horizontal(pl.all().is_null()))

The filter *args syntax is new and did not exist at the time of this issue.

They are implicitly combined by & (or as if surrounded by pl.all_horizontal) - so you raise a valid point.

So I don't understand why df.filter(pl.col("*").is_null()) raises this new error rather than just expanding to the second example?

lucazanna added bug Something isn't working python Related to Python Polars labels Feb 11, 2023

lucazanna closed this as completed Feb 11, 2023

lucazanna reopened this Feb 12, 2023

zundertj added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels Feb 18, 2023

ritchie46 self-assigned this Feb 18, 2023

lucazanna mentioned this issue Feb 28, 2023

Inconsistent behavior for is_duplicated on multiple columns. #7254

Closed

2 tasks

ritchie46 mentioned this issue Mar 1, 2023

fix(rust, python): raise error on ambiguous filter predicates #7265

Merged

ritchie46 closed this as completed in #7265 Mar 1, 2023

dwpeng mentioned this issue Sep 14, 2024

multi comment_prefix support while parsing csv #18742

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter based on nulls in multiple columns #6808

Filter based on nulls in multiple columns #6808

lucazanna commented Feb 11, 2023

cmdlineluser commented Feb 11, 2023 •

edited

Loading

lucazanna commented Feb 11, 2023

cmdlineluser commented Feb 11, 2023

lucazanna commented Feb 11, 2023

cmdlineluser commented Feb 12, 2023

lucazanna commented Feb 12, 2023

zundertj commented Feb 18, 2023 •

edited

Loading

lucazanna commented Feb 18, 2023

ritchie46 commented Feb 18, 2023

matterhorn103 commented Jun 21, 2024

cmdlineluser commented Jun 21, 2024

Filter based on nulls in multiple columns #6808

Filter based on nulls in multiple columns #6808

Comments

lucazanna commented Feb 11, 2023

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

cmdlineluser commented Feb 11, 2023 • edited Loading

lucazanna commented Feb 11, 2023

cmdlineluser commented Feb 11, 2023

lucazanna commented Feb 11, 2023

cmdlineluser commented Feb 12, 2023

lucazanna commented Feb 12, 2023

zundertj commented Feb 18, 2023 • edited Loading

lucazanna commented Feb 18, 2023

ritchie46 commented Feb 18, 2023

matterhorn103 commented Jun 21, 2024

cmdlineluser commented Jun 21, 2024

cmdlineluser commented Feb 11, 2023 •

edited

Loading

zundertj commented Feb 18, 2023 •

edited

Loading