Column equality testing fixes #10011

brandon-b-miller · 2022-01-11T03:56:42Z

Fixes a bug where empty columns were not comparing correctly as well as a few edge cases with strings

Partially addresses #8513

codecov · 2022-01-11T05:24:49Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.04@8d2a9cc). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head ed7c630 differs from pull request most recent head a44b97f. Consider uploading reports for the commit a44b97f to get more accurate results

@@               Coverage Diff               @@
##             branch-22.04   #10011   +/-   ##
===============================================
  Coverage                ?   10.46%           
===============================================
  Files                   ?      122           
  Lines                   ?    20523           
  Branches                ?        0           
===============================================
  Hits                    ?     2147           
  Misses                  ?    18376           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d2a9cc...a44b97f. Read the comment docs.

isVoid

Maybe we should add some tests in test_testing.py to cover these changes.

isVoid · 2022-01-14T18:53:29Z

python/cudf/cudf/testing/testing.py

+                            )
+                        )
+                    )
+                    and cp.allclose(


FYI, cupy.allclose has a parameter equal_nan that may simplify the is_nan check above.

I didn't see a way of using equal_nan but I agree that the logic was a little hard to follow in general, so I redid it here. Let me know if you think this is better.

isVoid · 2022-01-14T18:54:56Z

python/cudf/cudf/testing/testing.py

+    elif not (
+        (is_string_dtype(left) and is_numeric_dtype(right))
+        or (is_numeric_dtype(left) and is_string_dtype(right))
+    ):


Not sure if I follow these checks. Where the handler if the input falls in these dtypes?

Yeah this seems odd. Are we just generally looking to avoid checking this for mismatched dtypes?

If one column is string and the other is numeric, we can assume the columns are not equal. (1 != '1'). These lines check to make sure exactly one of the columns is string and the other is numeric, in which case we avoid the entire try/except block and therefore avoid any opportunity for columns_equal to be set to True. We should end up on line 243 from there.

Is this only an issue for string types, or do we need to worry about other types as well? I see categoricals are handled above, what about list/struct dtypes?

I looked into what happens here and it would appear that when list or struct are involved we fall into the left.equals(right) check at the very end and end up with a TypeError, except in the case that we're comparing list to list or struct to struct. That should probably not happen.

I see the point I think you are making though: there's a certain set of dtypes (beyond just string) where, even if check_dtype=False, we know up front that it's a 100% mismatch between non-null elements simply because a struct can't compare as equal to a "not struct". I think these dtypes are:

String

List

Struct

Interval

Decimal

I suppose the only edge case here would be that we should arguably return True even for that subset of dtypes if the columns are fully null.

If this seems like the correct logic I am happy to go back and wire it up as such here.

Yep that all sounds good. For the fully null case, I think whether or not we return True should be determined by check_dtypes, but if someone has a different opinion I'm open to a different result.

isVoid · 2022-01-14T19:02:58Z

python/cudf/cudf/testing/testing.py

    if not columns_equal:
-        msg1 = f"{left.values_host}"
-        msg2 = f"{right.values_host}"
+        ldata = [val for val in left.to_pandas(nullable=True)]


Side question - I've wondered how reliable nullable=True is given pandas support for nullable float dtype is still non-public?

It seems like they're being fairly forward with these as of 1.2.0

brandon-b-miller · 2022-02-01T21:01:49Z

everything look ok here? cc @vyasr @isVoid

vyasr

Some suggestions for improvement here and one question.

vyasr · 2022-02-02T20:08:54Z

python/cudf/cudf/testing/testing.py

+    elif not (
+        (is_string_dtype(left) and is_numeric_dtype(right))
+        or (is_numeric_dtype(left) and is_string_dtype(right))
+    ):


Is this only an issue for string types, or do we need to worry about other types as well? I see categoricals are handled above, what about list/struct dtypes?

python/cudf/cudf/testing/testing.py

vyasr

One last suggestion, otherwise good on my end

isVoid

It's all good. Just minor stuff below.

python/cudf/cudf/testing/testing.py

Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>

brandon-b-miller · 2022-02-08T01:06:39Z

@gpucibot merge

brandon-b-miller added 2 commits January 10, 2022 18:49

special case column assert

a77f861

updates

812577b

github-actions bot added the Python Affects Python cuDF API. label Jan 11, 2022

brandon-b-miller added 2 commits January 11, 2022 07:21

Merge branch 'branch-22.02' into fix-assert-frame-eq-dtype-kwarg

4562f75

fix other issue

5517835

brandon-b-miller added bug Something isn't working non-breaking Non-breaking change 3 - Ready for Review Ready for review by team labels Jan 11, 2022

brandon-b-miller marked this pull request as ready for review January 11, 2022 16:46

brandon-b-miller requested a review from a team as a code owner January 11, 2022 16:46

brandon-b-miller requested review from isVoid and marlenezw January 11, 2022 16:46

isVoid reviewed Jan 14, 2022

View reviewed changes

shwina changed the base branch from branch-22.02 to branch-22.04 January 20, 2022 21:24

brandon-b-miller added 2 commits January 24, 2022 17:56

Merge branch 'branch-22.02' into fix-assert-frame-eq-dtype-kwarg

f088adf

reorganize logic

042a4ed

brandon-b-miller requested review from vyasr and isVoid January 27, 2022 05:07

Merge branch 'branch-22.04' into fix-assert-frame-eq-dtype-kwarg

7c1b40d

vyasr requested changes Feb 2, 2022

View reviewed changes

address reviews, add tests

2644898

vyasr reviewed Feb 4, 2022

View reviewed changes

python/cudf/cudf/testing/testing.py Outdated Show resolved Hide resolved

python/cudf/cudf/testing/testing.py Show resolved Hide resolved

vyasr approved these changes Feb 4, 2022

View reviewed changes

isVoid approved these changes Feb 5, 2022

View reviewed changes

python/cudf/cudf/testing/testing.py Outdated Show resolved Hide resolved

brandon-b-miller and others added 2 commits February 7, 2022 15:11

Update python/cudf/cudf/testing/testing.py

599253d

Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>

minor update

a44b97f

brandon-b-miller added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 8, 2022

rapids-bot bot merged commit acb6aed into rapidsai:branch-22.04 Feb 8, 2022

This was referenced Aug 1, 2022

[BUG] cudf.testing.assert_series_equal not corrrectly testing for NaNs and assigning cp.nan to Series the same. #8513

Closed

[BUG] cudf.testing.assert_frame_equal not using rtol or atol correctly for floats #8518

Closed

AntiKnot mentioned this pull request Aug 23, 2024

[BUG] cudf.testing.assert_*_equal raises AssertionError for equivalent DecimalDtyped objects #16635

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column equality testing fixes #10011

Column equality testing fixes #10011

brandon-b-miller commented Jan 11, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading

isVoid left a comment

isVoid Jan 14, 2022

brandon-b-miller Jan 25, 2022

isVoid Jan 14, 2022

vyasr Jan 20, 2022 •

edited

Loading

brandon-b-miller Jan 25, 2022

vyasr Feb 2, 2022

brandon-b-miller Feb 3, 2022

vyasr Feb 4, 2022

isVoid Jan 14, 2022

brandon-b-miller Jan 25, 2022

brandon-b-miller commented Feb 1, 2022

vyasr left a comment

vyasr Feb 2, 2022

vyasr left a comment

isVoid left a comment

brandon-b-miller commented Feb 8, 2022

Column equality testing fixes #10011

Column equality testing fixes #10011

Conversation

brandon-b-miller commented Jan 11, 2022 • edited Loading

codecov bot commented Jan 11, 2022 • edited Loading

Codecov Report

isVoid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr Jan 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller commented Feb 1, 2022

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

isVoid left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Feb 8, 2022

brandon-b-miller commented Jan 11, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading

vyasr Jan 20, 2022 •

edited

Loading