Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests #9418

Merged
merged 8 commits into from
Oct 26, 2023

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Oct 11, 2023

closes #9399

  • Add converter from fastparquet result to cpu type result
    The fastparquet reading parquet result is flatted, and is not the same as CPU type result.
    e.g.: fastparquet Row(a.b1 = 1, a.b2 = 2) vs CPU type Row(a = Row(b1 = 1, b2 = 2))
  • Modify assert_gpu_and_cpu_are_equal_collect to add a converter function parameter.
  • update the test case in Add tests to check compatibility with fastparquet #9366

Signed-off-by: Chong Gao <res_life@163.com>
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@mythrocks
Copy link
Collaborator

Thanks for working on this, @res-life. I only had a couple of minor questions.

integration_tests/src/main/python/asserts.py Outdated Show resolved Hide resolved
(from_cpu, from_gpu) = result_converter_func_before_compare(from_cpu, from_gpu)
# after the above conversion, the result is changed,
# so should sort result locally before compare, add @ignore_order(local=True) to the test case
assert should_sort_locally()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can add some assert message e.g. assert should_sort_locally() "local sort needed since the result is changed"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# after the above conversion, the result is changed,

This is true for the fastparquet case here, but may be false for some cases. We'd better not enforce the local sort. Callers should add it when they know it is necessary.

integration_tests/src/main/python/fastparquet_utils.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/fastparquet_utils.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/asserts.py Outdated Show resolved Hide resolved
(from_cpu, from_gpu) = result_converter_func_before_compare(from_cpu, from_gpu)
# after the above conversion, the result is changed,
# so should sort result locally before compare, add @ignore_order(local=True) to the test case
assert should_sort_locally()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# after the above conversion, the result is changed,

This is true for the fastparquet case here, but may be false for some cases. We'd better not enforce the local sort. Callers should add it when they know it is necessary.

integration_tests/src/main/python/asserts.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/fastparquet_utils.py Outdated Show resolved Hide resolved
@res-life
Copy link
Collaborator Author

This is true for the fastparquet case here, but may be false for some cases. We'd better not enforce the local sort. Callers should add it when they know it is necessary.

The assert will help developer fast find the failure reason.
If developer use @ignore_order(local=False) and passed in the result convert func, the result will be change, the previous sort no longer take effect, so need a new sort.

@firestarman
Copy link
Collaborator

firestarman commented Oct 12, 2023

This is true for the fastparquet case here, but may be false for some cases. We'd better not enforce the local sort. Callers should add it when they know it is necessary.

The assert will help developer fast find the failure reason. If developer use @ignore_order(local=False) and passed in the result convert func, the result will be change, the previous sort no longer take effect, so need a new sort.

If users need the non-local sort, then it will conflict with the requirement of local sort. If local sort is always needed when this conversion func is specfied, we may need to forcely do the local sort instead of an assertion. And I am not sure, will the order of the output always be changed when giving a conversion func ? This is confusing me and i think it depends on the implementation of this func.

@res-life
Copy link
Collaborator Author

res-life commented Oct 12, 2023

New changes:

  • Now flatten the GPU result like fastparquet does to compare, in this way it's more simple and reliable.

@res-life
Copy link
Collaborator Author

If users need the non-local sort, then it will conflict with the requirement of local sort. If local sort is always needed when this conversion func is specfied, we may need to forcely do the local sort instead of an assertion. And I am not sure, will the order of the output always be changed when giving a conversion func ? This is confusing me and i think it depends on the implementation of this func.

I tested, first sort on Spark, then convert(do not change order), then sort on local, and the order does not change for the result, so I removed the sort requirement. It's depending caller to sort the data or not.

mythrocks
mythrocks previously approved these changes Oct 12, 2023
@firestarman
Copy link
Collaborator

Looks good to me

Copy link
Collaborator

@winningsix winningsix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except 1 last nit.


from pyspark.sql import Row

def get_fastparquet_result_converter():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name may be misleading. It's not returning the convertor but actual the converted result.

Not sure how we plan to use this. Probably you was meaning get_fastparquet_canonicalized_result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we plan to use this. The functions with "_" as the prefix are internal functions. This retuns a function and will be used like cpu,gpu = get_fastparquet_result_converter()(cpu, gpu). So will change the name to "get_fastparquet_result_canonicalizer".

winningsix
winningsix previously approved these changes Oct 15, 2023
@sameerz sameerz added the test Only impacts tests label Oct 16, 2023
@res-life
Copy link
Collaborator Author

build

@res-life res-life changed the title [WIP] Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests Oct 25, 2023
@res-life res-life marked this pull request as ready for review October 25, 2023 07:04
Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Shall we merge this?

@res-life
Copy link
Collaborator Author

Test passed on my local Spark311,320,330,340,350, so I'll merge it.
Thanks all for your review.

@res-life res-life merged commit e815dcf into NVIDIA:branch-23.12 Oct 26, 2023
29 of 30 checks passed
@res-life res-life deleted the fastparquet-converter branch October 26, 2023 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests
6 participants