-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests #9418
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Thanks for working on this, @res-life. I only had a couple of minor questions. |
(from_cpu, from_gpu) = result_converter_func_before_compare(from_cpu, from_gpu) | ||
# after the above conversion, the result is changed, | ||
# so should sort result locally before compare, add @ignore_order(local=True) to the test case | ||
assert should_sort_locally() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can add some assert message e.g. assert should_sort_locally() "local sort needed since the result is changed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# after the above conversion, the result is changed,
This is true for the fastparquet case here, but may be false for some cases. We'd better not enforce the local sort. Callers should add it when they know it is necessary.
(from_cpu, from_gpu) = result_converter_func_before_compare(from_cpu, from_gpu) | ||
# after the above conversion, the result is changed, | ||
# so should sort result locally before compare, add @ignore_order(local=True) to the test case | ||
assert should_sort_locally() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# after the above conversion, the result is changed,
This is true for the fastparquet case here, but may be false for some cases. We'd better not enforce the local sort. Callers should add it when they know it is necessary.
The assert will help developer fast find the failure reason. |
If users need the non-local sort, then it will conflict with the requirement of local sort. If local sort is always needed when this conversion func is specfied, we may need to forcely do the local sort instead of an assertion. And I am not sure, will the order of the output always be changed when giving a conversion func ? This is confusing me and i think it depends on the implementation of this func. |
New changes:
|
I tested, first sort on Spark, then convert(do not change order), then sort on local, and the order does not change for the result, so I removed the sort requirement. It's depending caller to sort the data or not. |
Looks good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except 1 last nit.
|
||
from pyspark.sql import Row | ||
|
||
def get_fastparquet_result_converter(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name may be misleading. It's not returning the convertor but actual the converted result.
Not sure how we plan to use this. Probably you was meaning get_fastparquet_canonicalized_result
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we plan to use this. The functions with "_" as the prefix are internal functions. This retuns a function and will be used like cpu,gpu = get_fastparquet_result_converter()(cpu, gpu)
. So will change the name to "get_fastparquet_result_canonicalizer".
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Shall we merge this?
Test passed on my local Spark311,320,330,340,350, so I'll merge it. |
closes #9399
The fastparquet reading parquet result is flatted, and is not the same as CPU type result.
e.g.: fastparquet Row(a.b1 = 1, a.b2 = 2) vs CPU type Row(a = Row(b1 = 1, b2 = 2))
assert_gpu_and_cpu_are_equal_collect
to add a converter function parameter.fastparquet
#9366