Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conditional xfail test for DISTINCT aggregates with NaN #261

Merged

Conversation

mythrocks
Copy link
Collaborator

@mythrocks mythrocks commented Jun 23, 2020

SPARK-32038 reports a regression in Apache Spark (3.0.0), in
failing to normalize NaN/Zero float values, during DISTINCT
aggregations. This causes a mismatch in results between
Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which
returns the right results).
SPARK-32038 was fixed in apache/spark#28876.

The following change introduces a conditional xfail test that passes
on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038),
but produces an expected failure on Spark 3.0.0.

Testing:

  1. On Apache Spark 3.0.0:
SPARK_HOME=~/workspace/dev/spark/3.0.0 ./run_pyspark_from_build.sh  -k hash -s

src/main/python/hash_aggregate_test.py::test_count_distinct_with_nan_floats[[('a', RepeatSeq), ('b', Float), ('c', Long)]][IGNORE_ORDER, APPROXIMATE_FLOAT] XFAIL [100%]
...
XFAIL src/main/python/hash_aggregate_test.py::test_count_distinct_with_nan_floats[[('a', RepeatSeq), ('b', Float), ('c', Long)]][IGNORE_ORDER, APPROXIMATE_FLOAT]
  [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate (https://github.com/apache/spark/pull/28876) Fixed in later Apache Spark releases.
  1. On Apache Spark 3.0.1:
SPARK_HOME=~/workspace/dev/spark/3.0.1 ./run_pyspark_from_build.sh  -k hash -s
...
src/main/python/hash_aggregate_test.py::test_count_distinct_with_nan_floats[[('a', RepeatSeq), ('b', Float), ('c', Long)]][IGNORE_ORDER, APPROXIMATE_FLOAT] PASSED [100%]

SPARK-32038 reports a regression in Apache Spark (3.0.0), in
failing to normalize NaN/Zero float values, during DISTINCT
aggregations. This causes a mismatch in results between
Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which
returns the right results).
SPARK-32038 was fixed in apache/spark#28876.

This commit introduces a conditional xfail test that passes
on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038),
but produces an expected failure on Spark 3.0.0.
@mythrocks mythrocks self-assigned this Jun 23, 2020
@mythrocks mythrocks added the test Only impacts tests label Jun 23, 2020
@mythrocks mythrocks added this to the Jun 22 - Jul 2 milestone Jun 23, 2020
@mythrocks
Copy link
Collaborator Author

build

@mythrocks
Copy link
Collaborator Author

For reference: xfail Documentation
The initial intention was to introduce an imperative xfail, based on the Apache Spark version string. But this would not execute any code after the xfail invocation.
Adding it to the @pytest.mark.xfail() lets the test run on Spark 3.0.0, and then excuses the failure. This way, we can look out for a future xpass, in case SPARK-32038 is ported back to Spark 3.0.0.

@pytest.mark.xfail(reason="count(distinct floats) fails when there are NaN values in the aggregation column."
"(https://github.com/NVIDIA/spark-rapids/issues/194)")
@pytest.mark.xfail(
condition=with_spark_session(lambda spark : spark.sparkContext.version == "3.0.0"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great 👍

@mythrocks
Copy link
Collaborator Author

build

@mythrocks
Copy link
Collaborator Author

Hmm... I'm not sure why the status checks are blocked.

@revans2
Copy link
Collaborator

revans2 commented Jun 23, 2020

Hmm... I'm not sure why the status checks are blocked.

The build for branch-0.2 is not set up yet. Still waiting on the team to do it.

@mythrocks
Copy link
Collaborator Author

not sure why the status checks are blocked.
The build for branch-0.2 is not set up yet.

D'oh! Thanks, @revans2. Permission to merge this?

@revans2
Copy link
Collaborator

revans2 commented Jun 23, 2020

We have only been merging necessary doc changes. If it is still not up tomorrow then I will start manually testing/merging things.

@mythrocks
Copy link
Collaborator Author

We have only been merging necessary doc changes.

Gotcha. I mistakenly thought test code was acceptable. This patch will keep, so there isn't a rush. Thanks, @revans2.

@revans2
Copy link
Collaborator

revans2 commented Jun 24, 2020

build

1 similar comment
@revans2
Copy link
Collaborator

revans2 commented Jun 24, 2020

build

@revans2 revans2 merged commit fab7f9a into NVIDIA:branch-0.2 Jun 24, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
SPARK-32038 reports a regression in Apache Spark (3.0.0), in
failing to normalize NaN/Zero float values, during DISTINCT
aggregations. This causes a mismatch in results between
Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which
returns the right results).
SPARK-32038 was fixed in apache/spark#28876.

This commit introduces a conditional xfail test that passes
on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038),
but produces an expected failure on Spark 3.0.0.
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
SPARK-32038 reports a regression in Apache Spark (3.0.0), in
failing to normalize NaN/Zero float values, during DISTINCT
aggregations. This causes a mismatch in results between
Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which
returns the right results).
SPARK-32038 was fixed in apache/spark#28876.

This commit introduces a conditional xfail test that passes
on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038),
but produces an expected failure on Spark 3.0.0.
pxLi pushed a commit to pxLi/spark-rapids that referenced this pull request May 12, 2022
* Cifar10 scaffold (#127)

* SCAFFOLD Integration

scaffold learner

scaffold configs

initialize the scaffold terms on the client

update README to include SCAFFOLD description

add scaffold experiment

update urls

add LICENSE from NIID-Bench

update README and plots to reflect SCAFFOLD experiment

refactor scaffold computation steps into their own member functions

remove unused import

refactor to use ScaffoldLearner class

Scaffold learner depends on PyTorch. Renamed as such

use two standard aggregators, inherit SCAFFOLD workflow from standard ScatterAndGather

remove custom aggregator

use update_shareable

use multi-class inheritance for scaffold learner

use new aggregator that supports several DXOs

fix updating call for global model

simplify dxo/shareable handling

use built-in PTFileModelLocator

fix formatting

add PT Formatter

use built-in validation json generator; remove custom formatter

remove custom Json validation generator

update license

restore run_secure.sh

restore project yml

restore main branch learner executor

add SCAFFOLD link to example readme

remove custom validation json generator

print model owner info during validation

remove custom formatter

remove special handling of validation dxo. Not needed anymore

use zeros_like() to initialize scaffold terms

* use scaffold helper class

* formatting

* make scaffold function args consistent

* Add versioneer so nvflare.__version__ has meaningful value (#215)

* Add versioneer so nvflare.__version__ has meaningful value

* Fix coding style and license for versioneer

* Add information about cross site validation, commands, and hostname to documentation (#206)

* Add information about cross site validation to documentation

* More minor documentation updates

* Fix link

* Add persistence to admin cli (#210)

* Add persistence to admin cli

* save to ~/.nvflare

* 211 Update hello-monai example (#212)

* update hello-monai

Signed-off-by: Yiheng Wang <vennw@nvidia.com>

* decrease aggregation_epochs

Signed-off-by: Yiheng Wang <vennw@nvidia.com>

* Initial implementation of Ditto algorithm (#144)

Refactor ditto with multi inheritance

Refactor ditto with multi inheritance

Update README.md

move learners to app_common with other minor updates

update license section

update for unit test

update for unit test

update for unit test

remove redundant init parameters

Update supervised_ditto_learner.py

Update README.md

Update README.md

Update README.md

Update README.md

Update supervised_ditto_learner.py

move learners back to custom folder, will further discuss and make the move in separate PR

* Initial implementation for HA.

* Added FL client HA support.

* Enhanced the Client HA.

* HA function working for server & client.

* Working with SnapshotFilePersistor config.

* Made the overseer_agent configurable.

* Made the admin support HA.

* Adjust retry.

* Adjust admin prompt.

* Added storage_state_persistor.py

* Integrated with new overseer-agent.

* Changed to use admin Overseer-agent.

* Updated overseer_spec.py

* Fixed end line.

* Added storage and used the dev-2.1 cli.py.

* Used dev-2.1 cli.py.

Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>
Co-authored-by: Isaac Yang <isaacy@nvidia.com>
Co-authored-by: nvkevlu <55759229+nvkevlu@users.noreply.github.com>
Co-authored-by: Sean Yang <seany314@gmail.com>
Co-authored-by: Yiheng Wang <68361391+yiheng-wang-nv@users.noreply.github.com>
Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com>
@mythrocks mythrocks deleted the xfail-distinct-aggregates-spark-28876 branch February 27, 2023 18:00
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
* moving rapids cmake include to pull from rapids version

* moving rapids cmake include to pull from rapids version

Signed-off-by: Mike Wilson <knobby@burntsheep.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] count(distinct float_col) produces different results from CPU, for Float columns with NaNs
3 participants