Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable tests in udf_cudf_test.py #777

Merged
merged 4 commits into from
Sep 18, 2020

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Sep 16, 2020

This PR is to enable all the tests in file udf_cudf_test.py.

  1. Update the configs for CUDF tests to enable pool and turn down the
    pool size to avoid OOM as much as possible.
  2. Increase the size of the test data frame to avoid IPC errors when running with the GPU columnar pipeline,
    Some tasks will get empty data when the data frame is small enough, then the IPC
    error happens.

@firestarman firestarman linked an issue Sep 16, 2020 that may be closed by this pull request
1) Update the configs for CUDF tests to enable pool and turn down the
   pool size to avoid OOM as much as possible.
2) Increase the size of the test data frame to avoid IPC errors. Some tasks
   will get empty data when the data frame is small enough, then the IPC
   error happens.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

Force pushing to add the signoff

@revans2
Copy link
Collaborator

revans2 commented Sep 16, 2020

Increase the size of the test data frame to avoid IPC errors. Some tasks
will get empty data when the data frame is small enough, then the IPC
error happens.

Does the IPC error happen with the CPU only version of Spark? If we have a reproducible use case then we should file something against Spark itself. If it only happens with our replacement code then we need to do some more digging to understand why this results in an error.

@revans2
Copy link
Collaborator

revans2 commented Sep 16, 2020

build

@firestarman
Copy link
Collaborator Author

firestarman commented Sep 16, 2020

Does the IPC error happen with the CPU only version of Spark? If we have a reproducible use case then we should file something against Spark itself. If it only happens with our replacement code then we need to do some more digging to understand why this results in an error.

This IPC error only happens with our code of the columnar part per my current findings. Yes, i will try to dig more.

@revans2
Copy link
Collaborator

revans2 commented Sep 16, 2020

This IPC error only happens with our code of the columnar part per my current findings.

Then it is either an empty batch some how sneaking in or we launch things too early before we know if we will get any data or not. Thanks for looking into this.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Since shim layer for Spark 3.1.0 is not ready.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator

@firestarman do you know why it doesn't work in 3.1.0 yet, just curious.

@firestarman
Copy link
Collaborator Author

firestarman commented Sep 17, 2020

do you know why it doesn't work in 3.1.0 yet, just curious.

The WindowExecBase becomes a trait in Spark 3.1.0, so need to add a shim layer for our gpu GPU version, but it has not been done yet.

@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if the IT passed locally

@jlowe jlowe added build Related to CI / CD or cleanly building test Only impacts tests labels Sep 17, 2020
@firestarman firestarman merged commit f76ed9c into NVIDIA:branch-0.3 Sep 18, 2020
@firestarman firestarman deleted the enable-cudf-udf-test branch September 18, 2020 01:15
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v")
)
elements = list(map(lambda i: (i, i/1.0), range(1, 5000)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@firestarman What is the follow on issue to fix the underlying problem? This was a workaround to not expose the bug, but I would much rather see us xfail in that situation pointing at the bug for us to track and fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tracked by #750

from spark_session import with_cpu_session, with_gpu_session
from marks import allow_non_gpu, cudf_udf

pytestmark = pytest.mark.skipif(LooseVersion(spark_version()) >= LooseVersion('3.1.0'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too if we are skipping a test, or a test is failing we need to have an issue that this is pointing to so we don't lose track of this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is #844

@tgravescs
Copy link
Collaborator

when I just run the full integration tests from mvn (or spark-submit) these tests fail for me. I'm going to revert this change. These tests need to run automatically with new no special setup from the user and should run both from running mvn test to run all integration tests and via the spark-submit command specifying runtest.py.

Error I got is:
E /home/tgraves/miniconda3/bin/python: Error while finding module specification for 'rapids.daemon' (ModuleNotFoundError: No module named 'rapids')

tgravescs added a commit that referenced this pull request Sep 18, 2020
@tgravescs
Copy link
Collaborator

I would also like to understand how premerge passed with this - is the premerge doing extra setup for the tests to run and installing these somehow?

tgravescs added a commit that referenced this pull request Sep 18, 2020
This reverts commit f76ed9c.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
tgravescs added a commit that referenced this pull request Sep 21, 2020
This reverts commit f76ed9c.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
firestarman added a commit to firestarman/spark-rapids that referenced this pull request Sep 22, 2020
* Enable tests in udf_cudf_test.py

1) Update the configs for CUDF tests to enable pool and turn down the
   pool size to avoid OOM as much as possible.
2) Increase the size of the test data frame to avoid IPC errors. Some tasks
   will get empty data when the data frame is small enough, then the IPC
   error happens.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Skip cudf test for premerge

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Skip cudf tests for Spark 3.1.0+ temporarily

Since shim layer for Spark 3.1.0 is not ready.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Use LooseVersion to compare the version instead

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
This reverts commit f76ed9c.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Enable tests in udf_cudf_test.py

1) Update the configs for CUDF tests to enable pool and turn down the
   pool size to avoid OOM as much as possible.
2) Increase the size of the test data frame to avoid IPC errors. Some tasks
   will get empty data when the data frame is small enough, then the IPC
   error happens.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Skip cudf test for premerge

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Skip cudf tests for Spark 3.1.0+ temporarily

Since shim layer for Spark 3.1.0 is not ready.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Use LooseVersion to compare the version instead
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
This reverts commit f76ed9c.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Enable tests in udf_cudf_test.py

1) Update the configs for CUDF tests to enable pool and turn down the
   pool size to avoid OOM as much as possible.
2) Increase the size of the test data frame to avoid IPC errors. Some tasks
   will get empty data when the data frame is small enough, then the IPC
   error happens.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Skip cudf test for premerge

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Skip cudf tests for Spark 3.1.0+ temporarily

Since shim layer for Spark 3.1.0 is not ready.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

* Use LooseVersion to compare the version instead
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
This reverts commit f76ed9c.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#777)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cudf_udf_test.py is flakey
7 participants