Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable approx percentile tests #3770

Merged

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Oct 8, 2021

Depends on rapidsai/cudf#9403 and rapidsai/cudf#9537.

Closes #3703 and #3706.

@andygrove andygrove added the bug Something isn't working label Oct 8, 2021
@andygrove andygrove added this to the Oct 4 - Oct 15 milestone Oct 8, 2021
@andygrove andygrove self-assigned this Oct 8, 2021
@andygrove andygrove marked this pull request as draft October 8, 2021 01:06
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove changed the title WIP: Enable approx percentile tests Enable approx percentile tests Oct 19, 2021
@andygrove andygrove marked this pull request as ready for review October 19, 2021 14:39
revans2
revans2 previously approved these changes Oct 22, 2021
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor nits.

@andygrove
Copy link
Contributor Author

build

Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove
Copy link
Contributor Author

I am working with @mythrocks to track down the cause for the test failures and it looks like it may be due to a regression in cuDF.

@mythrocks
Copy link
Collaborator

I am working with @mythrocks to track down the cause for the test failures

It took a while to find. I should have a PR up for this fix shortly.

@andygrove andygrove changed the title Enable approx percentile tests WIP: Enable approx percentile tests Oct 27, 2021
@andygrove andygrove marked this pull request as draft October 27, 2021 16:02
mythrocks added a commit to mythrocks/cudf that referenced this pull request Oct 27, 2021
`segmented_gather()` currently assumes that null LIST rows also have
a `0` size (as defined by the difference of adjacent offsets.)

This might not hold, for example, for LIST columns that are members
of STRUCT columns whose parent null masks are superimposed on its
children. This would cause a non-empty list row to be marked null,
without compaction. This leads to errors in fetching elements of a
list row as seen in NVIDIA/spark-rapids/pull/3770.

This commit adds the handling of uncompacted LIST rows in
`segmented_gather()`.
@mythrocks
Copy link
Collaborator

I should have a PR up for this fix shortly.

Sorry for the delay. The fix is in rapidsai/cudf#9537. I've tested that test_hash_groupby_approx_percentile_double_scalar works with this fix.

$ ./run_pyspark_from_build.sh -k test_hash_groupby_approx_percentile_double_scalar
...
============================= test session starts ==============================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/mithunr/anaconda3/envs/cudf-dev-11.2-2/bin/python
cachedir: .pytest_cache
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/mithunr/workspace/dev/spark-plugin/01/integration_tests/target/run_dir/.hypothesis/examples')
rootdir: /home/mithunr/workspace/dev/spark-plugin/01/integration_tests, configfile: pytest.ini
plugins: xdist-2.3.0, benchmark-3.4.1, hypothesis-6.21.5, forked-1.3.0
collecting ... collected 11283 items / 11281 deselected / 2 selected

../../src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_double_scalar[false] PASSED [ 50%]
../../src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_double_scalar[true] PASSED [100%]
...
=============== 2 passed, 11281 deselected, 20 warnings in 8.10s ===============

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Nov 4, 2021
`segmented_gather()` currently assumes that null LIST rows also have
a `0` size (as defined by the difference of adjacent offsets.)

This might not hold, for example, for LIST columns that are members
of STRUCT columns whose parent null masks are superimposed on its
children. This would cause a non-empty list row to be marked null,
without compaction. This leads to errors in fetching elements of a
list row as seen in NVIDIA/spark-rapids/pull/3770.

This commit adds the handling of uncompacted LIST rows in
`segmented_gather()`.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

URL: #9537
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove marked this pull request as ready for review November 5, 2021 23:11
@andygrove andygrove changed the title WIP: Enable approx percentile tests Enable approx percentile tests Nov 5, 2021
@andygrove andygrove merged commit e608ee7 into NVIDIA:branch-21.12 Nov 8, 2021
@andygrove andygrove deleted the enable-approx-percentile-tests branch November 8, 2021 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants