[BUG] Test failures for 0.2 when run with multiple executors #812

sameerz · 2020-09-18T15:20:48Z

There are 15 test failures on Dataproc-preview2 on Ubuntu 18 when running our integration tests. The following tests fail:

cache_test.py (1 failure)
join_test.py (1 failure)
window_function_test.py (1 failure)
qa_nightly_select_test.py (12 failures)

I ran the integration tests twice, and got the same first three tests failing, but a different set of qa_nightly_select_tests failing. I am attaching my cluster creation script, a log of the tests, and the spark-default.conf.
dataproc-integration-tests.tar.gz

revans2 · 2020-09-18T16:37:04Z

I reran myself and ended up with 19 failures. I'll start trying to debug them.

FAILED ../../src/main/python/cache_test.py::test_passing_gpuExpr_as_Expr - As...
FAILED ../../src/main/python/join_test.py::test_join_bucketed_table[false][IGNORE_ORDER, ALLOW_NON_GPU(DataWritingCommandExec)]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_needs_sort_select[SUM(byteF) OVER (PARTITION BY byteF ORDER BY CAST(dateF AS TIMESTAMP) RANGE BETWEEN INTERVAL 1 DAYS PRECEDING AND INTERVAL 1 DAYS FOLLOWING ) as sum_total][IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[FIRST(byteF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[FIRST(intF) GROUP BY byteF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[FIRST(longF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[FIRST(floatF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[FIRST(doubleF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[FIRST(strF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[FIRST(byteF) GROUP BY intF, shortF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[FIRST(shortF) GROUP BY intF, byteF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[LAST(byteF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[LAST(intF) GROUP BY byteF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[LAST(floatF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[LAST(doubleF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[LAST(strF) GROUP BY intF][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[byteF, SUM(byteF) OVER (PARTITION BY shortF ORDER BY intF ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING ) as res][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/qa_nightly_select_test.py::test_select_first_last[SUM(intF) OVER (PARTITION BY byteF ORDER BY byteF ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) as res][IGNORE_ORDER({'local': True}), INCOMPAT, APPROXIMATE_FLOAT]
FAILED ../../src/main/python/window_function_test.py::test_window_aggs_for_ranges[[('a', RepeatSeq), ('b', Date), ('c', Integer)]1][IGNORE_ORDER]

revans2 · 2020-09-18T16:53:41Z

All of the first/last test failures were fixes when I configured it to use a single executor. It looks like those are issues with the tests when they try to run too wide. I'll see about the others if they are similar.

revans2 · 2020-09-18T16:54:53Z

test_passing_gpuExpr_as_Expr is also related to that same issue.

revans2 · 2020-09-18T16:56:05Z

test_window_aggs_for_ranges also passed with a single executor.

revans2 · 2020-09-18T16:57:55Z

test_join_bucketed_table is still failing, but I suspect that it is caused by running with an older version of the plugin. I'll try and clean things up on all of the nodes so we are running with a newer version of the plugin everywhere.

revans2 · 2020-09-18T17:16:30Z

Yup that was it. The jar on the nodes was from a few days ago and it didn't have the bucketing fix in it. Once I replaced the plugin jar with the one that has the fix the test passes. I will rerun all of the tests again with a single executor just to be sure. After that I'll turn this into an issue around the tests because they should either be updated so that they can pass with multiple executors or they should have a way to detect that they are being run incorrectly and skip themselves.

revans2 · 2020-09-18T17:54:03Z

Moved this to 0.3 as these are test issues, not correctness issues with the plugin.

revans2 · 2020-10-12T15:09:59Z

I have found that one of the issues we are running into is that spark's sort order is a stable sort, but ours is not. I need to add in java bindings for this and update the plugin.

revans2 · 2020-10-12T15:13:43Z

Actually I will file a follow on issue for that, I think the issue is more related to unessisary sorts in some of the aggregates. I'll see if I can fix it that way first.

* Fixing subnormal values Signed-off-by: Mike Wilson <knobby@burntsheep.com>

sameerz added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 18, 2020

revans2 self-assigned this Sep 18, 2020

revans2 added test Only impacts tests and removed ? - Needs Triage Need team to review and classify labels Sep 18, 2020

revans2 added the P1 Nice to have for release label Sep 18, 2020

revans2 removed their assignment Sep 18, 2020

tgravescs mentioned this issue Sep 22, 2020

[BUG] test_window_aggs_for_ranges intermittently fails #825

Closed

tgravescs changed the title ~~[BUG] Test failures for 0.2 on Dataproc~~ [BUG] Test failures for 0.2 when run with multiple executors Oct 8, 2020

revans2 self-assigned this Oct 9, 2020

sameerz added this to the Oct 12 - Oct 23 milestone Oct 9, 2020

This was referenced Oct 13, 2020

Update first/last tests to avoid non-determinisim and ordering differences #933

Merged

Fix some issues with non-determinism #944

Merged

[BUG] Some TPC-DS and TPC-H integration tests fail when run on multiple excutors #943

Closed

revans2 closed this as completed in #944 Oct 14, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Fixing subnormal values for string to float casting kernel (NVIDIA#812)

f632c15

* Fixing subnormal values Signed-off-by: Mike Wilson <knobby@burntsheep.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Test failures for 0.2 when run with multiple executors #812

[BUG] Test failures for 0.2 when run with multiple executors #812

sameerz commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Oct 12, 2020

revans2 commented Oct 12, 2020

[BUG] Test failures for 0.2 when run with multiple executors #812

[BUG] Test failures for 0.2 when run with multiple executors #812

Comments

sameerz commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Sep 18, 2020

revans2 commented Oct 12, 2020

revans2 commented Oct 12, 2020