Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] window_function_test.py::test_multi_types_window_aggs_for_rows_lead_lag[partBy failed #1811

Closed
NvTimLiu opened this issue Feb 25, 2021 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Feb 25, 2021

Describe the bug
CI-Nightly-ID84

11:08:18  FAILED ../../src/main/python/**window_function_test.py::test_multi_types_window_aggs_for_rows[partBy:Boolean-orderBy:Boolean-String][IGNORE_ORDER, APPROXIMATE_FLOAT]**
11:08:18  FAILED ../../src/main/python/**window_function_test.py::test_multi_types_window_aggs_for_rows[partBy:Timestamp-orderBy:Long-String][IGNORE_ORDER, APPROXIMATE_FLOAT]**

https://blossom.nvidia.com/sw-gpu-spark-jenkins/job/rapids_databricks301_nightly-pre_release-github/85/console
12:41:32  FAILED ../../src/main/python/**window_function_test.py::test_multi_types_window_aggs_for_rows_lead_lag[partBy:Double-orderBy:Timestamp-Double][IGNORE_ORDER, APPROXIMATE_FLOAT]**

12:41:32  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
12:41:32  �[1m�[31m../../src/main/python/asserts.py�[0m:338: in assert_gpu_and_cpu_are_equal_collect
12:41:32      _assert_gpu_and_cpu_are_equal(func, �[94mTrue�[39;49;00m, conf=conf)
12:41:32  �[1m�[31m../../src/main/python/asserts.py�[0m:317: in _assert_gpu_and_cpu_are_equal
12:41:32      from_cpu = with_cpu_session(bring_back, conf=conf)
12:41:32  �[1m�[31m../../src/main/python/spark_session.py�[0m:76: in with_cpu_session
12:41:32      �[94mreturn�[39;49;00m with_spark_session(func, conf=copy)
12:41:32  �[1m�[31m../../src/main/python/spark_session.py�[0m:68: in with_spark_session
12:41:32      ret = func(_spark)
12:41:32  �[1m�[31m../../src/main/python/asserts.py�[0m:178: in <lambda>
12:41:32      bring_back = �[94mlambda�[39;49;00m spark: limit_func(spark).collect()
12:41:32  �[1m�[31m/databricks/spark/python/pyspark/sql/dataframe.py�[0m:612: in collect
12:41:32      �[94mreturn�[39;49;00m �[96mlist�[39;49;00m(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
12:41:32  �[1m�[31m/databricks/spark/python/pyspark/rdd.py�[0m:168: in _load_from_socket
12:41:32      sockfile = _create_local_socket(sock_info)
12:41:32  �[1m�[31m/databricks/spark/python/pyspark/rdd.py�[0m:153: in _create_local_socket
12:41:32      sockfile, sock = local_connect_and_auth(port, auth_secret)
12:41:32  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
12:41:32  
12:41:32  port = 41543
12:41:32  auth_secret = '****'
12:41:32  
12:41:32      �[94mdef�[39;49;00m �[92mlocal_connect_and_auth�[39;49;00m(port, auth_secret):
12:41:32          �[33m"""�[39;49;00m
12:41:32      �[33m    Connect to local host, authenticate with it, and return a (sockfile,sock) for that connection.�[39;49;00m
12:41:32      �[33m    Handles IPV4 & IPV6, does some error handling.�[39;49;00m
12:41:32      �[33m    :param port�[39;49;00m
12:41:32      �[33m    :param auth_secret�[39;49;00m
12:41:32      �[33m    :return: a tuple with (sockfile, sock)�[39;49;00m
12:41:32      �[33m    """�[39;49;00m
12:41:32          sock = �[94mNone�[39;49;00m
12:41:32          errors = []
12:41:32          �[90m# Support for both IPv4 and IPv6.�[39;49;00m
12:41:32          �[90m# On most of IPv6-ready systems, IPv6 will take precedence.�[39;49;00m
12:41:32          �[94mfor�[39;49;00m res �[95min�[39;49;00m socket.getaddrinfo(�[33m"�[39;49;00m�[33m127.0.0.1�[39;49;00m�[33m"�[39;49;00m, port, socket.AF_UNSPEC, socket.SOCK_STREAM):
12:41:32              af, socktype, proto, _, sa = res
12:41:32              �[94mtry�[39;49;00m:
12:41:32                  sock = socket.socket(af, socktype, proto)
12:41:32                  sock.settimeout(�[94m15�[39;49;00m)
12:41:32                  sock.connect(sa)
12:41:32                  sockfile = sock.makefile(�[33m"�[39;49;00m�[33mrwb�[39;49;00m�[33m"�[39;49;00m, �[96mint�[39;49;00m(os.environ.get(�[33m"�[39;49;00m�[33mSPARK_BUFFER_SIZE�[39;49;00m�[33m"�[39;49;00m, �[94m65536�[39;49;00m)))
12:41:32                  _do_server_auth(sockfile, auth_secret)
12:41:32                  �[94mreturn�[39;49;00m (sockfile, sock)
12:41:32              �[94mexcept�[39;49;00m socket.error �[94mas�[39;49;00m e:
12:41:32                  emsg = _exception_message(e)
12:41:32                  errors.append(�[33m"�[39;49;00m�[33mtried to connect to �[39;49;00m�[33m%s�[39;49;00m�[33m, but an error occured: �[39;49;00m�[33m%s�[39;49;00m�[33m"�[39;49;00m % (sa, emsg))
12:41:32                  sock.close()
12:41:32                  sock = �[94mNone�[39;49;00m
12:41:32  >       �[94mraise�[39;49;00m �[96mException�[39;49;00m(�[33m"�[39;49;00m�[33mcould not open socket: �[39;49;00m�[33m%s�[39;49;00m�[33m"�[39;49;00m % errors)
12:41:32  �[1m�[31mE       Exception: could not open socket: ["tried to connect to ('127.0.0.1', 41543), but an error occured: [Errno 111] Connection refused"]�[0m
12:41:32  
12:41:32  �[1m�[31m/databricks/spark/python/pyspark/java_gateway.py�[0m:203: Exception

Steps/Code to reproduce bug
failed pytest IT on databricks, not sure if it might fail on other IT pipeline

Expected behavior
It passes all the time, even in the face of failures.

Environment details (please complete the following information)
Databricks

@NvTimLiu NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 25, 2021
@NvTimLiu NvTimLiu changed the title [BUG] window_function_test.py:: [BUG] window_function_test.py::test_multi_types_window_aggs_for_rows_lead_lag[partBy failed Feb 25, 2021
@pxLi
Copy link
Collaborator

pxLi commented Feb 26, 2021

passed in today's run but looks like there is still something weird when testing test_multi_types_window_aggs_for_rows_lead_lag in parallel, looks like some race condition.
image

CI-NIGHTLY-ID86

@tgravescs tgravescs self-assigned this Mar 1, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 2, 2021
@tgravescs
Copy link
Collaborator

I've been running this test on a databricks cluster and haven't been able to reproduce. The error makes it looks like the java process went away.
I don't see this in any of the latest runs, if it comes up again we may need to change cluster to leave it around.

@tgravescs
Copy link
Collaborator

Closing this, we should reopen if we see it again.

@tgravescs tgravescs reopened this Mar 3, 2021
@tgravescs
Copy link
Collaborator

saw this again. I looked at a current running job and see events like: Driver is up but is not responsive, likely due to GC and in another build I saw Cluster lost at least one node. Reason: Spot Termination.. so I'm wondering if we are losing some nodes or GC issues which is affecting our tests. The tests run in local mode so the executor node lost shouldn't affect us but in the past I have seen that it did, something changed in conf or jars but was never able to debug.

@pxLi pxLi reopened this Apr 23, 2021
@pxLi
Copy link
Collaborator

pxLi commented Apr 23, 2021

reopen the issue.

started seeing this abnormal hanging issue again on 0.6 nightly test (enabled 2 days ago),

[2021-04-21T07:11:16.810Z] [gw4] ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows_lead_lag[partBy:Boolean-orderBy:Double-Decimal(18,3)][IGNORE_ORDER, APPROXIMATE_FLOAT]
[2021-04-21T07:15:08.691Z] ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows_lead_lag[partBy:Boolean-orderBy:String-Long][IGNORE_ORDER, APPROXIMATE_FLOAT]

[2021-04-21T07:35:15.491Z] [gw0] ../../src/main/python/udf_test.py::test_window_aggregate_udf[Lower_Upper-Byte][IGNORE_ORDER]
[2021-04-21T07:39:51.924Z] ../../src/main/python/udf_test.py::test_window_aggregate_udf[Lower_Upper-Short][IGNORE_ORDER]
[2021-04-21T07:39:51.924Z]

appeared in builds intermittently

DB301-CI-NIGHTLY-ID 149,151 have this issue, 150 works fine

@pxLi
Copy link
Collaborator

pxLi commented Apr 25, 2021

two more failures today
DB301-CI-NIGHTLY ID: 153,154

@sameerz sameerz added the ? - Needs Triage Need team to review and classify label Apr 25, 2021
@pxLi
Copy link
Collaborator

pxLi commented Apr 26, 2021

after @NvTimLiu changed the instance type to 2xlarge, we got the tests passed as expected.
The issue should be related the parallelism of integrations tests which lacking of memory caused frequent GC pauses and failed/slowed our test process

@pxLi pxLi closed this as completed Apr 26, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants