Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_cogroup_apply_udf[Short(not_null)] failed with pandas 2.1.X #9403

Closed
pxLi opened this issue Oct 9, 2023 · 6 comments
Closed

[BUG] test_cogroup_apply_udf[Short(not_null)] failed with pandas 2.1.X #9403

pxLi opened this issue Oct 9, 2023 · 6 comments
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Collaborator

pxLi commented Oct 9, 2023

Describe the bug
currently only found failed in arm (not sure if ephemeral or only in arm, we will keep monitoring)
pipeline rapids_nightly-dev-github-arm64 build: 13

23:26:27  �[1m�[31mE                 File "join.pyx", line 682, in pandas._libs.join.__pyx_fused_cpdef�[0m
23:26:27  �[1m�[31mE               TypeError: Function call with ambiguous argument types�[0m
23:26:27  =================================== FAILURES ===================================
23:26:27  �[31m�[1m___________________ test_cogroup_apply_udf[Short(not_null)] ____________________�[0m
23:26:27  [gw2] linux -- Python 3.9.18 /usr/bin/python
23:26:27  
23:26:27  data_gen = Short(not_null)
23:26:27  
23:26:27      @ignore_order
23:26:27      @pytest.mark.parametrize('data_gen', [ShortGen(nullable=False)], ids=idfn)
23:26:27      def test_cogroup_apply_udf(data_gen):
23:26:27          def asof_join(l, r):
23:26:27              return pd.merge_asof(l, r, on='a', by='b')
23:26:27      
23:26:27          def do_it(spark):
23:26:27              left, right = create_df(spark, data_gen, 500, 500)
23:26:27              return left.groupby('a').cogroup(
23:26:27                      right.groupby('a')).applyInPandas(
23:26:27                              asof_join, schema="a int, b int")
23:26:27  >       assert_gpu_and_cpu_are_equal_collect(do_it, conf=arrow_udf_conf)
23:26:27  
23:26:27  �[1m�[31m../../src/main/python/udf_test.py�[0m:338: 
23:26:27  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
23:26:27  �[1m�[31m../../src/main/python/asserts.py�[0m:566: in assert_gpu_and_cpu_are_equal_collect
23:26:27      _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
23:26:27  �[1m�[31m../../src/main/python/asserts.py�[0m:485: in _assert_gpu_and_cpu_are_equal
23:26:27      run_on_cpu()
23:26:27  �[1m�[31m../../src/main/python/asserts.py�[0m:471: in run_on_cpu
23:26:27      from_cpu = with_cpu_session(bring_back, conf=conf)
23:26:27  �[1m�[31m../../src/main/python/spark_session.py�[0m:116: in with_cpu_session
23:26:27      return with_spark_session(func, conf=copy)
23:26:27  �[1m�[31m../../src/main/python/spark_session.py�[0m:100: in with_spark_session
23:26:27      ret = func(_spark)
23:26:27  �[1m�[31m../../src/main/python/asserts.py�[0m:205: in <lambda>
23:26:27      bring_back = lambda spark: limit_func(spark).collect()
23:26:27  �[1m�[31m/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/dataframe.py�[0m:677: in collect
23:26:27      sock_info = self._jdf.collectToPython()
23:26:27  �[1m�[31m/spark-3.1.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1304: in __call__
23:26:27      return_value = get_return_value(
23:26:27  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
23:26:27  
23:26:27  a = ('xro959822', <py4j.java_gateway.GatewayClient object at 0xffff061577c0>, 'o959821', 'collectToPython')
23:26:27  kw = {}
23:26:27  converted = PythonException('\n  An exception was thrown from the Python worker. Please see the stack trace below.\nTraceback (mos...utor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\t... 1 more\n')
23:26:27  
23:26:27      def deco(*a, **kw):
23:26:27          try:
23:26:27              return f(*a, **kw)
23:26:27          except py4j.protocol.Py4JJavaError as e:
23:26:27              converted = convert_exception(e.java_exception)
23:26:27              if not isinstance(converted, UnknownException):
23:26:27                  # Hide where the exception came from that shows a non-Pythonic
23:26:27                  # JVM exception message.
23:26:27  >               raise converted from None
23:26:27  �[1m�[31mE               pyspark.sql.utils.PythonException: �[0m
23:26:27  �[1m�[31mE                 An exception was thrown from the Python worker. Please see the stack trace below.�[0m
23:26:27  �[1m�[31mE               Traceback (most recent call last):�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main�[0m
23:26:27  �[1m�[31mE                   process()�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process�[0m
23:26:27  �[1m�[31mE                   serializer.dump_stream(out_iter, outfile)�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 273, in dump_stream�[0m
23:26:27  �[1m�[31mE                   return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 81, in dump_stream�[0m
23:26:27  �[1m�[31mE                   for batch in iterator:�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 266, in init_stream_yield_batches�[0m
23:26:27  �[1m�[31mE                   for series in iterator:�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 443, in mapper�[0m
23:26:27  �[1m�[31mE                   return f(df1_keys, df1_vals, df2_keys, df2_vals)�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 146, in <lambda>�[0m
23:26:27  �[1m�[31mE                   return lambda kl, vl, kr, vr: [(wrapped(kl, vl, kr, vr), to_arrow_type(return_type))]�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 131, in wrapped�[0m
23:26:27  �[1m�[31mE                   result = f(left_df, right_df)�[0m
23:26:27  �[1m�[31mE                 File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper�[0m
23:26:27  �[1m�[31mE                   return f(*args, **kwargs)�[0m
23:26:27  �[1m�[31mE                 File "/home/jenkins/agent/workspace/rapids_nightly-dev-github-arm64/integration_tests/src/main/python/udf_test.py", line 331, in asof_join�[0m
23:26:27  �[1m�[31mE                   return pd.merge_asof(l, r, on='a', by='b')�[0m
23:26:27  �[1m�[31mE                 File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/merge.py", line 705, in merge_asof�[0m
23:26:27  �[1m�[31mE                   return op.get_result()�[0m
23:26:27  �[1m�[31mE                 File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/merge.py", line 1852, in get_result�[0m
23:26:27  �[1m�[31mE                   join_index, left_indexer, right_indexer = self._get_join_info()�[0m
23:26:27  �[1m�[31mE                 File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/merge.py", line 1133, in _get_join_info�[0m
23:26:27  �[1m�[31mE                   (left_indexer, right_indexer) = self._get_join_indexers()�[0m
23:26:27  �[1m�[31mE                 File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/merge.py", line 2217, in _get_join_indexers�[0m
23:26:27  �[1m�[31mE                   return func(�[0m
23:26:27  �[1m�[31mE                 File "join.pyx", line 682, in pandas._libs.join.__pyx_fused_cpdef�[0m
23:26:27  �[1m�[31mE               TypeError: Function call with ambiguous argument types�[0m
23:26:27  

Steps/Code to reproduce bug
currently only in nightly arm64 build

Expected behavior
pass the case

Environment details (please complete the following information)
python pkgs in ARM64 test
pandas-2.1.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

Successfully installed cfgv-3.4.0 distlib-0.3.7 exceptiongroup-1.1.3 execnet-2.0.2 filelock-3.12.4 
findspark-2.0.1 identify-2.5.30 iniconfig-2.0.0 nodeenv-1.8.0 numpy-1.26.0 packaging-23.2 
pandas-2.1.1 platformdirs-3.11.0 pluggy-1.3.0 pre-commit-3.4.0 pyarrow-13.0.0 
pytest-7.4.2 pytest-order-1.1.0 pytest-xdist-3.3.1 python-dateutil-2.8.2 pytz-2023.3.post1 
pyyaml-6.0.1 sre_yield-1.2 tomli-2.0.1 tzdata-2023.3 virtualenv-20.24.5

vs X86 pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

@pxLi pxLi added bug Something isn't working test Only impacts tests labels Oct 9, 2023
@sameerz sameerz added the ? - Needs Triage Need team to review and classify label Oct 9, 2023
@firestarman
Copy link
Collaborator

Python udf are susceptible to env changes. Can we try pandas-2.0.3 on arm ?

@firestarman
Copy link
Collaborator

Spark 3.1.1 asks for pandas version >= 0.23.2. So I geuss pandas 2.1.1 looks like too new to Spark 3.1.1.

@pxLi
Copy link
Collaborator Author

pxLi commented Oct 10, 2023

Python udf are susceptible to env changes. Can we try pandas-2.0.3 on arm ?

yep, just specify the panda version limit in docker (or re-install in test script)

@pxLi
Copy link
Collaborator Author

pxLi commented Oct 10, 2023

similar issue was reported at pandas-dev/pandas#55453

We will keep monitoring if newer version would fix the bug (if no short-term fix we have to limit panda requirement to 2.0.x)

@pxLi pxLi changed the title [BUG] test_cogroup_apply_udf[Short(not_null)] failed in arm [BUG] test_cogroup_apply_udf[Short(not_null)] failed with pandas 2.1.X Oct 10, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 10, 2023
@pxLi
Copy link
Collaborator Author

pxLi commented Oct 12, 2023

temply add pip install --force-reinstall pandas<2.1 to arm building pipeline to try workaround before the pandas fix

@pxLi pxLi closed this as completed Dec 4, 2023
@pxLi
Copy link
Collaborator Author

pxLi commented Dec 4, 2023

the issue has been fixed in new pandas release (verified pandas 2.1.3 works fine)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

No branches or pull requests

4 participants