Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_select_complex_field fails in MT tests #9103

Closed
abellina opened this issue Aug 24, 2023 · 4 comments · Fixed by #9113
Closed

[BUG] test_select_complex_field fails in MT tests #9103

abellina opened this issue Aug 24, 2023 · 4 comments · Fixed by #9113
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Collaborator

The MT standalone CI is consistently failing with this test that was added last week (#9018):

13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[parquet-True-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[parquet-True-select name.first from {} where name.first = 'Jane'-struct<name:struct<first:string>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[parquet-False-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>][INJECT_OOM]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[parquet-False-select name.first from {} where name.first = 'Jane'-struct<name:struct<first:string>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[orc-True-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[orc-True-select name.first from {} where name.first = 'Jane'-struct<name:struct<first:string>>][INJECT_OOM]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[orc-False-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[orc-False-select name.first from {} where name.first = 'Jane'-struct<name:struct<first:string>>][INJECT_OOM]

Here's the captured stdout with a NoClassDefFoundError. I think we need to move some of the code that was added here under the shim, so we can load the classes. I guess we could also add GpuFileSourceScanExec to the unshimmed list, but I am not sure what the ramifications are of that, I'd have to spend more time looking at this.

339340 [2023-08-23T19:20:59.575Z] ^[[31m^[[1m_ test_select_complex_field[parquet-False-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>] _^[[0m
339341 [2023-08-23T19:20:59.575Z]
339342 [2023-08-23T19:20:59.575Z] format = 'parquet'
339343 [2023-08-23T19:20:59.575Z] spark_tmp_path = '/tmp/pyspark_tests//spark-egx-06-master-33386-1435859741/'
339344 [2023-08-23T19:20:59.575Z] query = 'select friends.middle, friends from {} where p=1'
339345 [2023-08-23T19:20:59.575Z] expected_schemata = 'struct<friends:array<struct<first:string,middle:string,last:string>>>'
339346 [2023-08-23T19:20:59.575Z] is_partitioned = False
339347 [2023-08-23T19:20:59.575Z] spark_tmp_table_factory = <conftest.TmpTableFactory object at 0x7f9cf22214c0>
339348 [2023-08-23T19:20:59.575Z]
339349 [2023-08-23T19:20:59.575Z]     @pytest.mark.parametrize('query,expected_schemata', [("select friends.middle, friends from {} where p=1", "struct<friends:array<struct<first:string,middle:string,last:string>>>"),
339350 [2023-08-23T19:20:59.575Z]                                                          pytest.param("select name.middle, address from {} where p=2", "struct<name:struct<middle:string>,address:string>", marks=pytest.mark.skip(r       eason='https://github.com/NVIDIA/spark-rapids/issues/8788')),
339351 [2023-08-23T19:20:59.575Z]                                                          ("select name.first from {} where name.first = 'Jane'", "struct<name:struct<first:string>>")])
339352 [2023-08-23T19:20:59.575Z]     @pytest.mark.parametrize('is_partitioned', [True, False])
339353 [2023-08-23T19:20:59.575Z]     @pytest.mark.parametrize('format', ["parquet", "orc"])
339354 [2023-08-23T19:20:59.575Z]     def test_select_complex_field(format, spark_tmp_path, query, expected_schemata, is_partitioned, spark_tmp_table_factory):
339355 [2023-08-23T19:20:59.575Z]         table_name = spark_tmp_table_factory.get()
339356 [2023-08-23T19:20:59.575Z]         data_path = spark_tmp_path + "/DATA"
339357 [2023-08-23T19:20:59.575Z]         def read_temp_view(schema):
339358 [2023-08-23T19:20:59.575Z]             def do_it(spark):
339359 [2023-08-23T19:20:59.575Z]                 spark.read.format(format).schema(schema).load(data_path + f"/{table_name}").createOrReplaceTempView(table_name)
339360 [2023-08-23T19:20:59.575Z]                 return spark.sql(query.format(table_name))
339361 [2023-08-23T19:20:59.575Z]             return do_it
339362 [2023-08-23T19:20:59.575Z]         conf={"spark.sql.parquet.enableVectorizedReader": "true"}
339363 [2023-08-23T19:20:59.575Z] >       create_contacts_table_and_read(is_partitioned, format, data_path, expected_schemata, read_temp_view, conf, table_name)
339364 [2023-08-23T19:20:59.575Z]
339365 [2023-08-23T19:20:59.575Z] ^[[1m^[[31m../../src/main/python/prune_partition_column_test.py^[[0m:190:
339366 [2023-08-23T19:20:59.575Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
339367 [2023-08-23T19:20:59.575Z] ^[[1m^[[31m../../src/main/python/prune_partition_column_test.py^[[0m:170: in create_contacts_table_and_read
339368 [2023-08-23T19:20:59.575Z]     jvm.org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch(cpu_df._jdf, gpu_df._jdf, expected_schemata)
339369 [2023-08-23T19:20:59.575Z] ^[[1m^[[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py^[[0m:1304: in __call__
339370 [2023-08-23T19:20:59.575Z]     return_value = get_return_value(
339371 [2023-08-23T19:20:59.575Z] ^[[1m^[[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py^[[0m:111: in deco
339372 [2023-08-23T19:20:59.575Z]     return f(*a, **kw)
339373 [2023-08-23T19:20:59.575Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
339374 [2023-08-23T19:20:59.575Z]
339375 [2023-08-23T19:20:59.575Z] answer = 'xro5536074'
339376 [2023-08-23T19:20:59.575Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f9f33a942e0>
339377 [2023-08-23T19:20:59.575Z] target_id = 'z:org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback'
339378 [2023-08-23T19:20:59.575Z] name = 'assertSchemataMatch'
339379 [2023-08-23T19:20:59.575Z]
339380 [2023-08-23T19:20:59.575Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
339381 [2023-08-23T19:20:59.575Z]         """Converts an answer received from the Java gateway into a Python object.
339382 [2023-08-23T19:20:59.575Z]
339383 [2023-08-23T19:20:59.575Z]         For example, string representation of integers are converted to Python
339384 [2023-08-23T19:20:59.575Z]         integer, string representation of objects are converted to JavaObject
339385 [2023-08-23T19:20:59.575Z]         instances, etc.
339386 [2023-08-23T19:20:59.575Z]
339387 [2023-08-23T19:20:59.575Z]         :param answer: the string returned by the Java gateway
339388 [2023-08-23T19:20:59.575Z]         :param gateway_client: the gateway client used to communicate with the Java
339389 [2023-08-23T19:20:59.575Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
339390 [2023-08-23T19:20:59.575Z]             list, map)
339391 [2023-08-23T19:20:59.575Z]         :param target_id: the name of the object from which the answer comes from
339392 [2023-08-23T19:20:59.575Z]             (e.g., *object1* in `object1.hello()`). Optional.
339393 [2023-08-23T19:20:59.575Z]         :param name: the name of the member from which the answer comes from
339394 [2023-08-23T19:20:59.575Z]             (e.g., *hello* in `object1.hello()`). Optional.
339395 [2023-08-23T19:20:59.575Z]         """
339396 [2023-08-23T19:20:59.575Z]         if is_error(answer)[0]:
339397 [2023-08-23T19:20:59.575Z]             if len(answer) > 1:
339398 [2023-08-23T19:20:59.575Z]                 type = answer[1]
339399 [2023-08-23T19:20:59.575Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
339400 [2023-08-23T19:20:59.575Z]                 if answer[1] == REFERENCE_TYPE:
339401 [2023-08-23T19:20:59.575Z] >                   raise Py4JJavaError(
339402 [2023-08-23T19:20:59.575Z]                         "An error occurred while calling {0}{1}{2}.\n".
339403 [2023-08-23T19:20:59.575Z]                         format(target_id, ".", name), value)
339404 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch.^[[0m
339405 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                   : java.lang.NoClassDefFoundError: org/apache/spark/sql/rapids/GpuFileSourceScanExec^[[0m
339406 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback$$anonfun$2.applyOrElse(ExecutionPlanCaptureCallback.scala:103)^[[0m
339407 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback$$anonfun$2.applyOrElse(ExecutionPlanCaptureCallback.scala:102)^[[0m
339408 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at scala.PartialFunction$Lifted.apply(PartialFunction.scala:228)^[[0m
339409 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at scala.PartialFunction$Lifted.apply(PartialFunction.scala:224)^[[0m
339410 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.$anonfun$collect$1(AdaptiveSparkPlanHelper.scala:86)^[[0m
339411 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.$anonfun$collect$1$adapted(AdaptiveSparkPlanHelper.scala:86)^[[0m
339412 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.foreach(AdaptiveSparkPlanHelper.scala:45)^[[0m
339413 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.foreach$(AdaptiveSparkPlanHelper.scala:44)^[[0m
339414 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at com.nvidia.spark.rapids.AdaptiveSparkPlanHelperImpl.foreach(AdaptiveSparkPlanHelperImpl.scala:22)^[[0m
339415 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.collect(AdaptiveSparkPlanHelper.scala:86)^[[0m
339416 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.collect$(AdaptiveSparkPlanHelper.scala:83)^[[0m
339417 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at com.nvidia.spark.rapids.AdaptiveSparkPlanHelperImpl.collect(AdaptiveSparkPlanHelperImpl.scala:22)^[[0m
339418 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback$.assertSchemataMatch(ExecutionPlanCaptureCallback.scala:102)^[[0m
339419 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch(ExecutionPlanCaptureCallback.scala)^[[0m
339420 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)^[[0m
339421 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)^[[0m
339422 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)^[[0m
339423 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at java.lang.reflect.Method.invoke(Method.java:498)^[[0m
339424 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)^[[0m
339425 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)^[[0m
339426 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.Gateway.invoke(Gateway.java:282)^[[0m
339427 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)^[[0m
339428 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.commands.CallCommand.execute(CallCommand.java:79)^[[0m
339429 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.GatewayConnection.run(GatewayConnection.java:238)^[[0m
339430 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at java.lang.Thread.run(Thread.java:750)^[[0m
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 24, 2023
@jlowe
Copy link
Member

jlowe commented Aug 24, 2023

ExecutionPlanCaptureCallback, an unshimmed class, is referencing GpuFileSourceScanExec directly which is a shimmed class. At this point I recommend refactoring ExecutionPlanCaptureCallback so it is just a wrapper class that loads an underlying ExecutionPlanCaptureCallbackImpl class via ShimLoader to service the public methods. Otherwise every time we update this class, we're likely to accidentally touch a shimmed class from this unshimmed class.

A better solution is to refactor the code so that unshimmed classes are compiled in a separate module and the shimmed classes are in a separate module that depend on that submodule. Then the build will automatically fail if any unshimmed class tries to reference a shimmed class directly. I thought there was already an issue for this, but I couldn't find it. Filed #9104 to track that approach.

@abellina abellina self-assigned this Aug 24, 2023
@abellina
Copy link
Collaborator Author

abellina commented Aug 24, 2023

I have a patch that implements the first suggestion from @jlowe (https://github.com/NVIDIA/spark-rapids/compare/branch-23.10...abellina:instantiate_ExecutionPlanCaptureCallback_via_ShimLoader?expand=1). But it doesn't work:

E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch.
E                   : java.lang.NoClassDefFoundError: org/apache/spark/sql/rapids/GpuFileSourceScanExec
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallbackImpl$$anonfun$2.applyOrElse(ExecutionPlanCaptureCallback.scala:121)
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallbackImpl$$anonfun$2.applyOrElse(ExecutionPlanCaptureCallback.scala:120)
E                   	at scala.PartialFunction$Lifted.apply(PartialFunction.scala:228)
E                   	at scala.PartialFunction$Lifted.apply(PartialFunction.scala:224)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.$anonfun$collect$1(AdaptiveSparkPlanHelper.scala:86)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.$anonfun$collect$1$adapted(AdaptiveSparkPlanHelper.scala:86)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.foreach(AdaptiveSparkPlanHelper.scala:45)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.foreach$(AdaptiveSparkPlanHelper.scala:44)
E                   	at com.nvidia.spark.rapids.AdaptiveSparkPlanHelperImpl.foreach(AdaptiveSparkPlanHelperImpl.scala:22)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.collect(AdaptiveSparkPlanHelper.scala:86)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.collect$(AdaptiveSparkPlanHelper.scala:83)
E                   	at com.nvidia.spark.rapids.AdaptiveSparkPlanHelperImpl.collect(AdaptiveSparkPlanHelperImpl.scala:22)
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallbackImpl.assertSchemataMatch(ExecutionPlanCaptureCallback.scala:120)
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback$.assertSchemataMatch(ExecutionPlanCaptureCallback.scala:85)
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch(ExecutionPlanCaptureCallback.scala)
E                   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E                   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
E                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                   	at java.lang.reflect.Method.invoke(Method.java:498)
E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
E                   	at py4j.Gateway.invoke(Gateway.java:282)
E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
E                   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
E                   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
E                   	at java.lang.Thread.run(Thread.java:750)

I think it is because of how the app is getting launched likely. I had to add the jar to extraClassPath in order to cause this exception, so this is likely an order-of-instantiation + class loader issue that also plagues the shuffle manager. In the python tests it is getting added to spark.sql.queryExecutionListeners, so it looks a lot like the shuffle manager case.

@jlowe
Copy link
Member

jlowe commented Aug 24, 2023

The problem with your patch is that the unshimmed rules are picking up not only ExecutionPlanCaptureCallback but also ExecutionPlanCaptureCallbackImpl, so even the implementation is unshimmed. Yeah, not the best suggestion of a class name on my part. So either we need to update dist/unshimmed-common-from-spark311.txt to only pick up the ExecutionPlanCaptureCallback object class and not the Impl as well, or change the implementation class name to something the wildcard unshim rule for ExecutionPlanCaptureCallback won't also pick up.

@abellina
Copy link
Collaborator Author

Thanks @jlowe! I am testing changing the name of the impl class to have "Shimmed" as a prefix in the name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants