[BUG] test_select_complex_field fails in MT tests #9103

abellina · 2023-08-24T13:41:49Z

The MT standalone CI is consistently failing with this test that was added last week (#9018):

13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[parquet-True-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[parquet-True-select name.first from {} where name.first = 'Jane'-struct<name:struct<first:string>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[parquet-False-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>][INJECT_OOM]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[parquet-False-select name.first from {} where name.first = 'Jane'-struct<name:struct<first:string>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[orc-True-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[orc-True-select name.first from {} where name.first = 'Jane'-struct<name:struct<first:string>>][INJECT_OOM]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[orc-False-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>]
13:41:22  FAILED ../../src/main/python/prune_partition_column_test.py::test_select_complex_field[orc-False-select name.first from {} where name.first = 'Jane'-struct<name:struct<first:string>>][INJECT_OOM]

Here's the captured stdout with a NoClassDefFoundError. I think we need to move some of the code that was added here under the shim, so we can load the classes. I guess we could also add GpuFileSourceScanExec to the unshimmed list, but I am not sure what the ramifications are of that, I'd have to spend more time looking at this.

339340 [2023-08-23T19:20:59.575Z] ^[[31m^[[1m_ test_select_complex_field[parquet-False-select friends.middle, friends from {} where p=1-struct<friends:array<struct<first:string,middle:string,last:string>>>] _^[[0m
339341 [2023-08-23T19:20:59.575Z]
339342 [2023-08-23T19:20:59.575Z] format = 'parquet'
339343 [2023-08-23T19:20:59.575Z] spark_tmp_path = '/tmp/pyspark_tests//spark-egx-06-master-33386-1435859741/'
339344 [2023-08-23T19:20:59.575Z] query = 'select friends.middle, friends from {} where p=1'
339345 [2023-08-23T19:20:59.575Z] expected_schemata = 'struct<friends:array<struct<first:string,middle:string,last:string>>>'
339346 [2023-08-23T19:20:59.575Z] is_partitioned = False
339347 [2023-08-23T19:20:59.575Z] spark_tmp_table_factory = <conftest.TmpTableFactory object at 0x7f9cf22214c0>
339348 [2023-08-23T19:20:59.575Z]
339349 [2023-08-23T19:20:59.575Z]     @pytest.mark.parametrize('query,expected_schemata', [("select friends.middle, friends from {} where p=1", "struct<friends:array<struct<first:string,middle:string,last:string>>>"),
339350 [2023-08-23T19:20:59.575Z]                                                          pytest.param("select name.middle, address from {} where p=2", "struct<name:struct<middle:string>,address:string>", marks=pytest.mark.skip(r       eason='https://github.com/NVIDIA/spark-rapids/issues/8788')),
339351 [2023-08-23T19:20:59.575Z]                                                          ("select name.first from {} where name.first = 'Jane'", "struct<name:struct<first:string>>")])
339352 [2023-08-23T19:20:59.575Z]     @pytest.mark.parametrize('is_partitioned', [True, False])
339353 [2023-08-23T19:20:59.575Z]     @pytest.mark.parametrize('format', ["parquet", "orc"])
339354 [2023-08-23T19:20:59.575Z]     def test_select_complex_field(format, spark_tmp_path, query, expected_schemata, is_partitioned, spark_tmp_table_factory):
339355 [2023-08-23T19:20:59.575Z]         table_name = spark_tmp_table_factory.get()
339356 [2023-08-23T19:20:59.575Z]         data_path = spark_tmp_path + "/DATA"
339357 [2023-08-23T19:20:59.575Z]         def read_temp_view(schema):
339358 [2023-08-23T19:20:59.575Z]             def do_it(spark):
339359 [2023-08-23T19:20:59.575Z]                 spark.read.format(format).schema(schema).load(data_path + f"/{table_name}").createOrReplaceTempView(table_name)
339360 [2023-08-23T19:20:59.575Z]                 return spark.sql(query.format(table_name))
339361 [2023-08-23T19:20:59.575Z]             return do_it
339362 [2023-08-23T19:20:59.575Z]         conf={"spark.sql.parquet.enableVectorizedReader": "true"}
339363 [2023-08-23T19:20:59.575Z] >       create_contacts_table_and_read(is_partitioned, format, data_path, expected_schemata, read_temp_view, conf, table_name)
339364 [2023-08-23T19:20:59.575Z]
339365 [2023-08-23T19:20:59.575Z] ^[[1m^[[31m../../src/main/python/prune_partition_column_test.py^[[0m:190:
339366 [2023-08-23T19:20:59.575Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
339367 [2023-08-23T19:20:59.575Z] ^[[1m^[[31m../../src/main/python/prune_partition_column_test.py^[[0m:170: in create_contacts_table_and_read
339368 [2023-08-23T19:20:59.575Z]     jvm.org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch(cpu_df._jdf, gpu_df._jdf, expected_schemata)
339369 [2023-08-23T19:20:59.575Z] ^[[1m^[[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py^[[0m:1304: in __call__
339370 [2023-08-23T19:20:59.575Z]     return_value = get_return_value(
339371 [2023-08-23T19:20:59.575Z] ^[[1m^[[31m/var/lib/jenkins/spark/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py^[[0m:111: in deco
339372 [2023-08-23T19:20:59.575Z]     return f(*a, **kw)
339373 [2023-08-23T19:20:59.575Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
339374 [2023-08-23T19:20:59.575Z]
339375 [2023-08-23T19:20:59.575Z] answer = 'xro5536074'
339376 [2023-08-23T19:20:59.575Z] gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f9f33a942e0>
339377 [2023-08-23T19:20:59.575Z] target_id = 'z:org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback'
339378 [2023-08-23T19:20:59.575Z] name = 'assertSchemataMatch'
339379 [2023-08-23T19:20:59.575Z]
339380 [2023-08-23T19:20:59.575Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
339381 [2023-08-23T19:20:59.575Z]         """Converts an answer received from the Java gateway into a Python object.
339382 [2023-08-23T19:20:59.575Z]
339383 [2023-08-23T19:20:59.575Z]         For example, string representation of integers are converted to Python
339384 [2023-08-23T19:20:59.575Z]         integer, string representation of objects are converted to JavaObject
339385 [2023-08-23T19:20:59.575Z]         instances, etc.
339386 [2023-08-23T19:20:59.575Z]
339387 [2023-08-23T19:20:59.575Z]         :param answer: the string returned by the Java gateway
339388 [2023-08-23T19:20:59.575Z]         :param gateway_client: the gateway client used to communicate with the Java
339389 [2023-08-23T19:20:59.575Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
339390 [2023-08-23T19:20:59.575Z]             list, map)
339391 [2023-08-23T19:20:59.575Z]         :param target_id: the name of the object from which the answer comes from
339392 [2023-08-23T19:20:59.575Z]             (e.g., *object1* in `object1.hello()`). Optional.
339393 [2023-08-23T19:20:59.575Z]         :param name: the name of the member from which the answer comes from
339394 [2023-08-23T19:20:59.575Z]             (e.g., *hello* in `object1.hello()`). Optional.
339395 [2023-08-23T19:20:59.575Z]         """
339396 [2023-08-23T19:20:59.575Z]         if is_error(answer)[0]:
339397 [2023-08-23T19:20:59.575Z]             if len(answer) > 1:
339398 [2023-08-23T19:20:59.575Z]                 type = answer[1]
339399 [2023-08-23T19:20:59.575Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
339400 [2023-08-23T19:20:59.575Z]                 if answer[1] == REFERENCE_TYPE:
339401 [2023-08-23T19:20:59.575Z] >                   raise Py4JJavaError(
339402 [2023-08-23T19:20:59.575Z]                         "An error occurred while calling {0}{1}{2}.\n".
339403 [2023-08-23T19:20:59.575Z]                         format(target_id, ".", name), value)
339404 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch.^[[0m
339405 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                   : java.lang.NoClassDefFoundError: org/apache/spark/sql/rapids/GpuFileSourceScanExec^[[0m
339406 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback$$anonfun$2.applyOrElse(ExecutionPlanCaptureCallback.scala:103)^[[0m
339407 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback$$anonfun$2.applyOrElse(ExecutionPlanCaptureCallback.scala:102)^[[0m
339408 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at scala.PartialFunction$Lifted.apply(PartialFunction.scala:228)^[[0m
339409 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at scala.PartialFunction$Lifted.apply(PartialFunction.scala:224)^[[0m
339410 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.$anonfun$collect$1(AdaptiveSparkPlanHelper.scala:86)^[[0m
339411 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.$anonfun$collect$1$adapted(AdaptiveSparkPlanHelper.scala:86)^[[0m
339412 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.foreach(AdaptiveSparkPlanHelper.scala:45)^[[0m
339413 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.foreach$(AdaptiveSparkPlanHelper.scala:44)^[[0m
339414 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at com.nvidia.spark.rapids.AdaptiveSparkPlanHelperImpl.foreach(AdaptiveSparkPlanHelperImpl.scala:22)^[[0m
339415 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.collect(AdaptiveSparkPlanHelper.scala:86)^[[0m
339416 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.collect$(AdaptiveSparkPlanHelper.scala:83)^[[0m
339417 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at com.nvidia.spark.rapids.AdaptiveSparkPlanHelperImpl.collect(AdaptiveSparkPlanHelperImpl.scala:22)^[[0m
339418 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback$.assertSchemataMatch(ExecutionPlanCaptureCallback.scala:102)^[[0m
339419 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch(ExecutionPlanCaptureCallback.scala)^[[0m
339420 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)^[[0m
339421 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)^[[0m
339422 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)^[[0m
339423 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at java.lang.reflect.Method.invoke(Method.java:498)^[[0m
339424 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)^[[0m
339425 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)^[[0m
339426 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.Gateway.invoke(Gateway.java:282)^[[0m
339427 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)^[[0m
339428 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.commands.CallCommand.execute(CallCommand.java:79)^[[0m
339429 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at py4j.GatewayConnection.run(GatewayConnection.java:238)^[[0m
339430 [2023-08-23T19:20:59.575Z] ^[[1m^[[31mE                     at java.lang.Thread.run(Thread.java:750)^[[0m

The text was updated successfully, but these errors were encountered:

jlowe · 2023-08-24T14:01:14Z

ExecutionPlanCaptureCallback, an unshimmed class, is referencing GpuFileSourceScanExec directly which is a shimmed class. At this point I recommend refactoring ExecutionPlanCaptureCallback so it is just a wrapper class that loads an underlying ExecutionPlanCaptureCallbackImpl class via ShimLoader to service the public methods. Otherwise every time we update this class, we're likely to accidentally touch a shimmed class from this unshimmed class.

A better solution is to refactor the code so that unshimmed classes are compiled in a separate module and the shimmed classes are in a separate module that depend on that submodule. Then the build will automatically fail if any unshimmed class tries to reference a shimmed class directly. I thought there was already an issue for this, but I couldn't find it. Filed #9104 to track that approach.

abellina · 2023-08-24T22:03:21Z

I have a patch that implements the first suggestion from @jlowe (https://github.com/NVIDIA/spark-rapids/compare/branch-23.10...abellina:instantiate_ExecutionPlanCaptureCallback_via_ShimLoader?expand=1). But it doesn't work:

E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch.
E                   : java.lang.NoClassDefFoundError: org/apache/spark/sql/rapids/GpuFileSourceScanExec
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallbackImpl$$anonfun$2.applyOrElse(ExecutionPlanCaptureCallback.scala:121)
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallbackImpl$$anonfun$2.applyOrElse(ExecutionPlanCaptureCallback.scala:120)
E                   	at scala.PartialFunction$Lifted.apply(PartialFunction.scala:228)
E                   	at scala.PartialFunction$Lifted.apply(PartialFunction.scala:224)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.$anonfun$collect$1(AdaptiveSparkPlanHelper.scala:86)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.$anonfun$collect$1$adapted(AdaptiveSparkPlanHelper.scala:86)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.foreach(AdaptiveSparkPlanHelper.scala:45)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.foreach$(AdaptiveSparkPlanHelper.scala:44)
E                   	at com.nvidia.spark.rapids.AdaptiveSparkPlanHelperImpl.foreach(AdaptiveSparkPlanHelperImpl.scala:22)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.collect(AdaptiveSparkPlanHelper.scala:86)
E                   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper.collect$(AdaptiveSparkPlanHelper.scala:83)
E                   	at com.nvidia.spark.rapids.AdaptiveSparkPlanHelperImpl.collect(AdaptiveSparkPlanHelperImpl.scala:22)
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallbackImpl.assertSchemataMatch(ExecutionPlanCaptureCallback.scala:120)
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback$.assertSchemataMatch(ExecutionPlanCaptureCallback.scala:85)
E                   	at org.apache.spark.sql.rapids.ExecutionPlanCaptureCallback.assertSchemataMatch(ExecutionPlanCaptureCallback.scala)
E                   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E                   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
E                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                   	at java.lang.reflect.Method.invoke(Method.java:498)
E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
E                   	at py4j.Gateway.invoke(Gateway.java:282)
E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
E                   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
E                   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
E                   	at java.lang.Thread.run(Thread.java:750)

I think it is because of how the app is getting launched likely. I had to add the jar to extraClassPath in order to cause this exception, so this is likely an order-of-instantiation + class loader issue that also plagues the shuffle manager. In the python tests it is getting added to spark.sql.queryExecutionListeners, so it looks a lot like the shuffle manager case.

jlowe · 2023-08-24T22:13:33Z

The problem with your patch is that the unshimmed rules are picking up not only ExecutionPlanCaptureCallback but also ExecutionPlanCaptureCallbackImpl, so even the implementation is unshimmed. Yeah, not the best suggestion of a class name on my part. So either we need to update dist/unshimmed-common-from-spark311.txt to only pick up the ExecutionPlanCaptureCallback object class and not the Impl as well, or change the implementation class name to something the wildcard unshim rule for ExecutionPlanCaptureCallback won't also pick up.

abellina · 2023-08-25T13:26:43Z

Thanks @jlowe! I am testing changing the name of the impl class to have "Shimmed" as a prefix in the name.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 24, 2023

abellina self-assigned this Aug 24, 2023

abellina mentioned this issue Aug 25, 2023

Instantiate execution plan capture callback via shim loader #9113

Merged

abellina closed this as completed in #9113 Aug 25, 2023

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_select_complex_field fails in MT tests #9103

[BUG] test_select_complex_field fails in MT tests #9103

abellina commented Aug 24, 2023

jlowe commented Aug 24, 2023

abellina commented Aug 24, 2023 •

edited

Loading

jlowe commented Aug 24, 2023

abellina commented Aug 25, 2023

[BUG] test_select_complex_field fails in MT tests #9103

[BUG] test_select_complex_field fails in MT tests #9103

Comments

abellina commented Aug 24, 2023

jlowe commented Aug 24, 2023

abellina commented Aug 24, 2023 • edited Loading

jlowe commented Aug 24, 2023

abellina commented Aug 25, 2023

abellina commented Aug 24, 2023 •

edited

Loading