[BUG] xgboost job failed if we enable PCBS #5138

nvliyuan · 2022-04-04T12:29:14Z

Describe the bug
If we enable PCBS, the mortgage xgboost training job will fail in "Transformation and Show Result Sample" part.
Error executor log:

{
    "name" : "_col30",
    "type" : "double",
    "nullable" : false,
    "metadata" : { }
  } ]
}

        at scala.Predef$.assert(Predef.scala:223)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.<init>(ParquetRowConverter.scala:158)
        at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ShimParquetRowConverter.<init>(ShimVectorizedColumnReader.scala:45)
        at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ParquetRecordMaterializer.<init>(ParquetMaterializer.scala:47)
        at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.$anonfun$convertCachedBatchToInternalRowIter$1(ParquetCachedBatchSerializer.scala:760)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer.withResource(ParquetCachedBatchSerializer.scala:262)
        at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.convertCachedBatchToInternalRowIter(ParquetCachedBatchSerializer.scala:744)
        at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.hasNext(ParquetCachedBatchSerializer.scala:724)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:621)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)
Executor task launch worker for task 2.1 in stage 7.0 (TID 25) 22/04/04 11:52:29:485 ERROR PythonRunner: This may have been caused by a prior exception:
java.lang.AssertionError: assertion failed: User-defined types in Catalyst schema should have already been expanded:

Environment details (please complete the following information)
spark standalone cluster with 8 A100 GPUs(spark2a)

CMD_PARAMS="--master $SPARK_MASTER_URL \
--conf spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$CUDF_JAR \
--conf spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$CUDF_JAR \
--jars $SPARK_RAPIDS_PLUGIN_JAR,$CUDF_JAR,$XGBOOST4J_JAR,$XGBOOST4J_SPARK_JAR \
--py-files $SPARK_RAPIDS_PLUGIN_JAR,$CUDF_JAR,$XGBOOST4J_JAR,$XGBOOST4J_SPARK_JAR \
--driver-memory ${DRIVER_MEMORY}G \
--executor-cores $NUM_EXECUTOR_CORES \
--executor-memory ${EXECUTOR_MEMORY}G \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.rapids.sql.concurrentGpuTasks=1 \
--conf spark.locality.wait=0 \
--conf spark.rapids.memory.pinnedPool.size=2g \
--conf spark.sql.files.maxPartitionBytes=1g \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.task.resource.gpu.amount=$RESOURCE_GPU_AMT \
--conf spark.rapids.sql.enabled=True \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.variableFloatAgg.enabled=true \
--conf spark.sql.adaptive.enabled=$USEAQE \
--conf spark.rapids.sql.explain=True \
--conf spark.rapids.sql.decimalType.enabled=true \
--conf spark.rapids.sql.incompatibleDateFormats.enabled=True \
--conf spark.rapids.sql.hasNans=false \
--conf spark.rapids.sql.csv.read.long.enabled=true \
--conf spark.rapids.sql.csv.read.double.enabled=true \
--conf spark.rapids.sql.csv.read.integer.enabled=true \
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
--conf spark.app.name=mortgage-perf
"

$SPARK_HOME/bin/pyspark $CMD_PARAMS

If we remove --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \ , it works well.

The text was updated successfully, but these errors were encountered:

viadea · 2022-04-06T00:24:26Z

#4806 is related

nvliyuan · 2022-04-06T02:47:42Z

thanks @viadea , verified the latest snapshot jars and it is fixed

nvliyuan added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 4, 2022

nvliyuan changed the title ~~[BUG]~~ [BUG]xgboost job fail if we enable PCBS Apr 4, 2022

nvliyuan changed the title ~~[BUG]xgboost job fail if we enable PCBS~~ [BUG]xgboost job failed if we enable PCBS Apr 4, 2022

mattahrens assigned razajafri Apr 5, 2022

mattahrens added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Apr 5, 2022

nvliyuan closed this as completed Apr 6, 2022

sameerz changed the title ~~[BUG]xgboost job failed if we enable PCBS~~ [BUG] xgboost job failed if we enable PCBS Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] xgboost job failed if we enable PCBS #5138

[BUG] xgboost job failed if we enable PCBS #5138

nvliyuan commented Apr 4, 2022 •

edited

Loading

viadea commented Apr 6, 2022

nvliyuan commented Apr 6, 2022

[BUG] xgboost job failed if we enable PCBS #5138

[BUG] xgboost job failed if we enable PCBS #5138

Comments

nvliyuan commented Apr 4, 2022 • edited Loading

viadea commented Apr 6, 2022

nvliyuan commented Apr 6, 2022

nvliyuan commented Apr 4, 2022 •

edited

Loading