Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] xgboost job failed if we enable PCBS #5138

Closed
nvliyuan opened this issue Apr 4, 2022 · 2 comments
Closed

[BUG] xgboost job failed if we enable PCBS #5138

nvliyuan opened this issue Apr 4, 2022 · 2 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@nvliyuan
Copy link
Collaborator

nvliyuan commented Apr 4, 2022

Describe the bug
If we enable PCBS, the mortgage xgboost training job will fail in "Transformation and Show Result Sample" part.
Error executor log:

{
    "name" : "_col30",
    "type" : "double",
    "nullable" : false,
    "metadata" : { }
  } ]
}

        at scala.Predef$.assert(Predef.scala:223)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.<init>(ParquetRowConverter.scala:158)
        at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ShimParquetRowConverter.<init>(ShimVectorizedColumnReader.scala:45)
        at org.apache.spark.sql.execution.datasources.parquet.rapids.shims.v2.ParquetRecordMaterializer.<init>(ParquetMaterializer.scala:47)
        at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.$anonfun$convertCachedBatchToInternalRowIter$1(ParquetCachedBatchSerializer.scala:760)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer.withResource(ParquetCachedBatchSerializer.scala:262)
        at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.convertCachedBatchToInternalRowIter(ParquetCachedBatchSerializer.scala:744)
        at com.nvidia.spark.rapids.shims.v2.ParquetCachedBatchSerializer$CachedBatchIteratorConsumer$$anon$3.hasNext(ParquetCachedBatchSerializer.scala:724)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:621)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)
Executor task launch worker for task 2.1 in stage 7.0 (TID 25) 22/04/04 11:52:29:485 ERROR PythonRunner: This may have been caused by a prior exception:
java.lang.AssertionError: assertion failed: User-defined types in Catalyst schema should have already been expanded:

Environment details (please complete the following information)
spark standalone cluster with 8 A100 GPUs(spark2a)

CMD_PARAMS="--master $SPARK_MASTER_URL \
--conf spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$CUDF_JAR \
--conf spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$CUDF_JAR \
--jars $SPARK_RAPIDS_PLUGIN_JAR,$CUDF_JAR,$XGBOOST4J_JAR,$XGBOOST4J_SPARK_JAR \
--py-files $SPARK_RAPIDS_PLUGIN_JAR,$CUDF_JAR,$XGBOOST4J_JAR,$XGBOOST4J_SPARK_JAR \
--driver-memory ${DRIVER_MEMORY}G \
--executor-cores $NUM_EXECUTOR_CORES \
--executor-memory ${EXECUTOR_MEMORY}G \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.rapids.sql.concurrentGpuTasks=1 \
--conf spark.locality.wait=0 \
--conf spark.rapids.memory.pinnedPool.size=2g \
--conf spark.sql.files.maxPartitionBytes=1g \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.task.resource.gpu.amount=$RESOURCE_GPU_AMT \
--conf spark.rapids.sql.enabled=True \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.variableFloatAgg.enabled=true \
--conf spark.sql.adaptive.enabled=$USEAQE \
--conf spark.rapids.sql.explain=True \
--conf spark.rapids.sql.decimalType.enabled=true \
--conf spark.rapids.sql.incompatibleDateFormats.enabled=True \
--conf spark.rapids.sql.hasNans=false \
--conf spark.rapids.sql.csv.read.long.enabled=true \
--conf spark.rapids.sql.csv.read.double.enabled=true \
--conf spark.rapids.sql.csv.read.integer.enabled=true \
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
--conf spark.app.name=mortgage-perf
"

$SPARK_HOME/bin/pyspark $CMD_PARAMS

If we remove --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \ , it works well.

@nvliyuan nvliyuan added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 4, 2022
@nvliyuan nvliyuan changed the title [BUG] [BUG]xgboost job fail if we enable PCBS Apr 4, 2022
@nvliyuan nvliyuan changed the title [BUG]xgboost job fail if we enable PCBS [BUG]xgboost job failed if we enable PCBS Apr 4, 2022
@mattahrens mattahrens added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Apr 5, 2022
@viadea
Copy link
Collaborator

viadea commented Apr 6, 2022

#4806 is related

@nvliyuan
Copy link
Collaborator Author

nvliyuan commented Apr 6, 2022

thanks @viadea , verified the latest snapshot jars and it is fixed

@nvliyuan nvliyuan closed this as completed Apr 6, 2022
@sameerz sameerz changed the title [BUG]xgboost job failed if we enable PCBS [BUG] xgboost job failed if we enable PCBS Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

4 participants