Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to cast value false to BooleanType for partition column k1 #6026

Closed
viadea opened this issue Jul 19, 2022 · 4 comments · Fixed by #6097
Closed

[BUG] Failed to cast value false to BooleanType for partition column k1 #6026

viadea opened this issue Jul 19, 2022 · 4 comments · Fixed by #6097
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@viadea
Copy link
Collaborator

viadea commented Jul 19, 2022

Env:
Databricks 10.4ML LTS
Spark RAPIDS 22.08 snapshot jar with the alluxio-auto-mount feature

When using Alluxio auto-mount feature -- spark.rapids.alluxio.automount.enabled=true, a simple query failed:

select * from table limit 10

Full stacktrace:

com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.lang.RuntimeException: Failed to cast value `false` to `BooleanType` for partition column `k1`
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToCastValueToDataTypeForPartitionColumnError(QueryExecutionErrors.scala:638)
	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.$anonfun$parsePartitions$13(PartitioningUtils.scala:236)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.$anonfun$parsePartitions$12(PartitioningUtils.scala:229)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:227)
	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:141)
	at org.apache.spark.sql.execution.datasources.rapids.GpuPartitioningUtils$.inferPartitioning(GpuPartitioningUtils.scala:78)
	at com.nvidia.spark.rapids.AlluxioUtils$.replacePathIfNeeded(AlluxioUtils.scala:268)
	at com.nvidia.spark.rapids.shims.SparkShimImpl$$anon$1.convertToGpu(SparkShims.scala:146)
	at com.nvidia.spark.rapids.shims.SparkShimImpl$$anon$1.convertToGpu(SparkShims.scala:89)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:728)
	at com.nvidia.spark.rapids.GpuOverrides$$anon$199.convertToGpu(GpuOverrides.scala:3786)
	at com.nvidia.spark.rapids.GpuOverrides$$anon$199.convertToGpu(GpuOverrides.scala:3784)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:728)
	at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:50)
	at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:41)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:728)
	at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(aggregate.scala:978)
	at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(aggregate.scala:833)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:728)
	at org.apache.spark.sql.rapids.execution.GpuShuffleMeta.convertToGpu(GpuShuffleExchangeExecBase.scala:113)
	at org.apache.spark.sql.rapids.execution.GpuShuffleMeta.convertToGpu(GpuShuffleExchangeExecBase.scala:43)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:728)
	at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(aggregate.scala:978)
	at com.nvidia.spark.rapids.GpuBaseAggregateMeta.convertToGpu(aggregate.scala:833)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:728)
	at com.nvidia.spark.rapids.GpuOverrides$$anon$199.convertToGpu(GpuOverrides.scala:3786)
	at com.nvidia.spark.rapids.GpuOverrides$$anon$199.convertToGpu(GpuOverrides.scala:3784)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:728)
	at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:50)
	at com.nvidia.spark.rapids.GpuProjectExecMeta.convertToGpu(basicPhysicalOperators.scala:41)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:728)
	at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$convertToCpu$1(RapidsMeta.scala:607)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertToCpu(RapidsMeta.scala:607)
	at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:730)
	at com.nvidia.spark.rapids.GpuOverrides$.com$nvidia$spark$rapids$GpuOverrides$$doConvertPlan(GpuOverrides.scala:4048)
	at com.nvidia.spark.rapids.GpuOverrides.applyOverrides(GpuOverrides.scala:4297)
	at com.nvidia.spark.rapids.GpuOverrides.$anonfun$apply$4(GpuOverrides.scala:4249)
	at com.nvidia.spark.rapids.GpuOverrides$.logDuration(GpuOverrides.scala:466)
	at com.nvidia.spark.rapids.GpuOverrides.$anonfun$apply$2(GpuOverrides.scala:4247)
	at com.nvidia.spark.rapids.GpuOverrideUtil$.$anonfun$tryOverride$1(GpuOverrides.scala:4216)
	at com.nvidia.spark.rapids.GpuOverrides.apply(GpuOverrides.scala:4259)
	at com.nvidia.spark.rapids.GpuOverrides.apply(GpuOverrides.scala:4239)
	at org.apache.spark.sql.execution.ApplyColumnarRulesAndInsertTransitions.$anonfun$apply$1(Columnar.scala:564)
	at org.apache.spark.sql.execution.ApplyColumnarRulesAndInsertTransitions.$anonfun$apply$1$adapted(Columnar.scala:564)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.sql.execution.ApplyColumnarRulesAndInsertTransitions.apply(Columnar.scala:564)
	at org.apache.spark.sql.execution.ApplyColumnarRulesAndInsertTransitions.apply(Columnar.scala:523)
	at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$2(QueryExecution.scala:596)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
	at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$1(QueryExecution.scala:596)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
	at org.apache.spark.sql.execution.QueryExecution$.prepareForExecution(QueryExecution.scala:595)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$2(QueryExecution.scala:232)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:151)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:265)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:265)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:228)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:222)
	at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:298)
	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:361)
	at org.apache.spark.sql.execution.QueryExecution.explainStringLocal(QueryExecution.scala:325)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:176)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:360)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:160)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:115)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:310)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3928)
	at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3122)
	at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:271)
	at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:105)
	at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:115)
	at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:145)
	at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:605)
	at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:28)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
	at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:26)
	at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205)
	at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60)
	at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240)
	at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60)
	at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:582)
	at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
	at scala.util.Try$.apply(Try.scala:213)
	at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
	at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
	at java.lang.Thread.run(Thread.java:748)

	at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:130)
	at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:145)
	at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:605)
	at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:28)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
	at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:26)
	at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205)
	at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60)
	at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240)
	at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60)
	at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:582)
	at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
	at scala.util.Try$.apply(Try.scala:213)
	at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
	at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
	at java.lang.Thread.run(Thread.java:748)

This table is partitioned delta partition table(hive external partition table with underline storage on s3), and column "k1" is the leading partition key column.
Eg:

CREATE TABLE table (
  k1 BOOLEAN,
  c1 STRING,
  k2 TIMESTAMP,
  ...)
USING delta
PARTITIONED BY (k1, k2)
LOCATION 's3://xxx'
TBLPROPERTIES (
  'Type' = 'EXTERNAL',
  'delta.autoOptimize' = 'true',
  'delta.minReaderVersion' = '1',
  'delta.minWriterVersion' = '2')

If we disable auto-mount feature, then the query runs fine on GPU:

set spark.rapids.alluxio.automount.enabled=false;

Still trying to figure out a minimum repro as of now.

@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 19, 2022
@viadea
Copy link
Collaborator Author

viadea commented Jul 19, 2022

When trying to reproduce this issue, I found another #6029 that Spark RAPIDS(without alluxio) can not read a Hive partition table if the partition key is boolean if spark.rapids.alluxio.pathsToReplace is set.

@viadea
Copy link
Collaborator Author

viadea commented Jul 20, 2022

Now I can kind of reproduce the issue with a different error message which is the same as #6029:

Caused by: java.lang.IllegalArgumentException: 'false: class org.apache.spark.unsafe.types.UTF8String' is not supported for BooleanType, expecting Boolean.

Below is the minimum repro:

  1. Setup Hive Metatore
  2. Setup Alluxio cluster locally.
  3. Have a S3 bucket to store data
  4. Build a Spark RAPIDS jar using Add Alluxio auto mount feature #5925
  5. Stage the jars which are needed for Spark reading/writing from S3.
    Here I am using spark-3.2.1-bin-hadoop3.2 so below jars are put in spark classpaths:
  • aws-java-sdk-bundle-1.11.375.jar
  • hadoop-aws-3.2.3.jar

Make sure to use the newer guava-27.0-jre.jar from hive

lrwxrwxrwx  1 xxx xxx       46 Jul 19 09:39 guava-27.0-jre.jar -> $HIVE_HOME/lib/guava-27.0-jre.jar

Here I am using Hive(apache-hive-3.1.2-bin), so make sure below jars are there:

-rw-r--r--  1 xxx xxx  2747878 Oct 18  2021 guava-27.0-jre.jar
lrwxrwxrwx  1 xxx xxx       25 Oct 18  2021 mysql-connector-java.jar -> /usr/share/java/mysql.jar
lrwxrwxrwx  1 xxx xxx       63 May 23 11:55 aws-java-sdk-bundle-1.11.375.jar -> $SPARK_HOME/jars/aws-java-sdk-bundle-1.11.375.jar
lrwxrwxrwx  1 xxx xxx       51 Jul 19 10:04 hadoop-aws-3.2.3.jar -> $SPARK_HOME/jars/hadoop-aws-3.2.3.jar

To access S3 using access key, create $SPARK_HOME/conf/hdfs-site.xml and $HIVE_HOME/conf/hdfs-site.xml as:

<?xml version="1.0"?>
<configuration>
<property>
  <name>fs.s3a.access.key</name>
  <value>xxx</value>
</property>
<property>
  <name>fs.s3a.secret.key</name>
  <value>yyy</value>
</property>
</configuration>

To access S3 using Spark, make sure add below settings in spark-defaults.conf:

spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
  1. Create a sample table using spark-sql
CREATE TABLE issue6026 (
  k1 BOOLEAN,
  c1 STRING )
USING parquet
PARTITIONED BY (k1)
LOCATION 's3a://bucket/sampledata/issue6026';

INSERT INTO issue6026(k1,c1) select CAST('false' AS BOOLEAN), 'abc';
  1. GPU run works fine:
spark-sql> select * from issue6026;
abc	false
  1. GPU + alluxio failed:
set spark.rapids.alluxio.automount.enabled=true;
set spark.rapids.alluxio.cmd=/home/xxx/alluxio-2.8.0/bin/alluxio;
select * from issue6026;

Error:

Caused by: java.lang.IllegalArgumentException: 'false: class org.apache.spark.unsafe.types.UTF8String' is not supported for BooleanType, expecting Boolean.
	at com.nvidia.spark.rapids.GpuScalar$.from(literals.scala:315)
	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues$.$anonfun$createPartitionValues$2(ColumnarPartitionReaderWithPartitionValues.scala:69)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:216)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:213)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:213)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:248)
	at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues$.createPartitionValues(ColumnarPartitionReaderWithPartitionValues.scala:68)
	at com.nvidia.spark.rapids.MultiFileReaderFunctions.$anonfun$addPartitionValues$1(GpuMultiFileReader.scala:74)
	at scala.Option.map(Option.scala:230)

@viadea
Copy link
Collaborator Author

viadea commented Jul 20, 2022

I can reproduce the original error by just creating a Hive delta partition table, then below error will show up:
Failed to cast value false to BooleanType for partition column k1

@tgravescs
Copy link
Collaborator

so this has the same root cause as #6029.
The way that we are changing the paths for alluxio has to rebuild some of the datastructures. (https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala#L88) One of those is to get partitioning and we call into Sparks PartitionUtils.parsePartitions. As discovered by Allen (@wjxiz1992), https://issues.apache.org/jira/browse/SPARK-39012 was put into Spark 3.3 to fix the parsePartitions not handling Boolean types properly.
So we either need to pull logic like that into the plugin or find a different way to update the Paths.

I was experimenting with ways to update the paths by updating the PartitionSpecs but that isn't accessible to all FileIndex types. And some CSPs have custom versions of that so we can't just use the built in Spark classes like PartitioningAwareFileIndex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
4 participants