Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] from_json ArrayIndexOutOfBoundsException in 24.02 #10659

Closed
andygrove opened this issue Apr 2, 2024 · 4 comments
Closed

[BUG] from_json ArrayIndexOutOfBoundsException in 24.02 #10659

andygrove opened this issue Apr 2, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug

Calling from_json for json with nested structs can cause an ArrayIndexOutOfBoundsException if the provided schema for the nested struct has fewer fields than are present in the json.

Steps/Code to reproduce bug

Input file test.json

{ "x": { "a": 54321, "b": 1, "c": 2} }
{ "x": { "a": 54321, "b": 1, "c": 2} }
{ "x": { "a": 54321, "b": 1, "c": 2} }
{ "x": { "a": 54321, "b": 1, "c": 2} }

Test Setup

scala> import org.apache.spark.sql.types._

scala> val df = spark.read.text("test.json")

scala> df.write.parquet("test.parquet")

scala> val df = spark.read.parquet("test.parquet")

scala> val df2 = df.withColumn("mystruct", from_json(col("value"), new StructType(Array(StructField("x", new StructType(Array(StructField("a", DataTypes.IntegerType, false))))))))

scala> df2.show

CPU

+--------------------+---------+
|               value| mystruct|
+--------------------+---------+
|{ "x": { "a": 543...|{{54321}}|
|{ "x": { "a": 543...|{{54321}}|
|{ "x": { "a": 543...|{{54321}}|
|{ "x": { "a": 543...|{{54321}}|
+--------------------+---------+

GPU

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
  at org.apache.spark.sql.types.StructType.apply(StructType.scala:423)
  at com.nvidia.spark.rapids.GpuCast$.$anonfun$castStructToStruct$2(GpuCast.scala:1584)
  at com.nvidia.spark.rapids.GpuCast$.$anonfun$castStructToStruct$2$adapted(GpuCast.scala:1580)
  at scala.collection.immutable.Range.foreach(Range.scala:158)
  at com.nvidia.spark.rapids.GpuCast$.$anonfun$castStructToStruct$1(GpuCast.scala:1580)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:66)
  at com.nvidia.spark.rapids.GpuCast$.castStructToStruct(GpuCast.scala:1579)
  at com.nvidia.spark.rapids.GpuCast$.doCast(GpuCast.scala:584)
  at org.apache.spark.sql.rapids.GpuJsonToStructs.$anonfun$doColumnar$7(GpuJsonToStructs.scala:227)

Cause

The following code uses index as an index into both from and to types and assumes that they are the same size.

  private def castStructToStruct(
      from: StructType,
      to: StructType,
      input: ColumnView,
      options: CastOptions): ColumnVector = {
    withResource(new ArrayBuffer[ColumnVector](from.length)) { childColumns =>
      from.indices.foreach { index =>
        childColumns += doCast(
          input.getChildColumnView(index),
          from(index).dataType,
          to(index).dataType, options)
      }

Expected behavior
Match Spark behavior

Environment details (please complete the following information)

Additional context

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 2, 2024
@andygrove
Copy link
Contributor Author

I could not reproduce the issue with latest from branch-24.04

scala> spark.conf.set("spark.rapids.sql.expression.JsonToStructs", true)

scala> df2.show
24/04/02 23:48:01 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(from_json(StructField(x,StructType(StructField(a,IntegerType,false)),true), value#0, Some(UTC)) as string) AS mystruct#18 will run on GPU
      *Expression <Cast> cast(from_json(StructField(x,StructType(StructField(a,IntegerType,false)),true), value#0, Some(UTC)) as string) will run on GPU
        *Expression <JsonToStructs> from_json(StructField(x,StructType(StructField(a,IntegerType,false)),true), value#0, Some(UTC)) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

+--------------------+---------+
|               value| mystruct|
+--------------------+---------+
|{ "x": { "a": 543...|{{54321}}|
|{ "x": { "a": 543...|{{54321}}|
|{ "x": { "a": 543...|{{54321}}|
|{ "x": { "a": 543...|{{54321}}|
+--------------------+---------+

@thirtiseven
Copy link
Collaborator

The test can pass only after this commit in 24.04. It looks like the commit fixed this issue.

There are some schema handling in the commit that might solve the indexing issue in castStructToStruct, but I don't have much context about from_json.

@revans2
Copy link
Collaborator

revans2 commented Apr 9, 2024

The issue still exists, at least in a few situations. The problem shows up when we ask CUDF to return data with a specific schema, but CUDF sees something that does not match the schema it expects and decides to return a struct or a list instead of an actual string.

rapidsai/cudf#15278

Is the remaining case that I know causes something like this.

@mattahrens mattahrens assigned revans2 and unassigned thirtiseven Apr 9, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 9, 2024
@revans2
Copy link
Collaborator

revans2 commented Apr 10, 2024

I also manually verified that this is working now for 24.04, once you turn on JsonToStructs. I think we can close this as there are other issues to track the remaining JSON work.

@revans2 revans2 closed this as completed Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants