[BUG] [JSON] A mix of lists and structs within the same column is not supported #9353

andygrove · 2023-09-29T19:42:15Z

Describe the bug

Given the following input:

$ cat test.json
{ "foo": [1,2,3] }
{ "foo": { "a": 1 } }

Spark can read this and will return a string representation of the column:

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> val df = spark.read.json("test.json")
df: org.apache.spark.sql.DataFrame = [foo: string]                              

scala> df.show
+-------+
|    foo|
+-------+
|[1,2,3]|
|{"a":1}|
+-------+

scala> df.schema
res2: org.apache.spark.sql.types.StructType = StructType(StructField(foo,StringType,true))

Spark RAPIDS fails with this error:

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-181-cuda11/thirdparty/cudf/cpp/src/io/json/json_column.cu:577: A mix of lists and structs within the same column is not supported
  at ai.rapids.cudf.Table.readJSON(Native Method)
  at ai.rapids.cudf.Table.readJSON(Table.java:1123)
  at org.apache.spark.sql.catalyst.json.rapids.JsonPartitionReader$.$anonfun$readToTable$2(GpuJsonScan.scala:270)

Steps/Code to reproduce bug

$ cat test.json
{ "foo": [1,2,3] }
{ "foo": { "a": 1 } }

Run this in spark-shell:

spark.conf.set("spark.rapids.sql.format.json.read.enabled", true)
spark.conf.set("spark.rapids.sql.format.json.enabled", true)
val df = spark.read.json("test.json")
df.show

Expected behavior
Should produce the correct results and not fail

Environment details (please complete the following information)
N/A

Additional context

The text was updated successfully, but these errors were encountered:

andygrove · 2023-09-29T20:44:56Z

When reading JSON in Spark, if a field has mixed types, Spark will infer the type as String to avoid data loss due to the uncertainty of the actual data type.

Spark uses a sampling of the JSON file to infer the schema by default, and if the sampled records for a specific field have different types, it will default to the most permissive, which is String.

In GpuJsonScan we can see that Spark has specified the schema dataSchema=StructType(StructField(foo,StringType,true)) so we need a way to respect that.

We already have code in from_json that reads values as strings rather than parsing them. I ran a quick experiment that provides results that are close to what we need:

import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("foo",DataTypes.StringType,true)))
val df = Seq("{ \"foo\": { \"a\": [1,2,3] } }", "{ \"foo\": { \"a\": { \"b\": 1 } }").toDF("json")
val df2 = df.withColumn("foo", from_json(col("json"), schema)).drop("json")
df2.show(truncate=false)

This produces:

+---------+
|      foo|
+---------+
|{[1,2,3]}|
|{{"a":1}}|
+---------+

The data type of the foo column here is StructField(foo,StructType(StructField(foo,StringType,true)),true)).

There is some extra nesting but if we could remove the outer { and }, then we would have the correct result in this case.

andygrove · 2023-09-29T21:39:34Z

I am not convinced that that using from_json is really going to help because we would have do something complex like read the JSON file using Table.readJSON just to read the columns that have consistent types, then do a text file scan for the other columns (edit: that is not really feasible since it is embedded in JSON) and invoke from_json on them and then join the two results together.

I'll experiment some more but we will likely need cuDF to implement a new feature to allow us to read specific columns as unparsed JSON strings. I have filed rapidsai/cudf#14239

revans2 · 2023-10-02T14:38:35Z

For me this is mostly a question of how likely is this to show up in practice vs the amount of effort to make it work. This feels rather unlikely to show up that often, so in the short term I am fine with just documenting it and moving on. We should talk to CUDF if this is something that they would ever want to support. If it is, then we can work with them to try and support it. If not, then we likely will have to do something similar to what we did for Map support. We will have to take their tokenizer and write our own code to process the tokens.

andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 29, 2023

andygrove self-assigned this Sep 29, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 3, 2023

This was referenced Oct 13, 2023

[FEA] JSON input support #9

Open

[FEA] [EPIC] Priority JSON Issues #9458

Open

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 17, 2023

andygrove mentioned this issue Dec 7, 2023

Enable mixed types as string in GpuJsonScan #9993

Merged

andygrove closed this as completed in #9993 Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [JSON] A mix of lists and structs within the same column is not supported #9353

[BUG] [JSON] A mix of lists and structs within the same column is not supported #9353

andygrove commented Sep 29, 2023 •

edited

Loading

andygrove commented Sep 29, 2023 •

edited

Loading

andygrove commented Sep 29, 2023 •

edited

Loading

revans2 commented Oct 2, 2023

[BUG] [JSON] A mix of lists and structs within the same column is not supported #9353

[BUG] [JSON] A mix of lists and structs within the same column is not supported #9353

Comments

andygrove commented Sep 29, 2023 • edited Loading

andygrove commented Sep 29, 2023 • edited Loading

andygrove commented Sep 29, 2023 • edited Loading

revans2 commented Oct 2, 2023

andygrove commented Sep 29, 2023 •

edited

Loading

andygrove commented Sep 29, 2023 •

edited

Loading

andygrove commented Sep 29, 2023 •

edited

Loading