Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [JSON] A mix of lists and structs within the same column is not supported #9353

Closed
andygrove opened this issue Sep 29, 2023 · 3 comments · Fixed by #9993
Closed

[BUG] [JSON] A mix of lists and structs within the same column is not supported #9353

andygrove opened this issue Sep 29, 2023 · 3 comments · Fixed by #9993
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@andygrove
Copy link
Contributor

andygrove commented Sep 29, 2023

Describe the bug

Given the following input:

$ cat test.json
{ "foo": [1,2,3] }
{ "foo": { "a": 1 } }

Spark can read this and will return a string representation of the column:

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> val df = spark.read.json("test.json")
df: org.apache.spark.sql.DataFrame = [foo: string]                              

scala> df.show
+-------+
|    foo|
+-------+
|[1,2,3]|
|{"a":1}|
+-------+

scala> df.schema
res2: org.apache.spark.sql.types.StructType = StructType(StructField(foo,StringType,true))

Spark RAPIDS fails with this error:

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-181-cuda11/thirdparty/cudf/cpp/src/io/json/json_column.cu:577: A mix of lists and structs within the same column is not supported
  at ai.rapids.cudf.Table.readJSON(Native Method)
  at ai.rapids.cudf.Table.readJSON(Table.java:1123)
  at org.apache.spark.sql.catalyst.json.rapids.JsonPartitionReader$.$anonfun$readToTable$2(GpuJsonScan.scala:270)

Steps/Code to reproduce bug

$ cat test.json
{ "foo": [1,2,3] }
{ "foo": { "a": 1 } }

Run this in spark-shell:

spark.conf.set("spark.rapids.sql.format.json.read.enabled", true)
spark.conf.set("spark.rapids.sql.format.json.enabled", true)
val df = spark.read.json("test.json")
df.show

Expected behavior
Should produce the correct results and not fail

Environment details (please complete the following information)
N/A

Additional context

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 29, 2023
@andygrove andygrove self-assigned this Sep 29, 2023
@andygrove
Copy link
Contributor Author

andygrove commented Sep 29, 2023

When reading JSON in Spark, if a field has mixed types, Spark will infer the type as String to avoid data loss due to the uncertainty of the actual data type.

Spark uses a sampling of the JSON file to infer the schema by default, and if the sampled records for a specific field have different types, it will default to the most permissive, which is String.

In GpuJsonScan we can see that Spark has specified the schema dataSchema=StructType(StructField(foo,StringType,true)) so we need a way to respect that.

We already have code in from_json that reads values as strings rather than parsing them. I ran a quick experiment that provides results that are close to what we need:

import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("foo",DataTypes.StringType,true)))
val df = Seq("{ \"foo\": { \"a\": [1,2,3] } }", "{ \"foo\": { \"a\": { \"b\": 1 } }").toDF("json")
val df2 = df.withColumn("foo", from_json(col("json"), schema)).drop("json")
df2.show(truncate=false)

This produces:

+---------+
|      foo|
+---------+
|{[1,2,3]}|
|{{"a":1}}|
+---------+

The data type of the foo column here is StructField(foo,StructType(StructField(foo,StringType,true)),true)).

There is some extra nesting but if we could remove the outer { and }, then we would have the correct result in this case.

@andygrove
Copy link
Contributor Author

andygrove commented Sep 29, 2023

I am not convinced that that using from_json is really going to help because we would have do something complex like read the JSON file using Table.readJSON just to read the columns that have consistent types, then do a text file scan for the other columns (edit: that is not really feasible since it is embedded in JSON) and invoke from_json on them and then join the two results together.

I'll experiment some more but we will likely need cuDF to implement a new feature to allow us to read specific columns as unparsed JSON strings. I have filed rapidsai/cudf#14239

@revans2
Copy link
Collaborator

revans2 commented Oct 2, 2023

For me this is mostly a question of how likely is this to show up in practice vs the amount of effort to make it work. This feels rather unlikely to show up that often, so in the short term I am fine with just documenting it and moving on. We should talk to CUDF if this is something that they would ever want to support. If it is, then we can work with them to try and support it. If not, then we likely will have to do something similar to what we did for Map support. We will have to take their tokenizer and write our own code to process the tokens.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 3, 2023
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants