-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [JSON] A mix of lists and structs within the same column is not supported #9353
Comments
When reading JSON in Spark, if a field has mixed types, Spark will infer the type as String to avoid data loss due to the uncertainty of the actual data type. Spark uses a sampling of the JSON file to infer the schema by default, and if the sampled records for a specific field have different types, it will default to the most permissive, which is String. In We already have code in import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("foo",DataTypes.StringType,true)))
val df = Seq("{ \"foo\": { \"a\": [1,2,3] } }", "{ \"foo\": { \"a\": { \"b\": 1 } }").toDF("json")
val df2 = df.withColumn("foo", from_json(col("json"), schema)).drop("json")
df2.show(truncate=false) This produces:
The data type of the There is some extra nesting but if we could remove the outer |
I am not convinced that that using from_json is really going to help because we would have do something complex like read the JSON file using I'll experiment some more but we will likely need cuDF to implement a new feature to allow us to read specific columns as unparsed JSON strings. I have filed rapidsai/cudf#14239 |
For me this is mostly a question of how likely is this to show up in practice vs the amount of effort to make it work. This feels rather unlikely to show up that often, so in the short term I am fine with just documenting it and moving on. We should talk to CUDF if this is something that they would ever want to support. If it is, then we can work with them to try and support it. If not, then we likely will have to do something similar to what we did for Map support. We will have to take their tokenizer and write our own code to process the tokens. |
Describe the bug
Given the following input:
Spark can read this and will return a string representation of the column:
Spark RAPIDS fails with this error:
Steps/Code to reproduce bug
Run this in spark-shell:
Expected behavior
Should produce the correct results and not fail
Environment details (please complete the following information)
N/A
Additional context
The text was updated successfully, but these errors were encountered: