[FEA] Add support for reading nested JSON in `GpuJsonScan` #10241

andygrove · 2024-01-22T17:09:26Z

Is your feature request related to a problem? Please describe.

We cannot read JSON files containing nested types when using GpuJsonScan, even though we do support this in from_json.

Input:

{ "a": { "b": "hello" } }
{ "a": { "b": "goodbye" } }

Test:

scala> spark.conf.set("spark.rapids.sql.format.json.enabled", true)
scala> spark.conf.set("spark.rapids.sql.format.json.read.enabled", true)
scala> spark.read.schema("a struct<b string>").json("test.json").show
24/01/22 16:59:06 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> toprettystring(a#37, Some(UTC)) AS toprettystring(a)#40 will run on GPU
      *Expression <ToPrettyString> toprettystring(a#37, Some(UTC)) will run on GPU
    !Exec <FileSourceScanExec> cannot run on GPU because unsupported data types StructType(StructField(b,StringType,true)) [a] in read for JSON

Describe the solution you'd like
I would like to be able to read JSON files with nested types.

Note that we already support reading nested types in from_json:

scala> val schema = StructType(Seq(StructField("a", StructType(Seq(StructField("b", DataTypes.StringType, true))), true)))

scala> val df = spark.read.text("test.json").withColumn("json", from_json(col("value"), schema))
df: org.apache.spark.sql.DataFrame = [value: string, json: struct<a: struct<b: string>>]

!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> toprettystring(value#9, Some(UTC)) AS toprettystring(value)#27 will run on GPU
      *Expression <ToPrettyString> toprettystring(value#9, Some(UTC)) will run on GPU
    *Expression <Alias> toprettystring(from_json(StructField(a,StructType(StructField(b,StringType,true)),true), value#9, Some(UTC)), Some(UTC)) AS toprettystring(json)#28 will run on GPU
      *Expression <ToPrettyString> toprettystring(from_json(StructField(a,StructType(StructField(b,StringType,true)),true), value#9, Some(UTC)), Some(UTC)) will run on GPU
        *Expression <JsonToStructs> from_json(StructField(a,StructType(StructField(b,StringType,true)),true), value#9, Some(UTC)) will run on GPU
    !Exec <FileSourceScanExec> cannot run on GPU because unsupported file format: org.apache.spark.sql.execution.datasources.text.TextFileFormat

Describe alternatives you've considered
None

Additional context
None

The text was updated successfully, but these errors were encountered:

andygrove · 2024-01-23T17:50:36Z

Depends on rapidsai/cudf#14830

andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jan 22, 2024

This was referenced Jan 22, 2024

[FEA] [EPIC] Priority JSON Issues #9458

Open

WIP: Add support for reading structs in GpuJsonScan #10245

Closed

[BUG] [JSON] GpuJsonScan type checks are not recursive #9775

Open

sameerz removed the ? - Needs Triage Need team to review and classify label Jan 23, 2024

andygrove mentioned this issue Jan 30, 2024

WIP: Add support for reading structs in GpuJsonScan #10325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add support for reading nested JSON in `GpuJsonScan` #10241

[FEA] Add support for reading nested JSON in `GpuJsonScan` #10241

andygrove commented Jan 22, 2024 •

edited

Loading

andygrove commented Jan 23, 2024

[FEA] Add support for reading nested JSON in GpuJsonScan #10241

[FEA] Add support for reading nested JSON in GpuJsonScan #10241

Comments

andygrove commented Jan 22, 2024 • edited Loading

Input:

Test:

andygrove commented Jan 23, 2024

[FEA] Add support for reading nested JSON in `GpuJsonScan` #10241

[FEA] Add support for reading nested JSON in `GpuJsonScan` #10241

andygrove commented Jan 22, 2024 •

edited

Loading